Preference-based mining of top-K influential nodes in social networks

Preference-based mining of top-K influential nodes in social networks

Future Generation Computer Systems ( ) – Contents lists available at SciVerse ScienceDirect Future Generation Computer Systems journal homepage: w...

488KB Sizes 4 Downloads 30 Views

Future Generation Computer Systems (

)



Contents lists available at SciVerse ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Preference-based mining of top-K influential nodes in social networks Jingyu Zhou ∗ , Yunlong Zhang, Jia Cheng School of Software, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, PR China

article

info

Article history: Received 27 January 2012 Received in revised form 21 May 2012 Accepted 18 June 2012 Available online xxxx Keywords: User preference Influence maximization SVD Collaborative filtering Social network

abstract Many important applications can be generalized as the influence maximization problem, which targets finding a K -node set in a social network that has the maximum influence. Previous work only considers that influence is propagated through the network with a uniform probability. However, because users actually have different preferences on topics, such a uniform propagation can result in inaccurate results. To solve this problem, we have designed a two-stage mining algorithm (GAUP) to mine the most influential nodes in a network on a given topic. Given a set of users’ documents labeled with topics, GAUP first computes user preferences with a latent feature model based on SVD or a model based on vector space. Then to find top-K nodes in the second stage, GAUP adopts a greedy algorithm that is guaranteed to find a solution within 63% of the optimal. Our evaluation on the task of expert finding shows that GAUP performs better than the state-of-the-art greedy algorithm, SVD-based collaborative filtering, and HITS. © 2012 Elsevier B.V. All rights reserved.

1. Introduction A social network reflects a social structure made up of a set of individuals and ties among them, such as relationship, connections, and interactions. Online social networks such as FaceBook and LinkedIn have been remarkably popular for making friends and finding jobs, and other types of social networks include scientific collaboration networks [1], mobile phone networks [2], and instant messaging networks [3]. Being very popular, social networks have a strong synergy with business applications. Among them, ‘‘viral marketing’’ [4,5] tries to select a set of users that will potentially influence a large number of other users to choose a product, where the influence works mostly by ‘‘word-of-mouth’’. Searching for domain experts [6,7] and finding the most important blogs [8] are other examples that can be solved with analysis on social networks. Both ‘‘viral marketing’’ and searching for domain experts can be generalized as the influence maximization problem [9]: given a social network and a parameter K , find a K -node set that eventually has the maximum influence, where influence propagates in the network according to a stochastic cascade model. In other words, the problem is to select top-K influential nodes in the network such that they eventually influence the most people in the network. Kempe et al. [9] have proved that finding the optimal solution to the influence maximization problem is NP-hard. However, a simple greedy algorithm can approximate the optimal solution



Corresponding author. Tel.: +86 21 34204125; fax: +86 21 34205145. E-mail addresses: [email protected] (J. Zhou), [email protected] (Y. Zhang), [email protected] (J. Cheng). 0167-739X/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2012.06.011

within a factor of (1 − 1/e). Their experiments show that the greedy algorithm significantly outperforms the classic approach using degree- and centrality-based heuristics. After Kempe’s seminal work, later work [8,10,2] improved on the algorithm’s efficiency and were scaled to larger social networks. However, all the algorithms follow the same diffusion process as in Kempe’s work, where each active node has the same influence on other nodes. In reality, the diffusion process of different topics propagates differently in networks [11,12]. Topical Affinity Propagation (TAP) [13] models the topic-level social influence on large networks and demonstrates that different topics actually lead to different influence results. Although the influence graph generated from TAP reveals the influence on a fine-grain level, TAP was only applied to the problems of finding representative nodes or constructing the influential subgraphs. There is no research on the influence maximization problem taking the user preferences into account. Without considerations of user preferences, the results of influence maximization are often inaccurate for a specific topic. We study the influence maximization problem with the consideration of user preferences. To account for user preferences during the influence propagation, we propose a two-stage algorithm, called Greedy Algorithm based on Users’ Preferences (GAUP), for the influence maximization problem. Specifically, for a social network and a set of users’ documents labeled with topics, our two-stage algorithm works as follows: 1. We calculate each user’s preference for different topics. Inspired by SVD-based Latent Semantic Indexing (LSI) [14,15], we model user preferences using latent features from the user-topic matrix. Alternatively, preferences can also be modeled with

2

J. Zhou et al. / Future Generation Computer Systems (

the vector space model (VSM), popularly used in information retrieval. 2. We extend an independent cascade model to accommodate the user preferences calculated from the first stage, and then use a greedy algorithm to compute an approximate solution for the influence maximization problem for a specific topic. This greedy algorithm has a proven error bound of 37%. We have evaluated our GAUP algorithm on an academic coauthor network DBLP.1 Experimental results show that GAUP significantly outperforms the traditional greedy algorithm in terms of influence spread on specific topics. We compared GAUP with other popular algorithms for the task of searching for domain experts and found that GAUP is more effective than SVD-based collaborative filtering. In this extended version of our conference paper [16], our new experimental results show that (1) LSI-based user preferences are more effective than those of the VSM; and (2) compared with link-based HITS algorithm [17], GAUP produces more reliable results, because HITS suffers from the problem of topic drift. The rest of the paper is organized as follows. Section 2 discusses background. Section 3 elaborates the design of GAUP. Section 4 presents experimental results and validates the effectiveness of our approach. Section 5 summarizes related work. Finally, Section 6 concludes our work. 2. Preliminaries This section first discusses influence diffusion models and solutions to the influence maximization problem [9], and then describes Singular Value Decomposition (SVD) [18,19]. Readers familiar with the content may skip this section. 2.1. Influence diffusion models When considering models for influence spread through a social network, we often describe a node as being either active or inactive. Active nodes are those that are influenced by other active nodes, and are able to influence their inactive neighbors when they become active; inactive nodes are those that have not been influenced by their active neighbors. The influence diffusion from the perspective of an initially inactive node v unfolds as follows: as time progresses, more and more of v ’s neighboring nodes become active; v may become active at some point because of this; and v ’s activation may also affects its neighboring nodes. Each node can switch from an inactive to active state, but not the opposite. Popular influence diffusion models include the Linear Threshold Model, and the Independent Cascade Model (IC). In the linear threshold model, a node v becomes active if all its neighboring nodes’ influence added together exceeds a threshold. All nodes that were active in step t − 1 remain active in step t. The threshold for each node is uniformly distributed within interval [0, 1], representing the fraction of v ’s neighbors that must be active in order to activate v . In independent cascade model, the diffusion process also unfolds in discrete steps: when an active node v becomes active in step t − 1, it is given a single chance to active each of its inactive neighbors with a probability p, which is independent of the history thus far. If multiple nodes try to activate the same node, their attempts are sequenced in an arbitrary order. If successful, the activated node becomes active in step t. Whether or not v succeeds, it cannot make any further attempts to activate its neighboring nodes in subsequent rounds. The process ends when no activations are possible. The activation probability p in Kempe’s work is uniform to each edge of the graph. If node v and w have cv,w parallel edges, v has

1 http://www.informatik.uni-trier.de/~ley/db/.

)



a total probability of 1 − (1 − p)cv,w to activate w once it becomes active. The above models do not consider topics and user preferences on different topics. Intuitively, the more preferences on topic T by v and its neighboring nodes, the more likely the influence will happen. Thus, we consider the activation probability p is not uniform, but is related to user preferences on a topic T . 2.2. Approximation of influence maximization problem The ‘‘influence maximization problem’’ is NP-hard. Thus, finding an optimal solution is generally hard. However, approximation algorithms do exist and can find solutions with proven guarantees, if the influence of a seed set is submodular, non-negative, and monotone. A function f (·) is submodular if it satisfies a natural ‘‘diminishing returns’’ property: the marginal gain from adding an element to a set S is at least as high as the marginal gain from adding the same element to a superset of S. Formally, for all pairs of sets S ⊆ T , a submodular function satisfies: f (S ∪ {v}) − f (S ) ≥ f (T ∪ {v}) − f (T ).

(1)

Nemhauser [20,21] has shown that if f (·) is submodular, nonnegative, and monotone, the NP-hard ‘‘influence maximization problem’’, can be approximated by a greedy hill-climbing algorithm within a factor of (1 − 1/e). Our GAUP exploits this property in the second stage to find approximate solutions (Section 3.3). 2.3. Singular Value Decomposition SVD is a well-known matrix factorization method that factors an n × c matrix R into three matrices: R=U·



·V T ,

(2)

where U and  V are two orthogonal matrices of size n × n and c × c, respectively. is a diagonal matrix of size n×c that has all singular  values of matrix R as its diagonal entries. is The rank of R and r (r ≤ c ≤ n). All the entries of matrix are positive and stored in decreasing order of their magnitude.  Typical usage of SVD keeps the k largest diagonal values of , thus producing a rank-k approximation matrix Rk . Over all rankk matrices, Rk minimizes the Frobenius norm ∥R − Rk ∥. In other words, SVD provides the best lower rank approximations of the original matrix R. The selection of k should be large enough to capture all the important structures in the original matrix and small enough to avoid overfitting errors. In this paper, we use SVD to compute user preferences, which are represented as low-dimensional latent features learnt from a user-topic matrix. 3. Design This section first formulates the definition of the influence maximization for a topic, and then discusses details of the twostage algorithm GAUP. 3.1. Problem formulation We formulate the problem as follows. A social network is an undirected graph G = (V , E ). Vertices in V are the nodes in the network, and edges in E model the relationship between nodes. There is a set of users’ documents with topic labels. Given a number K and a topic label T , the task is to generate a seed set S of K vertices, with the objective that influence spread is as large as possible for topic T from the seed set. In this paper, we use the DBLP dataset, where G is a coauthorship graph, vertices are authors of academic papers and an edge indicates that the corresponding two authors have coauthored a paper. Parallel edges between two vertices denote

J. Zhou et al. / Future Generation Computer Systems ( Table 1 Notations. Descriptions

G = (V , E ) K R T p pv,w Ca,T S IS(S ) ISST(S , T )

A graph G with vertex set V and edge set E Number of seeds to be selected Number of rounds of simulations A selected topic (conference) The uniform propagation probability The propagation probability that node v activates node w Node a’s preference on topic T The seed set of K influential nodes Influence spread of S Influence spread of S on a specific topic T

the number of papers coauthored by the two authors. In DBLP, we can get information on each paper, including authors and conferences. We regard each conference as a distinct topic and each paper as a labeled document. Table 1 lists the notations used in this paper. 3.2. Computing user preferences We consider two different methods for computing a user’s preference for a topic. The first LSI model is a matrix factorization technique that predicts user preferences from a large-scale usertopic matrix. The other VSM is a content-based approach that uses document similarity as preferences for topics. 3.2.1. LSI-based user preferences To compute user preferences for a specific topic, we adopt the LSI model, which can project user preferences into a reduced latent space. By assigning a weight for each latent item, our algorithm can compute the preference value of a user for a given topic. Specifically for a given user-topic matrix M, our algorithm obtains preference prediction using the following steps: 1. Factor M using SVD to obtain U , and V . 2. Reduce the matrix to dimension k.  3. Compute the square root of the reduced matrix k , to obtain



k

1/2

.

4. Compute two matrices X and Y , where X = Uk 1/2



k

1/2

and

Y = VkT . k 5. Predict user a’s preference on topic T as follows:



Ca,T = C0 + X (a) · Y (T ).



3

Table 2 Translation rules for converting publication counts to 1–5 ratings. The left column is the personal fraction of papers that a person publishes at a specific conference.

Terms



)

(3)

X is a matrix with dimension of n × k, describing the authors’ preferences in the k-dimensional latent space, i.e., the weights of k latent features by the authors. X (a) denotes the ath row of X , representing user a’s weights for k latent features. Y is a matrix with dimension of k × c, representing the relationship between kdimensional latent space and c topics. Y (T ) denotes T th column of Y , representing k latent features’ weights to the T th topic. C0 is a constant. To compute the predictions, we simply calculate the dot product of the ath row of X and the T th column of Y and add the C0 back. Ca,T is the prediction of ath author’s preference for topic T . From the DBLP dataset, we can get a user-topic matrix M, where rows are authors and columns are conferences. The question is how to define values in matrix M. Inspired by the rating matrix used in collaborative filtering, entries in matrix M are rated with a 5-star scale, where high scores mean that users prefer more favorably the conference. Table 2 lists rules for such a translation. Following these rules, we obtain the user-topic matrix M, where Mij denotes the preference of author i to conference j. A zero value means author i is not interested in conference j or has not published in the conference.

# in this conference/ personal total #

Ratings

0 (0, 0.02 ] (0.02, 0.05 ] (0.05, 0.1 ] (0.1, 0.15 ] (0.15, 1 ]

0 1 2 3 4 5

The resulting matrix is large and extremely sparse. Performing SVD on such a matrix is computationally intensive. To solve this problem, we remove authors who have published in less than ten conferences from the matrix, thus reducing the size of the matrix significantly. The intuition is that we are interested in the most influential people among all authors and these important ones tend to publish heavily at many different conferences. 3.2.2. VSM-based user preferences In the vector space model, we use all documents authored by a user to create a person-document [22]. Similarly, for each topic, we combine all documents of that topic to construct a topic-document. All documents form a lexicon set L = {t1 , t2 , . . . , t|L| }. Then we compute vectors for all person-documents and topic-documents, where the ith element of the vector is the TF-IDF weight for lexicon ti . Finally, an active user’s preference on a topic is defined as the cosine similarity between their corresponding vectors, i.e., |L| 

Ca,T = 

aj · D(T )j

j =1

|L|  j =1

a2j

·

|L| 

, D(T )

(4)

2 j

j =1

where a is the person-document vector for the active user and D(T ) is the topic-document vector for the topic T . 3.3. Greedy algorithm for mining top-K nodes 3.3.1. Extended Independent Cascade (EIC) model In the traditional Independent Cascade (IC) model, the probability of an activation is uniform to all edges. However, such a uniform probability cannot capture the preferences of users on different topics. For mobile social networks, Wang et al. [2] extend the IC model to accommodate contact frequency. Similarly, GAUP considers the probability of influence is not only associated with influence frequency, but is also correlated with user preferences: the more preferences on topic T by v and w , the more likely the influence will happen. In order to describe this property, we define pv,w as follows: pv,w = p · F (Cv,T , Cw,T ).

(5)

This new influence model is called Extended Independent Cascade (EIC) model. In the above formula, the original uniform probability p is weighted with a user preference function F (·). The Cv,T and Cw,T are calculated by Eq. (3). In experiments, we choose the function F (Cv,T , Cw,T ) to be the square of the product of Cv,T and Cw,T . When node v and w have cv,w parallel edges, v has a total probability of 1−(1−pv,w )cv,w to activate w once it becomes active. 3.3.2. Approximation guarantee Mining top-K influential nodes in the new EIC model is NP-hard, but can be approximated with a greedy hill-climbing algorithm. The EIC model is an edge-weighted version of the IC model, for which Kempe [9] proves that the influence function is submodular. Because the objective function is obviously non-negative and monotone, we can use the greedy strategy to obtain a (1 − 1/e −ϵ)approximation [9].

4

J. Zhou et al. / Future Generation Computer Systems (

3.3.3. The greedy mining algorithm To compute the influence spread in mining the seed set and in evaluating the performance of algorithms, we often use Monte Carlo simulations. Let S be the seed set computed from the greedy algorithm and RS (S ) denote the resulting set of influence cascade. The influence spread after the random process is |RS (S )|. For any topic T , we obtain a user-preference vector UP (T ) from the previous step. GAUP takes the graph G, the number K , and the vector UP (T ) as input and generates a seed set S of cardinality K , with the purpose that the expected influence spread by the seed set S is as large as possible for topic T . The idea of the greedy algorithm is to run for K rounds, where each round finds the vertex that will maximize the incremental influence spread in the round. Algorithm 1 describes the details of GAUP. First we calculate the propagation probability of each edge in the social network using the vector UP (T ), based on the EIC model (lines 5–8). Then for each vertex, we compute its influence spread and insert the vertex into a priority queue, sorted by influence spread values (lines 10–17). The first vertex chosen will be the head element in the queue (lines 19–21). For the remaining K − 1 vertices, the algorithm first pops a vertex from the priority queue. If the vertex has been visited in this round, then we have found the vertex for this round (lines 25–30). Otherwise, the increase of influence spread by adding this node to current set S is computed and the vertex is inserted back to the priority queue (lines 32–37). Note that lines 23–39 implement the CELF optimization [8], which takes advantages of the submodularity property of the influence maximization objective. Because of submodularity, during each round, the incremental influence spread of a large number of nodes does not need to be re-evaluated when their values in the previous round are less than the values of some other nodes in the current round. Previous work [8] has shown that this optimization can improve the performance by a factor up to 700 times.

3: 4: 5: 6: 7: 8:

11: 12: 13: 14: 15: 16: 17: 19: 20: 21:

// 1st vertex is the one that has the largest influence spread node = pop(Q ) S = {node.v}, current_is = node.is

22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32:

35:

Dataset. We extract an academic co-author network from DBLP, which provides bibliographic information on major computer science journals and proceedings. DBLP indexes more than one million articles. In the extracted co-author network, each node represents an author and an edge represents that the corresponding two authors have collaborated once. Parallel edges means that the authors have collaborated more than once. We only select the papers published in conferences and extract the authors who have published in at least 10 conferences. The generated network with papers from 411 conferences contains 8627 nodes and 91 574 edges. For the VSM-based user preference, we use paper titles to represent documents, because DBLP does not contain the full content of papers and it would be expensive to do so. We use standard word stemming and remove stop words. Influence model. We conduct experiments to compare the stateof-the-art greedy algorithm using the IC model and the GAUP using

// Compute influence spread of each vertex for each vertex v ∈ V do isv = 0 for j = 1 to R do isv + = |RS (v)|/R end for insert node(v, isv ) to Q end for

18:

In this section, we conduct experiments on an academic coauthorship dataset to answer the following questions:

4.1. Experimental setup

Initialize S = ∅, R = 10000, Q = ∅, current_is = 0 // Compute activation probability of each edge for each edge (v, w) ∈ E do pv,w = p · F (Cv,T , Cw,T ) end for

9: 10:

33:

GAUP algorithm perform in terms of influence spread on a selected topic? • Compared with other approaches, how well can GAUP help find domain experts? • Which one of the LSI-based and VSM-based user preference model is better?



Algorithm 1 The greedy algorithm for mining top-K influential nodes. 1: Input: graph G = (V , E ), number K , topic T , user preference vector UP (T ) 2: Output: a seed set S

4. Evaluation

• Compared with traditional approaches, how well does our

)

34: 36: 37: 38: 39:

for i = 2 to K do while true do node = pop(Q ) if node.v has been visited in this round i then S = S ∪ {node.v} current_is+ = node.is break end if

v = node.v , is = 0 for j = 1 to R do is+ = |RS (S ∪ {v}|)| end for is = is/R − current_is insert node(v, is) to Q end while end for

Table 3 The Algorithms. Algorithms

Descriptions

GAUP GA CF Random HITS

Greedy algorithm using LSI-based user preferences The state-of-the-art greedy algorithm Collaborative filtering algorithm based on SVD Baseline of choosing random nodes in the graph Hyperlink-Induced Topic Search algorithm [17]

the EIC model. In the IC model, the probability p is set to 0.01 [9]. In the EIC model, the probability pv,w is computed by Eq. (5), where F (Cv,T , Cw,T ) = (Cv,T · Cw,T )2 in this paper. Algorithms and metrics. Table 3 lists the algorithms that are compared with in this paper. GA and GAUP are the greedy algorithms mentioned above. CF is the collaborative filtering algorithm based on SVD that select nodes with biggest Cv,T values as seed set. For expert search, we also consider the HITS algorithm [17]. For co-authors of a paper, we add links pointing to the first author from all other co-authors. For evaluation, the following metrics are used:

• ISST(S , T ): influence spread of seed set S on a specific topic T . This metric evaluates the influence on a specific topic that takes

J. Zhou et al. / Future Generation Computer Systems (

Fig. 1. A comparison of ISST on a specific topic.

)



5

Fig. 2. A comparison of influence spread of different algorithms. Table 5 Ratios of ISST of comm over ISST of kdd for different topics.

Table 4 IS(comm) and IS(kdd). Seed set

5

10

15

20

Topic

5

10

15

20

comm kdd

11 30

30 41

47 56

86 89

SIGCOMM SIGKDD

1.67 0.13

2.47 0.16

2.91 0.31

2.81 0.27

account of user preferences. ISST is defined as Eq. (6), where node v belongs to all nodes being activated by seed set S. ISST(S , T ) =



Cv,T 2 .

(6)

v∈RS (S )

• IS(S ): the traditional influence spread of seed set S that does not take user preferences into account. To obtain the IS(S ) and ISST(S , T ) for each seed set S, we run Monte Carlo simulations using the corresponding IC and EIC model 10,000 times and take the average. For all these algorithms, we compare their IS(S ) and ISST(S , T ) with different S sizes of 5, 10, 15 and 20. 4.2. Study on influence spread This experiment evaluates the performance of the proposed GAUP algorithm in a real co-author network and compares GAUP with other algorithms on the specific topic of SIGCOMM. Fig. 1 shows the ISST of various algorithms using the EIC model. From the figure, we can observe that Random as the baseline performs very badly. GAUP performs significantly better than other algorithms when K ≥ 15. GAUP outperforms GA by about 10% when K is 15 and more than 30% when K is 20. This is because GAUP can find more important nodes with respect to the specified topic. When K is 5 or 10, GA finds nodes that can activate more nodes than GAUP (shown in Fig. 2). As a result, the cumulative influence of GA on the topic exceeds that of GAUP. CF performs worse than GA by about 30%, because CF does not consider the influence diffusion process. Fig. 2 compares the IS(S ) of various algorithms in the IC model. Because IS(S ) does not take user preferences into account, the performance of GAUP is not so good as GA, since GA can find the most influential nodes across all topics. GAUP performs significantly better than Random and CF, because GAUP can find the most influential nodes in a specific topic. CF does not perform well because the influence diffusion process is not used. ISST vs. IS. This experiment studies the effectiveness of ISST and IS when considering users’ topic preferences. We first use the GAUP algorithm to generate two seed sets, comm and kdd, for topic SIGCOMM and SIGKDD, respectively. With these two seed sets, we compute their IS and ISST values. Table 4 shows that when K increases, IS(comm) and IS(kdd) are becoming very

close. In contrast, Table 5 illustrates the ratios of ISST of comm over ISST of kdd for different topics. For topic SIGCOMM, we can observe that ISST(comm, SIGCOMM ) is significantly larger than ISST(kdd, SIGCOMM ), more than two times when K ≥ 15. On the other hand, ISST(comm, SIGKDD) is only about one third of ISST(kdd, SIGKDD) for topic SIGKDD when K ≥ 15. These results demonstrate that ISST effectively measures the influence spread on specific topics and should be used as the metric other than the traditional IS when considering user preferences. 4.3. Expert search This experiment studies whether the GAUP algorithm can find experts in a specific domain. Again, we choose SIGCOMM as the topic. Table 6 shows the seed sets obtained by GA, CF, GAUP, and HITS when K = 5. None of the authors selected by GA is in the field of networks, because GA does not take domain preference into account and tends to find experts in other domains. In contrast, GAUP, CF, and HITS can successfully find networking experts. We verified that all these people have published in wellknown networking conferences. Apart from Jennifer Rexford, all four other experts selected by GAUP and CF are different. James F. Kurose appears in both GAUP and HITS. Joseph Naor published lots of papers at networking conferences, but none in SIGCOMM. This indicates that our algorithm can effectively find domain experts. To further study the effectiveness of GAUP, CF and HITS, we designed the following experiments to compare GAUP and CF, and to demonstrate the topic drift problem of HITS. 4.3.1. GAUP vs. CF With the GAUP and CF results given in Table 6 it is generally hard to measure which set of users is better. To address this problem, we designed the following experiment. For the field of networks, we choose two topics, SIGCOMM and ICCCN. Then the GAUP and CF algorithms are run to find the top 5, 10, 15, and 20 experts. Finally, we consider experts that are found by both topics. Fig. 3 illustrates the results of these two algorithms. We can observe that for the top 5, 10, and 15 experts, CF finds none that appear in both topics. In contrast, GAUP discovers 2, 4, and 5 overlapping ones for the top 5, 10, and 15, respectively. CF only finds one overlapping for the top 20 users, in which case GAUP finds 7.

6

J. Zhou et al. / Future Generation Computer Systems (

)



Table 6 The seed set of GA, CF, and GAUP when K = 5. GAUP

GA

CF

HITS

James F. Kurose Jennifer Rexford Joseph Naor Deborah Estrin Thomas E. Anderson

Wei-Ying Ma Philip S. Yu Wen Gao Mahmut T. Kandemir Thomas S. Huang

Simon S. Lam Jennifer Rexford Vishal Misra Donald F. Towsley Lixia Zhang

Donald F. Towsley James F. Kurose Zhen Liu Weibo Gong Vishal Misra

Table 8 Top five users on topic SIGCOMM ranked by LSI and VSM preferences. LSI

VSM

Simon S. Lam Jennifer Rexford Vishal Misra Donald F. Towsley Lixia Zhang

Donald F. Towsley Scott Shenker Lixia Zhang J.J. Garcia-Luna-Aceves James F. Kurose

Fig. 3. A comparison of GAUP and CF for finding experts in both ICCCN and SIGCOMM. Table 7 Topic drift of the HITS algorithm using three different seed sets. 20-set

30-set

31-set

Donald F. Towsley James F. Kurose Zhen Liu Weibo Gong Vishal Misra

Donald F. Towsley James F. Kurose Zhen Liu Krithi Ramamritham Weibo Gong

Wen Gao Xilin Chen Shiguang Shan Qingming Huang Debin Zhao

This experiment shows that GAUP can find common experts for different sub-topics in a domain. Thus, these experts are more likely to be the most influential ones in the domain. The CF algorithm does not consider the influence spread process, and only finds experts for specific topics. As a result, CF does not produce results as good as GAUP. 4.3.2. Topic drift of HITS The effectiveness of HITS depends on the quality of the initial set of nodes. Previous work [23,24] has shown that HITS may suffer from topic drift. To illustrate this problem, we run the HITS algorithm with three different seed sets for topic SIGCOMM:

• 20-set: we select 20 users from the experts returned by GAUP and CF for topic SIGCOMM;

• 30-set: we choose 10 new users together with 20-set, where some of new users are not in the field of networks;

• 31-set: we add a new user, Wen Gao, to 30-set. Table 7 shows the top five experts found by the HITS algorithm for these three seed sets. The results of 20-set and 30-set only differ for one person, and all of them are well-known networking people. However, none of the results for 31-set is in the field of networks. This is because Wen Gao and its neighboring nodes are well connected in the graph generated by HITS. As a result, the most highly ranked authorities and hubs deviate from the original topic of SIGCOMM. This experiment demonstrates that the quality of HITS is highly dependent on the seed set. Because GAUP does not need such seeds, a direct comparison of HITS and our GAUP is difficult. We emphasize that GAUP is more reliable than HITS for finding experts, because of the considerations for user preferences in the influence model.

Fig. 4. A comparison of the top 30 users on topic SIGCOMM ranked by LSI and VSM preferences.

4.4. A comparison of user preference models This experiment studies the effectiveness of LSI-based and VSM-based user preference models. For both approaches, we compute the top 30 users for the topic of SIGCOMM, where the top five users are illustrated in Table 8. For the top 30 users, the results are shown in Fig. 4. We verified that all these users are interested in networks, and all top five users have published more than five papers in SIGCOMM. In fact, two users, Donald F. Towsley and Lixia Zhang, appear in the top five places for both approaches. However, when looking at the top 30 users, Fig. 4 shows the VSM becomes less accurate, as only 20 out of 30 have actually published papers in SIGCOMM. In contrast, LSI achieves 29. This indicates that there may be subtle difference in a user’s preference and the topic in SIGCOMM, which the VSM cannot capture, especially for the top 21–30 users. LSI models user preferences via latent variables and is more accurate. In summary, the LSI approach performs better than the VSM. Because LSI does not rely on text content and uses only structure information among users, LSI is more applicable for social networks. 5. Related work 5.1. Influence maximization The ‘‘influence maximization problem’’ was first studied as viral marketing strategies [25,26] and revenue maximization [5]. These studies target marketing strategies, while this paper focuses on finding the most influential people in a social network.

J. Zhou et al. / Future Generation Computer Systems (

Kempe et al. [9] formulate influence maximization as a discrete optimization problem and prove it to be NP-hard. They propose a greedy approximation algorithm with significant computations. Several recent studies [8,10,2,1] aim at improving the algorithm’s efficiency. CELF [8] optimizes the greedy algorithm by exploiting the submodularity property to reduce the number of evaluations on the influence spread of users, which is also used by GAUP. NewGreedy [10] removes edges that will not contribute to propagation to get a smaller graph. MixedGreedy [10] combines CELF and NewGreedy together, where the first round uses NewGreedy and the remaining rounds use CELF. CGA [2] first detects communities in a social network, and then employs a dynamic programming algorithm for selecting communities to find influential users. Kimura and Saito [1] propose shortest-pathbased influence cascade models and provide efficient algorithms for computing influence spread under these models. However, they do not directly address the efficiency issue of the greedy algorithms studied by Kempe, because the influence cascade models are different. None of these approaches consider user preferences during influence diffusion and rather use a uniform probability. GAUP differs from these approaches by taking account of user preferences. 5.2. Expert search Previous work has proposed two ways to find experts. The first one leverages information retrieval techniques [22,27]. The P@NOPTIC Expert system [22] first collects all documents authored by an individual and concatenates them to create a ‘‘persondocument’’. Then, a standard document search system runs over these person-documents, where experts for an input query are returned as the people corresponding to the highest-ranked person-documents. An improvement by Balog et al. [27] is to first collect relevant documents and then use them to find experts. The specific domain of an expert is modeled by keywords in the information retrieval approach. In contrast, our approach uses LSIcomputed hidden variables to represent user preferences. The second method is the link-based method [6,7,28], which exploits the structure of social networks. Campbell et al. [6] use a modified version of the Hyperlink-Induced Topic Search (HITS) [17] algorithm to identify authorities. Zhang et al. [7] study the expertise finding problem in a social network of an online forum, where variants of PageRank and HITS are tested. Ghosh and Lerman [29] compared a number of influence metrics over Digg.com data and found that α − centrality is the best predictor of influence. We have shown that given the right seed set, HITS can identify experts in a domain. However, HITS suffers from the problem of topic drift [23,24], which can significantly affect the quality of results. Our work is different from these two types of approaches, as finding experts is an application of our top-K mining algorithm, which explicitly models user preferences and is based on the influence maximization model. 5.3. Collaborative filtering Collaborative filtering [30] is a one of the most successful recommendation approaches. Unlike content-based approaches, CF approaches make predictions from large-scale item-user matrices. Generally, CF approaches can be classified into memorybased and model-based approaches. Memory-based schemes achieve high accuracy by exploiting similarity functions of items and users [31–34]. Model-based CF approaches [35–37] first cluster all items or users into classes using machine learning algorithms, and then use these classes for prediction. Recently, CF approaches based on matrix factorizations have been shown to be

)



7

very effective on Netflix competition [38,39]. GAUP also adopts matrix factorization, which is used to estimate user preferences for subsequent top nodes mining. The use of matrix factorization in GAUP differs from CF in that GAUP explicitly filters out many rows in the matrix to reduce the matrix size, while CF has to work with the whole matrix. The goal of CF is to find the most relevant items for a given user, which is different from our purpose of searching for the most influential users. Additionally, our experiments have shown that collaborative filtering alone cannot find the most influential nodes. 6. Conclusions For the influence maximization problem, we have proposed a new GAUP algorithm to mine top-K influential nodes in social networks based on user preferences. In contrast, previous work does not take user preferences into account and only considers a uniform probability model for influence diffusion. Our GAUP algorithm works in two stages. To compute user preferences in the first stage, we study two different models, an LSI model, to project user preferences into a reduced latent space, and a VSMbased model. GAUP adopts a greedy algorithm in the second stage to find the most influential nodes in the network for a topic. Evaluation results with an academic social network demonstrate that our GAUP algorithm can maximize the influence spread on a topic. For the two models of user preferences, experimental results show that the LSI-based model is more accurate. We have compared GAUP with the SVD-based collaborative filtering algorithm and HITS for expert search, and have found that GAUP is more likely to find the most influential domain experts than collaborative filtering. In addition, GAUP is more reliable than HITS, because HITS suffers from the problem of topic drift. Acknowledgments We thank the anonymous referees of FCST and FGCS for their helpful comments on earlier drafts of this paper. This work was supported in part by NSFC (Grant No. 61003012). References [1] M. Kimura, K. Saito, Tractable models for information diffusion in social networks, in: PKDD, in: LNAI, vol. 4213, 2006, pp. 259–271. [2] Y. Wang, G. Cong, G. Song, K. Xie, Community-based greedy algorithm for mining top-k influential nodes in mobile social networks, in: KDD, Washington, DC, 2010, pp. 1039–1048. [3] J. Leskovec, E. Horvitz, Planetary-scale views on a large instant-messaging network, in: WWW, Beijing, China, 2008, pp. 915–924. [4] H. Ma, H. Yang, M.R. Lyu, I. King, Mining social networks using heat diffusion processes for marketing candidates selection, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM, Napa Valley, CA, 2008, pp. 233–242. [5] J. Hartline, V.S. Mirrokni, M. Sundararajan, Optimal marketing strategies over social networks, in: WWW, Beijing, China, 2008, pp. 189–198. [6] C.S. Campbell, P.P. Maglio, A. Cozzi, B. Dom, Expertise identification using email communications, in: CIKM, New Orleans, LA, 2003, pp. 528–531. [7] J. Zhang, M.S. Ackerman, L. Adamic, Expertise networks in online communities: structure and algorithms, in: WWW, Banff, Alberta, Canada, 2007, pp. 221–230. [8] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N. Glance, Costeffective outbreak detection in networks, in: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, San Jose, CA, 2007, pp. 420–429. [9] D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of influence through a social network, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Washington, D.C., 2003, pp. 137–146. [10] W. Chen, Y. Wang, S. Yang, Efficient influence maximization in social networks, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Paris, France, 2009, pp. 199–208. [11] K. Saito, M. Kimura, K. Ohara, H. Motoda, Learning continuous-time information diffusion model for social behavioral data analysis, in: Proceedings of the 1st Asian Conference on Machine Learning: Advances in Machine Learning, ACML, Nanjing, China, 2009, pp. 322–337. [12] K. Saito, M. Kimura, K. Ohara, H. Motoda, Behavioral analyses of information diffusion models by observed data of social network, in: Proceedings of the 2010 International Conference on Social Computing, Behavioral Modeling, Advances in Social Computing Prediction, SBP10, 2010, pp. 149–158.

8

J. Zhou et al. / Future Generation Computer Systems (

[13] J. Tang, J. Sun, C. Wang, Z. Yang, Social influence analysis in large-scale networks, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Paris, France, 2009, pp. 807–816. [14] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science 41 (6) (1990) 391–407. [15] M. Baumann, U. Helmke, Singular value decomposition of time-varying matrices, Future Generation Computer Systems 19 (3) (2003) 353–361. [16] Y. Zhang, J. Zhou, J. Cheng, Preference-based top-k influential nodes mining in social networks, in: The 6th International Conference on Frontier of Computer Science and Technology, FCST, Changsha, China, 2011, pp. 1512–1518. [17] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46 (5) (1999) 604–632. [18] J. Cadzow, SVD representation of unitarily invariant matrices, IEEE Transactions on Acoustics, Speech and Signal Processing 32 (3) (1984) 512–516. [19] D. Gleich, L. Zhukov, SVD based term suggestion and ranking system, in: Proceedings of the 4th IEEE International Conference on Data Mining, 2004, pp. 391–394. [20] G. Cornuejols, M. Fisher, G. Nemhauser, Location of bank accounts to optimize float, Management Science 23 (8) (1977) 789–810. [21] G. Nemhauser, L. Wolsey, M. Fisher, An analysis of the approximations for maximizing submodular set functions, Mathematical Programming 14 (1978) 265–294. [22] N. Craswell, D. Hawking, A.-M. Vercoustre, P. Wilkins, P@NOPTIC expert: searching for experts not just for documents, in: Ausweb, 2001, pp. 21–25. [23] K. Bharat, M.R. Henzinger, Improved algorithms for topic distillation in a hyperlinked environment, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, Melbourne, Australia, 1998, pp. 104–111. [24] L. Li, Y. Shang, W. Zhang, Improvement of HITS-based algorithms on web documents, in: WWW, Honolulu, Hawaii, USA, 2002, pp. 527–535. [25] P. Domingos, M. Richardson, Mining the network value of customers, in: Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, San Francisco, CA, 2001, pp. 57–66. [26] M. Richardson, P. Domingos, Mining knowledge-sharing sites for viral marketing, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Edmonton, Alberta, Canada, 2002, pp. 61–70. [27] K. Balog, L. Azzopardi, M. deRijke, Formal models for expert finding in enterprise corpora, SIGIR, 2006, pp. 43–50. [28] Y. Wang, A. Nakao, Poisonedwater: an improved approach for accurate reputation ranking in p2p networks, Future Generation Computer Systems 26 (8) (2010) 1317–1326. [29] R. Ghosh, K. Lerman, Predicting influential users in online social networks, in: The Fourth SNA-KDD Workshop, in Conjunction with the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, 2010. [30] X. Su, T.M. Khoshgoftaar, A survey of collaborative filtering techniques, Advances in Artificial Intelligence (2009). [31] G.-R. Xue, C. Lin, Q. Yang, W. Xi, H.-J. Zeng, Y. Yu, Z. Chen, Scalable collaborative filtering using cluster-based smoothing, SIGIR’05, 2005, pp. 114–121. [32] J. Wang, A.P. de Vries, M.J.T. Reinders, Unifying user-based and item-based collaborative filtering approaches by similarity fusion, in: SIGIR’06, ACM Press, 2006, pp. 501–508.

)



[33] D. Zhang, J. Cao, J. Zhou, M. Guo, V. Raychoudhury, An efficient collaborative filtering approach using smoothing and fusing, in: Proceedings of the 2009 International Conference on Parallel Processing, ICPP, 2009, pp. 558–565. [34] B. Sarwar, G. Karypis, J. Konstan, J. Reidl, Item-based collaborative filtering recommendation algorithms, in: WWW’01, Hong Kong, Hong Kong, 2001, pp. 285–295. [35] G. Linden, B. Smith, J. York, Amazon.com recommendations: item-to-item collaborative filtering, IEEE Internet Computing 7 (1) (2003) 76–80. [36] T. Zhang, V.S. Iyengar, Recommender systems using linear classifiers, Journal of Machine Learning Research 2 (2002) 313–334. [37] Y. Zhang, J. Koren, Efficient Bayesian hierarchical user modeling for recommendation system, SIGIR’07, 2007, pp. 47–54. [38] R. Salakhutdinov, A. Mnih, Probabilistic matrix factorization, in: NIPS, 2008. [39] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, IEEE Computer 42 (8) (2009) 42–49. Jingyu Zhou received the B.S. degree in Computer Science from Zhejiang University, China, in 1999. He received the M.S. and Ph.D. degrees in Computer Science from University of California at Santa Barbara in 2003 and 2006. He joined the Software Institute at Shanghai Jiao Tong University in 2006. He is generally interested in information retrieval, systems, and security. His current work is on vertical web search, including information retrieval, Chinese analysis, and search systems. His past work includes: network and application security, parallel scientific computing, cluster-based storage systems and middleware systems, and cluster-based Internet services.

Yunlong Zhang received his bachelor degree in Electrical Engineering from Tsinghua University, China, in 2006. He was a software engineer in PetroChina from 2006 to 2007. He is currently a masters student at Shanghai Jiao Tong University. His research interests are social networks and machine learning.

Jia Cheng received his bachelor degree in software engineering from Jilin University, China, in 2007. He was a software engineer in the IA Division of Neusoft Co., Ltd. from 2007 to 2008. He is currently a masters student at Shanghai Jiao Tong University. His research interests are text mining and machine learning.