A framework for WWW user activity analysis based on user interest

A framework for WWW user activity analysis based on user interest

Knowledge-Based Systems 21 (2008) 905–910 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

276KB Sizes 3 Downloads 44 Views

Knowledge-Based Systems 21 (2008) 905–910

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

A framework for WWW user activity analysis based on user interest Jianping Zeng *, Shiyong Zhang, Chengrong Wu Department of Computing & Information Technology, Fudan University, Shanghai 200433, PR China

a r t i c l e

i n f o

Article history: Received 1 November 2007 Received in revised form 29 March 2008 Accepted 30 March 2008 Available online 6 April 2008 Keywords: User activity User interest Interactive website Integration framework

a b s t r a c t User activity plays an important role in the operation of many kinds of websites. Significant progress has been made on user activity modeling by researchers in many kinds of Web information process fields, such as Web log mining, blogs social network analysis, etc. However, research on user activity in many interactive websites, such as bbs and discussion group websites, has not attracted much attention. On the other hand, the integration of software modules which are responsible for user activity analysis is also a critical issue. We propose a framework which can be easily implemented for the analysis of user activity on an interactive website. In the framework, user activity model is represented by a hidden Markov model (HMM), and the method for user interest computation is provided. User activity analysis tasks, such as user group discovery, can be performed in the framework. In the experiments, we investigate three interesting problems which are related with user interest and user activity. Experiment results show that the framework is helpful for user activity analysis and the proposed user interest can describe user activity on an interactive website well, hence it can be used as an effective measure in analyzing user activity. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction With the advance of Web technology application, there appear many kinds of websites, such as blog, electronic commerce, etc. User activity plays an important role on the operation of these websites. In the blog space, many users cite articles in other blogs by placing a link, and large amount of such links indicate the existing of social network [1–3]. In an electronic commerce website, service providers usually improve their service by using some personnel recommendation algorithms, which are mainly based on the user habit and activity analysis result [4,5]. Hence, it is interesting and necessary to study the user activity and its effect on analysis task. Significant progress has been made on user activity modeling by researchers in the field of Web information process [1,13,14]. Here, we briefly review some researches on user activity in web log, blogs social network and topic tracking. One of the main tasks in web log mining is to discover the hidden information, such as user browsing habit and provide it to the website managers or service providers for improving content arrangement. Several kinds of models have been proposed to describe user activity in this area. Markov model is used as the representation of user activity over a continuous time period, and prediction of the next web page is then performed based on the model [15,16]. Blogs social network analysis is also related to * Corresponding author. Tel.: +86 13564317273. E-mail addresses: [email protected] (J. Zeng), [email protected] (S. Zhang), [email protected] (C. Wu). 0950-7051/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2008.03.049

the user activity. Blogs are usually associated with the personal information of their authors, such as age, geographic location and personal interests. The author can act as a user to visit other blogs, get articles which he or she is interested in and then place their links in his or her own blog. These activities lead to the interlinking structure of blogs which share some friendship and location proximity [2,3]. Hence, it is helpful to discover social network by utilizing the user activity information. Several similarity measure, such as, measurement based on text similarity, user browsing times, etc, are proposed to describe the difference of two activities [17]. In the field of topic detection and tracking (TDT) on the WWW, user interest in text is considered to be useful in answering a range of important queries about the content of document collection. To achieve this goal, an author-topic model is required to describe topic and user interest. With such an appropriate author model, we can establish which subjects an author writes about, which authors are likely to have written documents similar to an observed document, and which authors produce similar word. There are kinds of topic models, such as pLSI [10], LDA [6], AT [8,12], finite mixture model [11] etc. They consider documents as a mixture of topics, and topics are composed of many keywords. As an extension to author model, AT model incorporating with document sender and receiver in Email can capture two users’ interest in text [7]. In some kind of interactive websites, such as bbs, group discussion, there are many thousands of users who visit the website every day, and they post or reply articles and thus result in continuous discussion on the website. It is important to investigate the

906

J. Zeng et al. / Knowledge-Based Systems 21 (2008) 905–910

users, especially their activity and interest in all kinds of topics. However, the above mentioned achievements are not suitable for capturing the user activity since they do not provide a method for user interest computation. Without the user interest, it is difficult to investigate the user’s response to a given topic. On the other hand, modeling user activity based on large amount of articles in website is a complex task, which composes of crawling articles, text processing, model training and user activity analysis, etc. Hence, the integration of these software modules is also a critical issue in performing user activity analysis. In this paper, we propose an integration framework for the analysis of user activity on an interactive website. In the framework, user activity model is represented by a hidden Markov model (HMM), and the method for user interest computation is provided. The main contributions are as follows. Firstly, we propose a reasonable definition of user interest and investigate its effectiveness in user activity analysis. Secondly, we propose an algorithm for user group discovery based on user interest, and the algorithm can get a more suitable result comparing with that drawn from convention methods. Thirdly, we propose a framework for user interest and activity analysis, which can be easily implemented, and several analysis tasks can be performed in the framework. The paper is organized as follow. In the next section, we describe the integration framework and present the formulation of user interest and related algorithms. User group discovery algorithm is also presented. In the third section, we utilize the framework to conduct several experiments on real world data and analyze the results. In the last section, we conclude the work and point out the future research. 2. Framework for WWW user activity analysis 2.1. Description of the framework User activity is an important element in the framework of user analysis. However, for the personal privacy, it is unsuitable to utilize the user’s registered information about the age, education, interest, etc. In an interactive website, each user has the power to post an article, and other people can reply to it. Usually a user replies or posts an article according to his interest. Hence, we construct a user activity model based on the articles he replies or posts on the website, and thus several operations, such as crawling, preprocessing, etc. are required. The framework is shown in Fig. 1. The working flow of the framework is explained as follows. (1) Crawl articles which are posted or replied by users in a website. The articles are in a text form. (2) Preprocess the articles. For example, deleting some articles whose length is too small, or deleting duplicated articles. (3) Extract user activity data, that is, articles that are posted or replied by the user.

(4) Convert each article into a word sequence. A word dictionary is required to achieve this goal. (5) Learn activity model for each user based on the converted word sequences, and the models are stored in a knowledge base. (6) For a set of new word sequences, which are attained in a similar way as (1)–(4), user activity analysis can be performed based on the knowledge base. The goal of this framework is to analyze user activity, which is associated with user interest. User interest is the essential of the framework. Hence, we emphasize on the user activity model, user interest computation and activity analysis tasks, such as user group discovery, user reply activity. Other modules will be explained in the experiment section. 2.2. User activity model Topic can be considered as a set of documents and documents can be considered as a probability distribution over word space. However, topics in a document collection usually transit between each other with a probability. For example, if the discussion about education is now a hot topic in a bbs, then topics about science and technology are more likely to appear than topics about tourism. It is useful to incorporate this kind of transition information into the topic model and improve the accuracy of user activity representation. In order to capture the topic transition, we use hidden Markov model [18] to describe the user activity. The model is described as follows, k ¼ ðA; B; p; N; MÞ

ð1Þ

where, N is the number of hidden states and A = {aij} is the transition probability between these states, where aij is the transition probability from state i to state j. M is the number of distinctive words in the document and B are word distribution probability for each hidden states, p is the initial probability distribution over the hidden states. 2.3. Definition of user interest For a word sequence which is extracted from a given document, if the likelihood of it with respect to the user model is larger, we can say that the user interest in the document is higher, too. In this way, we define the user interest as follows, Interestðku ; OÞ ¼

log PðO j ku Þ  log Pmin log Pmax  log Pmin

ð2Þ

where, O is the word sequence of the given article, ku is the user model for user u, Pmax and Pmin means the bounder of user interest in the similar documents, and it can be computed as follows, Pmin ¼ min PðOi j ku Þ;

i ¼ 1...K

ð3Þ

Pmax ¼ max PðOi j ku Þ;

i ¼ 1...K

ð4Þ

jOi j¼jOj

jOi j¼jOj

where, K is the number of all possible similar documents with same length. 2.4. Computation of user interest 2.4.1. Model training For a given word sequence SEQ = {v1,v2, . . . ,vn}, where vi denotes the index in a word dictionary. We train a HMM model with objective function as,

Fig. 1. Framework for user activity analysis.

PðSEQ j kÞ ¼

N X N X i¼1

j¼1

at ðiÞaij btþ1 ðjÞ;

16t 6n1

ð5Þ

J. Zeng et al. / Knowledge-Based Systems 21 (2008) 905–910

where, at(i) and bt(j) are the forward and backward variable [9], respectively. To maximize the (5), a series of iteration should be performed in Baum-Welch [9]. That is, let Pn1 ntði; jÞ ij ¼ Pt¼1 ð6Þ a n1 t¼1 ctðiÞ Pn t¼1 and vt ¼ok ctðjÞ  Pn bjðkÞ ¼ ; 1 6 j 6 N; k ¼ 1 . . . M ð7Þ t¼1 ctðjÞ p j ¼ c1 ðjÞ;

16j6N

ð8Þ

907

basic idea is to employ a clustering algorithm based on the mixture model. The group discovery algorithm GDA is described as follows. Input: a set of user models {k1,k2, . . . , kn}, word sequence {O1,O2, . . . ,Om} with each Oi describing a topic Process: 1. Compute the interest of each user with the word sequence, xij = Interest(ki,Oj), i = 1,2, . . . n, j = 1,2, . . . m 2. Finding maximum likelihood parameters of group mixture model via EM algorithm.

where, ot is the output state in the model. at ðiÞaij bj ðvtþ1 Þbtþ1 ðjÞ ; and PðSEQ j  kÞ N X ctðiÞ ¼ ntði; jÞ:

ntði; jÞ ¼

j¼1

; N; MÞ is got, and it can be Then, the new model  k ¼ ðA; B; p proved that PðSEQ j kÞ > PðSEQ j kÞ [9], k is the model in previous iteration. So, by recursively computing (6)–(8), we can finally get a best HMM model for the word sequence SEQ. 2.4.2. User interest computation The main problem is to calculate the Pmax and Pmin in (3) and (4). However, in this situation, it is difficult to get the two values in an efficient way, since the search space is very huge, for example, the space is K = 10010 if the number feature M is 100, and the length of word sequence is 10. Hence, we should use some kind of search technique. Here we use a genetic algorithm based method [19] with objective function as (3) and (4) to obtain optimized sequence and the two values. The size of the chromosome is equal to the length of O. The process for searching Pmin is as follows, Input: User model k, length of sequence L Process: 1. Initialize population with randomly selection of a word sequence set O = {O1,O2,. . .,On} and the length of Oi is L. 2. Compute the fitness of each chromosome f = P(Oijk). 3. Perform selection: select the next population O0 from O based on roulette wheel selection method, while chromosome selection probability is higher if its fitness is smaller. 4. Perform crossover and mutation, then get the new population O0 = {O1,O2, . . ., On}. 5. Repeatedly execute steps 2–4 until the value of f is no longer changing or the number of execute steps exceeds a threshold. 6. End.

Suppose that the group interest can be represented by a mixture Gaussian model, pðx j hÞ ¼

W X

ci pi ðx j hi Þ

ð9Þ

i¼1

where x is m dimensions vector, W is the number of groups existence in these users, and the parameters are h = (c1, P c2, . . . ,cW,h1,h2, . . . ,hW) such that W i¼1 c i ¼ 1 and each pi is a multi Gaussian density function, that is, P1 ðxli Þ ðxli ÞT 1 i  2 pi ðx j hi Þ ¼ e ð10Þ P m=2 1=2 j ij ð2pÞ Thus the estimation of the new parameters in terms of the old parameters hg are as follows [20], 1 Xm p ði j xj ; hg Þ j¼1 i m Pm g j¼1 xi pi ði j xj ; h Þ li ¼ Pm g j¼1 pi ði j xj ; h Þ Pm T g X j¼1 pi ði j xj ; h Þðxj  li Þðxj  li Þ Pm ¼ g i j¼1 pi ði j xj ; h Þ

ci ¼

ð11Þ ð12Þ ð13Þ

where xi = (interest(ki, O1), interest(ki, O2), . . ., interest(ki, Om)) By iteration of (11)–(13), we can finally get the best parameters for h. The number of mixture component W can be got via cross validation method. In this way, we can get the best group set from the users. Output: the user clusters {(ui, gi, pbi,)} with describing the user ui belong to a group gi with probability pbi. 3. Experiments and analysis

The procedure to get Pmax is similar with this one, with exception that the chromosome selection probability is higher if its fitness is higher.

In this section, we would like to utilize the framework to download amount of articles from an interactive website, create user activity model, and verify the effectiveness of the framework in activity analysis. Most important of all, we would also like to verify the effectiveness of user interest computation in activity analysis by investigating the relation between user interest and reply activity, user interest and user similarity, and the effectiveness of the user group discovery algorithm.

2.5. Group discovery

3.1. Data collection and preprocess

A user group is a set of users who share similar interest in topics. A user can belong to one or more different groups since his interest in all kinds of topics is different. Finding user group is an important task in interactive website analysis. A group is usually associated with a topic. Hence, to discover group from users, we should specify a topic which reflects the group interest. Based on the definition of user interest, it is easy to achieve the goal of group discovery. The

In order to investigate the user interest more thoroughly, we select a famous user discussion website as our data source. There are more than thousands of registered users in the website. The users are mainly young university students. There are several channels, such as ‘‘News”, ‘‘Stock”, ‘‘Military”, which serve as containers for different kinds of topics in the website. Based on the statistic result of the number of daily new articles, we can see that the users in the website are more interested in the articles in the ‘‘News” channel

Output: Pmin

908

J. Zeng et al. / Knowledge-Based Systems 21 (2008) 905–910

than that in other channels, so in the experiment, we select the articles in ‘News’ channel as our data source. In this channel, articles are mainly related to all kinds of news from nation, international, economic, society, finance, etc. It is clearly that some users are more interested in international news, while others are only interested with finance news. Hence, the channel provides a suitable test bed for our user interest verification and analysis task. In such a discussion group, each user has the power to post and reply an article. It is reasonable to suppose that a user will more likely to post articles that he is interested in than that he is not interested in. It should be noted that not all users in the discussion group post or reply similar number of articles in a time period. For example, in Fig. 2, the number of postings in two months for some users is shown. If the posting number is small, the user interest is not fully shown in the website, so we only select the users whose postings number is above 300 in the experiment. By utilizing the framework, we crawl all the postings records from August 6, 2006 to February 6, 2007. However, to filter the meaningless articles, the framework discards the records whose text length is below 100. After these processes, we get a dataset that contains 22032 records which includes ‘‘userid”, ‘‘text”, ‘‘posttime” and ‘‘flag”, and the meaning is shown in Table 1. The dataset serves as the raw data in our experiment, for convenience, we name it as ‘‘IntDataSet”.

we select the records in the following way. Suppose that a user replies or posts an article at time t, then we select the records where posttime is one day ahead of the time t, as shown in Fig. 3. We then compute the average user interest of word sequence of the articles in the test set for the nine selected users. The result is shown in Fig. 4. For all of the users, the average interest in the articles that are replied is larger than that of articles that are not replied. However, some users are with similar interest in the two kinds of articles, such as user 1 and user 6. This is true in real world. Some users do not reply the articles he is interest in, while he usually just select some of them. 3.3. User interest and user similarity As we know from the real society that the user interest will be more similar if two users share more similar background, such as education and work experiences. We try to exam the relation between the user activity and the user interest. The user activity is represented by the user model, which is based on the articles that the user selects to post or reply. We select 40 users who have published more articles between August 20, 2006 and October 20, 2006. We then convert the articles into word sequences and train HMM for each users in the framework. We then compute the distance which is measured by the KL (Kullback–Leibler [9]) between two users m and n, that is,

3.2. User interest and reply activity DKL ðkm ; kn Þ ¼ In the experiment, we exam two kinds of user interests, that is, the user interest in the articles that he selects to reply, and the interest in articles that he does not select to reply. By this means, we can explore the function of the user interest in the user reply activity analysis. We count the postings number for each user in IntDataSet and select nine users whose postings number are the on the top of the list. We select the records in which userid equals to the users and the posttime is between October 10, 2006 and November 10, 2006 as the training dataset and then employ methods in Section 2.4.1 to create the user model for each users. On the other hand, we select records for each users where posttime is between November 21, 2006 and January 30, 2007 as the test dataset. However, there is some further filtering on the test set since the users do not go to the website every day. To ensure the validity of the test,

1 ðlog Pðws j km Þ  log Pðws j kn ÞÞ V

where, ws is a word sequence generated by km, its length is V. To investigate these users’ interest in different kinds of topics, we then select 452 records from IntDataSet and compute the average user interest in the text. We then compute the ratio of user interest between each of the users, that is Pn interestðkui ; Ok Þ ð15Þ Rintðui ; uj Þ ¼ Pnk¼1 k¼1 interestðkuj ; Ok Þ The relation between model distance DKL(km,kn) and the ratio of user interest Rint(ui, uj) is shown in Fig. 5. As we can see from the trend line, when the user interests of two users are more similar, that is, the ratio Rint is closer to 1. The model distance between

Fig. 3. Articles selection.

Fig. 2. Number of posting for some selected users.

Table 1 Explanation of the record fileds Fields

Meaning

userid text posttime flag

User identification Main message of the record The time of the post or reply time Post or reply

ð14Þ

Fig. 4. Average user interest on test set.

J. Zeng et al. / Knowledge-Based Systems 21 (2008) 905–910

Fig. 5. Model distance and the ratio of user interest.

Table 2 Comparison on user group discovery methods Group discovery method

r

Model-based partition clustering algorithm using KL distance GDA-like with the log likelihood as the measurement GDA

0.78 1.12 0.25

909

group websites, have not attracted much attention. The integration of software modules which are responsible for user activity analysis is also a critical issue. A framework which can be easily implemented for the analysis of user activity on an interactive website is proposed in the paper. User activity model is represented by a hidden Markov model (HMM), and the method for user interest computation is provided. User activity analysis tasks, such as user group discovery, can be performed in the framework. We utilize the framework to download articles from an interactive website, create user activity model, and verify the effectiveness of the framework in activity analysis. Most important of all, we verify the effectiveness of user interest computation in activity analysis by investigating the relation between user interest and reply activity, user interest and user similarity, and the effectiveness of the user group discovery algorithm. For further research, we will concentrate on providing more user activity analysis based on user interest in the framework. For example, we’ll design an algorithm for studying user group evolution over time and exam how the user interest is varied as time going on, and consequently, we will design the mechanism to organize the user activity models so that the evolution analysis can be efficient. Acknowledgment

the two users is smaller, which means the two users are more similar. As the ratio of user interest increases, the model distance increases too. The result fits the real observation that users who have similar interest would have similar activity models.

We thanks J.F. Xie for providing the computer program of crawling data from website. References

3.4. User group and user interest In this experiment, we investigate the relation between user group and users. We select 50 users who publish the top number of articles in IntDataSet. We then manually select five kinds of articles which are related to five topics, that is, ‘‘TAIWAN issue”, ‘‘economic news”, ‘‘entertainment news”, ‘‘sport news” and ‘‘education news”, respectively, and each topic includes 100 articles between August 20, 2006 and October 20, 2006. To get the word sequences that describe the topics, we employ a partition clustering algorithm based on hidden Markov model [18] with partition number setting as 5. Then we run the group discovery algorithm GDA with 50 user models and the five word sequences as parameters. In order to show the effectiveness of the user clustering algorithm based on user interest, we conduct the clustering experiment with the other two methods. The first one is model-based partition clustering algorithm using KL as the distance measurement [21]. The second one is the same as GDA algorithm, with exception that we use the log likelihood P(Ojk) as the measurement, instead of the user interest in the step 1 of GDA. The result of the three clustering algorithms is compared by a measurement defined as follows, r¼

di db

where, di denotes the average distance of each users in all clusters, and db is the average distance between each of two clusters. By the definition, the smaller r is, the better user group is achieved. The result is shown in Table 2. We can see clearly that the clustering using interest as distance measurement can get the best clusters. 4. Conclusion and future work The user activity analysis is one of an important task in websites analysis, especially for an interactive website. Research on user activity in many interactive websites, such as bbs and discussion

[1] Y.Y. Ahn, S. Han, H. Kwak, Analysis of topological characteristics of huge online social networking services, in: Proceedings of WWW, 2007, pp. 835–844. [2] T.Y. Berger-Wolf, J. Saia, A framework for analysis of dynamic social networks, in: Proceedings of KDD, 2006, pp. 523–528. [3] P. Sarkar, A.W. Moore, Dynamic social network analysis using latent space models, SIGKDD Explorations 7 (2) (2005) 31–40. [4] T. Iwata, K. Saito, T. Yamada, Modeling user behavior in recommender systems based on maximum entropy, in: Proceedings of WWW, 2007, pp. 1281–1282. [5] C.-N. Ziegler, S.M. Mcnee, J.A. Konstan, G. Lausen, Improving recommendation lists through topic diversification, in: Proceedings of WWW, 2005, pp. 22–32. [6] D.M. Blei, A.N. Ng, M.I. Jordan, Latent Dirichlet allocation, Journal of Machine Learning 3 (2003) 993–1022. [7] A. McCallum, A. Corrada-Emanuel, X. Wang, Topic and role discovery in social networks, in: Proceedings of 19th International Joint Conference on Artificial Intelligence, 2005. [8] M. Rosen-Zvi, T. Griffiths, M.Steyvers, P. Smyth, The author-topic model for authors and documents, in: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada, 2004, pp. 487–494. [9] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, in: Proceedings of the IEEE, vol. 77, 1989, pp. 257– 286. [10] T. Hofmann, Probabilistic latent semantic indexing, in: Proceedings of the 22nd Annual International SIGIR Conference, Berkeley, California, USA, 1999, pp. 35–44. [11] S. Morinaga, K.Yamanishi, Tracking dynamics of topic trends using a finite mixture model, in: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. [12] M. Steyvers, P. Smyth, M. Rosen-Zvi, T. Griffiths, Probabilistic author-topic models for information discovery, in: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 306–315. [13] X. Song, C.-Y. Lin, B.L. Tseng, M.-T. Sun, Modeling and predicting personal information dissemination behavior, in: Proceedings of ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2005. [14] X. Song, C.-Y. Lin, B.L. Tseng, M.-T. Sun, Modeling evolutionary and relational behaviors for community-based dynamic recommendation, in: Proceedings of SIAM Data Mining Conference, 2006. [15] E. Manavoglou, D. Pavlov, C.L. Giles, Probabilistic user behaviour models, in: Proceedings of ICDM, 2003. [16] I. Cadez, D.Heckerman, C.Meek, P. Smyth, S. White, Visualization of navigation patterns on a web site using model based clustering, in: Proceedings of ACM KDD2000 Conference, 2000. [17] E. Adar, L.A. Adamic, Tracking information epidemics in blogspace, in: Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005.

910

J. Zeng et al. / Knowledge-Based Systems 21 (2008) 905–910

[18] J.P. Zeng, S.Y. Zhang, Incorporating topic transition in topic detection and tracking algorithms, Expert System with Applications, 38(3) doi:10.1016/ j.eswa.2007.09.013. [19] J.P. Zeng, D.H. Guo, Method for Masquerade intrusion detection based on HMM and genetic algorithm, Journal of Chinese Computer System 28 (7) (2007) 1210–1215.

[20] J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technique report TR-97-021, U.C. Berkeley, 1998. [21] S. Zhong, Probabilistic Model-based Clustering of Complex Data. The University of Texas at Austin. Ph.D. Thesis, 2003.