G Model
ARTICLE IN PRESS
JSS-9355; No. of Pages 11
The Journal of Systems and Software xxx (2014) xxx–xxx
Contents lists available at ScienceDirect
The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss
Measuring the veracity of web event via uncertainty Xinzhi Wang a , Xiangfeng Luo a,b,∗ , Huiming Liu a a b
School of Computer Engineering and Science, Shanghai University, Shanghai, China State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China
a r t i c l e
i n f o
Article history: Received 25 February 2014 Received in revised form 16 June 2014 Accepted 10 July 2014 Available online xxx Keywords: Web event veracity Topic detection and tracking Big data
a b s t r a c t Web events, whose data occur as one kind of big data, have attracted considerable interests during the past years. However, most existing related works fail to measure the veracity of web events. In this research, we propose an approach to measure the veracity of web event via its uncertainty based on its features distribution on different kind of confident websites. Firstly, the proposed approach mines various event features from the data of web event which may influence on the measuring process of uncertainty. Secondly, one computational model is introduced to simulate the influence process of the above features on the evolution process of web event. Thirdly, matrix operations are managed to facilitate practice. Finally, experiments are made based on the analysis above, and the results proved that the proposed uncertainty measuring algorithm is promising to measure the veracity of web event for big data. © 2014 Elsevier Inc. All rights reserved.
1. Introduction Today, big data has been attracting more and more attention. Challenges and opportunities of big data era are defined as being five Vs, i.e., volume (amount of data), velocity (speed of data in and out), variety (range of data types and sources), value (desirable quality of data) and veracity (trustworthiness of various data).1 The volume of big data is massive, and main technologies to store big data include distributed cache, distributed database and distributed file system (Zhang et al., 2014; Zhang X and Liu C et al., 2014). The big data is also time-sensitive, defined as velocity, referring to the speed at which new data is generated and the speed at which data1 moves around, and main methods to face this challenge include MapReduce (Dean and Ghemawat, 2008) and concurrency control. Moreover, types of big data consist of text, video, audio, webpage, stream and even aggregation of above, which is the meaning of variety. Another V to take into account when looking at big data is value, which is important as the result that users can make a benefit for any attempt to collect and leverage big data. The last V, veracity, refers to the trustworthiness of the big data, with large amount, high speed, many forms and uneven quality. We can safely argue that ‘veracity’ is the most important
∗ Corresponding author at: School of Computer Engineering and Science, Shanghai University, 333 Nanchen Road, Baoshan District, Shanghai 200444, China. Tel.: +86 13817505710. E-mail addresses:
[email protected] (X. Wang),
[email protected] (X. Luo),
[email protected] (H. Liu). 1 http://staff.science.uva.nl/∼demch/////presentations/sne2013-01-03-bigdata5V-infra-v04.pdf.
V of big data to gain ultimate and valid value. This paper aims to measure the veracity of web data which occurs as one challenge bridging users and big data. Web event is a story or a scandal occurred in the society or on the web reflected by a series of associated web pages with time. It is hard to measure web event veracity as the result that volume and various web events happen all the time. However, veracity of web event has to be detected in order to get trustworthy and valuable information. In its evolution process, social event is deeply affected by corresponding web information, whose one performance is uncertainty. In other words, monitoring veracity of web event via uncertainty can help user understand social events. Generally, event with high uncertainty are more likely to turn into popular or emergent event (Haddow et al., 2010). Here uncertainty of web event is determines by its features distribution on different kind of confident websites instead of entropy used in informatics. For instance, if an event has high uncertain distributed in low confident website, then the event is more likely to be faked; namely, its veracity is low; and vice versa. So, this research employs multifactor based uncertainty to measure the veracity of web event. However, it is difficult to manually analyze the veracity of the volume web events in their evolution process on the web, because it is a killing of time and energy. So, to understand and quickly respond to the volume web events, it is necessary to measure the web event veracity automatically. For instance, a Chinese girl named MeiLing Guo showed off her luxurious life on the web, and claimed to be one of the managers of Red Across in China. This message was detected by netizens who suspected the corruption of Red Across. The event ended up with administrative recombination in Red Across, and resulted in sharp
http://dx.doi.org/10.1016/j.jss.2014.07.023 0164-1212/© 2014 Elsevier Inc. All rights reserved.
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
2
decline of donations which had bad influence on the Red Across of China. This web event impacted greatly on our society. The veracity of this event is low with just one doubtful piece of microblog at first, but went high at last with various guesses occurred. Veracity of web event varies in the evolution process. One of veracity performances of web event is its uncertainty, and the other is website confidence, two of which is determined by various factors such as webpages distribution on websites, attributes distribution on webpages, etc. In this paper, we firstly proposed an approach mining several event features from webpages of web event. Then, one computational model is introduced to simulate the influence process of the above features on the evolution process of web event. Furthermore, matrix operations are managed to facilitate practice. At last but not at least, experiments are made based on the analysis above, and the results proved that the proposed model is promising when measuring the veracity of web event via its uncertainty. The organization of this paper is as following. Related work is introduced in Section 2. Basic terms and event features are defined in Section 3. The veracity measurement of web event via uncertainty is introduced in Section 4. The relation of the measuring processes with matrix is discussed in Section 5. Section 6 shows some experiments and analysis. And in the last section, conclusions are given.
2. Related work Veracity calculation of web event is one tough work since the data of web event is volume, variety and velocity. Moreover, veracity, one of whose performances is uncertainty, of web event varies during its evolution process. However, few researchers have contributed to measuring web event veracity or uncertainty directly. In the research area of topic detection and tracking (TDT) (Jin et al., 2010; Yin et al., 2008; Yang et al., 2009; Makkonen, 2003), some aspects of web event are measured during the evolution process, such as detecting unknown event and segmenting event information, without considering web event uncertainty. But technologies of TDT lay some foundations for our work, namely measuring the veracity of web event via uncertainty. In the research area of TDT, Allan et al. (1998a,b) calculated similarity among news and classified the news to corresponding events. If the event of a piece of news is not similar to any previous events, then it is considered as a new event. Chen et al., (2007) put forward the time-hardening theory, which modeled event life cycle based on the time relations among documents. Wei and Chang (2007) proposed a kind of technology to build an event evolution model, which found frequent events section and temporal relations of one event to construct frequent pattern. Among them, the events section refers to a sub-event or events stage. Yang and Shi (2006) proposed a method to extract evolutionary relationship diagram from news reports, and the evolutionary relationship diagram reflects relationship among web events or between events and its sub-events, which furthermore display the structure of web events. Some other aspects (Li et al., 2009; Li and Lau, 2011; Ng et al., 2003, 2005) are also been discussed. In general, TDT technique (Allan, 2000; Mei and Zhai, 2005; Jo et al., 2007; Nallapati et al., 2004) attempts to detect unknown event and cluster corresponding news reports into its event; TDT also involves tracking events, but do not involve the veracity measuring of web event during the evolution process of web event. So it is hard to provide users global and clear cognitions for the web event. This paper came up with a method to calculate the veracity of web event taking the advantage of web event uncertainty. This paper also discussed what factors may influence on the uncertainty of web event, such as attribute uncertainty, webpage
Fig. 1. Interaction among users, web events, websites, webpages and event attributes.
uncertainty, website confidence, event classified webpage uncertainty, attribute distribution on webpage, and event classified webpage distribution on websites. On the other hand, Salton and Yang (1973), Kenneth and Yang (2007) analyze many factors affecting website confidence. In addition, different types of website have various confidences (Nadine and Burke, 2002). The traditional media websites always have higher confidence compared with other websites, such as CNN, New York Times, and the Wall Street and so on. Metzger (2000) indicates that website confidences are dynamic and uncertain; this made it hard to evaluate website confidence clearly. Attribute distribution on webpage and classified webpage distribution on website have great influence to the web event uncertainty as well as the website confidence. In this paper, we define a lot web event features, then establish mathematical model for computing web event veracity via its uncertainty, furthermore gain web event veracity through iterative computational model, and finally matrix operations and experiments are made which proved the iterative algorithm is promising when computing web event uncertainty. 3. Basic terms and event features As we have discussed in Section 1, to measure the veracity of web event via uncertainty, it is necessary to study the event features of web event in its evolution process. Before the discussion of web event features, we define some basic terms which can help the understanding of veracity measuring process. 3.1. Terms definitions Normally, users take care of web events when they surf on the internet. And these web events are provided by webpages, which distribute on different websites and contain abundant event attributes. The whole chain process that influences on the measuring process of event veracity is depicted in Fig. 1. The five levels are users, web events, websites, webpages and event attributes, respectively. And the arrows describe the relations among
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
Fig. 2. Relations among webpages and their attributes for one web event. (In the figure, ϕi is the classified webpage tagged with i; wsi is the ith website related to web event e.; ki is the ith attribute belonging to an event.)
different levels. This section defines some basic terms and analyzes the event features based on the chain shown in Fig. 1. Definition 1 (Uncertainty of attributes, es(ij )). The uncertainty of attribute ij is the ability of an attribute connecting with others and it is also related to the confidence of website containing corresponding attribute. Here, ij is the jth attribute in the ith website. Attributes are keywords mined using TFIDF (Li et al., 2009) and their relations are got by aprior algorithm. If an attribute connect a lot of other attributes and appears in many low confident websites, then the attribute has high uncertainty; vice versa. Therefore, web event containing it has low veracity. Definition 2 (Uncertainty of webpage, c(ij )). Uncertainty of webpage ij is proportional to its attributes uncertainty and is a generalized uncertainty of its attributes. Here, ij is the jth webpage from the ith website. One webpage can be denoted as ij = {i1 , i2 , . . ., im }, its weight ij = (wi1 , wi2 , . . ., wim )T and wim is the vector can be denoted by weight of im , which can be mined according to TF-IDF (Li et al., 2009). If web event appears in lot of uncertain webpages, then the veracity is low; vice versa. Definition 3 (Uncertainty of classified webpage, c(ϕi )). The uncertainty of a classified webpage ϕi implies the uncertainty of a subordinated web event, namely sub-event, which is distributed in several websites. Here, ϕi is the ith classified webpage belonging to specific web event. A classified webpage of a web event is a webpage set in which webpages have high similarity with each other, which describes one specific subordinated web event. Here, we let ϕi = {ij , . . ., nm } denotes as a classified webpage collection in a specific period. Just like webpage, if classified webpage of web event distributed in low confident website, then it has high uncertainty, furthermore the web event has low veracity; vice versa. As classified webpage is a subordinated web event, so we use similarity to get classified webpage. Similarity among webpages for a web event can be denoted as I.
⎛ ⎜
a11
I = ⎜ .. ⎝ . an1
···
a1n
..
.. .
.
···
⎞ ⎟ ⎟ ⎠
3
Fig. 3. Relations among event attributes, webpages belonging to one website. (In the figure, ij is the webpage tagged with j in the ith website of specific web event; wsi is the ith website related to web event e.; ij is one attribute belonging to wsi .)
Definition 4 (Website confidence, T(wsi )). Confidence of a website wsi is the authority degree of a website. Here, wsi is the ith classified webpage belonging to specific web event. It can be formulized as T (wsi ) =< c(ij ), c(ϕi ) >. Website is a carrier of web event and its attributes. It is influenced by two factors. One is its own webpage uncertainty c(ij ), and the other is the classified webpage (subordinated web event) uncertainty c(ϕi ). If information of web event distribute in high confident website, then the veracity of web event is high; vice versa. Definition 5 (Uncertainty of web event, uc(e)). The uncertainty of web event e is the ability of an event derives to other events during a period of time, which can be formalize as uc(e) =< es(ij ), c(ij ), c(ϕi ) >; where es(ij ) is the uncertainty of attributes ij ; c(ij ) is the uncertainty of webpages ij ; c(ϕi ) is the uncertainty of event classified webpages ϕi . uc(e) is significantly affected by the above event features’ uncertainties or confidence. Uncertainty of attributes, webpage, classified webpage and confidence of website determine the uncertainty of web event; furthermore determine the veracity of web event. The more uncertain of web event, the less veracious it is, and then we can get the veracity of web event via its uncertainty based on its features distribution on different kind of confident websites. Definition 6 ((Veracity of web event, ver(e)). Veracity of web event is the trustworthiness of web event during its evolution process. It is inversely proportional to uncertainty of web event ver(e)=1-uc(e)). If words with high uncertainty of one web event distributed in high uncertain webpages and low confident websites, and related classified webpages are uncertain, then the corresponding web event has high uncertainty, moreover its veracity is low (the event may include some faked information). Conversely, if features (attribute, webpage, classified webpage) are certain and distributed in confident website, then the event is veracious with low uncertainty (the event contains lots of true information).
(1)
ann
ab · cd /| ab | | cd |), and i = a + b, j = c + d, and aij is the where aij = ( ab and cd . similarity between webpage Webpages with high similarity are merged as one classified webpage corresponding to one subordinated web event. To avoid ambiguous, we call the attribute as ki after the combination. The whole process is shown in Figs. 2 and 3.
3.2. Event features This section will discuss some characteristics of the terms and present how to process them for the veracity measurement of web event via uncertainty. As in Fig. 2, a series of websites provide many classified webpages along with many event attributes. It means website confidence and classified webpages uncertainty affect each other
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model
ARTICLE IN PRESS
JSS-9355; No. of Pages 11
X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
4
according to the classified webpages distribution on websites. It is defined as follows:
⎛ ⎜
11
= ⎜ .. ⎝ . v1
···
1n
..
.. .
.
···
⎞ ⎟ ⎟ ⎠
(2)
vn
where ij = 1 if elements of ϕj appears in wsi , on the contrary ij = 0 if elements of ϕj does not appear in wsi . Namely, ij denotes whether the ith website involves the jth classified webpage or not. Actually, one web event has only one classified webpages distribution on websites. If the distribute build on uncertain classified webpages and low confidence website, then the veracity of web event is low. In the computing process, this distribution is seen as one of the web event features, which is the distribution of classified webpages on websites, denotes as . Fig. 3 shows that a website provides many webpages with amount of event attributes, and these attributes are distributed on webpages covering various events. This distribution implies that the confidence of website is affected by its webpages uncertainty while webpage uncertainty and attribute uncertainty affect each other according to the attributes distribution on webpages. Moreover, veracity of web event is influenced by the distribution. We define this distribution as distribution of event attributes on webpages i , which is also an event feature. Herein, we use i to describe the ith event attributes distribution on webpages during a period of time.
⎛ ⎜
1i 1
i = ⎜ .. ⎝ . i n1
···
1i m
..
.. .
.
···
⎞ ⎟ ⎟ ⎠
(3)
i nm
1, in ∈ im denotes whether the nth event 0, in ∈ / im attribute exists in the mth webpage or not. Actually, from the description above, we may find that for a webpage, it has two labels beside it attributes. One is where the webpage comes from (which website it belongs to); the other is what the webpage can offer (which classified webpage it belongs to). The attributes distribution on webpages makes the attribute uncertainty and webpage uncertainty related to each other closely. In the same way, classified webpages distribution on websites makes the uncertainty of classified webpage and the website confidence related to each other closely. The difference between the two distributions as a clue to measure the event uncertainty will be discussed in Section 4. After getting the uncertainty of web event, the veracity of web event can be determined. For now, we can get the website wsi , webpage ij , classified webpage ϕi , event attribute ij , attributes distribution on webpages and classified webpages distribution on websites. Through the analysis of the above features, we can find that all the event features we get are from the webpages and websites. Actually, some other users’ features can be got, like Google Trend (http://trends.google.com). On the other hand, users’ features are somehow time delay. So in this paper, we consider the webpages of web event alone. In next section, we employ one computational model to analyze how the event features influence on the veracity measuring process of web event. In fact, all the features play their different roles according to the two distributions. More details will be discussed in next section. i = where mn
Fig. 4. Computational model of web event uncertainty.
4. Veracity measuring process of web event via uncertainty According to the discussion in last section, web event veracity computing via uncertainty is influenced by multi-factors such as attribute uncertainty, webpage uncertainty, website confidence and distributions among these factors. Uncertainty of attribute and webpage influence each other according to the attributes distribution on webpages. Based on the webpages we can get website confidence which lay the foundation for later computation. Furthermore, the website confidence and the classified webpage uncertainty influence each other according to the classified webpages distribution on websites. One classified webpage equals to one sub-event, so we finally get veracity of web event via uncertainty according to the uncertainty of features and confidence of corresponding website through two distributions shown in Fig. 4. In Fig. 4, we can find that for a web event it is supported by a classified webpage set ϕ, where ϕ = {ϕ1 , ϕ2 , . . ., ϕn(ϕ) } and n(ϕ) is the number of classified webpages supporting corresponding web event. Classified webpage set interacts with website set ws, where ws = {ws1 , ws2 , . . ., wsn(ws) }, and n(ws) is the number of websites reporting corresponding web event, according to the classified webpages distribution on websites . Webpage set about a web event e includes all the webpages in all websites, which can be denoted as = {1 , 2 , . . ., n(ws) }, where i is a webpage set including all the webpages in the ith website belonging to specific web event, and can be denoted by i = {i1 , i2 , . . ., in(i ) }. n(i ) is the number of webpages in the ith website related to web event e. In the same way, attribute set includes all the attributes, which can be denoted by = {1 , 2 , . . ., n(ws) }, where i is an attribute set including all the attributes in the ith website, and can be denoted by i = {11 , i2 , . . ., in(i ) }. n(i ) is the number of attributes in the ith website related to web event e. Webpage set and attribute set interact with each other according to the attributes distribution → = (1 , 2 , . . ., n(ws) )T . vector on webpages − Based on the analysis described above, the veracity computation of web event via uncertainty can be divided into three parts: 1) How the attributes distribution on webpages influences the veracity of web event via uncertainty. In this part, based on the attributes distribution on webpages, we compute the event attribute uncertainty, webpage uncertainty and webpage based website confidence. 2) How the uncertainty of classified webpages and confidence of websites influences the veracity of web event. In this part, based on the classified webpages distribution on websites, we measure
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model
ARTICLE IN PRESS
JSS-9355; No. of Pages 11
X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
the uncertainty of classified webpage, which describes different sub-event, classified webpages, of a web event. 3) Veracity of web event is computed based on the uncertainty of classified webpage and confidence of corresponding website.
4.1. Measuring of attribute uncertainty and website confidence After the discussion of basic event features of web event, we introduce some deductions as the guidance for the veracity measuring process of web event via its uncertainty and website confidence. Besides, these event features make the process visible to us which make it easy to find some common rules. Deduction 1. If all the websites support the entire event attributes shown as i in Fig. 3, then the uncertainty of webpage c(ij ) and the uncertainty of webpage attributes es(ij ) should be low, and the veracity of web event via uncertainty will arrive the highest point since all the websites have same view on the event. Deduction 2. If most of attributes of a web event are provided by few webpages which distributed on different websites, the uncertainty of webpage c(ij ) and the uncertainty of webpage attributes es(ij ) will be high, and the veracity of web event via uncertainty will also be low.
Table 1 The event features and symbols used in our paper. Name
Symbol
Web event Classified webpage Event attributes Distribution of classified webpages on websites Similarity Website Webpage in website Webpage attributes Distribution of webpages in websites Web event veracity Web event uncertainty Webpage uncertainty Classified webpage uncertainty Website confidence Webpages based website confidence Uncertainty of webpage attributes
e ϕij K I wsi ij ij i ver(e) uc(e) c(ij ) c(i ) T (wsi ) t(wsi ) es(ij )
In order to facilitate the computation and veracity exploration, we do some transformations as follows.
es (ij ) = 1 −
i ((1 − c(in ))nj )
0
1 − es (ij ) =
i ((1 − c(in ))nj )
0
Based on Deductions 1 and 2, which analyzed how attributes and webpages contribute to the veracity of web event via uncertainty through the event attributes distribution on webpages i , we choose an iterative computational process to implicitly compute the veracity of web event via uncertainty using attributes distribution on webpages. At each iteration process, the webpage uncertainty and the event attribute uncertainty are inferred from each other as follows:
c(ij ) =
n(i )
1
n(i ) n=1
i jn
i (es(in )jn )
ln(1 − es (ij )) =
0
(6)
i ) ln((1 − c(in ))nj
Attributes uncertainty can be gain as follows.
es(ij ) = − ln(1 − es (ij )) = −
0
i ln((1 − c(in ))nj )
(7)
Until now, we get the attribute uncertainty and the webpage uncertainty using the attributes distribution on webpages (Table 1). And the webpages based website confidence (denoted by t(wsi ) which lay the foundation of computing website confidence) can be measured by the average of certain degree of webpages:
(4)
1
(1 − c(iq )) n n
t(wsi ) =
n=1
5
(8)
q=1
where c(ij ) denotes the uncertainty of webpage ij ; es(ij ) denotes i denotes whether the attribute the uncertainty of attribute ij ; jn in exists in webpage ij or not; n is the order of event attribute; n(i ) i is the number of attributes in webpage ij . and n=1 m Note that webpage ij is different from classified webpage i when ij denotes the webpages from a website and i denotes one classified webpage supporting a subordinated web event. Eq. (4) assigns the average uncertainty of attributes to the uncertainty of corresponding webpage. According to Deductions 1 and 2, uncertainty of event attributes can be got by the uncertainty of webpages that they belong to. After getting the uncertainty of webpage based on Eq. (4), we can easily get the certain degree as 1 − c(ij ). Then,
i ) is the completed certainty degree of ((1 − c(in ))nj
0
attribute ij and 1 −
i ) ((1 − c(in ))nj
is the uncertainty of
0
attribute ij . es (ij ) = 1 −
i ((1 − c(in ))nj )
(5)
0
where es (ij ) is the auxiliary variable used to get attribute uncertainty es(ij ). The meanings of other variables are in accordance with above description.
Eq. (8) measures the webpages based website confidence during a period of time; n is the number of webpages in corresponding website; iq means the webpages provided by the website wsi ; q is the order of webpage in corresponding website. As described above, we can infer the webpages based website confidence and both uncertainty of webpages and attributes during the iteration process. The iteration process terminates when the computation reaches a stable state. As in other iterative approaches, we choose the initial uncertainty of attributes with uniform values. 4.2. Measuring uncertainty of classified webpages and confidence of websites According to the discussion above, the distribution between the classified webpage and the website partly determine the veracity of web event via uncertainty, so we give another two deductions as following: Deduction 3. Uncertainty of classified webpage c(i ) decreases with the increase of website confidence T (wsi ). For the classified webpages of web event, if they are supported by all kinds of much authorize websites (i.e., the high confidence websites), the corresponding classified webpages are much certain. Deduction 4. Website confidence T (wsi ) is affected by the uncertainty of classified webpage c(i ).
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model
ARTICLE IN PRESS
JSS-9355; No. of Pages 11
X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
6
If a website supports most of the high certainty classified webpages of a web event, its confidence will be strengthened, in return web event veracity will be promoted. If most of the high uncertainty classified webpages are support by one website, the website confidence will be weakened, and the corresponding web event veracity will be cut down. In other words, uncertainty of classified webpage inverse to web event veracity and website confidence is proportional to web event veracity. We employ c(i )∞T (wsi ) to get website confidence which contribute to the final web event veracity. According to Deductions 3 and 4, the classified webpage stored the information of a subordinated web event and the website confidence determines each other. Because of this interdependency between the classified webpage uncertainty and the website confidence, we choose an iterative process to imply the uncertainty measuring of web event using the classified webpages distribution on websites, which can avoid settling the parameters of the two distributions. The classified webpage uncertainty and the website confidence are deduced from each other in the iteration process through the distributions between them. At the beginning, the value of the webpages based website confidence t(wsi ) is given to the website confidence T (wsi ). Then the classified webpage uncertainty can be calculated by, c (ϕj ) =
(1 − T (wsi ))ij
0
c(ϕj ) = ln(c (ϕj )) =
ln((1 − T (wsi ))ij )
(9)
The whole process can be displayed in Fig. 4. Herein, we detail the event features and show how these features play their part through attributes distribution and classified webpages distribution. Iteration process is employed in the measuring of web event uncertainty. When all the states come to some stable state, the computing process stopped. In the whole process, we focus on how the attributes change instead of embodying the attribute distribution on webpage and classified webpage on websites, whose parameters are hard to settle as discussed in Section 4. After a sequence of computation, we compute the veracity of web event via its uncertainty using all kinds of event features through the attributes distribution on webpages and the classified webpages distribution on websites. As the inverse proportional relation of web event uncertainty to web event veracity, we can get the web event veracity after getting web event uncertainty. 5. How the iteration processes work with matrix operations According to the description above we have known that some event features influence each other, but it is hard to find out how they impact on the uncertainty computation process. Herein, we use matrix operations to analyze how the event features affect the uncertainty of web event. On the other hand, matrix representation facilitates the calculation of distributions. So we use matrix operations to simulate the interaction processes between the attributes distribution on webpages and the classified webpages distribution on websites.
0
where c (ϕj ) is an auxiliary variable used to get the classified webpage uncertainty c(ϕj ); T (wsi ) is the confidence of website, (1 − T (wsi )) is the unauthentic degree of website; n( ws) is the number of websites affecting corresponding web event; ij means whether website wsi include webpages belonging to i . Note again that the classified webpage ϕi is different from the webpage ij when the former means the classified webpage belong to all kinds of websites and ij means the webpage belong to one website. According to Deduction 4, the confidence of website is greatly affected by its web event, so we get
T (wsi ) =
0
(1 − c(ϕj ))ij
0
ij
(10)
Here the meanings of the website confidence T (wsi ), ij and the classified webpage c(ϕj ) have been explained above. Under the influence of the classified webpages, the website confidence is revised to a steady value. 4.3. Measuring veracity of web event via uncertainty
To analyze how the event features influence each other and to facilitate computation, we use vectors to represent the event attribute uncertainty, webpage uncertainty and webpage based website confidence as following. Since each website contains abundant event attributes, the uncertainty of attributes is denoted as following: = (es( ), . . ., es( es() 1 n(ws) )) ) = (es( ), . . ., es( )) es( i i1 iu
T
) 1 ≤ u ≤ n( i
) is the uncertainty of where n(ws) is the number of websites; es( i attributes belong to the ith website; n(i ) is the number of event attributes for each website; ij denotes the jth attribute in the ith website. Since each website contains abundant webpages, the uncertainty of the webpages is denoted as following: →), ..., c(− −→ ) = (c(− c( − 1 n(ws) ))
If all the subordinated web events of an event are uncertain, the corresponding web event is uncertain. And if none of subordinated web event is certain, then the event is the most uncertain. Deduction 5 is put forward. Deduction 5. Web event veracity ver(e) via uncertainty uc(e) can be determined by its classified webpages. Based on Deduction 5, the veracity of web event can be measured as following. 1
uc(e) = c(ϕi ) n
5.1. Iteration process between attributes and webpages with matrix operations
n
(11)
i=1
ver(e) = 1 − uc(e) Where n is the number of classified webpage belonging to web event e. Corresponding meanings of other symbols are shown prior.
i ) = (c(i1 ), ..., c(iv ))T c(
i) 1 ≤ v ≤ n(
i ) is the number of webpages in websites; ij denotes the where n( jth attribute in the ith website. For the attributes distributions on webpages and the website confidence, the donations are expressed as following: − → = (1 , . . ., n )T → = (t(ws ), . . ., (ws ))T t(− ws) n 1 denotes the webpages distribution on websites; i denotes where − → is a vector donating the ith webpage distribution on websites; t(ws) all the confidence of websites. −−→ Let En×1 denotes a vector with the elements equal to 1. Details can be seen in the far left of Fig. 4.
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
At the beginning of the first iteration process, we set the initial ) as es0 ( ). According to computation of Eqs. (4) and values of es( i i (7) in Section 4, we can get reference as Eq. (12).2 ) = − ∗ ln(1 − c( i )) es( i i
−−−−−→ )./( ∗ − i ) = i ∗ es( c( i i Erow(i )×1 )
(12)
−−−−−→ )./( ∗ − i ) = i ∗ es0 ( Then c 0 ( i i Erow(i )×1 ). ) can be a hidden variable, namely the two More clearly, es( i equations in Eq. (12) can be merged into one as Eq. (13). i) c n (
−−−−−−→ i ))./(i ∗ Erow( )×1 ) = −i ∗ i ∗ ln(1 − c n−1 ( i −−−−−−→ i ), Erow( )×1 ) = · · · = f1 (i , c 0 ( i
(13)
i ) can be gain after For Eq. (13) is a recursion function, c n ( i ). It is unadvisable times recursion with the starting value c 0 (
nto express the recursion function directly, so we suppose function f1 to i ) and Erow()×1 are constant, so c n ( i) denote the process. Since c 0 ( is determined by the distribution of attributes on webpage i . This means if we can figure out the rules of corresponding distribution, we can get all the uncertainties of event features. Furthermore, the confidence of corresponding website can be expressed by, t(wsi )
=
1 −−−−−−→ i )) ∗ Erow( )×1 (1 − c n ( i n
1 −−−−−−→ −−−−−−→ i ), Erow( )×1 )) ∗ Erow( )×1 (1 − f1 (i , c 0 ( i i n − − − − − − → − − − − − − → i ), Erow( )×1 , Erow( )×1 ) = D1 (i , c 0 ( i i =
(14)
7
Corresponding to Eqs. (9) and (10), iteration process exists between the website confidence and the classified webpage uncertainty. Based on the two equations, following equation is given. −−−−−−→ → = ∗ (1 − c(ϕ))./( T (− ws) ∗ Erow()×1 ) (16) → = ∗ ln(1 − T (− c(ϕ) ws)) In Section 5.1, we get the webpages based website confidence, and the starting value of website confidence equals to it, namely − → = t(ws). − → Then the starting value of the classified webpage T 0 (ws) uncertainty can be expressed by → = ∗ ln(1 − t(− → = ∗ ln(1 − T 0 (− c 0 (ϕ) ws)) ws)) − → can be removed after some transformations More clearly, T (ws) as following. −−−−−−→ c n (ϕ) = ∗ ln(1 − ∗ (1 − c n−1 (ϕ))./( ∗ Erow()×1 )) (17) −−−−−−→ Erow()×1 ) = · · · = f2 (, c0 (ϕ), As Eq. (17), the recursion process can be denotes as f2 . Except → ϕ ) and matrix , all of other parameters are constant like c0 (− −−−−−−→ Erow()×1 , so the classified webpage uncertainty is determined by the classified webpages distribution on websites. As shown in the middle of Fig. 4. As Eq. (11) shows, the uncertainty of web event is the sum of the classified webpage uncertainty, then we can get the following equation. uc =
1 n −−−−−−→ −−−−−−→ −−−−−−→ ∗ Erow(c)×1 = f2 (, c 0 (ϕ), Erow()×1 ) ∗ Erow(c)×1 c (ϕ) n
(18)
Here is the classified webpages distribution on website. FurAs Eq. (14) shows, t(wsi ) can be got using function f1 . Eq. (14) → via the and t(− thermore, we have known the value of c 0 (ϕ) ws) corresponds to Eq. (8). To facilitate observation, we use function D1 iteration process as described above, then we can get the final to denote the result more clearly. As the uncertainty of webpage, equation like this. the website confidence is also determined by distribution i . Then, all the websites confidences can be show as follows. −−−−−−−→ −−−−−−−→ −−−−−−−→ −−−−−−−→ → = (D ( , c 0 ( 1 ), Erow(1 )×1 , Erow(1 )×1 ), . . ., D1 (n , c 0 ( n ), Erow(n )×1 , Erow(n )×1 )) t(− ws) 1 1 (15) −−−−−−→ −−−−−−→ c 0 ( ), Erow()×1 , Erow()×1 ) = D1 (, = (1 , 2 , . . ., n )T means the vector of attributes diswhere ) is the starting value of the webpage tribution on webpage. c 0 ( −−→ uncertainty. En×1 means a vector with the elements equal to 1 with −−−−−−−→ size of n × 1 and Erow(− → )×1 means n vectors with unit value. The −−−−−−→ same as Erow()×1 . −−−−−−→ −−−−−−→ ), Erow()×1 and Erow()×1 are constant, so the website Since, c 0 ( confidences are determined by distribution . Up to now, the website confidence has been discussed using matrix operations through the attributes distribution on webpages. The following content will discuss how to explain the iteration process of the uncertainty computing of web event using matrix operation. 5.2. Iteration process between classified webpages and websites with matrix operations Like last part, some vectors are introduced first, including website confidence and classified webpage uncertainty. T → = (T (ws ), . . ., T (ws T (− ws) 1 n(ws) ))
−−→ c(ϕ) = (c(ϕ1 ), . . ., c(ϕn(ϕ) ))T
2 The symbol " " at the top right center of matrix or vector means transposition and symbol " . " at the bottom means element operation, which like matrix operations in Matlab, for example[1, 2, 3] ./[3, 2, 1] = [1/3, 2/2, 3/1] .
uc
−−−−−→ −−−−−−→ − → − = f2 (, ∗ ln(1 − t(ws)), Erow()×1 ) ∗ Erow(ϕ)×1 −−−−−−→ −−−−−→ −−−−−−→ −−−−−−→ c 0 ( = f2 (, ∗ ln(1 − D1 (, ), Erow()×1 , Erow()×1 )), Erow()×1 ) ∗ Erow(ϕ)×1 −−−−−−→ −−−−−→ −−−−−−→ −−−−−−→ c 0 ( ), Erow()×1 , Erow()×1 , Erow()×1 , Erow(ϕ)×1 ) = D2 (, ,
(19) The meanings of factors are in coincidence with previous descriptions. At last, we use function D2 to express the results clearly. And we can easily confirm that the web event veracity via uncertainty is actually determined by distributions. The distributions include the attributes distribution on webpages belongs = (1 , . . ., n ), and the classified webto corresponding website pages distribution on websites . Then we can get the conclusion ). Here D can be seen as a brief representation of that uc = D(, −−−→ c 0 (), . . .) shown in Eq. (19). That means web event veracD2 (, , ity via uncertainty is determined by the two distributions, which is in accordance with the veracity analysis discussed in Section 4. But in the computing process, iteration avoids settling the specific distributions successfully. According to the computation and the evaluation of iteration process in the implementing veracity of web event via uncertainty with matrix operations, we can get the conclusion that our iteration process is in line with the veracity analysis described in Section 4. At the same time, the iteration process has some advantages that it remits the demand for parameters of specific distributions. Our method employ the iterations process among a serial of event
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
8
Table 2 The details of dataset used to period detection (100 events life length is about 30–40 days). Feature
Value
Average number of seeds per event Average number of Webpages per event Average number of event attributes per event Average number of days per event Average number of Webpages per day Average number of event attributes per day
2 5556 16,856 40 146 469
features instead of modeling the distribution of attributes on webpage and the classified webpages on website directly, namely we computed the web event veracity via uncertainty through the two distributions without modeling them directly. 6. Experiments This section implements the proposed web event veracity model. As we discussed prior, web event veracity connect to web event uncertainty and website confidence closely. High uncertainty of attribute, webpage, classified webpage and low confidence of website yield high uncertainty and low veracity of web event. These features influence each other through two distributions (attributes on webpages, classified webpages on websites), and sparse distribution on uncertain webpage and unauthentic website lead to more uncertain and less veracious web event. Data sets and experimental analysis are made in this section. 6.1. Data sets In this paper, the web events in our experiments are downloaded from Baidu (http://news.baidu.com), and other news websites. The seeds of events are extracted from baidu.com. In the experiments, we chose 900,000 webpages about 100 events as the experimental data set, which covers politic area, accidents, disasters, terrorist attacks and so on. Table 2 shows the various statistics of the above data. The start timestamp is determined referring (Jin et al., 2010), and the end timestamp is settled as the time we stopped climbing webpages. And the average web event life length is about 30–40 days. 6.2. Experimental analysis of how the event features influence the veracity of web event via uncertainty 6.2.1. How the classified webpages distribution on the website influence on the veracity of web event via uncertainty In Fig. 5, after the distribution of attributes on webpages is fixed, we compute the uncertainty of web event with the variation of classified webpages distribution on websites.
Fig. 6. How the website confidence and the diversity influence on the web event uncertainty.
Supposing the number of edges is max(ne ) when all the classified webpages are supported by all websites. Then the density of edges means the number of existed edges divide max(ne ). From Fig. 5 we can gain that the uncertainty of web event is 0 and veracity is 1 when the density of edges near 1; the uncertainty of web event is high and veracity is low when the density of edges close to 0. The results are in accordance with Deductions 1 and 2. Actually, when the density of edges is near 1, it means all the subordinated web events of a web event are supported by all websites, which indicates that the event has been admitted by all Medias. In this way, the corresponding web event uncertainty is low and veracity is high. When the density close to 0, it means few websites support related subordinated web events and its evolution direction is quite uncertain, then the corresponding web event uncertainty is high and veracity is low. 6.2.2. How the website confidence influence on the web event uncertainty We assigned the website confidences randomly as the initial values at the beginning of the iterative process, and then compute the web event uncertainty. Fig. 6 shows how the website confidence influences the web event veracity. Through the experiment results, we can get the uncertainty distribution pattern; namely, web event veracity via uncertainty not only related to the website confidence but also related to the diversity of webpages. Website confidence has a little drift with the type of website such as news, bbs, and blog. Furthermore, web event veracity is affected. Therefore, the website confidence and the distribution of classified webpages on websites affect the web event uncertainty. 6.2.3. Experimental results about website confidence According to the descriptions in Sections 3–5, we know that the web event uncertainty is influenced by the website confidence, attribute uncertainty, webpage uncertainty, classified webpage uncertainty and related two distributions. This section chooses website confidence to confirm whether our algorithm is effective for that website confidence are easy to be evaluated by human.
Fig. 5. How the event classified webpage distribution on the websites influence the web event uncertainty.
6.2.3.1. Statistical results. We extracted 23 representative websites came from Internet, and compute the website confidence according to the iterative process proposed in Section 4 and the results have been shown in Table 3. The third column in Table 3 is labeled as the artificial website confidences. During the process of the labeling, we train seven experimenters. Then experimenters graded the website confidence with their experiences. The five columns are range of
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
9
Fig. 7. Analysis of experimental data.
website confidence, website confidence through manually grading, fluctuation of website confidence and deviation of website confidence respectively. The result shows that: 1) For news websites, the labeled artificial confidence is quiet consistent with the minimum computed confidence, and Pearson correlation coefficient of them reaches 0.966 (Fig. 7). 2) For blog sites, the labeled artificial confidence is consistent with the minimum computed confidence in most times. And Pearson correlation coefficient except Bokerb reaches 0.838 expecting one website. 3) For bbs sites, Pearson correlation coefficient is low to −0.074, which is not optimistic. Here, after analysis, as the third class website (bbs) is concerned, the reason people feel pessimistic is that people always considering all information, while our computation is based on the part of considering information. It is obvious that the desultorily information is often included in bbs and is considering little in news website or blog sites while we just mining part of web event related data, so there are some amplified emotions which we cannot control. So there must be some features in the computing process of website confidence, which we did not consider. Besides, people can refer to the labeled results, but cannot depend on it. If we observe carefully, we will find the labeled results have the same trend with the computed results, which can improve the effectiveness of our algorithm. According to the above experiments results and analysis, we can get the following conclusions: Conclusion 1: Websites of news have strong confidences and low fluctuations, namely the news websites are stable, and the results of calculation are quiet anastomosis with human cognitive results. Therefore, information of news website is more veracious. Conclusion 2: The confidence of blog sites is relatively low, and has higher fluctuation compared with news sites, namely the website performance is not very stable. The calculation results are anastomosis with human cognitive results to some degree. Hence, information of blog site is less veracious. Conclusion 3: Bbs confidence has strong fluctuation. Namely, the performance is not stable. The calculation results not very
accordance with the human being cognition. As a result, information of bbs is least veracious. After a serial of analysis, we conclude that the confidence of news website and bog website is in line with human cognition while confidence of bbs has some error for people are animals of emotion. Above conclusions confirmed the correctness of website confidence computation algorithm; the conclusions also throw light on our future study. 6.2.3.2. Illustrates about website confidence. Fig. 8 displays the bokee’s confidence in a period of time, which involves more than 100 web events including 900,000 webpages with life length about 30–40 days. With time going, all the event features, such as attribute uncertainty, webpage uncertainty and classified webpage uncertainty have some evolutions, and correspondingly, website confidence has some shift. The horizontal coordinates is time and the vertical coordinate indicates website confidence of “bokee”. Fig. 8 illustrates that, with time going, the mined 100 events evolve a lot, and the corresponding websites confidence have some variations. As can be seen from Fig. 8, calculated bokee confidence has wave range 0.74–0.99, while the confidence of the human cognition is 0.67. The calculated result is relatively high. From Table 3 we can find, its fluctuation is 0.069 which is not good compared to other websites.
Fig. 8. The confidence distribution of “bookee” over time.
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model
ARTICLE IN PRESS
JSS-9355; No. of Pages 11
X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
10 Table 3 Website confidences of the 23 website sources. Source
Range
M
D
Type
Website url
Sohu Sina Ifeng 163 Xinmin 19lou yangzhi8 mpaper366 Bokerb Weibo Focus Blogbus Bokee Tieba.Baidu Tainya China Mop Douban RenRen Kaixin001
[0.86, 0.93] [0.88, 0.95] [0.84, 0.91] [0.85, 0.94] [0.80, 0.99] [0.70, 0.99] [0.72, 0.99] [0.66, 0.99] [0.92, 0.99] [0.77, 0.86] [0.71, 0.97] [0.69, 0.99] [0.74, 0.99] [0.86, 0.93] [0.84, 0.92] [0.88, 0.96] [0.76, 0.90] [0.78, 0.95] [0.78, 0.94] [0.77, 0.96]
0.83 0.85 0.88 0.81 0.77 0.61 0.65 0.60 0.63 0.71 0.68 0.66 0.67 0.63 0.65 0.65 0.61 0.74 0.65 0.63
0.014 0.017 0.02 0.02 0.041 0.07 0.081 0.112 0.015 0.018 0.045 0.051 0.069 0.015 0.018 0.018 0.035 0.036 0.037 0.048
0.07 0.07 0.07 0.09 0.19 0.29 0.27 0.33 0.07 0.09 0.26 0.3 0.25 0.07 0.08 0.08 0.14 0.17 0.16 0.19
News News News News News News News News Blog Blog Blog Blog Blog BBS BBS BBS BBS BBS BBS BBS
http://www.sohu.com/ http://www.sina.com.cn/ http://www.ifeng.com/ http://www.163.com/ http://www.xinmin.cn/ http://www.19lou.com/ http://www.yangzhi8.com.cn/ http://www.mpaper366.com/ http://www.bokerb.com/ http://weibo.com/ http://sh.focus.cn/ http://www.blogbus.com/ http://www.blogbus.com/ http://tieba.baidu.com http://www.tianya.cn/ http://club.china.com/ http://www.mop.com/ http://www.douban.com/ http://www.renren.com/ http://www.kaixin001.com/
7. Conclusions
Fig. 9. The uncertainty distribution of web event “tapping phone”.
6.2.4. Experimental analysis of web event uncertainty According to the descriptions in Sections 3–5, we know that the veracity of web event via uncertainty is influenced by the attributes distribution on webpage, the classified webpages distribution on websites. This section analyzes the experimental results of veracity of web event via all the event features. Fig. 93 shows the uncertainty of Web event “tapping phone” in different times. The horizontal coordinate is time and the vertical coordinate is web event uncertainty. Veracity of this event is converse to its uncertainty. Calculated results in Fig. 9 are ordered by the day. The third day corresponds to July 10, 2012 when “World Service” published the last stage; On July 12, “Sunday times” and “the sun” are fallen into deep eavesdropping mire; On July 15, the chief executive Brooks resigned in Murdoch news group. Uncertainty of this web event is high and veracity is low at the beginning of evolution, then the veracity increase in the following time. In the whole process the tendency of them agree. There are some other examples in our experiences and here we just analyze one of them. Results shown in Fig. 9 agree with the social phenomenon. When the event is in the break out state, the corresponding veracity of web event is low. When the event becomes decay state, the corresponding web event veracity is high. The results are in accordance with human cognition.
3
http://en.wikipedia.org/wiki/News of the World phone tapping scandal.
This article has modeled how to compute veracity of web event via uncertainty based on its features distribution on different kind of confident websites, whose data is big. Some web events have positive effects through facilitating public supervision, while others have negative effects, such as MeiLing Guo event. Our work throws light on timely responding to and understanding popular/emergency web events for individuals and social groups. We propose one method to measure web event veracity through calculating web event uncertainty. In the process of calculating web event veracity, the main contributions of this paper are as follows: Some event features, which influence on the measurement of veracity in the evolution process of web event, are mined to calculate the veracity of web event. Relations of features play an important role in computing veracity of web event. The relations include distribution of attributes on webpages and classified webpages on websites. Moreover iterative processes are employed to handle these relations. Matrix operations are done to confirm that the result of the iteration progress is in coincidence with mathematical model. Finally, experiments are made based on the analysis above, and the results prove that the iterative algorithm is promising. In all, this paper analyzed web event from its event features to the relation among these features. All the analysis contributes to the computation of web event veracity. Acknowledgements This work was supported in part by the National Science Foundation of China under Grant nos. 91024012, 91324005 and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing. References Allan, J., 2000. Topic Detection and Tracking Event-Based Information Organization. Kluwer, Norwell, MA. Allan, J., Carbonell, G., Doddington, G., Yamron, J., Yang, Y., 1998a. Topic detection and tracking pilot study final report. In: Proceedings of the Broadcast News Transcription and Understanding Workshop. Allan, J., Papka, R., Lavrenko, V., 1998b. On-line new event detection and tracking. In: Proc. 21st Annu. Int. ACM SIGIR Conf. Res. Development Inf. Retrieval, Melbourne, Australia, 1998, pp. 37–45. Chen, C.C., Chen, Y., Chen, M.C., 2007. An aging theory for event lifecycle modeling. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37 (March (2)), 237–248.
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023
G Model JSS-9355; No. of Pages 11
ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx
Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51 (1), 107–113. Haddow, D., Bullock, A., Coppola, P., 2010. Introduction to Emergency Management. Jin, X., Spangler, S., Ma, R., Han, J., 2010. Topic initiator detection on the World Wide Web. In: Proceedings of the 19th International Conference on World Wide Web, pp. 481–490. Jo, Y., Lagoze, C., Lee Giles, C., 2007. Detecting research topics via the correlation between graphs and texts. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 370–379. Kenneth, C., Yang., C., 2007. Factors influencing Internet users’ perceived credibility of news—related blogs in Taiwan. Telemat. Inform. 24 (May (2)), 69–85. Li, F., Lau, R., 2011. Emerging technologies and applications on interactive entertainments. J. Multimedia 6 (2), 107–114. Li, Q., Lau, R., Leung, E., Li, F., Lee, V., Wah, B., Ashman, H., 2009. Emerging internet technologies for e-learning. IEEE Inter. Comput. 13 (July (4)), 11–17. Makkonen, J., 2003. Investigation on event evolution in TDT. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language, pp. 43–48. Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 198–207. Metzger, M.J., 2000. An Overview of Internet Credibility. Department of Communication, University of California Santa Barbara http://www.eomm.ucsb. ed u/mmetzger flash.htm Nadine Wathen, C., Burke, J., 2002. Believe it or not: factors influencing credibility on the web. J. Am. Soc. Inform. Sci. Technol. 53 (2), 134–144. Nallapati, R., Feng, A., Peng, F., Allan, J., 2004. Event threading within news topics. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 446–453. Ng, B., Li, F., Lau, R., Si, A., Siu, A., 2003. A performance study of multi-server systems for distributed virtual environments. Inform. Sci. 154 (1), 85–93. Ng, B., Lau, R., Si, A., Li, F., 2005. Multi-server support for large scale distributed virtual environments. IEEE Trans. Multimedia 7 (6), 1054–1065. Salton Gerard Yang, C.S., 1973. On the Specification of Term Values in Automatic Indexing. Cornell University. Wei, C., Chang, Y., 2007. Discovering event evolution patterns from document sequences. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37 (March (2)), 273–283. Yang, C., Shi, X., 2006. Discovering event evolution graphs from newswires. In: Proceedings of the 15th International Conference on World Wide Web, pp. 945–946. Yang, C., Shi, X., Wei, C., 2009. Discovering event evolution graphs from news corpora. IEEE Trans. Syst. Man Cybern. A 39 (4), 850–863. Yin, X., Han, J., Yu, P.S., 2008. Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20 (6), 796–808. Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., Chen, J., 2014. A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. J. Comput. Syst. Sci. 80, 1008–1020. Zhang, X., Yang, T., Liu, C., Chen, J., 2014. A scalable two-phase top-down specialization approach for data anonymization using MapReduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25 (2), 363–373.
11
Xinzhi Wang received the Bachelor’s degree from Shanghai University, Shanghai, China, in 2012, where she is currently working toward the Master degree in the School of Computers. Her main research interests include topic detection and tracking, sentimental analysis and web mining.
Xiangfeng Luo received the Master’s and Ph.D. degrees from the Hefei University of Technology, Hefei, China, in 2000 and 2003, respectively. He is a Professor with the School of Computers, Shanghai University, Shanghai, China. He was a Postdoctoral Researcher with the China Knowledge Grid Research Group, Institute of Computing Technology, Chinese Academy of Sciences, from 2003 to 2005. He has authored or coauthored more than 50 publications and his publications have appeared in the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBEMETICS, and IEEE TRANSACTIONS ON LEARNING TECHNOLOGY. His main research interests include web wisdom, cognitive informatics, and text understanding. Dr. Luo has served as the Guest Editor of ACM Transactions on Intelligent Systems and Technology. He has also served on the committees of a number of conferences/workshops, including Program Co-Chair of the International Conference on Web-based Learning (ICWL 2010) (Shanghai), International Conference Web Information Systems and Mining (WISM 2012) (Chengdu), International Workshop on Cognitive-based Text Understanding and Web Wis-dom (CTUW 2011) (Sydney), and more than 40 PC members of conferences and workshops.
Huimin Liu received the Bachelor’s and Master’s degrees from Shanghai University, Shanghai, China, in 2010 and 2013, respectively. He is currently working for data center of Agriculture Bank of China. His main research interests include topic detection and tracking and web mining.
Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023