Measuring the veracity of web event via uncertainty

G Model ARTICLE IN PRESS JSS-9355; No. of Pages 11 The Journal of Systems and Software xxx (2014) xxx–xxx Contents lists available at ScienceDirec...

Download PDF

2MB Sizes 4 Downloads 91 Views

Report

PDF Reader
Full Text

G Model

ARTICLE IN PRESS

JSS-9355; No. of Pages 11

The Journal of Systems and Software xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Measuring the veracity of web event via uncertainty Xinzhi Wang a , Xiangfeng Luo a,b,∗ , Huiming Liu a a b

School of Computer Engineering and Science, Shanghai University, Shanghai, China State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China

a r t i c l e

i n f o

Article history: Received 25 February 2014 Received in revised form 16 June 2014 Accepted 10 July 2014 Available online xxx Keywords: Web event veracity Topic detection and tracking Big data

a b s t r a c t Web events, whose data occur as one kind of big data, have attracted considerable interests during the past years. However, most existing related works fail to measure the veracity of web events. In this research, we propose an approach to measure the veracity of web event via its uncertainty based on its features distribution on different kind of conﬁdent websites. Firstly, the proposed approach mines various event features from the data of web event which may inﬂuence on the measuring process of uncertainty. Secondly, one computational model is introduced to simulate the inﬂuence process of the above features on the evolution process of web event. Thirdly, matrix operations are managed to facilitate practice. Finally, experiments are made based on the analysis above, and the results proved that the proposed uncertainty measuring algorithm is promising to measure the veracity of web event for big data. © 2014 Elsevier Inc. All rights reserved.

1. Introduction Today, big data has been attracting more and more attention. Challenges and opportunities of big data era are deﬁned as being ﬁve Vs, i.e., volume (amount of data), velocity (speed of data in and out), variety (range of data types and sources), value (desirable quality of data) and veracity (trustworthiness of various data).1 The volume of big data is massive, and main technologies to store big data include distributed cache, distributed database and distributed ﬁle system (Zhang et al., 2014; Zhang X and Liu C et al., 2014). The big data is also time-sensitive, deﬁned as velocity, referring to the speed at which new data is generated and the speed at which data1 moves around, and main methods to face this challenge include MapReduce (Dean and Ghemawat, 2008) and concurrency control. Moreover, types of big data consist of text, video, audio, webpage, stream and even aggregation of above, which is the meaning of variety. Another V to take into account when looking at big data is value, which is important as the result that users can make a beneﬁt for any attempt to collect and leverage big data. The last V, veracity, refers to the trustworthiness of the big data, with large amount, high speed, many forms and uneven quality. We can safely argue that ‘veracity’ is the most important

∗ Corresponding author at: School of Computer Engineering and Science, Shanghai University, 333 Nanchen Road, Baoshan District, Shanghai 200444, China. Tel.: +86 13817505710. E-mail addresses: [email protected] (X. Wang), [email protected] (X. Luo), [email protected] (H. Liu). 1 http://staff.science.uva.nl/∼demch/////presentations/sne2013-01-03-bigdata5V-infra-v04.pdf.

V of big data to gain ultimate and valid value. This paper aims to measure the veracity of web data which occurs as one challenge bridging users and big data. Web event is a story or a scandal occurred in the society or on the web reﬂected by a series of associated web pages with time. It is hard to measure web event veracity as the result that volume and various web events happen all the time. However, veracity of web event has to be detected in order to get trustworthy and valuable information. In its evolution process, social event is deeply affected by corresponding web information, whose one performance is uncertainty. In other words, monitoring veracity of web event via uncertainty can help user understand social events. Generally, event with high uncertainty are more likely to turn into popular or emergent event (Haddow et al., 2010). Here uncertainty of web event is determines by its features distribution on different kind of conﬁdent websites instead of entropy used in informatics. For instance, if an event has high uncertain distributed in low conﬁdent website, then the event is more likely to be faked; namely, its veracity is low; and vice versa. So, this research employs multifactor based uncertainty to measure the veracity of web event. However, it is difﬁcult to manually analyze the veracity of the volume web events in their evolution process on the web, because it is a killing of time and energy. So, to understand and quickly respond to the volume web events, it is necessary to measure the web event veracity automatically. For instance, a Chinese girl named MeiLing Guo showed off her luxurious life on the web, and claimed to be one of the managers of Red Across in China. This message was detected by netizens who suspected the corruption of Red Across. The event ended up with administrative recombination in Red Across, and resulted in sharp

http://dx.doi.org/10.1016/j.jss.2014.07.023 0164-1212/© 2014 Elsevier Inc. All rights reserved.

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

2

decline of donations which had bad inﬂuence on the Red Across of China. This web event impacted greatly on our society. The veracity of this event is low with just one doubtful piece of microblog at ﬁrst, but went high at last with various guesses occurred. Veracity of web event varies in the evolution process. One of veracity performances of web event is its uncertainty, and the other is website conﬁdence, two of which is determined by various factors such as webpages distribution on websites, attributes distribution on webpages, etc. In this paper, we ﬁrstly proposed an approach mining several event features from webpages of web event. Then, one computational model is introduced to simulate the inﬂuence process of the above features on the evolution process of web event. Furthermore, matrix operations are managed to facilitate practice. At last but not at least, experiments are made based on the analysis above, and the results proved that the proposed model is promising when measuring the veracity of web event via its uncertainty. The organization of this paper is as following. Related work is introduced in Section 2. Basic terms and event features are deﬁned in Section 3. The veracity measurement of web event via uncertainty is introduced in Section 4. The relation of the measuring processes with matrix is discussed in Section 5. Section 6 shows some experiments and analysis. And in the last section, conclusions are given.

2. Related work Veracity calculation of web event is one tough work since the data of web event is volume, variety and velocity. Moreover, veracity, one of whose performances is uncertainty, of web event varies during its evolution process. However, few researchers have contributed to measuring web event veracity or uncertainty directly. In the research area of topic detection and tracking (TDT) (Jin et al., 2010; Yin et al., 2008; Yang et al., 2009; Makkonen, 2003), some aspects of web event are measured during the evolution process, such as detecting unknown event and segmenting event information, without considering web event uncertainty. But technologies of TDT lay some foundations for our work, namely measuring the veracity of web event via uncertainty. In the research area of TDT, Allan et al. (1998a,b) calculated similarity among news and classiﬁed the news to corresponding events. If the event of a piece of news is not similar to any previous events, then it is considered as a new event. Chen et al., (2007) put forward the time-hardening theory, which modeled event life cycle based on the time relations among documents. Wei and Chang (2007) proposed a kind of technology to build an event evolution model, which found frequent events section and temporal relations of one event to construct frequent pattern. Among them, the events section refers to a sub-event or events stage. Yang and Shi (2006) proposed a method to extract evolutionary relationship diagram from news reports, and the evolutionary relationship diagram reﬂects relationship among web events or between events and its sub-events, which furthermore display the structure of web events. Some other aspects (Li et al., 2009; Li and Lau, 2011; Ng et al., 2003, 2005) are also been discussed. In general, TDT technique (Allan, 2000; Mei and Zhai, 2005; Jo et al., 2007; Nallapati et al., 2004) attempts to detect unknown event and cluster corresponding news reports into its event; TDT also involves tracking events, but do not involve the veracity measuring of web event during the evolution process of web event. So it is hard to provide users global and clear cognitions for the web event. This paper came up with a method to calculate the veracity of web event taking the advantage of web event uncertainty. This paper also discussed what factors may inﬂuence on the uncertainty of web event, such as attribute uncertainty, webpage

Fig. 1. Interaction among users, web events, websites, webpages and event attributes.

uncertainty, website conﬁdence, event classiﬁed webpage uncertainty, attribute distribution on webpage, and event classiﬁed webpage distribution on websites. On the other hand, Salton and Yang (1973), Kenneth and Yang (2007) analyze many factors affecting website conﬁdence. In addition, different types of website have various conﬁdences (Nadine and Burke, 2002). The traditional media websites always have higher conﬁdence compared with other websites, such as CNN, New York Times, and the Wall Street and so on. Metzger (2000) indicates that website conﬁdences are dynamic and uncertain; this made it hard to evaluate website conﬁdence clearly. Attribute distribution on webpage and classiﬁed webpage distribution on website have great inﬂuence to the web event uncertainty as well as the website conﬁdence. In this paper, we deﬁne a lot web event features, then establish mathematical model for computing web event veracity via its uncertainty, furthermore gain web event veracity through iterative computational model, and ﬁnally matrix operations and experiments are made which proved the iterative algorithm is promising when computing web event uncertainty. 3. Basic terms and event features As we have discussed in Section 1, to measure the veracity of web event via uncertainty, it is necessary to study the event features of web event in its evolution process. Before the discussion of web event features, we deﬁne some basic terms which can help the understanding of veracity measuring process. 3.1. Terms deﬁnitions Normally, users take care of web events when they surf on the internet. And these web events are provided by webpages, which distribute on different websites and contain abundant event attributes. The whole chain process that inﬂuences on the measuring process of event veracity is depicted in Fig. 1. The ﬁve levels are users, web events, websites, webpages and event attributes, respectively. And the arrows describe the relations among

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

Fig. 2. Relations among webpages and their attributes for one web event. (In the ﬁgure, ϕi is the classiﬁed webpage tagged with i; wsi is the ith website related to web event e.; ki is the ith attribute belonging to an event.)

different levels. This section deﬁnes some basic terms and analyzes the event features based on the chain shown in Fig. 1. Deﬁnition 1 (Uncertainty of attributes, es(ij )). The uncertainty of attribute ij is the ability of an attribute connecting with others and it is also related to the conﬁdence of website containing corresponding attribute. Here, ij is the jth attribute in the ith website. Attributes are keywords mined using TFIDF (Li et al., 2009) and their relations are got by aprior algorithm. If an attribute connect a lot of other attributes and appears in many low conﬁdent websites, then the attribute has high uncertainty; vice versa. Therefore, web event containing it has low veracity. Deﬁnition 2 (Uncertainty of webpage, c(ij )). Uncertainty of webpage ij is proportional to its attributes uncertainty and is a generalized uncertainty of its attributes. Here, ij is the jth webpage from the ith website. One webpage can be denoted as ij = {i1 , i2 , . . ., im }, its weight ij = (wi1 , wi2 , . . ., wim )T and wim is the vector can be denoted by weight of im , which can be mined according to TF-IDF (Li et al., 2009). If web event appears in lot of uncertain webpages, then the veracity is low; vice versa. Deﬁnition 3 (Uncertainty of classiﬁed webpage, c(ϕi )). The uncertainty of a classiﬁed webpage ϕi implies the uncertainty of a subordinated web event, namely sub-event, which is distributed in several websites. Here, ϕi is the ith classiﬁed webpage belonging to speciﬁc web event. A classiﬁed webpage of a web event is a webpage set in which webpages have high similarity with each other, which describes one speciﬁc subordinated web event. Here, we let ϕi = {ij , . . ., nm } denotes as a classiﬁed webpage collection in a speciﬁc period. Just like webpage, if classiﬁed webpage of web event distributed in low conﬁdent website, then it has high uncertainty, furthermore the web event has low veracity; vice versa. As classiﬁed webpage is a subordinated web event, so we use similarity to get classiﬁed webpage. Similarity among webpages for a web event can be denoted as I.

⎛ ⎜

a11

I = ⎜ .. ⎝ . an1

···

a1n

..

.. .

.

···

⎞ ⎟ ⎟ ⎠

3

Fig. 3. Relations among event attributes, webpages belonging to one website. (In the ﬁgure, ij is the webpage tagged with j in the ith website of speciﬁc web event; wsi is the ith website related to web event e.; ij is one attribute belonging to wsi .)

Deﬁnition 4 (Website conﬁdence, T(wsi )). Conﬁdence of a website wsi is the authority degree of a website. Here, wsi is the ith classiﬁed webpage belonging to speciﬁc web event. It can be formulized as T (wsi ) =< c(ij ), c(ϕi ) >. Website is a carrier of web event and its attributes. It is inﬂuenced by two factors. One is its own webpage uncertainty c(ij ), and the other is the classiﬁed webpage (subordinated web event) uncertainty c(ϕi ). If information of web event distribute in high conﬁdent website, then the veracity of web event is high; vice versa. Deﬁnition 5 (Uncertainty of web event, uc(e)). The uncertainty of web event e is the ability of an event derives to other events during a period of time, which can be formalize as uc(e) =< es(ij ), c(ij ), c(ϕi ) >; where es(ij ) is the uncertainty of attributes ij ; c(ij ) is the uncertainty of webpages ij ; c(ϕi ) is the uncertainty of event classiﬁed webpages ϕi . uc(e) is signiﬁcantly affected by the above event features’ uncertainties or conﬁdence. Uncertainty of attributes, webpage, classiﬁed webpage and conﬁdence of website determine the uncertainty of web event; furthermore determine the veracity of web event. The more uncertain of web event, the less veracious it is, and then we can get the veracity of web event via its uncertainty based on its features distribution on different kind of conﬁdent websites. Deﬁnition 6 ((Veracity of web event, ver(e)). Veracity of web event is the trustworthiness of web event during its evolution process. It is inversely proportional to uncertainty of web event ver(e)=1-uc(e)). If words with high uncertainty of one web event distributed in high uncertain webpages and low conﬁdent websites, and related classiﬁed webpages are uncertain, then the corresponding web event has high uncertainty, moreover its veracity is low (the event may include some faked information). Conversely, if features (attribute, webpage, classiﬁed webpage) are certain and distributed in conﬁdent website, then the event is veracious with low uncertainty (the event contains lots of true information).

(1)

ann

ab · cd /| ab | | cd |), and i = a + b, j = c + d, and aij is the where aij = ( ab and cd . similarity between webpage Webpages with high similarity are merged as one classiﬁed webpage corresponding to one subordinated web event. To avoid ambiguous, we call the attribute as ki after the combination. The whole process is shown in Figs. 2 and 3.

3.2. Event features This section will discuss some characteristics of the terms and present how to process them for the veracity measurement of web event via uncertainty. As in Fig. 2, a series of websites provide many classiﬁed webpages along with many event attributes. It means website conﬁdence and classiﬁed webpages uncertainty affect each other

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model

ARTICLE IN PRESS

JSS-9355; No. of Pages 11

X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

4

according to the classiﬁed webpages distribution on websites. It is deﬁned as follows:

⎛ ⎜

11

= ⎜ .. ⎝ . v1

···

1n

..

.. .

.

···

⎞ ⎟ ⎟ ⎠

(2)

vn

where ij = 1 if elements of ϕj appears in wsi , on the contrary ij = 0 if elements of ϕj does not appear in wsi . Namely, ij denotes whether the ith website involves the jth classiﬁed webpage or not. Actually, one web event has only one classiﬁed webpages distribution on websites. If the distribute build on uncertain classiﬁed webpages and low conﬁdence website, then the veracity of web event is low. In the computing process, this distribution is seen as one of the web event features, which is the distribution of classiﬁed webpages on websites, denotes as . Fig. 3 shows that a website provides many webpages with amount of event attributes, and these attributes are distributed on webpages covering various events. This distribution implies that the conﬁdence of website is affected by its webpages uncertainty while webpage uncertainty and attribute uncertainty affect each other according to the attributes distribution on webpages. Moreover, veracity of web event is inﬂuenced by the distribution. We deﬁne this distribution as distribution of event attributes on webpages i , which is also an event feature. Herein, we use i to describe the ith event attributes distribution on webpages during a period of time.

⎛ ⎜

1i 1

i = ⎜ .. ⎝ . i n1

···

1i m

..

.. .

.

···

⎞ ⎟ ⎟ ⎠

(3)

i nm

1, in ∈ im denotes whether the nth event 0, in ∈ / im attribute exists in the mth webpage or not. Actually, from the description above, we may ﬁnd that for a webpage, it has two labels beside it attributes. One is where the webpage comes from (which website it belongs to); the other is what the webpage can offer (which classiﬁed webpage it belongs to). The attributes distribution on webpages makes the attribute uncertainty and webpage uncertainty related to each other closely. In the same way, classiﬁed webpages distribution on websites makes the uncertainty of classiﬁed webpage and the website conﬁdence related to each other closely. The difference between the two distributions as a clue to measure the event uncertainty will be discussed in Section 4. After getting the uncertainty of web event, the veracity of web event can be determined. For now, we can get the website wsi , webpage ij , classiﬁed webpage ϕi , event attribute ij , attributes distribution on webpages and classiﬁed webpages distribution on websites. Through the analysis of the above features, we can ﬁnd that all the event features we get are from the webpages and websites. Actually, some other users’ features can be got, like Google Trend (http://trends.google.com). On the other hand, users’ features are somehow time delay. So in this paper, we consider the webpages of web event alone. In next section, we employ one computational model to analyze how the event features inﬂuence on the veracity measuring process of web event. In fact, all the features play their different roles according to the two distributions. More details will be discussed in next section. i = where mn

Fig. 4. Computational model of web event uncertainty.

4. Veracity measuring process of web event via uncertainty According to the discussion in last section, web event veracity computing via uncertainty is inﬂuenced by multi-factors such as attribute uncertainty, webpage uncertainty, website conﬁdence and distributions among these factors. Uncertainty of attribute and webpage inﬂuence each other according to the attributes distribution on webpages. Based on the webpages we can get website conﬁdence which lay the foundation for later computation. Furthermore, the website conﬁdence and the classiﬁed webpage uncertainty inﬂuence each other according to the classiﬁed webpages distribution on websites. One classiﬁed webpage equals to one sub-event, so we ﬁnally get veracity of web event via uncertainty according to the uncertainty of features and conﬁdence of corresponding website through two distributions shown in Fig. 4. In Fig. 4, we can ﬁnd that for a web event it is supported by a classiﬁed webpage set ϕ, where ϕ = {ϕ1 , ϕ2 , . . ., ϕn(ϕ) } and n(ϕ) is the number of classiﬁed webpages supporting corresponding web event. Classiﬁed webpage set interacts with website set ws, where ws = {ws1 , ws2 , . . ., wsn(ws) }, and n(ws) is the number of websites reporting corresponding web event, according to the classiﬁed webpages distribution on websites . Webpage set about a web event e includes all the webpages in all websites, which can be denoted as = {1 , 2 , . . ., n(ws) }, where i is a webpage set including all the webpages in the ith website belonging to speciﬁc web event, and can be denoted by i = {i1 , i2 , . . ., in(i ) }. n(i ) is the number of webpages in the ith website related to web event e. In the same way, attribute set includes all the attributes, which can be denoted by = {1 , 2 , . . ., n(ws) }, where ␭i is an attribute set including all the attributes in the ith website, and can be denoted by i = {11 , i2 , . . ., in(i ) }. n(i ) is the number of attributes in the ith website related to web event e. Webpage set and attribute set interact with each other according to the attributes distribution → = (1 , 2 , . . ., n(ws) )T . vector on webpages − Based on the analysis described above, the veracity computation of web event via uncertainty can be divided into three parts: 1) How the attributes distribution on webpages inﬂuences the veracity of web event via uncertainty. In this part, based on the attributes distribution on webpages, we compute the event attribute uncertainty, webpage uncertainty and webpage based website conﬁdence. 2) How the uncertainty of classiﬁed webpages and conﬁdence of websites inﬂuences the veracity of web event. In this part, based on the classiﬁed webpages distribution on websites, we measure

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model

ARTICLE IN PRESS

JSS-9355; No. of Pages 11

X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

the uncertainty of classiﬁed webpage, which describes different sub-event, classiﬁed webpages, of a web event. 3) Veracity of web event is computed based on the uncertainty of classiﬁed webpage and conﬁdence of corresponding website.

4.1. Measuring of attribute uncertainty and website conﬁdence After the discussion of basic event features of web event, we introduce some deductions as the guidance for the veracity measuring process of web event via its uncertainty and website conﬁdence. Besides, these event features make the process visible to us which make it easy to ﬁnd some common rules. Deduction 1. If all the websites support the entire event attributes shown as i in Fig. 3, then the uncertainty of webpage c(ij ) and the uncertainty of webpage attributes es(ij ) should be low, and the veracity of web event via uncertainty will arrive the highest point since all the websites have same view on the event. Deduction 2. If most of attributes of a web event are provided by few webpages which distributed on different websites, the uncertainty of webpage c(ij ) and the uncertainty of webpage attributes es(ij ) will be high, and the veracity of web event via uncertainty will also be low.

Table 1 The event features and symbols used in our paper. Name

Symbol

Web event Classiﬁed webpage Event attributes Distribution of classiﬁed webpages on websites Similarity Website Webpage in website Webpage attributes Distribution of webpages in websites Web event veracity Web event uncertainty Webpage uncertainty Classiﬁed webpage uncertainty Website conﬁdence Webpages based website conﬁdence Uncertainty of webpage attributes

e ϕij K I wsi ij ij i ver(e) uc(e) c(ij ) c(␸i ) T (wsi ) t(wsi ) es(ij )

In order to facilitate the computation and veracity exploration, we do some transformations as follows.

es (ij ) = 1 −

i ((1 − c(in ))nj )

0
1 − es (ij ) =

i ((1 − c(in ))nj )

0
Based on Deductions 1 and 2, which analyzed how attributes and webpages contribute to the veracity of web event via uncertainty through the event attributes distribution on webpages i , we choose an iterative computational process to implicitly compute the veracity of web event via uncertainty using attributes distribution on webpages. At each iteration process, the webpage uncertainty and the event attribute uncertainty are inferred from each other as follows:

c(ij ) =

n(i )

1

n(i ) n=1

i jn

i (es(in )jn )

ln(1 − es (ij )) =

0
(6)

i ) ln((1 − c(in ))nj

Attributes uncertainty can be gain as follows.

es(ij ) = − ln(1 − es (ij )) = −

0
i ln((1 − c(in ))nj )

(7)

Until now, we get the attribute uncertainty and the webpage uncertainty using the attributes distribution on webpages (Table 1). And the webpages based website conﬁdence (denoted by t(wsi ) which lay the foundation of computing website conﬁdence) can be measured by the average of certain degree of webpages:

(4)

1

(1 − c(iq )) n n

t(wsi ) =

n=1

5

(8)

q=1

where c(ij ) denotes the uncertainty of webpage ij ; es(ij ) denotes i denotes whether the attribute the uncertainty of attribute ij ; jn in exists in webpage ij or not; n is the order of event attribute; n(i ) i is the number of attributes in webpage ij . and n=1 m Note that webpage ij is different from classiﬁed webpage ␸i when ij denotes the webpages from a website and ␸i denotes one classiﬁed webpage supporting a subordinated web event. Eq. (4) assigns the average uncertainty of attributes to the uncertainty of corresponding webpage. According to Deductions 1 and 2, uncertainty of event attributes can be got by the uncertainty of webpages that they belong to. After getting the uncertainty of webpage based on Eq. (4), we can easily get the certain degree as 1 − c(ij ). Then,

i ) is the completed certainty degree of ((1 − c(in ))nj

0

attribute ij and 1 −

i ) ((1 − c(in ))nj

is the uncertainty of

0
attribute ij . es (ij ) = 1 −

i ((1 − c(in ))nj )

(5)

0
where es (ij ) is the auxiliary variable used to get attribute uncertainty es(ij ). The meanings of other variables are in accordance with above description.

Eq. (8) measures the webpages based website conﬁdence during a period of time; n is the number of webpages in corresponding website; iq means the webpages provided by the website wsi ; q is the order of webpage in corresponding website. As described above, we can infer the webpages based website conﬁdence and both uncertainty of webpages and attributes during the iteration process. The iteration process terminates when the computation reaches a stable state. As in other iterative approaches, we choose the initial uncertainty of attributes with uniform values. 4.2. Measuring uncertainty of classiﬁed webpages and conﬁdence of websites According to the discussion above, the distribution between the classiﬁed webpage and the website partly determine the veracity of web event via uncertainty, so we give another two deductions as following: Deduction 3. Uncertainty of classiﬁed webpage c(␸i ) decreases with the increase of website conﬁdence T (wsi ). For the classiﬁed webpages of web event, if they are supported by all kinds of much authorize websites (i.e., the high conﬁdence websites), the corresponding classiﬁed webpages are much certain. Deduction 4. Website conﬁdence T (wsi ) is affected by the uncertainty of classiﬁed webpage c(␸i ).

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model

ARTICLE IN PRESS

JSS-9355; No. of Pages 11

X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

6

If a website supports most of the high certainty classiﬁed webpages of a web event, its conﬁdence will be strengthened, in return web event veracity will be promoted. If most of the high uncertainty classiﬁed webpages are support by one website, the website conﬁdence will be weakened, and the corresponding web event veracity will be cut down. In other words, uncertainty of classiﬁed webpage inverse to web event veracity and website conﬁdence is proportional to web event veracity. We employ c(␸i )∞T (wsi ) to get website conﬁdence which contribute to the ﬁnal web event veracity. According to Deductions 3 and 4, the classiﬁed webpage stored the information of a subordinated web event and the website conﬁdence determines each other. Because of this interdependency between the classiﬁed webpage uncertainty and the website conﬁdence, we choose an iterative process to imply the uncertainty measuring of web event using the classiﬁed webpages distribution on websites, which can avoid settling the parameters of the two distributions. The classiﬁed webpage uncertainty and the website conﬁdence are deduced from each other in the iteration process through the distributions between them. At the beginning, the value of the webpages based website conﬁdence t(wsi ) is given to the website conﬁdence T (wsi ). Then the classiﬁed webpage uncertainty can be calculated by, c (ϕj ) =

(1 − T (wsi ))ij

0
c(ϕj ) = ln(c (ϕj )) =

ln((1 − T (wsi ))ij )

(9)

The whole process can be displayed in Fig. 4. Herein, we detail the event features and show how these features play their part through attributes distribution and classiﬁed webpages distribution. Iteration process is employed in the measuring of web event uncertainty. When all the states come to some stable state, the computing process stopped. In the whole process, we focus on how the attributes change instead of embodying the attribute distribution on webpage and classiﬁed webpage on websites, whose parameters are hard to settle as discussed in Section 4. After a sequence of computation, we compute the veracity of web event via its uncertainty using all kinds of event features through the attributes distribution on webpages and the classiﬁed webpages distribution on websites. As the inverse proportional relation of web event uncertainty to web event veracity, we can get the web event veracity after getting web event uncertainty. 5. How the iteration processes work with matrix operations According to the description above we have known that some event features inﬂuence each other, but it is hard to ﬁnd out how they impact on the uncertainty computation process. Herein, we use matrix operations to analyze how the event features affect the uncertainty of web event. On the other hand, matrix representation facilitates the calculation of distributions. So we use matrix operations to simulate the interaction processes between the attributes distribution on webpages and the classiﬁed webpages distribution on websites.

0
where c (ϕj ) is an auxiliary variable used to get the classiﬁed webpage uncertainty c(ϕj ); T (wsi ) is the conﬁdence of website, (1 − T (wsi )) is the unauthentic degree of website; n( ws) is the number of websites affecting corresponding web event; ␹ij means whether website wsi include webpages belonging to ␸i . Note again that the classiﬁed webpage ϕi is different from the webpage ij when the former means the classiﬁed webpage belong to all kinds of websites and ij means the webpage belong to one website. According to Deduction 4, the conﬁdence of website is greatly affected by its web event, so we get

T (wsi ) =

0
(1 − c(ϕj ))ij

0
ij

(10)

Here the meanings of the website conﬁdence T (wsi ), ij and the classiﬁed webpage c(ϕj ) have been explained above. Under the inﬂuence of the classiﬁed webpages, the website conﬁdence is revised to a steady value. 4.3. Measuring veracity of web event via uncertainty

To analyze how the event features inﬂuence each other and to facilitate computation, we use vectors to represent the event attribute uncertainty, webpage uncertainty and webpage based website conﬁdence as following. Since each website contains abundant event attributes, the uncertainty of attributes is denoted as following: = (es( ), . . ., es( es() 1 n(ws) )) ) = (es( ), . . ., es( )) es( i i1 iu

T

) 1 ≤ u ≤ n( i

) is the uncertainty of where n(ws) is the number of websites; es( i attributes belong to the ith website; n(i ) is the number of event attributes for each website; ij denotes the jth attribute in the ith website. Since each website contains abundant webpages, the uncertainty of the webpages is denoted as following: →), ..., c(− −→ ) = (c(− c( − 1 n(ws) ))

If all the subordinated web events of an event are uncertain, the corresponding web event is uncertain. And if none of subordinated web event is certain, then the event is the most uncertain. Deduction 5 is put forward. Deduction 5. Web event veracity ver(e) via uncertainty uc(e) can be determined by its classiﬁed webpages. Based on Deduction 5, the veracity of web event can be measured as following. 1

uc(e) = c(ϕi ) n

5.1. Iteration process between attributes and webpages with matrix operations

n

(11)

i=1

ver(e) = 1 − uc(e) Where n is the number of classiﬁed webpage belonging to web event e. Corresponding meanings of other symbols are shown prior.

i ) = (c(i1 ), ..., c(iv ))T c(

i) 1 ≤ v ≤ n(

i ) is the number of webpages in websites; ij denotes the where n( jth attribute in the ith website. For the attributes distributions on webpages and the website conﬁdence, the donations are expressed as following: − → = (1 , . . ., n )T → = (t(ws ), . . ., (ws ))T t(− ws) n 1 denotes the webpages distribution on websites; i denotes where − → is a vector donating the ith webpage distribution on websites; t(ws) all the conﬁdence of websites. −−→ Let En×1 denotes a vector with the elements equal to 1. Details can be seen in the far left of Fig. 4.

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

At the beginning of the ﬁrst iteration process, we set the initial ) as es0 ( ). According to computation of Eqs. (4) and values of es( i i (7) in Section 4, we can get reference as Eq. (12).2 ) = − ∗ ln(1 − c( i )) es( i i

−−−−−→ )./( ∗ − i ) = i ∗ es( c( i i Erow(i )×1 )

(12)

−−−−−→ )./( ∗ − i ) = i ∗ es0 ( Then c 0 ( i i Erow(i )×1 ). ) can be a hidden variable, namely the two More clearly, es( i equations in Eq. (12) can be merged into one as Eq. (13). i) c n (

−−−−−−→ i ))./(i ∗ Erow( )×1 ) = −i ∗ i ∗ ln(1 − c n−1 ( i −−−−−−→ i ), Erow( )×1 ) = · · · = f1 (i , c 0 ( i

(13)

i ) can be gain after For Eq. (13) is a recursion function, c n ( i ). It is unadvisable times recursion with the starting value c 0 (

nto express the recursion function directly, so we suppose function f1 to i ) and Erow()×1 are constant, so c n ( i) denote the process. Since c 0 ( is determined by the distribution of attributes on webpage i . This means if we can ﬁgure out the rules of corresponding distribution, we can get all the uncertainties of event features. Furthermore, the conﬁdence of corresponding website can be expressed by, t(wsi )

=

1 −−−−−−→ i )) ∗ Erow( )×1 (1 − c n ( i n

1 −−−−−−→ −−−−−−→ i ), Erow( )×1 )) ∗ Erow( )×1 (1 − f1 (i , c 0 ( i i n − − − − − − → − − − − − − → i ), Erow( )×1 , Erow( )×1 ) = D1 (i , c 0 ( i i =

(14)

7

Corresponding to Eqs. (9) and (10), iteration process exists between the website conﬁdence and the classiﬁed webpage uncertainty. Based on the two equations, following equation is given. −−−−−−→ → = ∗ (1 − c(ϕ))./( T (− ws) ∗ Erow()×1 ) (16) → = ∗ ln(1 − T (− c(ϕ) ws)) In Section 5.1, we get the webpages based website conﬁdence, and the starting value of website conﬁdence equals to it, namely − → = t(ws). − → Then the starting value of the classiﬁed webpage T 0 (ws) uncertainty can be expressed by → = ∗ ln(1 − t(− → = ∗ ln(1 − T 0 (− c 0 (ϕ) ws)) ws)) − → can be removed after some transformations More clearly, T (ws) as following. −−−−−−→ c n (ϕ) = ∗ ln(1 − ∗ (1 − c n−1 (ϕ))./( ∗ Erow()×1 )) (17) −−−−−−→ Erow()×1 ) = · · · = f2 (, c0 (ϕ), As Eq. (17), the recursion process can be denotes as f2 . Except → ϕ ) and matrix , all of other parameters are constant like c0 (− −−−−−−→ Erow()×1 , so the classiﬁed webpage uncertainty is determined by the classiﬁed webpages distribution on websites. As shown in the middle of Fig. 4. As Eq. (11) shows, the uncertainty of web event is the sum of the classiﬁed webpage uncertainty, then we can get the following equation. uc =

1 n −−−−−−→ −−−−−−→ −−−−−−→ ∗ Erow(c)×1 = f2 (, c 0 (ϕ), Erow()×1 ) ∗ Erow(c)×1 c (ϕ) n

(18)

Here is the classiﬁed webpages distribution on website. FurAs Eq. (14) shows, t(wsi ) can be got using function f1 . Eq. (14) → via the and t(− thermore, we have known the value of c 0 (ϕ) ws) corresponds to Eq. (8). To facilitate observation, we use function D1 iteration process as described above, then we can get the ﬁnal to denote the result more clearly. As the uncertainty of webpage, equation like this. the website conﬁdence is also determined by distribution i . Then, all the websites conﬁdences can be show as follows. −−−−−−−→ −−−−−−−→ −−−−−−−→ −−−−−−−→ → = (D ( , c 0 ( 1 ), Erow(1 )×1 , Erow(1 )×1 ), . . ., D1 (n , c 0 ( n ), Erow(n )×1 , Erow(n )×1 )) t(− ws) 1 1 (15) −−−−−−→ −−−−−−→ c 0 ( ), Erow()×1 , Erow()×1 ) = D1 (, = (1 , 2 , . . ., n )T means the vector of attributes diswhere ) is the starting value of the webpage tribution on webpage. c 0 ( −−→ uncertainty. En×1 means a vector with the elements equal to 1 with −−−−−−−→ size of n × 1 and Erow(− → )×1 means n vectors with unit value. The −−−−−−→ same as Erow()×1 . −−−−−−→ −−−−−−→ ), Erow()×1 and Erow()×1 are constant, so the website Since, c 0 ( conﬁdences are determined by distribution . Up to now, the website conﬁdence has been discussed using matrix operations through the attributes distribution on webpages. The following content will discuss how to explain the iteration process of the uncertainty computing of web event using matrix operation. 5.2. Iteration process between classiﬁed webpages and websites with matrix operations Like last part, some vectors are introduced ﬁrst, including website conﬁdence and classiﬁed webpage uncertainty. T → = (T (ws ), . . ., T (ws T (− ws) 1 n(ws) ))

−−→ c(ϕ) = (c(ϕ1 ), . . ., c(ϕn(ϕ) ))T

2 The symbol " " at the top right center of matrix or vector means transposition and symbol " . " at the bottom means element operation, which like matrix operations in Matlab, for example[1, 2, 3] ./[3, 2, 1] = [1/3, 2/2, 3/1] .

uc

−−−−−→ −−−−−−→ − → − = f2 (, ∗ ln(1 − t(ws)), Erow()×1 ) ∗ Erow(ϕ)×1 −−−−−−→ −−−−−→ −−−−−−→ −−−−−−→ c 0 ( = f2 (, ∗ ln(1 − D1 (, ), Erow()×1 , Erow()×1 )), Erow()×1 ) ∗ Erow(ϕ)×1 −−−−−−→ −−−−−→ −−−−−−→ −−−−−−→ c 0 ( ), Erow()×1 , Erow()×1 , Erow()×1 , Erow(ϕ)×1 ) = D2 (, ,

(19) The meanings of factors are in coincidence with previous descriptions. At last, we use function D2 to express the results clearly. And we can easily conﬁrm that the web event veracity via uncertainty is actually determined by distributions. The distributions include the attributes distribution on webpages belongs = (1 , . . ., n ), and the classiﬁed webto corresponding website pages distribution on websites . Then we can get the conclusion ). Here D can be seen as a brief representation of that uc = D(, −−−→ c 0 (), . . .) shown in Eq. (19). That means web event veracD2 (, , ity via uncertainty is determined by the two distributions, which is in accordance with the veracity analysis discussed in Section 4. But in the computing process, iteration avoids settling the speciﬁc distributions successfully. According to the computation and the evaluation of iteration process in the implementing veracity of web event via uncertainty with matrix operations, we can get the conclusion that our iteration process is in line with the veracity analysis described in Section 4. At the same time, the iteration process has some advantages that it remits the demand for parameters of speciﬁc distributions. Our method employ the iterations process among a serial of event

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

8

Table 2 The details of dataset used to period detection (100 events life length is about 30–40 days). Feature

Value

Average number of seeds per event Average number of Webpages per event Average number of event attributes per event Average number of days per event Average number of Webpages per day Average number of event attributes per day

2 5556 16,856 40 146 469

features instead of modeling the distribution of attributes on webpage and the classiﬁed webpages on website directly, namely we computed the web event veracity via uncertainty through the two distributions without modeling them directly. 6. Experiments This section implements the proposed web event veracity model. As we discussed prior, web event veracity connect to web event uncertainty and website conﬁdence closely. High uncertainty of attribute, webpage, classiﬁed webpage and low conﬁdence of website yield high uncertainty and low veracity of web event. These features inﬂuence each other through two distributions (attributes on webpages, classiﬁed webpages on websites), and sparse distribution on uncertain webpage and unauthentic website lead to more uncertain and less veracious web event. Data sets and experimental analysis are made in this section. 6.1. Data sets In this paper, the web events in our experiments are downloaded from Baidu (http://news.baidu.com), and other news websites. The seeds of events are extracted from baidu.com. In the experiments, we chose 900,000 webpages about 100 events as the experimental data set, which covers politic area, accidents, disasters, terrorist attacks and so on. Table 2 shows the various statistics of the above data. The start timestamp is determined referring (Jin et al., 2010), and the end timestamp is settled as the time we stopped climbing webpages. And the average web event life length is about 30–40 days. 6.2. Experimental analysis of how the event features inﬂuence the veracity of web event via uncertainty 6.2.1. How the classiﬁed webpages distribution on the website inﬂuence on the veracity of web event via uncertainty In Fig. 5, after the distribution of attributes on webpages is ﬁxed, we compute the uncertainty of web event with the variation of classiﬁed webpages distribution on websites.

Fig. 6. How the website conﬁdence and the diversity inﬂuence on the web event uncertainty.

Supposing the number of edges is max(ne ) when all the classiﬁed webpages are supported by all websites. Then the density of edges means the number of existed edges divide max(ne ). From Fig. 5 we can gain that the uncertainty of web event is 0 and veracity is 1 when the density of edges near 1; the uncertainty of web event is high and veracity is low when the density of edges close to 0. The results are in accordance with Deductions 1 and 2. Actually, when the density of edges is near 1, it means all the subordinated web events of a web event are supported by all websites, which indicates that the event has been admitted by all Medias. In this way, the corresponding web event uncertainty is low and veracity is high. When the density close to 0, it means few websites support related subordinated web events and its evolution direction is quite uncertain, then the corresponding web event uncertainty is high and veracity is low. 6.2.2. How the website conﬁdence inﬂuence on the web event uncertainty We assigned the website conﬁdences randomly as the initial values at the beginning of the iterative process, and then compute the web event uncertainty. Fig. 6 shows how the website conﬁdence inﬂuences the web event veracity. Through the experiment results, we can get the uncertainty distribution pattern; namely, web event veracity via uncertainty not only related to the website conﬁdence but also related to the diversity of webpages. Website conﬁdence has a little drift with the type of website such as news, bbs, and blog. Furthermore, web event veracity is affected. Therefore, the website conﬁdence and the distribution of classiﬁed webpages on websites affect the web event uncertainty. 6.2.3. Experimental results about website conﬁdence According to the descriptions in Sections 3–5, we know that the web event uncertainty is inﬂuenced by the website conﬁdence, attribute uncertainty, webpage uncertainty, classiﬁed webpage uncertainty and related two distributions. This section chooses website conﬁdence to conﬁrm whether our algorithm is effective for that website conﬁdence are easy to be evaluated by human.

Fig. 5. How the event classiﬁed webpage distribution on the websites inﬂuence the web event uncertainty.

6.2.3.1. Statistical results. We extracted 23 representative websites came from Internet, and compute the website conﬁdence according to the iterative process proposed in Section 4 and the results have been shown in Table 3. The third column in Table 3 is labeled as the artiﬁcial website conﬁdences. During the process of the labeling, we train seven experimenters. Then experimenters graded the website conﬁdence with their experiences. The ﬁve columns are range of

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

9

Fig. 7. Analysis of experimental data.

website conﬁdence, website conﬁdence through manually grading, ﬂuctuation of website conﬁdence and deviation of website conﬁdence respectively. The result shows that: 1) For news websites, the labeled artiﬁcial conﬁdence is quiet consistent with the minimum computed conﬁdence, and Pearson correlation coefﬁcient of them reaches 0.966 (Fig. 7). 2) For blog sites, the labeled artiﬁcial conﬁdence is consistent with the minimum computed conﬁdence in most times. And Pearson correlation coefﬁcient except Bokerb reaches 0.838 expecting one website. 3) For bbs sites, Pearson correlation coefﬁcient is low to −0.074, which is not optimistic. Here, after analysis, as the third class website (bbs) is concerned, the reason people feel pessimistic is that people always considering all information, while our computation is based on the part of considering information. It is obvious that the desultorily information is often included in bbs and is considering little in news website or blog sites while we just mining part of web event related data, so there are some ampliﬁed emotions which we cannot control. So there must be some features in the computing process of website conﬁdence, which we did not consider. Besides, people can refer to the labeled results, but cannot depend on it. If we observe carefully, we will ﬁnd the labeled results have the same trend with the computed results, which can improve the effectiveness of our algorithm. According to the above experiments results and analysis, we can get the following conclusions: Conclusion 1: Websites of news have strong conﬁdences and low ﬂuctuations, namely the news websites are stable, and the results of calculation are quiet anastomosis with human cognitive results. Therefore, information of news website is more veracious. Conclusion 2: The conﬁdence of blog sites is relatively low, and has higher ﬂuctuation compared with news sites, namely the website performance is not very stable. The calculation results are anastomosis with human cognitive results to some degree. Hence, information of blog site is less veracious. Conclusion 3: Bbs conﬁdence has strong ﬂuctuation. Namely, the performance is not stable. The calculation results not very

accordance with the human being cognition. As a result, information of bbs is least veracious. After a serial of analysis, we conclude that the conﬁdence of news website and bog website is in line with human cognition while conﬁdence of bbs has some error for people are animals of emotion. Above conclusions conﬁrmed the correctness of website conﬁdence computation algorithm; the conclusions also throw light on our future study. 6.2.3.2. Illustrates about website conﬁdence. Fig. 8 displays the bokee’s conﬁdence in a period of time, which involves more than 100 web events including 900,000 webpages with life length about 30–40 days. With time going, all the event features, such as attribute uncertainty, webpage uncertainty and classiﬁed webpage uncertainty have some evolutions, and correspondingly, website conﬁdence has some shift. The horizontal coordinates is time and the vertical coordinate indicates website conﬁdence of “bokee”. Fig. 8 illustrates that, with time going, the mined 100 events evolve a lot, and the corresponding websites conﬁdence have some variations. As can be seen from Fig. 8, calculated bokee conﬁdence has wave range 0.74–0.99, while the conﬁdence of the human cognition is 0.67. The calculated result is relatively high. From Table 3 we can ﬁnd, its ﬂuctuation is 0.069 which is not good compared to other websites.

Fig. 8. The conﬁdence distribution of “bookee” over time.

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model

ARTICLE IN PRESS

JSS-9355; No. of Pages 11

X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

10 Table 3 Website conﬁdences of the 23 website sources. Source

Range

M

D

Type

Website url

Sohu Sina Ifeng 163 Xinmin 19lou yangzhi8 mpaper366 Bokerb Weibo Focus Blogbus Bokee Tieba.Baidu Tainya China Mop Douban RenRen Kaixin001

[0.86, 0.93] [0.88, 0.95] [0.84, 0.91] [0.85, 0.94] [0.80, 0.99] [0.70, 0.99] [0.72, 0.99] [0.66, 0.99] [0.92, 0.99] [0.77, 0.86] [0.71, 0.97] [0.69, 0.99] [0.74, 0.99] [0.86, 0.93] [0.84, 0.92] [0.88, 0.96] [0.76, 0.90] [0.78, 0.95] [0.78, 0.94] [0.77, 0.96]

0.83 0.85 0.88 0.81 0.77 0.61 0.65 0.60 0.63 0.71 0.68 0.66 0.67 0.63 0.65 0.65 0.61 0.74 0.65 0.63

0.014 0.017 0.02 0.02 0.041 0.07 0.081 0.112 0.015 0.018 0.045 0.051 0.069 0.015 0.018 0.018 0.035 0.036 0.037 0.048

0.07 0.07 0.07 0.09 0.19 0.29 0.27 0.33 0.07 0.09 0.26 0.3 0.25 0.07 0.08 0.08 0.14 0.17 0.16 0.19

News News News News News News News News Blog Blog Blog Blog Blog BBS BBS BBS BBS BBS BBS BBS

http://www.sohu.com/ http://www.sina.com.cn/ http://www.ifeng.com/ http://www.163.com/ http://www.xinmin.cn/ http://www.19lou.com/ http://www.yangzhi8.com.cn/ http://www.mpaper366.com/ http://www.bokerb.com/ http://weibo.com/ http://sh.focus.cn/ http://www.blogbus.com/ http://www.blogbus.com/ http://tieba.baidu.com http://www.tianya.cn/ http://club.china.com/ http://www.mop.com/ http://www.douban.com/ http://www.renren.com/ http://www.kaixin001.com/

7. Conclusions

Fig. 9. The uncertainty distribution of web event “tapping phone”.

6.2.4. Experimental analysis of web event uncertainty According to the descriptions in Sections 3–5, we know that the veracity of web event via uncertainty is inﬂuenced by the attributes distribution on webpage, the classiﬁed webpages distribution on websites. This section analyzes the experimental results of veracity of web event via all the event features. Fig. 93 shows the uncertainty of Web event “tapping phone” in different times. The horizontal coordinate is time and the vertical coordinate is web event uncertainty. Veracity of this event is converse to its uncertainty. Calculated results in Fig. 9 are ordered by the day. The third day corresponds to July 10, 2012 when “World Service” published the last stage; On July 12, “Sunday times” and “the sun” are fallen into deep eavesdropping mire; On July 15, the chief executive Brooks resigned in Murdoch news group. Uncertainty of this web event is high and veracity is low at the beginning of evolution, then the veracity increase in the following time. In the whole process the tendency of them agree. There are some other examples in our experiences and here we just analyze one of them. Results shown in Fig. 9 agree with the social phenomenon. When the event is in the break out state, the corresponding veracity of web event is low. When the event becomes decay state, the corresponding web event veracity is high. The results are in accordance with human cognition.

3

http://en.wikipedia.org/wiki/News of the World phone tapping scandal.

This article has modeled how to compute veracity of web event via uncertainty based on its features distribution on different kind of conﬁdent websites, whose data is big. Some web events have positive effects through facilitating public supervision, while others have negative effects, such as MeiLing Guo event. Our work throws light on timely responding to and understanding popular/emergency web events for individuals and social groups. We propose one method to measure web event veracity through calculating web event uncertainty. In the process of calculating web event veracity, the main contributions of this paper are as follows: Some event features, which inﬂuence on the measurement of veracity in the evolution process of web event, are mined to calculate the veracity of web event. Relations of features play an important role in computing veracity of web event. The relations include distribution of attributes on webpages and classiﬁed webpages on websites. Moreover iterative processes are employed to handle these relations. Matrix operations are done to conﬁrm that the result of the iteration progress is in coincidence with mathematical model. Finally, experiments are made based on the analysis above, and the results prove that the iterative algorithm is promising. In all, this paper analyzed web event from its event features to the relation among these features. All the analysis contributes to the computation of web event veracity. Acknowledgements This work was supported in part by the National Science Foundation of China under Grant nos. 91024012, 91324005 and the Open Project Program of the State Key Laboratory of Mathematical Engineering and Advanced Computing. References Allan, J., 2000. Topic Detection and Tracking Event-Based Information Organization. Kluwer, Norwell, MA. Allan, J., Carbonell, G., Doddington, G., Yamron, J., Yang, Y., 1998a. Topic detection and tracking pilot study ﬁnal report. In: Proceedings of the Broadcast News Transcription and Understanding Workshop. Allan, J., Papka, R., Lavrenko, V., 1998b. On-line new event detection and tracking. In: Proc. 21st Annu. Int. ACM SIGIR Conf. Res. Development Inf. Retrieval, Melbourne, Australia, 1998, pp. 37–45. Chen, C.C., Chen, Y., Chen, M.C., 2007. An aging theory for event lifecycle modeling. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37 (March (2)), 237–248.

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

G Model JSS-9355; No. of Pages 11

ARTICLE IN PRESS X. Wang et al. / The Journal of Systems and Software xxx (2014) xxx–xxx

Dean, J., Ghemawat, S., 2008. MapReduce: simpliﬁed data processing on large clusters. Commun. ACM 51 (1), 107–113. Haddow, D., Bullock, A., Coppola, P., 2010. Introduction to Emergency Management. Jin, X., Spangler, S., Ma, R., Han, J., 2010. Topic initiator detection on the World Wide Web. In: Proceedings of the 19th International Conference on World Wide Web, pp. 481–490. Jo, Y., Lagoze, C., Lee Giles, C., 2007. Detecting research topics via the correlation between graphs and texts. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 370–379. Kenneth, C., Yang., C., 2007. Factors inﬂuencing Internet users’ perceived credibility of news—related blogs in Taiwan. Telemat. Inform. 24 (May (2)), 69–85. Li, F., Lau, R., 2011. Emerging technologies and applications on interactive entertainments. J. Multimedia 6 (2), 107–114. Li, Q., Lau, R., Leung, E., Li, F., Lee, V., Wah, B., Ashman, H., 2009. Emerging internet technologies for e-learning. IEEE Inter. Comput. 13 (July (4)), 11–17. Makkonen, J., 2003. Investigation on event evolution in TDT. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language, pp. 43–48. Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 198–207. Metzger, M.J., 2000. An Overview of Internet Credibility. Department of Communication, University of California Santa Barbara http://www.eomm.ucsb. ed u/mmetzger ﬂash.htm Nadine Wathen, C., Burke, J., 2002. Believe it or not: factors inﬂuencing credibility on the web. J. Am. Soc. Inform. Sci. Technol. 53 (2), 134–144. Nallapati, R., Feng, A., Peng, F., Allan, J., 2004. Event threading within news topics. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 446–453. Ng, B., Li, F., Lau, R., Si, A., Siu, A., 2003. A performance study of multi-server systems for distributed virtual environments. Inform. Sci. 154 (1), 85–93. Ng, B., Lau, R., Si, A., Li, F., 2005. Multi-server support for large scale distributed virtual environments. IEEE Trans. Multimedia 7 (6), 1054–1065. Salton Gerard Yang, C.S., 1973. On the Speciﬁcation of Term Values in Automatic Indexing. Cornell University. Wei, C., Chang, Y., 2007. Discovering event evolution patterns from document sequences. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37 (March (2)), 273–283. Yang, C., Shi, X., 2006. Discovering event evolution graphs from newswires. In: Proceedings of the 15th International Conference on World Wide Web, pp. 945–946. Yang, C., Shi, X., Wei, C., 2009. Discovering event evolution graphs from news corpora. IEEE Trans. Syst. Man Cybern. A 39 (4), 850–863. Yin, X., Han, J., Yu, P.S., 2008. Truth discovery with multiple conﬂicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20 (6), 796–808. Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., Chen, J., 2014. A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. J. Comput. Syst. Sci. 80, 1008–1020. Zhang, X., Yang, T., Liu, C., Chen, J., 2014. A scalable two-phase top-down specialization approach for data anonymization using MapReduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25 (2), 363–373.

11

Xinzhi Wang received the Bachelor’s degree from Shanghai University, Shanghai, China, in 2012, where she is currently working toward the Master degree in the School of Computers. Her main research interests include topic detection and tracking, sentimental analysis and web mining.

Xiangfeng Luo received the Master’s and Ph.D. degrees from the Hefei University of Technology, Hefei, China, in 2000 and 2003, respectively. He is a Professor with the School of Computers, Shanghai University, Shanghai, China. He was a Postdoctoral Researcher with the China Knowledge Grid Research Group, Institute of Computing Technology, Chinese Academy of Sciences, from 2003 to 2005. He has authored or coauthored more than 50 publications and his publications have appeared in the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBEMETICS, and IEEE TRANSACTIONS ON LEARNING TECHNOLOGY. His main research interests include web wisdom, cognitive informatics, and text understanding. Dr. Luo has served as the Guest Editor of ACM Transactions on Intelligent Systems and Technology. He has also served on the committees of a number of conferences/workshops, including Program Co-Chair of the International Conference on Web-based Learning (ICWL 2010) (Shanghai), International Conference Web Information Systems and Mining (WISM 2012) (Chengdu), International Workshop on Cognitive-based Text Understanding and Web Wis-dom (CTUW 2011) (Sydney), and more than 40 PC members of conferences and workshops.

Huimin Liu received the Bachelor’s and Master’s degrees from Shanghai University, Shanghai, China, in 2010 and 2013, respectively. He is currently working for data center of Agriculture Bank of China. His main research interests include topic detection and tracking and web mining.

Please cite this article in press as: Wang, X., et al., Measuring the veracity of web event via uncertainty. J. Syst. Software (2014), http://dx.doi.org/10.1016/j.jss.2014.07.023

Measuring the veracity of web event via uncertainty

Measuring the veracity of web event via uncertainty

Recommend Documents