A graph-based approach to ememes identification and tracking in Social Media streams

A graph-based approach to ememes identification and tracking in Social Media streams

Accepted Manuscript A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams Ekaterina Shabunina, Gabriella Pasi PII: DOI...

771KB Sizes 21 Downloads 192 Views

Accepted Manuscript

A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams Ekaterina Shabunina, Gabriella Pasi PII: DOI: Reference:

S0950-7051(17)30484-7 10.1016/j.knosys.2017.10.013 KNOSYS 4075

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

14 April 2017 11 October 2017 12 October 2017

Please cite this article as: Ekaterina Shabunina, Gabriella Pasi, A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams, Knowledge-Based Systems (2017), doi: 10.1016/j.knosys.2017.10.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams Ekaterina Shabunina, Gabriella Pasi

CR IP T

Universit`a degli Studi di Milano-Bicocca, Dipartimento di Informatica Sistemistica e Comunicazione, Viale Sarca 336, 20126 Milan, Italy

Abstract

A meme, as defined by Richard Dawkins, is a unit of information, a concept or an idea that

AN US

spreads from person to person within a culture. Examples of memes can be a musical melody, a catchy phrase, trending news, behavioral patterns, etc. In this article the task of identifying potential memes in a stream of texts is addressed: in particular, the content generated by users

of Social Media is considered as a rich source of information offering an updated window on the world happenings and on opinions of people. A textual electronic meme, a.k.a. ememe, is

M

here considered as a frequently replicated set of related words that propagates through the Web over time. In this article an approach is proposed that aims to identify ememes in Social Media streams represented as graph of words. Furthermore, a set of measures is defined to track the

ED

change of information in time.

Keywords: Meme, Ememe, Information Evolution, Information Propagation, Text Graph,

PT

Graph Degeneracy, Fuzzy Logic

CE

1. Introduction

A meme, as defined by Richard Dawkins [1], is a “unit of information” or a concept that

AC

spreads from person to person within a culture. An example of a meme can be a musical melody or a whole song, a catchy phrase or a word, a trending news, a behavioral pattern, etc. Memes

5

can be manifested in several ways and forms, ranging from images [2], videos [3], texts [4, 5], etc. In the latter case, the concepts/ideas behind a meme are expressed by means of sentences Email addresses: [email protected] ( Ekaterina Shabunina), [email protected] ( Gabriella Pasi) Preprint submitted to Knowledge-Based Systems

October 12, 2017

ACCEPTED MANUSCRIPT

in natural language, and they are replicated and spread by (slight) variations of the employed vocabulary. Nowadays, the most common way to propagate opinions and ideas on the Web is constituted 10

by Social Media, where users can freely express themselves: in the 21st century the User Gen-

CR IP T

erated Content (UGC) represents the voice of people. This is due to the fact that, in the Web 2.0, users do not have external control on the content they generate, while in the previous instance

of the Web only professionals were allowed to publish. In particular, micro-blogging platforms boost the creation of UGC and, therefore, they constitute an important source of news and opin15

ions on the world happenings, which diffuse at a real-time speed. Thus, they offer to both Social Media users and researchers an invaluable repository of potential information.

AN US

This social context offers an unprecedented opportunity to analyse the content generated by millions of users of different countries and cultural environments to the aim of excerpting cultural trends and ideas, and thus to identify memes. The ability to model a stream of messages and track 20

its change over time would provide high benefits to many fields of research. In [4] this issue was first outlined to the aim of identifying and tracking memes on the blogosphere, and the word

M

ememe was proposed to denote an “electronic meme”.

Based on the findings reported in [4], in this paper we propose an approach that relies on a graph-based representation of Social Media posts to the aim of identifying ememes in any Social Media platform. To this purpose, the Graph-of-Words representation introduced in [6] is

ED

25

employed to model a stream of social posts. Moreover, an enhanced k-core degeneracy process

PT

is applied to the obtained graph to the aim of identifying and extracting the ememes hidden in the Social Media stream. The major novelties of the work reported in this article are constituted by the definition of ememe and of its formal representation, and the definition of a set of measures that allow to evaluate both quantitatively and qualitatively the identified ememes, and to track

CE

30

their change in time in social streams.

AC

An experimental evaluation on the micro-blogging platform Twitter of both the proposed

approach and the measures on the extracted ememes has demonstrated their validity and applicability.

35

Twitter has been selected as a case study due to its wide usage and popularity, as well as for

the availability of the useful functionalities it offers. Hashtags in particular allow to explicitly specify the main topic of tweets. 2

ACCEPTED MANUSCRIPT

It is important to outline that in this work hashtags are not considered as ememes, but rather they are topics composed of subtopics and flactations of opinions; this was investigated in our 40

earlier work [7]. To reduce the computational complexity of the whole approach, the hashtags are used as a topical filter of the considered Twitter streams to reduce both the volume of tweets to

CR IP T

be analyzed and the related potential ememes. Additionally, the identification of ememes related to a predefined topic allows to control some problems typical of Natural Language Processing (NLP) as word sense disambiguation, due to the explicit knowledge of the topical context. 45

The article is organized as follows: Section 2 provides an extensive review of the related literature; Section 3 introduces the scientific background and the techniques used in the approach

proposed in this paper. Section 4 introduces the novel method for the identification of ememes

AN US

in Social Media streams, the proposed formal representation of an ememe and the proposed set

of measures for the evaluation of the identified ememes. Section 5 reports the experimental 50

evaluation of the proposed method; and finally, Section 6 concludes the work and indicates some future research directions.

M

2. Related Literature

In 1976, in his book The Selfish Gene [1], Richard Dawkins coined the term meme to mean a

55

ED

“unit of human cultural evolution”, analogous to a gene in genetics. Dawkins defined the memes as “selfish replicators”, which have the mission to survive by leaping from brain to brain in a process of imitation. Dawkins also introduced three properties that a meme should have: 1)

PT

longevity - the length of the life span of a replicator; 2) fecundity - the rate at which the meme copies are created; and 3) copy-fidelity - the fidelity of the replicated copies to the initial meme.

60

CE

In Computer Science a first application of the meme notion gave origin to the so-called “memetics” (by analogy with genetics), which is a branch of evolutionary computation [8]. Besides memetics, which is not the focus of the research reported in our article, the notion

AC

of meme has played an important role in the context of Web 2.0, in particular, in relation to Social Media. Here, in the realm of the User Generated Content, a concept or an idea can easily replicate and propagate through time. The work reported in [4] was one of the first contributions

65

that addressed the issue of the representation and of the definition of measures that characterize “Internet textual memes”, which the authors name ememes. In [4] an ememe is defined as “a unit of information (idea) replicated and propagated through 3

ACCEPTED MANUSCRIPT

the Web by one or more of the Internet-based services and applications”. Specifically, the authors propose to formally define an ememe as a micro ontology generated by posts on the blogosphere 70

in OWL. Additionally, they propose a set of measures that characterize an ememe: replication fidelity, mutation degree, spread, and longevity.

CR IP T

A semi-automatic approach to memes identification is proposed in [9]. Here memes are

extracted from a corpus of documents; by this approach, a concept is defined as an n-gram that occurs in a significant number of documents in the corpus. Two concepts are considered 75

related when they frequently co-occur in the same documents (a threshold frequency is set by the user). A semantic network is then constructed, where nodes are entity concepts and edges

express relations between concepts. A meme is defined as a minimal semantic network, i.e. the

AN US

network obtained by removing redundant relationships between concepts. A big limitation of this approach is that it is not fully automatic, and it requires the user feedback. 80

The meme-tracking approach proposed in [10] assumes that memes are frequently quoted phrases and their variations. To identify memes, a phrase graph is built, where the nodes represent the quoted phrases, and the directed edges represent the decreasing edit distance between the

M

phrases. As an application of the proposed technique, both news and blogs articles are explored. To measure the information spread, the authors propose to examine the volume of blogs and 85

news media that contain the quoted phrase at each timestamp, and they study the time difference

ED

for mentioning a phrase between the news media and blogs. Similar to [10], in [11] a meme is considered as a frequently quoted phrase, but the main

PT

focus of the work is to explore the mutation characteristics of memes depending on the size of the phrase, on the media source or social network source, and on time. 90

A meme ranking problem is introduced in [12], where the objective is to maximize the overall

CE

network activity by selecting which memes to show to the user. A set of “heuristic rules” is proposed, which estimates the probability that “short snippets of text, images, sounds or videos”

AC

(or memes, as they call them) that users post in social media will go viral. The proposed heuristics take into account cosine similarity of a meme to the user’s profile, the similarity among the users,

95

and the intrinsic decay of the repost probability along time. A different approach to meme identification in Twitter is proposed in [13], where tweets are

initially grouped into “protomemes”, i.e. groups that share one instance of entities, such as a hashtag, a mention, a URL, or a phrase. The identified protomemes are then clustered based on a 4

ACCEPTED MANUSCRIPT

set of similarity measures applied to content, network and user types. The obtained clusters are 100

assumed to be the memes. One of the most recent studies on memes is presented in [14]. In this large-scale work carried out on the Facebook social network, the focus was on the imperfections of fidelity of information

CR IP T

being propagated.

As we have seen, in the works proposed in the Computer Science literature, many percep105

tions of how the notion of meme can be interpreted and instantiated have appeared, and several interpretations of “ememes” have been provided. A synthesis of the proposed definitions is re-

ported here below. The software system Truthy presented in [5] assumes that a meme may have a number of types such as “hashtags” and “mentions” on Twitter, URLs and the preprocessed text

110

AN US

of the tweet itself. In [15] a Twitter hashtag is considered as an online meme and as an event. In [12] the meme notion is associated with “short snippets of text, photos, audio, or videos” that a user posts in Social Media. In [13] the term meme is used to denote sets of topically coherent tweets. In [10, 11, 14] memes are assumed to be frequently quoted phrases. A micro ontology

representation of an ememe is proposed in [4]. In [9], a meme is represented as a semantic

115

M

network that is composed of entity concepts with strong links.

The discovered variety of definitions, representations and perceptions of how a meme can be interpreted on the Web brings an insight on the absence of a unique interpretation of the meme

ED

concept applied to Social Media, and the need to define one, which serves as the motivation of our work.

120

PT

Relying on the above mentioned literature, a meme, defined originally as a “unit of human cultural evolution” [1], can only be evaluated a posteriori by domain experts based on historical data. An automatic detection of memes is a very difficult task due to the fact that the persistence

CE

of an meme, and therefore its aliveness is strongly influenced by people. Not every viral message that has received bursty attention can be defined as a meme.

AC

In our work we propose a model for detecting potential memes. We refer to these potential

125

memes as “ememes” since in Social Media a viral spread of a catchy idea or an opinion can quickly become an “electronic meme”, which has to be further evaluated by time and by domain experts to determine whether it is a true meme or not. Therefore, in our work we do not claim to detect true memes but we rather try to discover potential memes, a.k.a. “ememes”, based on a stream of Social Media data. 5

ACCEPTED MANUSCRIPT

130

3. Graph-Based Representation and Tracking of Textual Information in Social Media The aim of this paper is threefold: 1) to formally represent Social Media posts by means of a graph-based representation; 2) to extract potential memes (that we call ememes) from the ob-

CR IP T

tained graph-based representation; and 3) to track the evolution in time of the identified ememes. These tasks have some similarities with other related research fields that in last years have 135

raised the attention of the scientific community. These include: Topic Detection and Tracking

[16, 17, 18, 19], and Event Detection [20, 21]. However, it is worth to be underlined that even

though both events and memes can be formally defined as a set of interdependent terms, the

main difference is that an event is constrained by time while the notion of a meme is not. On

140

AN US

the contrary, one of the main characteristics of a meme is its persistence in time. Nonetheless,

an event may give rise to a number of memes related to the event itself or to some aspects of it, but this is not mandatory. Similarly, compared to the definition of a meme, the notion of a topic differs due to its higher generality.

In fact a topic may encompass several memes at different points in time. In the following, the issue of formally representing a text is addressed. In fact, the core aspects of any text analysis is the choice of the text representation model. The most common

M

145

approach is called the bag-of-words model, which was first mentioned by Zellig Harris [22]. This

ED

simple representation reduces a text into a set of words by disregarding syntax. An alternative approach is to represent a text as a graph, where the nodes are the words and the edges represent meaningful relations between words. This representation is richer than the bag-of-words model, because it carries additional information. In fact a graph-based representation of texts allows the

PT

150

exploitation of the connections between words. Moreover, appropriate weights associated with

CE

the edges can represent the strength of the relationship between words. Furthermore, graphs allow to measure some useful properties of words (centrality, connec-

tivity, etc). For the above reasons our choice is to represent texts as graphs. Currently, a variety of works in the literature propose a graph-based representation of texts;

AC

155

however, it is not the scope of our work to give an extensive review of them. A broad review of graph-based approaches for text representations is provided in [23]. In our work, to model a stream of Social Media posts we employ the graph-based represen-

tation called Graph-of-Words, the unweighted and directed version of which was first introduced 160

in [6]. In our approach we employ the weighted and undirected Graph-of-Words, as reported in 6

ACCEPTED MANUSCRIPT

Definition 1. Definition 1 (Graph-of-Words). Let G = (V, E) be a weighted undirected graph, where V is

a set of nodes and E is a set of edges. We denote by Wv : V → R+ the weight mapping on the

165

CR IP T

nodes, and by We : E → R+ the weight mapping on the edges. The nodes of the graph represent unique terms, and edges connect terms that co-occur in the considered text.

In a weighted graph the weighted degree of a node v is computed as the sum of the weights of its incident edges [24]. We assume that function Wv computes the weighted degree of a node, i.e. a node’s weight is its weighted degree.

In Definition 2, we introduce the concept of k-core graph degeneracy [25], which we employ to identify the potential ememes in the Graph-of-Words representing the stream of Social Media

AN US

170

posts.

Definition 2 (k-core). Let G = (V, E) be a graph and k be a positive real number. A subgraph Hk = (V0 , E0 ) of the graph G, induced by the subset of nodes V0 ⊆ V and the subset of edges

E0 ⊆ E, is called a k-core or a core of order k of graph G iff deg(v) ≥ k, ∀ v ∈ V0 . And Hk is the maximum subgraph which satisfies the property deg(v) ≥ k, ∀ v ∈ V0 , i.e. it cannot be augmented

ED

without losing this property.

M

175

More specifically, in our work we define and apply an extension of the k-core degeneracy algorithm, which considers the weighted degrees of nodes in the Graph-of-Words instead of the

180

PT

classic degrees of nodes. Thus, the k-core of a graph is the maximal connected subgraph where each node has a weighted degree of at least k within the subgraph. The set of all k-cores of the graph forms the k-core decomposition of a graph.

CE

The k-degenerated graph notion was introduced by Stephen Seidman in 1983 [25], and later,

in [26] an efficient algorithm has been introduced for determining the core decompositions of a

AC

graph with O(max(m, n)) time complexity, which equals O(m) for connected networks, where n

185

is the number of nodes and m is the number of edges. This algorithm is based on the recursive elimination of all nodes and edges incident with them of degree less than k.

In Section 4 the proposed process for identifying and measuring ememes is defined in-depth

and formally. Meanwhile, a general overview of the overall process is the following: given a finite stream of Social Media posts collected over a finite temporal window we define an iterative 7

ACCEPTED MANUSCRIPT

190

procedure to construct a graph (Graph-of-Words) that formally represents this stream. To this aim, we first define a graph associated with the first post in the stream, then we iteratively update it by generating and adding (one by one) the graphs of the next posts of the considered stream. We apply then to the obtained graph the k-core degeneracy algorithm to extract the core of the graph

195

CR IP T

that we assume to represent an ememe or a set of ememes. Then, we repeat this procedure for the subsequent stream of Social Media messages collected in on an equally sized time window,

and we obtain its k-core. This procedure is further repeated for a pre-defined number of streams of Social Media posts collected in temporally equal and sequential time windows. It is important

to notice that each Social Media stream may contain a different number of posts and a different number of ememes. Once the ememes from each target stream have been identified, we apply the ememe measures to track the temporal and topical changes of the ememes identified in these sequential streams of Social Media posts.

AN US

200

4. Formal Representation of an Ememe and its Measures

In the following subsections we define the process of the formal representation of a stream

205

M

of Social Media posts (Section 4.1), the process aimed at the identification and formal representation of ememes from the considered Social Media stream (Section 4.2), and the measures that

ED

we have defined in order to evaluate different aspects of a potential ememe and of its evolution through time (Section 4.3).

PT

4.1. The Ememe Identification Process

We assume to have a finite stream of Social Media posts. We partition this finite stream of Social Media posts into N sequential substreams of posts appearing in time windows of equal du-

CE

210

ration. We denote these sequential substreams as T S 1 , T S 2 , ..., T S n , ..., T S N . The granularity of the considered time window may depend on both the topic and the purpose of the analysis, and it

AC

could be an hour, a day, a week, etc. The smaller the granularity the more potential ememes can be identified, but the noise can also increase. The larger the granularity the simpler it is to verify

215

the persistence in time of the identified ememes. In Section 5 we will demonstrate how different time granularities impact the quality, quantity and topical aspects of the identified ememes. Introducing a partitioning of an observed stream of posts into substreams T S s is necessary in order to perform an analysis of the evolution of ememes over time. 8

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: General scheme of the process of generating the graph-based representation of the T S n substream of Social

1

AC

Media posts.

Figure 1: General scheme of the process of generating the graph-based representation of the T S n substream of Social Media posts.

9

ACCEPTED MANUSCRIPT

In the following we explain the process of constructing the weighted undirected Graph-of220

Words GT S n (see Definition 1 in Section 3) representing the T S n substream of posts associated with the time window n; the overall process is illustrated in Figure 1. We denote by GT S n the Graph-of-Words representing the target substream of posts T S n ; we denote by g pT S n the Graphi

CR IP T

of-Words representing each post of the T S n . Each post pTi S n of the target T S n is preprocessed

(see Section 5.2 for details), including its conversion to a bag-of-words of single terms of the 225

post. Then we create the GT S n of the target T S n by means of an iterative process. We start by

representing the first preprocessed post pT1 S n of the target substream T S n as a fully-connected graph g pT S n , where the nodes represent the words of the post. To treat all words in the single 1

post equally we initially set all the weighted degrees in g pT S n to 1. As outlined in Section 3, the 1

230

AN US

weighted degree of a node is defined as the sum of the weights of its incident edges. Therefore,

to share the weighted degree of each node over its incident edges, we compute the weight of each edge as w =

1 l−1 ,

where l is the number of nodes in the fully-connected g pT S n . The obtained 1

Graph-of-Words g pT S n is the initial GT S n . Then we repeat the same process for the next post 1

pT2 S n in

the target substream T S n and obtain its Graph-of-Words representation g pT S n . The initial 2

2

235

M

GT S n is then merged with g pT S n : the weights of all the edges of the existing nodes in GT S n are incremented by the weights of the edges in the second post’s graph g pT S n . For those nodes that did 2

not exist in the initial GT S n new nodes and corresponding edges are created by maintaining the

ED

edges’ weights from the g pT S n . The weighted degrees of the nodes are recalculated by summing 2

up all the weights on the edges incident to each node. This procedure is iterated for every post in

PT

the T S n to obtain a final Graph-of-Words GT S n of the Social Media posts substream T S n : GT S n = g pT S n + g pT S n + ... + g pT S n + ... + g pT S n ,

240

1

2

i

M

CE

where M is the number of Social Media posts in the substream T S n . Based on the process illustrated above, the larger and the more general (in terms of topics) is

the Social Media posts substream T S n , the bigger is the Graph-of-Words associated with it.

AC

We normalize the weighted degrees of the nodes of the final Graph-of-Words GT S n for a

245

given T S n in the [0, 1] interval by dividing all the nodes’ weights by the largest node weight in the GT S n . Similarly, we normalize the weights on the edges by dividing them by the largest edge weight in the GT S n . The described process is then repeated for all the subsequent T S s in the given stream of

Social Media posts. Thus, we obtain a temporal sequence of Graph-of-Words representations of 10

ACCEPTED MANUSCRIPT

250

all substreams: T S 1 → GT S 1 , ..., T S n → GT S n , ..., T S N → GT S N . Then we apply the k-core degeneracy algorithm to each final and normalized graph GT S n :

CR IP T

in Section 5.3 a discussion related to the setting of parameter k is presented. The application of the k-core degeneracy process is aimed at identifying the core vocabulary of the ememes in 255

the T S n . In fact, the outcome of this algorithm is constituted by a set of subgraphs: the nodes of each subgraph represent a set of words that co-occur predominantly in the observed Social Media posts substream. We claim that these subgraphs may be considered as a basis of the

AN US

representations of ememes of the substream T S n .

Definition 3 (k-core of the Graph-of-Words as an Ememes Basis). An ememes basis BT S n is 260

the set of subgraphs (one or more) generated by applying the k-core graph degeneracy to the final Graph-of-Words GT S n associated with the substream T S n .

Thus, the ememes basis BT S n represents the core granules of information in the stream T S n .

M

To further verify the quality of the obtained subgraphs in the ememe basis BT S n , an outliers detection technique is applied, which aims at identifying nodes with significantly high weighted 265

degree. We call these nodes “central concepts”. The extraction of “central concept(s)” of a

ED

subgraph forces it to split into a number of additional subgraphs. This way we can identify the ememes that spin around the same central concepts or entities.

PT

4.2. Formal Representation of an Ememe As outlined in the previous section, each subgraph, generated by the application of the enhanced k-core algorithm to the Graph-of-Words GT S n associated with a substream T S n , consti-

CE

270

tutes a basis for the formal representation of an ememe. We propose then to derive from each subgraph a formal representation of the associated ememe. In particular, we represent an ememe

AC

as a fuzzy subset of the terms contained in the subgraph, as specified in Definition 4. Definition 4 (Ememe). Given a substream T S n of Social Media posts, an ememe mTi S n is repre-

275

sented as a fuzzy subset of terms of the ith subgraph from the ememes basis BT S n . VmT S n denotes the vocabulary employed in the ememe representation and is composed of the i

terms associated with the nodes of the ith subgraph of the ememes basis BT S n (Definition 3). The 11

ACCEPTED MANUSCRIPT

degree of membership of each word is the normalized weight of the corresponding node in the ith subgraph of the ememes basis BT S n . µmT S n : VmT S n → [0, 1],

280

i

i

i

CR IP T

where µmT S n denotes the membership function of the ememe mTi S n .

The fuzzy subset representing the ememe mTi S n is denoted as follows:

FmT S n = {µmT S n (v1 )/v1 , µmT S n (v2 )/v2 , ... µmT S n (v j )/v j }, i

i

i

i

where the notation µmT S n (v j )/v j denotes a term v j ∈ VmT S n and its associated membership i

285

i

value.

AN US

Furthermore, we introduce a fuzzy relation RmT S n associated with ememe mTi S n ; the memi

bership function of the fuzzy relation is defined on the cartesian product VmT S n × VmT S n of the TSn

vertices of the ith subgraph from the ememes basis B

i

i

(Definition 3). The membership value

of a pair of vertices (vi ,v j ), is the weight of the edges connecting the two nodes in the considered subgraph. 4.3. Ememe Measures

M

290

In this section we define the measures that we propose to characterize an ememe. The pro-

ED

posed measures allow: a) to validate whether the identified ememe possesses the main characteristic of a meme, i.e. persistence in time; b) to track and measure the change of ememes as they propagate through time; c) to identify the most dominant ememe(s) for each time window in the

PT

295

target Social Media posts stream.

CE

To the purpose of defining the proposed measures, we formally define a Social Media post. Definition 5 (Representation of a Social Media post). A Social Media post pTj S n is a set of

AC

words. 300

Each Social Media post pTj S n of the T S n is represented as the set of words that it contains, ex-

cluding duplicate words and stopwords (see Section 5.2 for details). In this case the membership function is binary: the words of the post are either included in the set or not. In fact, as Social Media posts are usually very short texts, the crucial point is the presence of a word and not the count of its occurrences. 12

ACCEPTED MANUSCRIPT

305

To verify an ememe’s most important characteristic, i.e. its persistence in time, and to model and track the different transformations that an ememe can undergo, in the following we propose a set of measures. To this aim we have taken inspiration from the ememe’s characterizing measures prevalence, and fidelity.

310

CR IP T

proposed in [4], and we define: relatedness of an ememe to a Social Media post, coverage,

Definition 6 (Relatedness of an ememe to a Social Media post). The relatedness of an ememe

mTi S n to a Social Media post pTj S n is the degree of inclusion of the fuzzy subset representing the ememe mTi S n to the set representing the post pTj S n .

where t is the term in the 315

pTj S n

Count(pTj S n P



mTi S n )

Count(pTj S n )

P

t∈V pT S n

=

j

min(µ pT S n , µmT S n ) j

P

t∈V pT S n

i

µ pT S n

AN US

Rel(mTi S n ⊆ pTj S n ) =

P

,

j

j

post’s vocabulary V pT S n . j

The relatedness of an ememe to a Social Media post is calculated based on the number of terms in common between the ememe’s fuzzy subset and the post’s classical set. Since the ememe’s terms are weighted (the degree of membership of these terms to the ememe fuzzy

M

subset), the relatedness score is simply calculated as the sum of the weights of the terms that the post has in common with the ememe, divided over the number of terms in the post. Definition 7 (Coverage). The coverage of an ememe mTi S n is the percentage of the T S n posts

ED

320

pT S n that have a relatedness score to the ememe mTi S n above a threshold J.

PT

In Section 5.4 we present a discussion related to the setting of the threshold J. Coverage measures the proportion of posts in the T S n substream that ”include” the ememe

325

CE

mTi S n . This measure allows to assume that a Social Media post may include more than one ememe.

In Definition 8, we propose a measure to evaluate the importance of each ememe in compar-

AC

ison to other ememes of the target T S n . Prevalence can be used to identify the most dominant ememe for each Social Media posts substream. Definition 8 (Prevalence). Each Social Media post pTj S n from T S n is evaluated to establish to

330

which ememe it has the highest score of relatedness. Then the prevalence of an ememe in a T S n is calculated by computing the percentage of posts from T S n that have the highest relatedness score to that ememe. 13

ACCEPTED MANUSCRIPT

The measure of ememe prevalence allows to comparatively evaluate the predominance of each ememe identified in the considered T S n with respect to the other ememes of that T S n . 335

The measure of ememe’s prevalence is calculated by choosing for each Social Media post the ememe to which it has the highest relatedness score, in other words, which ememe is the most

CR IP T

similar to this post. Hence, we obtain the importance of each ememe for the target T S n through the percentage of the T S n posts that each ememe “represents”. By comparing the prevalence values of all ememes in T S n we identify the most dominant ememe of the T S n . 340

In Definition 9 we introduce a measure called Fidelity for the evaluation of the similarity between two ememes. Fidelity is a comparative measure. It is calculated for an ememe of interest

from one substream in relation to all other ememes in previous and subsequent substreams. The

AN US

procedure of calculating the Fidelity measure will be presented on two hypothetic ememes mTi S n from the substream T S n and mTj S n+m from one of the subsequent substreams T S n+m . 345

Definition 9 (Fidelity). Fidelity is the value of similarity between ememe mTi S n and ememe mTj S n+m . The similarity between the fuzzy subsets representing the two ememes is computed as sug-

M

gested in [27]:

ED

S (mTi S n , mTj S n+m ) = 1/h

min(µmT S n (vk ), µmT S n+m (vk )) h P i j ), ( k=1 max(µmT S n (vk ), µmT S n+m (vk )) i

j

where vk is a term of the vocabulary of the ememe, and h is the number of terms in the 350

vocabulary of the ememe with the highest number of terms.

PT

The measure of fidelity allows to verify the ememe’s main characteristic: its persistence in time. It allows to identify whether a given ememe is preserved in the previous and next time

CE

windows.

We emphasize the fact that all the previously presented definitions and measures are applica-

ble to any Social Media platform.

AC

355

5. Experimental Evaluation The method proposed in this paper is aimed at both identifying ememes in social media

streams and tracking their evolution in time. As outlined in Section 3, this task differs from the tasks of Topic/Event Detection and Tracking, which have been widely explored in the literature. 360

Due to the novelty of this task, in the literature there is a lack of baseline methods and datasets 14

ACCEPTED MANUSCRIPT

for comparative evaluations. The prior works on memes/ememes, presented in Section 2, have performed preliminary and specific evaluations. To evaluate the approach proposed in this article we aim to asses two distinct aspects: the quality of the ememes extracted from a stream of Social Media posts, and a quantitative evaluation of the proposed ememe measures. In Section 5.1 we introduce the datasets we created specifically for this purpose. Then we describe the performed

CR IP T

365

data preprocessing steps (Section 5.2) and the phase of parameter setting (Section 5.3). Finally, in Section 5.4 we present the results obtained by applying the proposed method to the generated datasets. 5.1. Datasets

In this section we describe the datasets that have been created to evaluate the proposed ap-

AN US

370

proach w.r.t the tasks of ememe representation, their identification in Social Media, and measuring their properties.

We have chosen the micro-blogging platform Twitter as a case study due to its explicit purpose to spread what people think about the world happenings around them, and to share the posts 375

of the people who they like. For this reason Twitter constitutes a good platform for studying

M

ememes from their identification to their spreading dynamics. Moreover, Twitter exposes some useful functionalities. For example, hashtags constitute an important Twitter mechanism that

ED

allows to explicitly specify a topic contained in a tweet, i.e., a Twitter post of only up to 140 characters. To the purpose of analyzing the information generated by Social Media we are faced 380

with the problem of the immense amount of UGC available, and, consequently, of the number

PT

of ememes we can identify. In order to limit the stream of Twitter messages to be analyzed, by maintaining a clear reference to a broad and meaningful topic, we have performed a prefiltering

CE

phase of the crawled tweets based on a general hashtag (e.g. #politics). Some tweets contain more than one hashtag; this represents either a number of sub-topics addressed in the tweet or 385

a number of core concepts of the main topic. Additionally, the identification of ememes related

AC

to a predefined topic allows to decrease some difficult problems typical of NLP, as word sense disambiguation, due to the explicit knowledge of the topical content. We have created a crawler that collects tweets on three separate topics, using the Twitter

Search API 1.1. The three collected datasets consist of tweets in the English language only. The 390

topics we have selected are the following: Economy, Finance, Politics. The statistics related to all three datasets are presented in Table 1. 15

ACCEPTED MANUSCRIPT

Hashtag

Number of days

Dates

Economy

#economy

57

01/03/2016 - 26/04/2016

180,690

Politics

#politics

48

09/03/2016 - 26/04/2016

303,358

Finance

#finance

51

06/03/2016 - 26/04/2016

334,660

Table 1: Datasets statistics

5.2. Data Preprocessing

Number of tweets

CR IP T

Topic

Since in this work we focus on Social Media, the data preprocessing step is particularly important. First of all, the tweets in a stream related to a broad topic have been preprocessed 395

by eliminating special characters (except for the common Social Media symbols such as “@”

AN US

for mentioning a user and “#” for hashtag, and “-” to maintain the hyphen connected words),

URLs, punctuation and by performing tokenization. We have not performed stemming and Partof-Speech tagging; in fact, we have aimed at preserving all possible information in the short Twitter messages. For the same reason we have used a short stopwords list to filter out the most 400

common English words.

M

From each of the three datasets we have eliminated the respective hashtag by which the dataset was collected along with the word of the hashtag since it is a redundant information contained in each tweet. As part of the Graph-of-Words algorithm we have eliminated all non-unique

405

ED

words per tweet. Due to the fact that words in Social Media can have intentional misspellings, we do not perform any prefiltering or corrections of the words. It is important to notice that the

PT

proposed algorithm already eliminates non-important information due to the fact that the words that characterize this information are absent from the k-core of the graph of the target Social Media stream. All the timestamps of the tweets have been converted to the UTC+01:00 timezone.

410

CE

Retweets have been preprocessed along with the tweets. 5.3. Parameter Setting

AC

In this section we introduce the experimental setup. In particular, we address the issue of

setting of the k parameter for the k-core graph degeneracy process. We have studied the behavior and the influence of the k parameter on the extracted ememes

through an experimental evaluation of the proposed algorithm on the different datasets and on 415

different time window sizes. The produced analysis allowed us to identify the desired characteristics in the extracted ememes, based on which the setting of the k parameter was performed. 16

ACCEPTED MANUSCRIPT

Economy

Politics

Finance

TS granularity

hour

day

hour

day

hour

day

Number of TSs in the dataset

1351

57

1114

48

1145

51

133.7

6340.0

272.3

6319.9

292.3

6561.9

Average number of edges per TS

3600.5

65485.7

7164.1

127412.4

8021.7

128987.7

Average number of nodes per TS

647.8

6541.5

1246.6

11303.1

1203.4

10107.4

Perc. of nodes with weight ≥ 0.1

32.6%

1.6%

9.2%

0.55%

6.9%

0.6%

210.9

103.3

114.6

62.5

82.5

60.2

10.9%

0.6%

2,7%

0.16%

2.3%

0.23%

70.3

39.2

33.9

17.8

27.3

23.4

5.7%

0.33%

1.3%

0.06%

1.2%

0.14%

36.7

21.4

16.1

6.9

14.8

14.4

Average number of nodes with weight ≥ 0.1 Perc. of nodes with weight ≥ 0.2

Average number of nodes with weight ≥ 0.2 Perc. of nodes with weight ≥ 0.3

AN US

Average number of nodes with weight ≥ 0.3

CR IP T

Average number of tweets per TS

Perc. of nodes with weight ≥ 0.4

Average number of nodes with weight ≥ 0.4 Perc. of nodes with weight ≥ 0.5

Average number of nodes with weight ≥ 0.5

3.6%

0.22%

0.76%

0.04%

0.83%

0.09%

23.5

14.1

9.5

4

10

9.7

2.7%

0.16%

0.53%

0.02%

0.61%

0.07%

17.3

10.6

6.6

2.7

7.4

6.6

Table 2: The distribution statistics of the degrees of nodes in all studied datasets. All the values are averaged over all

Economy

0.5

Politics

TS=day

TS = hour

ED

TS = hour

M

T S s in the considered dataset.

0.4

0.3

Finance

TS=day

TS = hour

TS=day

0.2

0.3

0.3

PT

Table 3: k parameter setup for each dataset for each T S granularity.

In Table 2, for each dataset we present the statistics of the distribution of the nodes’ degrees averaged over all T S s, for both time granularities of one hour and one day (for generating the

5.4. Preliminary Evaluations and Results

AC

420

CE

T S s). In Table 3 the k values for each dataset are reported, for both granularity sizes of T S .

In this section we describe the results obtained from the evaluation of the proposed approach

on the datasets introduced in Section 5.1, with the k parameter set as described in Section 5.3.

We start by giving some ecamples of the extracted ememes, identified according to the process described in Sections 4.1 and 4.2. Next, we provide a discussion of the results achieved by

425

applying the ememe measures defined in Section 4.3. 17

ACCEPTED MANUSCRIPT

Economy

Politics

Finance

hour

day

hour

day

hour

day

Number of TSs in the dataset

1,351

57

1,114

48

1,145

51

Average number of ememes per TS

2.4

1.2

1.6

1

1.4

1

Maximum number of ememes in one TS found

11

3

8

1

5

2

Average number of words in an ememe

5.4

7.2

CR IP T

TS granularity

6.7

10.5

6.8

8.5

Table 4: Ememes related statistics for each dataset.

5.4.1. Examples of Ememes Identified.

In this subsection we provide a few examples of ememes extracted from the employed

AN US

datasets, along with some statistics related to them.

In Figure 2 we present a sample of the ememes extracted from the Economy dataset with TS’s 430

time granularity of one hour; here ememes are represented by means of the words that constitute the support of their fuzzy representations (the support of a fuzzy set is the set containing all elements with nonzero membership). By the proposed representation it is visible how ememes diffuse and evolve over a timeline of one month. Each of these ememes is represented by the

435

M

words tha co-occur more in a tweet over the Economy dataset on hourly basis, as explained in Section 4.1. In this example, we emphasize with different tonalities in the gray scale and

ED

indentation the two initial ememes (the first one in bold black and the second one in bold gray and with indent), and how these evolve over time, gaining and loosing words in their vocabulary. We leave without any font style and tonality change the words that were only used once and that

440

PT

were not propagated in the next ememes instances. In Table 4 we present some statistics related to the results obtained from the ememes extrac-

CE

tion from Twitter streams, including the average and maximum numbers of ememes in a T S , and the average number of words in an ememe. From this table it becomes evident that the average number of ememes per T S is consistently larger and is greater than 1 throughout all three datasets

AC

for the hourly time granularity than for the daily time granularity. This interesting effect of the

445

T S ’s time granularity size on the identified ememes number is further visible when looking at the vocabulary of the ememes. For example, in Figure 2 we observe a 14-ememes long evolution and diffusion on the “climate change” and “support local producers” related topic, where the T S ’s time granularity is an hour. Meanwhile, on the same dataset of Economy, by setting the 18

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

1

AC

CE

PT

Figure 2: Evolution of ememes in time. Economy dataset. TS granularity of 1 hour.

Figure 2: Evolution of ememes in time. Economy dataset. TS granularity of 1 hour.

19

ACCEPTED MANUSCRIPT

Dataset

TS

Ememes

Economy

5am 06.04

1. #sales #business prices back 2. growth global fragile weak imf’s lagarde merger news 3. president

CR IP T

4. @dollarvigilante watch japan well land rising sun 5. #android 6. #money #finance

7. hailstorm rains lash parts himachal crops destroyed plan #news 8. rebounding southeast asia seeks buffer china slowdown Finance

2am 03.04

1. @murfo portfolio diversification key maximising return

AN US

#superannuation #capitalallocation #retirement

2. 30 great latest #forex #jobs #money #job #hiring #stocks Politics

2pm 21.03

1. trump #tcot obama cuba #news

2. ireland defends healy rae attacker #news #ireland @whispersnewsltd 3. political up #news

4. talk without losing friends #debates #2016election #pointsofview

M

#communication #strategy @strategymgzine

ED

Table 5: Ememes examples.

T S ’s time granularity to 1 day, we do not observe a single ememe on this topic. Another example 450

concerns the Politics dataset. With the T S ’s time granularity set to one hour, 62% of ememes

PT

mention Donald Trump. While in the same Politics dataset but with the T S ’s time granularity set to 1 day, Trump was mentioned in 100% of tweets. In Table 5 we exemplify a few different

CE

samples of ememes identified with our algorithm in the same T S . While our datatsets of tweets were gathered on three topics, from the obtained results it is evi-

455

dent that the selected topics are in turn composed of subtopics associated with numerous ememes

AC

throughout the observed datasets. For example, in the hourly partitioning of the Economy dataset, we find 178 ememes concerning “China” that spread throughout the whole observation period. These ememes related to “China” concern China’s growth, trade, etc. Similarly, we find other countries, such as USA, Greece, Iran, Russia, India, UK, Pakistan, etc. as subtopics, along with

460

common economic matters as production, investing, oil related, stock markets, business, real estate, employment, etc. And all of these “subtopics” related to countries or economic matters 20

ACCEPTED MANUSCRIPT

merge and split throughout the whole observation period creating numerous ememes. It is important to highlight that, as observable from Figure 2 and from Table 5, many ememes’ timestamps refer to the night and early morning hours. This is due to both the fact that, as 465

mentioned in Section 5.2, the timestamps of the tweets were converted to the UTC+01:00 (Eu-

5.4.2. Evaluations of the Ememe Measures.

CR IP T

rope/Rome) timezone and that most of the English tweets are geolocated in USA.

In the following we present the results obtained by applying the proposed measures on the extracted ememes. All the values in the following tables are averaged per each ememe of T S n 470

and for all T S s in the dataset, unless stated otherwise.

AN US

Relatedness of an ememe to a tweet; this measure allows us to evaluate how each tweet in

the T S n is related to each ememe, extracted from that T S n . In Table 6 we present the statistics related to the Relatedness measure. In addition to the averaged Relatedness scores, we present the average percentage of tweets with a zero Relatedness score per each ememe of T S n , and the 475

averaged Relatedness measure excluding these zero scores. The average percentage of tweets from T S n with zero score of Relatedness to each observed ememe in T S n is in the interval 45% -

M

71%, with the highest figures for the Economy topic and the lowest figures for the Finance topic. This phenomenon of large number of tweets unrelated to a specific ememe of a substream T S n is

480

ED

natural, and it represents an evident overall diversity of information in the Social Media stream, even in the case of a well defined general topic and for small time granularities as an hour or a day. As expected, the Relatedness scores are from two to five times better when excluding the

PT

tweets that do not relate to the target ememe. An observation of Tables 4 and 6 brings evidence to the presence of a correlation between the percentage of tweets with zero Relatedness scores

485

CE

and the averaged number of ememes per T S , and an inverse correlation between the Relatedness score and the average number of ememes per T S . The Coverage measure allows to identify the overall percentage of the T S n tweets that in-

AC

clude each ememe of this T S n . The threshold J (see Definition 7) on the Relatedness score for the Coverage measure is set to 0.1. This value was empirically identified from the statistics illustrated in Table 6, where the averaged Relatedness, without the zero scored tweets, was over 0.1

490

for all datasets and all T S granularities. In Table 7 we present the Coverage value (in percentage) averaged per each ememe of T S n over all the T S s for each dataset. On average, ememes were “covered” by 9.5% of T S tweets in the worst case and by 28.9% in the best. Interestingly, the 21

ACCEPTED MANUSCRIPT

Economy hour

day

hour

day

hour

day

0.03

0.08

0.04

0.05

0.06

0.09

70.9%

72.7%

63.1%

55.4%

58%

44.9%

0.15

0.29

0.12

0.1

0.15

0.16

Averaged Relatedness Percentage of tweets with zero Relatedness

Finance

Averaged Relatedness without zeros

CR IP T

TS granularity

Politics

Table 6: Statistics related to the Relatedness of an ememe to a tweet measure.

Economy

Politics

TS granularity

hour

day

hour

Averaged Coverage in percentage

9.5%

14.9%

15.2%

Finance

day

hour

day

18.9%

28.9%

22.6%

AN US

Table 7: Statistics related to the Coverage measure.

Economy topic, which has comparatively the largest number of ememes per T S on average and at maximum, as it can be seen in Table 4, has the poorest Coverage of ememes by the tweets 495

from their T S s, when compared to the other two topics in analysis.

The Prevalence measure allows us to understand the importance of each ememe identified

M

in the T S n in comparison to the other ememes of this T S n . In Table 8 we present the values (in percentage) of the prevailing ememe per T S , and the gap size (in percentage) between the

500

ED

prevailing ememe in the T S and the next closest ememe, which we call the “size of Prevalence”. The obtained results demonstrate that the most prevailing ememes per T S on average are leading by almost the size of their Prevalence. This result is especially interesting since in four

PT

out of six studied cases the average number of identified ememes per T S was over one, and the maximum number of ememes per T S has reached an 11 in the Economy, 8 in the Politics, and 5

505

CE

in the Finance topics, all with hourly time granularity (see Table 4). This brings to the reasonable conclusion that as a general rule a Social Media substream on a broad topic has one prevailing ememe and a number of ememes with relatively much smaller values and importance levels.

AC

The Fidelity measure allows to evaluate the persistence in time of the identified ememes. In

Figure 3 we illustrate the Fidelity measure results on a concrete example of 10 ememes, from the Economy dataset, concerning a thread of discussion on the raise of the minimum wages to $15

510

per hour in USA (only values of Fidelity over 0.1 are presented in Figure 3). From this figure we can notice that even though having equal sets of terms, the two pairs of ememes 1 and 8, and 22

ACCEPTED MANUSCRIPT

Economy TS granularity

Politics

Finance

hour

day

hour

day

hour

day

Prevailing ememe score in percentage

27.4%

28.7%

39.6%

44.5%

49.9%

54.2%

Size of Prevalence in percentage

21.7%

27.9%

36%

44.5%

45.2%

53.8%

CR IP T

Table 8: Statistics related to the Prevalence measure. All values are averaged over all TSs in a dataset.

ememes 2 and 3 have Fidelity of 0.85 due to the fact that the same terms have different degrees of membership in the two different ememes (see Definition 4). 5.4.3. Discussion

The preliminary evaluations on three datasets from a popular Social Media platform demon-

AN US

515

strated the potentiality of the proposed approach. From the point of view of the identification of ememes we believe that the proposed method has presented plausible results in extracting meaningful ememes and in identifying a number of separate ememes in a substream. The defined measures represent a good means to track the evolution of ememes in time through the evaluation of the persistence of an ememe in time, its prevalence in a substream and its representation

M

520

in a substream by the Social Media posts of that substream. An important aspect in the tracking and analysis of evolution of ememes in time relies on the choice of the granularity for the

525

PT

6. Conclusions

ED

partitioning of a Social Media stream, as it has been presented in this section.

In this paper we have addressed the research issue of identifying and tracking ememes in a

CE

stream of Social Media posts. In particular, we have provided a formal definition of ememe and of its formal representation, as well as the definition of a set of measures that allow to evaluate both quantitatively and qualitatively the identified ememes, and to track their change in time in

AC

social media streams.

530

The majority of approaches that have been proposed in the literature to address the issue of

modeling Information Propagation focus on a graph-based representation of Social Networks in which the information is diffusing [28, 29, 30]. In our work we represent instead the content of a stream of Social Media posts as a graph, by means of a Graph-of-Words representation, and we extract ememes from this graph with the k-core graph degeneracy technique. Most impor23

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

Figure 3: Fidelity scores between a thread of ememes. Economy dataset. TS granularity of 1 hour.

1

Figure 3: Fidelity scores between a thread of ememes. 24 Economy dataset. TS granularity of 1 hour.

ACCEPTED MANUSCRIPT

535

tantly, we introduce and formally define a set of measures aimed at evaluating and analyzing the variations and properties of the extracted ememes. To prove the effectiveness of our approach in a computationally feasible way, we have used Twitter as a case study platform. The experimental evaluation of our approach has shown its

540

CR IP T

validity in extracting ememes from streams of Social Media posts. The experiments have also

outlined the validity of the proposed measures, by bringing interesting insights on different aspects of the information evolution and propagation in Social Media. There are important and challenging open issues that we would like to address in the future. As one of our future objectives we aim to define a set of additional measures for the identification of events such as birth,

merge, split, disappearance of ememes; for this task we will take inspiration from different but related fields, including communities evolution in Social Networks [28].

AN US

545

Another challenging future direction concerns the improvement of the analysis of the texts extracted from Social Media to the aim of modeling the impact of negated words and of the qualification of words through adjectives.

Last but not least we also intend to improve the technique aimed at setting the value of the k parameter in the k-core algorithm.

M

550

Moreover we aim at performing more extensive experiments on diverse Social Media platforms, and on various topics and languages, by also investigating the links of this research to the

ED

related fields of Topic/Event Detection and Tracking.

555

References

PT

References

CE

[1] R. Dawkins, The Selfish Gene, Oxford University Press, Oxford, UK, 1976. [2] M. Cha, A. Mislove, K. P. Gummadi, A measurement-driven analysis of information propagation in the flickr social network, in: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, ACM, New York,

AC

NY, USA, 2009, pp. 721–730. doi:10.1145/1526709.1526806.

560

URL http://doi.acm.org/10.1145/1526709.1526806

[3] L. Xie, A. Natsev, J. R. Kender, M. Hill, J. R. Smith, Visual memes in social media: Tracking real-world news in youtube videos, in: Proceedings of the 19th ACM International Conference on Multimedia, MM ’11, ACM, New York, NY, USA, 2011, pp. 53–62. doi:10.1145/2072298.2072307.

URL http://doi.acm.org/10.1145/2072298.2072307

25

ACCEPTED MANUSCRIPT

565

[4] G. Bordogna, G. Pasi, A fuzzy approach to the conceptual identification of ememes on the blogosphere, in: Fuzzy Systems (FUZZ), 2013 IEEE International Conference on, 2013, pp. 1–8. doi:10.1109/FUZZ-IEEE.2013. 6622390. [5] J. Ratkiewicz, M. Conover, M. Meiss, B. Gonc¸alves, S. Patil, A. Flammini, F. Menczer, Truthy: Mapping the spread of astroturf in microblog streams, in: Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11, ACM, New York, NY, USA, 2011, pp. 249–252. doi:10.1145/1963192.1963301. URL http://doi.acm.org/10.1145/1963192.1963301

CR IP T

570

[6] F. Rousseau, M. Vazirgiannis, Graph-of-word and tw-idf: New approach to ad hoc ir, in: Proceedings of the 22Nd

ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 59–68. doi:10.1145/2505515.2505671. 575

URL http://doi.acm.org/10.1145/2505515.2505671

[7] E. Shabunina, S. Marrara, G. Pasi, An approach to analyse a hashtag-based topic thread in twitter, in: Natural

AN US

Language Processing and Information Systems - 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings, 2016, pp. 350–358. doi:10.1007/978-3-319-41754-7_34. 580

URL http://dx.doi.org/10.1007/978-3-319-41754-7_34

[8] F. Neri, C. Cotta, P. Moscato, Handbook of Memetic Algorithms, Springer Publishing Company, Incorporated, 2011.

[9] H. Beck-Fernandez, D. F. Nettleton, Identification and extraction of memes represented as semantic networks from

585

M

free text online forums, in: MDAI 2013 - Modeling Decisions for Artificial Intelligence, Barcelona, 2013. [10] J. Leskovec, L. Backstrom, J. Kleinberg, Meme-tracking and the dynamics of the news cycle, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, ACM,

ED

New York, NY, USA, 2009, pp. 497–506. doi:10.1145/1557019.1557077. URL http://doi.acm.org/10.1145/1557019.1557077 [11] M. P. Simmons, L. A. Adamic, E. Adar, Memes online: Extracted, subtracted, injected, and recollected., in: L. A. 590

Adamic, R. A. Baeza-Yates, S. Counts (Eds.), ICWSM, The AAAI Press, 2011.

PT

URL http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html#SimmonsAA11 [12] F. Bonchi, C. Castillo, D. Ienco, Meme ranking to maximize posts virality in microblogging platforms, Journal of Intelligent Information Systems 40 (2) (2013) 211–239. doi:10.1007/s10844-011-0181-4.

CE

URL http://dx.doi.org/10.1007/s10844-011-0181-4

595

[13] M. JafariAsbagh, E. Ferrara, O. Varol, F. Menczer, A. Flammini, Clustering memes in social media streams, Social Network Analysis and Mining 4 (1) (2014) 237. doi:10.1007/s13278-014-0237-x.

AC

URL http://dx.doi.org/10.1007/s13278-014-0237-x

[14] L. A. Adamic, T. M. Lento, E. Adar, P. C. Ng, Information evolution in social networks, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, ACM, New York, NY, USA,

600

2016, pp. 473–482. doi:10.1145/2835776.2835827. URL http://doi.acm.org/10.1145/2835776.2835827

[15] K. Y. Kamath, J. Caverlee, K. Lee, Z. Cheng, Spatio-temporal dynamics of online memes: A study of geo-tagged tweets, in: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, ACM, New York,

26

ACCEPTED MANUSCRIPT

NY, USA, 2013, pp. 667–678. doi:10.1145/2488388.2488447. 605

URL http://doi.acm.org/10.1145/2488388.2488447 [16] M. Cataldi, L. Di Caro, C. Schifanella, Emerging topic detection on twitter based on temporal and social terms evaluation, in: Proceedings of the Tenth International Workshop on Multimedia Data Mining, MDMKDD ’10, URL http://doi.acm.org/10.1145/1814245.1814249

610

CR IP T

ACM, New York, NY, USA, 2010, pp. 4:1–4:10. doi:10.1145/1814245.1814249. [17] M. Mori, T. Miura, I. Shioya, Topic detection and tracking for news web pages, in: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 338–342. doi:10.1109/WI.2006.171. URL http://dx.doi.org/10.1109/WI.2006.171

[18] R. Prabowo, M. Thelwall, I. Hellsten, A. Scharnhorst, Evolving debates in online communication: a graph analyti615

cal approach., Internet Research 18 (5) (2008) 520–540.

AN US

URL http://dblp.uni-trier.de/db/journals/intr/intr18.html#PrabowoTHS08

[19] H. Sayyadi, L. Raschid, A graph analytical approach for topic detection, ACM Trans. Internet Technol. 13 (2) (2013) 4:1–4:23. doi:10.1145/2542214.2542215.

URL http://doi.acm.org/10.1145/2542214.2542215 620

[20] P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, M. Vazirgiannis, Degeneracy-based real-time sub-event detection in twitter stream, in: ICWSM, 2015.

[21] H. Sayyadi, M. Hurst, A. Maykov, Event detection and tracking in social streams, in: In Proceedings of the Inter-

M

national Conference on Weblogs and Social Media (ICWSM 2009). AAAI, 2009. [22] Z. Harris, Distributional structure, Word 10 (23) (1954) 146–162. 625

[23] R. Blanco, C. Lioma, Graph-based term weighting for information retrieval, Inf. Retr. 15 (1) (2012) 54–92. doi:

ED

10.1007/s10791-011-9172-x.

URL http://dx.doi.org/10.1007/s10791-011-9172-x [24] M. E. J. Newman, Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality, Phys. Rev. E 64 (2001) 016132.

[25] S. B. Seidman, Network structure and minimum degree, Social Networks 5 (3) (1983) 269 – 287. doi:10.1016/

PT

630

0378-8733(83)90028-X.

URL http://www.sciencedirect.com/science/article/pii/037887338390028X

CE

[26] V. Batagelj, M. Zaversnik, An o(m) algorithm for cores decomposition of networks, arXiv preprint cs/0310049 (2003).

635

URL http://arxiv.org/abs/cs/0310049

AC

[27] W.-J. Wang, New similarity measures on fuzzy sets and on elements, Fuzzy Sets and Systems 85 (3) (1997) 305 – 309. doi:http://dx.doi.org/10.1016/0165-0114(95)00365-7. URL http://www.sciencedirect.com/science/article/pii/0165011495003657

[28] S. Y. Bhat, M. Abulaish, Hoctracker: Tracking the evolution of hierarchical and overlapping communities in

640

dynamic social networks, IEEE Transactions on Knowledge and Data Engineering 27 (4) (2014) 1019–1032. doi:10.1109/TKDE.2014.2349918.

URL http://dx.doi.org/10.1109/TKDE.2014.2349918

27

ACCEPTED MANUSCRIPT

[29] J. Yang, J. Leskovec, Modeling information diffusion in implicit networks, in: Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 645

599–608. doi:10.1109/ICDM.2010.22. URL http://dx.doi.org/10.1109/ICDM.2010.22 [30] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, J. Leskovec, Can cascades be predicted?, in: Proceedings of the

doi:10.1145/2566486.2567997. URL http://doi.acm.org/10.1145/2566486.2567997

AC

CE

PT

ED

M

AN US

650

CR IP T

23rd International Conference on World Wide Web, WWW ’14, ACM, New York, NY, USA, 2014, pp. 925–936.

28