Accepted Manuscript
A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams Ekaterina Shabunina, Gabriella Pasi PII: DOI: Reference:
S0950-7051(17)30484-7 10.1016/j.knosys.2017.10.013 KNOSYS 4075
To appear in:
Knowledge-Based Systems
Received date: Revised date: Accepted date:
14 April 2017 11 October 2017 12 October 2017
Please cite this article as: Ekaterina Shabunina, Gabriella Pasi, A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams, Knowledge-Based Systems (2017), doi: 10.1016/j.knosys.2017.10.013
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A Graph-Based Approach to Ememes Identification and Tracking in Social Media Streams Ekaterina Shabunina, Gabriella Pasi
CR IP T
Universit`a degli Studi di Milano-Bicocca, Dipartimento di Informatica Sistemistica e Comunicazione, Viale Sarca 336, 20126 Milan, Italy
Abstract
A meme, as defined by Richard Dawkins, is a unit of information, a concept or an idea that
AN US
spreads from person to person within a culture. Examples of memes can be a musical melody, a catchy phrase, trending news, behavioral patterns, etc. In this article the task of identifying potential memes in a stream of texts is addressed: in particular, the content generated by users
of Social Media is considered as a rich source of information offering an updated window on the world happenings and on opinions of people. A textual electronic meme, a.k.a. ememe, is
M
here considered as a frequently replicated set of related words that propagates through the Web over time. In this article an approach is proposed that aims to identify ememes in Social Media streams represented as graph of words. Furthermore, a set of measures is defined to track the
ED
change of information in time.
Keywords: Meme, Ememe, Information Evolution, Information Propagation, Text Graph,
PT
Graph Degeneracy, Fuzzy Logic
CE
1. Introduction
A meme, as defined by Richard Dawkins [1], is a “unit of information” or a concept that
AC
spreads from person to person within a culture. An example of a meme can be a musical melody or a whole song, a catchy phrase or a word, a trending news, a behavioral pattern, etc. Memes
5
can be manifested in several ways and forms, ranging from images [2], videos [3], texts [4, 5], etc. In the latter case, the concepts/ideas behind a meme are expressed by means of sentences Email addresses:
[email protected] ( Ekaterina Shabunina),
[email protected] ( Gabriella Pasi) Preprint submitted to Knowledge-Based Systems
October 12, 2017
ACCEPTED MANUSCRIPT
in natural language, and they are replicated and spread by (slight) variations of the employed vocabulary. Nowadays, the most common way to propagate opinions and ideas on the Web is constituted 10
by Social Media, where users can freely express themselves: in the 21st century the User Gen-
CR IP T
erated Content (UGC) represents the voice of people. This is due to the fact that, in the Web 2.0, users do not have external control on the content they generate, while in the previous instance
of the Web only professionals were allowed to publish. In particular, micro-blogging platforms boost the creation of UGC and, therefore, they constitute an important source of news and opin15
ions on the world happenings, which diffuse at a real-time speed. Thus, they offer to both Social Media users and researchers an invaluable repository of potential information.
AN US
This social context offers an unprecedented opportunity to analyse the content generated by millions of users of different countries and cultural environments to the aim of excerpting cultural trends and ideas, and thus to identify memes. The ability to model a stream of messages and track 20
its change over time would provide high benefits to many fields of research. In [4] this issue was first outlined to the aim of identifying and tracking memes on the blogosphere, and the word
M
ememe was proposed to denote an “electronic meme”.
Based on the findings reported in [4], in this paper we propose an approach that relies on a graph-based representation of Social Media posts to the aim of identifying ememes in any Social Media platform. To this purpose, the Graph-of-Words representation introduced in [6] is
ED
25
employed to model a stream of social posts. Moreover, an enhanced k-core degeneracy process
PT
is applied to the obtained graph to the aim of identifying and extracting the ememes hidden in the Social Media stream. The major novelties of the work reported in this article are constituted by the definition of ememe and of its formal representation, and the definition of a set of measures that allow to evaluate both quantitatively and qualitatively the identified ememes, and to track
CE
30
their change in time in social streams.
AC
An experimental evaluation on the micro-blogging platform Twitter of both the proposed
approach and the measures on the extracted ememes has demonstrated their validity and applicability.
35
Twitter has been selected as a case study due to its wide usage and popularity, as well as for
the availability of the useful functionalities it offers. Hashtags in particular allow to explicitly specify the main topic of tweets. 2
ACCEPTED MANUSCRIPT
It is important to outline that in this work hashtags are not considered as ememes, but rather they are topics composed of subtopics and flactations of opinions; this was investigated in our 40
earlier work [7]. To reduce the computational complexity of the whole approach, the hashtags are used as a topical filter of the considered Twitter streams to reduce both the volume of tweets to
CR IP T
be analyzed and the related potential ememes. Additionally, the identification of ememes related to a predefined topic allows to control some problems typical of Natural Language Processing (NLP) as word sense disambiguation, due to the explicit knowledge of the topical context. 45
The article is organized as follows: Section 2 provides an extensive review of the related literature; Section 3 introduces the scientific background and the techniques used in the approach
proposed in this paper. Section 4 introduces the novel method for the identification of ememes
AN US
in Social Media streams, the proposed formal representation of an ememe and the proposed set
of measures for the evaluation of the identified ememes. Section 5 reports the experimental 50
evaluation of the proposed method; and finally, Section 6 concludes the work and indicates some future research directions.
M
2. Related Literature
In 1976, in his book The Selfish Gene [1], Richard Dawkins coined the term meme to mean a
55
ED
“unit of human cultural evolution”, analogous to a gene in genetics. Dawkins defined the memes as “selfish replicators”, which have the mission to survive by leaping from brain to brain in a process of imitation. Dawkins also introduced three properties that a meme should have: 1)
PT
longevity - the length of the life span of a replicator; 2) fecundity - the rate at which the meme copies are created; and 3) copy-fidelity - the fidelity of the replicated copies to the initial meme.
60
CE
In Computer Science a first application of the meme notion gave origin to the so-called “memetics” (by analogy with genetics), which is a branch of evolutionary computation [8]. Besides memetics, which is not the focus of the research reported in our article, the notion
AC
of meme has played an important role in the context of Web 2.0, in particular, in relation to Social Media. Here, in the realm of the User Generated Content, a concept or an idea can easily replicate and propagate through time. The work reported in [4] was one of the first contributions
65
that addressed the issue of the representation and of the definition of measures that characterize “Internet textual memes”, which the authors name ememes. In [4] an ememe is defined as “a unit of information (idea) replicated and propagated through 3
ACCEPTED MANUSCRIPT
the Web by one or more of the Internet-based services and applications”. Specifically, the authors propose to formally define an ememe as a micro ontology generated by posts on the blogosphere 70
in OWL. Additionally, they propose a set of measures that characterize an ememe: replication fidelity, mutation degree, spread, and longevity.
CR IP T
A semi-automatic approach to memes identification is proposed in [9]. Here memes are
extracted from a corpus of documents; by this approach, a concept is defined as an n-gram that occurs in a significant number of documents in the corpus. Two concepts are considered 75
related when they frequently co-occur in the same documents (a threshold frequency is set by the user). A semantic network is then constructed, where nodes are entity concepts and edges
express relations between concepts. A meme is defined as a minimal semantic network, i.e. the
AN US
network obtained by removing redundant relationships between concepts. A big limitation of this approach is that it is not fully automatic, and it requires the user feedback. 80
The meme-tracking approach proposed in [10] assumes that memes are frequently quoted phrases and their variations. To identify memes, a phrase graph is built, where the nodes represent the quoted phrases, and the directed edges represent the decreasing edit distance between the
M
phrases. As an application of the proposed technique, both news and blogs articles are explored. To measure the information spread, the authors propose to examine the volume of blogs and 85
news media that contain the quoted phrase at each timestamp, and they study the time difference
ED
for mentioning a phrase between the news media and blogs. Similar to [10], in [11] a meme is considered as a frequently quoted phrase, but the main
PT
focus of the work is to explore the mutation characteristics of memes depending on the size of the phrase, on the media source or social network source, and on time. 90
A meme ranking problem is introduced in [12], where the objective is to maximize the overall
CE
network activity by selecting which memes to show to the user. A set of “heuristic rules” is proposed, which estimates the probability that “short snippets of text, images, sounds or videos”
AC
(or memes, as they call them) that users post in social media will go viral. The proposed heuristics take into account cosine similarity of a meme to the user’s profile, the similarity among the users,
95
and the intrinsic decay of the repost probability along time. A different approach to meme identification in Twitter is proposed in [13], where tweets are
initially grouped into “protomemes”, i.e. groups that share one instance of entities, such as a hashtag, a mention, a URL, or a phrase. The identified protomemes are then clustered based on a 4
ACCEPTED MANUSCRIPT
set of similarity measures applied to content, network and user types. The obtained clusters are 100
assumed to be the memes. One of the most recent studies on memes is presented in [14]. In this large-scale work carried out on the Facebook social network, the focus was on the imperfections of fidelity of information
CR IP T
being propagated.
As we have seen, in the works proposed in the Computer Science literature, many percep105
tions of how the notion of meme can be interpreted and instantiated have appeared, and several interpretations of “ememes” have been provided. A synthesis of the proposed definitions is re-
ported here below. The software system Truthy presented in [5] assumes that a meme may have a number of types such as “hashtags” and “mentions” on Twitter, URLs and the preprocessed text
110
AN US
of the tweet itself. In [15] a Twitter hashtag is considered as an online meme and as an event. In [12] the meme notion is associated with “short snippets of text, photos, audio, or videos” that a user posts in Social Media. In [13] the term meme is used to denote sets of topically coherent tweets. In [10, 11, 14] memes are assumed to be frequently quoted phrases. A micro ontology
representation of an ememe is proposed in [4]. In [9], a meme is represented as a semantic
115
M
network that is composed of entity concepts with strong links.
The discovered variety of definitions, representations and perceptions of how a meme can be interpreted on the Web brings an insight on the absence of a unique interpretation of the meme
ED
concept applied to Social Media, and the need to define one, which serves as the motivation of our work.
120
PT
Relying on the above mentioned literature, a meme, defined originally as a “unit of human cultural evolution” [1], can only be evaluated a posteriori by domain experts based on historical data. An automatic detection of memes is a very difficult task due to the fact that the persistence
CE
of an meme, and therefore its aliveness is strongly influenced by people. Not every viral message that has received bursty attention can be defined as a meme.
AC
In our work we propose a model for detecting potential memes. We refer to these potential
125
memes as “ememes” since in Social Media a viral spread of a catchy idea or an opinion can quickly become an “electronic meme”, which has to be further evaluated by time and by domain experts to determine whether it is a true meme or not. Therefore, in our work we do not claim to detect true memes but we rather try to discover potential memes, a.k.a. “ememes”, based on a stream of Social Media data. 5
ACCEPTED MANUSCRIPT
130
3. Graph-Based Representation and Tracking of Textual Information in Social Media The aim of this paper is threefold: 1) to formally represent Social Media posts by means of a graph-based representation; 2) to extract potential memes (that we call ememes) from the ob-
CR IP T
tained graph-based representation; and 3) to track the evolution in time of the identified ememes. These tasks have some similarities with other related research fields that in last years have 135
raised the attention of the scientific community. These include: Topic Detection and Tracking
[16, 17, 18, 19], and Event Detection [20, 21]. However, it is worth to be underlined that even
though both events and memes can be formally defined as a set of interdependent terms, the
main difference is that an event is constrained by time while the notion of a meme is not. On
140
AN US
the contrary, one of the main characteristics of a meme is its persistence in time. Nonetheless,
an event may give rise to a number of memes related to the event itself or to some aspects of it, but this is not mandatory. Similarly, compared to the definition of a meme, the notion of a topic differs due to its higher generality.
In fact a topic may encompass several memes at different points in time. In the following, the issue of formally representing a text is addressed. In fact, the core aspects of any text analysis is the choice of the text representation model. The most common
M
145
approach is called the bag-of-words model, which was first mentioned by Zellig Harris [22]. This
ED
simple representation reduces a text into a set of words by disregarding syntax. An alternative approach is to represent a text as a graph, where the nodes are the words and the edges represent meaningful relations between words. This representation is richer than the bag-of-words model, because it carries additional information. In fact a graph-based representation of texts allows the
PT
150
exploitation of the connections between words. Moreover, appropriate weights associated with
CE
the edges can represent the strength of the relationship between words. Furthermore, graphs allow to measure some useful properties of words (centrality, connec-
tivity, etc). For the above reasons our choice is to represent texts as graphs. Currently, a variety of works in the literature propose a graph-based representation of texts;
AC
155
however, it is not the scope of our work to give an extensive review of them. A broad review of graph-based approaches for text representations is provided in [23]. In our work, to model a stream of Social Media posts we employ the graph-based represen-
tation called Graph-of-Words, the unweighted and directed version of which was first introduced 160
in [6]. In our approach we employ the weighted and undirected Graph-of-Words, as reported in 6
ACCEPTED MANUSCRIPT
Definition 1. Definition 1 (Graph-of-Words). Let G = (V, E) be a weighted undirected graph, where V is
a set of nodes and E is a set of edges. We denote by Wv : V → R+ the weight mapping on the
165
CR IP T
nodes, and by We : E → R+ the weight mapping on the edges. The nodes of the graph represent unique terms, and edges connect terms that co-occur in the considered text.
In a weighted graph the weighted degree of a node v is computed as the sum of the weights of its incident edges [24]. We assume that function Wv computes the weighted degree of a node, i.e. a node’s weight is its weighted degree.
In Definition 2, we introduce the concept of k-core graph degeneracy [25], which we employ to identify the potential ememes in the Graph-of-Words representing the stream of Social Media
AN US
170
posts.
Definition 2 (k-core). Let G = (V, E) be a graph and k be a positive real number. A subgraph Hk = (V0 , E0 ) of the graph G, induced by the subset of nodes V0 ⊆ V and the subset of edges
E0 ⊆ E, is called a k-core or a core of order k of graph G iff deg(v) ≥ k, ∀ v ∈ V0 . And Hk is the maximum subgraph which satisfies the property deg(v) ≥ k, ∀ v ∈ V0 , i.e. it cannot be augmented
ED
without losing this property.
M
175
More specifically, in our work we define and apply an extension of the k-core degeneracy algorithm, which considers the weighted degrees of nodes in the Graph-of-Words instead of the
180
PT
classic degrees of nodes. Thus, the k-core of a graph is the maximal connected subgraph where each node has a weighted degree of at least k within the subgraph. The set of all k-cores of the graph forms the k-core decomposition of a graph.
CE
The k-degenerated graph notion was introduced by Stephen Seidman in 1983 [25], and later,
in [26] an efficient algorithm has been introduced for determining the core decompositions of a
AC
graph with O(max(m, n)) time complexity, which equals O(m) for connected networks, where n
185
is the number of nodes and m is the number of edges. This algorithm is based on the recursive elimination of all nodes and edges incident with them of degree less than k.
In Section 4 the proposed process for identifying and measuring ememes is defined in-depth
and formally. Meanwhile, a general overview of the overall process is the following: given a finite stream of Social Media posts collected over a finite temporal window we define an iterative 7
ACCEPTED MANUSCRIPT
190
procedure to construct a graph (Graph-of-Words) that formally represents this stream. To this aim, we first define a graph associated with the first post in the stream, then we iteratively update it by generating and adding (one by one) the graphs of the next posts of the considered stream. We apply then to the obtained graph the k-core degeneracy algorithm to extract the core of the graph
195
CR IP T
that we assume to represent an ememe or a set of ememes. Then, we repeat this procedure for the subsequent stream of Social Media messages collected in on an equally sized time window,
and we obtain its k-core. This procedure is further repeated for a pre-defined number of streams of Social Media posts collected in temporally equal and sequential time windows. It is important
to notice that each Social Media stream may contain a different number of posts and a different number of ememes. Once the ememes from each target stream have been identified, we apply the ememe measures to track the temporal and topical changes of the ememes identified in these sequential streams of Social Media posts.
AN US
200
4. Formal Representation of an Ememe and its Measures
In the following subsections we define the process of the formal representation of a stream
205
M
of Social Media posts (Section 4.1), the process aimed at the identification and formal representation of ememes from the considered Social Media stream (Section 4.2), and the measures that
ED
we have defined in order to evaluate different aspects of a potential ememe and of its evolution through time (Section 4.3).
PT
4.1. The Ememe Identification Process
We assume to have a finite stream of Social Media posts. We partition this finite stream of Social Media posts into N sequential substreams of posts appearing in time windows of equal du-
CE
210
ration. We denote these sequential substreams as T S 1 , T S 2 , ..., T S n , ..., T S N . The granularity of the considered time window may depend on both the topic and the purpose of the analysis, and it
AC
could be an hour, a day, a week, etc. The smaller the granularity the more potential ememes can be identified, but the noise can also increase. The larger the granularity the simpler it is to verify
215
the persistence in time of the identified ememes. In Section 5 we will demonstrate how different time granularities impact the quality, quantity and topical aspects of the identified ememes. Introducing a partitioning of an observed stream of posts into substreams T S s is necessary in order to perform an analysis of the evolution of ememes over time. 8
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 1: General scheme of the process of generating the graph-based representation of the T S n substream of Social
1
AC
Media posts.
Figure 1: General scheme of the process of generating the graph-based representation of the T S n substream of Social Media posts.
9
ACCEPTED MANUSCRIPT
In the following we explain the process of constructing the weighted undirected Graph-of220
Words GT S n (see Definition 1 in Section 3) representing the T S n substream of posts associated with the time window n; the overall process is illustrated in Figure 1. We denote by GT S n the Graph-of-Words representing the target substream of posts T S n ; we denote by g pT S n the Graphi
CR IP T
of-Words representing each post of the T S n . Each post pTi S n of the target T S n is preprocessed
(see Section 5.2 for details), including its conversion to a bag-of-words of single terms of the 225
post. Then we create the GT S n of the target T S n by means of an iterative process. We start by
representing the first preprocessed post pT1 S n of the target substream T S n as a fully-connected graph g pT S n , where the nodes represent the words of the post. To treat all words in the single 1
post equally we initially set all the weighted degrees in g pT S n to 1. As outlined in Section 3, the 1
230
AN US
weighted degree of a node is defined as the sum of the weights of its incident edges. Therefore,
to share the weighted degree of each node over its incident edges, we compute the weight of each edge as w =
1 l−1 ,
where l is the number of nodes in the fully-connected g pT S n . The obtained 1
Graph-of-Words g pT S n is the initial GT S n . Then we repeat the same process for the next post 1
pT2 S n in
the target substream T S n and obtain its Graph-of-Words representation g pT S n . The initial 2
2
235
M
GT S n is then merged with g pT S n : the weights of all the edges of the existing nodes in GT S n are incremented by the weights of the edges in the second post’s graph g pT S n . For those nodes that did 2
not exist in the initial GT S n new nodes and corresponding edges are created by maintaining the
ED
edges’ weights from the g pT S n . The weighted degrees of the nodes are recalculated by summing 2
up all the weights on the edges incident to each node. This procedure is iterated for every post in
PT
the T S n to obtain a final Graph-of-Words GT S n of the Social Media posts substream T S n : GT S n = g pT S n + g pT S n + ... + g pT S n + ... + g pT S n ,
240
1
2
i
M
CE
where M is the number of Social Media posts in the substream T S n . Based on the process illustrated above, the larger and the more general (in terms of topics) is
the Social Media posts substream T S n , the bigger is the Graph-of-Words associated with it.
AC
We normalize the weighted degrees of the nodes of the final Graph-of-Words GT S n for a
245
given T S n in the [0, 1] interval by dividing all the nodes’ weights by the largest node weight in the GT S n . Similarly, we normalize the weights on the edges by dividing them by the largest edge weight in the GT S n . The described process is then repeated for all the subsequent T S s in the given stream of
Social Media posts. Thus, we obtain a temporal sequence of Graph-of-Words representations of 10
ACCEPTED MANUSCRIPT
250
all substreams: T S 1 → GT S 1 , ..., T S n → GT S n , ..., T S N → GT S N . Then we apply the k-core degeneracy algorithm to each final and normalized graph GT S n :
CR IP T
in Section 5.3 a discussion related to the setting of parameter k is presented. The application of the k-core degeneracy process is aimed at identifying the core vocabulary of the ememes in 255
the T S n . In fact, the outcome of this algorithm is constituted by a set of subgraphs: the nodes of each subgraph represent a set of words that co-occur predominantly in the observed Social Media posts substream. We claim that these subgraphs may be considered as a basis of the
AN US
representations of ememes of the substream T S n .
Definition 3 (k-core of the Graph-of-Words as an Ememes Basis). An ememes basis BT S n is 260
the set of subgraphs (one or more) generated by applying the k-core graph degeneracy to the final Graph-of-Words GT S n associated with the substream T S n .
Thus, the ememes basis BT S n represents the core granules of information in the stream T S n .
M
To further verify the quality of the obtained subgraphs in the ememe basis BT S n , an outliers detection technique is applied, which aims at identifying nodes with significantly high weighted 265
degree. We call these nodes “central concepts”. The extraction of “central concept(s)” of a
ED
subgraph forces it to split into a number of additional subgraphs. This way we can identify the ememes that spin around the same central concepts or entities.
PT
4.2. Formal Representation of an Ememe As outlined in the previous section, each subgraph, generated by the application of the enhanced k-core algorithm to the Graph-of-Words GT S n associated with a substream T S n , consti-
CE
270
tutes a basis for the formal representation of an ememe. We propose then to derive from each subgraph a formal representation of the associated ememe. In particular, we represent an ememe
AC
as a fuzzy subset of the terms contained in the subgraph, as specified in Definition 4. Definition 4 (Ememe). Given a substream T S n of Social Media posts, an ememe mTi S n is repre-
275
sented as a fuzzy subset of terms of the ith subgraph from the ememes basis BT S n . VmT S n denotes the vocabulary employed in the ememe representation and is composed of the i
terms associated with the nodes of the ith subgraph of the ememes basis BT S n (Definition 3). The 11
ACCEPTED MANUSCRIPT
degree of membership of each word is the normalized weight of the corresponding node in the ith subgraph of the ememes basis BT S n . µmT S n : VmT S n → [0, 1],
280
i
i
i
CR IP T
where µmT S n denotes the membership function of the ememe mTi S n .
The fuzzy subset representing the ememe mTi S n is denoted as follows:
FmT S n = {µmT S n (v1 )/v1 , µmT S n (v2 )/v2 , ... µmT S n (v j )/v j }, i
i
i
i
where the notation µmT S n (v j )/v j denotes a term v j ∈ VmT S n and its associated membership i
285
i
value.
AN US
Furthermore, we introduce a fuzzy relation RmT S n associated with ememe mTi S n ; the memi
bership function of the fuzzy relation is defined on the cartesian product VmT S n × VmT S n of the TSn
vertices of the ith subgraph from the ememes basis B
i
i
(Definition 3). The membership value
of a pair of vertices (vi ,v j ), is the weight of the edges connecting the two nodes in the considered subgraph. 4.3. Ememe Measures
M
290
In this section we define the measures that we propose to characterize an ememe. The pro-
ED
posed measures allow: a) to validate whether the identified ememe possesses the main characteristic of a meme, i.e. persistence in time; b) to track and measure the change of ememes as they propagate through time; c) to identify the most dominant ememe(s) for each time window in the
PT
295
target Social Media posts stream.
CE
To the purpose of defining the proposed measures, we formally define a Social Media post. Definition 5 (Representation of a Social Media post). A Social Media post pTj S n is a set of
AC
words. 300
Each Social Media post pTj S n of the T S n is represented as the set of words that it contains, ex-
cluding duplicate words and stopwords (see Section 5.2 for details). In this case the membership function is binary: the words of the post are either included in the set or not. In fact, as Social Media posts are usually very short texts, the crucial point is the presence of a word and not the count of its occurrences. 12
ACCEPTED MANUSCRIPT
305
To verify an ememe’s most important characteristic, i.e. its persistence in time, and to model and track the different transformations that an ememe can undergo, in the following we propose a set of measures. To this aim we have taken inspiration from the ememe’s characterizing measures prevalence, and fidelity.
310
CR IP T
proposed in [4], and we define: relatedness of an ememe to a Social Media post, coverage,
Definition 6 (Relatedness of an ememe to a Social Media post). The relatedness of an ememe
mTi S n to a Social Media post pTj S n is the degree of inclusion of the fuzzy subset representing the ememe mTi S n to the set representing the post pTj S n .
where t is the term in the 315
pTj S n
Count(pTj S n P
∩
mTi S n )
Count(pTj S n )
P
t∈V pT S n
=
j
min(µ pT S n , µmT S n ) j
P
t∈V pT S n
i
µ pT S n
AN US
Rel(mTi S n ⊆ pTj S n ) =
P
,
j
j
post’s vocabulary V pT S n . j
The relatedness of an ememe to a Social Media post is calculated based on the number of terms in common between the ememe’s fuzzy subset and the post’s classical set. Since the ememe’s terms are weighted (the degree of membership of these terms to the ememe fuzzy
M
subset), the relatedness score is simply calculated as the sum of the weights of the terms that the post has in common with the ememe, divided over the number of terms in the post. Definition 7 (Coverage). The coverage of an ememe mTi S n is the percentage of the T S n posts
ED
320
pT S n that have a relatedness score to the ememe mTi S n above a threshold J.
PT
In Section 5.4 we present a discussion related to the setting of the threshold J. Coverage measures the proportion of posts in the T S n substream that ”include” the ememe
325
CE
mTi S n . This measure allows to assume that a Social Media post may include more than one ememe.
In Definition 8, we propose a measure to evaluate the importance of each ememe in compar-
AC
ison to other ememes of the target T S n . Prevalence can be used to identify the most dominant ememe for each Social Media posts substream. Definition 8 (Prevalence). Each Social Media post pTj S n from T S n is evaluated to establish to
330
which ememe it has the highest score of relatedness. Then the prevalence of an ememe in a T S n is calculated by computing the percentage of posts from T S n that have the highest relatedness score to that ememe. 13
ACCEPTED MANUSCRIPT
The measure of ememe prevalence allows to comparatively evaluate the predominance of each ememe identified in the considered T S n with respect to the other ememes of that T S n . 335
The measure of ememe’s prevalence is calculated by choosing for each Social Media post the ememe to which it has the highest relatedness score, in other words, which ememe is the most
CR IP T
similar to this post. Hence, we obtain the importance of each ememe for the target T S n through the percentage of the T S n posts that each ememe “represents”. By comparing the prevalence values of all ememes in T S n we identify the most dominant ememe of the T S n . 340
In Definition 9 we introduce a measure called Fidelity for the evaluation of the similarity between two ememes. Fidelity is a comparative measure. It is calculated for an ememe of interest
from one substream in relation to all other ememes in previous and subsequent substreams. The
AN US
procedure of calculating the Fidelity measure will be presented on two hypothetic ememes mTi S n from the substream T S n and mTj S n+m from one of the subsequent substreams T S n+m . 345
Definition 9 (Fidelity). Fidelity is the value of similarity between ememe mTi S n and ememe mTj S n+m . The similarity between the fuzzy subsets representing the two ememes is computed as sug-
M
gested in [27]:
ED
S (mTi S n , mTj S n+m ) = 1/h
min(µmT S n (vk ), µmT S n+m (vk )) h P i j ), ( k=1 max(µmT S n (vk ), µmT S n+m (vk )) i
j
where vk is a term of the vocabulary of the ememe, and h is the number of terms in the 350
vocabulary of the ememe with the highest number of terms.
PT
The measure of fidelity allows to verify the ememe’s main characteristic: its persistence in time. It allows to identify whether a given ememe is preserved in the previous and next time
CE
windows.
We emphasize the fact that all the previously presented definitions and measures are applica-
ble to any Social Media platform.
AC
355
5. Experimental Evaluation The method proposed in this paper is aimed at both identifying ememes in social media
streams and tracking their evolution in time. As outlined in Section 3, this task differs from the tasks of Topic/Event Detection and Tracking, which have been widely explored in the literature. 360
Due to the novelty of this task, in the literature there is a lack of baseline methods and datasets 14
ACCEPTED MANUSCRIPT
for comparative evaluations. The prior works on memes/ememes, presented in Section 2, have performed preliminary and specific evaluations. To evaluate the approach proposed in this article we aim to asses two distinct aspects: the quality of the ememes extracted from a stream of Social Media posts, and a quantitative evaluation of the proposed ememe measures. In Section 5.1 we introduce the datasets we created specifically for this purpose. Then we describe the performed
CR IP T
365
data preprocessing steps (Section 5.2) and the phase of parameter setting (Section 5.3). Finally, in Section 5.4 we present the results obtained by applying the proposed method to the generated datasets. 5.1. Datasets
In this section we describe the datasets that have been created to evaluate the proposed ap-
AN US
370
proach w.r.t the tasks of ememe representation, their identification in Social Media, and measuring their properties.
We have chosen the micro-blogging platform Twitter as a case study due to its explicit purpose to spread what people think about the world happenings around them, and to share the posts 375
of the people who they like. For this reason Twitter constitutes a good platform for studying
M
ememes from their identification to their spreading dynamics. Moreover, Twitter exposes some useful functionalities. For example, hashtags constitute an important Twitter mechanism that
ED
allows to explicitly specify a topic contained in a tweet, i.e., a Twitter post of only up to 140 characters. To the purpose of analyzing the information generated by Social Media we are faced 380
with the problem of the immense amount of UGC available, and, consequently, of the number
PT
of ememes we can identify. In order to limit the stream of Twitter messages to be analyzed, by maintaining a clear reference to a broad and meaningful topic, we have performed a prefiltering
CE
phase of the crawled tweets based on a general hashtag (e.g. #politics). Some tweets contain more than one hashtag; this represents either a number of sub-topics addressed in the tweet or 385
a number of core concepts of the main topic. Additionally, the identification of ememes related
AC
to a predefined topic allows to decrease some difficult problems typical of NLP, as word sense disambiguation, due to the explicit knowledge of the topical content. We have created a crawler that collects tweets on three separate topics, using the Twitter
Search API 1.1. The three collected datasets consist of tweets in the English language only. The 390
topics we have selected are the following: Economy, Finance, Politics. The statistics related to all three datasets are presented in Table 1. 15
ACCEPTED MANUSCRIPT
Hashtag
Number of days
Dates
Economy
#economy
57
01/03/2016 - 26/04/2016
180,690
Politics
#politics
48
09/03/2016 - 26/04/2016
303,358
Finance
#finance
51
06/03/2016 - 26/04/2016
334,660
Table 1: Datasets statistics
5.2. Data Preprocessing
Number of tweets
CR IP T
Topic
Since in this work we focus on Social Media, the data preprocessing step is particularly important. First of all, the tweets in a stream related to a broad topic have been preprocessed 395
by eliminating special characters (except for the common Social Media symbols such as “@”
AN US
for mentioning a user and “#” for hashtag, and “-” to maintain the hyphen connected words),
URLs, punctuation and by performing tokenization. We have not performed stemming and Partof-Speech tagging; in fact, we have aimed at preserving all possible information in the short Twitter messages. For the same reason we have used a short stopwords list to filter out the most 400
common English words.
M
From each of the three datasets we have eliminated the respective hashtag by which the dataset was collected along with the word of the hashtag since it is a redundant information contained in each tweet. As part of the Graph-of-Words algorithm we have eliminated all non-unique
405
ED
words per tweet. Due to the fact that words in Social Media can have intentional misspellings, we do not perform any prefiltering or corrections of the words. It is important to notice that the
PT
proposed algorithm already eliminates non-important information due to the fact that the words that characterize this information are absent from the k-core of the graph of the target Social Media stream. All the timestamps of the tweets have been converted to the UTC+01:00 timezone.
410
CE
Retweets have been preprocessed along with the tweets. 5.3. Parameter Setting
AC
In this section we introduce the experimental setup. In particular, we address the issue of
setting of the k parameter for the k-core graph degeneracy process. We have studied the behavior and the influence of the k parameter on the extracted ememes
through an experimental evaluation of the proposed algorithm on the different datasets and on 415
different time window sizes. The produced analysis allowed us to identify the desired characteristics in the extracted ememes, based on which the setting of the k parameter was performed. 16
ACCEPTED MANUSCRIPT
Economy
Politics
Finance
TS granularity
hour
day
hour
day
hour
day
Number of TSs in the dataset
1351
57
1114
48
1145
51
133.7
6340.0
272.3
6319.9
292.3
6561.9
Average number of edges per TS
3600.5
65485.7
7164.1
127412.4
8021.7
128987.7
Average number of nodes per TS
647.8
6541.5
1246.6
11303.1
1203.4
10107.4
Perc. of nodes with weight ≥ 0.1
32.6%
1.6%
9.2%
0.55%
6.9%
0.6%
210.9
103.3
114.6
62.5
82.5
60.2
10.9%
0.6%
2,7%
0.16%
2.3%
0.23%
70.3
39.2
33.9
17.8
27.3
23.4
5.7%
0.33%
1.3%
0.06%
1.2%
0.14%
36.7
21.4
16.1
6.9
14.8
14.4
Average number of nodes with weight ≥ 0.1 Perc. of nodes with weight ≥ 0.2
Average number of nodes with weight ≥ 0.2 Perc. of nodes with weight ≥ 0.3
AN US
Average number of nodes with weight ≥ 0.3
CR IP T
Average number of tweets per TS
Perc. of nodes with weight ≥ 0.4
Average number of nodes with weight ≥ 0.4 Perc. of nodes with weight ≥ 0.5
Average number of nodes with weight ≥ 0.5
3.6%
0.22%
0.76%
0.04%
0.83%
0.09%
23.5
14.1
9.5
4
10
9.7
2.7%
0.16%
0.53%
0.02%
0.61%
0.07%
17.3
10.6
6.6
2.7
7.4
6.6
Table 2: The distribution statistics of the degrees of nodes in all studied datasets. All the values are averaged over all
Economy
0.5
Politics
TS=day
TS = hour
ED
TS = hour
M
T S s in the considered dataset.
0.4
0.3
Finance
TS=day
TS = hour
TS=day
0.2
0.3
0.3
PT
Table 3: k parameter setup for each dataset for each T S granularity.
In Table 2, for each dataset we present the statistics of the distribution of the nodes’ degrees averaged over all T S s, for both time granularities of one hour and one day (for generating the
5.4. Preliminary Evaluations and Results
AC
420
CE
T S s). In Table 3 the k values for each dataset are reported, for both granularity sizes of T S .
In this section we describe the results obtained from the evaluation of the proposed approach
on the datasets introduced in Section 5.1, with the k parameter set as described in Section 5.3.
We start by giving some ecamples of the extracted ememes, identified according to the process described in Sections 4.1 and 4.2. Next, we provide a discussion of the results achieved by
425
applying the ememe measures defined in Section 4.3. 17
ACCEPTED MANUSCRIPT
Economy
Politics
Finance
hour
day
hour
day
hour
day
Number of TSs in the dataset
1,351
57
1,114
48
1,145
51
Average number of ememes per TS
2.4
1.2
1.6
1
1.4
1
Maximum number of ememes in one TS found
11
3
8
1
5
2
Average number of words in an ememe
5.4
7.2
CR IP T
TS granularity
6.7
10.5
6.8
8.5
Table 4: Ememes related statistics for each dataset.
5.4.1. Examples of Ememes Identified.
In this subsection we provide a few examples of ememes extracted from the employed
AN US
datasets, along with some statistics related to them.
In Figure 2 we present a sample of the ememes extracted from the Economy dataset with TS’s 430
time granularity of one hour; here ememes are represented by means of the words that constitute the support of their fuzzy representations (the support of a fuzzy set is the set containing all elements with nonzero membership). By the proposed representation it is visible how ememes diffuse and evolve over a timeline of one month. Each of these ememes is represented by the
435
M
words tha co-occur more in a tweet over the Economy dataset on hourly basis, as explained in Section 4.1. In this example, we emphasize with different tonalities in the gray scale and
ED
indentation the two initial ememes (the first one in bold black and the second one in bold gray and with indent), and how these evolve over time, gaining and loosing words in their vocabulary. We leave without any font style and tonality change the words that were only used once and that
440
PT
were not propagated in the next ememes instances. In Table 4 we present some statistics related to the results obtained from the ememes extrac-
CE
tion from Twitter streams, including the average and maximum numbers of ememes in a T S , and the average number of words in an ememe. From this table it becomes evident that the average number of ememes per T S is consistently larger and is greater than 1 throughout all three datasets
AC
for the hourly time granularity than for the daily time granularity. This interesting effect of the
445
T S ’s time granularity size on the identified ememes number is further visible when looking at the vocabulary of the ememes. For example, in Figure 2 we observe a 14-ememes long evolution and diffusion on the “climate change” and “support local producers” related topic, where the T S ’s time granularity is an hour. Meanwhile, on the same dataset of Economy, by setting the 18
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
1
AC
CE
PT
Figure 2: Evolution of ememes in time. Economy dataset. TS granularity of 1 hour.
Figure 2: Evolution of ememes in time. Economy dataset. TS granularity of 1 hour.
19
ACCEPTED MANUSCRIPT
Dataset
TS
Ememes
Economy
5am 06.04
1. #sales #business prices back 2. growth global fragile weak imf’s lagarde merger news 3. president
CR IP T
4. @dollarvigilante watch japan well land rising sun 5. #android 6. #money #finance
7. hailstorm rains lash parts himachal crops destroyed plan #news 8. rebounding southeast asia seeks buffer china slowdown Finance
2am 03.04
1. @murfo portfolio diversification key maximising return
AN US
#superannuation #capitalallocation #retirement
2. 30 great latest #forex #jobs #money #job #hiring #stocks Politics
2pm 21.03
1. trump #tcot obama cuba #news
2. ireland defends healy rae attacker #news #ireland @whispersnewsltd 3. political up #news
4. talk without losing friends #debates #2016election #pointsofview
M
#communication #strategy @strategymgzine
ED
Table 5: Ememes examples.
T S ’s time granularity to 1 day, we do not observe a single ememe on this topic. Another example 450
concerns the Politics dataset. With the T S ’s time granularity set to one hour, 62% of ememes
PT
mention Donald Trump. While in the same Politics dataset but with the T S ’s time granularity set to 1 day, Trump was mentioned in 100% of tweets. In Table 5 we exemplify a few different
CE
samples of ememes identified with our algorithm in the same T S . While our datatsets of tweets were gathered on three topics, from the obtained results it is evi-
455
dent that the selected topics are in turn composed of subtopics associated with numerous ememes
AC
throughout the observed datasets. For example, in the hourly partitioning of the Economy dataset, we find 178 ememes concerning “China” that spread throughout the whole observation period. These ememes related to “China” concern China’s growth, trade, etc. Similarly, we find other countries, such as USA, Greece, Iran, Russia, India, UK, Pakistan, etc. as subtopics, along with
460
common economic matters as production, investing, oil related, stock markets, business, real estate, employment, etc. And all of these “subtopics” related to countries or economic matters 20
ACCEPTED MANUSCRIPT
merge and split throughout the whole observation period creating numerous ememes. It is important to highlight that, as observable from Figure 2 and from Table 5, many ememes’ timestamps refer to the night and early morning hours. This is due to both the fact that, as 465
mentioned in Section 5.2, the timestamps of the tweets were converted to the UTC+01:00 (Eu-
5.4.2. Evaluations of the Ememe Measures.
CR IP T
rope/Rome) timezone and that most of the English tweets are geolocated in USA.
In the following we present the results obtained by applying the proposed measures on the extracted ememes. All the values in the following tables are averaged per each ememe of T S n 470
and for all T S s in the dataset, unless stated otherwise.
AN US
Relatedness of an ememe to a tweet; this measure allows us to evaluate how each tweet in
the T S n is related to each ememe, extracted from that T S n . In Table 6 we present the statistics related to the Relatedness measure. In addition to the averaged Relatedness scores, we present the average percentage of tweets with a zero Relatedness score per each ememe of T S n , and the 475
averaged Relatedness measure excluding these zero scores. The average percentage of tweets from T S n with zero score of Relatedness to each observed ememe in T S n is in the interval 45% -
M
71%, with the highest figures for the Economy topic and the lowest figures for the Finance topic. This phenomenon of large number of tweets unrelated to a specific ememe of a substream T S n is
480
ED
natural, and it represents an evident overall diversity of information in the Social Media stream, even in the case of a well defined general topic and for small time granularities as an hour or a day. As expected, the Relatedness scores are from two to five times better when excluding the
PT
tweets that do not relate to the target ememe. An observation of Tables 4 and 6 brings evidence to the presence of a correlation between the percentage of tweets with zero Relatedness scores
485
CE
and the averaged number of ememes per T S , and an inverse correlation between the Relatedness score and the average number of ememes per T S . The Coverage measure allows to identify the overall percentage of the T S n tweets that in-
AC
clude each ememe of this T S n . The threshold J (see Definition 7) on the Relatedness score for the Coverage measure is set to 0.1. This value was empirically identified from the statistics illustrated in Table 6, where the averaged Relatedness, without the zero scored tweets, was over 0.1
490
for all datasets and all T S granularities. In Table 7 we present the Coverage value (in percentage) averaged per each ememe of T S n over all the T S s for each dataset. On average, ememes were “covered” by 9.5% of T S tweets in the worst case and by 28.9% in the best. Interestingly, the 21
ACCEPTED MANUSCRIPT
Economy hour
day
hour
day
hour
day
0.03
0.08
0.04
0.05
0.06
0.09
70.9%
72.7%
63.1%
55.4%
58%
44.9%
0.15
0.29
0.12
0.1
0.15
0.16
Averaged Relatedness Percentage of tweets with zero Relatedness
Finance
Averaged Relatedness without zeros
CR IP T
TS granularity
Politics
Table 6: Statistics related to the Relatedness of an ememe to a tweet measure.
Economy
Politics
TS granularity
hour
day
hour
Averaged Coverage in percentage
9.5%
14.9%
15.2%
Finance
day
hour
day
18.9%
28.9%
22.6%
AN US
Table 7: Statistics related to the Coverage measure.
Economy topic, which has comparatively the largest number of ememes per T S on average and at maximum, as it can be seen in Table 4, has the poorest Coverage of ememes by the tweets 495
from their T S s, when compared to the other two topics in analysis.
The Prevalence measure allows us to understand the importance of each ememe identified
M
in the T S n in comparison to the other ememes of this T S n . In Table 8 we present the values (in percentage) of the prevailing ememe per T S , and the gap size (in percentage) between the
500
ED
prevailing ememe in the T S and the next closest ememe, which we call the “size of Prevalence”. The obtained results demonstrate that the most prevailing ememes per T S on average are leading by almost the size of their Prevalence. This result is especially interesting since in four
PT
out of six studied cases the average number of identified ememes per T S was over one, and the maximum number of ememes per T S has reached an 11 in the Economy, 8 in the Politics, and 5
505
CE
in the Finance topics, all with hourly time granularity (see Table 4). This brings to the reasonable conclusion that as a general rule a Social Media substream on a broad topic has one prevailing ememe and a number of ememes with relatively much smaller values and importance levels.
AC
The Fidelity measure allows to evaluate the persistence in time of the identified ememes. In
Figure 3 we illustrate the Fidelity measure results on a concrete example of 10 ememes, from the Economy dataset, concerning a thread of discussion on the raise of the minimum wages to $15
510
per hour in USA (only values of Fidelity over 0.1 are presented in Figure 3). From this figure we can notice that even though having equal sets of terms, the two pairs of ememes 1 and 8, and 22
ACCEPTED MANUSCRIPT
Economy TS granularity
Politics
Finance
hour
day
hour
day
hour
day
Prevailing ememe score in percentage
27.4%
28.7%
39.6%
44.5%
49.9%
54.2%
Size of Prevalence in percentage
21.7%
27.9%
36%
44.5%
45.2%
53.8%
CR IP T
Table 8: Statistics related to the Prevalence measure. All values are averaged over all TSs in a dataset.
ememes 2 and 3 have Fidelity of 0.85 due to the fact that the same terms have different degrees of membership in the two different ememes (see Definition 4). 5.4.3. Discussion
The preliminary evaluations on three datasets from a popular Social Media platform demon-
AN US
515
strated the potentiality of the proposed approach. From the point of view of the identification of ememes we believe that the proposed method has presented plausible results in extracting meaningful ememes and in identifying a number of separate ememes in a substream. The defined measures represent a good means to track the evolution of ememes in time through the evaluation of the persistence of an ememe in time, its prevalence in a substream and its representation
M
520
in a substream by the Social Media posts of that substream. An important aspect in the tracking and analysis of evolution of ememes in time relies on the choice of the granularity for the
525
PT
6. Conclusions
ED
partitioning of a Social Media stream, as it has been presented in this section.
In this paper we have addressed the research issue of identifying and tracking ememes in a
CE
stream of Social Media posts. In particular, we have provided a formal definition of ememe and of its formal representation, as well as the definition of a set of measures that allow to evaluate both quantitatively and qualitatively the identified ememes, and to track their change in time in
AC
social media streams.
530
The majority of approaches that have been proposed in the literature to address the issue of
modeling Information Propagation focus on a graph-based representation of Social Networks in which the information is diffusing [28, 29, 30]. In our work we represent instead the content of a stream of Social Media posts as a graph, by means of a Graph-of-Words representation, and we extract ememes from this graph with the k-core graph degeneracy technique. Most impor23
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
Figure 3: Fidelity scores between a thread of ememes. Economy dataset. TS granularity of 1 hour.
1
Figure 3: Fidelity scores between a thread of ememes. 24 Economy dataset. TS granularity of 1 hour.
ACCEPTED MANUSCRIPT
535
tantly, we introduce and formally define a set of measures aimed at evaluating and analyzing the variations and properties of the extracted ememes. To prove the effectiveness of our approach in a computationally feasible way, we have used Twitter as a case study platform. The experimental evaluation of our approach has shown its
540
CR IP T
validity in extracting ememes from streams of Social Media posts. The experiments have also
outlined the validity of the proposed measures, by bringing interesting insights on different aspects of the information evolution and propagation in Social Media. There are important and challenging open issues that we would like to address in the future. As one of our future objectives we aim to define a set of additional measures for the identification of events such as birth,
merge, split, disappearance of ememes; for this task we will take inspiration from different but related fields, including communities evolution in Social Networks [28].
AN US
545
Another challenging future direction concerns the improvement of the analysis of the texts extracted from Social Media to the aim of modeling the impact of negated words and of the qualification of words through adjectives.
Last but not least we also intend to improve the technique aimed at setting the value of the k parameter in the k-core algorithm.
M
550
Moreover we aim at performing more extensive experiments on diverse Social Media platforms, and on various topics and languages, by also investigating the links of this research to the
ED
related fields of Topic/Event Detection and Tracking.
555
References
PT
References
CE
[1] R. Dawkins, The Selfish Gene, Oxford University Press, Oxford, UK, 1976. [2] M. Cha, A. Mislove, K. P. Gummadi, A measurement-driven analysis of information propagation in the flickr social network, in: Proceedings of the 18th International Conference on World Wide Web, WWW ’09, ACM, New York,
AC
NY, USA, 2009, pp. 721–730. doi:10.1145/1526709.1526806.
560
URL http://doi.acm.org/10.1145/1526709.1526806
[3] L. Xie, A. Natsev, J. R. Kender, M. Hill, J. R. Smith, Visual memes in social media: Tracking real-world news in youtube videos, in: Proceedings of the 19th ACM International Conference on Multimedia, MM ’11, ACM, New York, NY, USA, 2011, pp. 53–62. doi:10.1145/2072298.2072307.
URL http://doi.acm.org/10.1145/2072298.2072307
25
ACCEPTED MANUSCRIPT
565
[4] G. Bordogna, G. Pasi, A fuzzy approach to the conceptual identification of ememes on the blogosphere, in: Fuzzy Systems (FUZZ), 2013 IEEE International Conference on, 2013, pp. 1–8. doi:10.1109/FUZZ-IEEE.2013. 6622390. [5] J. Ratkiewicz, M. Conover, M. Meiss, B. Gonc¸alves, S. Patil, A. Flammini, F. Menczer, Truthy: Mapping the spread of astroturf in microblog streams, in: Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11, ACM, New York, NY, USA, 2011, pp. 249–252. doi:10.1145/1963192.1963301. URL http://doi.acm.org/10.1145/1963192.1963301
CR IP T
570
[6] F. Rousseau, M. Vazirgiannis, Graph-of-word and tw-idf: New approach to ad hoc ir, in: Proceedings of the 22Nd
ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, New York, NY, USA, 2013, pp. 59–68. doi:10.1145/2505515.2505671. 575
URL http://doi.acm.org/10.1145/2505515.2505671
[7] E. Shabunina, S. Marrara, G. Pasi, An approach to analyse a hashtag-based topic thread in twitter, in: Natural
AN US
Language Processing and Information Systems - 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings, 2016, pp. 350–358. doi:10.1007/978-3-319-41754-7_34. 580
URL http://dx.doi.org/10.1007/978-3-319-41754-7_34
[8] F. Neri, C. Cotta, P. Moscato, Handbook of Memetic Algorithms, Springer Publishing Company, Incorporated, 2011.
[9] H. Beck-Fernandez, D. F. Nettleton, Identification and extraction of memes represented as semantic networks from
585
M
free text online forums, in: MDAI 2013 - Modeling Decisions for Artificial Intelligence, Barcelona, 2013. [10] J. Leskovec, L. Backstrom, J. Kleinberg, Meme-tracking and the dynamics of the news cycle, in: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, ACM,
ED
New York, NY, USA, 2009, pp. 497–506. doi:10.1145/1557019.1557077. URL http://doi.acm.org/10.1145/1557019.1557077 [11] M. P. Simmons, L. A. Adamic, E. Adar, Memes online: Extracted, subtracted, injected, and recollected., in: L. A. 590
Adamic, R. A. Baeza-Yates, S. Counts (Eds.), ICWSM, The AAAI Press, 2011.
PT
URL http://dblp.uni-trier.de/db/conf/icwsm/icwsm2011.html#SimmonsAA11 [12] F. Bonchi, C. Castillo, D. Ienco, Meme ranking to maximize posts virality in microblogging platforms, Journal of Intelligent Information Systems 40 (2) (2013) 211–239. doi:10.1007/s10844-011-0181-4.
CE
URL http://dx.doi.org/10.1007/s10844-011-0181-4
595
[13] M. JafariAsbagh, E. Ferrara, O. Varol, F. Menczer, A. Flammini, Clustering memes in social media streams, Social Network Analysis and Mining 4 (1) (2014) 237. doi:10.1007/s13278-014-0237-x.
AC
URL http://dx.doi.org/10.1007/s13278-014-0237-x
[14] L. A. Adamic, T. M. Lento, E. Adar, P. C. Ng, Information evolution in social networks, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, ACM, New York, NY, USA,
600
2016, pp. 473–482. doi:10.1145/2835776.2835827. URL http://doi.acm.org/10.1145/2835776.2835827
[15] K. Y. Kamath, J. Caverlee, K. Lee, Z. Cheng, Spatio-temporal dynamics of online memes: A study of geo-tagged tweets, in: Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, ACM, New York,
26
ACCEPTED MANUSCRIPT
NY, USA, 2013, pp. 667–678. doi:10.1145/2488388.2488447. 605
URL http://doi.acm.org/10.1145/2488388.2488447 [16] M. Cataldi, L. Di Caro, C. Schifanella, Emerging topic detection on twitter based on temporal and social terms evaluation, in: Proceedings of the Tenth International Workshop on Multimedia Data Mining, MDMKDD ’10, URL http://doi.acm.org/10.1145/1814245.1814249
610
CR IP T
ACM, New York, NY, USA, 2010, pp. 4:1–4:10. doi:10.1145/1814245.1814249. [17] M. Mori, T. Miura, I. Shioya, Topic detection and tracking for news web pages, in: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’06, IEEE Computer Society, Washington, DC, USA, 2006, pp. 338–342. doi:10.1109/WI.2006.171. URL http://dx.doi.org/10.1109/WI.2006.171
[18] R. Prabowo, M. Thelwall, I. Hellsten, A. Scharnhorst, Evolving debates in online communication: a graph analyti615
cal approach., Internet Research 18 (5) (2008) 520–540.
AN US
URL http://dblp.uni-trier.de/db/journals/intr/intr18.html#PrabowoTHS08
[19] H. Sayyadi, L. Raschid, A graph analytical approach for topic detection, ACM Trans. Internet Technol. 13 (2) (2013) 4:1–4:23. doi:10.1145/2542214.2542215.
URL http://doi.acm.org/10.1145/2542214.2542215 620
[20] P. Meladianos, G. Nikolentzos, F. Rousseau, Y. Stavrakas, M. Vazirgiannis, Degeneracy-based real-time sub-event detection in twitter stream, in: ICWSM, 2015.
[21] H. Sayyadi, M. Hurst, A. Maykov, Event detection and tracking in social streams, in: In Proceedings of the Inter-
M
national Conference on Weblogs and Social Media (ICWSM 2009). AAAI, 2009. [22] Z. Harris, Distributional structure, Word 10 (23) (1954) 146–162. 625
[23] R. Blanco, C. Lioma, Graph-based term weighting for information retrieval, Inf. Retr. 15 (1) (2012) 54–92. doi:
ED
10.1007/s10791-011-9172-x.
URL http://dx.doi.org/10.1007/s10791-011-9172-x [24] M. E. J. Newman, Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality, Phys. Rev. E 64 (2001) 016132.
[25] S. B. Seidman, Network structure and minimum degree, Social Networks 5 (3) (1983) 269 – 287. doi:10.1016/
PT
630
0378-8733(83)90028-X.
URL http://www.sciencedirect.com/science/article/pii/037887338390028X
CE
[26] V. Batagelj, M. Zaversnik, An o(m) algorithm for cores decomposition of networks, arXiv preprint cs/0310049 (2003).
635
URL http://arxiv.org/abs/cs/0310049
AC
[27] W.-J. Wang, New similarity measures on fuzzy sets and on elements, Fuzzy Sets and Systems 85 (3) (1997) 305 – 309. doi:http://dx.doi.org/10.1016/0165-0114(95)00365-7. URL http://www.sciencedirect.com/science/article/pii/0165011495003657
[28] S. Y. Bhat, M. Abulaish, Hoctracker: Tracking the evolution of hierarchical and overlapping communities in
640
dynamic social networks, IEEE Transactions on Knowledge and Data Engineering 27 (4) (2014) 1019–1032. doi:10.1109/TKDE.2014.2349918.
URL http://dx.doi.org/10.1109/TKDE.2014.2349918
27
ACCEPTED MANUSCRIPT
[29] J. Yang, J. Leskovec, Modeling information diffusion in implicit networks, in: Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 645
599–608. doi:10.1109/ICDM.2010.22. URL http://dx.doi.org/10.1109/ICDM.2010.22 [30] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, J. Leskovec, Can cascades be predicted?, in: Proceedings of the
doi:10.1145/2566486.2567997. URL http://doi.acm.org/10.1145/2566486.2567997
AC
CE
PT
ED
M
AN US
650
CR IP T
23rd International Conference on World Wide Web, WWW ’14, ACM, New York, NY, USA, 2014, pp. 925–936.
28