A generic framework and methodology for extracting semantics from co-occurrences

A generic framework and methodology for extracting semantics from co-occurrences

Data & Knowledge Engineering 92 (2014) 39–59 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.c...

2MB Sizes 0 Downloads 47 Views

Data & Knowledge Engineering 92 (2014) 39–59

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Editorial

A generic framework and methodology for extracting semantics from co-occurrences Aditya Ramana Rachakonda ⁎, Srinath Srinivasa, Sumant Kulkarni, M.S. Srinivasan 1 Open Systems Lab, IIIT Bangalore, India

a r t i c l e

i n f o

Article history: Received 15 June 2013 Received in revised form 24 May 2014 Accepted 17 June 2014 Available online 27 June 2014 Keywords: Cognitive models Co-occurrence Data mining Text mining

a b s t r a c t Extracting semantic associations from text corpora is an important problem with several applications. It is well understood that semantic associations from text can be discerned by observing patterns of co-occurrences of terms. However, much of the work in this direction has been piecemeal, addressing specific kinds of semantic associations. In this work, we propose a generic framework, using which several kinds of semantic associations can be mined. The framework comprises a cooccurrence graph of terms, along with a set of graph operators. A methodology for using this framework is also proposed, where the properties of a given semantic association can be hypothesized and tested over the framework. To show the generic nature of the proposed model, four different semantic associations are mined over a corpus comprising of Wikipedia articles. The design of the proposed framework is inspired from cognitive science — specifically the interplay between semantic and episodic memory in humans. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Extracting latent semantics from human generated text corpora, like a collection of news articles, email and blog post and so on is an important problem with several application areas. A practical approach to acquire such latent semantics is to observe patterns of term distribution. It is well known that term distributions in human generated text are not independent of each other and sets of semantically related terms tend to occur together [1]. In linguistics, this correlation is termed as the distributional hypothesis [2]. Because of this, observing occurrence and co-occurrence patterns of terms has become the primary mode of gathering latent semantics from unstructured text. Some examples include [3–6]. However, to the best of our knowledge, much of the research efforts in this direction, has been piecemeal solutions focused on identifying specific kinds of semantic associations. In this work, we try to build a theory around the question of how semantic associations can be discerned from co-occurrences and propose a generic framework for using co-occurrence patterns to extract different kinds of semantic associations. In the proposed framework, each document in a corpus is treated as an “episode” and a raw form of semantic knowledge is represented as a weighted graph showing co-occurrences of terms across the corpus. A set of graph operations called primitives is also proposed using which, several kinds of co-occurrence patterns can be discerned. Finally, a methodology is proposed, where a hypothesis about a semantic association can be tested on the co-occurrence graph using the primitives. The validity of the hypothesis is established by testing the outcome of an algorithm with human subjects, as well as with comparisons with related approaches.

⁎ Corresponding author at: Q801, Ajmera Infinity, Electronics City, Bangalore, India 560100. E-mail addresses: [email protected] (A.R. Rachakonda), [email protected] (S. Srinivasa), [email protected] (S. Kulkarni), [email protected] (M.S. Srinivasan). 1 Affiliated with IBM India Software Labs and with IIIT Bangalore as a part-time research scholar.

http://dx.doi.org/10.1016/j.datak.2014.06.002 0169-023X/© 2014 Elsevier B.V. All rights reserved.

40

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Unlike traditional natural language processing, in this work, we do not rely on rule-based text processing techniques like part-ofspeech tagging, dependency trees and so on. Instead we assume such techniques to be unavailable and use free-text along with statistical keyword detection algorithms to measure co-occurrences. The underlying data structures and primitives are not dependent upon the semantic associations proposed and are constant across all the algorithms. While the techniques used in the specific algorithms (like random walks, clustering, K–L divergence, etc.) are not novel by themselves, the focus of this work is on the generic nature of the proposed framework. A single data structure along with a small set of graph primitives, can be used to mine different kinds of semantic associations by following a methodology. This way, there is a basis for explaining the rationale behind a semantic association that is mined from the corpus. The rest of the paper is organized as follows. Section 2 briefly describes related literature in mining latent semantics from text. Section 3 proposes the 3-layer framework for mining semantic associations. Four different latent semantic associations based on this model are then demonstrated in Section 4 and the concluding remarks are noted in Section 5. 2. Related literature In this section, we survey literature in mining latent semantics. The objective here is to provide a realistic backdrop for this work, rather than a comprehensive survey. For the latter, the interested reader may like to refer to [7–11]. 2.1. Co-occurrence graph mining Co-occurrences represented as a graph are one of the fundamental structures using which the distributional hypothesis can be applied to semantics mining. The way co-occurrence graphs are constructed, are usually algorithm specific, as different kinds of cooccurrences are thought to capture different aspects of meaning. In the context of word-sense disambiguation, Widdows et al. use lists of nouns from a part-of-speech (POS) tagged corpus to construct a co-occurrence graph [12,13]. They then use a graph clustering algorithm to identify significant clusters of words and thereby distinguishing between word senses. In topic mining, Mihalcea and Tarau [14] propose a document-level co-occurrence graph of terms (nouns and adjectives) inside a document and compute a random-walk centrality of the nodes to identify the terms representing the topic of the document. They also show that the same technique can be used on sentences in a document to identify the sentence summarizing the document best. There are also several algorithms which use a co-occurrence graph consisting of explicitly heterogeneous nodes connected in a kpartite arrangement. A noun–adjective bipartite co-occurrence graph can be used to determine the sentiments associated with the nouns [15]. A bipartite random-walk on such a graph is used to identify important opinions (adjectives) and important product features (nouns) in a given product corpus. Similarly, a noun–verb bipartite co-occurrence graph can be used to mine semantics especially in speech disambiguation [16]. In our work, we propose a set of semantics mining algorithms on a single co-occurrence graph of terms built over a large textual corpus. 2.2. Dimensionality reduction An alternate approach to mining latent semantics has focused on discovering non-trivial latent co-occurrences by means of dimensionality reduction as in Latent Semantic Analysis (LSA) [17]. LSA uses singular value decomposition on a term–document matrix, then collapses the vector space by eliminating all but the top k dimensions and hence document vectors which were far off in the original space come closer in the new space. Such a recomputed space establishes extraneous associations between documents and terms beyond what was originally captured. Dimension reduction techniques have also been applied to co-occurrence graph mining in Hyperspace Analogue to Language (HAL) [18] and Correlated Occurrence Analogue to Lexical Semantics (COALS) [3]. Both the algorithms work on a term–term matrix composed of co-occurrence vectors instead of document vectors for mining semantic associations. Despite their impressive results, LSA and its variants do not have sound mathematical underpinnings for the extracted semantics. This meant that, while terms could be semantically related by collapsing dimensions, LSA is not able to assign a label to such associations. In addition, LSA computations are global in nature involving the entire corpus, making it difficult for incremental changes in computing semantic relatedness. Several research efforts have tried to extend LSA in different directions as well as explore newer models for capturing latent semantics. 2.3. Generative models Generative models in semantics started as a mathematically sound extension to LSA. Here, documents in the corpus are considered to be generated by a mixture of one or more random processes. Hofmann proposed pLSI [19], a probabilistic approach for topical mixture model. Here, a document is modeled as comprising a set of topics, where each topic generates terms with a given probability distribution. Latent Dirichlet Allocation (LDA) [20] is an extension of PLSI, where a document is modeled as generated using a mixture of k (finite) hypothetical topics and a topic is a probability distribution over all the terms observed in the corpus, based on a Dirichlet

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

41

prior. Statistical techniques like expectation maximization or Gibbs sampling are applied to invert this process and identify the set of topics which generated the corpus [21]. LDA has grown in popularity in recent times with several implementations of LDA being publicly available, and several extensions over the base LDA model. In contrast to the current literature on latent semantics, our work is primarily based on cognitive modeling of semantics. Rather than model a document as a mixture of probability distributions, we look at the intensional definitions of semantic associations. For instance, rather than viewing a topic as a probability distribution over terms, we approach the question by asking what a “topic” means. We answer this by a notion of semantic aboutness (introduced in Section 3), and address the question of how aboutness can be discerned over co-occurrence patterns. This way, we can provide a rationale for every association that is mined from the corpus. 3. The 3-layer cognitive model Our approach for mining latent semantics is based on building a “cognitive” model of semantics—in other words, modeling how humans understand meaning. Our proposed model is based on several well known concepts of semantics in language from Analytic Philosophy and of semantic memory from Cognitive Science. In Analytic Philosophy, real world as well as abstract entities are said to be represented as concepts (facts) in the human mind and their interplay is represented as associations (propositions) [22]. In such a representation, the meaning of a term (written/spoken word) is equivalent to the corresponding concept which gets evoked in the mind. Ordinary Language Philosophy, a branch of analytic philosophy, in turn propounds that terms in a language acquire such meaning through their usage in the language [23]. It thus asserts a relationship between term usage and semantics, thereby implicitly reaffirming the distributional hypothesis. In Cognitive Psychology, our long-term declarative memories are modeled to be made of episodic and semantic memories [24]. The input to our memory is made of sequences of experiences called episodes. Some episodes are stored in the episodic memory where the temporal order of events inside an episode is maintained. But episodes also contribute to semantic memory where latent semantic properties of episodes are observed. Semantic memory acts as a store of organized knowledge acquired by observing a large number of episodes and is necessary for our use of language. 3.1. 3-layer model To deal with human generated representations like text and speech we model the process of semantic representation and reasoning in humans by dividing it into three clear compartments (Fig. 1): 1. Analytic layer: It models our semantic memory and hence is at the core of how semantics are represented in our brains. This layer can be described as a set of concepts and a set of associations between the concepts. Concepts are mental abstractions of real world entities like Federer, space shuttle or of abstract notions like π, machine learning. Each concept acts as a point of access to concepts related to it. As concepts belong to the cognitive space, they can vary slightly from person to person but overlap significantly and thus enable communication. Associations are mental abstractions of relationships between concepts. For example, plays(Federer, Wimbledon) is an association describing the relationship between the concepts Federer and Wimbledon. Apart from associations, concepts also indirectly influence other concepts in the analytic layer. So, the above example is more related to the concepts Tennis and Sport than an arbitrary concept like space shuttle. We define a relationship called aboutness to describe the influence between concepts. Definition 1. Aboutness: Aboutness is a score between a set of concepts P and a concept c and is modeled as a function valued in the real unit interval [0, 1]. The aboutness scores of any set of concepts P over all the concepts in the analytic layer are represented as an aboutness distribution A(P). 2. Episodic layer: It deals with cogent snippets of information called episodes which help in building the complex analytic layer above. An episode is an autobiographical situation involving the subject (the speaker or author) and has certain episodic objectives.

Fig. 1. 3-layer model.

42

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

An episode is abstracted as a small subset of concepts and associations from the analytic layer which together express an idea. The idea is the glue which binds the concepts and associations expressed in an episode. Hence an episode like, “Federer won a Grand Slam” can be said to contain the concepts Federer and Grand Slam and the association win(Federer, Grand Slam). Although semantics referred to in episodes are sourced from the analytic layer, what is stated in an episode about a concept may be quite different from the semantic signature of the concept in the analytic layer. 3. Linguistic layer: It is the layer which populates an idea in an episode with terms from our vocabulary. Terms represented in a human language are external manifestations of concepts. Unlike concepts, terms exhibit synonymy and polysemy. Terms are not associated with one another but through their usage in episodes gain associations with concepts. This association is a model of meaning in our language. As we keep using the terms differently, they start getting associated with different concepts and hence their associations change in accordance to usage. So every thought we communicate is essentially a set of concepts (analytic layer) expressed as an idea unit (episode) using a set of terms along with associated structure (language). Note that the three layers are simplified representations of a complex reality and hence other significant notions like truth, objective of an episode and grammar are ignored from the semantic standpoint of this model. The 3-layer model offers an abstraction of how meaning is associated with language in order to replicate such structures in systems elsewhere to enable semantic understanding, however rudimentary. Although semantics mining algorithms like LSA and LDA implicitly follow a similar structure, they do not model how the different interactions across these layers yield qualitatively different semantics. 3.2. 3-layer model for co-occurrence semantics By understanding the process through which we embed semantics into language, we can recover some of the semantics algorithmically. Hence we extend the model onto a machine and use a document corpus as a replacement for a lifetime of episodes. The linguistic layer processes input documents based on their language and identifies named entities (terms) in the text. In practice we assume the language to be English and use a combination of algorithms and heuristics to extract terms. The query and response would both be sets of terms and not grammatical sentences. Such a linguistic representation, though rudimentary, forms the base for several powerful semantics extraction algorithms. Earlier, an episode was modeled as a set of concepts and associations which represent a cogent snippet of information. As the concepts are latent in the document corpus, we use the terms extracted as stand-ins for the concepts. The co-occurrences between terms in a document represent the associations between terms. Hence sets of co-occurring terms in a document form an episode for the machine. Considering co-occurrences to be the only observed facts we replace the analytic layer with a co-occurrence graph and a set of operations called primitives (Fig. 2). 3.2.1. Co-occurrence layer Formally, the co-occurrence graph G is a weighted, undirected graph of the form: G ¼ ðT; C; wÞ

ð1Þ

where T is the set of all terms in the corpus and C is the set of all pair-wise co-occurrences across terms inferred from the episodes in the corpus. The function w indicates the corresponding co-occurrence count between two terms ti,tj ϵ T in the corpus. To enable extraction of semantics from the co-occurrence graph, we need to define several operations (called primitives) by which we can operate on the graph. Definition 2. Closure and focus: Given a set of terms X, their closure X⁎, is the set of all the terms which co-occur with at least one of the terms in X. Their focus X⊥, is the set of all terms which co-occur with all the terms in X. 

X ¼ fvj∃u∈X; ðu; vÞ∈C g

ð2Þ

X ⊥ ¼ fvj∀u∈X; ðu; vÞ∈C g

ð3Þ

Fig. 2. 3-layer model with co-occurrence layer.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

43

Definition 3. Coherence: A set of terms X is said to be coherent if X⊥ ≠ ϕ. Incoherent terms—terms which do not share co-occurring terms—are of little use in co-occurrence based semantics. Definition 4. Neighborhood: Given a term t, its co-occurrence neighborhood N(t) is the set of all terms co-occurring with t along with their co-occurrence counts. In other words, it is the “star”-like sub-graph in G originating from t. Formally:   ð4Þ Nðt Þ ¼ T NðtÞ ; C Nðt Þ ; w where TN(t) = {t} ∪ {u|u ∈ T, {t, u} ∈ C}, CN(t) = {{t, u}|u ∈ T, {t, u} ∈ C} and w is the corresponding edge weight in G. The neighborhood represents the relationship of a set of terms with all the terms which co-occur with them. The neighborhood of a set of terms X can be defined in its two canonical forms: N(X⁎), the neighborhood closure and N(X⊥) the neighborhood focus. Formally:   N X ¼ ∪ NðxÞ

ð5Þ

NðX ⊥ Þ ¼ ∩ NðxÞ:

ð6Þ

x∈X

x∈X

When computing the neighborhood of a set of terms X representing a compound concept, the primitives X⁎ and X⊥ are treated as hypothetical terms representing the compound concept, so that the neighborhoods N(X⁎) and N(X⊥) still appear like star graphs. The co-occurrence weights of terms in the neighborhood of X⊥ and X⁎ are updated as follows: wðX ⊥ ; uÞ ¼ min wðx; uÞ

ð7Þ

   X w X ;u ¼ wðx; uÞ

ð8Þ

x∈X

x∈X

The above equations are essentially multiset (bag) intersection and multiset sum (⊎) operations respectively. Definition 5. Semantic context: Given a co-occurrence graph G, the semantic context ψ(t), of a term t, is the induced sub-graph of the vertices of the neighborhood TN(t). An induced sub-graph H of a graph G contains a subset of vertices of G and all edges of the form {v1,v2} from G such that v1,v2 ∈ V (H). Formally the semantic context is defined as:   ψðt Þ ¼ T Nðt Þ ; C ψðt Þ ; wt

ð9Þ

where Cψ(t) = {{v1, v2}|v1, v2 ∈ TN(t), {v1, v2} ∈ C}. As before, for a set of terms X we define the semantic contexts of their closure and focus.   ψ X ¼ ⊎ ψðxÞ

ð10Þ

ψðX ⊥ Þ ¼ ∩ ψðxÞ

ð11Þ

x∈X

x∈X

The semantic context is an important data structure for co-occurrence based latent semantics. We claim that a large number of latent semantics pertinent to terms in X will be found within the semantic context of either X⁎ or X⊥. It is not necessary to process the entire graph for extracting latent semantics related to a small set of terms. Of course, it is important that any such set of terms X are coherent in the first place. Co-occurrence between two terms along with a weight is an undirected relationship between the terms. However, when viewed from the vantage point of either of the terms, the relative probability of co-occurrence of the other term, need not be identical. To capture this asymmetry, we define the notion of generatability. Definition 6. Generatability: If a term u co-occurs with t, then the probability that any arbitrarily chosen co-occurring term with t happens to be u is called the generatability of u in the context of t. Formally:

Γ t→u

8 > < X wðt; uÞ wðt; xÞ ¼ x∈T NðtÞ > : 0

u ∈ T NðtÞ ∖ft g otherwise

ð12Þ

44

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Every edge ({u,v}∈ C) can have two generatability probabilities associated with it (Γu → v, Γv → u). All the generatability probabilities originating from a term u form a probability distribution which is called the generatability distribution Γu of u over its neighborhood N(u). Generatability of a target term can be extended to sets of source terms from a single term using the focus and closure operators. Hence, Γ X  →u and Γ X ⊥ →u are calculated as above, with the co-occurrence weights adjusted according to the multi-set operations specified in Eqs. (7) and (8). 3.3. Methodology for mining semantics The proposed co-occurrence graph along with the set of primitives forms a generic framework using which, we claim, several kinds of semantic associations can be mined. In order to do this, we propose a methodology for using the framework. The methodology involves three steps. For a given semantic association S, do the following: 1. Provide an intensional definition of S, indicating what S is supposed to mean in terms of the analytic layer 2. Hypothesize an extensional definition for S, about how S manifests itself across episodes. This is also called the episodic hypothesis for S 3. Test the episodic hypothesis for S over the co-occurrence graph by formally representing the hypothesis using the proposed set of primitives These three steps are fundamental to the way we mine semantics, as reflected in the set of different semantics mining algorithms in the following section. 4. Semantic associations Using the proposed methodology, we present four different kinds of semantics extraction tasks. These tasks were implemented over a co-occurrence graph generated from named entities extracted from Wikipedia articles. The algorithms are generic enough to work on any kind of dataset. However, evaluating the correctness of a semantics extraction algorithm is difficult over arbitrary datasets. Semantic associations represent the collective worldview of the population that generated the corpus. The validity of the collective worldview is the easiest to verify with human evaluators, when the corpus is of an encyclopedic nature like Wikipedia. 4.1. Document corpus In this work, we used the English language Wikipedia as the document corpus for our model. The entire free text of the dataset was used for measuring co-occurrences. The dataset was cleaned by removing all the non-article pages—like category pages, talk pages, user pages and so on—and stub pages and in each of the pages the tables, info-boxes and general references were removed from the text, as the primary objective was to extract semantics based solely on co-occurrences. In the experiments, co-occurrence was measured between keywords observed in the text. For the sake of clarity and repeatability, we picked all the terms with their own Wikipedia page as the set of keywords. The Wikipedia data was obtained in May 2011 and the co-occurrence graph built using it contains more than 7 million nodes and 155 million edges. As the algorithms described below were devised at various points in time, an older version of the Wikipedia data was used sometimes. Such instances are explicitly mentioned in the relevant sections. 4.1.1. Topical anchors The first kind of semantic association that we consider is called topical anchors. Topical anchors are concepts representing the topic of an episode, based on the terms that have occurred in the episode. For example, if a document has words like Federer, Nadal and Wimbledon, it would be very useful if their association with Tennis can be established, even if the word Tennis does not appear in the document. Here, Tennis acts as the topical anchor for these set of words.2 Topical anchors find applications like automatic labeling of conversations, email messages, and so on and are useful in settings like handling customer complaints. Topical anchors are an existing work which has been adapted to the 3-layer model [25,26]. The contribution in this work is to revisit the algorithm through the lens of the 3-layer model and provide an episodic hypothesis to explain as to why the algorithm results in topical anchors. To extract topical anchors, we adopt the 3-layer methodology by first providing an intensional definition for topical anchors and then an episodic hypothesis. Finally, the hypothesis is tested on the co-occurrence graph by an algorithm built by using the set of primitives. 4.1.1.1. Intensional definition. The topical anchor t of a set of concepts Q, is the concept whose aboutness distribution A(t) resembles the aboutness distribution of Q: A(Q). This implies that the topical anchor t is a concept which is semantically about the same set of concepts as the set Q is collectively about.

2

A working implementation of Topical Anchors was found at http://osl.iiitb.ac.in/~aditya/anchors.php.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

45

The next step is to reduce this definition into an extensional definition or episodic hypothesis. If t is a topic for the set of concepts in Q, how will it be evidenced across different episodes? 4.1.1.2. Episodic hypothesis. If a set of terms Q are observed in an episode, the topical anchor of the terms is the term t whose probability of generation increases with the length of the episode. To explain the episodic hypothesis in intuitive terms, consider the following example as a hypothetical claim: Suppose we are witness to a semantically meaningful conversation, a document, or an article containing the terms {Roger Federer, Wimbledon, Davis Cup}. In such a context, we are bound to encounter the term Tennis, the longer the conversation, document or article gets. Note the difference between the intensional and extensional definitions. This episodic hypothesis addresses observable patterns in human language usage that represent the intensional structure of the semantic association. The episodic hypothesis in this case seems like a bold and unusual claim. Indeed, it is not necessary for a sentence or even a set of sentences about a topic t to mention t. For instance, status updates on Facebook or tweets on Twitter may make statements about (say) Tennis, without mentioning the term “Tennis.” But the hypothesis is based on “long enough” episodes. A long enough document about a given topic will need to keep mentioning the topic in order to reinforce the relevance of what is said, to the topic. The topic is what the document is “about.” 4.1.1.3. Co-occurrence algorithm. To test the above hypothesis, we represent it formally on the co-occurrence graph as follows. Given a coherent set of terms Q, in a corpus represented as a co-occurrence graph G, their topical anchor is the term with the highest cumulative generatability score in an infinitely long random walk executed on ψ(Q⁎). In our implementation, to perform a random walk an OPIC-like algorithm is used, which is adapted from Abiteboul et al. [27]. Every node representing a term in Q is initialized with a seed cash and this cash is distributed to its neighbors in accordance to their generatability values. This process is iterated by picking any node uniformly at random and distributing its cash to its neighbors, again in accordance with its generatability to other nodes. As this process is repeated, the cash-flow history at every node is recorded. The cash-flow history of a node is the total cash distributed by the node over all iterations up to this point. The relative ordering of cash-flow histories of nodes converges to a fixed point indicating the centrality of nodes in the context. A set of example results of this algorithm is shown in Table 1. As in OPIC, each iteration of the random walk has a complexity of O(N2), where N is the number of nodes in the semantic context ψ(Q). There is a subtle but important distinction from OPIC in the way the cash is distributed to a node's neighbors. The cash at u is not distributed in the ratio of its co-occurrence edge weights to its neighbors but according to the generatability of that neighbor. These two values would be exactly the same for cash distributed by a node u if, N(u) ⊆ ψ(Q∗). But when N(u) contains terms which are not in ψ(Q∗) then the two metrics differ. For example, in the graph shown in Fig. 3, let the sub-graph in the dotted circle indicate ψ(Q∗) and the sub-graph with the dashed edges indicate N(u). As earlier, let u have a cash of x to distribute. If we just take ψ(Q∗), i.e., ignoring a and b and distribute cash at u in the ratio of its co-occurrence edge weights in ψ(Q∗), then r and s would receive cash of 0.66x and 0.33x respectively. But when distributing cash using generatability where Γu → r = 0.05 and Γu → s = 0.025, r and s would receive a cash of 0.05x and 0.025x respectively. This effectively reduces u's say in determining the topical anchor of ψ(Q∗) by taking into account the structure of the graph outside of ψ(Q∗). When we distribute cash in terms of generatability edges, all the cash at a node is not necessarily distributed. The undistributed cash is leaked out of the system and is not distributed any further. In addition, the algorithm was found to be fairly robust in resolving polysemy in the query terms. This is because the semantic context sub-graph was distinctly different for polysemous terms based on the context provided by the surrounding terms. The cash leaking random walk is designed to accentuate the differences between the semantic context sub-graph and the rest of the graph. The results for some polysemic query terms are presented in Table 2. 4.1.1.4. Validating the hypothesis. The experiment was conducted in two phases. In the first phase, volunteers were invited to submit episodes (sets of terms) for which topical anchors as per the analytic definition were mined. In the second phase, a set of 100 such volunteer given queries were randomly chosen and presented to a larger set of volunteers in a randomized order. These volunteers were asked to write down at most three topical anchors for each of the queries, independent of the algorithm. The volunteers were

Table 1 Example results from the topical anchor experiments. Input terms (Q)

Topical anchors (top 3)

mit, stanford, harvard manchester united, chelsea, arsenal injection, surjection, bijection rice, wheat, barley volt, watt, ohm, tesla

university, college, united states london, football, football club mathematics, set, function food, agriculture, maize unit, electricity, current

46

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Fig. 3. An example sub-graph.

not aware of the existence of such an algorithm. There were a total of 86 volunteers and each volunteer was given at most 30 random queries from the hundred questions, of which she was asked to evaluate the queries she was comfortable with. The topical anchors given by the volunteers were recorded and compared with the topical anchors generated by the algorithm. For the experiment, we partitioned the topical anchors given by the volunteers into confidence intervals based on the percentage of evaluators agreeing upon a topical anchor. A confidence interval from x to x + c is a bucket where x% to (x + c)% of evaluators who answered that query agree that the terms in the bucket are topical anchors for a query. For example, if 95% of the evaluators answer computer for the query, “CPU, hard disk, monitor, mouse” then we put computer into the 90–100 confidence interval. The confidence intervals are in steps of 10 starting from 40 to 50 and going up to 90–100. The evaluator generated topical anchors which are not in any of the confidence intervals above 40 are ignored because of the lack of adequate support. With a confidence cut-off at 40, there are a total of 156 topical anchors, which are in confidence intervals of 40 and above, across all the 100 chosen queries. Some queries like, “summer, winter, spring, autumn” have just one anchor season and some others like “Volt, Watt, Ohm, Tesla” have several evaluator given anchors like unit, electricity, and physics. We evaluated three different algorithms for computing the most central nodes. First we used a variant of TF–IDF to identify the most important nodes in a sub-graph. As there is no notion of a document in the co-occurrence graph we had to use an interpretation of TF–IDF which is similar in spirit. Term Frequency (TF) of a node for a context ψ(Q∗), was defined as the sum of all the edge weights of edges to nodes in the context. Inverse of Document Frequency (IDF) of a node was defined as the log of the ratio between the sum of all the edge weights of edges from all the nodes in the context to the sum of all of edge weights of edges from the given node. The product of TF and IDF was used as the score of a node as shown in Eq. (13). X

    ¼ TF i; ψ Q

wði; xÞ

x∈T ψðQ  Þ

X X wðx; yÞ x∈T ψðQ  Þ y∈T     IDF i; ψ Q ¼ logX wði; zÞ z∈T

ð13Þ

We also compared the two modes of distributing cash: (i) the cash of a node is distributed in accordance with the generatability and the undistributed cash is leaked out of the system, and (ii) the cash of a node is distributed in the ratio of the co-occurrence weights in the sub-graph. The former is called cl where cl stands for cash leakage and the latter opic.

Table 2 Polysemic query terms from the topical anchor experiments. Input terms (Q)

Topical anchors (top 3)

java, sumatra, borneo java, ruby, perl amazon, nile, congo amazon, ebay, google

indonesia, island programming language, programming river, africa internet, united states, software

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

47

Each algorithm was executed in three different variants based on the number of topical anchors they choose. The number of correctly identified topical anchors of tfidf 1 was computed where the 1 stands for picking only one topical anchor per query. Then the number of correctly identified topical anchors of tfidf 3 and tfidf 10 was also computed where the numbers of topical anchors generated by the algorithms are 3 and 10 respectively. The same procedure was repeated with opic 1, opic 3 and opic 10 and cl 1, cl 3 and cl 10. On the whole, there were 3 different algorithms computed 3 times with varying number of topical anchors on each of the 100 test queries. The results were plotted with the confidence intervals for the topical anchors on the horizontal axis and the number of correctly picked topical anchors in the vertical axis and are presented in Fig. 4. The vertical bars indicate the number of topical anchors chosen by the volunteers. The plot in Fig. 4 compares the number of volunteer picked topical anchors with those of all the three algorithms. The results show that tfidf 10 and opic 10 could correctly pick only 40 and 56 topical anchors respectively of the 156, i.e., tfidf 10 could only pick 40 correct topical anchors. In contrast the random walk cl 1 performed much better with a hit-rate of 67. To emphasize, the cash leaking random walks, cl 3 and cl 10 correctly identified 110 and 149 of the topical anchors respectively. This implies that cl 3 had a recall of 70% and cl 10 had a recall of 95.5%. Computing a reliable precision on the other hand is not feasible as we do not know the topical nature of those terms which were not picked by the volunteers. For example, for the terms {Mickey Mouse, Donald Duck, Goofy}, the evaluator generated topical anchor is Cartoon whereas the algorithm returned Disney. This result is obviously not completely incorrect but due to its fuzziness it will not figure in any precision computation. This experiment validates the algorithm that the topical anchors of a set of terms are terms which have the highest cumulative generatability and in turn validates the episodic hypothesis, that the topical anchors are terms which are the most generatable in a text which contains the query terms. In the original work, the algorithm was shown to have performed on par with a recent multi-document topic labeling algorithm [26]. Refer to the same for further experiments and results involving topical anchors.

4.2. Semantic siblings The second algorithm that we present concerns the notion of semantic siblings. Semantic siblings are sets of terms representing concepts that play similar roles in one or more settings, but are not synonyms. For example, the terms Sapphire, Emerald, Topaz form semantic siblings, as they are all different types of gems. Given an ontology, semantic siblings are concepts that share the same conceptual parent. But without a formal ontology, identifying semantic siblings is not straightforward. Identifying semantic siblings is an important problem in several applications like semantic query expansion, recommender systems, semantic matching of documents, etc.

4.2.1. Specific related work Early work in semantic siblings can be traced to automated thesauri construction techniques [28]. Most of the existing techniques exploit structural cues for mining semantic siblings. For example, to identify semantic siblings, there are algorithms which use HTML and XHTML tags [29]; and comma-separated terms in a sentence [30]. Also words occurring along a column or a row of a table in a web page are likely to be siblings [31].

Fig. 4. Comparison between TF–IDF and random walks with and without cash leakage.

48

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Sometimes ‘x is a y’ patterns in text can be used to determine the parent–child tree which can in turn help determine semantic siblings [32]. The semantic siblings relationship between terms can also act as a foundation to a different algorithm. Dorow et al. use co-occurrences constructed out of lists of nouns (semantic siblings) to disambiguate between different senses of a term [13]. In this work, we hypothesize that, like other several semantic associations, semantic siblings are also latent in co-occurrence patterns and can be mined without using structural cues like lists or tables. This makes sure that the algorithm can be language agnostic and can perform well even on unstructured corpora like free text or transcripts. To extract semantic siblings and to differentiate them from synonyms, the problem is posed as a set expansion problem. A small set (cardinality: 3) of semantic siblings is taken as input and a larger set (20) of siblings is returned as the result. As before, we follow the 3-layer methodology of defining semantic siblings with intensional and extensional definitions, and reducing it to a co-occurrence algorithm. 4.2.2. Intensional definition A semantic sibling s of a set of concepts Q = {q1, q2, …, qn}, is the concept whose aboutness distribution resembles the aboutness distribution of each of the concepts in Q. i.e., (A(q1) ≈ A(s)) ∧ (A(q2)proxA(s)) ∧ … ∧ (A(qn) ≈ A(s)). This implies that the semantic sibling should be a concept whose relevance to other concepts is similar to each of the given set of concepts. 4.2.3. Episodic hypothesis The extensional definition for semantic siblings is hypothesized as follows. Elements from a set of concepts Q are said to be semantic siblings of one another, if given an episode e that features one of the concepts q ∈ Q, it is possible to find another episode e′ featuring some other concept q′ ∈ Q, with the rest of the concepts and associations in e′ nearly identical to that of e. The observational hypothesis is based on a notion of replaceability of semantically similar concepts. In the association plays(Federer, Wimbledon), Federer can be replaced by Nadal, but not (say) by Germany. Two concepts which are semantically similar will be replaceable in most of the associations. 4.2.3.1. Co-occurrence Algorithm 1 (direct). Two different algorithms on the co-occurrence graph were tested for computing replaceability. Given a coherent set of terms Q which are semantic siblings of one another, in a corpus represented as a co-occurrence graph G, a term s is a semantic sibling of the terms in Q if the properties of the neighborhood N(s), are similar to the properties of the neighborhoods of each term in Q, N(qi). The generatability distribution of a node is the best way to capture the properties of the neighborhood. In this algorithm, henceforth referred to as the generatability distribution of a candidate sibling Γs is compared with the generatability distributions of  direct,  each of the terms Γ qi in Q. This results in a vector of scores. The magnitude of the vector is a measure of the replaceability of s and is used to determine the semantic siblings. 4.2.3.2. Co-occurrence Algorithm 2 (interleaved). Given a coherent set of terms Q which are semantic siblings of one another, in a corpus represented as a co-occurrence graph G, a term s is a semantic sibling of the terms in Q if the properties of the neighborhood N(((Q ∖ {qi}) ∪ {s})∗), where one of the query terms is replaced by s, are similar to the properties of the neighborhood N(Q∗). In this algorithm, henceforth referred to as interleaved, the joint generatability distribution ΓQ of the terms in Q over N(Q∗) is estimated by assuming them to be independent. Similarly, a new set Q′i = (Q ∖ {qi}) ∪ {s} is constructed, where the ith input sibling is replaced by the candidate s. The candidate s can replace qi, if ΓQ is similar to Γ 0 . Again there is a vector of scores and a similar process Qi is followed as in direct. Both the algorithms compare one probability distribution with another and for this purpose we used Kullback–Leibler divergence (K–L divergence) [33]. The K–L divergence of a distribution B with respect to a distribution A is given as, DKL ðAjjBÞ ¼

X i

AðiÞ ln

AðiÞ BðiÞ

ð14Þ

K–L divergence between two distributions is a positive number in the range [0, ∞] where a lower value indicates higher similarity. Given a query Q = {q1, q2, q3}, and a candidate sibling s, the direct algorithm computes the vector D      E DKL Γ q1 jjΓ s ; DKL Γ q2 jjΓ s ; DKL Γ q3 jjΓ s : For the same example, the resultant vector in the interleaved algorithm would be,        DKL Γ Q jjΓQ 0 1 ; DKL Γ Q jjΓQ 0 2 ; DKL Γ Q jjΓQ 0 3 : Every term in ψ(Q⁎) is chosen as a candidate and the above vectors are computed using direct method and interleaved method. In each of the algorithms, the candidates are ordered based on the magnitude of the vector. Semantic siblings are those terms whose

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

49

vectors have the lowest magnitudes. Please refer to Algorithms 1 and 2 for further details. The interleaved algorithm has a time complexity of O(N3) and direct algorithm has a time complexity of O(N2) where N is the number of nodes in the semantic context ψ(Q⁎). Algorithm 1. Semantic siblings using interleaved method.

Some sample queries along with the top 10 siblings given by the interleaved algorithm are shown in Table 3. 4.2.3.3. Validating the hypothesis. The evaluation methodology was similar to topical anchors but as the algorithm has several correct answers, volunteers were shown the results and were asked to choose the correct semantic siblings from the ones generated. Algorithm 2. Semantic siblings using direct method.

As in the case of topical anchors, a random set of 100 volunteer generated semantic sibling sets of size 3 were chosen. For these 100 sets, 20 semantic siblings were computed using the direct and the interleaved methods. To eliminate bias towards any algorithm, the results of these computations were merged and sorted in the alphabetical order before presenting them to our human evaluators. Evaluators were presented with each set of query semantic sibling terms accompanied by a larger set of expanded semantic siblings. So for a given semantic sibling query, the evaluators were shown anywhere between 20 and 40 semantic siblings based on the overlap of results between the algorithms. The evaluation proved a difficult task, as there were a total of 3506 decisions to be made across the hundred queries and hence was more time intensive than the topical anchors evaluation. The order of the queries was randomized between evaluators and they were asked to evaluate as many queries as they felt comfortable with. A total of 18 evaluators volunteered for the purpose and on an average

50

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Table 3 Semantic siblings results. sapphire, emerald, topaz roger federer, rafael nadal, andy roddick

gemstone, opal, amethyst, garnet, peridot, lapis lazuli, turquoise, beryl, onyx, pearl janko tipseravic, marat safin, arnaud clement, mario ancic, mardy fish, marcos baghdatis, jurgen melzer, paul-henri mathieu, jose acasuso, michael berrer

every query received answers from 6.9 evaluators. Three of the 100 queries had only three evaluators and the rest of the 97 queries had four or more evaluators. To eliminate any errors due to accidental clicks and other trivial biases, only those results which were chosen as semantic siblings by at least two evaluators were considered. Semantic siblings tend to exhibit a hypothetical parent term, i.e., they are either co-hyponyms of a shared hypernym, like {Federer, Nadal, Roddick} → Tennis player or sportsperson; or co-meronyms of a shared holonym like, {Germany, France, Spain} → Europe. Technically, a term like Entity is a hypernym, and Universe is a holonym for any given set of terms, but these are too generic to be of any use. Hence, the evaluators were allowed to find the hypothetical parent at the right level of generalization which they found suitable in everyday usage, given that the query terms were semantic siblings themselves. Based on such an evaluation, we plotted the accuracy of both the direct and the interleaved algorithms in Fig. 5. For generating 20 semantic siblings, on an average the interleaved yielded 63.4% precision and the direct yielded 59.2% precision. Although direct and interleaved have similar precision, there was very little overlap in their results. Of the 3506 terms given to the evaluators, 2133 were chosen as semantic siblings by at least two evaluators. Amongst those 2133, 1267 were generated by interleaved and 1184 by direct. This means that only 318 siblings were generated by both the algorithms, which indicates that the result space of each algorithm had a minimal overlap with the other. In comprehending these results, there are several factors which must be taken into account. For example, in the semantic context ψ(Q⁎) where Q = {Federer, Nadal, Roddick}, the number of nodes was 2334. Only 4% of these nodes (b100 nodes) represented tennis players. Sometimes the number of semantic siblings available in the context was so low that it was less than 20. For example, one of the queries in the evaluation had three terms {Aries, Taurus, Gemini} from the zodiac signs. Though their semantic context was large, the number of possible semantic siblings could only be 9 more. Of those 9, all except Cancer were picked up by the algorithm. Hence the chosen threshold of 20 might not be optimal from a recall stand-point, but it was chosen as a trade-off between maximizing the number of semantic siblings obtained and reducing the load on evaluators. Both the algorithms are purely based on heuristics derived from the cognitive model and hence are not optimized through any supervised learning algorithms. We found that if we had an oracle who could choose the better algorithm on looking at the query, then the overall precision could be improved to 71%. Another point of interest is that, different kinds of terms exhibited different co-occurrence patterns and we found that terms which represent people had a specific signature of their own. In our dataset, the algorithms performed significantly better in those query terms about people. The interleaved algorithm had a precision of 77.5% and direct had a precision of 65.6% for queries in which the terms in Q represented people. We also compared these algorithms with other queries on structured datasets like, WordNet,3 YAGO2,4 DBPedia.5 Using WordNet, YAGO2 and DBPedia, we computed the semantic siblings for all the hundred queries and the results are presented in Table 4. WordNet is a hand-tagged lexical database for English language, which defines the semantics of terms and also defines several semantic associations between terms. Amongst the associations, WordNet provides a relationship between terms called sister terms which is similar to the semantic sibling relationship. The 100 semantic sibling queries in the experiments consisted of 300 unique query terms (3 × 100 queries). Of these, 203 terms were defined in WordNet, of which 201 terms had the association sister terms defined. Each term had a set of senses and the sister term could be a semantic sibling in any of the senses. To get the semantic siblings of a query, we took an intersection of the sister terms of each of the query terms. Only 28 of the 100 queries had a non-null intersection and for those 28 cases, WordNet returned an average of 22 results per query. An important point of note here is that the semantic sibling queries submitted by the volunteers were predominantly (N50%) proper nouns. Though it might seem that proper nouns might not be represented well in WordNet, in our experiments we found that WordNet contains many proper nouns, including those that denote people. For example, proper noun query terms like Akira Kurosawa, Mickey Mouse, Golden Retriever, Poseidon and Victoria falls have sister terms defined in WordNet. Hence, the reason for the low coverage is not because the queries were outside the scope of WordNet. It is because any handcrafted dictionary like WordNet cannot as be exhaustive as free text. In another comparison, we chose the structured dataset YAGO2, which consists of a set of semi-automatically built semantic associations. Like WordNet, YAGO2 describes several relationships between terms and one among them is is–a. The is–a association captures hyponym–hypernym relationships which are inherent in semantic siblings. Hence, to find semantic siblings of a query, we computed an intersection of the is–a parents of the query terms and then counted the number of other terms which shared at least 3 4 5

http://wordnet.princeton.edu/. http://www.mpi-inf.mpg.de/yago-naga/yago/. http://dbpedia.org/About.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

51

Fig. 5. Accuracy of the direct and the interleaved algorithms.

one of their parents with all the query terms. We found that only 36 of the 100 queries had all query terms represented. Also, there was no simple way by which we could eliminate the highly generic is–a parents. Hence each of the queries had hundreds of thousands of semantic siblings. Even assuming that a clever technique could be used to prune the result set to an acceptable size, two-thirds of the queries still could not be answered. Finally we chose DBPedia, a structural information base built using data in Wikipedia, for evaluation. A similar technique like YAGO2 was used in DBPedia. Since DBPedia is based on Wikipedia, its size is much greater than YAGO2, leading to an increased number of queries being answered. However, as in the case of YAGO2 each of the queries had millions of semantic siblings. In comparison, the results of direct had 5 or more siblings for 84 queries and 10 or more siblings for 73 queries. Similarly, the results of interleaved had 5 or more siblings for 94 queries and 10 or more siblings for 78 queries. The greater coverage is due to the fact that the algorithms are based on co-occurrence in unstructured text and do not rely on structural cues for semantics extraction. Handcrafted structural datasets are useful only in a small set of cases and any real world approach should consider automated unstructured techniques. 4.3. Topical markers The third kind of semantic association that we present is called Topical Markers. Topical markers are concepts which are mostly unique to a topic, and are improbable to be significantly associated with other topics. For example, the term double fault in an episode can determine the topic of the episode as Tennis with high confidence. Topical Markers can be used to find snippets of text related to a topic in a large textual stream. For instance, the topical markers for Machine Learning can be computed and used to search a text stream like Twitter to identify tweets related to machine learning with high confidence, even though the term machine learning itself need not to appear in the tweet. 4.3.1. Specific related work The relative importance of a term in a document with respect to the entire corpus has been addressed initially using metrics like TF–IDF [34]. Though TF–IDF assigns a higher weight to terms important in a document, it fails to identify them uniquely. For example, if we consider the Internet as a corpus and a page on Harry Potter as our document, the term magic might have a high weight for that page. Still it is not unique to Harry Potter unlike terms like Hogwarts or Hermione Granger. Statistically improbable phrases6 address this issue by looking for terms which are not only important but also unique to a document. They have been successfully used to determine and delete duplicates in a document corpus [35]. Although, statistically improbable phrases can be found in the termdocument space, there is no such equivalent in the conceptual space. For example, unlike Harry Potter, concepts like machine learning or Tennis need not be confined to the boundaries of a small set of documents. The 3-layer methodology for identifying topical markers follows. 4.3.2. Intensional definition A topical marker m for a given concept t, is a concept such that t has high aboutness values for those concepts for which m has a high aboutness value but not necessarily vice versa. Let us consider Tennis and one of its topical markers double fault. Tennis will have a high aboutness to those concepts which are related to double fault but not necessarily the other way around. In other words, m is a marker for t if t is relevant to whatever m is relevant to, but not vice versa. 4.3.3. Episodic hypothesis The term m is the topical marker of a term t, if on observing m in an episode, the probability of generation of t increases with the length of the episode or with the number of such episodes. 6

Originally introduced by Amazon.com http://www.amazon.com/gp/search-inside/sipshelp.html.

52

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Table 4 Comparison with structured datasets.

WordNet YAGO2 DBPedia Direct ≥5 Direct ≥10 Interleaved ≥5 Interleaved ≥10

Queries answered

Mean no. of results

28 36 52 84 73 94 78

22 0.51 M 1.75 M NA NA NA NA

The episodic hypothesis is similar to topical anchors, but instead of observing a set of terms to determine the topic, the topical marker term can independently determine the topic. The problem here is not to determine the topic, but to determine the marker m, which can unilaterally determine the topic with high probability. 4.3.4. Co-occurrence algorithm Given a topic represented as term t, on a co-occurrence graph G, a topical marker m is a term that with high probability generates terms only in ψ(t). A topical marker m is a term whose generatability distribution lies almost completely within the vertices of ψ(t), as it should be unique to ψ(t). To compute these terms which are central to ψ(t), we adopt a variant of the HITS algorithm first introduced by Kleinberg [36]. This algorithm uses mutual recursion to compute the most central (authorities) and the most unique (hubs) vertices of a context simultaneously. First a bipartite graph is created by dividing each vertex in ψ(t) into an authority and a hub as shown in Fig. 6. Initially, the scores for all the hubs are initialized to be 1/n where n is the number of vertices in ψ(t). The authority and hub scores of the terms are then computed recursively as in Eq. (15). authðaÞ ¼

X

Γ x→a  hubðxÞ V x∈N ðaÞ X Γ a→x  authðxÞ V

x∈T ψðtÞ

hubðaÞ ¼

x∈T ψðtÞ

ð15Þ

x∈N ðaÞ

The hub and authority scores thus computed are normalized after every iteration. After convergence, the product of the authority and hub scores of each term is treated as the marker score for the term. The terms are ranked based on their marker scores and the top k terms are chosen as the topical markers of t. As in HITS, the time complexity of the algorithm is O(N2) per iteration, where N is the number of nodes in the semantic context ψ(Q). In this work, in the generatability graph, every term is considered to be both a hub and an authority with certain scores. The hub score of a term indicates how well the term can generate good authorities in its semantic context. The authority score of a term indicates how well the term can be generated by good hubs in the semantic context. Since authorities need not necessarily belong to the topic, the topical markers need to be chosen using hubs. But good hubs by themselves sometimes turn out to be relatively rare terms of that topic or even terms which are under-represented in the corpus and hence they seem relatively rare and unique to a context. This problem is alleviated by choosing those terms which are good hubs and also good authorities, represented by having high hub and authority scores. Some example results are given in Table 5. 4.3.5. Validating the hypothesis The evaluation methodology was through human validation and was similar to that of semantic siblings. Since each query results in many correct topical markers, the volunteers were shown the top 30 results and asked to select the correct topical markers from them, for the given topic. A set of 50 topics were randomly chosen as queries from the set of human-generated queries for topical anchors and semantic siblings. Totally 19 people volunteered for this evaluation. 30 topical markers were generated for each of the 50 query topics and every evaluator had to make a maximum of 1500 decisions. Hence they were asked to evaluate as many queries as they felt comfortable with. As in the case of semantic siblings, to eliminate erroneous accidental clicks and other trivial biases, every term in the results which was chosen by at least two evaluators was considered a topical marker. Based on this evaluation method, our algorithm yielded an overall precision of 93.8% for generating 30 topical markers. The accuracy is plotted in Fig. 7. In the second method of evaluation, 10 of the 50 query topics were randomly chosen. For each of these topics, we selected the top 10 topical markers generated by our algorithm and for each of them, we performed a search on the web using Google Custom Search API.7 In general, this API only searches on certain websites but it was adequately modified to perform searches over the entire web. For 7

Google Custom Search API: http://developers.google.com/custom-search/.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

53

Fig. 6. Example: semantic context to bipartite graph.

each of the 10 topical markers used as search queries, we selected the top 10 results returned by Google Custom Search API. Hence, there were 100 web pages for each topic (10 results for each of the 10 topical markers of the topic). We manually classified these web pages as being relevant to the topic or not. Totally 1000 such decisions were made (for the 10 topics) and the evaluation resulted in 902 of the 1000 pages being correct. The search queries were the topical markers generated by our algorithm for the given topic and not the topics themselves. Hence, the high number of web pages being classified as correctly belonging to the topic gives us an indication of the effectiveness of the topical markers algorithm in generating terms which uniquely determine the topic. Fig. 8 shows the number of search results for all 10 topical markers of each topic which were classified as correctly belonging to the topic. The 10 topics are shown on the x-axis. We also compared the topical markers with the top 30 terms having the highest TF–IDF for each of the 50 query topics. The TF–IDF metric used for this purpose is same as that in Eq. (13), used in topical anchors evaluation. The TF–IDF algorithm fares very poorly compared to the results of our topical markers algorithm as it fails to generate any terms that are not just important but also unique to the topic. For example, for the topic capitalism, the terms with the highest TF–IDF scores are the United States, Germany, France, the United Kingdom etc. Even though these terms are important to capitalism, they are not topical markers. In addition, we also performed a qualitative comparison of our results with Google AdWords8 for the query topics. Google AdWords service allows the user to find important keywords for any topic (which they may wish to advertise). This mostly helps in query expansion, but it is not suitable to find Statistically Improbable Phrases (SIPs) for the topic. For example, for the topic quantum mechanics, results generated by Google AdWords and the topical markers generated by our algorithm are shown in Table 6. It is seen that all the suggested keywords from Google AdWords contain the term quantum mechanics and hence they miss all such important topical markers which do not contain the topic name but uniquely identify it. This suggests that augmenting an application like Google AdWords with topical markers may be attractive. 4.4. Topic expansion The last semantic association we consider in this paper, is called Topic Expansion. Topic expansion of a topic t is the process of unfolding t into a set of concepts that are collectively “about” t—essentially the opposite problem of topical anchor computation. Topic expansion is a divergent computation and also needs to contend with the problem of multiple senses of terms. Each sense of a term represents a different concept in the analytic layer—but at the linguistic layer, the same term is used to refer to the different concepts. 4.4.1. Specific related work Word Sense Disambiguation (WSD) algorithms by Widdows et al. [12,13] use a co-occurrence graph constructed out of list of terms which share a semantic sibling relationship. To disambiguate the senses of a term, its neighborhood graph is clustered into different components belonging to distinct senses. In a similar vein, SenseClusters [37] is a tool which uses co-occurrences as feature vectors so as to use existing clustering techniques. Along with co-occurrences they also use other features—derived using heuristics or from algorithms like LSA—to find clusters representing different senses of a given word. There are several WSD algorithms which choose terms based on morphological analysis of the text to create a co-occurrence graph [38–41]. Among these, HyperLex [39] mines a minimal spanning tree in a context defined in an adjective and noun cooccurrence graph to identify a topic hierarchy specific to the context. Pantel and Lin [41] use a clustering by committee on a cooccurrence graph created out of special contexts like verb–object pairs to differentiate different senses of a given word. It should be noted that the above approaches construct specialized forms of co-occurrence graphs tailored for WSD. In contrast, the focus here is on the generic framework and methodology, that can achieve similar results. Another relevant area of work is topic modeling algorithms like LDA [20]. LDA can be seen to cluster terms in the corpus along with the different topics that represent the corpus. While, this is different from expanding a topic given the topical anchor term, we will be suitably applying LDA so as to use it as a benchmark for comparison. Topic expansion finds various applications where we need more detailed description of the topics, as well as in sense disambiguation. As with the rest of the episodic hypotheses described earlier, we follow the 3-layer methodology for topic expansion. 8

Google AdWords: http://adwords.google.com.

54

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59 Table 5 Topical marker results. machine learning

quantum mechanics

weighted majority algorithm boosting semi-supervised learning knowledge discovery institute of photogrammetry and geoinformation supervised learning 3d geometry international conference on machine learning concept drift data mining data stream mining unsupervised learning computational intelligence image interpretation rapidminer

quantum tic tac toe schrödinger equation hamiltonian copenhagen interpretation epr paradox quantum field theory density matrix quantum entanglement wave function wave function collapse physics bell's theorem electron classical mechanics hilbert space

Fig. 7. Accuracy of the topical markers algorithm.

4.4.2. Intensional definition For a topic represented by a concept t, a topic expansion T E(t) = {c1,c2,c3,…cn} is a set of concepts, which collectively display a high aboutness for t. When describing the results of a topic expansion, it also makes sense to order them in terms of their individual aboutness scores for t, making TE(t) as a tuple of terms: bc1,c2,c3,…cnN, where: Aðc1 →t Þ≥Aðc2 →t Þ≥Aðc3 →t Þ≥…≥Aðcn →t Þ: We know that a concept has the maximum aboutness score of 1 for itself. Hence it will always be the first concept in TE. That is, c1 = t.

Fig. 8. Effectiveness of topical markers in determining the topic.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

55

Table 6 Results for quantum mechanics. Google AdWords

Topical markers

postulates of quantum mechanics basics of quantum mechanics quantum mechanics pdf quantum mechanics basics perturbation theory quantum mechanics what is quantum mechanics quantum mechanical model of atom quantum mechanics wiki lectures on quantum mechanics books on quantum mechanics quantum mechanics books application of quantum mechanics quantum mechanics video lectures relativistic quantum mechanics quantum statistical mechanics

quantum tic tac toe schrödinger equation hamiltonian copenhagen interpretation epr paradox quantum field theory density matrix quantum entanglement wave function wave function collapse physics bell's theorem electron classical mechanics hilbert space

4.4.3. Episodic hypothesis In a long enough conversation about a concept t, the probability of the joint occurrences of concepts about t including t itself, is much higher than the joint probability of concepts unrelated to t. In other words, as the length and the number of episodes about concept t increase, topically relevant terms tend to cluster together within and across episodes and these clusters tend to include the topical term t. The last condition in the hypothesis makes it consistent with the episodic hypothesis for topical anchors. 4.4.4. Co-occurrence algorithm For a given term t, there are four steps to translate the episodic hypothesis to a co-occurrence algorithm. The episodic hypothesis requires us to partition the neighborhood of t into clusters. To do this, we generate all possible clusters from N(t) and then reduce this set by a series of cluster merge and filtration steps. 4.4.4.1. Cluster generation. Given the topical term t, we start expanding t to include other co-occurring terms based on their generatability. If N(t) has k nodes, then a total of k clusters are initially generated. This is done by running the cluster generation algorithm separately with each of the k co-occurring terms. The ith run of the cluster generation algorithm generates a cluster using t and the ith most generatable term in N(t). In any given run of the cluster generation algorithm, as the expanded set grows, we consider it as one unit, and take the generatability scores of the focus of the expanded set. The cluster generation algorithm for the ith run is described below: Algorithm 3. Cluster generation with the ith most generatable term.

The term ui, which is the ith most generatable term from t is used as the “key” sense term for expanding the ith most-important cluster. The index i associated with a cluster represents the importance rank of the generated cluster. 4.4.4.2. Cluster merging. Cluster generation generates all possible sets of cliques generated from the given topic t and one of its neighbors. After this step, two or more clusters may represent the same sense and are redundant as separate clusters. In the second step, we progressively merge clusters based on their similarity. Cluster similarity between clusters Ca and Cb is given by: OðC a ; C b Þ ¼

jC a ∩C b j : minðjC a j;jC b jÞ

ð16Þ

Algorithm 4 describes cluster merging. The outcome of cluster merging should be the dominant clusters depicting the different senses of the topic t. However, there could still be extraneous clusters left that are not highly generatable from t and do not depict any major sense of the topic t.

56

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Algorithm 4. Cluster merging.

4.4.4.3. Filtration. In the third step, such extraneous clusters which do not depict any major sense of t are filtered out. Recall that the cluster index represents the “importance” of a cluster or the dominance of the sense that the cluster represents—the lower the index, the greater the importance. When two clusters are merged in step two, they retain the lower index. This means that extraneous clusters that are left out tend to have a higher index, and hence lower importance. In this filtration step, we choose to drop such low importance clusters using a similar logic as cluster merging. Every time, a pair of clusters with the highest overlap is chosen. Amongst them, the less important cluster is dropped. This process is repeated till the maximum overlap between any two clusters drops below another moderate threshold β. After doing this, we arrive at n ≤ S clusters, each representing a different dominant sense for the topic. 4.4.4.4. Ranking. In the final step, called the ranking step, we rank the terms in each of the topic expansion clusters in the decreasing order of their importance with respect to the sense of the cluster. The importance of a term is computed as the exclusivity score, where the exclusivity of two terms tm and tn is defined as: Eðt m ; t n Þ ¼ Γ t m →tn  Γ t n →t m :

ð17Þ

Exclusivity is the product of the two way generatabilities and can be modeled as an undirected relation in co-occurrence graph. The exclusivity score of a term with respect to the topic represents not only the importance of the term to the topic, but also the importance of the topic itself, to the given term. Overall, the topic expansion algorithm has a time complexity of O(N3) where, N is the number of nodes in the semantic context ψ(t). Table 7 shows some examples of topic terms and the dominant senses expanded using the above algorithm. 4.4.5. Validating the hypothesis To evaluate topic expansion, 25 ambiguous polysemous terms were chosen as topics so as to demonstrate the sense disambiguation aspect of the algorithm. For each of these 25 terms, topic expansion algorithm was run with α = 0.9 and β = 0.5. The results of the algorithm were compared with topic modeling and word sense disambiguation algorithms. For each of the 25 terms, we compared the topic expansions (clusters) generated by our algorithm with the topics generated by LDA. Instead of giving a topic term as input, we gave all the documents containing the term in our corpus as input to LDA. LDA generates a finite number of topics as output and in our case, we generated 10 topics using LDA. Some of the input documents might mention the topic in passing but such documents could result in LDA generating some unrelated topics. To account for this discrepancy, we manually chose the three most related topics from the 10 output topics such that they possessed distinct senses. For the same topic term, we picked the three best clusters from the results of topic expansion using the inherent ranking of clusters generated by this algorithm. To evaluate the algorithms we represented each cluster using its top 10 terms. To measure the goodness of the clusters, we determined two metrics: Cohesiveness and relatedness. Cohesiveness of a cluster is a measure assigned by the evaluator, specifying how Table 7 Example results from the topic expansion experiments. Topic (t) Sense 1

Sense 2

amazon

amazon river, amazon rainforest, rainforest, brazil, peru, andes

amazon.com, brazil, consumer electronics, services, mp3, internet, company, october 23, software

corpus

habeas corpus, eighth amendment, mandamus, capital punishment in the united states, writ, appeal, sentence, amendment, governor

filter

boolean prime ideal theorem, order theory, ideal, partially ordered set, boolean algebra, glossary of order theory, infimum, axiom of choice, lattice

Sense 3

amazons, river, artemis, nile, greek, herodotus, mythology, civilization, black sea native speaker, word sense disambiguation, natural hippocratic corpus, kos, hippocrates, language, machine translation, computational history of medicine, oath, acute, new linguistics, natural language processing, york university, medical, byzantine substitution, linguistic typology, recognition glass, water, light, metal, oxygen, color, heat, camera, exposure, photography, glass, chemistry, fish lens, photographic film, visible light, light, optics

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

57

Fig. 9. Comparison of cohesiveness scores between topic expansion and LDA.

closely related are the terms within the cluster to form a given sense. Relatedness of a cluster is a measure assigned by the evaluator, specifying the relevance of a cluster to one of the senses of the topic term t. There were six such clusters for each query topic term (three from LDA and three from topic expansion), and the clusters were jumbled such that an evaluator looking at the clusters would not be able to identify the algorithm that generated it. A total of 22 volunteers evaluated topic expansion of each of the input topic terms. We asked them to rate each cluster based on its cohesiveness and its relatedness on a scale of 0 to 3, 0 corresponding to nonsense or completely unrelated and 3 being excellent. Such evaluations require a lot of manual effort, as there were 25 topics each having six clusters, where each cluster had 10 terms. Hence for the complete evaluation, the volunteers had to essentially go through 1500 terms and comprehend them before passing a judgment. They were given the choice of evaluating only those sets which they were comfortable with. On an average, we found that each volunteer evaluated 15 input terms. The cohesiveness of each cluster was calculated as the mean evaluator rating for cohesiveness of that cluster. The overall cohesiveness score of a topic term was the mean of its three cluster-wise cohesiveness scores. The overall cohesiveness score was computed for both the algorithms for each of the input topic terms as shown in Fig. 9. We observed that cohesiveness of the topic expansion clusters was better than the cohesiveness of LDA clusters. Over all the terms, the average cohesiveness score was found to be 2.22 for topic expansion, whereas it was 1.7 for LDA. We also computed the overall relatedness score—in a similar fashion—for each topic term, for topic expansion and LDA clusters separately. We observed that the relatedness of topic expansion clusters was always higher than that of the LDA clusters for any given topic. The relatedness scores for the two algorithms are shown in Fig. 10. Over all the terms, the average relatedness score was found to be 2.16 for topic expansion, whereas it was 1.48 for LDA. In another experiment, we compared the relatedness scores of clusters generated using topic expansion against the relatedness scores of clusters generated by a word sense disambiguation algorithm [13]. The algorithm uses Markov Clustering (MCL) on the neighborhood graph of the topic term after removing the topic term to detect different senses. We used the same algorithm with identical parameters mentioned in the original paper (inflation parameter = 2, expansion parameter = 2) and computed the clusters and picked the 10 most important nodes in the cluster for our evaluation. This algorithm for word sense disambiguation was proposed on a specific co-occurrence graph which was built by connecting the nouns occurring in a list structured data. This and other word sense disambiguation algorithms rely on cues based on linguistic structure (like lists and syntactical patterns) in the data and are not suited well for unstructured data. To demonstrate this, we used our co-occurrence graph instead of a graph of nouns co-occurring in lists in this experiment. The evaluation methodology was similar to what was done earlier and the overall relatedness and cohesiveness scores were computed for each of the 25 input terms based on evaluator inputs. We found that the results of topic expansion were considerably better than the results of the MCL based clustering method as shown in Fig. 11. Over all the terms, the average relatedness score was found to be 2.16 for topic expansion, whereas it was 1.003 for word sense disambiguation.

Fig. 10. Comparison of relatedness scores between topic expansion and LDA.

58

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59

Fig. 11. Comparison of relatedness scores between topic expansion and word sense disambiguation.

We also observed that in the results of the latter method, the focus is not on the ordering of the terms within the clusters according to their importance with respect to the topic. Hence the cohesiveness scores of the clusters based on the top 10 terms were quite low and hence did not mandate a comparison. However, for the sake of completion, when compared, we found that the average of the cohesiveness score over all the terms for word sense disambiguation was found to be 0.75 while it was 2.22 for the topic expansion. These experiments show that, the topic expansion algorithm gives an ordered set of highly exclusive terms with respect to the topic and it compares favorably against the existing topic modeling and word sense disambiguation algorithms to solve this problem. 5. Conclusions The research presented in this paper approaches the problem of mining latent semantics with two specific objectives: (a) to explore generic infrastructural elements that can be used to solve several forms of latent semantics mining, and (b) an ability to explain the extracted results. The above objectives led us to look into cognitive science and the result is the proposed 3-layer framework and methodology. 5.1. Future directions Cognitive modeling has a rich body of literature, which can deeply impact research in text mining and analytics. In fact, our contention is that the notion of analytics will eventually give way to a notion of cognitics. While analytics is primarily about extracting knowledge from data, cognitics is about model building as well as feedback of semantics into the operations of the system from which the data is collected. Just as the current trend for major application programs is to be shipped with an inbuilt analytics module, we envisage that future applications would be shipped with a cognitics module that not only extracts knowledge from the application's dynamics, but also intelligently contributes to the application's dynamics. Rudimentary forms of cognitics already exist in the form of recommender systems. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24]

M. Sahlgren, The Word-Space Model, (Ph.D. thesis) Stockholm University, 2006. H. Rubenstein, J.B. Goodenough, Contextual correlates of synonymy, Commun. ACM 8 (10) (1965) 627–633. D.L.T. Rohde, L.M. Gonnerman, D.C. Plaut, An improved method for deriving word meaning from lexical co-occurrence, Cogn. Sci. 7 (2004) 573–605. I. Dagan, F.C. Pereira, L. Lee, Similarity-based estimation of word cooccurrence probabilities, ACL '94, 1994, pp. 272–278. M. Patel, J.A. Bullinaria, J.P. Levy, Extracting semantic representations from large text corpora, Proceedings of the 4th Neural Computation and Psychology Workshop, 1998, pp. 199–212. Y. Ohsawa, N.E. Benson, M. Yachida, Keygraph: automatic indexing by co-occurrence graph based on building construction metaphor, ADL '98, 1998, pp. 12–18. M.W. Berry, Survey of Text Mining, Springer-Verlag, 2003. A. Hotho, A. Nuernberger, G. Paass, A brief survey of text mining, LDV Forum GLDV J. Comput. Linguist. Lang. Technol. 20 (1) (2005) 19–62. B. Pang, L. Lee, Opinion mining and sentiment analysis, Found. Trends Inf. Retr. 2 (2008) 1–135. F. Sebastiani, Machine learning in automated text categorization, ACM Comput. Surv. 34 (2002) 1–47. E. Hovy, R. Navigli, S.P. Ponzetto, Collaboratively built semi-structured content and artificial intelligence: the story so far, Artif. Intell. 194 (1) (2013) 2–27. D. Widdows, B. Dorow, A graph model for unsupervised lexical acquisition, In 19th International Conference on, Computational Linguistics, 2002, pp. 1093–1099. B. Dorow, D. Widdows, K. Ling, J.-P. Eckmann, D. Sergi, E. Moses, Using curvature and Markov clustering in graphs for lexical acquisition and word sense discrimination, 2nd Workshop Organized by the MEANING Project (MEANING-2005), Trento, Italy, 2005. R. Mihalcea, P. Tarau, Textrank: bringing order into texts, EMNLP-04, 2004. A. Ghose, P.G. Ipeirotis, A. Sundararajan, Opinion mining using econometrics: a case study on reputation systems, ACL '07, 2007, pp. 416–423. I. Dagan, L. Lee, F.C. Pereira, Similarity-based models of word cooccurrence probabilities, Mach. Learn. 34 (1) (1999) 43–69. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990) 391–407. K. Lund, C. Burgess, Producing high-dimensional semantic spaces from lexical co-occurrence, Behav. Res. Methods Instrum. Comput. 28 (1996) 203–208. T. Hofmann, Probabilistic latent semantic indexing, SIGIR '99, 1999, pp. 50–57. D. Blei, A. Ng, M. Jordan, Latent Dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022. M. Steyvers, T. Griffiths, Probabilistic topic models, in: T. Landauer, S.D. McNamara, W. Kintsch (Eds.), Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum, 2007, pp. 424–440. L. Wittgenstein, Tractatus Logico-Philosophicus, Routledge, 1922. (Ch. Propositions 2 and 3). L. Wittgenstein, Philosophical Investigations, Blackwell, 1953. (Ch. Proposition 43, translated by G.E.M. Anscombe). E. Tulving, Episodic and semantic memory, in: E. Tulving, W. Donaldson (Eds.), Organization of Memory, 1972.

A.R. Rachakonda et al. / Data & Knowledge Engineering 92 (2014) 39–59 [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41]

59

A.R. Rachakonda, S. Srinivasa, Finding the topical anchors of a context using lexical cooccurrence data, CIKM '09, 2009, pp. 1741–1746. A.R. Rachakonda, S. Srinivasa, Vector-based ranking techniques for identifying the topical anchors of a context, COMAD '09, 2009, pp. 9–20. S. Abiteboul, M. Preda, G. Cobena, Adaptive on-line page importance computation, WWW '03, 2003, pp. 280–290. D. Lin, Automatic retrieval and clustering of similar words, COLING '98, 1998, pp. 768–774. M. Brunzel, M. Spiliopoulou, Discovering semantic sibling groups from web documents with XTREEM-SG, EKAW '06, 2006, pp. 141–157. L. Sarmento, V. Jijkuon, M. de Rijke, E. Oliveira, “More like these”: growing entity classes from seeds, CIKM '07, 2007, pp. 959–962. Y. He, D. Xin, SEISA: set expansion by iterative similarity aggregation, WWW '11, 2011, pp. 427–436. W. Phillips, E. Riloff, Exploiting strong syntactic heuristics and co-training to learn semantic lexicons, EMNLP '02, vol. 10, ACL, Stroudsburg, PA, USA, 2002, pp. 125–132. S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. (1951) 79–86. K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1972) 11–21. M. Errami, Z. Sun, A.C. George, T.C. Long, M.A. Skinner, J.D. Wren, H.R. Garner, Identifying duplicate content using statistically improbable phrases, Bioinformatics 26 (11) (2010) 1453–1457. J.M. Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM 46 (5) (1999). A. Purandare, T. Pedersen, SenseClusters: finding clusters that represent word senses, Demonstration Papers at HLT-NAACL 2004, HLT-NAACL—Demonstrations '04ACL, Stroudsburg, PA, USA, 2004, pp. 26–29. K. Tanaka-Ishii, H. Iwasaki, Clustering co-occurrence graph based on transitivity, 5th Workshop on Very Large Corpora, 1997, pp. 91–100. J. Véronis, Hyperlex: lexical cartography for information retrieval, Comput. Speech Lang. 18 (3) (2004) 223–252. E. Agirre, D. Martínez, O.L. de Lacalle, A. Soroa, Two graph-based algorithms for state-of-the-art WSD, EMNLP 2006, ACL, 2006, pp. 585–593. P. Pantel, D. Lin, Discovering word senses from text, SIGKDD '02, ACM, 2002, pp. 613–619.

Aditya Ramana Rachakonda holds a PhD and an M.Tech. from the International Institute of Information Technology, Bangalore, India. His research interests include co-occurrence based semantics extraction, text mining and information retrieval. He was an intern at Yahoo! Labs and contributed to Yahoo! Facebook Search. He has also worked on research projects funded by Wipro Applied Research and HP Labs. He is currently employed at Big Data Labs, American Express, India.

Srinath Srinivasa holds a PhD from the Berlin-Brandenburg Graduate School for Distributed Information Systems, (GkVI) Germany and MS from IIT Madras, India. He works in the broad areas of web sciences, multi-agent systems, network analysis and text mining. He is a member of various technical and organizational committees for international conferences. As part of academic community outreach, Srinath has served on the Board of Studies of Goa University and as a member of the Academic Council of the National Institute of Engineering, Mysore. He has served as a technical reviewer for various journals like the VLDB journal, IEEE Transactions on Knowledge and Data Engineering, and IEEE Transactions on Cloud Computing. He is also the recipient of various national and international grants for his research activities.

Sumant Kulkarni is a PhD candidate at IIIT Bangalore. He holds a Bachelor's degree in Computer Science and Engineering from the Sri Jayachamarajendra College of Engineering, Mysore, India. He has worked in the areas of Storage and Set Top Box for more than 4 years. His interests include cognitive models for semantics, topic discovery and expansion, conceptual modeling and information retrieval. He has developed indexing and retrieval methodologies for unicode text in Kannada language.

M.S. Srinivasan is pursuing his MS at IIIT Bangalore. He holds a Bachelor of Engineering in Computer Science from the Government College of Technology, Coimbatore, India. He has worked in the areas of application development, services and product development for more than 12 years. His primary work is in the area of developing a cognitive model for discovering celebrities from the Twitter data. His other interests involve machine learning, topic modeling and knowledge representation. He is also affiliated with IBM India Software Labs, Bangalore, India.