Fusing distributional and experiential information for measuring semantic relatedness

Fusing distributional and experiential information for measuring semantic relatedness

Information Fusion 14 (2013) 281–287 Contents lists available at SciVerse ScienceDirect Information Fusion journal homepage: www.elsevier.com/locate...

299KB Sizes 0 Downloads 63 Views

Information Fusion 14 (2013) 281–287

Contents lists available at SciVerse ScienceDirect

Information Fusion journal homepage: www.elsevier.com/locate/inffus

Fusing distributional and experiential information for measuring semantic relatedness Yair Neuman ⇑, Dan Assaf, Yohai Cohen Department of Education, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel

a r t i c l e

i n f o

Article history: Received 6 October 2011 Received in revised form 6 December 2011 Accepted 2 February 2012 Available online 23 February 2012 Keywords: Semantic representation Semantic relatedness Cognition Family resemblance Semiotics Interdisciplinary research

a b s t r a c t Models of semantic relatedness have usually focused on language-based distributional information without taking into account ‘‘experiential data’’ concerning the embodied sensorial source of the represented concepts. In this paper, we present an integrative cognitive model of semantic relatedness. The model – semantic family resemblance – uses a variation of the co-product as a mathematical structure that guides the fusion of distributional and experiential information. Our algorithm provides superior results in a set expansion task and a significant correlation with two benchmarks of human rated word-pair similarity datasets. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Semantic similarity or in its extensive sense ‘‘semantic relatedness’’ involves the degree to which two words are close in their meaning [1–4]. In fact, this notion can be traced back to antiquity and to Aristotle who identified four strategies through which associations are formed in our mind [5]: similarity (e.g., an orange and a lemon), difference (e.g., high versus low), contiguity in time (e.g., sun rise and a rooster’s crow), and contiguity in space (e.g., a cup and a spoon). These perceptually grounded associations may provide the basic strata for semantic relatedness as words correspond with concepts that have an embodied base [6–8]. While the sensorimotor aspect of associations has been recognized from antiquity, the availability of huge language corpora has naturally led to the reliance on ‘‘language-based distributional data’’ for building semantic representations and for measuring semantic similarity and relatedness. This approach may computationally solve the problem of measuring semantic relatedness through brute force and huge knowledge sources such as Wikipedia, WordNet, or the New York Times archive [2,9,10]. However, it cannot replace a cognitive economic model of semantic relatedness. As argued by Perlovsky [11, p. 2099]: ‘‘Current engineering approaches attempt to develop computer capabilities for language and cognition separately . . . Nature does it differently.’’ In a series

⇑ Corresponding author. E-mail address: [email protected] (Y. Neuman). 1566-2535/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.inffus.2012.02.001

of papers [12–15] Perlovsky argues that ‘‘Abstract thoughts cannot emerge without language [13, p. 71] and that the logic of this interdependence involves the knowledge instinct (KI) to fit top-down to bottom-up signals. Moreover, he argues that conceptual structures evolve from ‘‘vague-to-crisp’’, and that language that is significantly crisp [13, p. 75] may mediate the emergence of conceptual structures. For instance, while the concept of ‘‘dog’’ is perceptually vague, the use of the sign ‘‘dog’’ to signify the objects of the dogs’ family may group them together, despite huge differences between dog instances such as Chihuahua and San Bernard. While presenting a mathematical model of this process, Perlovsky does not directly address the issue of semantic relatedness that may be an interesting test case for his mathematical modeling of language and thought. In psychology, Andrews et al. [16] criticize the orthogonality of the experiential and distributional dimensions in models of semantic representation and call for an integrated model. Moreover, they provide empirical support for the benefits of an integrative model. Nevertheless, their reliance on hand-crafted semantic feature norms limits the representation of the experiential dimension in two important senses. Theoretically, this model assumes people hold in their mind huge feature sets for each word. It is doubtful whether this assumption is the most economical one. Practically, the norms used by the researchers were hand crafted and cannot be easily and trivially extended to new words and contexts. There is, however, another source of difficulty in integrating the two dimensions for studying semantic relatedness. Human language as a complex evolutionary system, e.g. [17,18], involves

282

Y. Neuman et al. / Information Fusion 14 (2013) 281–287

the abstraction of signs from their original embodied context through a complex network of connotations. For instance, while the adjective ‘‘sweet’’ is sensorially embedded in our taste, chains of connotations [5] lead it far beyond its source to abstract senses such as ‘‘Sweet Dreams’’ [8]. In other words, the meaning of a word is an emerging phenomenon that cannot be simply grounded in experiential data. This difficulty is further elaborated through Wittgenstein’s idea of family resemblance. 1.1. Family resemblance One of the main problems in understanding concept formation is that different instances of a given concept do not share a fixed set of features that may be analytically used for defining it. It implies that feature norms are limited in modeling the resemblance of concepts and the words we use to represent them. For instance, looking for a set of features that uniquely defines game may end in bitter disappointment. After all, what is the set of defining features that uniquely characterizes playing with a teddy bear and playing Chess? This lack of essential features for defining a concept is usually referred to in the context of family resemblance. Wittgenstein coined the term ‘‘family resemblance’’ [19] as a part of his attack on essentialism. The idea behind ‘‘family resemblance’’ is that instances of a given concept are ‘‘united not by a single common defining feature, but by a complex network of overlapping and crisscrossing similarities’’ [20, p. 121]. The same argument holds for words that represent these concepts. The semantic relatedness of words cannot be simply reduced to feature norms because family resemblance, whether of words or concepts, is an emerging property of the semantic network. This emerging meaning cannot be captured at the micro or the macro level of the network but at the mesoscopic level of the network [21,22]. A model that accounts for family resemblance of words should be located at this mesoscopic level and show how the semantic relatedness is an ‘‘emerging’’ phenomenon. 1.2. Lexical priming and abstraction In this paper, we address the challenge of integrating the experiential and the distributional dimensions through two cognitive processes: abstraction and lexical priming. Lexical priming [23] involves the idea that every word is mentally primed to occur with particular other words, semantic sets, and pragmatic functions. In this context, semantic associations are formed when a word (or a word sequence) is associated in the mind of a language-user with a semantic set or class, some members of which are also collocates for that user [23, p. 24]. In other words, the distributional dimension of semantic representation may be traced back to the psycho-linguistic process of lexical priming and the way it guides our processing of linguistic data. A measure of lexical priming may be used for constructing the distributional dimension. The second dimension can be studied through abstraction. Language involves words that refer to concrete entities whose basic meaning is clearly grounded in our sensorimotor experience. On the other hand, the language denotes objects whose meaning cannot be traced to sensorimotor experience. These words such as ‘‘God’’ or ‘‘Justice’’ are clearly more abstract than ‘‘doughnut’’ or ‘‘apple’’. The meaning of words may be associated with their level of abstractness. For instance, by using the Corpus of Contemporary American English (COCA) [24], we found that while the word ‘‘love’’ primes abstract nouns such as ‘‘affair’’ and ‘‘affection’’, it also primes more concrete nouns such as ‘‘sweets’’ and ‘‘songs’’. Based on their level of abstractness, we may better understand the meaning of nouns primed by ‘‘love’’ and group them into sets and pragmatic functions. For instance, from the perspective of

pragmatic function, songs and sweets are gifts given by the lover to his or her loved one. In this sense, and combined with lexical priming, the abstractness level of a word may be used as a heuristic cue for building the experiential dimension of a semantic representation. Given the importance of heuristics in human cognition [25], it is worth examining the use of abstractness level as a heuristic for building the experiential dimension of a semantic representation. Nevertheless, it is clear that none of the dimensions is sufficient when used isolated from its complement. Therefore, we present an abstract structure that fuses these sources of information. First, however, we present the way in which we measured abstractness and lexical priming. 2. Measuring abstractness and concreteness This section describes the way through which we measured the abstractness level of words as a step toward the identification of the experiential dimension and the measurement of semantic relatedness. Concrete words refer to things, events, and properties that we can perceive directly with our senses, such as banana, tree, and sweet. Abstract words refer to ideas and concepts that are distant from immediate perception, such as God, justice, and science. In this section, we describe an algorithm that can automatically calculate a numerical rating of the degree of abstractness of a word on a scale from 0 (highly concrete) to 1 (highly abstract). The algorithm has been developed by Peter Turney and cited in [8,26]. The algorithm is a variation of Turney and Littman’s [27] algorithm that rates words according to their semantic orientation and calculates the abstractness of a given word by comparing it to 20 abstract words and 20 concrete words that are used as paradigms of abstractness and concreteness. The abstractness of a given word is the sum of its similarity with 20 abstract paradigm words minus the sum of its similarity with 20 concrete paradigm words. The similarity of words was measured through a Vector Space Model (VSM) of semantics [28]. The MRC Psycholinguistic Database Machine Usable Dictionary [29], which contains 4295 words rated with degrees of abstractness, was used to guide the search for paradigm words.1 Turney used half of these words to train a supervised learning algorithm and the other half to validate it. On the testing set, the algorithm attains a correlation of 0.81 with the dictionary ratings. This indicates that the algorithm agrees well with human judgments of the degrees of abstractness of words. For building a list of words rated according to their level of abstraction, Turney used a corpus of 5  1010 words (280 gigabytes of plain text) gathered from university websites by a webcrawler2 and then indexed it with the Wumpus search engine [30].3 The list was selected from the terms (words and phrases) in the WordNet lexicon.4 By querying Wumpus, he obtained the frequency of each WordNet term in the corpus and selected all terms with a frequency of 100 or more. This resulted in a set of 114,501 terms. Next he used Wumpus to search for up to 10,000 phrases per term, where a phrase consists of the given term plus four words to the left of the term and four words to the right of the term. These phrases were used to build a word–context frequency matrix F with 114,501 rows and 139,246 columns. A row vector in F corresponds to a term in WordNet and the columns in F correspond to contexts (the words to the left and right of a given term in a given phrase) in which the term appeared. The columns in F are unigrams (single words) in WordNet with a frequency of 100 or more in the corpus. A given unigram is represented by two columns, one marked left and one marked right. 1 2 3 4

The dictionary is available at http://www.ota.oucs.ox.ac.uk/headers/1054.xml. The corpus was collected by Charles Clarke at the University of Waterloo. Wumpus is available at http://www.wumpus-search.org/. WordNet is available at http://www.wordnet.princeton.edu/.

283

Y. Neuman et al. / Information Fusion 14 (2013) 281–287

Suppose r is the term corresponding to the ith row in F and c is the term corresponding to the jth column in F. Let c be marked left. Let fij be the cell in the ith row and jth column of F. The numerical value in the cell fij is the number of phrases found by Wumpus in which the center term was r and c was the unigram closest to r on the left side of r. That is, fij is the frequency with which r was found in the context c in our corpus. A new matrix X, with the same number of rows and columns as in F, was formed by calculating the Positive Pointwise Mutual Information (PPMI) of each cell in F [31]. The function of PPMI is to emphasize cells in which the frequency fij is statistically surprising, and hence particularly informative. This matrix was then smoothed with a truncated Singular Value Decomposition (SVD), which decomposes X into the product of three matrices: Uk Rk VTk . Finally, the terms were represented by the matrix Uk Rpk , which has 114,501 rows (one for each term) and k columns (one for each latent contextual factor). The semantic similarity of two terms is given by the cosine of the two corresponding rows in Uk Rpk . After generating the paradigm words with the training set and evaluating them with the testing set, Turney used them to assign abstractness ratings to every term in the matrix. The result of this is that we now have a set of 114,501 terms (words and phrases) with abstractness ratings ranging from 0 to 1.5 Based on the testing set performance, these 114,501 ratings would have a correlation of 0.81 with human ratings and an accuracy of 85% on binary (abstract or concrete) classification. In addition, Turney has produced a word-pair matrix with 30,955 different words/phrases and for each word/phrase the 50 words that it significantly primes according to the procedure described above. We describe this word pair matrix as ‘‘Turney’s primed words list’’ (TPWL) as it includes a list of words and the words they prime. Now that we have (1) a list of words rated according to their abstractness level, and (2) a list of words/phrases and the words/ phrases they prime, we may move to the next phase: presenting our model.

3. The co-product as the building block of semantic family resemblance Our basic assumption is that a model of Semantic Family Resemblance (SFR) should be ideally grounded in some kind of a universal mathematical structure. We are well aware that in practice the performance of an algorithm associating the members of a given family may have nothing to do with mathematical universality. Nevertheless, it is still a scientific ideal that should be taken into account. To address this challenge we use the idea of ‘‘coproduct’’ adopted from Category Theory [32,33]. Category Theory is a mathematical language that describes similar phenomena in different mathematical fields. It is ostensibly very simple since it deals with objects and maps (known as ‘‘morphisms’’) between those objects denoted by arrows. Its ability to do so, however, makes it an abstract and powerful tool for modeling beyond mathematics. One of the abstract structures in Category Theory is the coproduct. The co-product is first defined and then we explain the way we use it to fuse experiential and distributional information. The first component of a category is objects (A, B, C, etc.). In our case, we will treat words as objects. The second component of a category is maps between objects (denoted as f, g, h, etc.). These maps, or ‘‘morphisms’’, represented as arrows, are merely a way of relating the objects to themselves and to each other. For instance 5

A copy of the 114,501 rated terms is available on request from Peter Turney.

a

ia

f

a+b

[f, g]

ib

b

g

c Fig. 1. The co-product.

if the word ‘‘cat’’ mentally primes the word ‘‘mouse’’, then an arrow (i.e., a relation) can be drawn from ‘‘cat’’ to mouse’’. Let us assume that L is a given category consisting of objects and relations. A co-product of L-objects a and b is another object in the category – a + b – together with a pair of arrows (ia: a ? a + b, ib: b ? a + b) such that for any pair of arrows of the form (f: a ? c, g: b ? c) there is exactly one arrow [f, g]: a + b ? c that makes the diagram below commute in the sense that [f, g] ia = f and [f, g] ib = g. [f, g] is called the co-product arrow of f and g with respect to injections ia and ib. The co-product is presented in Fig. 1. The co-product is described as the ‘‘least specific’’ object to which each object of the family admits morphism. It is a way of seeing the general object (‘‘the best of its type’’) from the perspective of the specific objects [33]. This is the reason why the co-product and its dual notion (the product) have been proposed as powerful structures for modeling structures specifically in the context of human cognition [34–37], and as proposed by Neuman and Nave [36] for the modeling of family resemblance. In other words, as the co-product is the ‘‘least specific’’ object to which each object of the category/family admits morphism, that is it is the ‘‘least specific’’ object to which the other objects send their arrows, it is a kind of ‘‘attractor’’ that stitches together the a and b objects of the category. Identifying the co-product associated with certain words may help us in identifying the members/objects of a given family/category. Therefore, the next section introduces a novel definition of semantic family resemblance in terms of the co-product. 4. Semantic family resemblance A Semantic Family Resemblance Graph (SFRG) is the directed graph G = (V, E) where G is a co-product, V is the set of 4-tuple words {Vai, Vbi, Vai+bi, Vci}, and E is a set of the directed edges (ai, ai + bi), (ai, c), (bi, ai + bi), (bi, c), (ai + bi, c) such that an edge from a word X to a word Y indicates that (1) Y is primed by X and (2) Y is higher in its abstractness level than X. It must be emphasized that G is a co-product because given two words ai and bi and a third word to which they are related, it is a universal ‘‘law’’ that if there is a fourth word to which they are related, there is exactly one relation from the third to the fourth word such as the diagram commutes. This highly abstract idea can be easily explained through a concrete example. A dentist chair is_a chair the same as a barber chair is_a chair. Both objects admit morphism to chair in the sense that they have the relation is_a with the object chair. A chair is_a furniture. Therefore, the structure we have is a co-product because if both a dentist chair and a barber chair are furniture it must be the case that a chair is furniture! The next section presents the way we use the abstract definition of the co-product in order to identify semantic family resemblance. 4.1. Identifying semantic family resemblance For identifying SFR and building the SFRG, we used a simple procedure. We considered each word/phrase A in Turney’s ‘‘primed

284

Y. Neuman et al. / Information Fusion 14 (2013) 281–287

A

B

f

g

C

D Fig. 2. A model of semantic family resemblance.

words list’’ as a basic level word and looked for a B word/phrase that it primes under the following conditions: (1) A and B should not be different in their level of abstraction beyond one quarter of a standard deviation. That is, the Abs. (A–B) <0.025. (2) There should be a word that both A and B prime on Turney’s ‘‘primed words list’’ (TPWL) and that has an abstraction level that is higher than A and B up to 0.025 SD. This word is the co-product (i.e., a + b) and denoted by C. (3) There should be a word D that is primed by A, B, and C, and is higher in its abstraction level than C up to 0.025 SD. Moreover, to limit the size of the semantic representation, we have decided to add another constraint: (4) A and B should prime each other in the sense that B is among the 50 words that appear after A in TPWL and vice versa. One should notice that according to condition (4), A and B factor through each other and therefore construct an equivalence class. The model is presented in Fig. 2. The above structure integrates lexical priming and abstractness level and produces a list of words together with their semantic set. 5. Analysis and results By automatically applying the above procedure to TPWL, we constructed a three-layer directed graph comprised of 4332 different words at the basic level (A and B), 2289 different words at the Table 1 Results of the set expansion task. Disease 1. Measles: poliomyelitis, smallpox, tetanus, rubella, mumps, diphtheria, vaccine, polio, malaria 2. Polio: tetanus, pertussis, typhoid, measles, diphtheria, vaccine, paralytic 3. Smallpox: typhoid, measles, diphtheria, typhus, vaccine 4. Varicella: rubella, vaccine, zoster, pox, chickenpox 5. Cholera: typhoid, diphtheria, dysentery, dengue, vibrio, brucellosis 6. Tuberculosis: pulmonary 7. Hyperthyroidism: goiter, hypothyroidism, hyperparathyroidism 8. Diphtheria: smallpox, measles, typhoid, poliomyelitis, pertussis, tetanus, rubella, polio, mumps, cholera 9. Aids: hepatitis, HIV 10. Cancer: liver, bladder, ovarian, lung Medications 1. Valium: Xanax, diazepam, benzodiazepine 2. Prozac: ssri, fluoxetine, Zoloft, sedative, paxil 3. Xanax: Valium, diazepam, Zoloft, sedative, paxil 4. Penicillin: streptomycin, tetracycline 5. Amphotericin: Acyclovir 6. Fluoxetine: Tricyclic, Prozac, zoloft, ssri, amitriptyline 7. Diazepam: Xanax, Valium, atropine, benzodiazepine, phenytoin, Phenobarbital 8. Nonsteroidal: nsaid, Tylenol, ibuprofen 9. Tylenol: ibuprofen, nonsteroidal, advil, acetaminophen, aspirin 10. Nsaid: nonsteroidal, ibuprofen, acetaminophen, naproxen

Music 1. Violin: flute 2. Flute: violin, oboe 3. Piano: viola, clarinet, cello 4. Guitar: tenor, banjo 5. Cello: viola, clarinet, piano, bassoon, percussion 6. Clarinet: viola, cello, bassoon, piano, soprano, oboe, bass 7. Bass: clarinet 8. Viola: cello, violincello, clarinet, piano 9. Banjo: harmonica, guitar, dulcimer, mandolin 10. Mandolin: dulcimer, harmonica, fiddle, banjo Country 1. Cameroon: Bulgaria, Benin, Togo, Cambodia, Congo, Yaounde, Nigeria 2. Brazil: Chile, Cambodia, Bolivia, Bulgaria, Ecuador, Cayman, Barbados, Colombia, Botswana, Bermuda 3. Togo: Cameroon, Ghana, Gambia, Benin, Senegal, Niger, Uganda, Trinidad, Lome 4. China: Chile, Korea, Asia 5. Indonesia: Philippines, Laos, Cambodia, Bandung, Bahasa, Timur, Jakarta 6. Vietnam: Myanmar, Cambodia, Laos, Zimbabwe, Yugoslavia, Veteran, Korea 7. Venezuela: Uruguay, Uzbekistan, Rico 8. Spain: Paris, Germany, Russia 9. Serbia: Bosnia, Sudan 10. Yugoslavia: Zaire, Rwanda, Zimbabwe, Vietnam City 1. Berlin: Amsterdam, Frankfurt 2. Brisbane: Perth, Adelaide, Sydney, Melbourne, Australia 3. Trento: Adige, Veneto, Bolzano 4. Troy: Athens, Bradford 5. Torino: Genova, Napoli, Milano, Italia, bologna, Roma, Firenze 6. Utrecht: Nijmegen, Leiden 7. Roma: Piazza, Padova, Firenze, Venezia, Torino, Lazio, dell 8. Sarajevo: Bosnia, Bosnian 9. Sidney: Prescott 10. Amsterdam: Berlin Person 1. Balzac: Flaubert, Proust, Zola, Stendhal, Baudelaire 2. Proust: Zola, Flaubert, Balzac 3. Lennon: McCartney 4. Rubens: Rembrandt, Vermeer 5. Renoir: Cezanne, Seurat, van Gogh, Gauguin, Degas 6. Spinoza: Descartes, Leibniz 7. Jesus: resurrection, Christ, son, followers, prayer, disciple, death 8. Cezanne: Seurat, Degas, Renoir, van Gogh, Gauguin, Monet, Chagall, Botticelli 9. Monet: Magritte, van Gogh, Degas, Cezanne 10. Bach: Mozart Food 1. Tomato: cucumber, onion, puree, pepper, dried, sauce, ketchup 2. Eggplant: zucchini, okra, grilled, cucumber, stuffed 3. Toffee: almond 4. Salad: pasta, cucumber 5. Ravioli: lasagna 6. Zucchini: eggplant, stuffed 7. Merlot: pinot, riesling, zinfandel, cabernet, chardonnay, blanc 8. Cabernet: pinot, zinfandel, merlot, chardonnay 9. Milk: sugar, juice, eggs, vanilla 10. Shrimp: fried

second level (i.e., C), and 1444 different concepts at the third level (i.e., D). It must be noted that this graph does not constitute a hierarchy as some words may appear in more than one level and lower level words may be associated with different higher level words.

5.1. Testing the model on a set expansion task If the above procedure associates words sharing some kind of ‘‘semantic family resemblance’’ then we should see it in the set of words associated with each token. To test this hypothesis, we chose six super-ordinate categories we manually identified in the data: (1) Disease (e.g., Polio), (2) Medication (e.g., Valium),

285

Y. Neuman et al. / Information Fusion 14 (2013) 281–287

(3) Musical instruments (e.g., piano), (4) Country (e.g., Russia), (5) City (e.g., Berlin), (6) Person (e.g., Bach), and (7) Food (e.g., salad). For each category, we manually identified ten tokens, and for each token we automatically identified its semantic neighborhood according to the above procedure. The categories were manually identified to make sure through human judgment that they are well established and that each token of the categories is indeed a valid token. However, the semantic neighborhood of each token or ‘‘seed’’ was automatically identified to test the power of our model to identify semantic relatedness. In a sense, this procedure is actually an automatic set expansion [38–40] as epitomized by Google Sets™ [http://labs.google.com/ sets]. Set expansion techniques compute terms’ similarity and then choose from the terms those most similar to the seeds [39]. Table 1 presents the unique tokens identified for each target. One should notice that the tokens varied in their frequency. For instance, given the target ‘‘Measles’’ the algorithm identified nine different tokens. However, while ‘‘Poliomyelitis’’ appeared as the pair of Measles 17 times, Malaria appeared only once. To evaluate whether the semantic neighborhood that we have automatically identified for each token is valid and to compare it with another algorithm, we used the Latent Semantic Analysis (LSA) [41] near neighborhood procedure with General-readingup-to-first-year-college [lsa.colorado.edu], and compared them to the words identified by the LSA for each target word from our list. The reason for choosing the LSA for comparison is that in contrast with other algorithms and tools for set expansion (e.g., Google Sets), the LSA has been presented as a psychological model of the mind [42]. In addition, a comparison to Google Sets is problematic as this algorithm is not public and its results ‘‘cannot be reliably replicated’’ [40, p. 347]. As the number of neighbors identified by our algorithm varies from case to case, we compared the n neighbors identified by our algorithm to the n first results of the LSA. The decision of whether a word belongs to the category has been manually conducted by a human judge. Table 2 presents the mean average precision (MAP) for SFR and LSA. It can be seen that the MAP of our algorithm is significantly higher (p < .001) than the one gained by LSA. Our model has two advantages. First, it relies on minimal resources rather than on a huge term-to-term matrix created by harvesting the Web (e.g. [38,40,43]) or other resources such as the Wikipedia [39], and second it expands the single seed in a bottom-up manner rather than by relying on ‘‘semi-structured web pages that contain ‘lists’ of items’’, such as the algorithm developed by Wang and Cohen [40].

5.2. Semantic relatedness To test Semantic Family Resemblance’s (SFRs) ability to identify semantic relatedness, we used the WordSimilarity-353 dataset [2,44]. This dataset, which is used as a benchmark for measuring semantic relatedness, comprises 353 word-pairs rated for their relatedness by human judges. We followed the convention of measuring relatedness against this benchmark by using Spearman rank-order correlation coefficient. An algorithm using WikiRelate obtained a 0.19–0.48 correlation; Wordnet [45] gained a 0.33– 0.35 correlation with human rating, and gained a correlation of 0.50 by using a large-scale semantic network [46]; an algorithm based on Roget’s Thesaurus gained 0.55 correlation; and similar Table 2 MAP for SFR and LSA.

SFR LSA

Food

Disease

Medication

Music

Country

City

Person

MAP

84 28

69 26

100 0

86 20

82 52

73 11

90 30

83.42 23.85

Table 3 Summary of algorithm correlations with humans. Algorithm

Correlation with humans

WordNet [45] WikiRelate! [47] Wojtinne & Pulman [46] Roget’s Thesaurus [45] LSA [48] Agirre et al. [9] WLM [3] ESA – Wikipedia [2] TSA [10]

0.33–0.35 0.19–0.48 0.50 0.55 0.56 0.66 0.69 0.75 0.80

E

B

A

C

D Fig. 3. The semantic family resemblance graph used to model semantic relatedness.

results of 0.56 correlation were gained by LSA [44]. Agirre and colleagues [9] combine a graph-based algorithm to WordNet and distributional similarities collected from a 1.6-terabyte Web corpus, and gained a 0.66 correlation with the WordSim353 datasets. By using Wikipedia links rather than content, Milne and Witten’s [3] algorithm (WLM) gained a 0.69 correlation. They justify the use of their algorithm in terms of the limited data sources they use, minimal preprocessing, and accessibility for use. By using the Explicit Semantic Analysis (ESA) and Wikipedia as a knowledge resource, Gabrilovich and Markovitch [2] gained a 0.75 correlation. Their algorithm outperforms LSA, and they further argue that another advancement of their ESA is that while ‘‘latent semantic models are notoriously difficult to interpret . . . [t]he Explicit Semantic Analysis . . . circumvents this problem’’. The best results were recently gained by the Temporal Semantic Analysis (TSA) [10], which uses the New York Times archive spanning over 130 years. TSA gained an 0.80 correlation. Table 3 summarizes these findings. To measure semantic relatedness, we used extremely limited computational resources (128 MB) and a variation of the SFR graph as presented in Fig. 3. Given word pair A–B, we calculated whether A primes B and vice versa. The score ranged from 0 to 2. We also used the number of different words E that prime both A and B and are lower than A and B in their level of abstractness, the number of different words C that are primed by A and B and are higher than them in abstractness level, and the number of different words D that are primed by A, B, and C and are higher than them in their level of abstractness. The semantic relatedness between A and B was defined as follows: SemRel(A, B) = MEAN (RankAB, Rank(SUM(E, C, D))

286

Y. Neuman et al. / Information Fusion 14 (2013) 281–287

Notice that following the convention in studies of semantic relatedness, we used the ranked values of a given variable rather than the raw data (e.g., RankAB). By using this measure, we gained Spearman’s q = 0.46 (p < .001). Therefore, by using the most limited resources described in the literature, we have gained better results than those gained by WordNet and not significantly far from those gained by WikiRelate and the large semantic network used by Wojtinnek and Pulman [46]. Another way of analyzing the results is by using Binary Logistic Regression Analysis. We defined the average judgment of semantic relatedness in WordSimilarity-353 in binary terms as high (above the median) or low (below or equal to the median). As predictors we used (1) RankC, (2) RankD, (3) RankE, and (4) RankAB. The base rate for prediction was 50.4%. The regression was statistically significant (v2 (4) = 60.82, p < .001) with 66.9% correct classification rate. That is a 16.5% improvement in prediction over the base-rate. By using Backward Conditional Binary Logistic Regression Analysis it was found that only three variables predicted the criterion: RankC, RankE, and RankAB. Following the request of one of the reviewers, we replicated our analysis by using Rubenstein and Goodenough’s [48] dataset, abbreviated as R–G. Rubenstein and Goodenough asked human subjects to rate the similarity of 65 English word pairs. The algorithms used to measure this similarity and the correlations they gained with human judgment, as summarized by Budanitsky and Hirst [1], appear in Table 4. By applying our procedure to R–G dataset we gained a statistically significant correlation (r = 0.519, p < .001). As argued by Mohammad and Hirst [49 p. 2], the R–G dataset involves pairs that are ‘‘all noun pairs and these that were semantically close were also semantically similar; the dataset did not contain word pairs that are semantically related but not semantically similar’’. Based on this critique and to improve our correlation, we measured the frequency in which each word pair from R–G appears in COCA. We inserted the first word and measured the number of times the second word appears within a window of four words to the left or to the right of the target word. The raw frequency for each pair was ranked. A statistically significant correlation was found between this rank and the human similarity judgment of R–G (r = 0.663 p < .001). This result indicates that the word pair similarity in the R–G dataset can be accounted for through their simple co-occurrence in COCA. Based on this finding, we created a new similarity measure, which is the average of the pair co-occurrence rank and our previous similarity measure, and gained r = 0.694 (p < .001). When defining the human similarity judgment in binary terms and applying Binary Logistic Regression Analysis by using the same procedure that we have used for WordSimilarity-353, we gained statistically significant results (v2 (1) = 11.46, p < .001) with a 70.8% correct classification rate, which is a 20% improvement in prediction over the base-rate. When including the co-occurrence measure in the equation the prediction was improved to 78.5%. In sum, by using a model fusing experiential and distributional information with limited resources, we have shown that semantic

Table 4 Algorithms used to measure word pair similarity and correlations with human judgment. Algorithm

Correlation with humans

Hirst and St-Onge relHS Jiang and Conrath disJC Leacock and Chodorow simLC Lin simL Resnick simR

0.786 0.781 0.838 0.819 0.779

relatedness can be evaluated in a significant way. The meaning of those findings is discussed in the next section. 6. Discussion Understanding the way semantic representation is formed in the human mind is of interest to cognitive scientists and related disciplines. A growing body of literature points to the embodied nature of cognition and therefore it seems reasonable to integrate experiential and distributional information in modeling semantic relatedness. From a semiotic perspective, in order to understand meaning and relatedness one needs be sensitive to the unique way natural language as a sign-system is formed and evolves as a complex dynamic social system. While the meaning of signs we use evolve from embodied experience, and while they are statistically collocated with other words, their meaning is an emerging phenomenon created at the mesoscopic level of analysis between the micro-level of a word and its embodied source and the macro-level of its collocations with other words. In this paper, we have introduced a cognitive economical model that integrates the experiential and distributional information through a mesoscopic organization captured by the abstract mathematical structure of the co-product. The model performs well with extremely limited resources. Moreover, the model is context sensitive to relatedness as a pair of words may be related to a different extent when stitched together under different co-products. This contextual sensitivity has not been studied in this paper and will be the target of future studies. Acknowledgment The authors would like to thank Peter Turney for his cooperation and reading of the final draft and the anonymous reviewers for their constructive comments. References [1] A. Budanitsky, G. Hirst, Evaluating wordnet-based measures of lexical semantic relatedness, Comput. Linguist. 32 (2006) 13–47. [2] E. Gabrilovich, R. Markovitch, Computing semantic relatedness using Wikipedia-based explicit semantic analysis, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), 2007, pp. 1606– 1611. [3] D. Milne, I.H. Witten, An effective, low-cost measure of semantic relatedness obtained from Wikipedia links, in: Proceedings of the AAAI 2008 Workshop on Wikipedia and Artificial Intelligence (WIKIAI 2008), Chicago, IL, 2008. [4] T. Zesch, I. Gurevych, Wisdom of crowds versus wisdom of linguists – measuring the semantic relatedness of words, J. Nat. Language Eng. 16 (2010) 25–59. [5] M. Danesi, Metaphorical ‘‘networks’’ and verbal communication: a semiotic perspective on human discourse, Sign Syst. Stud. 31 (2003) 341–363. [6] A. Clark, Language, embodiment, and the cognitive niche, Trends Cogn. Sci. 10 (2006) 370–374. [7] G. Lakoff, M. Johnson, Philosophy in the Flesh, Basic Books, New York, 1999. [8] Y. Neuman, P. Turney, Y. Cohen, How language enables abstraction: a study in computational cultural psychology, Integr. Psychol. Behav. (2011), http:// dx.doi.org/10.1007/s12124-011-9165-8. [9] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, A. Soroa, A study on similarity and relatedness using distributional and wordnet-based approaches, in: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2009, pp. 19–27. [10] K. Radinsky, E. Agichtein, E. Gabrilovich, S. Markovitch, Word at a time: computing word relatedness using temporal semantic analysis, in WWW, 2011, in: Proceedings of the 20th International Conference on Word Wide Web. doi: 10.1145/1963405.1963455. [11] L. Perlovsky, Cognitive high level information fusion, Inform. Sci. 177 (2007) 2099–2118. [12] L.I. Perlovsky, Language and cognition, Neural Networks 22 (3) (2009) 247– 257. [13] L.I. Perlovsky, R. Ilin, neurally and mathematically motivated architecture for language and thought. special issue ‘‘Brain and language architectures: where we are now?’’ Open Neuroimag. J. 4 (2010) 70–80.

Y. Neuman et al. / Information Fusion 14 (2013) 281–287 [14] L.I. Perlovsky, Joint Acquisition of Language and Cognition, WebmedCentral Brain 1(10) (2010) WMC00994. . [15] L.I. Perlovsky, Language and Cognition Interaction Neural Mechanisms, Computational Intelligence and Neuroscience, 2011. doi: 10.1155/2011/ 454587. [16] M. Andrews, G. Vigliocco, D.P. Vinson, Integrating experiential and distributional data to learn semantic representations, Psychol. Rev. 116 (3) (2009) 463–498. [17] C. Beckner, R. Blythe, J. Bybee, M.H. Christiansen, W. Croft, N.C. Ellis, J. Holland, J. Ke, D. Larsen-Freeman, T. Schoenemann, Language is a complex adaptive system: position paper, Lang. Learn. 59 (2009) 1–26. [18] Y. Neuman, O. Nave, E. Dolev, Buzzwords on their way to a tipping point: a view from the Blogosphere, Complexity 16 (4) (2011) 58–68. [19] L. Wittgenstein, Philosophical Investigations, Blackwell, Oxford, 1953. [20] H.-J. Glock, A Wittgenstein Dictionary, Blackwell, Oxford, 1996. [21] R.B. Laughlin, D. Pines, J. Schmalian, B.P. Stojkovic´, P. Wolynes, The middle way, Proc. Natl. Acad. Sci. USA 97 (2000) 32–37. [22] Y. Neuman, Meaning making in language and biology, Perspect. Biol. Med. 48 (2005) 320–327. [23] M. Hoey, Lexical Priming: A New Theory of Words and Language, Routledge, London, 2005. [24] M. Davies, The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights, Int. J. Corpus Linguist. 14 (2009) 159–190. [25] G. Giegerenzer, Simple Heuristics That Make Us Smart, Oxford University Press, Oxford, 1999. [26] P. Turney, Y. Neuman, D. Assaf, Y. Cohen, Literal and metaphorical sense identification through concrete and abstract context, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 27–31, 2011, pp. 680–690. [27] P.D. Turney, M.L. Littman, Measuring praise and criticism: inference of semantic orientation from association, ACM T. Inform. Syst. (TOIS) 21 (4) (2003) 315–346. [28] P.D. Turney, P. Pantel, From frequency to meaning: vector space models of semantics, J. Artif. Intell. Res. (JAIR) 37 (2010) 141–188. [29] M. Coltheart, The MRC psycholinguistic database, Q. J. Exp. Psychol. 33A (4) (1981) 497–505. [30] S. Büttcher, C. Clarke, Efficiency vs. effectiveness in terabyte-scale information retrieval, in: Proceedings of the 14th Text REtrieval Conference (TREC), Gaithersburg, MD, 2005. [31] P. Turney, Similarity of semantic relations, Comput. Linguist. 32 (2006) 379– 416. [32] R. Goldblatt, Topoi: The Categorial Analysis of Logic, North Holland Publishing Company, Amsterdam, 1979.

287

[33] F.W. Lawvere, S.H. Schanuel, Conceptual Mathematics, Cambridge University Press, Cambridge, 1997. [34] A.C. Ehresmann, J.-P. Vanbremeersch, Memory Evolutive Systems, Elsevier, New York, 2007. [35] Y. Neuman, A novel generic conception of structure: solving Piaget’s riddle, in: L. Rudolph, J. Valsiner (Eds.), Mathematical Models for Research on Cultural Dynamics, Routledge, London, in press. [36] Y. Neuman, O. Nave, A mathematical theory of sign-mediated concept formation, Appl. Math. Comput. 201 (2008) 72–81. [37] S. Philips, W.H. Wilson, Categorical compositionality II: universal constructions and a general theory of (Quasi-) systematicity in human cognition, PLoS Comput. Biol. 7 (2011) 1–11. [38] P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, V. Vyas, Web-scale distributional similarity and entity set expansion, in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009, pp. 938–947. [39] L. Sarmento, V. Jijkuon, M. de Rijke, E. Oliveira, More like these: growing entity classes from seeds, in: Proceedings of the 16th ACM Conference on Information and, Knowledge Management, 2007, pp. 959–962. [40] R. Wang, W. Cohen, Language-independent set expansion of named entities using the web, in: ICDM 2007, Seventh IEEE International Conference on Data Mining, 2007, pp. 342–350. [41] T. Landauer, P. Foltz, D. Laham, Introduction to latent semantic analysis, Discourse Process. 25 (1998) 259–284. [42] T. Landauer, S. Dumais, A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge, Psychol. Rev. 104 (2) (1997) 211–240. [43] R.C. Wang, W.W. Cohen, Iterative set expansion of named entities using the web, in: ICDM, IEEE Computer Society, 2008, pp. 1091–1096. [44] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin, Placing search in context: the concept revisited, ACM T. Inform. Syst. 20 (1) (2002) 116–131. [45] M. Jarmasz, S. Szpakowicz, Roget’s thesaurus and semantic similarity, in: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), 2003, pp. 212–219. [46] P. Wojtinnek, S. Pulman, Semantic relatedness from automatically generated semantic networks, in: Proceedings of the Ninth International Conference on Computational Semantics (IWCS’11), 2011. [47] M. Strube, S. Ponzetto, WikiRelate! Computing semantic relatedness using Wikipedia, in: AAAI’O6, Boston, MA, 2006. [48] H. Rubenstein, J. Goodenough, Contextual correlates of synonymy, CACM 8 (10) (1965) 627–633. [49] S. Mohammad, G. Hirst, Distributional measures as proxies for semantic relatedness, submitted for publication. .