Evaluation of analogical proportions through Kolmogorov complexity

Knowledge-Based Systems 29 (2012) 20–30 Contents lists available at SciVerse ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.co...

Download PDF

263KB Sizes 0 Downloads 103 Views

Report

PDF Reader
Full Text

Knowledge-Based Systems 29 (2012) 20–30

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Evaluation of analogical proportions through Kolmogorov complexity Meriam Bayoudh b,1, Henri Prade a, Gilles Richard a,⇑ a b

IRIT, 118 Route de Narbonne, 31062 Toulouse Cedex 9, United Kingdom Centre IRD de Guyane, Route de Montabo BP165, 97323 Cayenne CEDEX, France

a r t i c l e

i n f o

Article history: Available online 18 July 2011 Keywords: Analogical proportion Kolmogorov complexity Common sense analogies Search engine Google

a b s t r a c t In this paper, we try to identify analogical proportions, i.e., statements of the form ‘‘a is to b as c is to d’’, expressed in linguistic terms. While it is conceivable to use an algebraic model for testing proportions such as ‘‘2 is to 4 as 5 is to 10’’, or even such as ‘‘read is to reader as lecture is to lecturer’’, there is no algebraic framework to support statements such as ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’, helping to recognize them as meaningful analogical proportions. The idea is then to rely on text corpora, or even on the Web itself, where one may expect to ﬁnd the pragmatics and the semantics of the words, in their common use. In that context, in order to attach a numerical value to the ‘‘analogical ratio’’ corresponding to the phrase ‘‘a is to b’’, we start from the works of Kolmogorov on complexity theory. This is the basis for a universal measure of the information content of a word a, or of a word a with respect to another one b, which, in practice, is estimated in a statistical manner. We investigate the link between a purely logical, recently introduced view of analogical proportions and its counterpart based on Kolmogorov theory. The criteria proposed for testing candidate proportions ﬁt with the expected properties (symmetry, central permutation) of analogical proportions. This leads to a new computational method to deﬁne, and ultimately to try to detect, analogical proportions in natural language. Experiments with classiﬁers based on these ideas are reported, and results are rather encouraging with respect to the recognition of common sense linguistic analogies. The approach is also compared with existing works on similar problems. Ó 2011 Elsevier B.V. All rights reserved.

1. Introduction Despite its heuristic status, analogical reasoning is a commonly used form of reasoning which has the ability to shortcut long chains of classical deductions, while often reaching the same conclusions. It is largely accepted that analogy is the basis for creativity as it puts different paradigms into correspondence (see [16,35,14]). Analogical reasoning is based on the human ability to identify ‘‘situations’’ or ‘‘problems’’ a and c, and then ‘‘deduce’’ that if b is a solution for the problem a, then some d, whose relation to c is similar to the relation between a and b, might be a solution for problem c. Such a relation involving 4 items a, b, c, d is called an analogical proportion, or analogy for short, usually denoted a:b::c:d and should be read ‘‘a is to b as c is to d’’. Algebraic frameworks for giving concise deﬁnitions of analogical proportions have been deeply investigated in [37] in the last past years. For instance, when the universe is the set R of real numbers, the truth of a:b::c:d is interpreted as a d = b c, justifying ‘‘2 is to 4 as 5 is to 10’’. An⇑ Corresponding author. E-mail addresses: [email protected] (M. Bayoudh), [email protected] (H. Prade), [email protected] (G. Richard). 1 On leave from IRIT, presently at Université des Antilles et de la Guyane at Cayenne. 0950-7051/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2011.06.022

other example now involving sequences of bits could be 01 is to 10 as 11 is to 00 just because 01 and 10 does not share any bit and this is also the case with 11 and 00. In [28,30], a complete logical framework has been developed, mainly Boolean-oriented, i.e. where the underlying universe is B ¼ f0; 1g or isomorphic to B. In the ﬁeld of artiﬁcial intelligence, analogy-discovering programs have been designed for specialized areas where there exists an underlying minimal algebraic structure. Natural language analogies like ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’ are more at a linguistic or conceptual level: a simple mathematical structure is missing to cope with such proportions. Sowa’s conceptual graphs (CG) offer an appealing framework for representing concepts: core knowledge can be encoded using CG, and then with the help of a structured linguistic database (like e.g. WorldNet), we could discover analogies as with VivoMind analogy engine [35]) for instance. There is another option coming from the works of Gentner ([13,14] about the so-called structure mapping theory (SMT), implemented in the structure mapping engine (SME) with [11]. This way to proceed allows the author to exhibit high level analogical proportions: for instance, the analogical proportion ‘‘planets are to the sun as electrons are to the atom’s kernel’’ is coming from the mapping between a representation of the solar system and a representation of the Bohr model of atom. Obviously, this can only

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

be done with the help of a costly high level hand-coded representation. And this is exactly what we want to avoid here! In the ﬁeld of Computational Linguistics, the works of Turney et al. [42,41] relying on corpus-based techniques to learn semantic features like analogies, synonyms, antonyms and associations, are very successfully and we will devote Section 5 to investigate this approach and to compare with the one we propose. But, let us carry on with our ideas now. In [27], a method dealing with natural language analogies but avoiding any pre-coding of the universe has been developed. The main idea is that each word a carries an ‘‘information content’’ that is formally deﬁned via its Kolmogorov complexity, K(a), which is an ideal natural number. In order to build up an effective implementation, this number has to be estimated. Thanks to the works of Solomonoff [34], it appears that K(a) can be related to the probability of a to ‘‘appear’’. Thus, applying a kind of reverse process, we start from a probability distribution to estimate the Kolmogorov complexity. Among the candidates to provide a probability distribution over the set of English words, the World Wide Web is a strong one. Considering Google as a web mining engine, it is an easy game to get for each word (in our case, a concept representation) its frequency and to consider it as a probability to appear in a document. Then we are done with the estimation of the Kolmogorov complexity of a word: applying our deﬁnitions involving only the complexity of a, b, c and d, we can now check if a:b::c:d holds or does not hold. It appears that the proposed deﬁnitions are rather consistent with a sample of well-agreed analogies, as we shall see. Obviously, the web is a relatively dynamic corpus and we could imagine to improve our works within a more homogeneous database, where in some sense, noise has been ﬁltered. Starting from our previous works, we ﬁrst re-implement a classiﬁer using a structured database coming from the US National Institute of Standards and Technology (NIST) TREC Document Databases.2 Then a careful examination of our results leads to propose other options, bridging the gap between a purely Boolean view and a Kolmogorov-based deﬁnition. Our paper is organized as follows: the next section starts from an informal analysis of the core concepts underlying an analogical proportion leading to the well agreed axioms deﬁning this proportion. We also provide the Boolean interpretation of such a proportion and highlights the properties we expect to be satisﬁed in another context. In Section 3, we switch to natural language analogies, brieﬂy recalling the main principles of Kolmogorov complexity theory and its companion concept known as the universal distribution. We show how to use it to provide different practical deﬁnitions for analogical proportion between concepts represented as words, highlighting the link with the logical setting described in Section 2. In Section 4, we examine the results we get through diverse sets of experimentations and we show that they bridge the apparent gap between the Boolean framework and the complexity-based framework. Finally we survey related works and conclude. Sections 5 and 6 provide a comparative discussion of the proposed model with another approach developed in computational linguistics, at the methodological level and on a preliminary experiment. This paper is a fully revised and substantially expanded version of a conference paper [3].

21

capture its essence and that we recall below. Let us start with an informal analysis of the core concepts underlying this relation. 2.1. Brief analysis In order to transfer knowledge, analogical reasoning considers two situations in parallel and compare them by putting them into correspondence. In the structure mapping theory terminology [11], the output of this process would be the so-called ‘‘mapping function’’. Here, we want to stick to a simpler context where each situation involves only two entities or items, say a, c on the one hand, and b, d on the other hand. The comparison then bears on the pair a and b, and on the pair c and d. This naturally leads to consider two kinds of properties: what is common in terms of properties to a and b: let us denote it com(a, b), and what is speciﬁc to a and not shared by b: we denote it spec(a, b). Due to the intended meaning of com and spec, it is natural to assume com(a, b) = com(b, a) but in general, we cannot assume spec(a, b) = spec(b, a): spec(a, b) – spec(b, a) is more realistic. With this view, a is represented by the pair (com(a, b), spec(a, b)) b is represented by the pair (com(a, b), spec(b, a)) while c is represented by the pair (com(c, d), spec(c, d)) d is represented by the pair (com(c, d), spec(d, c)) Then, an analogical proportion between the 4 items, expressing that a is to b as c is to d amounts to state that the way a and b differ is the same as the way c and d differ, namely using our notation:

specða; bÞ ¼ specðc; dÞ and specðb; aÞ ¼ specðd; cÞ assuming symmetry in the way the parallel is done. This simple informal observation highlights two expected properties: a is to b as a is to b and if a is to b as c is to d then c is to d as a is to b (due to the symmetry of the = operator) Going a little bit deeper in this informal analysis, we can also observe above that since spec(a, b) = spec(c, d), it means that a differs from c through the properties of a shared with b previously denoted com(a, b), and it is the same for b with respect to d. This amounts to write spec(a, c) = spec(b, d) since they are both equals to com(a, b). A symmetric reasoning leads to spec(c, a) = spec(d, b), which together which the previous equality exactly mean a is to c as b is to d. We retrieve here the central permutation postulate that most authors associate with analogical proportion together with the symmetry postulate already mentioned. We have thus retrieved the 3 characteristic properties usually requested for a proper deﬁnition of analogical proportions. It is time now for a formalization.

2. Analogical proportions: a logical view

2.2. Formal setting

An analogical proportion3 can be considered as a relation involving 4 items and satisfying some basic axioms which are supposed to

The best option is to consider a ﬁrst order setting where a, b, c, d are variables and A denotes a quaternary relation. A is an analogical proportion when it satisﬁes the following axioms:

2

http://www.nist.gov/srd/nistsd22.htm. From time to time in the remaining of the paper, the word ‘analogy’ will be used as a shortcut for ‘analogical proportion’. 3

A(a, b, a, b) (identity) A(a, b, c, d) ) A(c, d, a, b) (symmetry)

22

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

A(a, b, c, d) ) A(a, c, b, d) (central permutation) Using these axioms, we infer that an analogical relation A should satisfy A(a, a, a, a) and A(a, a, b, b) which is intuitively satisfactory but A(a, b, b, a) does not hold in general as soon as a – b (back to our informal analysis, this is due to the fact that spec(a, b) – spec(b, a)). These axioms, which have been considered for a long time and which are directly inspired from the characteristic properties of numerical proportions, are supposed to capture the essence of analogical proportions. Clearly, the third postulate (central permutation) is the strongest one and is in some sense speciﬁc of analogical proportions. In case of analogy between numbers, a ratio-based reading is natural as for instance in the example 3:6::4:8 and obviously agrees with the idea of central permutation. This is also the case with a difference-based reading for a numerical analogy such as 13:15::17:19. When it comes to geometry, a, b, c and d are vectors or points in IR2: to be in analogical proportion, ! ! they have to be the vertices of a parallelogram, ab ¼ cd, which is equivalent to d(a, b) = d(c, d) and d(a, c) = d(b, d) (where d is the Euclidean distance). But, when it comes to analogical proportions between words representing concepts, it may be more problematic: general analogical statements as ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’ have to be handled differently. This situation will be examined in Section 3. Basic properties of analogy can be easily deduced from the axioms. For instance: Proposition 1. If A is an analogical relation, then the 5 following properties hold: A(a, b, c, d) ? A(c, a, d, b) (i) (by symmetry + central permutation) A(a, b, c, d) ? A(b, d, a, c) (ii) (by central permutation + symmetry) A(a, b, c, d) ? A(b, a, d, c) (iii) (by ii + central permutation) A(a, b, c, d) ? A(d, c, b, a) (iv) (by iii + symmetry) A(a, b, c, d) ? A(d, b, c, a) (v) (by iv + central permutation) This means that, when an analogical proportion holds for (a, b, c, d), the same proportion holds for 7 permutations of (a, b, c, d) including the 2 obtained by symmetry and by central permutation, leading to a class of 8 permutations satisfying the proportion. When there is no ambiguity about the context, the standard notation (that we use in the remaining of this paper) for the analogy A(a, b, c, d) is a:b::c :d. Let us consider now a Boolean interpretation of analogy. 2.3. Boolean interpretation When the items a, b, c, d belong to a structured universe, it is relatively easy to deﬁne an analogical proportion and this has been done for diverse universes: Boolean lattice, sets, strings, etc. (see [37] for instance). In this section, we recall the Boolean model (where items belong to B ¼ f0; 1g) as deﬁned in [26], and we underline some remarkable properties. In that case, a:b::c:d is deﬁned as the following Boolean formula:

ðða ^ :bÞ ðc ^ :dÞÞ ^ ðð:a ^ bÞ ð:c ^ dÞÞ This formula is true for the 6 truth value assignments of a, b, c, d appearing in Table 1, and is false for the 24 6 = 10 remaining possible assignments.

This relation over B4 satisﬁes the 3 axioms required from an analogical proportion and several equivalent writings have been proposed in [26], e.g. Deﬁnition 1. (a:b::c:d) iff ((a ? b) (d ? c))

(c ?

d)) ^

((b ? a)

Starting from the initial deﬁnition, this is an immediate consequence of the Boolean equivalence (a ^ :b) :(a ? b). As it has been noticed in [29], if a b and c d both hold then a:b::c:d holds which is formally expressed as

ða bÞ ^ ðc dÞ a : b
ða bÞ ðc dÞ and which can be regarded as the ﬁrst attempt to give a binary interpretation to analogical proportion. In Klein’s view, the 2 patterns 0110 and 1001 are allowed (together with 1010 and 0101), which is not intuitively satisfactory (since it is not expected that b:a:: c:d follows from a:b::c:d). In fact, with our deﬁnition, we only have:

a : b
Table 1 Analogy truth table: Boolean model.

a : b
a

b

c

d

0 1 0 1 0 1

0 1 1 0 0 1

0 1 0 1 1 0

0 1 1 0 1 0

a : b
23

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

axiom, this is equivalent to a:b::c:d ? c::b::d: :a. Property (iii), added to the fact that neither a:b:: :a::b nor a::a::b::b hold, could seem a bit counter-intuitive at a ﬁrst glance. Nevertheless, this should not come as a surprise if we remember that analogy is only characterized by the three axioms of Section 2 which do not constrain its behavior with respect to operators that are associated to a particular interpretative setting. Moreover, it is clear that the logical deﬁnition of an analogical proportion relies on something more essential than a superﬁcial formal similarity (which might lead to think that a:b:::a::b holds, which is wrong). When, instead of dealing with Boolean values, we deal with concepts represented as words (like ‘‘car’’ or ‘‘human’’), we cannot rely on any pre-existing structure to provide a deﬁnition for analogical proportion. Before leaving this section, let us however investigate how we could deal with such analogies by applying algebraic methods. 2.4. Formal frameworks to deal with natural language analogies Let us consider two basic analogies namely ‘‘read is to reader as write is to writer’’ and ‘‘heart is to human as engine is to car’’. Back to our initial analysis, when a:b::c:d holds, it means that a and b differ in the same way as c and d differ (we use the notion of speciﬁcities in that case). In the case of ‘‘read is to reader as write is to writer’’ where a, b, c, d are easily identiﬁed, we have spec(a, b) = ; = spec(c, d) and spec(b, a) = ‘er’ = spec(d, c). The case of ‘‘heart is to human as engine is to car’’ is more tricky and does not rely on a simple syntactic operation. In fact, the words are used use here to refer to concepts, and they implicitly call for external pieces of knowledge such as:

partOf ðheart; humanÞ

Table 2 Boolean modeling for ‘‘wine is to France as beer is to England’’.

Wine France Beer England

Alcohol

isaDrink

isaCountry

Drink (beer, England)

Drink (wine, France)

1 0 1 0

1 0 1 0

0 1 0 1

0 0 1 1

1 1 0 0

Representing this knowledge base with a Boolean table, we get Table 2. We observe that the analogical proportion holds componentwise, which allows us to conclude that the proportion holds in its whole. With that way to proceed, checking if an analogical proportion becomes more easy and relies only on the identiﬁcation of proportions between atomic facts. A similar complex hand-coded representation has to be done for the SMT (‘‘structure mapping theory’’) engine (which has been implemented in LISP). With the last version of Forbus et al. works (see [12] for instance) and derived from SME (‘‘structure mapping engine’’), it is not necessary to hand-code the whole representation of the text at hand, only a sketching is needed. But at least the user needs to identify the basic components in the sketch in order to hand-label them with terms from a given knowledge base (in that case derived from OpenCyc.4 Unfortunately, this knowledge base has to be seriously extended to cope with the whole scope of analogy-making: till now, this extension has to be done manually. As we understand, in any case, a structure is brought to the core knowledge from which we can work and this is not always an easy task. That is why in [27], a different viewpoint has been developed that do not rely on any representation or structure and that we describe in the following section.

stopðheartÞ ! :mov eðhumanÞ stopðheartÞ ! :thinkðhumanÞ isFunctionalðhumanÞ ! mov eðhumanÞ _ thinkðhumanÞ which constitutes an implicite knowledge base from which we can infer

:ðmov eðhumanÞ _ thinkðhumanÞÞ ! :isFunctionalðhumanÞ and ﬁnally

stopðheartÞ ! :isFunctionalðhumanÞ: The same kind of implicit knowledge applies to car and engine:

partOf ðengine; carÞ stopðengineÞ ! :mov eðcarÞ stopðengineÞ ! :neutralGearðcarÞ isFunctionalðcarÞ ! mov eðcarÞ _ neutralGearðcarÞ from which we ﬁnally infer

stopðengineÞ ! :isFunctionalðcarÞ: Let us denote KB(heart, human) the ﬁrst knowledge base and KB(engine, car) the second one. In some sense, KB(heart, human) (resp. KB(engine, car)) speciﬁes the link between heart and human (resp. engine and car). Evaluating the analogy amounts to compare and notice the (partial) identity of this link. Obviously ‘‘wine is to France as beer is to England’’ would lead to the same kind of treatment, but using other predicates leading for instance to:

isaDrinkðbeerÞ; isaDrinkðwineÞ; drinkðwine; FranceÞ; drinkðbeer; EnglandÞ

3. Analogies in natural language: a complexity view ‘‘Wine is to France as beer is to England’’ is a good example of what we try to capture in this section. In that case, representations (i.e. words) would be only implicit and summarized in terms of information amounts, noticing that we are interested in what ‘‘information’’ is common to a and b (resp. c and d), and more importantly in what ‘‘information’’ is added/deleted when ‘‘going from’’ a to b or from c to d. So if we are able to properly deﬁne this notion of ‘‘information’’ for words representing concepts, it could be the basis for a quantitative information-based interpretation of analogy between concepts. This is why we turn to information theory. When it comes to information theory, at least two candidates compete: The theory developed by C. Shannon in 1948, whose fundamental concept is the notion of entropy: for a given string, this entropy is usually expressed as the average number of bits needed for an emitter to send the string to a receiver. This measures the quantity of information contained in the transmitted string. This notion of information is thus related to a notion of transmission and is based on probability theory since Shannon entropy is just an average value. On the opposite, the theory developed in the sixties by A. Kolmogorov, also known as Kolmogorov complexity theory, is only linked to the way we can describe a given string with a Turing machine, without any reference to a notion of transmission or probability. This framework provides a kind of universal information measure without any reference to an average value.

alcoholðbeerÞ; alcoholðwineÞ; isaCountryðFranceÞ; isaCountryðEnglandÞ

4

http://www.cyc.com/opencyc.

24

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

However, there are strong mathematical links between Shannon and Kolmogorov views which are clearly highlighted in [23]. On top of that, Kolmogorov complexity can also be used as a powerful tool to deal with communication complexity as it has been recently shown in [18]. Starting from the now universal model of communication complexity deﬁned in [46], the authors investigate the use of Kolmogorov deﬁnition within lower bound complexity proofs, targeting diverse communication protocols. It emerges that their technique leads to simpler proofs (simpler than those coming from Shannon’s theory as in [1] for instance), highlighting once again that the intuition underlying Kolmogorov complexity and Shannon’entropy are very close despite their different theoretical settings. Nevertheless, in this paper, we use the Kolmogorov setting, as it appears to be more immediately appropriate for our purpose. Indeed, we are interested in evaluating the informative content of a concept, represented by a word (e.g., ‘car’), both in an absolute manner, and relatively to another concept (e.g., ‘engine’). These questions are reminiscent of Kolmogorov complexity theory that handles strings, instead of ‘concepts’ here. This observation has led us to try to adapt the Kolmogorov setting to our problem. 3.1. Kolmogorov theory: brief overview Developed in the late 1960’s, the aim of Kolmogorov complexity theory was to give a formal meaning to the notion of ‘information content’. For a given string x (in that context a ﬁnite sequence of 0 and 1), Kolmogorov complexity K(x) is a numerical measure of the descriptive complexity contained in x. In this paper, we simply give some notations and intuitions that are useful to understand our work. We start from a universal turing machine U, with an input tape containing a string y, a program tape containing a string p and an output tape. Universal simply means that any other machine can be simulated with U: following Church’s thesis, there are such machines. When we start to run p on U with y as input, if the machine halts, we have a ﬁnite string x on the output tape and a ﬁnite part of p, pr has been read. It is convenient to adopt a functional notation with: Uðy; prÞ ¼ x. It means that there is a way to transform y into x using pr or any program with pr as preﬁx. Another way to put the things is to say that pr can reconstruct x with the help of auxiliary data y. Then the conditional Kolmogorov complexity of x relative to y is: Deﬁnition 2.

KðxjyÞ ¼ minfjprj : Uðy; prÞ ¼ xg where jprj denotes the length of pr (the number of bits for encoding it). In some sense, K(xjy) represents the shortest way to go from y to x. Then the Kolmogorov complexity of x is just

computable. This is obviously an issue we should have to deal with when it comes to implement a practical tool. Let us postpone this issue to Section 3.3. For now, having a concise deﬁnition for ‘‘information content’’, we have the necessary tool to interpret an analogical proportion in the context of natural language. 3.2. Kolmogorov model for analogy in natural language Taking inspiration from the deﬁnitions above, [27] starts with some obvious and simple ideas extracted from the deﬁnitions above: we work on ﬂat ﬁnite strings representing concepts. a, b, c and d are simple strings and we have only access to their information content via K. following our initial analysis, the common understanding for ‘‘a:b
½Kða=bÞ ¼ Kðc=dÞ ^ ½Kðb=aÞ ¼ Kðd=cÞðI1 Þ Obviously, this deﬁnition obeys the ﬁrst axiom and the symmetry postulate of an analogical proportion. But there is no way, starting from the Kolmogorov complexity properties, to infer that central permutation holds. 2. To take into account the central permutation postulate, required for a genuine interpretation of analogy, we can enforce the fact that a:c::b:d should hold as well, then leading to interpret a:b ::c:d as the following more constrained requirement:

I1 ^ I2 where I2 is

½Kða=cÞ ¼ Kðb=dÞ ^ ½Kðc=aÞ ¼ Kðd=bÞ Deﬁnition 3.

KðxÞ ¼ KðxjÞ where

denotes the empty string:

Given a program p such that jpj = K(x), able to produce x from U with no auxiliary string, it can be understood as the essence of x since we cannot recover x from a shorter program than p. It is thus natural to consider p as the most compressed version of x and the size of p, K(x), as a measure of the amount of information contained in x. With this viewpoint, K(xjy) measures the amount of information we need to recover x from y. K is extended to pair of strings simply by putting that K(x, y) is the length of the shortest program which can output the pair and then halt. There is a huge literature on the works of Kolmorogov: a comprehensive description can be found in the book [23]. The function K enjoys a lot of amazing properties, one of which being to be not

Now, it is clear that our second deﬁnition satisﬁes the requirements to be an analogical proportion. Starting from that, it remains to see if the formula properly captures the expected semantics by checking 2 points: – Any well agreed analogies should satisfy the formula, – And a natural language construction involving 4 words which is not considered as an analogy does not satisfy the formula. In fact, when considering the ﬁrst simple deﬁnition of analogical proportion I1, we understand that a pair of words (a, b) is rep! resented by a real valued vector in R2 ; v a;b ¼ ðKða=bÞ; Kðb=aÞÞ and I1 can be rewritten as:

25

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

! ! kv a;b v c;d k ¼ 0 Despite this deﬁnition of I1 is exactly equivalent to the previously given one, it provides a more compact notation. Using the same notation, I2, which enforces central permutation property, can now be rewritten as: !

!

k v a;c v b;d k ¼ 0 Despite these last notations carry out exactly the same semantics as the initial deﬁnitions, they ﬁt better with the implementation process: Instead of comparing separately each components K(a/b) with K(c/d). . . we ﬁrst compute the previous vectorial norms. Then, since there is no way we can get exactly 0 for each norm, we use thresholds that have been experimentally tuned for our classiﬁcation purpose. This view is more amenable to a testing process where we have only to rank-order 4-tuples of words in terms of their relevance to the idea of analogical proportion (in the sense captured by I1 and/or I2). For instance, when dealing with I1 only (but the same obviously applies to I2), between 2 candidates a:b::c:d and a:b::c0 :d0 , we consider the ‘‘best’’ one as the one associated to ! ! ! ! minðk v a;b v c;d k; k v a;b v c0 ;d0 kÞ. Before going to a practical implementation, it remains to deﬁne a protocol to estimate K(a/b) for every couple of words (a, b). This is the object of the next section. 3.3. Universal distribution At this stage, there is no clear way to estimate K(a/b) (or event K(a)) for a given couple of words a, b. As explained above, K is not computable but it is at least upper semicomputable i.e. can be computably approximated from above. It has to be noticed that none of our deﬁnitions make use of what is known as the Normalized Information Distance (nid) between 2 strings x and y:

nidðx; yÞ ¼

maxfKðxjyÞ; KðyjxÞg maxfKðxÞ; KðyÞg

In fact, it has recently been shown [39] that nid is neither upper semicomputable nor lower semicomputable. This is a reason not to use it in this paper despite the fact that a practical approximation of nid using compression, known as the normalized compression distance (ncd) and developed in [7,21] has turned out to be very competitive when compared to other standard distances used for data mining and clustering. Back to our initial issue of ﬁnding a way to approximate K, we take our inspiration from [6]. It departs from the work of Ray Solomonoff [34] whose idea was to deﬁne a kind of universal distribution over all possible objects to overcome the problem of unknown prior distribution within Bayes’ formula. For a given string a i.e. a ﬁnite sequence of 0 and 1, his idea was to consider 2K(a) as the unknown apriori probability of a when nothing else is known. Roughly speaking, this formula tells us that the more complex the string a, the more unlikely it is. In other words, simple strings are more likely than complex ones. In fact, in order to deﬁne a true probability measure, this deﬁnition needs some technical reﬁnements which are out of the scope of this paper (we have to restrict the type of authorized programs to the so-called ‘‘reduced programs’’ [20,22,4]). With this in mind, the application a ? 2K(a) becomes a probability distribution over the set of ﬁnite strings f0; 1gN . With our point of view, we can understand this number as the probability for a to appear (i.e. in that case, to be produced by a Turing machine). As it is quite clear that the log inverse of 2K(a) is just K(a), any process generating

strings and whose mass distribution is known can be used as a Kolmogorov complexity estimator: if p(a) is the probability of a to be generated by the process, then an estimation of K(a) is just log2(p(a)). As it has been explained previously, we have an upper bound of the complexity since K(a) 6 log2(p(a)) or, in other words, a mass distribution will allocate a complexity greater than the real one and we do not know what is the range of the error. At this stage of our work, this is not an important issue. It remains for us to ﬁnd out a process generating a known mass distribution over words, relevant for our purpose. This is a simple problem since, as we will see, any corpus of words, equipped with a suitable querying engine will do the job. 4. Putting the complexity view in practice In this section, we investigate two types of corpus and we show how our diverse experimentations validate (at least partially) the ideas described above, using the same kind of strategy as in [6,27] that we recall below. 4.1. Probability distribution generator Since our words (or strings) are just syntactic representations for concepts, it is relevant to deal with a text corpus or database where these words get their meaning. When we are looking for a word a, querying the database to get the number n of pages where a appears at least once then dividing this number by the total number M of pages in the corpus, we get the frequency p = n/M of this word in our database (more precisely the frequency of the pages containing at least one occurrence of the word). Considering this frequency as a probability, and using the log inverse function, we get log2(p) as an estimation of the Kolmogorov complexity of a. This is in accordance with the intuitive idea that, if a word a is rarely used (low probability), the underlying concept is relatively complex, thus justifying the high complexity measure. But we have to deal with K(a/b) as well: this is naturally estimated via

log 2 pðajbÞ ¼ log 2 pða; bÞ=pðbÞ ¼ log 2 pðbÞ log 2 pða; bÞ where p(a, b) is the proportion of pages containing both the words a and b. Let us investigate what the expected behaviour of our formula is. Let us start with the deﬁnition of I1. From an implementation viewpoint, we expect the sum K(a/b) K(c/d)+K(b/a) K(d/c) to be close to 0 to classify a:b::c:d as an analogical proportion. This leads us to compute:

log 2 pðaÞ þ log 2 pðbÞ log 2 pðcÞ log 2 pðdÞ 2log 2 pða; bÞ þ 2log 2 pðc; dÞ which is equal to

log 2

pðaÞpðbÞpðc; dÞ2 pðcÞpðdÞpða; bÞ2

This number is close to 0 when

N1 ¼

pðaÞpðbÞpðc; dÞ2 pðcÞpðdÞpða; bÞ2

is close to 1:

Among the numerous options to get this result in our context, one is to have both p(a) and p(b) close together to p(a, b), both p(c) and p(d) close to p(c, d). In terms of probability, this means a and b are not independent. Indeed p(a) close to p(a, b) implies 2 facts: (1) K(b/a), estimated via log2p(b/a) is close to 0, i.e. it is easy to get b from a. (2) The probability of a page containing a but not b is close to 0.

26

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

In our context, the probability of a page containing snow without containing ﬂake (and vice versa) is quite low. That is why proportion like ‘‘drop is to rain as ﬂake is to snow’’ are easily classiﬁed as analogy by our system. Let us carry on with deﬁnition I2, which is a new condition added to I1 to enforce the central permutation property. This new condition is similar to I1 where we have permuted b and c. So a computation as above for the compound condition I1 ^ I2 leads to consider the number

N2 ¼

pðaÞpðc; dÞpðb; dÞ which has to be close to 1: pðdÞpða; bÞpða; cÞ

The raw probabilities p(b) and p(c) disappear in the ﬁnal formula, conﬁrming the fact that they can be permuted in the proportion, which was not the case for the previous number N1. We have now everything we need for our experimentation. 4.2. First experiments Our experiments have been done with two corpus of words: – In [27], we use the World Wide Web, taking advantage of Google as an effective querying engine. Our program is written with Javascript enhanced with the Google API (AJAX programming). – in this paper, we use a TREC database (GOV) containing one year of US government proceedings (http://www.nist.gov/srd/ nistsd22.htm) and the usual programs to submit queries. From http://www.teachersdesk.org/vocabanal.html, we get a list of 50 well agreed analogies. To build up our negative examples, we proceed in two ways: – Method 1: Starting from an analogy a:b::c:d of the previous list, we build up a:b::d:c as a negative example, switching their 2 last items in the proportion since it has been seen in a previous section that, when a:b::c:d holds, then a:b:: d:c does not in general (except if c and d are synonyms (and thus a and b). – Method 2: Starting from an analogy a:b::c:d, we randomly choose a word d0 without any link with a, b, c, d and we build up a:b::c:d0 as a negative example. At the end of this process, we get a testing set of 150 elements altogether. We summarize our results in terms of confusion matrix. Let us ﬁrst recall the matrix (Table 3) we got in [27] when the WEB is used as a text corpus, Google as a querying engine and with I1 ^ I2 as deﬁnition of the analogy. This leads to an accuracy rate of 38þ75 100% ’ 75%. This result has been improved to go up to 80% 150 thank to a careful examination of the failure where it appears that polysemic words (for instance ‘‘view’’) generate a lot of errors: we replaced these polysemic examples with non polysemic ones, which is possible due to the large size of the underlying corpus (the Web). So we keep the size of our testing set but we modify its content. In the case of the GOV database, we were faced to the fact that some words appearing in the initial list of examples do not appear within the GOV corpus or appears with a very low frequency (due to the fact that this is a government proceedings corpus where, for instance ‘‘mice’’ does not appear and we cannot deal with ‘‘mouse is to mice as woman is to women’’). When a word appears with a very low frequency, it means we are close to 0 in terms of Table 3 Results with I1 ^ I2 and the Web.

+

probability and the estimated complexity becomes a very big number with which we cannot deal in a concise way with a standard programming language. So we are led to reduce the size of our testing set to 70 (28 positive and 42 negatives examples) and we get when using only I1 for analogy deﬁnition the result of Table 4. This leads to an accuracy rate of 24þ18 100 %=60% (this is mainly due to 70 bad results on negative examples obtained by method 1). With I1 ^ I2, the confusion matrix becomes Table 5: Finally, we obtain an accuracy rate of 24þ23 100 ’ 70%, without any other prelimin70 ary tests. Nothing changes for the positive examples, but a slight improvement appears with the negative examples. The errors are apparently more numerous with negative examples generated by method 1 (reversing the ordering of two words in a genuine analogy). Finally, as such the method is not better than the results we got in [27] with the original examples, but our testing set is a bit small. 4.3. A new option: introducing negation The above analysis leads us to consider a new deﬁnition for analogical proportion. The ﬁrst deﬁnition I1 using Kolmogorov complexity was not sufﬁcient to insure the central permutation property, expected from an analogical proportion. That is why we have added to I1 (whose implementation is via the test N1 w 1) a second condition I2 to enforce central permutation (whose implementation is via the test N2 w 1). But, as seen in Section 2.3, there is a another property related to the negation operator which is a:b::c:d ? c::b::d::a. Obviously, this property has no translation in a Kolmogorov framework, just because it does not make sense to deﬁne K(c/:b) for instance. Nevertheless, in terms of frequencies, it makes sense to consider the frequency of the pages containing at least one occurrence of c among the pages which do not contain a single occurrence of b, leading to an estimation of p(c/ :b). From a practical viewpoint, the term :b refers to the pages which do not contain an occurrence of b: instead of taking into account the presence of a word c in the context where a word b appears (i.e. p(c/b)), we consider the presence of the word c in the context where a word b does not appear, i.e. p(c/:b). Obviously this gives some information about the existing link between c and b, which could be helpful in testing analogical proportion involving c and b. Since these conditional probabilities have no direct interpretation within the Kolmogorov framework, there is no need at this stage to go for a log function. This is why we express the formula I3 with probabilities that is why we enforce the property above with a new constraint I3 for analogical proportion:

ðpðc=:bÞ ¼ pðd=:aÞÞ ^ ðpðb=:cÞ ¼ pða=:dÞ When adding I3 on the same test set (so using I1 ^ I2 ^ I3 as a deﬁnition for an analogy), the confusion matrix becomes Table 6. Finally, we obtain an accuracy rate of 24þ30 100 ’ 77% which is 70 comparable (slightly better) than the result obtained with the

Table 4 Results with I1 and GOV corpus.

+

Positive examples

Negative examples

24 4

24 18

Table 5 Results with I1 ^ I2 and GOV corpus.

Positive examples

Negative examples

38 12

25 75

+

Positive examples

Negative examples

24 4

19 23

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30 Table 6 Results with I1 ^ I2 ^ I3 and GOV corpus.

+

Positive examples

Negative examples

24 4

12 30

Web. We still observe that we generally fail to correctly classify an analogy containing polysemic words or common nouns having a name as homonym. Of course, there is room for improvement: for instance, for sake of symmetry, it could be interesting to add the following constraint I4 to I3:

ðpða=:bÞ ¼ pðc=:dÞÞ ^ ðpðb=:aÞ ¼ pðd=:cÞ but at this stage, we think a larger test set is necessary to go for more accurate deﬁnitions. Before leaving this section, let us come back to the vectorial notation previously introduced. If we now denote ! ua;b ¼ ðpða=:bÞ; pðb=:aÞÞ, then I3 is just !

!

k uc;b ud;a k ¼ 0 And I4 is expressed via the new constraint !

!

k ua;b uc;d k ¼ 0 5. Turney’s evaluation of the similarity between pairs of words Among the works related to analogy, the approach of Turney [42,41] is probably one of the closest to this paper. Turney has investigated diverse techniques, from vector space model (VSM) to the last latent relation analysis [41,40], all of them departing from word frequency counts. It appears that these techniques can be used to solve multi-choice analogy questions coming from the former scholastic assessment test or SAT test (college entrance test in the US) [8]. The primary aim was not to deal with analogies, but to classify analogous word pairs. The core notion is similarity and at this stage, Turney distinguishes between two kinds of similarity: attributional similarity referring to a correspondence between attributes (i.e. ‘‘X is red’’, or ‘‘X has wheel’’) and relational similarity being a correspondence between relations (i.e. ‘‘X is greater than Y’’, or ‘‘X is made of Y’’). – When there is a high degree of attributional similarity between 2 words a and b, they are synonyms, – When there is a high degree of relational similarity between 2 pairs of words (a, b) and (c, d), they constitute an analogy a:b::c:d.

In the VSM approach, a pair of words (a, b) is represented as a ! multi-dimensional real vector v a;b . The measure of the strength of the ‘‘analogical link’’ between 2 pairs of words (a, b) and ! ! (c, d) is the cosine of the 2 associated vectors v a;b and v c;d : the closer to 1 the cosine, the stronger the link. The way to translate a pair of words into a real valued vector is by choosing a list of joining terms, such as ‘‘for’’ or ‘‘to’’ and to build up phrases like ‘‘a for b’’, ‘‘a to b’’, etc. Turney chosed a list of 64 joining terms JT and build up all the 128 combinations a JT b and b JT a (let us call them joining term patterns), getting 128 short sentences. Using AltaVista as querying engine for the web and looking for the previous instanciated joining term ! patterns, the corresponding component in v a;b is log(x + 1)5

5 It is noticeable to observe that not only Turney, but other authors [31,24,32] consider log-based measures as more effective than raw values.

27

where x is the number of hits (i.e. number of documents matching the query) for the given pattern. This approach is quite successful in ordering analogical proportions according to human understanding: they got a score of 47% of successful guesses (corresponding to the best way of completing a pair of words with another pair, taken among ﬁve possible choices, in order to get an analogy), far higher than a random guessing of 20% (since 5 choices (c, d) are given for each pair of words (a, b)). The author extended this approach to the more sophisticated latent relational analysis where joining term patterns are derived from the corpus at hand. Starting from a pair (a, b), we search in an auxiliary thesaurus of synonyms, all the synonyms for a and b. Then we build up alternate pairs of words (a0 , b0 ) with all these synonyms: in some sense, a:b:: a0 :b0 are near analogical proportions. Then we search in the target corpus all short sentences starting with a and ending with b: these phrases are supposed to capture the relations between the words in each pair. Replacing the words by wild cards, we get patterns similar to the previous joining terms patterns. For instance, ‘‘the mason cut the stone with’’ will generate a pattern ‘‘X ⁄ the stone ⁄’’. Obviously, a typical pair of words will generate a huge number of patterns and this number is reduced via a feature selection technique. From now on, the joining terms are not predeﬁned, they are discovered in the corpus. It has to be noticed that this way to proceed is not very far from our predicate-based suggestion in Section 2.4. The obvious advantage of Turney’s method is its ﬂexibility (with the detriment of complexity!). Then a pair-pattern frequency matrix is built, in which each cell represents the number of times that the corresponding pair (row) appears in the corpus with the corresponding pattern (column). Obviously we get a sparse matrix which is then compressed using the Single Value Decomposition method. At the end of this process, each pair of words (a, b) is still represented as a real ! valued vector v a;b , i.e. the corresponding row in the matrix. For 2 pairs of words (a, b) and (c, d), this procedure leads to 2 real-valued matrix of the same dimensions. The process of computing the relational similarity between 2 pairs of words (a, b) and (c, d) is as follows: we compute the cosine ! ! cos(a, b, c, d) of the 2 row original vectors v a;b and v c;d . Then 0 0 0 0 we compute all the cosines cos(a , b , c , d ) corresponding to alternate synonyms (a0 , b0 ) and (c0 , d0 ): the ﬁnal relational similarity between (a, b) and (c, d) is just the average of all the cosines cos(a0 , b0 , c0 , d0 ) greater than or equal to the initial cosine cos(a, b, c, d). This way to proceed insures a score of 56, 8% (instead of 47, 7%) on the SAT corpus.

Starting from these works. The author proposes in [41] a uniform approach for analogies, synonyms, antonyms and word association. Analogical proportion is the core concept from which we can derive the other ones. A pair of words (a, b) are antonyms as soon as they make an analogical proportion with the pair (black, white), i.e., a:b::black:white holds. Two words (a, b) are synonymous as soon as the proportion a:b::levied:imposed holds. Finally they are associated when a:b::doctor:hospital holds. This is a rather original view and obviously, this avoids to build up ad-hoc algorithms to deal with these diverse semantic phenomenons: analogical proportion is a way to subsume all these concepts. Analogy detection is then considered as classifying analogous word pairs i.e. classiﬁcation of semantic relations between words: the basic hypothesis being that the lexical knowledge is relational not attributional (for instance, the knowledge in WorldNet comes from the graph structure).

28

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

6. Further experiments Let us go back to our approach and let us investigate the implementation process. Back to our vectorial notation where ! ! v a;b ¼ ðKða=bÞ; Kðb=aÞÞ and ua;b ¼ ðpða=:bÞ; pðb=:aÞÞ, our 4 constraints get the following forms6: !

!

k v a;b v c;d k ¼ 0 ðI1 Þ ! ! k a;c b;d k ¼ 0 ! ! k uc;b ud;a k ¼ 0 ! ! k ua;b uc;d k ¼ 0

v

v

ðI2 Þ

ðI4 Þ

mason:stone::teacher:chalk mason:stone::carpenter:wood mason:stone::soldier:gun mason:stone::photograph:camera mason:stone::book:word.

From a human understanding, mason:stone::carpenter:wood is the ‘‘best’’ analogy, and in the former scholastic assessment test or SAT test (college entrance test in the US), (carpenter:wood) is considered as the best match to build up an analogical proportion with (mason:stone). In our case, we have now to order the norms in increasing order, the better match corresponding to the smaller norm. In fact, to have a fair ordering process, we need to normalize our numbers. So let us consider the following deﬁnitions: 1. Inormal ¼ 1 2. Inormal ¼ 2 3.

Inormal 3

¼

4.

Inormal 4

¼

I1

!

!

!

!

!

!

!

!

maxðk v a;b k;k v c;s kÞ I2 maxðk v a;c k;k v b;d kÞ I3 maxðk uc;b k;k ud;a kÞ I4 maxðk ua;b k;k uc;d kÞ

We immediately understand that we have a lot of options at this stage that we cannot discriminate without a deeper investigation. Let us consider these different options, each of them having its own logic: We may simply use Inormal and in that case, the best analogy will 1 be the one with the smallest value for Inormal . 1 We may also order the possible solutions according to the increasing value of maxðInormal ; Inormal Þ (thus requiring to have 1 2 normal normal both I1 and I2 as small as possible). Or leaving a pure Kolmogorov framework, we may use max ðInormal ; Inormal Þ. 3 4 Or using all the indexes together by using maxðInormal ; 1 Inormal ; Inormal ; Inormal Þ. 2 3 4 Another way to take advantage of all the four elementary indexes Ik is to take into account their collective propensity to privilege some pair of words as being the best solution. This leads us to test a majority rule (where Inormal is used in case of ties). 1 We have used a set of 147 pairs of words coming from SAT [8] that are to be completed with another pair of words to be selected among 5 options. Each problem is thus built up with a stem and 6

Rules Inormal 1

max Inormal ; Inormal 1 2 max Inormal ; Inormal 3 4 max Inormal ; Inormal ; Inormal ; Inormal 1 2 3 4 Majority rule

Accuracy rate (%) 25 26 30 29 32

ðI3 Þ

But now, if instead of classifying a proportion a:b::c:d, we follow the Turney’s approach which is simply to rank-order pairs of words (c:d) in terms of their relevance with regard to another pair (a:b) (the stem), then we do not rely on thresholds anymore. For instance, starting from the stem (mason, stone), we could have to order: – – – – –

Table 7 Results with a testing set of 147 examples coming from SAT.

For the sake of simplicity, I1, I2, I3, I4 will now relate to the left-hand side of the ! ! equalities (i.e. I1 ¼ k v a;b v c;d k for instance).

ﬁve pairs of words to rank-order. We use our Web implementation using Google search engine. In Table 7, we provide the results (in terms of accuracy rate) we got by running all these options. Remember that a pure random choice will give an average accuracy of 20% (one option to be chosen among 5). Our table highlights the fact that Kolmogorov complexity is slightly better that a pure random choice but does not perform very well when it comes to order different options to build up an analogical proportion. In order to get a better result, we need to combine the diverse formulas. To conclude this section, we have to notice that our representation is similar to the Turney’s one in the sense that each pair of words (a, b) is represented as a real valued vector, but in our case: We have a maximum of 8 components per vector, that is to say half of the dimension needed by Turney.

The semantics of the components is completely different and is related to a complexity view. In terms of practical implementation, we have to highlight several points where we really differ from Turney’s works: We use only the Web as a text corpus without any pre-processing of the initial feature: for instance, there is no search for morphological variations (i.e. plural, gender, etc.) as it is the case for Turney. When we are looking for a pair of words (a, b) using Google, there is no ﬁxed window to search for the words. They are taken into account as soon as they occur on a web page, they may be separated by a huge number of words. In some sense, we leave Google to do the job without any extra constraints. We are still very sensitive to polysemic words, especially those ones having a homonym which is a name. For instance, if you look for ‘‘best’’ with Google, you will get a lot of companies or associations around the world having ‘‘BEST’’ as name (example: Board of European Students of Technology). This will artiﬁcially increase the frequency of ‘‘best’’ then decrease its Kolmogorov complexity related to any other word. Ultimately, we cannot accurately classify ‘‘winner:best
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

refer to [36,37,25,2]. Moreover, in [33], formal proportions of structured objects have been investigated, going through second order logic substitutions: this approach allows its authors to capture high level mapping between highly structured universes. [37] might be viewed as a particular case of it and a practical approach to build up an analogical-proportion based learning engine. Nevertheless, apart the previously cited [35], these works do no really provide a framework to deal with natural language analogies. The use of analogy in cognition and especially in learning has been rigorously identiﬁed and widely discussed, e.g. [15]. But, as far as we know, Cornuejols [9] was the ﬁrst to establish a link between analogical reasoning and Kolmogorov complexity, leaving the rather strict logical framework coming from [10] or, more recently, [30]. The author advocates a kind of ‘‘simplicity principle’’ to serve as a starting point for modeling analogy via Kolmogorov theory considered as a mathematical formulation of the Occam’s razor. This approach is in complete line with the works in [5] in which ‘‘choose the pattern that provides the briefest representation of the available information’’ is the rational basis of a wide range of cognitive processes, including analogical reasoning. On another hand, there are a number of works in diverse ﬁelds (linguistics, cognitive sciences) using word frequencies to design various similarity/dissimilarity measures, e.g. [38]. However, despite some obvious relations, our approach is quite different: the starting point is Kolmogorov complexity and its theoretical developments. We go through Google frequencies to estimate this complexity but any alternate methods, able to provide an upper bound for this complexity would be suitable. For instance, in diverse other ﬁelds (spam ﬁltering, network intrusion detection), when the string at hand is sufﬁciently long, complexity is estimated via compression. In our case, it simply does not make sense to compress ‘‘mason’’, ‘‘stone’’, etc. Moreover, it has been shown in [27], that a deﬁnition for analogy based on Jaccard index provides much less satisfactory results in terms of analogies classiﬁcation. This suggests that the information-theoretic model coming from Kolmogorov theory could be an accurate deﬁnition not only for strings, but also for concepts represented as words. Analogy and metaphors play an important role in linguistic creativity. In that perspective different from ours, let us mention the system Palimpsest (designed by [44]) that builds up analogies or metaphors, starting from WorldNet ontology. This system is able to overcome the rigid framework of WorldNet by dynamically creating new concepts, implementing what is known as dynamic type hierarchy (as it has been ﬁrst advocated in [45] and developed in [43]). Veale’s approach is thus able to capture metaphors, which are not given in the ontology. Nevertheless, considering the metaphoric joke due to Sigmund Freud ‘‘A wife is like an umbrella. Sometimes one takes a cab’’ cited by [44], it may be funny to notice that our system is able to classify ‘‘wife: umbrella
8. Conclusion In this paper, we have established a parallel between a purely abstract view of analogical proportion, suitable for structured

29

domains, and a more practical one, dedicated to the handling of natural language analogies. Based on Kolmogorov complexity as a concise deﬁnition for the notion of ‘‘information content’’, we provide diverse formulas that may be suitable for classifying analogical proportions in natural language. Using the relationship between Kolmogorov complexity and the universal distribution, we get a method to estimate this complexity and then a way to implement a practical tool to check our deﬁnitions. Using frequency as an approximation of a probability, our deﬁnitions are implemented via the computation of diverse numbers, using a structured database as text corpus, or the Web. A careful analysis of our implementation formulas, leads to take into account another kind of information, which cannot be described via Kolmogorov theory, but which makes sense in practice. When used for classiﬁcation purpose, this last deﬁnition leads to results which are slightly better than those previously obtained with the Web as target corpus. It also appears that the proposed approach performs better when we have to identify analogies in a collection containing both analogies and 4-tuples of words containing an intruder than when we have to rank-order potential solutions. Indeed the four numbers I1, I2, I3 and I4, despite their theoretical basis, provide only a rather rough view of the comparison of the relations between two words, and are insufﬁcient to accurately capture the differences between word pairs that may seem somewhat acceptable, at ﬁrst glance, for constituting analogies with a given pair of words. Obviously, it remains to check our ideas on a larger scale, at least in two directions: On the formal side and in order to get a more accurate classiﬁer, to transfer in our deﬁnition other logical properties, expected from an analogical proportion, but not expressible within Kolmogorov framework (like p(a/:b) = p(c/:d) and p(b/:a) = p(d/ :c)). On the practical side, to work with other structured databases, more general than the TREC GOV and having a larger diversity of words, allowing to investigate a larger set of examples, When properly developed, our deﬁnitions may provide the ability to identify analogies between concepts, starting only from natural language raw text, without the need of any pre-processing or external lexicon. Acknowledgements The authors are indebted to Mohand Boughanem and Cécile Laffaire for providing them with the opportunity to access the GOV corpus. References [1] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, An information statistics approach to data stream and communication complexity, Journal of Computer and System Sciences 68 (2004) 702–732. Special Issue on FOCS 2002. [2] N. Barbot, L. Miclet, La proportion analogique dans les groupes: applications aux permutations et aux matrices. Technical Report 1914 IRISA, (2009). [3] M. Bayoudh, H. Prade, G. Richard, A Kolmogorov complexity view of analogy: from logical modeling to experimentations, in: Research and Development in Intelligent Systems XXVII incorporating Applications and Innovations in Intelligent Systems XVIII, Proceedings 13th SGAI International Conference on (AI-2010), Springer Verlag, Cambridge, UK, 2010. 14–16th December. [4] C. Bennett, P. Gacs, M. Li, P. Vitányi, W. Zurek, Information distance, IEEE Transaction on Information Theory 44 (1998) 1407–1423. [5] N. Chater, The search for simplicity: A fundamental cognitive principle? In Taylor, & H. Francis (Eds.), The Quarterly journal of experimental psychology, 1999, vol. 52 (2), pp. 273–302. [6] R. Cilibrasi , P. Vitanyi, Automatic meaning discovery using google. In Manuscript, CWI, 2004; . [7] R. Cilibrasi, P. Vitányi, Clustering by compression, IEEE Transaction on Information Theory 51 (2005). [8] C. Claman, 10 Real Sats. The College Board, 1997.

30

M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30

[9] A. Cornuéjols, Analogie, principe d’économie et complexité algorithmique, in: In Actes des 11èmes Journées Francaises de l’Apprentissage. Sète, 1996, France [10] T.R. Davies, S.J. Russell, A logical approach to reasoning by analogy, in: IJCAI87, Morgan Kaufman, 1987, pp. 264–270. [11] B. Falkenhainer, K.D. Forbus, D. Gentner, The structure-mapping engine: algorithm and examples, Artiﬁcial Intelligence 41 (1989) 1–63. [12] K. Forbus, A. Lovett, K. Lockwood, J. Wetzel, C. Matuk, B. Jee, J. Usher, Cogsketch, in: AAAI’08: Proceedings of the 23rd National Conference on Artiﬁcial Intelligence, Springer, 2008, pp. 1878–1879. [13] D. Gentner, Structure-mapping: a theoretical framework for analogy, Cognitive Science 7 (1983) 155–170. [14] D. Gentner, The mechanisms of analogical learning, in: Similarity and Analogical Reasoning, Cambridge University Press, 1989, pp. 197–241. [15] D. Gentner, K.J. Holyoak, B. Kokinov (Eds.), The Analogical Mind: Perspectives from Cognitive Sciences, MIT Press, 2001. [16] A.K. Goel, Design, analogy and creativity, IEEE Expert 12 (1997) 62–70. [17] J. Guan, Y. Gan, H. Wang, Discovering pattern-based subspace clusters by pattern tree, Knowledge-Based Systems 22 (2009) 569–579. [18] M. Kaplan, S. Laplante, Kolmogorov complexity and combinatorial methods in communication complexity, Theoretical Computer Science 412 (2011) 2524– 2535. [19] S. Klein, Culture, mysticism & social structure and the calculation of behavior, in: Proceedings 5th European Conference in Artiﬁcial Intelligence (ECAI’82), Paris Orsay 1982, pp. 141–146. [20] A.N. Kolmogorov, Three approaches to the quantitative deﬁnition of information, Problems in Information Transmission 1 (1965) 1–7. [21] M. Li, X. Chen, X. Li, B. Ma, P. Vitnyi, The similarity metric, IEEE Transactions Information Theory 50 (12) (2004) 50. 3250G326. [22] M. Li, P. Vitányi, Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, 1997, ISBN 0-387-94053-7. [23] M. Li, P. Vitanyi, An introduction to Kolmogorov Complexity and its Applications, Springer Verlag, 2008. [24] D. Lin, An information-theoretic deﬁnition of similarity, in: ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1998, pp. 296–304. [25] L. Miclet, A. Delhay, Relation d’analogie et distance sur un alphabet deﬁni par des traits, Technical Report 1632 IRISA, 2004. [26] L. Miclet, H. Prade, Handling analogical proportions in classical logic and fuzzy logics settings, in: Proceedings 10th ECSQARU, LNCS, vol. 5590, Springer, Verona, 2009, pp. 638–650. [27] H. Prade, G. Richard, Testing analogical proportions with Google using Kolmogorov information theory, in: Proceedings of International Conference FLAIRS22, AAAI Press publisher, Fort Myers, USA, 2009, pp. 272–277. [28] H. Prade, G. Richard, Analogical proportions: another logical view, in: M. Bramer, R. Ellis, M. Petridis (Eds.), Research and Development in Intelligent Systems XXVI, Proceedings 29th Annual International Conference on AI (SGAI’09), Springer, Cambridge, UK, 2010, pp. 121–134. December 2009. [29] H. Prade, G. Richard, Nonmonotonic reasoning – From cataloguing to analogizing, in: Proceedings International Conference 30 Years of

[30]

[31]

[32] [33]

[34] [35]

[36]

[37] [38]

[39]

[40] [41]

[42] [43]

[44] [45] [46]

Nonmonotonic Reasoning (Nonmon@30), Lexington, Oct. 22–25, 2010b . H. Prade, G. Richard, Reasoning with logical proportions. in: Proceedings International Conference on Principles of Knowledge Representation and Reasoning. (KR’10), 2010c, pp. 546–555. Toronto, Canada. P. Resnik, Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language, Journal of Artiﬁcial Intelligence Research 11 (1999) 95–130. G. Ruge, Experiment on linguistically-based term associations, Information Processing Management 28 (1992) 317–332. U. Schmid, H. Gust, K. Kühnberger, J. Burghardt, An algebraic framework for solving proportional and predictive analogies, European Conference Cognition Science (2003) 295–300. R.J. Solomonov, A formal theory of inductive inference, in: Information Controlling, vol. 7 (1), 1964. J.F. Sowa, A.K. Majumdar, Analogical reasoning, in: Proceedings International Conference on Conceptual Structures LNAI 2746, Springer-Verlag, Dresden, 2003, pp. 16–36. N. Stroppa, F. Yvon, An analogical learner for morphological analysis, in: Proceedings 9th Conference Computer Natural Language Learning (CoNLL2005), 2005a, pp. 120–127. N. Stroppa, F. Yvon, Analogical learning and formal proportions: Deﬁnitions and methodological issues, ENST Paris report, 2005b. E. Terra, C.L.A. Clarke, Frequency estimates for statistical word similarity measures, in: Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference HLT/NAACL 2003, 2003, pp. 244–251. S.A. Terwijn, L. Torenvliet, P.M.B. Vitányi, Nonapproximability of the normalized information distance, Journal of Computing System and Science 77 (2011) 738–742. P.D. Turney, The latent relation mapping engine: algorithm and experiments, Journal of Artiﬁcial Intelligence Research 33 (2008) 615–655. P.D. Turney, A uniform approach to analogies, synonyms, antonyms, and associations, in: COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics, 2008b, pp. 905–912. Morristown, NJ, USA: Association for Computational Linguistics. P.D. Turney, M.L. Littman, Corpus-based learning of analogies and semantic relations, Machine Learning 60 (2005) 251–278. T. Veale, Dynamic type creation in metaphor interpretation and analogical reasoning: A case-study with wordnet, in: In the proceedings of ICCS2003, the 2003 International Conference on Conceptual Structures, 2003. T. Veale, An analogy-oriented type hierarchy for linguistic creativity, Knowledge-Based Systems 19 (2006) 471–479. E.C. Way, Knowledge Representation and Metaphor (Studies in Cognitive Systems), Kluwer Academic Publishers, Amsterdam, 1991. A.C.-C. Yao, Some complexity questions related to distributive computing(preliminary report), in: Proceedings of the eleventh annual ACM symposium on Theory of computing STOC ’79, ACM, New York, NY, USA, 1979, pp. 209–213.

Evaluation of analogical proportions through Kolmogorov complexity

Evaluation of analogical proportions through Kolmogorov complexity

Recommend Documents