Knowledge-Based Systems 29 (2012) 20–30
Contents lists available at SciVerse ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Evaluation of analogical proportions through Kolmogorov complexity Meriam Bayoudh b,1, Henri Prade a, Gilles Richard a,⇑ a b
IRIT, 118 Route de Narbonne, 31062 Toulouse Cedex 9, United Kingdom Centre IRD de Guyane, Route de Montabo BP165, 97323 Cayenne CEDEX, France
a r t i c l e
i n f o
Article history: Available online 18 July 2011 Keywords: Analogical proportion Kolmogorov complexity Common sense analogies Search engine Google
a b s t r a c t In this paper, we try to identify analogical proportions, i.e., statements of the form ‘‘a is to b as c is to d’’, expressed in linguistic terms. While it is conceivable to use an algebraic model for testing proportions such as ‘‘2 is to 4 as 5 is to 10’’, or even such as ‘‘read is to reader as lecture is to lecturer’’, there is no algebraic framework to support statements such as ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’, helping to recognize them as meaningful analogical proportions. The idea is then to rely on text corpora, or even on the Web itself, where one may expect to find the pragmatics and the semantics of the words, in their common use. In that context, in order to attach a numerical value to the ‘‘analogical ratio’’ corresponding to the phrase ‘‘a is to b’’, we start from the works of Kolmogorov on complexity theory. This is the basis for a universal measure of the information content of a word a, or of a word a with respect to another one b, which, in practice, is estimated in a statistical manner. We investigate the link between a purely logical, recently introduced view of analogical proportions and its counterpart based on Kolmogorov theory. The criteria proposed for testing candidate proportions fit with the expected properties (symmetry, central permutation) of analogical proportions. This leads to a new computational method to define, and ultimately to try to detect, analogical proportions in natural language. Experiments with classifiers based on these ideas are reported, and results are rather encouraging with respect to the recognition of common sense linguistic analogies. The approach is also compared with existing works on similar problems. Ó 2011 Elsevier B.V. All rights reserved.
1. Introduction Despite its heuristic status, analogical reasoning is a commonly used form of reasoning which has the ability to shortcut long chains of classical deductions, while often reaching the same conclusions. It is largely accepted that analogy is the basis for creativity as it puts different paradigms into correspondence (see [16,35,14]). Analogical reasoning is based on the human ability to identify ‘‘situations’’ or ‘‘problems’’ a and c, and then ‘‘deduce’’ that if b is a solution for the problem a, then some d, whose relation to c is similar to the relation between a and b, might be a solution for problem c. Such a relation involving 4 items a, b, c, d is called an analogical proportion, or analogy for short, usually denoted a:b::c:d and should be read ‘‘a is to b as c is to d’’. Algebraic frameworks for giving concise definitions of analogical proportions have been deeply investigated in [37] in the last past years. For instance, when the universe is the set R of real numbers, the truth of a:b::c:d is interpreted as a d = b c, justifying ‘‘2 is to 4 as 5 is to 10’’. An⇑ Corresponding author. E-mail addresses:
[email protected] (M. Bayoudh),
[email protected] (H. Prade),
[email protected] (G. Richard). 1 On leave from IRIT, presently at Université des Antilles et de la Guyane at Cayenne. 0950-7051/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.knosys.2011.06.022
other example now involving sequences of bits could be 01 is to 10 as 11 is to 00 just because 01 and 10 does not share any bit and this is also the case with 11 and 00. In [28,30], a complete logical framework has been developed, mainly Boolean-oriented, i.e. where the underlying universe is B ¼ f0; 1g or isomorphic to B. In the field of artificial intelligence, analogy-discovering programs have been designed for specialized areas where there exists an underlying minimal algebraic structure. Natural language analogies like ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’ are more at a linguistic or conceptual level: a simple mathematical structure is missing to cope with such proportions. Sowa’s conceptual graphs (CG) offer an appealing framework for representing concepts: core knowledge can be encoded using CG, and then with the help of a structured linguistic database (like e.g. WorldNet), we could discover analogies as with VivoMind analogy engine [35]) for instance. There is another option coming from the works of Gentner ([13,14] about the so-called structure mapping theory (SMT), implemented in the structure mapping engine (SME) with [11]. This way to proceed allows the author to exhibit high level analogical proportions: for instance, the analogical proportion ‘‘planets are to the sun as electrons are to the atom’s kernel’’ is coming from the mapping between a representation of the solar system and a representation of the Bohr model of atom. Obviously, this can only
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
be done with the help of a costly high level hand-coded representation. And this is exactly what we want to avoid here! In the field of Computational Linguistics, the works of Turney et al. [42,41] relying on corpus-based techniques to learn semantic features like analogies, synonyms, antonyms and associations, are very successfully and we will devote Section 5 to investigate this approach and to compare with the one we propose. But, let us carry on with our ideas now. In [27], a method dealing with natural language analogies but avoiding any pre-coding of the universe has been developed. The main idea is that each word a carries an ‘‘information content’’ that is formally defined via its Kolmogorov complexity, K(a), which is an ideal natural number. In order to build up an effective implementation, this number has to be estimated. Thanks to the works of Solomonoff [34], it appears that K(a) can be related to the probability of a to ‘‘appear’’. Thus, applying a kind of reverse process, we start from a probability distribution to estimate the Kolmogorov complexity. Among the candidates to provide a probability distribution over the set of English words, the World Wide Web is a strong one. Considering Google as a web mining engine, it is an easy game to get for each word (in our case, a concept representation) its frequency and to consider it as a probability to appear in a document. Then we are done with the estimation of the Kolmogorov complexity of a word: applying our definitions involving only the complexity of a, b, c and d, we can now check if a:b::c:d holds or does not hold. It appears that the proposed definitions are rather consistent with a sample of well-agreed analogies, as we shall see. Obviously, the web is a relatively dynamic corpus and we could imagine to improve our works within a more homogeneous database, where in some sense, noise has been filtered. Starting from our previous works, we first re-implement a classifier using a structured database coming from the US National Institute of Standards and Technology (NIST) TREC Document Databases.2 Then a careful examination of our results leads to propose other options, bridging the gap between a purely Boolean view and a Kolmogorov-based definition. Our paper is organized as follows: the next section starts from an informal analysis of the core concepts underlying an analogical proportion leading to the well agreed axioms defining this proportion. We also provide the Boolean interpretation of such a proportion and highlights the properties we expect to be satisfied in another context. In Section 3, we switch to natural language analogies, briefly recalling the main principles of Kolmogorov complexity theory and its companion concept known as the universal distribution. We show how to use it to provide different practical definitions for analogical proportion between concepts represented as words, highlighting the link with the logical setting described in Section 2. In Section 4, we examine the results we get through diverse sets of experimentations and we show that they bridge the apparent gap between the Boolean framework and the complexity-based framework. Finally we survey related works and conclude. Sections 5 and 6 provide a comparative discussion of the proposed model with another approach developed in computational linguistics, at the methodological level and on a preliminary experiment. This paper is a fully revised and substantially expanded version of a conference paper [3].
21
capture its essence and that we recall below. Let us start with an informal analysis of the core concepts underlying this relation. 2.1. Brief analysis In order to transfer knowledge, analogical reasoning considers two situations in parallel and compare them by putting them into correspondence. In the structure mapping theory terminology [11], the output of this process would be the so-called ‘‘mapping function’’. Here, we want to stick to a simpler context where each situation involves only two entities or items, say a, c on the one hand, and b, d on the other hand. The comparison then bears on the pair a and b, and on the pair c and d. This naturally leads to consider two kinds of properties: what is common in terms of properties to a and b: let us denote it com(a, b), and what is specific to a and not shared by b: we denote it spec(a, b). Due to the intended meaning of com and spec, it is natural to assume com(a, b) = com(b, a) but in general, we cannot assume spec(a, b) = spec(b, a): spec(a, b) – spec(b, a) is more realistic. With this view, a is represented by the pair (com(a, b), spec(a, b)) b is represented by the pair (com(a, b), spec(b, a)) while c is represented by the pair (com(c, d), spec(c, d)) d is represented by the pair (com(c, d), spec(d, c)) Then, an analogical proportion between the 4 items, expressing that a is to b as c is to d amounts to state that the way a and b differ is the same as the way c and d differ, namely using our notation:
specða; bÞ ¼ specðc; dÞ and specðb; aÞ ¼ specðd; cÞ assuming symmetry in the way the parallel is done. This simple informal observation highlights two expected properties: a is to b as a is to b and if a is to b as c is to d then c is to d as a is to b (due to the symmetry of the = operator) Going a little bit deeper in this informal analysis, we can also observe above that since spec(a, b) = spec(c, d), it means that a differs from c through the properties of a shared with b previously denoted com(a, b), and it is the same for b with respect to d. This amounts to write spec(a, c) = spec(b, d) since they are both equals to com(a, b). A symmetric reasoning leads to spec(c, a) = spec(d, b), which together which the previous equality exactly mean a is to c as b is to d. We retrieve here the central permutation postulate that most authors associate with analogical proportion together with the symmetry postulate already mentioned. We have thus retrieved the 3 characteristic properties usually requested for a proper definition of analogical proportions. It is time now for a formalization.
2. Analogical proportions: a logical view
2.2. Formal setting
An analogical proportion3 can be considered as a relation involving 4 items and satisfying some basic axioms which are supposed to
The best option is to consider a first order setting where a, b, c, d are variables and A denotes a quaternary relation. A is an analogical proportion when it satisfies the following axioms:
2
http://www.nist.gov/srd/nistsd22.htm. From time to time in the remaining of the paper, the word ‘analogy’ will be used as a shortcut for ‘analogical proportion’. 3
A(a, b, a, b) (identity) A(a, b, c, d) ) A(c, d, a, b) (symmetry)
22
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
A(a, b, c, d) ) A(a, c, b, d) (central permutation) Using these axioms, we infer that an analogical relation A should satisfy A(a, a, a, a) and A(a, a, b, b) which is intuitively satisfactory but A(a, b, b, a) does not hold in general as soon as a – b (back to our informal analysis, this is due to the fact that spec(a, b) – spec(b, a)). These axioms, which have been considered for a long time and which are directly inspired from the characteristic properties of numerical proportions, are supposed to capture the essence of analogical proportions. Clearly, the third postulate (central permutation) is the strongest one and is in some sense specific of analogical proportions. In case of analogy between numbers, a ratio-based reading is natural as for instance in the example 3:6::4:8 and obviously agrees with the idea of central permutation. This is also the case with a difference-based reading for a numerical analogy such as 13:15::17:19. When it comes to geometry, a, b, c and d are vectors or points in IR2: to be in analogical proportion, ! ! they have to be the vertices of a parallelogram, ab ¼ cd, which is equivalent to d(a, b) = d(c, d) and d(a, c) = d(b, d) (where d is the Euclidean distance). But, when it comes to analogical proportions between words representing concepts, it may be more problematic: general analogical statements as ‘‘engine is to car as heart is to human’’ or ‘‘wine is to France as beer is to England’’ have to be handled differently. This situation will be examined in Section 3. Basic properties of analogy can be easily deduced from the axioms. For instance: Proposition 1. If A is an analogical relation, then the 5 following properties hold: A(a, b, c, d) ? A(c, a, d, b) (i) (by symmetry + central permutation) A(a, b, c, d) ? A(b, d, a, c) (ii) (by central permutation + symmetry) A(a, b, c, d) ? A(b, a, d, c) (iii) (by ii + central permutation) A(a, b, c, d) ? A(d, c, b, a) (iv) (by iii + symmetry) A(a, b, c, d) ? A(d, b, c, a) (v) (by iv + central permutation) This means that, when an analogical proportion holds for (a, b, c, d), the same proportion holds for 7 permutations of (a, b, c, d) including the 2 obtained by symmetry and by central permutation, leading to a class of 8 permutations satisfying the proportion. When there is no ambiguity about the context, the standard notation (that we use in the remaining of this paper) for the analogy A(a, b, c, d) is a:b::c :d. Let us consider now a Boolean interpretation of analogy. 2.3. Boolean interpretation When the items a, b, c, d belong to a structured universe, it is relatively easy to define an analogical proportion and this has been done for diverse universes: Boolean lattice, sets, strings, etc. (see [37] for instance). In this section, we recall the Boolean model (where items belong to B ¼ f0; 1g) as defined in [26], and we underline some remarkable properties. In that case, a:b::c:d is defined as the following Boolean formula:
ðða ^ :bÞ ðc ^ :dÞÞ ^ ðð:a ^ bÞ ð:c ^ dÞÞ This formula is true for the 6 truth value assignments of a, b, c, d appearing in Table 1, and is false for the 24 6 = 10 remaining possible assignments.
This relation over B4 satisfies the 3 axioms required from an analogical proportion and several equivalent writings have been proposed in [26], e.g. Definition 1. (a:b::c:d) iff ((a ? b) (d ? c))
(c ?
d)) ^
((b ? a)
Starting from the initial definition, this is an immediate consequence of the Boolean equivalence (a ^ :b) :(a ? b). As it has been noticed in [29], if a b and c d both hold then a:b::c:d holds which is formally expressed as
ða bÞ ^ ðc dÞ a : b
ða bÞ ðc dÞ and which can be regarded as the first attempt to give a binary interpretation to analogical proportion. In Klein’s view, the 2 patterns 0110 and 1001 are allowed (together with 1010 and 0101), which is not intuitively satisfactory (since it is not expected that b:a:: c:d follows from a:b::c:d). In fact, with our definition, we only have:
a : b
Table 1 Analogy truth table: Boolean model.
a : b
a
b
c
d
0 1 0 1 0 1
0 1 1 0 0 1
0 1 0 1 1 0
0 1 1 0 1 0
a : b
23
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
axiom, this is equivalent to a:b::c:d ? c::b::d: :a. Property (iii), added to the fact that neither a:b:: :a::b nor a::a::b::b hold, could seem a bit counter-intuitive at a first glance. Nevertheless, this should not come as a surprise if we remember that analogy is only characterized by the three axioms of Section 2 which do not constrain its behavior with respect to operators that are associated to a particular interpretative setting. Moreover, it is clear that the logical definition of an analogical proportion relies on something more essential than a superficial formal similarity (which might lead to think that a:b:::a::b holds, which is wrong). When, instead of dealing with Boolean values, we deal with concepts represented as words (like ‘‘car’’ or ‘‘human’’), we cannot rely on any pre-existing structure to provide a definition for analogical proportion. Before leaving this section, let us however investigate how we could deal with such analogies by applying algebraic methods. 2.4. Formal frameworks to deal with natural language analogies Let us consider two basic analogies namely ‘‘read is to reader as write is to writer’’ and ‘‘heart is to human as engine is to car’’. Back to our initial analysis, when a:b::c:d holds, it means that a and b differ in the same way as c and d differ (we use the notion of specificities in that case). In the case of ‘‘read is to reader as write is to writer’’ where a, b, c, d are easily identified, we have spec(a, b) = ; = spec(c, d) and spec(b, a) = ‘er’ = spec(d, c). The case of ‘‘heart is to human as engine is to car’’ is more tricky and does not rely on a simple syntactic operation. In fact, the words are used use here to refer to concepts, and they implicitly call for external pieces of knowledge such as:
partOf ðheart; humanÞ
Table 2 Boolean modeling for ‘‘wine is to France as beer is to England’’.
Wine France Beer England
Alcohol
isaDrink
isaCountry
Drink (beer, England)
Drink (wine, France)
1 0 1 0
1 0 1 0
0 1 0 1
0 0 1 1
1 1 0 0
Representing this knowledge base with a Boolean table, we get Table 2. We observe that the analogical proportion holds componentwise, which allows us to conclude that the proportion holds in its whole. With that way to proceed, checking if an analogical proportion becomes more easy and relies only on the identification of proportions between atomic facts. A similar complex hand-coded representation has to be done for the SMT (‘‘structure mapping theory’’) engine (which has been implemented in LISP). With the last version of Forbus et al. works (see [12] for instance) and derived from SME (‘‘structure mapping engine’’), it is not necessary to hand-code the whole representation of the text at hand, only a sketching is needed. But at least the user needs to identify the basic components in the sketch in order to hand-label them with terms from a given knowledge base (in that case derived from OpenCyc.4 Unfortunately, this knowledge base has to be seriously extended to cope with the whole scope of analogy-making: till now, this extension has to be done manually. As we understand, in any case, a structure is brought to the core knowledge from which we can work and this is not always an easy task. That is why in [27], a different viewpoint has been developed that do not rely on any representation or structure and that we describe in the following section.
stopðheartÞ ! :mov eðhumanÞ stopðheartÞ ! :thinkðhumanÞ isFunctionalðhumanÞ ! mov eðhumanÞ _ thinkðhumanÞ which constitutes an implicite knowledge base from which we can infer
:ðmov eðhumanÞ _ thinkðhumanÞÞ ! :isFunctionalðhumanÞ and finally
stopðheartÞ ! :isFunctionalðhumanÞ: The same kind of implicit knowledge applies to car and engine:
partOf ðengine; carÞ stopðengineÞ ! :mov eðcarÞ stopðengineÞ ! :neutralGearðcarÞ isFunctionalðcarÞ ! mov eðcarÞ _ neutralGearðcarÞ from which we finally infer
stopðengineÞ ! :isFunctionalðcarÞ: Let us denote KB(heart, human) the first knowledge base and KB(engine, car) the second one. In some sense, KB(heart, human) (resp. KB(engine, car)) specifies the link between heart and human (resp. engine and car). Evaluating the analogy amounts to compare and notice the (partial) identity of this link. Obviously ‘‘wine is to France as beer is to England’’ would lead to the same kind of treatment, but using other predicates leading for instance to:
isaDrinkðbeerÞ; isaDrinkðwineÞ; drinkðwine; FranceÞ; drinkðbeer; EnglandÞ
3. Analogies in natural language: a complexity view ‘‘Wine is to France as beer is to England’’ is a good example of what we try to capture in this section. In that case, representations (i.e. words) would be only implicit and summarized in terms of information amounts, noticing that we are interested in what ‘‘information’’ is common to a and b (resp. c and d), and more importantly in what ‘‘information’’ is added/deleted when ‘‘going from’’ a to b or from c to d. So if we are able to properly define this notion of ‘‘information’’ for words representing concepts, it could be the basis for a quantitative information-based interpretation of analogy between concepts. This is why we turn to information theory. When it comes to information theory, at least two candidates compete: The theory developed by C. Shannon in 1948, whose fundamental concept is the notion of entropy: for a given string, this entropy is usually expressed as the average number of bits needed for an emitter to send the string to a receiver. This measures the quantity of information contained in the transmitted string. This notion of information is thus related to a notion of transmission and is based on probability theory since Shannon entropy is just an average value. On the opposite, the theory developed in the sixties by A. Kolmogorov, also known as Kolmogorov complexity theory, is only linked to the way we can describe a given string with a Turing machine, without any reference to a notion of transmission or probability. This framework provides a kind of universal information measure without any reference to an average value.
alcoholðbeerÞ; alcoholðwineÞ; isaCountryðFranceÞ; isaCountryðEnglandÞ
4
http://www.cyc.com/opencyc.
24
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
However, there are strong mathematical links between Shannon and Kolmogorov views which are clearly highlighted in [23]. On top of that, Kolmogorov complexity can also be used as a powerful tool to deal with communication complexity as it has been recently shown in [18]. Starting from the now universal model of communication complexity defined in [46], the authors investigate the use of Kolmogorov definition within lower bound complexity proofs, targeting diverse communication protocols. It emerges that their technique leads to simpler proofs (simpler than those coming from Shannon’s theory as in [1] for instance), highlighting once again that the intuition underlying Kolmogorov complexity and Shannon’entropy are very close despite their different theoretical settings. Nevertheless, in this paper, we use the Kolmogorov setting, as it appears to be more immediately appropriate for our purpose. Indeed, we are interested in evaluating the informative content of a concept, represented by a word (e.g., ‘car’), both in an absolute manner, and relatively to another concept (e.g., ‘engine’). These questions are reminiscent of Kolmogorov complexity theory that handles strings, instead of ‘concepts’ here. This observation has led us to try to adapt the Kolmogorov setting to our problem. 3.1. Kolmogorov theory: brief overview Developed in the late 1960’s, the aim of Kolmogorov complexity theory was to give a formal meaning to the notion of ‘information content’. For a given string x (in that context a finite sequence of 0 and 1), Kolmogorov complexity K(x) is a numerical measure of the descriptive complexity contained in x. In this paper, we simply give some notations and intuitions that are useful to understand our work. We start from a universal turing machine U, with an input tape containing a string y, a program tape containing a string p and an output tape. Universal simply means that any other machine can be simulated with U: following Church’s thesis, there are such machines. When we start to run p on U with y as input, if the machine halts, we have a finite string x on the output tape and a finite part of p, pr has been read. It is convenient to adopt a functional notation with: Uðy; prÞ ¼ x. It means that there is a way to transform y into x using pr or any program with pr as prefix. Another way to put the things is to say that pr can reconstruct x with the help of auxiliary data y. Then the conditional Kolmogorov complexity of x relative to y is: Definition 2.
KðxjyÞ ¼ minfjprj : Uðy; prÞ ¼ xg where jprj denotes the length of pr (the number of bits for encoding it). In some sense, K(xjy) represents the shortest way to go from y to x. Then the Kolmogorov complexity of x is just
computable. This is obviously an issue we should have to deal with when it comes to implement a practical tool. Let us postpone this issue to Section 3.3. For now, having a concise definition for ‘‘information content’’, we have the necessary tool to interpret an analogical proportion in the context of natural language. 3.2. Kolmogorov model for analogy in natural language Taking inspiration from the definitions above, [27] starts with some obvious and simple ideas extracted from the definitions above: we work on flat finite strings representing concepts. a, b, c and d are simple strings and we have only access to their information content via K. following our initial analysis, the common understanding for ‘‘a:b
½Kða=bÞ ¼ Kðc=dÞ ^ ½Kðb=aÞ ¼ Kðd=cÞðI1 Þ Obviously, this definition obeys the first axiom and the symmetry postulate of an analogical proportion. But there is no way, starting from the Kolmogorov complexity properties, to infer that central permutation holds. 2. To take into account the central permutation postulate, required for a genuine interpretation of analogy, we can enforce the fact that a:c::b:d should hold as well, then leading to interpret a:b ::c:d as the following more constrained requirement:
I1 ^ I2 where I2 is
½Kða=cÞ ¼ Kðb=dÞ ^ ½Kðc=aÞ ¼ Kðd=bÞ Definition 3.
KðxÞ ¼ KðxjÞ where
denotes the empty string:
Given a program p such that jpj = K(x), able to produce x from U with no auxiliary string, it can be understood as the essence of x since we cannot recover x from a shorter program than p. It is thus natural to consider p as the most compressed version of x and the size of p, K(x), as a measure of the amount of information contained in x. With this viewpoint, K(xjy) measures the amount of information we need to recover x from y. K is extended to pair of strings simply by putting that K(x, y) is the length of the shortest program which can output the pair
and then halt. There is a huge literature on the works of Kolmorogov: a comprehensive description can be found in the book [23]. The function K enjoys a lot of amazing properties, one of which being to be not
Now, it is clear that our second definition satisfies the requirements to be an analogical proportion. Starting from that, it remains to see if the formula properly captures the expected semantics by checking 2 points: – Any well agreed analogies should satisfy the formula, – And a natural language construction involving 4 words which is not considered as an analogy does not satisfy the formula. In fact, when considering the first simple definition of analogical proportion I1, we understand that a pair of words (a, b) is rep! resented by a real valued vector in R2 ; v a;b ¼ ðKða=bÞ; Kðb=aÞÞ and I1 can be rewritten as:
25
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
! ! kv a;b v c;d k ¼ 0 Despite this definition of I1 is exactly equivalent to the previously given one, it provides a more compact notation. Using the same notation, I2, which enforces central permutation property, can now be rewritten as: !
!
k v a;c v b;d k ¼ 0 Despite these last notations carry out exactly the same semantics as the initial definitions, they fit better with the implementation process: Instead of comparing separately each components K(a/b) with K(c/d). . . we first compute the previous vectorial norms. Then, since there is no way we can get exactly 0 for each norm, we use thresholds that have been experimentally tuned for our classification purpose. This view is more amenable to a testing process where we have only to rank-order 4-tuples of words in terms of their relevance to the idea of analogical proportion (in the sense captured by I1 and/or I2). For instance, when dealing with I1 only (but the same obviously applies to I2), between 2 candidates a:b::c:d and a:b::c0 :d0 , we consider the ‘‘best’’ one as the one associated to ! ! ! ! minðk v a;b v c;d k; k v a;b v c0 ;d0 kÞ. Before going to a practical implementation, it remains to define a protocol to estimate K(a/b) for every couple of words (a, b). This is the object of the next section. 3.3. Universal distribution At this stage, there is no clear way to estimate K(a/b) (or event K(a)) for a given couple of words a, b. As explained above, K is not computable but it is at least upper semicomputable i.e. can be computably approximated from above. It has to be noticed that none of our definitions make use of what is known as the Normalized Information Distance (nid) between 2 strings x and y:
nidðx; yÞ ¼
maxfKðxjyÞ; KðyjxÞg maxfKðxÞ; KðyÞg
In fact, it has recently been shown [39] that nid is neither upper semicomputable nor lower semicomputable. This is a reason not to use it in this paper despite the fact that a practical approximation of nid using compression, known as the normalized compression distance (ncd) and developed in [7,21] has turned out to be very competitive when compared to other standard distances used for data mining and clustering. Back to our initial issue of finding a way to approximate K, we take our inspiration from [6]. It departs from the work of Ray Solomonoff [34] whose idea was to define a kind of universal distribution over all possible objects to overcome the problem of unknown prior distribution within Bayes’ formula. For a given string a i.e. a finite sequence of 0 and 1, his idea was to consider 2K(a) as the unknown apriori probability of a when nothing else is known. Roughly speaking, this formula tells us that the more complex the string a, the more unlikely it is. In other words, simple strings are more likely than complex ones. In fact, in order to define a true probability measure, this definition needs some technical refinements which are out of the scope of this paper (we have to restrict the type of authorized programs to the so-called ‘‘reduced programs’’ [20,22,4]). With this in mind, the application a ? 2K(a) becomes a probability distribution over the set of finite strings f0; 1gN . With our point of view, we can understand this number as the probability for a to appear (i.e. in that case, to be produced by a Turing machine). As it is quite clear that the log inverse of 2K(a) is just K(a), any process generating
strings and whose mass distribution is known can be used as a Kolmogorov complexity estimator: if p(a) is the probability of a to be generated by the process, then an estimation of K(a) is just log2(p(a)). As it has been explained previously, we have an upper bound of the complexity since K(a) 6 log2(p(a)) or, in other words, a mass distribution will allocate a complexity greater than the real one and we do not know what is the range of the error. At this stage of our work, this is not an important issue. It remains for us to find out a process generating a known mass distribution over words, relevant for our purpose. This is a simple problem since, as we will see, any corpus of words, equipped with a suitable querying engine will do the job. 4. Putting the complexity view in practice In this section, we investigate two types of corpus and we show how our diverse experimentations validate (at least partially) the ideas described above, using the same kind of strategy as in [6,27] that we recall below. 4.1. Probability distribution generator Since our words (or strings) are just syntactic representations for concepts, it is relevant to deal with a text corpus or database where these words get their meaning. When we are looking for a word a, querying the database to get the number n of pages where a appears at least once then dividing this number by the total number M of pages in the corpus, we get the frequency p = n/M of this word in our database (more precisely the frequency of the pages containing at least one occurrence of the word). Considering this frequency as a probability, and using the log inverse function, we get log2(p) as an estimation of the Kolmogorov complexity of a. This is in accordance with the intuitive idea that, if a word a is rarely used (low probability), the underlying concept is relatively complex, thus justifying the high complexity measure. But we have to deal with K(a/b) as well: this is naturally estimated via
log 2 pðajbÞ ¼ log 2 pða; bÞ=pðbÞ ¼ log 2 pðbÞ log 2 pða; bÞ where p(a, b) is the proportion of pages containing both the words a and b. Let us investigate what the expected behaviour of our formula is. Let us start with the definition of I1. From an implementation viewpoint, we expect the sum K(a/b) K(c/d)+K(b/a) K(d/c) to be close to 0 to classify a:b::c:d as an analogical proportion. This leads us to compute:
log 2 pðaÞ þ log 2 pðbÞ log 2 pðcÞ log 2 pðdÞ 2log 2 pða; bÞ þ 2log 2 pðc; dÞ which is equal to
log 2
pðaÞpðbÞpðc; dÞ2 pðcÞpðdÞpða; bÞ2
This number is close to 0 when
N1 ¼
pðaÞpðbÞpðc; dÞ2 pðcÞpðdÞpða; bÞ2
is close to 1:
Among the numerous options to get this result in our context, one is to have both p(a) and p(b) close together to p(a, b), both p(c) and p(d) close to p(c, d). In terms of probability, this means a and b are not independent. Indeed p(a) close to p(a, b) implies 2 facts: (1) K(b/a), estimated via log2p(b/a) is close to 0, i.e. it is easy to get b from a. (2) The probability of a page containing a but not b is close to 0.
26
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
In our context, the probability of a page containing snow without containing flake (and vice versa) is quite low. That is why proportion like ‘‘drop is to rain as flake is to snow’’ are easily classified as analogy by our system. Let us carry on with definition I2, which is a new condition added to I1 to enforce the central permutation property. This new condition is similar to I1 where we have permuted b and c. So a computation as above for the compound condition I1 ^ I2 leads to consider the number
N2 ¼
pðaÞpðc; dÞpðb; dÞ which has to be close to 1: pðdÞpða; bÞpða; cÞ
The raw probabilities p(b) and p(c) disappear in the final formula, confirming the fact that they can be permuted in the proportion, which was not the case for the previous number N1. We have now everything we need for our experimentation. 4.2. First experiments Our experiments have been done with two corpus of words: – In [27], we use the World Wide Web, taking advantage of Google as an effective querying engine. Our program is written with Javascript enhanced with the Google API (AJAX programming). – in this paper, we use a TREC database (GOV) containing one year of US government proceedings (http://www.nist.gov/srd/ nistsd22.htm) and the usual programs to submit queries. From http://www.teachersdesk.org/vocabanal.html, we get a list of 50 well agreed analogies. To build up our negative examples, we proceed in two ways: – Method 1: Starting from an analogy a:b::c:d of the previous list, we build up a:b::d:c as a negative example, switching their 2 last items in the proportion since it has been seen in a previous section that, when a:b::c:d holds, then a:b:: d:c does not in general (except if c and d are synonyms (and thus a and b). – Method 2: Starting from an analogy a:b::c:d, we randomly choose a word d0 without any link with a, b, c, d and we build up a:b::c:d0 as a negative example. At the end of this process, we get a testing set of 150 elements altogether. We summarize our results in terms of confusion matrix. Let us first recall the matrix (Table 3) we got in [27] when the WEB is used as a text corpus, Google as a querying engine and with I1 ^ I2 as definition of the analogy. This leads to an accuracy rate of 38þ75 100% ’ 75%. This result has been improved to go up to 80% 150 thank to a careful examination of the failure where it appears that polysemic words (for instance ‘‘view’’) generate a lot of errors: we replaced these polysemic examples with non polysemic ones, which is possible due to the large size of the underlying corpus (the Web). So we keep the size of our testing set but we modify its content. In the case of the GOV database, we were faced to the fact that some words appearing in the initial list of examples do not appear within the GOV corpus or appears with a very low frequency (due to the fact that this is a government proceedings corpus where, for instance ‘‘mice’’ does not appear and we cannot deal with ‘‘mouse is to mice as woman is to women’’). When a word appears with a very low frequency, it means we are close to 0 in terms of Table 3 Results with I1 ^ I2 and the Web.
+
probability and the estimated complexity becomes a very big number with which we cannot deal in a concise way with a standard programming language. So we are led to reduce the size of our testing set to 70 (28 positive and 42 negatives examples) and we get when using only I1 for analogy definition the result of Table 4. This leads to an accuracy rate of 24þ18 100 %=60% (this is mainly due to 70 bad results on negative examples obtained by method 1). With I1 ^ I2, the confusion matrix becomes Table 5: Finally, we obtain an accuracy rate of 24þ23 100 ’ 70%, without any other prelimin70 ary tests. Nothing changes for the positive examples, but a slight improvement appears with the negative examples. The errors are apparently more numerous with negative examples generated by method 1 (reversing the ordering of two words in a genuine analogy). Finally, as such the method is not better than the results we got in [27] with the original examples, but our testing set is a bit small. 4.3. A new option: introducing negation The above analysis leads us to consider a new definition for analogical proportion. The first definition I1 using Kolmogorov complexity was not sufficient to insure the central permutation property, expected from an analogical proportion. That is why we have added to I1 (whose implementation is via the test N1 w 1) a second condition I2 to enforce central permutation (whose implementation is via the test N2 w 1). But, as seen in Section 2.3, there is a another property related to the negation operator which is a:b::c:d ? c::b::d::a. Obviously, this property has no translation in a Kolmogorov framework, just because it does not make sense to define K(c/:b) for instance. Nevertheless, in terms of frequencies, it makes sense to consider the frequency of the pages containing at least one occurrence of c among the pages which do not contain a single occurrence of b, leading to an estimation of p(c/ :b). From a practical viewpoint, the term :b refers to the pages which do not contain an occurrence of b: instead of taking into account the presence of a word c in the context where a word b appears (i.e. p(c/b)), we consider the presence of the word c in the context where a word b does not appear, i.e. p(c/:b). Obviously this gives some information about the existing link between c and b, which could be helpful in testing analogical proportion involving c and b. Since these conditional probabilities have no direct interpretation within the Kolmogorov framework, there is no need at this stage to go for a log function. This is why we express the formula I3 with probabilities that is why we enforce the property above with a new constraint I3 for analogical proportion:
ðpðc=:bÞ ¼ pðd=:aÞÞ ^ ðpðb=:cÞ ¼ pða=:dÞ When adding I3 on the same test set (so using I1 ^ I2 ^ I3 as a definition for an analogy), the confusion matrix becomes Table 6. Finally, we obtain an accuracy rate of 24þ30 100 ’ 77% which is 70 comparable (slightly better) than the result obtained with the
Table 4 Results with I1 and GOV corpus.
+
Positive examples
Negative examples
24 4
24 18
Table 5 Results with I1 ^ I2 and GOV corpus.
Positive examples
Negative examples
38 12
25 75
+
Positive examples
Negative examples
24 4
19 23
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30 Table 6 Results with I1 ^ I2 ^ I3 and GOV corpus.
+
Positive examples
Negative examples
24 4
12 30
Web. We still observe that we generally fail to correctly classify an analogy containing polysemic words or common nouns having a name as homonym. Of course, there is room for improvement: for instance, for sake of symmetry, it could be interesting to add the following constraint I4 to I3:
ðpða=:bÞ ¼ pðc=:dÞÞ ^ ðpðb=:aÞ ¼ pðd=:cÞ but at this stage, we think a larger test set is necessary to go for more accurate definitions. Before leaving this section, let us come back to the vectorial notation previously introduced. If we now denote ! ua;b ¼ ðpða=:bÞ; pðb=:aÞÞ, then I3 is just !
!
k uc;b ud;a k ¼ 0 And I4 is expressed via the new constraint !
!
k ua;b uc;d k ¼ 0 5. Turney’s evaluation of the similarity between pairs of words Among the works related to analogy, the approach of Turney [42,41] is probably one of the closest to this paper. Turney has investigated diverse techniques, from vector space model (VSM) to the last latent relation analysis [41,40], all of them departing from word frequency counts. It appears that these techniques can be used to solve multi-choice analogy questions coming from the former scholastic assessment test or SAT test (college entrance test in the US) [8]. The primary aim was not to deal with analogies, but to classify analogous word pairs. The core notion is similarity and at this stage, Turney distinguishes between two kinds of similarity: attributional similarity referring to a correspondence between attributes (i.e. ‘‘X is red’’, or ‘‘X has wheel’’) and relational similarity being a correspondence between relations (i.e. ‘‘X is greater than Y’’, or ‘‘X is made of Y’’). – When there is a high degree of attributional similarity between 2 words a and b, they are synonyms, – When there is a high degree of relational similarity between 2 pairs of words (a, b) and (c, d), they constitute an analogy a:b::c:d.
In the VSM approach, a pair of words (a, b) is represented as a ! multi-dimensional real vector v a;b . The measure of the strength of the ‘‘analogical link’’ between 2 pairs of words (a, b) and ! ! (c, d) is the cosine of the 2 associated vectors v a;b and v c;d : the closer to 1 the cosine, the stronger the link. The way to translate a pair of words into a real valued vector is by choosing a list of joining terms, such as ‘‘for’’ or ‘‘to’’ and to build up phrases like ‘‘a for b’’, ‘‘a to b’’, etc. Turney chosed a list of 64 joining terms JT and build up all the 128 combinations a JT b and b JT a (let us call them joining term patterns), getting 128 short sentences. Using AltaVista as querying engine for the web and looking for the previous instanciated joining term ! patterns, the corresponding component in v a;b is log(x + 1)5
5 It is noticeable to observe that not only Turney, but other authors [31,24,32] consider log-based measures as more effective than raw values.
27
where x is the number of hits (i.e. number of documents matching the query) for the given pattern. This approach is quite successful in ordering analogical proportions according to human understanding: they got a score of 47% of successful guesses (corresponding to the best way of completing a pair of words with another pair, taken among five possible choices, in order to get an analogy), far higher than a random guessing of 20% (since 5 choices (c, d) are given for each pair of words (a, b)). The author extended this approach to the more sophisticated latent relational analysis where joining term patterns are derived from the corpus at hand. Starting from a pair (a, b), we search in an auxiliary thesaurus of synonyms, all the synonyms for a and b. Then we build up alternate pairs of words (a0 , b0 ) with all these synonyms: in some sense, a:b:: a0 :b0 are near analogical proportions. Then we search in the target corpus all short sentences starting with a and ending with b: these phrases are supposed to capture the relations between the words in each pair. Replacing the words by wild cards, we get patterns similar to the previous joining terms patterns. For instance, ‘‘the mason cut the stone with’’ will generate a pattern ‘‘X ⁄ the stone ⁄’’. Obviously, a typical pair of words will generate a huge number of patterns and this number is reduced via a feature selection technique. From now on, the joining terms are not predefined, they are discovered in the corpus. It has to be noticed that this way to proceed is not very far from our predicate-based suggestion in Section 2.4. The obvious advantage of Turney’s method is its flexibility (with the detriment of complexity!). Then a pair-pattern frequency matrix is built, in which each cell represents the number of times that the corresponding pair (row) appears in the corpus with the corresponding pattern (column). Obviously we get a sparse matrix which is then compressed using the Single Value Decomposition method. At the end of this process, each pair of words (a, b) is still represented as a real ! valued vector v a;b , i.e. the corresponding row in the matrix. For 2 pairs of words (a, b) and (c, d), this procedure leads to 2 real-valued matrix of the same dimensions. The process of computing the relational similarity between 2 pairs of words (a, b) and (c, d) is as follows: we compute the cosine ! ! cos(a, b, c, d) of the 2 row original vectors v a;b and v c;d . Then 0 0 0 0 we compute all the cosines cos(a , b , c , d ) corresponding to alternate synonyms (a0 , b0 ) and (c0 , d0 ): the final relational similarity between (a, b) and (c, d) is just the average of all the cosines cos(a0 , b0 , c0 , d0 ) greater than or equal to the initial cosine cos(a, b, c, d). This way to proceed insures a score of 56, 8% (instead of 47, 7%) on the SAT corpus.
Starting from these works. The author proposes in [41] a uniform approach for analogies, synonyms, antonyms and word association. Analogical proportion is the core concept from which we can derive the other ones. A pair of words (a, b) are antonyms as soon as they make an analogical proportion with the pair (black, white), i.e., a:b::black:white holds. Two words (a, b) are synonymous as soon as the proportion a:b::levied:imposed holds. Finally they are associated when a:b::doctor:hospital holds. This is a rather original view and obviously, this avoids to build up ad-hoc algorithms to deal with these diverse semantic phenomenons: analogical proportion is a way to subsume all these concepts. Analogy detection is then considered as classifying analogous word pairs i.e. classification of semantic relations between words: the basic hypothesis being that the lexical knowledge is relational not attributional (for instance, the knowledge in WorldNet comes from the graph structure).
28
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
6. Further experiments Let us go back to our approach and let us investigate the implementation process. Back to our vectorial notation where ! ! v a;b ¼ ðKða=bÞ; Kðb=aÞÞ and ua;b ¼ ðpða=:bÞ; pðb=:aÞÞ, our 4 constraints get the following forms6: !
!
k v a;b v c;d k ¼ 0 ðI1 Þ ! ! k a;c b;d k ¼ 0 ! ! k uc;b ud;a k ¼ 0 ! ! k ua;b uc;d k ¼ 0
v
v
ðI2 Þ
ðI4 Þ
mason:stone::teacher:chalk mason:stone::carpenter:wood mason:stone::soldier:gun mason:stone::photograph:camera mason:stone::book:word.
From a human understanding, mason:stone::carpenter:wood is the ‘‘best’’ analogy, and in the former scholastic assessment test or SAT test (college entrance test in the US), (carpenter:wood) is considered as the best match to build up an analogical proportion with (mason:stone). In our case, we have now to order the norms in increasing order, the better match corresponding to the smaller norm. In fact, to have a fair ordering process, we need to normalize our numbers. So let us consider the following definitions: 1. Inormal ¼ 1 2. Inormal ¼ 2 3.
Inormal 3
¼
4.
Inormal 4
¼
I1
!
!
!
!
!
!
!
!
maxðk v a;b k;k v c;s kÞ I2 maxðk v a;c k;k v b;d kÞ I3 maxðk uc;b k;k ud;a kÞ I4 maxðk ua;b k;k uc;d kÞ
We immediately understand that we have a lot of options at this stage that we cannot discriminate without a deeper investigation. Let us consider these different options, each of them having its own logic: We may simply use Inormal and in that case, the best analogy will 1 be the one with the smallest value for Inormal . 1 We may also order the possible solutions according to the increasing value of maxðInormal ; Inormal Þ (thus requiring to have 1 2 normal normal both I1 and I2 as small as possible). Or leaving a pure Kolmogorov framework, we may use max ðInormal ; Inormal Þ. 3 4 Or using all the indexes together by using maxðInormal ; 1 Inormal ; Inormal ; Inormal Þ. 2 3 4 Another way to take advantage of all the four elementary indexes Ik is to take into account their collective propensity to privilege some pair of words as being the best solution. This leads us to test a majority rule (where Inormal is used in case of ties). 1 We have used a set of 147 pairs of words coming from SAT [8] that are to be completed with another pair of words to be selected among 5 options. Each problem is thus built up with a stem and 6
Rules Inormal 1
max Inormal ; Inormal 1 2 max Inormal ; Inormal 3 4 max Inormal ; Inormal ; Inormal ; Inormal 1 2 3 4 Majority rule
Accuracy rate (%) 25 26 30 29 32
ðI3 Þ
But now, if instead of classifying a proportion a:b::c:d, we follow the Turney’s approach which is simply to rank-order pairs of words (c:d) in terms of their relevance with regard to another pair (a:b) (the stem), then we do not rely on thresholds anymore. For instance, starting from the stem (mason, stone), we could have to order: – – – – –
Table 7 Results with a testing set of 147 examples coming from SAT.
For the sake of simplicity, I1, I2, I3, I4 will now relate to the left-hand side of the ! ! equalities (i.e. I1 ¼ k v a;b v c;d k for instance).
five pairs of words to rank-order. We use our Web implementation using Google search engine. In Table 7, we provide the results (in terms of accuracy rate) we got by running all these options. Remember that a pure random choice will give an average accuracy of 20% (one option to be chosen among 5). Our table highlights the fact that Kolmogorov complexity is slightly better that a pure random choice but does not perform very well when it comes to order different options to build up an analogical proportion. In order to get a better result, we need to combine the diverse formulas. To conclude this section, we have to notice that our representation is similar to the Turney’s one in the sense that each pair of words (a, b) is represented as a real valued vector, but in our case: We have a maximum of 8 components per vector, that is to say half of the dimension needed by Turney.
The semantics of the components is completely different and is related to a complexity view. In terms of practical implementation, we have to highlight several points where we really differ from Turney’s works: We use only the Web as a text corpus without any pre-processing of the initial feature: for instance, there is no search for morphological variations (i.e. plural, gender, etc.) as it is the case for Turney. When we are looking for a pair of words (a, b) using Google, there is no fixed window to search for the words. They are taken into account as soon as they occur on a web page, they may be separated by a huge number of words. In some sense, we leave Google to do the job without any extra constraints. We are still very sensitive to polysemic words, especially those ones having a homonym which is a name. For instance, if you look for ‘‘best’’ with Google, you will get a lot of companies or associations around the world having ‘‘BEST’’ as name (example: Board of European Students of Technology). This will artificially increase the frequency of ‘‘best’’ then decrease its Kolmogorov complexity related to any other word. Ultimately, we cannot accurately classify ‘‘winner:best
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
refer to [36,37,25,2]. Moreover, in [33], formal proportions of structured objects have been investigated, going through second order logic substitutions: this approach allows its authors to capture high level mapping between highly structured universes. [37] might be viewed as a particular case of it and a practical approach to build up an analogical-proportion based learning engine. Nevertheless, apart the previously cited [35], these works do no really provide a framework to deal with natural language analogies. The use of analogy in cognition and especially in learning has been rigorously identified and widely discussed, e.g. [15]. But, as far as we know, Cornuejols [9] was the first to establish a link between analogical reasoning and Kolmogorov complexity, leaving the rather strict logical framework coming from [10] or, more recently, [30]. The author advocates a kind of ‘‘simplicity principle’’ to serve as a starting point for modeling analogy via Kolmogorov theory considered as a mathematical formulation of the Occam’s razor. This approach is in complete line with the works in [5] in which ‘‘choose the pattern that provides the briefest representation of the available information’’ is the rational basis of a wide range of cognitive processes, including analogical reasoning. On another hand, there are a number of works in diverse fields (linguistics, cognitive sciences) using word frequencies to design various similarity/dissimilarity measures, e.g. [38]. However, despite some obvious relations, our approach is quite different: the starting point is Kolmogorov complexity and its theoretical developments. We go through Google frequencies to estimate this complexity but any alternate methods, able to provide an upper bound for this complexity would be suitable. For instance, in diverse other fields (spam filtering, network intrusion detection), when the string at hand is sufficiently long, complexity is estimated via compression. In our case, it simply does not make sense to compress ‘‘mason’’, ‘‘stone’’, etc. Moreover, it has been shown in [27], that a definition for analogy based on Jaccard index provides much less satisfactory results in terms of analogies classification. This suggests that the information-theoretic model coming from Kolmogorov theory could be an accurate definition not only for strings, but also for concepts represented as words. Analogy and metaphors play an important role in linguistic creativity. In that perspective different from ours, let us mention the system Palimpsest (designed by [44]) that builds up analogies or metaphors, starting from WorldNet ontology. This system is able to overcome the rigid framework of WorldNet by dynamically creating new concepts, implementing what is known as dynamic type hierarchy (as it has been first advocated in [45] and developed in [43]). Veale’s approach is thus able to capture metaphors, which are not given in the ontology. Nevertheless, considering the metaphoric joke due to Sigmund Freud ‘‘A wife is like an umbrella. Sometimes one takes a cab’’ cited by [44], it may be funny to notice that our system is able to classify ‘‘wife: umbrella
8. Conclusion In this paper, we have established a parallel between a purely abstract view of analogical proportion, suitable for structured
29
domains, and a more practical one, dedicated to the handling of natural language analogies. Based on Kolmogorov complexity as a concise definition for the notion of ‘‘information content’’, we provide diverse formulas that may be suitable for classifying analogical proportions in natural language. Using the relationship between Kolmogorov complexity and the universal distribution, we get a method to estimate this complexity and then a way to implement a practical tool to check our definitions. Using frequency as an approximation of a probability, our definitions are implemented via the computation of diverse numbers, using a structured database as text corpus, or the Web. A careful analysis of our implementation formulas, leads to take into account another kind of information, which cannot be described via Kolmogorov theory, but which makes sense in practice. When used for classification purpose, this last definition leads to results which are slightly better than those previously obtained with the Web as target corpus. It also appears that the proposed approach performs better when we have to identify analogies in a collection containing both analogies and 4-tuples of words containing an intruder than when we have to rank-order potential solutions. Indeed the four numbers I1, I2, I3 and I4, despite their theoretical basis, provide only a rather rough view of the comparison of the relations between two words, and are insufficient to accurately capture the differences between word pairs that may seem somewhat acceptable, at first glance, for constituting analogies with a given pair of words. Obviously, it remains to check our ideas on a larger scale, at least in two directions: On the formal side and in order to get a more accurate classifier, to transfer in our definition other logical properties, expected from an analogical proportion, but not expressible within Kolmogorov framework (like p(a/:b) = p(c/:d) and p(b/:a) = p(d/ :c)). On the practical side, to work with other structured databases, more general than the TREC GOV and having a larger diversity of words, allowing to investigate a larger set of examples, When properly developed, our definitions may provide the ability to identify analogies between concepts, starting only from natural language raw text, without the need of any pre-processing or external lexicon. Acknowledgements The authors are indebted to Mohand Boughanem and Cécile Laffaire for providing them with the opportunity to access the GOV corpus. References [1] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, An information statistics approach to data stream and communication complexity, Journal of Computer and System Sciences 68 (2004) 702–732. Special Issue on FOCS 2002. [2] N. Barbot, L. Miclet, La proportion analogique dans les groupes: applications aux permutations et aux matrices. Technical Report 1914 IRISA, (2009). [3] M. Bayoudh, H. Prade, G. Richard, A Kolmogorov complexity view of analogy: from logical modeling to experimentations, in: Research and Development in Intelligent Systems XXVII incorporating Applications and Innovations in Intelligent Systems XVIII, Proceedings 13th SGAI International Conference on (AI-2010), Springer Verlag, Cambridge, UK, 2010. 14–16th December. [4] C. Bennett, P. Gacs, M. Li, P. Vitányi, W. Zurek, Information distance, IEEE Transaction on Information Theory 44 (1998) 1407–1423. [5] N. Chater, The search for simplicity: A fundamental cognitive principle? In Taylor, & H. Francis (Eds.), The Quarterly journal of experimental psychology, 1999, vol. 52 (2), pp. 273–302. [6] R. Cilibrasi , P. Vitanyi, Automatic meaning discovery using google. In Manuscript, CWI, 2004; . [7] R. Cilibrasi, P. Vitányi, Clustering by compression, IEEE Transaction on Information Theory 51 (2005). [8] C. Claman, 10 Real Sats. The College Board, 1997.
30
M. Bayoudh et al. / Knowledge-Based Systems 29 (2012) 20–30
[9] A. Cornuéjols, Analogie, principe d’économie et complexité algorithmique, in: In Actes des 11èmes Journées Francaises de l’Apprentissage. Sète, 1996, France [10] T.R. Davies, S.J. Russell, A logical approach to reasoning by analogy, in: IJCAI87, Morgan Kaufman, 1987, pp. 264–270. [11] B. Falkenhainer, K.D. Forbus, D. Gentner, The structure-mapping engine: algorithm and examples, Artificial Intelligence 41 (1989) 1–63. [12] K. Forbus, A. Lovett, K. Lockwood, J. Wetzel, C. Matuk, B. Jee, J. Usher, Cogsketch, in: AAAI’08: Proceedings of the 23rd National Conference on Artificial Intelligence, Springer, 2008, pp. 1878–1879. [13] D. Gentner, Structure-mapping: a theoretical framework for analogy, Cognitive Science 7 (1983) 155–170. [14] D. Gentner, The mechanisms of analogical learning, in: Similarity and Analogical Reasoning, Cambridge University Press, 1989, pp. 197–241. [15] D. Gentner, K.J. Holyoak, B. Kokinov (Eds.), The Analogical Mind: Perspectives from Cognitive Sciences, MIT Press, 2001. [16] A.K. Goel, Design, analogy and creativity, IEEE Expert 12 (1997) 62–70. [17] J. Guan, Y. Gan, H. Wang, Discovering pattern-based subspace clusters by pattern tree, Knowledge-Based Systems 22 (2009) 569–579. [18] M. Kaplan, S. Laplante, Kolmogorov complexity and combinatorial methods in communication complexity, Theoretical Computer Science 412 (2011) 2524– 2535. [19] S. Klein, Culture, mysticism & social structure and the calculation of behavior, in: Proceedings 5th European Conference in Artificial Intelligence (ECAI’82), Paris Orsay 1982, pp. 141–146. [20] A.N. Kolmogorov, Three approaches to the quantitative definition of information, Problems in Information Transmission 1 (1965) 1–7. [21] M. Li, X. Chen, X. Li, B. Ma, P. Vitnyi, The similarity metric, IEEE Transactions Information Theory 50 (12) (2004) 50. 3250G326. [22] M. Li, P. Vitányi, Introduction to Kolmogorov Complexity and Its Applications, Springer-Verlag, 1997, ISBN 0-387-94053-7. [23] M. Li, P. Vitanyi, An introduction to Kolmogorov Complexity and its Applications, Springer Verlag, 2008. [24] D. Lin, An information-theoretic definition of similarity, in: ICML ’98: Proceedings of the Fifteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 1998, pp. 296–304. [25] L. Miclet, A. Delhay, Relation d’analogie et distance sur un alphabet defini par des traits, Technical Report 1632 IRISA, 2004. [26] L. Miclet, H. Prade, Handling analogical proportions in classical logic and fuzzy logics settings, in: Proceedings 10th ECSQARU, LNCS, vol. 5590, Springer, Verona, 2009, pp. 638–650. [27] H. Prade, G. Richard, Testing analogical proportions with Google using Kolmogorov information theory, in: Proceedings of International Conference FLAIRS22, AAAI Press publisher, Fort Myers, USA, 2009, pp. 272–277. [28] H. Prade, G. Richard, Analogical proportions: another logical view, in: M. Bramer, R. Ellis, M. Petridis (Eds.), Research and Development in Intelligent Systems XXVI, Proceedings 29th Annual International Conference on AI (SGAI’09), Springer, Cambridge, UK, 2010, pp. 121–134. December 2009. [29] H. Prade, G. Richard, Nonmonotonic reasoning – From cataloguing to analogizing, in: Proceedings International Conference 30 Years of
[30]
[31]
[32] [33]
[34] [35]
[36]
[37] [38]
[39]
[40] [41]
[42] [43]
[44] [45] [46]
Nonmonotonic Reasoning (Nonmon@30), Lexington, Oct. 22–25, 2010b . H. Prade, G. Richard, Reasoning with logical proportions. in: Proceedings International Conference on Principles of Knowledge Representation and Reasoning. (KR’10), 2010c, pp. 546–555. Toronto, Canada. P. Resnik, Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language, Journal of Artificial Intelligence Research 11 (1999) 95–130. G. Ruge, Experiment on linguistically-based term associations, Information Processing Management 28 (1992) 317–332. U. Schmid, H. Gust, K. Kühnberger, J. Burghardt, An algebraic framework for solving proportional and predictive analogies, European Conference Cognition Science (2003) 295–300. R.J. Solomonov, A formal theory of inductive inference, in: Information Controlling, vol. 7 (1), 1964. J.F. Sowa, A.K. Majumdar, Analogical reasoning, in: Proceedings International Conference on Conceptual Structures LNAI 2746, Springer-Verlag, Dresden, 2003, pp. 16–36. N. Stroppa, F. Yvon, An analogical learner for morphological analysis, in: Proceedings 9th Conference Computer Natural Language Learning (CoNLL2005), 2005a, pp. 120–127. N. Stroppa, F. Yvon, Analogical learning and formal proportions: Definitions and methodological issues, ENST Paris report, 2005b. E. Terra, C.L.A. Clarke, Frequency estimates for statistical word similarity measures, in: Proceedings of the Human Language Technology and North American Chapter of Association of Computational Linguistics Conference HLT/NAACL 2003, 2003, pp. 244–251. S.A. Terwijn, L. Torenvliet, P.M.B. Vitányi, Nonapproximability of the normalized information distance, Journal of Computing System and Science 77 (2011) 738–742. P.D. Turney, The latent relation mapping engine: algorithm and experiments, Journal of Artificial Intelligence Research 33 (2008) 615–655. P.D. Turney, A uniform approach to analogies, synonyms, antonyms, and associations, in: COLING ’08: Proceedings of the 22nd International Conference on Computational Linguistics, 2008b, pp. 905–912. Morristown, NJ, USA: Association for Computational Linguistics. P.D. Turney, M.L. Littman, Corpus-based learning of analogies and semantic relations, Machine Learning 60 (2005) 251–278. T. Veale, Dynamic type creation in metaphor interpretation and analogical reasoning: A case-study with wordnet, in: In the proceedings of ICCS2003, the 2003 International Conference on Conceptual Structures, 2003. T. Veale, An analogy-oriented type hierarchy for linguistic creativity, Knowledge-Based Systems 19 (2006) 471–479. E.C. Way, Knowledge Representation and Metaphor (Studies in Cognitive Systems), Kluwer Academic Publishers, Amsterdam, 1991. A.C.-C. Yao, Some complexity questions related to distributive computing(preliminary report), in: Proceedings of the eleventh annual ACM symposium on Theory of computing STOC ’79, ACM, New York, NY, USA, 1979, pp. 209–213.