Available online at www.sciencedirect.com
Expert Systems with Applications Expert Systems with Applications 36 (2009) 1860–1875 www.elsevier.com/locate/eswa
Learning semantic relatedness from term discrimination information D. Cai *, C.J. van Rijsbergen Department of Computing Science, University of Glasgow, Glasgow G12 8RZ, UK
Abstract Formalization and quantification of the intuitive notion of relatedness between terms has long been a major challenge for computing science, and an intriguing problem for other sciences. In this study, we meet the challenge by considering a general notion of relatedness between terms and a given topic. We introduce a formal definition of a relatedness measure based on term discrimination measures. Measurement of discrimination information (MDI) of terms is a fundamental issue for many areas of science. In this study, we focus on MDI, and present an in-depth investigation into the concept of discrimination information conveyed in a term. Information radius is an information measure relevant to a wide variety of applications and is the basis of this investigation. In particular, we formally interpret discrimination measures in terms of a simple but important property identified by this study, and argue the interpretation is essential for guiding their application. The discrimination measures can then naturally and conveniently be utilized to formalize and quantify the relatedness between terms and a given topic. Some key points about the information radius, discrimination measures and relatedness measures are also made. An example is given to demonstrate how the relatedness measures can deal with some basic concepts of applications in the context of text information retrieval (IR). We summarize important features of, and differences between, the information radius and two other information measures, from a practical perspective. The aim of this study is part of an attempt to establish a theoretical framework, with MDI at its core, towards effective estimation of semantic relatedness between terms. Due to its generality, our method can be expected to be a useful tool with a wide range of application areas. Ó 2008 Elsevier Ltd. All rights reserved. Keywords: Measurement of discrimination information; Learning of semantic relations; Relatedness measures; Informative terms; Good discriminators; Information retrieval; Query expansion
1. Introduction Measurement of discrimination information (MDI) of terms is a fundamental issue for many areas of science. It has a wide spectrum of applications including computational linguistics, information retrieval, text annotation, natural language processing, machine learning and translation, knowledge representation and extraction, bioinformatics and chemoinformatics. A concept is a unit of thought, formed by mentally combining some or all of the characteristics of a concrete or abstract, real or imaginary object. Concepts exist in the mind as abstract entities independent of terms used to *
Corresponding author. E-mail addresses:
[email protected] (D. Cai),
[email protected] (C.J. van Rijsbergen). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2007.12.072
express them. A term is one or more words used to express a concept (ANSI/NISO Z39.19-2005, 2005). Discrimination information of a term, as used in this paper, refers to the amount of information conveyed by a term in support of a certain category of documents relevant to a specific topic of interest and rejecting other categories. An informative term, also called a good discriminator, should have a high capability of categorizing documents. Categorization used in this paper refers to the process of classifying documents based on their similarity with respect to a group of unrelated topics. Each document of a given classification universe should belong to one, and only one, of the categories. According to such a view, categories should be clearly defined, mutually exclusive and collectively exhaustive. Categorization is another fundamental issue in computing science. This study is distinct but related work.
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
The idea that some terms are more informative than others is rather vague. Intuitively, it is accepted that terms with a higher power of discrimination should be considered more informative. Statistically, terms which are thought of as having higher power of discrimination tend to contribute more to the expected amount of discrimination information than others. The extent of the contributions that terms make may hence be used as a device for representing the informativeness of terms. The formula used to compute the extent is called a discrimination measure. The underlying mathematical structures that enable the computation are information measures, they provide powerful tools for estimating the expected amount. There are two kinds of semantic relations commonly used in studies: semantically similar and semantically related (Chiarello, Burgess, Richards, & Pollock, 1990). Semantic relatedness is a more general concept than semantic similarity. Similar terms are usually considered to be related due to their likeness (synonymy); dissimilar terms may also be semantically related by lexical relations (antonymy, hyperonymy, hyponymy, meronymy, holonymy, troponymy, etc.), or by co-occurrence statistics of corpora (Budanitsky & Hirst, 2006). The corpora-based methods are generally context-dependent in character. Computational applications typically require semantic relatedness rather than just semantic similarity (Budanitsky & Hirst, 2001). A number of applications can be regarded as cases where measuring term relatedness is the main concern, including question answering (Moldovan, Badulescu, Tatu, Antohe, & Girju, 2004), noun-modifier pairs (Nastase & Szpakowicz, 2003), synonym recognition (Turney, Littman, Bigham, & Shnayder, 2003), measurement of semantic relational similarity (Turney, 2006), measurement of textual cohesion (Morris & Hirst, 1991), latent semantic analysis (Landauer & Dumais, 1997), query expansion (Cai, 2004) and word sense disambiguation (Florian & Yarowsky, 2002). The problem of formalizing and quantifying the intuitive notion of relatedness between terms has a long history in philosophy, psychology and computing science. Considerable efforts have been made to propose a variety of relatedness measures: some using lexical resources (manually built thesauri) (Lee, Kim, & Lee, 1993; Richardson, Smeaton, & Murphy, 1994), some using co-occurrence statistics (unsupervised learning from corpora) (Banerjee & Pedersen, 2003; Corley & Mihalcea, 2005; Dagan, Lee, & Pereira, 1999; Hirst & Budanitsky, 2005; Lee, 1999; Marx, Dagan, Buhmann, & Shamir, 2002; Mohammad & Hirst, submitted for publication; Mohammad and Hirst, 2006a,b; Pantel & Lin, 2002; Resnik, 1999; Seco, Veale, & Hayes, 2004; Weeds & Weir, 2005), and some using hybrid techniques (combining both statistical and lexical information) (Han, Sun, Chen, & Xie, 2006; Jiang & Conrath, 1997; Pekar & Staab, 2003; Resnik, 1999; Rodriguez & Egenhofer, 2003). The aim of this study is part of an attempt to establish a theoretical framework towards an effective estimation of relatedness between terms. The core of the framework is
1861
the MDI. One particular information measure, information radius, is the main focus of this study. We concentrate on discussing the concept of the power of discrimination of a term by formal interpretation of the concept in terms of the information radius, and by quantitative expression of the concept by means of discrimination measures. We also highlight properties and relationships of the discrimination measures, and make some key points for clarifying problems inherent in applications and address solutions. Then we give a practical example, which demonstrates how our method can deal with some basic concepts of applications. A detailed discussion on performance analysis and comparison of experimental results obtained from our method can be found in our other work (Cai, 2004; Cai & Van Rijsbergen, 2007). The remainder of the paper is organized as follows. In Section 2, we describe the underlying hypotheses of this study, and introduce a general definition of the relatedness measures. In Section 3, we intend to give an easily understood account of the mathematical concept of information radius, which is a basis of this study. In Section 4, we concern ourselves with the formal discussion and quantitative expression of the concept of discrimination information conveyed by a term, the issue of MDI, and introduce the discrimination measures and their properties. In Section 5, we show how to apply our knowledge of discrimination information to a practical problem: the measurement of term relatedness. Some key points about the information radius, discrimination measures and relatedness measures are also made, respectively, in Sections 3–5. An example application of our method is demonstrated in Section 6. Finally, conclusions are drawn in Section 7. 2. Relatedness measures Many terms tend to co-occur more often than we expect by chance. This is often indicative of some related relations between terms. Humans can usually judge if a pair of terms is more related than others, for example, a human would easily judge that ‘computing’ and ‘algorithm’ are more related than ‘computing’ and ‘bicycle’. However, it is not easy for computational systems to make such an apparently simple judgement. Morris and Hirst (2004) in a study on semantic relations wrote: When people read a text, the relations between the words contribute to their understanding of it. Related word pairs may join together to form larger groups of related words that can extend freely over sentence boundaries. These larger term groups contribute to the meaning of text through ‘‘the cohesive effect achieved by the continuity of lexical meaning’’ (Halliday & Hasan, 1976). These words tell us that related relations guide human understanding of texts from term meanings to information or knowledge, providing new and possibly unexpected insights into the real world.
1862
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
2.1. Underlying hypotheses Corpora-based methods attempt to learn related relations, and to provide computational evidence to support the learning theoretically and experimentally, by means of distributional analysis of statistical data of the corpora. There are two underlying hypotheses for doing this. The distributional hypothesis (Harris, 1968) states: related terms tend to occur in similar contexts. However, the converse is often not true, that is, terms co-occurring in similar contexts may not be related to each other. The topic-related hypothesis addresses this, it claims: terms related to a given topic within similar contexts tend to be related to each other. Aitchison (2005) also states that term ‘‘co-occurrence in a linguistic context was therefore crucial . . . cooccurrence and context provide the key’’. The hypotheses and statements tell us, with respect to a given topic, that co-occurrence statistics can provide valuable clues about semantic and contextual information, and that corporabased methods can enable learning of related relations from these clues. Turney (2006) in a study on semantic relations wrote: Intuitively, we may expect that lexicon-based algorithms would be better at capturing synonymy than corporabased algorithms, since lexicons, such as WordNet (Fellbaum, 1998; Miller, 1990), explicitly provide synonymy information that is only implicit in corpora. However, the experiments do not support this intuition. The reason for this may be because the corpora-based methods can generally identify terms that have similar co-occurrence patterns; the identified terms can be related or similar to each other in their meaning, or even have opposite meanings (Dagan, 2000). A typical application requiring related relations is the process of word sense disambiguation, the assigning of a particular meaning (sense) to some term based on the context in which it occurs. In other words, a term occurring in some context can be disambiguated by assigning it the sense most closely related to terms co-occurring in the same context. For example, consider a term ‘bank’ and the following sentences: ‘‘A bank is a business that provides banking services for profit. Traditional banking services include receiving deposits of money, lending money and processing transactions. Some banks issue banknotes as legal tender. Many banks offer ancillary financial services to make additional profit; for example: selling insurance products, investment products or stock broking’’. The sentences tell us the sense of ‘bank’: it is clearly ‘financial bank’, rather than ‘river bank’, because many terms in the context have the financial sense: business, services, profit, deposits, money, transactions, issue, banknotes, legal, tender, ancillary, financial, selling, insurance, investment, stock, broking. Another typical application requiring related relations is the process of query expansion, which modifies the user’s query (topic) so as to more accurately describe the user’s
information needs. The terms selected from the relevant document are added into the query; the added terms, expressing the concepts used in the relevant documents, are expected to be those that express the same concepts used in the information needs. For example, suppose a user is interested in information about ‘wildfowl’. The user presents a query ‘swan’ to a retrieval system. Then, from the initial search results, he marks those documents he feels relevant to the query, and returns the marked documents to the system. The system automatically selects some terms from the marked documents according to some relatedness measure, adds them into the query, retrieves from the database again with the expanded query, and yields results as a ranked similarity list. Documents about ‘ducks’ or ‘geese’ may be expected to be near the top of the ranking list, this is because terms ‘ducks’ or ‘geese’ are treated as closely related to ‘swan’. This relatedness may refer to their semantic similarity: they are all in the Anatidae family of birds. This relatedness may also refer to their co-occurrence patterns: they tend to occur as the subjects of the same verbs (‘feed’, ‘lay’, ‘mate’, ‘swim’, etc.), and tend to be modified by the same adjectives (‘aquatic’, ‘breeding’, ‘terrestrial’, ‘webbed’, etc.). Suppose also that there are two texts in the database: The Ugly Duckling (a fable) and The Vain Man (containing ‘all his geese are swans’). The two texts would be placed at the bottom of the ranking list because of lower similarities: the same verbs and adjectives would not occur in the contexts of the texts. An interesting study (Fellbaum, 1995) gives some possible reasons for co-occurrence of semantically opposed terms (for adjectives, nouns and verbs). Some of the reasons are: syntactic frames (‘beautiful and ugly alike’, ‘from the first to the last’, ‘cry as well as smile’), redundancy (‘dark, rather then light’, ‘beginning, not ending’, ‘not increase, but decrease’), style (‘Tobacco Road is dead. Long live Tobacco Road’). Sometimes an explicit reference to one state presupposes the opposite/reverse state: ‘The flowers are languishing’ may imply that ‘they were blooming’, ‘The old man is very rich now’ may imply that ‘he was poor when he was young’. Such presuppositions are frequently overly expressed, resulting in co-occurrence of semantically opposed terms. 2.2. Definition of relatedness In order to learn related relations, by the topic-related hypothesis, we need to consider only those terms which are related to a given topic within similar contexts. An interesting question thus arises immediately: what is meant by saying that a term is related to a given topic, or more precisely, that a term conveys information related to a given topic? We now attempt to give an answer, which leads to a general definition of relatedness used throughout this study. Let D be a corpus of documents and jDj ¼ m. Let T be a set of topics and jTj ¼ r. Let V be a vocabulary of terms used to index individual documents in D and jV j ¼ n.
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
Suppose that Di # D is the category of all the documents relevant to topic si 2 T, and that V Di is the sub-vocabulary consisting of those terms that appear in at least one document in Di , where i ¼ 1; 2; . . . r. Suppose also that all the categories form a partition over D, that is, [ri¼1 Di ¼ D, and Di \ Dj ¼ ; (generally, V Di \ V Dj –;), where 1 6 i < j 6 r. In practice, it is unlikely that all terms in V Di would be closely related to topic si , and it is very difficult to estimate relatedness of each term to si . Our aim becomes to judge which terms are informative in expressing context Di . As mentioned previously, terms having higher power of discrimination, i.e., tending to contribute more to the expected amount of discrimination information than others, should be considered more informative. Thus, if we have a discrimination measure to estimate the extent of the contributions that terms make, we have a way to measure the informativeness of terms, and then to quantify relatedness. Now let H 1 ; H 2 ; . . . ; H r be competing hypotheses, and H i the hypothesis that ‘term t conveys information in expressing context Di ’ (i ¼ 1; 2; . . . ; r). In order to quantify the relatedness of term t to topic si in terms of the discrimination information of t, we need only to adopt an assumption (stated rather informally): The statement, ‘the amount of information of term t in expressing context Di ’ can be restated as, ‘the power of discrimination of term t in support of hypothesis H i rejecting all other hypotheses H j (j ¼ 1; 2; . . . ; i 1; i þ 1; . . . ; r)’. The above discussion may already answer the question and give the meaning of a term being related to a given topic. The issue of MDI, which is the main subject of this study, and will be discussed in the following sections, derives its importance from the fact that it provides a means to formally define a relatedness measure, which is now introduced as follows. Definition 1. Let P Di ðtÞ be the distribution of terms in the category Di with a priori probability ki , where i ¼ 1; 2; . . . ; r. The extent of term t 2 V related to topic si is defined as relðt; si Þ ¼ si ðtÞ WðiÞ ðfkj g; fP Dj gÞ ¼ si ðtÞ ifdðiÞ ðtÞ
ð1Þ
which is referred to as a relatedness measure between individual terms and topic si , where si ðtÞ P 0 is a weighting function estimating the importance of t concerning si , ifdðiÞ ðtÞ is a discrimination measure estimating the amount of information conveyed by t in support of H i rejecting H 1 ; . . . ; H ji ; H jþi ; . . . ; H r . In particular, if we consider only the term discrimination information without incorporating the topic term weights into relatedness values, that is, rel ðt; si Þ ¼ WðiÞ ðfkj g; fP Dj gÞ ¼ ifdðiÞ ðtÞ
ð2Þ
then we can gain an insight into how the discrimination information contributes to the performance. Clearly,
1863
rel ðt; si Þ is a special case of relðt; si Þ, where the function si ðtÞ ¼ 1 for every term t 2 V Di . Formalization and quantification of the intuitive notion of relatedness between terms has long been a major challenge for many areas of computing science, and an intriguing problem for other sciences. The relatedness measure Eq. (2) provides a useful tool for giving a formal expression of the notion of relatedness between any two terms drawn from category Di . That is, for a given term t0 2 V Di and any term t 2 V Di , we have rel ðt; t0 Þ ¼ WðiÞ ðfkj g; fP Dj gÞ ¼ ifdðiÞ ðtÞ which is a special case of rel ðt; si Þ where si ¼ t0 is a singleterm topic. Some recent studies of semantic relations, for instance (Hirst & Budanitsky, 2005; McCarthy, Koeling, Weeds, & Carroll, 2004; Patwardhan, Banerjee, & Pedersen, 2003; Stevenson & Greenwood, 2005), have experimentally shown the method proposed in Jiang and Conrath (1997) to be good for individual study tasks. Their method is basically lexical-based (using the lexical taxonomy structure) in conjunction with corpus statistical data to calculate semantic similarity between terms. Our method using MDI targets semantic relatedness from a very different angle without using lexical structure. Combination of taxonomic methods with statistical methods may have some benefits. The main benefit is that it complements the contextual information and domain knowledge with taxonomic structure. However, the combination is only a matter of technical detail, whereas MDI is a fundamental issue, and has to be faced in almost all the computational applications. In the following sections, we concentrate on discussing the discrimination measures, ifdðiÞ ðtÞ, in depth by formally interpreting them in terms of the information radius, K r ðfkj g; fP Dj gÞ, and by quantitatively expressing them by means of the item ifdK ðtÞ and sub-items ifdðiÞ ðtÞ (where i ¼ 1; 2; . . . ; r) of the K r ðfkj g; fP Dj gÞ. 3. The information radius Information radius is the basis for the formal interpretation and quantitative expression of the discrimination measures used in this paper. This section intends to give an easily understood account of the mathematical concept of information radius. Some key points, which are essential to understand when applying information radius, are also addressed. 3.1. Information moment To gain a full appreciation of the discrimination power of the information radius, it is necessary not only to consider its interpretation but also to become acquainted with some other supporting considerations (i.e., some simple properties). An excellent paper about these has been provided by Sibson (1969). In addition, it is helpful to become
1864
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
familiar with the concept of information moment and its interpretation. The following is illustrative. Let Pn be a convex set1 of all finite multinomial (discrete) probability distributions defined on a probability space ðV ; 2V Þ, Pn ¼ fP ðtÞ ¼ ðpðt1 Þ; pðt2 Þ; . . . ; pðtn ÞÞj t 2 V g P where pðtj Þ P 0 (j ¼ 1; . . . ; n), nj¼1 pðtj Þ ¼ 1. Let H i be the hypothesis that term t is drawn from the categories Di , and P Di ðtÞ 2 Pn the distribution of terms in Di with a priori probability ki , where i ¼ 1; 2; . . . ; r. Also, let H be the hypothesis that term t is drawn from a category D and, P ðtÞ 2 Pn the distribution of terms in D . The information moment for the categories D1 ; . . . ; Dr and the category D is defined by K r ðfkj g; fP Dj g : P Þ ¼
r X
¼
i¼1
3.2. Information radius The information radius for these r distributions P Di ðtÞ with a priori probability ki , due to Sibson (1969), is defined as the minimum, and denoted by K r ðfkj g; fP Dj gÞ ¼ K r ðfkj g; fP Dj g : P R Þ Therefore, it can be immediately expressed as r r X X K r ðfkj g; fP Dj gÞ ¼ ki IðP Di : P R Þ, ki I iR ðP Di : P R Þ i¼1
ki IðP Di : P Þ
i¼1 r X
since IðP R : P Þ P 0 with equality if and only if P R ðtÞ ¼ P ðtÞ. The minimum can be regarded as the expected gain in information on judging which P Di ðtÞ should be correct.
ki
X t2V
P D ðtÞ P Di ðtÞ log i P ðtÞ
¼ !
where IðP Di : P Þ is called the directed divergence (Kullback, 1959), because it can be used to measure the expected divergence of distribution P ðtÞ from distribution P Di ðtÞ. The logarithm base is immaterial. Throughout this paper, the logarithms are taken to base 2, unless otherwise specified. If we regard ki as the probability of P Di ðtÞ being correct, then the information moment can be interpreted as the expected gain in information on rejecting P ðtÞ in support of P Di ðtÞ for i ¼ 1; . . . ; r. Particularly, when r ¼ 1, we have k1 ¼ 1 and the corresponding information moment is reduced to directed divergence K 1 ðf1g; fP D1 g : P Þ ¼ IðP D1 : P Þ. Now, assume that all P D1 ðtÞ, . . ., P Dr ðtÞ are known, and let a composite distribution be
K r ðfkj g; fP Dj g : P Þ ¼ K r ðfkj g; fP Dj g : P R Þ þ IðP R : P Þ It is clear that the first item on the right side of the equality above is a constant when k1 ; . . . ; kr and P D1 ðtÞ; . . . ; P Dr ðtÞ are given, and that the second item is the function of distribution P ðtÞ 2 Pn . Consequently, it can be seen that K r ðfkj g; fP Dj g : P Þ arrives uniquely at a minimum when P ðtÞ ¼ P R ðtÞ, that is, inf P 2Pn fK r ðfkj g; fP Dj g : P Þg ¼ K r ðfkj g; fP Dj g : P R Þ 1 By the convexity of set Pn we mean here that k1 P D1 ðtÞþ k2 P D2 ðtÞ þ þ kr P Dr ðtÞ ¼ P R ðtÞ 2 Pn if P Di ðtÞ 2 Pn for i ¼ 1; 2; . . . ; r and P k ¼ fk1 ; k2 ; . . . ; kr g is an a priori probability distribution concerning r distributions P D1 ðtÞ; P D2 ðtÞ; . . . ; P Dr ðtÞ.
X t2V
i¼1
P D ðtÞ ki P Di ðtÞ log i P R ðtÞ
!
It can be seen that K r ðfkj g; fP Dj gÞ P 0 with equality if and only if P Dk1 ðtÞ ¼ ¼ P Dks ðtÞ, in which kkl > 0, where l ¼ 1; . . . ; s, 1 6 s 6 r (that is, it vanishes if and only if those P Dkl ðtÞ, for which the corresponding coefficient kkl are not equal to zero, are identical). Furthermore, for r disjoint probability distributions2 the information radius is reduced to the entropy of its a priori probability distribution P k ¼ fk1 ; k2 ; . . . ; kr g: ! r X X X K r ðfkj g; fP Dj gÞ ¼ þ þ ki P Di ðtÞ t2V 1
log ¼
X
þ
t2V r
X t2V r
i¼1
P Di ðtÞ k1 P D1 ðtÞ þ þ kr P Dr ðtÞ
k1 P D1 ðtÞ log
t2V 1
P R ðtÞ ¼ k1 P D1 ðtÞ þ þ kr P Dr ðtÞ Obviously, we have P R ðtÞ 2 Pn by virtue of the convexity of Pn . It was shown (Sibson, 1969) that the information moment satisfies the equality
i¼1 r X
P D1 ðtÞ þ k1 P D1 ðtÞ
kr P Dr ðtÞ log
P Dr ðtÞ kr P Dr ðtÞ
¼ k1 log k1 kr log kr ¼ H ðP k Þ Notice that, when k1 k2 kr –0, we have k1 P D1 ðtÞ þ k2 P D2 ðtÞ þ þ kr P Dr ðtÞ ¼ 0 if and only if P D1 ðtÞ ¼ P D2 ðtÞ ¼ ¼ P Dr ðtÞ ¼ 0. Thus, if we assume that ki > 0 for i ¼ 1; . . . ; r, then P Di ðtÞ are absolutely continuous3 with respect to the composite distribution P R ðtÞ, that is, P Di ðtÞ P R ðtÞ. Therefore, under the assumption, the information radius can be used to compare arbitrary term distributions over (V ; 2V ). Because of this outstanding property, the information radius appears to be of some general interest. There are many computational systems which can 2
The r probability distributions P Di ðtÞ, i ¼ 1; . . . ; r, are said to be disjoint if P Di ðtÞ P 0 when t 2 V i and P Di ðtÞ ¼ 0 when t R V i , where V 1 ; . . . ; V r is a partition of V, i.e., V ¼ V 1 [ . . . [ V r and V i \ V j ¼ ; (1 6 i; j 6 r; i–j). 3 Probability distribution P 1 ðtÞ is said to be absolutely continuous with respect to distribution P 2 ðtÞ, or in symbols, P 1 ðtÞ P 2 ðtÞ, if P 1 ðtÞ ¼ 0 whenever P 2 ðtÞ ¼ 0.
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
benefit from applying the information radius, in particular, in systems where a priori probability distribution in the sense of Bayesian statistics is needed. 3.3. An important situation An important situation is, for a given topic s, where we consider classifying documents into two categories: D1 ¼ R is the set of all the documents relevant to s, and D2 ¼ R ¼ D R is the set of all the documents non-relevant to s. That is, categories R and R form a relevance classification over the corpus D with respect to s. Thus, in what follows, we will consider only the case where r ¼ 2, and always assume that each term is associated with two opposite hypotheses (i.e., H 2 ¼ H 1 , the complement of H 1 ). Let H 1 and H 2 be the hypotheses that term t is drawn respectively. Let from categories R and R, P R ðtÞ ¼ P ðtjH 1 Þ and P R ðtÞ ¼ P ðtjH 2 Þ be the distributions of terms in R and R, with a priori probabilities k1 and k2 , respectively. Also, let H R be a hypothesis that term t is and P R ðtÞ ¼ drawn from corpora R [ R ¼ D, P ðtjH R Þ ¼ k1 P R ðtÞ þ k2 P R ðtÞ the composite distribution of terms under the hypothesis. Denote the corresponding information radius as Kðk1 ;k2 ;P R ; P R Þ ¼ k1 I 1R ðP R : P R Þ þ k2 I 2R ðP R : P R Þ X P R ðtÞ P ðtÞ ¼ þ k2 P R ðtÞ log R k1 P R ðtÞlog P R ðtÞ P R ðtÞ t2V ð3Þ which can be viewed as the expected divergence between distributions P R ðtÞ and P R ðtÞ. Based on the interpretation of the information gain, if we view k1 and k2 as the initial probabilities that the respective distributions P R ðtÞ and P R ðtÞ are correct, then the information radius can be interpreted as the expected gain in information on rejecting P R ðtÞ in support of P R ðtÞ and P R ðtÞ (Jardine & Sibson, 1971). The expected gain measures the expected amount of discrimination information conveyed by term t in support of H 1 rejecting H 2 . It can be seen that property 0 6 Kðk1 ; k2 ; P R ; P R Þ 6 1 holds. In fact, by the definition, the lower bound Kðk1 ; k2 ; P R ; P R Þ P 0 holds as pointed out earlier for the general case. Whereas the upper bound can be shown, from Eq. (3), by X k1 P R ðtÞ k1 P R ðtÞlog Kðk1 ;k2 ;P R ; P R Þ ¼ P ðtÞ þ k2 P R ðtÞ k 1 R t2V k2 P R ðtÞ þk2 P R ðtÞlog k1 P R ðtÞ þ k2 P R ðtÞ ! X X k1 P R ðtÞlog k1 þ k2 P R ðtÞlog k2 t2V
t2V
6 0 þ 0 ðk1 log k1 þ k2 log k2 Þ 6 1 since from calculus we can easily prove that 12 6 x log x 6 0 when x 2 ½0; 1. More of its properties are to be found in Sibson (1969).
1865
Notice that Kðk1 ; k2 ; P R ; P R Þ is not symmetric in arguments P R ðtÞ and P R ðtÞ, neither in k1 and k2 . It may be desirable to have a symmetrical divergence measure, which is meaningful in terms of information radius, when there is no particular reason to emphasize either P R ðtÞ or P R ðtÞ. A symmetric divergence measure Kðk1 ; k2 ; P R ; P R Þ can easily be introduced by considering a more particular situation where k1 ¼ k2 ¼ 12. Denote the corresponding information radius by KðP R ; P R Þ, we thus further obtain 1X 2P R ðtÞ KðP R ; P R Þ ¼ P R ðtÞ log 2 t2V P R ðtÞ þ P R ðtÞ 2P R ðtÞ þP R ðtÞ log P R ðtÞ þ P R ðtÞ
ð4Þ
3.4. Some key points There are some important points to be made about the information radius. Clarification of these points is necessary in clearly understanding and properly applying this information measure. * A Classification Criterion In a practical application context, the first stage in measuring the power of discrimination of terms is to calculate the expected divergence, followed by derivation of the contributions made by individual terms to the expected divergence. Underlying this is the following Classification Criterion: The divergence measure should be independent of the addition or removal of the terms unrelated to the classification. By saying that terms are unrelated to the classification, it is meant here they have an invariant probability over all the categories considered. It is essential that a divergence measure satisfies the Classification Criterion: when a term t has an equal probability, i.e., P R ðtÞ ¼ P R ðtÞ, it implies ‘t does not provide any profitable discrimination information for classifying D into R or R’. This implication should be carefully distinguished from ‘t is not related to the topic’. A term may be closely related to the topic, while it is entirely unrelated to the classification. For instance, let us consider a topic ‘What is tomorrow’s computer?’ Term ‘computer’ may be an unrelated term in respect to the relevance classification for the corpus catalogued as ‘computing science’: term ‘computer’ would distribute rather uniformly over the whole corpus, it has thus an invariant probability over all sub-categories. However, everyone would agree that ‘computer’ is central to the topic. It is rather intuitive and understandable that the addition or removal of terms unrelated to the classification, such as ‘computer’, should make no difference to the expected divergence.
1866
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
* KðP R ; P R Þ cannot be reduced From Eq. (4), the symmetric information radius has 1X KðP R ; P R Þ ¼ ðP R ðtÞ log 2 þ P R ðtÞ log 2 þ /ðtÞÞ 2 t2V 1X 1 ¼ ðP R ðtÞ þ P R ðtÞ þ /ðtÞÞ ¼ 1 þ UðtÞ 2 t2V 2 where
X P R ðtÞlog
P R ðtÞ P R ðtÞ þ P R ðtÞ log P P ðtÞ þ P ðtÞ ðtÞ þ P R ðtÞ R R R t2V X X ¼ ð/1 ðtÞ þ /2 ðtÞÞ ¼ /ðtÞ
UðtÞ ¼
t2V
t2V
It is important to understand that KðP R ; P R Þ cannot be reduced to summation UðtÞ by eliminating coefficients 1 and 12 in the last expression. In fact, UðtÞ cannot serve as a divergence measure. First, UðtÞ is non-positive. This is because, for each term t 2 V , two sub-items /1 ðtÞ and /2 ðtÞ in each item /ðtÞ are both non-positive since P R ðtÞ; P R ðtÞ 6 P R ðtÞ þ P R ðtÞ, and so is the item itself. Thus, the summation over individual items receives a non-positive value. We can also easily verify UðtÞ 6 0 from 0 6 KðP R ; P R Þ ¼ 1 þ 12 UðtÞ 6 1 (we have shown 0 6 Kðk1 ; k2 ; P R ; P R Þ 6 1 for a general case in Section 3.3). Second, UðtÞ does not satisfy the Classification Criterion. That is, its items, P R ðtÞ P R ðtÞ /ðtÞ ¼ P R ðtÞ log þ P R ðtÞ log P R ðtÞ þ P R ðtÞ P R ðtÞ þ P R ðtÞ 1 1 ¼ P R ðtÞ log þ P R ðtÞ log ¼ P R ðtÞ P R ðtÞ–0 2 2 do not vanish when P R ðtÞ ¼ P R ðtÞ–0. Thus, UðtÞ is dependent on the addition or removal of terms unrelated to the classification. * Term distributions overlap Denote V R and V R as the sub-vocabularies consisting of those terms that appear in at least one document in R and R, respectively. Generally, we have V R \ V R –;. If two distributions P R ðtÞ and P R ðtÞ overlap4 over i.e., some sub-vocabulary C # V R \ V R, P R ðtÞ ¼ P R ðtÞ for every t 2 C, then Kðk1 ; k2 ; P R ; P R Þ would drop sharply. Particularly, in an extreme case where they overlap over vocabulary V, we have Kðk1 ; k2 ; P R ; P R Þ ¼ 0. This implies that the information radius satisfies the Classification Criterion. * Term distributions are joint or disjoint Two distributions P R ðtÞ and P R ðtÞ are said to be joint over V, if V R \ V R –; (see Footnote 2); P R ðtÞ and
P R ðtÞ are said to be completely joint over V, if V R ¼ V R. Two distributions P R ðtÞ and P R ðtÞ are said to be disjoint over V, if V R \ V R ¼ ;. In this case, Kðk1 ; k2 ; P R ; P R Þ is reduced to the entropy of its a priori probability distribution P k ¼ fk1 ; k2 g (see Section 3.2 for a general case). In particular, KðP R ; P R Þ can be reduced to unity: KðP R ; P R Þ ¼ k1 log k1 k2 log k2 1 1 1 1 1 1 ¼ log log ¼ log 2 þ log 2 ¼ 1 2 2 2 2 2 2
Notice that, if P R ðtÞ and P R ðtÞ are not completely joint, then P R ðtÞ P R ðtÞ and/or P R ðtÞ P R ðtÞ. However, for the information radius, the distributions P R ðtÞ and P R ðtÞ are not necessarily absolutely continuous to each other as P R ðtÞ P R ðtÞ and P R ðtÞ P R ðtÞ hold unconditionally. Such an outstanding property is not possessed by some other divergence measures commonly used, such as, directed divergence IðP R : P R Þ. 4. Measurement of discrimination information (MDI) This section concentrates on formal definition and quantitative interpretation of discrimination measures, the issue of MDI, which is the core of this study. In what follows, we will always assume that k1 –0 and k2 –0. 4.1. Definition of discrimination measures As mentioned previously, in order to measure the power of discrimination of a term, we need to measure the extent of the contribution made by the term to the expected divergence. Let us return to Eq. (3) in Section 3.3. The information radius consists of a sum of items: X X ifdK ðtÞ ¼ ðk1 ifdð1Þ ðtÞ þ k2 ifdð2Þ ðtÞÞ Kðk1 ; k2 ; P R ; P R Þ ¼ t2V
t2V
and its two sub-items are respectively: P R ðtÞ ¼ P ðtjH 1 ÞiðH 1 : H R jtÞ P R ðtÞ P ðtÞ ifdð2Þ ðtÞ ¼ ifdI 2R ðtÞ ¼ P R ðtÞ log R ¼ P ðtjH 2 ÞiðH 2 : H R jtÞ P R ðtÞ ifdð1Þ ðtÞ ¼ ifdI 1R ðtÞ ¼ P R ðtÞ log
Let us consider the first sub-item ifdI 1R ðtÞ. A similar discussion can be applied to the second sub-item ifdI 2R ðtÞ. It is remarkable that the likelihood ratio P R ðtÞ P ðtjH 1 Þ P ðH 1 jtÞP ðtÞ=P ðH 1 Þ P ðH 1 jtÞ P ðH 1 Þ ¼ ¼ ¼ P R ðtÞ P ðtjH R Þ P ðH R jtÞP ðtÞ=P ðH R Þ P ðH R jtÞ P ðH R Þ ¼ OðH 1 jtÞ=OðH 1 Þ
4
Two probability distributions are said to overlap over some subvocabulary C # V if their densities coincide over C. Particularly, when they overlap over the whole domain, then P R ðtÞ ¼ P R ðtÞ for all t 2 V .
where OðH 1 jtÞ is the odds in favour of H 1 against H R given t, and OðH 1 Þ is the odds in favour of H 1 against H R . The likelihood ratio or, with Turing’s terminology, Bayes factor is an intuitive and important concept in information the-
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
ory. Turing introduces the expression ‘Bayes factor in favour of a hypothesis’. Denote now the logarithm of the Bayes factor by iðH 1 : H R jtÞ ¼ logðOðH 1 jtÞ=OðH 1 ÞÞ ¼ logðOðH 1 jtÞÞ logðOðH 1 ÞÞ which is called a discrimination factor. Kullback (1959) defined iðH 1 : H R jtÞ as the ‘information for discrimination’ in favour of H 1 against H R . Good (1950) gives a similar interpretation, he describes iðH 1 : H R jtÞ as the ‘weight of evidence’ concerning H 1 as opposed to H R , provided by t (in this case, the occurrence of term t is thought of as a piece of evidence). Therefore, iðH 1 : H R jtÞ can be used to measure the amount of information conveyed by term t in favour of hypothesis H 1 against hypothesis H R , or saying it alternatively, in favour of P R ðtÞ against P R ðtÞ, when t occurs. In this paper, we will use ‘in favour of H 1 against H R ’ and ‘in favour of P R ðtÞ against P R ðtÞ’, interchangeably. Consequently, the factor iðH 1 : H R jtÞ in ifdI 1R ðtÞ measures the power of term t to discriminate two opposite hypotheses H 1 and H R . The probability P ðtjH 1 Þ in ifdI 1R ðtÞ measures the significance of term t concerning category R in determining the power of discrimination. Thus, ifdI 1R ðtÞ indicates the amount of ‘information for discrimination’ for term t in support of hypothesis H 1 rejecting hypothesis H R . The above explains what we mean by the discrimination information of a given term. Thus, we can introduce discrimination measures which compute the extent of the contributions made by individual terms to the expected divergence. We make the following formal definition. Definition 2. Let P R ðtÞ and P R ðtÞ be the distributions of terms in the categories R and R with a priori probabilities k1 and k2 , respectively. Suppose P R ðtÞ ¼ k1 P R ðtÞ þ k2 P R ðtÞ. For each term t 2 V , the information of t for discrimination in support of H 1 rejecting H R is defined by P R ðtÞ ð5Þ k1 P R ðtÞ þ k2 P R ðtÞ and the information of t for discrimination in support of H 2 rejecting H R is defined by ifdI 1R ðtÞ ¼ P R ðtÞ log
ifdI 2R ðtÞ ¼ P R ðtÞ log
P R ðtÞ k1 P R ðtÞ þ k2 P R ðtÞ
ð6Þ
in which, ifdI 1R ðtÞ and ifdI 2R ðtÞ are referred to as the discrimination measures of terms concerning the categories R and R, respectively, and ifdK ðtÞ ¼ k1 ifdI 1R ðtÞ þ k2 ifdI 2R ðtÞ
ð7Þ
is referred to as the combined discrimination measure of terms concerning the corpora D. Notice that the properties P R ðtÞ P R ðtÞ and P R ðtÞ P R ðtÞ inherent in the information radius ensure that sub-items ifdI iR ðtÞ < 1, where i ¼ 1; 2, and thus item ifdK ðtÞ exists, for every term t 2 V .
1867
Notice also that ifdI 1R ðtÞ ¼ ifdI 2R ðtÞ ¼ ifdK ðtÞ ¼ 0 for term t appearing in both R and R with P R ðtÞ ¼ P R ðtÞ. Thus we learn that the contribution, to the expected divergence, of terms unrelated to the relevance classification, will be zero. Therefore, the information radius satisfies the Classification Criterion and, the three discrimination measures emphasize the importance of those terms which have variant probabilities over categories R and R. In particular, from Eq. (4), we have the following symmetric discrimination measures: 2P R ðtÞ P R ðtÞ þ P R ðtÞ 2P R ðtÞ ifdI 2R ðtÞ ¼ P R ðtÞ log P R ðtÞ þ P R ðtÞ 1 ifdK ðtÞ ¼ ðifdI 1R ðtÞ þ ifdI 2R ðtÞÞ 2 ifdI 1R ðtÞ ¼ P R ðtÞ log
4.2. A property of discrimination measures Notice that each item ifdK ðtÞ can be positive or negative, so can its two sub-items ifdI 1R ðtÞ and ifdI 2R ðtÞ. Before giving the quantitative interpretation of the discrimination measures, let us first consider their properties through the following simple but important theorem. Theorem 1. For an arbitrary term t 2 V P R ðtÞ P R ðtÞ > 0, we always have
satisfying
(1) ifdI 1R ðtÞ ¼ 0 if and only if P R ðtÞ ¼ P R ðtÞ, i.e., ifdI 2R ðtÞ ¼ 0; (2) ifdI 1R ðtÞ > 0 if and only if P R ðtÞ > P R ðtÞ, i.e., ifdI 2R ðtÞ < 0. Proof. If P R ðtÞ–0 and P R ðtÞ–0, then (1) ifdI 1R ðtÞ ¼ 0 if and only if P R ðtÞ ¼ k1 P R ðtÞþ k2 P R ðtÞ ¼ k1 P R ðtÞ þ ð1 k1 ÞP R ðtÞ, i.e., ð1 k1 ÞP R ðtÞ ¼ ð1 k1 ÞP R ðtÞ, i.e., P R ðtÞ ¼ P R ðtÞ, i.e., ð1 k2 ÞP R ðtÞ ¼ ð1 k2 ÞP R ðtÞ, i.e., P R ðtÞ ¼ ð1 k2 ÞP R ðtÞ þ k2 P R ðtÞ ¼ k1 P R ðtÞ þ k2 P R ðtÞ if and only if ifdI 2R ðtÞ ¼ 0. (2) ifdI 1R ðtÞ > 0 if and only if P R ðtÞ > k1 P R ðtÞþ k2 P R ðtÞ ¼ k1 P R ðtÞ þ ð1 k1 ÞP R ðtÞ, i.e., ð1 k1 ÞP R ðtÞ > ð1 k1 ÞP R ðtÞ, i.e., P R ðtÞ > P R ðtÞ, i.e., ð1 k2 ÞP R ðtÞ > ð1 k2 ÞP R ðtÞ, i.e., P R ðtÞ < ð1 k2 ÞP R ðtÞ þ k2 P R ðtÞ ¼ k1 P R ðtÞ þ k2 P R ðtÞ if and only if ifdI 2R ðtÞ < 0. The proof is complete. h From Theorem 1, we learn that, for each term t 2 V , ifdI 1R ðtÞ ifdI 2R ðtÞ 6 0. Thus, we have the following interesting interpretations. * If P R ðtÞ ¼ P R ðtÞ, then the discrimination factor iðH 1 : H R jtÞ ¼ iðH 2 : H R jtÞ ¼ 0, and term t gives us no discrimination information about the classification, and the corresponding amount of information ifdK ðtÞ ¼ 0.
1868
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
* If P R ðtÞ > P R ðtÞ, then
The total amount of information ifdK ðtÞ is the weighted algebraic sum of two sub-items under a priori probabilities k1 and k2 . Therefore,
* The relationship of discrimination measures The relationship of the three measures is clearly shown in Eqs. (5)–(7). From the property given in Theorem 1, we can see that two measures ifdI 1R ðtÞ and ifdI 2R ðtÞ are merged into the combined measure ifdK ðtÞ: it measures not only the positive discrimination information of term t supporting H 1 , but also the negative discrimination information inherent in t supporting H 2 when it also appears in non-relevant documents. It is essential to understand, from a positive amount ifdK ðtÞ > 0, that we cannot infer that term t supports H 1 strongly. This is because it may be ifdI 2R ðtÞ > 0 and, in this case, it must be accompanied by ifdI 1R ðtÞ < 0 and, ifdK ðtÞ > 0 suggests that term t supports H 2 more strongly than it supports H 1 .
– If ifdK ðtÞ > 0, then the positive sign is dominated by its positive sub-item ifdI 1R ðtÞ, and ifdK ðtÞ indicates that t is in support of H 1 more strongly than of H 2. – If ifdK ðtÞ < 0, then the negative sign is dominated by its negative sub-item ifdI 2R ðtÞ, and ifdK ðtÞ indicates that t is in support of H 2 more strongly than of H 1.
* The verification of term t strongly supporting hypothesis H 1 Obviously, in order to effectively judge whether a term t supports H 1 more strongly than it supports H 2 , we need to consider some further condition. For doing so, let us now suppose ifdK ðtÞ > 0, and that term t appears in both relevant and non-relevant documents, that is, P R ðtÞ > 0 and P R ðtÞ > 0. Thus, we can see that,
(a) ð1 k1 ÞP R ðtÞ > ð1 k1 ÞP R ðtÞ, i.e., P R ðtÞ > k1 P R ðtÞ þk2 P R ðtÞ, thus iðH 1 : H R jtÞ > 0, term t conveys information in favour of H 1 against H R , and it contributes amount ifdI 1R ðtÞ ¼ jifdI 1R ðtÞj in support of H 1 rejecting H R ; (b) ð1 k2 ÞP R ðtÞ > ð1 k2 ÞP R ðtÞ, i.e., P R ðtÞ < k1 P R ðtÞ þk2 P R ðtÞ, thus iðH 2 : H R jtÞ < 0, term t also conveys information in favour of H 2 against H R , and it contributes amount ifdI 2R ðtÞ ¼ jifdI 2R ðtÞj in support of H 2 rejecting H R .
* If P R ðtÞ < P R ðtÞ, then (a)
(b)
ð1 k1 ÞP R ðtÞ < ð1 k1 ÞP R ðtÞ, i.e., P R ðtÞ < k1 P R ðtÞ þ k2 P R ðtÞ, thus iðH 1 : H R jtÞ < 0, term t conveys information in favour of H 1 against H R , and it contributes amount ifdI 1R ðtÞ ¼ jifdI 1R ðtÞj in support of H 1 rejecting H R . ð1 k2 ÞP R ðtÞ < ð1 k2 ÞP R ðtÞ, i.e., P R ðtÞ > k1 P R ðtÞ þk2 P R ðtÞ, thus iðH 2 : H R jtÞ > 0, term t also conveys information in favour of H 2 against H R , and it contributes amount ifdI 2R ðtÞ ¼ jifdI 2R ðtÞj in support of H 2 rejecting H R .
Similarly, for the weighted algebraic sum of two subitems, we have: – If ifdK ðtÞ > 0, the sign of the total amount of information is dominated by ifdI 2R ðtÞ, and ifdK ðtÞ indicates that t is in support of H 2 more strongly than of H 1 . – If ifdK ðtÞ < 0, the sign of the total amount of information is dominated by ifdI 1R ðtÞ, and ifdK ðtÞ indicates that t is in support of H 1 more strongly than of H 2. 4.3. Some key points Despite the attractiveness of the three discrimination measures Eqs. (5)–(7) there are potential problems in applications without a clear understanding of their relationship and properties. We draw attention to some key points that highlight and clarify the problems so that they can be used properly.
– if P R ðtÞ > P R ðtÞ then ifdK ðtÞ ¼ k1 ifdI 1R ðtÞ k2 j ifdI 2R ðtÞj > 0, which indicates that term t conveys more information supporting H 1 than it conveys negative information supporting H 2 . – if P R ðtÞ < P R ðtÞ then ifdK ðtÞ ¼ k2 ifdI 2R ðtÞ k1 j ifdI 1R ðtÞj > 0, which indicates that term t conveys less information supporting H 1 than it conveys negative information supporting H 2 . From the above two indications, we learn that term t can be said to be in favour of H 1 more than of H 2 , if it satisfies two conditions ifdK ðtÞ > 0 and P R ðtÞ > P R ðtÞ simultaneously. We also learn that the verification with the two conditions is equivalent to the verification with only one condition: k1 ifdI 1R ðtÞ k2 jifdI 2R ðtÞj > 0
ð8Þ
The above inequality is thus a precondition of the judgement whether term t supports hypothesis H 1 more strongly than it supports hypothesis H 2 . * The simplest discrimination measure Clearly, the ‘prime culprit’ that leads to ifdI 2R ðtÞ > 0 is P R ðtÞ < P R ðtÞ. Therefore, we can easily see that whether term t supports H 1 more strongly than it supports H 2 depends on the relationship between P R ðtÞ and P R ðtÞ, rather than on the mathematical sign of the discrimination measures. This clearly tells us that, as long as P R ðtÞ > P R ðtÞ, term t is deemed to convey information supporting H 1 strongly, term t supports H 2 strongly, otherwise.
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
Thus, the difference P R ðtÞ P R ðtÞ is in fact the simplest discrimination measure, and a more general form in applications is, b
sdmðtÞ ¼ ½k1 P R ðtÞ k2 P R ðtÞ
ð9Þ
where the parameter b–0.
1869
relK ðt; sÞ ¼ sðtÞ ifdK ðtÞ ¼ sðtÞ ðk1 ifdI 1R ðtÞ þ k2 ifdI 2R ðtÞÞ or even by the simplest relatedness measure rel sdmðt; sÞ ¼ sðtÞ sdmðtÞ ¼ sðtÞ ½k1 P N1 ðtÞ k2 P N2 ðtÞ
5. Learning relatedness from MDI So far, we have established a theoretical framework towards semantic relatedness, and given a formal account of the MDI, the core of the framework. We are now in a position to see how to apply our knowledge to practical problems—learning relatedness between individual terms and a given topic. 5.1. Expression of relatedness The discrimination measures corresponding to R and R have no direct implications in practice since the relevance classification in question is the object of the classification. In practice, there is no a priori way to obtain the discrimination measures. In the theory of categorization it has been assumed that classification of new examples is based on their similarity to known examples or to a category prototype (Medin, Goldstone, & Gentner, 1990). This kind of assumption suggests an analogy using samples. Let N be a set of samples taken from the corpora D. Let N1 # N and N2 ¼ N N1 be the sets of documents relevant to and non-relevant to the topic s, respectively. That is, N1 and N2 form a classification over N. As we learnt from the previous sections, the extent of term t related to the topic s can be computed by the amount of discrimination information conveyed by term t in support of one of P N1 ðtÞ and P N2 ðtÞ rejecting the other. All the discrimination measures given in the last section can be used to measure the amount. Thus, with Definition 1 in Section 2.2, we can write a series of definitions as follows. Definition 3. Let P N1 ðtÞ and P N2 ðtÞ be the distributions of terms in the sets N1 and N2 with a priori probabilities k1 and k2 , respectively. Let sðtÞ P 0 estimate the importance of term t 2 V concerning topic s. Then, for each term t 2 V and a given topic s, the extent of t related to s is defined by relI 1R ðt; sÞ ¼ sðtÞ ifdI 1R ðtÞ P N1 ðtÞ ð10Þ k1 P N1 ðtÞ þ k2 P N2 ðtÞ where ifdI 1R ðtÞ is given in Eq. (5); the extent of t non-related to s is defined by ¼ sðtÞ P N1 ðtÞ log
relI 2R ðt; sÞ ¼ sðtÞ ifdI 2R ðtÞ P N2 ðtÞ ð11Þ k1 P N1 ðtÞ þ k2 P N2 ðtÞ where ifdI 2R ðtÞ is given in Eq. (6); the total extent of t related to s by ¼ sðtÞ P N2 ðtÞ log
ð12Þ
b
ð13Þ
where sdmðtÞ is given in Eq. (9). In particular, we have the following relatedness measures: relI 1R ðt; sÞ ¼ ifdI 1R ðtÞ
ð14Þ
relI 1R ðt; sÞ ¼ ifdI 2R ðtÞ relK ðt; sÞ ¼ k1 ifdI 1R ðtÞ
ð15Þ þ k2 ifdI 2R ðtÞ
ð16Þ b
rel sdm ðt; sÞ ¼ ½k1 P N1 ðtÞ k2 P N2 ðtÞ
ð17Þ
which consider only the discrimination information of terms without incorporating the weights of topic terms into relatedness values. 5.2. Reduction of domain Suppose k1 ¼ jNjNj1 j > 0 and k2 ¼ jNjNj2 j > 0. Notice that, generally, V N1 \ V N2 –;, and the whole domain V can be partitioned into four sub-domains: V ¼ ðV N1 \ V N2 Þ [ ðV N1 V N2 Þ [ ðV N2 V N1 Þ [ ðV V N Þ: Thus, the discrimination measure can 8 k1 ifdI 1R ðtÞ þ k2 ifdI 2R ðtÞ > > > > < k1 ifdI 1R ðtÞ þ k2 0 log 0 > 0 r1 ifdK ðtÞ ¼ 0 > k1 0 log r2 þ k2 ifdI 2R ðtÞ > 0 > > > : k1 0 log 00 þ k2 0 log 00 ¼ 0
be decomposed into when t 2 V N1 \ V N2 when t 2 V N1 V N2 when t 2 V N2 V N1 when t 2 V V N
where ri ¼ ri ðtÞ ¼ ki P Ni ðtÞ > 0 where i ¼ 1; 2. When t 2 V V N , P N1 ðtÞ ¼ P N2 ðtÞ ¼ 0, and ifdK ðtÞ ¼ 0. In this case, term t does not give us any discrimination information for the relevance classification, and relðt; sÞ ¼ 0. Thus, it is not necessary to consider terms in V V N1 , and the ifdK ðtÞ with domain t 2 V can immediately be reduced to the one with domain t 2 V N . When t 2 V N2 V N1 # V N , we have P N2 ðtÞ > P N1 ðtÞ ¼ 0. Thus, term t contributes ifdI 1R ðtÞ ¼ 0 for supporting H 1 , and contributes ifdI 2R ðtÞ ¼ P N2 ðtÞ log k12 > 0 for supporting H 2 . That is, term t contributes total amount ifdK ðtÞ ¼ k2 ifdI 2R ðtÞ for supporting only H 2 . In other words, the terms appearing in only the non-relevant sample documents will not offer any statistical information for supporting the relevant hypothesis. Conversely, they provide information for fully supporting the non-relevant hypothesis. Such kind of terms might also be informative, but would be nonrelated to the topic. Therefore, we should be concerned only with those terms that appear in at least one relevant sample document, and throw all terms t 2 V N2 V N1 ¼ V N V N1
1870
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
away. Consequently, the ifdK ðtÞ with domain t 2 V N can further be reduced to the one with domain t 2 V N1 . Next, when t 2 V N1 V N2 # V N1 , we have P N1 ðtÞ > P N2 ðtÞ ¼ 0. Thus, term t contributes ifdI 1R ðtÞ ¼ P N1 ðtÞ log k11 > 0 for supporting H 1 , and contributes ifdI 2R ðtÞ ¼ 0 for supporting H 2 . That is, term t contributes total amount ifdK ðtÞ ¼ k1 ifdI 1R ðtÞ for supporting only H 1 . In other words, the terms appearing in only relevant sample documents will provide information for fully supporting the relevant hypothesis. Such kinds of terms should be considered closely related to the topic. Finally, when t 2 V N1 \ V N2 # V N1 , we have P N1 ðtÞ > 0 and P N2 ðtÞ > 0. In this case, (note that ifdI 1R ðtÞ and ifdI 2R ðtÞ are opposite in sign), term t contributes amount ifdK ðtÞ for supporting both H 1 and H 2 . In other words, the terms appearing in both relevant and non-relevant sample documents would convey statistical information for supporting both the relevant and the non-relevant hypotheses. In particular, when P N1 ðtÞ > P N2 ðtÞ, term t would support the relevant hypothesis more strongly than it supports the non-relevant hypothesis, and should be considered related to the topic. In fact, the main task of all our relatedness measures is to deal with such kinds of terms. 5.3. Some key points For applying the relatedness measures, the following points are worth mentioning. * The relationship of relatedness measures From Eqs. (10)–(12), for each term t 2 V N1 , we can write relK ðt; sÞ ¼ k1 relI 1R ðt; sÞ þ k2 relI 2R ðt; sÞ which can be positive or negative. From Theorem 1, we have relI 1R ðt; sÞ relI 2R ðt; sÞ 6 0. Thus, relK ðt; sÞ is also the weighted algebraic sum of two opposite relatedness values: it offers not only the relatedness of terms to the topic, but also the nonrelatedness inherent in terms when they also appear in non-relevant documents. It should be emphasized that a single condition, relK ðt; sÞ > 0, cannot guarantee that term t is closely related to the topic. It is true that relK ðt; sÞ > 0 implies ifdK ðtÞ > 0 (since sðtÞ P 0), however, this is not enough to infer that term t supports P N1 ðtÞ more strongly than it supports P N2 ðtÞ. * The verification of term t closely related to topic s Clearly, if P N1 ðtÞ > P N2 ðtÞ then relI 1R ðt; sÞ P 0 and relI 2R ðt; sÞ 6 0 and, in this case, if also relK ðt; sÞ > 0, then we have k1 relI 1R ðt; sÞ k2 jrelI 2R ðt; sÞj > 0
ð18Þ
which indicates that term t is related to topic s. Thus, in order to effectively select informative terms closely related to s, we must verify both conditions
P N1 ðtÞ > P N2 ðtÞ and relK ðt; sÞ > 0 simultaneously for each selected term. The verification with the two conditions is also equivalent to the verification with only one condition: the inequality given in Eq. (18) or given in Eq. (8). * The sample set In practice, it is normally accepted that relatedness methods are not concerned with treating the situation where N1 ¼ ; and N2 ¼ N (i.e., k1 ¼ 0 and k2 ¼ 1), that is, where no relevance information is available and all samples are found to be non-relevant to the topic. On the other hand, for the situation where N1 ¼ N and N2 ¼ ; (i.e., k1 ¼ 1 and k2 ¼ 0), that is, all samples are judged to be relevant to the topic, the relatedness measure given in Eq. (12), for instance, can be written relK ðt; sÞ ¼ sðtÞ k1 ifdI 1R ðtÞ ¼ sðtÞ ðk1 log k1 ÞP N1 ðtÞ
ð19Þ
To sum up, let us say that we consider a candidate t 2 V N1 whenever it satisfies P N1 ðtÞ > P N2 ðtÞ. The problem can then be stated as that of computing relatedness values for the candidates using one of the relatedness measures; the candidates can then be sorted in decreasing order of the values. The candidates with the highest (positive) values should be given a high priority as selected ones as they make the greatest contributions to the expected divergence. The selected candidates should be regarded as informative terms (or good discriminators), which are capable of distinguishing relevant documents from many non-relevant ones. In the next section we move from mathematical abstraction to a concrete example, to help further clarify the ideas involved in our method. One of the most prominent applications of the MDI is in the analysis and retrieval of text documents. Therefore, the example we have chosen is set in the context of text information retrieval. By this example, we attempt to demonstrate how our method can deal with some basic concepts of applications, and how the mathematical analysis is supported by empirical evidence drawn from performance experiments. 6. An example application In text information retrieval, each query (topic) is represented by means of a set of (index) terms; the query terms are used to express concepts; the concepts, each of which is a unit of thought, join together to form (describe) a user’s information needs. In practice, the original queries are usually inadequate, imprecise, or incomplete descriptions of users’ information needs, and a retrieval system cannot be expected to produce ideal results by using the original query representation. Query representation is one of the main obstacles to be faced in developing an effective quantitative retrieval system.
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
Query expansion is a process that revises the original query representation by strengthening some concepts so as to more accurately describe the information need. In particular, when expansion terms are selected from a set of relevant documents, query expansion is an effective technique: it adds those terms—expected to express concepts describing the information needs—that occur in the relevant documents into the original query. The idea of query expansion using information measures does not appear out of nowhere. Many past studies, examining both theoretical and experimental aspects, have shown that it can be profitable to employ an information measure as a device to construct relatedness measures to select informative terms. For instance, the studies given in Amati and van Rijsbergen (2002), Cai (2004), Carpineto, Mori, and Romano (1998), Carpineto, Mori, Romano, and Bigi (2001), Cronen-Townsend, Zhou, and Croft (2002), Dagan et al. (1999), Lafferty and Zhai (2001), van Rijsbergen (1979), and Zhai and Lafferty (2001) used the directed divergence measure (Kullback, 1959): X P R ðtÞ P R ðtÞ log IðP R : P R Þ,I 12 ðP R : P R Þ ¼ P R ðtÞ t2V The studies given in Cai (2004), Carpineto et al. (1998) used the divergence measure (Kullback, 1959): J ðP R ; P R Þ ¼ I 12 ðP R : P R Þ þ I 21 ðP R : P R Þ X P R ðtÞ ¼ ðP R ðtÞ P R ðtÞÞ log P R ðtÞ t2V The studies given in Cai (2004), van Rijsbergen (1979), Wong and Yao (1992) used the information radius measure (Sibson, 1969). The studies given in Berger and Lafferty (1999), Biru, El-Hamdouchi, Rees, and Willett (1989), Cai (2004), Church and Hanks (1990), Gauch, Wang, and Rachakonda (1999), Kang and Choi (1997), Kim and Choi (1999), Kwon, Kim, and Choi (1994), Mandala, Tokunaga, and Tanaka (2000), van Rijsbergen (1979), Wong and Yao (1989), and Xu and Croft (1996) used the expected mutual information measure (Good, 1950; Kullback, 1959; Shannon, 1948). The studies given in Carmel, Farchi, Petruschka, and Soffer (2002), Fan, Gordon, and Pathak (2005) used the entropy increase measure (Shannon, 1948; Rao, 1982). A necessary condition that must be satisfied in applying IðP R : P R Þ is P R ðtÞ P R ðtÞ; a necessary condition that must be satisfied in applying J ðP R ; P R Þ is both P R ðtÞ P R ðtÞ and P R ðtÞ P R ðtÞ; they are meaningless otherwise. Such requirements may be too strong in a practical IR context, and may not be satisfied when we attempt to derive the two distributions from different document categories. As we mentioned previously, measure Kðk1 ; k2 ; P R ; P R Þ is well-defined: it need not place any requirement on arguments P R ðtÞ and P R ðtÞ. Also, our experimental results have shown that there is no significant difference between retrieval performances obtained from the three measures (Cai, 2004). Therefore, it may be appropriate to apply the information radius to query
1871
expansion in the situation where the absolute continuity of term distributions does not hold. 6.1. Feedback processes Suppose q to be a query formulated by a user of a retrieval system. Let sample set N, where jNj ¼ a > 0, consist of the top-ranked documents retrieved in an initial retrieval iteration. Let N1 and N2 be the sample subsets of documents relevant and non-relevant to q, respectively. With the discussion given in Section 5.2, we assume that V N1 constitutes a source of candidates. Our aim is to judge which candidates are informative ones distinguishing relevant documents from non-relevant ones with respect to q. With this aim, we can apply the relatedness measures, for instance, Eqs. (10), (12) and (16), to compute the relatedness between the individual candidates and q. (I) Relevance feedback In a relevance feedback process, the top-ranked documents in set N can be graphically displayed to the user, and screen pointers can be used to designate some of the top-ranked documents as relevant to his/her information needs. Suppose N1 (where 0 < jN1 j < a) is the set of designated documents. If jN1 j ¼ 0, that is, no positive relevance information is available and all sample documents are found to be non-relevant, the user should be required to reformulate his query and submit it to the retrieval system to produce an effective sample set. If jN1 j ¼ a, that is, all sample documents are judged to be relevant to q, the user can terminate his search if he is satisfied that he has found enough documents relevant to his information need. Otherwise, for obtaining more relevant documents, he can either enter an iterative relevance feedback loop by applying one of the relatedness measures, or enter an iterative pseudo-relevance feedback loop by taking an extra ‘non-relevant’ sample set @ and merging it into the sample set N (see below). (II) Pseudo-relevance feedback In an operational situation where no relevance information is available in advance, all documents in N are treated as (pseudo) relevant to q, and V N as a source of candidates. In order to take an extra ‘non-relevant’ sample set, we would proceed as follows. Let @ be a set of documents ranked in the initial retrieval, and @ ¼ fd bþ1 ; d bþ2 ; . . . ; d bþc g, where integers b > a and c P 1, and subscripts b þ 1; b þ 2; . . . ; b þ c are ranking numbers. The choice of a, b and c, depending on a specific retrieval strategy, is immaterial. For instance, we may take a ¼ 10, b ¼ 1000 and c ¼ 30, if we have sufficient belief that documents in @ ¼ fd 1001 ; d 1002 ; . . . ; d 1030 g are non-relevant to q. Thus, we obtain an alternative sample set
1872
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
I ¼ N [ @ with N \ @ ¼ ; (generally, V N \ V @ –;). That is, we use the sample sets N and @ instead of sets N1 and N2 just like in the relevance feedback. If c–a, then we can use Kðk1 ; k2 ; P N ; P @ Þ to construct the relatedness measure as discussed for relevance c a and k2 ¼ aþc . feedback with k1 ¼ aþc If c ¼ a, then we can apply KðP N ; P @ Þ to construct the relatedness measure as discussed for relevance feedback with k1 ¼ k2 ¼ 12. However, in the pseudo-relevance feedback, if the initial retrieval returns a low precision, the estimate of relatedness values may be poor due to limited and noisy training samples providing insufficient and unreliable relevance information. In this case, the expanded query cannot be expected to produce any further improvement in retrieval performance. 6.2. Query expansion In order to investigate to what extent each relatedness measure contributes to improvement of retrieval performance, we have carried out a number of experiments. Detailed descriptions of the methodologies, such as, the composition of the vocabulary V, the weighting functions for document terms wd ðtÞ and query terms wq ðtÞ, the similarity measure simðd; qÞ between document d 2 D and the query q, the reweighting function for the expanded query terms rewq0 ðtÞ, the estimation of term probability distributions P N1 ðtÞ and P N2 ðtÞ, the estimation of weighting functions for candidates TðtÞ, the optimal size of sample set and number of expansion terms, the standard evaluation measures, etc., can be found in Cai (2004).
In this example, we show only the experimental results obtained from relK ðt; qÞ given in Eq. (16) in the relevance feedback, which enable us to gain an insight into how the discrimination information of candidates make contributions to the performances. The results given in Fig. 1 consist of the average performances (over 50 queries) obtained respectively from: benchmark-1 (the original queries), benchmark-2 (the expanded queries in the pseudo-relevance feedback, using Rocchio’s method), benchmark-3 (the expanded queries in the relevance feedback, using Rocchio’s method), and relK ðt; qÞ (in the relevance feedback). Despite its age, Rocchio’s method (1971) and its variants have been shown to achieve good performances, and are widely used in many information retrieval-related tasks (Buckley & Salton, 1995; Cai, 2004; Carpineto et al., 2001; Crammer & Singer, 2003; Fan et al., 2005). From Fig. 1 we can see: the performances of the expanded queries obtained from relK ðt; qÞ are dramatically better than benchmark-1 and benchmark-2, at all the evaluation points, for all the different parts of the queries; the performances of the expanded queries obtained from relK ðt; qÞ are markedly better than benchmark-3, at almost all the evaluation points, for all the different parts of the queries. Our experimental results in the example demonstrate, if partial relevance information over a sample set is available, that our relatedness measures will be able to produce performance improvement in a practical information retrieval setting. As this paper is focused on a theoretical analysis and formal method discussion for applications, readers interested in how our method is supported by empirical evidence drawn from a number of performance experiments, are referred to our companion work (Cai, 2004). 7. Conclusion
Fig. 1. This example shows the performances obtained from retrieval using one collection FT (provided by TREC ad hoc data (Voorhees & Harman, 1999)) with 210,158 documents against 50 TREC queries (351– 400). Each query was produced, respectively, from the corresponding title field (denoted by title-only), both title and description fields (denoted by desc + title), and full text (denoted by full-text), of the query. The standard evaluation measures used are: A-P (average precision over the set of 50 queries), R-P (R-precision—precision at the number jRj of documents), P@5 (average precision at 5 top-ranked documents) and, P@10 (average precision at 10 top-ranked documents).
In this study, we presented a thorough investigation into measuring the power of discrimination of terms. We focused on developing a formal method for estimating the discrimination information conveyed by terms based on the information radius. In particular, we formally interpreted the discrimination measures, and argued the interpretation is essential for guiding practical applications. We introduced a formal definition for the intuitive notion of relatedness between terms and a given topic based on the discrimination measures. A practical example using our method in the context of text IR was given. As we mentioned previously, co-occurrence statistics can provide valuable clues about semantic and contextual information; the distributions P R ðtÞ and P R ðtÞ can characterize statistical data and thus reveal clues; the divergence information between the distributions can be used to identify terms having similar co-occurrence patterns (clues) and enable learning of related relations. The three information measures IðP R : P R Þ, J ðP R ; P R Þ and Kðk1 ; k2 ; P R ; P R Þ address, in different ways, the issue of how to make estimates of the expected divergence. Their strength lies in their ability to provide rational and sound estimates and
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
thus to capture semantic relatedness between terms. We summarize some important features of and differences between the three information measures from practical application viewpoints as follows. IðP R : P R Þ, J ðP R ; P R Þ and Kðk1 ; k2 ; P R ; P R Þ emphasize the importance of those terms with variant probabilities within sets R and R, and remove the dependence on terms with invariant probabilities as the terms would not provide profitable information for the relevance classification. If P R ðtÞ ¼ P R ðtÞ for all t 2 C V R \ V R , values IðP R : P R Þ, J ðP R ; P R Þ and Kðk1 ; k2 ; P R ; P R Þ drop sharply. for all t 2V, IðP R : P R Þ ¼ If P R ðtÞ ¼ P R ðtÞ J ðP R ; P R Þ ¼ Kðk1 ; k2 ; P R ; P R Þ ¼ 0. When P R ðtÞ and P R ðtÞ are completely disjoint, i.e., V R \ V R ¼ ;, Kðk1 ; k2 ; P R ; P R Þ is reduced to the entropy of its a priori probability distribution. In this case, IðP R : P R Þ and J ðP R ; P R Þ do not exist. IðP R : P R Þ requires V R # V R ; J ðP R ; P R Þ requires V R ¼ V R ; Kðk1 ; k2 ; P R ; P R Þ does not place any requirement on the relationship between V R and V R . In other words, IðP R : P R Þ must satisfy P R ðtÞ P R ðtÞ for t 2 V ; J ðP R ; P R Þ must satisfy P R ðtÞ P R ðtÞ and P R ðtÞ P R ðtÞ for t 2 V ; it is unnecessary for Kðk1 ; k2 ; P R ; P R Þ to satisfy and absolute continuity since P R ðtÞ P R ðtÞ P R ðtÞ P R ðtÞ for all t 2 V . IðP R : P R Þ is not symmetric in P R ðtÞ and P R ðtÞ; J ðP R ; P R Þ is symmetric in P R ðtÞ and P R ðtÞ; Kðk1 ; k2 ; P R ; P R Þ is not symmetric in P R ðtÞ and P R ðtÞ, nor in k1 and k2 (a symmetric information radius can be introduced by setting k1 ¼ k2 ¼ 12). In the application of Kðk1 ; k2 ; P R ; P R Þ, a priori probability distribution P k ¼ fk1 ; k2 g must be provided beforehand. The choice of P k depends on a specific model itself. There is no need to have an a priori probability distribution for the applications of IðP R : P R Þ and J ðP R ; P R Þ. It is clear that each discrimination measure is uniquely determined by its two arguments P R ðtÞ and P R ðtÞ. Evidently, the problem of estimating these two arguments is crucial for effectively measuring the power of discrimination of terms. Ongoing and future work includes (i) establishing a unified framework for supporting a systematic investigation into effective estimation of probability distributions and, (ii) carrying out extensive experimental studies and performance comparisons and analyses, particularly using extremely large corpora. Finally, it should be emphasized that our method can be applied to any type of data over a discrete domain, and form part of the groundwork for any semantic analysis task.
Acknowledgement This study was supported in part by EPSRC.
1873
References Aitchison, J. (2005). Review: M. Lynne Murphy. Semantic relations and the lexicon: Antonymy, synonymy and other paradigms. International Journal of Lexicography, 18(1), 106–109. Amati, G., & van Rijsbergen, C. J. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems, 20(4), 357–389. ANSI/NISO Z39.19-2005. (2005). Guidelines for the construction, format, and management of monolingual thesauri. National Information Standards Organization. Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (pp. 805–810). Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 22nd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 222–229). Biru, T., El-Hamdouchi, A., Rees, R. S., & Willett, P. (1989). Inclusion of relevance information in the term discrimination model. Journal of Documentation, 45(2), 85–109. Buckley, C., & Salton, G. (1995). Optimisation of relevance feedback weights. In Proceedings of the 18th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 351–357). Budanitsky, A., & Hirst, G. (2001). Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures. In Proceedings of the Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics (pp. 29–34). Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 13–47. Cai, D. (2004). IfD—Information for Discrimination. PhD thesis, University of Glasgow, Glasgow, Scotland. Cai, D., & Van Rijsbergen, C. J. (2007). Reconsidering the fundamentals of measurement of discrimination information. In Proceedings of the 1st International Conference on Theory of Information Retrieval (ICTIR’07) (pp. 151–158). Carmel, D., Farchi, E., Petruschka, Y., & Soffer, A. (2002). Automatic query refinement using lexical affinities with maximal information gain. In Proceedings of the 25th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 283–290). Carpineto, C., Mori, R., & Romano, G. (1998). Informative term selection for automatic query expansion. In The 7th Text REtrieval Conference (TREC-7) (pp. 363–369). NIST Special Publication. Carpineto, C., Mori, R. D., Romano, G., & Bigi, B. (2001). An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1), 1–27. Chiarello, C., Burgess, C., Richards, L., & Pollock, A. (1990). Semantic and associative priming in the cerebral hemispheres: Some words do, some words don’t . . . sometimes, some places. Brain and Language, 38(1), 75–104. Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Journal of the American Society for Information Science, 16(1), 22–29. Corley, C., & Mihalcea, R. (2005). Measuring the semantic similarity of texts. In Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment (pp. 13–18). Crammer, K., & Singer, Y. (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning, 3, 1025–1058. Cronen-Townsend, S., Zhou, Y., & Croft, W. B. (2002). Predicting query performance. In Proceedings of the 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 299–306).
1874
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875
Dagan, I. (2000). Contextual word similarity. Handbook of Natural Language Processing, 459–475. Dagan, I., Lee, L., & Pereira, F. C. N. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1–3), 43–69 (Special issue on natural language learning). Fan, W., Gordon, M. D., & Pathak, P. (2005). Effective profiling of consumer information retrieval needs: A unified framework and empirical comparison. Decision Support Systems, 40(2), 213–233. Fellbaum, C. (1995). Co-occurrence and antonymy. International Journal of Lexicography, 8(4), 281–303. Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: The MIT Press. Florian, R., & Yarowsky, D. (2002) Modeling consensus: Classifier combination for word sense disambiguation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 25–32). Gauch, S., Wang, J., & Rachakonda, S. M. (1999). A corpus analysis approach for automatic query expansion and its extension to multiple databases. ACM Transactions on Information Systems, 17(3), 250–269. Good, I. J. (1950). Probability and the Weighing of Evidence. London: Charles Griffin. Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London: Longman. Han, L., Sun, L., Chen, G., & Xie, L. (2006). ADSS: An approach to determining semantic similarity. Advances in Engineering Software, 37(2), 129–132. Harris, Z. S. (1968). Mathematical Structures of Language. New York: John Wiley. Hirst, G., & Budanitsky, A. (2005). Correcting real-word spelling errors by restoring lexical cohesion. Natural Language Engineering, 11(1), 87–111. Jardine, N., & Sibson, R. (1971). Mathematical Taxonomy. John Wiley & Sons Ltd. Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th International Conference on Research in Computational Linguistics (pp. 19–33). Kang, H., & Choi, K. (1997). Two-level document ranking using mutual information in natural language information retrieval. Information Processing and Management, 33(3), 289–306. Kim, M., & Choi, K. (1999). A comparison of collocation-based similarity measures in query expansion. Information Processing and Management, 35(1), 19–30. Kullback, S. (1959). Information Theory and Statistics. New York: Wiley. Kwon, O., Kim, M., & Choi, K. (1994). Query expansion using domainadapted, weighted thesaurus in an extended Boolean model. In Proceedings of the 3rd International Conference on Information and Knowledge Management (ACM-CIKM) (pp. 140–146). Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 111–119). Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49, 188–207. Lee, L. (1999). Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (pp. 25–32). Mandala, R., Tokunaga, T., & Tanaka, H. (2000). Query expansion using heterogeneous thesauri. Information Processing and Management, 36(3), 361–378. Marx, I., Dagan, Z., Buhmann, J., & Shamir, E. (2002). Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3, 747–780. McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004). Finding predominant word senses in untagged text. In Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics (pp. 280–287). Medin, D. L., Goldstone, R. L., & Gentner, D. (1990). Similarity involving attributes and relations: Judgments of similarity and difference are not inverses. Psychological Science, 1(1), 64–69. Miller, G. (1990). WordNet: An on-line lexical database. International Journal of Lexicography, 3(4), 235–244 [Special issue]. Mohammad, S., & Hirst, G. (submitted for publication). Distributional measures as proxies for semantic relatedness. Available from:
. Mohammad, S., & Hirst, G. (2006a). Determining word sense dominance using a thesaurus. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (pp. 121–128). Mohammad, S., & Hirst, G. (2006b). Distributional measures of conceptdistance: A task-oriented evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Moldovan, D., Badulescu, A., Tatu, M., Antohe, D., & Girju, R. (2004). Models for the semantic classification of noun phrases. In Proceedings of the Workshop on Computational Lexical Semantics (pp. 60–67). Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21–48. Morris, J., & Hirst, G. (2004). Non-classical lexical semantic relations. In Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics (pp. 46–51). Nastase, V., & Szpakowicz, S. (2003). Exploring noun-modifier semantic relations. In Fifth International Workshop on Computational Semantics (pp. 285–301). Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 613–619). Patwardhan, S., Banerjee, S., & Pedersen, T. (2003). Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the 4th International Conference on Intelligent Text Processing and Computational Linguistics (pp. 241–257). Pekar, V., & Staab, S. (2003). Word classification based on combined measures of distributional and semantic similarity. In Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (pp. 147–150). Rao, C. R. (1982). Diversity and dissimilarity coefficients: A unified approach. Journal of Theoretical Population Biology, 21, 24–43. Resnik, P. (1999). Semantic similarity in a taxonomy: An informationbased measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130. Richardson, R., Smeaton, A., & Murphy, J. (1994). Using WordNet as a knowledge base for measuring semantic similarity between words. In Proceedings of AICS Conference. Rocchio, J. J. (1971). Relevance feedback in information retrieval. In The SMART retrieval system—experiments in automatic document processing (pp. 313–323). Rodriguez, M. A., & Egenhofer, M. J. (2003). Determining semantic similarity among entity classes from different ontologies. IEEE Transactions on Knowledge and Data Engineering, 15(2), 442–456. Seco, N., Veale, T., & Hayes, J. (2004). An intrinsic information content metric for semantic similarity in WordNet. In Proceedings of the 16th European Conference on Artificial Intelligence. Shannon, C. E. (1948). A mathematical theory of communication. Bell System and Technical Journal, 27, 379–423, 623–656. Sibson, R. (1969). Information radius. Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 14, 149–160. Stevenson, M., & Greenwood, M. A. (2005). A semantic approach to IE pattern induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (pp. 379–386). Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416. Turney, P. D., Littman, M. L., Bigham, J., & Shnayder, V. (2003). Combining independent modules to solve multiple-choice synonym
D. Cai, C.J. van Rijsbergen / Expert Systems with Applications 36 (2009) 1860–1875 and analogy problems. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (pp. 482–489). van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). London: Butterworths. Voorhees, E. M., & Harman, D. (1999). Overview of the Seventh Text REtrieval Conference (TREC-7). In The 7th Text REtrieval Conference (TREC-7) (pp. 1–23). NIST Special Publication. Weeds, J., & Weir, D. (2005). Co-occurrence retrieval: A flexible framework for lexical distributional similarity. Computational Linguistics, 31(4), 439–475. Wong, S. K. M., & Yao, Y. Y. (1989). A probability distribution model for information retrieval. Information Processing and Management, 25(1), 39–53.
1875
Wong, S. K. M., & Yao, Y. Y. (1992). An information-theoretic measure of term specificity. Journal of the American Society for Information Science, 43(1), 54–61. Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 4–11). Zhai, C., & Lafferty, J. (2001). Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 24th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (pp. 403–410).