Transitive closures of fuzzy thesauri for information-retrieval systems

Transitive closures of fuzzy thesauri for information-retrieval systems

InL J. Man-Machine Studies (1986) 25, 343-356 Transitive closures of fuzzy thesauri for information-retrieval systems JAMES C. BEZDEK, GAUTAM BlSWAS...

2MB Sizes 4 Downloads 103 Views

InL

J. Man-Machine Studies (1986) 25, 343-356

Transitive closures of fuzzy thesauri for information-retrieval systems JAMES C. BEZDEK, GAUTAM BlSWAS AND LI-YA HUANG

Computer Science Department, University of South Carolina, Columbia, SC29208, U.S.A. (Received January 1986 and in revised form June 1986) In this paper we represent a thesaurus (R) for an information system as the sum of two fuzzy relations,S(synonyms) and G(generalizations). The max-star completion of R is defined as R , , the max-star transitive closure of R. We interpret /~,, which extends the concept-pair fuzzy relation R initially provided by an expert, as a linguistic completion of the thesaurus. Six max-star completions, corresponding to six well-known T-norms, are defined, analysed, and numerically illustrated on a nine-term dictionary. The application of our results in the context of document retrieval is this: one may use/~, as a means of effecting replacements of terms appearing in a natural-language document request. The weights (/~,)0 can be used to diminish or increase one's confidence in the degree of support being developed for each document considered relevant to a given query. The/jth element of/~, can be regarded as the ultimate extent to which term j can be "reached" from term i; the values in/~, thus represent degrees of confidence in max-star transitive chains.

1. Introduction The need to relate natural-language queries in a free vocabulary to document descriptors necessitates an information structure powerful enough to capture relationships among terms of descriptors that represent the subject domain concepts. By vocabulary we mean here a set of descriptors (keywords, terms, phrases, etc.) that are relevant to a particular concept or topic domain. There are, generally speaking, two approaches to information structures: classification, wherein hierarchical groups of concepts are constructed by subdividing the technical fields into classes described by the vocabulary; and the thesaurus approach, wherein specific descriptors (or groups thereof) are related to each other through, e.g. synonym or generalization (implication) tables. These approaches are described at length in Salton & McGill (1983) and Bartschi (1985). The prototypical retrieval system described in Biswas, Subramanian, Marques & Bezdek (1985) uses two fuzzy relations to represent the thesaurus. Specifically, we have developed a retrieval model that uses a synonym relation S and an implication relation G to relate descriptor pairs in the concept domain. In this p a p e r we describe an extension of the model realized by considering the sum of S and G, say R = S + G, and its transitive closure (/~) as a basis for chaining to develop a degree of support for each document being considered for a query. Another purpose of this paper will be to emphasize the fact that there are m a n y choices available for representation o f the relational data; towards this end, we shall illustrate the calculation and interpretation of six different forms of the transitive closure of R. Consider the problem of taking a document or search request presented in natural language and attempting to use an automatic procedure to generate the content 343 0020-7373/86/090343 + 14503.00/0

© 1986 Academic Press Inc. (London) Limited

344

J. C. B E Z D E K

ET AL.

identifications associated with the request. This task involves many difficulties because of the complexity and diversity of natural language. One of the principal difficulties stems from the fact that many distinct words are often used to supply the same (or nearly the same) meaning. Thus synonyms arise as an artifact of the evolution of natural language. A second and broader concept is the idea of generalization (narrower terms to broader ones), and its converse, specialization (broader terms to narrower ones). Generalization and specialization are one form of reasoning towards objectives represented through relationships between natural language phrase pairs. This is precisely the action of the human expert that we hope to approximate by our relational model; the ability to draw inferences about the best retrieval for a given request by understanding descriptor-pair relations. Thus, our model must be able to elicit, represent, manipulate, and inference with synonym, generalization, and specialization relations on descriptor pairs. The basic structure we have chosen for language normalization is the thesaurus. We define a thesaurus as a list of domain-specific terms together with two numerical relationships, viz., synonyms, and generalizations (narrow to broad). Other writers use different terminology; e.g. Miyamoto, Miyake & Nakayama (1983) prescribe four kinds of relations, viz., synonyms, broader terms, narrower terms, and related terms. In any case the extraction of exhaustive numerical relationships for dictionary pairs from an expert is virtually impossible in a real world situation wherein the dictionary may contain virtually thousands of terms. For example, with n = 1000 terms, and two relations there are (potentially) 2 ( n ( n - 1 ) / 2 ) = 999 000 descriptor pairs that might be supplied by a catalogue expert. From this it is clear that no expert will exhaust the possible set of relations one might pose for a thesaurus. One of the objectives of this paper is to develop a means for completing a set of partial relationships supplied to the system by an expert. More specifically, we show below that transitive closures of R can be used to construct "complete" thesauri from partial relational information. A related aspect of system design that will not be further discussed below is the matter of representational schemes or data structures which are adequate and appropriate for overall implementation in an information retrieval system.

2. Related literature In this section we present a brief description of several other approaches to the construction of a thesaurus. Salton (1971) describes a fully automatic thesaurus construction method based on the vocabulary constrained in a sample document collection assumed to be typical for a given subject area. A frequency count is made of the words contained in a set of documents, and each document is identified by certain highfrequency words included in it. The sample collection is represented initially by a term-document, or concept-document matrix. The matrix element at the intersection of row i and column j of the matrix represents the weight of term j in the document i, which is a non-negative integer. Then similarity coefficients between terms, based on co-occurrence characteristics of the terms in the documents are computed. For example, the document-term matrix shown in Table 1 might, with an associative procedure, result in the formation of three-term (thesaurus) groups: { T,, T6},{T4, TT}, {T2, T3, Ts}.

INFORMATION-RETRIEVAL

345

SYSTEMS

TABLE 1 Terms Document

Tl

7"2

T3

T4

T5

T6

T7

D1 D2 D3 D4

3 0 0 1

0 0 2 2

0 1 3 1

2 3 0 0

0 2 4 3

6 0 0 1

1 2 0 0

Lancaster (1972) described the use of correlation coefficients for the similarity of terms. Radecki (1976) proposed a mathematical model of an information retrieval system based on the concept of a fuzzy thesaurus. He defined a fuzzy thesaurus as a set of terms T satisfying the following conditions: (a) There is a fuzzy relation ~s: T x T-~ [0, 1] which is reflexive, symmetric, and transitive, supplied by an expert [see equations (3) below]. (b) There is non-empty set td e T, called a set of elementary descriptors, and a A-synonymic relation sa ~ T x T. sa is used to drive the retrieval process. Radecki's method is well-summarized by Zenner, Caluwe & Kerre (1985), who propose a modification of Radecki's model that circumvents several difficulties of the A-cut approach. Miyamoto et al (1983) use a set-theoretical model of an abstract thesaurus and relate it to co-occurrence frequencies by employing fuzzy-set theory to generate a fuzzy pseudothesaurus. Their main idea is as follows. Let w = {to1, t o 2 , . . . , to.} be the set o f keywords and C={cl, c2,..., c,} be the set of concepts. A function h : w ~ [ 0 , 1] c maps each word toi to an associated fuzzy set o f concepts h(toi). An additive measure M is defined on fuzzy subsets of C. Two fuzzy relations are introduced:

s(to,, ¢oj) = M(h(to,) c~ h(toj))/M(h(toi)u h(toi) ) t(to,, toj) = M(h(to,) c~ h(toj))/M(h(to~))

l(a) l(b)

Equation l(a) means toi and % are fuzzy related terms (RT) with value s. Equation l(b) means toi is a fuzzy narrower term of % with value t. These relations correspond to the synonym and generalization relations discussed below; however, our relations are not derived with the equations in (1). In the pseudothesaurus, the set of concepts are replaced by the set of various articles or bibliographic citations. Map h is derived from the matrix [h~j] of co-occurrences, where ho is the frequency that the keyword to~ occurs in the article cj. Then: s(to,, %) = Y. min (h,k, k

t(to,, %)=~k min

hjk)/~

max (h,k,

hjk)

2(a)

I k

(h,k, hjk)/~k h,k

2(b)

An algorithm for pseudothesaurus generation and several numerical examples o f this method of thesaurus construction are given in Miyamoto et al. (1983).

346

.I. C . B E Z D E K

ET AL.

3. Mathematical model of the (S, G) thesaurus Zadeh (1965) introduced the notion of fuzzy sets to enable quantification of nonstatistical imprecision. Many other authors have developed retrieval system models which utilize the fuzzy-sets approach. Important progress in this area can be found in the works o f Buell & Kraft (1981), Buckles & Petry (1983), Anvari & Rose (1986), and Eastman (1983). The basic structure of our model is the fuzzy relation, first defined by Zadeh (1965), and expanded upon in Zadeh (1971). Briefly, if X = {xl, x 2 , . . . , x , } is any set o f objects, a fuzzy relation in X is a function p : X x X -->[0, 1]. The (n x n) matrix of values [p(xi, xj)] is a convenient characterization of p: we shall call R -- [r0] = [ p ( x , xj)] the relation matrix of p (and, as is customary, we may call R "the relation" when no confusion can arise). The value r U is the degree o f strength of relationship between xi and xj. We say that: R is r e f l e x i v e ¢ ~ I , <- R; R is symmetric ¢ ~ R = R r ; R is max- * transitive ¢ : > R<- R( v * ) R = R E.

3(a) 3(b) 3(c)

I, in 3(a) is the (n x n) identity matrix; (-<) is matrix ordering component by component (thus, e.g. R is reflexive iff r, = 1Vi); superscript (T) means transpose; and ( v * ) denotes generalized matrix multiplication using the maximum (v) over n pairwise star ( * ) operations. More specifically, (v * ) means that whenever real matrices A, B are commensureable, their product P = A(v * )B has for its /jth entry p# = ~/ (aik * bkj),

(4)

k=l

where ( * ) is a commutative, associative binary operation in R x R. In the sequel our interest lies with six star operators, commonly called T-norms Bonnisone & Decker (1985), which will be defined at length below. With this preliminary structure we now define the relations that comprise our thesaurus: Definition 1. Let T = { h , . . . , t,} be a dictionary of document descriptors. S is a set of s y n o n y m s for T in case S is a reflexive, symmetric fuzzy relation in T x T. Note that we include as synonyms those descriptor pairs that are "partially" and symmetrically equivalent. For example, consider the words: ti = about; tj = around;

and

tk = nearly. One might agree to let " a b o u t " and "nearly" be fully synonymous, so Sik = Ski = 1; and perhaps that " a r o u n d " can be taken as a partial replacement for either of these, say s o = sji = 0.8, and Sjk = S k i "~-- 0.9. Definition 2. Let T = { t ~ , . . . , t,} be a dictionary of document descriptors. G is a set of generalizations for T in case G is a fuzzy relation in T x T such that: g, = 0Vi, and if gij > 0, go = g(6, tj) denotes the extent to which the narrower term ti implies the broader term ti.

INFORMATION-RETRIEVALSYSTEMS

347

R e m a r k 1. Any generalization from 6 to tj also implies particularization from tj to 6. However, the value gj~ needed for the reverse implication cannot in general be computed from &j, and will generally differ functionally from pair to pair. R e m a r k 2. We have have chosen to define g, = 0 to make G a pure parent relation. A mathematical reason for this choice will be apparent below. R e m a r k 3. We tacitly assume that if s~j = @ > 0 for (xi, xj) ~ T x T, then g~j = gj~ = 0. That is, if a concept pair are synonymous, they cannot have a non-symmetric generalization-specialization relationship; and conversely, if go a n d / o r gji > 0, then s o = sj~ = 0. On the other words, synonymity and generalization are assumed to be mutually exclusive ideas. R e m a r k 4. In our formulation we assume that the entries of S and G are obtained by an interactive knowledge engineering session involving a domain expert and an indexing specialist. This approach is not currently taken in practice; one of the objectives of this paper is to suggest that it should be. The extension of R through transitive closure relieves the human interface of worries about consistency and completeness. The difficulties inherent in obtaining reliable numerical values have resulted in a new scheme based on linguistic transitive closures that enable the semantic net to be specified in linguistic terms (cf. Bezdek, Pettus, Stephens & Zhang, in press). As an example, consider the terms: 6 = production-system; b = ruled-base;

and

t k : antecedent.

Since "antecedent" is narrower than "production-system", we might put gk~ = 0"8; and since "production-system" does not imply "antecedent", gig = 0. On the other hand, "production-system" and "rule-based" are synonyms in the domain, so we define s# = @ = 1 and gu = gJ~ = 0. The matrix G represents both (narrow to broad) and (broad to narrow) implication relations between document descriptor pairs. If we take the sum of S and G, we have a fuzzy thesaurus for T as in Table 2. TABLE 2 A n (S, G) thesaurus f o r T in (6) R

prs

rub

ant

prc

pat

pre

evi

dat

hyp

prs rub

1-00 1.00

1-00 1-00

0.00 0-00

0.00 0.00

0.00 0.00

0-00 0.00

0.00 0-00

0.00 0-00

0.00 0.00

ant prc

1.00 0-00

0.00 0.00

1.00 1.00

1.00 1.00

0.50 0.00

0.60 0.00

0.80 0.00

0.30 0.00

0.20 0-00

pat pre evi dat hyp

0-00 0.00 0-00 0.00 0.00

0.00 0-00 0.00 0-00 0.00

0"50 0"60 0.80 0-30 0"20

0"00 0.00 0"00 0.00 0.00

1.00 0-00 0"00 0-00 0.00

0-00 1-00 0.00 0°00 0.00

0"00 0.00 1.00 0"00 0.00

0.00 0.00 0.00 1 "00 0.00

0.00 0.00 0-00 0.00 1"00

J. C. BEZDEK

348

ET

AL.

Definition 3. Let T = { t l , . . . , tn} be a dictionary of document descriptors. Let S and G by synonym and generalization relations, respectively, for T. The (S, G ) thesaurus o f T is the fuzzy relation;

(5)

R=S+G

We refer to S and G in (5) as an (S, G) decomposition of R. In (5) " + " is the usual matrix addition. R is reflexive but not necessarily symmetric. We have exhibited the two components of R to emphasize that the expert uses different forms of reasoning about natural-language relationsips during the construction of a thesaurus for T. The symbol R is used to suggest that ro assesses the extent to which t~ can be used to "replace" tj, i.e., R is in some sense a replacement relation for T. However, R is clearly incomplete in all but trivial instances of (n), because the expert will not be able to supply all of the potential relationships in S and G. As an example, let T = { t l , t 2 , . . . , t9} be: tl = production-system

prs;

6(a)

t2 = rule-based

rub;

6(b)

t3 = antecedent

ant;

6(c)

t4 = precedent

prc;

6(d)

t5 = pattern

pat;

6(e)

premise

pre;

6(f)

t7 = evidence

evi;

6(g)

ts = data

dat;

6(h)

t9 = hypothesis

hyp;

6(i)

t6 =

Table 2 exhibits one (S, G) thesaurus for T that might have been supplied by an expert. We draw attention to the blocked structure of R fashioned by lines, which can be symbolically represented as; JR,1 R12 RI31 R = / g 2 1 R22 R23/ ERa1 Ra2 R33J where, e.g. Rll =

El 1]

1 , etc.

This structure is related to the (S, G) decomposition of R as follows:

Oo].

INFORMATION-RETRIEVALSYSTEMS

349

TABLE 3 The (S, G) decomposition of R in Table 2 S

prs

rub

ant

prc

pat

pre

evi

dat

hyp

prs rub

1.00 1.00

1.00 1.00

0.00 0.00

0.00 0.00

0.00 0.00

0-00 0.00

0.00 0-00

0.00 0.00

0.00 0.00

ant prc

0.00 0.00

0.00 0.00

1.00 1-00

1.00 1.00

0.50 0.00

0.60 0.00

0.80 0.00

0.30 0.00

0.20 0.00

pat pre evi dat hyp

0.00 0.00 0-00 0-00 0.00

0-00 0.00 0.00 0.00 0.00

0.50 0.60 0.80 0.30 0.20

0.00 0-00 0.00 0.00 0.00

1 "00 0.00 0-00 0"00 0.00

0"00 1.00 0.00 0.00 0.00

0.00 0.00 1.00 0-00 0"00

0.00 0.00 0.00 1.00 0"00

0.00 0.00 0.00 0.00 1.00

G

prs

rub

ant

pro

pat

pre

evi

dat

hyp

prs rub

0.00 0.00

0-00 0.00

0.00 0-00

0.00 0.00

0.00 0.00

0-00 0.00

0.00 0.00

0.00 0.00

0.00 0-00

ant prc

1.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0-00

0.00 0.00

0.00 0.00

0.00 0.00

pat pre evi dat hyp

0.00 0.00 0.00 0.00 0.00

0-00 0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00 0-00

0.00 0.00 0.00 0.00 0-00

0.00 0.00 0.00 0.00 0-00

0.00 0-00 0-00 0.00 0-00

0.00 0.00 0-00 0.00 0.00

0.00 0.00 0-00 0-00 0.00

0-00 0.00 0.00 0.00 0.00

The (S, G) decomposition of R is shown in Table 3. Note specially that G contains exactly one non-zero entry in this example, viz. g31 = 1. This single value will provide linking for many term set pairs upon taking the transitive closures of R. Further consideration of Table 3 leads one to ask: are there relationships between pairs if terms in T x T that were not supplied by the constructing source? Clearly this is the case. In section 4 we describe a method which "completes" R (mathematically). It remains to be seen whether or not the completion of R by our method provides a "better" thesaurus for the retrieval system at hand than the one supplied by an expert.

4. Transitive closure algorithms The closure/~ of a fuzzy relation R is a relation derived from R that has some specific property. Regardless of the property i n v o l v e d / ? should have the smallest number of additional pairs ( x , xj) unioned with those in R that result in R having the desired property. As examples, we describe three kinds of closures; the reflexive, symmetric, and (v * ) transitive closures of R. (a) The reflexive closure of R is R v I,. The reflexive closure of relation R in Table 2 is R itself, because R was already reflexive.

350

J . C . BEZDEK E T A L . TABLE 4

The symmetric closure of R in Table 2 prs

rub

ant

prc

pat

pre

evi

dat

hyp

prs rub

1.00 1.00

1.00 1.00

1.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

ant prc

1.00 0.00

0.00 0.00

1.00 1.00

1.00 1.00

0-50 0-00

0.60 0.00

0.80 0.00

0.30 0.00

0.20 0.00

pat

0.00

0.00

pre

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

0.50 0"60 0"80 0.30 0"20

0.00 0.00 0"00 0:00 0-00

1 "00 0"00 0.00 0"00 0.00

0.00 1 "00 0.00 0-00 0.00

0.00 0.00 1"00 0.00 0.00

0.00 0.00 0"00 1 "00 0.00

0.00 0.00 0.00 0.00 1-00

evi

dat hyp

(b) The symmetric closure of R is R v R 7". The symmetric closure of relation R in Table 2 is shown in Table 4. (c) The ( v * ) transitive closure of R is / ~ , = R v R "-1, where R k is calculated recursively, R k = R v R k - l , and powers on R on the fight-hand side are computed as (v *) products using (4). Zadeh (1971) proved that /~, was well-defined for ( * ) = ( A = m i n ) and (*)= ( . = product). Efficient algorithms for computing/~, for the (v A) case are discussed by Dunn (1973) and Kandel & Yelowitz (1974). Algorithmic construction o f / ~ , via the methods presented in Zadeh (1971), Dunn (1973) and Kandel & Yelowitz (1974) are, respectively, essentially o(n4), o(n2), and o(n 3) processes in the number of arithmetic operations. This issue is crucially important if powers of R must be calculated on line. One of the virtues of representing a thesaurus R by its transitive closure /~, is t h a t / ~ , can be computed once offiine, and accessed during retrieval as a look-up table. This provides significant speed-up during the retrieval computations, and demonstrates one of the greatest advantages of our approach, especially as the number of domain terms becomes large. Moderate values of n (~ 103) will require processing times (off-line) in the hours range (our system has n = 144, and took about 1.5 h for each /~. on the VAX 11/780). As n becomes large, Dunn's algorithm clearly becomes the technique of choice. Bezdek & Harris (1978) discussed other forms for ( * ), including the so-called max-A or (v A) form, where in ( * ) = A and A is defined as a A b = v (0, a + b - 1) for a, b [0, 1]. Subsequent work by Bandler & Kohout (1985) and others has led to a systematization of (*)s which is well-summarized in Bonissone & Decker (1985). Specifically, they identity six T-norms for a, b ~ [0.1] as follows: To: ao b={^~ a' b),, otherwiseV(a' b)= 1} TI: a A b = v(0, a + b - 1 )

Tl.5: a V b = ( a b ) / ( 2 - a + b - a b ) T2: a. b = ab

7(a)

7(b) 7(c) 7(d)

,.J C~C~

C~

C~

~

0

,--i

C~C~ C ~ 3 •

o

• ~l C~



.

oo oo • ° c~C~

¢~

• ° C~C~

0

0 •

°



o



o

C~C~

2



°

• ,.J

o ,l~



.

L~ •

Lr'~ °

c~

it)

C~ ~'~1

~'v'~

•-i

¢'~1

~,.-q

352

J. C. B E Z D E K

T2.5: a

• b = (ab)/(a

+ b -

ab)

T3:avb=v(a,b)

E T AL.

7(e) 7(t")

It is easy to show that for any a, b e [ 0 , 1], we have To -< T~-< Tvs-< T2 -< T2.5-< T3. Since/~, exists and is well defined for ( * ) = (v), this chain of inequalities guarantees us that the same is true for /~, computed by (v *) using any of the six operators. Moreover, they are clearly ordered as;

/~o-
(8)

Finally, since the three algorithms for calculating/~, all produce/~3 it is clear that they possess the same properties for all six/~.s. These facts have not, to our knowledge, been previously stated; the proofs are obvious. As an example, we exhibit in Table 5 the six transitive closures of the thesaurus R listed in Table 2. Of the nine blocks in the original matrix as partitioned in Table 2, eight bear the same values i n / ~ , over the six choices for ( * ) shown in equations (7). Only/~33 varies with (*). Thus, in Table 5 each entry of R33 has six values, arranged schematically as follows:

To. L

T,.5]

T2 T~.5 :/'33 Any one of the matrices in Table 5 can be viewed as a completion of R through (v * ) chaining. Before we analyse these relations further, we formalize this idea as:

Definition 4. Let T = { q , . . . , t,} be a dictionary of document descriptors, and let R be an (S, G) thesaurus for T. The max-star (v *) completion of R for T is /~., the transitive closure of R, computed as; R , = R v R n-l, where Vk; R k=

R

v R k-l,

and;

R E= R(v *)R, (*) as in equations [7(a-f)].

9(a) 9(b) 9(c)

In what follows we refer to /~, as a max-star thesaurus of R. When .~, = ,~, + G , , we shall call S, and G , the max-star synonyms and generalizations (implications) of T; and shall refer to ( S , , G , ) as a max-star decomposition o f / ~ , . To relate the max-star thesauri of R to our information retrieval application, we examine the structure of R more closely. The initial relationship of the nine terms in T to each other is shown in Fig. 1. The max-star thesauri of Table 5 can be decomposed into their ,~, and G , components as shown in Table 6: We note first that S, is unique and identical for all six operators except in the (33) block. There are six matrices corresponding to the six different blocks for ( R , L , , (these values are indicated as (***)). From S, we see, for example, that the max-star thesaurus of R will regard "antecedent" and "precedent" as fully equivalent (synonymous) upon max-star completion using any of the six T-norms in (7); This seems plausible, and we emphasize here the point that through max-star completion this relationship on T x T is computed,

INFORMATION-RETRIEVAL

353

SYSTEMS

I'O0

1.00

Rule

Production (

bose

system

1-00

Antecedence

0"5

Pottern

Precedent

0.6

Premise

0-8

0.2

0-3

Ewdence

Doto

Hypothesis

FIG. 1. Initial relation of terms in T as represented in R.

n o t e l i c i t e d f r o m a n e x p e r t . As a f u r t h e r e x a m p l e , we find t h a t S , s u g g e s t s t h a t " a n t e c e d e n t " a n d " p r e m i s e " are p a r t i a l l y s y n o n y m o u s (i.e. m u t u a l l y r e p l a c e a b l e to t h e e x t e n t 0.60); T h i s w o u l d e n a b l e us to r e p l a c e e i t h e r t e r m for t h e o t h e r in a r e t r i e v a l s i t u a t i o n , b u t t h e e x t e n t to w h i c h a d o c u m e n t m a t c h e s a " p a r t i a l l y r e p l a c e d " t e r m w o u l d b e w e a k e n e d ( b y u s i n g the 0.60 in some--but in this paper--unspecified w a y ) .

TABLE 6

Decomposition of R , into its synonym and implicational components S,

prs

rub

ant

prc

pat

pre

evi

dat

hyp

prs rub

1.00 1.00

1.00 1.00

0.00 0.00

0.00 0.00

0.00 0-00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

ant prc

0.00 0.00

0-00 0.00

1.00 1.00

1.00 1.00

0.50 0.50

0.60 0.60

0.80 0-80

0.30 0.30

0.20 0.20

pat pre evi dat • hyp

0-00 0.00 0.00 0.00 0.00

0-00 0.00 0.00 0.00 0.00

0"50 0"60 0"80 0"30 0'20

0"50 0"60 0"80 0"30 0"20

1"00

****

****

****

****

****

1"00

****

****

****

**** **** ****

**** **** ****

1"00 **** ****

**** 1"00 ****

**** **** 1"00

(~,

prs

rub

ant

prc

pat

pre

evi

dat

hyp

prs rub

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

0.00 0.00

ant prc

1-00 1 "00

0.00 1 "00

0.00 0.00

0.00 0"00

0.00 0"00

0.00 0.00

0.00 0"00

0.00 0.00

0.00 0.00

pat pre evi dat hyp

0.50 0.60 0.80 0.30 0.20

0.50 0.60 0.80 0.30 0.20

0-00 0.00 0.00 0.00 0.00

0.00 0"00 0.00 0"00 0.00

0"00 0.00 0"00 0-00 0.00

0.00 0'00 0.00 0.00 0"00

0"00 0.00 0"00 0.00 0"00

0"00 0"00 0"00 0"00 0.00

0.00 0'00 0.00 0.00 0"00

J . c . BEZDEK ET AL.

354

T u r n i n g to the (33) b l o c k o f S . , we have six choices, for e x a m p l e , for the c o m p l e t e d m a x - s t a r r e l a t i o n s h i p b e t w e e n " p r e m i s e " a n d " p a t t e r n " , as shown in T a b l e 7: TABLE 7

The six (6, 5) elements in R . Pat pre V0"00 0"01 0"25"] 10"30

0"38

0"501

F r o m T a b l e 7 we can infer t h a t the most o p t i m i s t i c c o m p l e t i o n o f R via (^ v) a l l o w s us to r e p l a c e " p r e m i s e " with " p a t t e r n " with " c e r t a i n t y " o f 0.50, w h e r e a s To = (v o) is most pessimistic, a l l o w i n g no r e p l a c e m e n t at all. It is n o t o u r objective to argue h e r e that o n e o f these values is p r e f e r a b l e to the others; rather, we e m p h a s i z e that an e x p e r t w o u l d c h o o s e one n o r m that s e e m s most suitable in the c o n t e x t o f the d o m a i n - s p e c i f i c t e r m set T o n hand. ( I n fact, h o w e v e r , we have an intuitive p r e f e r e n c e for a v a l u e in the r a n g e T1 = 0 . 0 1 to T 2 = 0 . 3 0 , p e r h a p s the average o f TI a n d 7"2, b a s e d o n the p r o p e r t i e s o f max-(A) a n d m a x - ( . ) d e v e l o p e d in B e z d e k & Harris, 1978). F i n a l l y , we can d i s c u s s / ~ , in terms o f internal- a n d e x t e r n a l - b l o c k r e l a t i o n s h i p s . I f we d i v i d e T into {tl, t2} = TI, {t3, t4} = T2 a n d {ts, t6, t7, ts, t9} = T3, the three g r o u p s have e x t e r n a l a n d internal r e l a t i o n a l categories as in T a b l e 8. C o m b i n i n g these t w o r e l a t i o n s y i e l d s the i n f o r m a t i o n in T a b l e 9. Thus, it a p p e a r s t h a t there is a n i m p l i c a t i o n r e l a t i o n b e t w e e n s o m e terms o f two different b l o c k s in R ; after c o m p l e t i o n via the m a x - s t a r transitive closure o p e r a t i o n , each term o f the i m p l i e d set will be i m p l i e d to

TABLE 8 External TI T2

T3

Internal

TI

72

73

-imp *

* -p.syn

* p.syn --

Tl

71

72

73

syn

-syn --

--*

T2

--

T3

--

* Lack of relation between two groups in that category. - - Belongs to another category. p.syn Partial synonym, not every term of that set is synonymous to every term of the other set.

TABLE 9 Initial blocks in R

T~ Tl T2 T3

syn imp *

72 *

T~ *

syn p.syn

p.syn

*

* No relation between two groups.

Completed blocks in R ,

T~

7"1 T2 T3

T~

R=Ri

R=Ri

R#Ri R~Ri

R=Ri R~Ri

T~ R=Ri R#Ri R#Ri

INFORMATION-RETRIEVAL

SYSTEMS

355

the extent of the original implication degree, but not vice versa. As one term of T2 (antecedent) implies one term of T~ (production system) initially, after transitive closure T2 implies each term of TI. On the other hand, if there is a synonym relation between some terms of two different blocks, after the transitive closure operation, each term of the two blocks will have synonym relationships to the extent of the original synonym degrees. As only one term of T2 and T3 have a synonym relation initially, and T1 and T3 have no relation initially, after transitive closure operations T~ and T3 have the same synonym relations as T2 and 7"3. This is because T2 and T3 have a synonym relation initially. Thirdly, owing to the external relation, max-* transitive closures produce different degrees of synonymity as (*) runs through different formula. And finally, it seems that different T~s produce different values only when a group initially has no relation among elements; but varying (*) makes no difference when the (operated) upon group has some initial relation between elements.

5. Conclusions We have defined T as a dictionary of document descriptors; S as a set of synonyms for T which are reflexive and symmetric; and G as a set of generalizations for T which are neither reflexive or symmetric. The fuzzy pseudothesaurus R = S + G is used to represent all the relations between terms in T x T supplied by an expert. R can be "completed" using a number of different transitive closure operations. A f t e r / ~ . , the max-star ( v * ) completion by transitive closure of R, is computed, the ( S . , G , ) decomposition o f / ~ . can be computed and used as a basis for term replacement in T. Finally, we have given the utilization and interpretation o f / ~ . in the context of document retrieval systems. Two significant advantages of the proposed approach are that it can be done off-line, so that powers of R need not be computed at (real) retrieval time; and it completes (or fills in) a partial knowledge base supplied by a document expert in a mathematically tractable and linguistically plausible way. This research was supported by NSF Grant IST-8407860.

References ANVARI, M. & ROSE, G. F. (1987). Fuzzy relational databases. In BEZDEK, J., Ed. The Analysis Fuzzy Information, Vol. 2(14). Boca Raton: CRC Press. BANDLER, W. & KOHOUT, L. J. (1985). Probabilistic versus fuzzy production rules in expert systems. International Journal of Man- Machine Studies, 22, 347-353. BARTSCHI, M. (1985). An overview of information retrieval subjects. Computer, 18, 67-84. BEZDEK, J. C. & HARRIS, J. D. (1978). Fuzzy partitions and relations: an axiomatic basis for clustering. Fuzzy Sets and Systems, 1, 111-127. BEZDEK, J. C., PETTUS, R., STEPHENS, L. & ZHANG, W. (1986). Knowledge representation using linguistic fuzzy similarity relations. International Journal of Approximate Reasoning. In press. BISWAS, G., SUBRAMANIAN, V., MARQUES, M. M. & BEZDEK, J. (1985). A document retrieval system using a fuzzy expert system approach. IEEE Proceedings of the SMC, 126-130. BONISSONE, P. P. & DECKER, K. S. (1985). Selecting uncertainty calculi and granularity: an experiment in trading-off precision and complexity. Technical Report, at GE Corp., Schenectady, NY. BUCKLES, B. P. & PETRY, F. E. (1983). Information-theoretical characterization of fuzzy relational data bases. IEEE Transactions on Systems, Man and Cybernetics, SMC-13, 74-77.

356

J . c . BEZDEK ET AL.

BUELL, D. A. & KRAFT, D. H. (1981). A model for a weighted retrieval system. Journal of the American Society for Information Science, 32, 211-216. DUNN, J. C. (1973). A graph theoretic analysis of pattern classification via Tamura's fuzzy relation. IEEE Transactions on Systems, Man and Cybernectics, SMC-4, no. 3 310-313. EASTMAN, C. M. (1983). A lexical analysis of keywords in high level programming languages. International Journal of Man- Machine Studies 19, 595-607. LANCASTER, F. W. (1972). Vocabulary Control for Information Retrieval. Washington, DC: Information Resources. KANDEL, A. & YELOWITZ, L. (1974). Fuzzy chains. IEEE Transactions on Systems, Man and Cybernetics, 472-475. MIYAMOTO, S., MIYAKE, T. & NAKAYAMA, K. (1983). 'Generation of a pseudothesaurus for information retrieval based on cooccurrences and fuzzy set operations. IEEE Transactions on Systems, Man and Cybernetics 13, 62-70. RADECKI, T. (1976). Concept of fuzzy thesaurus. Information Processing and Management, 12, 313-318. SALTON, G., Ed. (1971). The S M A R T Retrieval System; Experiments in Automatic Document Processing. Englewood Cliffs, N J: Prentice-Hall. SALTON, G. &: MCGILL, M. J. (1983). Introduction to Modern Information Retrieval System. New York: McGraw-Hill. ZADEH, L. A. (1965). Fuzzy Sets. Information and Control, 8, 338-353. ZADEH, L. A. (1971). Similarity relations and fuzzy orderings. Information Sciences 3, 177-200. ZENNER, B. R. C., CALUWE, M. M. D. & KERRE, E. E. (1985). Retrieval systems using fuzzy expressions. Fuzzy Sets and Systems, 17, 9-22.