Outline of a fuzzy logic approach to information retrieval

Outline of a fuzzy logic approach to information retrieval

Int. J. Man-Machine Studies (1981) 14, 169-178 Outline of a fuzzy logic approach to information retrieval TADEUSZ RADECKI Main Library and Scientifi...

480KB Sizes 0 Downloads 73 Views

Int. J. Man-Machine Studies (1981) 14, 169-178

Outline of a fuzzy logic approach to information retrieval TADEUSZ RADECKI

Main Library and Scientific Information Centre, Technical University of Wroclaw, 50-370 Wroctaw, Poland (Received 5March 1980) An information retrieval method based on fuzzy logic is presented. The method described takes into account in a straightforward way the varying importance of descriptors which reflect the content of the information system documents as well as the varying formal relevance grades of documents in relation to a given query. The use of the simple operations of fuzzy logic allows retrieval of documents with the highest grades of formal relevance (in a given information system).

1. Introduction The information retrieval process takes place by means of document and query search patterns. Since document and query search patterns are only better or worse representatives, the quality of an information retrieval process depends on the accuracy of representation of the content of the documents and queries by their search patterns. The quality of creating the document and query search patterns can be improved by use of a thesaurus and by taking into account the varying importance of descriptors in the search patterns of documents and queries. This has been confirmed experimentally (Salton, 1971). A method of indexing taking into account the above suggestions implies the use of a mathematical method for development of a document retrieval strategy which would allow for the varying importance of descriptors in the search patterns of documents and queries possibly in a simple way. This mathematical apparatus should also take into account the differentiation of pertinence, relevance, and formal relevance values of the information system documents. Generally, an adequate mathematical apparatus for describing document retrieval systems should consider the varying importance of a given feature in particular elements of a universe of discourse. Of the mathematical methods known, the mathematical apparatus which best fulfils the above postulates is the theory of fuzzy sets, the basis of which was given by Zadeh (1965, 1971). The basic idea of the theory of fuzzy sets is that the grades of membership of particular elements of a universe of discourse in a given fuzzy set are determined by the so-called membership function which is a generalization of the well-known characteristic function. The transition from the membership to non-membership of elements of a universe of discourse in a fuzzy set is continuous, as opposed to conventional set theory. In the case of well-known set theory methods of document retrieval (Dobrowski, 1975; Pawlak, 1973; van Rijsbergen, 1975; Wong & Chiang, 1971) a given document is included in the information system response to a given query if the search pattern of that 169 0020-7373/81/020169 + 10 $02.00/0

O 1981 Academic Press Inc. (London) Limited

170

T. R A D E C K !

query (being a certain logical Boolean of descriptors) is true for the document search pattern. If the document search patterns are sets of descriptors with assigned weights indicating their importance in the documents, then the problem of finding a simple algorithm for information retrieval becomes more difficult. It seems natural that the information system response to a given query should then be the set of those documents the search patterns of which are sufficiently true in relation to the query search pattern. This is a reason why the mathematical apparatus used in this paper for development of a retrieval strategy for documents indexed by weighted descriptors is fuzzy logic (Kandel, 1973; Lee & Chang, 1971; Lee, 1972; Negoita & Ralescu, 1975) based on the notions of the theory of fuzzy sets and symbolic logic. First of all we shall present the basic notions of the mathematical apparatus used and then we shall define the document retrieval system considered. Then we shall describe the document retrieval system language and present an algorithm for assigning documents to particular queries.

2. Definitions In this section we shall define the basic notions of the theory of fuzzy sets and fuzzy logic which will be used further on in the paper. Definition 1. Let U be a universe of discourse. A fuzzy set A in U, written as A c U, is defined as the set of ordered pairs as follows: A = {(x,/ZA(X))[X ~ U}, where tZA(X) stands for the grade of membership of x in A. It is assumed that/ZA(X) is a real number in the closed interval [0, 1]. The nearer the value of /XA(X) to unity, the higher the grade of membership of x in A. In other words, a fuzzy set A in U is characterized by the so-called membership function /XA:U-~ [0, 1]. Of course, the membership function is a generalization of the characteristic function known in conventional set theory. EXAMPLE 1

Let U = {xx, xz, x3, x4, Xs, x6} be a universe of discourse. An exemplifying fuzzy set f

A c U can be written as follows: A = {(xx, 0.1), (x2, 0.8), (xa, 0.7), (x4, 0.2), (Xs, 0.9), (x6, 1)}. Definition 2. Let U1 and U2 be two sets of objects considered. A binary fuzzy relation R f

is a fuzzy set in the Cartesian product U I • written as R c U1 • and is characterized by the membership function/XR: U1 x U2-~" [0, 1], which associates with each ordered pair (x, y), x ~ U~, y ~ U2, a real number tXR(X, y) ~ [0, 1] representing the grade of membership of (x, y) in R. Thus, a binary fuzzy relation can be written as the set of ordered triples:

R = {(x, y, R(X, y))IX

U1, y

171

F U Z Z Y L O G I C A P P R O A C H TO I N F O R M A T I O N R E T R I E V A L

The nearer the value of /ZR(X, y) to unity, the higher the grade of membership of (x, y), x ~Ux, y ~ U2, in R. In practice it is sometimes sufficient to perform the operations on fuzzy sets which are not defined in the whole universe of discourse but in subsets of that universe. This kind of fuzzy set is called level fuzzy set (Radecki, 1977).

Definition 3. Let A be a given fuzzy set in a universe of discourse U. A A-level fuzzy set Ax is defined as follows: A~ = {(x, ILLAx(X) = ~J~A(X)>[~.ZA(X)~ A, X E U},

a . ~ [0, 1].

f

Generally, A = Aa =o, that is, the fuzzy set A c U is a particular case of a A-level fuzzyset Aa ~ A(A) c U, where A(A) = {X]/~A(X) -->A, X ~ U}.

Definition 4. A fuzzy variable x~ is a variable which assumes its truth-values in the closed interval [0, 1] of real numbers representing grades of membership of a given object (belonging to the universe of discourse U) in certain fuzzy sets (defined in U).

Definition 5. The fuzzy formulae f(xl,

X2 . . . . .

An)

generated by fuzzy variables

x~, x2 . . . . . x, are defined recursively as follows. 1. 2. 3. 4. 5. 6.

The numbers 0 and 1 are fuzzy formulae. A fuzzy variable x~ (i = 1, 2 . . . . . n) is a fuzzy formula. If u is a fuzzy formula, then ~ u is a fuzzy formula. If u and v are fuzzy formulae, then u v v is a fuzzy formula. If u and v are fuzzy formulae, then u A V is a fuzzy formula. Only those formulae resulting from rules 1-5 are fuzzy formulae.

Definition 6. Denoting the truth-value assigned to a fuzzy variable xl (i = 1, 2 . . . . . n) by W(xl), the truth-value W(c) of a fuzzy formula c is uniquely determined by means of the following rules: 1. 2. 3. 4. 5. 6.

W(c) = 0 if c = 0. W(c)=lifc=l. W(c) = W(xi) if c = x~. W(c) = 1 - W ( a ) if c = -la. W(c) = max [W(a), W(b)] if c = a v b. W(c) = min [W(a), W(b)] if c = a ^ b.

Since every fuzzy variable xl of a fuzzy formula c can assume an infinite number of truth-values, there exists an infinite number of distinct assignments of truth-values to the fuzzy variables of a fuzzy formula. In the special case where the fuzzy variables x~ (i = 1, 2 . . . . . n) assume only two truth-values, 0 or 1, the truth-values of the fuzzy formulae will also assume only these values. It thus follows that the widely-known two-valued logic is a particular case of fuzzy logic. EXAMPLE 2

Let c = (x 1 ^ --1x2) v x3 be a fuzzy formula and W(x 1) = 0.2, W(x2) = 0.6 and W(x3) = 0.5 be the truth-values of the fuzzy variables xl, x2 and x3. The truth-value W(e) of the

172

T. RADECKI

fuzzy formula c is calculated by performing the following operations: W(c) = max [W(xl ^ -lx2), W(x3)] = max [rain (Wxl), W(-lx2)), W(x3)] = max [rain (W(xl), 1 -W(x2)), W(x3)] = m a x [min (0.2, 1 - 0 . 6 ) , 0.5] = max (0.2, 0.5) = 0.5.

3. Document retrieval system By a document retrieval system we mean the following quadruple: I = (D, Q, T, ~b), where D is a set of the information system documents of cardinality IDI = n, Q is a set of queries directed to the information system of cardinality tQI -- m, T is a set of descriptors of cardinality ITI = k, whereas t~ is a mapping in the form: ~ : Q - ~ 2 D. The mapping ~ represents an algorithm for assigning documents to particular queries. For a given query q 9 Q, the set ~b(q) c D is the information system response to that query. If we refer to the information system response to a given query q e Q, the response being ranked in decreasing order according to the grade of formal relevance of the search pattern of that query to the search patterns of the information system documents d ~ D, then we will symbolize this by Ord #(q). Let us assume that we know the relation F of the description of the information system documents, which is a binary fuzzy relation in the form: F = {(d, t,/.t.F(d, t))ld ~ D, t e T}, where/.~r: D • T ~ [0, 1] is a function specifying the importance of descriptor t ~ T in the description of document d 9 D for each ordered pair (d, t), d 9 D, t 9 T. On the basis of f

the binary fuzzy relation F in the Cartesian product D • T, written as F c D • T, we can define the search pattern of a document d 9 D. The search pattern of a document d 9 D is a fuzzy set Fd in the set T of descriptors, i.e. Fa = {(t, IzFa(t) =/.tF(d, t))lt ~ T}. It is obvious that in the search patterns Fa of documents d ~ D it is not necessary to retain those ordered pairs (t,/ZFd (t)) for which/ZFd (t) = 0. Moreover, it is also likely that the removal from the document search patterns of those ordered pairs (t,/ZFd(t)) for which the values of/XFd(t) are nearly 0 would not significantly affect the quality of the information system response. Hence, it would be advisable to make an optimal selection of the value of A* from the point of view of time and quality of retrieval, which would allow the system to operate on the A*-level search patterns of documents.

FUZZY

LOGIC

APPROACH

TO INFORMATION

RETRIEVAL

173

By the A*-level search pattern of a document d ~ D we will understand a A*-level fuzzy set Fa(x*) defined as: F a ( : ) = {(t, ~F.,.., (t) =/*F. (t))l/zF. (t) -> zl*, t ~ T}. Obviously, for A * = 0, Fa(~*) = Fa.

4. Document retrieval system language A set T* of correct expressions of a document retrieval system language is defined as the smallest set containing the set T of descriptors and such that if t', t" ~ T*, then t' ^ t", t' v t', -~t', and -it" also belong to the set T* of correct expressions, where the symbols ^, v and "-7 stand for the logical operations of conjunction, disjunction and negation, respectively. We will call the elements of the set T* complex descriptors of the document retrieval system language. In our case the meanings of complex descriptors are fuzzy sets which are accordingly determined on the basis of the A*-level search patterns of the information system documents d ~ D. We define below the A*-level meanings Mtx* of the correct expressions t ~ T* of our document retrieval system language. First, however, we shall introduce some necessary symbols. Let xt stand for a fuzzy variable corresponding to descriptor t in the set T. By means of the symbol xt(A*)(d) we denote the truth-value of fuzzy variable xt. This truth-value represents the importance of descriptor t in the A*-level search pattern Fd(a*) of document d. On the other hand, by ct we denote a fuzzy formula corresponding to complex descriptor t in the set T* of correct expressions, and by c,**)(d) we symbolize the truth-value of fuzzy formula ct corresponding to document d, calculated on the basis of the A*-level search pattern of that document. If in the A*-level search pattern Fa(**) of a document d there does not occur an ordered pair (t,/zF~c~.)(t)) corresponding to descriptor t(t ~ T) which is included in the complex descriptor in question, then in calculating the truth-value c . : ) ( d ) one assumes that the fuzzy variable x, corresponding to that ordered pair takes the truth-value x,a*)(d) equal to 0. Using the symbols introduced above we can define the A*-level meanings Mtx• of correct expressions t e T * in the following way: Mtx• = {(d,/ZM:" (d) = c,(~*)(d))ld ~ D, ct(a*)(d) # 0}. In determining the A*-level meanings M~* of correct expressions t ~ T * of the document retrieval system language certain semantic properties may appear useful. These properties may be represented in the form of the following proposition.

Proposition 1. In document retrieval system I the following properties hold: 1. (u

(Vt' ~ T*)(M~t*=twt ,= M,=,,).**

E T * ) ( M e ~ * t , ^ t , =Mr=c), **

2. (Vt', t"

~* A* T * )(Mt=c^t,, = Mt=r^,.),

O/t', t" ~ T *)(Mt=t,vt,,= x* x* M,=t,,vc). 3. (Vt',t", t " ~ T * )(Mt=t,^(t,,^t,-) x* _- M,=(t,^t-)^r,), x* (Vt', t", t" ~ T * )(Mt=,'v(~"v,,',)** _ ,

x* Mt=(t'vt")vt")'

174

T. RADECKI

4. (Vt', t,, = T,~/xxA* I~lvlt=t'^(t'vt") = M t A* =t'), .

,

A*

A*

(Vt', t e T )(Mt=cv(c^c') = Mr=c). 5. (Vt', t", t "

x* x* T * )(Mt=t,^(t,,~c,,) = M,=(t,,,t,,)v(c,,r,)),

(Vt', t", t'" ~ T*)(M~*--cv(e,^r,) .

,

A*

= U t =x* (t'vt")a(t'vt')). A*

6. (Vt', t ~ T )(M~=~(t,^r)=Mt=~c~,,), (Vt', t" E T *W~xx* )~lVlt=~(t,vt,,

) = s~a;t*

= M,=,,). 7. ( V t ' e T * )(M,==c=c) ~* ~*

Proof. The above-mentioned semantic properties of the document retrieval system language result directly from the definition of a A*-level meaning M~* of a complex descriptor t ~ T*, and from the fact that the operations v, ^, and ~ applied to fuzzy formulae fulfil the laws of idempotency, commutativity, associativity, absorption, distributivity, involution and de Morgan's laws.

5. Description of the document retrieval process We present below an algorithm for assigning documents to queries addressed to information retrieval system. In the described system the process of document retrieval takes place as follows: for a given query the truth-value of the fuzzy formula (complex descriptor) representing that query is determined for every A*-level search pattern Fa(A*) of document d ~ D. Every such value represents the grade of membership of a document in the A*-level meanin~ M~ of complex descriptor t representing the query search pattern. Knowing M~ corresponding to a given query, it is possible to issue to the user the documents with the highest grades of formal relevance (in a given information system). In more detail, in our system I the process of document retrieval in response to a given query q~.~ Q, j = 1, 2 . . . . . m, is carried out in the following stages: 1. The creation (in accordance with the syntactic rules of the document retrieval system language) of a complex descriptor tj ~ T* representing the search pattern of this query. 9 2. The determination of the A*-level meaning M,jA * of the complex descriptor tj in accordance with the semantic rules of the document retrieval system language. 3. The issue of documents in decreasing order according to the grade of their membership in M~* j 9 If we do not take this order into account, we can regard the documents issued in response to a given query qj e Q whose search pattern is represented by the complex descriptor t~ e T* as forming the set DM~*= {d[(d,/~M,~(d)) e M~*}. And therefore

O(qj) = DM~*. On the other hand, the ordered response Ord ~b(qi) of the information system to query qi can be generally defined as Ord 0(qj) = (K1, K2 . . . . . K,),

FUZZY

LOGIC

APPROACH

where K , i = 1, 2 . . . . .

TO INFORMATION

175

RETRIEVAL

r, are sets of d o c u m e n t s defined in the following way: A*

K~ = {dl/ZM~,. (d) = a~, d ~ DMtj }, A*

A*

A*

where a~ c ~ M , j = {/~M~,j (d)[(d,/-t M,~ (d)) ~ M, i }, a~-i > a~ and [._.J~=l K~ = DM,j . Obviously, the sets K . i = 1, 2 . . . . . r, are equivalence classes d e t e r m i n e d by an equivalence relation Rj defined as: (Vd', d" ~ $(qi))(d'Rjd"<:>lZM~,~ (d') --- I z ~ t ,~.* (d t ! )). In view of the cost a n d p e r c e p t u a l limitations of an i n f o r m a t i o n system user it is not always a g o o d idea to issue the whole set DM~* of d o c u m e n t s in r e s p o n s e to a given query qj ~ Q, but ignoring the possibility of not issuing all the d o c u m e n t s of a sufficient grade of pertinence, the information system r e s p o n s e ~/,(qj) should be limited to the d o c u m e n t s with the highest grades of f o r m a l relevance to that q u e r y (in a given system). This can be d o n e in either of the two ways: either by establishing a threshold value Ai of formal relevance or by fixing a m a x i m u m permissible n u m b e r N i of d o c u m e n t s to be included in the i n f o r m a t i o n system r e s p o n s e $(qi) to a query qi ~ O. In the first case A*

$(qi) = M,~ (At),

At ~ [0, 1],

i = 1, 2 . . . . .

m,

where M~'(A/) = {dlgM~,; (d) -->h~, d E DM~'}. On the o t h e r hand, O r d $(qj) = (K1, K2 .

. . . .

K,),

where a , = h i. With the second way of limiting the size of the r e s p o n s e we have tp(q~) =M~*(ANj)=

max

A*

M,j (A),

] = 1, 2 . . . . .

m

x~(O,l] A*

IM,i (A)I--
and O r d ~b(q~-)-- (K1, K2 . . . . .

Kr),

where the following condition is fulfilled:

i=l

Of the two a b o v e - m e n t i o n e d ways of limiting the size of the information system response 0(q~) to a q u e r y qj e O the m o r e natural way s e e m s to be the fixing of a m a x i m u m permissible n u m b e r Nj of d o c u m e n t s in the response. T a k i n g this m e t h o d of limiting the size of the i n f o r m a t i o n system response, the retrieval result can be characterized by one of the following cases (] = 1, 2 . . . . . m):

1. (V,~ ~ (o, 1])([M,~*(,~)[ = 0); 2. (VA ~(0, 1])(0< IM~*(A)I
4. [M, i (h = m a x { a [ a

A*

(]Mt, })1 > NI..

176

T. RADECKI

When Case 1 occurs, there are no documents of sufficient grade of formal relevance to the query qj ~ Q in the set D of documents. The information system response characterized by Case 2 consists of those documents whose grade of formal relevance is higher than 0, where the number of documents issued is less than the number required. In Case 3 the information system fulfils the user's requirements from the point of view of the number of supplied documents with a grade of formal relevance higher than 0. When Case 4 occurs, then in accordance with the above-described algorithm for assigning documents to queries ~!qj) = 0 . In this case it seems better to assume that ~b(qi)=M~ ( ; t = m a x { a l a e ( ~ M ~ }) or that the information system response ~b(qj) to the query qi represented by the complex descriptor ti is a certain subset of the set M,A* i (h = max {a[a ~ (~M~*}) of cardinality equal to N i. EXAMPLE 3

Let D = { d , dE, d3, d4, ds, d6} be a set of documents and T = {tl, t2, t3, t4} a set of descriptors. The document search patterns are as follows: Fal = {(t, 0.8), (t2, 0.7),
Fa3~o.3) = {(t, 0.4), (t2, 0.3), (ta, 0.6),
Assuming that the query search pattern of a query q was represented by a complex descriptor in the form: t = --7tl A 7 t2 ^ (t3 V t4), we get the following truth-values of the fuzzy formula ct corresponding to the complex descriptor t representing the query q: C*(0.3) (dl) = 0"2,

Ct(o.3)(d2) = O,

ct(0.3) (d3) = 0"6,

Ct(0.3) (d4) = 0.5,

Ct(o.3)(ds) = 0,

C,(o.3) (d6) = 0.6.

Therefore 0.3

Mt

={(dl, 0.2), (d3, 0.6), (d4, 0.5), (d6, 0.6)}.

FUZZY

LOGIC

APPROACH

TO INFORMATION

RETRIEVAL

177

Let the maximum permissible number of documents to be included in the response of the information system in question to the query q represented by the complex descriptor t be N = 3. In this case Ord if(q) = ({d3, d6}, {d,}), whereas ~b(q) = {d3, d4, d6}.

6. Final remarks Since fuzzy logic is a generalization of two-valued logic, the information retrieval method presented in this paper can be treated as a generalization of the widely-known sequential file method. Bearing in mind that the use of varying weights of descriptors depicting documents allows the creation of more adequate document search patterns, the use of the retrieval method described in the present paper should give better results compared with the sequential file method and other retrieval methods based on set theory. Another advantage of the retrieval method described is that it takes into account the varying grades of formal relevance of particular information system documents to a given query. The use of the simple operations of fuzzy logic allows the retrieval of the documents with the highest grades of formal relevance in relation to the query posed. These advantages are the direct result of the application of fuzzy logic to the development of a document retrieval strategy in conditions where document search patterns are sets of weighted descriptors. Of course, where these search patterns are ordinary Sets of certain descriptors, then the use of the method described (where no limitation is imposed on the response size) will give the same results as in the case of the sequential file method. One should notice that because of the time necessary for document retrieval the process of finding the information system response to a given query should be carried out on the basis of A*-level search patterns of the documents. Therefore, a vital question is the optimal (from the point of view of retrieval time and quality of response) selection of a threshold value A*. Because of the difficulty in the analytical determination of the threshold value A *, its selection for each concrete information system should be carried out experimentally.

References DABROWSKI, M. (1975). A general model of distribution of objects in information retrieval systems. Information Systems, 1, 147. KANDEL, A. (1973). On minimization of fuzzy functions. IEEE Transactions on Computers, C-22, 826. LEE, R. C. T. (1972). Fuzzy logic and the resolution principle. Journal of the Association for Computing Machinery, 19, 109. LEE, R. C. T. & CHANG, C. L. (1971). Some properties of fuzzy logic. Information and Control, 19, 417. NEGOITA, C. W. & RALESCU, D. A. (1975). Applications of Fuzzy Sets to Systems Analysis. Basel and Stuttgart: Birkh~iuser Verlag. PAWLAK, Z. (1973). Mathematical foundations of information retrieval. CCPAS Reports No. 101. Computation Centre Polish Academy of Sciences, Warsaw, Poland.

178

T. RADECKI

RADECKI, T. (1977). Level fuzzy sets. Journal of Cybernetics, 7, 189. SALTON, G. (Ed.) (1971). The S M A R T Retrieval System-Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey: Prentice-Hall. VAN RIJSBERGEN, C. J. (1975). Information Retrieval. London, Boston: Butterworths. WONG, E. & CHIANG, T. C. (1971). Canonical structure in attribute based file organization. Communications of the Association for Computing Machinery, 14, 593. ZADEH, L. A. (1965). Fuzzy sets. Information and Control, 8, 338. ZADEH, L. A. (1971). Similarity relations and fuzzy orderings. Information Sciences, 3, 177.