Zn/orm. Sror. Retr.
Vol. 7, pp. 89-94.
ON A MODEL
Pergamon Press 1971.
Printed in Great Britain
OF INFORMATION RETRIEVAL BASED ON THESAURUS
SYSTEM
W. M. TURSKI Computation
Centre, Polish Academy of Sciences, Palac Kultury i Nauki, Pok. 1050 Warszawa
Snnnnary-This paper describes a model of information retrieval system based on thesaurus. Definitions of: thesaurus, classification, document description, information query and response are given. Inclusiveness and two other properties of the considered system are proven. SPECIAL 3 V
quantifiers I
xc
3 V A e \ U
n X c E
SYMBOLS
“precedes” sign “implies” sign alternative sign conjunction sign “is equivalent” sign set difference set union set intersection Carthesian product “is a subset of” sign “is a member of” sign 1. THESAURUS
of this paper we shall define the thesaurus to be a finite set T of terms t satisfying following conditions: (1) There exists a nonempty subset TO c T, called the descriptor set, and a symmetric, transitive and reflexive relation 9’ c T x T, called the synonymy relation, such that: (1.1) no two distinct descriptors are synonymous, i.e. FOR THE purpose
t, # tz A t, 9% * (tr ET\‘&,) v (tz ET\T.,), (1.2) for each element of the ascriptor set, T\T,, there exists a synonymous descriptor, i.e. trET\T,=+
3 Wt,
teTo
(2) A transitive and nonsymmetric relation 9 c T,, x TO, called the generalization relation, is defined in the descriptor set. If t,Rt,, we shall frequently say that the descriptor t, is more general than the descriptor t2, or that the descriptor t, is more specific than the descriptor t,. 89
W. M. TURSKI
90
Corollary. For any given ascriptor t1 there exists exactly one descriptor t satisfying t, Yt. Indeed, let there exist two such descriptors t and t’, both members of T,,t, Yt and t, Yt’. Since the relation 9’ is transitive and symmetric, we would have tl Yt’ e- t’Yt, and t, Yt, hence t’ Yt which contradicts (1.1). Remarks. The above definition of the thesaurus does not assume anything about possible relations between ascriptors and descriptors (or among descriptors) except the ones mentioned explicitely. In particular, all flexion forms may be considered as ascriptors and corresponding radices as descriptors. Similarly, the introduced relation of synonymy does not encompass all connotations of this notion; for discussion cf. e.g. [l]. The important problem of homonymy has not been touched upon at all; it is simply assumed that the considered set of terms does not contain homographs. In justification we may refer to the timehonoured practice of distinguishing homonyms denotations by attaching a digit, or a digit sequence, at the end of terms to differentiate between various semantic meanings.
2.
DOCUMENTS,
THEIR
DESCRIPTIONS
AND
ORDERING
We shall consider a document to be a finite sequence c(~r1 CQz2 . . . LX,Z,CI,+ 1, where ( 71, 72, * . ., 7,) is a nonempty subset of T, and ai are immaterial (from our present point of view) sequences of symbols. Some (or even all) of ai may be empty. By the description of a document we shall understand the sequence resulting from a document by the application of the following steps: all immaterial symbols are deleted, all ascriptors are replaced by synonymous descriptors, each second (and further) occurrence of the same descriptor is deleted as are also all descriptors whose further specifications appear in the considered sequence. Hence, the description of a document is a sequence:
t = t,, tz, . . ., tk
0)
in which no descriptor appears twice and which does not contain any pair satisfying tiWtj; i,jo{1,2, . . . . k}. We shall say that the description of the document t, = tll, t,,, . . ., t,, is contained in the description of another document t2 = t,,, tz2, . . . , tzl, and write tl 5 t2 if V
3
l
(t,i = t2j
V
iliWt2j)
(2)
lSj
i.e. when each descriptor from t, is either identical with a descriptor from t, or is the generalization of a descriptor from t2. It can be verified that the relation 3 induces a (partial) order in the set of document descriptions. Indeed, from (2) it follows that this relation is reflexive and nonsymmetric, it remains to demonstrate that it is also transitive. Let tl 5 t2 and t, 3 tf, t3 = tsl, ts2, . . ., t,,. Let us choose arbitrary i, 1 d i < k. Since tl 3 t2, by virtue of (2) there exists such j, 1 G j < I, that either tli = t2j or tl iWt2j. Since t2 5 t,, by the same token, there exists suchp, 1 < p < r, that either t2, = tap Consequently, we arrive at one of the following four cases: or tzj~t,,.
t,i
=
tzj
=
t3p,
tli = t2jiZ2t3p, t,iWt2j
=
tliRt2jWt3p,
t,,,
i.e.
i.e. i.e. i.e.
tli
=
t3p
tligt3p tliWt3p
(since W is transitive) tlist3p
On
a Model of Information Retrieval System Based on Thesaurus
91
Therefore the descriptor tsp is either identical with tli or is its specification. Since i was chosen arbitrarily this conclusion is true for all descriptors from t,, i.e. tl 5 t3. Q.E.D. The process of obtaining the descriptions of documents will be henceforth referred to as the class@cation process. Remarks. By accepting the above definition of the relation “to be contained in” we formalized following intuitive theses: 1. If two descriptions of two documents differ only in that one contains descriptors not appearing in another, the description containing more descriptors is “wider”. 2. The occurrence in a description of a document of a specific descriptor implies applicability of all more general descriptors, which reduces the second part of the alternative in (2) to point 1 above. Moreover, even-if such general descriptors were present in the document, the classification process would have deleted them. As an example-let the descriptor “ordinary differential equation” be more general than the descriptor “Riccati’s equation”. If “Riccati’s equation” appears in the description of a document, and “ordinary differential equation”-in another description, then, ceteris paribus, the latter description is contained in the former. 3. QUERIES AND THESAURUS-BASED INFORMATION RETRIEVAL SYSTEMS
By a query we shall understand a finite sequence of descriptors without repetitions and generalizations. Thus, queries are of exactly the same form as document descriptions and the relation “to be contained in” can be readily extended to queries and query-description pairs. Following SALTON ([2], Chapter 6.3) we shall consider a thesaurus-based information retrieval system to be a quadruple (T, D, R, p). where T is the thesaurus (with the descriptor set T,,), D-a collection of documents, R-a query set and p a mapping R + 2D assigning subsets of D to queries. For a given query r, the set Q = p(r) c D shall be called the response of the system to the query r. The classification process described in point 2 may assign identical descriptions to different documents (e.g. documents differing in immaterial symbols, in synonymous descriptors, in use of generalizations and in frequency of ascriptors and descriptors occurrence). The very important practical problem of differentiating between identically classified documents shall not be considered in this paper, remaining parts of which are devoted to some properties of sets and sequences of responses to sets and sequences of queries. 4. INCLUSIVENESS INFORMATION
OF THESAURUS-BASED RETRIEVAL SYSTEMS
An information retrieval system is inclusive [2] if the query set is (partially) ordered and {rl, 4 = R A rl 5 r2 * p(rJ = p(r2). (3) The thesaurus-based information retrieval system, such that the response to a query consists of all documents whose descriptions contain the query is inclusive. In order to prove this statement we shall demonstrate first that for any two queries rl, r2, such that rl 3 r2, we have z(rl) 2 r(r2), where z(r) = (t: 3 t(d) > r} is the set deD
-
of document descriptions containing r, and t(d) denotes the description of document d, obtained by the classification process. Assume t E z(r2), i.e. t 2 r2 ; since r2 2 rl and
W.
92
M.
TURSKI
> is transitive: t > rl, hence t E z(rl). It follows immediately that t E:z(rJ =I t E z(rl) and therefore z(r2)c z(r&. It remains to show that z(rl) 3 z(rJ Z- p(rl) 3 p(rJ. By virtue of our assumption, p(r) = jd: d E:D A t(d) > r} = (d: d E D A t(d) e z(r)}. Let d’ E p(r2), i.e. d’ ED A t(d’) E z(r2). Since z(r2) cz(rl), t(d’) E z(rl) and d’ E p(rl). This completes the proof of inclusiveness. The following important conclusion (valid for any inclusive retrieval system) may be applied to the considered thesaurus-based information retrieval system :
if r,
..,
p(rl) = p(rz) = . . . = d-A. The practical value of this conclusion consists in that it permits to restrict the search for responses to more specific queries to the response set of more general queries. This fact simplifies considerably the design of on-line retrieval systems for which the basic assumption is that the user will improve on the scope of his query by a conversational iteration process (e.g. by using more specific descriptors or adding additional ones); in such a system the response of each iteration step (after the first one) may be formed from the response given at the preceding step, without consulting complete backfiles. 5. FURTHER PROPERTIES OF THESAURUS-BASED INFORMATION RETRIEVAL SYSTEMS
Let rl = rll, r12, . . ., rlk and r2 = rzl, rz2, . . . , rzL be queries. Denote by P the set of descriptors consisting of: 1”. Descriptors common to ri and r2, i.e. (4) 3 (r li = r2j)
*
r,iE?
l$ibk l
2”. Descriptors appearing in one query which are generalizations of descriptors of the second one, i.e. 3 (r,iL%r2j)*
r,iEP
l
and 3
(rzj%li)
=>
r,jE?
lbi
Let furthermore ri\P denote the set of ri descriptors not included in f, and r-the
set of
most specific common generalizations of pairs of descriptors from r,\f and r2\P,i.e.
3 (r9r’ h r9fr”) A 3 (rBr* h (r*Wr’ A r*B?r”)) . I* r’ETj\; > ( I”EQ\. > ( Obviously, not every pair-r’, r” has the most specific common generalization. Denote by rl .r2 the set of descriptors consisting of union of P and rM with repetitions and specifications deleted (repetitions and specifications of descriptors may have been included into rM). Denote by rl +r2 the set of descriptors consisting of union of rl and r2, with repetitions and generalizations deleted. It is easy to see that if r1 _I r2 then r1 .r2 = rl and rl +r2 = r2. The following two properties (analogous to the theorems in [2], Ch. 6.3, 6.4) can be derived directly from the adopted definitions. rErM-
On a Model of Information Retrieval System Baaed on Thesaurus
93
P 1. In a thesaurus-based information retrieval system, the response to query rl +rz is a subset of set intersection of responses to queries rl and r2. The response to query r1 .r2 includes the set union of responses to queries rl and rz. In formulae:
p(rl + r2) = &)
n &)
p(rl . cl = drl)
u dc)
(5)
We shall prove first
rl+rztrl rl
(6)
+r, trz
rl.r2
Sr, (7)
r1.r23r2
Let r E rl ; by construction of rl +r2, either r E rl +r2, or descriptor r has been deleted from the union of rl and r2 because it was a generalization of a descriptor included in rl +r,. In both cases the definition of 3 is satisfied. Quite similarly one proves the second part of (6). In order to show the first relation (7) we shall consider a descriptor r E rl .r2. By construction of rl .r2, either r E rl, or r is a generalization of an element of rl (whose eventual membership in rz is totally immaterial for our considerations). Direct application of the definition of the relation 3 completes the proof; similarly for the second part of (7). The remainder of the proof of P 1 follows from the inclusiveness of considered retrieval system. By virtue of (6) p(rl -+r2) = p(rl) p(rl + r2) = p(r2) hence p(r, +r2) = &)
n p(r2).
Similarly p(rl .r2) = &) &
.r2) = p(r2)
hence p(rl . r2) = drl> u dr2). Q.E.D.
The first part of P 1 may be formulated somewhat stronger: In a thesaurus-based information retrieval system the response to query r1 +r, is the set intersection of responses to queries rl and r2. P2.
We shall prove first that p(rl> n p(r2) c p(rl fr,). Let de p(r2) n p(rl), i.e. d E p(r,) and d E p(rz>. Since de p(rl), t(d) t rl, and since d E p(r2>, t(d) > r2. Let us choose an arbitrary descriptor r E rl +rz, by c%struction of the set rl +r2, r &an element of rl or of r2 (or, what is quite immaterial for our present considerations, of both: rl and r2). Assuming, for example, r E rl we observe that from t(d) > rl follows that t(d) contains either r or its specification; similarly for r E r2. Therefore,&ch descriptor from the query
W. M.
94 r,+r,
either appears in t(d)
t(d) 2 rl +rz, consequently,
TURSKI
or is a generalization of a descriptor Hence, indeed,
from t(d),
i.e.
d E:p(rl +r,).
p(rJ n A> = A + rZ). Since, on the other hand, from P 1 follows that k&) n ArJ = p(rr +r& we may write ArI +r,) = A>
n &J
Q.E. D. Remarks. Unfortunately, since the construction of query rl .r2 admits in it descriptors alien to rl and rz, the second part of P 1 cannot, in general, be similarly strengthened. Practical value of property P 2 consists in simplification of the search process needed to form the response to the query which is the “sum” of two queries. REFERENCES 11 W. VAN ORMAN QIJINE: From a Logical Point of View. Harvard University Press, Cambridge, Mass. (1964). [2] G. SALTON: Automatic Information Organization and Retrieval. McGraw-Hill, New York (1968).