LOCAL FEEDBACK AND INTELLIGENT AUTOMATIC QUERY EXPANSIONt PIRKKO PIETILAINEN
Universityof Oulu, Faculty of Science. Library, SF-90570, Oulu 57, Finland (Received
for publication
4 July 1982)
Abstract-An iterative method for information retrieval is presented. It uses searchonyms found from the previously retrieved set of documents in query expansion. Only largest values of relation of resemblance between the query and the documents are used to form the feedback seed. From this top retrieved set of documents, most informative features are selected as searchonyms, which are subsequently used in query reformulation. Large operational bibliographic data bases are used to simulate the behavior of this method. 1. INTRODUCTION Local clustering methods[l, 21 try to replace manually or automatically constructed general or global thesauri by per search found alternative expressions, searchonyms. Thesauri are used to help the searcher select synonyms to avoid loosing information which is expressed in some other words. Global methods, in addition of being very expensive, are felt to be inferior to local methods, in that they use general synonyms instead of searchonyms, which are unique to each search formulation. Synonym is a word or expression which can replace another word in any connection whereas searchonym can be used as a replacer in a given search formulation, i.e. in a specific connection, and that’s why searchonyms should be able to lead to more precise results. In pattern recognition terms retrieval may be viewed as a feature selection problem, see e.g. [3]. In a bibliographic data base these features may be index terms, title words, abstract words, citations, source information etc. In the present method most informative features of the first retrieved set of documents are selected and then used to reformulate the original query. The experiments reported here can be characterized as in Fig. 1. To increase the precision and to avoid noise in searchonym selection a proper weighting and normalization method is needed[ 1, 21. A method of using only top retrieved set of documents as a seed of new query formulation is used here. A top retrieved set is the best top of documents in a ranked output retrieval system. (It may also be the set of most relevant documents selected by the user.) In automatic ranking here the similarity measure, relation of resemblance, described in [4] is used. The present feedback approach is based on the notion that a bibliographic data base can be utilized as a knowledge base at the time of retrieval. Data base is seen here as a semantic network[5], but without any syntactic elements, where searchonyms are generated like answers in a question-answering system.
Fig. I. Query q, retrieves the document set D, from which searchonyms are selected to form a new 4~ retrieving the set D2 and new searchonyms, q3, generated are used to retrieve the set 4.
wry
tThis work was supported by NORDINFO (Nordic council for scientific information and research libaries). 51
P.PlFnl ,tr-Il!h:
53
2.THEOR? Relation of resemblance[4]
of a document
R
=
d and a query y is defined as
KC-J jnf (I)
(Ii
):‘inf / r--<,
where inf (t) = -log, is the amount of semantic
information
12)
m(r)
[6] of a term t, and m(f) is its probability
m(t) =
f(f) u fctt ,t i,
defined as
(3)
I__
where f(t) is the frequence of documents in the data base containing the term t. (1) is used to rank the documents in nonincreasing partial order of resemblance to the query 4. Relation of resemblance R has 2” - I non-zero values for queries with n terms, giving rise to 2” - I document sets with different R-values. Top ranked sets of documents are in this feedback experiment, selected as seeds to generate searchonyms for next search iteration. Selection of seed size can be made so that the total number of documents does not exceed a certain preselected number k. (In the following experiments, k = 20, is used.) Feature extraction is made taking into account only those features (terms, descriptors, etc.) which appear in the seed set of in more than one document, These features, the searchonyms. are then used to reformulate
the query, either by mixing query terms, i.e. using
query after j iterations, or independently, using only the searchonyms previously retrieved set as a new query. The amount of semantic information a seed s is infff,)=f,,inf(t) where fV is the frequence
f,r2
ii qi as the new i-7 obtained from the of a term t found in
(4)
of term t in the seed s,
3.EXPEKiMENT In order to test the basic properties of local feedback from large data bases only document titles are used in searches. Abstract texts could also have been used, but like in full-text retrieval[l, ?] they might need some metrical constraints. Index terms added in bibliographic data bases can also be searched with this method and may be fed back to the searcher. A method along this line is described in 171. Using natural language text has however wider application areas, see e.g. [S]. An experiment was done with the query "FLOW in TUBES or PIPES with SUDDEN WIDENING or ENLARGEMENT” = 4,. originally used by De Heer[9], and which was aiso used in testing the relation of resemblance[4]. The query y, was applied to the document titles (stop-words are omitted) in three data bases in the Lockheed/Dialog information retrieval system: NTIS64-8211~~13, COMPENDEX 7O-82/Apr, and INSPEC78-82/lssOX. Simulation of the method described above was made selecting as seeds the whole sets of highest R-values, so that the number of titles (documents) was less than 20. Seed I sizes became 6, 16, and IO titles respectively in the files NTIS, COMPENDEX, and INSPEC (Table I). Seed 1 (16 titles) of the file COMPENDEX is shown completely in the Fig. 2. All COMPENDEX seeds are presented as examples here (Figs. 2-4). These seed I’s gave rise to 5, 14 and 6 new search words = y:, searchonyms, in the data
Local feedback and intelligent automatic query expansion
23
Table 1. Searchonyms obtained from the first seed in three data bases NTIS(a), COMPENDEX(b), and INSPEC(c). The words are given with their title frequency, f, frequency in the seed 1,f,, amount of semantic information, inf (I,), and its percentage of weight in the new query, y:, formulation (a)
bases respectivety. These searchonyms are shown in the Tabte 1 with their (title) frequency in the data base, f, their frequency in the seed, fi, their amount of semantic information, inf (t,), calculated as in (4), and the percentage of their weight in the new (independent) query, %. In Fig. 3 there are 18 top titles retrieved with q? (Table l(b)) derived from the seed I of the file COMPENDEX. Second iteration searchonym lists y3 are given in Table 2 with same information f, fs, inf (t,) and % as in Table 1, for all three files NTIS. COMPENDEX, and INSPEC. Seed sizes in the second iteration became 13, 18, and 7 respectively in these files and new words we obtained 13 (Table Za), 9 (Table 2b), and 6 (Table 2c) respectively. Resulting top titles after three iterations in COMPEDEX are shown in Fig. 4. 4. ANALYSIS
OF THE
EXPERIMENT
Four of all found searchonyms appeared in all three data bases in the first two iterations.
5d
P.
Pit-IlIAlhFN
Table 2. Searchonyms obtained from the second seed in three data bases NTIS(a), COMPENDEX(b), and INSPEC(c). The words are given with their title frequency, f, frequency in the seed 2, f$, amount of semantic information, inf (I,). and its percentage of weight in the new query, y,, formulation (a)
(b)
-
These are ABRUPT, DOWNSTREAM, DUCT, and CIRCULAR. In addition, there was 6 words appearing in two of the data bases concerned: PIPE, THROUGH, TRANSFER, OSCILLATIONS, TURBULENT, and INVESTIGATION. Three of these words can be considered as synonymous to terms in the original query q,: ABRUPT (SUDDEN), DUCT and PIPE (TUBES, PIPES). Words TRANSFER, OSCILLATIONS and TURBULENT characterize the phenomena appearing in flows in these conditions, see e.g. [lo] and Figs. 2-4. An encouraging thing about these experiments is that only in NTIS data base two completely irrelevant words appear. Words TOTAL and CHARGE in q3 (Table 2a) came from the seed 2 of NTIS, which contained 2 completely irrelevant titles from high energy physics. All other seeds in all three data bases contained so little noise that it did not contribute to the new queries, i.e. completely irrelevant words did not appear more than once in top retrieved set and thus did not create searchonyms. Good examples of the power of searchonymity are the words DOWNSTREAM and THROUGH, which clearly, with proper weighting, are useful query terms in this connection, but could not probably be obtained by any conventional thesaurus. Also it is important to note
Local feedback and intelligent automatic query expansion 1.
13 NO. Problem
2.
3.
4.
El8201a030~6 of gas flow
6.
sharply
ID
NO,
-
widening
high speed enlargement.
ID NO. - El81191087 Self-excited oscillations of a duct.
of
dyna?lic cracks
gas
supersonic
-
7.
IO
-
8.
ID NO. Theoretical a sudden
El770856811 and expcrimuntal study enlargement of a cylindrical
ID
El760316852 flow across
NO.
-
Generalized
IO.
11.
12.
16.
El780958045 heat frasfer
downstream
flow
ID NO. - El;41277304 Lr~c~l and long-range a sudden enlarqement. IO
NO.
-
ID NO. Furi!Ier
in
a
system
sodden
2.
Piping
enlargement
guard pipe
of
a
pipes due to sudden opening and crash to guard pipe wall.
and the additives.
sudden
loss
due
enlargement
to
sudden
of
a
duct.
an
abrupt
in
a
of
a separated conduit.
downstream
from
enlargement.
two-dimensional
i?TCC.SUTe
flow
fluctuations
channel
induced
with
by
a
a
sudden
fluid
flow
in
a pipe.
in
El7’10955275 of
-
piping
~1780425320
ID NO. - El760959946 Slow viscous fluid local Cnldrqcnient.
Analysis
in
flow
section tube drag reducing
ID NO. Laminar
5.
flow
behavior of in the inner
ID NO. - ~1790533436 On the flow in a circular enlatgement. Effects of
NO.
pipes.
~1810650850
Elastic-plastic af longitudinal 5.
in
ID NO. - E1811191092 Study of compressible systems with sudden
flow
FiiiXGODR~~> sxper;iwi:t~
thrwqh
with
sudden
51jction
enlargements
at
a
sudden
in
pipes.
cnlargenw~t
Fig. 2. Top titles (R i 0.28) returned to the query q, in the COMPENDEX data base.
that singular forms PIPE and TUBE (Table 1) appear as sear~honyms when their plural forms are used in the original query. Method of mixing query words by using !J 4; as the new query after j iterations might be a i=l useful alternative, as then for example in COMPENDEX search condition “SUDDEN EXPANSION” would then had been obtained i.e. SUDDEN from q1 and EXPANSION from q3 (Table 2). As can be seen in Fig. 4, after 2 independent iterations, the method has found titles with synonymous expressions like “abrupt pipe expansion” (ref. 1 in Fig. 4) “tube with an abrupt cross-sectional expansion” (Ref. 2 in Fig. 4), “abrupt changes in geometry” (Ref. 5 in Fig. 4), or “Abrupt change of section” (Ref. 9 in Fig. 4), which are evident alternative expressions to “tubes or pipes with sudden widening or enlargement” of the original query. 5. CONCLUDING
REMARKS
Six concluding remarks can be made. (1) This method produces both frequent and rare terms as searchonyms and it can handle them properly so that very common words do not dominate the new query but they can however be used to improve the query. “Normalization weighting” found to be the most useful
method by Attar and Fraenkel [I. 91 is closest to the method used here, as both methods use term weights, proportional to local frequency. f$, and weights decreasing with increasing data base frequency, f. This method is comparable with “normalization weighting” method of Attar and Fraenkel but it can utilize also frequent terms. This is due to the weighting scheme of the present method. For example in COMPENDEX, the “normalization weighting”-type f,/f would give the word DOWNSTREAM 16.5 times greater weight than the word TUBE whereas in this method the ratio is only 1.7 (Table I), which is more plausible in this connection. (2) The experiment also supports the findings of [I ] that no stemming might be needed in local feedback. Attar and Fraenkel came to this conclusion with their experiments when unstemmed and stemmed clustering performances were compared. Results along these lines here are the examples of obtaining singular forms of plural words in the original query and also that one g~nlmatical variant of a word might be better than another like CROSSSECTIONAL, as opposed to CROSS-SUCTION (Table la).
57
Local feedback and intelligent automatic query expansion I
ID NO. -
El810865789
2.
13 NO.
-
Elaolz9~350
3.
I3
-
E18oog67400
4
ID HO. - E1300i03085 Flow of viscous incomprcssibie 3ne-srded broadening.
5.
6.
7.
NO.
ID NO.E1770643498 Long range memory effects in The expansion/contraction/expansion
ID NO. - El761064977 Influence of upstream of JO abrupt circular IO NO. - El760352545 Flow of viscoelastic
liquid
flows
in
involving problem.
conditions on flow channel expansion.
polymer
a plane
solutions
through
ID NO.
9.
ID NO. - El750316473 Eti!-crical calculation of a singular solution Flow in a channel with an abrupt change of
10.
Il.
NO.
-
ID
NO.
Abrupt 13.
-
changes
abrupt
in
lengths
abrupt
Z-to-1
geometry
2.
downstream
expansion.
of Navier-Stokes sec?ion.
and wave instabilities and circular channe! expansion.
equations:
reattachment
lengths
El72X045981
Partially ionized and redevelopment expansion. 12.
an
with
Ei750635955
ID NO. - Ei730207416 Shear-layer flow regimes downstream of an abrupt ID
abrupt
reattachment
8.
-
channel
c~as Flow and heat transfer regions down- stream of an
in the abrupt
separation, circular
reattachment. channel
El71X039463
transition
ID NO. - El7OXl43479 Local heat transfer
From
a circular
downstream
pipe
of
to
abrupt
a
rectangular
circular
channel
open
channel.
expansion.
Fig. 4. Top titles (R > 0.44) retrieved to the query q3 in the COMPENDEXdata base.
(3) Using only those seed terms which occur more than once in a seed seems to have a useful effect as many completely irrelevant terms seems to be left out due to this choice. (4) In the experiment local clusters’ size of less than twenty titles was used. Increasing the seed size might increase the recall but would probably lower the precision and hence it is not so clear that essentially larger seeds should be required with this method. Further research on optimal seed size should be done (5) Also in these very large data bases, local feedback methods can procedure in addition to synonymity relations, useful relations specific to the search at hand. The latter relations are not normally produced by any global tool. (4) Further investigation is clearly needed to find out for example what kind of queries can produce searchonyms with this method. Examples presented here, seem to indicate, that typical titles of technical articles can be good queries in this respect. REFERENCES [I] R. ATTARand A. S. FRAENKEL,Experiments in local metrical feedback in full-text retrieval systems. Inform. Proc. Management 1981 17, 115-126. 121 R. ATTARand A. S. FRAENKEL,Local feedback in full-text retrieval systems. J. ACM 1977, 24(3), 397-417. [3] L. C. SMITH,Artificial intelligence applications in information systems. Ann. Rev. Inform.Sci. Technal. 1980, 15,67-l 15. [4] P. PIETILAINEN, Relation on resemblance in information retrieval. ~~~o~. &WC.~ff~agerne~~ 1982,18, 55-59.
P. PItIll
‘8
[F] N. V. FINDLEK. A heuristic ~Vetrwr~s (Edited
information
retrieval
1lht.h system based on associative
by N. V. FINDLER). pp. 30.‘-326. Academic
networks.
[6] R. C:IRNAP and Y. B,~R-HILLEI., An outline of II theory of semuntic information, Research Lab. Electronics,
Tech. Rept. No. 247 (1952). Reprinted
Inform&ion. Addison-Wesley,
Associntir~c
Press. Nevv York (1979). Mass. Inst. Technol.
in Y. BAR-HILLEL,
Lunguage and
Reading, Mass. (1964).
[7] T. E. DOSZKO~‘S. An associative
interactive
dictionary
(AID)
for
online
bibliographic
searching.
Proceedings of the 41sr Annul1 Meeting of the ASIS, Vol. 15. 1978, 105-109. [8] D. E. W~I.KER, The organization computational
linguistics
and artificial
[9] T. DF HEER, Quasi comprehension
and use of information: intelligence. of natural
Contributions
of information
science.
J. ASIS 198 I, 32. 347-363.
language simulated
by means of information
traces.
Inform. Pror. Moncqement 1979. 15. 89-98. [IOj A. M. B~NSOL et rrl. Fluid mechanics. Entrance and exit effects. Kirk-Othmer Encyclopediu of (‘~IcwI;c~c~/ T&IMJ/~~~ 3. (Edited hy M GR.\~SOK and D. E(‘KROTH) Vol. IO. %Q-594. Wiley and Sons. Inc., Ne\v York (19X0).