Local feedback and intelligent automatic query expansion

Local feedback and intelligent automatic query expansion

LOCAL FEEDBACK AND INTELLIGENT AUTOMATIC QUERY EXPANSIONt PIRKKO PIETILAINEN Universityof Oulu, Faculty of Science. Library, SF-90570, Oulu 57, Finla...

577KB Sizes 0 Downloads 66 Views

LOCAL FEEDBACK AND INTELLIGENT AUTOMATIC QUERY EXPANSIONt PIRKKO PIETILAINEN

Universityof Oulu, Faculty of Science. Library, SF-90570, Oulu 57, Finland (Received

for publication

4 July 1982)

Abstract-An iterative method for information retrieval is presented. It uses searchonyms found from the previously retrieved set of documents in query expansion. Only largest values of relation of resemblance between the query and the documents are used to form the feedback seed. From this top retrieved set of documents, most informative features are selected as searchonyms, which are subsequently used in query reformulation. Large operational bibliographic data bases are used to simulate the behavior of this method. 1. INTRODUCTION Local clustering methods[l, 21 try to replace manually or automatically constructed general or global thesauri by per search found alternative expressions, searchonyms. Thesauri are used to help the searcher select synonyms to avoid loosing information which is expressed in some other words. Global methods, in addition of being very expensive, are felt to be inferior to local methods, in that they use general synonyms instead of searchonyms, which are unique to each search formulation. Synonym is a word or expression which can replace another word in any connection whereas searchonym can be used as a replacer in a given search formulation, i.e. in a specific connection, and that’s why searchonyms should be able to lead to more precise results. In pattern recognition terms retrieval may be viewed as a feature selection problem, see e.g. [3]. In a bibliographic data base these features may be index terms, title words, abstract words, citations, source information etc. In the present method most informative features of the first retrieved set of documents are selected and then used to reformulate the original query. The experiments reported here can be characterized as in Fig. 1. To increase the precision and to avoid noise in searchonym selection a proper weighting and normalization method is needed[ 1, 21. A method of using only top retrieved set of documents as a seed of new query formulation is used here. A top retrieved set is the best top of documents in a ranked output retrieval system. (It may also be the set of most relevant documents selected by the user.) In automatic ranking here the similarity measure, relation of resemblance, described in [4] is used. The present feedback approach is based on the notion that a bibliographic data base can be utilized as a knowledge base at the time of retrieval. Data base is seen here as a semantic network[5], but without any syntactic elements, where searchonyms are generated like answers in a question-answering system.

Fig. I. Query q, retrieves the document set D, from which searchonyms are selected to form a new 4~ retrieving the set D2 and new searchonyms, q3, generated are used to retrieve the set 4.

wry

tThis work was supported by NORDINFO (Nordic council for scientific information and research libaries). 51

P.PlFnl ,tr-Il!h:

53

2.THEOR? Relation of resemblance[4]

of a document

R

=

d and a query y is defined as

KC-J jnf (I)

(Ii

):‘inf / r--<,

where inf (t) = -log, is the amount of semantic

information

12)

m(r)

[6] of a term t, and m(f) is its probability

m(t) =

f(f) u fctt ,t i,

defined as

(3)

I__

where f(t) is the frequence of documents in the data base containing the term t. (1) is used to rank the documents in nonincreasing partial order of resemblance to the query 4. Relation of resemblance R has 2” - I non-zero values for queries with n terms, giving rise to 2” - I document sets with different R-values. Top ranked sets of documents are in this feedback experiment, selected as seeds to generate searchonyms for next search iteration. Selection of seed size can be made so that the total number of documents does not exceed a certain preselected number k. (In the following experiments, k = 20, is used.) Feature extraction is made taking into account only those features (terms, descriptors, etc.) which appear in the seed set of in more than one document, These features, the searchonyms. are then used to reformulate

the query, either by mixing query terms, i.e. using

query after j iterations, or independently, using only the searchonyms previously retrieved set as a new query. The amount of semantic information a seed s is infff,)=f,,inf(t) where fV is the frequence

f,r2

ii qi as the new i-7 obtained from the of a term t found in

(4)

of term t in the seed s,

3.EXPEKiMENT In order to test the basic properties of local feedback from large data bases only document titles are used in searches. Abstract texts could also have been used, but like in full-text retrieval[l, ?] they might need some metrical constraints. Index terms added in bibliographic data bases can also be searched with this method and may be fed back to the searcher. A method along this line is described in 171. Using natural language text has however wider application areas, see e.g. [S]. An experiment was done with the query "FLOW in TUBES or PIPES with SUDDEN WIDENING or ENLARGEMENT” = 4,. originally used by De Heer[9], and which was aiso used in testing the relation of resemblance[4]. The query y, was applied to the document titles (stop-words are omitted) in three data bases in the Lockheed/Dialog information retrieval system: NTIS64-8211~~13, COMPENDEX 7O-82/Apr, and INSPEC78-82/lssOX. Simulation of the method described above was made selecting as seeds the whole sets of highest R-values, so that the number of titles (documents) was less than 20. Seed I sizes became 6, 16, and IO titles respectively in the files NTIS, COMPENDEX, and INSPEC (Table I). Seed 1 (16 titles) of the file COMPENDEX is shown completely in the Fig. 2. All COMPENDEX seeds are presented as examples here (Figs. 2-4). These seed I’s gave rise to 5, 14 and 6 new search words = y:, searchonyms, in the data

Local feedback and intelligent automatic query expansion

23

Table 1. Searchonyms obtained from the first seed in three data bases NTIS(a), COMPENDEX(b), and INSPEC(c). The words are given with their title frequency, f, frequency in the seed 1,f,, amount of semantic information, inf (I,), and its percentage of weight in the new query, y:, formulation (a)

bases respectivety. These searchonyms are shown in the Tabte 1 with their (title) frequency in the data base, f, their frequency in the seed, fi, their amount of semantic information, inf (t,), calculated as in (4), and the percentage of their weight in the new (independent) query, %. In Fig. 3 there are 18 top titles retrieved with q? (Table l(b)) derived from the seed I of the file COMPENDEX. Second iteration searchonym lists y3 are given in Table 2 with same information f, fs, inf (t,) and % as in Table 1, for all three files NTIS. COMPENDEX, and INSPEC. Seed sizes in the second iteration became 13, 18, and 7 respectively in these files and new words we obtained 13 (Table Za), 9 (Table 2b), and 6 (Table 2c) respectively. Resulting top titles after three iterations in COMPEDEX are shown in Fig. 4. 4. ANALYSIS

OF THE

EXPERIMENT

Four of all found searchonyms appeared in all three data bases in the first two iterations.

5d

P.

Pit-IlIAlhFN

Table 2. Searchonyms obtained from the second seed in three data bases NTIS(a), COMPENDEX(b), and INSPEC(c). The words are given with their title frequency, f, frequency in the seed 2, f$, amount of semantic information, inf (I,). and its percentage of weight in the new query, y,, formulation (a)

(b)

-

These are ABRUPT, DOWNSTREAM, DUCT, and CIRCULAR. In addition, there was 6 words appearing in two of the data bases concerned: PIPE, THROUGH, TRANSFER, OSCILLATIONS, TURBULENT, and INVESTIGATION. Three of these words can be considered as synonymous to terms in the original query q,: ABRUPT (SUDDEN), DUCT and PIPE (TUBES, PIPES). Words TRANSFER, OSCILLATIONS and TURBULENT characterize the phenomena appearing in flows in these conditions, see e.g. [lo] and Figs. 2-4. An encouraging thing about these experiments is that only in NTIS data base two completely irrelevant words appear. Words TOTAL and CHARGE in q3 (Table 2a) came from the seed 2 of NTIS, which contained 2 completely irrelevant titles from high energy physics. All other seeds in all three data bases contained so little noise that it did not contribute to the new queries, i.e. completely irrelevant words did not appear more than once in top retrieved set and thus did not create searchonyms. Good examples of the power of searchonymity are the words DOWNSTREAM and THROUGH, which clearly, with proper weighting, are useful query terms in this connection, but could not probably be obtained by any conventional thesaurus. Also it is important to note

Local feedback and intelligent automatic query expansion 1.

13 NO. Problem

2.

3.

4.

El8201a030~6 of gas flow

6.

sharply

ID

NO,

-

widening

high speed enlargement.

ID NO. - El81191087 Self-excited oscillations of a duct.

of

dyna?lic cracks

gas

supersonic

-

7.

IO

-

8.

ID NO. Theoretical a sudden

El770856811 and expcrimuntal study enlargement of a cylindrical

ID

El760316852 flow across

NO.

-

Generalized

IO.

11.

12.

16.

El780958045 heat frasfer

downstream

flow

ID NO. - El;41277304 Lr~c~l and long-range a sudden enlarqement. IO

NO.

-

ID NO. Furi!Ier

in

a

system

sodden

2.

Piping

enlargement

guard pipe

of

a

pipes due to sudden opening and crash to guard pipe wall.

and the additives.

sudden

loss

due

enlargement

to

sudden

of

a

duct.

an

abrupt

in

a

of

a separated conduit.

downstream

from

enlargement.

two-dimensional

i?TCC.SUTe

flow

fluctuations

channel

induced

with

by

a

a

sudden

fluid

flow

in

a pipe.

in

El7’10955275 of

-

piping

~1780425320

ID NO. - El760959946 Slow viscous fluid local Cnldrqcnient.

Analysis

in

flow

section tube drag reducing

ID NO. Laminar

5.

flow

behavior of in the inner

ID NO. - ~1790533436 On the flow in a circular enlatgement. Effects of

NO.

pipes.

~1810650850

Elastic-plastic af longitudinal 5.

in

ID NO. - E1811191092 Study of compressible systems with sudden

flow

FiiiXGODR~~> sxper;iwi:t~

thrwqh

with

sudden

51jction

enlargements

at

a

sudden

in

pipes.

cnlargenw~t

Fig. 2. Top titles (R i 0.28) returned to the query q, in the COMPENDEX data base.

that singular forms PIPE and TUBE (Table 1) appear as sear~honyms when their plural forms are used in the original query. Method of mixing query words by using !J 4; as the new query after j iterations might be a i=l useful alternative, as then for example in COMPENDEX search condition “SUDDEN EXPANSION” would then had been obtained i.e. SUDDEN from q1 and EXPANSION from q3 (Table 2). As can be seen in Fig. 4, after 2 independent iterations, the method has found titles with synonymous expressions like “abrupt pipe expansion” (ref. 1 in Fig. 4) “tube with an abrupt cross-sectional expansion” (Ref. 2 in Fig. 4), “abrupt changes in geometry” (Ref. 5 in Fig. 4), or “Abrupt change of section” (Ref. 9 in Fig. 4), which are evident alternative expressions to “tubes or pipes with sudden widening or enlargement” of the original query. 5. CONCLUDING

REMARKS

Six concluding remarks can be made. (1) This method produces both frequent and rare terms as searchonyms and it can handle them properly so that very common words do not dominate the new query but they can however be used to improve the query. “Normalization weighting” found to be the most useful

method by Attar and Fraenkel [I. 91 is closest to the method used here, as both methods use term weights, proportional to local frequency. f$, and weights decreasing with increasing data base frequency, f. This method is comparable with “normalization weighting” method of Attar and Fraenkel but it can utilize also frequent terms. This is due to the weighting scheme of the present method. For example in COMPENDEX, the “normalization weighting”-type f,/f would give the word DOWNSTREAM 16.5 times greater weight than the word TUBE whereas in this method the ratio is only 1.7 (Table I), which is more plausible in this connection. (2) The experiment also supports the findings of [I ] that no stemming might be needed in local feedback. Attar and Fraenkel came to this conclusion with their experiments when unstemmed and stemmed clustering performances were compared. Results along these lines here are the examples of obtaining singular forms of plural words in the original query and also that one g~nlmatical variant of a word might be better than another like CROSSSECTIONAL, as opposed to CROSS-SUCTION (Table la).

57

Local feedback and intelligent automatic query expansion I

ID NO. -

El810865789

2.

13 NO.

-

Elaolz9~350

3.

I3

-

E18oog67400

4

ID HO. - E1300i03085 Flow of viscous incomprcssibie 3ne-srded broadening.

5.

6.

7.

NO.

ID NO.E1770643498 Long range memory effects in The expansion/contraction/expansion

ID NO. - El761064977 Influence of upstream of JO abrupt circular IO NO. - El760352545 Flow of viscoelastic

liquid

flows

in

involving problem.

conditions on flow channel expansion.

polymer

a plane

solutions

through

ID NO.

9.

ID NO. - El750316473 Eti!-crical calculation of a singular solution Flow in a channel with an abrupt change of

10.

Il.

NO.

-

ID

NO.

Abrupt 13.

-

changes

abrupt

in

lengths

abrupt

Z-to-1

geometry

2.

downstream

expansion.

of Navier-Stokes sec?ion.

and wave instabilities and circular channe! expansion.

equations:

reattachment

lengths

El72X045981

Partially ionized and redevelopment expansion. 12.

an

with

Ei750635955

ID NO. - Ei730207416 Shear-layer flow regimes downstream of an abrupt ID

abrupt

reattachment

8.

-

channel

c~as Flow and heat transfer regions down- stream of an

in the abrupt

separation, circular

reattachment. channel

El71X039463

transition

ID NO. - El7OXl43479 Local heat transfer

From

a circular

downstream

pipe

of

to

abrupt

a

rectangular

circular

channel

open

channel.

expansion.

Fig. 4. Top titles (R > 0.44) retrieved to the query q3 in the COMPENDEXdata base.

(3) Using only those seed terms which occur more than once in a seed seems to have a useful effect as many completely irrelevant terms seems to be left out due to this choice. (4) In the experiment local clusters’ size of less than twenty titles was used. Increasing the seed size might increase the recall but would probably lower the precision and hence it is not so clear that essentially larger seeds should be required with this method. Further research on optimal seed size should be done (5) Also in these very large data bases, local feedback methods can procedure in addition to synonymity relations, useful relations specific to the search at hand. The latter relations are not normally produced by any global tool. (4) Further investigation is clearly needed to find out for example what kind of queries can produce searchonyms with this method. Examples presented here, seem to indicate, that typical titles of technical articles can be good queries in this respect. REFERENCES [I] R. ATTARand A. S. FRAENKEL,Experiments in local metrical feedback in full-text retrieval systems. Inform. Proc. Management 1981 17, 115-126. 121 R. ATTARand A. S. FRAENKEL,Local feedback in full-text retrieval systems. J. ACM 1977, 24(3), 397-417. [3] L. C. SMITH,Artificial intelligence applications in information systems. Ann. Rev. Inform.Sci. Technal. 1980, 15,67-l 15. [4] P. PIETILAINEN, Relation on resemblance in information retrieval. ~~~o~. &WC.~ff~agerne~~ 1982,18, 55-59.

P. PItIll

‘8

[F] N. V. FINDLEK. A heuristic ~Vetrwr~s (Edited

information

retrieval

1lht.h system based on associative

by N. V. FINDLER). pp. 30.‘-326. Academic

networks.

[6] R. C:IRNAP and Y. B,~R-HILLEI., An outline of II theory of semuntic information, Research Lab. Electronics,

Tech. Rept. No. 247 (1952). Reprinted

Inform&ion. Addison-Wesley,

Associntir~c

Press. Nevv York (1979). Mass. Inst. Technol.

in Y. BAR-HILLEL,

Lunguage and

Reading, Mass. (1964).

[7] T. E. DOSZKO~‘S. An associative

interactive

dictionary

(AID)

for

online

bibliographic

searching.

Proceedings of the 41sr Annul1 Meeting of the ASIS, Vol. 15. 1978, 105-109. [8] D. E. W~I.KER, The organization computational

linguistics

and artificial

[9] T. DF HEER, Quasi comprehension

and use of information: intelligence. of natural

Contributions

of information

science.

J. ASIS 198 I, 32. 347-363.

language simulated

by means of information

traces.

Inform. Pror. Moncqement 1979. 15. 89-98. [IOj A. M. B~NSOL et rrl. Fluid mechanics. Entrance and exit effects. Kirk-Othmer Encyclopediu of (‘~IcwI;c~c~/ T&IMJ/~~~ 3. (Edited hy M GR.\~SOK and D. E(‘KROTH) Vol. IO. %Q-594. Wiley and Sons. Inc., Ne\v York (19X0).