Dynamic dictionary updating

Infonnomv~Processing& Monagemenf. Vol. 13. pp. 235-245. PergamonPress 1977. Printed in Great Britain DYNAMIC DICTIONARY ROBERT Department of Compu...

Download PDF

874KB Sizes 20 Downloads 92 Views

Report

PDF Reader
Full Text

Infonnomv~Processing& Monagemenf. Vol. 13. pp. 235-245. PergamonPress 1977. Printed in Great Britain

DYNAMIC

DICTIONARY ROBERT

Department

of Computing

and Information

G.

CRAWFORD

Science,

Queen’s

UPDATING

University,

Kingston.

Ontario,

Canada

Abstract-A method for updating the dictionary in a dynamic information retrieval system is presented. It is shown that as a collection changes through addition and deletion of documents, the appropriate set of index terms may be determined without complete periodic regeneration of the dictionary. Results are presented for experiments involving acomplete change in collection membership, with the dynamic dictionary updating methods shown to be effective.

I. INTRODUCTION

At the conclusion of his paper “A Theory of Indexing”, Salton lists a number questions which must be examined if the theory is to receive practical application. questions

is as follows:

of important One of these

[I, p. 601.

“Can the computation of term values obtained from a dynamic environment where recompute the term values?”

old documents

static model of a given document are removed, and new ones are added?

collection be maintained in a If not, how often must one

This question defines quite well the considerations of this paper. It is clear that the dictionary must reflect the subject matter of the collection. Consider two ways, as implied in the above question, in which this may be assured. (i) Periodic regeneration of the dictionary. At intervals, when the document collection and/or terminology has undergone a change, a new dictionary may be constructed. There are two problems with this. (a) Prior to reconstructing the dictionary, performance of the system may be expected to decline as the dictionary becomes less reflective of the document collection. (b) For very large collections, the time required to do this at reasonable intervals may be prohibitive. (ii) Continuous updating of the dictionary. All changes, as they occur in the system, are reflected by updating of the dictionary. Here the principle problem is the development of a methodology to do this. In this paper a methodology for dynamic updating of the dictionary is presented, thus answering the introductory question in the affirmative. 2. BASIC

CONSIDERATIONS

There are four areas which should be briefly outlined ments presented in this paper.

to serve as a basis for the develop-

2.1 The environment The discussion and results to be presented are in the context of information retrieval system. That is, dictionary construction, indexing, retrieval are all performed using fully automatic methods. Thus, it is assumed document surrogate (title, abstract, or full text) are considered for assignment that document and appropriate weights are also calculated and assigned.

a fully automatic classification, and that all terms in a as index terms to

2.2 Purpose of a dynamic dictionary The purpose of a dynamic dictionary is to provide for: (i) the maintenance of existing levels of performance by reflecting changes in the document set; i.e. additions and retirements. (ii) the improvement of existing levels of performance by reflecting changes due to document space modification [2-4]. New documents are indexed by a dictionary which fully reflects the document collection, thus maintaining system performance. In the static system this is not true, and new documents may be indexed by a dictionary which does not reflect changes in the collection due to recent document additions and retirements. IPM

vol.

13, No. 4-c

235

236

R. G. CR.AWFORD

Additionally however, document space modification is reflected in the dynamic dictionary, providing for improved performance of a query not only because of the actual document space changes, but also due to improved indexing of the query by the updated dictionary. 2.3 Transactions in a dynamic system It is important to bear in mind the situations affecting terminology which may occur in a dynamic information retrieval system. (i) A new term is entered into the subject area of the collection, representing either; (a) a new idea in the subject area, or (b) a new terminology for an existing concept. (ii) An old term passes from common usage due to either; (a) decreasing importance of the idea it represents, or (b) the introduction of new terminology replacing the old term. (iii) The coverage of the collection shifts: (a) from one subject area to another due to changes in the interests of the users or(b) from a specific subject area to a more general subject area or (c) from a general subject area to a specific subset of that subject. (iv) Through document modifications, based on user evaluations; (a) a term increases in weight in the collection or (b) a term decreases in weight in the collection. It is these situations which must be handled through the dynamic dictionary. 2.4 The static dictionary To understand the problems involved with dynamic dictionary updating for the situations listed in the previous section, consider the organization of a static dictionary and document collection. Figure I shows a sample of a dictionary and a document vector. Consider how updating may be done on the basis of the total information contained in these files. For example suppose that document 1291 as shown in Fig. 1 were retired from the collection. In what manner would this affect concept number 172 and its associated term in the dictionary? It is possible that the term is no longer used anywhere in the collection (i.e. document frequency = 0), but this cannot be determined without an examination of all document vectors, which is prohibitive. Alternatively, suppose that concept number 172 is still used in the collection but is no longer a useful index term. How can the other documents which were indexed by this concept be changed? An additional problem also may occur. Suppose for example, that the term “ocular” was not used to index a set of documents in ophthalmology because it occured in a high proportion of However. over a period of time the the documents and was, therefore, a non-discriminator. collection is greatly expanded, with the addition of documents in more general areas of

Concept

Number

1031 172 3019

Iiai;ophili(

462

Behavior

701

document

number

concept/weight 172/l, 3412/l,

191/2,

1291 pairs

367/l,

491/l.

567/3,

4096/?

Fig. I. Sample element\

of a static system.

Dynamic dictionary updating

231

medicine. Also, many of the specific documents in opthalmology are retired. In other words, in the present collection the term “ocular” is an excellent discriminator and should be included as an index term. Two questions must be answered. (i) What information must be maintained so that a shift in “status” of a term (e.g. from non-discriminator to discriminator) may be detected? (ii) Given the change in “status” of a term, how may the indexing of the documents be modified to account for this? (i.e. the original documents were not indexed by “ocular”. Must we re-examine all the original text?). Part of the purpose of the previous examples is to motivate the need for maintaining further information if dynamic dictionary updating is to be possible. In the following section, precisely what information must be kept is outlined. 3. THE

DYNAMIC

SYSTEM: TERM

DICTIONARY. STATUS MAP

DOCUMENTS

AND

Given the necessity for additional information for use in updating of the dictionary, it must be determined what information must be maintained, where it is to be kept, and how it is to be updated. It appears desirable to include as much of the necessary information as possible as a part of the existing dictionary and document files, as opposed to constructing many new and (possibly) redundant files. The following constraints must therefore be met: (i) The dictionary must contain all terms used in the collection and have a concept number for each. (ii) The document vector must have a concept number and associated weight for all terms occuring in the document (i.e. “full” information for each document). A dictionary and a document collection of this form are the basis for a dynamic system. Additionally, it is necessary to have: (iii) Information associated with each concept, sufficient to determine the status of the concept. (iv) A provision for searching the collection using only “index” terms, rather than the “full” vectors which seem to be suggested by (i) and (ii). This fourth provision is crucial. Searches done using “full” vectors are shown to provide for a very low (and unacceptable) level of retrieval performance[5]. A dictionary which includes all terms in the collection (i.e. stop words or low frequency words are not omitted to keep the dictionary size small) and a set of document vectors which include concepts and weights for all terms in each document (provisions (i) and (ii)) are easy to construct. Thus, consider the problem of what information about a concept is sufficient to determine its status as an index term (provision (iii)). Consider a term, k, and its possible status in the collection as a result of the initial construction of the dictionary. Term k: (i) Is not an index term due to high frequency of occurrence. (ii) Is not an index term due to very low frequency of occurrence. (iii) Is not an index term since it is a non-discriminator or poor discriminator. (iv) Is an index term since it is a discriminator. Now, under addition, deletion, or modification of documents, the status of term k may change. Cases (i) and (ii) are easily handled by maintaining frequencies of occurrence of terms. It is cases (iii) and (iv) which present difficulty. It is shown that the discrimination value, which has previously proven effective for static dictionary construction[5,6], may be revised for use in the dynamic situation. 3.1 Dynamic discrimination value The discrimination value of a term i is defined as follows:

where Q is the compactness

of the collection

and Qi is the compactness

with term i deleted.

21x This

tt.

may

be re-written

’

Now

~‘K4WFOKI)

as:

D =

Qi and

function

(;.

Q are each a function only of those documents

Q 0’

where

(2)

AQ I = QI - Q.

of all documents in the document space. However, AQi is a in which term i actually occurs. Further:

where superscript i indicates the deleted term

for a collection

of N documents,

. d,..,,with centroid

(i,,

(4)

(_.

Thus:

AQi=Q,-Q

(5)

Clearly, the term in brackets approximates approximation, this may be written

AQi = +

c

zero for all j such that d,’ = 0; thus, to a close

[cos(c’. d,‘)-

cos ((:,

&)I.

I

d,’ 10

The quantity AQ, has the property of being easier to compute than Qi and is therefore considered for usefulness in updating. Clearly, if AQ, can be maintained for each term i, than it is a simple matter to compute Di (using eqn (2)) for any term i for which the value of Q, changes. The value D, for a term, as computed on the basis of updated values of AQ, is the dynumic discrimination ualue. The usefulness of dynamic discrimination value is tested and the results discussed in a later section.

3.2 The term stutus map

It is apparent then that, for each term in the collection, the following information is necessary to determine the status of the term for use in indexing: (i) the document frequency of occurrence (ii) the total frequency of occurrence (iii) the dynamic discrimination value (by keeping AQ, and D, for each concept i). This answers the question of what additional information must be kept. It must be determined where this information is to be maintained, and finally shown how it may be updated. Figure 2 shows samples of the files required for dynamic dictionary updating. The Dictionary and the Document vectors are maintained as in the conventional system: all terms being included. That is, for each natural language term in the dictionary there is a list of the concept numbers into which the term is mapped. Also, for each document there is a vector specifying the weight of each concept occurring in that document. It is the third file shown in Fig. 2, the term status map, which enables the dynamics. As shown, there is for each concept in the system a record of the information necessary for dynamically determining the index status of the term, as well as an indicator showing whether the concept, as presently used, is considered to be an index term. Algorithms for the dynamic updating and utilization of the term status map must be

Dynamic

dictionary

updating

239

(a) Dictionary concept

Te?Xl Bacillus

1031

Bacteria

174

Basophilic

3019

I3e

380 5011

BeCaUSe

Behavior

462

Benign

781

(b) Document

Vectors

(full) document

number

1291

concepts/weights 174/l, 3412/l, (c) Term

191/2,

367/l,

4096/2,

Status

380/l,

5010/Z,

491/l,

567/3.

5052/l

Map

Document Frequency

Total Frequency

---Qi

174

12

16

.061

8.133

380

132

-.046

-6.004

Concept

215

Di

Index Ter;n status Yes No

462

11

14

.009

1.221

Yes

781

14

18

.072

9.734

Yes

1031

6

8

.037

5.064

Yes

Fig. 2. Samples

of files required

for dynamic

dictionary

updating

described. However, it is first important to consider how the basic operations retrieval system are carried out using the files described.

in an information

3.3 Standard operations with the dynamic dictionary The design of the files discussed in the previous section (the dictionary, document and term status map files) becomes clearer as their use is considered. Ignoring momentarily the actual updating process, consider the use of these files under standard operations of indexing and searching. DOCUMENT INDEXING is done as in a static system; however, the vectors produced are “full” vectors, containing a concept number and weight for every term occurring in the document. The effect of these full vectors on the search process is insignificant when query indexing is done as follows. QUERY INDEXING is a two step process. First, the dictionary is used to index the query, producing a “full” query vector having a concept and weight for every term occurring in the query (unless the term is new to the system). This full query vector is then filtered through the term status map to produce an index query vector containing only concepts which are currently to be used for indexing. That is, each concept in the full query vector is either kept in the index query vector (in the case where the indicator in the term status map indicates an index term) or deleted (in the case where the indicator indicates a non-index term). An example of this process is shown in Fig. 3. Most of the processing involved in handling natural language queries is in the “lookup” of the words of the text in the dictionary. Thus, the additional step of mapping these concepts into the final index query vector does not add significantly to the processing required. SEARCHING in the dynamic system, whether clustered or not, occurs much as in a static system. The final index query vector is correlated with centroids and document vectors as necessary to fulfill the requirements of the search algorithm. The principal difference from a standard search lies in the use of the full document vectors. It has been shown[5] that when

R.

240

h'iitur21

‘1 1

1.anguaqc

CRAWFORD

Query

"k!hat is the post-operative

procedure

patients

transplant."

Full

b)

G.

undergoinq

@uerv

fvery

Vector

term

in

cornea1 -

concepts

to

the original

for

represent

query

(concepts/

weights). 1835/l,

974/l, 3681/l, C)

4652/l,

Index

Query

shown

by

terms

are

977/l,

2490/l,

3373/l,

4771/l,

Vector

the term

4864/l,

(only terns status

map

3569/l, 5066/l which

included)

3569/l,

3681/l,

4771/l

Fig.3. Example of query indexmg in the dynamic

ranking correlation due to the fairly for documents.

are

to be index

\y\tem

is used, results are not affected by the use of full document constant proportionality

Having discussed what information

between is required

the length of the “full” to dynamically

this information is maintained, and how it is utilized to perform is necessary to consider how the dictionary is updated. 3.4 Dynamic dictionary

vectors. This is

and “index”

update the dictionary, standard retrieval

vectors where

operations.

it

updating algorithms

The dictionary, more particularly the term status map, is updated following the addition of new documents, the retirement of old documents and the modification of the document space. The dictionary is not directly updated as a result of query document space is modified as a result of the query. Thus, queries do not introduce

new terminology

submission,

but is updated

if the

to the system. All new terms are entered

through occurrence in new documents. However, through their use in queries, their occurrence may be expanded to old documents by document space modification. Updating

of the three files (as shown

in Fig. 2) proceeds

updated when: (i) A new term is introduced by its occurrence (ii) An old term is deleted as it is no longer retirement

or modification

which effected

The term Status Map and the Document

as follows.

in a new document. used in any document

its reduction

The dictionary

is

(due to document

to zero weight).

space are updated when:

(i) New documents are added. (ii) Old documents are retired. (iii) The Document space is modified. Thus, dictionary updating algorithms are needed to handle the cases of document addition and document retirement. However, the third case, that of document space modification, may be considered simply as a combination of the first two processes. During document space modification a document (the unmodified one) is retired from the collection and a new document (the modified one) is added to the collection. Figure 4 summarizes the notation used to express the algorithms. Figure 5 outlines the algorithm for dynamic modification of the set of index terms upon addition of a new document to the system. This algorithm may appear to be unwieldy, but a close examination shows that there is little increase in processing over adding a document to a collection based on a static dictionary. Steps (ii) and (ix), i.e. the indexing of the document, and the clustering of the document, each involve a lot of processing, but this is not a function of the dynamic system. Steps (iii) through (viii) handle the actual dynamic updating and involve only

241

Dynamicdictionary updating

c c.

centroid

vector

frequency

of document

(weight)

of term

space i in c

1

a

full

document

under

vector

for the document

consideration

(i.e., being

added

or deleted). di

frequency

N

number

Q

compactness

i in d

in the collection

of the collection (i.e. N*Q)

(c,dj)

correlation

TCOS

of term

of documents

v x cos j=l

Q sum

(weight)

between

centroid

and

full

between

centroid

and

documen

document TCOSj

correlation with

term

i deleted

difference

in

due

correlation

to term

i

I

dQi

sum of correlation term

dynamic

Di DOCFRCQ

differences

due

to

i

number

discrimination of documents

value

in which

of term term

i

i

OCC"?ZS Note

that

are

:, N, 0, and Q

all

"global

values",

SWTl i.e.,

are

values

which

give

entire

documents

collection;

singie

document;

the other

particular

infromation

about

2 and TCOS values

the

pertain

all relate

to a

to a

term.

Fig. 4. Notation

used in dictionary

updating

algorithms.

straightforward numeric computation. Most of this computation is required to update the dynamic discrimination value for each term used in the new document. Step (viii-g) involves the determination of the index status of each term occurring in the new document. This determination is based on the newly updated values describing the term. A general algorithm for handling this step is necessary. Figure 6 outlines Algorithm B for dynamic dictionary updating upon retirement of a document from the system. The algorithm is similar to Algorithm A, with reductions rather than increases in appropriate values. However, the order of computation in these algorithms differs, and is crucial. It is clear that the algorithms should be consistent; that is the, addition and subsequent deletion of a document, with nc’ intervening changes, should leave the system unchanged. Study of Algorithms A and B shows that this is the case. It is important to note that Algorithms A and B do not include the actual determination of the index status of terms. Rather, these algorithms are useful for updating the values which describe each term. The index status of each term is then determined on the basis of these values. Figure 7 suggests an algorithm for evaluating the index status of a term based on the values maintained for it. The specific algorithm to be utilized would depend on the particular implementation. For example, for one collection tested, the following values yield good results: tL=l, As shown previously[5],

rH==,

d, = 0,

DC may be increased,

dH = 300,

D, = 0.

resulting in fewer index terms, with only a slight

242

R. G. CRAWFORD

I-

PROCEDURE

STLP

i)

Read

Document

Indexing

ii)

iii)

iv) V)

EXPLANATION New

of Document

c+c +

d

N-N

+

1

TCOS

*COS

(C,S,

vii)

COMPUTE

centroid

Update

collection

New

(Natural

new

Total

Language

vector

Text)

for the new

of the collection size

document System

Compactness,

with

centroid

Compactness

Q

For each term status map

in document

Correlate deleted

document

a update

its

terr

:

- -

a!

Update

Update

FOR V i3 di # 0

viii)

received

;i, the concept

Corrclatc

Qsum +Q,, + TCoS Q *-Qs urn’‘’

vi)

document

Produce document

(d.cJ-

i< .d ii

TCOSi,.

)

new

with

centroid;

tern

i

(;:c2 -c* 1 (z,' _d2 ) jji jji b)

LCQS~*TCOS.

-1

ACi

d)

L

Di

Change

- TCOS

1

. GQi

+ ~CCSi

Dynamic

- hQi

(*loo) /Q

Par

E)

DOCFREQi+-

DOCFREQi

fl

“OTFP!s@. 1*

TOTFREQi

+di

3)

TEST

status

index

Add

document

i)

of term

i

d to collection

ii) iii) 1V)

V) vi) vii)

TCOS ?J+

1 from document

(Algorithm (Modify

Value

of occurrence

of terr

1

C) clusters

as necessary)

updating upon addition of a document.

CXPLANATION collection

Appropriate updating of cIwters and centroids may be necessary Correlate

* COS(c,d) N-l

c *c-;i Qsum+Qsum

i

i updated

frequencies

PROCEDURE Delete

to tern

I

Fig. 5. AIgc~rithm .4: Dictionary

STC?

due

Discrimination

term

i-1 update

ix1

in correlation

- TCOS

document

with

ccntroid

Update

Collection

Size

Update

Collection

Centroid

Update

Collection

Compactness

Q .. Qsun/N FOR Vi3

di # 0

For each the tern

term in document status map:

i update

COMPUTE

- -

a)

(c.d)

TCOSi

b)

cc9si

-

(ci.di)

+J (;c; - cf)

TCOSi

I-

-

Correlate with term

C)

clQ.+ AQi1 D. - Ani 1

(loo*) /Q

e)

DOCFRCQi

+ DOCFRCQ.

f)

TOTFRCQi

- T3TFRCQ.

g)

Test

centroid

(id; - d;)

TCOS

d;

document and i deleted

Difference in correlation due to term i is used to update dynamic discrimination value of term i

ACOSi

index

status

1 1

-1

Update frequencies of term i

of occurrence

-1

of term

i

(Algorithm

Fig. 6. A1,qorithm B: Dictionary

C)

updating upon retirement

of a document.

Dynamic dictionary

Term

i is an Index

Term

1.

tL ( Total

an2

2.

dL c Document

and

3.

Term

DC

-IF:

frequency

an Index

of term

Frequency

< Discrimination

i is not

243

updating

Term

i c tH

of term

Value

i < dH

of term

i

Otherwise.

Parameters: tL

-

tH

-

d

L

dH Dc

low cutoff high

cutoff

for total for total

frequency frequency

values values

-

low

cutoff

for document

frequency

values

-

high

cutoff

for document

frequency

values

-

discrimination

value

Fig. 7. A/gorithnr C: Evaluation

cutoff

of the index status of a term

decrease in retrieval performance. Likewise, the frequency limits may be modified for systems in which either very high precision or very high recall are required. Algorithm C is simple and may be easily modified to produce the set of index terms which provide good average performance over all queries for a particular collection and user population. Determining the index status of a term is equivalent to the static case of adding and deleting terms in a dictionary. However, since all terms are maintained in the dynamic dictionary, deletion of a term (and likewise addition) could have two meanings. First, a term which is no longer used in any documents would be deleted from the dictionary in the conventional sense. Second, a term which becomes a non-discriminator is essentially “deleted” from the set of index terms. The first case is simply a matter of keeping track of the frequency of occurrence of a term. It is the second case which is of interest and which is determined by Algorithm C. When Algorithms A, B and C are applied successively the set of index terms may be expected to change, continuing to reflect the document collection. Some question may be raised as to the validity of the successive deletion of non-discriminators which may occur. In particular, Clemons and Newton[7] found retrieval to be adversely affected upon successive deletion of non-discriminators. Their method was based on the assumption that “if a fixed number of nondiscriminators were to be deleted, they should optimally be deleted one at a time, with recalculation of discrimination value following each deletion”. However, the recalculation of discrimination values after the deletion of many terms does not give a true indication of the discriminating effect of terms in the actual documents. That is, unless the results of the calculations can be related back to the original documents they are of no value. In light of this, it is important to see that a different approach is taken here. Discrimination values are recalculated (updated) only on the basis of changes in the docutnent colection. A set of index terms is maintained which accurately reflects the current set of documents. 4. EXPERIMENTS

AND

EVALUATION

A principal result is required in support of the algorithms presented. They must be shown to be correct; i.e. that they in fact provide an effective means of dynamically updating a dictionary. The dictionary updating algorithms are evaluated using a collection consisting of documents chosen from the fields of general medicine and ophthalmology. An initial collection is constructed of 50 documents from the medical collection. Forty documents are selected from the opthalmology collection as the new documents to be added to the system. The initial collection (50 medical documents) is modified by the random addition of new (opthalmology) documents

244

R. 6.

CKAWFOKI)

and retirement of old (medical) documents until a completely new collection, consisting of 40 documents in opthalmology, is generated. Three dictionaries are necessary for evaluation. First, the medical static dictionary is constructed for the initial set of SO documents from the medical collection. Second, the ophthalmology dynamic dictionary is constructed by starting with the medical static dictionary and updating it appropriately as documents are added to and deleted from the collection. Third, the ophthalmology static dictionary is constructed for the set of 40 ophthalmology documents taken as a static collection. A set of IS terms is chosen from these collections for evaluation. Table I lists these terms as they occur in the dictionaries. Of these 15 terms, 4 do not occur in the medical collection, 2 do not occur in the ophthalmology collection, and 9 occur in both collections. Table 1 shows the effectiveness of the proposed dynamic updating algorithms. The values for the ophthalmology static dictionary are those which would be achieved if the dictionary were completely regenerated for the new collection. The values for the ophthalmology dynamic dictionary on the other hand are achieved through dynamic updating of the dictionary as documents are added and deleted. It is, of course, to be expected that the frequencies should agree for both dictionaries, however, it is the close agreement in both sign and magnitude of the static and dynamic discrimination values which is important. It is apparent that, in general, an algorithm used to determine the set of index terms based on either the static or dynamic set of values would produce the same set of index terms. Because of these results, the static and dynamic dictionaries are not tested for retrieval effectiveness. Dictionaries which are approximately equivalent, as these are, will provide for a similar level of retrieval performance. On the basis of the results shown, the effectiveness of the dynamic dictionary updating algorithms may be concluded. ‘Table I. Statistics for the te\t term\ for the three dictionarie\

te\ted

r

38

76

hNIMAL

3

4

AND

-50.457 2.944

31

60

II

0

-35.644

31

60

II

"

-28.602

BACILLUS

3

4

3.085

0

0

0

0

BLOOD

3

8

11.401

1

2

1.031

1

2

0.991

BOOY

3

5

2.281

2

3

2.461

2

3

2.539

CATARRCT

0

0

3

6

5.568

3

6

5.129

HISTOLOGY

1

1

0.546

3

4

5.804

3

4

5.625

LENS

3

9

14.096

3

6

5.038

3

6

5.902

,,~I,ANOIIA

0

0

3

4

4.639

3

1

4.573

METRSTASIC

1

1

0.170

3

9

8.299

3

9

-390.3

40 3

3

0.980

3

3

1.494

2

4

4.165

2

4

4.000

4

8

14.127

4

8

13.484

2

3

3.027

2

3

2.973

OF

50

196

OPHrnHALHlC

0

0

P,IENOrnNA

2

2

SCLERh

0

SYNDROXE

2

2.881

5. SUMMARY

AND

174

-348.5

40

174

7.843 -335.5 0.913

CONCLUSIONS

There are two important conclusions to be drawn regarding dynamic dictionary updating. First, it is feasible. Second, it is practical. Results in the previous section showed that, with the maintenance of some additional information, and the utilization of appropriate algorithms, a dictionary may be dynamically updated effectively. Additionally, examination of the algorithms shows that this method is more practical than periodic regeneration of the dictionary. Each time a document is added to or retired from the system the term status map is accessed for all the terms in the document. However. a key practical consideration is that these accesses of the term status map may be done at any time. Thus, if a document is retired at a time when the system is very busy, the vector for the retired document may be saved and the term status map updated at a later time when the system is quieter. This does not detract in any appreciable way from the principle of having the dictionary always reflect the current contents

Dynamic dictionary

updating

24s

of the document collection. First of all, for a very large document collection, each document affects the status of each term only slightly; second of all, the term status map will reflect the contents of the document collection on a day by day or hour by hour basis if not momentarily. The overhead of the additional processing required to keep the dictionary updated for each change in the set of documents is clearly preferable to the hours and possibly days required to regenerate the entire dictionary. The one event during which term status map access is critical is query submission. However, due to the comparatively long time required to access natural language terms in a dictionary, the additional access of the term status map for each of the query concepts is not significant. Access to the term status map is by concept number, thus the term status map may be maintained on disk in a convenient way, making it easy to access. Also. queries have comparatively few terms so require only a few accesses of the term status map. While it is clear that any implementation of these algorithms would have to be done carefully to insure practicality it seems apparent that such implementation is entirely feasible. In addition to showing the effectiveness and efficiency of the updating algorithms, a significant contribution is the derivation of the dynamic discrimination value. Clearly, just as the discrimination value was useful and even necessary[S] in the static case for determination of index terms, a similar tool is required in the dynamic case. The dynamic discrimination value fills the requirements. There is another interesting and important benefit to be gained from the use of the dynamic dictionary system as outlined. Information retrieval systems are generally designed to provide good retrieval on the average. Thus, the dictionary consists of the set of index terms which must necessarily be utilized by users of the system. However, there are instances in which this does not provide for good results for a particular query. Consider the following three cases. (i) A user would prefer to formulate a search query using some narrow specific term(s) which are not included in the standard set of index terms, but which occur in the collection. (ii) A user would prefer to formulate a search query using some broad general terms which occur in the documents but are not included in the standard set of index terms, being poor discriminators. (iii) Two or more distinct user groups are querying the same collection, however, the standard set of index terms provides good results for only one user group. In the static case, we are stuck with the documents as indexed. If a term is not a standard index term, it is not used to index the documents, and documents having that term may only be determined by a re-examination of the natural language text of the documents. It is possible to index a collection using several dictionaries. However, this requires duplication of the collection of document vectors for each dictionary. With large collections this is not feasible. Utilizing the design of the dynamic system given in this paper, the problems are easily solved. The main consideration is that all documents are fully indexed, reflecting all terms occurring in the original text. It is only when a query vector is filtered through the term status map that non-index terms are removed. It is a simple matter to provide users with a means of specifying non-standard index terms which are to be retained for search purposes. In the case of distinct user populations, additional values may be maintained in the term status map specifying the index status of each term for each user group. Thus, dictionaries need not be constructed for a static document collection and an average user. Rather, the dictionary may be dynamic with respect to both changes in the collection and the needs of different users. REFERENCES [ll G. SALTON,A Theory of Indexing. Cornell University (March 1974). [21 T. BRAUEN,Document vector modification in on-line information retrieval. M.Sc. Thesis, Cornell University (Sept. l%9). [31 M. D. KERCHNER,Dynamic document processing in clustered collections. Ph.D. Thesis, Cornell University (Dec. 1971). [4] R. G. CRAWFORD, Automatic dictionary construction and updating,Chap. V-Dynamic information retrieval system. Ph.D. Thesis. Cornell University (June 1975). [51 R. G. CRAWFORD, Automatic negative dictionary construction. Report No. ISR-12 to the National Science Foundation, Department of Computer Science, Cornell University (Nov. 1974). [61 G. SALTON,C. S. YANGand C. T. Yu, Contribution to the theory of indexing. Proc. of the IFIP Congress, (Aug. 1974). [71 E. K. CLEMONSand J. E. NEWTON,A sequential refinement method for removing non-discriminators from a document collection. Department of Computer Science, Cornell University (May 1973).

Dynamic dictionary updating

Dynamic dictionary updating

Recommend Documents