Infonnomv~Processing& Monagemenf. Vol. 13. pp. 235-245. PergamonPress 1977. Printed in Great Britain
DYNAMIC
DICTIONARY ROBERT
Department
of Computing
and Information
G.
CRAWFORD
Science,
Queen’s
UPDATING
University,
Kingston.
Ontario,
Canada
Abstract-A method for updating the dictionary in a dynamic information retrieval system is presented. It is shown that as a collection changes through addition and deletion of documents, the appropriate set of index terms may be determined without complete periodic regeneration of the dictionary. Results are presented for experiments involving acomplete change in collection membership, with the dynamic dictionary updating methods shown to be effective.
I. INTRODUCTION
At the conclusion of his paper “A Theory of Indexing”, Salton lists a number questions which must be examined if the theory is to receive practical application. questions
is as follows:
of important One of these
[I, p. 601.
“Can the computation of term values obtained from a dynamic environment where recompute the term values?”
old documents
static model of a given document are removed, and new ones are added?
collection be maintained in a If not, how often must one
This question defines quite well the considerations of this paper. It is clear that the dictionary must reflect the subject matter of the collection. Consider two ways, as implied in the above question, in which this may be assured. (i) Periodic regeneration of the dictionary. At intervals, when the document collection and/or terminology has undergone a change, a new dictionary may be constructed. There are two problems with this. (a) Prior to reconstructing the dictionary, performance of the system may be expected to decline as the dictionary becomes less reflective of the document collection. (b) For very large collections, the time required to do this at reasonable intervals may be prohibitive. (ii) Continuous updating of the dictionary. All changes, as they occur in the system, are reflected by updating of the dictionary. Here the principle problem is the development of a methodology to do this. In this paper a methodology for dynamic updating of the dictionary is presented, thus answering the introductory question in the affirmative. 2. BASIC
CONSIDERATIONS
There are four areas which should be briefly outlined ments presented in this paper.
to serve as a basis for the develop-
2.1 The environment The discussion and results to be presented are in the context of information retrieval system. That is, dictionary construction, indexing, retrieval are all performed using fully automatic methods. Thus, it is assumed document surrogate (title, abstract, or full text) are considered for assignment that document and appropriate weights are also calculated and assigned.
a fully automatic classification, and that all terms in a as index terms to
2.2 Purpose of a dynamic dictionary The purpose of a dynamic dictionary is to provide for: (i) the maintenance of existing levels of performance by reflecting changes in the document set; i.e. additions and retirements. (ii) the improvement of existing levels of performance by reflecting changes due to document space modification [2-4]. New documents are indexed by a dictionary which fully reflects the document collection, thus maintaining system performance. In the static system this is not true, and new documents may be indexed by a dictionary which does not reflect changes in the collection due to recent document additions and retirements. IPM
vol.
13, No. 4-c
235
236
R. G. CR.AWFORD
Additionally however, document space modification is reflected in the dynamic dictionary, providing for improved performance of a query not only because of the actual document space changes, but also due to improved indexing of the query by the updated dictionary. 2.3 Transactions in a dynamic system It is important to bear in mind the situations affecting terminology which may occur in a dynamic information retrieval system. (i) A new term is entered into the subject area of the collection, representing either; (a) a new idea in the subject area, or (b) a new terminology for an existing concept. (ii) An old term passes from common usage due to either; (a) decreasing importance of the idea it represents, or (b) the introduction of new terminology replacing the old term. (iii) The coverage of the collection shifts: (a) from one subject area to another due to changes in the interests of the users or(b) from a specific subject area to a more general subject area or (c) from a general subject area to a specific subset of that subject. (iv) Through document modifications, based on user evaluations; (a) a term increases in weight in the collection or (b) a term decreases in weight in the collection. It is these situations which must be handled through the dynamic dictionary. 2.4 The static dictionary To understand the problems involved with dynamic dictionary updating for the situations listed in the previous section, consider the organization of a static dictionary and document collection. Figure I shows a sample of a dictionary and a document vector. Consider how updating may be done on the basis of the total information contained in these files. For example suppose that document 1291 as shown in Fig. 1 were retired from the collection. In what manner would this affect concept number 172 and its associated term in the dictionary? It is possible that the term is no longer used anywhere in the collection (i.e. document frequency = 0), but this cannot be determined without an examination of all document vectors, which is prohibitive. Alternatively, suppose that concept number 172 is still used in the collection but is no longer a useful index term. How can the other documents which were indexed by this concept be changed? An additional problem also may occur. Suppose for example, that the term “ocular” was not used to index a set of documents in ophthalmology because it occured in a high proportion of However. over a period of time the the documents and was, therefore, a non-discriminator. collection is greatly expanded, with the addition of documents in more general areas of
Concept
Number
1031 172 3019
Iiai;ophili(
462
Behavior
701
document
number
concept/weight 172/l, 3412/l,
191/2,
1291 pairs
367/l,
491/l.
567/3,
4096/?
Fig. I. Sample element\
of a static system.
Dynamic dictionary updating
231
medicine. Also, many of the specific documents in opthalmology are retired. In other words, in the present collection the term “ocular” is an excellent discriminator and should be included as an index term. Two questions must be answered. (i) What information must be maintained so that a shift in “status” of a term (e.g. from non-discriminator to discriminator) may be detected? (ii) Given the change in “status” of a term, how may the indexing of the documents be modified to account for this? (i.e. the original documents were not indexed by “ocular”. Must we re-examine all the original text?). Part of the purpose of the previous examples is to motivate the need for maintaining further information if dynamic dictionary updating is to be possible. In the following section, precisely what information must be kept is outlined. 3. THE
DYNAMIC
SYSTEM: TERM
DICTIONARY. STATUS MAP
DOCUMENTS
AND
Given the necessity for additional information for use in updating of the dictionary, it must be determined what information must be maintained, where it is to be kept, and how it is to be updated. It appears desirable to include as much of the necessary information as possible as a part of the existing dictionary and document files, as opposed to constructing many new and (possibly) redundant files. The following constraints must therefore be met: (i) The dictionary must contain all terms used in the collection and have a concept number for each. (ii) The document vector must have a concept number and associated weight for all terms occuring in the document (i.e. “full” information for each document). A dictionary and a document collection of this form are the basis for a dynamic system. Additionally, it is necessary to have: (iii) Information associated with each concept, sufficient to determine the status of the concept. (iv) A provision for searching the collection using only “index” terms, rather than the “full” vectors which seem to be suggested by (i) and (ii). This fourth provision is crucial. Searches done using “full” vectors are shown to provide for a very low (and unacceptable) level of retrieval performance[5]. A dictionary which includes all terms in the collection (i.e. stop words or low frequency words are not omitted to keep the dictionary size small) and a set of document vectors which include concepts and weights for all terms in each document (provisions (i) and (ii)) are easy to construct. Thus, consider the problem of what information about a concept is sufficient to determine its status as an index term (provision (iii)). Consider a term, k, and its possible status in the collection as a result of the initial construction of the dictionary. Term k: (i) Is not an index term due to high frequency of occurrence. (ii) Is not an index term due to very low frequency of occurrence. (iii) Is not an index term since it is a non-discriminator or poor discriminator. (iv) Is an index term since it is a discriminator. Now, under addition, deletion, or modification of documents, the status of term k may change. Cases (i) and (ii) are easily handled by maintaining frequencies of occurrence of terms. It is cases (iii) and (iv) which present difficulty. It is shown that the discrimination value, which has previously proven effective for static dictionary construction[5,6], may be revised for use in the dynamic situation. 3.1 Dynamic discrimination value The discrimination value of a term i is defined as follows:
where Q is the compactness
of the collection
and Qi is the compactness
with term i deleted.
21x This
tt.
may
be re-written
’
Now
~‘K4WFOKI)
as:
D =
Qi and
function
(;.
Q are each a function only of those documents
Q 0’
where
(2)
AQ I = QI - Q.
of all documents in the document space. However, AQi is a in which term i actually occurs. Further:
where superscript i indicates the deleted term
for a collection
of N documents,
. d,..,,with centroid
(i,,
(4)
(_.
Thus:
AQi=Q,-Q
(5)
Clearly, the term in brackets approximates approximation, this may be written
AQi = +
c
zero for all j such that d,’ = 0; thus, to a close
[cos(c’. d,‘)-
cos ((:,
&)I.
I
d,’ 10
The quantity AQ, has the property of being easier to compute than Qi and is therefore considered for usefulness in updating. Clearly, if AQ, can be maintained for each term i, than it is a simple matter to compute Di (using eqn (2)) for any term i for which the value of Q, changes. The value D, for a term, as computed on the basis of updated values of AQ, is the dynumic discrimination ualue. The usefulness of dynamic discrimination value is tested and the results discussed in a later section.
3.2 The term stutus map
It is apparent then that, for each term in the collection, the following information is necessary to determine the status of the term for use in indexing: (i) the document frequency of occurrence (ii) the total frequency of occurrence (iii) the dynamic discrimination value (by keeping AQ, and D, for each concept i). This answers the question of what additional information must be kept. It must be determined where this information is to be maintained, and finally shown how it may be updated. Figure 2 shows samples of the files required for dynamic dictionary updating. The Dictionary and the Document vectors are maintained as in the conventional system: all terms being included. That is, for each natural language term in the dictionary there is a list of the concept numbers into which the term is mapped. Also, for each document there is a vector specifying the weight of each concept occurring in that document. It is the third file shown in Fig. 2, the term status map, which enables the dynamics. As shown, there is for each concept in the system a record of the information necessary for dynamically determining the index status of the term, as well as an indicator showing whether the concept, as presently used, is considered to be an index term. Algorithms for the dynamic updating and utilization of the term status map must be
Dynamic
dictionary
updating
239
(a) Dictionary concept
Te?Xl Bacillus
1031
Bacteria
174
Basophilic
3019
I3e
380 5011
BeCaUSe
Behavior
462
Benign
781
(b) Document
Vectors
(full) document
number
1291
concepts/weights 174/l, 3412/l, (c) Term
191/2,
367/l,
4096/2,
Status
380/l,
5010/Z,
491/l,
567/3.
5052/l
Map
Document Frequency
Total Frequency
---Qi
174
12
16
.061
8.133
380
132
-.046
-6.004
Concept
215
Di
Index Ter;n status Yes No
462
11
14
.009
1.221
Yes
781
14
18
.072
9.734
Yes
1031
6
8
.037
5.064
Yes
Fig. 2. Samples
of files required
for dynamic
dictionary
updating
described. However, it is first important to consider how the basic operations retrieval system are carried out using the files described.
in an information
3.3 Standard operations with the dynamic dictionary The design of the files discussed in the previous section (the dictionary, document and term status map files) becomes clearer as their use is considered. Ignoring momentarily the actual updating process, consider the use of these files under standard operations of indexing and searching. DOCUMENT INDEXING is done as in a static system; however, the vectors produced are “full” vectors, containing a concept number and weight for every term occurring in the document. The effect of these full vectors on the search process is insignificant when query indexing is done as follows. QUERY INDEXING is a two step process. First, the dictionary is used to index the query, producing a “full” query vector having a concept and weight for every term occurring in the query (unless the term is new to the system). This full query vector is then filtered through the term status map to produce an index query vector containing only concepts which are currently to be used for indexing. That is, each concept in the full query vector is either kept in the index query vector (in the case where the indicator in the term status map indicates an index term) or deleted (in the case where the indicator indicates a non-index term). An example of this process is shown in Fig. 3. Most of the processing involved in handling natural language queries is in the “lookup” of the words of the text in the dictionary. Thus, the additional step of mapping these concepts into the final index query vector does not add significantly to the processing required. SEARCHING in the dynamic system, whether clustered or not, occurs much as in a static system. The final index query vector is correlated with centroids and document vectors as necessary to fulfill the requirements of the search algorithm. The principal difference from a standard search lies in the use of the full document vectors. It has been shown[5] that when
R.
240
h'iitur21
‘1 1
1.anguaqc
CRAWFORD
Query
"k!hat is the post-operative
procedure
patients
transplant."
Full
b)
G.
undergoinq
@uerv
fvery
Vector
term
in
cornea1 -
concepts
to
the original
for
represent
query
(concepts/
weights). 1835/l,
974/l, 3681/l, C)
4652/l,
Index
Query
shown
by
terms
are
977/l,
2490/l,
3373/l,
4771/l,
Vector
the term
4864/l,
(only terns status
map
3569/l, 5066/l which
included)
3569/l,
3681/l,
4771/l
Fig.3. Example of query indexmg in the dynamic
ranking correlation due to the fairly for documents.
are
to be index
\y\tem
is used, results are not affected by the use of full document constant proportionality
Having discussed what information
between is required
the length of the “full” to dynamically
this information is maintained, and how it is utilized to perform is necessary to consider how the dictionary is updated. 3.4 Dynamic dictionary
vectors. This is
and “index”
update the dictionary, standard retrieval
vectors where
operations.
it
updating algorithms
The dictionary, more particularly the term status map, is updated following the addition of new documents, the retirement of old documents and the modification of the document space. The dictionary is not directly updated as a result of query document space is modified as a result of the query. Thus, queries do not introduce
new terminology
submission,
but is updated
if the
to the system. All new terms are entered
through occurrence in new documents. However, through their use in queries, their occurrence may be expanded to old documents by document space modification. Updating
of the three files (as shown
in Fig. 2) proceeds
updated when: (i) A new term is introduced by its occurrence (ii) An old term is deleted as it is no longer retirement
or modification
which effected
The term Status Map and the Document
as follows.
in a new document. used in any document
its reduction
The dictionary
is
(due to document
to zero weight).
space are updated when:
(i) New documents are added. (ii) Old documents are retired. (iii) The Document space is modified. Thus, dictionary updating algorithms are needed to handle the cases of document addition and document retirement. However, the third case, that of document space modification, may be considered simply as a combination of the first two processes. During document space modification a document (the unmodified one) is retired from the collection and a new document (the modified one) is added to the collection. Figure 4 summarizes the notation used to express the algorithms. Figure 5 outlines the algorithm for dynamic modification of the set of index terms upon addition of a new document to the system. This algorithm may appear to be unwieldy, but a close examination shows that there is little increase in processing over adding a document to a collection based on a static dictionary. Steps (ii) and (ix), i.e. the indexing of the document, and the clustering of the document, each involve a lot of processing, but this is not a function of the dynamic system. Steps (iii) through (viii) handle the actual dynamic updating and involve only
241
Dynamicdictionary updating
c c.
centroid
vector
frequency
of document
(weight)
of term
space i in c
1
a
full
document
under
vector
for the document
consideration
(i.e., being
added
or deleted). di
frequency
N
number
Q
compactness
i in d
in the collection
of the collection (i.e. N*Q)
(c,dj)
correlation
TCOS
of term
of documents
v x cos j=l
Q sum
(weight)
between
centroid
and
full
between
centroid
and
documen
document TCOSj
correlation with
term
i deleted
difference
in
due
correlation
to term
i
I
dQi
sum of correlation term
dynamic
Di DOCFRCQ
differences
due
to
i
number
discrimination of documents
value
in which
of term term
i
i
OCC"?ZS Note
that
are
:, N, 0, and Q
all
"global
values",
SWTl i.e.,
are
values
which
give
entire
documents
collection;
singie
document;
the other
particular
infromation
about
2 and TCOS values
the
pertain
all relate
to a
to a
term.
Fig. 4. Notation
used in dictionary
updating
algorithms.
straightforward numeric computation. Most of this computation is required to update the dynamic discrimination value for each term used in the new document. Step (viii-g) involves the determination of the index status of each term occurring in the new document. This determination is based on the newly updated values describing the term. A general algorithm for handling this step is necessary. Figure 6 outlines Algorithm B for dynamic dictionary updating upon retirement of a document from the system. The algorithm is similar to Algorithm A, with reductions rather than increases in appropriate values. However, the order of computation in these algorithms differs, and is crucial. It is clear that the algorithms should be consistent; that is the, addition and subsequent deletion of a document, with nc’ intervening changes, should leave the system unchanged. Study of Algorithms A and B shows that this is the case. It is important to note that Algorithms A and B do not include the actual determination of the index status of terms. Rather, these algorithms are useful for updating the values which describe each term. The index status of each term is then determined on the basis of these values. Figure 7 suggests an algorithm for evaluating the index status of a term based on the values maintained for it. The specific algorithm to be utilized would depend on the particular implementation. For example, for one collection tested, the following values yield good results: tL=l, As shown previously[5],
rH==,
d, = 0,
DC may be increased,
dH = 300,
D, = 0.
resulting in fewer index terms, with only a slight
242
R. G. CRAWFORD
I-
PROCEDURE
STLP
i)
Read
Document
Indexing
ii)
iii)
iv) V)
EXPLANATION New
of Document
c+c +
d
N-N
+
1
TCOS
*COS
(C,S,
vii)
COMPUTE
centroid
Update
collection
New
(Natural
new
Total
Language
vector
Text)
for the new
of the collection size
document System
Compactness,
with
centroid
Compactness
Q
For each term status map
in document
Correlate deleted
document
a update
its
terr
:
- -
a!
Update
Update
FOR V i3 di # 0
viii)
received
;i, the concept
Corrclatc
Qsum +Q,, + TCoS Q *-Qs urn’‘’
vi)
document
Produce document
(d.cJ-
i< .d ii
TCOSi,.
)
new
with
centroid;
tern
i
(;:c2 -c* 1 (z,' _d2 ) jji jji b)
LCQS~*TCOS.
-1
ACi
d)
L
Di
Change
- TCOS
1
. GQi
+ ~CCSi
Dynamic
- hQi
(*loo) /Q
Par
E)
DOCFREQi+-
DOCFREQi
fl
“OTFP!s@. 1*
TOTFREQi
+di
3)
TEST
status
index
Add
document
i)
of term
i
d to collection
ii) iii) 1V)
V) vi) vii)
TCOS ?J+
1 from document
(Algorithm (Modify
Value
of occurrence
of terr
1
C) clusters
as necessary)
updating upon addition of a document.
CXPLANATION collection
Appropriate updating of cIwters and centroids may be necessary Correlate
* COS(c,d) N-l
c *c-;i Qsum+Qsum
i
i updated
frequencies
PROCEDURE Delete
to tern
I
Fig. 5. AIgc~rithm .4: Dictionary
STC?
due
Discrimination
term
i-1 update
ix1
in correlation
- TCOS
document
with
ccntroid
Update
Collection
Size
Update
Collection
Centroid
Update
Collection
Compactness
Q .. Qsun/N FOR Vi3
di # 0
For each the tern
term in document status map:
i update
COMPUTE
- -
a)
(c.d)
TCOSi
b)
cc9si
-
(ci.di)
+J (;c; - cf)
TCOSi
I-
-
Correlate with term
C)
clQ.+ AQi1 D. - Ani 1
(loo*) /Q
e)
DOCFRCQi
+ DOCFRCQ.
f)
TOTFRCQi
- T3TFRCQ.
g)
Test
centroid
(id; - d;)
TCOS
d;
document and i deleted
Difference in correlation due to term i is used to update dynamic discrimination value of term i
ACOSi
index
status
1 1
-1
Update frequencies of term i
of occurrence
-1
of term
i
(Algorithm
Fig. 6. A1,qorithm B: Dictionary
C)
updating upon retirement
of a document.
Dynamic dictionary
Term
i is an Index
Term
1.
tL ( Total
an2
2.
dL c Document
and
3.
Term
DC
-IF:
frequency
an Index
of term
Frequency
< Discrimination
i is not
243
updating
Term
i c tH
of term
Value
i < dH
of term
i
Otherwise.
Parameters: tL
-
tH
-
d
L
dH Dc
low cutoff high
cutoff
for total for total
frequency frequency
values values
-
low
cutoff
for document
frequency
values
-
high
cutoff
for document
frequency
values
-
discrimination
value
Fig. 7. A/gorithnr C: Evaluation
cutoff
of the index status of a term
decrease in retrieval performance. Likewise, the frequency limits may be modified for systems in which either very high precision or very high recall are required. Algorithm C is simple and may be easily modified to produce the set of index terms which provide good average performance over all queries for a particular collection and user population. Determining the index status of a term is equivalent to the static case of adding and deleting terms in a dictionary. However, since all terms are maintained in the dynamic dictionary, deletion of a term (and likewise addition) could have two meanings. First, a term which is no longer used in any documents would be deleted from the dictionary in the conventional sense. Second, a term which becomes a non-discriminator is essentially “deleted” from the set of index terms. The first case is simply a matter of keeping track of the frequency of occurrence of a term. It is the second case which is of interest and which is determined by Algorithm C. When Algorithms A, B and C are applied successively the set of index terms may be expected to change, continuing to reflect the document collection. Some question may be raised as to the validity of the successive deletion of non-discriminators which may occur. In particular, Clemons and Newton[7] found retrieval to be adversely affected upon successive deletion of non-discriminators. Their method was based on the assumption that “if a fixed number of nondiscriminators were to be deleted, they should optimally be deleted one at a time, with recalculation of discrimination value following each deletion”. However, the recalculation of discrimination values after the deletion of many terms does not give a true indication of the discriminating effect of terms in the actual documents. That is, unless the results of the calculations can be related back to the original documents they are of no value. In light of this, it is important to see that a different approach is taken here. Discrimination values are recalculated (updated) only on the basis of changes in the docutnent colection. A set of index terms is maintained which accurately reflects the current set of documents. 4. EXPERIMENTS
AND
EVALUATION
A principal result is required in support of the algorithms presented. They must be shown to be correct; i.e. that they in fact provide an effective means of dynamically updating a dictionary. The dictionary updating algorithms are evaluated using a collection consisting of documents chosen from the fields of general medicine and ophthalmology. An initial collection is constructed of 50 documents from the medical collection. Forty documents are selected from the opthalmology collection as the new documents to be added to the system. The initial collection (50 medical documents) is modified by the random addition of new (opthalmology) documents
244
R. 6.
CKAWFOKI)
and retirement of old (medical) documents until a completely new collection, consisting of 40 documents in opthalmology, is generated. Three dictionaries are necessary for evaluation. First, the medical static dictionary is constructed for the initial set of SO documents from the medical collection. Second, the ophthalmology dynamic dictionary is constructed by starting with the medical static dictionary and updating it appropriately as documents are added to and deleted from the collection. Third, the ophthalmology static dictionary is constructed for the set of 40 ophthalmology documents taken as a static collection. A set of IS terms is chosen from these collections for evaluation. Table I lists these terms as they occur in the dictionaries. Of these 15 terms, 4 do not occur in the medical collection, 2 do not occur in the ophthalmology collection, and 9 occur in both collections. Table 1 shows the effectiveness of the proposed dynamic updating algorithms. The values for the ophthalmology static dictionary are those which would be achieved if the dictionary were completely regenerated for the new collection. The values for the ophthalmology dynamic dictionary on the other hand are achieved through dynamic updating of the dictionary as documents are added and deleted. It is, of course, to be expected that the frequencies should agree for both dictionaries, however, it is the close agreement in both sign and magnitude of the static and dynamic discrimination values which is important. It is apparent that, in general, an algorithm used to determine the set of index terms based on either the static or dynamic set of values would produce the same set of index terms. Because of these results, the static and dynamic dictionaries are not tested for retrieval effectiveness. Dictionaries which are approximately equivalent, as these are, will provide for a similar level of retrieval performance. On the basis of the results shown, the effectiveness of the dynamic dictionary updating algorithms may be concluded. ‘Table I. Statistics for the te\t term\ for the three dictionarie\
te\ted
r
38
76
hNIMAL
3
4
AND
-50.457 2.944
31
60
II
0
-35.644
31
60
II
"
-28.602
BACILLUS
3
4
3.085
0
0
0
0
BLOOD
3
8
11.401
1
2
1.031
1
2
0.991
BOOY
3
5
2.281
2
3
2.461
2
3
2.539
CATARRCT
0
0
3
6
5.568
3
6
5.129
HISTOLOGY
1
1
0.546
3
4
5.804
3
4
5.625
LENS
3
9
14.096
3
6
5.038
3
6
5.902
,,~I,ANOIIA
0
0
3
4
4.639
3
1
4.573
METRSTASIC
1
1
0.170
3
9
8.299
3
9
-390.3
40 3
3
0.980
3
3
1.494
2
4
4.165
2
4
4.000
4
8
14.127
4
8
13.484
2
3
3.027
2
3
2.973
OF
50
196
OPHrnHALHlC
0
0
P,IENOrnNA
2
2
SCLERh
0
SYNDROXE
2
2.881
5. SUMMARY
AND
174
-348.5
40
174
7.843 -335.5 0.913
CONCLUSIONS
There are two important conclusions to be drawn regarding dynamic dictionary updating. First, it is feasible. Second, it is practical. Results in the previous section showed that, with the maintenance of some additional information, and the utilization of appropriate algorithms, a dictionary may be dynamically updated effectively. Additionally, examination of the algorithms shows that this method is more practical than periodic regeneration of the dictionary. Each time a document is added to or retired from the system the term status map is accessed for all the terms in the document. However. a key practical consideration is that these accesses of the term status map may be done at any time. Thus, if a document is retired at a time when the system is very busy, the vector for the retired document may be saved and the term status map updated at a later time when the system is quieter. This does not detract in any appreciable way from the principle of having the dictionary always reflect the current contents
Dynamic dictionary
updating
24s
of the document collection. First of all, for a very large document collection, each document affects the status of each term only slightly; second of all, the term status map will reflect the contents of the document collection on a day by day or hour by hour basis if not momentarily. The overhead of the additional processing required to keep the dictionary updated for each change in the set of documents is clearly preferable to the hours and possibly days required to regenerate the entire dictionary. The one event during which term status map access is critical is query submission. However, due to the comparatively long time required to access natural language terms in a dictionary, the additional access of the term status map for each of the query concepts is not significant. Access to the term status map is by concept number, thus the term status map may be maintained on disk in a convenient way, making it easy to access. Also. queries have comparatively few terms so require only a few accesses of the term status map. While it is clear that any implementation of these algorithms would have to be done carefully to insure practicality it seems apparent that such implementation is entirely feasible. In addition to showing the effectiveness and efficiency of the updating algorithms, a significant contribution is the derivation of the dynamic discrimination value. Clearly, just as the discrimination value was useful and even necessary[S] in the static case for determination of index terms, a similar tool is required in the dynamic case. The dynamic discrimination value fills the requirements. There is another interesting and important benefit to be gained from the use of the dynamic dictionary system as outlined. Information retrieval systems are generally designed to provide good retrieval on the average. Thus, the dictionary consists of the set of index terms which must necessarily be utilized by users of the system. However, there are instances in which this does not provide for good results for a particular query. Consider the following three cases. (i) A user would prefer to formulate a search query using some narrow specific term(s) which are not included in the standard set of index terms, but which occur in the collection. (ii) A user would prefer to formulate a search query using some broad general terms which occur in the documents but are not included in the standard set of index terms, being poor discriminators. (iii) Two or more distinct user groups are querying the same collection, however, the standard set of index terms provides good results for only one user group. In the static case, we are stuck with the documents as indexed. If a term is not a standard index term, it is not used to index the documents, and documents having that term may only be determined by a re-examination of the natural language text of the documents. It is possible to index a collection using several dictionaries. However, this requires duplication of the collection of document vectors for each dictionary. With large collections this is not feasible. Utilizing the design of the dynamic system given in this paper, the problems are easily solved. The main consideration is that all documents are fully indexed, reflecting all terms occurring in the original text. It is only when a query vector is filtered through the term status map that non-index terms are removed. It is a simple matter to provide users with a means of specifying non-standard index terms which are to be retained for search purposes. In the case of distinct user populations, additional values may be maintained in the term status map specifying the index status of each term for each user group. Thus, dictionaries need not be constructed for a static document collection and an average user. Rather, the dictionary may be dynamic with respect to both changes in the collection and the needs of different users. REFERENCES [ll G. SALTON,A Theory of Indexing. Cornell University (March 1974). [21 T. BRAUEN,Document vector modification in on-line information retrieval. M.Sc. Thesis, Cornell University (Sept. l%9). [31 M. D. KERCHNER,Dynamic document processing in clustered collections. Ph.D. Thesis, Cornell University (Dec. 1971). [4] R. G. CRAWFORD, Automatic dictionary construction and updating,Chap. V-Dynamic information retrieval system. Ph.D. Thesis. Cornell University (June 1975). [51 R. G. CRAWFORD, Automatic negative dictionary construction. Report No. ISR-12 to the National Science Foundation, Department of Computer Science, Cornell University (Nov. 1974). [61 G. SALTON,C. S. YANGand C. T. Yu, Contribution to the theory of indexing. Proc. of the IFIP Congress, (Aug. 1974). [71 E. K. CLEMONSand J. E. NEWTON,A sequential refinement method for removing non-discriminators from a document collection. Department of Computer Science, Cornell University (May 1973).