Inform,zl~~~ Processing & Management Vol. 24, No. 1, pp. 17-22, 1988 Printed in Great Britain.
0
0306.4573/88 $3.00 + .OO 1988 Pergamon Journals Ltd.
AN IMPROVED ALGORITHM FOR THE CALCULATION OF EXACT TERM DISCRIMINATION VALUES ABDELMOULA EL-HAMDOUCHI and PETER WILLETT* Department of Information Studies, University of Sheffield, Western Bank, Sheffield SIO 2TN, UK (Received 24 Much
1987: accepted 8 May 1987)
Abstract-The term discrimination model provides a means of evaluating indexing terms in automatic document retrieval systems. This article describes an efficient algorithm for the calculation of term discrimination values that may be used when the interdocument similarity measure used is the cosine coefficient and when the document representatives have been weighted using one particular term-weighting scheme. The algorithm has an expected running time proportional to Nn2 for a collection of N documents, each of which has been assigned an average of n terms.
I. INTRODUCTION
The term discrimination model [l, 21 has been suggested as a basis for the evaluation of indexing terms in computerized document retrieval systems. For some term i, the term discrimination value LIP’, measures the extent to which the use of i as an indexing term affects the separation of the documents in the multidimensional space defined by the indexing vocabulary. Several studies have demonstrated a strong relationship between term discrimination and term frequency, with the most highly discriminating terms being those of intermediate frequencies of occurrence in document collections. We are currently reevaluating the use of the term discrimination model as a basis for automatic indexing strategies. One obvious current limitation of the model is that the calculation of the LIx values involves extensive computaiion. For a collection of N documents indexed by a total of M terms the obvious algorithm [3] for the computation of all MDl$ values involves the calculation of O(N’M) interdocument similarity coefficients, and most studies of term discrimination have accordingly used an approximate method for the calculation of the OK values. Willett [3] has recently described an algorithm for the calculation of exact discrimination values that involves the calculation of only 0(N2n) similarity coefficients where n is the mean number of indexing terms assigned to each of the documents in a file. This article reports a new algorithm for the calculation of exact term discrimination values that has an expected running time of order O(Nrz2).
2. CALCULATION
OF TERM DISCRIMINATION
VALUES
A collection of N documents is assumed to be represented by a series of document vectors D/e 1 5 j 5 N. Each such document vector contains M elements, where M is the number of distinct terms that have been used for the indexing of the collection: the ith element, 1 I i I M, dj;, contains the number of occurrences of the ith term in thejth document. In many cases, including all of the seven document collections considered here, the indexing is binary in character so that the d,, values are either 0 or 1. Thus, the collection may be visualized as an N x Mbit matrix where thejth row represents the occurrence of terms in thejth document, and the ith column the occurrences of the ith term in the documents. A measure of the similarity between some pair of documents DJ and Dk may then be calculated using a coefficient such as the cosine coefficient COSJK defined by *To whom all correspondence IPM
?i:l-B
should
be addressed
17
A. EL-HAMDOUCHI and P. WILLETT
18
COSJK
= f
dji *
i=l
A space in which all of the documents are as far apart as possible will be one that corresponds to minimizing the sum of the COSJK values for all pairs of documentsj,k. This sum, Q, is defined by
2 COSJK
Q =
j.k=l
The effect of an individual term i on the interdocum~nt similarities may be determined by calculating Q and then recalculating it when i has been deleted from the index term lists of the documents to which it had been assigned; this corresponds to setting all of the elements dji to 0 for 1 5 j I N. The difference between Q and this new sum, QI, then gives a measure of the extent to which the presence of i in the indexing vocabulary affects the separation of the documents in the collection that is being analyzed. The term discrimination value of i, DV;, is defined to be
DK = (Pi - QIJ’Q and the merits of different indexing terms may be evaluated by ranking the corresponding DF$ values. Terms with large positive values will occur at the top of the ranking; such terms are considered to be potentially useful for the representation of document content, whereas those terms lower down the ranking are considered to be poor discriminators [ 1,2]. The algorithm to be discussed in this article takes as its basis some recent work by Voorhees [4], which involves the use of the group average hierarchic agglomerative clustering method for automatic document classification. This clustering method may be implemented by a series of fusions in which those two clusters are fused for which the average interdocument similarity coefficient is a m~imum, the average being taken over all of the x, * xb coefficients for a pair of clusters containing x, and xb documents, respectively. However, the evaluation of the x, * xb coefficients can be replaced by the evaluation of a single coefficient as we now demonstrate. Let A be the linear combination, or centroid, of the document vectors Dj, 1 5 j I x,, representing the documents in the first cluster so that A = where
Wj is the weight of the jth
2 wjDj
J=l
vector in A. The ith component
a, =
of A, ai, is given by
2 WjdJ,
J=l
where dji is the ith component of thejth document in the cluster. The vector B and its components bi may be defined similarly in terms of the xb documents in the second cluster. The dot product of the two vectors A and B is given by
DOTPRODAB
= 5 WjDj * fJ WkDk J=I
which may be rewritten
k=I
as
DOTPRODAB
= c c j=l k=l
wj * wk * DOTPRODJK
Exact term discrimination
values
19
where DOTPRODJK is the dot product of the vectors Dj and Dk. Hence, the dot product of the two centroid vectors is given by the sum of the weighted dot products between all x, * xb pairs of documents for which one document is in the first cluster and the other in the second. Various weighting schemes may be used to obtain the weights Wj and wk [5]. In this article we consider the weight Wj = l/SUMSQJ”*, where
SUMSQJ
= 5 d/5 i=l
and similarly
for wk. In this case, the expression
DOTPRODAB
for DOTPRODAB
= c
c
j=1
k=l
becomes:
COSJK
Thus, the calculation of x, * xb interdocument cosine similarities may be reduced to the calculation of a single intercentroid dot product similarity when the l/SUMSQJ”* weight is used. This result was used by Voorhees in her work on hierarchic document clustering [4] and we will show how it may be applied to the development of an efficient algorithm for the calculation of exact term discrimination values. Let A = B = C, where C is the centroid of the entire document collection so that:
DOTPRODCC
= t
5
j=l
Wj
*
Wk
*
DOTPRODJK
k=1
Note that the left-hand side of this expression, the dot product of C with itself, is equivalent to SUMSQC. The calculation of a Dv value involves the sum of all of the distinct interdocument similarities, that is, the N(N - 1)/2 similarities for which j < k, and thus the equation above may be formulated as:
SUMSQC
= 2 *
5
Wj
*
Wk
DOTPRODJK
*
+ g
W,’
*
SUMSQJ
j=l
j,k=l (j
from which Q and Qi may be obtained
as
2Q = SUMSQC
- 5
wi’ * SUMSQJ
j=l
and 2Q; = SUMSQCI
- i
wi: * SUMSQJI
j=l
where SUMSQCI and SUMSQJI refer to the sums of the squares for the centroid and for the jth document, respectively, when the ith term has been deleted from the indexing vocabulary, and where Wij is the corresponding weight when this deletion has occurred. D V has been defined above as (Qi - Q)/Q. However, Q is a constant for a given collection, and thus the denominator will not affect the ranking of a set of DK values, if it is the relative rather than the absolute values that are of interest. The term Q in the denominator may hence be neglected, as may the factor of 2 on the left-hand side of the equations above since this, again, will not affect the ranking. We hence obtain:
Dv
= SUMSQCI
- SUMSQC
- 5 j=l
wij’ * SUMSQJI
+ 5 w,? * SUMSQJ j=l
A. EL-HAMDOUCHI and P. WILLETT
20
and thus, if Wj = l/SUMSQJ”’
and wb = 1/SUMSQJ11’2, OK = SUMSQCI
then:
- SUMSQC
To summarize: if DV is calculated with the index terms characterizing each document being weighted by the 1/SUMSQ”2 weight, the N( N - 1) /2 interdocument similarities may be replaced by a single calculation that involves only the centroids of the document collection when the ith term is and is not used for indexing purposes. This observation may be used to frame an extremely efficient algorithm for the calculation of the MDK values as we now describe. If ck and cik are the kth components of C and CZ, then substituting for SUMSQCI and SUMSQC, we obtain:
k=l
k=l kc>i
(ci,’ -
= i,
ci) - c,”
k<>r
=
kc,[(cik
+ ck) * (cik - ck)] -
cf
Let KZ be the set of documents indexed by i and k, and KNZ the set of those indexed k but not i. For each term k <> i, ck can be expressed as follows:
Ck =
C Wjdjk + C Wjdjk D,EKNI D,EKI
Accordingly,
C
Wjdjk = Ck -
D,EKNI
Similarly,
cik can be rewritten
as follows:
cik =
c D,EKI
Note that in KNZ, since i
is
C WJdj,+ D,EKI
absent
Wijdjk -i- c Wi,djk D,EKNI
from these documents: Wijdjk = Wjdjk
Therefore.
the Dt( can be expressed M
Dv, =
as follows,
after the obvious
simplifications:
.
c
[DZF, * (DZF, + 2 * ck)] - Cf
k=l k<>i
where for each term k (and taking above), DIFk is equivalent to: DZFk=
C
account
(Wij-
of the binary
indexing
Wj) * djk
D,EKI
= DFK, [(l/(SZj I
- 1))1’2 -
(l/SZj)1’2]
and the convention
by
Exact term discrimination values
21
the summation is extended to the documents indexed by i and k, and SZj is the size, or number of index terms, describing document j. The important thing to note about this last expression is that DW,, is nonzero only for those terms that occur in the documents indexed by i; for any other term, k, cik, and ck will be equal and thus cancel each other out. Hence, rather than considering all A4 terms, Or: may be evaluated by considering only those terms that cooccur with i. Such terms may be identified efficiently by using an inverted file to identify those documents in the collection that contain i, and then restricting attention to the other terms that occur in these documents. where
3. THE ALGORITHM
The discussion in the previous section leads to the folfowing algorithm, which calculates all h4 discrimination values for a collection of documents. C and DZF are two arrays that are used to cumulate the contributions to the two components of the above expression. The algorithm assumes that the documents are stored for search in an N x M integer matrix in which the element dji contains the number of occurrences of the ith term in the jth document. FORi:= 1 TOMDO BEGIN C(i) := DIF(i) .== 0; FORj .= 1 TO ND0 C(i) := C(i) + Wjdj;; END; FORi:= ITOMDO BEGIN FORj:= 1TONDO IF dji > 0 THEN FORk:= 1 TOMDO IF djk > 0 THEN DZF( k) := DZF( k) + [ wij - Wj] * djk DK := 0; FOR/c:= lTOMD0 Di$ := DVy -I-DZF(k) * (DZF(k) + 2 * C(k)); DV; := Di$ - C(i)* END.
4. RESULTS AND DISCUSSION
An inspection of the above algorithm shows that it has a running time proportional to NM2. Since M and N are generally of comparable magnitude in free-text retrieval systems, this implies a file-size dependency varying as N3, and it might accordingly be thought that the algorithm was of no practical use. However, the time-consuming inspection of the dji and dki values to determine whether a particular term occurs in a particular document may be eliminated by storing only the nonzero entries in an inverted file and in the document file, as is, in fact, commonly done in text retrieval systems. To allow an analysis of the algorithm, we make two assumptions. First, all of the documents have the same number of indexing terms, n, and second, al1 of the lists in the inverted file are of the same length, m. Although the first assumption may not be too inaccurate, the second certainly is since textual entities such as words are known to have hyperbolic frequency distributions [6]. If the two assumptions are accepted, then the innermost loop of the algorithm will be performed nm times for each one of the M indexing terms, giving an overall time requirement of order 0( n&4). Note that m is given by m = N&M, that is, the total number of postings in the inverted file divided by the number of inverted file lists; thus, an alternative expression for the time requirement is order 0(Nn2) since
A. EL-HAMDOUCHI
22
Table
1. Document
Collection
N
n
Keen Evans Harding Cranfield LISA UKCIS INSPEC
800 2542 2472 1400 6004 27361 12864
9.1 6.6 36.3 28.7 39.1 6.7 36.0
and P. WILLETT test collections CPU (set)
Nn*/CPU*104
(*)
*This time is reduced by 3 to 6 times depending using the level-3 optimising FORTRAN compiler.
1.9 3.0 88.0 28.2 236.5 33.5 421.2 on the collection
4.0 3.7 3.7 4.1 4.0 3.1 4.0
characteristics
if the program
is compiled
mA4 = nN. Thus, if CPU is the central processor time for the execution of the algorithm, exclusive of I/O operations, it would be expected that Nn’/CPU should be constant across different test collections. The storage requirement additional to that of the inverted and direct files is of order O(M) for the arrays C and DIF. The algorithm was encoded in FORTRAN 77 and run on an IBM 3083 using seven document test collections, as shown in Table 1; further details of the collections used are presented by Griffiths et a/. [7]. An inspection of the table reveals two things. First, Nn2/CPU is indeed approximately constant, thus confirming the expected 0(Nn2) running time of our algorithm. Second, the magnitudes of the CPU values are all small, thus implying that the term discrimination model can be used on a routine basis even with large data sets. Acknowledgementsfor their helpful
We thank the British Council for the award comments on an earlier draft of the article.
of a Ph.D.
studentship
to AEH and the referees
REFERENCES 1. Salton, G.; Yang, C. S.; Yu, C. T. A theory of term importance in automatic text analysis. Journal of the American Society for Information Science 26: 33; 1975. 2. Salton, G.; Wong, A.; Yang, C. S. A vector space model for automatic indexing. Communications of the ACM 18: 613; 1975. 3. Willett, P. An algorithm for the calculation of exact term discrimination values. Information Processing and Management 21: 225; 1985. 4. Voorhees, E. Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management 22: 465; 1986. 5. Sparck Jones, K. Index term weighting. Information Storage and Retrieval 9: 619; 1973. 6. Lynch, M. F. Variety generation-a reinterpretation of Shannon’s mathematical theory of communication, and its implications for information science. Journal of the American Society for Information Science 28: 19; 1977. 7. Griffiths, A.; Luckhurst, H. C.; Willett, P. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science 37: 3: 1986.