COMPRESSION OF LARGE INVERTED FILES WITH HYPERBOLIC TERM ~~STRI3~TION E. I. SCHUEGRAF Departmentof MathematicsSt. Francis Xavier University Antigonish, NovaScotia,Canada
Abstract-Thestorage requjrements for retrieval systems utili~ng inverted files are calculated assuming different storage modes. Various methods for compression of these large files are analyzed. Binary vectors compressed by run-length coding as well as lists of document numbers were found to be suitable. The problem of minimal storage requirements for the inverted file is solved for different assumptions about index term distributions. A representation combining run-length coded binary vectors with list of document numbers was found to be the most economical. Parameter values for this minimum storage form are calculated and specified in tables as well as displayed graphically.
1. INTRODUCTION Many interactive data base management systems as well as most automated library systems must respond quickIy to user inquiries, This constraint presents substantial problems especially in the design of on-line systems accessing a large data base such as a library catalog file of the document tile in a retrospective document retrieval system. Various solutions to achieve reasonable response time have been suggested, ranging from the use of special technoIogy[l, 21 to tile structures tailored to a particular application [3]. A common approach to satisfying the above mentioned constraint is to generate from the original data an auxiliary file whose only function is to provide quick response to queries. Unfortunately, the storage requirements nearly double, since the redundant material in the auxiliary file almost duplicates the original data. The main distinction between the original and the auxiliary file is the di~erent structure imposed on the data. As noted by ~~ENAs[~] the organization of the auxiliary file “becomes another file problem in itself, possibly of the same magnitude as the data base itself”. In theory, it is possible to select the organization of the auxiliary file from the many possible methods described in the literature[S]. However, most systems use either the inverted file (TDMS, GIS) or the tree organization (II&S). Tree structured files are discussed by Sussenguth [6] and Stanfel [i’],while inverted files are described by GARDENAS[4,8] and KING[9]. A comparison of the efficiency of the two structures for document retrieval was made by EIN-DoR[IO].From these discussions in the literature the fact emerges that an inverted file performs better than a tree structured auxiliary file, especially with respect to complex queries. For the subsequent discussion it is assumed that an inverted file is used as the tool for answering queries. 2. METHODSFOR STORINGINVERTEDFILES Two alternate interpretations of inverted files are possible, resulting in two different storage forms and manipulation methods. The first interpretation envisages the file as a store of lists of document numbers [ 111.A Iist L(k) consists of all the document, or accession numbers of all the records in the data base which contain the kth index element. The number of different lists D is determined by the size of the index set, since each element k has one associated list L(k). If the database is comprised of N records, [ log, N bits are needed to represent one accession number. (r(x) denotes the smallest integer greater than x.) Problems arising from the distribution of index terms resulting in different list lengths have been discussed by Low~z[l2].The inverted index is usually stored as a sequence of accession numbers and pointers associated with each index term indicating the beginning of each list. Figure I shows such an ~rangement. If it is assumed that there are a total of i entries in the inverted file, the storage requirements in bits may be calculated as,
37x
E. J. :bii?EXTtfl~~
SCHUEGRAF
POINTERS
ACCESSION
NWE3f.RS
Fig. 1. Inverted file interpreted as a store of lists.
with the first term representing the storage for the accession numbers and the second term accounting for the pointers. Coordinate indexing provides the second interpretation for inverted files. Assume that all index terms are ordered according to some criteria and can thus be numbered sequentially. For each document in the collection it is possible to construct a Boolean vector with N components. The ith component is true if index term i is contained in the document, and false, otherwise. Boolean values may be represented by the digits zero and one, and thus have a convenient representation for computer processing. There are several advantages to using binary vectors [9] as the basis for an inverted file. Most machines allow easy manipulation of bit strings and logical operations on these strings can be performed normally by single machine instructions. The answering of queries thus is quite rapid and requires little programming effort. Furthermore, all vectors are of the same length and the pointers required in the list interpretation of an inverted file can be eliminated and may be replaced by address calculations. The storage requirements for an inverted index based on binary vectors is given by S”E(.TI,R= N * D.
(2)
In contrast to (I), the storage needed is independent of the number of index entries and is only a function of the number of documents N in the collection and the size D of the index set. The I non-zero entries in the index represent only a small fraction of the N * D bits required and thus most of the N * D bits are zero. Elimination or compression of the many zeros can reduce storage requirements. The approach by THIEL.and HEAPS[~l] combines the list interpretation with the binary vector. If a few successively numbered records contain the same index term a bit map for these records is stored rather than the record numbers. Some storage overhead is introduced by the necessity for special codes indicating the storage mode used (binary vector/document numbers). The method is quite eflicient for clustered document collections or documents which have been preordered. For inverted files based on the binary vector another method of compressing the zeros, called run-length coding may be borrowed from picture compression.
3. RUN-LENGTH CODING The idea of run-length coding is fairly simple. A string of consecutive zeros terminated by a one (called a run) is replaced hy the length of the run[ 131. The main area of application has been image compression[l4], but it has also been suggested for compression of bibliographic files[ IS, 251. Various attempts have been made to optimize the code for the length of a runll6,17]. For reasons of convenient computer manipulation a fixed number of bits b is often chosen to encode the length of a run. Assume that the average length of runs is r, then a value of b = [ log, r is the optimal assignment for b and produces minimum storage requ~ements. Thus, the only problem remaining in the choosing of b is the calculation of the average run length, r(n, k), of a binary vector with n components
containg
k ones. There are
such vectors 0 I: with n-k zeros and k ones. Assuming that each of these vectors occurs with the same probability, it is sufficient for the computation of r to determine the total number of runs, R(n, k), in all these vectors. The value of R may then be calculated as
Compression of large inverted files with hyperbolic term distribution
r(n, k) =
01 (n-k) r(n, k)
379
(3)
R(n, k) may be computed by the foI~owing arguments: Let S(n, k) be the set of ail vectors with n components, consisting of n - k zeros and k ones. An element of S is said to have property A if the bit in position n is a one. The set S(n, k) may be divided into two mutually exclusive subsets, S(n, k, A), containing all vectors with property A, and the complement S(n, k, A). The number of different vectors in each set are given by
Assume that R(n - 1, k) is already known and R(n, k) is to be calculated. There are three possible cases of constructing a member of S(n, k) (see Fig. 2).
S(411
J
Fig. 2. Construc~jonof S(5,Z).
(1) Take the elements of S(n - 1, k, A ) and add a zero in position n. This increases the length of the runs but the number of runs remains the same. (2) Take the elements of S(n - 1, k, A) and add a zero in position n. The addit~oRa1zero adds a run to each element in addition to the number of runs in S(n - 1, k, A), The set S(n - I, k - 1, n-2 elements and since one run is added per element the number of additional runs is A)bas k_, C > n-2 ( k-l >’ (3) The third case of constructing an element of S(n, k) is by taking an element of S(n - 1, k - 1) and adding a one in position n. The number of runs stay the same as in S(n - f, k - I), given by R(n - 1, k - I). A11the elements of S{n, k) are covered by these three cases and thus the number of runs, R(n, k), is given by R(n,k)=R(n-1,k-l)+R(n-I,k)c
(4)
The last two terms are the contribution from cases 1 and 2, whereas the first is the contribution of case 3. This recursion formula has its start given by the following obvious values: R(j,j-l)=j R(j, 1) = 20’ - 1).
(5)
Standard mathematical procedures give, after considerable manipulation, the following solution to the recursion:
R(n,k)=
a result which may be verified by substituting may be computed as
(6) into (4). Using (3) and (6) the average run-length
r(n,k)=&
(7)
which in turn yields for the number of bits b(n, k) for encoding
the runs
The above result is optima1 only for a single binary vector with n components of which there are k ones, and is optimal only for those entire inverted files, where each term is used to index exactly the same number of documents. As this assumption is not very realistic, eqn (8) may not specify the optimal value for the inverted file as a whole. For optimization of the total storage the vectors associated with all the index terms should be considered, since the number of non-zero components in the vectors may vary widely. 4. HYPERBOI~IC
TERM DISTRIBUTIONS
AND RUG-LENGTK
CODING
Various studies and experiments have been conducted to determine the empirical distribution of index terms, while many attempts have been made at resolving the question of the most optimal distribution for retrieval[l&211. As in other bibliometric areas[22], a distribution commoniy referred to as Zipf’s Law 1231,is known to fit most of the observed data for index term distributions. Zipf’s Law implies, that if the D index terms are ranked in order of decreasing frequency, then the frequency f of a term of rank r is given by j(r)
The constant C, the only parameter approximation, as
= C/r.
in the distribution,
(9) may be computed using a standard series
I C = log, D + 0.271 with I being the total number of entries in the index. The knowledge of the term distribution may be utilized to find the most economic storage mode for the entire index, since eqn (9) specifies the number of non-zero entries in the N-component binary vector associated with a term of rank r. If the index is stored as lists of document numbers, then (9) may be interpreted as the length of the lists. The choice of representation for the inverted file under the assumption of a hyperbolic distribution appears difficult. For terms of high frequency the binary vector seems to be more economical, especially if run-length coding is used for compression. On the other hand, document numbers appear more suitable for low frequency terms. An uncompressed bit vector need not be considered since even for the most frequent terms run-length coding is more economical. The frequency I( of the term, for which run-Iength coding and bit vector representation use the same amount of storage, can be determined by solving the equation
N=(K+I)log,
(&, 1
(10)
for K. However, this equation has no solution since the right side of eqn (IO) has a maximum at K = (N/e)- 1 with a value of (N/e log2). Therefore, uncompressed bit vectors can be
Compression of large inverted files with hyperbolic term distribution
381
eliminated, and only run-length coded bit vectors and document numbers remain for consideration. If it is assumed that a combination of the two representations is to be the most economical, then there must be a certain rank, say rO, at which run-length coded representation is changed to document numbers. The storage requirement for the entire index, assuming Zipf’s distribution for the index terms, can thus be expressed as,
(11)
with the first sum expressing the contribution from run-length coded vectors, while the second term results from the lists of document numbers. To determine the optimal value for S, it is sufficient to find the minimum of SC with respect to r,. Straightforward computations lead to the following equation for rO r. =
ClogN
_
1
( > 1+$
log?
which has no analytical solution but may be solved numerically. Tables 1 and 2 show ro, for typical values of C and N, while Figs. 3 and 4 are graphic representations. The results show the expected behavior, that the rank r. at which the mode of representation is to be switched decreases as the document collection increases. The rank r. also increases when the Zipf constant C increases, since this implies normally a large vocabulary and more entries in the index. Having determined the optimal rank r. it is still necessary to find the number of bits to be used for representing a run. For computers with fixed word length it is easier to represent a run by a code of fixed length rather than by a variable length code. The advantage of fixed length codes over variable length codes is discussed in detail by SCHUEGRAF and HEAPS[%]. Minimization of wasted storage leads to the following formula for the number of bits b to be used for coding a run:
i(” +1) log*-(c/r.)+N
b = r=I r
2
(:+
1
1)
Table 1. Values for r,, b, R, as a function of Zipf’s constant C N = 100,000 b R
ZIPF C
r,
10,000 30,000 50,000 80,000 100,000
1773 5314 8855 14,165 17,706
8.43 7.63 7.26 6.92 6.76
0.707 0.641 0.603 0.537 0.516
N = l,OOO,OOO b R
r.
1561 4673 7785 12,454 15,566
11.64 10.85 10.48 10.14 9.98
0.750 0.695 0.654 0.602 0.588
Table 2. Values of r,, b, R, as a function of the number of documents N
N
ro
c = 10,000 b R
10,000 2069 50,000 1852 100,000 1773 500,000 1618 l,OOO,OOO 1560 lO,OOO,OOO 1398
5.23 7.46 8.43 10.67 11.64 14.88
0.642 0.686 0.707 0.735 0.750 0.780
r,
c = 100,000 b R
17;706 16,145 15,566 13,937
6.76 9.01 9.98 13.21
0.516 0.565 0.588 0.675
382
E. J.
SCHUEGRAF
+ 5
105
t
103 I ~-~~_ 104
-~
'0
C
&i
~I
Fig. 4
_~L ~_
L
105
Fig. 3. Fig. 3. r, as a function of C. Fig. 4. r,, as a function of N.
which may be expressed
as
b=J-
log 2
1
log* r, ~+rD~C~[logrO-l]+C
log;+
(12)
r0 + c . (log r, + 0.577)
Values for b are also contained in Tables 1 and 2, as functions of C and N, with Figs. 5 and 6 showing the graphic representation. As anticipated, the code length for a run increases with the number of documents, but decreases with vocabulary growth as indicated by large values of C. Since the code length b must be an integer the values given in Tables 1 and 2 should be rounded to the next highest integer. However, practical considerations may force other values of b, because of the fixed character or word length of the computer to be used. For example, a choice of 8- or 1Zbits for the code length of a run would be quite suitable for computers with an 8-bit character representation, since all computers have special instructions for processing of characters. If this approach is chosen, storage space is sacrificed for convenient programming and easy processing. The advantage of the combined run-length coded binary vector and document number representation over the two other storage modes becomes apparent when the compression ratio R is calculated. The compression is defined as the ratio of the length of the compressed file to that 1b I 50t
lo-
P___
------g--____~--.a
+--*
N~1,00,,,000
*-+N
A*___
=
1.000.000
5.
-104
5.106 Fig. 5. 6
as a function of C.
. 105
.--
C
Compression of large inverted tiles with hyperhdic
term distribution
383
tb
i .__ __. __b
_._..__1
104
l_l.-.. * 106 105 A_
.
Fig. 6. b as a function of N.
of the uncompressed file. Since the storage space occupied by the list representation is always less than that of uncompressed bit vectors, the former storage mode is used in the calculation of the compression. It is defined as
with SLISTfrom eqn (1) and SC from eqn (ii). Neglecting the contribution from the pointers in (I) allows to compute R as a simple expression: R=l-
(13)
Tables 1 and 2 contain values for R which were calculated by rounding the proper values for b to the nearest integer and using eqn (13). Considerable savings in storage space, between 30 and 40%, may be realized and are especially significant for inverted files containing many documents and a large vocabulary. 5. CONCLUSION
The analysis of storage methods for inverted files in the previous sections has produced two interesting results. It has been shown that under certain conditions a binary vector compressed by run-length coding will occupy less space than the corresponding list of document numbers. The conditions are satisfied if there are k non-zero entries in the n-component binary vector and the inequality
(k + 1)Flog,(A)
holds. Unfortunately, real systems rarely exhibit the property that every index term has the same number of postings, so that the above conditions are almost never satisfied for an entire file. Recently, it was proposed to use equifrequent fragments rather than words as index elements for a retrieval system[24-26,291. In this case, the above conditions would be fuhilled and eqn (8) would specify the best choice for the code length of a run. The storage space for the entire inverted file would be optimized by choosing that particular value. Because most systems do not exhibit the equifrequency property, a more realistic assumption about index term distribution was made. A hyperbolic distribution, the Zipf Law, was adopted, since it is the one most often observed in real systems. Assuming that distribution, the problem of minimizing the storage space for the entire file was soked by a combination of run-length coded binary vectors and lists of document numbers. Some values of the two parameters controlling this representation were calculated and specified in two tables. A choice of these values will
384
E. J. SCHUEGRAF
result in a minimum value for the storage requirements. The relative insensitivity of the values for the storage space with respect to the parameters con&offing the index term distribution allows immediate appii~ation to practical systems. The implementation of a combined run-length coded vector and document number inverted fife appears not to be di~cult, since similar ideas have been already explored[9,11]. The combined approach promises considerable economic benefits, a very important fact, since very few other methods for the compression of general binary matrices exist. It has been shown1271,that even the use of unambiguous bit matrices for compression of large binary matrices is infeasible, although convenient retrieval algorithms based on this storage form exist [28]. As an alternative to the use of unambiguous bit matrices for the compression of binary matrices, such as inverted files, the feasibility of using multiple projections for compression is presently being investigated. Practical problems regarding retrieval and dynamic updating of inverted files in various storage modes are also being considered. AcknowledRemenls-This research was supported by the National Research Council of Canada and the University Council for Research of St. Francis Xavier IJniversity. The many helpful discussions with Dr. S. K. AAI.M are also gratefully acknowledged.
REFERENCES [I] J. I.. PARKER.A logic per track retrieval system. Proceedings IFIP Congress, pp. 71 l-716. North Holland, Amsterdam (1971). [2] 6. PARHAMI,A highly parallel computing system for information retrieval. proceedings Fati Joinf Computer Conference, pp. 681690 (1972). 131J. J. DIMSDALEand H. S. HEAPS, File structure for an on-line catalog of one million titles. J. Lib. Au&m. 1973,6.37. [4] A. F. CARDENAS, Analysis and performance of inverted data base structures. Commun. ACM 1975 18(S), 253. IS] D. LEFKOV~TZ, File Structures for On-Line Systems. Spartan Books, New York (1969). [6] E. H. SUSSENFUTH, Use of tree structnres for processing files. Comma ACM 1963, 6(S), 272. [7] L. E. STANFEL,Three structures for optimal searching. J. ACM 1970, I7(3) 508, IS] A. F. CARDENAS, Evaluation and selection of file organization -a model and system. Comm. ACM 1973,I6(9) 540. [9] D. R. KING, The binary vector as the hasis of an inverted index file. J. Lib. Au~om. 1974, 7(4), 307. [IO] P. EIN-DOR, The comparatjve efficiency of two dictionary structures for document retrieval. INFOR 1974,12(l), 87 [I l] I,. H. THIELand H. S. HEAPS,Program design for retrospective searcheson large data bases. ~nforrn~S&r, R&r. 1972.8. l!: C. LOWEThe influence of data base characteristics and usageon direct accessfile organizations. JAC.V 1%X, 15.535, P. EI.IAS, Predictive coding. IRE 7”rans. Inform. Theory IT-I, 1955, 16. H. KOBAYA~HIand L. R. BAWL,Image data compression by predictive coding. ID&f J, Res. Deuel. 1974, 18, 164. M. F. LYNCH.Compression of bibliographic files using an adaptation of run-length coding. Inform.Star.R&r. l973,9, 207. (161 S. W. GOI.OMB,Run-length encoding. IEEE ‘Iiuns. 1fortn. Theory IT-f2, 1966. IT-12, 399. 1171S. D. BRADLEY,Optimizing a scheme for run-length encoding, Proc. IEEE 1969, 57, 108. 1181N. HOUSTON and E. WAI.L, ‘The distribution of term usage in manipulative indexes. Am. &cum. 1964, 15, 10.5. [l9] G. A. SALTON,Computer evaluation of indexing and text processing. 1ACM 1968, IS, 8. 1201P. ZWDE and V. SLAMECKA,Distribution of indexing terms for maximum efficiency of information transmission, Am. &x&m. 1%7, IS, 204. 1211E. SVENONI~US. An ex~riment in index term frequency. J: ASIS 1972, 23, 109. 1221R. A. FA~RT~o~N~, empirical hyperbolic distributions (Bradford-Zipf-Mandelbrot~ for hihliometric description and prediction. J: Docum. 1969, 25, 319. 1231G. K. ZIPF. Human ~e~al~iour and the Prjncip~e of Lens! Effort. Addison Wesley, New York (19491. 1241 E. J. SCHUEGRAF and H. S. HEAPS,Selection of eqnifrequent word fragments for information retrieval. fnform. Star. Retr. 1973, 9, 697. 1251I. 3. BARTON, S. E. CREASEY, M. F. LYNCHand F. F. SNELI.,An infor~tion theoretic approach to text searehing in direct access systems. ~omrnu~. ACM 1974, f2, 345. 1261 and H. S. HEAPS,Guerv processing in a retrospective document retrieval system that uses word . _ E. J. SCHUEGRAF fragments as language elements. ~nj~rrnn& Proc. kgmt. 1976, i2, 283. 1271S. K. AALT~ and E. J. SCHUEGRAF. A. determination of the number of unambiguous hit matrices. fnt. J. Camp. hf#rm. Sci. to be published. 1281Y. R. WANG,Data retrieval algorithm for unambiguous bit matrices. Paper presented at the 7th Annual Princefan Conf. on Inform. Sci. and Systems, 1973, 22-23 March. [29] A. C.-GLARE.E. M. COOKand M. F. LYNCH,The identi~cation of variable length equifrequent character strings in a natural language data base. Camp. J. 1972, 15, 259.
LIZ] [I31 [14] [15]