DOCUMENT RETRIEVAL USING A SERIAL BIT STRING SEARCH ALAN F. HARDING,MICHAELF. LYNCHand PETER WILLETT” Department of Information Studies, University of Sheffield, Western Bank. Sheffield SIO ?TN, England
Abstract-An
experimental
organisation.
Documents
best match
retrieval
system
is described
based
on the serial
file
and queries are characterised by fixed length bit strings and the time-consuming character-by-character term match is preceeded by a bit string search to eliminate large numbers of documents which cannot possibly satisfy the query. Two methods, one fully automatic and one partially manual in character, are described for the generation of such bit string characterisations. Retrieval experiments with a large document test collection show that the two-level search can increase substantially the efficiency of serial searching while maintaining retrieval effectiveness, and that a single-level search based only upon the bit strings results in only a small decrease in effectiveness in some cases.
I. INTRODUCTION
The great majority of current online bibliographic retrieval systems are based on the inverted file organisation. Although this provides a rapid response to Boolean search statements, it entails large computational overheads in the extensive sorting operations involved in the generation and updating of the indexes, the volume of disc space needed for their storage, and the complex software required to access the various files. Because of these overheads, there is now a growing body of interest in the use of serial, or direct, files which do not require indexes but which, unlike conventional batch SD1 services, could provide a sufficiently fast response for interactive retrieval. Two main approaches have been suggested for increasing the speed of serial searching. The first of these involves the use of special purpose hardware as reviewed by HOLLAAR[~,21. An alternative, software-based approach has been described by HICKEY[3] and by DUNN et d.[4] in which a fast initial search is used to eliminate all but a small percentage of the records in a file. This initial search is based on a fixed-length bit string which is associated with each of the records in the file, and which is matched against a comparable query bit string: only those few records which match the query undergo the computationally demanding character-by-character comparison for a match on the actual query terms. Implicit in this latter approach is a means of mapping the vocabulary used for indexing the items in the file into a dedicated bit string of reasonable length: we term this approach Llocubulmry reduction. HICKEY[~]and DUNN et a/.[41 used a superimposed coding technique due to HARRISON[S,61 in which digrams and trigrams chosen from the words in documents are mapped into the bit string. A similar two-level search is used in a chemical context as exemplified by the serial file used for interactive searching of the 5 million chemical compounds in the Chemical Abstracts Service Registry Service[7]; in this, the bit string is used to denote the presence or absence of a limited number of substructural fragments which have been carefully selected from the well nigh infinite range of possible substructures[S]. An obvious means of reducing the variety of word types encountered in natural language data bases is the use of short character strings, and there have been many reports of information retrieval systems based on such strings[9-161. All of these reports have assumed the use of exact match or partial match systems in which the query fragments are combined for search using the well known Boolean operations of conjunction, disjunction and union. Increasingly, however, information retrieval research has moved towards the use of best match searching in which the documents comprising a file are individually matched against a list of query terms, and ranked in decreasing order of some similarity or distance function. BURNETTet *Author to whom correspondence should be addressed I
.A. F. H
7
r1l.[l7] and WII I tagI [I& 191 have described best match experiments document test collection involving a range of methods for vocabulary
with the Cranfield 1400 reduction, including the
use of fixed length and variable length substrings, and of truncation and hash coding prvcedures. It was found that sets of a few hundreds of certain of these descriptor types resulted in a level of retrieval
effectiveness
comparable
with that obtained
from
use of the complete
word vocabulary, despite the great disparity in vocabulary size. This paper describes a novel means of selecting small sets of discriminating
textual
substrings for the representation of document content, and reports retrieval experiments this, and other methods for vocabulary reduction, with a large document test collection. 2. COI.OMBOand chemical
IDENI‘IFI(‘4
RLISH[%]
element
[‘ION
OF
DISCKIMIN
ATING
SUBS
comparisons
a KLIC Each entry
index
of the index
terms or title
shows a word indexed
data base, would
would be needed but would be most
unlikely to retrieve any other terms incorrectly. This simple procedure suggested here which involves three main stages. Firstly,
if used in place of the
in a serial search of a textual
reduce search costs because fewer character
produced.
I KINGS
have suggested that a string such as -YHD-.
name MOLYBDENUM
using
words
idea forms
the basis for
in the document
at one of its characters
collection
the is
other than the last. and
also gives the frequency of occurrence of the word; an extract from such an index is shown in Fig. I with the index letter shown by arrows above and below the column. A sorted word list is also used to monitor
the process.
In the second stage, the words from the least frequent characters
containing each character are examined in turn. working of the alphabet since discriminating strings are likely to
include characters of lower frequency. Words of high frequency are considered together with variants on the same stem or root. The substrings of the word or stem which include the character
on which
ROCKET
and RETROROCKETS,
the KLIC
index
is currently
arranged
are examined:
the cubstrings would include ROCK,
for
OCKE,
the words
CKET.
OCK,
CKE, and KET. Certain of these occur in other stems: thus OCK is seen to occur in BLOCKAGE at the head of the list in Fig. I, as well as in SHOCK, as may be found by looking up the fragment
OCK in the KLIC
be the shortest
discriminant
general, frequency
for
high frequency
index. The substring CKE is found, for this vocabulary,
substring, terms.
terms, which are not likely
and is selected
the substring
to represent
uniquely
identifies
to be used very frequently
this group
of terms.
the stem, but for
as query terms[Z!l]
Fig. I. Use of a KLIC index for the identification of shortr\l discriminating wbstrings. Boxec enclos group of word? sharing a common tuch substring.
to In low
and thus
Document
retrieval
using a serial bit string search
3
need not be delineated in such detail, the requirement for uniqueness is relaxed, so that a group of unrelated stems may be conflated: thus the substring VERS occurs in REVERSAL, REVERSED, etc. as well as in the unrelated terms VERSES and PERVERSE. The results of this stage are indicated by the substrings enclosed in boxes in the KLIC index of Fig. I. The final stage involves the marking in the dictionary of words from which shortest discriminant strings have been derived so as to avoid multiple strings being erroneously selected for a single stem. The second and third stages of the procedure are then repeated until words of frequency greater than some low limit in the list are exhausted. There is clearly much latitude in the procedure, and movement up and down the scale of vocabulary size is clearly possible, depending upon the threshold frequency which is chosen. The method is dependent in part upon the particular vocabulary under consideration and, unlike our previous approaches to vocabulary reduction[ 17, 181, is primarily manual in operation, invoiving as it does subjective judgements about what words or stems should be conflated. In this respect, the approach is similar to the work involved in the development of a stemming algorithm for retrieval purposes[22], the use of which, it may be noted, also results in a reduction in vocabulary size owing to the conflation of words which share a common root or stem, and of the procedure described by MULLIN for the correction of character recognition systems [23]. Examples of discriminating substrings and of some of the words to which they were assigned in the experiments reported below are shown in Fig. 2. Once all of the discriminating text fragments have been identified, a document bit string is created by matching each of the document terms in turn against the set of word fragments, and setting that bit which corresponds to the longest matching fragment for the term. For comparison in the experiments reported below, we have also used a method for vocabulary reduction based on division hashing which performed well in earlier work[l8]. In this, some convenient fixed length prefix string-four characters in our experiments-of each term, space filled if necessary, is treated as a binary integer and divided by some specified number, d, where the document and query bit strings are to be of length d bits. The division results in a remainder, r, in the range 0 to d - I which is used to set the r + I’th bit in the bit string representing the document or query. Since the experiments were carried out on an ICL 1906s computer which has a 24-bit word length, the divisors were all one less than multiples of 24.
- HANG
-
- IlARG -
CHANGE
CIIARGE
CHANGED
CHARGED
CHANGES
CIIARGING
CHANGING
DISCHARGED
EXCHANGE EXCHANGED UNCHANGED
- VERT
-
- VISC
-
CONVERTER
INVISCID
INVERTED
NONVISCOUS
SEMIVERTEX
VISCID
VERTEX
VISCOELASTIC
VERTICAL
VISCOELASTICITY
VERTOL
VISCOPLRSTIC VICOSITY v1sc0us
Fig. 2. Discriminating
substrings
and the terms to which
they were assigned
in the Vaswani
data base.
1
A. F. H IKI)I\O t’l trl.
3, EXPERlMENl.AL
DETAILS
The experiments used the Vaswani/National Physical Laboratory document collection which is a large test set containing II429 documents and 93 queries for which relevance judgements are available. The version of the collection employed here had the document\ automatically indexed from titles and abstracts using an extensive stopword list and a suffix stripping algorithm with similar procedures being applied to the query statements: in all. a total of 71 I9 distinct stem types were identified. A serial search using the bit strings involved the following four steps: (i) matching of the query bit string against each of the document bit strings in turn to identify those documents having some minimal number of bits in common with the query; (ii) matching of the list of query terms against each of the lists of document terms corresponding to the documents passing the bit string match; (iii) ranking of the documents in decreasing order of some matching function based on the common terms; (iv) application of a cut-off to retrieve some fixed number of documents. The efficiency of the bit strings in eliminating documents from the term match is described in the results below by the .tcreenout, which is the percentage of the file eliminated by the bit string search. In exact match, or partial match. searching, the full document record would need to be retrieved and inspected only if there was an exact, or an inclusive, match between the document and query bit strings. In the case of best match searching, however, all documents must be inspected which have at least one bit set corresponding to a query bit since this implies the possibility of at !east one term matching; accordingly, the screenout which may be expected is very much lower than in the case of exact or partial match retrieval. However, a simple trade-off may be effected between search efficiency and search effectiveness by specifying a minimal number of bit correspondences for the term match to take place: in the experiments reported below, a threshold. t, of I, 3, or 3 bits was used. When t = I, the effectiveness will be exactly the same as that of the normal term match but, as the threshold is raised, the effectiveness of recall-oriented searches in particular may suffer owing to the need to include in the output documents having very few terms in common with the query. In an operational implementation, the thresholds used would depend upon the length of a particular query. For example, a natural language need statement might well yield twenty or more search terms and a high threshold would be required to limit the term matching to an acceptable amount. For the Vaswani collection, however. the queries are fairly short, having a mean of 6.6 terms per query, and the three chosen thresholds are accordingly low. It should be noted that although the use of I ‘. I will increase the screenout. and hence decrease the amount of term matching and the overall elapsed time for the search, it is also likely to increase the time required for the first level, bit string search. This increase arises from the need to shift the computer words arising from ANDing the document and query bit strings through a register to determine the exact number of matching bits: in the case of t = I, a machine branch may be executed as soon as a matching computer word is identified. Experiments were also carried out in which the bit strings alone were used as the basis for the calculation of the matching coefficient between a document and the query. This one-level search is a much more stringent test of the ability of the bit strings to characterise the content of the documents and queries, and is also still more efficient in operation since no term matching is required at all. The effectiveness of these searches was characterised by the effectiveness function, E[3_4], which is defined as I_(I+h’)PR h’P+R
’
where h is a user-defined parameter which reflects the importance attached by the user to precision (P) and to recall (R), R and P being calculated on the basis of some fixed number of retrieved documents. The figures reported below refer to the mean E value when averaged over the entire set of 93 queries using cut-offs of 15 and 65 documents, these corresponding to a precision-oriented search for which b was set to 0.5, and to a recall-oriented search for which h was set to 2.0. It should be noted that the lower the E value, the better the retrieval.
Document
retrieval
using a serial bit string search
.F
A range of methods has been suggested for determining the correlation between document and query representatives. The experiments here used the following three types of matching function: (i) simple coordination level, i.e. the number of common terms; (ii) Dice coeficient in which the number of common terms is normalised by the sum of the lengths of the document and query term lists; (iii) inverse document frequency weighting in which a match on a term of collection frequency f results in a contribution of logYV;,Jf) to the overall match value[21]. 4. RESULTS
AND
DISCUSSION
The results for the two-level searches, in which the bit strings are used to limit the amount of term matching carried out, and for the one-level searches, in which the bit strings alone are used for determining the degree of similarity between a document and the query, are given in Tables 1 and 2. The figures correspond to the use of a set of 719 discriminating substrings, of hashing with a comparable divisor, and of the full set of 7119 word stems. Table t details the screenout obtained using thresholds of 1, 2, and 3 bits (or words in the case of the term match). It will be seen that both methods of vocabulary reduction result in a high level of screenout from the bit string search. and this is so even when the minimal threshold of 1= 1 is used. The screenout for the stems is that obtainable from a best match search in which an inverted file is available. In such a case, for I = 1, the lists from the inverted file corresponding to the query terms may be ORed together to yield a list of those documents which have at least one term in common with the query and thus will have a non-zero value for any of the matching functions above; in the case of 1 = 2 (or 3), the screenout figures
Table
I. Screenout
using discriminating
Discriminating
Table
2. Retrieval
effectiveness
substrings. division hashing of 1. 1 or 3 matches
substring
in a one-level
search
and the full term lists, and with a threshold,
t=i
t=2
t=3
62.8
88.4
96.8
using discriminating substrings, division hashing, and the
1,
full
term lists
Dice
Coordination b = 0.i
Discriminatlnu
Division
T.Z?iZlll
SubStrinq
hashinq
b = 2.0
b = 0.5
IDF b = 2.0
b = 0.5
b = 2.0
0.81
0.76
0.87
0.84
0.79
0.75
0.85
(?.80
0.88
0.85
0.84
0.79
0.80
0.74
0.88
0.83
0.80
0.74
A. F. H \KI)IV(,
h
et rd.
correspond to an inverted file search in which pairs (or triplets) of term lists are ANDed together and the set of resultant lists ORed together. The documents in the resulting list may then be inspected one at a time to identify the best matches[3] in much the same way as the documents passing the bit string comparison are inspected in the work reported here. The one-level searches were carried out without any threshold w that all documents with at least one bit in common with the query entered into the ranking. The results obtained in Table 2 ,tnd ’ it will be seen that the discriminating little different
to that provided
sub\trings
by the full stem vocabulary,
which is about ten times as large.
The performance of the division hashing is noticeably poorer. A series of experiments was carried out using bit strings
based on division
determine
string
effectiveness
the effect
of variations
of searching[l9].
in the
The results
length
are shown
yield a level of effectiveness
of the bit
of these experiments.
hashing
on the efficiency using
strings
to and
containing
between 239 and 959 bits, are shown in Tables 3 and 4. It will be seen that, at first, the screenout rises rapidly with an increase in the length of the bit string but that the rate of increase then slows and the screenout tends to the screenout obtainable
from an inverted file search. From the results in
Table 3, it is clear that increase\ in the length of the bit string above some number of bits will result in only a marginal increase in screenout, a result which is in line with work reported the context
of substructural
searches of machine-readable
chemical
structure
elsewhere
in
files [26].
The effectiveness figures in Table 4 are ba\ed on one-level searches using coordination level matching and exhibit ;I variation comparable to the screenout result\ in Table 3. As the length of the bit string is increased,
the effectiveness
rises, i.e. E decreases. slowly.
This is again in
agreement with our earlier experiments using the Cranfield test collection[l7. IX] and the work reported by CR,4WFORD[?7] where the size of the vocabulary W;I~ varied by the elimination of terms with low discriminatory power. In the discussion w far, it has been assumed that the reduced vocabularies the basis for a comparison
of document
and query
will be used a\
bit strings using conventional
computer
Document retrieval using a ~i;d
bit \tring search
7
hardware. However, devetopments in computer design suggest an attractive alternative means of using reduced vocabularies. Associative parallel processors (APP)[28] search data by content rather than by address, as in a conventional index-based retrieval system. Content-based access is achieved by means of a storage device in which corresponding positions in each memory location can be interro~ted in parallel, and a correct match causes the address of the corresponding location to be indicated to the hardware. This provides for a very fast search, although costs and f~~brication methods restrict the sizes of such processors to some th~~usandsof locations, so that the full advantages may not be achieved unless very fast means are provided for refitlin~ the device from large scale backing storage media. A typical design involves sequential operations on bit slices and hence the processor has a width which is the number of fields in each record, and a depth which is the number of documents which can be processed at one time. It is clear that the use of a restricted vocabulary whereby the content of a document or query is represented in a bit string of fixed and limited length provides one means by which the economic viability of such devices could be enhanced substantially. LEE and SCHVECRAF[?~] have described a device involving an APP placed between a disc drive and the central processor which permits “‘on-the-fly” retrieval at search rates of some megabytes per second from a document file based on bit strings similar to those described here ~aithough in their design, the bit strings are suggested as a means for the generation of a hash code rather than as document characteris~~tions in their own right). These developments raise the possibility of considering fast mechanisms for serial search. A processing rate of 1 Mbyte, i.e, 8 Mbit, per set using a reduced vocabulary containing 1000 terms would correspond to a search rate of some 8000 documents per second. Bit strings that matched the query would trigger an access to backing storage for the retrieval of the full document record for term matching and display to a user. it is likely that the first such document would be displayed in a period comparable with the delay expected in current online retrieval systems and, subsequent to this first retrieval, thinking time would probably far outweigh the search time of the processor. Thus although the total elapsed time for the search might be greater than in current interactive retrieval systems, the delay is unlikely to seriously incommode a user. ft may also be noted that, in the case of a sit&e user, accesses to backing storage would be very much more under control in such a system since the accesses would take place in the same sequence as the file, thus reducing in large part the seek time associated with moving from track to track; similar con~ments would apply in a multi-user environment if searching wits carried out by continuous scanning of the bit strings, rather than by initiating the scan each time a query was presented. 5. CONCLUSIONS
A two-level search procedure has been described for the searching of files of documents. The first level consists of a rapid comparison of fixed length bit strings which are generated from the query and document term lists. Only those documents having some minimal number of bit matches with the query undergo the second-level term list comparison. Retrieval experiments with the Vaswani test collection show that the procedure can considerably increase the eficiency of serial searching while maint~~iningthe effectiveness of a full search. Experiments with a one-level search in which the bit strings alone are used as the basis for the calculation of a matching function between documents and queries. show that a novel approach to the identj~cati~~n of discriminating character strings provides an extremely effective means of deriving redundant representations of document content. There is a distinct trade-off between the length of the bit string used and the efficiency and effectiveness of retrieval. Although performance increases with increasing bit string length, improvements above a certain point are likely to be gained only at the expense of a large increase in length. The exact point at which the marginal increase in performance is outweighed by increased storage and search times will depend, inter ufiu, upon the type of reduced vocabulary employed, the size of the file which is to be searched, and computer hardware constraints. The document and query bit strings may be implemented most efficiently in novel computer architectures based (3%associative processor devices. AcknorfetlX~mmfs-Our thanks are due to Dr. P. K. ‘I’. Vaswani for the d~~cumentcollection, and to British Library Research and Development Department for funding during the early part of this work.
REFt:REk(‘I:S
I \ W. Unconventional sci. 7-rc~htd lY79. 14. I I’).
[I ] 1,. A. HOI
computer
architecture\
[2] L. A. HoI.I.~\ {K. Text retrieval computers. (‘or,lputrr [!]
T. HIC Kf:‘l. Searching linear file\ on-line.
[4]
R. G. DL~v\. W. FIS.\\I(X Abstracta
J.
[.S] M. C. H \KKISON. Implementation Softwdrr
Inforttt.
Chrm.
Pn~liw
ctnd
of text
Cotrrput.
.!+i.
Cotntt~un.
Ggnature\
for
1971. 14. 777.
ACM
accelerating
string
searching.
of substance information
from Chemical
Abstract\
de@
1. M. F. LOACH and M. J.
[IO] C. E. GOHI k. A frwtc\t
retrie\;ll
\y\tsm
SNf I I
.4CM
Cmt~tt~rrtl.
structure
searching.
I. The w-eens.
J.
. An information-theoretic COI~I~)III. .1. IY?!. IX. 18.
procesGng in 3 retrospective Infornt.
;I\ language elernentj.
document
Proc,. ,Mnncqcwtrv1
WII I I.\MS and M. T. KH II I IGHI. Document
approach to text
I Y74. 17. 34s.
using hash code\.
and H. S. HI..~I~s. (luerl
uses word fragment\
for chemical
I Y7S, 15. 137.
in direct ac‘cej\ $1 jtemj.
[I I] E. J. SCHII~GK.IF
[I?] P. W.
based on Chemical
19X0. 3. IO.
[9] I. J. BAKION. S. E. CKI \\I warching
warch \y$trm
1977, 17, 212.
19%. 12. 3s.
E.xprifwf~
[8] .4. F~I.~xv.~‘L and 1,. HOI)I \;. ‘An efficient Chcrn. Inforttt.
&i.
of the Aubstring te$t by hashing.
[7] N. A. F~RHFR and M. P. O‘H IR,A. .4 new wurce Service. Dnt&rr
RcL’. lnfortt~
1977. I. 53.
COI~I~NI.
and K. C. T.41. The practicality
.4nn.
retrieval.
1979. 12. 30.
ReL.im
and A. Z+LIOR\. A chemical sub\tructurr
Index Nomenclature.
[6] A. L. Ttf-\Kf’
Ott-linr
for information
retrieval
retrieval
\qztem th:rt
1976. 12. 183.
using ;I \ub\tring
inde.1.
Cottlptct.J. 1977.
20. 157. Qua4
1131 T. ~)t Htt:~. Itrfortti.
Ptw.
comprehenGon
1141 C. S. RoHwIS.
of natural
language simulated
by mean\
of information
trace5.
1979. 15. X9.
,Mtrntrgfwifvrf
codc$. Ptw.
IEEE
IY7Y. 67,
for processing partial match queries using word fragments.
Inforttl.
Systems
Partial-match
retrieval
via the method of wperimpased
1623.
[ 151 V. S. AI (I if?. Algorithms 19x0, 5. 323.
[ 161 D. KKOIV and G. W.41 ( H. A graph structured text field index based on word fragments. Itlfortn. [I71 J. E.
ELKUF-I
experiments
1.
D. COOPF.K. M. F. LEACH, P. W1t.t.t I I and M. W~(,H~.KI E\,
using indexing
fronts of indexing [IX]
term\.
P. WII I I.1 I. Document truncation.
cocabularie~
J.
of varying
generation
experiments
using indexing
vocabularies
[?I]
(‘llfwlic~tr/
Dof~~rttlrtltiiticttl
K. Q\K(
li Jo\~\.
noc,rtttic,tit[lfi,n
,Mdtirrf~
l’tIK(
In~f~//;,qfwf~
J.
.l. Inforttldc,\
in computer-bajed
lY7Y. 3. 3. retrieval
\ystemj.
J.
of term \pecificitb
and it5 application
in retrieval.
J.
Infortttr~fion
I 1I I. The effect of wren Inforttf.
(271 R. W. CK\MI OKI). Nr,qltirr
Rrrriru~I.
reagnition
debicej.
Butterworth.
London
The nearest neighbour
,4CM SIGIR
Chr.
Sci.
of wme conflation
algorithm9
lEEE
!‘rcm\.
Pcrtrrrtr
Atlct/~.~ic
347.
RIJSHERGEN.
I’. %‘;I
C’ottlpur.
using unreliable
I98 I. PAMI-3,
using upperbound\.
and P. WII 1.1 I I, A compariwn
Itrforttt. !Gi. 1081, 3. 177.
indexing
[75] .A F. SVI ,j ION and C. J. \’ \\ [Ih]
interpretation
I . H. D. T\KK\I
retrieval.
[?4] C. J. \. \C RtJjHtK(;F\. algorithm
fragments
sire. II. Hashing.
1979. 35. 296.
I YhY. 9. 4;.
A \tati\tical
[13] J. K. ML I I IN. Reliable trntl
retrieval
1972. 28. I I.
[E] M 1.1NNo\. D. S. for information
of word
of varying
J. Dftc.utttpttlntiott
digram and trigram encoding of index term\.
D. S. (‘III ()i!~w and J. E. RL’SH. Uw
Document
spmhol~ assigned to the
l97Y. 35. 197.
Dfif~ut7ifvtIc~tion
retrieval
size. I. Variety
[ 191 P. WII.I I I I. The effect of atrrihute variety on retrieval performance. [NJ
Prw.
1981. 17. 363.
Mrrniipwf’tll
Forum
(lY7Y).
problem in information
retrieval.
An
19x1. 16. X3.
\et size on retrieval
from chemical
jubstructure
search systems. 1.
lY7Y. 19. 253. Dictionf~ry
Scientific
Cfmtrucfion.
Report
ISR-22.
Cornell
Universit)
(1974). [X]
K. J. THI KHf K and 1,. D. W\I I). Associative
[29] R. M. I.t+, and E. J. SCHl’tGK:\f-. compresGon.
Irrjortttcttion
Rrfrirwl
parallel processors.
An associative Rrstwrch
Computing
file store using fragments
(Edited
Surrux.\
1975. 7, ?I.(.
for run-time
by R. N. ODDY. S. E. Rot+,krso\.
RIJSHEKGFX and P. \I!. WII I I:I\IS). pp. 280-295. Butterworth.
London
(IYXI).
indexing
and
C. J. \ ,I\