Algorithms for processing partial match queries using word fragments

Algorithms for processing partial match queries using word fragments

Infon. Sysfems Vol. 5. pp. 313-332 Pergamon Press Ltd.. 1980. Printed in Great Britain ALGORITHMS FOR PROCESSING PARTIAL MATCH QUERIES USING WORD FRA...

1MB Sizes 1 Downloads 146 Views

Infon. Sysfems Vol. 5. pp. 313-332 Pergamon Press Ltd.. 1980. Printed in Great Britain

ALGORITHMS FOR PROCESSING PARTIAL MATCH QUERIES USING WORD FRAGMENTS? VANGALUR Department

S. ALAGAR

of Computer Science, Concordia University, (Received 12 September 1979; in revisedform

Montreal, Canada H3G lM8 17 April 1980)

Abstract-Algorithms are given to process partially specified queries in a compressed database system. The proposed methods handle effectively queries that use either whole words or word fragments as language elements. The methods are compared and critically evaluated in terms of the design and retrieval costs. The analyses show that the method which exploits the interdependence of fragments as well as the relevance of fragments to records in the file has maximum design cost and least retrieval cost.

device access when fragments chosen for compression[4] are used as searchable elements. First we present basic formalisms related to the model of a file, query type, mode of transaction and cost model. Section 2 briefly reviews partial match search methods and the compression algorithm proposed in [4]. Section 3 discusses the framework for the proposed algorithms. Section 4 will critically examine three algorithms with varying complexity for performing partial match retrieval on a compressed database and compare them in terms of design and retrieval costs. We conceive a database as a collection of information called a file. An individual unit of a file is a record and there is a logical separator between records. Each record is formatted so that all records have the same number (k) of fields. If ith field takes values from a domain Vi,a file F is simply a subset of V,X.. . XV,. A compression is a mapping that assigns to each distinct value D belonging t0 any Vi, i = 1, . . . ,k a code u, which is usually a string of finite length over the alphabet (0, I). Thus any record (0 I... QJ with ai E Vi will be uniquely coded to (UI... uk) where Ui is the image of ai is the image of ai under the mapping. An important requirement of a compression method is that the mapping used has a unique inverse. The set F, ={(u,, . . . , uk)l(u,. . . u~)EF} is called the compressed file of E A key is an element of F and a fully specified key is an element of V,X.. . XV,. A user of an information retrieval system is one who has one or more pieces of information requests, called queries, to pose to the system. The most common query type is known as intersection queries, for which a record is to be retrived only if it is in a specified subset (may be empty) of the file. The class of boolean queries that are defined by specifying (attribute, value) pairs connected by logical operators, is identical to the set of intersection queries. In this paper we shall restrict our discussion to a subfamily of intersection queries, called partial match queries. The class of partial match queries include the subclasses of exact match queries and single key queries. A partial match query with s keys specified, s I k, is one in which s fields are specified by the user; the other k-s positions are not specified (indicated by *) either because the

I. INTRODUCTION

Nearly all of the large commercial organizations and institutions use computerized information systems for their day to day services. Examples of such systems include directories in a telephone company, personnel files in a large corporation, scientific and technical information in libraries and part lists in a machine manufacturing company. Although the type of data and the scope of their use vary in these examples, all of them process a large amount of information and hence computers are necessary for their storage and retrieval on specific requests. There are three important requirements for the well functioning of an information retrieval system. These are: (I) economical storage of the database that stores all the information relevant to the goal of the system, (2) fast retrieval of information pertinent to the queries of the users, (3) retrieving all and only those records that are relevant to a request. Several researchers[l-31 have given methods to compress a database and thus achieve economy of storage. However very little work has been done to investigate the adaptability of a compressed database in the contexts of problems 2 and 3. In a recent paper Schuegraf and Heaps[4] discuss query processing in a compressed database and we extend their work in several directions. Taking the compression method reported by Schuegraf and Heaps[4] as a test bed, we explore several algorithms for query processing and evaluate their cost and efficiency. In general the cost and efficiency of a search algorithm to answer a query depends on the complex structure of the file (database), the storage medium, the complexity of the query set and the searchable elements chosen in the design of the system. As such without reasonable restrictions and assumptions on these one cannot reasonably evaluate the performance of query processing in a database system. In this paper we shall discuss algorithms for performing partial match searches on a compressed file stored on a direct tThis work is supported by the National Canada grant No. A3552.

Research Council of

323

324

V. S. ALACAR

user does not know the values to be specified or he does not care to specify them. We shall also consider queries which specify truncated values from the various fields. These are in fact special cases of partially specified queries. More formally a partially specified query is an element the unspecification of a of (~~~~~*}), where * Indicates ’ combonent. For example if k = 3, (ui, *, a*) and (vi, *, *) with triE Vi are partially specified queries. Any record in F which matches the specified values is relevant to the given query. Our decision to restrict the query type is based on two reasons. First one is the comment made by Rivest[S]: The simpler intersection queries are easy to formulate and algo~thms are known for their adequate handling. The most general intersection query (thus a complex boolean query) can be shown to require searching the entire file almost all the time. In fact it would take longer to specify the query than reading the entire file. The second one is that a compressed database can effectively be used to process not only partial match queries but also queries in which one or more specified values are themselves partially specified, i.e. truncated. In general the system is supposed to present a set of records in response to a query. If several queries are collectively presented to the system, we say that the system is operated in batch mode. If queries are presented one at a time, we say that the retrieval is done on-line. In this paper it is assumed that on-line retrieval is the natural choice of the user. The cost or measure of search for a single query is the on-line measure, i.e. the time that is taken to answer a single query. Typically this depends on the access-time, which in turn depends on the structural property of the storage medium. If the file is stored in a direct access storage device such as magnetic disks, the access time is the sum of the three quantities x1, xz and x3 where xi: the seek time which is the time required to position a reading head to the correct cylinder; x,: the latency time which is the time taken to position the right sector of the track under the reading head; x3: the transmission time required to transfer one block of data into core. Note that the tracks of a disk can be considered as a linear storage medium and as such the access time between any two storage locations is an increasing function of the directed (with reference to the rotation of the disk) distance between them. Both RivestlS] and Burkhard[6] have considered the number of accesses to secondary storage as a standard measure to evaluate their algorithms for partial match retrieval of binary attributed records. This can be justified, since the quantity X, t x2 dominates x3 if one assumes that the amount of information transferred to core per access is a constant and no assumptions are made on the physical relative placement of records. However if the relative physical locations of records are such that xl +x2 is a constant for each and every most commonly occuring query, then the cost x3 is the appropriate measure. This paper will consider methods that are consistent with the above measures.

2. COMPRFSSION AND PARTIAL MATCH SEARCH METHODS-REVIEW

A number of researchers have studied the problem of partial match retrieval and Knuth[7] reviews different techniques. It is to be noted that inverted list techniques are adequate for single-key retrieval in multi-attributed file, but are not suitable for partial match retrieval. This is due to the fact that a partially specified query, say with s specified attributes, retrieves a subset of the tile which is the intersection of s lists in the inverted file system. The amount of work that has to be done (to find the intersection) increases as s increases, whereas the expected number of records in response to such a query decreases. Wong and Chiang[S] have proposed the usage of a large number of short lists, so that the response to a query will be the union of lists, instead of intersection. Their file design is thus different from inverted lists. The drawback in this method is that the number of short lists will be very large. One method to counteract this effect is to restrict and thus set a bound on s. Exploiting theories on finite geometries and setting a bound on s, Abraham, Ghosh and RayChaudri[9] have proposed a file design which permits redundant records to be stored. The theory behind this algebraic filing scheme is so complex that the time needed to set up the file design is too high. The first efficient solution to the problem of partial match retrieval is due to Gustafson[ lo]. His model of file is the one proposed by Hsiao and Harary [ll]. In the method of Gustafson if there are at most i attributes per record, then the number of lists examined to answer a . w--s query 1s j_s , where w is the total number of lists i 1 ~rmitted in the system. In this method there are two important improvements over the previously mentioned methods. These are (1) each record is stored only once and (2) the expected amount of work decreases as s increases. The one disadvantage is that records that are not relevant to a query might be retrieved but the set of all retrieved records will always include all desirable records. RivestfS] has studied a family of partial match retrieval IlIe designs and discussed optimality questions. Burkhard[6] has introduced a special kind of block designs as file designs which have the same behaviour as the methods outlined by Rives& but somewhat superior in the sense that when Rivest’s method fails to obtain a file design his method succeeds. Bentley and Burkhard[lZ] have given heuristics for the construction of tries for storage and retrieval of records on partially specified queries. All this work assume uniform distribution of queries. Two recent papers[l3,14] have assumed prescribed probabilities with which the values of the various fields in queries are specified and given methods to minimize the overall retrieval time. Heuristic methods in [131construct near optimal tries; although no analytical results are given, the empi~cal results suggest optimality. In [14] a partitioned hashing t~hnique that extracts su~cient number of bits from each field and concatenate these bits to obtain a bucket address is given and shown to be optimal. However no criterion is given as to the actual selection of bits.

Algorithms for processing partial match queries using

The algorithms that we discuss in Section 4 use fragments selected for compression as search elements. We justify in Section 3 the assumption that the selected fragments would either include or cause best match with terms specified in most of user queries. We review now the essential details of compression coding as advocated in [l,4]. There are two phases in the process of compressing and coding a database. The first phase is concerned with the selection of equiprobable word fragments. The selection algorithm[l] operates on the following guidelines: (I) An input parameter t (positive integer), called threshold, is given. (2) The frequency of occurrence of each fragment must be t or close to t. (In fact the algorithm selects fragments that are only nearly equiprobable.) (3) The length of each fragment must be as large as possible. (4) Single characters are kept in addition to fragments. Note that this decision has two effects. These are: equiprobable property is destroyed when a considerable number of single characters are forced to get selected and each word in the database is ensured to be represented as a concatenation of selected fragments. (5) The set of fragments is a minimal set. The second phase is coding the entire database. Assume that D = (d,, . . . , d,) is the set of fragments selected by the algorithmll] and are stored in the dictionary D. Each fragment is assigned a unique code of length [log,pl bits, where [xl is the least integer greater than or equal to x. Since no selected fragment can occur as a subfragment of another selected fragment, each word in the database can be coded as a unique concatenation of the codes corresponding to fragments that uniquely (and minimally) make up this word. Through dictionary lookup this process is reversed while decoding the database. The primary advantage of using equiprobable fragments is that each fragment contains the same amount of information content and hence the selected set of fragments give an excellent overview of the data. However in practice there can be no perfectly equifrequent fragment set. 3. FRAMEWORK FOR QUERY PROCESSING

In general the terms specified by a user query need not be the fragments and all queries need not be equally likely to be queried. In order to aid the user in the formulation of queries, the dictionary of fragments can be made available. This would enable the user to specify the fragments that best match their terms in queries. Alternately the system can extract fragments consistent with the probability distribution of queries. One may assume that the probability of specification of a particular field in a query depends only on that field and is independent of what other fields are specified in the query. Let p,, . . , pk be the probabilities where pi is the probability that ith field is specified. There is no loss of generality in assuming p# 2 p2.. . zpk. Intuitively it is clear that the system must generate more fragments from

wordfragments

325

ith field than jth field if pi > pb From Section 2 recall the conditions satisfied by the fragments extracted by the algorithm[ I]. Those imply that for every distinct value in Vi there is at least one fragment. Conversely given a value u E Vi, there is at least one fragment in D that matches u. Let Pr[ai = u] denote the probability that the ith field specified in a query will have the value v in Vi. Then pi = CPr[ai = u], where the summation extends over all values in Vi. Since a query that specifies the ith field is equally likely to specify all (ViJ values, we have Pr[ai = u] = pi/l Vi1and this is the minimum probability that the ith field specified in a query will match a fragment, since every value u has at least one fragment extracted from it. If t is the threshold and ni = IVij/tpi fragments are selected from ith field so that these occur approximately t or more times, then (l/nit) = Pr[ai = u], i.e. Pr[Ui = u] has a uniform measure among the fragments selected in Vi. Moreover t remains the same for all fields and thus the selected fragments can be used for compression as well as query processing. To process a query it is essential to have a preprocessor that transforms a query to a chain of fragments. It is also essential to maintain the list of records associated with fragments. If R(di) = {Rild< is extracted from Ri} then the collection R(D) = {R(di)Jdi ED} is known as inverted lists. Thus in order to process a query in a compressed database it seems necessary to maintain (1) a dictionary D(2) inverted lists R(D) and (3) the compressed file. The general query processing algorithm is: (I) Using D transform query words to a chain of fragments. (2) Retrieve records pertinent to the fragments from the compressed file. (3) Using D decode each record and output. Steps (I) and (3) are common to all the algorithms that we discuss in Section 4. Moreover we assume D is small and can be kept in core so that the total cost of steps 1 and 3 is negligible compared to the cost of step 2. As such we distinguish our algorithms according to the “on line” measure of performing step 2. 4. ALGORITHMS

We discuss three algorithms to process partially specified queries. It is assumed that each query has been transformed into a chain of fragments. The algorithms are given in the order of decreasing cost. We assume a file of N records RI,. . . , R,. 4.1 Schuegraf and Heaps[l] have proposed a method that uses all the inverted lists in query processing. Thus it is necessary to maintain (1) a dictionary D(2) The inverted lists R(D) and (3) The compressed file. We review this method and analyze its cost. ALGORITHM INK: (I) For each di in the transformed query, read in R(di).

(2) Obtain L, the intersection of the lists R(4). (3) For each record number in L, read and decode the actual record.

326

V. S. ALAGAR

Analysis If s fragments appear in the transformed query, then s accesses are required to get the lists of record numbers pertinent to the fragments. Assuming that the buffer can hold R(di) numbers (2 t), the cost of processing step 1 is (A t pt)~, where A and p are constants. The cost of step 2 is negligible since L can be constructed in main memory. Let IL1 denote the length of list. All the records corresponding to these numbers must be brought from disk to main memory. Any further processing that is done in memory is not dependent on the chosen set of fragments in D. We ignore the cost of processing in memory but account the cost of accessing disks to read all records into memory. If the number of records per block in disk is the same as the size of the buffer and B denotes this number, then the number of requests to secondary storage is IL[/B. We estimate IL1 in a probabilistic sense, that is consistent (approximately) with the equifrequent property of the selected fragments. Assume that the average number of distinct fragments per field is p/k. Then the probability that a fragment in a given query matches a fragment in a field is k/p. Since the fragments specified in a query may occur independently among the various fields, the probability that exactly r out of s specified fragments in a query match dictionary fragF (k/p)‘(l -k/p)“-‘. This is also the prob!! ability that exactly r inverted lists each containing a proportion of t/N of the file are to be intersected. Since these are independent events the expected proportion of the file to be examined is given by ments is

s xi0

s (klP)‘(l - k/P)“_‘(r/N)’ = (kr/pN t 1- k/p)“. r1

Hence the expected value of IL1is N(kt/pN

t 1- k/p)” > N( 1- k/p)“[ I t skt/N(p - k)]

Thus the total cost of processing a query with s specified fragments is at least C,(t, s) = (A t pf)s + N(l - k/p)“[l t skt/JV(p - k)]/L? (1) It is easy to see that C,(t, 0) = N/B and C,(t, s), 0 < s 5 k is maximum for s = k. With a view to decrease the cost of processing a query we present the next two methods. 4.2 In this method we partition the collection of inverted lists into two classes. One class contains inverted lists of certain groups of fragments organized as chains of trees and the other class contains the inverted lists of some individual fragments. Groups of fragments are formed by imposing a partial order on the set of fragments D. The ordering is defined as follows: a fragment di is related to di under the partial order < (preceed) if di preceeds di in a record. We say di(di) is a predessor (successor) of dj(di) if di < di holds. There is a good possibility for two

fragments to preceed each other, but this situation can be avoided by temporarily not considering them. However, these fragments will be retained in the dictionary for the purpose of compression. Essentially there are two classes of fragments. These are EO= {dil there exists a dr E & such that di < dj or dj < di but not both} and El ={dil for every di E E,, there is no relation between di and di}. In principle we have N partial order relations d,, d2 < 2d4r d, < 3d2, d2 < ,d, and d2 < 3d6. In this example we have the sets (d, d2 d3), (d, d2 d.,), (d, d2 ds d6) belonging to records 1, 2 and 3; d2 preceeds d, in records I, 2 and 3 and so on. Since partial order is a transitive relation, we can always generate a set of chains {c,} of fragments with the property that if di and dj belong to a chain then di I di iff i < j. Since chains can be merged to obtain longest possible chains, assume that {c,} are maximal chains. For the case of the previous example we have 3 maximal chains: cI = d, d2 d,; c2 = d, d2 d4; c3 = d, d2 ds d6. For any chain ci = {d,,, . . . , di,} we define (1) di, as the first element or unique predecessor. Note that there is no predecessor of di, in ci, although the predecessor of di, might occur in some other chain, (2) the length of chain denoted by Ici) is r, i.e. the number of fragments in the chain, (3) c, is a partial chain of ci if either (a) the first element of cj is the first element of ci and for every other element of cj the predecessor is in ci or (b) for every element of ci including its first element, the predecessor belongs to ci, /c~JI Jci(, (4) the inverted list R (ci) is the intersection of the inverted lists of the fragments in ci. The set El is a special case of E,, since each chain El is of length 1.We remark here that in general, the size of the inverted lists for a chain is smaller than the size of the inverted list corresponding to a fragment and hence if the chains are generated, maintained efficiently and matched against partial chains in a given query, the number of disk accesses would be reduced. We shall explain below how this may be done. Obviously there is one chain per record. However it is highly likely that a longer chain obtained from one record will have partial chains each of which completely matching chains from other records. More precisely we may define the dependency between chains as follows: TWOchains ci and ci are r-dependent if r = min()ci, Ici]), the first elements of ci and cj are the same and they have a common partial chain of length r. The chains {ci} that are generated from the fragments can be examined for r-dependencies, r ~1 and dependent chains can be grouped together. Once such group of chains having the same first element will be called a dependence tree of chains (DTC). As an example consider the chains (d, d2 d,), (d, d, ds de), (d, d, d4 d& (d2 d4 de), (d2 d4 de), (d2 d., d, d9). The dependence trees are as shown in Diagram 1

The existence of dependence trees will demonstrate the non-independence of the fragments extracted from the file, although the fragments are equiprobable. The term “dependence tree” has been used by Van

Algorithms for processing partial match queries using word fragments

327

Given such an incidence matrix, the dependence trees can be set up easily. For example the DTCs corresponding to M are:

d6 Diagram

Rigsbergen[lS] in the context of describing the nonindependence of index terms used in a document retrieval system. Observe that a path from the root to a leaf node of a DTC is a chain. Since disjoint chains are recoverable from a DTC, it is enough that we associate with each DTC an inverted list of record numbers which is defined to be the disjoint union of the inverted lists of chains in the DTC. Regarding dependence trees we have the following remarks: (1) If DTC, and DTC, are two dependence trees, a maximal chain in DTC, (DTCJ can not be a partial chain, or chain in DTC2 (DTC,). In view of this, we assert that each dependence tree is maximal. (2) If the set of predecessors of a fragment dk is (d,,. . . d,), then those dependence trees with roots belonging to this set will contain all chains having dk. (3) The actual fragments need not be stored in the dependence trees; it is sufficient to keep pointers to the actual locations of fragments in D. Ideally the partial order relations being known during the extraction of fragments, the sets &, El and all the dependence trees should be built concurrently. However for the sake of clarity and simplicity let us assume that the inverted lists and partial orders are computed first and then this information is available in a column ordered record fragment incidence matrix. Example 1. Let F = R,, . . . , R8 be a file and D = (d ,, . . . , d4) be the fragments extracted from E Let the record-fragment matrix be

R7 RI R2 R3 M=R 8

R, RS R.5

1

0

0

I

1 I I 0 0 0 0

0 1 I I I 0 0

1 0 I I I I I

0 1 0 0 0 I I

The columns give the inverted lists corresponding to the fragments and the precedence relation is implicit along each row. Although the association between a fragment and its field of extraction is lost, we have achieved in obtaining E,, and El. Thus the incidence matrix is more informative than pure inverted lists.

In general the algorithm given below assumes an incidence matrix M as input and constructs all DTCs.

ALGORITHMDTC: For all practical purposes it can be assumed that the incidence matrix has no zero column and no row having only one 1. (I) [Initialize] Set DTCg };j+O; ri+O. Here DTC is the collection of all dependency trees; j is the number of fragments processed so far and rj is the number of records allocated to the DTCs. (2) [next unique predecessor] jcj t 1. (3) [pack records having di as unique predecessor] Rearrange rows ri_, t I,. . . , N so as to get a line of l’s in jth column. Set ni to the last row that has a I in jth column. (4) [Construct a new tree]. Make di the root of a new tree DT. for i: = rj_, t 1 until ni do steps (a) and (b). (a) c+ < dk, . . . dk, > where dk,< . . . < dk, are column numbers greater than j and in each column there is a 1 in row i (b) If c is already a chain in DT, insert i in the inverted list R(c); otherwise insert c in DT and set R(c)+{i}. (c) Set DTCcDTC U{DT}. (5) [Any more trees?] if ni < N or j


Since m
328

V.S. ALAGAR

cost of step 3 is (N-it 1)t and that of step 4 is o[(t - b)@ - i)]. Thus the total cost of constructing all dependence trees is q[(n-i+l)t+(t-b)(p-i)]+b@-1)

mine the expected value of X,, assuming the elements Cd, . . . d,) are independent and uniform. Although the assumption of independence and uniformity are not justified, the result gives some insight to the value of m; moreover in practice the true lower bound will be much higher than what is stated in the theorem.

=m[N-$m-l)t(r-l)(p-m-l)]tb@-1). Substituting for (t-b) from (2) we approximate and simplify this expression to get o(mN). The algorithm will need slight modifications when the whole matrix can not be held in memory. Given the equiprobable nature of fragments the value of m will increase with b, the amount of overlap. Unless the joint distribution of fragments are known, the quantity b cannot be predicted. However we can estimate m in a probabilistic sense. Towards this we introduce the concept of absorbing fragments. A fragment di in DP = (dr,. . . , dp) will be called an absorbing fragment with respect to a set of records if for all k, i
d2

d,

R, R2

1 1

0 0

I R3

0

1

1 1 1

d2 is the only absorbing fragment. Note that there are two dependence trees, one with root d, and the other with root d2. It is clear from the definition and the examples that an absorbing fragment is a candidate for the root of a DTC. However the root of a DTC need not be an absorbing fragment, since a maximal chain containing the root and a leaf node of a DTC need not necessarily contain all fragments imposed by the above definition. Thus we assert that if there are X, absorbing fragments among Q,, these must have occurred among the roots of the dependence trees constructed by the algorithm DTC. This implies X, < m. The bounds on m are given by the following theorem. Theorem

I

The lower and upper bounds of the expected value of m are log p and 2dp respectively. Proof: Since m 2 X,,, the lower bound of the expected value of m is the expected value of X,,. We will deter-

Let G,(y) = $Pr(X,

= n)y”

(3)

be the generating function of the probability distribution function of X,,. If we pass from the state D, to Dp+,, then with probability I/p t 1 a new absorbing fragment is added and with probability p/p t I no new absorbing fragment is added. Thus we write G,+,(Y) = G,(Y)(P + Y)/(P + 1) =Ilp/fi(s+y) s=o = l/P$,

(- I)p+rs(P, r)y’

where S(p, r) are Stirling numbers of the second kind. This gives P(X, = n) = S(p, n)/p!.

(4)

From (4) the expected value of X, turns out to be Fl/p, which is equal to log p t y, where y is Euler’s constant. For the upper bound consider all disjoint chains included in the dependence trees. If W, denotes this number, we must have m < W,. Assuming that chains have different fragments, the quantity W, turns out to be the length of a largest increasing sequence in a permutation of p elements. No exact formula is known for this quantity, however simulation studies reported in [20] reveal that 2dp is an excellent approximation for the average of W,. Hence the theorem is proved. Towards efficient query processing we need the dictionary D in which each di, i = 1,. . . , m keeps an index to its DTC and di, j = m t 1,. . . , p keeps an index to its inverted list. In addition to this we associate with each di, i= I,... m a bit string of length i - I which indicates the dependence trees in which di occurs. If i,, . . . , is are s fragments generated from the terms of a user query, the pertinent records may be obtained as described below: ALGORITHM DTCL (1) [Initialize] set X,c( } and X2+ }. (2) [Search D] for j: = I to s do this step. Search D for ib If ii = d,, I I r~ m put r in X,; otherwise put r in X2.

Note that both X, and X2 cannot be empty. (3) [X, is not empty] If X, is empty execute step 4. Let r be the minimum of the integers in X,. Let Y{jldi < d,}. This information is available in the bit string associated with d,. If 1 t 1Y( > s -IX,/ execute step 4 else do the following: read in the dependence trees d, and di, jE Y. The each

pertinent of

these

record

numbers

dependent

will

trees

belong that

to chains

contain

all

in the

329

Algorithms for processing partial m atch queries using word fragments

specified fragments. Let L denote this set of record numbers. Execute step 5. (4) [X, is empty] For each r in X2 read the inverted list corresponding to d,. Form L, the intersection of these lists. (5) [Read pertinent records] For each record number in 15, read the record and decode. Storage and cost analysis We need extra storage to store the m dependence trees, their inverted lists and the inverted lists corresponding to the p-m fragments. The size of the inverted lists is t(p - m) t N, since the inverted lists of the dependence trees are mutually disjoint. Thus the amount of additional storage demanded in this method over the pure inverted list design is (N - tm t storage for the trees). This additional storage is the trade off one has to pay for the gain in processing speed. Since the expected length of I_. would be the same as in Method 1, the cost of steps 3 and 4 must be compared to (A t pt)s. Intuitively it is clear that when the framework for query processing is as enunciated in Section 3, the fields that are most likely and more often to be specified in queries will not only contribute more fragments to D than other fields would contribute, but also that these are more likely to be selected as the roots of dependency trees. In other words, X, will not be empty for most of the queries. When this happens step 3 or step 4 is executed, whereas when X1 is empty only step 4 is executed. Let L, and U, denote the lower and upper bounds of the average retrieval cost A,. Consider the fragment generation from query terms as random; this is justified because the queries themselves are random. Let Q = m/p. The probability that exactly j out of s fragments will be the roots of the dependence trees is qj =

0

Case I l s - j, j > 1. Thus we have

Since !jqi = sa, we have Us = s(l - a) < s Case 2 2d(m)-l
ls-j,

l~j
andm/jtl
j,
Hence we have il-1

U,=sqot

C(s-j)qit

~(mij+~)qiti~,(s-j)q~

I

il

Now consider

us-s=

us-s(qot

. . . tq,)=

C -jqit~(m/jtl-s)

I

; a’(1 - .)s-i. 4i + j$,

Lower bound L, : With probability qo, the s fragments belong to p -m fragments that are not roots of dependence trees. For j > I, the j fragments that match the roots of dependence trees can be such that step 3 is executed just once. Hence we have

(7)

-

il

hi.

Using the result (7), we have Us < s. Case 3 Itm/2GsSp

The roots j, and j2 are real and distinct but j, is negative. So the conditions ( ) become

and L, < s. Upper bound II,: Let XI consist of integers i,, . . . , ii. Assume that these integers are uniform and independent in the range 1,..., m. The expected value of their minimum is close to m/jtl. For j>l, if m/jtl
m/jtl>s-j

j2
m/jtl
l
and the rest of the analysis is similar to case 2. Since the upper bound remains smaller than s in all cases, we establish A, < s. An exact expression for A, is difficult to obtain even when the conditional distributions are known; however it can be computed. Although our analysis assumed independence, experiments conducted with a small data set of 50 records confirm our analysis. To sum up, the main advantages of this method are: (1)

330

V. S. ALAGAR

The reduction in the size of the inverted lists; (2) The gain in processing speed; and (3) The cost function remaining directly proportional to the size of the result. Next we shall show how the average retrieval cost per query can be minimized by a linear arrangement of records with redundancy. 4.3 Ghosh [ 161first introduced the property of consecutive retrieval for a given query set and file. A set of queries is said to have the CR (consecutive retrieval) property with respect to a record set if the records can be placed in a linear storage medium (such as tracks of a disk) so that the records pertinent to every query can be placed in consecutive storage locations without redundancy. Easwaran[l7] has given necessary and sufficient conditions for a query set to have the CR property with respect to a file. In general, a query set and a file may not have the CR property. Similar to the spirit of the CR property, we introduce the consecutive retrieval on fragment (CROF) property in the following way: Let

R(D) = IS,,

. $1

where S, = R(di). F = U{Si} = {RI,. . . , R,}. Thus F is the file consisting of the union of the sets Si, where Si is the non-empty inverted list of the fragment di. Since the records belonging to F - 5’; are not related to di they should not be retrieved when di alone is given in a query. We call the elements of F - S, as foreign to di. We say the set of fragments D has CROF property if the following condition holds: There exists a one-to-one function that maps the reocrds in F into the set {I, . N} such that for each d, ED, the corresponding Si can be mapped onto an interval 1, which contains the images of all the elements in Si but not the images of those in F- $ If D has the CROF property then all relevant records pertinent to any partial chain can be obtained in minimum time. It is sufficient to know the first and last records pertinent to a partial chain in order to retrieve the whole set of relevant records. Since the fragments occur equiprobably on partial chains, we can expect to achieve minimum overall retrieval time, if the CROF property holds for D.

Example 2 Assume D = {d,, d2, d3, d4} and there are 6 records in F = {RI,. . , Rh}. Let the inverted lists be given by Si = R(di), where S, =

{R*, Rx, Rs}

S, = {R,, R5, RJ

Example 3 Let F=(R,,...,

Rs). Assume d,, d2, d, and d4 are fragments. The record-fragment incidence matrix is

R, R2 R3 M=R., Rs Rg R, Rs

d,

ds

d,

d.,

0 0 0

I

0

0 0

0 1

I 0 1

0 1 0 I

0 I I 0 0 0 0

0

1

I I

I 0 0

0 0

The matrix M can be transformed the rows to M,,

by a permutation

S2 = {RI, R2, Wr

S, = fR2,

R,.

’ d,

I-4.

Define the .mapping g : F + { 1,2, . . (6)

g(R,) = I, g(R,) = 3, g(R,) = 4, g(k) = 2, g(Rs) = 5, g(Rn) = 6. Now g(S,) = ]3,51, gW = [l, 31, g&I = 14,619AS.+)=

r&41.

It is straightforward to verify that g is a one-to-one function and maps the sets Si so that the CROF property holds for the fragments with respect to the given records. Assume that the records (RI,. , R6) are stored in the locations (I, 4, 2, 3, 5, 6). Consider any partial chain composed of (d,, . . . , d4); say c = (* d2 * d4). The records that are pertinent to this chain are R2 and R., (the intersection of S, and S4) and these are in consecutive storage locations. One can verify that for every partial chain the pertinent records can be found in consecutive locations of storage. If the CROF property holds for D, then there exists a mappingg:F+{l,..., N} for which Si = R(di) will have an interval [Ii, Ui] as the image under g. The image of a partial chain c, = (d,,, . . . , d,) is [I,, u,] where I, = max{/,i} and u, = min{u,}. If b > u, there is no record satisfying the query from which the partial chain was generated. In terms of the record fragment incidence matrix we say that the CROF property holds if and only if there exists a permutation of the rows of the matrix such that in each column the l’s occur in consecutive rows. Thus once the fragments D which occur with equiprobability in a file F are selected by the algorithm[l], we have to examine the existence of the CROF property for D. However if the CROF property does not exist for D with respect to the file, we need to examine the possibility of allowing redundant storage of records so as to achieve minimum possible retrieval time for any partial chain of length s.

dz

d,

d4

0 0

0 0

R, Rq

0 I

1 I

&

1 1 0 0 0 0

:, 0 0 0 0

M,=Rs R, R, R2 R,

0

0

I I I

o 0 I

0 0

1 1

of

Algorithms for processing partial match queries using word fragments

The CROF property exists and one organization of the records is RI R4 R6 R, R3 R2 Rs. The CROF property does not hold for the example 1. However if the records RS, R2 and R7 are placed twice as in Rt RS RZ RT Rg R6 Ri RJ Rs Rq Rz, all queries that specify one or more fragments will have pertinent records in consecutive storage locations, In this organization the redundancy factor is 318 and is much smaller than the redundancy in a total invested file organization. For this example 318 is the minimum possible redundancy factor. This type of organization aill be called consecutive retrieval on fragments with redundancy (CROFR) organization. Four questions arise naturally. These are (I) Does CROF property exist? (2) How to obtain an organization having CROFR property when CROF property does not exist? (3) What is the cost of such an investigation? and (4) What is the redundancy factor? In [17] necessary and sufficient conditions are given for the existence of CR property with respect to a query set. In general the investigation of the existence is a difficult problem. in [I91 methods are given to find query inverted file organization with redundancy, Thus the answer to the first three questions are similar to the fmdings in [17,19]. We shall answer the fourth question by deriving an upper bound for the redundancy factor. Estimation of redundancy There are p -m fragments all of which occur among the m DTCs constructed by the Algorithm DTC. All chains that occur among these fragments occur as partial chains in one or more DTCs. Let w denote this number. Assuming that these chains are equally likely to be absorbed among the DTCs, each one of these chains occur among m/w DTCs on the average. The number of records pertinent to each one of these chains is t(p m)/w, since on the average there are t records pertinent to each fragment. In order to force a linear ~rangement of records such that records pertinent to all fragments are found in adjacent storage locations, we must associate the records of these w chains with at most [m (m/w)] DTCs. Hence the redundancy can not exceed [m -(m/w)] (t(p - m)/w). It remains to estimate w. This is equal to the length of the longest ascending sequence in a permutation of p - m elements. The approximation given in [20] for the average of this value seems quite good even for moderately small values. Hence we take w = 21,/(p - m) and obtain an upper bound for the number of redundant records and is given by

(8) From Theorem (1) we have m 52~‘~. Hence we have

This shows that the derived upper bound is strictly smaller than the redundancy factor in a total inverted

331

organization. The following two examples give some insight into the extent of reduction achievable in a CROFR organization.

Consider 8 records from which 5 fragments are extracted. The record-incidence matrix M and the dependence trees are shown below: /

dl

dz

d,

d,

dS

An organi~tion of the records with redundancy factor 112 is: RJ R5 Ri R6 Rs Rz Rs Rs R7 Rs R, R4. There exists no organization with redundancy factor smaller than l/2 and the upper bound (8) gives 5.28/S. The redundancy factor in a total inverted organization is 1l/4. Example 5 We consider all binary attributed records (excluding the string of O’seach having four attributes (bits). There are 15 records. We consider 9 fragments, each being pertinent to four records. These fragments are: dr = (Ol4, dz = (IO--), d3 = (I I--), dd = (--OIL ds = (--IO), d6 = G--11), d? =(-Ol-1, ds = (-10-J and d,= (-II-). The records are numbered as R, = 0001, Rz = 0010,. . . R,5 = I I I I. The dependence trees and a linear organization are shown in Diagram 4. A linear organization with redundancy is Rts R, R3 RI1 R,c Rs % R, Rn Rs R, f-h R5 Ra Rn R,, RT Rta Rts Ra RZRN R3 RII. The redundancy factor is 9115and the upper bound gives the value 14115,The redundancy factor of a total inverted organization is 4.

332

V. S. ALAGAR

o(mN), log p I m I 2dp. The cost of setting up CROFR organization is high, since this is an inherently difficult one. However methods suggested in [19] and the nice structure of the dependence trees should be used in an implementation of method 3. Typically the design cost is incurred only once. The drastic reduction in the operational cost of the system will adequately compensate and in fact pay dividends for the initial cost incurred. Acknowledgement-The author thanks the referees for their valuable comments which enabled the revision of the paper to the present form. 1”9

(151

Diagram 4.

5. CONCLUDING REMARKS

The feasibility of designing an information retrieval system that uses word fragments as basic language elements in a query has been investigated. The discussion has been restricted to queries that are partially specified. Any preprocessing that might be done to transform query terms to fragments and any further processing of records that are necessitated after the records have been brought to main memory are common to all the methods discussed in this paper; yet these may differ in detail and their complexity. The three methods are compared and critically evaluated in terms of the retrieval cost which is the on-line measure, i.e. the number of disk accesses to answer a single query. In the first method that uses the tranditional inverted lists for the fragments, the cost as a function of s (the number of specified fragments) increases with s and thus is inversely proportional to the result. This implies that one has to pay more to get less. This situation is remedied in the second method and to a great extent eliminated in the third method. Upper and lower bounds for the average cost in method 2 show that the dependence trees and their associated inverted lists decrease the retrieval cost considerably. Method 3 refines the second method to achieve a linear ordering of records with nearly minimum redundancy and force minimum overal retrieval time. Thus in terms of retrieval, the costs of the three methods form a strictly monotonic decreasing sequence. Since the dependence trees together with their inverted lists are kept, method 2 seems to require more storage than method I. However our analysis for method 3 shows that the maximum redundancy in CROFR organization is strictly smaller than that of a total inverted organization. Hence method 3 requires least space and has least retrieval cost among the methods discussed. Finally one can not but compare the design costs. It is clear that method I has least design cost. It has been shown that the cost of setting up the dependence trees is

REFERENCES

E. J. Schuegraf and H. S. Heaps: Selection of equifrequent word fragments for information retrieval. Inform. Star. Retr. 9, 697-71 I (1973). PI V. R. Walker: Compaction of names by X-grams. Proc. Am. Sot. Inform. Sci. 6, 129-135 (l%9). f31D. S. Colombo and J. E. Rush: Use of word fragments in computer based retrieval systems. J. C/rem. Lbcum. 3, 4750 (1%9). [41E. J. Schuegraf and H. S. Heaps: Query processing in a retrospective document retrieval system that uses word fragments as language elements. Inform. Proc. Management 12, 283-292 (1976). (51R. L. Rivest: Partial-match retrieval algorithms. SIAM _f. Comput. 5, 19-50 (1976). 161W. A. Burkhard: Partial match retrieval. BIT 16, 13-31 (1976). f71D. E. Knuth: The Art of Computer Programming: Vol. 3 Sorting and Searching. Addison-Wesley, Reading, Mass. (1974). PI E. Wong and T. C. Chiang: Canonical structure in attribute based file organization. CACM 14, 309-319 (1971). f91C. T. Abraham, S. P. Ghosh and D. K. Ray-Chaudri: File organization schemes based on finite geometries. Inform. Co&. 12, 143-163 (1%8). [toI R. A. Gustafson: A randomized combinatorial file structure for stroage and retrieval systems. Ph.D. Thesis. University of South Carolina (1%9). D. Hsiao and F. Harary: A formal system for information retrieval from files. CACM 13, 218-222 (1970). J. L. Bentley and W. A. Burkhard: Heuristics for partial match retrieval database desinn. IPL 132-135 (1976). V. S. Alagar and C. Soochai: Partial match retrieval for non-uniform query distributions. Proc. NCC’79, New York, pp. 775-780. (141A. V. Aho and 1. D. Ullman: Optimal partial match retrieval when fields are independently specified. ACM Trans. Database Systems 4(2), 168-179 (1979). WI C. J. Van Rijsbergen: A theoretical basis for the use of co-occurrence data in information retrieval. J. Docum. 2, 109-l I9 (1977). [I61 S. P. Ghosh: File Organization: The consecutive retrieval property. CACM IS, 802-808 (1972). [I71 K. P. Easwaran: Faithful representation of a family of sets by sets of intervals. SIAM J. Comput. 4, 56-67 (1975). WI S. P. Ghosh: Database Organization for Data Management. Academic Press, New York (1977). 1191S. P. Ghosh: The consecutive storage of relevant records with redundancy. CACM IS, 464-471 (1975). PO1R. M. Baer and P. Brock: Natural sorting over permutation spaces. Math. Comput. 22,385-410 (1968).