Information Systems Vol. 17, No. 5, pp. 359-380, Printed in Great Britain.All rightsreserved
1992
03064379192
$5.00 + 0.00
Copyright0 1992PergamonPressLtd
LINEAR-DENSITY HASHING WITH DYNAMIC OVERFLOW SHARING MURLIDHAR KOUSHIK’ and GEORGE DIEI-& ‘Department of Management Science, School of Business Administration, DJ-IO, University of Washington, Seattle, WA 98195, U.S.A. 2Califomia State University at San Marcos, College of Business Administration, San Marcos, CA 92096, U.S.A. (Received 24 July 1991; in revised form 21 April 1992) Abstract-A new method for handling overflow records in a linear hash file is introduced. This method, referred to as dynamic overflow sharing, requires that a small number of contiguous pages in the file be treated as a group for the initial sharing of an overflow page. An increase in the number of overtlow records belonging to the group could result in the overflow page itself being split. The resulting two overflow pages are now associated with half the pages in the group. This process of overflow page splitting and reallocation can continue with further growth in the number of records hashing to the group. An important element in dynamic overtlow sharing is that pointers to overflow are maintained in the home pages themselves in the form of a small extendible directory. A primary advantage of this method is that it is possible to retrieve any record with a maximum of two disk accesses. In addition, it enables file expansion to be performed at per record access costs which are superior to other recently reported schemes. Dynamic overtlow sharing is described and analyzed for two situations: (1) a linear, downward slope load distribution and (2) a uniform load distribution of records in the file. A comparison of the performance in the two cases reveals that, in general, the average retrieval, insertion and storage utilization performance of linear density hashing is superior to uniform density hashing and only slightly worse than conventional (non-dynamic) hashing. Key words: File organization, linear hashing, non-uniform load distribution, overtlow page splitting
1. INTRODUCTION
Several new hashing schemes have been developed and analyzed over the last few years. These schemes differ from traditional hashing in that they dynamically reorganize storage space as records are inserted and deleted. Dynamic hashing methods include virtual hashing [l], dynamic hashing [2,3], extendible hashing [4-61 and linear hashing [7-121. Dynamic reorganization eliminates the costly, periodic reorganizations necessary with conventional hashing in the face of growth. Of these schemes, linear hashing is particularly attractive because it exhibits good performance in terms of storage utilization and retrieval, insertion and deletion costs. In addition, it has the advantage over other dynamic hashing schemes in that no directory is required. In this paper we propose and analyze a linear hashing scheme which has two new features: (1) a method for handling overflows which uses embedded, within-page pointers and (2) a constant growth-rate hash function which generates a linear, downward sloping load distribution over the pages in the file. Since these ideas are central to the linear hashing scheme proposed, they are briefly described below. We introduce a method of overflow handling, referred to as dynamic ouerfmv sharing, according to which a physically contiguous group of home pages shares one or more overflow pages. An overflow page is initially allocated to a group when any of its home pages overflows. Subsequent overflows are all directed to this overflow page as long as it has room. If the insertion of a new record finds the overflow page to be full, a new overflow page is allocated and the overflow records are redistributed between the two overflow pages. This is done in such a manner that each overflow page is now associated with half the pages in the group. Thus, the overflow pages can themselves split and merge as the overflow area grows and shrinks. An important element in dynamic overflow sharing is that all pointers to overflow are embedded in the home pages of the file; each home page therefore needs room for a small, extendible directory. The primary advantage of dynamic overflow sharing is that a successful or unsuccessful search requires a maximum of two disk accesses, thereby eliminating one of the drawbacks of linear hashing. In addition, it enables file insertion and 359
360
MURLIDHAR KOUSHIK
and
GEORGE DIEHR
expansion to be performed at per record access costs which are superior to other reported methods. This approach to overflow management also permits pseudo-sequential processing with a two-page buffer. Furthermore, it is easily adapted to conventional (non-dynamic) hashing. The growth of the file under the impact of inserted records is modeled by the expansion scheme to be used, denoted as P :Q, which implies that P pages are expanded into Q pages, where Q is typically equal to P + 1. A P :Q expansion scheme implies a growth rate of Q/P.We describe a scheme in which file growth is accomplished by expanding the space required for records belonging to a group of home pages. The actual number of new pages added during a single expansion is a function of the growth rate and the group size. Thus, when using a 1:2 expansion scheme with a group size of 4 pages, an expansion implies that records belonging to the 4 pages are redistributed into 8 pages-the 4 existing pages and 4 new pages. Using a 2 : 3 expansion scheme with the same group size results in each expansion adding 2 new pages; two expansions are therefore required to generate a new group. We note that this expansion process results in a constant growth-rate model since each expansion cycle results in the file size increasing by a factor equal to the growth rate. We examine the performance of the constant growth-rate model with dynamic overflow sharing using (1) a linear, downward sloping load distribution and (2) a uniform load distribution. The linear, downward sloping load distribution is achieved by the use of a hash function which is linearly decreasing. Pages at one (logical) end of the file have a higher expected load than pages at the other end. The linear load distribution results in a substantial reduction in the oscillation in system performance as the file space expands and contracts. In particular, we observe that by increasing the value of P, the within cycle variation can be decreased to the point that performance is almost non-cyclic. For example, at P = 4 the difference in minimum and maximum number of expected page accesses for successful retrievals across an expansion cycle is on the order of 0.002. Analysis of dynamic overflow sharing with the linear load distribution also reveals an improvement in retrieval and insertion performance as compared to a scheme that uses a uniform load distribution. Our method is also competitive with other variants of linear hashing. In fact, with this scheme the price paid for dynamic expansion is very small when compared with static hashing using the same overflow method and represents an improvement over other methods of overflow handling, such as linear probing [9, lo] and recursive hashing [12]. The remainder of the paper is organized as follows: Section 2 reviews previous work on linear hashing and puts the current work in perspective. Key issues in the design of linear hashing schemes are also identified here. Linear hashing using a linear density hash function is presented in Section 3. Dynamic overflow sharing is described in Section 4 and file expansion is discussed in Section 5. Analysis of the proposed method is described in Section 6. Numerical results for the linear and uniform load distributions are given in Section 7. Conclusions and suggestions for further research are given in Section 8. 2. REVIEW
OF
LINEAR
HASHING
This section reviews the basic ideas of linear hashing. A complete description of the various schemes used to implement linear hashing and other related methods can be found in [3,7-141. Linear hashing assumes the existence of a contiguous logical address space of home pages, each with capacity for B fixed-length records. Expansion of the file is achieved by adding a new page to the end of the file and relocating some of the records from one or more existing pages to the new page. This operation is termed a split. A split is triggered by the use of a control rule such as constant (nominal or actual) storage utilization. If a single existing page is expanded in order to create a new page, then a complete cycle of such expansions leads to a doubling of the file size. In this case, a full expansion is completed in one pass through the pages in the file. Linear hashing using a uniform load distribution exhibits a variation in performance within a cycle. The maximum number of accesses typically occurs in the middle of a cycle. The minimum number occurs at the beginning (or equivalently, the end) of the cycle when the distribution of records is uniform and hence, performance is the same as static hashing when using an identical overflow handling method.
Linear-density hashing with dynamic overflow sharing
361
A file organization method that reduces or eliminates variation in performance is attractive. A method related to linear hashing, called spiral storage, has been proposed as a means of achieving constant mean performance while the storage space expands and contracts in proportion to the demand for storage. The original idea is due to Martin [IS], who explains this concept in terms of an exponential spiral. Mullin [16] analyzes spiral storage using a separate overtlow area with overflow records being organized as linked lists emanating from the home pages. Lomet [17] examines non-cyclic performance in the context of extendible hashing, with file growth accomplished by means of node doubling, where the page size grows to accommodate the records that need to be stored in it. Linear hashing, in contrast to methods such as extendible hashing, must deal with overflow. A variety of methods for overflow handling have been suggested in the literature. Separate chaining is employed in [7,8, 11,16], where overflow records emanating from a page are arranged in a linked list. This method does not perform well for record retrieval and file expansion especially when the design load factor is high, since a number of seeks have to be performed when traversing the linked list. A modification of this basic scheme is [18], where an overflow page may be shared by 2 pages, When one overtlow page is full and another record needs to be inserted into it, a new overflow page is allocated and linked to the first overiIow page by a pointer. This scheme has the obvious advantage that the length of any chain is reduced, leading to a reduction in access costs for insertion, retrieval and expansion. However, since in general there could be a number of overflow pages allocated, not all of which are full, storage utilization tends to be lower than for separate chaining. Another modification is [12], where the overflow files are themselves organized as a series of linear hash files. This method allows very high levels of utilization to be realized, but insertion and expansion costs tend to increase rapidly at these levels. Overflow handling without a separate overflow area is proposed in [9,10,14]. These schemes attempt to store a colliding record in the nearest non-full page in the file. In [14], however, a pointer from each home page links all such overflow records, whereas [9] adopts linear probing. Since there is no separate overflow area, these methods maintain locality of reference. Insertion and retrieval costs are higher when load factor increases, partly due to the increase in secondary collisions generated. Linear probing is adopted with the use of internal memory for separators (bit strings which determine if a given record is present in a page or not without having to accesss the page) in [lo]. Retrieval costs are thus constant at one-access, but insertion and expansion costs tend to exhibit the same performance characteristics as noted above for [9]. As mentioned earlier, linear hashing with a uniform distribution of keys exhibits oscillatory performance. Consider a scheme in which the load factor (ratio of number of records to the number of home pages) is maintained constant. At the beginning of each expansion cycle the expected load of each page is equal to the load factor. Retrieval performance at this point in time is equivalent to static hashing. As soon as the first page expands and until the cycle ends, the file will have two different page loads-those pages which have expanded will have lower load than those not expanded. The expanded pages will tend to have less overtlow and hence require closer to one access to their records; but this does not sufficiently compensate for the performance of non-expanded pages whose expected load factor may well exceed 1.0, thus resulting in considerable overtlow. As shown in subsequent graphs and tables, even with an efficient over-Bow handling scheme, worst-case performance may exceed best-case performance by 5%; with linked-list overflow schemes the deterioration may be considerably greater. Performance is typically improved, and variation reduced, if the file is expanded through a series of so-called “partial expansions”. When using two partial expansions, the file size increases to 1.5 times its original size at the end of the first partial expansion, and to twice its original size at the end of the second partial expansion. Each partial expansion thus induces a different growth rate. The reduction in average growth rate results in improved average performance over a full expansion cycle, though variation within each partial cycle is still observed. We conclude this section by outlining a number of performance objectives for linear hashing schemes. This provides a framework into which existing proposals can be mapped (as shown overleaf) and also permits a subsequent evaluation of the scheme presented here. IS 17/S--B
362
MURLIDHARKOUSHIKand
GEORGE
DIEHR
Good overall retrieval and insertion performance: mean performance over an expansion cycle should be close to that achieved by static hashing schemes [3,7,8, 12, 16,181. vacation should be minimize by setting an upper (2) Low variation in retrieval ~rfo~an~: bound on the maximum number of accesses [4,6,17]. (3)Constant performance over an expansion cycle: cycling of performance should be minimized [l S-l 71. (4)Constant growth rate: partial expansions cause the growth rate of a file to vary since each subcycle introduces a different growth rate; a constant growth rate scheme is preferred [3, 161. (5)Good storage utilization: effective storage utilizations normally achieved by conventional, static hashing schemes should be attained 16-9, 121. Pseudo-~quential processing: it should be possible to process the records in the file in 6) hash-key order without multiple accesses to the same page, with minimum buffer size. (1)
It is clear that no individual scheme completely meets all the requirements specified above. A number of schemes meet some of the objectives while providing partial support for others. Some of the objectives are not well supported by most linear hashing schemes. For instance, most linear hashing methods use linked structures for overflow management, and hence do not adequately control for variation in retrieval performance since the number of accesses for a successful retrieval is dependent on the length of the overflow chain. This is also true of schemes that use linear probing. These methods also do not have the capability for pseudo-sequential processing using a buffer size of (say) 2 pages, since records belonging to a given home page may be scattered over a large number of pages. 3. LINEAR
HASHING
USING
A LINEAR
LOAD
DISTRIBUTION
This section develops the hash function for generating a linear load distribution in the file. We begin with a standard hash function which yields a uniform density hash-key in the range (0, 1). Any function that belongs to the class of universal hash functions [ 191is sufficient for this purpose. The hash-key value is then transformed to a second hash-key, also in the range (0, 1), but with a linear density. We will term this second hash-key the LDH-key (for linear density hash-key). The final step maps the LDH-key to a physical page address. The process can be illustrated as follows:
The density function of the LDH-key depends on the expansion shceme used for the file. The expansion scheme determines the number of pages, P,which are “split” when an expansion occurs and the number of pages, Q, (Q > P),into which records from the P pages are distributed. Though P could theoretically assume any positive integral value, we restrict it to be of the form 2”, for this requirement is imposed in view of the overflow handling scheme to be n=O,l,Z,...; introduced in Section 4. Typically, Q = P + 1, and thus an expansion adds a single page to the file; however, the analysis and expansion algorithms are not limited to this choice. The density at a given LDH-key value x is: ~(x)=C+C(Q/P-1)(1-X),
O
(1)
where C is a constant chosen to assure that f is a density function. The load ratio of a file is the ratio of the densities at the extreme points of x, that is, f(O)/f(l). Thus, in what might be termed a “conventional” expansion scheme, where 1 page expands to create 2 pages (P = 1, Q = 2), the density at x = 0 is twice the density at x = 1. As will be seen, the page (or pages) chosen for expansion are always those with highest (expected) load factor.
Linear-density hashing with dynamic overBow sharing
363
The value of the constant C is determined by solving the following (where F(a) is the distribution function): F(l) = :f(x)dx=Cx-~(C(eiP-l)(l-x)‘)lh=l (2) s Hence C = %P/(P f Q). The density at x = 1 is C = 2P/(P f Q) while the density at x = 0 is C&?/P) = 2QKP + Q). The mapping from a uniformly distributed hash-key, U, to a Iogicaf address x is derived from equation (2) as: ~(~-l>,‘-(~)Cx+~=O Using the second root of the quadratic formula and simplifying, the Iagical address corresponding to hash-key W is: (3)
(b)
d---
Not 1 home pages
,-b
+
No + t home pages
,-b
Loading
density
0
A
No
1
Not1
2No
No-1 2No-I
. . . .
home pageu
____II_)
model of file expansion using linear density function and I : 2 expansion scheme. (a) Initial fde creation; (bf after split of page 0; (c) after split of page I; (d) after one expansion cycle is complete. Fig. I. Logical
344
MURLIDHARKOUSHIKand GEQRGEDIEHR
Example 1. With a 1:2 expansion scheme, the low-end density is C = 2/3. Since the load ratio is
Q/P = 2, the high-end density is C(Q/P) = 4/3. Figure l(a) illustrates the density and also shows a mapping (to be described in Section 6) from the logical address range to a physical set of pages at addresses 0 through N,, - 1. From equation (l),f(x) = 2C - Cx, F(x) = 2Cx - Cx*/2, and for any hash-key U, the logical address is x = 2 - (4 - 3U) ‘I2. For example, the values lJ = 0.25, U = 0.75 transform to x = 0.20 and x = 0.68, respectively. Figure l(b) illustrates the result of expanding page 0 into 2 pages. Page 0, which was at the high density end of the distribution (x = 0)is expanded to create 2 pages. For simplicity, the 2 resultant (O)
l
l
*
H-31
l
H-32
H-33
H-34
3893
5390
1549
6’735
3400 306
3673
6802 9471
6429 2536
0
1 9
t
-14
0
1
0
1
t
t
t
t
0
.
l
l
1
t
t
*
8789 (H-32) 7089 (H-32) 7A72 (H-35) 10
(b) .
.
.
I
H-32
H-31
l
H-33
H-34
3893
5390
1549
3400
3673
6802
6735 6429
306
5764
9471
2536
.
.
l
427 (H-34) 9931 (H-35)
(c) l
.
.
H-31
l
0 t
H-33
H-32
I
H-34
3893
5390
1549
6735
3400 306
3673 5764
6802 9471
6429 2536
I t
0
3(
1
t
0
*
1
$
lool Fig. 2 (a)-(c).
(Caption on facing page)
0
t
. 1 t
.
.
Linear-density hashing with dynamic overflow sharing
(d)
H-31
H-32
3893 .
.
.
.
3400 306
H-33
365
H-34
5390
1549
6135
3673 5764
6802 9471
6429 2536
.
.
.
.
2612
Fig. 2. Overflow page allocation and reallocation (group size 4, page size 3). (a) Initial overtlow page allocation (caused by insertion of record with key 8789). (b) Overflow page reallocation after a split of O-14 (caused by insertion of record with key 427). (c) Overflow page reallocation after a split of O-23 (caused by insertion of record with key 4728). (d) Overflow page reallocation after a split of O-23 and O-39 (caused by insertion of records with keys 8928 and 8695, respectively).
pages are shown appended to the end of the file. These 2 pages will now have (expected) nominal load factors which (in the limit of No) are l/2 the nominal load factor of page 0 before the expansion took place. In fact, when P pages are expanded to create Q pages, the logical-to-physical transformation maps P of the Q pages to the addresses of the original P pages and Q - P of the addresses to the end of the file. Figure l(c) shows the result of expanding page 1, which is now at the high density end of the file, to create 2 pages, and Fig. l(d) illustrates the logical arrangement of pages in the file after one cycle of expansions is complete. At this point, pages 0 and N,, represent the high density pages in the file, and pages ZV,- 1 and 2N,, - 1 represent the low density end of the file.? If a 2: 3 expansion scheme is used, the LDH-key density at x = 1 is C = 4/5, the density at x = 0 is 6/5 and the load ratio is 3/2. The mapping from linear hash-key to LDH-key is x = 3 - (9 - 5U)‘12. For example, the values U = 0.25, U = 0.75 transform to x = 0.22 and x = 0.71 respectively. Comparing this transformation to the case of 1:2 expansion illustrates how the densityf(x) approaches a uniform density as the ratio of P : Q increases. If, for example, P = 10 and Q = 11, values U = 0.25, U = 0.75 transform to x = 0.241 and x = 0.741 respectively. While a hashing scheme with near-constant mean performance and near-uniform density function is attractive, expanding 10 pages at one time to create a new page significantly increases the average insertion cost. 4. LINEAR
HASHING
WITH
DYNAMIC
OVERFLOW
SHARING
The proposed overflow handling scheme allows an overflow page to be shared by one or more home pages. The maximum number of home pages that can share an overflow page is limited to a fixed value, called the group size, G, which is restricted to be of the form 2”, for n = 0, 1,2, . . . . As we will show, the actual number of home pages sharing an overflow page decreases as more records are inserted into the group. The group size is a design parameter that affects file performance. A very small group size (such as 1 or 2) has the adverse effect of reducing storage utilization, while a large value (such as 16 or 32) tends to increase insertion costs at lower levels of storage utilization. Our analysis and results indicate that a group size of 4 or 8 is a good compromise. Although the discussion below illustrates the scheme with a group size of 4, it is easily generalized to any valid group size. Figure 2 illustrates the proposed overflow handling scheme using group size 4 and page size 3 for a small number of keys. Key sizes used in this example have been intentionally kept small. tFor simplicity, the load density in the file is shown as a linear, continuous distribution; the actual distribution will be a step function, with each step corresponding to 2 pages in the file.
366
MURL!DHARKOUSHK
and GEORGED-
It is assumed that each home page provides some space in its header area for a small, extendible directoryp, and that each overflow page similarly has space for a status indicator. Each entry in the extendible directory is of the form (bitstring, pointer). The pointer is the address of an overflow page or is null. The length of the bit-string varies from one to some higher value, but is typically no larger than 2 bits. The function of the status indicator in an overflow page is to determine (without accessing all the home pages in a group) the number of home pages associated with the overflow page%. We adopt a method according to which 2S***ur-vrrlup represents the number of home pages involved where status-value is the decimal value of the indicator. For a group size of 4 or 8, the status indicator needs no more than 2 bits. Thus, for a group size of 4, an indicator of “10” implies that a single overflow page is shared by the full group, indicator “01” that the overflow page is shared by the home page and its neighbor, and indicator “00” that the overflow page is associated with a single home page. The value “11” is unused for a group size of 4, but is the initial value for a group size of 8. The use and dynamic nature of the extendible directory and the status indicator will be made clear in the following discussion. Initially none of the home pages is associated with an overflow page and the pointers in the directory of each home page have null values. If the insertion of a new record into any of these pages causes an overflow, than an overflow page is allocated, the new record is written into it and a pointer to this new page is placed in each of the 4 home pages (see Fig. 2a). Each home page now has 2 entries in its directory, with both entries having a bit-string of length one and the same pointer value. The overflow page has its status indicator set to a value of “10”. The number of disk accesses required to process such an overflow page allocation (including placement of pointers in home pages) is clearly 2G + 1. When an overflow page itself overflows, a new overflow page is allocated and the overflow records are distributed between the 2 overflow pages. This operation is termed an overflow page split. Each of the 2 resulting overflow pages is now associated with half the number of home pages that were associated with the original overflow page, and consequently have status indicator values of ‘01”. Insertion accesses are reduced if the new overflow page is associated with those home pages which include the home page that received the new record. Figure 2(b) illustrates this reallocation of overflow pages (caused by the insertion of a record with key 427 into home page H-33). Records belonging to home pages H-3 1 and H-32 are retained in the original overflow page O-14 so that pointers in these pages do not need to be changed& Overflow records belonging to home pages H-33 and H-34 are relocated to the new overflow page O-23. This requires that pointers in these pages be updated. The number of disk accesses required to process this reallocation is G + 3, as shown below: Read home page H-33 (which is full); Read overflow page O-14 (also full); Write overflow page O-23 with records belonging to home pages H-33 and H-34; Rewrite overflow page O-14 with records belonging to home pages H-31 and H-32; Rewrite home page H-33 with new overflow pointer to page O-23; Read home page H-34; Rewrite home page H-34 with new overflow pointer to page O-23. Note that in the second step above, when page 0- 14 is read and found to be full, its status indicator reveals that it is shared by all the home pages in the group. This info~ation, together with the fact that home page H-33 is the recipient of the new record, is used to determine that the new tAs noted in 1201,virtually all implementations of external storage hashing use pointers. ‘Thus, adopting a pointer-oriented scheme seems a small cost to pay for managing overflow. fOther implementations of dynamic hashing schemes incorporate a status indicator in each home page (see, for example, [6]). An undesirable consequence of this in our model is that in the event of an overflow page split, other home pages would also have to be accessed even though their directory entries do not need any updating, thus increasing the average cost of an insertion. $Note that there exists the possibility that all records in page O-14 are from pages H-33 and H-34 or even a small probability that ail overflow records are from page H-33. The former can be dealt with by redistributing records based on the home page they belong to, so that pages H-33 and H-34 each have their own overtIow page, while pages H-31 and H-32 do not have any overflow page associated with them. The latter situation is handled by digital splitting of records, i.e. records are redistributed based on the value of one or more bits as described later on in this section.
Linear-density hashing with dynamic overtlow sharing
367
overtlow page allocated should be associated with home pages H-33 and H-34. Pages H-31 and H-32 therefore do not need to be accessed since they continue to be associated with O-14. While 7 accesses may seem high, note first that overflow page splits occur infrequently. Second, the number of page accesses computed on a per record basis will necessarily be much smaller than 7. Furthermore, even a simple linked list scheme utilizing separate chaining will require on the order of 4 (or more) page accesses for each overflow record inserted when the list traverses 2 overflow pages. Continued insertion of records to this home group could result in further reallocation of overflow pages, such as the situation depicted in Fig. 2(c) where pages H-33 and H-34 have their own overflow pages, O-31 and O-23 respectively. This split of page O-23 (caused by the insertion of record with key 4728 into H-33) leading to the creation of O-31 needs only home page H-33 to be updated; the number of disk accesses required (5) is now independent of the group size G. Each overflow page now has a status indicator value of “00”. If page O-23 again overflows, page H-34 will be associated with 2 overtlow pages, O-23 and O-39. Redistribution of records between pages O-23 and O-39 uses a variation of the extendible hashing scheme of [4]. Specifically, 1 (or more) bit(s) of the key is used to determine the division of overflow records into overflow pages. For example, records with value “0” of this bit are retained in O-23 and records with bit “1” are relocated into O-39. Further splits of either O-23 or O-39 are handled by using a second bit, etc. Figure 2(d) shows the situation obtained when page O-39 is split and new page O-45 is allocated. All overflow records with bit patterns ending in 00 or 10 are allocated in page O-23; those ending in “01” are in page O-39 and those ending in “11” are located in page o-45. The extendable directory in a home page could theoretically grow to be as large as necessary to accommodate all overflow records hashing to the page. As a practical matter though, we are interested in ensuring that the directory does not consume too much space. Note that the directory will typically have 2 entries with either null-valued pointers or with both pointers to the same overflow page. Observe also that 2 directory entries are sufficient for all cases from 1 overflow page per group to 2 overtlow pages per home page. In the latter case, the first directory entry is associated with an overflow page having records with a bit value of “0” and the second entry with an overtlow page having records with a bit value of “1”. If more than 2 overflow pages are required for a single page, or in the rare case where a split on 1 bit directs all overtlow records to a single page, 2 bits are used for redistribution, and the directory size doubles to 4 entries. As in extendible hashing, several pointers may have the same value; for example, 2 pointers to 1 page (e.g. pointers corresponding to “00” and “10”) and 2 other pointers to another page (e.g. the pointers corresponding to “01” and “11”). Note that it is highly unlikely that over 8 pointers will ever be required in this directory. Deletion of a record can be handled in two ways: deletion with space reclamation and deletion by marking. The method used for deletion from overflow pages should be consistent with the method used for deletion from home pages. A problem that can arise with deletion and space reclamation is the need to merge “twin” overflow pages when the total number of “active” overflow records in the 2 pages is less than or equal to B. This situation can only be recognized by accessing the twin page. If file volatility is high, page merging can impose a high access cost. The recommended alternative for either of the above approaches is to wait until the home page is to be expanded and to carry out this local reorganization at that time. This method of overtlow handling possesses a number of significant advantages. Since overflow records belonging to a single page are at most 1 disk access away from the home page, the scheme maintains (logical) locality of reference. Retrievals, whether successful or unsuccessful, require a maximum of two disk accesses?. Insertions into home pages require 2 disk accesses. A majority of insertions into overflow require 3 disk accesses. Only those insertions which cause an overflow page to split require more than 3 accesses. However, overflow page splits occur infrequently, and further, the impact of these additional accesses is small when insertion costs are computed on a per record tone of the drawbacks of linear hashing schemes using other methods of overflow handling has been the inability to impose a bound on the number of disk accesses required for a random retrieval (see for example [13, p. 1121). The scheme proposed here addresses this issue and thereby reduces the record-to-record variation in the number of accesses for retrieval. The scheme is also such that average performance over a file will always be less than two accesses per retrieval even for very small page sizes, and slightly over one access per record for moderate to large page sizes.
368
MURLIDHAR KOUSI-IIK and GEORGE DIEHR
basis. Finally, as we show in Section 7, the scheme results in high levels of storage utilization even for small page sizes.
5. FILE
CREATION
AND
EXPANSION
We assume that the file is created with a certain number of records, using the linear density function of (3), and that N, pages numbered 0, 1, . . . , IV,,- 1 are initially allocated to the file (see Fig. la). The number of home pages needed for file creation depends on the number of records initially available for loading, page size B,and the design load factor. It is convenient to describe the load factor in terms of the nominal load factor, z’, which is the average load factor described in terms of home pages only: Number of records in the file z’ =
BN
where N is the number of home pages currently in the file. The insertion of a new record into any page in the file increases the nominal load factor. If this load factor is greater than a predetermined threshold value z, a file expansion is carried out. We assume a process that simultaneously expands all the pages in one group before proceeding to the pages of the next group. The control rule can therefore be viewed as carrying out an expansion each time zBG/P records are inserted into the filet. Expansion of the file needs some method for determining whether a given record remains in the original page or is moved to a page in the new group. We assume the existence of a sequence of independent hash functions for this purpose. Estimation of the cost (in terms of the number of disk accesses) incurred during an expansion depends on the expansion scheme. The simplest case is that of a 1: 2 expansion. Let V denote an existing group, X the new group 01 tained by expanding V, and expand (V) the cost in disk accesses. Then, expand(V) = total accesses to (read home and overflow pages of group V) + (rewrite home and overflow pages of group V for records that remain in the group) +(rewrite home and overflow pages of group V for records that are moved from V). When rewriting I/ it is assumed that pages are recompacted by moving records as close as possible to their respective home pages. When using other expansion schemes, additional accesses are required to maintain pointers in the home pages of the new group. This arises because the new group is formed a few pages at a time. In general, the approach used here is that pointers in the pages of the newly formed group are maintained almost exactly as though the records were inserted one-by-one into the new group. Similar considerations need to be adopted for other group sizes and expansion schemes. The use of a linear density function and a constant growth rate model make the algorithms for file management slightly more complex than for uniform density linear hashing. In particular, the expansion sequence of linear hashing must be modified to ensure that the linear density profile is maintained with expansion. This in turn affects the addressing problem. Due to limitations of space we do not describe these algorithms in this paper. Algorithms for determining the home page address of a record and for locating the next group to be expanded are presented in [21].
6. ANALYSIS
OF SYSTEM
PERFORMANCE
This section analyzes the performance of linear hashing when overflow is managed using the page splitting scheme described in Section 4. In order to facilitate comparison, the analysis is done for both linear and uniform density functions. Four measures of performance are considered: retrieval, insertion, expansion accesses and effective storage utilization. Successful as well as tNote that this scheme is equivalent to the scheme used in [3], that is, a file expansion is carried out after each insertion of a tied number of records.
Linear-density hashing with dynamic overtlow sharing
369
high-end page density Om
initial highuld page density
I
page density &I in the last page yet to be split (or first page that was split at the beginning of the cycle) \
h
a_ --\- _\
-. I
-\
low-end page density (XI
-\
I I
-.
B
I
non-split portion of file with mean page density
I
ah
I
split portion of file with mean page density
D
acr
I ---Pn
BP.
-*-.
KW)Ps --)(
Fig. 3. Linear density file after 25% of the file has expanded (1: 2 expansion scheme).
unsuccessful retrieval are considered. For successful retrieval, file performance is a function of page performance, while for the others, file performance is a function of group performance. 6.1. Linear hashing with a linear loading density A linear hash file using linear density loading is characterized
by several file parameters:
1. P : Q expansion scheme 2. group size, G 3. page size, B 4. nominal load factor, z. Performance analysis for LDH files is based on the following observations. Let ps and p,,( = 1 - p,) represent the split and non-split fractions in the file respectively. At the beginning (or end) of an expansion cycle, ps = 0 (or ps = l), and the load densities in the pages of the file vary uniformly from h at the high-density end to 1 at the low-density end as shown earlier in Fig. l(a and e). Thus, z = (h + 1)/2B. Since h = /(Q/P), we have I = 2zB/(l + Q/P) and h = 2zB/(l + P/Q). Next we consider the behaviour of the file in the middle of an expansion cycle, when 0 < ps < 1. (see Fig. 3). The file now consists of a split portion and non-split portion, and the load density in the pages of the file exhibits a point of discontinuity. The page density at the point of discontinuity?, denoted a,,,, is a function of ps. Its initial value is 1, and moves along the line from I to h as file expansion proceeds, finally ending at h at the end of the expansion cycle. It can be seen that a, = h/Q(P +p,). Since the nominal load factor in the file is maintained constant, it follows that a,, and cq, the page densities at the high and low end of the file, must be continuously changing such that a,, 2 h and a, 2 1. The page densities in the non-split portion of the file vary uniformly from ah at the high-density end to a, at the point of discontinuity and in the split portion from a, to al at the low-density end of the file. The first step in the analysis is thus the determination of high-end and low-end page densities. Let as = (a,,, + a,)/2 and a,, = (ah + a,)/2 denote the mean page densities for the split and non-split portions of the file respectively. Then the mean nominal load factor in the file can be expressed as:
z=
kower IS 17/s-c
Expected page density = Page capacity
case subscripts used with a denote page densities while upper case subscripts denote group densities.
(4)
KOUSHIK
M~~~
370
and GEORGE DIEHR
Note that when ps = 0, that is, when no splits have taken place, equation (4) reduces to z = a,/& which is, by definition, the nominal load factor in the file. On substituting for z, a,, a, and pn we obtain from equation (4):
Thus, a,, is initially at h, gradually increases as the file expands and reaches a maximum at the mid-point of the cycle, and then decreases to k at the end of the cycle. The behavior of a, around 2 is similar. Note that group densities can be derived from the corresponding page densities by using a factor of G; thus, aH = Gu,,, a,,, = Ga, and a, = GE,. A general approach used for determining mean file performance is as follows. Let Z,(p,) and Z,,(p,,) denote average performance measures defined over the split and non-split portions of the file when the split fraction of the file is pS. Then the average performance of the file at ps is the weighted sum, Z(p,) = psZs(p,) +p”Z,(p,J. In the limit each value of ps has an equal probability of occurrence, and the mean performance over an expansion cycle is obtained by integrating over the range of ps: (6) z = ; IPXPJ +Pz2(P~)~ 4Jz s Thus, for each of the performance measures described below, we derive expressions for the non-split portion of the file; expressions for the split portion of the file will be similar, and the mean performance of the file can then be obtained from equation (6). Mean accesses for successful search. If a record exists, the mean number of disk accesses required for a successful search depends on the number of records that hash to the record’s home page. If the number of records hashing to the home page is not greater than its capacity, then 1 disk access is sufficient to retrieve any record. If the page has overllowed, 1 disk access is necessary to retrieve a record in the home page, and 2 disk accesses are required to retrieve a record located in an overflow page. Thus, for the non-split portion of the file, the expected number of accesses per record, E,(RETR), is the total (expected) number of records in home pages, E(Q), plus 2 times the (expected) number of records in overflow pages, E;(no), divided by the total number of records, M: E”(RETR) = (E(n,) -t- 2E(no))/M = 1 + &J/M The expected total number of overflow records is the number of home pages, N, times the expected overflow records per page, or: E(n,) = NE(0l’F.L) Note that Nap is also the inverse of average page density, or 2~(a~+ tl,), thus E,(RETR) To determine E(OWX), let E,(OVFL) with nominal density J.. Then,
= 1 + 2E(OVFL)/(a,,
+ a,).
(7)
denote the expected number of records overflowing a page
En(OVFL) =
f k-B+
(k - B&(k) I
where f#) is the probability that a randomly chosen record hashes to a page with k records. E(OYFL) is then obtained by integrating the above expression for E,(OVFL) over /I:
k=R+l
k=B+I
where U(1) is the probability of a page having a density of Iz. Since d is uniformly disturbuted over [am, ahI, W) = ll(cxh- a,,,). Rearranging the order of integration and summation and simplifying gives: E(OVFL) =
(8)
Linear-density hashing with dynamic overflow sharing
Using the Poisson density forf,(k), it is known that jz;f,(k) Poisson densities (see [22, pp. 442]), and so
371
M is the difference in two cumulative
Note that equation (9) represents a probability distribution (see, for example [23, p. 2491). Substituting equation (8) into equation (7) and letting f(k) denote the probability density represented by equation (9) gives the expression for the expected number of disk accesses for successful search:
(10)
E,(RETR)= Equation (10) can be efficiently computed by writing kf(k)
as F
- k$o kf(k)
k=B+l
and
to yield:
kfW+B{l -&W}}
(11)
Mean accesses for unsuccessful search. Closed form expressions for the computation of unsuccessful search accesses becomes very cumbersome even with a moderate value for B and small G since joint probabilities are involved and a number of combinations have to be considered?. An alternative method is as follows. Define URET(k) to be the mean number of disk accesses required for an unsuccessful search when there are k records in the group#. Let P,(k) denote the probability that the (unsuccessful) retrieval finds k records in a group having density D. Note that D varies uniformly from aH to Q,. Then proceeding in a manner analogous to that for successful retrieval, E, (URI?T) can be expressed as E,,(URET)
=
‘” U(D) s %4
where U(D) = l/(aH - aM) is the probability gives: EJURET)
f
URET(k)P,(k)
dD
k=O
of a non-split group having density D. Simplifying = 2
URET(k)P(k)
k=O
where P(k) = (F,(k) - FH(k))/(aH - aM) is the probability that a randomly chosen (non-existent) record hashes to a non-split group with k records, and F,(k) and F,(k) are cumulative group densities-that is: F,(k) Average efictive
= PM(K
= i PM(r) r-0
and
F,(k)
= P,(K
< k) = i P&r). r=O
storage utilization. Let N, be the total number of pages, home and overflow,
required to store the M records in the non-split portion of the file. Then, effective storage utilization is M/(BN,). Total number of pages is given by the number of groups, N/G, times expected pages per group, E(TBL): NT = (N/G)E(TBL) tSee [6] for an example of tbis method. SEstimation of file performance for unsuccessful retrieval, insertion, expansion and storage utilization requires that the corresponding group performance parameters be known. Thus, we define URET(k), INS(k), EAT’(k) and TM.(k) as estimates describing group performance. where k is the number of records which hash to a group. These values are obtained by simulation as described in Section 7.
MURLIDHARKOUSHIKand GEORGEDIEHR
312
Let T&C(k) denote the expected number of pages, home and overflow, allocated to a group of size G which contains k records. Then, the mean number of pages per group is E(TBL) = f
P(R)TBL(k).
k=O
Average effective storage utilization is then given by E,,(ESU) = MG/(BiV,E(TBL)). The factor MG/Nr is equal to the average group density, or (a, + ay)/2. Thus, the final expression for average effective storage utilization is: E,(ESU)
= (aH + a,)/(2BE(TBL))
Mean accesses for insertion. Let ZiVSR(k) denote the number of accesses per record required to insert k records into a group. Note that ZNsR(k) includes additional accesses needed to perform overtlow page splits, if necessary, but excludes accesses required for expanding the file. The mean number of accesses required to insert a record can be derived in a manner similar to E,(URET) as E,(ZNSR)
= f
P(k)ZiWR(k).
k=O
Expected accesses for jle expansion. In contrast to the performance measures discussed above, expansion accesses are determined solely by the density experienced in the high-density end of the file. Since this density varies with pS, the expected number of accesses for accomplishing file growth varies within an expansion cycle. Let EXP(k) denote the average number of accesses per record required to expand a group containing k records. Then, E(EXP), the expected number of accesses per record for file expansion is given by E(EXP)
where E,(EXP)
=
’ (1 -p,)E,(EXP) s0
dp,
= Xckm_o P,(k)EXP(k).
4.2. Linear hashing with uniform loading density
The assumptions involved in the performance analysis of UDH files are the same as for LDH files, with the only exception being that the page and group densities in the split and non-split portion of the file are uniformly distributed. The expected number of records in a non-split page is (Q/P) times the expected number of records in a split page, that is, a,, = (Q/P)a,. Equation (4) can therefore be simplified as +
r = a,lg(p, which yields
and
a, = z@p, + (QPhL
cc,= ZBb,
+ U’/QW
Mean performance
over an expansion cycle is obtained from equation (6). underlying the retrieval process are identical to those in the case of linear density hashing. Then, similar to equation (1 l), Mean accesses for successful retrieval. The assumptions
WTR)=2-+{
n
2 kf(k)+B k=O
1i
$ f(k) k=O
11
where f(k) = (e+af:)/k! is the Poisson probability that a randomly chosen record hashes to a non-split page having k records. Expressions describing other measures of performance are derived in a similar way and are therefore given below without full discussion. Mean accesses for unsuccessful retrieval. As with linear density hashing, file performance is a function of group performance, and so E.( URET) = f
P(k)URET(k)
k=O
where Z’(k) = (e-“Nai)/k!
is the Poisson probability of a non-split group having k records.
Linear-density hashing with dynamic overflow sharing
313
Average efective storage utilization. E,(ESU)
= a,/BE(TBL)
where E(TBL)
= X;=,, P(k)TBL(k). Mean number of accesses for insertion. E,(ZN!M)
= 5
P(k)ZiV,SR(k).
k=O
Mean number of accesses forjle expansion. The number of accesses incurred for expanding the file depends on the density in a non-split group. Since this density varies over an expansion cycle, expansion performance tends to cycle. Hence,
I E(EXP)
=
s0
E, (EXP) dp,
where E,(EXP)=
'f
P(K)EXP(k).
k=O
7. PERFORMANCE
RESULTS
This section presents numerical results for the proposed linear density hashing scheme with comparison to a uniform density scheme using the identical overflow handling method. Unless otherwise specified, all results represent values over one expansion cycle. Results shown for successful retrieval have been directly computed from the expressions described in Section 6. Results for unsuccessful retrieval, insertion and effective storage utilization have been estimated by simulation at the group level. Simulation was used to estimate URET(k), ZZWR(k) and TBL(k) where k is the number of records in the group. Data for determining expansion costs, EXP(k), were collected in a similar manner, but rather than simulate just one group, P groups were simulated?, for P = 1, 2 and 4. Table 1 compares the successful retrieval and storage utilization performance as a function of nominal load factor. The results are shown for groups sizes 4 and 8, expansion schemes 1: 2, 2 : 3 and 4: 5, and nominal load factors ranging from 0.70 to 1.30. Variation in retrieval performance for both schemes is indicated by including the mean, minimum, and maximum values. Performance of conventional, static hashing using the new overtlow handling method is shown in the “Min*” column under “Uniform density hashing”. Consider first the retrieval performance using a linear density function. As with conventional hashing, the expected number of disk accesses increases with increasing nominal load factor. For a given nominal load factor, retrieval performance improves with higher P values; for example, at a nominal load factor of 1.Othe expected accesses decreases from 1.130 to 1.104 as the expansion scheme increases from 1: 2 to 4: 5. A comparison of the mean values with those in the “Min*” column reflects the price paid for dynamic expansion. It can be seen that this cost is rather small; thus at a load factor of 1.O for the 2 : 3 expansion scheme, the linear density function requires 1.104 expected accesses versus 1.089 for conventional hashing. Table 1 also shows the effective storage utilization obtained for each nominal load factor. Note that storage utilization also cycles for both schemes; the data shown are the average values over a complete cycle. For fixed group size and nominal load factor, effective storage utilization improves as P increases. When other parameters are fixed, effective storage utilization is also better as the group size increases from 4 to 8. Figure 4 illustrates average storage utilization as a function of nominal load factor for group size 4, page size 20 and various expansion schemes. It can be seen that utilization improves with increasing load factor, but only up to a certain point. Note, for example, that with the 2: 3 expansion scheme, utilization increases up to about 0.88 at a load factor of 1.20 but then decreases to about 0.87 as the load factor further increases to 1.30. This suggests tK eys required for the simulation were generated using Lehmer’s algorithms as described in [24]. Independent streams were used for each simulation run and also for each group when more than one group had to be simultaneously simulated.
374 Tabk
MURLIDHARKOUSHIKand GEORGE Dnnnt
1.Effective storage utilization and mesa number of accesses for successful search as a function of nominal load factor (pap sire 20) Linear density hashing Average effective
Expansion scheme
Nominal load factor
storage utihzation
Expected disk a-s
CJnt~ormabtsity hashing Average effective
for search
storage utilization
Expected disk accesses for search
G=4
G=8
Mean
Min
Max
G=4
G=8
Mean
Mint
Max
0.70 0.90 1.00 1.20
0.671 0.779 0.811 0.842
0.686 0.802 0.833 0.859
I .024
1.029 1.096 1.139 1.231
0.675 0.766 0.794 0.828
0.689 0.785 0.814 0.844
1.039
1.087 1.130 1.223
1.019 1.077 1.119 1.212
1.104 1.144 1.229
1.008 1.050 1.089 1.189
1.059 1.136 1.177 1.258
2~3
0.70 0.90 1.00 1.20
0.668 0.773 0.818 0.864
0.676 0.808 0.851 0.879
1.013 1.063 1.104 1.201
1.012 1.060 1.100 1.197
1.014 1.065 1.106 1.203
0.668 0.774 0.81 I 0.853
0.680 0.803 0.841 0.870
1.017 1.071 1.111 1.204
1.@I8 1.050 1.089 1.189
1.023 1.082 1.123 1.214
4:5
0.70 0.90 1.00 1.20
0.669 0.767 0.815 0.876
0.674 0.808 0.861 0.887
1.009 1.054 1.093 1.192
1.009 1.053 1.092 1.191
1.010 1.054 1.094 1.193
0.668 0.769 0.815 0.871
0.674 0.807 0.856 0.884
1.011 1.056 1.096 1.194
1.008 1.050 1.089 1.189
1.012 1.060 1.100 1.196
I:2
tValues in this column represent average pcrformancc for static hashing.
that for a given group size and page size there is a specific range of nominal load factors which result in efficient performance; increasing the nominal load factor beyond the upper limit of this range will result in poor performance since retrieval and other costs continue to increase while storage utilization decreases. In comparison with uniform density hashing, linear density has better average retrieval performance for fixed nominal load factor. Neither scheme seems to dominate in effective storage utilization. In general, uniform density hashing yields better utilization at lower values of nominal load factor, but linear density hashing is slightly better at nominal load factors above (about) 0.90. Table 2 presents essentially the same performance data as Table 1 but, to provide an equitable basis for comparison, the average storage utilization is held constant for both schemes. The “+” for uniform density hashing indicates that it was not possible to achieve the specified storage utilization throughout the cycle for any nominal load factor. Average retrieval performance
0.85
0.65 0.70
0.80
0.90
1.oo
1.10
1.20
1.30
1.40
Nominal load factor Fig. 4. Average storage utilization as a function of nominal load factor for linear density various expansion schemes (group size 4, page sire 20). (-) 1: 2 Expansion scheme; (---) scheme; (-----) 4:s expansion scheme.
function using 2: 3 expansion
Linear-density hashing with dynamic overRow sharing
315
Table 2. Number of accesses for successful search as a function of average effective storage utilization (Dapc size 20) Expansion scheme
Effective storage utilization
Uniform
Linear density hashing
density hashing
Mean
Min
Max
Mean
Mint
Max
0.70 0.75 0.80
1.035 1.062 1.112
1.025 1.048 1.097
1.049 1.080 1.129
1.051 1.088 1.156
1.011 1.026 1.057
1.082 1.144 1.303
2:3
0.70 0.75 0.80 0.85
1.022 1.047 1.086 1.157
1.020 1.043 1.081 1.150
1.025 1.051 1.092 1.162
1.028 1.053 1.098 1.195
1.014 1.030 1.063 1.120
1.037 I.072 1.134 +
4:5
0.70 0.75 0.80 0.85
1.017 1.042 1.080 1.132
1.017 1.041 1.079 1.130
1.018 1.043 1.082 1.134
1.019 1.044 1.082 1.141
1.015 1.037 1.073 1.124
1.022 1.051 1.091 1.161
0.70 0.75 0.80
1.028 1.048 1.084
1.020 1.037 1.072
1.040 1.064 1.101
1.043 1.072 1.123
1.006 1.015 1.034
1.078 1.129 1.239
0.70 0.75 0.80 0.85
1.018 1.033 1.057 1.102
1.016 1.030 1.053 1.096
1.020 1.037 1.107
1.022 1.039 1.068 1.126
1.010 1.020 1.039 1.073
1.031 1.054 1.095 1.200
0.70 0.75 0.80 0.85
1.014 1.029 1.050 1.082
1.014 1.028 1.049 1.081
1.015 1.030 1.051 1.084
1.016 1.030 1.052 1.089
1.012 1.024 1.044 1.075
1.018 1.035 1.058 1.101
Group size 4 I:2
Group size 8 1:2
2~3
4:5
1.063
tValues in this
column represent average performance for conventional (non-dynamic) hashing. +The specified storage utilization was never achieved throughout the cycle for any nominal load factor.
of linear density hashing is generally superior to that of uniform density hashing. Improvements are generally higher at higher values of storage utilization. In addition, for fixed average storage utilization, worst-case performance of linear density hashing is better (but by very slight margins) than average performance with uniform density hashing. Further, the amount of cycling (as measured by the difference between the maximum and minimum columns) is considerably lower with linear density. Measures that describe the improvement of linear density hashing over uniform density hashing can be developed as follows. For a disk based system the best retrieval performance that can be achieved is one access. Using this as the base performance, we have improvement(mean) The base performance
=
mean(uniform density) - mean(linear density) mean(uniform density) - 1.O
for cycling is 0, and so improvement
improvement(cycling)
=
in cycling can be best described as
range(uniform density) - range(linear density) range(uniform density)
where range( a) = max( a) - min( *). Adopting this approach, the 1: 2 expansion scheme results in improvement in mean retrieval performance of about 30% over all values of effective storage utilization and group sizes. This decreases somewhat for the 2: 3 and 4: 5 expansion schemes. Cycling is reduced by around 70-80% over all expansion schemes and group sizes. Table 3 compares mean performance with respect to unsuccessful retrievals and insertions. Insertion costs shown here do not include file expansion costs. Conclusions that can be drawn are similar to those for successful retrievals. In general, mean performance of linear density hashing outperforms that of uniform density hashing over all values for storage utilization. Though not shown, cycling is also considerably lower with the use of a linear density function. Table 4 was prepared to illustrate average performance of file expansion for group size 4, expansion schemes 1: 2 and 2 : 3, and page sizes 5, 20 and 50. Expansion costs are indicated in two different ways: on a per record basis, and on a per expansion basis. Per record results are useful in estimating total insertion costs (insertion costs + expansion costs), while per expansion results enable estimation of the actual time for processing an expansion. In comparing per record performance across different expansion schemes it has to be noted that when P > 1, the accesses
MURLIDHARKOUSHIK and GEORGEDm-nt
316
Table 3. Mean number of accesses for unsuccessful search and insertion function of average effective stomae utilization fuaae size 20). Unsuccessful
search
as a
Insertion
Effective storage utilization
Linear density
Uniform density
Linear density
Uniform density
0.70 0.75 0.80
1.186 1.305 1.481
1.248 1.375 1.570
2.328 2.462 2.653
2.381 2.533 2.732
2~3
0.70 0.75 0.80 0.85
1.134 1.269 1.451 1.688
1.165 I.297 1.479 1.750
2.315 2.453 2.610 2.837
2.325 2.467 2.644 2.904
4:5
0.70 0.75 0.80 0.85
1.110 1.258 1.449 1.652
1.122 1.261 1.459 1.668
2.309 2.463 2.597 2.711
2.314 2.470 2.606 2.802
I:2
0.70 0.75 0.80
1.164 1.260 1.401
1.226 1.336 1.493
2.364 2.491 2.652
2.425 2.561 2.713
2~3
0.70 0.75 0.80 0.85
1.119 1.212 I.340 I .526
1.114 1.241 1.378 I .588
2.371 2.460 2.580 2.158
2.372 2.487 2.622 2.817
4:5
0.70 0.75 0.80 0.85
1.099 1.196 1.323 1.482
1.109 1.204 1.332 I .502
2.390 2.467 2.526 2.681
2.400 2.466 2.552 2.718
Expansion scheme Group size 4 I:2
Group sire
8
per cycle need to be multipled by In 2/ln(Q/P) to reflect costs incurred over a doubling of the file size. Data shown for the 2: 3 expansion scheme reflect this normalization. The results demonstrate the benefits of the new overflow handling scheme using either a uniform or linear density. Interestingly, though an increase in storage utilization results in an increase in the accesses per expansion, it actually yields a lower per record expansion cost. This happens because even though more pages are accessed during an expansion, the increase in the number of records at the higher level of utilization causes the per record cost to be lower. This contrasts sharply with other schemes where overflow records having the same hash key are scattered over a large number of pages. Figures 5-7 were prepared to illustrate performance at page sizes 5 and 50 for different group sizes and expansion schemes. All figures display mean performance of linear density hashing as shown by the solid line. Mean and minimum performance of uniform density hashing are Table 4. Mean number
Expansion scheme
Effective storage utilization
of accesses for file expansion as a function utilization krouo size 41. Linear density hashing
Uniform
B = 50
0.88 0.83 0.77
0.23 0.21 0.20
0.09 0.08 0.08
0.26 0.24 0.22 0.22
2.60 2.48 2.34 l **
0.65 0.61 0.59 0.57
0.26 0.24 0.23 0.23
12.97 13.15 13.85
14.86 16.15 20.66
13.49 14.40 16.14
13.10 13.73 15.05
11.02 11.43 11.83 13.61
12.71 13.96 16.33 l ***
11.54 12.21 13.37 15.76
11.08 11.45 12.30 14.37
B = 50
0.22 0.20 0.19
0.09 0.08 0.08
0.65 0.60 0.58 0.57
Average accesses per expansion over one cycle I:2 0.70 14.34 13.11 0.75 15.62 13.59 0.80 18.33 14.75
2:3
2:3
0.70 0.75 0.80 0.85
0.70 0.75 0.80 0.85
2.56 2.46 2.34 l **
12.56 13.70 15.73 l ***
11.46 12.00 12.96 14.90
tResults reported for the 2:3 expansion doubling of the file size.
density hashing B = 20
B = 20
B=5
Average accessesper record t I:2 0.70 0.87 0.75 0.81 0.80 0.76
of average effective storage
B=5
scheme have been normal&d
to reflect values over a
Lin~~~s~ty
0.60
0.70
377
hashing with dynamic overBow sharing
0.90
0.80 Effective
storage utilization
Fig. 5. Comparison of successful search performanw as a function of effective storage utilization (group size 4, expansion scheme 1: 2). (-) Linear density hashing mean; (-----) uniform density hashing mean; (---) conventional (nond~am~) hashing uniform density hashing ~~rnurn.
1.8
T
1.6
1.4
1.2
l.@ -r 0.60
0.70
0.90
0.80 Effective
storage utiiization
Fig. 6. Comparison of unsuccessful search performance as a function of effective storage utilization (group size 8, expansion scheme 2 : 3). (-) Linear density hashing mean; (- ----) uniform density hashing mean; (---) conventional (non~~~i~) hashing uniform density hashing minimum.
378
Magi
0.60
KOUSHIKand GEORGE DIEHR
0.70
0.80
0.90
Effective storage utilization Fig. 7. Comparison of insertion performance as a function of effective storage utilization (group size 8, expansion scheme 1: 2). (-) Linear density hashing mean; (-----) uniform density hashing mean; (---) conventional (non-dynamic) hashing uniform density hashing minimum.
represented by the dashed line and dotted line respectively. Results are consistent with tabled values at page size 20. 8. SUMMARY
AND
DISCUSSION
In this paper we have introduced a linear-density hash function and a new method for handling overflow records in a linear hash file, referred to as dynamic overtlow sharing. This method requires that a small number of contiguous home pages in the file be grouped together for sharing overflow. Pointers to overflow are maintains in the home pages in the form of a small, extendible directory. The process of associating overflow pages with home pages is dynamic because as the number of overflow records in this group increases, the overflow pages themselves split, and each of the split overflow pages is now associated with a smaller number of home pages. In the limit it is possible for a single home page to have pointers to more than one overflow page. A primary advantage of this method is that it bounds the number of disk accesses for a retrieval, successful or unsuccessful, to 2. In addition, by clustering together the overllow records belonging to a home page into a few overflow pages, it enables file expansion to be performed in a highly efficient manner. Dynamic overflow sharing has been analyzed for two cases: a linear, downward sloping load dist~bution and a more conventional, uniform load dist~bution. Both situations are constant growth rate models in that each expansion cycle causes the file to grow by a constant factor. The control rule used for file expansion is based on constant load factor on the home pages in the file. The analysis has been performed for the following performance measures: successful and unsuccessful retrieval, insertion, expansion and storage utilization. Note that deletion costs have not been reported on because these depend on the details of the method used for their implementation. We can summarize our findings with respect to the objectives previously stated in Section 2 as follows: (1) Good overall retrieval and insertion performance. Linear density hashing has expected number of accesses for successful search and insertion which are lower than the average
Linear-density hashing with dynamic overflow sharing
379
realized with a uniform density scheme. Performance is only marginally worse than possible with static hashing using an identical overflow handling method. (2) Low variance in search performance. The overflow handling scheme assures that the maximum number of accesses for successful or unsuccessful search is 2; no long chains of pages need to be accessed. Low variance of performance holds for either linear or uniform density hashing. (3) Constant performance over an expansion cycle. Performance of both schemes tends to vary over an expansion cycle. However, the variation with linear density hashing is small (on the order of 20-50%) compared to that of uniform density hashing. (4) Constant growth rate: since a single cycle is not broken down into subcycles by the use of partial expansions, the growth rate of the schemes presented here is essentially constant. (5) Good storage utilization. Effective storage utilization of 80-85% is possible with either linear or uniform density functions. (6) Pseudo-sequential processing is possible (for either density function) given buffer space for 2 pages-one for a home page and the other for an overflow page. As is typical with almost any file organization, there is no single setting of parameters which is somehow optimal. The file or database designer must consider the following factors:
(1) Lower nominal load factors produce better retrieval and insertion performance at the cost of poorer storage utilization. Higher blocking factors reduce expected number of retrievals at a cost of increased page (2) transfer times. (3) Expansion factors closer to one (i.e. larger values of P with Q = P + 1) result in performance which asymptotically approaches conventional hashing but at an increased per record cost for file expansion. If file expansion can be done during slack periods, a large value for P (e.g. 8 or greater) might be acceptable. For example, rather than using a fixed nominal load factor to control splits, expansions might be deferred during peak operating hours. An increase of even 10% in number of records, with no file expansion, typically results in an increase of less than 5% in expected accesses for retrieval. “Batch” or background page expansions would then occur during off-peak hours until the nominal load factor was restored to the desired value. In general, we believe that the use of a linear density hash function and the proposed overflow handling method result in a hashing scheme which performs very well against common evaluation criteria. However, several issues and potential refinements to the-methoddeserve further attention. (1) Retrieval and insertion performance (actual disk access tinze) will be improved if overflow pages are in physical proximity (e.g. same disk cylinder) to home pages. This could be achieved by preallocation of (say) one overflow page for every G home pages within the same linear address space. The cost would be a decrease in storage utilization. (2) Storage utilization might be improved if each home page were assigned a “buddy” home page to serve as the first overflow page. Buddies would be determined so that high expected load factor pages were paired with low expected load factor pages. We have used an overBow page size which is identical to home page size. It is not clear (3) that this is optimal. Storage utilization could probably be improved with smaller overflow pages. REFERENCES W. Litwin. Virtual hashing: a dynamically changing hashing scheme. In Proc. 4th Conf. on Very Large Data Bases, pp. 517-523. North-Holland, Amsterdam (1978). P.-A. Larson. Dynamic hashing. BIT 18(2), 184-201 (1978). K. Ramamohanrao and J. K. Lloyd. Dynamic hashing schemes. Compur. J. X$(4), 478-485 (1981). R. Fagin. J. Nievergeit, N. Pippenner and H. R. Strong. Extendible hashing-a fast access method for dynamic files. ACM-Trims. Database Syst. 4(3), 315-344 (1979). H. Mendelson. Analvsis of extendible hashing. IEEE Trans. Software Enma SE4MSh 611-619 (19821. T-S. Yuen and D. Hung-Chang. Dynamic file&ucture for par&l match &ieval da&d on overiow bucket sharing. IEEE Trans. S‘ofiwure Engng SE12(8), 801-810 (1986).
380
M~L~~Kous~K
and GEORGE DIEHR
[7]P.-A. Larson. Linear hashing with partial expansions. In Proc. 6th Co& on Very Large Data Bases, pp. 224-232. ACM, New York (1980). [8] P.-A. Larson. Performance analysis of linear hashing with partial expansions. ACM Trans. Database Syst. 7(4), 566-587 (1982). [9] P.-A. Larson. Linear hashing with overflow handling by linear probing. ACM Trans. Dafabase Syst. 10(l), 75-89 (1985). [IO] P.-A. Larson. Linear hashing with separators-a dynamic hashing scheme achieving one-access retrieval. ACM Trans. Database Syst. 13(3), 366-388 (1988). [I l] W. Litwin. Linear hashing: a new tool for file and table addressing. In Proc. 6th Conf. on Very Large Data &zses, pp. 212-223. ACM, New York (1980). [I4 K. Ramamohanrao and R. Sacks-Davis. Recursive linear hashing. ACM Tram. Database Syst. 9(3), 369-391 (1984). [13] 11. J. Enbody and H. C. Du. Dynamic hashing schemes. ACM Comput. Sum. 28(2), 85-l 13 (1988). 1141J. K. Mullin. Tightly controlled linear hashing without separate overtlow area. BIT 21, 390-400 (1981). [15] G. N. Martin. Spiral storage: incrementally augmentable hash addressed storage. University of Warwick, Technical Report 27 (1979). [16] J. K. Mullin. Spiral storage: efficient dynamic hashing with constant performance. Cornput. J. 28(3), 330-334 (1985). [17j D. B. Lomet. Bounded index exponential hashing. ACM Trans. Database Syst. 8(l), 136-165 (1983). [18] M. Scholl. New file organizations based on dynamic hashing. ACM Trans. Database Sysr. 6(l), 194-211 (1981). 1191J. L. Carter and M. N. Weaman. Universal classes of hash functions. J. Cornout. Svst. Sci. 18. 143-154 (1979). . ’ i20] B. Salzberg. File StructuresrAn Analytic Approach. Prentice Hall, Englewood -Cliffs:NJ (1988): [21] M. Koushik and G. Diehr. Linear-density hashing with dynamic overflow sharing. Working Paper, Department of Management Science, University of Washington, Seattle (1989). 1221G. Hadley and T. Whitin. ~~2~s~ o~Zn~nfory Systems. Prentice-Hall, Englewood Cliffs, NJ (1963). 1231W. Feller. Zntroduction to Prob~~~iiy Theory and irs App~icafions,Vol. 1, 2nd edn. Wiley, NY (1957). [24] S. K. Park and K. W. Miller. Random number generators: good ones are hard to find. Cornman. ACM 31(10), 1192-1201 (1988).