INFORMATION
SCIENCES
48,99-117
(1989)
99
Multidimensional Quantile Hashing Is Very Efficient for Nonuniform Distributions* HANS-PETER
KRIEGEL
and BERNHARD Praktische
SEEGER Informatik,
University
of Bremen,
D-2800
Bremen 33, West German.p
ABSTRACT
Previous multidimensional hashing schemes without directory which generalize the concept of Litwin’s linear hashing partition the data space into equidistant cells using a dynamic orthogonal grid. Thus the performance of these schemes degenerates in case of nonuniform record distributions. We propose a new scheme without directory, called multidimensional quantile hashing, which avoids this important drawback. Quantile hashing exhibits for independent nonuniform record distributions practically the same performance as for uniform distributions. The performance gain of quantile hashing in comparison with other schemes without directory is demonstrated by experiments.
1.
INTRODUCTION
Concerning database systems for standard applications, e.g. commercial applications, there is a large variety of index structures at the disposal of the database designer for the implementation of the physical level of a database system. As demonstrated in [4], there are efficient-tree-based structures, such as multidimensional B-trees, in particular k&trees. For nonstandard databases used in applications such as image processing, design and manufacturing (CAD/CAM), etc., none of the tree-based index structures is suitable, since they cluster records according to the lexicographical order of the attributes. Thus the retrieval time to answer a complex query depends on which of the attributes is specified. When designed suitably,
*This work
was supported
by the Deutsche
Forschungsgemeinschaft
(German
Research
Society). CElsevier Science Publishing 655 Avenue of the Americas.
Co., Inc. 1989 New York, NY 10010
0020-0255/89/$03.50
100
HANS-PETER KRIEGEL AND BERNHARD
SEEGER
hash-based index structures cluster records which are close together in the key space. This is the type of clustering which is needed in nonstandard applications. Furthermore, multidimensional hashing schemes can be designed to fulfill all the requirements for use in engineering database systems: to be dynamic, to be suitable for secondary storage devices, and to support point and spatial objects [13, 231. In this paper we focus on the problem of deriving efficient multidimensional hashing (MDH) schemes organizing a file of d-attribute composite keys K = Kd), where d > 1 is the dimension of the key K. These keys are stored (K,,..., in buckets, and b is the maximum number of keys (capacity) in one bucket. Moreover we will assume the domain of the keys to be the unit cube E = [0, l)d. This requirement can easily be fulfilled for an arbitrary domain by simple transformation. The basic idea of multidimensional index structures is to divide the data space into disjoint regions. The objects contained in one region are stored in one bucket, or in a short chain of buckets. In order to support a dynamic adaption, the number of disjoint regions depends on the number of records. MDH schemes commonly partition the data space using a dynamic grid. The various MDH schemes are adaptations of one-dimensional dynamic hashing schemes, either linear hashing [ll] or extendible hashing [3]. MDH schemes fall in one of the following three groups: (1) MDH without directory, (2) MDH with an array-based directory, (3) multidimensional hash trees. Typical representatives of the second group are the grid file [12] and multidimensional extendible hashing [24,14]. The major disadvantage of these schemes is the growth of the directory, which is organized as a d-dimensional dynamic array on secondary storage. For instance, in the case of the grid file the f or a uniform distribution directory size is O(,l+(d-l)/db) [20], and may be exponential in the case of nonuniform distributions. This aspect was the reason for suggesting hash trees, such as the multilevel grid file [25], the interpolation-based grid file [17], and the balanced multidimensional hash tree [16]. The most important property of these schemes is that the directory is organized similarly to a K-D-B tree [22]. Every internal node of the tree is organized like a grid directory, and the pointers of this local grid directory refer again to nodes of the tree. The size of such a local grid directory is limited by the capacity of a bucket. The data buckets correspond to the leaves of the tree. The principle of MDH without directory is that for every key K an address H(K) E {O,..., m -l} can be computed by an address function H. Here m denotes the number of chains in the file, and a chain can consist of more than
MULTIDIMENSIONAL
QUANTILE
HASHING
101
one bucket. Then an address corresponds to exactly one chain. However, no information need be stored in secondary storage to compute addresses. Typical representatives of this group are the interpolation-based scheme [l], multidimensional linear hashing [18], multidimensional digital hashing [15] and multidimensional order-preserving linear hashing with partial expansions (MOLHPE) [5]. The main drawback of these schemes is that they assume a uniform record distribution to yield optimal performance. Unfortunately, there is no scheme with good overall performance. In the case of uniform key distribution, MOLHPE seems to be the best scheme, whereas in the case of strongly correlated key distributions hash trees outperform all other schemes. In this paper our goal is to propose a new MDH scheme without directory, which exhibits for independent or weakly correlated key distributions nearly the same performance as MOLHPE for a uniform distribution. This paper is organized as follows. In Section 2, we give a brief introduction to MOLHPE and its address function. A detailed discussion of MOLHPE can be found in [5]. Then in Section 3 we present the quantile method, which will be added on top of MOLHPE. The result of this combination is our new scheme, called multidimensional quantile hashing. In Section 4 we give some experimental results on quantile hashing and a performance comparison with other multidimensional hashing schemes without directory.
2.
REVIEW
OF MOLHPE
In this section we will review multidimensional order-preserving linear hashing with partial expansions (MOLHPE). In [5] we presented MOLHPE completely and reported on experimental performance comparisons. MDH schemes without a directory are based on (one-dimensional) linear hashing [ll]. Using a hashing function H, we compute the address of a short chain of buckets, where the set of addresses (0,. . . , m - l} is time-varying. Giving up the directory, we have to allow overflow records, i.e. records which cannot be placed in the first bucket of the corresponding chain, called the primary bucket. The overflow records are stored in a so-called secondary bucket, which is chained with the primary bucket. The primary bucket resides in the primary file, the secondary bucket in the secondary file. This very simple way of treating overflow records is called bucket chaining. One chain of buckets is also called a page. Overflow records can be handled more efficiently with other strategies, such as recursive hashing [19] and overflow handling in the primary file [lo]. The expansion and contraction of the file are triggered by a rule called the control function. For instance, if the storage utilization exceeds a fixed thresh-
102
HANS-PETER
KRIEGEL
AND BERNHARD
SEEGER
old, the file is expanded by one page. A global pointer p points to the page which will be expanded next. The records in page p are divided into two groups of about equal size. One group remains in the old page p; the other group is allocated in the new page. A variable L denotes the level of the file, i.e., it indicates how often the file size has doubled. To focus on the performance of MDH schemes without directory, we will define the relative load factor. DEFINITION1. Let f be a file, K be the key inserted next into the file, and c 1,. . . , C,,,, m > 1, be a disjoint partition of the file into chains. Then the relative load factor lf(Ci) of the chain C,, 1~ i g m, is given by
lf( c,) =
where Pb( K E C,) denotes chain Ci.
Pb(KEC,) max i
the probability
’
with which key K is inserted
into
The basic paradigm of hashing schemes is that best retrieval performance is achieved when the records are distributed as uniformly as possible over all the pages of the file, i.e., the relative load factors of the pages are all nearly equal. Assuming uniform record distribution, linear hashing and its multidimensional counterparts [l, 181 produce pages with relative load factor between 0.5 and 1. To achieve a more even load, partial expansions have been proposed for linear hashing [9]. The pages are partitioned into groups. If a new page is added to the file, one group of pages will be expanded, and the records in this group must be redistributed over the pages. However, previous MDH schemes without directory do not allow partial expansions under the requirement of a compact address space. For the sake of simple explanation, we do not consider partial expansions in this paper. How partial expansions can be built in is shown in [5]. In the following we will introduce a simplified version of MOLHPE which generalizes linear hashing similarly to [15]. As already indicated, the key space is partitioned by an equidistant grid, where all objects in one cell of the grid are stored in one page on secondary storage. Now we want to present our address function G, which was originally proposed to compute directory addresses for multidimensional extendible hashing [14]. First, let us assume a file with 2L pages, i.e., the file has reached level L. Moreover, each dimension is treated equally and has a local level Lj, jE {l,..., d }, which is given by LDIvd+l
if
L Dn’d
otherwise.
jE
{l,...,
s-l},
MULTI~IMENSION~
QUANTILE HASHING
103
The axiss = L MOD d + 1 is the axis along which the
expansions are carried out and therefore is called the split axis. Using the bit representation of the key K=(K,,..., Kd) E [0, l)d, K, = C, a I b/ 2-i, we compute for each dimension j, 1~ j < d, an index 5
c
iJ =
b/2/-‘.
I=1
Then the address is given by 0
G(i,,...,i,)
=
if max(i,,...,id}
j, r]: 4. + C cji,i i
JEM
=O, (2)
otherwise,
JEM
where z, M, t, J;, c, are given as follows: {l,... ,d}\[log,i,I
z=max(jE M= {l,...,
=max{[log,&jIk~
fl,...,d}]),
d)\W
t = [log2 i, 1, 2r+1 J,=
iEM,
2’ iEM.
c, = r-itl,
rfr
In order to get an intuitive understanding of the function G, we depict the addresses of a file with 16 pages (d = 2) produced by the function G dependent on the index array i = (it, i2) in Figure 1. Originally, the file consisted of 4 pages with addresses 0,1,2,3. One doubling of the file corresponds to adding the pages with addresses 4,5,6,7. These new addresses are in lexicographical ordering with respect to (il, i2). One further doubling of the file creates new pages with addresses 8,. . . , 15. Again these new addresses are in lexicographical ordering, but now with respect to (i2, iI). On the right-hand side in Figure 1 we depict the addresses produced by the function G dependent on the domain of the keys. Every address corresponds to a rect~g~~-shaped region of the data space. For example, address 5 corresponds to the rectangle [0.25,0.5) X [0.5,0.75). In analogy of linear hashing, the file can be expanded page by page using an expansion pointer ep = (ept , . . . , ep,). The next page which will be split is
HANS-PETER
104
KRIEGEL
AND BERNHARD
SEEGER
0.5 0.25
1
2
3
0.25 0.5 0.75 1.0A-1
Fig. 1. File with 16 pages addressed by G (d = 2).
addressed by G(ep). After splitting a page, the expansion pointer is updated to the next higher value in lexicographical ordering with respect to id). However, after doubling the file sire, the expansion (i.~,il,...,is-l,is+l,..., pointer is initialized to ep := (0,. . . ,O).
3.
THE QUANTILE
METHOD
In this section we will present the quantile method, which is applicable to any MDH scheme with or without directory. We will apply it to MOLHPE, since it is supposed to yield practically the same retrieval performance for independent nonuniform distributions as for uniform distributions, and MOLHPE performs better than its competitors for uniform distributions. Thus we expect practically optimal behavior of the quantile method applied to MOLHPE. However, in the following we do not consider partial expansions, in order to give a simple intuitive explanation of the quantile method. In this case MOLHPE is similar to [l], a generalization of one-dimensional linear hashing. Then MOLHPE guarantees for uniform distributions that lf( C,)E [0.5,1.0]. We want to fulfill this condition for nonuniform distributions as well. We will achieve this by making the grid equidistant, but adjusting it to the distribution of the objects. Let us now introduce some further notation. Let F denote the d-dimensional distribution function of the keys K = (K,, . . . , Kd) E [0, l)d. Then f;, 1~ i< d, denotes the one-dimensional distribution function of the marginal distribution oftheithdimension.ForO~cr~1andi~{1,...,d}thecu-quantileoftheith
MULTIDIMENSIONAL dimension
QUANTILE
HASHING
105
is the domain value x(a) E [OJ) of the i th key component
Furthermore,
the distribution
is called an independent
F( K, >. . . , KJ
=.f,m
distribution
such that
iff
. . ..Ld(Kf).
Let us now demonstrate our method, considering 2-dimensional (d = 2) keys K = (K,,K,).Let us assume that starting from an empty file we have to partition the key space [0,1)2 for the first time, and we decide to partition the first dimension. If fi is a nonuniform distribution, we will not partition the first axis in the middle, but we will choose the $-quantile x(i) as the so-called partitioning point, This guarantees that a new record will be stored with equal probability in the page corresponding to the rectangle [0, x(f)) x [O,l) or in the page corresponding to the rectangle [x(i), 1.0) x [0,1); see the left side of Figure 2. During the next expansion we will partition the second dimension (axis), choosing the t-quantile y(i) of the second dimension as the partitioning point; see the right side of Figure 2. Figure 3 shows a file consisting of 16 pages, where each axis has been partitioned at the :, i, i quantile. As depicted in Figure 3, the partitioning points are stored in optimal balanced binary trees which can easily be stored in main memory. The most important property of these binary trees of partitioning points is the following: for each type of operation a nonuniformly dis-
It-2
I\‘2
1.0
1.0
0.c
I
L .
A
O.O x(1/2)
1.0 Kl
x(1/2)
Fig. 2. Partitioning the 2-dimensional key space
(0,l)‘.
1.0
Ii’1
106
HANS-PETER KRIEGEL AND BERNHARD SEEGER
1.0 Kl
Fig. 3. Optimal grid partition of a file consisting of 16 pages.
tributed query value Kj, j = 1,2, is transformed into a uniformly distributed a E [0, 1) by searching the corresponding binary tree of partitioning points. This cu is then used as input to the retrieval and update algorithms of MOLHPE. Let us mention that best performance can be achieved only for an independent distribution, since the data space is partitioned by an orthogonal grid. The hyperplanes of the corresponding partitioning point are perpendicular to the dimension to which the partitioning point belongs, thus the choice of the partitioning points of the jth dimension should only depend on the distribution function 4. In case this independence assumption is not fulfilled, the method which we will suggest will still be very efficient but not practically optimal any more. In most applications the distribution of objects is unknown and thus also the quantiles of the dimensions are unknown. Therefore we will approximate the unknown distribution using the records presently in the file. Burkhard was the first to use a stochastic approximation process for adapting the partitioning points to the underlying distribution [2]. We will call this process the quartile method. In [2] the quantile method is applied to the l-dimension interpolationbased scheme [l]. Now we will apply the quantile method to MOLHPE. The
MULTIDIMENSIONAL
QUANTILE
107
HASHING
quantile method was described for the first time in [21]. A survey of the theory of stochastic approximation processes, in particular the quantile method, can be found in [8]. For each A, 1~ i d d, we will approximate the proper quantiles according to the following iteration scheme. Let (1) i E (1,. . . , d }, and x(a) be the cw-quantile of the distribution function f,; (2) xi(&) be our best initial guess for x(a); (3) Kl,..., K” be the sequence of keys which have been inserted into the file,where Kj=(K/ ,..., KJ)~[O,l)‘j,l
m(n,I) =I(K:lx,(a) where the sequence
>k/,
Jo
{Ln}}
1,
(3)
{x1(~)} of estimates is given by
X,,i(cy) =xl(a)-a,(+Llj.
(4)
Here {a, } is a sequence such that
a,+~
(k-+oo),
(4
k
c
af converges for k + cc.
(b)
;=1
The stochastic function m(n, 1) denotes the number of records whose ith component key is below the Ith estimate x,(a) of the a-quantile ~(a) when n records are presently in the file. The expected value of m( n, I)/n corresponds to h( x,(a)). According to (4) the (I + 1)th estimate of X((Y) can be computed only after at least I records are stored in the file. The sequence {u,} = {l/1}, 121, fulfills conditions (a) and (b). However, in order to guarantee a fast adaptation to the distribution we have to choose a, depending on the file size. For example, in our implementation we choose for d = 2 a, = 1/2LD’Vd, where L is the present level of the file. As I -+ cc the sequence {x,(a)} converges to x(a) with probability 1, if fi is continuous almost everywhere. Obviously, a disadvantage of the quantile method is that a sorted sequence of keys will prevent fast convergence. Let us now consider a file of 2L pages, where L is the level of the file. In analogy to MOLHPE, we choose the levels L,, i E { 1,. . , d }. Then the i th axis
108
HANS-PETER
KRIEGEL
AND BERNHARD
SEEGER
is partitioned by 2=1- 1 partitioning points ppi(a), 0 < a! -C1. Each of the pp,(a) is the newest estimate of the a-quantile, computed using the quantile method. The set P,, 1~ i < d, of partitioning points of the ith axis is given by
P,=
pp;(a) i
a= I
2
42-J,
b,E {0,1},(~#0
.
j=l
The whole key space is partitioned by hyperplanes q( ol) = { K E [0, l)d 1K,= d }. Each pp,(cu) can be characterized by an bitstring Wita)(EP,>}, iE {l,..., z
&= c
42-J.
j=l
The most important property of these binary trees of partitioning points is the following: for each type of operation and each axis j, 1~ j Q d, a nonuniformly distributed query value KJ is transformed into a uniformly distributed (YE [0, 1) by searching the corresponding binary tree of partitioning points. This (Yis then used as input to the retrieval and update algorithms of MOLHPE. Now for each axis j, j E (1,. . . , d}, we organize the partitioning points within a binary tree, which may be stored explicitly (using pointers) or implicitly (stored within an array using no pointers). These d binary trees have a total storage requirement O(( d/b)n’ld), where n is the number of records in the file and b is the capacity of a data bucket. Thus these binary trees can be easily kept in main memory. The algorithms for retrieval and expansion are basically the same as for MOLHPE without quantile approximations.
4.
REORGANIZATION
OF THE FILE
None of the previous MDH schemes adapts gracefully to the distribution of the records in key space. Using the suggested quantile hashing, we reorganize by adapting the partition of the key space to the present distribution. This is
MULTIDIMENSIONAL
QUANTILE
HASHING
109
done incrementally by computing a new estimate for a partitioning point pp, ((u) using (4) and adjusting the partition of the key space to the new partitioning hyperplane H,(a), 1 Q i < d. We say that we move the partitioning hyperplane H, (a). We call the rule which determines whether the file will be reorganized or not the control function for the reorganization of the file. Let i E 11,. . . , d } be the axis to be considered next for reorganization. We will consider the following possible control functions: (Cl) The file is reorganized before each expansion (or contraction). (C2) Let x;(a) be the a-quantile to be reorganized next, and pp,(ol) be the present estimate of x,(a). For p E (0, l), we compute the confidence interval Ifi for x, (a). If pp, (a) E Z, then pp, (a) is accepted as a-quantile. Otherwise a new estimate is computed for pp,(ol) according to Equation (4) and the file is reorganized by moving the hyperplane H,(a). Then (Y is advanced to the a-value which will be reorganized next. This step is performed after a predefined number of insertions and deletions. For a given /3 E (0,l) the confidence interval lp for x,(a) is a compact part of the domain in which x,(a) is, with probability 1 - /?. (C3) Compute the number of records with ith component key between two neighboring partitioning points pp, (0~~) and pp,((~~), pp, (OLD,) < pp,(a,). Subtract [pp, ( aL) - pp, ( aR)] n, which is the corresponding expected value. Take the maximum of this difference over all possible aa’s on axis i. Compute a new estimate for this pp,(as), and advance the axis to be reorganized cyclically. Control function (Cl) has the following property if we assume that the expansion of the file is controlled by the storage utilization: The partition of the key space remains invariant if after an expansion of the file a record is always inserted and a different record is deleted. Thus the distribution of the records in key space may change without adapting the partition of the key space. Such undesirable behavior is prevented by control function (C2). Furthermore, (C2) exhibits the following desirable property: Before computing a new estimate for a quantile approximation, we can decide whether the old estimate already suffices. Control function (C3) determines the worst estimate over each axis. Thus it requires computation for all partitioning points and searching for the maximum difference of present estimate and expected value. Obviously, this needs intensive computation. Summarizing, we can say that (C2), (C3), or a combination of both is an adequate control function for the reorganization of the file. As mentioned before, during reorganization one partitioning hyperplane is moved. If this move is done in one step, it requires 0( n1 iId) page accesses for performing the reorganization, where n is the number of records presently in the file. Although this may sound very high, 0( N’ ‘jd) page accesses are required by the grid file when adding the new partitioning hyperplane to the
110
HANS-PETER
KRIEGEL
AND BERNHARD
SEEGER
grid directory, where N > O(n) is the number of directory entries. If a hyperplane is moved in one step, our method (and the same is true for the grid file) loses its dynamic character. Since reorganization only improves retrieval performance and is not as crucial as restructuring in the grid file, we will amortize the time required for moving one partitioning hyperplane over a sequence of insert and delete operations. Thus, step by step, we will adjust a pair of chains separated by the hyperplane to be moved to the new estimate. We call a reorganization local if the expected value for the number of page accesses during reorganization is constant. The corresponding control function for the reorganization of the file is called linear. A local reorganization of the file is only possible for hashing schemes which allow overflow records. As we did for the expansion of a file, we define the following variable: ra E {1,. . . , d } denotes the axis (dimension) whose partitioning points will be recomputed next. We call this axis the reorganization axis. The pair of chains to be reorganized next is specified by d reorganization pointers ‘pi, . . . , rpd. Now the chain G(rp,, . . . , rpd) and its right neighbor chain with respect to axis ra will be adjusted during the next local reorganization. After inserting a record, we will either expand the file or perform a local reorganization of it. More precisely, after each insertion which does not require an expansion, a local reorganization will be carried out. This strategy is supported by the following observation obtained from experimental runs on our implementation: For most nonuniform distributions the expected value of page accesses for insertion (including the cost for local reorganization and expansion) is considerably lower for quantile hashing than for other schemes that do not use the quantile method. Since this phenomenon is largely unexpected, it justifies our strategy even more. Now we will present an algorithm for local reorganization using control function (C2): ALGORITHM FORLOCALREORGANIZATION. 1. 2.
For
chain
Next = G(rp,,. . . ,rpd)
compute
the right
partitioning
point
PP,,(Q) on es ra; IF rP,=O for all iE {l,...,d}\{ra} THEN compute for a given fl E (0,l) a confidence interval Za for the a,-quantile X,,(Q); IF PP,,(%) E Zp THEN rp,, := rpra + 1; IF rp,, = 2Lra - I THEN ra := (ra+l)MODd $1; rp;:=Oforall iE{l,...,d} END ELSE compute a new estimate for the a,-quantile using Equation (4)
MULTIDIMENSIONAL A.2
111
Ia_ 1.0
__ _.___
Gi
HASHING
I\,
1.0 y(1/2)-
QUANTILE
____ .
Y(l/?)
Gi
rPlt
1.0 K,
Fig. 4.
rPlt
1.0 Kl
Moving the hyperplane
TPlf
1.0 ICI
rPlt 1.0
Kl
H2 (i ) using local reorganization
END END;
3. 4.
Adjust chain Next and the right neighbor the new estimate for ppra(a,); Advance (rpi,. . . ,rpd) to the next value.
First pp,,(a,) butions. plane is following
chain with respect to axis ra to
let us mention that this algorithm does not require any disk access if E Is is fulfilled. This situation will typically occur for uniform distriSecond, step 3 of the algorithm is the essential part, where a hypermoved. Now we will illustrate steps 3 and 4 of the algorithm using the example.
EXAMPLE. Let us consider the situation in Figure 3, and let us assume that the estimates of the quantiles are stored in the nodes of the binary trees. Thus we have the following parameters: d = 2, L, = 2, L, = 2, L = 4. Furthermore we consider rp = (0,l) and ra = 2, and assume that the file will not be expanded during the following four insertions. This implies that the reorganization algorithm is executed after each of these insertions. An insertion of a record triggers a computation of a new estimate ~i,~ of the $-quantile with respect to the second dimension; see the left side of Figure 4. Then we adjust the page Next = G(rp) to the new estimate by removing all corresponding records from the upper neighbor page. We advance the reorganization pointer to the next page, which will be still limited by the old estimate y,,, . After two more insertions we have adjusted the partition of two more pairs of pages to the new estimate; see Figure 4. Eventually, after one more insertion, we adjust the last pair of pages and reorganize the binary tree of the second dimension (Figure 5).
5.
EXPERIMENTAL
RESULTS
In order to demonstrate the performance of quantile hashing for nonuniform distributions, we have implemented it in MODULA-2 on an Olivetti M24 PC
112
HANS-PETER
KRIEGEL
AND BERNHARD
SEEGER
1.0 KI
Fig. 5.
Update of the binary tree after moving the hyperplane Hz($).
under MSDOS. In MOLHPE the following parameters were chosen: the control function for the expansion of the file is expansion after 28 insertions, the capacity of a primary bucket is 31 records, and that of a secondary bucket is 7 records. From all the experiments, we select the following two for demonstrating the graceful adaption of quantile hashing to a nonuniform distribution. EXPERIMENT 5.1
For d = 2, K = (K,, K2) E [O,l)*, we have Pb( ZC,<$) =&,
Pb(K,&)
=;,
where Pb denotes the probability with which K, is in the specified interval, i = 1,2. However, we only accept component keys K, which follow the distribution and are not in [0.6,0.7]. The application of the quantile method to MOLHPE yields the partition of the key space depicted in Figure 6. On the left side of Figure 7 we have plotted the average number of page accesses in a successful exact-match query with and without the quantile method as a function of the level of the file. The right side of Figure 7 depicts the average storage utilization with and without quantile method for various levels of the file.
MULTIDIMENSIONAL
1,4
I/IIIl!Il
QUANTILE
HASHING
Ii1
i
113
I
1
I
A l/4 Fig. 6.
Partition
l/2 of the key space generated
3/4 by Experiment
5.1.
Let us remark that level 9 corresponds to 15,000 records in the file and that the implemented version of MOLHPE sees one partial expansion per doubling of the file. Thus the performance of this kind of MOLHPE is the same as that for the schemes proposed in [l] or [18]. The advantage of the quantile method is obvious. The average number of page accesses in a successful exact-match query is reduced from more than 4 to less than 1.5; the average storage utilization is improved from approximately 60% to more than 75%. Let us remark that although the grid file is tailor-cut for best exact-match query performance, it is outperformed by MOLHPE combined with the quantile method. It will be interesting to observe the superlinear growth of the grid directory for this nonuniform distribution. keys EXPERIMENT 5.2 For d = 2, K = (K,, K,) E [O,l)‘, the component follow a Gaussian distribution with expected value 0.5 and variance 1. Furthermore, the component keys are multiplied with the factor f . The application of the quantile method yields the partition of the key space depicted in Figure 8. This partition gives an idea of the graceful adaptation, depicted for level 10. On
114 page
HANS-PETER
KRIEGEL storage
accesses
AND BERNHARD
SEEGER
utlllmatlon
4
wlthout
quantlle
method
. 80% with
quantllc
method
- 70% -3
wlthout with
7 7
quantlle
8
quantlle
method
method
9
Y 7
8
9
Fig. 7. Performance parameters of Experiment 5.1: left, the average number of page accesses in a successful exact-match query; right, the average storage utilization, as a function of the level of the file.
the left side of Figure 9 we have plotted the average number of page accesses in a successful exact-match query with and without the quantile method for levels 8,9,10. Here level 10 corresponds to 30,000 records. Since the performance of our scheme without quantile method is cyclic and stationary, we did not extend this experiment beyond level 10. However, the right side of Figure 9 clearly shows the positive trend for the performance of the quantile method with increasing file size. We do not depict storage utilization. Using the quantile method, it exceeds 70% for levels higher than 9. The improvement in page accesses for quantile hashing is even greater in this experiment than in the previous one. For level 9 the number of page accesses is improved from approximately 8 to less than 2. The comparison of exact-match queries is clearly to the advantage of the grid file. But even for this type of query, quantile hashing outperforms the grid file on average for large enough files. For files with about lo6 records we expect quantile hashing to perform successful exact-match queries in practically one disk access. The greatest advantage of quantile hashing over the grid file for nonuniform distributions will be realized for complex queries such as range queries, where the growth of the directory influences the retrieval performance.
115
MULTIDIMJZNSIONAL QUANTILE HASHING
1;2
l/4 Fig. 8.
Partition
3;4
of the key space generated
7.
I .
6. 5.
3 -
I 3. 2 ._ .,,. L 8
by Experiment
5.2.
2 . ..-_._,
.’
with
quantllc
method
+ . .,...._.... _,_._ .i’.--....‘.-.....___ -“, 9
Fig. 9. Average number the level of the file.
10
Level
Level
1 s
of page accesses in a successful
6
exact-match
7
8
9
query as a function
10
of
116 6.
HANS-PETER
KRIEGEL
AND BERNHARD
SEEGER
CONCLUSION
In this paper we have dealt with multidimensional hashing schemes without directory which generalize the concept of linear hashing for multidimensional keys. We have proposed a new scheme without directory, called multidimensional quantile hashing, which combines the quantile method and multidimensional order-preserving linear hashing with partial expansions (MOLHPE). In comparison with previous schemes without directory, quantile hashing significantly improves retrieval performance for independent nonuniform distributions. This behavior is achieved by allowing nonequidistant grid partition, which adapts to the underlying distribution of the keys. First, we gave a brief review of MOLHPE, which is the multidimensional counterpart of linear hashing with partial expansions. Then we introduced stochastic approximation processes and in particular the quantile method, which estimates the quantiles using the records presently in the file. In order to guarantee dynamic insertion and deletion, we suggested the concept of local reorganization, where hyperplanes of the grid are moved step by step. Finally we demonstrated the performance of quantile hashing in an experimental comparison with previous schemes without directory. Future work in this area should deal with (1) more efficient overflow strategies than those considered in this paper; (2) application of concepts like parallel retrieval algorithms and concurrency control, which are proposed for linear hashing (including these concepts, our scheme is suitable for use as an efficient index in a database system); (3) performance comparisons of our scheme using real data occurring, for example, in geographical applications; (4) design of a scheme which performs well for sorted insertions (one solution of this problem is presented in [7]). We are thankful to the anonymous referees for their helpful comments. Additionally, we are grateful to Professor Peter C. Lockemann,
who made it possible for
Bernhard Seeger to work at the computer Science Department of the University of Karlsruhe.
REFERENCES 1. W. A. Burkhard, Interpolation-based index maintenance, BIT 23:274-294 (1983). 2. W. A. Burkhard, Index maintenance for nonuniform record distributions, in Proceedings of the 3rd ACM SlGACT/SIGMOD Symposium on PODS, 1984, pp. 173-180. 3. R. Fagin, J. Nievergelt, N. Pippenger, and H. R. Strong, Extendible hashing-a fast access method for dynamic files, ACM Trans. Database System, 4(3):315-344 (1979).
MULTIDIMENSIONAL 4. H.
P. Kriegel,
Proceedings
QUANTILE
Performance
of the ACM
comparison
SIGMOD
117
HASHING of index
Internutional
structures
Conference
for multikey
on Management
retrieval, of Data.
in
1984.
pp. 186-196. 5. H. P. Kriegel and B. Seeger, Multidimensional order preserving linear hashing with partial expansions, in Proceedings of the Internotionul Conference on Dutahase Theory. Lecture Notes Comput. Sci. 243, 1986, pp. 203-220. 6. H. P. Kriegel and B. Seeger. Multidimensional dynamic quantile hashing is very efficient for nonuniform record distributions, in Proceedmgs of the Internutionul Conference on Dutu Engineering, 1987, pp. 10-17. 7. H. P. Kriegel and B. Seeger, PLOP-hashing: A grid file without directory. in Proceedmgs oJ the Internotionul Conference on Data Engrneering. 1988. pp. 369-376. 8. H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained und Systems, Appl.
Math. Sci. 26, Springer-Verlag. 1978. with partial expansions, in Proceedings of the 6th Internutionul Conference on VLDB, 1980, pp. 24-232. P. A. Larson, Linear hashing with overflow handling by linear probing, ACM Trans. Datubuse Systems 10(1):75-89 (1985). W. Litwin, Linear hashing: A new tool for file and table addressing, in Proceedings of the 6th International Conference on VLDB, 1980, pp. 212-223. J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure, ACM Trans. Datubme Systems 9(1):38-71 (1984). J. Nievergelt and K. Hinrichs, Storage and access structures for geometric databases, in Proceedings of the Internutional Conference on Foundations of Duta Orgunrzution. 1985, pp. 335-345. E. J. Otoo, A mapping function for the directory of a multidimensional extendible hashing, in Proceedmgs of the 10th International Conference on VLDB, 1984, pp. 491-506. E. J. Otoo, A multidimensional digital hashing scheme for files with composite keys, in Unc;onstrained
9. P. A. Larson, 10. 11. 12. 13.
14. 15.
Proceedings
Linear
hashing
of the ACM
pp. 214-229. 16. E. J. Otoo, Balanced
SIGMOD
International
multidimensional
extendible
Conference
on Management
of Data,
1985.
hash tree, in Proceedings of the 5th ACM
SIGACT/SIGMOD Symposium on PODS, 1986. 17. M. Ouksel, The interpolation based grid file, in Proceedings of the 4th ACM SIGACT/ SIGMOD Symposium on PODS, 1985. 18. M. Ouksel and P. Scheuermann, Storage mappings for multidimensional linear dynamic hashing, in Proceedings of the 2nd ACM SIGACT/SIGMOD Symposium on PODS, 1983. 19. W. Romamohanarao and R. Sacks-Davis, Recursive linear hashing, ACM Truns. Datubuse &stems 9(3):369-391 (1984). 20. M. Regnier, An analysis of the grid file algorithms, BIT 25:335-357 (1985). 21. H. Robbins and S. Monro, A Stochastic approximation method. Ann. Math. 22:400-407
Statist.
(1955).
22. J. T. Robinson,
The K-D-B tree: A search structure for large multidimensional dynamic in Proceedings of the ACM SIGMOD International Conference on Management of Dam, 1981, pp. 10-18. 23. B. Seeger and H. P. Kriegel, Design and implementation of spatial access methods, in the proceedings of the 14th International Conference on VLDB, 1988, pp. 360-371. 24. M. Tamminen, The extendible cell method for closest point problems, BIT 22:27-41 indexes
(1982). 25. K.-Y. Whang Lab., Yorktown Received
and R. Krishnamurthy, Heights,
3 Februarv 1988
1985.
Multilevel
Grid
Files, draft
report,
IBM Research