q:,.“) (4) Range of p: on x- axis (e.g., p:,, > q,,,) The relations are chosen in a way that the number, the type (restricted on one or on both ends) of constraints and the axes involved differ. The performance of each indexing method is expected to be Table 1 Direction relations between objects and mapping to range constraints Relation R between objects
Range constraints R’ for MBRs p’ to be retrieved 1 constraint
alZ_north( p, q) = 'Pi'qj(Pi,y
Example
between MBRs
on y-axis:
PI!., >s:.,
'4J.v)
example: an(NL, AU)
some-north( P. q) =
2 constraints
3Pivq, VP,%, gPi%,
(4\.,
(P,., ’ q,.,) A (Pa., ’ q,.y)h (P,.,
on y-axis:
example: sn(BU, GR)
4) =
1 constraint
(Pt., ’ qJ.Y)A [(Pj,, ‘4j.X) h(PC.Y ’ qI.Y)l A [(P,.x
(PL, ‘4:J 2 constraints
all_strictly_north(P, ‘Pi’qj vpi3qj
VPi3qj
example: asn(GE, IT)
on y-axis: on x-axis:
(41.X
317
Y. Theodoridis et al. / Data L Knowledge Engineering 27 (1998) 313-336 Table 1 Continued Relation R between
Example
objects
Range constraints R’ for MBRs p’ to be retrieved
some_strictly_north(p, q) =
2 constraints
gpivq, gpigq, VP,%, VP&,
(41.,
b., > q,.J A (pi., < q,.,) A [(P,,,‘q,.x)A(P,,, [(P,,,
‘s,.Y)lh ‘q,..J
on y-axis:
(4.X
example: ssn( BE, FR)
all_north-east(p, q) =
1 constraint
vp,vq,
(P;., ’ 4A.J 1 constraint on x-axis:
[(P,,x’q,,JA(PL,
‘q,J
CPlx ’ 4:.x)
example: ane(DE, NL)
some_north_east( p. q) = 3Pivqj 3P;%, VpJq,
L(P8.x‘4j.x)“(PJ.Y (Pt., <$.,)A [(P,,,‘q,.x)A(P,,,
example sne(AU, CH)
on y-axis:
2 constraints ‘ql..Y)l A ‘q,.J
on y-axis:
(41.,
similar for relations with sim&r range constraints. Therefore the behaviour of an indexing method to any spatial relation can be predicted from other relations with similar range characteristics [35]. Because MBRs differ from the actual objects they enclose, they are not always adequate to express the relation between the actual objects. For this reason, spatial queries involve the following two-step strategy [ 12,221: ?? First a filter step based on MBRs is used to rapidly eliminate MBRs of objects that could not possibly satisfy the query and select a set of potential candidates. ?? Then during a rejinement step each candidate is examined (by using computational geometry techniques) and false hits are detected and eliminated. Unlike topological relations, where the refinement step is the rule [26], only two direction relations of Table 1, some_strictly_north and some_north_east, need a refinement step. For the rest of the
Y. Theodoridis et al. I Data & Knowledge
318
Engineering
27 (1998) 313-336
relations all retrieved MBRs correspond to objects that satisfy the query. Details about the retrieval of direction relations using MBRs can be found in [28]. In the next sections we will show how the above results can be applied to indexing techniques available in commercial DBMS (namely, B-trees [5], KDB-trees [30] and R-trees [15,2]).
3. Retrieval of direction
relations
The retrieval of direction relations in existing DBMSs could be accomplished either by maintaining traditional indexes (e.g. B-trees) or, alternatively, by incorporating Abstract Data Types (ADTs) with specialised indexes defined by external code (e.g. KDB-trees or R-trees). Application developers should decide which method is the most appropriate for their application needs. In the rest of the section we describe how B-, KDB- and R-trees can be used to retrieve direction relations and provide analytical formulas that estimate the performance of the three methods. Existing formulas focus on traditional retrieval (match or range queries on B-trees and KDB-trees, and overlap queries on R-trees). Our work extends previous research by estimating the expected cost (i.e., number of disk accesses) for the retrieval of direction relations and the corresponding range queries. In the following discussion we assume that both data and query rectangles follow a random (i.e., uniform-like) distribution over the unit square address space. Table 2 lists the symbols to be used hereafter in our discussion. 3.1. Retrieval of direction relations using B-trees The first solution for the retrieval of direction relations includes the maintenance of a group of four alphanumeric indexes, such as B-trees’, each corresponding to one of the four numbers: p:,,, pi,,, p:,,, Table 2 List of Symbols Symbol
Description
P
Spatial object A point of object p x and y coordinates of point p, MBR of p Ordered pair of lower (pi) and upper (p:) point of p’ Size of p’s projection on axis x and y respectively Spatial relation Set of range constraints for R Retrieval cost (in terms of disk accesses) of relation R Range for B-tree range query Query windows for KDB-tree (lower or upper) range query Query window for R-tree range query R-tree node rectangle
P, Pw
P,.,
P’ (Pi.
P:)
IPL IPI,
R R’ C(R) r
Q’, Q” Q P
* In particular, the implementation used in existing systems is the B+ -tree index [20] and, in this paper, we will think of B’-trees when we use the term B-trees.
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
319
P :,y. The processing of a query of the form ‘find all objects p that satisfy a given direction relation R with respect to ob_ject 4’ using the above set of four B-trees involves the following steps: Step 1. Depending in the relation to be retrieved, select the B-tree(s) to be searched (according to the last column of Table 1). Step 2. Search each B-tree involved to find the corresponding answer sets. Step 3. If more than one indexes are involved, find the intersection set (a ‘realistic’ assumption is that this procedure is executed in main memory). Step 4. If necessary, follow a refinement step for the selected objects. As an example, consider the query: ‘Find the countries p all_north_east of Switzerland (CH)‘. In this case, two out of four B-trees, those for p;,, and plx parameters, need to be searched, because only P I, and p;,, participate in the range constraints for all_north_east. The two trees are illustrated in Fig. 2a3 where each label indicates the appropriate coordinate for the corresponding object p. The sets {CZ’, LU’, BE’, PL’, UK’, NL’, IR’, DE’, SW’, NO’, FI’, IC’} and {SW’, CZ’, PL’, YU’, HU’, FI’, RO’, AL’, GR’, BU} are the two answer sets. The intersection set {CZ’, PL’, SW’, FI’} contains the object IDS that satisfy the query (illustrated as the dark shaded area in Fig. 2b). A refinement step is not needed for the retrieval of all_north_east; that is, all retrieved MBRs correspond to objects that satisfy the query. If the data keys are stored in a B-tree of height h with L leaf nodes, then the average cost C(r) for the retrieval of a constraint with length (range) r is [6]: C(r) = h + r *L
(1)
B-tree for p’,y
B-tree for p’
1.x
64 Fig. 2. Retrieval of relation all_norrh_easr
(b) using B-trees.
3 For illustration reasons, in the examples of the tree structures we assume a branching factor of 4, i.e., each node contains at most four entries.
320
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
For the computation of the cost C(r) we need estimated values for the parameters of Eq. 1, namely h, L, r. Let m be the maximum number of entries in a B-tree node, c the average capacity of a node, and N the total number of keys stored in the leaf nodes. The following equations [l] describe the average h and L values: h= 1+[1og_~]
(2)
L=Nl(c*m)
(3)
What remains in order to have a complete expression for Eq. 1 is the value of parameter r, which depends on the particular constraint. Table 3 provides the values of Y for the possible range constraints of Table 1, if we assume that 191, - lqlv is the size of a data object MBR, and 1q1, * /qlY the size of a query object MBR (the constants refer only to x- coordinate since constraints for y- coordinate can be expressed in a similar way). Using information from Table 3 and Eqs. 2 and 3 we can estimate the expected cost for each constraint (Eq. 1) and, summing up, the expected cost C(R,K) for the retrieval of a spatial relation R with k constraints (1 s k d 4), i.e.,
C(R, k) = i C(r,) = i (h + ri. L) i=l i=l
(4)
The cost depends significantly on the particular direction relation because the number of B-trees to be searched is equal to the number of range constraints involved in the definition of that relation. All-north involves only one constraint (p;,, > q: ,), while some_strictly_north contains four: (q:,, < p:., < q:,JJ A (I$, > q:,,) A (41.x < p:,, < 4:.x) A (4,1.x
Table 3 Average values for range r of a constraint Average range r
Constraint ~:,~<4:,,
orp:,,>91,c
orp:x>9:,x
orp:,x<9q:,x
r = (1 - IsI,)/
4i,,
r = 141,
pi., %,
r = (1 + Isl,Y2
Or p:.?L
r = (1 - (2. IPI, + MN2
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
321
sets. Table 4 presents the query windows for each direction relation, assuming the unit work space [O,l] as global space. Query windows Q’ and QU refer to the KDB-trees for the lower and upper MBR comer points, respectively, expressed as a set of four values that correspond to the coordinates (Q:,,,
Qt.,. QL,,, Q:,,) and (Q;txT Q;,, Q:,xy QiJ
respectively.
Consider again the query: ‘l%nd the countries p all_north_east of Switzerland (CH)‘. In this case, only one out of the two KDB-trees needs to be searched, because only one query window (Q’) is specified (Fig. 3b). The involved KDB-tree is illustrated in Fig. 3a where each label indicates the Table 4 Query windows for the retrieval of direction relations using GLIB-trees Relation
Range query Q’
aZl_north(p, q)
(0, q;,,> 1, 1)
Range query Q”
some_north(p, q)
(0, q:,“> 1, 1)
all_strictly_north( p, q)
(q:,x. q:,YT q:,,* 1)
some_strictly_north(p, q)
all_north_eas?( p, q)
some_north_east( P, 4)
Illustration
Y. Theodoridis
322
et al. I Data & Knowledge
Engineering
27 (1998)
(4 Fig. 3. Retrieval
313-336
(b) of relation ali_norfh_easr using KDB-trees.
lower left point pi for each object MBR p’. The nodes to be searched are X and Y (1st level entries), D, F and G (2nd level entries), i.e., the ones that intersect Q’ (these node entries are coloured grey in Fig. 3a). Among the entries of nodes D, F and G, the ones that satisfy the range query are CZ’, PL’, SW’ and FI’. To our knowledge, there do not exist analytical models that describe the retrieval cost of KDB-trees in the literature. Robinson [30] estimates that the expected cost of a window query with size Q should be C(Q) = Q * P, where P is the total number of nodes (pages) of the tree structure. However, this estimation is oversimplified. In order to facilitate analysis, we consider a KDB-tree as a simple R-tree with two properties: (a) point data, and (b) non-overlapping nodes. According to [23,1 I], the estimated retrieval cost (i.e., number of disk accesses) of an overlap query with size 191,. ki, using R-trees (and consequently KDB-trees) is:
(5) where h is the height of the tree structure, m is the average capacity of a node, N is the node size at level j (the root is assumed at The expression for computing the height h = 1+
the expected N,
is the maximum total number of level j = h and h of the R-tree
N
i
log,.., ~ c*m 1 ’
number of entries in a KDB-tree node, c data entries, and n, * . nJ,y is the average the leaf-nodes at level j = 1). (or KDB-tree) is [ 111:
(6)
number of nodes N, at level j is:
=L (c . rn)]
and the average node sizes nj,, and nj.? are given by:
(7)
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336 nj,x = nj,y =(l/Nj)‘”
323 (8)
Using information from Table 4 and Eqs. 6, 7 and 8 we can estimate the cost for each query window Q’ and Q” (using Eq. 5) and, summing up, the expected cost C(R) for the retrieval of a direction relation R: C(R) =
C(le'i,. IQ?,) + C(iQ"I,, IQ“,)
3.3. Retrieval of direction
(9)
relations using R-trees
The R-tree multidimensional data structure [15] stores the MBRs of the actual data objects in the leaf nodes and each intermediate node stores rectangles (node rectangles) which enclose all rectangles that correspond to lower level nodes. The fact that R-trees permit overlap among node entries sometimes leads to unsuccessful hits on the tree structure. Several variants of the original method [33,2] have been proposed to address the problem of performance degradation caused by the overlapping regions or excessive dead-space (space in the intermediate nodes, not covering lower level nodes) respectively. In order to retrieve objects that satisfy a direction relation with respect to .a reference object q we have to specify the MBRs p’ that could approximate such objects (Table 1) and then to search the R-tree nodes that contain these MBRs. Table 5 presents the constraints for the node rectangles P for each direction relation of Table 1. Notice that the same relation between node rectangles and the reference MBR holds for all the levels of the tree structure. For instance, the intermediate nodes that could enclose other (intermediate or leaf) nodes P that satisfy the general constraint (P,,, > q:,,) should also satisfy the same constraint. This conclusion is applicable to all direction relations. Fig. 4 shows how the MBRs of Fig. lb are grouped and stored in an R-tree. At the lower level MBRs of countries (denoted by two letters) are grouped into seven leaf nodes A, B, C, D, E, F and G. At the next level, the seven leaf nodes are grouped into two larger intermediate nodes X and Y, which in turn compose the root node of the tree. If we consider the query ‘Find the countries p all_north_east of Switzerland (CH)‘: the nodes to be searched are X and Y (1st level), B, E, F and G (2nd level) i.e., the ones that satisfy the range constraint (Pu,, > q:,y)~ (P,,, > q:,,). Among the entries of leaf nodes B, E, F and G, the countries all-north-east of Switzerland are SW’, FI’, CZ’ and PL’. We have implemented and tested the most popular R-tree variations on the retrieval of direction relations using MBRs of variable sizes [28]. R*-trees [2] have consistently better performance than Table 5 Constraints
for the R-tree node rectangles
Relation
all_north(p, q) some_north( p, q) all_strictly_north( p. q) some_strictly_north( p, q) all-north-east( p, q) some_north_east( p, q)
Constraints
for the node rectangles
P to be searched
(P”., ’ q:.,) (P”., ’ 4L.Y) A (Pl., < 4L.J CPU., 4L.,)h(P,.x
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
324
(a)
(b)
Fig. 4. Retrieval of relation all_north_eastusing R-trees.
both R-trees [15] and Rf-trees [33] (which were shown to be inefficient for direction relations) and are used in the rest of the paper (hereafter, when we use the term R-tree, we imply the R*-tree). Most of the work in the literature [23,11,36] has dealt with the expected performance of R-trees for processing overlap queries, i.e., the retrieval of data objects p that share common area with a query window q. Q. 5 describes the expected cost of retrieval for both R- and KDB-trees, where the height of the tree is given by Eq. 6. The fact that the data objects MBRs and R-tree nodes at each level are possibly overlapping, makes the computation of the R-tree nodes size different and not straightforward as the KDB-tree one (Eq. 8). If Nj is the number of nodes at level j, and Dj is the density4 of node rectangles then the average sizes nj,Xand nj,, of node rectangles at each direction are given by the following formula [36]:
nj..x
=
yy
=(Dj/A$)"2
where (Dj_J'*
(10)
-
1 2
(11)
(c - m)“* and *j-l *j
=
(12)
G
Notice from Eqs. 11 and 12 that Dj and Nj can be computed recursively using D, and NO which denote the density D and the amount N, respectively, of the data object MBRs. Therefore, we can 4 We define density D of a set of N objects D = (Z;“, , s,/global area).
to be the sum of the object
areas
s, divided
by the global
area:
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
325
......................... .............
..........
all_strictly_north@,q)
some-north @,q)
:
Fig. 5. Query windows for the estimation
of the retrieval cost.
estimate the retrieval cost of an overlap query just with the knowledge of two data set properties (density D and amount N) and the query window q. Since Eq. 5 expresses the expected performance of R-trees on overlap queries with a query window q, in order to estimate the cost of a direction relation R(p, q) we need the following transformation: Wp, 4) * overlaW,
Q) .
In other words, the retrieval of a direction relation using R-trees is equivalent (in terms of cost) to the retrieval of an overlap query using an appropriate query window Q. The necessary transformation Q for each direction relation R should take into consideration the corresponding constraint of the node rectangles in Table 5 (and not data MBRs), because access to nodes (not data) is the single factor for the retrieval cost [28]. The appropriate query windows Q for the range constraints of Table 5 are illustrated in Fig. 5. Using information from Fig. 5 and Eqs. 10, 11 and 12 we can estimate the expected cost for the query window Q (Eq. 5) which equals to the expected cost C(R) for the retrieval of a direction relation R:
C’(R)= C(IQI,TIQIy)= j=l i {&
y)}
*(nj,_x + IQIx)*(nj,y+ IQI
(13)
Intuitively R-trees perform better than other indexing methods in cases where many constraints are involved in the definition of the direction relation of interest. The experimental results of the next section demonstrate that this is actually the case when four constraints are involved, while for one constraint, B-trees have a better performance and for the intermediate relations (two or three constraints) KDB-trees are strong competitors.
4. Performance
comparison
In the previous section we have argued that the performance of each indexing method significantly depends on the particular range. In this section we present several experimental results that justify our
326
Y. Theodoridis et al. I Data & Knowledge
Engineering
27 (1998) 313-336
argument. In order to experimentally quantify the performance, we created tree structures by inserting 10 000 randomly generated MBRs. We tested three data $Zes: ?? the first file contains small MBRs: the average area of each rectangle is 0.003% of the global area (i.e., D = 0.3), ?? the second file contains medium MBRs: the average area of each rectangle is 0.25% of the global area (i.e., D = 25), ?? the third file contains large MBRs: the average area of each rectangle is 5% of the global area (i.e., D = 500). The search procedure used three query jifiles for each data file containing 100 rectangles, also randomly generated, with similar size properties as the data rectangles. We used the same data/query files for the retrieval of direction relations with B-trees, KDB-trees and R-trees. The performance of each data structure per relation (number of disk accesses for each data/query size combination) is illustrated in Fig. 6. As a first conclusion we notice the way that the data and query size affect the performance of the data structures. The performance of B-trees and KDB-trees depends significantly on the query size while for R-trees the data size is the main factor responsible for the high or low performance of the structure. The main conclusions that can be drawn from Fig. 6 are: ?? The B-tree is the most efficient structure for l-constraint relations only (e.g. all_north). When more range constraints (even on the same axis) are involved it is not competitive. ?? The KDB-tree is the most competitive index when two constraints for the same representative point are involved (e.g. alZ_north_east). It is also very competitive when both representative points are involved and the corresponding query windows are too selective (e.g. aZZ_strictZy_north, some_strictZy_north). ?? The R-tree is the winner when four constraints are involved (some_strictZy_north, some_north_east). However, when the data rectangles are large, the R-tree is not able to organise the data efficiently and, therefore, it usually loses against the B-tree or the KDB-tree even for relations involving many constraints. From the above results, it is evident that there is not an overall winner but the performance of the structures is affected by the number of constraints involved and the size of the data and query rectangles. In order to evaluate the quality of the analytical models proposed in Section 3, we compared the expected and the experimental cost for the retrieval of the direction relations. For the analytical estimation of the B-tree, ISDB-tree and R-tree indexing mechanisms we used the formulas of Eqs. 4, 9 and 13, respectively, and the following typical values: number of data N = 10 000, average node capacity c = 0.67, maximum number of entries in a node m = 126 or 84 or 50 (for B-trees, KDB-trees and R-trees, respectively). Table 6 lists the relative errors of the actual results compared to the predictions of the proposed analytical models for the various data/query size combinations. The estimations of the performance of all indexing methods are very close to the experimental results. In most cases, the relative error is lower than 15%. Only for large data and relations with large answer sets, such as all-north and all-north-east the relative error may be higher. Such analytical models are useful to spatial query optimisers because they permit cost prediction. Since the most appropriate index for some relations (aZZ_strictZy_north, some_strictZy_north etc.) is not known beforehand, but depends on the size of the data and query MBRs, the analytical models can be used to select the most suitable method.
321
Y. Theodoridis et al. / Data & Knowledge Engineering 27 (1998) 313-336 ..--.
-
------Ra
140-f
____----
120.;__----____-'--100 .. 80 " ,_._..__-..-_ 60 ..
--.____.-
40 " 2o.r
_. s/a
. 5hn
. m/s nvm
s/l
. ml
us
l/m
VI
nVl
Us
lim
VI
um
111
some-north
all-north lw)T
s/s
s/m
wl
ml5
dm
some_strictly_north
all_strictly_north 180
140 120 100 160 80 L
. ..-.*
*,-----..
*. “**'--\
.._/-___ loo
--7
__----___--50 f
0:
s/s
shll
all-north-east Fig. 6. Performance
s/l
m/s
mhll
nvl
us
some - north- east comparison
(disk accesses
per data/query
size combination).
Fig. 7 illustrates the expected performance of the three data structures for the direction relations implemented. For the analytical estimations we used data and query MBRs with the sizes of each side varying from 0.001 up to 0.256 of the global unit space (i.e., the area of each rectangle varied from 0.0012 up to 0.2562 of the global area) and assumed that the data and query MBRs are equal sized, a property which usually appears in real geographic (and other) applications. According to Fig. 7, the choice of the most efficient method depends on the data and query size for three out of six direction relations (some_north, aZZ_strictZy_north and some_strictZy_north). R-trees perform better for these relations when object size is small but, after a specified threshold, KDB-trees
328
Y. Theodoridis et al. I Data & Knowledge
Table 6 Relative error in estimating
the performance
Direction relation
Engineering
27 (1998) 313-336
of the tree structures
Relative error B-trees
KDB-trees
R-trees
all_north some-north all_strictly_north some_strictly_north all_north_east some_north_east
O-25% O-10% O-20% O-10% O-25% O-15%
O-30% O-10% 5-15% 5-15% 5-30% O-15%
O-25% O-10% 5-30% O-30% O-20% S-25%
Total average
5%
10%
15%
-. . . .
&_
---KDEb-w-R-lme
_.._.._.......-.-.._--.....__ 60
-...
-I.*
t
*.
04 0.001
8 0.002
0.m
0.008
0.016
0.022
0.084
0.126
01
0.258
I
0.001
0.002
0.004
all-north
0.006
0.016
0.022
0.064
0.126
0.266
some-north
MO-
180
120 ..
140 120
100 "
loo
60 .. W~.......................~~"----'
I
,’ ,’ ,' .,
-1'
40 *.
01 O.Wl
I
0.002
0.004
0.008
0.018
0.032
0.064
0.126
0.266
0.001
0.002
0.004
0.063
0.016
0.032
0.084
0.126
0.266
some_strictly_north
all_strictly_north zoo 180
-*.-.____
160 140 120 loo 60
---------------------
M
0.001
o.io2O.&l o.io6o.o1e
o&2
o.iiu
0.;26
0.k
0.002
0.004
0.606
0.016
0.032
some_north_east
all_north_east Fig. 7. Analytical
0.001
performance
comparison
(disk accesses per data-query
size).
0.064
0.126
0.268
Y. Theodoridis et al. / Data & Knowledge Engineering 27 (1998) 313-336
329
(or B-trees for the some-north relation) are the best choice. The knowledge of such a threshold is very important for a spatial query optimiser. It is also important to notice that the ordering of the three data structures per relation remains unchangeable for the experimental and the analytical results of Figs. 7 and 8, respectively, since query optimisers primarily need to know the best solution and, secondarily, the exact retrieval cost. In general, usually there is no overall winner even for a specified direction relation. Therefore, we study modifications or extensions of the three data structures which could be more suitable for certain range queries. In the next sections we propose such extensions of B-, KDB- and R-trees.
5. Extensions One- and multi-dimensional data structures have been designed to efficiently answer some particular types of queries. For example, B-trees index alphanumeric data in order to answer match or range queries, KDB-trees are efficient for match or range queries on multi-dimensional points and R-trees were designed for overlap queries on multi-dimensional MBRs. However, direction relations between spatial objects in 2D space do not follow directly one of these categories and, therefore, in Section 3 we presented the appropriate procedures in order to use one of these ‘classic’ data structures. In this section we propose extensions of the three data structures that facilitate efficiency for some 2D range queries.
5.1. Extensions of B-trees In the case of B-trees, we propose schemes for the maintenance of information regarding the MBR extents in the index; we call the proposed structures composite B-trees. As shown in Section 3, the number of B-trees to be searched is equal to the number of constraints that compose the query. This inconvenience may be overcome if the B-tree accommodates additional information regarding the MBRs in its leaf nodes. That is, a composite key may be maintained instead of a simple numerical value (i.e., the coordinate value of one MBR comer). This composite key consists of the coordinate values of the two MBR comers in one axis (x- or y- axis) or even of all four coordinates of the two MBR comers. One of these coordinates is the primary component, and based on its value the B-tree is built. The rest of the coordinates are used for the elimination of irrelevant MBRs. The key of the composite B-tree considered here consists of two values that represent the lower and upper coordinates of each MBR in one of the two axes. One of these values is the primary component. This scheme reduces each MBR into two line segments, which represent its extents over the two dimensions of the space and are indexed separately. Clearly, the efficient retrieval of direction relations can be achieved when four composite B-trees are maintained: B,,X-treethat keeps pj,, as the primary component and P:,~ as the secondary component, and, similarly, B,,X-tree, B,,,-tree and B,,,-tree. Some relations imply the search of only one B-tree (such as aZZ_north),while the rest imply exactly two searches (i.e., one for each axis such as alZ_north_east). In addition, provided that the information regarding the MBRs’ extents over each axis are maintained in two indices, the approach to satisfy a query is not unique. For instance, relation some-north requires no index to be searched for the x- axis, but it may be satisfied by searching either
Y. Theodoridis et al. I Data & Knowledge
330
Engineering
27 (1998) 313-336
a &-tree (condition to be fulfilled: q;,, Cp,‘,, < q:.,; refinement condition: p:.,. > q:,,) or a B,,,-tree (condition to be fulfilled: p:,, > q:,,; refinement condition: q,‘,y
nsn
anslamssn~m
mmaslsslmcsx
MWUYsJb=
direction relation
direction relstion
direction mlalion
(a) small data / query fig. 8. Performance
(b) medium data / query comparison
of the classic and the composite
(c) large data / query B-tree.
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
331
Compared to the other methods, the composite B-tree is the most efficient one for large data when more than one constraints are involved (e.g. some_north and some_str-ictly_north for large data files), since R-trees are unable to index such data efficiently, but it is sensitive to query size; large query windows are not handled efficiently by composite B-trees [37]. 5.2. Extensions
of KDB-trees
A similar approach can be considered for KDB-trees. KDB-trees handle two-dimensional data efficiently when the search procedure involves one of the two-representative points of the MBRs. However, most of the direction relations involve both points and, as a consequence, the intersection of two answer sets should be computed (Step 3 in Section 3.2). The determination of both the answer sets and their intersection can be a highly time-consuming procedure when dealing with large databases. We propose the maintenance of the opposite MBR corner, along with the corner on which indexing is based, in the leaf nodes of a composite KDB-tree. The additional point will serve for the fast elimination of irrelevant MBRs. Clearly, the efficient retrieval of direction relations can be achieved when two composite KDB-trees are maintained: a KDB,-tree where indexing is based on the lower left comers of the MBRs; and, a KDBU-tree where indexing is based on the upper right comers of the MBRs. A range relation query may be satisfied by searching one of the two composite KDB-trees. Similarly to the composite B-tree the selection of the most effective tree is not trivial, and should consider: (a) the direction relation involved; (b) the query window size and position; and (c) the distribution of MBRs over the work space. The three schemes of the composite B-tree &in also be applied for the selection of the most efficient composite KDB-tree, provided that linear ranges are substituted by area ranges. We have implemented and tested the three schemes of composite KDB-trees and again found the third one to be the most efficient solution. A comparison of the classic KDB-tree and the third scheme is illustrated in Fig. 9. As a general conclusion, the composite KDB-tree outperforms the classic KDB-tree in most cases. The opposite happens only for the all-north and alZ_north_east relations where only one index is
an
snII1mssn~ar
direction
m
r&lion
“”
P”
UT
*
nsle%lanmEme
direction mlarion
(a) small data / query Fig. 9. Performance
(b) medium data / query comparison
of the classic and the composite
dimcfion mlatioa
(c) large data I query KDB-tree.
Y. Theodoridis
332
et al. 1 Data LB Knowledge
m
YI111mmmlMW
direction
111
M
Engineering
1111
UK
IDC
27 (1998) 313-336
sn
111
dirsclionrelation
mlstio~
(a) small data / query Fig. 10. Performance
(b) medium data / query comparison
*ml
PI)
UT
UT
directionmlation
(c) large data / query .
of the 2D R-tree and the ID R-tree.
accessed (see Table 4). Compared to the other methods, the composite KDB-tree is the most efficient one for the some-north relation and small or medium data, where the corresponding query window is too selective, but it is insufficient for the rest ones. 5.3. Extensions of R-trees R-trees handle two-dimensional data efficiently when the search procedure involves both axes of the work space. However, several direction relations, such as alZ_north and some-north, involve search on only one axis. In such cases, the information regarding the other axis, which is maintained in the two-dimensional R-tree is useless. Clearly, an 1D R-tree (i.e., a segment tree), which is an index of the MBR extents along the axis of interest, would be more efficient because it is more compact (tree nodes accommodate a larger number of entries, since two instead of four coordinates are stored per MBR) and effective (the MBR extents along the other axis do not affect the maintenance of the index) than the 2D R-tree. For the retrieval of direction relations that involve both axes, two ID R-trees are needed to index the MBR extents over each axis separately. Each index provides a set of MBRs, and the intersection of the two sets composes the qualified set. In such cases though, the 2D R-tree is expected to be more efficient. We have implemented and tested the 1D R-tree in comparison to the classic 2D version. The results are illustrated in Fig. 10. According to Fig. 10, the 1D R-tree outperforms the 2D R-tree only for the aZZ_north and some-north relations. In the other cases, where two ID indexes need to be searched, it is insufficient. Some-north is the only relation where the 1D R-tree is the overall winner, provided that data MBRs are not large.
6. Conclusion This paper describes implementations of direction relations for spatial database systems. Direction relations constitute a special type of queries for spatial access methods, which so far have been
Y. Theodoridis et al. I Data & Knowledge
Engineering
27 (1998) 313-336
333
concerned with traditional window queries, topological relations and nearest neighbour queries. Despite the fact that direction queries are of equal importance to previous ones they have not been extensively implemented. In this work we defined direction relations between points and used these definitions as a basis for relations between objects. For the purposes of the paper we used a set of six object relations based on the north direction, but a large number of additional ones can be defined depending on the application domain. Then we showed how these relations can be transformed to range queries and retrieved in existing DBMS using three alternative indexing techniques: B-, KDBand R-trees. We also provided analytical formulas for the expected performance which, in most cases, proved to be very accurate for all the indexing methods (relative error usually lower than 15%), a fact that renders the derived formulas suitable for query optimisers. The main conclusion that arises from the experimental and analytical comparison of the alternative indexing techniques is that there does not exist a single data structure that performs best in all queries but the performance depends on the following factors: ?? the number and the type of range constraints involved in the definition of the direction relations of interest, ?? the data size (i.e., the size of the primary MBRs), and ?? the query size (i.e., the size of the reference MBRs). Besides the classic implementations of the three data structures, appropriate extensions of them (namely composite B-trees, composite KDB-trees and ID R-trees, respectively) were also implemented in order to facilitate efficiency for some types of queries. The guidelines to a spatial query optimiser that should choose the most suitable indexing method for a specific input query are summarised in Table 7. These guidelines can be extended to include other spatial data structures (e.g. Grid files) and other direction relations. We currently work on the conjunction of the above guidelines with other data structures, which adopt different techniques for organising the spatial objects’ approximations [34]. Table 7 Decision rules for the efficient retrieval of direction relations Direction relation (as a set of range constraints)
B-trees Classic
1 constraint on one axis (e.g., all-north) 2 constraints on one axis (e.g., some-north) 1 constraint on the first axis PLUS 1 constraint on the second axis (e.g., all_north_east) 1 constraint on the first axis PLUS 2 constraints on the second axis (e.g., all_strictly_north) 2 constraints on the first axis PLUS 2 constraints on the second axis (e.g., some_strictly_north, some_north_east) “May not be efficient for large data. bMay not be efficient for large queries.
KDB-trees Comp.
Classic
R-trees Comp.
2D
1D
334
Y. Theodoridis
et al. I Data & Knowledge
Engineering
27 (1998)
313-336
Progress can also be achieved in specialised data structures for queries that involve direction relations or combinations of several types of spatial information (‘find the k-nearest land parcels northeast of a lake’).
Acknowledgements Yannis Theodoridis, Emmanuel Stefanakis and Timos Sellis were partially supported by the European Commission under the TMR project ‘CHOROCHRONOS: A Research Network for Spatiotemporal Database Systems’. Emmanuel Stefanakis and Timos Sellis were also supported by the Department of Research and Technology of Greece (YPER’94). Dimitris Papadias was supported by a DAG Grant from Hong Kong Research Grants Council.
References [l] D.S. Batory, B’ trees and indexed sequential Management
of Data,
files: A performance
Proc.
comparison,
ACM SIGMOD
Intl. Conf
on
1981.
[2] N. Beckmann, [3] [4] [5]
[6] [7] [S]
H.P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An efficient and robust access method for points and rectangles, Proc. ACM SIGMOD Conf. Management of Data, 1990. T. Brinkhoff, HP Kriegel, R. Schneider, Comparison of approximations of complex objects used for approximationbased query processing in spatial database systems, Proc. 9th IEEE Intl. Conf on Data Engineering, 1993. A.G. Cohn, The challenge of qualitative spatial reasoning, ACM Computing Survey 27(3) (1995) 323-325. D. Comer, The ubiquitous B-Tree, ACM Computing Surveys 1 l(2) (1979) 121-137. Z. Cui, A.G. Cohn, D.A. Randell, Qualitative and topological relationships in spatial databases, Proc. of the 3rd Intl. Symp. on Large Spatial Databases (SSD), 1993. E. Clementini, J. Sharma, M. Egenhofer, Modeling topological spatial relations: Strategies for query processing, Intl. Journal of Computer and Graphics 18(6) (1994) 815-822. M. Egenhofer, Reasoning about binary topological relations, Proc. of the 2nd Intl. Symp. on Large Spatial Databases (SSD),
I 99 1.
[9] M. Egenhofer, Systems
J. Sharma, Assessing
the consistency
of complete and incomplete
topological
Geographical
information,
1( 1) ( 1993) 47-68.
[IO] C. Faloutsos,
T. Sellis, N. Roussopoulos, Analysis of object oriented spatial access methods, Proc. of ACM SIGMOD of Data, 1987. [1 I] C. Faloutsos, I. Kamel, Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension, Proc. of the 13th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (PODS), 1994. [12] A.U. Frank, MAPQUERY: Database query language for retrieval of geometric data and its graphical representation, Intl. Conf
on Management
ACM SIGGRAPH [13]
16(3)
A.U. Frank, Qualitative Systems
lO(3) ( 1996)
1141 M. Grigni,
(1982)
[ 151 A. Guttman, Management
Cardinal directions
as an example, Intl. Journal
of Geographic
Information
269-290.
D. Papadias,
Intelligence,
199-207.
spatial reasoning:
C. Papadimitrious,
Topological
inference,
Proc.
of the Intl.
Joint
Conf
of Artificial
1995.
R-trees: of Data,
A dynamic
index structure
for spatial
searching,
Proc.
of ACM
SIGMOD
Intl. Conf.
1984.
[ 161 D. Hernandez, [ 171 D. Hemandez,
Qualitative
Qualitative Representation
[18] A. Herskovits,
Language
distances,
Proc.
of Spatial Knowledge 2nd Intl. Conf
and Spatial Cognition
(Cambridge
(Springer Verlag LNAI, 1994).
on Spatial Information
University
Theory
Press, 1986).
(COSIT),
1995.
on
Y. Theodoridis et al. 1 Data & Knowledge Engineering 27 (1998) 313-336
335
u91 PD. Holmes, E. Jungert, Symbolic abd geometrc connectivity graph methods for route planning in digitized maps, IEEE Trans. on Pattern Analysis and Machine Intelligence 14(5) (1992) 549-565. 1973). WI D. Knuth, The Art of Computer Programming, Vol. 3, Sorting and Searching (Addison-Wesley, WI S.-Y. Lee, M.-C. Yang, J.-W. Chen, Signature file as a spatial filter for inconic image database, Journal of Visual Languages and Computing 3 (1992) 373-397. WI J. Orenstein, Spatial query processing in an object-oriented database system, Proc. of ACM SIGMOD Intl. Co@ on Management of Data, 1986. ~31 B. Pagel, H. Six, H. Toben, P. Widmayer, Towards an analysis of range query performance, Proc. of the I2th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (PODS), 1993. space, Very Large Data ~241D. Papadias, T. Sellis, Qualitative representation of spatial knowledge in two-dimensional Bases Journal 3(4) (1994) 479-516. language, Journal of Visual Languages and Computing 6( 1) v51 D. Papadias, T. Sellis, A pictorial query-by-example (1995) 53-72. WI D. Papadias, Y. Theodoridis, T. Sellis, M. Egenhofer, Topological relations in the world of minimum bounding rectangles: A study with R-trees, Proc. of ACM SIGMOD Intl. Conj on Management of Data, 1995. 1271D. Papadias, M.J. Egenhofer, J. Sharma, Hierarchical reasoning about direction relations, Proc. of the 4th ACM Workshop on GIS, 1996. ml D. Papadias, Y. Theodoridis, Spatial relations, minimum bounding rectangles and spatial data structures, Intl. Journal of Geographical Information Science 11(2) ( 1997) 11l- 138. ~291D. Peuquet, Z. Ci-Xiang, An algorithm to determine the directional relationship between arbitrarily shaped polygons in the plane, Pattern Recognition 20( 1) (1987) 65-74. J.T. Robinson, The K-D-B-Trees: A search structure for large multidimensional dynamic indexes. Proc. of ACM 1301 SIGMOD Intl. Conf. on Management of Data, 1981. [311N. Roussopoulos, C. Faloutsos, T. Sellis, An efficient pictorial database system for PSQL, IEEE Trans. on Software Engineering 14(5) (1988) 639-650. 1321N. Rou~sopoulos, F. Kelley, F. Vincent, Nearest neighbor queries, Proc. of ACM SIGMOD Intl. Co@ on Management of Data, 1995. objects, Proc. of the [331T. Sellis, N. Roussopoulos, C. Faloutsos, The R’-tree: A dynamic index for multi-dimensional Intl. Co@ on Very Lurge Data Bases (‘VLDB), 1987. [341E. Stefanakis, Y. Theodoridis, T. Sellis, Y.-C. Lee, Point representation of spatial objects and query window extension: A new technique for spatial access methods, Intl. Journal of Geographical Information Science 1 l(6) (1997) 529-554. 1351Y. Theodoridis, D. Papadias, Range queries involving spatial relations: A performance analysis, Proc. of the 2nd Intl. Con& on Spatial Information Theory (COSIT), 1995. [361Y. Theodoridis, T. Sellis, A model for the prediction of R-tree performance, Proc. of the 15th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems (PODS), 1996. 1371Y. Theodoridis, D. Papadias, E. Stefanakis, Supporting direction relations in spatial database systems, Proc. of the 7th Intl. Symp. on Spatial Data Handling (SDH), 1996. Yannis Theodoridii received his Diploma Degree in Electrical Engineering in 1990 and his Ph.D. in Electrical and Computer Engineering in 1996, both from the National Technical University of Athens, Greece. Since 1996, he is a Research Engineer in the Knowledge and Database Systems Laboratory, National Technical University of Athens, Greece. His research interests include spatial and spatiotemporal database systems, multimedia database systems, and geographical information systems. Dr. Theodoridis has oublishe :d several articles in refereed ioumals and international conferences in the above areas. Dr. Theodoridis is a member of IEEE and ACM.
Dimitris Papadias is an Assistant Professor at the Computer Science Dept. of the Hong Kong University of Science and Technology. Before joining HKUST he held research and teaching positions at the Institute for Intelligent Robotics and Information Systems (Canada), the Electrical and Computer Engineering Dept., National Technical University of Athens (Greece), the Department of Geoinformation, Technical University of Vienna (Austria), the Department of Computer Science and Engineering, University of California at San Diego, the National Center for Geographic Information and Analysis, University of Maine (USA), and the Artificial Intelligence Research Division, German National Research Center for Information Technology (GMD).
336
Y. Theodoridis et al. I Data & Knowledge Engineering 27 (1998) 313-336
Emmanuel Stefanakis was born in Athens, Greece, in 1969. He received his Ph.D. deeree in Comouter Science. Department Gf Electrical gnd Computer Engineering, National Technical University of Athens, Greece, 1997, M.Sc.E. degree, Department of Geodesy 1 and Geomatics Engineering, Universitv of New Brunswick: Canad;, 1994, and DipLEng. degree (with distinction), Department of Rural and Surveying Engineering, National Technical University of Athens, Greece. 1992. He has received several scholarships and awards in Greece and Canada and has several publications in international journals, books and conferences. Since 1991 he has been involved in several research projects dealing with database and GIS applications. His research interests include Database Systems, Geographic Information Systems and Digital Mapping.
Timos Sellis received his Diploma Degree in Electrical Engineering in 1982 from the National Technical University of Athens, Greece. In 1983 he received his M.Sc. degree from Harvard University and in 1986 his Ph.D. from the University of California at Berkeley, where he was a member of the INGRES group, both in Computer Science. In 1986, he joined the Department of Computer Science of the University of Maryland, College Park, as an Assistant Professor in 1992. Between 1992 and 1996 he was an Associate Professor at the Computer Science Division of the National Technical University of Athens, in Athens, Greece, where he is currently a Full Professor. His research interests include extended relational database systems, active database systems, and spatial, image and multimedia database svstems. Prof. Sellis has oublished over 60 articles in refereed journals and international conferences in the above areas. He is a recipient of a Presidential Young Investigator (PYI) award for 1990-1995. He is a member of the editorial boards of the International Journal on Intelligent Information Systems, and GeoInformatica. Prof. Sellis is a member of IEEE and ACM.