Information and Software Technology 47 (2005) 785–795 www.elsevier.com/locate/infsof
Multi-way spatial join selectivity for the ring join graph Jun-Ki Mina, Ho-Hyun Parkb,*, Chin-Wan Chungc a
School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam 330-708, South Korea b School of Electrical and Electronics Engineering, Chung-Ang University, 221, Huksuk-Dong, Dongjak-Gu, Seoul 156-756, South Korea c Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Kusong-Dong, Yusong-Gu, Daejeon 305-701, South Korea Received 1 April 2004; revised 8 January 2005; accepted 10 January 2005 Available online 7 April 2005
Abstract Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that they compute the relationships (e.g. intersect) among the spatial data. Thus, accurate estimation of the spatial join selectivity is essential to generate an efficient spatial query execution plan that takes advantages of spatial access methods efficiently. For the multi-way spatial joins, the selectivity estimation formulae only for the two kinds of query types, tree and clique, have been developed. However, the selectivity estimation for the general query graph which contains cycles has not been developed yet. To fill this gap, we devise a formula for the multi-way spatial ring join selectivity. This is an indispensable step to compute the selectivity of the general multi-way spatial join whose join graph contains cycles. Our experiment shows that the estimated sizes of query results using our formula are close to the sizes of actual query results. q 2005 Elsevier B.V. All rights reserved. Keywords: Spatial data; Spatial join selectivity; Multi-way join; Databases
1. Introduction In the past few decades, the research on spatial database management systems (SDBMSs) has actively progressed since the applications using the spatial information such as geographic information systems (GIS), computer aided design (CAD), multimedia systems and satellite image database, and location based service (LBS), have increased The spatial join is a common spatial query type which requires high processing cost due to the high complexity and large volume of spatial data. Thus, to reduce the overall processing cost, the spatial join is processed in two steps (the filter step and the refinement step) [5,11]. As shown in Fig. 1(a), the filter step evaluates tuples whether they satisfy the constraints of a given spatial query, using the MBR (Minimum Bounding Rectangle) approximation. The refinement step (Fig. 1(b)) checks the candiate tuples
(i.e. the outputs of the filter step) using computational geometric algorithms whether the output tuples really satisfy the constraints of the given spatial query. This paper, like most related spatial database literature, focuses the query processing on the filter step [2,18]. Many spatial queries include the multi-way spatial join because the spatial queries mainly compute relationships (e.g. intersect) among spatial data such as ‘Find all buildings which are adjacent to roads that intersect with boundaries of districts’. The multi-way spatial join combines m (mO2) spatial relations using mK1 or more spatial predicates. Since the multi-way spatial join combines tuples from m spatial relations into a single m-tuple whenever the combination satisfies the join conditions (e.g. intersect), the estimated number of the multi-way spatial join result is: # of all possible m-tuples$Probðan m-tuple is a solutionÞ (1)
* Corresponding author. Tel.: C82 2 820 5345; fax: C82 2 825 1584. E-mail addresses:
[email protected] (J.-K. Min), hohyun@ cau.ac.kr (H.-H. Park),
[email protected] (C.-W. Chung). 0950-5849/$ - see front matter q 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.infsof.2005.01.002
The front part of the above formula equals the cardinality of the Cartesian product of m relations, and the latter part is
786
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
the multi-way clique spatial join selectivity through a survey of the previous work. Section 3 describes the multiway ring spatial join among various multi-way spatial joins. Section 4 contains experimental results on uniformly distributed data in the two-dimensional space, which show the accuracy of the proposed formula. Section 5 concludes this paper.
Fig. 1. Spatial join processing steps.
the multi-way spatial join selectivity which means the probability that an m-tuple satisfies the join predicates. Due to the high complexity of the spatial join operation and a large volume of spatial data, an accurate estimation for the spatial join selectivity has a great influence to the query optimizer and the spatial database management system. Recently, the cost models for several kinds of spatial joins have been studied [10,12–15,18]. Especially, for the multi-way spatial join, the formulae only for the selectivity of the two kinds of query types, tree and clique, have been developed [14]. However, the formula for the selectivity of the general multi-way spatial join whose join graph contains cycles has not been derived yet. Therefore, the join selectivity which contains cycles could not be estimated accurately. Instead, only the tree typed spatial join and the clique typed spatial join have been used as approximated estimates. However, the selectivity of the tree typed spatial join overestimates the amount of the ring typed spatial join result since the tree typed spatial join does not have a cycle. Also, the selectivity of the clique typed spatial join underestimates the amount of the ring typed spatial join result since the clique typed spatial join considers only the mutually intersected spatial data. Thus, we develop a formula for the selectivity of the ring typed spatial join. This is an indispensable step for computing the selectivity of the general multi-way spatial join whose join graph contains cycles. Traditionally in the database area, the selectivity estimation problem, when a query graph contains cycles, was considered as very difficult [8]. Thus, this work should be considered as a theoretical break through to go forward the selectivity estimation problem of the general multi-way spatial join whose join graph contains cycles. Our contributions are as follows: † find properties of the result that satisfies the constraints of the multi-way spatial ring join. † devise a formula for the selectivity for the multi-way spatial ring join. † show the accuracy of the proposed formula through experiments. The rest of this paper is organized as follows. Section 2 describes the multi-way tree spatial join selectivity and
2. Preliminaries In this paper, like most related spatial database literature, we assume that all spatial data are uniformly distributed in the d-dimensional unit work space, WSZ[0,1)d (‘uniformity assumption of the placement distribution’ [9]), and all spatial data are rectangles. The notations to be used in this paper are summarized in Table 1. Formally, a multi-way spatial join can be expressed as follows [14]: † Given m relations R1,R2,.,Rm and a query Q, where Qij is the binary spatial predicate between Ri and Rj, find all m-tuples ffsR1 ; sR2 ; .; sRm gjc i; j : sRi 2Ri ; sRj 2Rj and Qij Z TRUEg. A multi-way spatial join can be modeled by a query graph GQ whose nodes represent relations and edges represent spatial predicates. Various spatial conditions (intersect, meet, include, etc [4,6]) can be applied to spatial join predicates. But, following the standard approach in the literature of spatial joins, the intersect (not disjoint) is considered as the default join predicate. Papadias et al. [16] showed how the spatial selectivity using intersect could be applied to the other spatial predicates. 2.1. Selectivity for the multi-way spatial tree join Huang et al. [7] and Theodoridis et al. [18] provided the formula of the 2-way spatial join selectivity. As shown in Fig. 2, in order for a spatial object ðZsRi Þ of relation Ri and a spatial object ðZsRj Þ of relation Rj to intersect, assuming the location of sRi is fixed, an end point of sRj (denoted by †) should exist within the area (dotted area in Fig. 2) that is an extension of sRi to the size of sRj in each dimension. This is the same as, assuming the location of sRj is fixed, an end point of sRi (denoted by +) should exist within the area (not seen in Fig. 2) that is an extension of sRj to the size of sRi in each dimension. Thus, the 2-way spatial Table 1 Notations Symbol
Description
sRi jsRi j jjSjj
A spatial object of relation Ri The average length (on each dimension) of a spatial object in Ri The number of elements of set or relation S
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
787
Fig. 4. An example of a clique join graph. Fig. 2. Intersect condition of two rectangles.
Papadias et al. [14] showed that jCAreaðfsRi ; sRj gÞj is:
join selectivity is represented by Eq. (2). ðjsRi j C jsRj jÞd
(2)
The multi-way tree join is a multi-way spatial join whose join graph is a tree. Based on Eq. (2), Papadias et al. [14] provided a formula for the multi-way tree join selectivity. As shown in Fig. 3, several intersect relationships hold between nodes in a multi-way tree join graph. Since there is no cycle in the join graph, the joins between adjacent nodes in the join graph GQ are mutually independent. Thus, the selectivity of a multi-way join among n relations for the tree typed join graph is expressed by Eq. (3): 0 1d Y @ ðjsRi j C jsRj jÞA (3)
jCAreaðfsRi ; sRj gÞj Z
2.2. Selectivity for the multi-way spatial clique join Another multi-way spatial join style is the clique join where spatial objects of all relations mutually intersect. In the multi-way clique join, the join graph literally forms a clique. Papadias et al. [14] found the following property of the query results of the clique join. As shown in Fig. 4, if a set of rectangles mutually intersect, then they share a common area on each dimension. When a spatial object ðZsRi Þ of relation Ri and a spatial object ðZsRj Þ of relation Rj intersect in one dimensional space (see Fig. 5), the common area of sRi and sRj exists. Let the common area of sRi and sRj be CAreaðfsRi ; sRj gÞ and the average length of CAreaðfsRi ; sRj gÞ be jCAreaðfsRi ; sRj gÞj.
Fig. 3. An example of a tree join graph.
(4)
jsRi j C jsRj j
Furthermore, they derived the general form of the average length of the common area on one dimension as the following: jCAreaðfsR1 ; .; sRn gÞj Z
Z
jCAreaðfsR1 ; .; sRnK1 gÞj$jsRn j jCAreaðfsR1 ; .; sRnK1 gÞj CjsRn j jCAreaðfsR1 ;.;sRnK2 gÞj$jsRnK1 j jCAreaðfsR1 ;.;sRnK2 gÞjCjsRnK1 j jsRn j jCAreaðfsR1 ;.;sRnK2 gÞj$jsRnK1 j jCAreaðfsR1 ;.;sRnK2 gÞjCjsRnK1 j CjsRn j
Z/Z
cRi ;Rj :GQ ði;jÞZtrue
In the above formula, GQ(i, j)Ztrue denotes that there is an edge between the node for Ri and the node for Rj in the join graph GQ.
jsRi j$jsRj j
n Y iZ1
n n Y X jsRi j jsRj j
ð5Þ
iZ1 jZ1;jsi
When sR1 ; .; sRn mutually intersect, sR1 ; .; sRnK1 also mutually intersect and CAreaðsR1 ; .; sRnK1 Þ intersects sRn . Therefore, the probability that sR1 ; .; sRn mutually intersect is equal to the product of the probability that sR1 ; .; sRnK1 mutually intersect and the probability that C AreaðfsR1 ; .; sRnK1 gÞ and sRn intersect. Thus, the formula for the multi-way clique join selectivity in d-dimensional space is Eq. (6) [14]. ðCliqueðfsR1 ;.;sRn gÞÞd ZðCliqueðfsR1 ;.;sRnK1 gÞ$ðjCAreaðfsR1 ;.;sRnK1 gÞjCjsRn jÞÞd !d n n Y X Z/Z jsRj j ð6Þ iZ1 jZ1;jsi
In the second line of Eq. (6), the front part is the probability that sR1 ;.;sRnK1 mutually intersect and the latter part is the probability that CAreaðfsR1 ;.;sRnK1 gÞ and sRn intersect. In [14], since the formula for the general join graph which contains cycles was not derived, the formulae for the tree join graph and the clique join graph were considered as an upper bound and a lower bound, respectively.
Fig. 5. The common area of sRi and sRj .
788
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
Fig. 7. Examples satisfying Property 1. Fig. 6. An example of the 4-way ring join graph and a result.
3. Multi-way ring join selectivity As mentioned in Section 2, the formulae of the multi-way spatial join selectivity for tree and clique join graphs have been proposed by Papadias et al. [14]. However, the selectivity for the general query graph, especially general cyclic graphs, has not been devised. This section proposes a formula for the ring typed query graph which is a simple cycle. This is an indispensable step to compute the selectivity of the general multi-way spatial join graph. The join graph GQ of the ring typed spatial join query is a connected 2 regular graph.1 Since the 3-way join graph of the ring join is equal to that of the clique join, this work considers ring joins to be above 3-way. To assign the relation variable Ri to each node in GQ, we increase the relational subscript i from 1 to m according to the counter-clockwise direction of the ring like Fig. 6. Therefore, the node Ri connects with RiC1 where 1%i%m and the node Rm connects with the node R1. Under uniformity assumption of spatial data distribution, since the selectivity on one dimension is independent of those on other dimensions, the extension to the multi-dimensional space is straightforward (Note that if random variables X and Y are mutually independent, E(XY)ZE(X)E(Y)). Therefore, for simplicity, we explain the selectivity for the spatial ring join on one-dimensional space. 3.1. Properties of the result of a ring join To derive a formula for the multi-way spatial join selectivity, the properties of the result of the join should be identified. Because the multi-way spatial ring join graph has a cycle, there is a non terminating sequence of join conditions: for {R1,.,Rm}, R1 intersects R2 and.and Rm intersects R1 and R1 intersects R2, and so on. Therefore, we adopt the mod operator to indicate a relation. Let a solution S of the m-way spatial join whose join graph is a ring2 be an m-tuple ðSZ fsR1 ; sR2 ; .; sRm gÞ. The m-tuple S must satisfy Property 1.
Property 1. Let SZ fsR1 ; sR2 ; .; sRm g be a solution of the m-way ring spatial join. For ciO0, sRðimod mÞC1 intersects sRðiC1mod mÞC1 . Property 1 is directly derived from the m-way spatial ring join conditions. We mentioned that the m-way spatial ring join graph is a connected 2-regular graph. However, as shown in Fig. 7, there are many join graphs which satisfy Property 1. For example, the spatial join conditions of Fig. 7(b), (R1 intersects R2)o(R2 intersects R3)o(R3 intersects R4)o(R4 intersects R5)o(R5 intersects R1)o(R1 intersects R3), satisfy Property 1. But the join graph is not a connected two regular graph because it has an additional condition, (R1 intersect R3). This implies that more specific query graphs than the spatial ring join always generate solutions which are also the solutions of the spatial ring join graph. Therefore, our formula to be derived in Section 3.2 for the multi-way spatial ring join selectivity can be used to find an upper bound for a multi-way spatial join which contains cycles. This is more accurate than the tree join selectivity which was an upper bound of Papadias et al. [14]. Theorem 1 shows another property of S. Theorem 1. Let SZ fsR1 ; sR2 ; .; sRm g be a solution of the mway spatial ring join. Then, d i such that sRðimod mÞC1 mutually intersects at least one element of S 0 Z S K fsRðiK1mod mÞC1 ; sRðimod mÞC1 ; sRðiC1modmÞC1 g Z fsR1 ; .; sRðiK2modmÞC1 ; sRðiC2modmÞC1 ; .; sRm g: Proof. For c i, if SRðimod mÞC1 does not intersect any element of S 0 as in Fig. 8, for some j, sRððiKj mod mÞC1 can not intersect sRððiKjK1Þmod mÞC1 . But, by Property 1, sRððiKjÞmod mÞC1 must intersect sRððiKjK1Þmod mÞC1 . Therefore, S is not a solution of the m-way spatial ring join. , As previously stated, if S satisfies the join conditions of the m-way spatial clique join query, S is also a solution of the mway spatial ring join query because S satisfies Property 1 and Theorem 1. Therefore, we can divide the solutions of the ring join into the clique case and the non-clique case. As shown in Fig. 9, if all elements of S do not mutually intersect but satisfy the join conditions of the m-way spatial
1
The graph GQ(V, E) is a connected graph and the degree of each node in graph GQ is 2. 2 We call it the m-way spatial ring join.
Fig. 8. An example which is not a result of the m-way ring join.
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
789
The following definition shows the guide function which is derived from Property 1 and Lemma 1. Definition 1. The guide function q(S,C,Tl,Tr,k) is true when S, C, Tl, Tr and k satisfy the following conditions. Fig. 9. The partitions of S.
ring join (i.e. a non-clique case), then S can be partitioned by the following lemma. Lemma 1. Let S be a solution of m-way spatial ring join but all elements of S do not mutually intersect. Then S is partitioned into disjoint subsets C, Tl, and Tr such that: (i) 2%jjCjj%mK2, all elements of C mutually intersect and the corresponding partial graph for C of the join graph GQ is not connected. (ii) jjTljjR1, jjTrjjR1, no element of Tl intersects any element of Tr. (iii) the common area of C must intersect at least one element of Tl and one element of Tr. Proof. (by sketch) The full proof is shown in Appendix A. By Theorem 1, there is a set C whose corresponding partial graph of the join graph GQ is not connected (condition (i)). In addition, according to condition (i), there is a subset TZ SKC. Then, since the graph for C is not connected, T is partitioned into Tl and Tr such that no element of Tl intersects any element of Tr (condition (ii)). Suppose that no element of Tl intersects the common area of C. In this case, at least, one element of Tl and two elements (sRi and sRj ) in C mutually intersect since the corresponding partial graph for C is not a connected graph where Ri and Rj are not adjacent in the join graph GQ. Thus, there always exists a new C which satisfies condition (iii). (The same arguments hold on Tr). , Section 3.2 describes the formula for the multi-way spatial ring join selectivity using the properties of the multiway spatial ring join result. 3.2. Formula for the selectivity of the m-way spatial ring join As mentioned in Section 3.1, a solution SZ fsR1 ; sR2 ; . ; sRm g of the m-way spatial ring join is partitioned into C, Tr, and Tl, if all elements of S do not mutually intersect. Thus, as the first step to derive the selectivity of the mway spatial ring join, we devise the guide function q(S,C,Tl,Tr,k) to choose the valid partition C, Tl, and Tr among all possible partitions of S. 0 B mK2 X B ðRingðS ¼ fsR1 ; .; sRn gÞÞd ¼ B CliqueðSÞ þ B @ k¼2
X cC;Tl ;Tr : qðS;C;Tl ;Tr ;kÞ¼true
(1) C, Tl, and, Tr are disjoint, CgTlgTrZS, 2!jjCjjZ k%mK2, jjTljjZaR1, and jjTrjjZbR1. (2) dsRði mod mÞC1 and sRðj mod mÞC1 2C such that (i mod m)C 1s(j mod m)C1 and sRðiC1 mod mÞC1 ;C as well as sRðjC1 mod mÞC1 ;C. (3) If sRðj mod mÞC1 2Tl , sRðjK1 mod mÞC1 and sRðjC1 mod mÞC1 ;Tr . The first condition of the guide function q(S,C,Tl,Tr,k) comes from that S is partitioned into C, Tl, and Tr. The second condition of the guide function is derived from Lemma 1-(i) such that the corresponding partial graph for C is not connected. In order that the partial graph GC(V 0 ,E 0 ) for C is not connected, there should exist at least two nodes R(i 0 mod m)C1 and R(j mod m)C12V whose left adjacent nodes (i.e. R(iC1mod m)C1 and R(jC1mod m)C1) in GQ are not in V 0 . Suppose that, for only one node R(x mod m)C12V 0 , an edge hR(x mod m)C1, R(xC1mod m)C1i;E 0 . In this case, R(xC1mod m)C1 ;V 0 . Then, GC(V 0 ,E 0 ) is the connected graph since every node R(i mod m)C12V 0 except R(x mod m)C1 has the edge to R(iC1mod m)C1 (including the edge between R(xK1mod m)C1 and R(x mod m)C1). In addition, by Lemma 1-(ii), the final condition of the guide function is derived. With respect to Property 1, sRðjK1 mod mÞC1 should intersect sRðj mod mÞC1 and sRðj mod mÞC1 should intersect sRðjC1 mod mÞC1 . However, by Lemma 1-(ii), no element of Tl intersect any element of Tr. Therefore, Tl and Tr should satisfy the condition (3). Until now, we explained the valid partitioning of S into C, Tl and Tr. Among three items of Lemma 1, we used only the first two items to build the guide function of Definition 1. Lemma 1-(iii) will be used later. Next, we present the formula for the selectivity of the m-way spatial ring join with C, Tl and Tr. Note that, when S is a solution of the m-way spatial ring join, the case that all elements of S mutually intersect and the case that all elements of S do not mutually intersect are independent. In addition, the case that a valid partition of S is a solution of the m-way spatial ring join is independent from the cases for other partitions of S. Therefore, to obtain the selectivity of the m-way spatial ring join, we derive a formula for each partition C, Tl, and Tr of S. Then, we sum up the formula for the clique join selectivity of S and the formula for all valid partitions of S. Therefore, the probability that S is a solution of the m-way spatial ring join is: 1d C C Probða partition C; Tl ; of S satisfies Lemma 1ÞC C A
(7)
790
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
Now, we derive the formula for the probability, in Eq. (7), that a certain partition C, Tl, and Tr of S satisfies Lemma 1. Suppose that a partition CZ fsc1 ; .; sck g, Tl Z fsl1 ; .; sla g, and Tr Z fsr1 ; .; srb g of S satisfies the guide function q(S, C, Tl, Tr, k). By Lemma 1-(i), all elements of C should mutually intersect. Then, there is a common area (ZC Area(C)) of C. Since no element of Tl intersects any element of Tr on C Area(C), a subarea in C Area(C) which is not shared by any element of Tr should exist (see Fig. 10). We call this area A_diff(C,Tr). By Lemma 1-(iii), C Area(C) intersects at least one element of Tr. The average length of the area which is shared by C Area(C) and an element of Tr when C Area(C) and an element of Tr intersect is: Y
1 CliqueðCgfsrj gÞ
csrj 2Tr
b 1
1
P
b 1
$ Q
csrj 2Tr jCAreaðCgfsrj gÞj
(8)
sub_diff ðC; Tri Þ. The length of a sub_diff ðC; Tri Þ is: jsub_diff ðC; Tri Þj 1 ¼ jC AreaðCÞj K b i
X
jC AreaðCg Tri Þj
(10)
cTri 4Tr : jjTri jj¼i
Above all, in order that A_diff(C,Tr) exists on the common area of C, the common area (ZC Area(C)) of C should exist. As mentioned in Section 2, when all elements of C mutually intersect, the C Area(C) exists. And, if all possible subareas ðZsub_diff ðC; Tri ÞÞ, each of which does not contain the common area of C Area(C) and a certain Tri , mutually intersect, then there is A_diff(C,Tr). The formula for probability that a set of line segments mutually intersect is provided in Eq. (6). Thus, we can derive the formula for the probability that A_diff(C,Tr) exists as the following: CliqueðCÞ$
b b X Y
jsub_diff ðC; Tri Þj
(11)
jZ1 iZ1; isj
csrj 2Tr
1 CliqueðCgfsrj gÞ
A_diff(C,Tr) is the common area of the all possible sub_diff ðC; Tri Þs. Thus, the average length of A_diff(C,Tr) is:
b 1
In Eq. (8), we can compute jCAreaðCgfsrj gÞj based on Eq. (5) in Section 2. As shown in Fig. 10, when all elements of C mutually intersect, the subarea (ZA_diff) which does not intersect any element of Tr exists in C Area(C) since no element of Tr intersects any element of Tl. Thus, A_diff does not contain the common area of all possible subsets of Tr ðZTri Þ and C. The average length of the common area which is shared by all elements of C and a certain Tri , such that jjTri jjZ i, is:
jA_diff ðC;Tr Þj Z
b Y iZ1
jsub_diff ðC;Tri Þj
b b Y X
jsub_diff ðC;Tri Þj
(12)
jZ1 iZ1; isj
As mentioned above, A_diff(C,Tr) is the subarea of C Area(C) which is not shared by any element of Tr. That is, A_diff(C,Tr) does not contain the common areas of any possible Tri and C. In other words, A_diff(C,Tr) is the common area of all possible subareas each of which does not contain the common area of C Area(C) and a certain Tri . Let the subarea of C Area(C) which does not contain the common area of a certain Tri and C Area(C) be
Next, we will consider the relationship between A_diff(C, Tr) and Tl. With a little loss of generality, we assume that all elements of Tl Z fsl1 ; sl2 ; .; sla g and A_diff mutually intersect. However, by Lemma 1-(ii), no element of Tl intersects any element of Tr. Since all elements of Tl mutually intersect A_diff and do not intersect any element of Tr, sl1 2Tl in Fig. 11 also intersects A_diff and does not intersect any element of Tr. Generally, the probability that two line segments intersect is the sum of the lengths of two line segments (see Eq. (2)). However, this probability cannot be applied to this case because the following additional condition must hold between A_diff and sl1 . As shown in Fig. 11, when the end point (denoted by +) of A_diff is in sl1 and the end point (denoted by †) of sl1 is in A_diff, sl1 intersects A_diff and does not intersect any element of Tr.
Fig. 10. A_diff(C,Tr).
Fig. 11. A possible configuration of A_diff and sl1 .
1 b i
X
jC AreaðCg Tri Þj
(9)
cTri4Tr : jjTri jj¼i
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
791
Note that the probability that a point is in a line segment is equal to the length of the line segment [9,14]. In this case, since each end point for two line segments is contained in the other line segment at the same time, each probability that a point is in a line segment should be multiplied (Note that if X and Y are mutually independent, P(XhY)Z P(X)P(Y)). Therefore, the probability that sl1 intersects A_diff and does not intersect any element of Tr is:
Now, we derived the formula for the relationship between A_diff(C,Tr) and Tl. Thus, by Eqs. (11), (12) and (14), we can compute the probability that a valid partition C, Tl, and Tr of S is a solution of the m-way spatial ring join. Y ProbðA_diff ðC;Tr ÞexistsÞ$jA_diff ðC;Tr Þj$ jsli j
jA_diff ðC; Tr Þj$jsl1 j
ZCliqueðCÞ$
(13)
Now, we derive the general formula for the probability that all elements of Tl and A_diff mutually intersect and no element of Tl intersects any element of Tr. In order that all elements of Tl and A_diff mutually intersect, all elements of Tl should mutually intersect. And 0 B m K2 X B ðRingðS¼fsR1 ;.;sRn gÞÞd ¼ B CliqueðSÞþ B @ k¼2
X
¼CliqueðTl Þ$jCAreaðTl Þj$jA_diff ðC;Tr Þj 1 0 0
1
C B C B C B Y C B X Y . X Y C B C B ¼B jslj jC$B jsli j jslj jC C Bcs 2T C Bcs 2T csli 2Tl cs 2T o A @ li l cslj 2Tlo A @ li l lj l slj ssli
$jA_diff ðC;Tr Þj¼jA_diff ðC;Tr Þj$
Y
slj ssli
jsli j
ð14Þ
jsub_diff ðC;Tri Þj$
Y
jslj j
(15)
cslj 2Tl
Eq. (15) is for a valid partition of S. Thus, we extend Eq. (15) to all possible valid partitions of S on a multidimensional space. The probability that S is a solution of the m-way spatial ring join on a d-dimensional space is: 1d b Y
jsub_diff ðC;Tri Þj$
i¼1
cC;Tl ;Tr : qðS;C;Tl ;Tr ;kÞ¼TRUE
l
b Y iZ1
CliqueðCÞ$
then, as shown in Fig. 12, when the end point (Z+) of A_diff is in the common area (ZC Area(Tl)) of Tl and the end point (Z†) of the most right located line segment (e.g. sl1 in Fig. 12) among all elements of Tl is in A_diff, all elements of Tl and A_diff mutually intersect and no element of Tl intersects any element of Tr. Under the uniform distribution assumption, the probability that each line segment in Tl is located on the most right is equally 1/a. Therefore, the probability that all elements of Tl and A_diff mutually intersect and no element of Tl intersects any element of Tr is: X 1 CliqueðTl Þ: jCAreaðTl Þj$ $jA_diff ðC;Tr Þj a cs 2T li
csli 2Tl
C C jslj jC C A cslj 2Tl Y
(16)
In Eq. (16), the function Clique( ) is the probability that all elements of the input mutually intersect. As mentioned earlier, when all elements of S mutually intersect, S also satisfies Property 1. Thus, the front part of Eq. (16) is for the probability that all elements of S mutually intersect. In addition, the last part is for the probability that all elements of S do not mutually intersect but satisfy Property 1. Eq. (16) considers all possible C, Tl and Tr. Appendix B shows how to apply this equation for 4-way and 5-way spatial ring joins. The complexity of Eq. (16) is mainly determined by the number of elements of C and the number of elements of Tr because TlZSK(CgTr). The following lemma shows the complexity of Eq. (16). Lemma 2. The complexity of Eq. (16) is: O
m K2 X
m
kZ2
k
!
!
!
Km $ðmKk K1Þ ZO
m K2 X
m
kZ2
k
! ! :m
ZOðm2 $2m Þ
csli 2Tl
Proof. When the number of elements of C is k, all possible numbers of C is ! m Km k since the corresponding partial graph for C is not connected. The number of all possible Tr’s is mKkK1 because jjSjjKjjCjjZmKk and jjTrjjR1. , Fig. 12. A possible configuration of A_diff and Tl.
Lemma 2 shows that the complexity is proportional to m. But, fortunately m/10 and commonly m%5.
792
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
Table 2 Characteristics of data sets Relation
# of obj
Domain area
Average length
R1.i R2.i R3.i
10,000 10,000 10,000
100,000!100,000 100,000!100,000 100,000!100,000
500!500 707!707 1000!1000
Fig. 13. Example graphs in 5-way join queries.
4. Experimental results The evaluation of the proposed selectivity formula was performed based on a variety of experimental tests on uniformly distributed rectangular data sets We use R*-tree
index [3] as a spatial access method for the data sets. The characteristics of the data sets are summarized in Table 2. The experiments were performed on a Sun Ultra II 168 MHz platform on which Solaris 2.5.1 was running with 384 MB of main memory. To show the accuracy of our formula, we compared its error ratio with those of the formulae for clique and tree join selectivities of [14] with the same experimental data. Example query graphs for each query type in a 5-way join (mZ5) are shown in Fig. 13. Following the standard experimental methodology in the spatial join literature, the spatial predicate used for our experiment is intersect. Data Combi1: R1.1 R1.2 R1.3 R1.4 R1.5 R1.6 R1.7 Data Combi2: R2.1 R2.2 R2.3 R2.4 R2.5 R2.6 R2.7 Data Combi3: R3.1 R3.2 R3.3 R3.4 R3.5 R3.6 R3.7 Data Combi4: R3.1 R3.2 R1.1 R3.3 R3.4 R1.2 R3.5 Data Combi5: R3.1 R2.1 R3.2 R3.3 R3.4 R1.1 R3.5 Data Combi6: R1.1 R1.2 R2.1 R3.1 R3.2 R1.3 R3.3 For this experiment, we extracted the above six data combinations from the data sets shown in Table 2. Data Combi1 to Data Combi3 have the same average lengths
Table 3 Number of actual and estimated results for various data combinations m
3
4
5
6
7
Type
Tree Ring Clique Tree Ring Clique Tree Ring Clique Tree Ring Clique Tree Ring Clique
Datacombi1
Datacombi2
Actual
Estimate
(%)
Actual
Estimate
(%)
Actual
Estimate
(%)
10410 5789 5780 10574 4690 2622 10239 3576 914 10154 3037 318 10293 2475 79
10000 5625 5625 10000 4444 2500 10000 3588 977 10000 2940 352 10000 2501 120
3.9 2.8 2.8 5.4 5.2 4.7 2.3 0.3 6.9 1.5 3.2 10.7 2.8 1.1 51.9
40840 23233 23233 81154 35978 20518 167347 61492 16628 344468 106232 12320 703225 193257 9425
39976 22486 22486 79928 35523 19982 159807 57331 15606 319517 93928 11233 638841 159766 7642
2.1 3.2 3.2 1.5 1.3 2.6 4.5 6.8 6.1 7.2 11.6 8.8 9.2 17.3 18.9
164445 92436 92436 668468 298155 167792 2734878 987513 268844 11177913 3422737 397386 45370436 11923129 547700
16000 90000 90000 640000 284444 160000 2560000 918403 250000 10240000 3010225 360000 40960000 10243556 490000
2.7 2.6 2.6 4.3 4.6 4.6 6.4 7.0 7.0 8.4 12.1 9.4 9.7 14.1 10.5
Datacombi4
3
4
5
6
7
Tree Ring Clique Tree Ring Clique Tree Ring Clique Tree Ring Clique Tree Ring Clique
Datacombi3
Datacombi5
Datacombi6
Actual
Estimate
(%)
Actual
Estimate
(%)
Actual
Estimate
(%)
103888 52020 52020 237165 143471 84819 957945 431317 124828 2326730 630930 79731 5346195 2010103 104498
90000 40000 40000 202500 116736 62500 810000 367034 90000 1822500 404554 40000 4100625 1319371 50625
13.4 23.1 23.1 14.6 18.6 26.3 15.4 14.9 27.9 21.7 35.9 49.8 23.3 34.4 51.6
103654 76241 76241 417267 219610 132298 1690925 695167 208675 4259543 1001157 132901 9975296 3444996 183793
84905 58274 58274 339621 175146 97406 1358483 551163 146536 3056586 660501 68696 6877318 2153667 88477
18.1 23.6 23.6 18.6 20.2 26.4 19.7 20.7 29.8 28.2 34.0 48.3 31.1 37.5 51.9
14245 8914 8914 42308 21615 12708 171313 58814 17455 416301 66785 10585 1034891 331825 19232
14569 9159 9159 42450 23283 12853 169802 67788 17174 382053 63887 6922 859620 212105 8471
2.3 2.7 2.7 0.3 7.7 1.1 0.9 15.3 1.6 8.2 4.3 34.6 16.9 36.1 55.9
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
among relations, whereas Data Combi4 to Data Combi6 have different average lengths among relations. We made a comparison between the actual join sizes using R-trees [13] and the estimated result sizes using Eq. (16). Table 3 illustrates the actual sizes of the results and the estimated sizes of the results. The error rate is defined by: Error rate Z
j# of actual results # of estimate resultsj # of actual results
From Table 3, the estimated result size of the query is close to the actual result size. In the worst case, the error rate is below 40%. Consequently, compared with the formulae of [14], the error rate of our formula is not high. As a result, our formula for the selectivity of the m-way spatial ring join is sufficiently accurate in spite of a little loss of generality. As shown in Table 3, as the number of the relations increases, the error rate increases generally. As presented in Sections 2 and 3, the selectivity of the m 0 -way spatial join is computed using the selectivities of the m 0 -way spatial joins where m 0 is less than m. Therefore, as the number of the relations increase, the error rate is propagated. Also, the error rates for the cases that the average lengths of relations are similar (i.e. Data Combi1, Data Combi2, and Data Combi3) are lower than those for the cases that the average lengths of relations are diverse (i.e. Data Combi4, Data Combi5, and Data Combi6). As a result, when the number of relations participating a multi-way join is small and the average lengths of relations are similar, accurate join selectivity is obtained.
5. Conclusion Selectivity estimation formulae for the window query and the 2-way spatial join were already developed [1,7,9,17, 18]. But research on multi-way spatial join selectivity is immature. For the multi-way spatial joins, the selectivity estimation formulae only for some restricted forms of query graphs, tree and clique, have been developed [14]. However, the selectivity estimation formula for the general join graph has not been developed yet. This indicates that deriving a formular for the general join selectivity is very difficult. A general join graph may contain cycles. In this paper, we derive a formula for the multi-way spatial ring join selectivity as a first step to the general join graph. First of all, we found the properties of the result of a multiway spatial ring join query. We then formulated an equation (Eq. (16)) using these properties. We compared the estimated sizes of results with those of the actual results using R*-tree for uniform distributed data sets for the dimensionality dZ2. The comparison showed the accuracy of our proposed formula in spite of a little loss of generality. The formula for the multi-way spatial ring join selectivity can be used as a basic formula in spatial query optimizers to generate efficient query execution plans and in performance
793
measurements of spatial access methods. Also, the formula for the multi-way spatial ring join selectivity can be used to find an upper bound for a multi-way spatial join whose join graph contains cycles. For our future work, we are interested in the selectivity of a general multi-way spatial join. Our paper used statically gathered statistics before selectivity estimation. But many query optimizers estimate selectivity using dynamically gathered statistics based on random sampling. Thus, another future work is selectivity estimation using random sampling.
Acknowledgements This research was supported by the Chung-Ang University Research Grants in 2004. Appendix A. Proof of Lemma 1 Let SZ fsR1 ; sR2 ; .; sRm g be a solution of the m-way spatial ring join. Lemma 1. When all elements of S do not mutually intersect in one dimension, Then S is partitioned into disjoint subsets C, Tl, and Tr such that: (i) 2%jjCjj%mK2, all elements of C mutually intersect and the corresponding partial graph for C of the join graph GQ is not connected. (ii) jjTljjR1, jjTrjjR1, no element of Tl intersects any element of Tr (iii) the common area of C must intersect at least one element of Tl and one element of Tr. Proof. By Theorem 1, there is a set C whose corresponding partial graph of the join graph GQ is not connected (the condition (i)). In addition, according to the condition (i), these are a subset TZSKC. Since S does not mutually intersect (meaning that the elements of S do not mutually intersect) in one-dimension, all elements of T do not mutually intersect on the common area of C. Then, T can be separated to Tr and Tl which satisfy the condition (ii) and (iii). Suppose some elements ðZCr0 Þ of Tr intersect some elements ðZCl0 Þ of Tl on the common area of C. Let Tr0 Z Tr K Cr0 and Tl0 Z Tl K Cl0 where Tr0 sF or Tl0 sF.3 Then there are four cases (see Fig. A.1): Case 1. No element of Tr0 intersects the common area of C and no element of Tl0 intersects the common area of C (see Fig. A.1(a)). In this case, define C 0 in the following two ways: C 0 Z Cg Cr0 K Cb , Tl00 Z Tl g Cb and Tr00 Z Tr0 where Tr0 s F and csCb ð2Cb 3CÞ intersects all or some elements of Tl0 and does not intersect any element of Tr0 , or 3
If Tr0 Z F and T0lZ F, then S is mutually intersect.
794
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
Tl0 sF and csCa ð2Ca 3CÞ intersects all or some elements of Tr0 and does not intersect any element of Tl0 , or C 0 Z Cg Cr0 , Tl00 Z Tl Z Tl0 g Cl0 and Tr00 Z Tr0 . Then C 0 satisfies Lemma 1 and Tl00 and Tr00 do not intersect on the common area of C 0 . Also, some elements of Tl00 must intersect a common area of C 0 . If no element of Tl00 (ZTl0 where C 0 Z Cg Cl0 K Ca ) intersects any element of C 0 , no element of Tl00 intersects any element of CKCa and any element of Cl0 . Thus, S is not a solution of multi-way spatial ring join query. And if Tl00 (ZTl0 g Cl0 where C 0 Z Cg Cr0 ), no element of Tl00 intersects any element of Cr0 . Thus, S is not a solution of multi-way spatial ring join. (same arguments hold on Tr00 ). Case 4 similar to Case 3. , Fig. A.1. Examples that Tr and Tl intersect on common area of C.
C 0 Z Cg Cl0 K Ca , Tr00 Z Tr g Ca and Tl00 Z Tl0 where Tl0 s F and csCa ð2Ca 3CÞ intersects all or some elements of Tr0 and does not intersect any element of Tl0 . Then C 0 satisfies Lemma 1 and Tr00 and Tl00 do not intersect on the common area of C 0 . Also, some elements of Tr00 must intersect a common area of C 0 . If no element of Tr00 (ZTr0 where C 0 Z Cg Cr0 K Cb ) intersects a common area of C 0 , no element of Tr00 intersects any element of CKCb and any element of Cr0 . Thus, S is not a solution of multi-way spatial ring join query. And if no element of Tr00 (ZCr0 g Tr0 where C 0 Z Cg Cl0 ) intersects a common area of C 0 , no element of Tr00 intersects any element of CKCa and any element of Cl0 . Thus, S is not a solution of multi-way spatial ring join. (same arguments hold on Tl00 ). Case 2. All or some elements of Tr0 intersect the common area of C. And all or some elements of Tl0 intersect the common area of C (see Fig. A.1(b)). In this case, define C 0 in the following two ways: C 0 ZCgCr0 ; Tr00 Z Tr0 ; and Tl00 Z Tl ZCl0 gTl0 where Tr0 sF; or C 0 ZCgCl0 ; Tr00 ZTr ZCr0 gTr0 ; and Tl00 Z Tl0 where Tl0 sF: Then C 0 satisfies Lemma 1 because Tr0 and Tl0 are not empty sets, and elements of Tr00 and elements of Tl00 do not intersect on the common area of C 0 . Also, some elements of Tr00 must intersect a common area of C 0 . If no element of Tr00 (ZTr0 where C 0 Z Cg Cr0 ) intersects any element of C 0 , no element of Tr00 intersects any element of Cr0 . Thus, S is not a solution of multi-way spatial ring join. And if no element of Tr00 (ZCr0 g Tr0 where C 0 Z Cg Cl0 ) intersects any element of C 0 , no element of Tr00 intersects any element of Cl0 . Thus, S is not a solution of multi-way spatial ring join. (same arguments hold on Tl00 ). Case 3. No element of Tl0 intersects the common area of C and all or some elements of Tr0 intersect the common area of C (see Fig. A.1(c)). In this case, define C 0 in the following two ways: C 0 Z Cg Cl0 K Ca , Tl00 Z Tl0 and Tr00 Z Tr g Ca , where
Appendix B. Example application of Eq. (16) In this appendix, we give examples of how to apply Eq. (16) to 4-way and 5-way spatial ring join queries. Table B.1 Spatial relationship of a 4-way spatial ring join query result in each dimension
J.-K. Min et al. / Information and Software Technology 47 (2005) 785–795
When mZ4, all elements of S or subset of S(ZC) mutually intersect in one dimension as shown in Table B.1. Let CZ fsR1 ; sR3 g and Tl Z fsR2 g, then Tr Z fsR4 gZ SK CK Tl . If we apply Eq. (16) to this case, we obtain Eq. (B.1). jsR1 j$jsR3 j ðjsR1 j C jsR3 jÞ$ jsR1 j C jsR3 j jsR1 j$jsR3 j$jsR4 j (B.1) K $jsR2 j jsR1 j$jsR3 j C jsR3 j$jsR4 j C jsR4 j$jsR1 j Also, if CZ fsR1 ; sR3 g and Tl Z fsR4 g, we obtain the following: jsR1 j$jsR3 j ðjsR1 j C jsR3 jÞ$ jsR1 j C jsR3 j jsR1 j$jsR3 j$jsR2 j (B.2) K $jsR4 j jsR1 j$jsR3 j C jsR3 j$jsR2 j C jsR2 j$jsR1 j By applying Eq. (16) to all possible C, Tl, Tr, and including the 4-way clique join selectivity, the formula for the 4-way spatial ring join selectivity is: jsR1 j$jsR2 j ðjsR1 j C jsR2 jÞ$ C jsR3 j jsR1 j C jsR2 j jsR1 j$jsR2 j$jsR3 j $ C jsR4 j jsR1 j$jsR2 j C jsR2 j$jsR3 j C jsR3 j$jsR1 j jsR1 j$jsR3 j CðjsR1 j C jsR3 jÞ$ jsR1 j C jsR3 j jsR1 j$jsR3 j$jsR4 j K $jsR2 j jsR1 j$jsR3 j C jsR3 j$jsR4 j C jsR4 j$jsR1 j jsR1 j$jsR3 j CðjsR1 j C jsR3 jÞ$ jsR1 j C jsR3 j jsR1 j$jsR3 j$jsR2 j K $jsR4 j jsR1 j$jsR3 j C jsR3 j$jsR2 j C jsR2 j$jsR1 j jsR2 j$jsR4 j CðjsR2 j C jsR4 jÞ$ jsR2 j C jsR4 j jsR2 j$jsR4 j$jsR1 j $jsR3 j K jsR2 j$jsR4 j C jsR4 j$jsR1 j C jsR1 j$jsR2 j jsR2 j$jsR4 j CðjsR2 j C jsR4 jÞ$ jsR2 j C jsR4 j jsR2 j$jsR4 j$jsR3 j (B.3) K $jsR1 j jsR2 j$jsR4 j C jsR4 j$jsR3 j C jsR3 j$jsR2 j For the 5-way spatial ring join, k is 2 or 3. Suppose CZ fsR1 ; sR3 g, Tl Z fsR2 g and Tr Z fsR4 ; sR5 g. Then Eq. (16) is applied to Eq. (B.4). 1 ðjsR1 jCjsR3 jÞ$ C AreaðfsR1 ;sR3 gÞK ðC AreaðfsR1 ;sR3 ;sR4 gÞ 2 CC AreaðfsR1 ;sR3 ;sR5 gÞÞ $ C AreaðfsR1 ;sR3 gÞ (B.4) KC AreaðfsR1 ;sR3 ;sR4 ;sR5 gÞ $jsR2 j
795
Suppose CZfsR1 ;sR3 g, Tl ZfsR4 ;sR5 g and Tr ZfsR2 g. Then Eq. (16) is applied to Eq. (B.5). ðjsR1 jCjsR3 jÞ$ C AreaðfsR1 ;sR3 gÞ KC AreaðfsR1 ;sR3 ;sR2 gÞ $jsR4 j$jsR5 j (B.5) Suppose CZfsR1 ;sR3 ;sR4 g, Tl ZfsR2 g and Tr ZfsR5 g. Then Eq. (16) is applied to Eq. (B.6). CliqueðfsR1 ;sR3 ;sR4 gÞ$ C AreaðfsR1 ;sR3 ;sR4 gÞ KC AreaðfsR1 ;sR3 ;sR4 gÞ $jsR2 j (B.6) In this way, the formula for the 5-way spatial ring join selectivity can be obtained by applying Eq. (16) to all C, Tl and Tr.
References [1] S. Acharya, V. Poosala, S. Ramaswamy, Selectivity estimation in spatial databases, Proc. ACM SIGMOD 1999; 13–24. [2] T. Brinkhoff, H. Kriegel, B. Seeger, Efficient processing of spatial joins using R-trees, Proc. ACM SIGMOD 1993; 237–246. [3] T. Brinkhoff, H. Kriegel, R. Scheneider, B. Seeger, The R*-tree: an efficient and robust access method for points and rectangles, Proc. ACM SIGMOD 1990; 322–331. [4] M.J. Egenhofer, Reasoning about binary topological relations, Proc. SSD 1991; 143–160. [5] R.H. Gu¨ting, An introduction to spatial database systems, VLDB J. 3 (4) (1994) 357–399. [6] R.H. Gu¨ting, M. Schneider, Realm-based spatial data types: the ROSE algebra, VLDB J. 4 (2) (1995) 243–286. [7] Y.-W. Huang, N. Jing, E.A. Rundensteiner, A. Cost, Model for estimating the performance of spatial joins using R-trees, Proc. SSDBMS 1997;. [8] M. Jarke, J. Koch, Query optimization in database systems, ACM Comput. Surv. 16 (2) (1984) 111–152. [9] I. Kamel, C. Faloutsos, On packing R-trees, Proc. CIKM 1993; 490– 499. [10] N. Mamoulis, D. Papadias, Multiway spatial joins, ACM TODS 26 (4) (2001) 424–475. [11] J.A. Orenstein, Spatial query processing in an object-oriented database system, Proc. ACM SIGMOD 1986; 326–336. [12] H.H. Park, C.W. Chung, Complexity of estimating multi-way join result sizes for area skewed spatial data, Inform. Process. Lett. 76 (3) (2000) 121–129. [13] H.H. Park, G.H. Cha, C.W. Chung, Multi-way joins using R-trees: methodology and performance evaluation, Proc. SSD 1999; 229–250. [14] D. Papadias, N. Mamoulis, Y. Theodoridis, Processing and optimization of multiway spatial join using R-tree, Proc. ACM PODS 1999; 44–55. [15] D. Papadias, N. Mamoulis, Y. Theodoridis, Constraint-based processing of multiway spatial joins, Algorithmica 30 (2) (2001) 188–215. [16] D. Papadias, Y. Thedoridis, E. Stefanakis, Multidimensional range query processing with spatial relations, Geograph. Syst. 4 (4) (1997) 343–365. [17] Y. Theodoridis, T. Sellis, A model for prediction of R-tree performance, Proc. ACM PODS 1996; 161–171. [18] Y. Theodoridis, E. Stefanakis, T. Sellis, Cost models for join queries in spatial databases, Proc. of IEEE ICDE 1998; 476–483.