Information Systems 40 (2014) 67–83
Contents lists available at ScienceDirect
Information Systems journal homepage: www.elsevier.com/locate/infosys
MSSQ: Manhattan Spatial Skyline Queries$ Wanbin Son, Seung-won Hwang, Hee-Kap Ahn n Department of Computer Science and Engineering, Pohang University of Science and Technology, Republic of Korea
a r t i c l e i n f o
abstract
Article history: Received 18 April 2012 Received in revised form 1 June 2013 Accepted 1 October 2013 Recommended by: Xifeng Yan Available online 18 October 2013
Skyline queries have gained attention lately for supporting effective retrieval over massive spatial data. While efficient algorithms have been studied for spatial skyline queries using the Euclidean distance, these algorithms are (1) still quite computationally intensive and (2) unaware of the road constraints. Our goal is to develop a more efficient algorithm for L1 distance, also known as Manhattan distance, which closely reflects road network distance for metro areas. We present a simple and efficient algorithm which, given a set P of data points and a set Q of query points in the plane, returns the set of spatial skyline points in just OðjPjlogjPjÞ time, assuming that jQ j r jPj. This is significantly lower in complexity than the best known method. In addition to efficiency and applicability, our algorithm has another desirable property of independent computation and extensibility to L1 norm distance, which naturally invites parallelism and widens applicability. Our extensive empirical results suggest that our algorithm outperforms the state-of-the-art approaches by orders of magnitude. We also present efficient algorithms that report the changes of the skyline points when single or multiple query points move along the x- or y-axis. & 2013 Elsevier Ltd. All rights reserved.
Keywords: Spatial skyline queries Spatial databases Manhattan distance Query semantics
1. Introduction Skyline queries have gained attention [1–5] because of their ability to retrieve “desirable” objects that are not worse than any other object in the database. Recently, these queries have been applied to spatial data, as we illustrate with the example below. Consider a hotel search scenario for a conference trip to Minneapolis, where the user marks two locations of interest, e.g., the conference venue and an airport, as Fig. 1 (a) illustrates. Given these two query locations, one option is to identify hotels that are close to both locations. When
☆ Work by Son and Ahn was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. 2011-0030044). Work by Hwang was supported by Microsoft Research Asia. n Corresponding author. Tel.: þ82 542792387. E-mail addresses:
[email protected] (W. Son),
[email protected] (S.-w. Hwang),
[email protected] (H.-K. Ahn).
0306-4379/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.is.2013.10.001
considering the Euclidean distance, we can say that hotel H5, located in the middle of the two query points, is more desirable than H4, i.e., H5 “dominates” H4. The goal is to narrow down the choice of hotels to a few desirable hotels that are not dominated by any other objects, i.e., no other object is closer to all the given query points simultaneously. However, as Fig. 1(b) shows, considering these query and data points on the map, the Euclidean distance, quantifying the length of the line segment between H5 and the query points, does not consider the road constraints and thus severely underestimates the actual distance. Going back to Fig. 1(a), we can now assume that the dotted lines represent the underlying road network and revisit the problem to identify desirable objects with respect to L1 distance. In this new problem, H4 and H5 are equally desirable, as both are three blocks away from the conference venue and two blocks from the airport. In general, the Manhattan distance, or L1 distance, reflects actual road network distances well for well-connected
68
W. Son et al. / Information Systems 40 (2014) 67–83
1.0
H3
H7
0.8
H4
Airport
0.6 y
H5
0.4 0.2
H2
H8
H1
Venue H6
0
0.2
0.4
x
0.6
0.8
1.0
Fig. 1. Hotel search scenario: (a) abstract view and (b) actual Minneapolis map.
Fig. 2. Road network maps of (a) Pasadena and (b) Ontario, California. Table 1 Inversion ratios of four real road networks. Road network
Inversion ratio (%)
City of Pasadena City of Ontario City of San Joaquin County California
4.13 4.61 6.13 6.78
metro areas such as Pasadena and Ontario (Fig. 2) in California. The experimental results for real road networks, summarized in Table 1, support this claim.1 In the experiment, we repeated the following 1000 times for each network. We chose a node randomly and constructed two sorted lists of the nodes of the network, one in the ascending order of network distance and the other in the ascending order of L1 distance from the chosen node. Then we counted the number of inversions between the two lists. Table 1 shows the average inversion ratio of each road network,
1 First two road networks are from OpenStreetMap (http://www. openstreetmap.org/), and the other two road networks are available at http://www.cs.fsu.edu/ lifeifei/SpatialDataset.htm.
which is less than 7%. For Pasadena and Ontario, the inversion ratios are even less than 5%. Skyline queries have been actively studied for Euclidean distance [6–9]. Given a set P of data points and a set Q of query points in the plane, the most efficient algorithm known so far has the time complexity of OðjPjðjSjlogj CHðQ Þj þlogjPjÞÞ [8,9]. Here S denotes the set of spatial skyline points, and CHðQ Þ denotes the standard convex hull of Q in the underlying metric. These algorithms are based on a geometric interpretation of spatial dominance of a point p over another point p′: p is not spatially dominated by p′ if and only if there is at least one query point in the side of the bisecting line of p and p′ that contains p. From this observation, they showed that every data point p lying in CHðQ Þ is a skyline point, because there is at least one query point in the side of the line bisecting p and any other data point that contains p. They also showed, using a similar argument, that a site of the Voronoi diagram of P is a skyline point if its Voronoi cell makes nonempty intersection with CHðQ Þ. The geometric interpretation of spatial dominance also holds for L1, because the bisecting line of two points p and p′ in L1 norm distance is the set of points at equidistance from p and p′, and therefore there is at least one query
W. Son et al. / Information Systems 40 (2014) 67–83
point in the side of the line containing p if and only if p is not spatially dominated by p′. This implies that (a) every data point p lying in the orthogonal convex hull of Q is a skyline point and (b) a site of the Voronoi diagram of P in L1 metric is a skyline point if its Voronoi cell makes nonempty intersection with the convex hull. Therefore, we can compute a “subset” of the spatial skyline points by constructing the convex hull and the Voronoi diagram, which can be done in OðjQ jlogjQ jÞ time and OðjPjlogjPjÞ time, respectively. However, Fig. 3 shows that there are still some skyline points not belonging to the two cases above. For example, p2 is skyline, because none of the other points dominates it. But p2 is not contained in the orthogonal convex hull of queries and its Voronoi cell (gray region) does not intersect the orthogonal convex hull of Q. This example suggests that we need not only to maintain the subset of skyline points for cases (a) and (b), but also to check whether the remaining data points are skyline or not. This takes OðjPjjSjlogjCHðQ ÞjÞ time, which is exactly the same as the total time complexity required for Euclidean distance. In a clear contrast, we develop a simple and efficient algorithm that computes skyline points in just OðjPjlogjPjÞ time for L1 metric, assuming jQ j r jPj. Our extensive empirical results suggest that our algorithm outperforms the state-of-the-art algorithms in spatial and general skyline problems significantly. Our contributions can be summarized as follows:
We study the Manhattan Spatial Skyline Queries (MSSQ)
problem, which arises in advanced query semantics, such as ranking and skyline queries of massive spatial datasets. We show that a straight-forward extension of the existing algorithm under L2 distance is inefficient for our problem, and present a simple and efficient algorithm that computes skyline points in just OðjPjlogjPjÞ time. We also propose an algorithm for MSSQ when query points move either vertically or horizontally. Our algorithm runs in OðjPjlogjPjÞ time when only one query point moves and in OðjPj2 jQ jÞ time when more than one query point moves. We show that our algorithm can easily be parallelized by computing each skyline point independently.
69
Our algorithm also straightforwardly extends for the Chebyshev distance, also known as L1 distance, which are used extensively for spatial logistics in warehouses [10]. We evaluate our framework using synthetic data and show that our algorithms are faster by orders of magnitude than the current state-of-the-art approaches.
2. Related work This section provides a brief survey of work related to spatial query processing. Skyline queries were introduced in the context of finding the maximum vectors [1]. Since then they have been studied in database applications, both in a course of enhancing the efficiency of computation [2,3,11,4,5,12] and in course of enhancing the quality of results [13–15], by narrowing down skyline results using properties such as frequency, k-dominance, and k-representativeness of skyline results. The spatial query problem has been extensively studied in query semantics, such as ranking neighboring objects by their distances to a single query point [16–18]. For multiple query points, Papadias et al. [19] studied ranking by a class of monotone “aggregation” functions of distances from multiple query points. For a spatial skyline query with a single query point, Huang and Jensen [20] studied the problem of finding spatial locations that are not dominated by the network distance to the query point. For multiple query points, Sharifzadeh and Shahabi [6] proposed two algorithms that identify the skyline locations for the given query points such that no other location is closer to all of the query points. However, it was shown that the solution proposed in [6] is incorrect [8,9]. Later Sharifzadeh et al. proposed a modified version [7], but the new algorithm is computationally more expensive than the original one. The best known algorithm for spatial skyline queries using L2 distance runs in time OðjPjðjSjlogjCHðQ Þj þ logjPjÞÞ [8,9]. Deng et al. [21] studied the spatial skyline query problem for multiple query points in road networks. They focused on reducing the number of network distance computations during skyline computation. In contrast, little is known about the spatial skyline query problem under L1 distance. To our best knowledge, our work is the first result using L1 distance and is significantly more efficient than the ones using L2 distance.
3. Problem definition
Fig. 3. The orthogonal convex hull of queries (white disks) and the Voronoi cell (gray region) of p2 in L1 metric. The point p2 is skyline, which satisfies neither case (a) nor (b).
In the spatial skyline query problem, we are given two point sets: a set P of data points and a set Q of query points in the plane, assuming that jQ j r jPj. In general, the purpose of querying on a data set is to extract a subset of the data set with respect to the query set and the query set behaves as a set of constraints which each skyline point must satisfy. In many practical situations, the size of constraints is much smaller than the size of data under consideration, and therefore the assumption is reasonable.
70
W. Son et al. / Information Systems 40 (2014) 67–83
Distance function dðp; qÞ returns the L1 distance between a pair of points p and q, that is, the sum of the absolute differences of their coordinates. Given this distance function, our goal is to find the set of spatial skyline points. Our definitions are consistent with prior literature [8], as we restate below. Definition 1. We say that p1 spatially dominates p2 if and only if dðp1 ; qÞ rdðp2 ; qÞ for every q A Q , and dðp1 ; q′Þ o dðp2 ; q′Þ for some q′ A Q . Definition 2. A point p A P is a spatial skyline point with respect to Q if and only if p is not spatially dominated by any other point of P.
4. Observation The basic idea of our algorithm is as follows. To determine whether p A P is skyline or not, the approach under L2 distance performs dominance tests with the current skyline points (which we later discuss in detail, denoted as baseline algorithm PSQ, in Section 7). Under L1 distance, we use a different approach in which we check the existence of a point that dominates p. To do this, we introduce another definition (below) on spatial dominance between two points which is equivalent to Definition 1. We denote by Cðp; qÞ the L1 disk (its closure) centered at q with radius dðp; qÞ. Definition 3. We say that p1 spatially dominates p2 if and only if p1 is always contained in Cðp2 ; qÞ for every q A Q , and is contained in the interior of Cðp2 ; q′Þ for some q′ A Q . Based on this new definition above, a straightforward approach would be, for each data point p, to compute L1 disks for every q A Q , and check whether there is any data point satisfying the definition. However, this already takes OðjQ jÞ time only for computing L1 disks. Instead, we use some geometric properties of the L1 disks Cðp; qÞ and compute the common intersection of L1 disks in OðlogjQ jÞ time for each data point p, and perform the dominance test efficiently. We denote by R(p) the common intersection of Cðp; qÞ for every q A Q . Note that p itself is always contained in R(p) (in fact p is on the boundary of R(p)). By Definitions 2 and 3, we have the following three cases for data points contained in R(p): (a) there is no data point in R(p), other than p, or (b) there is some data point p′ in the interior of R(p), or (c) there is some data point p′ in R(p), other than p, but no data point in the interior of R(p).
Case (a) obviously implies that p is skyline. For case (b), p′ dominates p, and therefore p is not skyline. For case (c), if p′ is contained in the interior of some L1 disk Cðp; qÞ for a q A Q , then p′ dominates p, and therefore p is not skyline. On the other hand, if every data point in R(p) lies on the boundary of Cðp; qÞ for all q A Q , then p is skyline.
5. Algorithm In this section, we show how to handle each of the three cases efficiently so as to achieve an OðlogjPjÞ time algorithm for determining whether a data point is skyline or not. 5.1. Data structures We first introduce data structures we build on P and Q, to support “range counting query” and “segment dragging query” efficiently. These two queries are the building blocks of our proposed algorithm. Range counting is a fundamental problem in computational geometry and spatial databases. We preprocess a set P of points, in order to determine the number of points from P intersected by a range query R. Among the specific results, we implement the range counting structure proposed in [22], building a balanced binary tree for one dimension and storing the information for the other dimension as well. This structure on n points in the plane can be constructed in Oðn log nÞ time, and it answers a range counting query in Oðlog nÞ time. We build this structure for both P and Q, which we denote as rCountP and rCountQ, respectively. Note, rCountP can be built once offline, while rCountQ needs to be built at query time. (However, this cost is negligible, as we will empirically report in Section 8.) Segment dragging query, informally speaking, is to determine the next point “hit”, by the given query line (or, segment) st , when it is “dragged” to s′t′, with s in direction ϕs and t in direction ϕt. Fig. 5(a) shows an example where ϕs and ϕt are parallel, in the direction of gray arrow. There are two types of queries, parallel tracks and dragging out of a corner. When ϕs and ϕt are parallel, such queries belong to parallel tracks type, such as Fig. 5(a) and (b). When the initial query segment st is a point, such queries belong to dragging out of a corner type, such as Fig. 5(c). From [23,24], it was shown that, one can preprocess a set P of n points into a data structure of O(n) size in Oðn log nÞ time that answers a segment dragging query of “parallel tracks” or “dragging out of a corner” type in Oðlog nÞ time. We build this structure on Q, which we denote as sDragQ. 5.2. Computing the common intersection R(p) Since each Cðp; qÞ is an L1 circle, that is, a rotated copy of an axis-parallel square by 451, in the plane, R(p) is obviously a rectangle with sides parallel to the lines y¼x or y ¼ x (see Fig. 4(a).) Therefore, R(p) is determined by at most four query points: when R(p) is a point, it is the only common intersection of four L1 disks. Otherwise, it is determined by at most three query points. Lemma 1. There are at most four query points that determine the sides of R(p). We can identify them in OðlogjQ jÞ time after OðjQ jlogjQ jÞ time preprocessing. Proof. We first show that the sides of R(p) are determined by at most four query points.
W. Son et al. / Information Systems 40 (2014) 67–83
71
Fig. 4. The vertical line and the horizontal line through p subdivide the plane into four quadrants, Q 1 ; Q 2 ; Q 3 ; and Q4. Depending on the locations of the query points (white disks), there are five cases (a)–(e) of the common intersection R(p) (gray region).
Consider the subdivision of the plane into regions (quadrants) defined by the vertical line through p and the horizontal line through p. Then at least one of the four quadrants contains some query points in its closure unless Q ¼ |. Without loss of generality, we assume that the top right quadrant contains some query points. That is, the set Q 1 ¼ fqjq x Z p x; q y Zp y; q A Q g is not empty. We also denote by Q 2 ; Q 3 ; and Q4 the set of query points in the top left, the bottom left, and the bottom right quadrants, respectively. Note that a data point lying on the border of two quadrants belongs to both sets. Consider the case that Q 1 ¼ Q . Then the bottom left side of R(p) is determined by p. The other three sides are determined by the three query points: the one with the smallest x-coordinate determines the bottom right side, the one with the smallest y-coordinate determines the top left side, and the one with the smallest L1 distance from p determines the top right side of R(p). Fig. 4(a) shows the case. Consider now the case that Q 2 a| and Q ¼ Q 1 [ Q 2 as shown in Fig. 4(b). In this case, p is the bottom corner of R(p), and therefore the bottom left and the bottom right sides are determined by p. The top right side is determined either by the query point in Q1 with the smallest L1 distance from p or by the query point in Q2 with the smallest y-coordinate. The top left side is determined by one of two such query points as above, after switching the roles of Q1 and Q2. If Q i a| for all i ¼ 1; 2; 3; 4, R(p) is a point that coincides with p as shown in Fig. 4(e). Otherwise, R(p) is just a line segment. If Q 3 a | and Q ¼ Q 1 [ Q 3 , then Cðp; qÞ \ Cðp; q′Þ for any q A Q 1 and q′A Q 3 is a line segment. Fig. 4(c) illustrates the case. The lower endpoint of the segment is determined either by the
query point in Q1 with the smallest x-coordinate or by the query point in Q3 with the largest y-coordinate. If two of Q 2 ; Q 3 ; and Q4 are not empty, R(p) is a line segment whose one endpoint is p. Fig. 4(d) shows the case that both Q2 and Q3 are not empty. In this case, the lower endpoint of R(p) is p and the upper endpoint is determined by one of the three query points: the query point in Q1 with the smallest y-coordinate or the query point in Q3 with the largest x-coordinate or the query point in Q2 with the smallest L1 distance from p. For the cases in which Q2 and Q4 are not empty, or that Q3 and Q4 are not empty, one endpoint of R(p) is p and the other endpoint is determined by one of the three such extreme query points. For a data point p, we perform four range counting queries on rCountQ, using each region (or quadrant) subdivided by the vertical and horizontal lines through p as a query. Based on the results of the range counting queries, we determine the case it belongs to and identify query points that determine the sides of R(p). Once we construct the segment dragging query structure sDragQ, we can find these query points in at most four segment dragging queries as follows. For the query point in Q1 that has the smallest x-coordinate (or y-coordinate), we use “dragging by parallel tracks” with the vertical (or horizontal) line through p as in Fig. 5(a) (or (b)). For the query point in Q1 that has the smallest L1 distance from p, we use “dragging out of a corner” with p as in Fig. 5(c). The extreme query points in Q 2 ; Q 3 , and Q4 can also be found similarly by using the same segment dragging queries. □ Once we identify four query points that determine the sides of R(p), we perform a counting query on rCountP using range R(p) as a query. If there is only one data point p in R(p), it belongs to the case (a) in Section 4 and p is skyline. Otherwise, we perform a range counting query
72
W. Son et al. / Information Systems 40 (2014) 67–83
Fig. 5. Segment dragging queries to find (a) the query point with the smallest x-coordinate, (b) the query point with the smallest y-coordinate, and (c) the query point with the smallest L1 distance from p in quadrant Q1 of p.
with the interior of R(p) to check whether there is any data point in the interior of R(p). If this is the case (case (b) in Section 4), p is not skyline. If there is no data point in the interior of R(p), it belongs to case (c) in Section 4. 5.3. For data points on the boundary of R(p) We will show how to handle the case (c) in Section 4. For this case, we should check whether there is a point p′ A P\fpg lying on the boundary of R(p) but lying in the interior of Cðp; qÞ for some q A Q . Clearly, no such point exists if and only if p is skyline. To perform this test efficiently, we will use the observation that p′ lies on the boundary of Cðp; qÞ for every q A Q if and only if all points in Q lie on the specific regions determined by p′. We can check whether all points in Q lie on these regions in Oðlog jQ jÞ time, after OðjQ j log jQ jÞ time preprocessing. Without loss of generality, we assume that Q1 is not empty, that is, p lies on the bottom left side of R(p) as in Section 5.2. Recall that Cðp; qÞ for every q A Q contains R(p) by definition. We denote by ℓv the vertical line through the bottom corner of R(p), and by ℓh the horizontal line through the left corner of R(p). We denote by ℓs the line consisting of points at equidistance from the bottom left side and the top right side of R(p) (see Fig. 6(a)). Lemma 2. Assume that the interior of R(p) is not empty but contains no point in P, and p lies in the interior of the bottom left side of R(p). A data point p′ð a pÞ lies on the boundary of Cðp; qÞ for every q A Q if and only if (i) p′ lies on the bottom left side of R(p) and all query points lie above or on ℓh and ℓs, and on the right of or on ℓv (Fig. 6(a)), (ii) p′ lies on the top left side of R(p) and all query points lie on ℓh but above or on ℓs (Fig. 6(b)), (iii) p′ lies on the bottom right side of R(p) and all query points lie on ℓv but above or on ℓs, or (iv) p′ lies on the top right side of R(p) and all data points lies on ℓs but above or on ℓh, and on the right of or on ℓv (Fig. 6(c)).
Proof. As it is straightforward to see that the necessary condition holds, we only prove the sufficient condition. Since the interior of R(p) is not empty and p lies in the
interior of the bottom left side of R(p), p always lies on the bottom left side of Cðp; qÞ for every q A Q . If there is a query point q′ below ℓh, the left corner of Cðp; q′Þ lies on the interior of the bottom left side of R(p) and R(p) is not contained in Cðp; q′Þ, which contradicts the definition of R(p). We can show that all query points lie on or in the right of ℓv analogously. For any query point q′ below ℓs, Cðp; q′Þ does not contain the top right side of R(p), which again contradicts the definition of R(p). Consider case (i) in which there is a data point p′ on the bottom left side of R(p). Then p′ lies on the bottom left side of Cðp; qÞ for every q A Q . Since the bottom left side of R(p) is the common intersection of the bottom left sides of all Cðp; qÞ for every q A Q , p′ does not impose any additional constraint on the locations of query points. Consider case (ii) in which there is a data point p′ on the top left side of R(p). Then p′ lies on the top left side of Cðp; qÞ for every q A Q . Therefore the only additional constraint is that all query points lie on ℓh. Case (iii) can be shown analogously. Consider case (iv) in which there is a data point p′ on the top right side of R(p). Then p′ lies on the top right side of Cðp; qÞ for every q A Q . Therefore the only additional constraint is that all query points lie on ℓs. □ Note that when p′ lies on a corner of R(p), we consider it contained on both sides of R(p) sharing the corner. Therefore the lemma holds if every query point satisfies one of the two conditions of the sides. The lemma above implies that we can test whether dðp′; qÞ ¼ dðp; qÞ for every q A Q in Oðlog jQ jÞ time by performing a few range counting queries on rCountQ. The case in which p lies on the bottom or the left corner of R(p) can be handled as follows. When Q 2 a|, p lies on the bottom corner of R(p), and all query points in Q2 must lie on or above the horizontal line through the right corner of R(p). When Q 4 a|, p lies on the left corner of R(p), and all query points in Q4 must lie on or in the right of the vertical line through the top corner of R(p). It remains to show the case in which R(p) degenerates to a line segment (see Fig. 4(c) and (d)). (Note that when R(p) is a point, dðp′; qÞ ¼ dðp; qÞ for every q A Q if and only if p ¼ p′ (see Fig. 4(e).) We denote by R1 the region bounded from below by ℓh and bounded from left by ℓv. We denote by R2 the region bounded from above by the horizontal line through the lower endpoint of R(p) and bounded from right by the vertical line through the upper endpoint of R(p). Let ℓ′s be the line consisting of points at equidistance from p and p′ (or, bisector).
W. Son et al. / Information Systems 40 (2014) 67–83
73
Fig. 6. A data point p′ð a pÞ lies on the boundary of Cðp; qÞ for every qA Q if and only if all query points lie in the gray region or thick line segments.
Lemma 3. Assume that R(p) is a line segment. A data point p′ð a pÞ lies on the boundary of Cðp; qÞ for every q A Q if and only if (i) p or p′ lies in the interior of the segment R(p) and all query points lie in R1 [ R2 (Fig. 6(d)), or (ii) p and p′ lie on the opposite endpoints of R(p) and all query points lie in R1 [ R2 [ ℓ′s (Fig. 6(e)).
7. 8. 9. 10. 11. 12. 13.
14.
Again, we can test whether dðp′; qÞ ¼ dðp; qÞ for every q A Q in Oðlog jQ jÞ time by performing a few range counting queries on rCountQ. Lemma 4. We can decide in Oðlog jQ jÞ time whether the data points on a side of R(p) dominate p or not. 5.4. Computing all the skyline points The following pseudocodes summarize our algorithm. Algorithm (MSSQ). Input: a set P of data points and a set Q of query points Output: the list S of all skyline points 1. initialize the list S 2. construct two range counting query structures: rCountP of P and rCountQ of Q 3. construct a segment dragging query structure sDragQ of Q 4. for i←1 to jPj 5. do 6. determine the quadrants containing query points, by querying rCountQ with quadrants of pi /n Section 5.2 n/
15. 16. 17.
determine four sides of Rðpi Þ, by querying sDragQ with pi /n Section 5.2 n/ count←query rCountP with Rðpi Þ if count ¼ 1 /n pi is the only point in Rðpi Þ n/ then insert pi to S/n pi is skyline n/ else query rCountP with interior of Rðpi Þ if there is no data point in the interior of Rðpi Þ then query rCountP with the regions defined by sides and corners of Rðpi Þ to check whether they contain data points/n Section 5.3 n/ query rCountQ with regions defined by ℓh ; ℓv ; and ℓs (or R1 ; R2 ; ℓ′s ) for sides and corners of Rðpi Þ containing data points /n Section 5.3 n/ if all query points lie in the regions defined above then insert pi to S /n pi is skyline n/ return S
In Line 2 of algorithm MSSQ, we construct two range counting query structures, one for P and one for Q. This can be done in OðjPjlogjPjÞ time using OðjPjÞ space [22]. A segment dragging query structure in Line 3 can be constructed in OðjQ jlogjQ jÞ time and OðjQ jÞ space [23,24]. In the for-loop, we use four queries to rCountQ to determine the quadrants containing query points, at most four queries to sDragQ to find the query points determining the sides of Rðpi Þ, and eight queries to rCountP and at most six queries to rCountQ to determine whether any data point on the boundary of Rðpi Þ dominates p. Each such query can be answered in logarithmic time – a query to sDragQ or rCountQ takes OðlogjQ jÞ time, and a query to rCountP takes OðlogjPjÞ time [22]. Therefore the for-loop takes OðjP jðlogjPj þ logjQ jÞÞ time in total. Because we assume that jPjZ jQ j, the total time complexity of the algorithm MSSQ is OðjPjlogjPjÞ and the space complexity is OðjPjÞ.
74
W. Son et al. / Information Systems 40 (2014) 67–83
Fig. 7. A complete run of MSSQ for P ¼ fp1 ; p2 ; p3 ; p4 g and Q ¼ fq1 ; q2 g.
Theorem 1. Given a set P of data points and a set Q of query points in the plane, the algorithm MSSQ returns the set of all skyline points in OðjPjlogjPjÞ time. An example of the algorithm's operation is shown in Fig. 7. For each p A P, MSSQ computes R(p) and checks whether R(p) contains any other data point or not in its closure. Clearly, Rðp1 Þ and Rðp2 Þ do not contain any other data point (Fig. 7(a) and (b), respectively), so they are skyline points. Regions Rðp3 Þ and Rðp4 Þ contain p1 in their interiors (Fig. 7(c) and (d), respectively), and they are spatially dominated by p1. Our algorithm has the following two desirable properties: Fig. 8. An L1 disk (a) and an L1 disk (b).
Easy parallelization: As shown in algorithm MSSQ, each
loop represents an independent computation for pi and does not depend on other points. This property naturally invites loop parallelization. Easy extension to L1 : Our observations for L1 disk Cðp; qÞ also hold for L1 disk, which is simply a 451 rotation of L1 disk as illustrated in Fig. 8. This illustrates that our algorithm also works under L1 distance, widely used for spatial logistics in warehouses, simply by rotating the input dataset by 451 around the origin.
range query structure can report emptiness of R(p) in 2d 1 1
Oðlog
d1
2
log
1
jPjÞ time once it is constructed in OðjPj
jPjÞ time and space. Therefore we can compute 2d 1 1
all the skyline points in Rd in OðjPjðlog 2
time and OðjPjlog
d1
1
jPjþ 2d jQ jÞÞ
jPjÞ space for d Z 3.
6. Tracing moving query points The algorithm can be generalized to higher dimensions. In Rd , R(p) is defined as follows. Any L1 ball is a scaled and translated copy of the cross polytope, and the cross polytope has 2d 1 pairs of parallel facets in Rd [25]. So, an L1 ball and R(p) can be defined by 2d 1 pairs of parallel facets, respectively. The intersection of two L1 balls can be computed in Oð2d Þ time by considering facets of them, so we can compute R(p) in Oð2d jQ jÞ time. To check emptiness of R(p), we use the orthogonal range query structure [22]. For each pair of parallel facets of the d-dimensional cross polytope, we set an imaginary axis that is perpendicular to the facets. As a result, we can get 2d 1 dimensional space. We map points in P into this 2d 1 dimensional space, and then construct the orthogonal range query structure for them. By using this structure, we can check emptiness of R(p) efficiently. Orthogonal
In this section, we introduce a variation of the MSSQ problem, where data points are fixed and each query point qi moves either vertically or horizontally at unit speed. More precisely, for a nonnegative real number t, let qi(t) denote the translation of qi at time t, that is, qi ðtÞ≔qi þðt; 0Þ (or qi ðtÞ ¼ qi ðt; 0Þ) if qi moves along a horizontal line, and qi ðtÞ≔qi þð0; tÞ (or qi ðtÞ ¼ qi ð0; tÞ) if qi moves along a vertical line. Let Q(t) be the set of query points at time t, and let Rðp; tÞ be R(p) defined by Q(t). Our goal is to report the changes of the skyline points while t increases. Without loss of generality, we assume that p lies in the interior of the bottom left side of Rðp; tÞ. Note that the cases where p lies in the interior of some other side of Rðp; tÞ can be handled by mirroring the data points and query points along the vertical line or the horizontal line
W. Son et al. / Information Systems 40 (2014) 67–83
through. The cases where p lies at a corner of Rðp; tÞ can also be handled by a combination of two cases above. 6.1. When a single query point moves We assume that all query points are fixed, except one query point q which belongs to the top right quadrant of a data point p when t¼0, and moves vertically downward, that is, qðtÞ≔q ð0; tÞ. Then there is a real number t 1 Z 0 such that qðt 1 Þ belongs to the boundary of the top right and bottom right quadrants. This allows us to divide the path of q into two subpaths, the path from qð0Þ to qðt 1 Þ and the remaining part, and to focus on only one path contained in a quadrant. So from now on, we consider only the path from qð0Þ to qðt 1 Þ. Observation 1. For any real numbers t′ and t″ satisfying 0 r t′ ot″ r t 1 , we have Cðp; qðt″ÞÞ DCðp; qðt′ÞÞ and Rðp; t″Þ D Rðp; t′Þ. The observation above follows from the definition of L1 circle and dðp; qðt″ÞÞ rdðp; qðt′ÞÞ. This observation, together with the cases in Section 4, implies that (1) once a nonskyline point becomes skyline, it remains as skyline, and (2) once a skyline point becomes nonskyline, it remains as nonskyline. The second claim can easily be understood if we reverse the translation of q, that is, moving it from qðt 1 Þ to qð0Þ. From this we can compute all skyline points as follows. First we check whether each p A P is skyline or not at time t¼0 as explained in Section 5. Then for each data point p, we compute the time tð0 r t r t′Þ when p changes its membership to the set of skyline points, if it does, by updating Cðp; qðtÞÞ and Rðp; tÞ while t increases. 6.1.1. Events Here we show only the case that a nonskyline point becomes skyline during the time period from 0 to t1. The other case can be shown symmetrically. Note that q ¼ qð0Þ. We denote by R′ðpÞ the intersection of Cðp; q′Þ for all queries q′A Q \fqg. Thus Rðp; tÞ ¼ R′ðpÞ \ Cðp; qðtÞÞ, which implies that the membership of p does not change during the time interval when R′ðpÞ D Cðp; qðtÞÞ.
75
Since p is not skyline at time 0, Rðp; 0Þ contains some data point dominating p. Now observe that Cðp; qðtÞÞ shrinks during the translation of q only at two top sides of Cðp; qðtÞÞ. More precisely, the top left and top right sides move vertically downward at the same speed. Fig. 9 illustrates that Cðp; qðtÞÞ sweeps R′ðpÞ either (a) by one side only, or (b) by two sides. Lemma 5. A nonskyline point p becomes skyline when the moving sides of Cðp; qðtÞÞ together sweep over all data points dominating p, and this is the only case that p becomes skyline. Proof. A data point p is skyline if and only if R(p) does not contain any data point dominating p. Since p is not skyline at time 0, R(p) contains some data point dominating p. As t increases, Cðp; qðtÞÞ shrinks at its two top sides, and Rðp; tÞ ¼ R′ðpÞ \ Cðp; qðtÞÞ also shrinks at its two top sides accordingly, as illustrated in Fig. 9. So p becomes skyline if and only if the moving sides of Cðp; qðtÞÞ together sweep over all data points dominating p. And once Rðp; tÞ contains no data point dominating p, it never contains such a point again until t ¼ t 1 . □ Lemma 6. Assume that there are at least two distinct data points, p′ and p″, lying in the interior of the bottom left side of R(p). Then either both points dominate p or both points do not dominate p. This also holds for any two data points lying in the interior the bottom right side of R(p). Proof. Assume to the contrary that p′ dominates p but p″ does not dominate p. This occurs only when an L1 circle Cðp; q′Þ for some q′ A Q contains p′ in its interior but contains p″ on one of its side. That side cannot be the bottom left or top right side of Cðp; q′Þ, because it also contains p′ on the same side. But this contradicts to that p′ and p″ lie in the interior of the bottom left side of RðpÞ ¼ ⋂for all q′ A Q Cðp; q′Þ. □ We can compute this event by segment dragging queries in OðlogjPjÞ time as follows. Imagine that we do the translation of q in reverse, starting from Cðp; qðt 1 ÞÞ. Clearly, Cðp; qðt 1 ÞÞ does not contain any data point dominating p, because p is skyline at time t1. And p becomes nonskyline when a data point dominating p is hit by the moving sides for the first time. If Cðp; qðtÞÞ sweeps R′ðpÞ by one side only, then we can
Fig. 9. R′ðpÞ (gray rectangle) and Cðp; qðtÞÞ while q moves vertically downward. (a) The top left side of Cðp; qðtÞÞ sweeps R′ðpÞ, and (b) two top sides together sweep R′ðpÞ.
76
W. Son et al. / Information Systems 40 (2014) 67–83
use a parallel query and get the first data point dominating p hit by the side. If two sides of Cðp; qðtÞÞ together sweep R′ðpÞ, then two dragging out of a corner queries give us the answer. We compute R′ðpÞ for each p A P in OðlogjQ jÞ time, and check whether a nonskyline point p becomes skyline, or a skyline point p becomes nonskyline, by a few segment dragging queries in R′ðpÞ in OðlogjPjÞ time as described above. Therefore our algorithm reports all the changes of skyline points when one query point moves in OðjPjlogjPjÞ time with OðjPjÞ space. 6.2. When multiple query points move In this section, we consider the case that more than one query point moves vertically or horizontally at unit speed from time 0 to t1. The efficiency of our algorithm in the previous section comes from Observation 1 that Rðp; t″Þ D Rðp; t′Þ for any 0 rt′ o t″ r t 1 when a single query point moves. But this is not necessarily the case when more than one query point moves. Fig. 10 shows such an example where two query points move. At time t¼0, the L1 circle Cðp; qð0ÞÞ is contained in the other L1 circle Cðp; q′ð0ÞÞ, and Rðp; 0Þ ¼ Cðp; qð0ÞÞ. Then as t increases, Cðp; qðtÞÞ expands but Cðp; q′ðtÞÞ shrinks such that the top left side of Rðp; tÞ determined by Cðp; q′ðtÞÞ moves downward along its bottom left side, and the top right side determined by Cðp; qðtÞÞ moves upward along its bottom right side. So Rðp; t″Þ⊈Rðp; t′Þ for any 0 rt′ o t″ r t 1 . So we should use an approach different from the one of the previous section. To solve this problem we subdivide the timeline into intervals such that in each interval, the four determinants remain the same. After then, we check whether p changes its membership to the skyline set within the interval. We show how many times do the determinants of R(p) change, and also how many times does p change its membership to the skyline set. 6.2.1. Computing determinants from time 0 to t1 Without loss of generality we assume that q is a query point in the top right quadrant of p at time 0. Imagine that q moves vertically downward at unit speed and hits the horizontal line through p at time t1, as in the previous section. Then Cðp; qðtÞÞ is defined by its four sides. In other words, Cðp; qð0ÞÞ is the common intersection of four
halfspaces such that two supporting lines of them have orientation 451 and the other two supporting lines of the others have orientation 1351. The supporting lines of four halfspaces have the same distance to q. As q moves vertically downward at unit speed, the top left and top right supporting lines move vertically downward as well but at speed 2t, while the other two supporting lines are stationary. The top left and top right supporting lines move vertically upward at speed 2t if q moves vertically upward. If q moves horizontally, the top right and bottom right supporting lines move horizontally as well at speed 2t, while the other two supporting lines are stationary. By definition, the top left side of Rðp; tÞ is determined by the query point whose L1 circle has the lowest top left side among all L1 circles of query points. It is not difficult to see that the other sides of Rðp; tÞ are also determined by extreme ones of L1 circles. Lemma 7. The L1 circle determining the sides of Rðp; tÞ changes OðjQ jÞ times. Proof. We first assume that p lies in the interior of the bottom left side of R(p) and show the lemma for the top left side of R(p). The lower envelope of y-intercepts of lines supporting the top left sides of all L1 circles over time t represents the combinatorial changes (and their corresponding times) of the top left side of Rðp; tÞ. That is, the timeline splits into intervals at these changes such that in each interval, the L1 circle determining the top left side of Rðp; tÞ remains the same. It is obvious that the top left side of an L1 circle is stationary if the corresponding query is stationary or moves horizontally. Otherwise, the top left side of an L1 circle either (1) moves vertically downward at speed 2t (if the query moves downward) or (2) moves vertically upward at speed 2t (if the query moves upward). Therefore, the yintercept of the supporting line of a top left side is constant of a linear function on t, with slope 2 or 2. Since every function graph has one of the three slops in f0; 2; 2g, the lower envelope of those graphs consists of at most three line segments of slops in f0; 2; 2g. Thus, we conclude that the L1 circle determining the top left side of Rðp; tÞ changes three times. This can be shown for the other sides of Rðp; tÞ analogously. When p lies on the bottom corner of Rðp; tÞ, the sides of Rðp; tÞ change by the combination of two cases, one with p
Fig. 10. A query point q moves vertically upward and another query point q′ moves vertically downward from time 0 to t1.
W. Son et al. / Information Systems 40 (2014) 67–83
in the bottom left side and the other with p in the bottom right side of Rðp; tÞ. Since each moving query goes from the top right quadrant to the top left quadrant and vice versa at most once, there are OðjQ jÞ changes in the set query points contained in each quadrant. Thus, the L1 circle determining a side of Rðp; tÞ changes OðjQ jÞ times. □ 6.2.2. Computing the changes of skyline points We can compute the changes of skyline points based on the proof of the lemma above. We first compute whether p is skyline or not at time 0. At that time, we count the number of data points dominating p. Then, we compute, for each side of Rðp; tÞ, the lower envelope of the y-intercept graphs defined in the proof of the lemma above, and compute the events at which the supporting line determining the corresponding side of Rðp; tÞ changes. These events split the timeline into intervals. In each interval, we handle the data points in order from two sorted lists of P, one in 451 and the other in 1351, which are hit by the moving sides of Rðp; tÞ. Lemma 8. The sides of Rðp; tÞ hit a data point constant number of times. Proof. The y-intercept graph of the top left supporting line of an L1 circle consists of at most two segments of slops in f0; 2; 2g, but not of two segments of slops 2 and 2. Therefore, the lower envelope of y-intercept graphs has only one local maximum or local maximum, and a data point is hit by the top left side of Rðp; tÞ at most twice. This also holds for the other sides of Rðp; tÞ. □ By the above lemma, there are OðjPjÞ events in total where a data point is hit by a side of Rðp; tÞ. For each such event, we spend OðjQ jÞ time to decide whether the data point dominates p or not, and update the counter of data points dominating p. Therefore, it takes OðjPj2 jQ jÞ time in total to compute the changes of skyline points. Fig. 11 shows a worst case example. In Fig. 11(a), all query points lie on Q1 and Q2 of p1, so p1 is on the bottom corner of Rðp1 ; tÞ for 0 r t rt 1 . The top left side and the top right side of Rðp1 ; tÞ is decided by q2 and q1, respectively. There are ⌈jPj=2⌉ data points on the horizontal line segment connecting the top corners of Rðp1 ; 0Þ and Rðp1 ; t 1 Þ. The distance between any two consecutive points on the line ℓ through the top corners of Rðp1 ; 0Þ and Rðp1 ; t 1 Þ is greater than equal to some constant dc. So, whenever Rðp1 ; tÞ hits a data point p′ on the line ℓ, Rðp1 ; tÞ contains
77
no data points in its interior. We need to check whether p′ dominates p1 or not. Fig. 11(b) shows another data point p2 located to the left of p1. Since Cðp1 ; q1 ðtÞÞ Cðp2 ; q1 ðtÞÞ and Cðp1 ; q2 ðtÞÞ Cðp2 ; q2 ðtÞÞ, the top corner of Rðp1 ; tÞ is located vertically below the top corner of Rðp2 ; tÞ at distance dðp1 ; p2 Þ. Thus, the length of the intersection of ℓ and Rðp2 ; tÞ is two times of dðp1 ; p2 Þ. If this length is smaller than or equal to dc, whenever Rðp2 ; tÞ hits a data point on ℓ, Rðp2 ; tÞ does not contain any data points in its interior. Therefore, we need to check dominance in each event. Both p1 and p2 change from skyline points to nonskyline points and vice versa OðjPjÞ times. We can place the other ⌊jPj⌋ 2 data points in between p1 and p2 such that all of them also change from skyline points to nonskyline points and vice versa OðjPjÞ times. Thus, there are OðjPj2 Þ changes. 7. Implementation In our implementation of MSSQ, an R-tree is used to efficiently prune out nonskyline points from P. More specifically, we first find a range bounding Q and read a constant number of points in this region from the R-tree. For each such point p, we identified the bounding box for j [ jQ i ¼ 1 Cðp; qi Þ. Any point outside of this bounding box can be safely pruned as it would be dominated by p. We intersect such bounding boxes and retrieve the points falling into this region, which can be efficiently supported by R-tree, as the intersected region will also be a rectangular range. We call the reduced dataset P′. For fair comparison, we build all our baselines to use this reduced dataset P′. In this section, we discuss more on our two baselines – PSQ and BBS – representing the current-state-of-the-art for spatial and classic skyline algorithms, respectively. 7.1. PSQ PSQ builds upon Lemma 1 and 2 in [9] to compute skyline queries in L1 metric space. By these two lemmas, after sorting points in P in the ascending order of distance from some query point q A Q , we can compute all skyline points in OðjPjjSjjQ jÞ time. Specifically, in this sorting, if some points have same distance from q, then we break the tie by the distance from the other query points in Q. After this sorting, we check each point p in the sorted order, to check whether it is skyline or not, by testing dominance with the skyline
Fig. 11. A worst case example.
78
W. Son et al. / Information Systems 40 (2014) 67–83
points already found. This algorithm is essentially [8,9]. The algorithm of Deng et al. [21] works essentially in the same way as PSQ when it does not use road networks. The following pseudocodes formally presents PSQ.
each point in P′ from all query points, and then generate jQ jdimensional data, where each dimension i of point p representing the distance of p to qi. The following pseudocodes summarize this transformation procedure.
Algorithm (PSQ).
Algorithm (BBS).
Input: P′; Q Output: S 1. initialize the list S, array A 2. A←ðp; dðp; qÞÞ for all p A P′ and one query point q A Q 3. sort A by distance in ascending order 4. for i←1 to jP′j 5. do if A½i is not dominated by points in S 6. then insert A½i to S 7. return S
7.2. BBS Meanwhile, BBS is a well-known algorithm for general skyline problems [4]. To apply a general skyline algorithm for spatial problems, we need to compute the distance of Table 2 Parameters used for synthetic datasets. Parameter
Setting
Distribution of data points
Uniform distribution, Gaussian distribution 100k, 500k, 1M, 2M 4, 8, 12, 16, 20
Dataset cardinality The number of points in a query Standard deviation of points 0.02 0.04 0.06 0.08 0.1 in a query
Input: P′; Q Output: S 1. initialize the list S, array Ai where i is an integer 1r ir jQ j 2. for i←1 to jQ j 3. do Ai ←dðp; qi Þ for all p A P′ 4. run BBS for A to get S 5. return S
To prune out data points that cannot be skyline points, BBS traverses the R-tree for jQ jdimensional points. During traversing, BBS sorts data points in the ascending order of sum-of-coordinates. For the sorted list, it also spends OðjPjjSjjQ jÞ time to compute all the skyline points.
8. Experimental evaluation In this section, we outline our experimental settings, and present evaluation results to validate the efficiency and effectiveness of our framework. We compare our algorithm (MSSQ) with PSQ and BBS. As datasets, we use both synthetic datasets and a real dataset of points of interest (POI) in California. We carry out our experiments on Linux with Intel Q6600 CPU and 3 GB memory, and the algorithms are coded in C þ þ.
Fig. 12. Effect of the dataset cardinality for synthetic datasets of uniform distribution: (a) jQ j ¼ 4; (b) jQ j ¼ 12; (c) jQ j ¼ 20; and (d) jSj.
W. Son et al. / Information Systems 40 (2014) 67–83
79
Fig. 13. Effect of jQ j for synthetic datasets of uniform distribution: (a) jPj ¼ 1M; (b) jPj ¼ 2M; (c) MSSQ for large jQ j; and (d) jSj for large jQ j.
8.1. Experimental settings 8.1.1. Synthetic dataset A synthetic dataset contains up to two million random locations in a 2D space. The space of the datasets is limited to the unit space, i.e., the upper and lower bound of all points are 0 and 1 for each dimension, respectively. Specifically, we use four synthetic datasets with 100k, 500k, 1M, and 2M points. Data points are drawn from uniform distribution or Gaussian distribution. Default distribution of the synthetic dataset is uniform distribution. The parameters are summarized in Table 2. We also generate queries randomly using the parameters in Table 2. Query points are normally distributed with deviation s to control the distribution. When s is low, query points are clustered in a small area, and when high, they are scattered over a wide area. We set s to 0.06, except for the experiment showing the effect of the standard deviation in Fig. 15. For data of Gaussian distribution in Fig. 16, we set s to 0.02.
Fig. 14. Break down of the response times in Fig. 12(a).
Table 3 Break down of the response times in Fig. 12(a). # of data
8.1.2. POI dataset We also validate our proposed framework using a reallife dataset. In particular, we use a sampled POI dataset that has 104,770 locations in 63 different categories, as shown in Fig. 17(a).2
100k 500k 1M 2M
Time (s) Compute SSP
Construct D.S
R-tree OP
Total
0.004 0.034 0.052 0.144
0.003 0.016 0.025 0.080
0.051 0.361 0.510 1.466
0.058 0.411 0.587 1.690
8.2. Efficiency of MSSQ We first validate the efficiency in Fig. 12, by comparing the response times of the three algorithms over 2
Available at http://www.cs.fsu.edu/ lifeifei/SpatialDataset.htm.
varying datasize jPj. From the figure, we can observe that our proposed algorithm is highly scalable over varying jPj, consistently outperforming both baselines. The performance gap only increases as jPj and jQ j increase. For example, when jPj ¼ 2M and jQ j ¼ 20, our
80
W. Son et al. / Information Systems 40 (2014) 67–83
Fig. 15. Effect of the standard deviation for query points.
Fig. 16. Effect of the dataset cardinality (a, b and c) and jQ j (d) for synthetic datasets of Gaussian distribution: (a) jQ j ¼ 4; (b) jQ j ¼ 12; (c) jQ j ¼ 20; and (d) jPj ¼ 2M.
algorithm is up to 100 times faster than BBS. Note that we may have a large number of skyline points for a large data set. These skyline points can be used as candidates for further selection or optimization determined by user preference. Fig. 13 similarly studies the effect of the query size jQ j. We similarly observe that our algorithm is the clear winner in all settings, outperforming BBS by up to 100 times, when jPj ¼ 2M and jQ j ¼ 20. Recall that the worstcase asymptotic running time of our algorithm is OðjP jlogjPj þjQ jlogjQ jÞ, but by our assumption that jPjZ jQ j, the second term on jQ j is dominated by the first one and is omitted in the simplified time complexity. Therefore, the average running times measured from experiments can fluctuate depending on the number of query points.
Moreover, there are architecture-specific minutiae such as caching strategy that affects running time. In Fig. 13(c) and (d), we test our algorithm for large number of query points: 50, 100, 150, 200, 250, and 300. MSSQ(i) denotes the response time when jPj ¼ i. Even if large jQ j, our algorithm spends few seconds. Fig. 13(d) shows jSj over varying jQ j. The number of skyline points increases as the query size increases. A large number of skylines can be used as candidates for further selection or optimization determined by user preference. For instance, a straightforward way for finding top-k skyline points is to compute all the skyline points by our algorithm and then to use a selection algorithm to find top-k skyline points among them, which can be done in the same time and space complexity as MSSQ.
W. Son et al. / Information Systems 40 (2014) 67–83
For closer observation, Fig. 14 and Table 3 show the breakdown of our response times reported in Fig. 12(a) (i.e., when jQ j ¼ 4). From this breakdown, we can observe that I/O costs (of traversing the R-tree) dominate the response times, and the second dominant factor is the computation of dominance tests. The remaining cost, including that of building data structures, is left insignificant. Fig. 15 shows the effect of standard deviation of query points. Fig. 16 shows the experimental results for data points drawn from Gaussian distribution. We observe that the results show trends similar to the ones for distributed data sets of uniform distribution. Lastly, we report our results for real-life California POI data shown in Fig. 17(a). The response times are reported in Fig. 17(b) for varying jQ j, and observe that our proposed algorithm consistently outperforms baselines, and the performance gap increases as the query size increases. The speedup of our algorithm was up to 12 (when jQ j ¼ 20). This finding is consistent with that from synthetic data. 8.3. Efficiency of tracing moving query points In this section, we present evaluation results of tracing moving query points. We denote by MoveOne the algorithm for tracing a single moving query point in Section 6.1. We use
81
the same experimental settings in Table 2 for the evaluation. To show the efficiency of MoveOne, we compare the running time of MoveOne of computing all the intermediate skyline points with the running time of algorithm MSSQ executed at 10 regular intervals on the path of the moving point. The experimental results show that the skyline set changes at least a few hundred times, and MoveOne runs faster than the 10 execution of MSSQ. Specifically, Fig. 18(a) shows the response times of MSSQ and MoveOne over varying datasizes of P when jQ j ¼ 8. Clearly, MoveOne shows better performance than MSSQ. Fig. 18(b) shows the response times of MoveOne over varying the number of moving sequence when we fix jPj ¼ 2M and jQ j ¼ 8. The response time grows slowly when we increase the number of moving sequence. Table 4 shows the average number of skyline changes for a moving when jPj ¼ 2M, where S and NS denotes the number of skyline points and nonskyline points, respectively. The Table 4 The average number of skyline changes for a moving. jQ j
S
NS
S-NS
NS-S
NS-S-NS
8 12 16
26,544.5 32,300.4 39,072.9
76,909.0 80,566.0 89,037.3
505.1 357.6 368.4
1398.7 1008.1 935.6
7.6 5.3 4.0
Fig. 17. Effect of jQ j for POI datasets: (a) 10k sampled points from the California's POI dataset.
Fig. 18. Response times of MoveOne: (a) jQ j ¼ 8 and (b) jQ j ¼ 16.
82
W. Son et al. / Information Systems 40 (2014) 67–83
Fig. 19. Response times of MoveAll: (a) jQ j ¼ 8 and (b) jQ j ¼ 16.
column with S-NS shows the number of data points that are skyline points in the beginning, but change into nonskyline points later. The other columns also show the number of corresponding changes. Let MoveAll denote the algorithm for tracing moving query points described in Section 6.2 when all query points are moving. Fig. 19 shows efficiency of MoveAll. In the figure, (a) and (b) show the response times of MSSQ and MoveAll over varying datasize jPj when jQ j ¼ 8 and jQ j ¼ 16, respectively. Again we run MSSQ at 10 regular intervals on the paths of the moving points. The experimental results show thatMoveAll runs faster than the 10 execution of MSSQ.
[5]
[6]
[7]
[8]
[9]
[10]
9. Conclusion We have studied Manhattan spatial skyline query processing and presented an efficient algorithm. We showed that our algorithm can identify the correct result in OðjPjlogjPjÞ time with desirable properties of easy parallelizability and extensibility. We also propose an algorithm for spatial skyline queries when query points move either vertically or horizontally. Our algorithm runs in OðjPjlogjPjÞ time when only one query point moves and in OðjPj2 jQ jÞ time when more than one query point moves. Our algorithms assume that each query point moves to a fixed direction. We can handle a more complex motion by decomposing it into a sequence of axes-aligned motions and applying our algorithm for the sequence. In addition, our extensive experiments validated the efficiency and effectiveness of our proposed algorithms using both synthetic and real-life data. References [1] H.T. Kung, F. Luccio, F. Preparata, On finding the maxima of a set of vectors, J. Assoc. Comput. Mach. 22 (4) (1975) 469–476. [2] S. Börzsönyi, D. Kossmann, K. Stocker, The skyline operator, in: ICDE '01: Proceedings of the 17th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, 2001, pp. 421–430. [3] K. Tan, P. Eng, B.C. Ooi, Efficient progressive skyline computation, in: VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 301–310. [4] D. Papadias, Y. Tao, G. Fu, B. Seeger, An optimal and progressive algorithm for skyline queries, in: SIGMOD '03: Proceedings of the
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
2003 ACM SIGMOD International Conference on Management of Data, 2003, pp. 467–478. J. Chomicki, P. Godfery, J. Gryz, D. Liang, Skyline with presorting, in: ICDE '03: Proceedings of the 19th International Conference on Data Engineering, 2003, pp. 717–719. M. Sharifzadeh, C. Shahabi, The spatial skyline queries, in: VLDB '06: Proceedings of the 32nd International Conference on Very Large Data Bases, 2006, pp. 751–762. M. Sharifzadeh, C. Shahabi, L. Kazemi, Processing spatial skyline queries in both vector spaces and spatial network databases, ACM Trans. Database Syst. 34 (3) (2009) 1–43. W. Son, M.-W. Lee, H.-K. Ahn, S.-w. Hwang, Spatial skyline queries: an efficient geometric algorithm, in: SSTD '09: Proceedings of the 11th International Symposium on Spatial and Temporal Databases, 2009, pp. 247–264. M.-W. Lee, W. Son, H.-K. Ahn, S.-w. Hwang, Spatial skyline queries: exact and approximation algorithms, GeoInformatica 15 (4) (2011) 665–697. G. Cormier, Operational research methods for efficient warehousing, in: A. Langevin, D. Riopel (Eds.), Logistics Systems, Springer, US, 2005, pp. 93–122. D. Kossmann, F. Ramsak, S. Rost, Shooting stars in the sky: an online algorithm for skyline queries, in: VLDB '02: Proceedings of the 28th International Conference on Very Large Data Bases, 2002, pp. 275–286. P. Godfrey, R. Shipley, J. Gryz, Maximal vector computation in large data sets, in: VLDB '05: Proceedings of the 31st International Conference on Very Large Data Bases, 2005, pp. 229–240. C.Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, Z. Zhang, On high dimensional skylines, in: EDBT '06: Proceedings of the 10th International Conference on Extending Database Technology, 2006, pp. 478–495. C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, Z. Zhang, Finding k-dominant skylines in high dimensional space, in: SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, 2006, pp. 503–514. X. Lin, Y. Yuan, Q. Zhang, Y. Zhang, Selecting stars: the k most representative skyline operator, in: ICDE '07: Proceedings of the 23rd International Conference on Data Engineering, 2007, pp. 86–95. N. Roussopoulos, S. Kelley, F. Vincent, Nearest neighbor queries, in: SIGMOD '95: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, 1995, pp. 71–79. S. Berchtold, C. Böhm, D.A. Keim, H.-P. Kriegel, A cost model for nearest neighbor search in high-dimensional data space, in: PODS '97: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1997, pp. 78–86. K.S. Beyer, J. Goldstein, R. Ramakrishnan, U. Shaft, When is “nearest neighbor” meaningful?, in: ICDT '99: Proceedings of the 7th International Conference on Database Theory, 1999, pp. 217–235. D. Papadias, Y. Tao, K. Mouratidis, C.K. Hui, Aggregate nearest neighbor queries in spatial databases, ACM Trans. Database Syst. 30 (2) (2005) 529–576. X. Huang, C.S. Jensen, In-route skyline querying for location-based services, in: W2GIS '04: Proceedings of the International Workshop on Web and Wireless Geographical Information Systems, 2004, pp. 120–135. K. Deng, X. Zhou, H. Shen, Multi-source skyline query processing in road networks, in: ICDE '07: Proceedings of the 23rd International Conference on Data Engineering, 2007, pp. 796–805.
W. Son et al. / Information Systems 40 (2014) 67–83
[22] P.K. Agarwal, J. Erickson, Geometric range searching and its relatives, in: Advances in Discrete and Computational Geometry, American Mathematical Society, 1999, pp. 1–56. [23] B. Chazelle, An algorithm for segment-dragging and its implementation, Algorithmica 3 (1) (1988) 205–221.
83
[24] J. Mitchell, L1 shortest paths among polygonal obstacles in the plane, Algorithmica 8 (1) (1992) 55–88. [25] H.S.M. Coxeter, Regular Polytopes, Dover, New York, 1973.