Parallel k-dominant skyline queries in high-dimensional datasets

ARTICLE IN PRESS JID: INS [m3Gsc;January 16, 2019;21:4] Information Sciences xxx (xxxx) xxx Contents lists available at ScienceDirect Information...

Download PDF

1MB Sizes 0 Downloads 60 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: INS

[m3Gsc;January 16, 2019;21:4]

Information Sciences xxx (xxxx) xxx

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Parallel k-dominant skyline queries in high-dimensional datasets Yi-Wen Peng, Wei-Mei Chen1,∗ Department of Electronic and Computer Engineering National Taiwan University of Science and Technology Taipei 106, Taiwan

a r t i c l e

i n f o

Article history: Received 1 February 2018 Revised 19 December 2018 Accepted 11 January 2019 Available online xxx Keywords: Skyline k-Dominant skyline Parallel algorithm GPU computing

a b s t r a c t The skyline operator has been used to select preference points in many applications. The previously proposed k-dominant skyline, which relaxes the idea of dominance, reduces the number of skyline points in high-dimensional datasets. However, retrieving both skyline and k-dominant skyline points in high-dimensional datasets are computationally expensive. In this paper, we aim to consider all attributes simultaneously when retrieving skylines and k-dominant skyline points using a new data representation. Moreover, we propose a parallel k-dominant skyline algorithm, which obtains eﬃciency by exploring data parallelism. In the proposed algorithm, we introduce a data representation that signiﬁcantly reduces the number of veriﬁcations between points and the computation of veriﬁcations. We implement the proposed algorithm on GPU frameworks, which are designed to perform data-parallel computation. For skyline queries, the experimental results show that the proposed algorithm outperforms the state-of-the-art GPU-based algorithms. In high dimensional space, the proposed algorithm is up to 10 times faster than the state-of-the-art GPU-based algorithm for skyline queries. We also evaluate and discuss the performance of the proposed algorithm for k-dominant skyline queries. With the proposed data representation, each point is averagely checked with less than 5% of points in k-dominant skyline queries, and 95% of veriﬁcations take only two comparisons. © 2019 Published by Elsevier Inc.

1. Introduction The skyline operator [9] retrieves the preference points from databases, which is essential for many applications involving multi-criteria analysis. In multi-objective optimization techniques [16,17,19] in engineering and economics, it is frequently known as ﬁnding the Pareto frontier, which is the set of all Pareto eﬃcient or Pareto optimal allocations, or as the maxima vector problem [3,11,12,25]. Given a d-dimensional dataset D, each point p in D consists of d attributes (p1 , p2 , , pd ), where pi is the attribute of p in the ith dimension. A point p ∈ D is said to dominate another point q ∈ D only if pi ≤ qi for all i ∈ {1, 2, , d}, and there exists j ∈ {1, 2, , d}, such that pj < qj is written as p≺q. Otherwise, p does not dominate q and is written as pq. A point is a skyline point if it is not dominated by any point in the same dataset. A skyline query retrieves all skyline points in a given dataset. Consider the example shown in Fig. 1; points C, I, and J are skyline points because they are not dominated by any point. By contrast, the remaining points are not skyline points since each of these points is dominated by at least one point. ∗

1

Corresponding author. E-mail address: [email protected] (W.-M. Chen). This work is partially supported by the Ministry of Science Technology under the Grant MOST106-2221-E-011-025.

https://doi.org/10.1016/j.ins.2019.01.039 0020-0255/© 2019 Published by Elsevier Inc.

Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS 2

ARTICLE IN PRESS

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx Table 1 An example of k-dominant skylines. Label

Attributes

A B C D F H I J

(0,5,1) (4,3,6) (5,2,3) (1,6,5) (7,4,7) (3,0,2) (6,7,0) (2,1,4)

Skyline

k-dominant skyline

√

k=3 √

√ √ √

√ √ √

k=2 √

k=1

Although the skyline query is widely used in many applications, the number of skyline points increases as the number of dimensions grows exponentially, which is the curse of dimensionality. If the dataset D is taken uniformly and independently

n) from [0, 1]n , the expected number of skyline points is asymptotic to (log (d−1 )! , where n is the size of D [22]. We show this problem in Fig. 2 by indicating the percentages of skyline points of three synthetic datasets with one million points, based on synthetic datasets generated by the standard skyline dataset generator [9]. For the correlated dataset, the impact is insigniﬁcant because attributes of each point are correlated. About 20% of points are skyline points when the number of dimensions is 30. In the independent dataset, over 90% of points are skyline points when the number of dimensions is greater than 20. The growth ratio is even higher in the anti-correlated dataset since attributes of each point are negatively correlated. In this case, when the number of dimensions is greater than 16, over 90% of points are skyline points. Skylines provide less representative information in high-dimensional space since most points are identiﬁed as skyline points. To retrieve more meaningful skyline points in high-dimensional space, the k-dominant skyline query was proposed in [10], which relaxes the deﬁnition of dominance. After the relaxation, point p ∈ D is said to k-dominate another point q ∈ D only if there are at least k attributes of p such that pi ≤ qi for i ∈ {1, 2, , d} and there exists j ∈ {1, 2, , d}, such that pj < qj , which is written as p≺k q. A point is a k-dominant skyline point if it is not k-dominated by any point in the same dataset. A k-dominant skyline query retrieves all k-dominant skyline points in a given dataset. The skyline query is a special case of the k-dominant skyline query, where k = d. The dataset in Table 1, where n = 8 and d = 3, contains four k-dominant skyline points when k = 3, which are also the skyline points. After relaxing, there is one and zero k-dominant skyline point when k = 2 and k = 1 respectively. Hence, k-dominant skyline queries return more useful results with properly selected k. However, the computational complexity of the k-dominant skyline operator is higher than the skyline operator because for a pair of points, we have to check their dominance relationship in dk possible ways. The challenge for k-dominant skyline queries is that the computational cost grows rapidly because the number of kcombination from d attributes and set size n increase. Since the skyline is the special case of k-dominant skyline where k = d, algorithms for skyline queries are not easily adapted for k-dominant skyline queries. The existing algorithms for skyline queries obtain eﬃciency from transitivity [4,9,15,24,32] and incomparability [27]. Consider three points p, q, and r in a dataset. If p≺q and q≺r, we have p≺r by transitivity. If pq and qp, p and q are incomparable. In k-dominant skyline queries, the transitivity no longer holds because of cyclic dominant relationship [10]. When k < d, it is possible that p≺k q, q≺k r and r≺k p. Fortunately, we can improve performance from incomparability. In this paper, we propose an eﬃcient and parallel algorithm for k-dominant skyline queries in high-dimensional space. To improve the eﬃciency, we aim to identify incomparability as quickly as possible by considering all attributes simultaneously. Therefore, we introduce a new data representation, called characteristic bitmap, which rearranges and records attributes in a different way. With characteristic bitmaps, we can also know the strength of each point, which refers to the ability of dominating other points. In addition, incomparability can be identiﬁed with only the characteristic bitmaps of two points. Thus, the proposed algorithm obtain better performance from data parallelism. We implemented the proposed algorithm on GPU frameworks, which are designed to perform data-parallel computation. The experimental results show that the proposed algorithm is up to 10 times faster than the state-of-the-art GPU-based algorithms for skyline queries in high dimensional datasets. Based on our proposed data representation, each point is averagely checked with less than 5% of points and 95% of veriﬁcations take only two comparisons. The reminder of this paper is organized as follows. In Section 2, we review previous works that improve the eﬃciency of skyline and k-dominant skyline queries. In Section 3, we introduce the new data representation and its operations with examples. In Section 4, we introduce the proposed algorithms for skyline and k-dominant skyline queries. In Section 5, we evaluate the performance of proposed algorithms for skyline and k-dominant skyline queries, respectively, and Section 6 concludes this paper. d−1

2. Related work The concept of k-dominant skyline queries was ﬁrst proposed by Chan et al. [10] to alleviate the curse of dimensionality in high-dimensional space. The authors proposed three algorithms—one-scan algorithm (OSA), two-scan algorithm (TSA), Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

[m3Gsc;January 16, 2019;21:4] 3

and sorted retrieval algorithm (SRA)—to retrieve k-dominant skyline points. Furthermore, they extended those algorithms to automatically determine k that there are at least expected number of dominant skyline points returned. After relaxing the dominance relation, the search space of k-dominant skyline queries should be signiﬁcantly reduced. However, those algorithms scan all points in a given dataset, which is not eﬃcient. The state-of-the-art algorithm, called k-ZSearch [29], eﬃciently performs k-dominant skyline queries by exploiting the clustering property of Z-order curve. Although k-ZSearch showed a better performance, both TSA and k-ZSearch are not well suited for parallel and distributed environments. In [36], a parallel algorithm for k-dominant skyline queries in the MapReduce framework was proposed. The authors used a point-based bound tree (PB-tree) to split data space for parallel computation. Other algorithms applied k-dominant skyline queries to other scenarios, such as incomplete datasets [30,31], combined datasets [2,18] and data streams [23]. We also studied previous research on skyline queries for useful properties. The skyline operator was ﬁrst introduced by Borzsony et al. [9] and received substantial research attention. The authors proposed the basic block-nested-loop (BNL) and divide-and-conquer (DnC) algorithms to compute the skyline. BNL is a naive algorithm that keeps a set of candidates of skyline points and scans the dataset repeatedly. After comparing all points with candidates, all points in the set are skyline points. In the worst case, the time complexity is of the order of O(n2 ). However, DnC algorithm obtains performance from region-level relations using m-way partitioning scheme. Borzsony et al. also replaces m-way partitioning with B-tree and R-tree for space partitioning. However, neither BNL nor DnC can produce skyline points progressively. The existing skyline algorithms can be categorized into two approaches: partition-based and sorting-based algorithms; we ﬁrst discuss partition-based algorithms. In [37], two online algorithms based on bitmaps and B+ -tree extension, respectively, were proposed. While the bitmapbased algorithm is similar to our proposed algorithm that obtains performance from bit manipulations, the bitmaps are generated and used in a different way. The index algorithm maps high-dimensional points into a single-dimensional space with a transformation mechanism, and a B+ -tree is used to index transformed points. Although both algorithms provide quick initial response time compared to BNL and DnC, the memory space required in high-dimensional space is enormous. To tackle this problem, a nearest neighbor (NN) search-based algorithm is proposed [24]. With a monotonic distance function and R∗ -tree [5], an NN search is performed to ﬁnd the point with the minimum distance from the origin of coordinates. The results of the NN search are used to partition the space and search other regions recursively. Although NN is signiﬁcantly faster for up to four dimensions compared with previous algorithms, the cost increases as the number of dimensions grows due to the overlapping area between partitions. Hence, a branch-and-bound skyline (BBS) [34] algorithm, which is also based on an NN search but is optimal in terms of nodes accessed, was proposed. The major difference is that the BBS visits only the nodes that might contain skyline points and does not access the same node twice. While BBS showed good eﬃciency for both progressive and complete skyline computation, the execution time is still long for high-dimensional space where R-tree is ineﬃcient. Recently, a correlation-aware approach for skyline queries in high-dimensional datasets, called HashSkyline [38], was proposed. The HashSkyline algorithm uses a hash-based method to speed up the skyline queries for correlated datasets. For sorting-based algorithms, a sort-ﬁlter-skyline (SFS) [15] algorithm is the ﬁrst proposed algorithm after BNL. SFS presorts all points according to an entropy function before checking the dominance relations between points. In other words, SFS performs a topological sort with respect to the dominance relations. SFS also has a window as BNL, but points added to the window are guaranteed to be skyline points. Thus, experiments showed that SFS outperforms BNL because of a lower bookkeeping overhead. In [20], a maximal-vector algorithm, called linear elimination sort for skyline (LESS), is introduced, which improves SFS with an external sort routine. In the external-sort phase, LESS eliminates some points earlier by maintaining a small eliminate-ﬁlter window. As a result, LESS shows signiﬁcant improvements in synthetic and real datasets in the experiments. To further reduce computation,the sort and limit skyline algorithm (SaLSa) [4], is proposed, which evaluates skyline queries without applying the skyline ﬁlter to all points. SaLSa also presorts all points according to a monotonic function, but it selects a stop point in the ﬁlter-scan processes. They studied how to select a good sorting function and stop point to archive the best performance. Although SaLSa shows a great performance with a proper selection, all sorting-based algorithms suffer from the large number of computations required in the ﬁlter-scan phase. Parallel techniques exist for MapReduce [33,35,39,40], GPU [7,8,14], and other distributed environments [1,13,21]. Since the proposed algorithm is based on GPUs, we focus on three GPU-based skyline algorithms. The ﬁrst GPU-based skyline algorithm is GNL [14], which runs nested loops. GNL is the GPU-based brute-force algorithm, which compares each point with other points using its own thread. In [7], the GGS algorithm presorts all points according to a monotonic function and prunes non-skyline points in batches. To obtain high throughput and eﬃciency, a branch-free process was introduced to check the dominance relations between points and discuss the optimization for memory management. The recent stateof-the-art GPU-based skyline algorithm is SkyAlign [8], which is a partition-based algorithm. This algorithm uses static partitioning mechanism, which is GPU-friendly, to eﬃciently identify the relations between points in the region level. The optimization techniques for GPU computations are fully discussed, including memory management and thread divergence. In the experiments, the results showed that SkyAlign outperforms optimized GGS in various scenarios. 3. Data representation We introduce the data representation used in the proposed algorithm and the veriﬁcations of dominance relation in this section. Determining the dominance relations between points is the basic operation in skyline queries because skyline Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

ARTICLE IN PRESS

JID: INS 4

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx Table 2 An example of 3-dimensional dataset. Label

A B C D F H I J

Attribute

Rank

1

2

3

R

0.184 0.504 0.565 0.216 0.972 0.356 0.601 0.241

0.578 0.393 0.146 0.602 0.415 0.054 0.802 0.068

0.067 0.628 0.152 0.233 0.642 0.092 0.010 0.165

0 4 5 1 7 3 6 2

(1)

Bitmap R 5 3 2 6 4 0 7 1

(2)

R

(3)

1 6 3 5 7 2 0 4

B

(3)

010 101 100 011 111 000 110 001

Skyline B

(2)

000 011 011 010 101 101 110 100

B

(1)

011 010 101 101 101 100 010 010

√

√ √ √

points are not dominated by any point. Consider two points p = ( p1 , p2 , p3 ) and q = (q1 , q2 , q3 ) in a three-dimensional space; conventional methods sequentially compare attributes of two points to determine the dominance relation between p and q. For simplicity, we assume that attributes of p and q are distinct in the example of Fig. 3. In Fig. 3, the comparison tree verifying whether p dominates q in conventional ways indicates that p and q are incomparable in most cases. In the case of p = (0, 5, 1 ) and q = (4, 3, 2 ), it takes at least two scalar comparisons to recognize that p does not dominate q because 0 < 4 and 5 > 3. To improve the computation process, we consider all attributes simultaneously to eﬃciently verify whether p dominates q. In the above case, we compare the most signiﬁcant bits of attributes as shown in Fig. 4. Since the most signiﬁcant bit of p1 is less than that of q1 , we have p1 < q1 . Since the most signiﬁcant bit of p2 is greater than that of q2 , we have p2 > q2 . Because p1 < q1 and p2 > q2 , we conclude that pq by two bit comparisons. We can pack these related bit comparisons into one scalar comparison. Consequently, it takes only one scalar comparison to identify that pq. In the following, we introduce a new data representation that makes our proposed method compare bits eﬃciently. 3.1. Characteristic bitmap We introduce a new data representation, called characteristic bitmap, which extracts and records the corresponding bits of ranks. Instead of attributes which are usually ﬂoating points, we use ranks that are positive integers to make the comparisons simple and eﬃcient. We give the deﬁnitions of ranks and characteristic bitmaps in the following. Deﬁnition 1 (Rank). Given a d-dimensional dataset D of size n, the i-th rank of point p ∈ D, denoted R(i) (p), is the number of points in D that its attribute in ith dimension is less than that of p. Then

R(i ) (p ) = (b(i m ) b(i m−1) · · · b(i 1) )2 =

m

2 j−1 · b(i j ) ,

j=1

where i ∈ {1, 2, , d}. b(i m ) is the most signiﬁcant bit and b(i 1 ) is the least signiﬁcant bit. Deﬁnition 2 (Characteristic Bitmap). Given a d-dimensional dataset of size n, each point has m = log2 n characteristic bitmaps. Each characteristic bitmap is composed of d bits. The jth characteristic bitmap of a point p is denoted as follows:

B( j ) (p ) = (b(1j ) b(2j ) · · · b(dj ) )2 , ( j)

where j ∈ {1, 2, , m} and bi

∈ {0, 1} is the jth bit of ith rank of p.

We use point A in Table 2 to demonstrate the conversion of characteristic bitmaps. In this example, the number of characteristic bitmap m = log2 8 is 3 because the size of dataset is 8. After ranking, R(1 ) (A ) = 0, R(2 ) (A ) = 5, and R(3 ) (A ) = 1 and the binary representation of ranks is (0 0 0)2 , (101)2 , and (001)2 . Then, bits of all ranks are extracted and collected to their corresponding characteristic bitmaps. The characteristic bitmaps starting from the most signiﬁcant bits are B(3 ) (A ) = 010, B(2 ) (A ) = 0 0 0 and B(1 ) (A ) = 011. 3.2. Veriﬁcation of dominance relation Using characteristic bitmaps, we can verify dominance relations between points without extracting bits repeatedly. In each veriﬁcation, we perform bitmap comparisons until the termination conditions. In the following, we illustrate the veriﬁcations using characteristic bitmaps with the dataset in Table 2. First, we only consider the ﬁrst ranks R(1) to explain the bitmap comparisons. Then, we illustrate the complete veriﬁcations which consider all ranks simultaneously. Consider point H and D with the ﬁrst ranks R(1) . Verifying whether H dominates D is equal to verifying whether R(1) (H) is less than R(1) (D). To archive that with characteristic bitmaps, we start comparing characteristic bitmaps that consist of the most signiﬁcant bits. If the bits are equal, we compare the next characteristic bitmaps until the bits are different or all bitmaps have been compared. Thus, the ﬁrst comparison is between B(3 ) (H ) = 0 and B(3 ) (D ) = 0 where R(1 ) (H ) = 3 = (011 )2 Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

[m3Gsc;January 16, 2019;21:4] 5

Table 3 All situations of comparing two bits and the corresponding bitwise operations, where ¬, ∧, and is the NOT, AND, and XOR bitwise operations respectively. X

Y

relation

bitwise operation

0 0 1 1

1 0 1 0

< =

¬X∧Y ¬(XY)

>

X∧¬Y

and R(1 ) (D ) = 1 = (001 )2 . Since B(3 ) (H ) = B(3 ) (D ), we have to compare B(2 ) (H ) = 1 and B(2 ) (D ) = 0. Since B(2) (H) > B(2) (D), we have R(1) (H) > R(1) (D). Table 3 indicates all situations of comparing two bits and the corresponding bitwise operations that identify corresponding relations. We perform those operations on characteristic bitmaps to identify the relations between two ranks. As mentioned previously, we can pack multiple bit comparisons into one scalar comparison by collecting bits and performing bitwise operations. Using the bitmap comparisons above, we can examine all rank of two points simultaneously from the view of characteristic bitmaps. Now, consider points H and I with all ranks, and verify whether H dominates I. Since B(3 ) (H ) = 0 0 0 and B(3 ) (I ) = 110, we have R(1) (H) < R(1) (I), R(2) (H) < R(2) (I), and an undetermined relation between R(3) (H) and R(3) (I). Thus, we compare B(2) (H) and B(2) (I). Since B(2 ) (H ) = 101 and B(2 ) (I ) = 110, we have an undetermined relation between R(1) (H) and R(1) (I), R(2) (H) < R(2) (I), and R(3) (H) > R(3) (I). However, the relations of H and I in R(1) , R(2) have been determined when comparing B(3) (H) and B(3) (I). Hence, in each bitmap comparison, we mask the ranks whose relation is determined using a bitmap U, which is the result of checking the equality of previous characteristic bitmaps. Algorithm 1 shows the details of verifying whether a point p dominates another point q, where bitmap G and U are used Algorithm 1 DOM(p, q). Require: Points p, q with m characteristic bitmaps Ensure: Determine if p dominates q 1: Set all bits of G and U to 0 and 1, respectively 2: i ← m 3: while i > 0 and Ones (G ) = 0 and Ones (U ) > 0 do 4: G ← B ( i ) ( p ) ∧ ¬B ( i ) ( q ) ∧ U U ← ¬(B(i ) (p ) B(i ) (q )) ∧ U 5: i←i−1 6: 7: if Ones (G ) = 0 and Ones (U ) < d then 8: return true 9: else return false 10: to record the status of the ranks. If a rank of p is greater than that of q, the corresponding bit of G is set to 1, otherwise it is set to 0. If the relation of a rank is undetermined, the corresponding bit of U is set to 1, otherwise it is set to 0. Hence, all bits of G and U are initialized to 0 and 1, respectively. Then, we perform bitmap comparisons and update bitmaps G and U. The comparing process is continued until the answer is determined or all characteristic bitmaps are compared. We identify whether the answer is determined by counting the number of ones in G and U with Ones. If |{i|R(i ) (p ) > R(i ) (q )}| = Ones(G ) > 0, we have pq. If Ones(G ) = 0 and Ones(U ) = 0, we have p≺q because R(i) (p) < R(i)(q) for all i ∈ {1, 2, , d}. Hence, if Ones(G ) = 0 and Ones(U) > 0, the comparing is continued. Finally, we have p≺q if Ones(G ) = 0 and Ones(U) < d; otherwise, we have pq. Since m characteristic bitmaps are compared at most, the complexity of Algorithm 1 is O(m) which is not associated with the number of dimensions. For instance, consider DOM(H, B), which veriﬁes whether point H dominates point B in Table 2, and G and U are initialized to 0 0 0 and 111, respectively. We ﬁrst compare the characteristic bitmaps that consist of the most signiﬁcant bits of all ranks by computing G = B(3 ) (H ) ∧ ¬B(3 ) (B ) ∧ U and U = ¬(B(3 ) (H ) B(3 ) (B )) ∧ U. Since G = 0 0 0 ∧ ¬101 ∧ 111 = 0 0 0 and U = ¬(0 0 0 101 ) ∧ 111 = 010, the relation between R(2) (H) and R(2) (B) is undetermined because the corresponding bit in G and U is 0 and 1, respectively. In addition, we have R(1) (H) < R(1) (B) and R(3) (H) < R(3) (B) because their corresponding bits of them are 0 in U and G. The DOM(H, B) operation is continued because Ones(G ) = 0 and Ones(U ) = 1. After comparing B(2) (H) and B(2) (B), we have G = 101 ∧ ¬011 ∧ 010 = 0 0 0 and U = ¬(101 011 ) ∧ 010 = 0 0 0. Since Ones(G ) = 0 and Ones(U ) = 0, all ranks of H are less than those of B. Consequently, H is found to dominate B after two bitmap comparisons. 3.3. Weight of characteristic bitmap Characteristic bitmaps are related to lattice structures [26,28] that can be used as partition indices. By mapping points into regions according to characteristic bitmaps, we know the related strength of a point. To indicate this feature, we use Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

ARTICLE IN PRESS

JID: INS 6

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

Fig. 1. A skyline in two-dimensional dataset, where black circles are skyline points and white circles are non-skyline points.

Fig. 2. Percentage of skyline points of synthetic datasets.

the example in Fig. 1 by transforming attributes into characteristic bitmaps as shown in Fig. 5. Since it is a two-dimensional dataset, there are four regions according to B(m) , which are composed of the most signiﬁcant bits of ranks. For example, point I is mapped into region 00 because R(1 ) (I ) = (001 )2 , R(2 ) (I ) = (001 )2 , and B(m ) (I ) = 00. As seen in the ﬁgure, point I dominates points in region 11 and partially dominates points in regions 01 and 10. We observe that points in region 00 are the strongest because they tend to dominate more points. Points in regions 01 and 10 are weaker than points in region 00 but stronger than points in region 11. The weakest points are in region 11 because they do not dominate any point in other regions. A stronger point tends to dominate other points more easily and dominates more points, and the strength is related to the number of ones in characteristic bitmaps. We can eliminate unnecessary DOM operations by sorting datasets according to the strength and verifying dominance relation with stronger points ﬁrst. In the following, we deﬁne the weight of bitmap, which represents the strength of the points. Deﬁnition 3 (Weight of Characteristic Bitmap). The weight is the number of1s in a characteristic bitmap, denoted as follows:

ω (B( j ) (p )) =

d b(i j ) , j ∈ {1, 2, · · · , m}. i=1

Fig. 6 indicates the corresponding weights and the partial dominance relations between characteristic bitmaps of threedimensional datasets. We know the strength of points according to their weights, and we can prune non-skyline points earlier by performing dominance tests with the stronger points ﬁrst. In the example in Table 2, we compute the weights and order the dataset by comparing ω(B(3) ) of points ﬁrst. If ω(B(3) ) are the same, ω(B(2) ) of points are compare, followed by ω(B(1) ) of points, in case ω(B(2) ) are the same. As seen in Table 4, the points are ordered according to their strengths, Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

7

Fig. 3. The comparison tree associated with determining the dominance relation between p and q, where black squares represent incomparable.

0<1 ( 0 0 0 ) = 0 ? 4 =( 1 0 0 ) =⇒ p1 < q1 2 2 1>0 ( 1 0 1 ) = 5 ? 3 =( 0 1 1 ) =⇒ p2 > q2 2 2

=⇒ p ⊀ q

( 0 0 1 ) = 1 ? 2 =( 0 1 0 ) 2 2 Fig. 4. Verifying whether p dominates q using the proposed method.

01 11 00 10 Fig. 5. Result of ranked example shown in Fig. 1.

ω=0

000 001

010

100

ω=1

011

101

110

ω=2

111

ω=3

Fig. 6. The binary lattice of three-dimensional datasets.

Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

ARTICLE IN PRESS

JID: INS 8

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx Table 4 Sorted dataset in Table 2 according to weights. Label

Bitmap and Weight B

H A J C D I B F

(3)

, (ω )

0 0 0, (0) 010, (1) 001, (1) 100, (1) 011, (2) 110, (2) 101, (2) 111, (3)

B

(2)

Skyline , (ω )

101, (2) 0 0 0, (0) 100, (1) 011, (2) 010, (1) 110, (2) 011, (2) 101, (2)

B

(1)

, (ω )

100, (1) 011, (2) 010, (1) 101, (2) 101, (2) 010, (1) 010, (1) 101, (2)

√ √ √

√

where the point H is the strongest because ω (B(3 ) (H )) = 0. By checking the dominance relations between all points and the strongest point H, three of four non-skyline points B, C and F have been identiﬁed and pruned. 4. Proposed algorithm Using characteristic bitmaps, we eﬃciently verify dominance relations between two points and eliminate unnecessary operations by the strength of points. To further improve the overall performance, we present a parallel algorithm for skyline and k-dominant skyline queries on massively parallel architectures. The following section illustrates the proposed algorithm for skyline queries. 4.1. Skyline query We introduce a parallel skyline algorithm on massively parallel architectures, called PSA, which obtains performance from data parallelism and consists of two phases, as shown in Algorithm 2. The ﬁrst phase pre-processes the dataset by Algorithm 2 PSA. Require: A d-dimensional dataset D of size n Ensure: All skyline points of D 1: m ← log2 n 2: for i ← 1 to d do Sort D in non-ascending order by i-th attributes 3: for p ∈ D do in parallel 4: 5: Extract bits from R(i ) (p ) and save to the corresponding characteristic bitmaps Sort D in non-ascending order by ω (B(m ) ) for q ∈ D do in parallel for p ∈ D do 8: if ω (B(m ) (p )) ≤ ω (B(m ) (q )) and DOM (p, q ) then 9: 10: Mark q as dominated 6:

7:

11: 12:

S ← {p ∈ D | p is non-dominated} return S

transforming and sorting, and the second phase performs DOM operations to prune non-skyline points. In the ﬁrst phases, the given dataset is sorted in a non-ascending order according to each attribute to obtain the corresponding ranks of all points. Then, bits of each rank are extracted and saved to the corresponding characteristic bitmaps. After transforming attributes into characteristic bitmaps, we sort the given dataset according to the weights of B(m) , which are composed of the most signiﬁcant bits. In the second phase, we verify whether each point is dominated by other points in parallel. Each point only veriﬁes dominance relations with points that have smaller or equal weights to those of the ﬁrst characteristic bitmap. Since the dataset is ordered according to the weight of B(m) , all necessary DOM operations for each point are ﬁnished when a weaker point has been scanned. Furthermore, if a point is marked as dominated, there is no need to verify it with other points. After essential DOM operations are performed, skyline points are collected and returned. By Theorem 1, we prove the correctness of eliminating DOM operations with weaker points. Theorem 1. If ω(B(m) (p)) > ω(B(m) (q)), p does not dominate q. Proof. Assume that p≺q. Then we have R(i) (p) ≤ R(i) (q) for all i ∈ {1, 2, , d} and there exists j ∈ {1, 2, , d} such that R(j) (p) < R(j) (q). Therefore, ω(B(m) (p)) ≤ ω(B(m) (q)), which is a contradiction. This completes the proof. Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

ARTICLE IN PRESS

JID: INS

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

[m3Gsc;January 16, 2019;21:4] 9

4.2. k-Dominant skyline query In this section, we introduce a parallel k-dominant skyline algorithm, called PSAk , which is extended from PSA. First, we illustrate the DOMk (p, q) operation, which veriﬁes whether a point p k-dominates another point q; the details are shown in Algorithm 3. We modify the termination conditions but use the same bitwise operations in DOM(p, q) because Algorithm 3 DOMk (p, q). Require: Points p, q with m characteristic bitmaps Ensure: Determine if p k-dominates q 1: Set all bits of G and U to 0 and 1, respectively 2: i ← m 3: while i > 0 and d − Ones (G ) ≥ k and k > d − Ones (G ) − Ones (U ) do G ← B ( i ) ( p ) ∧ ¬B ( i ) ( q ) ∧ U 4: U ← ¬(B(i ) (p ) B(i ) (q )) ∧ U 5: i←i−1 6: 7: if d − Ones (G ) ≥ k and Ones (G ) + Ones (U ) < d then return true 8: 9: else 10: return false k-dominance relation is the relaxation of dominance relation. By the deﬁnition of k-dominance relations, if |{i|R(i ) (p ) ≤ R(i ) (p )}| = d − Ones(G ) < k, we have pk q. If |{i|R(i ) (p ) < R(i ) (p )}| = d − Ones(G ) − One(E ) ≥ k, we have p≺k q. Hence, if d − Ones(G ) ≥ k and k > d − Ones(G ) − Ones(U ), the comparing process is continued. Finally, we have p≺k q if Ones(G ) ≤ d − k and Ones(G ) + Ones(U ) < d; otherwise, we have pk q. Secondly, we focus on the eliminating of unnecessary DOMk operations using the weights of characteristic bitmaps and give the details of PSAk as shown in Algorithm 4. The ﬁrst phase of PSAk , including transforming of characterisAlgorithm 4 PSAk . Require: A d-dimensional dataset D of size n Ensure: All skyline points of D 1: m ← log2 n 2: for i ← 1 to d do 3: Sort D in non-ascending order by i-th attributes for p ∈ D do in parallel 4: Extract bits from R(i ) (p ) and save to the corresponding characteristic bitmaps 5: 6: 7: 8: 9: 10: 11: 12:

Sort D in non-ascending order by ω (B(m ) ) for q ∈ D do in parallel for p ∈ D do if ω (B(m ) (p )) − ω (B(m ) (q )) ≤ d − k and DOMk (p, q ) then Mark q as k-dominated

S ← {p ∈ D | p is not k-dominated} return S

tic bitmaps and sorting them according to their strength, perform the same operations in PSA. In the second phase, we also verify whether each point is k-dominated by other points in parallel. However, we have to relax the condition used to eliminate unnecessary DOMk operations. Each point p is veriﬁed k-dominance relation with points q where ω (B(m) (p )) − ω (B(m) (q )) ≤ d − k. Finally, points that are not k-dominated are collected and returned after all necessary DOMk operations are performed. By Theorem 2, we prove the correctness of eliminating DOMk operations in k-dominant skyline queries. Theorem 2. If ω (B(m ) (p )) − ω (B(m ) (q )) > d − k, p does not k-dominate q. Proof. Since ω (B(m ) (p )) − ω (B(m ) (q )) > d − k, there are at least d − k + 1 ranks of p such that R(i) (p) > R(i) (q) for i ∈ {1, 2, . . . , d}. Hence, there are at most k − 1 ranks of p such that R(i) (p) ≤ R(i) (q) for i ∈ {1, 2, . . . , d} and p does not k-dominate q. 5. Experiments In this section, we evaluate the performance of proposed algorithms on skyline queries and k-dominant skyline queries. The proposed algorithm can be executed with different parallel computing techniques, such as MPI, OpenMP, and GPU. With Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

ARTICLE IN PRESS

JID: INS 10

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

(s)

(s)

(s)

104

101

103

103

102

102

100

101

101 GGS SkyAlign PSA

10−1 18

19

20

21

22

d

(a) Correlated

23

10

GGS SkyAlign PSA

0

18

19

20

21

22

23

GGS SkyAlign PSA

100 18

19

20

d

(b) Independent

21

22

23

d

(c) Anti-correlated

Fig. 7. The average execution time when varying both cardinality and dimensionality such that n = 2d .

characteristic bitmaps, we can easily verify dominance relations in parallel and balance the workload of veriﬁcations used in queries. The choice of the technique depends on use cases. If the dataset contains hundreds million points, GPU might not be a choice because of the limited memory space, and MPI is a proper solution. However, the communication between machines is a critical factor which is not focus in the paper. Since OpenMP and GPU are using shared memory on the same machine, communication cost between processors while verifying dominance relations can be ignore. We use GPU in experiments because its performance is greatly affected by the workload balance which is the major issue we resolve. All experiments are conducted on Debian 8 with an Intel Core i7-4770K 3.5 GHz, 8GB RAM, and Nvidia GTX TITAN-X. We use C++11 and CUDA 7.5 for all experiments. For datasets, we use the standard skyline dataset generator [9] to generate datasets that are correlated, independent, and anti-correlated. In each experiment, we execute each algorithm 100 times and compute the average execution time. To optimize the implementations and executions of all algorithms, the number of resident threads per block is 32, which is equal to the warp size. All data are stored in global memory and loaded into cache while used because most data are used only once. Moreover, coalesced memory accessed and memory alignment are considered to maximize memory bandwidth. Since we sort the datasets in the ﬁrst phase, we employ the eﬃcient and parallel sorting function from Thrust [6] which is a C++ STL-like GPU library. In the second phase, we not only sort the given dataset by ω(B(m) ) but also place points that have the same B(m) in contiguous memory to increase the warp eﬃciency.

5.1. Experiments on skyline queries The PSA algorithm is compared with the optimized GGS [8] and SkyAlign [8], the state- of-the-art algorithms on GPU. The GGS algorithm sorts the dataset according to the monotonic function and eliminates non-skyline points in batches. Besides, a branch-free process is used to obtain high throughput and eﬃciency. The SkyAlign algorithm, which is a partitionbased algorithm, uses static partition to identify unnecessary verﬁcations as soon as possible to eliminate DOM operations. All three algorithms are compiled with the same optimization settings. We conduct three experiments to compare the average execution times. First, we measure the eﬃciency of the proposed algorithm by varying both the cardinality and dimensionality, such that n = 2d . Since the number of characteristic bitmaps is m = log2 n, the worst case of verifying dominance relations using conventional methods and the proposed methods are d comparisons in this experiment. Fig. 7 shows the average execution time of three algorithms on three synthesis datasets by varying d from 18 to 23. In general, PSA is faster than both GGS and SkyAlign; however, when cardinality is small in correlated datasets, GGS is faster because PSA transform attributes into bitmaps before verifying the dominance relations. To measure the eﬃciency of veriﬁcation in PSA, we compute the average number of bitmap comparisons for each DOM operation of synthesis datasets. As the trends on increasing dimensionality with n = 106 shown in Fig. 8, the average number of bitmap comparisons decreases as the dimensionality increases. Although each point has log2 n = 20 characteristic bitmaps, most DOM operations obtain the answer after ﬁrst comparing two characteristic bitmaps. Hence, considering the signiﬁcant bit of all attributes simultaneously is more eﬃcient than attribute-by-attribute comparisons in high-dimensional datasets. Figs. 9 and 10 show the trends with respect to increasing dimensionality with n = 106 and n = 107 respectively. When the dimensionality is low, SkyAlign and GGS are faster than PSA, but the execution times of GGS and SkyAlign increase rapidly. When the dimensionality is high, PSA is faster than GGS and SkyAlign. The execution times of SkyAlign and PSA remain the same when d > 26 in independent and anti-correlated datasets because the numbers of skyline points are the Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

[m3Gsc;January 16, 2019;21:4] 11

Fig. 8. The average number of bitmap comparisons for each DOM operation.

Fig. 9. The average execution time when varying d with n = 106 .

same. PSA shows better performance in high-dimensional datasets because the eﬃcient DOM operations consider all attributes simultaneously. Fig. 11 shows the trends with respect to increasing cardinality with d = 30. In general, PSA is signiﬁcantly faster than both GGS and SkyAlign. Although the number of characteristic bitmaps increases as the cardinality increases, the execution time of PSA increases slowly because most DOM operations do not need to compare all characteristic bitmaps. Hence, PSA scales well with dimensionality in high-dimensional datasets. 5.2. Experiments on k-dominant skyline queries In k-dominance skyline queries, the dominance relation is relaxed and the transitivity of the dominance relation no longer holds. The branch-free comparison of GGS and the pruning technique of SkyAlign can not be adapted for k-dominant skyline queries. Hence, we compare PSAk with a basic algorithm which is based on the deﬁnition of k-dominant skyline [10] and performs the veriﬁcations of the k-dominant relation in parallel. It sorts the dataset in the non-ascending order according to the sum of attributes. Then, points are veriﬁed the k-dominance relation between all other points using the conventional method in parallel. We evaluate the performance of PSAk by varying k in high-dimensional datasets. Figs. 12 and 13 show the trends of the average execution time and the percentages of k-dominant skyline points with respect to increasing k with n = 106 and d = {24, 30}. The y-axes on the left-hand side and right-hand side indicate the average execution time and the percentage of k-dominant skyline points, respectively. In correlated datasets, the percentages of k-dominant skyline points are very low, and the execution times are almost the same because the percentage of k-dominant skyline points is less than 10. The execution time of the basic algorithm is 3 to 60 times larger than that of PSAk . In independent and anti-correlate datasets, the trends of execution times of both algorithms and percentages of k-dominant skyline points are similar. The basic algorithm is faster than PSAk when Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS 12

ARTICLE IN PRESS

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

Fig. 10. The average execution time when varying d with n = 107 .

Fig. 11. The average execution time when varying n with d = 30.

Fig. 12. The average execution time when varying k with n = 106 , d = 24.

k is small because there is no k-dominant skyline point and PSAk spends time on data preprocessing. The number of kdominant skyline points and the execution time of the basic algorithm increase rapidly as k increases. However, the longest execution times of PSAk occur when k 0.8d because points are checked with more points. The execution times decrease as k increases further because more DOMk operations are eliminated by ω(B(m) ). Consequently, PSAk is up to 200 times faster than the basic algorithm. Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

13

Fig. 13. The average execution time when varying k with n = 106 , d = 30.

Fig. 14. The average number of points checked when varying k with n = 106 .

Fig. 15. The average number of characteristic bitmaps compared for each DOMk operation when varying k with n = 106 , d = 24.

Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS 14

ARTICLE IN PRESS

[m3Gsc;January 16, 2019;21:4]

Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

Fig. 16. The average number of bitmap comparisons for each DOMk operation when varying k with n = 106 , d = 30.

To measure the work eﬃciency of PSAk , we count the average percentage of points checked for each point and the average number of bitmap comparisons in each DOMk operation. Fig. 14 shows the average percentage of points checked in three different datasets with d = {24, 30}. The total execution time of PSAk is mainly determined by the number of DOMk operations which is proportional to the number of points checked. Thus, we also observe the peaks when k 0.8d, which appear in the plot of the average execution times. The results indicate that PSAk is eﬃcient because the average number of points checked are less than 5%. Figs. 15 and 16 show the average number of bitmap comparisons for each DOMk operation by varying k with n = 106 , d = {24, 30}. We show the median, ﬁrst and third quartiles, and upper and lower whiskers, which represent the 5th and 95th percentiles, respectively. When k is close to d2 , the upper whisker of the number of bitmap comparisons is up to six. In other words, 95% of the DOMk operations obtain the answer after comparing less than six characteristic bitmaps. The median, upper, and lower whiskers converge to two as k increases, which means 95% of the DOMk operations compare two characteristic bitmaps and obtain the answer. The results show that considering all attributes simultaneously can verify k-dominance relations eﬃciently. 6. Conclusions Skyline and k-dominant skyline queries have been used frequently to retrieve preference points, but the computational cost increases rapidly as the cardinality and dimensionality increase. In this paper, we introduced characteristic bitmaps to verify dominance and k-dominance relations eﬃciently by considering all attributes simultaneously. We also use the weights of characteristic bitmaps to represent their strength and eliminate unnecessary veriﬁcations by region-level relations. Furthermore, we proposed parallel algorithms for skyline and k-dominant skyline queries on massively parallel architectures. The results of the experiments, in which we evaluated the performance of PSA and PSAk , indicate that PSA outperforms the state-of-the-art algorithm and that PSAk shows great performance for k-dominant skyline queries. In high-dimensional datasets, PSA is up to 10 times faster than the state-of-the-art algorithm. PSAk is up to 200 times faster than the basic algorithm. With characteristic bitmaps, each point is averagely checked with less than 5% of points in k-dominant skyline queries, and 95% of the DOMk operations take only two comparisons. In general, k-dominant skyline queries can select more meaningful representative skyline points in high-dimensional datasets. However, an important step in applying the k-dominant skyline algorithm is to determine an appropriate k such that the size of the k-dominant skyline is manageable. From our experimental observation, smaller values of k are recommended if the dimensionality is large. To extend further applications of the k-dominant skyline, developing an eﬃcient strategy to identify the value of k in large high-dimensional datasets is a worthwhile research topic. References [1] F.N. Afrati, P. Koutris, D. Suciu, J.D. Ullman, Parallel skyline queries, Theory Comput. Syst. 57 (4) (2015) 1008–1037. [2] A. Awasthi, A. Bhattacharya, S. Gupta, U.K. Singh, k-dominant skyline join queries: extending the join paradigm to k-dominant skylines, in: 2017 IEEE 33rd International Conference on Data Engineering, 2017, pp. 99–102. [3] Z.-D. Bai, C.-C. Chao, H.-K. Hwang, W.-Q. Liang, On the variance of the number of maxima in random vectors and its applications, Ann. Appl. Probab. 8 (3) (1998) 886–895. [4] I. Bartolini, P. Ciaccia, M. Patella, SaLSa: computing the skyline without scanning the whole sky, in: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 2006, pp. 405–414. [5] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R∗ -tree: an eﬃcient and robust access method for points and rectangles, in: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, 1990, pp. 322–331. [6] N. Bell, J. Hoberock, Thrust: a productivity-oriented library for cuda, 2, GPU Computing Gems Jade Edition, 2011, pp. 359–371.

Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

JID: INS

ARTICLE IN PRESS Y.-W. Peng and W.-M. Chen / Information Sciences xxx (xxxx) xxx

[m3Gsc;January 16, 2019;21:4] 15

[7] K.S. Bøgh, I. Assent, M. Magnani, Eﬃcient GPU-based skyline computation, in: Proceedings of the Ninth International Workshop on Data Management on New Hardware, 2013, pp. 5:1–5:6. [8] K.S. Bøgh, S. Chester, I. Assent, Skyalign: a portable, work-eﬃcient skyline algorithm for multicore and gpu architectures, VLDB J. 25 (6) (2016) 817–841. [9] S. Borzsony, D. Kossmann, K. Stocker, The skyline operator, in: Proceedings 17th International Conference on Data Engineering, 2001, pp. 421–430. [10] C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, Z. Zhang, Finding k-dominant skylines in high dimensional space, in: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, 2006, pp. 503–514. [11] W.-M. Chen, H.-K. Hwang, T.-H. Tsai, Maxima-ﬁnding algorithms for multidimensional samples: a two-phase approach, Comput. Geom. 45 (12) (2012) 33–53. [12] W.-M. Chen, H.-K. Hwang, T.-H. Tsai, Eﬃcient maxima-ﬁnding algorithms for random planar samples., Discrete Math. Theor. Comput.Sci. 6 (1) (2003) 107–122. [13] S. Chester, D. Sidlauskas, I. Assent, K.S. Bøgh, Scalable parallelization of skyline computation for multi-core processors, in: 2015 IEEE 31st International Conference on Data Engineering, 2015, pp. 1083–1094. [14] W. Choi, L. Liu, B. Yu, Multi-criteria decision making with skyline computation, in: 2012 IEEE 13th International Conference on Information Reuse Integration, 2012, pp. 316–323. [15] J. Chomicki, P. Godfrey, J. Gryz, D. Liang, Skyline with Presorting: Theory and Optimizations, in: Intelligent Information Processing and Web Mining, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 595–604. [16] C.A.C. Coello, G.B. Lamont, D.A. Van Veldhuizen, et al., Evolutionary algorithms for solving multi-objective problems, 5, Springer, 2007. [17] K. Deb, K. Sindhya, J. Hakanen, Multi-objective Optimization, in: Decision Sciences: Theory and Practice, CRC Press, 2016, pp. 145–184. [18] L.G. Dong, X.W. Cui, Finding k-dominant skyline for combined dataset, in: Measurement Technology and its Application III, 568, 2014, pp. 1534–1538. [19] M. Ehrgott, Multicriteria optimization, Springer Science & Business Media, 2006. [20] P. Godfrey, R. Shipley, J. Gryz, Maximal vector computation in large data sets, in: Proceedings of the 31st International Conference on Very Large Data Bases, 2005, pp. 229–240. [21] K. Hose, A. Vlachou, A survey of skyline processing in highly distributed environments, VLDB J. 21 (3) (2012) 359–384. [22] H.-K. Hwang, T.-H. Tsai, W.-M. Chen, Threshold phenomena in k-dominant skylines of random samples, SIAM J. Comput. 42 (2) (2013) 405–441. [23] M. Kontaki, A.N. Papadopoulos, Y. Manolopoulos, Continuous k-dominant skyline computation on multidimensional data streams, in: Proceedings of the 2008 ACM Symposium on Applied Computing, 2008, pp. 956–960. [24] D. Kossmann, F. Ramsak, S. Rost, Shooting stars in the sky: an online algorithm for skyline queries, in: Proceedings of the 28th International Conference on Very Large Data Bases, 2002, pp. 275–286. [25] H.T. Kung, F. Luccio, F.P. Preparata, On ﬁnding the maxima of a set of vectors, J. ACM 22 (4) (1975) 469–476. [26] J. Lee, S.-W. Hwang, BSkyTree: scalable skyline computation using a balanced pivot selection, in: Proceedings of the 13th International Conference on Extending Database Technology, 2010, pp. 195–206. [27] J. Lee, S.-W. Hwang, Skytree: scalable skyline computation for sensor data, in: Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data, 2009, pp. 114–123. [28] J. Lee, S.-W. Hwang, Scalable skyline computation using a balanced pivot selection technique, Inf. Syst. 39 (2014) 1–21. [29] K.C. Lee, W.-C. Lee, B. Zheng, H. Li, Y. Tian, Z-SKY: An eﬃcient skyline query processing framework based on Z-order, VLDB J. 19 (3) (2010) 333–362. [30] Z. Ma, K. Zhang, S. Wang, C. Yu, A double-index-based k-dominant skyline algorithm for incomplete data stream, in: 2013 IEEE 4th International Conference on Software Engineering and Service Science, 2013, pp. 750–753. [31] X. Miao, Y. Gao, G. Chen, T. Zhang, k-Dominant skyline queries on incomplete data, Inf. Sci. 367 (2016) 990–1011. [32] M. Morse, J.M. Patel, H.V. Jagadish, Eﬃcient skyline computation over low-cardinality domains, in: Proceedings of the 33rd International Conference on Very Large Data Bases, 2007, pp. 267–278. [33] K. Mullesgaard, J.L. Pedersen, H. Lu, Y. Zhou, Eﬃcient skyline computation in MapReduce, in: 17th International Conference on Extending Database Technology, 2014, pp. 37–48. [34] D. Papadias, Y. Tao, G. Fu, B. Seeger, An optimal and progressive algorithm for skyline queries, in: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, 2003, pp. 467–478. [35] Y. Park, J.-K. Min, K. Shim, Parallel computation of skyline and reverse skyline queries using mapreduce, Proc. VLDB Endowment 6 (14) (2013) 2002–2013. [36] M.A. Siddique, T. Hao, Y. Morimoto, k-Dominant skyline query computation in mapreduce environment, IEICE Trans. Inf. Syst. 98 (5) (2015) 1027–1034. [37] K.-L. Tan, P.-K. Eng, B.C. Ooi, Eﬃcient progressive skyline computation, in: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, pp. 301–310. [38] B. Yu, W. Choi, L. Liu, Exploring correlation for fast skyline computation, J. Supercomput. 73 (11) (2017) 5071–5102. [39] J. Zhang, X. Jiang, W.-S. Ku, X. Qin, Eﬃcient parallel skyline evaluation using mapreduce, IEEE Trans. Parallel Distrib. Syst. 27 (7) (2016) 1996–2009. [40] S. Zhang, N. Mamoulis, D.W. Cheung, Scalable skyline computation using object-based space partitioning, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 2009, pp. 483–494.

Please cite this article as: Y.-W. Peng and W.-M. Chen, Parallel k-dominant skyline queries in high-dimensional datasets, Information Sciences, https://doi.org/10.1016/j.ins.2019.01.039

Parallel k-dominant skyline queries in high-dimensional datasets

Parallel k-dominant skyline queries in high-dimensional datasets

Recommend Documents