Knowledge-Based Systems 74 (2015) 89–105
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Incremental evaluation of top-k combinatorial metric skyline query Tao Jiang a, Bin Zhang a,⇑, Dan Lin b, Yunjun Gao c, Qing Li d a
College of Mathematics, Physics and Information Engineering, Jiaxing University, 56 Yuexiu Road (South), Jiaxing 314001, China Department of Computer Science, Missouri University of Science and Technology, 500 West 15th Street, Rolla, MO 65409, USA c College of Computer Science, Zhejiang University, 38 Zheda Road, Hangzhou 310027, China d Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China b
a r t i c l e
i n f o
Article history: Received 22 March 2014 Received in revised form 22 September 2014 Accepted 9 November 2014 Available online 15 November 2014 Keywords: Query processing Combinational skyline Metric skyline Algorithm Spatial database
a b s t r a c t In this paper, we define a novel type of skyline query, namely top-k combinatorial metric skyline (kCMS) query. The kCMS query aims to find k combinations of data points according to a monotonic preference function such that each combination has the query object in its metric skyline. The kCMS query will enable a new set of location-based applications that the traditional skyline queries cannot offer. To answer the kCMS query, we propose two efficient query algorithms, which leverage a suite of techniques including the sorting and threshold mechanisms, reusing technique, and heuristics pruning to incrementally and quickly generate combinations of possible query results. We have conducted extensive experimental studies, and the results demonstrate both effectiveness and efficiency of our proposed algorithms. 2014 Elsevier B.V. All rights reserved.
1. Introduction A skyline query retrieves every data point whose attribute vector is not dominated by that of any other data points in the same dataset. This type of query has fostered a large number of applications that facilitate decision making [2], business planning [20], sensor network management [33], etc. Recently, there has been an interesting skyline query variant, called combinatorial skyline [25,4,21] which returns groups of data points and ensures that the combination of data points in each such group is not dominated by any other data points. However, this combinatorial skyline can only handle data points in Euclidean space, which limits its wide adoption in many applications containing data points in metric space, such as those for processing biological sequences and textual strings. It is worth noting that processing skyline queries in metric space is a very challenging task and very few works [3,8] have been proposed so far. To address the aforementioned challenges, in this paper, we define and solve a novel type of skyline query, namely top k combinatorial metric skyline (kCMS) query. The kCMS query retrieves k combinations of data objects in metric space that satisfy the following two conditions: (1) each combination G has a query ⇑ Corresponding author. Tel./fax: +86 573 8364 0102. E-mail addresses:
[email protected] (T. Jiang),
[email protected] (B. Zhang),
[email protected] (D. Lin),
[email protected] (Y. Gao),
[email protected] (Q. Li). http://dx.doi.org/10.1016/j.knosys.2014.11.009 0950-7051/ 2014 Elsevier B.V. All rights reserved.
object q in its the metric skyline, i.e., q is not dominated by any other objects in the dataset with respect to G; (2) they are the top k combinations with respect to a strict monotonic function of q. The important significance of this query lies in that it can identify the impact of q on multiple groups of data objects. For better understanding of the kCMS query, let us step through the following example. Consider a logistics company which aims to find two locations to open new branches according to a warehouse q. An important selection criterion to consider is the location diversity. It is expected that the selected two locations should not be concentrated in a narrow region, and there are not other branch near to either of the two branches. In other words, the combination can service a larger number of users, and any branch in the combination is not near to the branches of potential competitors. Therefore, to maximize the profit, the combinations of the two locations should have q at their metric skyline. In addition to the above selection criteria, the logistics company has one more requirement that the two new branches should be near q, i.e., the sum of the distance between the two locations and the warehouse should be minimized. All the selection criteria can be handled coherently using our proposed kCMS query as follows. Fig. 1 shows six candidate locations that the logistics company plans to consider as new branch locations. Since the logistics company is looking for two locations (i.e., m = 2), all the combinations of two locations are listed in Fig. 1(c), where the column ‘y/n’ indicates whether the combination has q at its metric skyline or
90
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
not, and the column ‘adist(Gj)’ is the sum of the distance of the two locations (in the combination Gj) to the warehouse q. As shown in Fig. 1(c), there are multiple combinations which have q at their metric skyline. Assuming that the logistics company is only interested in the top 4 combinations which are closer to the warehouse q, it is then a 4CMS query which will return the following results: G1{p1, p2}, G3{p1, p4}, G4{p1, p5} and G7{p2, p4}. More specifically, G2 is not selected because q does not at its metric skyline; G8 is not selected because it is farther from the warehouse q than any combination in the query result; and so do other non-selected combinations. Here, we refer to these groups as 2-skyline groups of the dataset where the number 2 indicates the number of objects in each group. Besides the above example, the kCMS query can facilitate decision makings in a variety of applications, such as trip planning and disaster management. For example, Bob has only one day for sightseeing in a city. There are ten attractions around his hotel, but he has time to visit at most three of them. In this case, Bob can perform a kCMS query on the ten attractions with his hotel as the query object q and the sum of distance as the monotonic function. The kCMS query will return top k groups of attractions with each group containing three attractions. After reviewing the query results, Bob can then pick one group of attractions for his trip. Another example is to leverage kCMS to help efficiently dispatch rescue teams in emergency scenarios. Specifically, when a disaster occurred, it may require more than one rescue team to come to the scene since each rescue team typically has its own specialty. It is thus a challenging task to determine the best combination of the rescue teams which have complementary specialties and are closest to the scene in a timely manner. Our proposed kCMS query can aid this challenging selection process by returning k best combinations of rescue teams. To answer the kCMS query, a naïve approach is to enumerate all combinations of objects, and then check whether it satisfies the two conditions in the kCMS query: (i) if the combination has q at its metric skyline; (ii) if the combination is one of the top k results according to the given monotonic function of the query object. Such an exhaustive approach may end up comparing C(n, m) = n!/((n m)!m!) m-skyline groups, which is obviously time consuming. Therefore, in this paper, we propose an efficient kCMS query algorithm which seamlessly integrates the following techniques: (i) the incremental combination sorting (ICS) algorithm which progressively generates the combinations according to the monotonic function used by the query; (ii) the early stopping (ES) technology which helps significantly reduce the search space; (iii) the triangle-based pruning (TP) and the spatial pruning (SP) heuristics which help prune a large number of ineligible combinations at an early stage; and (iv) reuse heap (RH) technology that further avoids redundant I/O accesses. In summary, our contributions are the following:
We define a new skyline query variant, i.e., top-k combinatorial metric skyline (kCMS) query. We propose a novel query algorithm which can efficiently answer kCMS queries. We formally prove the correctness of some pruning heuristics based on the theories of spatial skyline [28,29]. We conduct extensive experiments using both real and synthetic data sets, and the results demonstrate the effectiveness and efficiency of our proposed algorithms under various experimental settings. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 gives the definition of the kCMS query. Sections 4 and 5 present the proposed query algorithm. Section 6 reports the experimental results and our findings. Finally, Section 7 concludes this paper and outlines some directions for future work. 2. Related work In this section, we review the existing work related to the kCMS query, which are skyline queries and its variants, combinatorial skyline queries, and diversified queries. 2.1. Skyline queries and its variants Since Börzsönyi [2] proposed the skyline operator in the database community, a large number of algorithms have been proposed in the literature. For example, Kossmann et al. [17] iteratively partitioned all data into overlapping partitions on Nearest Neighbor (NN) object and developed a NN approach to obtain the skyline objects using an R-tree [11]. Papadias et al. [23] improved the NN method by introducing a Branch and Bound Skyline (BBS) algorithm which needs to traverse the R-tree only once for retrieving the skyline objects. To further improve the performance, Bartolini et al. [1] proposed the Sort and Limit Skyline algorithm (SaLSa) that even does not need to scan the whole dataset. More recently, Lee et al. [19] developed a skyline query processing framework based on the Z-order, called Z-SKY. In addition, many variants of skyline query have also been extensively explored, such as reverse skyline [6,18,9], subspace skyline [31,34], probabilistic skyline [24], preference skyline [16,22], threshold skyline [35], and mutual skyline [15]. There are also studies in the area of spatial skyline queries (SSQ) [28] and metric skyline queries (MSQ) [3]. Given a set of data points P and a set of query points Q, each data point can be associated with a number of derived spatial attributes, whereby each derived spatial attribute is the distance from the data point to a query point qi in Q. A SSQ retrieves those points in P which are not dominated by any other points in P considering their derived
y q=(8, 7) 12 10
e2 p4
e1 p2 p3
6 2
p5
Q={p1, p3, p5} 2
4
6
ID
q
4
0
p6
p1
8
8
x 10 12
(a) the dataset
p1 p2 p3 p4 p5 p6
x, y 7, 8 6, 7 6, 6 10, 9 11, 6 12, 8
dist(pi, q) 1.41 2.00 2.24 2.83 3.16 4.12
(b) the coordinates
adist(Gj) ID G1(p1&p2) 3.41 G2(p1&p3) 3.65 G3(p1&p4) 4.24 G4(p1&p5) 4.57 G5(p1&p6) 5.53 G6(p2&p3) 4.24 G7(p2&p4) 4.83 G8(p2&p5) 5.16
y/n y n y y y n y y
ID G9(p2&p6) G10(p3&p4) G11(p3&p5) G12(p3&p6) G13(p4&p5) G14(p4&p6) G15(p5&p6)
adist(Gj) y/n 6.12 5.07 5.40 6.36 5.99 6.95 7.28
(c) all combinations
Fig. 1. Illustration of a combinatorial metric skyline (m = 2).
y y y y n n n
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
spatial attributes. Sharifzadeh et al. [28] utilized the technologies of Voronoi diagram, Delaunay graph, and convex hull, and proposed three efficient algorithms of SSQ. Specifically, they proposed B2S2 and VS2 algorithms for static datasets, and VCS2 for streaming query points, which exploit the geometric properties of the SSQ problem space to avoid the exhaustive examination of all the point pairs in P and Q. Later, Son et al. [27] and Sharifzadeh et al. [29] further improved the VS2 algorithm. However, the L2 distance used in above SSQ algorithms cannot reflect road network distance in metro areas. Therefore, Son et al. [26] propose to use L1 distance (i.e., Manhattan distance) instead of L2. Based on SSQ, Chen et al. [3] defined a new skyline query with dynamic attributes, called metric skyline query (MSQ), where attributes of each data object are given by a set of dimension functions. MSQ is not limited to spatial data because it returns skyline points with dynamic attributes in the metric space. The main difference between SSQ and MSQ is that the distance functions used in MSQ consist of not only the Euclidean distance function in [28] but also other metric functions (e.g., the edit distance function). In other words, MSQ is a more generic skyline query. In terms of query algorithms, unlike SSQ, MSQ makes use of triangle-based pruning heuristics to reduce the search space. Since our proposed kCMS query is closely related to SSQ [28] and MSQ [3], in the following, we provide more details for these two types of queries. Let us reconsider the example in Fig. 1(a), where the dark gray region corresponds to the convex hull of reference points Q = {p1, p3, p5}. Now assume that we need to compute the spatial skyline of Q. Clearly, the query object q is a spatial skyline point of Q according to Theorem 1 in [28] since the convex hull of Q contains q. In other words, {p1, p3, p5} is a result of combinational metric skyline of q. On the other hand, p2 and p6 are also the spatial skyline points of Q since they are the closest points to p3 and p5 in Q, respectively (see Lemma 1 in [28]). In order to retrieve the metric skyline of Q, MSQ searches the metric index, M-tree [5], in a best-first manner [23]. Fig. 1(a) depicts a small M-tree, where the entry e1 contains p1, p2, p3 and q, and the entry e2 contains p4, p5 and p6. For the sake of clear presentation, we use Euclidean distance as the similarity measure. The key is defined as the sum of the minimum distance between the current entry (e.g., e1) and each data point (e.g., p1) in Q. Firstly, MSQ accesses the entry e1 and inserts its children q and p2 into the auxiliary heap H in the form of (entry, key) since e1 has the minimum aggregate distance to Q. Note that p1 and p3 are not inserted into H since they belong to the data objects of Q. Then, q is popped from H and becomes a result of MSQ. Next, p2 is inserted into the result set since it is not dominated by any point in the resultant set. In contrast, p4 is pruned because it is dominated by q. At last, the algorithm obtains p6 as another result of MSQ. Unfortunately, the aforementioned algorithms only focus on the individual data objects but not their combinations. Therefore, they cannot be applied to solve our proposed kCMS query. 2.2. Combinatorial skyline queries Our proposed kCMS query is also closely related to combinatorial skyline queries [25,12,14,4], which return the best groups of data objects according to the features of their elements [21]. Su et al. [25] are first who introduced top-k combinatorial skyline query (k-CSQ). The k-CSQ query aims to find k combinations of skyline objects whose aggregate values for the most preferred attribute are the highest. The preference order is crucial in reducing exponential search space. In fact, the traditional skyline queries can be considered as a special case of CSQ whereby each combination contains only one skyline object. Guo et al. [12] also study the CSQ and designed a pattern-based pruning algorithm to dramatically reduce the search space. The CSQ and k-CSQ may look similar to
91
our kCMS query. However, the k-CSQ retrieves skyline objects in Euclidean space whereas our proposed kCMS query computes skyline in metric space. Therefore, many pruning heuristics in k-CSQ cannot be directly used in kCMS query. Recently, Chung et al. [4] defined an extended version of k-CSQ and proposed two efficient query algorithms: the decomposition algorithm (DA) and the improved decomposition algorithm (IDA), which report all combinatorial skyline results. DA recursively decomposes the whole problem into a series of subproblems. Then, it executes the skyline operator for each subproblem. The DA algorithm can prune the combinations that cannot be the combinatorial skyline result without enumerating all combinations. In fact, some objects do not contribute to new combinations to form the combinatorial skyline, which may result in identical solutions in multiple subproblems. To avoid processing duplicate subproblems, Chung et al. further proposed the IDA algorithm which sorts objects in a descending order according to dom(ti) which is the number of objects that dominate ti. Im et al. [14] studied the group skyline query which is similar to CSQ in spirit, and developed two group skyline algorithms, GIncremental and GDynamic. GIncremental firstly removes all k-dominated objects from the dataset and then incrementally generates all candidate groups by exploiting various properties of group skyline computation. GDynamic overcomes the weakness of GIncremental by generating at once the set of all candidate groups that include a specific object. Moreover, GDynamic maintains a sorted list for each dimension as an index structure. Magnani et al. [21] introduced aggregate skylines, where the skyline works as a filtering predicate on sets of records. The aggregate skyline queries merge the functionalities of two basic database operators, skyline and group by. Compared with these existing works, our algorithms share some similarity with [4,14] in terms of the use of the incremental computation. However, none of the existing works is able to answer the proposed kCMS query in metric space. 2.3. Diversified queries During the last decade, diversified queries [30,13,32,10,7] have attracted considerable attention from the database community due to their applicability in many domains, such as ambiguous keyword search and user personalized results. Gollapudi et al. [10] developed a set of natural axioms that a diversified system is expected to satisfy, and showed that no diversified function can satisfy all the axioms simultaneously. Drosou et al. [7] surveyed, classified and comparatively studied various definitions, algorithms and metrics for the result diversification. In fact, the diversity is also very important for the skyline query. Tao et al. [30] firstly introduced the concept of diversity into the skyline query, and proposed a representative skyline that best describes the tradeoffs among different dimensions offered by the full skyline. Huang et al. [13] integrated k-means clustering into skyline computation to capture the skyline diversity and improved the usefulness of skyline results. Valkanas et al. [32] presented a novel definition of diversity which, in contrast to previous proposals, is intuitive, because it is based solely on the domination relationships among points. Our algorithms also consider the concept of diversity, which is integrated into the combinatorial skyline query. To the best of our knowledge, this is the first attempt on diversified query over combination data. 3. Problem statement In this section, we formally define the top-k combinational metric skyline(kCMS). Table 1 summarizes the notations used throughout this paper. We use point and object interchangeably to refer to a
92
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
Table 1 Symbols and their description. Notation
Description
P,n m dist, adist e p,q X Xyx ; jXyx j
The dataset, and the number of objects in dataset The number of objects in combination The metric function, the sum distance function An entry in the index A data object, the query object An ascending list A set to store the combinations of selecting y objects from x objects, the cardinality of Xyx A combination and its cardinality A resultant set of combinations of CMS query A refined set of candidate combinations of CMS query The k-th score of combinations in Grlt computed by adist The minimum score of combinations in Grfn computed by adist An auxiliary heap and a reuse heap, respectively
G, jGj Grlt Grfn k_score rfn_minscore H, Hr
col 1 row m
row 3
row 2
row 1
row 0
Ω pm
m m
col 2
col 3
Ω
Ω
m m+1
pm+1
m m+ 2
pm+2
col 4
Ω
m m+ 3
col x-m+1
Ω mx
px
pm+3
Ω33 p3
Ω34 p4
Ω35 p5
Ω36 p6
Ω3m+ 2 pm+2
Ω 22 p2
Ω32
Ω52 p5
Ω 2m+1
p3
Ω 24 p4
Ω p1
Ω p2
1 3
Ω
Ω p4
Ω
Ω
0 2
Ω
1 1
0 0
1 2
0 1
p3 Ω
1 4
0 3
pm+1 Ω1m
pm Ω0m−1
Fig. 2. The logic structure of the IDM-tree.
database object. For easy illustration, we use the sum function as the monotonic function in the paper. Given a data set P with n data objects pi (i 2 [1, n]) in a metric space. For any two different objects p, p0 2 PnG, p dominates p0 with respect to the combination G P which contains m data objects (denoted as p G p0 ) if the following conditions hold: (i) "pi 2 G, dist(p, pi) 6 dist(p0 , pi); and (ii) $pj 2 G, dist(p, pj) < dist(p0 , pj). The pair wise distance dist(pi, pj) between data objects pi and pj (i, j 2 [1, n]) is a metric function, satisfying the following properties for "u, v, w 2 P: (i) dist(u, v) > 0, (ii) dist(u, v) = 0,u = v, (iii) dist(u, v) = dist(v, u), and (iv) dist(u, w) 6 dist(u, v) + dist(v, w). In what follows, we first introduce the basic definition of the metric skyline query, and then define our proposed kCMS query. Definition 1 (Metric Skyline Query, MSQ [3]). Given a metric space database P and a reference set Q = {r1, r2, . . ., rm}, a metric skyline query returns all the objects such that any data object among them, p, is not dominated by other objects p0 2 Pnp w.r.t. Q, namely, :$ p0 2 Pnp Q p.
Definition 2 (top-k Combinatorial Metric Skyline, kCMS). Given a metric space database P and a query object q, a top-k combination metric skyline retrieves k combinations such that, for each combination G P(jGj = m,m P 2), (i) q is among the metric skyline of G, (ii) the k combinations have the minimum sum score Pk Pm j¼1 adistðq; Gj Þ, where adistðq; GÞ ¼ i¼1 distðpi ; qÞ; pi 2 G; distð:Þ is a metric function and m denotes the number of objects in G. To the best of our knowledge, none of the existing works has studied the kCMS problem. As discussed in the introduction, the naïve approach that uses the linear scan (denoted as LS approach) is very time consuming due to the extensive computation and huge I/O cost. In the following sections, we will present our proposed efficient kCMS query algorithms. 4. Incremental combination sorting algorithm for kCMS query In this section, we first introduce a novel data structure, an Iterative Decomposing and Merging tree (IDM-tree), which is the key data structure used by our query algorithm. Then, discuss how to create the IDM-tree and utilize it to incrementally generate combinations of data objects. 4.1. The structure of the IDM-tree Assume that there are n data objects p1, p2, . . ., pn in the dataset P, and each object pi has aweight w(pi). Moreover, these n
objects are organized in an ascending order of their weights, and form a list X = {p1, p2, . . ., pn}, where w(p1) < w(p2) < . . . < w(pn). Then, the weight of a combination of objects (denoted P as G) can be computed as wðGÞ ¼ m i¼1 wðpi 2 GÞ where m (2 6 m < n) is the number of the objects in G. Let Xyx be the set of the combinations formed by selecting y data objects from the first x(x P y) data objects in X. For example, X23 represents the set of {{p1, p2}, {p1, p3}, {p2, p3}}. We now proceed to introduce the IDM-tree, which is a matrix structure containing m + 1 rows (or say levels) and multiple columns. The bottom row is the 0-th row and each node in this row corresponds to an empty set X0j ð0 6 j 6 m 1Þ. The node Xyx of the y-th row (1 6 y 6 m, y 6 x 6 n) contains the combinations of selected y data objects from the first x data objects in X, and are y1 linked to two adjacent sets Xyx1 and Xx1 as shown in Fig. 2. We call the current set Xyx the father set, and the adjacent sets Xyx1 y1 and Xx1 the children sets. The ordinal of the columns starts from the left side. Each node at the 1-th column corresponds to an initial set Xjj (1 6 j 6 m) which only includes a combination of the first j data objects in X, e.g., X22 ¼ ffp1 ; p2 gg. Observe that there exists a recursion relationship among the three sets, Xyx ; Xyx1 , and Xy1 x1 as shown in equality (1): y1 Xyx ¼ Xyx1 [ Xx1 px ;
ð1Þ
where the symbol ‘’ is used to concatenate a set or a data object. For example, if x = 3 and y = 2, then Xyx ¼ X23 ¼ ffp1 ; p2 g; fp1 ; p3 g; 1 fp2 ; p3 gg; Xyx1 ¼ X22 ¼ ffp1 ; p2 gg; Xy1 x1 px ¼ X2 px ¼ ffp1 g; fp2 gg p3 ¼ ffp1 ; p3 g; fp2 ; p3 gg. We refer to the incremental set of Xyx as DXyx which is the set of Xy1 x1 px . Thus, the equality (1) can be rewritten into the following equality (2):
Xyx ¼ Xyx1 [ DXyx :
ð2Þ
According to the equalities (1) and (2), we obtain two important properties of the IDM-tree. Property 1 (Incremental Property). Each combination of a set Xyx can be represented in an incremental form that: (i) consists of its children sets in the same row, and (ii) can be decomposed until their children sets consist of the initial sets in the bottom row.
The incremental property is true since the set Xyx is divided into y1 two sets Xyx1 and Xx1 px , which can be further decomposed into the incremental form according to the equalities (1) and (2). Fig. 3 shows the incremental forms of a set X34 : X34 ¼ DX33 [ DX34 and X34 ¼ DX33 [ DX22 p4 [ DX11 p3 p4 [ DX12 p3 p4 , where DX33 ¼ X33 ¼ ffp1 ; p2 ; p3 gg; DX22 ¼ X22 ¼ ffp1 ; p2 gg; DX11 ¼ X11 ¼ ffp1 gg, and
93
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
4.3. Progressively outputting the combinations using IDM-tree
Ω34 = Ω33 ∪ ΔΩ34 (p1,p2,p4) (p1,p3,p4) (p2,p3,p4) Ω = Ω ∪ ΔΩ 3 4
(p1,p2,p3)
ΔΩ Ω33 = ΔΩ33 3 3
ΔΩ34
(p1,p2)
ΔΩ 22 2 Ω 2 = ΔΩ 22
p4
3 3
3 4
Next, we discuss the detailed incremental combination sorting (ICS) algorithm which incrementally generates the combinations using the IDM-tree. The main idea is developed based on dynamic programming. The ICS algorithm recursively invokes two operations: decomposition and merging. The decomposition operation decomposes a set Xyx ðy 6 xÞ into two small sets, Xyx1 and y1 Xx1 px , using the Eq. (1). Then, the merging operation sorts and merges them into an ordered new set.
ΔΩ34 = Ω32 ∪ p4 = (Ω 22 ∪ ΔΩ32 ) ∪ p4
(p1,p3)
(p2,p3)
ΔΩ32
= (ΔΩ 22 ∪ ΔΩ32 ) ∪ p4 Ω = ΔΩ33 Ω 22 = ΔΩ 22 3 3
Fig. 3. Illustration of the incremental form for the IDM-tree.
DX12 = {{p2}}. By taking advantage of this property, we propose an incremental storage schema that can significantly save storage space. For instance, the node in the IDM-tree, e.g., X34 , only needs to store its incremental combinations as shown in Fig. 3. Property 2 (Inclusive Property). The prefix of DXyx is the child of Xyx , y namely, Xy1 x1 . The combinations of Xx1 is contained in its father set Xyx . The inclusive property indicates that the combinations of a set
Xyx can be easily generated from its children sets and the current data object px.
To construct an IDM-tree, we use m B+-trees, which are denoted as BC1-tree, BC2-tree, . . ., BCm-tree, where i in a BCi-tree indicates the number of data objects in the corresponding combination Gi. In each BCi-tree, w(Gi) is the key, and the combinations with respect to the nodes X1x ; X2x ; . . . ; Xm x are treated as the data to be indexed. The construction of the IDM-tree consists of two main phases, recursive inserting and global amending, as described below. Recursive inserting. Produce new combinations in a zigzag order by scanning the IDM-tree from bottom to top and left to right. New combinations are inserted into the corresponding B+-trees until the integer x reaches the maximum and meanwhile the following inequality (3) is satisfied:
ð3Þ
Global amending. Produce groups of combinations, called y1 extended combinations (i.e., pxþ1 Xx1 ), in order to identify the top k combinations. A group of extended combinations is y1 formed by the combination of a node Xx1 ðy 6 xÞ and the data objects behind px (i.e., px+1, px+2, . . .). The procedure will continue until the integer r (r P 1) in the following inequality (4) reaches the maximum value:
y1 y jX j jD X j y1 minj¼1x1 w pxþr Gj 2 Xx1 6 maxj¼1 x1 w G0j 2 DXyx1 : ð4Þ In addition, the algorithm only inserts the combinations of a group of pxþi Xy1 x1 (1 6 i 6 r, y 6 x), whose weights are less than the maximum weight of the combinations in DXyx1 . In other words, for the current combination G 2 pxþi Xy1 x1 , the inequality (5) is satisfied:
jDXm j y1 w G 2 pxþi Xx1 6 maxj¼1 x w G0j 2 DXm ; 1 6 i 6 r: x
Proof. Since any set Xyx can be decomposed into a series of combinations which are sorted by the merging operation, Xyx is in order. h Lemma 2. Given a set Xyx , the following relationship between DXyx y jDX j and its children set Xyx1 always holds: maxi¼1 x w Gi 2 DXyx > y jX j maxi¼1x1 w Gi 2 Xyx1 . Proof. Since w(px) > w(px1), Lemma 2 is correct by definition.
4.2. Creating the IDM-tree
m X 6 k 6 Xm : x xþ1
Lemma 1. In the IDM-tree, the combinations of any set Xyx are in order.
n o jDXy j jXy j > maxj¼1x w Gj 2 Xyx , Theorem 1. If minj¼1 xþ1 w Gj 2 DXyxþ1 the combinations among DXyx are in a global order; otherwise the combinations in D0 Xyx is in a global order. y
jDX
In order to distinguish the updated set obtained after the global amending from the old set, we denote DXyx obtained after the global amending as D0 Xyx .
j
Proof. By the incremental property, we have that minj¼1 xþ1 n o y1 jX j y1 ; k P 1. w Gj 2 DXyxþk ¼ wðpxþk Þ þ minj¼1xþk1 w Gj 2 Xxþk1 n o y1 jX j Clearly, minj¼1xþk1 w Gj 2 Xy1 is invariable for any integer k. xþk1 In other words, there does not exist another combination in any set DXyxþk ðk P 1Þ, which has a weight less than the maximum weight of the combinations in DXyx if a prerequisite is satisfied. In addition, D0 Xyx is in a global order since they have been amended globally. h Based on Theorem 1, we develop the ICS algorithm as shown in Algorithm 1. Specifically, the ICS algorithm recursively produces combinations in a zig-zag manner (lines 1–2). At the beginning, the initial combinations, i.e., DXyy ð2 6 y 6 mÞ, are inserted into 2 Ω 24 Δ ′Ω3 ={{p3,p1},{p4,p1}} 2 Δ ′Ω 4 ={{p3,p2},{p4,p2},
Ω={p1, p2, p3, p4}
p4
w(p1)=1 w(p2)=2
Ω
w(p3)=3 w(p4)=3.5
Ω ΔΩ
ð5Þ
h
Lemma 1 indicates that not only the combinations in Xyx are in order, but also the combinations in DXyx are in order since y1 DXyx ¼ Xx1 px . In other words, they are sorted in the local data space. In addition, Lemma 2 shows that the combinations in DXyx keep an ascending order in the whole space. However, our goal is to have the combinations in DXyx are sorted in the global data space so that we can generate them incrementally. To achieve this, we leverage the following Theorem 1.
2 2
p 2 p1 3
{p4, p3}}
Ω13 {{p1}, {p2},{p3}}
2 3
p3 2 2
ΔΩ
2 3
Ω12 {{p1}, {p2}} ΔΩ 24
p3 p1 p3 p2 p4 p1 p4 p2 p4 p3 4
5
4.5
5.5
6.5
Fig. 4. The illustration of progressively outputting the combinations.
94
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
Algorithm 1. Incremental Combination Sorting (ICS) algorithm ICS (X, m, k) Input: an orderly list X, the parameter k and m Output: incremental combinations y y 1. for col ¼ 2; jXxþ1 j 6 k 6 jXxþ2 j; col ¼ col þ 1 do//apply inequality (3) 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
for (row = 2;row 6 m; row = row + 1) do x = col + row 1, y = row insert the combinations in DXyx into BCy-tree jD0 Xy j jDXyx j l ¼maxj¼1 x1 w Gj 2 D0 Xyx1 ; m ¼minj¼1 w Gj 2 DXyx if m 6 l then//globally amending by applying Theorem 1 jDXy j for (k = 1; minj¼1 xþk w Gj 2 DXyxþk < l; k = k + 1) do//apply inequality (4) for i ¼ 1; w Gi 2 DXyxþk < l; i ¼ i þ 1 do//apply inequality (5) if Gi is not in BCy-tree then//produce Gi in a ascending order insert Gi into BCy-tree to obtain D0 Xyx1
if (y is equal to m) then output the combinations in D0 Xyx1
the corresponding B+-tree. Then, the algorithm generates the current incremental combinations DXyx and inserts them into the BCy-tree (line 4). Meanwhile, ICS uses the variable l to store the maximum weight value of the updated incremental combinations of previous column (line 5). Note that if the incremental combinations of the previous column, D0 Xyx1 , is not updated, then we have l ¼ D0 Xyx1 ¼ DXyx1 . Another variable m stores the minimum weight value of the incremental combinations of current column, y jDX j i.e., minj¼1 x w Gj 2 DXyx (line 5). Next, if m 6 l (lines 6–10), the algorithm updates the incremental combinations of current column by conducting the global mending. Finally, the algorithm outputs the incremental combinations of the previous column since they are in a global order when the superscript y is equal to m (line 11). The following example illustrates the procedure of the ICS algorithm. Example 1. Assume that we need to select a combination of two objects from X = {p1, p2, p3, p4} in Fig. 4, where they have the weights of 1, 2, 3 and 3.5, respectively. To generate all combinations in an ascending order of w(G), the ICS algorithm first outputs {p2, p1}. Then, ICS uses p3 and X12 = {{p1}, {p2}} to generate X23 ¼ p3 ðp2 jp1 Þ = {{p3, p1}, {p3, p2}}. Since w({p4, p1}) < w({p3, p2}), ICS uses the combinations in DX24 to globally amend DX23 . Thus, the combinations in DX24 and DX23 are sorted. ICS obtains the updated combinations after the global amending, which are D0 X23 ¼ ffp3 ; p1 g; fp4 ; p1 gg and D0 X24 ¼ ffp3 ; p2 g; fp4 ; p2 g; fp4 ; p3 gg. h 5. Top-k Combinatorial Metric Skyline (kCMS) query processing So far, we have known that our proposed ICS algorithm does not need to compute the exponential number of combinations to obtain the top k combinations according to the monotonic function given the query. Based on the ICS algorithm, we propose an efficient kCMS query algorithm that can incrementally generates combinatorial metric skyline. The detail of the algorithm is elaborated in the following. 5.1. An overview of the kCMS query algorithm A general framework for the kCMS query processing consists of three phases: (i) enumerating phase, (ii) pruning phase, and (iii) refinement phase. Let Grlt denote the query results, and Grfn denote the combinations to be refined. Grfn contains intermediate nodes to be extended later.
At the beginning, the enumerating phase searches the index (i.e., M-tree [5]) in best-first (BF) [23] manner until the heap H becomes empty. When a data object p is popped, the algorithm enumerates all combinations using p and other data objects (or nodes) in H. These combinations are processed one by one. The current enumeration will be terminated if the current combination reaches a threshold value. Then, the algorithm begins the next round of enumeration. The whole search will be terminated when certain conditions are satisfied. During this phase, the IDM-tree is built. The pruning phase aims to prune the combinations that are not qualified as query results according to some pruning heuristics. The refinement phase extends the combinations by replacing the current entry e with its children ei so that the resulting combinations will only contain data objects but not any intermediate nodes. Then, the obtained combinations will be processed to find the final query results. The monotonic function considered in the paper is the sum of attribute values of metric skyline, i.e., P adistðGÞ ¼ jGj i¼1 distðpi ; qÞ. 5.2. Early Stopping (ES) for kCMS query In the enumerating phase, it is obviously very time consuming if all combinations of all data objects need to be computed. In order to improve the efficiency, we propose several stopping criteria that can help terminate the enumeration algorithm much earlier without examining all combinations. Our idea follows the spirit in [1], i.e., ‘‘stop the skyline computation without applying the skyline filter to all the objects’’ This method can significantly reduce the number of objects to be checked and the candidate combinations to be evaluated. The stopping criteria are presented as follows. Let k_score denote the k-th score of the combinations in Grlt, and rfn_minscore denote the minimum score of the combinations in Grfn. Then, we have the Theorem 2 below. Theorem 2. Let entry e be the top entry in H and jGj = m. The kCMS query can be terminated if e. key m > k_score and rfn_minscore > k_score. Proof. Since rfn_minscore is larger than k_score, there will not be any other combinations in Grfn which has a score lower than k_score. Moreover, all the new combinations inserted into Grfn must have a score larger than rfn_minscore after e is popped. If e.key m is larger than k_score, all new combinations generated using e and the entries in H will have a score larger than k_score. This is because any entries e0 in G except e has a score larger than e.key
95
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
since the algorithm searches the nodes in the best-first manner in an ascending order. Integrating the above two cases, we can conclude that there does not exist a combination with score lower than k_score. Therefore, the search can be stopped. h Based on Theorem 2, we immediately obtain the Corollary 1 as follows. Corollary 1. Assume that the current entry popped from H is entry e and jGj = m. The kCMS query can stop searching the rest of the data space if Grfn = £ and e. key m > k_score. Theorem 2 and Corollary 1 show that the algorithm can stop at the certain point even though the most of index has still not been visited. In fact, Bartolini et al. [1] have proved that using the MiniMax rule to choose the stop point is optimal which is independent of the specific function used to sort the objects. Clearly, Theorem 2 and Corollary 1 are very important because it shrinks the search space significantly. However, for the current data object p, the algorithm still needs to enumerate a large number of combinations. For example, the number of combinations will be C 3100 ¼ 161; 700 if n = 100 and m = 3. Therefore, we propose another new stopping criterion to further shorten the enumeration process. The new criterion is given by Theorem 3. Theorem 3. Given the combination G sorted by the ICS algorithm, if its minimum distance to q, i.e., mindist(q, G), is greater than k_score, the enumeration can be stopped, where mindistðq; GÞ ¼ PjGj mindistðq; ei 2 GÞ is the sum of the minimum distance between i¼1 ei and q, i.e., mindist(q,ei 2 G). Proof. Since the incremental combinatorial sorting method is used, all combinations are in a global order. In other words, there does not exist any other combination which has a score lower than k_score if mindist(q, G) > k_score. h Theorem 3 is critical to the performance of the kCMS query because it avoids enumerating and evaluating a large number of unnecessary combinations. Our experimental results in Section 6 also prove the effectiveness of Theorem 3. It is also worth noting that to ensure that Theorem 3 can work, the ICS algorithm needs to be adopted. Next, we introduce the pruning conditions for the combinations in Grfn. Theorem 4. The kCMS query algorithm can delete G0 and the combinations behind G0 from Grfn if all combinations in Grfn is in an ascending order and mindist(q, G0 ) > k_score. Theorem 4 helps improve the query performance by largely reducing the storage space needed by Grfn, and avoiding evaluating unqualified combinations. Note that Theorems 2–4 can be easily Q revised for other monotonic functions, such as ei 2G fdistðei 2 GÞg, which computes the volume of each attribute of a metric skyline. 5.3. Triangle-based Pruning (TP) Up to now, we have introduced the main pruning methods used by the kCMS query algorithm. In what followed, we will present triangle-based pruning (TP) heuristics. Let q be a query object, p and rp be the pivot object and the radius of the entry e, respectively. We denote the maximum distance between any object pi 2 e and pj 2 e (i – j) as UB(pi, pj), and the minimum distance of any object pk 2 e and q as LB(pk, q). Then, we obtain the following pruning heuristics. Theorem 5 (Triangle-based Pruning Heuristics). Let e be a node in the metric index which contains at least (m + 1) data objects. The node e does not contain any results of a CMS query if UB(pi, pj) < LB(pk, q).
p2
p6 p4
p1
p5 p3
p7
q
Fig. 5. Illustration of triangle-based pruning.
Proof. Let G be a combination of the index node e(jGj = m). There must exist an object p0 2 e whose distance from any object pi 2 G, dist(p0 , pi 2 G), is no more than UB(pi, pj). Since UB(pi, pj) < LB(pk, q), it holds that dist(p0 , pi 2 G) < LB(pk, q). Thus, dist(p0 , pi 2 G) < dist(q,pi 2 G). In other words, q is dominated by p0 with respect to G. Therefore, G cannot become a result of the CMS query. Since G is a general combination, we finish the proof. h Theorem 5 indicates that the farther node generally does not contain any results of a CMS query. Therefore, it would be more efficiency if the query algorithm retrieves the data objects closer to q. To this end, our proposed algorithm follows the best-first manner [23] where the key is defined as the minimum distance between the entry e and q. Fig. 5 illustrates the rational of Theorem 5, where p5 is the pivot of the index node e = {p1, p2, p3, p4, p5} and q is the query object. Assume that the candidate combination G is the set of {p1, p2, p3}. The distance between p2 and p3, namely, dist(p2, p3), is the maximum distance of any two data objects in the node e. LB(p4, q) is the minimum distance of any data object pk 2 e (k 2 [1,5]) and q. Since dist(p2, p3) < LB(p4, q), we have "dist(pi, p4) < dist(pi, q), i 2 [1,3]. In other words, p4 dominates q with respect to G. Hence, G is not a result of the CMS query. However, Theorem 5 requires pre-computing the distance of each pair of objects in order to obtain UB(pi, pj). To further improve the efficiency, we replace UB(pi, pj) with 2rp, since UB(pi, pj) 6 dist(pi, p) + dist(pj, p) 6 2rp. Moreover, we also replace LB(pk, q) with dist(p, q) rp since dist(p, q) rp 6 dist(p, q) dist(p, pk) 6 LB(pk, q). Accordingly, we obtain the following Corollary 2. Corollary 2. Let e be the node in a metric index which contains at least (m + 1) data objects. The node e does not contain any query result if 3rp 6 d(q, p). We omit the proof of Corollary 2 here to save space because it is very straightforward according to Theorem 5. Corollary 2 enables our kCMS query algorithm to quickly prune a large number of combinations in the node e by comparing them only against the query object. It provides a basis of pruning the combinations in the intermediate node. Observe that, similar to Theorem 5, Corollary 2 can be applied only when the node e contains at least (m + 1) data objects. However, in some cases, the node e can still be pruned even when it has only m data objects as given in the following Corollary 3. In Corollary 3, UB(Y, X) = dist(x, y) + rx + ry is the maximum distance between the pivot object xi in an index node X and the pivot object yj in Y. Corollary 3. Given a node X in metric index which contains m data objects, another node Y, and a query object q. If UB(Y, X) < LB(q, X), X will not contain a result of the CMS query. Proof. Since dist(yj 2 Y, "x0 ) 6 UB(Y, X) < LB(q, X) 6 dist(q, "x0 2 X), q will be dominated by any object y0 in Y. Hence, X does not contain any query result of the CMS query. h It is worth noting that Corollary 3 has less pruning power than Corollary 2 because Corollary 3 can prune just one combination if the condition is satisfied. Therefore, we further develop the following Corollary 4 that can be used for more general case. Corollary 4 is derived based on the observation that the data objects in a combination are usually from multiple index nodes rather than a single index node.
96
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
Algorithm 2. kCMS_Processing Input: M-tree I constructed over P, user specified parameters m and k, query object q Output: the result of kCMS 1. Grlt = £, Grfn = £, k_score = +1, initialize min-heap H accepting entries in the form (e, key) 2. insert (Root, mindist(e, q)) into heap H 3. while (heap H is not empty) do 4. remove top entry e from H 5. if ((e.key m) > k_score)//Theorem 2 and Corollary 1 6. if (Grfn is not empty and rfn_minscore > k_score) return 7. if (Grfn is empty) return 8. if (e is a data object) 9. for " combination G of e not marked as false positive do// by ICS method 10. if (mindist(q, G) > k_score) break//Theorem 3 11. ProcessG(G, Grlt, Grfn, k, k_score) 12. else// intermediate node 13. if (jej P m + 1 and 3 e.rp 6 d(e.piv, q))// Corollary2, rp is the radius of e 14. mark all combinations from e as false positives 15. if (jej = = m and $ e0 s.t. UB(e0 , e) < LB(q, e) or e0 dominates q w.r.t. G from e)//Corollary 3 16. mark the combination from e as false positive 17. if (jej + je0 j P m and $e0 s.t. UB(e0 , e) < LB(q, e) and UB(e0 , e) < LB(q, e0 ))//Corollary 4 18. mark the combinations formed by e and e0 as false positives 19. for each child ei 2 e do 20. insert (ei, mindist(ei, q)) into Heap H 21. for " combination G of Grfn including e do//update the combinations of Grfn 22. remove G from Grfn 23. for each combination G0 of G do //obtain G0 by replacing e of G using ei 24. if (mindist(q, G0 ) > k_score) continue//Theorem 4 25. ProcessG(G0 , Grlt, Grfn, k, k_score)
Corollary 4. Assume that two nodes X and Y contain at least m data objects (jXj + Yj P m), and a combination G consists of l1 (
Similar to Corollary 3, Corollary 4 is used only if the information of their sibling nodes is available. Clearly, Corollary 2 and Corollary 4 play an important role on quickly pruning the intermediate node e with at least m + 1 data objects.
Proof. Given a point p0 2 (X [ Y)nG and a point pi 2 G (i 2 [1, jGj]), we have dist(p0 , pi) 6 UB(X, Y) < LB(q, X). So, dist(p0 , pi) 6 UB(X, Y) < dist(q, x) rx. If pi 2 X, it holds that dist(q, x) rx < dist(q, x) dist(x, pi) because of dist(x, pi) < rx. On the other hand, we have dist(q, x) dist(x, pi) < dist(q, pi) according to the triangle inequality. Thus, dist(p0 , pi) < dist(q, x) rx < dist(q, x) dist(x, pi) < dist(q, pi), that is, dist(p0 , pi) < dist(q, pi). Similarly, if pi 2 Y, it holds that dist(p0 , pi) < dist(q, pi). In summary, in both cases, we can always conclude that dist(p0 , pi) 6 dist(q, pi), pi 2 G. Therefore, G is not a result of the CMS query by Definition 2. h
5.4. kCMS query processing
e6
We now present the complete kCMS query algorithm (as shown in Algorithm 2) that integrates our proposed Early Stopping (EP) and Triangle-based Pruning (TP) techniques. We will step through the algorithm using the set of data points P = {p1, . . ., p9} and query point q in Fig. 6 with m = 2 and k = 2. First, we index all the objects in the dataset P using an M-tree (denoted as I). The kCMS_Processing retrieves the qualified combinations by traversing the M-tree in a best-first manner [23].
Root
e5 e4
p8
PR(p3, q)
p7
p5
e3
p9 Root
p6 p1 p2 e1
p3 p4 q
e2 PR(p4, q)
(a) the dataset
e1 e2 e1
e3 e4
p1 p2 p3 p4
e5 e6
p5 p6 p7 p8 p9
(b) the structure
Fig. 6. Illustration of kCMS query in an M-tree.
e2
97
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105 Table 2 kCMS for the example in Fig. 6 (k = 2, m = 2). Actions
The contents of H
Candidate combinations
The contents of Grfn
The contents of Srlt
Visit Root Expand e1 Expand e4 Visit p4 Visit p3 Expand e3
e1, e2 e4, e3, e2 p4, p3, e3, e2 p3, e3, e2 e3, e2 p2, p1, e2
– – – {p4, p3}, {p4, e3}, {p4, e2} {p3, e3}, {p3, e2} {p3, p2}, {p3, p1},
£ £ £ {{p4, e3}, {p4, e2}} {{p3, e3}, {p4, e3}, {p4, e2}, {p3, e2}} {{p4, e2}, {p3, e2}}
£ £ £ {{p4, p3}} {{p4, p3}} {{p4, p3}, {p3, p2}}
Specifically, we maintain a min-heap H with entries in the form of (e, key) (line 1), where the key is defined as the minimum distance between the entry e and the query point q (i.e., mindist(e, q)). Intuitively, a small key may result in a small aggregate value of a combination. We also initialize two empty sets, Srlt and Srfn at line 1, which are used to store the kCMS results and the candidate combinations for refinement, respectively. Each combination in Srfn includes at least one intermediate entry that will be extended later. The query algorithm starts by inserting e1 and e2 into H with their keys (e.g., mindist(e1, q) and mindist(e2, q)). Table 2 shows the contents of H, Srlt and Srfn as well as candidate combinations at each step, where a combination marked with a strikethrough will be pruned in the next step. Then, e1 with the minimum mindist is removed from H and its children e3 and e4 along with their mindist values are inserted into H. At this point, the algorithm verifies whether the stopping condition is satisfied (lines 5–7) according to Theorem 2 and Corollary 1. Next, the algorithm will process e in the following two cases. If e is a data object, the algorithm uses the ICS method to gradually enumerate the combinations that contain e and the entries in H (line 9). For example, the algorithm generates three combinations, {p4, p3}, {p4, e3} and {p4, e2}, after p4 is removed from H. Attributed to the use of the ICS approach, the enumeration process in line 9 can stop after only a few combinations are enumerated. Line 10 is used to exit the enumeration of current entry e when the termination condition of enumerating, i.e., mindist(q, G) > k_score, is satisfied. This allows many candidate combinations to be pruned by Theorem 3. As a result, the algorithm achieves significant performance improvement. Before the exit of the current
enumerating, each combination G is processed by ProcessG. Since {p4, e3} (or {p4, e2}) cannot be pruned by the sibling entry of any entry in {p4, e3} (or {p4, e2}), {p4, e3} and {p4, e2} are inserted into Srfn (see line 5 in ProcessG). Note that this method is very effective because it prunes some combinations in Srfn in advance. Nevertheless, the combination {p4, p3} is inserted into the resultant set Srlt since q is in its metric skyline. Similarly, p3 is removed from H, and {p3, e3} and {p3, e2} are added to Srfn. In the other case when e is an intermediate node, the algorithm first filters out false positives (unqualified combinations) by Corollary 2, Corollary 3, and Corollary 4 (lines 13 18). These false positives will not be processed any more (line 9). In order to better prune the combination(s) of line 15 and line 18, we only select the data objects (or entries) in e’s sibling entry (denoted as e0 ). Then, the children of e are inserted into H (lines 19–20). Considering our example, after e3 is removed, the combination from e3, e.g., {p2, p1}, is marked as false positive (line 15). Its children p2 and p1 are inserted into H. On the other hand, the algorithm uses e’s children to update the combinations in Grfn which contain e, traversing all combinations in Grfn. Each updated combination is denoted as G0 . For the above example, the algorithm first deletes {p3, e3} and {p4, e3} from Grfn, and then generates four new combinations by replacing e3 in each combination of Grfn (that contains e3) with its children p2 and p1, respectively. At last, the algorithm processes each G0 one by one by invoking ProcessG. If mindist(q, G0 ) > k_score, the processing of G0 is skipped. As a result, {p4, p2} and {p4, p1} are pruned; meanwhile {p3, p2} and {p3, p1} are used to update Srlt by UpdateRlt. Accordingly, the contents of Grfn and Srlt become {{p3, p2}, {p3, p1}, {p4, e2},
Procedure ProcessG(G, Grlt, Grfn, k, k_score) Input: G: the candidate combination; Grlt: the resultant set; Grfn: the set of candidate combinations for refinement; k: a user specified parameter; k_score: the k-th value of Grlt Output: process G and maintain Grlt, Grfn, and k_score 1. if G consists of data objects then 2. invoke MSQ algorithm to judge whether q is in metric skyline w.r.t. G 3. if G is a result then UpdateRlt(G, Grlt, k, k_score) 4. else// G includes intermediate entry(entries) 5. if ($e0 dominates q w.r.t. G) then do nothing//select e0 from the sibling entry of ei 2 G 6. else insert G into Grfn
Procedure UpdateRlt(G, Grlt, k, k_score) Input: G: the candidate combination; Grlt: the resultant set; k: a parameter; k_score: the k-th value of Grlt Output: update Grlt and k_score 1. if (jGrltj < k-1) insert G into Grlt 2. else if (jGrltj== k-1) 3. insert G into Grlt, 4. k_score = max{adist(q, Gi 2 Grlt), i 2 [1, jGrltj]} 5. else//jGrltj P k 6. if adist(q, G) < k_score then 7. delete the combination Gj 2 Grlt with the largest score 8. insert G into Grlt and update k_score
98
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
{p3, e2}} and {{p4, p3}, {p3, p2}}, respectively. At this moment, it holds that rfn_minscore = adist(q, {p4, e2}) and k_score = adist(q, {p3, p2}). The next step of the query algorithm removes p2 from H and examines the stopping condition (lines 5–7). Since dist(p2, q) 2 > k_score and rfn_minscore > k_score, the algorithm terminates according to Theorem 2. Note that in the above example where k = 2, the TP technique has not been used. The TP will take effect when the parameter k becomes larger. For example, if k = 4, the combinations formed by e5 and e6, e.g., {p5, p7}, {p5, p8}, {p5, p9}, {p6, p7}, {p6, p8} and {p6, p9}, will be pruned by Corollary 4; the combination from e5, e.g., {p5, p6}, will be pruned by Corollary 3; the combinations from e6, e.g., {p7, p8}, {p7, p9}, and {p8, p9}, will be pruned by Corollary 2. 5.5. Enhanced query processing with spatial pruning and reuse technology As presented in the previous section, the kCMS_Processing algorithm has achieved significant performance improvement over the baseline approach by pruning a large amount of search space. However, it still has room to be further improved. Observe that the kCMS_Processing algorithm needs to execute metric skyline queries (MSQs) to check whether a combination is a result of the CMS query of q if the combination cannot be pruned by ES and TP heuristics. For one kCMS query, multiple MSQs may be executed which may visit same nodes and introduces unnecessary I/O accesses. Therefore, in this section, we proposed an enhanced kCMS query algorithm by leveraging two techniques: spatial pruning (SP) and reuse heap (RH) [15]. Spatial pruning technology. We illustrate the basic idea of the spatial pruning heuristics using a 2-dimensional example as shown in Fig. 7, where G = {p1, p2, p3} is a candidate combination and q is the query object. We draw a circle C(pi, dist(pi, q)) centered at the point pi 2 G with radius equals to dist(pi, q) for each i 2 [1,3]. The region bounded in the circle C(pi, dist(pi, q)) is called the pruning region PR(pi, q) of pi (i 2 [1,3]), e.g., PR(p1, q). The intersection of the pruning regions of all data objects in G is called the pruning region PR(G, q) of G, i.e., PR(G, q) = PR(p:1, q) \ PR(p:2, q) \ PR(p:3, q), which is shown as the dark gray region with dashed lines in the figure. Obviously, any object p0 2 PnG located in the region PR(G, q) dominates q with respect to G (i.e., p0 G q). Therefore, G cannot be a result of the CMS of q, due to dist(p0 , pi) 6 dist(q, pi) for any object pi 2 G (i 2 [1,3]). Thus, G can be safely pruned. Based on the observation in the above example above, we propose the spatial pruning heuristics as follows. Theorem 6 (Spatial Pruning Heuristics). Given a query object q and a combination G from the dataset P (G P, jGj = m), if the pruning region PR(G, q) contains at least one object p0 2 PnG, G can be safely pruned; otherwise G is a result of the combinatorial metric skyline.
PR(G, q)
Y
PR(p2, q)
PR(p3, q)
8
p2 6
p6
p5
p1
2
p4 2
4
G = {p1, p2, p3}
Theorem 6 helps avoid computing expensive metric skyline queries for checking whether a combination is the query result or not. Instead we only need to carry out m window queries. Correspondingly, the new query algorithm only requires replacing the second line in ProcessG with the spatial pruning heuristics. More importantly, this approach greatly reduces the search space because it only needs to search the union of all pruning region of data objects in G. Thus, it is much more efficient than computing MSQ. Consider the example in Fig. 7, where the combination G consists of three data objects p1, p2, and p3. Since the PR(G, q) does not include any other data object p0 2 PnG, q is not dominated by p0 with respect to G = {p1, p2, p3}. Therefore, we can quickly conclude that G is a result of the CMS query. In fact, we even do not need to run all m window queries to determine whether a combination G is a result of the CMS query. Reconsidering the example in Fig. 7, we can conclude that {p1, p3, p2}, {p1, p3, p6}, and {p1, p3, p5} are in the query results by running only the window query of p1 and q. This is because PR(p1, q) only contains two data objects p1 and p3 in G, and there does not exist any other data object p0 R G which dominates q. In other words, G is a query result. The challenge here is to determine which window query needs to be executed first. Generally, a smaller pruning region may contain fewer data objects than a bigger pruning region. Therefore, we run the window queries in an ascending order of the size of their corresponding pruning regions. For G = {p1, p3, p2}, we will first run the window query for p3, followed by p1 and p2. In fact, after we obtain PR({p3, p1}, q), we can terminate the window query of p2, the rationale behind which is presented by the following Corollary 5. Corollary 5. Consider two sets G G0 and a query object q. If q is a metric skyline object with respect to G (jGj = m), then q is also a metric skyline object with respect to G0 (jG0 j > m).
Proof. Since G is a result of the CMS query, the pruning region of PR(G, q) does not contain any other data objects p0 2 PnG. On the other hand, PR(G0 , q) PR(G, q) holds since PR(G0 , q) = PR(G, q) \ PR(p, q). Therefore, PR(G0 , q) does not contain any other data objects p0 2 PnG0 . According to Theorem 6, the query object q is not dominated by any other data objects p0 2 PnG0 . The proof is finished. h
q
p3
4
p7
Proof. We prove the theorem in the following two cases. (i) Assume that there exists an object p0 2 PnG which is located in the pruning region PR(G, q), G = {p1, p2, . . ., pm}. Since PR(G, q) = PR(p1, q) \ PR(p2, q) \ . . . \ PR(pm, q), p0 2 PR(p1, q), p0 2 PR(p2, q), . . ., p0 2 PR(pm, q), we have dist(p0 , p1) 6 dist(q, p1), dist(p0 , p2) 6 dist(q, p2), . . ., dist(p0 , pm) 6 dist(q, pm). Thus, p0 G q. The combination G can be pruned. Otherwise, there does not exist an object p0 2 PR(G, q) (p0 2 PnG) that dominates q w.r.t. G. (ii) Assume that there exists another object p00 2 PnG which is not located in PR(G, q) and dominates q w.r.t. G. Then, there must exist a pruning region PR(pj, q) (j R [1, m]) such that p00 R PR(pj, q) because p00 cannot be contained in any pruning regionPR(pi, q), 1 6 i 6 m. Thus, dist(p00 , pi) > dist(q, pi) for "i 2 [1, m]. In other words, q is not dominated by p00 w.r.t. G. Obviously, the conclusion conflicts with the assumption. Therefore, G is a result of the kCMS query. h
6
PR(p1, q)
8
X
Fig. 7. Illustration of spatial pruning.
Corollary 5 can help quickly identify the results of the CMS query and save the computation cost. In order to avoid repeated computation, we search the index in a best- first (BF) [23] manner. Meanwhile, in order to reduce the memory cost, we employ the compressed storage scheme. For example, two combinations {p1, p3, p2} and {p1, p3, p6} share the common prefix {p1, p3}. Based on
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
the observation, we compress the combinations that contain the same number of objects and share a common prefix by storing their common prefix. Reuse heap technology. Recall that if a candidate combination G cannot be pruned by ES or TP heuristics, the kCMS_Processing needs to execute multiple window queries as discussed in Spatial Pruning to verify G. As an example, Fig. 6 shows the query windows of p3 and p4, that is, PR(p3, q) and PR(p4, q), respectively. From the example, we can see that some entries (e.g., Root, e1, e4) have been visited multiple times when performing window queries for different objects, which results in a large amount of unnecessary I/O and CPU costs. Therefore, if we store the M-tree nodes visited by the previous window queries, we may be able to reuse some of them for the subsequent window queries. In fact, using the reuse technology, all the window queries can be answered by traversing the M-tree only once. This approach significantly reduces the overall I/O cost. Similarly, when the kCMS_Processing invokes the MSQ algorithm to verify whether the current combination is a result of the CMS query of q, we can also use this reuse technology to ensure that the M-tree is traversed only once. In order to implement the reuse technology, we need to store the visited M-tree nodes. This can be done by either maintaining all the visited nodes in a reuse heap Hr or by storing the leaf nodes of the M-tree. Since maintaining all the visited nodes takes considerable space, we adopt the second option. To ensure no entry will be missed during the whole query processing, any visited entry should not be discarded before the completion of the entire query processing unless it is expanded. It is worth noting that our proposed reusing heap technique is different from the caching techniques in most database management systems. The caching typically keeps the recent entries, whereas our reusing heap technique preserves specific entries. We refer to the kCMS_Processing with both spatial pruning and reuse technology as spatial reuse kCMS algorithm. Based on the above analysis, we have the Theorem 7 as follows. Theorem 7. For the entry in reuse heap Hr, spatial reuse kCMS query algorithm traverses the M-tree only once, but the kCMS query algorithm traverses the M-tree multiple times.
Proof. Since the kCMS algorithm needs to execute the MSQs (or window queries) to check whether a combination G is the query result, it usually needs to traverse the M-tree multiple times. Nonetheless, the redundant I/O access can be avoided by reuse heap Hr, which stores the visited objects. h Reconsider the example in Fig. 6 and assume that each node access causes one I/O operation. In order to generate the combination {p4, p3}, the kCMS algorithm needs to access Root node, entry e1 and e4, and hence the I/O cost is 3. However, for the metric skyline of {p4, p3}, the algorithm still requires 4 I/O accesses because it still needs to access Root node, entries e1, e4, and e3. At the end, we saved 3 I/O accesses to Root node, entry e1 and e4 since their index information is stored in Hr = {p4, p3, e3, e2}. 5.6. Discussion Although we have developed many pruning heuristics for the kCMS query, there may be room for further improvement. For example, some existing technologies used by the spatial skyline queries such as Voronoi diagram, Delaunay graph, and convex hull [28,29] may be useful for metric skyline queries too. This is because the spatial skyline is very similar to the metric skyline. The main difference between them lies in that the spatial skyline
99
is only applicable in Euclidean space whereas the metric skyline can use any distance functions in metric space, e.g., edit distance function. Therefore, for kCMS query on the spatial datasets, new pruning heuristics can be developed to boost the pruning power. At the moment, the definition of kCMS query needs to be revised by adding the keyword ‘‘spatial’’. For example, we replace the sentence of ‘‘q is in the metric skyline of G’’ with ‘‘q is in the spatial skyline of G’’. Correspondingly, the combinatorial metric skyline becomes the combinatorial spatial skyline. Based on the theories in [28], some useful Lemma and Theorems can be obtained as follows. Lemma 3. For each gi 2 G, if gi has the query point q as its closest point in the dataset P, the combination G is a result of the combinatorial spatial skyline of q. Proof. If q is the closest point to gi in P, we have dist(q, gi) < dist(p0 , gi) for all p0 2 P (p0 – q). By definition, no point in P spatially dominates q. Therefore, G is a result of combinatorial spatial skyline of q. h Theorem 8. Any combination G P whose convex hull contains the query point q is a result of the combinatorial spatial skyline of q. Let VC(q) denote the Voronoi cell that contains the query point q, VC(q) is a convex polygon in Euclidean space. Then, we can obtain the following Theorem 9. Theorem 9. If the Voronoi cell VC(q) intersects with the boundaries of convex hull of G, G is a result of combinatorial spatial skyline of q. Lemma 3 shows that a combination is a combinatorial spatial skyline of q only because of q’s location regardless of where other data points of P are located. Theorem 8 enables our algorithm to efficiently retrieve a large number of combinations, which are the results of the combinatorial spatial skyline of q, only by examining them against the query object q. Theorem 9 specifies those combinations which are in the combinatorial spatial skyline of q, by examining only the data points in a limited local proximity around q. Lemma 3, Theorem 8, and Theorem 9 help us quickly select some combinations which are the results of the combinatorial spatial skyline of q. They produce the seed combinations which are used to further improve the pruning power. In fact, we also make use of the following Theorem 10 to reduce the time complexity of our algorithms by disregarding the distance computation operations against the non-convex points. Theorem 10. Whether G is a result of combinatorial spatial skyline of q, does not depend on any non-convex point gi 2 G. We omit the proofs of Theorems 8–10 since they are very straightforward according to Theorem 1, 8 and Theorem 2 in [28], respectively. 6. Experimental study In this section, we evaluate the effectiveness and efficiency of our proposed algorithms through extensive experiments using both real and synthetic datasets. 6.1. Experimental settings The real dataset in our experiments is Color(Col) [31], which is a 9-dimensional dataset, containing 68 K image data. Each data item is associated with nine attributes such as brightness, saturation, and so on. The L1-norm distance is used to measure the similarity
100
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
P the ICS algorithm is ei 2G mindistðei ; qÞ. All the algorithms were implemented in C++, and conducted on a Windows PC with 2.0 GHz dual-CPU with 4 GB RAM.
Table 3 The statistics of datasets. Datasets
Size (K)
Dimensionality
Measure
Color Signature Correlated Independent
68 50 256 256
3 64 3 3
L1-norm Edit distance L2-norm L2-norm
6.2. Experimental results In this subsection, we present four sets of experiments that compare the performance of the kCMS, kCMS + SR and the LS algorithm using four datasets: Col, Sig, Cor, and Ind. We extract the first three dimensions in the original datasets Col and Sig for all the experiments except of the one for evaluating the effect of dimensionality. The performance of our proposed algorithms are investigated under a variety of parameters, including parameter k, the dimensionality of datasets dim, the cardinality of datasets N, and the number of objects per combination m. In each experiment, we vary only one parameter while keeping others fixed to their default values. We use the following four performance metrics: the number of node accesses (denoted as NA), the CPU time (excluding the I/O time), the number of distance computation, and the maximum number of entries in the reuse heap (denoted as MH). The effect of parameter k. The first set of experiments explore the effect of parameter k on the query performance over 3D datasets, Col(68 K), Cor(256 K), and Ind(256 K), and 64D dataset Sig(50 K) where m = 2. Fig. 8 shows the CPU time and I/O cost of all the algorithms, and Fig. 9 plots the number of distance computation conducted by each algorithm. The abbreviations of the algorithms (C for kCMS, S for kCMS + SR, and L for LS) are listed under the horizontal axis, and MH for kCMS + SR is listed at the bottom of each dark gray bar. As expected, the CPU time and NA of all the
between any two feature vectors extracted from images. The synthetic dataset that we used is Signature(Sig) [3] which contains 50 K randomly generated strings. Each string in Sig includes 64 English letters and the edit distance function is used. In addition, we also generate two more datasets Correlated (Cor) and Independent (Ind) [3]. Table 3 lists all the parameters that are considered in the experiments, along with its value range. For each dataset, we index the objects using the M-tree [5] with 2048 bytes page size and randomly select 50 data objects as the query objects. Since all the related skyline query algorithms work only in vector space while our algorithm works in metric space, we do not compare them with our algorithms in the following experiments. In our algorithm, we mainly compare the performance of three algorithms: the basic kCMS query algorithm without reuse technique (denoted as kCMS or C in figures), and improved kCMS query processing with spatial pruning and reuse technology (denoted as kCMS + SR or S in figures), and the brute force approach using the linear scan (denoted as LS or L in figures). Each reported value in the following diagrams is the average query cost of 50 queries, whose locations follow the distribution of the corresponding dataset. The monotonic function for sorting the combinations in
kCMS CPU me kCMS+SR CPU me kCMS+SR node access(S) kCMS node access (C) node access 4 CPU Time (s) CPU Time (s) 4×10 600 500 4
102
LS CPU me LS node access (L) node access 105
102
10
104
1
1
10
10
103
103
0
0
10
10
102
-1
10
10-2 4
CSL CSL C S 8
CS
C S
16 k
32
102
10-1 10-2
10 64
4
CSL CSL C S 8
(a) Col(68K, 3D)
kCMS+SR CPU me kCMS+SR node access(S)
4
10
1
10
103
100
10
10-1
4
CSL CSL C S 8
16 k
10 64
LS CPU me LS node access (L) node access 4 2×10
CPU Time (s) node access 105 400
102
10-2
32
(b) Sig(50K, 64D)
kCMS CPU me kCMS node access (C) CPU Time (s) 103
CS
C S
16 k
C S
32
(c) Cor(256K, 3D)
CS
101
103
100
102
2
101 64
104
102
10-1 10-2 4
CSL CSL C S 8
16 k
C S
32
CS
(d) Ind(256K, 3D)
Fig. 8. CPU time and node accesses vs. k(m = 2).
10 64
101
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
kCMS+SR (S) LS (L) distance computaon 1E6
897
32
4
64
(a) Col(68K, 3D)
9E5 3E5
1E5
1E6 3E5 5E4
10
4
C S
8
C S
16 k
103
C S 1915
1809
1729
MH
C S 1689
L C S
2E4
2E4 4
32
64
L C S 4
(c) Cor(256K, 3D)
1E4
5E3
MH
6E4
C S
C S
8
C S
C S
851
3E4
1E5
1E5
105
4
103
64
106
4E5
7E4
1646
10
32
1E5
7E4 1E4
16 k
8
746
105
C S
107 3E6
2E6
106
C S
kCMS+SR (S) LS (L) distance computaon
kCMS (C) 8E6
C S
(b) Sig(50K, 64D)
distance computaon 107
C S
1020
16 k
C S
MH
932
8
1E4
791
4
834
MH
103
L CS C S C S C S 785
LCS
3E4
104
5E3
8E4
6E4
1959
105
7E4 2E4
708
103
4E5 1E5
1E5
1E4
104
3E5
1E5
8E4 3E4
741
105
106
4E5 1E5
2E6 9E5
1880
106
7E6
2167
107
9E6
1827
107 3E6
2046
kCMS (C) distance computaon
16 k
32
64
(d) Ind(256K, 3D)
Fig. 9. Distance computation vs. k(m = 2).
kCMS CPU me (C) kCMS+SR CPU me (S) CPU Time (s) 1.0
kCMS node access kCMS+SR node access
node access 104
0.8 103
0.6 0.4
2
10
0.2
node access 105
CPU Time (s) 5 4
104
3
103
2 102
1
0 2
C S
C S
C S 3
dim
4
C S
101 5
(a) Col(68K)
0
C S 2
C S
C S 3
dim
4
C S
101 5
(b) Cor (256K)
Fig. 10. CPU time and node accesses vs. dim(k = 16, m = 2).
three algorithms increase when k increases because the number of candidate combinations increases with k. Since the baseline approach LS is thousands of times slower than our proposed two algorithms due to the need to generate an exponential number of combinations, we will not report the results of LS in the subsequent experiments. Among the three algorithms, kCMS + SR performs best in terms of all aspects. In particular, the query cost of kCMS + SR increases slower than others and achieves the best performance in all cases. This is because kCMS + SR adopts both reuse
heap technology (RH) and spatial pruning (SP) method. The SP method helps reduce the number of candidate combinations to be evaluated. The RH method cuts down the number of node accesses because it ensures that the M-tree is traversed only once. In particular, kCMS + SR conducts much fewer number of distance computations for any dataset. The reason is that the SP method used in kCMS + SR limits the search range, which reduces significantly the number of distance computations. In addition, we also observe that the MH of kCMS + SR increases slightly with the
102
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
kCMS (C) distance computaon 106
8E5
4E5
1E6
1E5 5
10
kCMS+SR (S) distance computaon 107
1E5
9E4
10
3E5
6E4
105
2E4
1E4
8E5
6
3E5 1E5
9E4
7E4 1E4
2
3
dim
4
5
2
(a) Col(68K)
C S
C S 3
dim
C S
9872
MH
4362
C S
1729
10
4
447
C S
3901
MH
C S
785
C S
1716
C S 386
104
4
5
(b) Cor (256K)
Fig. 11. Distance computation vs. dim(k = 16, m = 2).
growth of k. This is because when k is larger, more index nodes need to be visited which in turn increases the number entries stored in the reuse heap. Overall, kCMS and kCMS + SR significantly outperform the LS while kCMS + SR achieves the best performance. The effect of dimensionality (dim). Figs. 10 and 11 demonstrate the effect of dimensionality (dim) on the query performance by varying the dim from 2 to 5, where k = 16 and m = 2. Due to space limitations, we only report the results on the datasets Col (68 K) and Cor (256 K). The results on Sig (50 K) and Ind (256 K) are similar and thus omitted here. From Figs. 10 and 11, we can observe that kCMS + SR outperforms kCMS in terms of node
accesses by about 1–2 orders of magnitude. This is again attributed to the stronger pruning power provided by the spatial pruning heuristics (SP), which also confirms the effectiveness of Theorem 6 and Corollary 5. In addition, from Fig. 10, we also see that the CPU time and NA both increase with the increase of dim, though kCMS + SR’s CPU time increases slower than kCMS. Moreover, as shown in Fig. 11, the number of distance computations also increase with the growth of dim. The reason of such behavior is that a high-dimensional combination is less likely to become the result of a kCMS query. Thus, both kCMS and kCMS + SR need to process more candidate combinations to obtain the first k query
kCMS CPU me (C) kCMS+SR CPU me (S)
CPU Time (s) 0.25
node access x102 21
0.20
kCMS node access kCMS+SR node access CPU Time (s) 0.4
16
0.3
11
0.2
6
0.1
node access
x102 31 26 21
0.15
16
0.10 0.05 0
1
C S
C S
C S
C S
C S
12K
24K
36K N
48K
60K
0
11 6
C S
C S
C S C S
C S
10K
20K
30K N
50K
(a) Col(3D)
(b) Sig(64D)
kCMS CPU me (C) kCMS+SR CPU me (S) node access x102 51
CPU Time (s) 0.7 0.6 0.5 0.4
41 31
0.3 0.2 0.1 0
21 11
C S
C S
64K
128K
C S
C S
256K N
512K 1024K
(c) Cor(3D)
40K
1
C S
1
kCMS node access kCMS+SR node access CPU Time (s) 0.30 0.25 0.20 0.15 0.10 0.05 0
C S 64K
C S
128K
C S
256K N
node access x102 19 17 15 13 11 9 7 5 3 1
C S
C S
512K 1024K
(d) Ind(3D)
Fig. 12. CPU time and node accesses vs. dataset size N (k = 16, m = 2).
103
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
results. On the other hand, from the figures, it is observed that the dimension of data set has little impact on CPU time, but more impact on the number of distance computations and the maximum number of entries in the reuse heap (MH). This is because the distance computation for each data point in a high-dimensional space is more computationally expensive compared with that in a low-dimensional space. Also, a combination in the highdimensional space has a lower probability of being a result of the kCMS query. With the increase of dim, more index nodes need to be visited, resulting a larger MH and higher I/O and CPU cost. It is worth noting that for extremely large dim, the size of the reuse heap may become very large, and hence even kCMS + SR’s performance may be affected due to the time spent for managing the reuse heap. The effect of dataset cardinality (N). In this set of experiments, we evaluate the scalability of our approaches by varying the cardinality of datasets N, where k = 16 and m = 2. In particular, we vary N from 12 K to 60 K for 3D Col, from 10 K to 50 K for 64D Sig, and from 64 K to 1024 K for 3DCor and Ind. The experimental results are shown in Figs. 12 and 13. Observe that, in most cases, the query cost of both algorithms slightly increases when the number of objects in the dataset becomes larger. The reason is that, the size of the M-tree increases with N, which forces the algorithms to visit more entries when computing the combinatorial metric skyline. kCMS incurs more CPU time and the number of node access than kCMS + SR because kCMS traverses the M-tree repeatedly. There are some exceptions for the range of [40 K, 50 K] in Sig probably due to the characteristics of the data. Fig. 13 also presents the MH of kCMS + SR with respect to N at the bottom of the corresponding bar. As expected, the size of MH grows when N increases. The reason behind is that the bigger
106
36K
48K
10K
60K
C S
20K
30K
40K
50K
N
(a) Col(3D)
(b) Sig(64D) kCMS+SR (S) distance computaon 106
3E5
4E5
3E5
64K
105
128K
256K
C S
C S
512K 1024K
N
(c) Cor(3D)
104 MH
C S 713
C S
2170
C S
2E4
1801
MH
1106
104 C S
7E4
7E4 3E4
1729
4E4
1438
105
1E5
1E5
1E5
1E5
1E5 8E4
64K
C S 128K
2E4
2E4
2E4
3E4
C S
C S
256K
512K 1024K
753
kCMS (C) distance computaon 106 2E5
1316
C S
1202
C S
N
2E5
4E4
C S
1028
C S 915
MH
972
24K
105
C S
5E4
4E4
3E4
3E4
851
12K
C S
C S
702
C S 572
MH
C S 477
104
2E4
695
2E4
3E4
791
2E4
3E4
1245
105
2E5
N
(d) Ind(3D)
Fig. 13. Distance computation vs. dataset size N (k = 16, m = 2).
C S
1037
1E5
3E5
2E5
2E5
1E5
1E5
1E5
1E5
1E5
kCMS+SR (S) distance computaon 107
kCMS (C)
distance computaon 106
dataset has more combinatorial metric skyline objects, which results in the bigger MH. However it is not always the case because kCMS + SR is very efficient even for bigger datasets. In summary, the experimental results show that kCMS + SR always performs best and achieves relatively consistent performance in all cases. The effect of parameter m. Figs. 14 and 15 present the query performance as the function of the metric parameter m over 3D datasets: Col (12 K), Cor(64 K), and Ind(64 K), and 64D dataset Sig(10 K), with k = 16. From the figures, we can see that CPU time slightly increases with respect to m. The reason is that the larger the parameter m is, the more objects need to be considered to generate the candidate combinations. In fact, the number of the combinations increases in an exponential rate. Thus, both algorithms require evaluating more candidate combinations to obtain the first k query results. However, the number of node access (NA) and the distance computations do not increase with the increase of m in most cases. Sometimes, a bigger m can even leads to less NA, e.g., m = 3. This is because both kCMS and kCMS + SR adopt the ICS algorithm and generate the candidate combinations in an incremental manner according to the aggregate distance from the query point, which accelerates the pace of obtaining the initial query results. On the other hand, our proposed ES, TP, and SP technologies provide a stronger pruning power. Thus, many candidate combinations are pruned before they are evaluated. Another important reason is that a bigger m generally brings more qualified combinations according to the Corollary 5. Therefore, our proposed algorithms sometimes achieve better performance when m is larger. This adequately indicates that our algorithms have a high pruning capability. Meanwhile, it also confirms the nice scalability of our approaches on the parameter m.
104
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
CPU me 101 100 10-1 10-2
C S 2
kCMS CPU me (C) kCMS+SR CPU me (S) node access x102 16 14 12 10 8 6 4 2 0 C S C S C S 3 4 5 m
kCMS node access kCMS+SR node access CPU me 101
25 100
20 15
10-1
10
10-2
0
5
(a) Col(12K, 3D)
C S 2
C S 3
C S 4
m
kCMS node access kCMS+SR node access CPU me 1 10
2 node access x10 16 14 12 10 8 6 4 2 0 C S C S 4 5
20 15
100
10
100 10-1
5 10-1
C S 2
C S 3
m
C S 4
C S 5
(b) Sig(10K, 64D)
kCMS CPU me (C) kCMS+SR CPU me (S) 2 node access x10 25
CPU me 20 101
2
node access x10 30
0
C S 5
10-2
C S 2
(c) Cor(64K, 3D)
C S 3
m
(d) Ind(64K, 3D)
Fig. 14. CPU time and node accesses vs. m(k = 16).
kCMS+SR (S) distance computaon 45 x104
kCMS (C) distance computaon 20 x104 16
35
12
25
8
15
2
3
m
4
5
2
(a) Col(12K, 3D)
C S
C S 3
m
C S 906
MH
878
C S
1125
535
C S
482
MH
5 0
C S
C S
558
C S 477
0
915
4
4
5
(b) Sig(10K, 64D) kCMS+SR (S) distance computaon 18 x104
kCMS (C) distance computaon 30 x104 25
14
20 10
15
6
2
3
m
4
(c) Cor(64K, 3D)
5
C S
MH
2
C S
C S 3
m
4
(d) Ind(64K, 3D)
Fig. 15. Distance computation vs. m(k = 16).
C S 746
2 0
754
C S 1182
1165
MH
C S
C S
1138
C S 1106
0
735
5
713
10
5
T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105
7. Conclusions and future work In this paper, we propose a novel type of skyline query, namely kCMS (top k combinatorial metric skyline) query, which can be adopted in various types of applications, such as business data analysis, decision making, and so forth. In order to efficiently answer kCMS queries, we designed two efficient algorithms, kCMS and kCMS + SR that combine the advantages of a series of techniques including early stopping, triangle-based pruning, spatial pruning and reused heaps. In the future, we plan to extend our algorithms to tackle so-called bichromatic kCMS query that involves querying two data sets. In addition, we are also interested in studying other variants of the combinatorial metric skyline query, such as constrained combinatorial metric skyline, combinatorial metric skyline based on clustering, and combinatorial metric skyline with respect to both metric and non-metric attributes. Acknowledgments Bin Zhang was supported in part by ZJNSF Grant LY14F020038. Yunjun Gao was supported in part by NSFC Grants 61379033 and 61003049, the National Key Basic Research and Development Program (i.e., 973 Program) No. 2015CB352502, the Cyber Innovation Joint Research Center of Zhejiang University, and the Key Project of Zhejiang University Excellent Young Teacher Fund (Zijin Plan). References [1] I. Bartolini, P. Ciaccia, M. Patella, Efficient sort-based skyline evaluation, ACM Trans. Database Syst. 33 (4) (2008) 1–49. [2] S. Börzsönyi, D. Kossmann, K. Stocker, The skyline operator, in: 17th Int’l Conf. on Data Engineering, 2–6 April, IEEE Computer Society, Heidelberg, Los Alamitos, 2001, pp. 421–430. [3] L. Chen, X. Lian, Efficient processing of metric skyline queries, IEEE Trans. Knowl. Data Eng. 21 (3) (2009) 351–365. [4] Y.-C. Chuang, I.-F. Su, C. Lee, Efficient computation of combinatorial skyline queries, Inform. Syst. 38 (3) (2013) 369–387. [5] P. Ciaccia, M. Patella, P. Zezula, M-Tree: an efficient access method for similarity search in metric spaces, in: the 23rd Int’l Conf. on Very Large Data Bases, 25–29 August, Morgan Kaufmann, Athens, San Fransisco, 1997, pp. 426–435. [6] E. Dellis, B. Seeger, Efficient computation of reverse skyline queries, in: the 33rd Int’l Conf. on Very Large Data Bases, 23–27 September, ACM, Vienna, New York, 2007, pp. 291–302. [7] M. Drosou, E. Pitoura, Search result diversification, SIGMOD Rec. 39 (1) (2010) 41–47. [8] D. Fuhry, R. Jin, D. Zhang, Efficient skyline computation in metric space, in: 12th Int’l Conf. on Extending Database Technology, 24–26 March, ACM, Saint Petersburg, New York, 2009, pp. 1042–1051. [9] Y. Gao, Q. Liu, B. Zhen, G. Chen, On efficient reverse skyline query processing, Expert Syst. Appl. 41 (7) (2014) 3237–3249. [10] S. Gollapudi, A. Sharma, An axiomatic approach for result diversification, in: Proceeding of WWW Conference, Madrid, Spain, 2009, pp. 381–390. [11] A. Guttman, R-Trees: a dynamic index structure for spatial searching, in: The 1984 ACM SIGMOD Int’l Conf. on Management of Data, 18–21 June, ACM, Boston, New York, 1984, pp. 47–57.
105
[12] X. Guo, C. Xiao, Y. Ishikawa, Combination skyline queries, Transactions on Large-Scale Data- and Knowledge-Centered Systems VI, LNCS 7600, 2012, pp. 1–30. [13] Z. Huang, Y. Xiang, B. Zhang, X. Liu, A clustering based approach for skyline diversity, Expert Syst. Appl. 38 (7) (2011) 7984–7993. [14] H. Im, S. Park, Group skyline computation, Inform. Sci. 188 (2012) 151–169. [15] T. Jiang, Y. Gao, B. Zhang, D. Lin, Q. Li, Monochromatic and bichromatic mutual skyline queries, Expert Syst. Appl. 41 (4) (2014) 1885–1900. [16] B. Jiang, J. Pei, X. Lin, D.W. Cheung, J. Han, Mining preferences from superior and inferior examples, in: 14th ACM SIGKDD Int’l Conf. on Knowl. Discovery and Data Mining, 24–27 August, ACM, Las Vegas, New York, 2008, pp. 390–398. [17] D. Kossmann, F. Ramsak, S. Rost, Shooting starts in the sky: an online algorithm for skyline queries, in: the 28th Int’l Conf. on Very Large Data Bases, 20–23 August, Hong Kong, 2002, pp. 275–286. [18] X. Lian, L. Chen, Reverse skyline search in uncertain databases, ACM Trans. Database Syst. 35 (1) (2010). Article 3. [19] Lee, C.K. Ken, W.-C. Lee, B. Zheng, H. Li, Y. Tian, Z-SKY: an efficient skyline query processing framework based on Z-order, VLDB J. 19 (3) (2010) 333–362. [20] C. Li, B.C. Ooi, A.K.H. Tung, S. Wang, DADA: a data cube for dominant relationship analysis, in: The ACM SIGMOD Int’l Conf. on Management of Data, 27–29 June, ACM, Chicago, New York, 2006, pp. 659–670. [21] M. Magnani, I. Assent, From stars to galaxies: Skyline queries on aggregate data, in: The Proceeding of ACM EDBT Conference, March 18–22, Genoa, Italy, 2013, pp. 477–488. [22] D. Mindolin, J. Chomicki, Discovering relative importance of skyline attributes, Proc. VLDB Endowment 2 (1) (2009) 610–621. [23] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive skyline computation in database systems, ACM Trans. Database Syst. 30 (1) (2005) 41–82. [24] J. Pei, B. Jiang, X. Lin, Y. Yuan, Probabilistic skylines on uncertain data, in: The 33rd Int’l Conf. on Very Large Data Bases, 23–27 September, ACM, Vienna, New York, 2007, pp. 15–26. [25] I.-F. Su, Y.-C. Chuang, C. Lee, Top-k combinatorial skyline queries, in: The 15th International Conference on Database Systems for Advanced Applications, 1, April, Springer, Tsukuba, Japan, 2010, pp. 9–93. [26] W. Son, S.-W. Hwang, H.-K. Ahn, MSSQ: manhattan spatial skyline queries, Inform. Syst. 40 (2014) 67–83, http://dx.doi.org/10.1016/j.is.2013.10.001. [27] W. Son, M.-W. Lee, H.-K. Ahn, S.-W. Hwang, Spatial skyline queries: an efficient geometric algorithm, in: the 11th International Symposium on Spatial and Temporal Database, 8–10 July, Aalborg, Denmark, 2009, pp. 247–264. [28] M. Sharifzadeh, C. Shahabi, The spatial skyline queries, in: The 32nd Int’l Conf. on Very Large Data Bases, 12–15 September, ACM, Seoul, New York, 2006, pp. 751–762. [29] M. Sharifzadeh, C. Shahabi, L. Kazemi, Processing spatial skyline queries in both vector spaces and spatial network databases, ACM Trans. Database Syst. 34 (3) (2009). Article 14. [30] Y. Tao, L. Ding, X. Lin, J. Pei, Distance-based representative skyline, in: The Proceeding of IEEE ICDE Conference, March 29, IEEE Computer Society, Shanghai, China, 2009, pp. 892–903. [31] Y. Tao, X. Xiao, J. Pei, SUBSKY: efficient computation of skylines in subspaces, in: The 22nd Int’l Conf. on Data Engineering, 3–8 April, IEEE Computer Society, Atlanta, Los Alamitos, 2006. 65-65. [32] G. Valkanas, A.N. Papadopoulos, D. Gunopulos, SkyDiver: a framework for skyline diversification, in: the Proceeding of ACM EDBT Conference, March 18– 22, Genoa, Italy, 2013, pp. 406–417. [33] G. Wang, J. Xin, L. Chen, Y. Liu, Energy-efficient reverse skyline query processing over wireless sensor networks, IEEE Trans. Knowl. Data Eng. 24 (7) (2012) 1259–1275. [34] Y. Yuan, X. Lin, Q. Liu, W. Wang, J.X. Yu, Q. Zhang, Efficient computation of the skyline cube, in: The 31st Int’l Conf. on Very Large Data Bases, August 30– September 2, ACM, Trondheim, New York, 2005, pp. 41–252. [35] W. Zhang, X. Lin, Y. Zhang, Threshold-based probabilistic top-k dominating query, VLDB J 19 (2) (2010) 283–305.