Incremental evaluation of top-k combinatorial metric skyline query

Knowledge-Based Systems 74 (2015) 89–105 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate...

Download PDF

1MB Sizes 2 Downloads 85 Views

Report

PDF Reader
Full Text

Knowledge-Based Systems 74 (2015) 89–105

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Incremental evaluation of top-k combinatorial metric skyline query Tao Jiang a, Bin Zhang a,⇑, Dan Lin b, Yunjun Gao c, Qing Li d a

College of Mathematics, Physics and Information Engineering, Jiaxing University, 56 Yuexiu Road (South), Jiaxing 314001, China Department of Computer Science, Missouri University of Science and Technology, 500 West 15th Street, Rolla, MO 65409, USA c College of Computer Science, Zhejiang University, 38 Zheda Road, Hangzhou 310027, China d Department of Computer Science, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China b

a r t i c l e

i n f o

Article history: Received 22 March 2014 Received in revised form 22 September 2014 Accepted 9 November 2014 Available online 15 November 2014 Keywords: Query processing Combinational skyline Metric skyline Algorithm Spatial database

a b s t r a c t In this paper, we deﬁne a novel type of skyline query, namely top-k combinatorial metric skyline (kCMS) query. The kCMS query aims to ﬁnd k combinations of data points according to a monotonic preference function such that each combination has the query object in its metric skyline. The kCMS query will enable a new set of location-based applications that the traditional skyline queries cannot offer. To answer the kCMS query, we propose two efﬁcient query algorithms, which leverage a suite of techniques including the sorting and threshold mechanisms, reusing technique, and heuristics pruning to incrementally and quickly generate combinations of possible query results. We have conducted extensive experimental studies, and the results demonstrate both effectiveness and efﬁciency of our proposed algorithms. 2014 Elsevier B.V. All rights reserved.

1. Introduction A skyline query retrieves every data point whose attribute vector is not dominated by that of any other data points in the same dataset. This type of query has fostered a large number of applications that facilitate decision making [2], business planning [20], sensor network management [33], etc. Recently, there has been an interesting skyline query variant, called combinatorial skyline [25,4,21] which returns groups of data points and ensures that the combination of data points in each such group is not dominated by any other data points. However, this combinatorial skyline can only handle data points in Euclidean space, which limits its wide adoption in many applications containing data points in metric space, such as those for processing biological sequences and textual strings. It is worth noting that processing skyline queries in metric space is a very challenging task and very few works [3,8] have been proposed so far. To address the aforementioned challenges, in this paper, we deﬁne and solve a novel type of skyline query, namely top k combinatorial metric skyline (kCMS) query. The kCMS query retrieves k combinations of data objects in metric space that satisfy the following two conditions: (1) each combination G has a query ⇑ Corresponding author. Tel./fax: +86 573 8364 0102. E-mail addresses: [email protected] (T. Jiang), [email protected] (B. Zhang), [email protected] (D. Lin), [email protected] (Y. Gao), [email protected] (Q. Li). http://dx.doi.org/10.1016/j.knosys.2014.11.009 0950-7051/ 2014 Elsevier B.V. All rights reserved.

object q in its the metric skyline, i.e., q is not dominated by any other objects in the dataset with respect to G; (2) they are the top k combinations with respect to a strict monotonic function of q. The important signiﬁcance of this query lies in that it can identify the impact of q on multiple groups of data objects. For better understanding of the kCMS query, let us step through the following example. Consider a logistics company which aims to ﬁnd two locations to open new branches according to a warehouse q. An important selection criterion to consider is the location diversity. It is expected that the selected two locations should not be concentrated in a narrow region, and there are not other branch near to either of the two branches. In other words, the combination can service a larger number of users, and any branch in the combination is not near to the branches of potential competitors. Therefore, to maximize the proﬁt, the combinations of the two locations should have q at their metric skyline. In addition to the above selection criteria, the logistics company has one more requirement that the two new branches should be near q, i.e., the sum of the distance between the two locations and the warehouse should be minimized. All the selection criteria can be handled coherently using our proposed kCMS query as follows. Fig. 1 shows six candidate locations that the logistics company plans to consider as new branch locations. Since the logistics company is looking for two locations (i.e., m = 2), all the combinations of two locations are listed in Fig. 1(c), where the column ‘y/n’ indicates whether the combination has q at its metric skyline or

90

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

not, and the column ‘adist(Gj)’ is the sum of the distance of the two locations (in the combination Gj) to the warehouse q. As shown in Fig. 1(c), there are multiple combinations which have q at their metric skyline. Assuming that the logistics company is only interested in the top 4 combinations which are closer to the warehouse q, it is then a 4CMS query which will return the following results: G1{p1, p2}, G3{p1, p4}, G4{p1, p5} and G7{p2, p4}. More speciﬁcally, G2 is not selected because q does not at its metric skyline; G8 is not selected because it is farther from the warehouse q than any combination in the query result; and so do other non-selected combinations. Here, we refer to these groups as 2-skyline groups of the dataset where the number 2 indicates the number of objects in each group. Besides the above example, the kCMS query can facilitate decision makings in a variety of applications, such as trip planning and disaster management. For example, Bob has only one day for sightseeing in a city. There are ten attractions around his hotel, but he has time to visit at most three of them. In this case, Bob can perform a kCMS query on the ten attractions with his hotel as the query object q and the sum of distance as the monotonic function. The kCMS query will return top k groups of attractions with each group containing three attractions. After reviewing the query results, Bob can then pick one group of attractions for his trip. Another example is to leverage kCMS to help efﬁciently dispatch rescue teams in emergency scenarios. Speciﬁcally, when a disaster occurred, it may require more than one rescue team to come to the scene since each rescue team typically has its own specialty. It is thus a challenging task to determine the best combination of the rescue teams which have complementary specialties and are closest to the scene in a timely manner. Our proposed kCMS query can aid this challenging selection process by returning k best combinations of rescue teams. To answer the kCMS query, a naïve approach is to enumerate all combinations of objects, and then check whether it satisﬁes the two conditions in the kCMS query: (i) if the combination has q at its metric skyline; (ii) if the combination is one of the top k results according to the given monotonic function of the query object. Such an exhaustive approach may end up comparing C(n, m) = n!/((n m)!m!) m-skyline groups, which is obviously time consuming. Therefore, in this paper, we propose an efﬁcient kCMS query algorithm which seamlessly integrates the following techniques: (i) the incremental combination sorting (ICS) algorithm which progressively generates the combinations according to the monotonic function used by the query; (ii) the early stopping (ES) technology which helps signiﬁcantly reduce the search space; (iii) the triangle-based pruning (TP) and the spatial pruning (SP) heuristics which help prune a large number of ineligible combinations at an early stage; and (iv) reuse heap (RH) technology that further avoids redundant I/O accesses. In summary, our contributions are the following:

We deﬁne a new skyline query variant, i.e., top-k combinatorial metric skyline (kCMS) query. We propose a novel query algorithm which can efﬁciently answer kCMS queries. We formally prove the correctness of some pruning heuristics based on the theories of spatial skyline [28,29]. We conduct extensive experiments using both real and synthetic data sets, and the results demonstrate the effectiveness and efﬁciency of our proposed algorithms under various experimental settings. The rest of this paper is organized as follows. Section 2 reviews the related work. Section 3 gives the deﬁnition of the kCMS query. Sections 4 and 5 present the proposed query algorithm. Section 6 reports the experimental results and our ﬁndings. Finally, Section 7 concludes this paper and outlines some directions for future work. 2. Related work In this section, we review the existing work related to the kCMS query, which are skyline queries and its variants, combinatorial skyline queries, and diversiﬁed queries. 2.1. Skyline queries and its variants Since Börzsönyi [2] proposed the skyline operator in the database community, a large number of algorithms have been proposed in the literature. For example, Kossmann et al. [17] iteratively partitioned all data into overlapping partitions on Nearest Neighbor (NN) object and developed a NN approach to obtain the skyline objects using an R-tree [11]. Papadias et al. [23] improved the NN method by introducing a Branch and Bound Skyline (BBS) algorithm which needs to traverse the R-tree only once for retrieving the skyline objects. To further improve the performance, Bartolini et al. [1] proposed the Sort and Limit Skyline algorithm (SaLSa) that even does not need to scan the whole dataset. More recently, Lee et al. [19] developed a skyline query processing framework based on the Z-order, called Z-SKY. In addition, many variants of skyline query have also been extensively explored, such as reverse skyline [6,18,9], subspace skyline [31,34], probabilistic skyline [24], preference skyline [16,22], threshold skyline [35], and mutual skyline [15]. There are also studies in the area of spatial skyline queries (SSQ) [28] and metric skyline queries (MSQ) [3]. Given a set of data points P and a set of query points Q, each data point can be associated with a number of derived spatial attributes, whereby each derived spatial attribute is the distance from the data point to a query point qi in Q. A SSQ retrieves those points in P which are not dominated by any other points in P considering their derived

y q=(8, 7) 12 10

e2 p4

e1 p2 p3

6 2

p5

Q={p1, p3, p5} 2

4

6

ID

q

4

0

p6

p1

8

8

x 10 12

(a) the dataset

p1 p2 p3 p4 p5 p6

x, y 7, 8 6, 7 6, 6 10, 9 11, 6 12, 8

dist(pi, q) 1.41 2.00 2.24 2.83 3.16 4.12

(b) the coordinates

adist(Gj) ID G1(p1&p2) 3.41 G2(p1&p3) 3.65 G3(p1&p4) 4.24 G4(p1&p5) 4.57 G5(p1&p6) 5.53 G6(p2&p3) 4.24 G7(p2&p4) 4.83 G8(p2&p5) 5.16

y/n y n y y y n y y

ID G9(p2&p6) G10(p3&p4) G11(p3&p5) G12(p3&p6) G13(p4&p5) G14(p4&p6) G15(p5&p6)

adist(Gj) y/n 6.12 5.07 5.40 6.36 5.99 6.95 7.28

(c) all combinations

Fig. 1. Illustration of a combinatorial metric skyline (m = 2).

y y y y n n n

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

spatial attributes. Sharifzadeh et al. [28] utilized the technologies of Voronoi diagram, Delaunay graph, and convex hull, and proposed three efﬁcient algorithms of SSQ. Speciﬁcally, they proposed B2S2 and VS2 algorithms for static datasets, and VCS2 for streaming query points, which exploit the geometric properties of the SSQ problem space to avoid the exhaustive examination of all the point pairs in P and Q. Later, Son et al. [27] and Sharifzadeh et al. [29] further improved the VS2 algorithm. However, the L2 distance used in above SSQ algorithms cannot reﬂect road network distance in metro areas. Therefore, Son et al. [26] propose to use L1 distance (i.e., Manhattan distance) instead of L2. Based on SSQ, Chen et al. [3] deﬁned a new skyline query with dynamic attributes, called metric skyline query (MSQ), where attributes of each data object are given by a set of dimension functions. MSQ is not limited to spatial data because it returns skyline points with dynamic attributes in the metric space. The main difference between SSQ and MSQ is that the distance functions used in MSQ consist of not only the Euclidean distance function in [28] but also other metric functions (e.g., the edit distance function). In other words, MSQ is a more generic skyline query. In terms of query algorithms, unlike SSQ, MSQ makes use of triangle-based pruning heuristics to reduce the search space. Since our proposed kCMS query is closely related to SSQ [28] and MSQ [3], in the following, we provide more details for these two types of queries. Let us reconsider the example in Fig. 1(a), where the dark gray region corresponds to the convex hull of reference points Q = {p1, p3, p5}. Now assume that we need to compute the spatial skyline of Q. Clearly, the query object q is a spatial skyline point of Q according to Theorem 1 in [28] since the convex hull of Q contains q. In other words, {p1, p3, p5} is a result of combinational metric skyline of q. On the other hand, p2 and p6 are also the spatial skyline points of Q since they are the closest points to p3 and p5 in Q, respectively (see Lemma 1 in [28]). In order to retrieve the metric skyline of Q, MSQ searches the metric index, M-tree [5], in a best-ﬁrst manner [23]. Fig. 1(a) depicts a small M-tree, where the entry e1 contains p1, p2, p3 and q, and the entry e2 contains p4, p5 and p6. For the sake of clear presentation, we use Euclidean distance as the similarity measure. The key is deﬁned as the sum of the minimum distance between the current entry (e.g., e1) and each data point (e.g., p1) in Q. Firstly, MSQ accesses the entry e1 and inserts its children q and p2 into the auxiliary heap H in the form of (entry, key) since e1 has the minimum aggregate distance to Q. Note that p1 and p3 are not inserted into H since they belong to the data objects of Q. Then, q is popped from H and becomes a result of MSQ. Next, p2 is inserted into the result set since it is not dominated by any point in the resultant set. In contrast, p4 is pruned because it is dominated by q. At last, the algorithm obtains p6 as another result of MSQ. Unfortunately, the aforementioned algorithms only focus on the individual data objects but not their combinations. Therefore, they cannot be applied to solve our proposed kCMS query. 2.2. Combinatorial skyline queries Our proposed kCMS query is also closely related to combinatorial skyline queries [25,12,14,4], which return the best groups of data objects according to the features of their elements [21]. Su et al. [25] are ﬁrst who introduced top-k combinatorial skyline query (k-CSQ). The k-CSQ query aims to ﬁnd k combinations of skyline objects whose aggregate values for the most preferred attribute are the highest. The preference order is crucial in reducing exponential search space. In fact, the traditional skyline queries can be considered as a special case of CSQ whereby each combination contains only one skyline object. Guo et al. [12] also study the CSQ and designed a pattern-based pruning algorithm to dramatically reduce the search space. The CSQ and k-CSQ may look similar to

91

our kCMS query. However, the k-CSQ retrieves skyline objects in Euclidean space whereas our proposed kCMS query computes skyline in metric space. Therefore, many pruning heuristics in k-CSQ cannot be directly used in kCMS query. Recently, Chung et al. [4] deﬁned an extended version of k-CSQ and proposed two efﬁcient query algorithms: the decomposition algorithm (DA) and the improved decomposition algorithm (IDA), which report all combinatorial skyline results. DA recursively decomposes the whole problem into a series of subproblems. Then, it executes the skyline operator for each subproblem. The DA algorithm can prune the combinations that cannot be the combinatorial skyline result without enumerating all combinations. In fact, some objects do not contribute to new combinations to form the combinatorial skyline, which may result in identical solutions in multiple subproblems. To avoid processing duplicate subproblems, Chung et al. further proposed the IDA algorithm which sorts objects in a descending order according to dom(ti) which is the number of objects that dominate ti. Im et al. [14] studied the group skyline query which is similar to CSQ in spirit, and developed two group skyline algorithms, GIncremental and GDynamic. GIncremental ﬁrstly removes all k-dominated objects from the dataset and then incrementally generates all candidate groups by exploiting various properties of group skyline computation. GDynamic overcomes the weakness of GIncremental by generating at once the set of all candidate groups that include a speciﬁc object. Moreover, GDynamic maintains a sorted list for each dimension as an index structure. Magnani et al. [21] introduced aggregate skylines, where the skyline works as a ﬁltering predicate on sets of records. The aggregate skyline queries merge the functionalities of two basic database operators, skyline and group by. Compared with these existing works, our algorithms share some similarity with [4,14] in terms of the use of the incremental computation. However, none of the existing works is able to answer the proposed kCMS query in metric space. 2.3. Diversiﬁed queries During the last decade, diversiﬁed queries [30,13,32,10,7] have attracted considerable attention from the database community due to their applicability in many domains, such as ambiguous keyword search and user personalized results. Gollapudi et al. [10] developed a set of natural axioms that a diversiﬁed system is expected to satisfy, and showed that no diversiﬁed function can satisfy all the axioms simultaneously. Drosou et al. [7] surveyed, classiﬁed and comparatively studied various deﬁnitions, algorithms and metrics for the result diversiﬁcation. In fact, the diversity is also very important for the skyline query. Tao et al. [30] ﬁrstly introduced the concept of diversity into the skyline query, and proposed a representative skyline that best describes the tradeoffs among different dimensions offered by the full skyline. Huang et al. [13] integrated k-means clustering into skyline computation to capture the skyline diversity and improved the usefulness of skyline results. Valkanas et al. [32] presented a novel deﬁnition of diversity which, in contrast to previous proposals, is intuitive, because it is based solely on the domination relationships among points. Our algorithms also consider the concept of diversity, which is integrated into the combinatorial skyline query. To the best of our knowledge, this is the ﬁrst attempt on diversiﬁed query over combination data. 3. Problem statement In this section, we formally deﬁne the top-k combinational metric skyline(kCMS). Table 1 summarizes the notations used throughout this paper. We use point and object interchangeably to refer to a

92

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

Table 1 Symbols and their description. Notation

Description

P,n m dist, adist e p,q X Xyx ; jXyx j

The dataset, and the number of objects in dataset The number of objects in combination The metric function, the sum distance function An entry in the index A data object, the query object An ascending list A set to store the combinations of selecting y objects from x objects, the cardinality of Xyx A combination and its cardinality A resultant set of combinations of CMS query A reﬁned set of candidate combinations of CMS query The k-th score of combinations in Grlt computed by adist The minimum score of combinations in Grfn computed by adist An auxiliary heap and a reuse heap, respectively

G, jGj Grlt Grfn k_score rfn_minscore H, Hr

col 1 row m

row 3

row 2

row 1

row 0

Ω pm

m m

col 2

col 3

Ω

Ω

m m+1

pm+1

m m+ 2

pm+2

col 4

Ω

m m+ 3

col x-m+1

Ω mx

px

pm+3

Ω33 p3

Ω34 p4

Ω35 p5

Ω36 p6

Ω3m+ 2 pm+2

Ω 22 p2

Ω32

Ω52 p5

Ω 2m+1

p3

Ω 24 p4

Ω p1

Ω p2

1 3

Ω

Ω p4

Ω

Ω

0 2

Ω

1 1

0 0

1 2

0 1

p3 Ω

1 4

0 3

pm+1 Ω1m

pm Ω0m−1

Fig. 2. The logic structure of the IDM-tree.

database object. For easy illustration, we use the sum function as the monotonic function in the paper. Given a data set P with n data objects pi (i 2 [1, n]) in a metric space. For any two different objects p, p0 2 PnG, p dominates p0 with respect to the combination G P which contains m data objects (denoted as p G p0 ) if the following conditions hold: (i) "pi 2 G, dist(p, pi) 6 dist(p0 , pi); and (ii) $pj 2 G, dist(p, pj) < dist(p0 , pj). The pair wise distance dist(pi, pj) between data objects pi and pj (i, j 2 [1, n]) is a metric function, satisfying the following properties for "u, v, w 2 P: (i) dist(u, v) > 0, (ii) dist(u, v) = 0,u = v, (iii) dist(u, v) = dist(v, u), and (iv) dist(u, w) 6 dist(u, v) + dist(v, w). In what follows, we ﬁrst introduce the basic deﬁnition of the metric skyline query, and then deﬁne our proposed kCMS query. Deﬁnition 1 (Metric Skyline Query, MSQ [3]). Given a metric space database P and a reference set Q = {r1, r2, . . ., rm}, a metric skyline query returns all the objects such that any data object among them, p, is not dominated by other objects p0 2 Pnp w.r.t. Q, namely, :$ p0 2 Pnp Q p.

Deﬁnition 2 (top-k Combinatorial Metric Skyline, kCMS). Given a metric space database P and a query object q, a top-k combination metric skyline retrieves k combinations such that, for each combination G P(jGj = m,m P 2), (i) q is among the metric skyline of G, (ii) the k combinations have the minimum sum score Pk Pm j¼1 adistðq; Gj Þ, where adistðq; GÞ ¼ i¼1 distðpi ; qÞ; pi 2 G; distð:Þ is a metric function and m denotes the number of objects in G. To the best of our knowledge, none of the existing works has studied the kCMS problem. As discussed in the introduction, the naïve approach that uses the linear scan (denoted as LS approach) is very time consuming due to the extensive computation and huge I/O cost. In the following sections, we will present our proposed efﬁcient kCMS query algorithms. 4. Incremental combination sorting algorithm for kCMS query In this section, we ﬁrst introduce a novel data structure, an Iterative Decomposing and Merging tree (IDM-tree), which is the key data structure used by our query algorithm. Then, discuss how to create the IDM-tree and utilize it to incrementally generate combinations of data objects. 4.1. The structure of the IDM-tree Assume that there are n data objects p1, p2, . . ., pn in the dataset P, and each object pi has aweight w(pi). Moreover, these n

objects are organized in an ascending order of their weights, and form a list X = {p1, p2, . . ., pn}, where w(p1) < w(p2) < . . . < w(pn). Then, the weight of a combination of objects (denoted P as G) can be computed as wðGÞ ¼ m i¼1 wðpi 2 GÞ where m (2 6 m < n) is the number of the objects in G. Let Xyx be the set of the combinations formed by selecting y data objects from the ﬁrst x(x P y) data objects in X. For example, X23 represents the set of {{p1, p2}, {p1, p3}, {p2, p3}}. We now proceed to introduce the IDM-tree, which is a matrix structure containing m + 1 rows (or say levels) and multiple columns. The bottom row is the 0-th row and each node in this row corresponds to an empty set X0j ð0 6 j 6 m 1Þ. The node Xyx of the y-th row (1 6 y 6 m, y 6 x 6 n) contains the combinations of selected y data objects from the ﬁrst x data objects in X, and are y1 linked to two adjacent sets Xyx1 and Xx1 as shown in Fig. 2. We call the current set Xyx the father set, and the adjacent sets Xyx1 y1 and Xx1 the children sets. The ordinal of the columns starts from the left side. Each node at the 1-th column corresponds to an initial set Xjj (1 6 j 6 m) which only includes a combination of the ﬁrst j data objects in X, e.g., X22 ¼ ffp1 ; p2 gg. Observe that there exists a recursion relationship among the three sets, Xyx ; Xyx1 , and Xy1 x1 as shown in equality (1): y1 Xyx ¼ Xyx1 [ Xx1 px ;

ð1Þ

where the symbol ‘’ is used to concatenate a set or a data object. For example, if x = 3 and y = 2, then Xyx ¼ X23 ¼ ffp1 ; p2 g; fp1 ; p3 g; 1 fp2 ; p3 gg; Xyx1 ¼ X22 ¼ ffp1 ; p2 gg; Xy1 x1 px ¼ X2 px ¼ ffp1 g; fp2 gg p3 ¼ ffp1 ; p3 g; fp2 ; p3 gg. We refer to the incremental set of Xyx as DXyx which is the set of Xy1 x1 px . Thus, the equality (1) can be rewritten into the following equality (2):

Xyx ¼ Xyx1 [ DXyx :

ð2Þ

According to the equalities (1) and (2), we obtain two important properties of the IDM-tree. Property 1 (Incremental Property). Each combination of a set Xyx can be represented in an incremental form that: (i) consists of its children sets in the same row, and (ii) can be decomposed until their children sets consist of the initial sets in the bottom row.

The incremental property is true since the set Xyx is divided into y1 two sets Xyx1 and Xx1 px , which can be further decomposed into the incremental form according to the equalities (1) and (2). Fig. 3 shows the incremental forms of a set X34 : X34 ¼ DX33 [ DX34 and X34 ¼ DX33 [ DX22 p4 [ DX11 p3 p4 [ DX12 p3 p4 , where DX33 ¼ X33 ¼ ffp1 ; p2 ; p3 gg; DX22 ¼ X22 ¼ ffp1 ; p2 gg; DX11 ¼ X11 ¼ ffp1 gg, and

93

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

4.3. Progressively outputting the combinations using IDM-tree

Ω34 = Ω33 ∪ ΔΩ34 (p1,p2,p4) (p1,p3,p4) (p2,p3,p4) Ω = Ω ∪ ΔΩ 3 4

(p1,p2,p3)

ΔΩ Ω33 = ΔΩ33 3 3

ΔΩ34

(p1,p2)

ΔΩ 22 2 Ω 2 = ΔΩ 22

p4

3 3

3 4

Next, we discuss the detailed incremental combination sorting (ICS) algorithm which incrementally generates the combinations using the IDM-tree. The main idea is developed based on dynamic programming. The ICS algorithm recursively invokes two operations: decomposition and merging. The decomposition operation decomposes a set Xyx ðy 6 xÞ into two small sets, Xyx1 and y1 Xx1 px , using the Eq. (1). Then, the merging operation sorts and merges them into an ordered new set.

ΔΩ34 = Ω32 ∪ p4 = (Ω 22 ∪ ΔΩ32 ) ∪ p4

(p1,p3)

(p2,p3)

ΔΩ32

= (ΔΩ 22 ∪ ΔΩ32 ) ∪ p4 Ω = ΔΩ33 Ω 22 = ΔΩ 22 3 3

Fig. 3. Illustration of the incremental form for the IDM-tree.

DX12 = {{p2}}. By taking advantage of this property, we propose an incremental storage schema that can signiﬁcantly save storage space. For instance, the node in the IDM-tree, e.g., X34 , only needs to store its incremental combinations as shown in Fig. 3. Property 2 (Inclusive Property). The preﬁx of DXyx is the child of Xyx , y namely, Xy1 x1 . The combinations of Xx1 is contained in its father set Xyx . The inclusive property indicates that the combinations of a set

Xyx can be easily generated from its children sets and the current data object px.

To construct an IDM-tree, we use m B+-trees, which are denoted as BC1-tree, BC2-tree, . . ., BCm-tree, where i in a BCi-tree indicates the number of data objects in the corresponding combination Gi. In each BCi-tree, w(Gi) is the key, and the combinations with respect to the nodes X1x ; X2x ; . . . ; Xm x are treated as the data to be indexed. The construction of the IDM-tree consists of two main phases, recursive inserting and global amending, as described below. Recursive inserting. Produce new combinations in a zigzag order by scanning the IDM-tree from bottom to top and left to right. New combinations are inserted into the corresponding B+-trees until the integer x reaches the maximum and meanwhile the following inequality (3) is satisﬁed:

ð3Þ

Global amending. Produce groups of combinations, called y1 extended combinations (i.e., pxþ1 Xx1 ), in order to identify the top k combinations. A group of extended combinations is y1 formed by the combination of a node Xx1 ðy 6 xÞ and the data objects behind px (i.e., px+1, px+2, . . .). The procedure will continue until the integer r (r P 1) in the following inequality (4) reaches the maximum value:

y1 y jX j jD X j y1 minj¼1x1 w pxþr Gj 2 Xx1 6 maxj¼1 x1 w G0j 2 DXyx1 : ð4Þ In addition, the algorithm only inserts the combinations of a group of pxþi Xy1 x1 (1 6 i 6 r, y 6 x), whose weights are less than the maximum weight of the combinations in DXyx1 . In other words, for the current combination G 2 pxþi Xy1 x1 , the inequality (5) is satisﬁed:

jDXm j y1 w G 2 pxþi Xx1 6 maxj¼1 x w G0j 2 DXm ; 1 6 i 6 r: x

Proof. Since any set Xyx can be decomposed into a series of combinations which are sorted by the merging operation, Xyx is in order. h Lemma 2. Given a set Xyx , the following relationship between DXyx y jDX j and its children set Xyx1 always holds: maxi¼1 x w Gi 2 DXyx > y jX j maxi¼1x1 w Gi 2 Xyx1 . Proof. Since w(px) > w(px1), Lemma 2 is correct by deﬁnition.

4.2. Creating the IDM-tree

m X 6 k 6 Xm : x xþ1

Lemma 1. In the IDM-tree, the combinations of any set Xyx are in order.

n o jDXy j jXy j > maxj¼1x w Gj 2 Xyx , Theorem 1. If minj¼1 xþ1 w Gj 2 DXyxþ1 the combinations among DXyx are in a global order; otherwise the combinations in D0 Xyx is in a global order. y

jDX

In order to distinguish the updated set obtained after the global amending from the old set, we denote DXyx obtained after the global amending as D0 Xyx .

j

Proof. By the incremental property, we have that minj¼1 xþ1 n o y1 jX j y1 ; k P 1. w Gj 2 DXyxþk ¼ wðpxþk Þ þ minj¼1xþk1 w Gj 2 Xxþk1 n o y1 jX j Clearly, minj¼1xþk1 w Gj 2 Xy1 is invariable for any integer k. xþk1 In other words, there does not exist another combination in any set DXyxþk ðk P 1Þ, which has a weight less than the maximum weight of the combinations in DXyx if a prerequisite is satisﬁed. In addition, D0 Xyx is in a global order since they have been amended globally. h Based on Theorem 1, we develop the ICS algorithm as shown in Algorithm 1. Speciﬁcally, the ICS algorithm recursively produces combinations in a zig-zag manner (lines 1–2). At the beginning, the initial combinations, i.e., DXyy ð2 6 y 6 mÞ, are inserted into 2 Ω 24 Δ ′Ω3 ={{p3,p1},{p4,p1}} 2 Δ ′Ω 4 ={{p3,p2},{p4,p2},

Ω={p1, p2, p3, p4}

p4

w(p1)=1 w(p2)=2

Ω

w(p3)=3 w(p4)=3.5

Ω ΔΩ

ð5Þ

h

Lemma 1 indicates that not only the combinations in Xyx are in order, but also the combinations in DXyx are in order since y1 DXyx ¼ Xx1 px . In other words, they are sorted in the local data space. In addition, Lemma 2 shows that the combinations in DXyx keep an ascending order in the whole space. However, our goal is to have the combinations in DXyx are sorted in the global data space so that we can generate them incrementally. To achieve this, we leverage the following Theorem 1.

2 2

p 2 p1 3

{p4, p3}}

Ω13 {{p1}, {p2},{p3}}

2 3

p3 2 2

ΔΩ

2 3

Ω12 {{p1}, {p2}} ΔΩ 24

p3 p1 p3 p2 p4 p1 p4 p2 p4 p3 4

5

4.5

5.5

6.5

Fig. 4. The illustration of progressively outputting the combinations.

94

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

Algorithm 1. Incremental Combination Sorting (ICS) algorithm ICS (X, m, k) Input: an orderly list X, the parameter k and m Output: incremental combinations y y 1. for col ¼ 2; jXxþ1 j 6 k 6 jXxþ2 j; col ¼ col þ 1 do//apply inequality (3) 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

for (row = 2;row 6 m; row = row + 1) do x = col + row 1, y = row insert the combinations in DXyx into BCy-tree jD0 Xy j jDXyx j l ¼maxj¼1 x1 w Gj 2 D0 Xyx1 ; m ¼minj¼1 w Gj 2 DXyx if m 6 l then//globally amending by applying Theorem 1 jDXy j for (k = 1; minj¼1 xþk w Gj 2 DXyxþk < l; k = k + 1) do//apply inequality (4) for i ¼ 1; w Gi 2 DXyxþk < l; i ¼ i þ 1 do//apply inequality (5) if Gi is not in BCy-tree then//produce Gi in a ascending order insert Gi into BCy-tree to obtain D0 Xyx1

if (y is equal to m) then output the combinations in D0 Xyx1

the corresponding B+-tree. Then, the algorithm generates the current incremental combinations DXyx and inserts them into the BCy-tree (line 4). Meanwhile, ICS uses the variable l to store the maximum weight value of the updated incremental combinations of previous column (line 5). Note that if the incremental combinations of the previous column, D0 Xyx1 , is not updated, then we have l ¼ D0 Xyx1 ¼ DXyx1 . Another variable m stores the minimum weight value of the incremental combinations of current column, y jDX j i.e., minj¼1 x w Gj 2 DXyx (line 5). Next, if m 6 l (lines 6–10), the algorithm updates the incremental combinations of current column by conducting the global mending. Finally, the algorithm outputs the incremental combinations of the previous column since they are in a global order when the superscript y is equal to m (line 11). The following example illustrates the procedure of the ICS algorithm. Example 1. Assume that we need to select a combination of two objects from X = {p1, p2, p3, p4} in Fig. 4, where they have the weights of 1, 2, 3 and 3.5, respectively. To generate all combinations in an ascending order of w(G), the ICS algorithm ﬁrst outputs {p2, p1}. Then, ICS uses p3 and X12 = {{p1}, {p2}} to generate X23 ¼ p3 ðp2 jp1 Þ = {{p3, p1}, {p3, p2}}. Since w({p4, p1}) < w({p3, p2}), ICS uses the combinations in DX24 to globally amend DX23 . Thus, the combinations in DX24 and DX23 are sorted. ICS obtains the updated combinations after the global amending, which are D0 X23 ¼ ffp3 ; p1 g; fp4 ; p1 gg and D0 X24 ¼ ffp3 ; p2 g; fp4 ; p2 g; fp4 ; p3 gg. h 5. Top-k Combinatorial Metric Skyline (kCMS) query processing So far, we have known that our proposed ICS algorithm does not need to compute the exponential number of combinations to obtain the top k combinations according to the monotonic function given the query. Based on the ICS algorithm, we propose an efﬁcient kCMS query algorithm that can incrementally generates combinatorial metric skyline. The detail of the algorithm is elaborated in the following. 5.1. An overview of the kCMS query algorithm A general framework for the kCMS query processing consists of three phases: (i) enumerating phase, (ii) pruning phase, and (iii) reﬁnement phase. Let Grlt denote the query results, and Grfn denote the combinations to be reﬁned. Grfn contains intermediate nodes to be extended later.

At the beginning, the enumerating phase searches the index (i.e., M-tree [5]) in best-ﬁrst (BF) [23] manner until the heap H becomes empty. When a data object p is popped, the algorithm enumerates all combinations using p and other data objects (or nodes) in H. These combinations are processed one by one. The current enumeration will be terminated if the current combination reaches a threshold value. Then, the algorithm begins the next round of enumeration. The whole search will be terminated when certain conditions are satisﬁed. During this phase, the IDM-tree is built. The pruning phase aims to prune the combinations that are not qualiﬁed as query results according to some pruning heuristics. The reﬁnement phase extends the combinations by replacing the current entry e with its children ei so that the resulting combinations will only contain data objects but not any intermediate nodes. Then, the obtained combinations will be processed to ﬁnd the ﬁnal query results. The monotonic function considered in the paper is the sum of attribute values of metric skyline, i.e., P adistðGÞ ¼ jGj i¼1 distðpi ; qÞ. 5.2. Early Stopping (ES) for kCMS query In the enumerating phase, it is obviously very time consuming if all combinations of all data objects need to be computed. In order to improve the efﬁciency, we propose several stopping criteria that can help terminate the enumeration algorithm much earlier without examining all combinations. Our idea follows the spirit in [1], i.e., ‘‘stop the skyline computation without applying the skyline ﬁlter to all the objects’’ This method can signiﬁcantly reduce the number of objects to be checked and the candidate combinations to be evaluated. The stopping criteria are presented as follows. Let k_score denote the k-th score of the combinations in Grlt, and rfn_minscore denote the minimum score of the combinations in Grfn. Then, we have the Theorem 2 below. Theorem 2. Let entry e be the top entry in H and jGj = m. The kCMS query can be terminated if e. key m > k_score and rfn_minscore > k_score. Proof. Since rfn_minscore is larger than k_score, there will not be any other combinations in Grfn which has a score lower than k_score. Moreover, all the new combinations inserted into Grfn must have a score larger than rfn_minscore after e is popped. If e.key m is larger than k_score, all new combinations generated using e and the entries in H will have a score larger than k_score. This is because any entries e0 in G except e has a score larger than e.key

95

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

since the algorithm searches the nodes in the best-ﬁrst manner in an ascending order. Integrating the above two cases, we can conclude that there does not exist a combination with score lower than k_score. Therefore, the search can be stopped. h Based on Theorem 2, we immediately obtain the Corollary 1 as follows. Corollary 1. Assume that the current entry popped from H is entry e and jGj = m. The kCMS query can stop searching the rest of the data space if Grfn = £ and e. key m > k_score. Theorem 2 and Corollary 1 show that the algorithm can stop at the certain point even though the most of index has still not been visited. In fact, Bartolini et al. [1] have proved that using the MiniMax rule to choose the stop point is optimal which is independent of the speciﬁc function used to sort the objects. Clearly, Theorem 2 and Corollary 1 are very important because it shrinks the search space signiﬁcantly. However, for the current data object p, the algorithm still needs to enumerate a large number of combinations. For example, the number of combinations will be C 3100 ¼ 161; 700 if n = 100 and m = 3. Therefore, we propose another new stopping criterion to further shorten the enumeration process. The new criterion is given by Theorem 3. Theorem 3. Given the combination G sorted by the ICS algorithm, if its minimum distance to q, i.e., mindist(q, G), is greater than k_score, the enumeration can be stopped, where mindistðq; GÞ ¼ PjGj mindistðq; ei 2 GÞ is the sum of the minimum distance between i¼1 ei and q, i.e., mindist(q,ei 2 G). Proof. Since the incremental combinatorial sorting method is used, all combinations are in a global order. In other words, there does not exist any other combination which has a score lower than k_score if mindist(q, G) > k_score. h Theorem 3 is critical to the performance of the kCMS query because it avoids enumerating and evaluating a large number of unnecessary combinations. Our experimental results in Section 6 also prove the effectiveness of Theorem 3. It is also worth noting that to ensure that Theorem 3 can work, the ICS algorithm needs to be adopted. Next, we introduce the pruning conditions for the combinations in Grfn. Theorem 4. The kCMS query algorithm can delete G0 and the combinations behind G0 from Grfn if all combinations in Grfn is in an ascending order and mindist(q, G0 ) > k_score. Theorem 4 helps improve the query performance by largely reducing the storage space needed by Grfn, and avoiding evaluating unqualiﬁed combinations. Note that Theorems 2–4 can be easily Q revised for other monotonic functions, such as ei 2G fdistðei 2 GÞg, which computes the volume of each attribute of a metric skyline. 5.3. Triangle-based Pruning (TP) Up to now, we have introduced the main pruning methods used by the kCMS query algorithm. In what followed, we will present triangle-based pruning (TP) heuristics. Let q be a query object, p and rp be the pivot object and the radius of the entry e, respectively. We denote the maximum distance between any object pi 2 e and pj 2 e (i – j) as UB(pi, pj), and the minimum distance of any object pk 2 e and q as LB(pk, q). Then, we obtain the following pruning heuristics. Theorem 5 (Triangle-based Pruning Heuristics). Let e be a node in the metric index which contains at least (m + 1) data objects. The node e does not contain any results of a CMS query if UB(pi, pj) < LB(pk, q).

p2

p6 p4

p1

p5 p3

p7

q

Fig. 5. Illustration of triangle-based pruning.

Proof. Let G be a combination of the index node e(jGj = m). There must exist an object p0 2 e whose distance from any object pi 2 G, dist(p0 , pi 2 G), is no more than UB(pi, pj). Since UB(pi, pj) < LB(pk, q), it holds that dist(p0 , pi 2 G) < LB(pk, q). Thus, dist(p0 , pi 2 G) < dist(q,pi 2 G). In other words, q is dominated by p0 with respect to G. Therefore, G cannot become a result of the CMS query. Since G is a general combination, we ﬁnish the proof. h Theorem 5 indicates that the farther node generally does not contain any results of a CMS query. Therefore, it would be more efﬁciency if the query algorithm retrieves the data objects closer to q. To this end, our proposed algorithm follows the best-ﬁrst manner [23] where the key is deﬁned as the minimum distance between the entry e and q. Fig. 5 illustrates the rational of Theorem 5, where p5 is the pivot of the index node e = {p1, p2, p3, p4, p5} and q is the query object. Assume that the candidate combination G is the set of {p1, p2, p3}. The distance between p2 and p3, namely, dist(p2, p3), is the maximum distance of any two data objects in the node e. LB(p4, q) is the minimum distance of any data object pk 2 e (k 2 [1,5]) and q. Since dist(p2, p3) < LB(p4, q), we have "dist(pi, p4) < dist(pi, q), i 2 [1,3]. In other words, p4 dominates q with respect to G. Hence, G is not a result of the CMS query. However, Theorem 5 requires pre-computing the distance of each pair of objects in order to obtain UB(pi, pj). To further improve the efﬁciency, we replace UB(pi, pj) with 2rp, since UB(pi, pj) 6 dist(pi, p) + dist(pj, p) 6 2rp. Moreover, we also replace LB(pk, q) with dist(p, q) rp since dist(p, q) rp 6 dist(p, q) dist(p, pk) 6 LB(pk, q). Accordingly, we obtain the following Corollary 2. Corollary 2. Let e be the node in a metric index which contains at least (m + 1) data objects. The node e does not contain any query result if 3rp 6 d(q, p). We omit the proof of Corollary 2 here to save space because it is very straightforward according to Theorem 5. Corollary 2 enables our kCMS query algorithm to quickly prune a large number of combinations in the node e by comparing them only against the query object. It provides a basis of pruning the combinations in the intermediate node. Observe that, similar to Theorem 5, Corollary 2 can be applied only when the node e contains at least (m + 1) data objects. However, in some cases, the node e can still be pruned even when it has only m data objects as given in the following Corollary 3. In Corollary 3, UB(Y, X) = dist(x, y) + rx + ry is the maximum distance between the pivot object xi in an index node X and the pivot object yj in Y. Corollary 3. Given a node X in metric index which contains m data objects, another node Y, and a query object q. If UB(Y, X) < LB(q, X), X will not contain a result of the CMS query. Proof. Since dist(yj 2 Y, "x0 ) 6 UB(Y, X) < LB(q, X) 6 dist(q, "x0 2 X), q will be dominated by any object y0 in Y. Hence, X does not contain any query result of the CMS query. h It is worth noting that Corollary 3 has less pruning power than Corollary 2 because Corollary 3 can prune just one combination if the condition is satisﬁed. Therefore, we further develop the following Corollary 4 that can be used for more general case. Corollary 4 is derived based on the observation that the data objects in a combination are usually from multiple index nodes rather than a single index node.

96

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

Algorithm 2. kCMS_Processing Input: M-tree I constructed over P, user speciﬁed parameters m and k, query object q Output: the result of kCMS 1. Grlt = £, Grfn = £, k_score = +1, initialize min-heap H accepting entries in the form (e, key) 2. insert (Root, mindist(e, q)) into heap H 3. while (heap H is not empty) do 4. remove top entry e from H 5. if ((e.key m) > k_score)//Theorem 2 and Corollary 1 6. if (Grfn is not empty and rfn_minscore > k_score) return 7. if (Grfn is empty) return 8. if (e is a data object) 9. for " combination G of e not marked as false positive do// by ICS method 10. if (mindist(q, G) > k_score) break//Theorem 3 11. ProcessG(G, Grlt, Grfn, k, k_score) 12. else// intermediate node 13. if (jej P m + 1 and 3 e.rp 6 d(e.piv, q))// Corollary2, rp is the radius of e 14. mark all combinations from e as false positives 15. if (jej = = m and $ e0 s.t. UB(e0 , e) < LB(q, e) or e0 dominates q w.r.t. G from e)//Corollary 3 16. mark the combination from e as false positive 17. if (jej + je0 j P m and $e0 s.t. UB(e0 , e) < LB(q, e) and UB(e0 , e) < LB(q, e0 ))//Corollary 4 18. mark the combinations formed by e and e0 as false positives 19. for each child ei 2 e do 20. insert (ei, mindist(ei, q)) into Heap H 21. for " combination G of Grfn including e do//update the combinations of Grfn 22. remove G from Grfn 23. for each combination G0 of G do //obtain G0 by replacing e of G using ei 24. if (mindist(q, G0 ) > k_score) continue//Theorem 4 25. ProcessG(G0 , Grlt, Grfn, k, k_score)

Corollary 4. Assume that two nodes X and Y contain at least m data objects (jXj + Yj P m), and a combination G consists of l1 (
Similar to Corollary 3, Corollary 4 is used only if the information of their sibling nodes is available. Clearly, Corollary 2 and Corollary 4 play an important role on quickly pruning the intermediate node e with at least m + 1 data objects.

Proof. Given a point p0 2 (X [ Y)nG and a point pi 2 G (i 2 [1, jGj]), we have dist(p0 , pi) 6 UB(X, Y) < LB(q, X). So, dist(p0 , pi) 6 UB(X, Y) < dist(q, x) rx. If pi 2 X, it holds that dist(q, x) rx < dist(q, x) dist(x, pi) because of dist(x, pi) < rx. On the other hand, we have dist(q, x) dist(x, pi) < dist(q, pi) according to the triangle inequality. Thus, dist(p0 , pi) < dist(q, x) rx < dist(q, x) dist(x, pi) < dist(q, pi), that is, dist(p0 , pi) < dist(q, pi). Similarly, if pi 2 Y, it holds that dist(p0 , pi) < dist(q, pi). In summary, in both cases, we can always conclude that dist(p0 , pi) 6 dist(q, pi), pi 2 G. Therefore, G is not a result of the CMS query by Deﬁnition 2. h

5.4. kCMS query processing

e6

We now present the complete kCMS query algorithm (as shown in Algorithm 2) that integrates our proposed Early Stopping (EP) and Triangle-based Pruning (TP) techniques. We will step through the algorithm using the set of data points P = {p1, . . ., p9} and query point q in Fig. 6 with m = 2 and k = 2. First, we index all the objects in the dataset P using an M-tree (denoted as I). The kCMS_Processing retrieves the qualiﬁed combinations by traversing the M-tree in a best-ﬁrst manner [23].

Root

e5 e4

p8

PR(p3, q)

p7

p5

e3

p9 Root

p6 p1 p2 e1

p3 p4 q

e2 PR(p4, q)

(a) the dataset

e1 e2 e1

e3 e4

p1 p2 p3 p4

e5 e6

p5 p6 p7 p8 p9

(b) the structure

Fig. 6. Illustration of kCMS query in an M-tree.

e2

97

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105 Table 2 kCMS for the example in Fig. 6 (k = 2, m = 2). Actions

The contents of H

Candidate combinations

The contents of Grfn

The contents of Srlt

Visit Root Expand e1 Expand e4 Visit p4 Visit p3 Expand e3

e1, e2 e4, e3, e2 p4, p3, e3, e2 p3, e3, e2 e3, e2 p2, p1, e2

– – – {p4, p3}, {p4, e3}, {p4, e2} {p3, e3}, {p3, e2} {p3, p2}, {p3, p1},

£ £ £ {{p4, e3}, {p4, e2}} {{p3, e3}, {p4, e3}, {p4, e2}, {p3, e2}} {{p4, e2}, {p3, e2}}

£ £ £ {{p4, p3}} {{p4, p3}} {{p4, p3}, {p3, p2}}

Speciﬁcally, we maintain a min-heap H with entries in the form of (e, key) (line 1), where the key is deﬁned as the minimum distance between the entry e and the query point q (i.e., mindist(e, q)). Intuitively, a small key may result in a small aggregate value of a combination. We also initialize two empty sets, Srlt and Srfn at line 1, which are used to store the kCMS results and the candidate combinations for reﬁnement, respectively. Each combination in Srfn includes at least one intermediate entry that will be extended later. The query algorithm starts by inserting e1 and e2 into H with their keys (e.g., mindist(e1, q) and mindist(e2, q)). Table 2 shows the contents of H, Srlt and Srfn as well as candidate combinations at each step, where a combination marked with a strikethrough will be pruned in the next step. Then, e1 with the minimum mindist is removed from H and its children e3 and e4 along with their mindist values are inserted into H. At this point, the algorithm veriﬁes whether the stopping condition is satisﬁed (lines 5–7) according to Theorem 2 and Corollary 1. Next, the algorithm will process e in the following two cases. If e is a data object, the algorithm uses the ICS method to gradually enumerate the combinations that contain e and the entries in H (line 9). For example, the algorithm generates three combinations, {p4, p3}, {p4, e3} and {p4, e2}, after p4 is removed from H. Attributed to the use of the ICS approach, the enumeration process in line 9 can stop after only a few combinations are enumerated. Line 10 is used to exit the enumeration of current entry e when the termination condition of enumerating, i.e., mindist(q, G) > k_score, is satisﬁed. This allows many candidate combinations to be pruned by Theorem 3. As a result, the algorithm achieves signiﬁcant performance improvement. Before the exit of the current

enumerating, each combination G is processed by ProcessG. Since {p4, e3} (or {p4, e2}) cannot be pruned by the sibling entry of any entry in {p4, e3} (or {p4, e2}), {p4, e3} and {p4, e2} are inserted into Srfn (see line 5 in ProcessG). Note that this method is very effective because it prunes some combinations in Srfn in advance. Nevertheless, the combination {p4, p3} is inserted into the resultant set Srlt since q is in its metric skyline. Similarly, p3 is removed from H, and {p3, e3} and {p3, e2} are added to Srfn. In the other case when e is an intermediate node, the algorithm ﬁrst ﬁlters out false positives (unqualiﬁed combinations) by Corollary 2, Corollary 3, and Corollary 4 (lines 13 18). These false positives will not be processed any more (line 9). In order to better prune the combination(s) of line 15 and line 18, we only select the data objects (or entries) in e’s sibling entry (denoted as e0 ). Then, the children of e are inserted into H (lines 19–20). Considering our example, after e3 is removed, the combination from e3, e.g., {p2, p1}, is marked as false positive (line 15). Its children p2 and p1 are inserted into H. On the other hand, the algorithm uses e’s children to update the combinations in Grfn which contain e, traversing all combinations in Grfn. Each updated combination is denoted as G0 . For the above example, the algorithm ﬁrst deletes {p3, e3} and {p4, e3} from Grfn, and then generates four new combinations by replacing e3 in each combination of Grfn (that contains e3) with its children p2 and p1, respectively. At last, the algorithm processes each G0 one by one by invoking ProcessG. If mindist(q, G0 ) > k_score, the processing of G0 is skipped. As a result, {p4, p2} and {p4, p1} are pruned; meanwhile {p3, p2} and {p3, p1} are used to update Srlt by UpdateRlt. Accordingly, the contents of Grfn and Srlt become {{p3, p2}, {p3, p1}, {p4, e2},

Procedure ProcessG(G, Grlt, Grfn, k, k_score) Input: G: the candidate combination; Grlt: the resultant set; Grfn: the set of candidate combinations for reﬁnement; k: a user speciﬁed parameter; k_score: the k-th value of Grlt Output: process G and maintain Grlt, Grfn, and k_score 1. if G consists of data objects then 2. invoke MSQ algorithm to judge whether q is in metric skyline w.r.t. G 3. if G is a result then UpdateRlt(G, Grlt, k, k_score) 4. else// G includes intermediate entry(entries) 5. if ($e0 dominates q w.r.t. G) then do nothing//select e0 from the sibling entry of ei 2 G 6. else insert G into Grfn

Procedure UpdateRlt(G, Grlt, k, k_score) Input: G: the candidate combination; Grlt: the resultant set; k: a parameter; k_score: the k-th value of Grlt Output: update Grlt and k_score 1. if (jGrltj < k-1) insert G into Grlt 2. else if (jGrltj== k-1) 3. insert G into Grlt, 4. k_score = max{adist(q, Gi 2 Grlt), i 2 [1, jGrltj]} 5. else//jGrltj P k 6. if adist(q, G) < k_score then 7. delete the combination Gj 2 Grlt with the largest score 8. insert G into Grlt and update k_score

98

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

{p3, e2}} and {{p4, p3}, {p3, p2}}, respectively. At this moment, it holds that rfn_minscore = adist(q, {p4, e2}) and k_score = adist(q, {p3, p2}). The next step of the query algorithm removes p2 from H and examines the stopping condition (lines 5–7). Since dist(p2, q) 2 > k_score and rfn_minscore > k_score, the algorithm terminates according to Theorem 2. Note that in the above example where k = 2, the TP technique has not been used. The TP will take effect when the parameter k becomes larger. For example, if k = 4, the combinations formed by e5 and e6, e.g., {p5, p7}, {p5, p8}, {p5, p9}, {p6, p7}, {p6, p8} and {p6, p9}, will be pruned by Corollary 4; the combination from e5, e.g., {p5, p6}, will be pruned by Corollary 3; the combinations from e6, e.g., {p7, p8}, {p7, p9}, and {p8, p9}, will be pruned by Corollary 2. 5.5. Enhanced query processing with spatial pruning and reuse technology As presented in the previous section, the kCMS_Processing algorithm has achieved signiﬁcant performance improvement over the baseline approach by pruning a large amount of search space. However, it still has room to be further improved. Observe that the kCMS_Processing algorithm needs to execute metric skyline queries (MSQs) to check whether a combination is a result of the CMS query of q if the combination cannot be pruned by ES and TP heuristics. For one kCMS query, multiple MSQs may be executed which may visit same nodes and introduces unnecessary I/O accesses. Therefore, in this section, we proposed an enhanced kCMS query algorithm by leveraging two techniques: spatial pruning (SP) and reuse heap (RH) [15]. Spatial pruning technology. We illustrate the basic idea of the spatial pruning heuristics using a 2-dimensional example as shown in Fig. 7, where G = {p1, p2, p3} is a candidate combination and q is the query object. We draw a circle C(pi, dist(pi, q)) centered at the point pi 2 G with radius equals to dist(pi, q) for each i 2 [1,3]. The region bounded in the circle C(pi, dist(pi, q)) is called the pruning region PR(pi, q) of pi (i 2 [1,3]), e.g., PR(p1, q). The intersection of the pruning regions of all data objects in G is called the pruning region PR(G, q) of G, i.e., PR(G, q) = PR(p:1, q) \ PR(p:2, q) \ PR(p:3, q), which is shown as the dark gray region with dashed lines in the ﬁgure. Obviously, any object p0 2 PnG located in the region PR(G, q) dominates q with respect to G (i.e., p0 G q). Therefore, G cannot be a result of the CMS of q, due to dist(p0 , pi) 6 dist(q, pi) for any object pi 2 G (i 2 [1,3]). Thus, G can be safely pruned. Based on the observation in the above example above, we propose the spatial pruning heuristics as follows. Theorem 6 (Spatial Pruning Heuristics). Given a query object q and a combination G from the dataset P (G P, jGj = m), if the pruning region PR(G, q) contains at least one object p0 2 PnG, G can be safely pruned; otherwise G is a result of the combinatorial metric skyline.

PR(G, q)

Y

PR(p2, q)

PR(p3, q)

8

p2 6

p6

p5

p1

2

p4 2

4

G = {p1, p2, p3}

Theorem 6 helps avoid computing expensive metric skyline queries for checking whether a combination is the query result or not. Instead we only need to carry out m window queries. Correspondingly, the new query algorithm only requires replacing the second line in ProcessG with the spatial pruning heuristics. More importantly, this approach greatly reduces the search space because it only needs to search the union of all pruning region of data objects in G. Thus, it is much more efﬁcient than computing MSQ. Consider the example in Fig. 7, where the combination G consists of three data objects p1, p2, and p3. Since the PR(G, q) does not include any other data object p0 2 PnG, q is not dominated by p0 with respect to G = {p1, p2, p3}. Therefore, we can quickly conclude that G is a result of the CMS query. In fact, we even do not need to run all m window queries to determine whether a combination G is a result of the CMS query. Reconsidering the example in Fig. 7, we can conclude that {p1, p3, p2}, {p1, p3, p6}, and {p1, p3, p5} are in the query results by running only the window query of p1 and q. This is because PR(p1, q) only contains two data objects p1 and p3 in G, and there does not exist any other data object p0 R G which dominates q. In other words, G is a query result. The challenge here is to determine which window query needs to be executed ﬁrst. Generally, a smaller pruning region may contain fewer data objects than a bigger pruning region. Therefore, we run the window queries in an ascending order of the size of their corresponding pruning regions. For G = {p1, p3, p2}, we will ﬁrst run the window query for p3, followed by p1 and p2. In fact, after we obtain PR({p3, p1}, q), we can terminate the window query of p2, the rationale behind which is presented by the following Corollary 5. Corollary 5. Consider two sets G G0 and a query object q. If q is a metric skyline object with respect to G (jGj = m), then q is also a metric skyline object with respect to G0 (jG0 j > m).

Proof. Since G is a result of the CMS query, the pruning region of PR(G, q) does not contain any other data objects p0 2 PnG. On the other hand, PR(G0 , q) PR(G, q) holds since PR(G0 , q) = PR(G, q) \ PR(p, q). Therefore, PR(G0 , q) does not contain any other data objects p0 2 PnG0 . According to Theorem 6, the query object q is not dominated by any other data objects p0 2 PnG0 . The proof is ﬁnished. h

q

p3

4

p7

Proof. We prove the theorem in the following two cases. (i) Assume that there exists an object p0 2 PnG which is located in the pruning region PR(G, q), G = {p1, p2, . . ., pm}. Since PR(G, q) = PR(p1, q) \ PR(p2, q) \ . . . \ PR(pm, q), p0 2 PR(p1, q), p0 2 PR(p2, q), . . ., p0 2 PR(pm, q), we have dist(p0 , p1) 6 dist(q, p1), dist(p0 , p2) 6 dist(q, p2), . . ., dist(p0 , pm) 6 dist(q, pm). Thus, p0 G q. The combination G can be pruned. Otherwise, there does not exist an object p0 2 PR(G, q) (p0 2 PnG) that dominates q w.r.t. G. (ii) Assume that there exists another object p00 2 PnG which is not located in PR(G, q) and dominates q w.r.t. G. Then, there must exist a pruning region PR(pj, q) (j R [1, m]) such that p00 R PR(pj, q) because p00 cannot be contained in any pruning regionPR(pi, q), 1 6 i 6 m. Thus, dist(p00 , pi) > dist(q, pi) for "i 2 [1, m]. In other words, q is not dominated by p00 w.r.t. G. Obviously, the conclusion conﬂicts with the assumption. Therefore, G is a result of the kCMS query. h

6

PR(p1, q)

8

X

Fig. 7. Illustration of spatial pruning.

Corollary 5 can help quickly identify the results of the CMS query and save the computation cost. In order to avoid repeated computation, we search the index in a best- ﬁrst (BF) [23] manner. Meanwhile, in order to reduce the memory cost, we employ the compressed storage scheme. For example, two combinations {p1, p3, p2} and {p1, p3, p6} share the common preﬁx {p1, p3}. Based on

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

the observation, we compress the combinations that contain the same number of objects and share a common preﬁx by storing their common preﬁx. Reuse heap technology. Recall that if a candidate combination G cannot be pruned by ES or TP heuristics, the kCMS_Processing needs to execute multiple window queries as discussed in Spatial Pruning to verify G. As an example, Fig. 6 shows the query windows of p3 and p4, that is, PR(p3, q) and PR(p4, q), respectively. From the example, we can see that some entries (e.g., Root, e1, e4) have been visited multiple times when performing window queries for different objects, which results in a large amount of unnecessary I/O and CPU costs. Therefore, if we store the M-tree nodes visited by the previous window queries, we may be able to reuse some of them for the subsequent window queries. In fact, using the reuse technology, all the window queries can be answered by traversing the M-tree only once. This approach signiﬁcantly reduces the overall I/O cost. Similarly, when the kCMS_Processing invokes the MSQ algorithm to verify whether the current combination is a result of the CMS query of q, we can also use this reuse technology to ensure that the M-tree is traversed only once. In order to implement the reuse technology, we need to store the visited M-tree nodes. This can be done by either maintaining all the visited nodes in a reuse heap Hr or by storing the leaf nodes of the M-tree. Since maintaining all the visited nodes takes considerable space, we adopt the second option. To ensure no entry will be missed during the whole query processing, any visited entry should not be discarded before the completion of the entire query processing unless it is expanded. It is worth noting that our proposed reusing heap technique is different from the caching techniques in most database management systems. The caching typically keeps the recent entries, whereas our reusing heap technique preserves speciﬁc entries. We refer to the kCMS_Processing with both spatial pruning and reuse technology as spatial reuse kCMS algorithm. Based on the above analysis, we have the Theorem 7 as follows. Theorem 7. For the entry in reuse heap Hr, spatial reuse kCMS query algorithm traverses the M-tree only once, but the kCMS query algorithm traverses the M-tree multiple times.

Proof. Since the kCMS algorithm needs to execute the MSQs (or window queries) to check whether a combination G is the query result, it usually needs to traverse the M-tree multiple times. Nonetheless, the redundant I/O access can be avoided by reuse heap Hr, which stores the visited objects. h Reconsider the example in Fig. 6 and assume that each node access causes one I/O operation. In order to generate the combination {p4, p3}, the kCMS algorithm needs to access Root node, entry e1 and e4, and hence the I/O cost is 3. However, for the metric skyline of {p4, p3}, the algorithm still requires 4 I/O accesses because it still needs to access Root node, entries e1, e4, and e3. At the end, we saved 3 I/O accesses to Root node, entry e1 and e4 since their index information is stored in Hr = {p4, p3, e3, e2}. 5.6. Discussion Although we have developed many pruning heuristics for the kCMS query, there may be room for further improvement. For example, some existing technologies used by the spatial skyline queries such as Voronoi diagram, Delaunay graph, and convex hull [28,29] may be useful for metric skyline queries too. This is because the spatial skyline is very similar to the metric skyline. The main difference between them lies in that the spatial skyline

99

is only applicable in Euclidean space whereas the metric skyline can use any distance functions in metric space, e.g., edit distance function. Therefore, for kCMS query on the spatial datasets, new pruning heuristics can be developed to boost the pruning power. At the moment, the deﬁnition of kCMS query needs to be revised by adding the keyword ‘‘spatial’’. For example, we replace the sentence of ‘‘q is in the metric skyline of G’’ with ‘‘q is in the spatial skyline of G’’. Correspondingly, the combinatorial metric skyline becomes the combinatorial spatial skyline. Based on the theories in [28], some useful Lemma and Theorems can be obtained as follows. Lemma 3. For each gi 2 G, if gi has the query point q as its closest point in the dataset P, the combination G is a result of the combinatorial spatial skyline of q. Proof. If q is the closest point to gi in P, we have dist(q, gi) < dist(p0 , gi) for all p0 2 P (p0 – q). By deﬁnition, no point in P spatially dominates q. Therefore, G is a result of combinatorial spatial skyline of q. h Theorem 8. Any combination G P whose convex hull contains the query point q is a result of the combinatorial spatial skyline of q. Let VC(q) denote the Voronoi cell that contains the query point q, VC(q) is a convex polygon in Euclidean space. Then, we can obtain the following Theorem 9. Theorem 9. If the Voronoi cell VC(q) intersects with the boundaries of convex hull of G, G is a result of combinatorial spatial skyline of q. Lemma 3 shows that a combination is a combinatorial spatial skyline of q only because of q’s location regardless of where other data points of P are located. Theorem 8 enables our algorithm to efﬁciently retrieve a large number of combinations, which are the results of the combinatorial spatial skyline of q, only by examining them against the query object q. Theorem 9 speciﬁes those combinations which are in the combinatorial spatial skyline of q, by examining only the data points in a limited local proximity around q. Lemma 3, Theorem 8, and Theorem 9 help us quickly select some combinations which are the results of the combinatorial spatial skyline of q. They produce the seed combinations which are used to further improve the pruning power. In fact, we also make use of the following Theorem 10 to reduce the time complexity of our algorithms by disregarding the distance computation operations against the non-convex points. Theorem 10. Whether G is a result of combinatorial spatial skyline of q, does not depend on any non-convex point gi 2 G. We omit the proofs of Theorems 8–10 since they are very straightforward according to Theorem 1, 8 and Theorem 2 in [28], respectively. 6. Experimental study In this section, we evaluate the effectiveness and efﬁciency of our proposed algorithms through extensive experiments using both real and synthetic datasets. 6.1. Experimental settings The real dataset in our experiments is Color(Col) [31], which is a 9-dimensional dataset, containing 68 K image data. Each data item is associated with nine attributes such as brightness, saturation, and so on. The L1-norm distance is used to measure the similarity

100

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

P the ICS algorithm is ei 2G mindistðei ; qÞ. All the algorithms were implemented in C++, and conducted on a Windows PC with 2.0 GHz dual-CPU with 4 GB RAM.

Table 3 The statistics of datasets. Datasets

Size (K)

Dimensionality

Measure

Color Signature Correlated Independent

68 50 256 256

3 64 3 3

L1-norm Edit distance L2-norm L2-norm

6.2. Experimental results In this subsection, we present four sets of experiments that compare the performance of the kCMS, kCMS + SR and the LS algorithm using four datasets: Col, Sig, Cor, and Ind. We extract the ﬁrst three dimensions in the original datasets Col and Sig for all the experiments except of the one for evaluating the effect of dimensionality. The performance of our proposed algorithms are investigated under a variety of parameters, including parameter k, the dimensionality of datasets dim, the cardinality of datasets N, and the number of objects per combination m. In each experiment, we vary only one parameter while keeping others ﬁxed to their default values. We use the following four performance metrics: the number of node accesses (denoted as NA), the CPU time (excluding the I/O time), the number of distance computation, and the maximum number of entries in the reuse heap (denoted as MH). The effect of parameter k. The ﬁrst set of experiments explore the effect of parameter k on the query performance over 3D datasets, Col(68 K), Cor(256 K), and Ind(256 K), and 64D dataset Sig(50 K) where m = 2. Fig. 8 shows the CPU time and I/O cost of all the algorithms, and Fig. 9 plots the number of distance computation conducted by each algorithm. The abbreviations of the algorithms (C for kCMS, S for kCMS + SR, and L for LS) are listed under the horizontal axis, and MH for kCMS + SR is listed at the bottom of each dark gray bar. As expected, the CPU time and NA of all the

between any two feature vectors extracted from images. The synthetic dataset that we used is Signature(Sig) [3] which contains 50 K randomly generated strings. Each string in Sig includes 64 English letters and the edit distance function is used. In addition, we also generate two more datasets Correlated (Cor) and Independent (Ind) [3]. Table 3 lists all the parameters that are considered in the experiments, along with its value range. For each dataset, we index the objects using the M-tree [5] with 2048 bytes page size and randomly select 50 data objects as the query objects. Since all the related skyline query algorithms work only in vector space while our algorithm works in metric space, we do not compare them with our algorithms in the following experiments. In our algorithm, we mainly compare the performance of three algorithms: the basic kCMS query algorithm without reuse technique (denoted as kCMS or C in ﬁgures), and improved kCMS query processing with spatial pruning and reuse technology (denoted as kCMS + SR or S in ﬁgures), and the brute force approach using the linear scan (denoted as LS or L in ﬁgures). Each reported value in the following diagrams is the average query cost of 50 queries, whose locations follow the distribution of the corresponding dataset. The monotonic function for sorting the combinations in

kCMS CPU me kCMS+SR CPU me kCMS+SR node access(S) kCMS node access (C) node access 4 CPU Time (s) CPU Time (s) 4×10 600 500 4

102

LS CPU me LS node access (L) node access 105

102

10

104

1

1

10

10

103

103

0

0

10

10

102

-1

10

10-2 4

CSL CSL C S 8

CS

C S

16 k

32

102

10-1 10-2

10 64

4

CSL CSL C S 8

(a) Col(68K, 3D)

kCMS+SR CPU me kCMS+SR node access(S)

4

10

1

10

103

100

10

10-1

4

CSL CSL C S 8

16 k

10 64

LS CPU me LS node access (L) node access 4 2×10

CPU Time (s) node access 105 400

102

10-2

32

(b) Sig(50K, 64D)

kCMS CPU me kCMS node access (C) CPU Time (s) 103

CS

C S

16 k

C S

32

(c) Cor(256K, 3D)

CS

101

103

100

102

2

101 64

104

102

10-1 10-2 4

CSL CSL C S 8

16 k

C S

32

CS

(d) Ind(256K, 3D)

Fig. 8. CPU time and node accesses vs. k(m = 2).

10 64

101

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

kCMS+SR (S) LS (L) distance computaon 1E6

897

32

4

64

(a) Col(68K, 3D)

9E5 3E5

1E5

1E6 3E5 5E4

10

4

C S

8

C S

16 k

103

C S 1915

1809

1729

MH

C S 1689

L C S

2E4

2E4 4

32

64

L C S 4

(c) Cor(256K, 3D)

1E4

5E3

MH

6E4

C S

C S

8

C S

C S

851

3E4

1E5

1E5

105

4

103

64

106

4E5

7E4

1646

10

32

1E5

7E4 1E4

16 k

8

746

105

C S

107 3E6

2E6

106

C S

kCMS+SR (S) LS (L) distance computaon

kCMS (C) 8E6

C S

(b) Sig(50K, 64D)

distance computaon 107

C S

1020

16 k

C S

MH

932

8

1E4

791

4

834

MH

103

L CS C S C S C S 785

LCS

3E4

104

5E3

8E4

6E4

1959

105

7E4 2E4

708

103

4E5 1E5

1E5

1E4

104

3E5

1E5

8E4 3E4

741

105

106

4E5 1E5

2E6 9E5

1880

106

7E6

2167

107

9E6

1827

107 3E6

2046

kCMS (C) distance computaon

16 k

32

64

(d) Ind(256K, 3D)

Fig. 9. Distance computation vs. k(m = 2).

kCMS CPU me (C) kCMS+SR CPU me (S) CPU Time (s) 1.0

kCMS node access kCMS+SR node access

node access 104

0.8 103

0.6 0.4

2

10

0.2

node access 105

CPU Time (s) 5 4

104

3

103

2 102

1

0 2

C S

C S

C S 3

dim

4

C S

101 5

(a) Col(68K)

0

C S 2

C S

C S 3

dim

4

C S

101 5

(b) Cor (256K)

Fig. 10. CPU time and node accesses vs. dim(k = 16, m = 2).

three algorithms increase when k increases because the number of candidate combinations increases with k. Since the baseline approach LS is thousands of times slower than our proposed two algorithms due to the need to generate an exponential number of combinations, we will not report the results of LS in the subsequent experiments. Among the three algorithms, kCMS + SR performs best in terms of all aspects. In particular, the query cost of kCMS + SR increases slower than others and achieves the best performance in all cases. This is because kCMS + SR adopts both reuse

heap technology (RH) and spatial pruning (SP) method. The SP method helps reduce the number of candidate combinations to be evaluated. The RH method cuts down the number of node accesses because it ensures that the M-tree is traversed only once. In particular, kCMS + SR conducts much fewer number of distance computations for any dataset. The reason is that the SP method used in kCMS + SR limits the search range, which reduces signiﬁcantly the number of distance computations. In addition, we also observe that the MH of kCMS + SR increases slightly with the

102

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

kCMS (C) distance computaon 106

8E5

4E5

1E6

1E5 5

10

kCMS+SR (S) distance computaon 107

1E5

9E4

10

3E5

6E4

105

2E4

1E4

8E5

6

3E5 1E5

9E4

7E4 1E4

2

3

dim

4

5

2

(a) Col(68K)

C S

C S 3

dim

C S

9872

MH

4362

C S

1729

10

4

447

C S

3901

MH

C S

785

C S

1716

C S 386

104

4

5

(b) Cor (256K)

Fig. 11. Distance computation vs. dim(k = 16, m = 2).

growth of k. This is because when k is larger, more index nodes need to be visited which in turn increases the number entries stored in the reuse heap. Overall, kCMS and kCMS + SR signiﬁcantly outperform the LS while kCMS + SR achieves the best performance. The effect of dimensionality (dim). Figs. 10 and 11 demonstrate the effect of dimensionality (dim) on the query performance by varying the dim from 2 to 5, where k = 16 and m = 2. Due to space limitations, we only report the results on the datasets Col (68 K) and Cor (256 K). The results on Sig (50 K) and Ind (256 K) are similar and thus omitted here. From Figs. 10 and 11, we can observe that kCMS + SR outperforms kCMS in terms of node

accesses by about 1–2 orders of magnitude. This is again attributed to the stronger pruning power provided by the spatial pruning heuristics (SP), which also conﬁrms the effectiveness of Theorem 6 and Corollary 5. In addition, from Fig. 10, we also see that the CPU time and NA both increase with the increase of dim, though kCMS + SR’s CPU time increases slower than kCMS. Moreover, as shown in Fig. 11, the number of distance computations also increase with the growth of dim. The reason of such behavior is that a high-dimensional combination is less likely to become the result of a kCMS query. Thus, both kCMS and kCMS + SR need to process more candidate combinations to obtain the ﬁrst k query

kCMS CPU me (C) kCMS+SR CPU me (S)

CPU Time (s) 0.25

node access x102 21

0.20

kCMS node access kCMS+SR node access CPU Time (s) 0.4

16

0.3

11

0.2

6

0.1

node access

x102 31 26 21

0.15

16

0.10 0.05 0

1

C S

C S

C S

C S

C S

12K

24K

36K N

48K

60K

0

11 6

C S

C S

C S C S

C S

10K

20K

30K N

50K

(a) Col(3D)

(b) Sig(64D)

kCMS CPU me (C) kCMS+SR CPU me (S) node access x102 51

CPU Time (s) 0.7 0.6 0.5 0.4

41 31

0.3 0.2 0.1 0

21 11

C S

C S

64K

128K

C S

C S

256K N

512K 1024K

(c) Cor(3D)

40K

1

C S

1

kCMS node access kCMS+SR node access CPU Time (s) 0.30 0.25 0.20 0.15 0.10 0.05 0

C S 64K

C S

128K

C S

256K N

node access x102 19 17 15 13 11 9 7 5 3 1

C S

C S

512K 1024K

(d) Ind(3D)

Fig. 12. CPU time and node accesses vs. dataset size N (k = 16, m = 2).

103

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

results. On the other hand, from the ﬁgures, it is observed that the dimension of data set has little impact on CPU time, but more impact on the number of distance computations and the maximum number of entries in the reuse heap (MH). This is because the distance computation for each data point in a high-dimensional space is more computationally expensive compared with that in a low-dimensional space. Also, a combination in the highdimensional space has a lower probability of being a result of the kCMS query. With the increase of dim, more index nodes need to be visited, resulting a larger MH and higher I/O and CPU cost. It is worth noting that for extremely large dim, the size of the reuse heap may become very large, and hence even kCMS + SR’s performance may be affected due to the time spent for managing the reuse heap. The effect of dataset cardinality (N). In this set of experiments, we evaluate the scalability of our approaches by varying the cardinality of datasets N, where k = 16 and m = 2. In particular, we vary N from 12 K to 60 K for 3D Col, from 10 K to 50 K for 64D Sig, and from 64 K to 1024 K for 3DCor and Ind. The experimental results are shown in Figs. 12 and 13. Observe that, in most cases, the query cost of both algorithms slightly increases when the number of objects in the dataset becomes larger. The reason is that, the size of the M-tree increases with N, which forces the algorithms to visit more entries when computing the combinatorial metric skyline. kCMS incurs more CPU time and the number of node access than kCMS + SR because kCMS traverses the M-tree repeatedly. There are some exceptions for the range of [40 K, 50 K] in Sig probably due to the characteristics of the data. Fig. 13 also presents the MH of kCMS + SR with respect to N at the bottom of the corresponding bar. As expected, the size of MH grows when N increases. The reason behind is that the bigger

106

36K

48K

10K

60K

C S

20K

30K

40K

50K

N

(a) Col(3D)

(b) Sig(64D) kCMS+SR (S) distance computaon 106

3E5

4E5

3E5

64K

105

128K

256K

C S

C S

512K 1024K

N

(c) Cor(3D)

104 MH

C S 713

C S

2170

C S

2E4

1801

MH

1106

104 C S

7E4

7E4 3E4

1729

4E4

1438

105

1E5

1E5

1E5

1E5

1E5 8E4

64K

C S 128K

2E4

2E4

2E4

3E4

C S

C S

256K

512K 1024K

753

kCMS (C) distance computaon 106 2E5

1316

C S

1202

C S

N

2E5

4E4

C S

1028

C S 915

MH

972

24K

105

C S

5E4

4E4

3E4

3E4

851

12K

C S

C S

702

C S 572

MH

C S 477

104

2E4

695

2E4

3E4

791

2E4

3E4

1245

105

2E5

N

(d) Ind(3D)

Fig. 13. Distance computation vs. dataset size N (k = 16, m = 2).

C S

1037

1E5

3E5

2E5

2E5

1E5

1E5

1E5

1E5

1E5

kCMS+SR (S) distance computaon 107

kCMS (C)

distance computaon 106

dataset has more combinatorial metric skyline objects, which results in the bigger MH. However it is not always the case because kCMS + SR is very efﬁcient even for bigger datasets. In summary, the experimental results show that kCMS + SR always performs best and achieves relatively consistent performance in all cases. The effect of parameter m. Figs. 14 and 15 present the query performance as the function of the metric parameter m over 3D datasets: Col (12 K), Cor(64 K), and Ind(64 K), and 64D dataset Sig(10 K), with k = 16. From the ﬁgures, we can see that CPU time slightly increases with respect to m. The reason is that the larger the parameter m is, the more objects need to be considered to generate the candidate combinations. In fact, the number of the combinations increases in an exponential rate. Thus, both algorithms require evaluating more candidate combinations to obtain the ﬁrst k query results. However, the number of node access (NA) and the distance computations do not increase with the increase of m in most cases. Sometimes, a bigger m can even leads to less NA, e.g., m = 3. This is because both kCMS and kCMS + SR adopt the ICS algorithm and generate the candidate combinations in an incremental manner according to the aggregate distance from the query point, which accelerates the pace of obtaining the initial query results. On the other hand, our proposed ES, TP, and SP technologies provide a stronger pruning power. Thus, many candidate combinations are pruned before they are evaluated. Another important reason is that a bigger m generally brings more qualiﬁed combinations according to the Corollary 5. Therefore, our proposed algorithms sometimes achieve better performance when m is larger. This adequately indicates that our algorithms have a high pruning capability. Meanwhile, it also conﬁrms the nice scalability of our approaches on the parameter m.

104

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

CPU me 101 100 10-1 10-2

C S 2

kCMS CPU me (C) kCMS+SR CPU me (S) node access x102 16 14 12 10 8 6 4 2 0 C S C S C S 3 4 5 m

kCMS node access kCMS+SR node access CPU me 101

25 100

20 15

10-1

10

10-2

0

5

(a) Col(12K, 3D)

C S 2

C S 3

C S 4

m

kCMS node access kCMS+SR node access CPU me 1 10

2 node access x10 16 14 12 10 8 6 4 2 0 C S C S 4 5

20 15

100

10

100 10-1

5 10-1

C S 2

C S 3

m

C S 4

C S 5

(b) Sig(10K, 64D)

kCMS CPU me (C) kCMS+SR CPU me (S) 2 node access x10 25

CPU me 20 101

2

node access x10 30

0

C S 5

10-2

C S 2

(c) Cor(64K, 3D)

C S 3

m

(d) Ind(64K, 3D)

Fig. 14. CPU time and node accesses vs. m(k = 16).

kCMS+SR (S) distance computaon 45 x104

kCMS (C) distance computaon 20 x104 16

35

12

25

8

15

2

3

m

4

5

2

(a) Col(12K, 3D)

C S

C S 3

m

C S 906

MH

878

C S

1125

535

C S

482

MH

5 0

C S

C S

558

C S 477

0

915

4

4

5

(b) Sig(10K, 64D) kCMS+SR (S) distance computaon 18 x104

kCMS (C) distance computaon 30 x104 25

14

20 10

15

6

2

3

m

4

(c) Cor(64K, 3D)

5

C S

MH

2

C S

C S 3

m

4

(d) Ind(64K, 3D)

Fig. 15. Distance computation vs. m(k = 16).

C S 746

2 0

754

C S 1182

1165

MH

C S

C S

1138

C S 1106

0

735

5

713

10

5

T. Jiang et al. / Knowledge-Based Systems 74 (2015) 89–105

7. Conclusions and future work In this paper, we propose a novel type of skyline query, namely kCMS (top k combinatorial metric skyline) query, which can be adopted in various types of applications, such as business data analysis, decision making, and so forth. In order to efﬁciently answer kCMS queries, we designed two efﬁcient algorithms, kCMS and kCMS + SR that combine the advantages of a series of techniques including early stopping, triangle-based pruning, spatial pruning and reused heaps. In the future, we plan to extend our algorithms to tackle so-called bichromatic kCMS query that involves querying two data sets. In addition, we are also interested in studying other variants of the combinatorial metric skyline query, such as constrained combinatorial metric skyline, combinatorial metric skyline based on clustering, and combinatorial metric skyline with respect to both metric and non-metric attributes. Acknowledgments Bin Zhang was supported in part by ZJNSF Grant LY14F020038. Yunjun Gao was supported in part by NSFC Grants 61379033 and 61003049, the National Key Basic Research and Development Program (i.e., 973 Program) No. 2015CB352502, the Cyber Innovation Joint Research Center of Zhejiang University, and the Key Project of Zhejiang University Excellent Young Teacher Fund (Zijin Plan). References [1] I. Bartolini, P. Ciaccia, M. Patella, Efﬁcient sort-based skyline evaluation, ACM Trans. Database Syst. 33 (4) (2008) 1–49. [2] S. Börzsönyi, D. Kossmann, K. Stocker, The skyline operator, in: 17th Int’l Conf. on Data Engineering, 2–6 April, IEEE Computer Society, Heidelberg, Los Alamitos, 2001, pp. 421–430. [3] L. Chen, X. Lian, Efﬁcient processing of metric skyline queries, IEEE Trans. Knowl. Data Eng. 21 (3) (2009) 351–365. [4] Y.-C. Chuang, I.-F. Su, C. Lee, Efﬁcient computation of combinatorial skyline queries, Inform. Syst. 38 (3) (2013) 369–387. [5] P. Ciaccia, M. Patella, P. Zezula, M-Tree: an efﬁcient access method for similarity search in metric spaces, in: the 23rd Int’l Conf. on Very Large Data Bases, 25–29 August, Morgan Kaufmann, Athens, San Fransisco, 1997, pp. 426–435. [6] E. Dellis, B. Seeger, Efﬁcient computation of reverse skyline queries, in: the 33rd Int’l Conf. on Very Large Data Bases, 23–27 September, ACM, Vienna, New York, 2007, pp. 291–302. [7] M. Drosou, E. Pitoura, Search result diversiﬁcation, SIGMOD Rec. 39 (1) (2010) 41–47. [8] D. Fuhry, R. Jin, D. Zhang, Efﬁcient skyline computation in metric space, in: 12th Int’l Conf. on Extending Database Technology, 24–26 March, ACM, Saint Petersburg, New York, 2009, pp. 1042–1051. [9] Y. Gao, Q. Liu, B. Zhen, G. Chen, On efﬁcient reverse skyline query processing, Expert Syst. Appl. 41 (7) (2014) 3237–3249. [10] S. Gollapudi, A. Sharma, An axiomatic approach for result diversiﬁcation, in: Proceeding of WWW Conference, Madrid, Spain, 2009, pp. 381–390. [11] A. Guttman, R-Trees: a dynamic index structure for spatial searching, in: The 1984 ACM SIGMOD Int’l Conf. on Management of Data, 18–21 June, ACM, Boston, New York, 1984, pp. 47–57.

105

[12] X. Guo, C. Xiao, Y. Ishikawa, Combination skyline queries, Transactions on Large-Scale Data- and Knowledge-Centered Systems VI, LNCS 7600, 2012, pp. 1–30. [13] Z. Huang, Y. Xiang, B. Zhang, X. Liu, A clustering based approach for skyline diversity, Expert Syst. Appl. 38 (7) (2011) 7984–7993. [14] H. Im, S. Park, Group skyline computation, Inform. Sci. 188 (2012) 151–169. [15] T. Jiang, Y. Gao, B. Zhang, D. Lin, Q. Li, Monochromatic and bichromatic mutual skyline queries, Expert Syst. Appl. 41 (4) (2014) 1885–1900. [16] B. Jiang, J. Pei, X. Lin, D.W. Cheung, J. Han, Mining preferences from superior and inferior examples, in: 14th ACM SIGKDD Int’l Conf. on Knowl. Discovery and Data Mining, 24–27 August, ACM, Las Vegas, New York, 2008, pp. 390–398. [17] D. Kossmann, F. Ramsak, S. Rost, Shooting starts in the sky: an online algorithm for skyline queries, in: the 28th Int’l Conf. on Very Large Data Bases, 20–23 August, Hong Kong, 2002, pp. 275–286. [18] X. Lian, L. Chen, Reverse skyline search in uncertain databases, ACM Trans. Database Syst. 35 (1) (2010). Article 3. [19] Lee, C.K. Ken, W.-C. Lee, B. Zheng, H. Li, Y. Tian, Z-SKY: an efﬁcient skyline query processing framework based on Z-order, VLDB J. 19 (3) (2010) 333–362. [20] C. Li, B.C. Ooi, A.K.H. Tung, S. Wang, DADA: a data cube for dominant relationship analysis, in: The ACM SIGMOD Int’l Conf. on Management of Data, 27–29 June, ACM, Chicago, New York, 2006, pp. 659–670. [21] M. Magnani, I. Assent, From stars to galaxies: Skyline queries on aggregate data, in: The Proceeding of ACM EDBT Conference, March 18–22, Genoa, Italy, 2013, pp. 477–488. [22] D. Mindolin, J. Chomicki, Discovering relative importance of skyline attributes, Proc. VLDB Endowment 2 (1) (2009) 610–621. [23] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive skyline computation in database systems, ACM Trans. Database Syst. 30 (1) (2005) 41–82. [24] J. Pei, B. Jiang, X. Lin, Y. Yuan, Probabilistic skylines on uncertain data, in: The 33rd Int’l Conf. on Very Large Data Bases, 23–27 September, ACM, Vienna, New York, 2007, pp. 15–26. [25] I.-F. Su, Y.-C. Chuang, C. Lee, Top-k combinatorial skyline queries, in: The 15th International Conference on Database Systems for Advanced Applications, 1, April, Springer, Tsukuba, Japan, 2010, pp. 9–93. [26] W. Son, S.-W. Hwang, H.-K. Ahn, MSSQ: manhattan spatial skyline queries, Inform. Syst. 40 (2014) 67–83, http://dx.doi.org/10.1016/j.is.2013.10.001. [27] W. Son, M.-W. Lee, H.-K. Ahn, S.-W. Hwang, Spatial skyline queries: an efﬁcient geometric algorithm, in: the 11th International Symposium on Spatial and Temporal Database, 8–10 July, Aalborg, Denmark, 2009, pp. 247–264. [28] M. Sharifzadeh, C. Shahabi, The spatial skyline queries, in: The 32nd Int’l Conf. on Very Large Data Bases, 12–15 September, ACM, Seoul, New York, 2006, pp. 751–762. [29] M. Sharifzadeh, C. Shahabi, L. Kazemi, Processing spatial skyline queries in both vector spaces and spatial network databases, ACM Trans. Database Syst. 34 (3) (2009). Article 14. [30] Y. Tao, L. Ding, X. Lin, J. Pei, Distance-based representative skyline, in: The Proceeding of IEEE ICDE Conference, March 29, IEEE Computer Society, Shanghai, China, 2009, pp. 892–903. [31] Y. Tao, X. Xiao, J. Pei, SUBSKY: efﬁcient computation of skylines in subspaces, in: The 22nd Int’l Conf. on Data Engineering, 3–8 April, IEEE Computer Society, Atlanta, Los Alamitos, 2006. 65-65. [32] G. Valkanas, A.N. Papadopoulos, D. Gunopulos, SkyDiver: a framework for skyline diversiﬁcation, in: the Proceeding of ACM EDBT Conference, March 18– 22, Genoa, Italy, 2013, pp. 406–417. [33] G. Wang, J. Xin, L. Chen, Y. Liu, Energy-efﬁcient reverse skyline query processing over wireless sensor networks, IEEE Trans. Knowl. Data Eng. 24 (7) (2012) 1259–1275. [34] Y. Yuan, X. Lin, Q. Liu, W. Wang, J.X. Yu, Q. Zhang, Efﬁcient computation of the skyline cube, in: The 31st Int’l Conf. on Very Large Data Bases, August 30– September 2, ACM, Trondheim, New York, 2005, pp. 41–252. [35] W. Zhang, X. Lin, Y. Zhang, Threshold-based probabilistic top-k dominating query, VLDB J 19 (2) (2010) 283–305.

Incremental evaluation of top-k combinatorial metric skyline query

Incremental evaluation of top-k combinatorial metric skyline query

Recommend Documents