A cost-efficient framework for finding prospective customers based on reverse skyline queries

A cost-efficient framework for finding prospective customers based on reverse skyline queries

Accepted Manuscript A Cost-Efficient Framework for Finding Prospective Customers Based on Reverse Skyline Queries Bo Yin, Ke Gu, Xuetao Wei, Siwang Z...

3MB Sizes 0 Downloads 52 Views

Accepted Manuscript

A Cost-Efficient Framework for Finding Prospective Customers Based on Reverse Skyline Queries Bo Yin, Ke Gu, Xuetao Wei, Siwang Zhou, Yonghe Liu PII: DOI: Reference:

S0950-7051(18)30179-5 10.1016/j.knosys.2018.04.011 KNOSYS 4293

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

23 November 2017 8 April 2018 9 April 2018

Please cite this article as: Bo Yin, Ke Gu, Xuetao Wei, Siwang Zhou, Yonghe Liu, A Cost-Efficient Framework for Finding Prospective Customers Based on Reverse Skyline Queries, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.04.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

CR IP T

[Title Page]

AN US

A Cost-Efficient Framework for Finding Prospective Customers Based on Reverse Skyline Queries

Bo Yina,b,∗ , Ke Gua,b , Xuetao Weic,∗ , Siwang Zhoud , Yonghe Liue School of Computer and Communication Engineering, ChangSha University Of Science and Technology, Changsha 410114, China b Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, Changsha 410114, China c School of Information Technology and Department of Electrical Engineering and Computing Systems, University of Cincinnati, Cincinnati OH 45221, USA d College of Computer Science and Electronic Engineering, Hunan University, Changsha 410114, China e Department of Computer Science and Engineering, University of Texas at Arlington, Arlington TX 76019, USA

PT

ED

M

a

AC

CE

Corresponding author name: Bo Yin Affiliation: School of Computer and Communication Engineering, ChangSha University Of Science and Technology, Changsha 410114, China Email address: [email protected] Telephone number: +86 137 55005469

1

ACCEPTED MANUSCRIPT

A cost-efficient framework for finding prospective customers based on reverse skyline queries

CR IP T

Bo Yina,b,∗, Ke Gua,b , Xuetao Weic,∗, Siwang Zhoud , Yonghe Liue a

ChangSha University Of Science and Technology, China Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, China c University of Cincinnati, USA d Hunan University, China e University of Texas at Arlington, USA

AN US

b

Abstract

AC

CE

PT

ED

M

Analyzing customers’ information for marketing insights is critical for the company to remain invincible in increasingly fierce market competition. Rankbased and dominance-based customer selections are widely used for the market analysis to find prospective customers. However, rank-based solutions required companies to define weight vectors, which is not practical in real applications. Dominance-based approaches (e.g., reverse skyline queries) had problems of a high overhead and the poor progressiveness. In this paper, we propose a cost-efficient framework to find prospective customers for a target product. We first target on the most prospective customers for whom no better product exists compared with the target product. We formulate the problem based on reverse skyline queries, and propose a new algorithm that significantly reduces the query cost by pruning unqualified customers without any false positive and identifies reverse skyline points as early as possible based on decision region and effect region. We then further target on arbitrary k prospective customers and formulate the problem as top-k reverse skyline (Top-k RS) queries. We extend the notion of reverse skyline to reverse skyline order in order to support arbitrary k and group-based promotions. We evaluate our framework with extensive experiments, and our ∗

Corresponding author Email addresses: [email protected] (Bo Yin), [email protected] (Ke Gu), [email protected] (Xuetao Wei), [email protected] (Siwang Zhou), [email protected] (Yonghe Liu) Preprint submitted to Knowledge-Based Systems

April 10, 2018

ACCEPTED MANUSCRIPT

results demonstrate that our framework has the promising results for reverse skyline queries, and can efficiently support top-k RS queries.

CR IP T

Keywords: Prospective customers, Competitive products, Cost-efficient, Reverse skyline 1. Introduction

AC

CE

PT

ED

M

AN US

The phenomenon of Internet of Things (IoT) fuses the digital and physical world by connecting the proliferation of smart devices to the Internet [1–3]. The Business Insider Intelligence predicts that the number of IoT devices will hit 34 billion by 2020 [4]. The applications of IoT on businesses make customers’ information become huge and full of variety. It is critical for the company to gain insights from customers’ information in order to remain invincible in the increasingly fierce market competition [5–9]. A typical application is to find prospective customers who are most interested in target products from a large collection of customers. This could enable the company utilize its limited marketing resources more efficiently. From the perspective of products’ competition, the interests of customers on the target product are influenced by the existence of competitive products of the same type. Specifically, the target product attracts the customers most, if there are no better products. Apparently, these attracted customers are of high probability of buying the target product. As a result, companies have a strong desire to identify these attracted customers, namely, the most prospective customers. Rank-based approaches [10–14] and dominance-based approaches [15–27] are two major approaches for the customer selection problem. The rankbased approach utilizes a preference function, in which the weights represent the relative importance of different product attributes, to quantitatively express interests of a customer towards a certain type of products. The highly ranked products are more attractive to the customer. Thus, the potential market of a product can be modeled as the reverse top-k query results. However, this approach suffers limitations as follows: (i) it requires the specification of a set of weights in the preference function and a ranking threshold k, which are difficult to obtain in many cases; (ii) it fails to provide an overall view of the customers. To address these concerns, the dominance-based approach, e.g, reverse skyline query, measures the “attractiveness” of a product by dominance tests. The non-dominated products (a product pi dynamically 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

dominates product pj if pi is better in at least one attribute and comparable in the rest of attributes) are more attractive to the customer. Towards this direction, reverse skyline queries [15] retrieve a set of customers whose non-dominated sets contain the target product. The reverse skyline provides a good overview to potential customers from the perspective of companies. In this paper, we capitalize on reverse skyline queries to define the most prospective customers as the reverse skyline result set. For example, consider a laptop company that wants to determine potential market of a new laptop product. Figure 1 shows the preferences of customers on laptops as points in the data space. We consider two attributes of the laptop, the price and the heat emission. The new laptop product is specified as a query point, say p. In Figure 1(a), each customer c (represented by the product that c has preferred) in the original two-dimensional space is transformed to a new customer c0 using the mapping function c0 [m] = |c[m] − c1 [m]|. The target laptop p is a skyline point with respect to c1 . Hence, c1 would have been interested in the target laptop p. The same holds for customers c3 and c8 , both are in the reverse skyline with respect to p. As a result, the company might offer laptop p to customers with preferences c1 , c3 and c8 . Although the reverse skyline can be used to find the most prospective customers, the reverse skyline query is complex, heavyweight and timeconsuming. The major challenge is that data points (customers), which do not belong to the reverse skyline set, cannot be discarded during the query processing; otherwise, false positives may occur. Therefore, a great number of dominance tests have been involved to identify a reverse skyline point. Prior works [15, 16] used a filter-refinement strategy, which firstly compute a set of global skyline points as candidate reverse skyline points, and then verify each candidate based on precomputed dynamic skyline sets and window queries. However, the prior works have the the following problems: they incur expensive I/O costs for traversing the R-tree repeatedly in refinement; they cannot report any reverse skyline in the filter step, which results in poor progressiveness; and they precompute the dynamic skyline for each point in C, which thereby are not cost-efficient for a single query and limited to query-intensive applications. There are also several works designed for reverse skyline variants and different settings [17–27], which is different with our focus in this paper. Therefore, a more cost-efficient approach is desired to conduct reverse skyline queries for finding the most prospective customers. Another issue about the reverse skyline is that its cardinality is fixed 4

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

for a given dataset of customer preferences and a target product. The fixed number of the most prospective customers is not flexible in practice. Companies may need such flexibility to identify arbitrary k prospective customers. For example, in the online hotel booking application, the company may adjust the number of prospective customers whom the company will advertise to, according to the number of vacant rooms. Specifically, there may be many vacant rooms that have not been booked in the off-season. To attract more tourists, the company specifies a large k value. On the other hand, the booking of rooms reaches its peak during the peak season. Then, the best marketing strategy is to select a subset of reverse skyline, and advertise to customers in this subset. Furthermore, in the example of marketing a new laptop, in order to promote the product’s impact and capture the market quickly, the company is always willing to advertise to more customers. Therefore, only finding the most prospective customers (e.g., three in Figure 1) may not be sufficient and a larger customer set is desired in this case. With a large k number of prospective customers, the company may need to further group k customers for a better management. The customers can be grouped according to their interests on the target product such that different promotion strategies can be implemented to different groups. In this paper, we study the problems of finding prospective customers for a target product. We target on the most prospective customers first, and then on arbitrary k customers for the marketing and applying varied promotions to different groups. We first propose a new algorithm OneTraversal to find the most prospective customers based on the reverse skyline queries. We propose effective pruning techniques to prune away data points that are not in the reverse skyline while avoiding false positives, based on the concept of effect region. We also propose effective techniques to identify reverse skyline point as early as possible. By integrating these techniques, our OneTraversal has the appealing feature of computing reverse skyline only based on a small subset of customers, without preprocessing and refinement. Then, we formulate the problem of finding arbitrary k prospective customers as the top-k reverse skyline (Top-k RS) queries, which rank customers based on the notion of reverse skyline order in order to support arbitrary k as well as group-based promotions. We assign each customer an order, which we called the reverse skyline order, in the set of whole customers, by integrating with both inter-group and intra-group sorting. Specifically, following the view of products’ competition, we first group customers and sort the groups according to their interests on the target product (i.e., the number of 5

ACCEPTED MANUSCRIPT

CR IP T

better products). Afterwards, for customers in the same group, we sort them according to the selectivity probability which considers the similarity of the target product to both customers preferences and better products. We extend our OneTraversal to support top-k RS queries, and propose a preprocessingbased algorithm which presorts customers based on the reverse skyline order to handle multiple k. Our contributions can be summarized as follows:

AN US

• We propose a cost-efficient framework to find prospective customers for a target product. We first target on the most prospective customers for whom no better product exists compared with the target product. We then further target on arbitrary k prospective customers.

M

• We propose a new progressive algorithm OneTraversal to find the most prospective customers based on the reverse skyline queries. We introduce the concepts of decision region and effect region. By using effect region, our OneTraversal significantly reduces the query cost by pruning unqualified customers without any false positive. By using decision region and a distance threshold, our OneTraversal progressively returns query results and identifies reverse skyline points as early as possible.

CE

PT

ED

• We formulate the problem of finding k prospective customers as a top-k RS query problem that returns the top-k customers based on the reverse skyline order. We first group and rank customers according to the number of competitive products. We then rank customers in the same group by using the selectivity probability function. The query results support arbitrary k and group-based promotions. We also propose effective schemes to improve the query efficiency based on the reverse skyline order.

AC

• We conduct extensive experiments by using both synthetic and real datasets. Our results demonstrate that our algorithms have the promising results for reverse skyline queries, and can efficiently support top-k RS queries.

The remainder of the paper is organized as follows. Section 2 surveys the related work. Section 3 presents the background and preliminaries. Section 4 proposes our algorithm OneTraversal and proves its superiority over previous work. Section 5 formalizes the problem of finding k most prospective 6

ACCEPTED MANUSCRIPT

customers and presents algorithms for solving the problem efficiently. The performance evaluation on the proposed algorithms is reported in Section 6. Finally, we conclude this paper in Section 7.

CR IP T

2. Related work

AC

CE

PT

ED

M

AN US

Rank-based customer/product selection. Rank-based approaches qualify the importance/impact of an object (i.e, customer or product) by using a weight vector among different object attributes. Each object is assigned a score based on the weight vector, and the objects that score higher are much more important. As a result, the customer/product selection problems are modeled as top-k [28] or reverse top-k [10–14] queries. Xu et al. [28] proposed to select m products according to product adoption models, in order to maximize the sale for the company. Vlachou et al. introduced the concept of reverse top-k in [10], and studied the problem of retrieving the m most influential objects by defining the influence of an object as the cardinality of its reverse top-k set in [11]. Top m products are selected according to individual influences of products in [11]. Differently, work [12] considered the union influences of products and select m products with the maximum number of the total customers in the reverse top-k sets. Some other criteria are used in [13, 14], e.g., diversity[13, 14], and coverage[14]. The rank-based solutions require customers to define the weight vectors, which is not practical in real applications. Dominance-based customer/product selection. Dominance-based approaches evaluate the importance of an object from the point of multiple criteria. The skyline queries [29], also known as the maximum vector problem [30], have been widely used for preference queries based on domination operations. The skyline queries aim to find attractive products (from the customer’ view) in a given product set, which are not worse than (dominated by) any other product. Li et al.[31] studied how to position the product by capitalizing on the dominant relationships between products and potential buyers. A linear optimization query is proposed to model constraints as a linear plane that is anti-correlated with regard to all dimensions. A single product is returned for the company in [31]. Differently, given a set of candidate products, a set of existing products, and user requirements on each attribute, the problem of assisting companies to select k products from the candidate products have been studied in [32–34]. A product is attractive if it is not dominated by any existing product and it satisfies customers’ re7

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

quirements. Specifically, Lin et al.[32] selected k products to maximize the expected number of the total customers of the k products. Peng et al.[33] considered two criteria, i.e., total profit, and the number of customers. Xu et al. [34] derived a distance-based adoption model and aimed to select k products to maximize the expected market share. Zhou et al. [35] selected top k products based on the favorite probability. In [36–39], the problem of finding k most “attractive” products is modeled as k representative skyline queries. The k points are selected from the skyline set, with respect to various optimization object, e.g., the number of distinct dominated points [36], the user selecting probability [37], diversity [38, 39], and significance[39]. The models for selecting k skyline points are always reduced to the maximum coverage problem, and approximate approaches are designed since the problem is NP-hard for higher dimensional datasets [32–34, 36–39]. Reverse skyline queries. The reverse skyline query is a kind of dominancebased query operator, which focuses on customer selection (i.e., retrieving customers who would find the target product “attractive” from the company’s view). Dellis and Seeger [15] introduced the reverse skyline query and proposed a filter-refinement algorithm which firstly computes the global skyline as the set of candidate reverse skyline points, and then conducts refinement processing to verify whether those candidates are reverse skylines (Section 3.2). Dynamic skyline (precomputed and kept on disk) and window queries are used for refinement. Gao et al. [16] proposed a reuse technique to improve the efficiency of [15]. The R-tree nodes that have been visited are stored in memory for candidate refinement, thereby avoiding multiple traversal of the R-tree. Nevertheless, the proposed algorithms also need to precompute the dynamic skyline and follow the framework of filtering-refinement of [15]. Works [15, 16] focused on monochromatic scenario. Differently, bichromatic reverse skyline queries are studied in [17–20], where two dataset are used to store customer and company data, respectively. The main idea is to identify reverse skyline points using the midpoints of quadrant skylines. The R-tree nodes that are dominated by the midpoints cannot be/contain reverse skyline points. As the (midpoints of) quadrant skylines are updated based on visited nodes during the query processing, the dominated nodes can be safely pruned. Heuristics are proposed to optimize the processing order of indexed R-trees to minimize the I/O cost and make the nodes pruned early. Our pruning techniques are quite different from the midpoint based pruning. We prune nodes based on effect region of the nodes and need not to keep the information of pruned nodes. In addition to bichromatic reverse skyline 8

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

queries, many other reverse skyline variants have also been proposed. Specifically, Arvanitis et al. [19] also addressed the problem of selecting k-most attractive candidates (K-MAC) to maximize the overall number of reverse skyline points of the k selected products. The problem is reduced to maximum k-coverage problem and a greedy algorithm is proposed to return the approximate results. Islam et al.[21] addressed the problem of answering why-not questions in reverse skyline queries. Gao et al. [22] studied the causality and responsibility problem for the non-answers to probabilistic reverse skyline queries. Solutions of [17–22] are extended versions of [15] and designed for reverse skyline variants, and thus, they are not applicable to our problem. Moreover, Islam and Liu [23] studied the problem of finding k-most promising products to maximize the market contributions. A probability is associated with the product adoption among customers based on dynamic dominance and reverse skylines. Deshpande and Deepak [24] considered non-metric attribute domains and proposed non-indexed algorithms which explore block-based processing and preprocessing to reduce the I/O costs and speed up the computation. The reverse skyline queries in wireless sensor networks are studied in [25]. Park et al. [26] and IsIam et al. [27] studied parallel reverse skyline queries on shared-nothing clusters. They built quad-tree for data partitioning, and reduced the searching space by extending the idea of midpoint based pruning [17–20]. In this work, we focus on market analysis from the view of the company. We study two specific customer searching problems: (i) finding the most prospective customers, and (ii) finding arbitrary k prospective customers. With respect to the first problem, rank-based approaches [10–14] need to specify the weight vectors, which are not practical for real applications. The dominance-based approaches [15–27] retrieve prospective customers based on reverse skyline queries. However, these approaches either have problems of a high overhead and poor progressiveness for reverse skyline queries or are designed for reverse skyline variants and different settings, which is different with our focus in this paper. Our proposed algorithm has the appealing feature of computing reverse skyline only based on a small subset of customers, which is proved to be not larger than the global skyline set, without refining and preprocessing. With respect to the second problem of finding arbitrary k prospective customers, approaches of [32–39] focused on finding interesting products (thereby using skyline queries) and the parameter k was constrained to be smaller than the cardinality of the skyline set. Highlighting the perspective of products’ competition, we study the interests of customers 9

ACCEPTED MANUSCRIPT

on a target product based on the number of competitive products in order to support arbitrary k and group-based promotion.

CR IP T

3. Background and preliminaries

AN US

3.1. Reverse skyline We first introduce the notion of dynamic skyline [40], which finds interesting products from the perspective of customers. Let P denote a set of products of a certain type. We use d numerical attributes to describe the quality of the product on various aspects. The ith attribute value of a product is denoted by p[i]. Consequently, the quality of a product p ∈ P is represented by a multi-dimensional point p =< p[1], p[2], . . . , p[d] >. Let c denote a customer demanding a product of the type. The preferences of c is represented by a multi-dimensional point c =< c[1], c[2], . . . , c[d] >.

ED

M

Definition 1 (Dynamic skyline). (from [40]) Given a dataset of products P , the dynamic skyline with respect to a customer c, denoted as DSKP (c), contains all products in P that are not dynamically dominated. Specifically, a product p1 ∈ P dynamically dominates p2 ∈ P with respect to c, if it holds that: (1) |p1 [m] − c[m]| ≤ |p2 [m] − c[m]| for all dimension m, and (2) |p1 [k] − c[k]| < |p2 [k] − c[k]| for at least one dimension k.

AC

CE

PT

We next introduce the notion of reverse skyline [15], which finds the most prospective customers from the perspective of companies. We consider the monochromatic reverse skyline for the given dataset of customers C and the target product p. Although the preferences of customers on products of a certain type can be collected from purchasing logs, on-line searching records, and questionnaires, a direct way for describing the preferences of a customer is to use parameters of the product he/she has purchased or searched on line. Hence, each customer c ∈ C also stands for a competitive product for the target product p. We can use dynamic dominance to find interested products from C and p for c. Then, if p is a dynamic skyline point with respect to c, c is a prospective customer of p. Definition 2 (Reverse skyline). (from [15]) Given a dataset of customers C and a target product p as the query point, the reverse skyline with respect to p, denoted as RSKC (p), contains all customers c ∈ C such that p is a dynamic skyline with respect to c, i.e.,p ∈ DSKC (c). 10

ACCEPTED MANUSCRIPT

heat emission

heat emission c '6

c6

c '5

c '3

c6

c4

c'4

c3

p c '2

c5 c '7

c '9

c '8

c8

c1 c2

c9

AN US

(a) Dynamic skyline

c7

p

c8

c9

O

c5

c3

c7

c1 c2

CR IP T

c4

price

O

(b) Reverse skyline

price

Figure 1: Dynamic skyline and reverse skyline

ED

M

Consider the dataset given in Figure 1(a), The dynamic skyline of c1 is {p, c2 , c3 , c8 }. Hence, the customer c1 would be interested in p and the products that customers c2 , c3 and c8 have preferred. According to Definition 2, c1 is a reverse skyline of p. Figure 1(b) illustrates the reverse skyline of p, which includes c1 , c3 , and c8 .

CE

PT

Definition 3 (Global Skyline). (from [15]) Given a dataset of customers C, the global skyline with respect to the target product p, denoted as GSKC (p), contains all customers that are not globally dominated. Specifically, a customer c1 ∈ C globally dominates another c2 ∈ C with respect to p, if (1) |c1 [m] − p[m]| ≤ |c2 [m] − p[m]| and (c1 [m] − p[m])(c2 [m] − p[m]) > 0 for all dimension m, and (2) |c1 [k] − p[k]| < |c2 [k] − p[k]| for at least one dimension k.

AC

The global skyline can be used as the candidate set of reverse skyline. In this way, the search space can be reduced [15]. Figure 2(a) illustrates the global skyline with respect to p, which includes points c1 , c3 , c5 , c7 , c8 , and c9 . Table 1 provides an overview of the most basic symbols used in this paper. In the rest of the literature, we omit the dataset C and the query p in the symbols (e.g., RSKC (p) and GSKC (p)), when it is clear from the context or not important for the discussion. 11

ACCEPTED MANUSCRIPT

y

y

c5

c3

CR IP T

c6

c4

c7

c '3

c3

p c' 2

p

c '8

c8

DADR(c1)

c8

c1

c1 c2

c9

query window

AN US

O

DDR(c1)

O

x

(a) Global skyline

x

(b) Refinement step

Figure 2: Global skyline and an example of RSSA algorithm

AC

CE

PT

ED

M

3.2. The RSSA algorithm There are several approaches designed for reverse skyline queries in different settings, and many of them are based on the RSSA [15]. In the following we detail the classic reverse skyline query algorithm RSSA. Suppose that the dataset C is indexed by an R-tree. In the pre-processing, RSSA computes the dynamic skyline for each customer c ∈ C, and maintains an approximation version on the disk. Based on the approximate dynamic skyline, two regions are computed for later refinement. Specifically, the DDR region contains the customers dominated by at least one dynamic skyline while the DADR region contains the customers dominating some dynamic skylines. If the query point p falls in the DDR region of c, p cannot belong to the reverse skyline. On the contrary, if p is inside in the DADR region, p must be a reverse skyline point. Figure 2(b) illustrates DDR and DADR regions of customer c1 by using shaded areas. When a reverse skyline query is initiated, initially, in the filter step, RSSA computes global skyline as candidate set of reverse skyline. Then, in the refinement step, iteratively, the query point p is examined against the already computed DDR and DADR regions of each candidate customer. Some candidate customers are identified as reverse skyline points or non-reverse skyline points. The remaining candidates, subsequently, are further examined by a window query against dataset C. If there exists no point in the query 12

ACCEPTED MANUSCRIPT

Table 1: Frequent notations

AN US

CR IP T

Symbol Interpretation D a set of points q a query point p a point of D p[i] the value of point p on dimension i DSKD (q) the dynamic skyline of D with respect to q RSKD (q) the reverse skyline of D with respect to q GSKD (q) the global skyline of D with respect to q SRSD (q) the selective reverse skyline of D with respect to q DR(p) the decision region of p ER(p) the effect region of p

PT

ED

M

window, this candidate customer is determined as a reverse skyline point. Consider the global skyline point c1 in Figure 2. Since query point p is neither in DDR nor in DADR region of c1 , a refinement step is issued. The slashed grids of Figure 2(b) draw the query window of c1 . Because no point of C is in the query window, c1 is output as a reverse skyline point. In the following, we start to present our framework for efficient computation of reverse skyline and top-k reverse skyline. We first present our OneTraversal algorithm for reverse skyline queries which significantly reduces the I/O cost and achieves better progressiveness compared to RSSA, in Section 4. We then present algorithms to support top-k reverse skyline queries efficiently based on the reverse skyline ordering concept, in Section 5.

CE

4. OneTraversal: efficient processing of reverse skyline queries

AC

In this section, we first detail the problems of RSSA algorithms, and introduce the concept of decision region, which will be used to determine whether a customer is a reverse skyline point or not. We then present a serials of pruning rules based on effect region while avoiding false positives. Finally, we present a more efficient pruning-based reverse skyline algorithm, termed OneTraversal.

13

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

4.1. Problems of RSSA Firstly, the RSSA algorithm cannot compute the reverse skyline only using candidate set (i.e., global skyline set). RSSA needs to perform the refinement processing to check each candidate customer to get the query results. Hence, RSSA has to traverse the R-tree over dataset C repeatedly (computing candidate set and conducting refinements). Let |GSK| be the cardinality of global skyline set GSK. Then, the number of traversal is O(|GSK|) in the worst case. When C is a uniformly distributed dataset, we have O(|GSK|) ≈ O(2d (In(|C| − d · In2)d−1 /d!) ≈ O(2d Ind−1 |C|/d!) [41]. Consequently, repeated traversals incur prohibitively expensive I/O cost, especially for larger datasets or higher dimensional data. Secondly, RSSA cannot report any reverse skyline point in the procedure for computation of candidate customers. In order to retrieve the first reverse skyline point, several iterations on checking DDR and DADR regions may be required. In the worst case, no reverse skyline point is identified based on DDR and DADR regions, and many window queries are further conducted, which results in poor progressiveness. Finally, RSSA precomputes and then maintains the dynamic skyline for each customer of C. This preprocessing is obviously an expensive operation. It makes RSSA not cost-efficient for single query and confined to queryintensive applications.

AC

CE

PT

ED

4.2. Decision region Given a query point p, let us partition the d-dimensional space into 2d quadrants with d orthdogonal hyperplanes. For example, in Figure 3, we partition the data space into four quadrants using a horizontal and a vertical line. Consider the customer c1 in Figure 3. Then, the decision region of c1 , denoted as DR(c1 ), is the area that dynamically dominates p with respect to c1 . To illustrate, consider the rectangle with p as a corner point and centered at c1 in Figure 3. We emphasize that the rectangle except corner points draws the decision region for c1 . Note that, the DR region does not contain the corners of the rectangle. This is because, each corner point r (marked with “×”) has the property that |r[m] − c1 [m]| = |p[m] − c1 [m]| on every dimension m. Hence, r does not dynamically dominate p with respect to c1 . Then, according to Definition 2, c1 belongs to the reverse skyline with respect to p if and only if there does not exist a customer in the DR(c1 ) region. Lemma 1. A customer c1 belongs to RSK(p), iff 6 ∃c2 such that c2 is inside DR(c1 ). 14

ACCEPTED MANUSCRIPT

y

c3

quadrant 4 quadrant 1

p

quadrant 2

c2 DR(c1)

r

AN US

quadrant 3

c1

CR IP T

DR(c3)

O

x

Figure 3: Example of decision region

PT

ED

M

In the following, we give an example of the “false positive” case when deservedly discarding customers not in RSK during the query processing, and give an analysis by using DR regions. Given three customers c1 , c2 and c3 , we assume that c2 is inside DR(c1 ) and c1 is inside DR(c3 ). Unfortunately, it does not follow that c2 falls in DR(c3 ). Figure 3 shows an example of such a case. Suppose that we have accessed c1 and c2 , and we discard c1 since c2 falls in DR(c1 ). Next, we access c3 , and c3 will be recognized as a reverse skyline point, because no other customer (i.e., c1 ) is inside DR(c3 ). In fact, c3 does not belong to reverse skyline, since c1 is in DR(c3 ) ( Lemma 1).

AC

CE

4.3. Pruning rules In this subsection, we discuss how to prune customers not in reverse skyline without any false positive. We present two important pruning rules ( Lemma 2 and a relaxed version Lemma 3). Let us use SRSC (p) (abbr. SRS) to denote the subset of customers from C that are not pruned by Lemma 3 with respect to query point p. We call SRS the selective reverse skyline. We then prove that SRS is sufficient to compute reverse skyline correctly and SRS ⊆ GSK (Lemma 4). Hence, the reverse skyline can be retrieved based on a subset of GSK, while avoiding the precomputation and refinement processing. This idea will be integrated into our RSSA algorithm 15

ACCEPTED MANUSCRIPT

c

Mc

AN US

p

CR IP T

y

ER2(c) ER1(c)

O

x

Figure 4: Example of effect region

M

as stated in Section 4.4. In the following, we first introduce the concept of effect region of a customer c.

PT

ED

Definition 4 (Effect region). Given a customer c, let S(c) denote the set of customers such that c dynamically dominates the query point p with respect to those customers. The effect region of c, denoted as ER(c), is defined as a hyperectangle with c and p as diagonal corners except corners on quadrant intersection planes as shown in Figure 4, such that customers locating in the effect region dynamically dominates p with respect to any customer in S(c).

AC

CE

Obviously, customers in S(c) are not reverse skyline points as a result of c. Hence, if we delete c during the query processing, false positives caused by this deleting operation would happen on S(c). Let Mc be the midpoint from p to c. The dominance region of Mc (the lighter gray area except the midpoint in Figure 4), contains all customers in S(c). In order to avoid false positive, our main idea is to find out some other customers to help us identify customers of S(c) as non-reverse skyline points. We use the effect region to capture this idea. Specifically, ∀o ∈ ER(c) and ∀s ∈ S(c), o dynamically dominates p with respect to s. That is, o ∈ DR(s). Hence, s can be recognized as a non-reverse skyline point based on o. An issue needs 16

ACCEPTED MANUSCRIPT

CR IP T

to be addressed is that o may also belong to S(c). Then, o can be used to identify the other customers in S(c). But o may be recognized as a reverse skyline point without c. To solve the problem, we divide ER(c) into two parts ER1 (c) and ER2 (c) such that ER2 (c) is the intersection of ER(c) and S(c) and ER2 (c)=ER(c)-ER1 (c). Customers in ER2 (c) are also from S(c). Consider the rectangle with Mc and c as diagonal corners. The region ER2 (c) is the rectangle except midpoint Mc . The following lemma provides the basic pruning rule that will help us to prune c safely by using customers in ER(c): Lemma 2. Given a customer c ∈ C, c can be safely pruned if there is at least one customer in ER1 (c) or two customers in ER2 (c).

AN US

Proof. Let o be a customer in ER(c). Clearly, o is in the DR region for each customer in S(c). Hence, customers in S(c) cannot belong to the reverse skyline as a result of o. Particularly, if o is inside ER2 (c), o is meanwhile a point of S(c). Let r be another customer in ER2 (c). Then o ∈ S(r) and r ∈ S(o), i.e., r and o are not reverse skyline points as a result of each other. Hence, customer c can be safely pruned.

M

Notice that, customers inside ER2 (c) cannot belong to the reverse skyline as a result of c. Therefore, if there is a customer o in ER2 (c), we can mark o as a non-reverse skyline point and then discard c. Based on this idea, we have the following relaxed pruning rule:

ED

Lemma 3. Given a customer c and another o in ER(c), we can safely prune c. Especially, we mark o as a non-reverse skyline point when o is inside ER2 (c).

CE

PT

We define the subset of customers who are not pruned as the selective reverse skyline, and proves an important lemma (Lemma 4) which will help us to answer a reverse skyline query correctly based on the selective reverse skyline.

AC

Definition 5 (Selective Reverse Skyline). Given a dataset of customers C, the selective reverse skyline with respect to query point p, denoted as SRSC (p), contains all customers who are not pruned by Lemma 3. Lemma 4. SRS is sufficient to compute RSK correctly and SRS ⊆ GSK.

Proof. ∀c ∈ / SRS, c ∈ / RSK. ∀c ∈ SRS and c ∈ / RSK, ∃o ∈ SRS such that o ∈ DR(c) or o has been marked as a non-reverse skyline point. Hence, SRS is sufficient to compute RSK. As customers who globally dominates c are included in ER(c), we conclude that SRS ⊆ GSK. 17

ACCEPTED MANUSCRIPT

4.4. Our OneTraversal We now present our OneTraversal algorithm for efficient reverse skyline computation. The OneTraversal algorithm:

AN US

CR IP T

• Does not require the maintenance of the dynamic skyline sets. • Traverses the R-tree of dataset C only once, and thus less expensive in terms of I/O cost. • Discards entries based on our pruning rules as early as possible, in order to avoid costly accessing their child nodes (in terms of computation and I/O cost). • Expands a non-leaf entry not in reverse skyline only if it is necessary in order to determine whether another customer belongs to reverse skyline.

AC

CE

PT

ED

M

Framework. Let us first provide a general description of OneTraversal algorithm. Like RSSA algorithm, OneTraversal traverses the R-tree in the order so that it always evaluates and expands the tree node closest to the query point p. To do that, a mini-heap H is built on the entries of the Rtree. We use U to denote the set of visited entries that are not pruned. Then, entries from U can be divided into three categories: a set RSK that stores currently found reverse skyline points; a heap CU R that stores temporary reverse skyline points such that no visited entry intersects DR regions of those points; and another heap Hnon that contain entries that whose customers are non-reverse skyline points. Thus, U = RSK ∪ CU R ∪ Hnon . In each iteration, OneTraversal extracts the top entry ei of H. If ei is a non-leaf entry represented with an minimum bounding rectangle (MBR), M in(ei ) is defined as the corner of MBR nearest to query point p and M ax(ei ) is the corner with the longest distance to p. Figure 5 shows an example where M in(ei ) is the min-corner of the MBR while M ax(ei ) is the max-corner of the MBR. Specially, if ei is a point, both M in(ei ) and M ax(ei ) are ei itself. Then, OneTraversal conducts the following processes for ei : 1. Prune ei if possible. Meanwhile, OneTraversal removes all customers c ∈ CU R to Hnon if ei is a point and c is inside ER2 (M in(ei )). 2. Otherwise, if it can be determined that ei does not belong to reverse skyline based on U , OneTraversal inserts ei into Hnon . Based on ei , CU R is updated, such that updated points of CU R are definitely reverse skyline points of visited entries of C. For this purpose, as long as there is a customer c in CU R such that ei intersects DR(c), ei is expanded, until we can determine whether c is a reverse skyline point or not. 18

ACCEPTED MANUSCRIPT

min-corner p

AN US

Figure 5: Example of MBR

CR IP T

max-corner

3. Else,

ED

M

• if ei is a non-leaf entry, OneTraversal inserts its children entries into H. • else, in order to determine whether point ei is a temporary reverse skyline point or not, OneTraversal checks ei against entries in Hnon . As long as there is a node enon intersecting DR(ei ), enon is expanded. Next, OneTraversal removes customers from CU R to Hnon if those customers have ei in their DR regions, and prunes as many entries from Hnon as possible.

AC

CE

PT

4. Let distance(M in(ei ), p) be the distance from p to M in(ei ). OneTraversal reports all customers c ∈ CU R as reverse skyline points if distance(c, p) ≤ distance(M in(ei ), p)/2. The algorithm terminates when H is empty. Next, we discuss the techniques and OneTraversal’s execution in detail. From the definition of DR region, we can see that whether a customer c belong to the query skyline only depends on the other customers in the same quadrant of c, more exactly, those customers in DR(c). Since all quadrants are symmetrical, in the following discussion we concentrate on the right-upper quadrant of query c. More on pruning. OneTraversal checks the pruning conditions when a new entry of the R-tree needs to be visited or a node in Hnon is expanded. We stress that an entry ei needs to be compared with entries with smaller distances to the query p for the pruning of ei . The pruning rules Lemma 3 and Lemma 4 only consider points set. We now extend the pruning rules to the set of nodes that are represented with 19

ACCEPTED MANUSCRIPT

AN US

CR IP T

MBRs. Figure 6 shows five pruning conditions for ei . Note that, if a node has at least one edge in ER1 (M in(ei )), we can infer that there must exist a real point (i.e., a customer) of the node such that this point falls in ER1 (M in(ei )). Consider Figures 6(a)-6(b) where there is a customer (or a node with at least one edge of its MBR) in region ER1 (M in(ei )). We guarantee that there exists a customer of C in ER1 of any point in ei . Hence, ei can be pruned. Apparently, if there is a customer c (or a node ej has at least one edge) in ER(M in(ei ))(not ER1 (M in(ei ))), ei can not be pruned away based on c ( or node ej ). Nevertheless, when there exist two such cases, it can be inferred that there are at least two customers in ER(M in(ei )) and thus, ei can also be pruned (Figures 6(c)-6(e)). In summary, OneTraversal prunes an entry ei according to the following rules: 1. If ei is a point,

M

• if there is a customer c inside ER(M in(ei )) or there is a node ej with at least one edge inside ER1 (M in(ei )), prune ei . Specifically, when c falls in ER2 (M in(ei )), c is removed to Hnon . • Else, if there are two nodes such that each node has at least one edge inside ER(M in(ei )), prune ei . 2. Else,

PT

ED

• if there is a customer c (or a node ej with at least one edge) inside ER1 (M in(ei )), prune ei . • Else, prune ei , if there are two nodes such that each node has at least one edge inside ER(M in(ei )), or two customers inside ER2 (M in(ei )), or one node with at least one edge inside ER(M in(ei )) and one customer inside ER2 (M in(ei )).

AC

CE

Example 1. In the example of Figure 7(a), OneTraversal first accesses customer c1 , which has the minimum distance from p, and hence CU R = {c1 } at the beginning. Next e4 is examined. As depicted in Figure 7(b), since c1 is inside ER1 (M in(e4 )), e4 is pruned away. Identifying entries not in reverse skyline. When ei is pruned, clearly, ei cannot belong to the reverse skyline. Otherwise, if there is a customer (or a node with at least one edge) inside the DR region of M in(ei ), ei is not in reverse skyline but ei cannot be pruned. Hence, ei is inserted into Hnon . Next, OneTraversal updates CU R based on ei to ensure that customers in the updated CU R are still temporary reverse skyline points after the 20

ACCEPTED MANUSCRIPT

c1 p

p (b) Condition 2

(a) Condition 1

ei

ei

c2

p

p

c1

p

(d) Condition 4

(c) Condition 3

ei

AN US

c1

CR IP T

ei

ei

(e) Condition 5

M

Figure 6: Example of pruning

AC

CE

PT

ED

visitation of ei . For this purpose, as long as there is a customer c in CU R with ei intersecting DR(c), ei may have a customer inside DR(c), and thus, ei is expanded and its children entries are examined. At last, the customer c is either still in CU R or removed to Hnon . Moreover, as ei is expanded, its children entries may be pruned based on U . Thus, OneTraversal checks the pruning conditions on the children entries to minimize the size of Hnon . Example 2. Continuing the example in Figure 7(b), after pruning e4 , we next examine e5 . Entry e5 cannot be pruned, but e5 is not a part of reverse skyline (because c1 is inside DR(M in(e5 ))). As e5 intersects DR(c1 ) (Figure 7(c)), e5 is expanded in order to update the only customer c1 in CU R. Since no customer of e5 falls in DR(c1 ), c1 still stays in CU R. Meanwhile, c5 , c6 and c7 are pruned as those customers have c1 in their ER1 regions. So we have CU R = {c1 }. Identifying reverse skyline points. Note that, OneTraversal accesses entries of the R-tree in ascending order of their distance to the query point p. Given a data point p, if no visited entry intersects DR(p), p is recognized as a temporary reverse skyline point. Since entries that have not been visited may intersect DR(p), the temporary reverse skyline point p cannot be recognized 21

ACCEPTED MANUSCRIPT

8

8

e1

7

c6

c5

e3

5

e5

5

c10

c3

3

c1

2

c8

1

e2

p

6

c4

e4

4

c7

1

2

3

4

5

c9 e6

c13 c11

6

7

8

9

c1

2

c12 e7

10

3

1

ER1(Min(e4))

p

11

1

2

(a) Dataset C 8

7

7

c6

c5

e5

c7

5

4

4

3

3

c1

2

3

6

7

8

9

10 11

8 7

c2

6 5

c1

c10

c9 e6

c8

1

DR(c1) 1

5

DR(c2)

2

2 1

4

6

5

p

3

(b) Accessing c1 and e4

8

6

c4

e4

c3

4

CR IP T

c2

6

AN US

7

3

5

6

7

8

9

(c) Accessing e5

10

11

p

1

1

2

3

4

5

6

c1

7

8

(d) Accessing c2 and e6

9

10 11

p

c9 e6

c8

2

DR(Min(e6))

4

c10

4

c11

DR(c11) 1

2

c13 c12 e7

3

4

5

6

7

8

9

10

11

(e) Accessing e7

M

Figure 7: Running example of the OneTraversal algorithm

ED

as a final reverse skyline point currently. In the following, we detail how to find the final reverse skyline points from temporary reverse skyline points as early as possible to enable progressive results outputting (Lemma 5).

PT

Lemma 5. Let ei be the latest visited entry. Then, all customers c ∈ CU R can be reported as reverse skyline points if distance(c, p) ≤ distance(M in(ei ), p)/2.

CE

Proof. The proof is obvious. The unvisited entries cannot intersect the DR regions of customers with distance (to query p) smaller than distance(M in(ei ), p)/2.

AC

Therefore, we can use distance(M in(ei ), p)/2 as the distance threshold, and the temporary reverse skyline points with distance not larger than the threshold are removed from CU R and reported as part of reverse skyline immediately. Example 3. In the example of Figure 7(c), now we have CU R = {c1 } after checking e5 . OneTraversal then accesses c2 . Since c1 is inside DR(c2 )) but not inside ER(c2 )), c2 is not a reverse skyline point but it cannot be pruned. Hence, c2 is inserted into Hnon . Next, e6 is accessed. Similarly, e6 is also inserted into Hnon . Now the threshold is distance(M in(e6 ), p)/2. 22

ACCEPTED MANUSCRIPT

CE

PT

ED

M

AN US

CR IP T

As distance(c1 , p) < distance(M in(e6 ), p)/2, c1 is immediately reported a reverse skyline point. Thus, RSK = {c1 } and Hnon = {c2 , e6 , } now. Algorithm description. The OneTraversal algorithm is presented in Algorithm 1 which follows the framework presented in the part-Framework. Given an entry ei , we describe the pruning processing in lines 5-7. If ei cannot be pruned, OneTraversal checks whether ei does not belong to reverse skyline (lines 8-11). When OneTraversal cannot prune ei nor determine that ei does not belong to reverse skyline, OneTraversal expands ei if ei is a non-leaf entry (lines 13-14). If ei is a point, OneTraversal uses an important operation, Compare(), to determine whether point ei is a temporary reverse skyline point or not (lines 16-30). For this purpose, Compare() expands non-leaf entries in Hnon that intersect DR(ei ), and conducts a deep-first traversal. The details of Compare() is presented in Algorithm 2. After this processing, ei is pruned, removed into Hnon , or determined as a temporary reverse skyline point. As some entries from Hnon are expanded and their children nodes are inserted into Hnon (line 22 in Algorithm 1 and lines 8-9 in Algorithm 2), OneTraversal prunes these new entries in Hnon if possible to minimize the size of Hnon (line 29). Furthermore, OneTraversal removes customers from CU R whose DR regions contain ei into Hnon (line 30). After checking ei , OneTraversal outputs the customers in CU R with distances not larger than distance(M in(ei ), p)/2 as part of the reverse skyline (line 31). Example 4. In the running example of Figure 7, OneTraversal finally accesses entry e7 . As e7 cannot be pruned or inserted into Hnon , e7 is expanded. The customer c11 from e7 is then inserted into CU R as no entry from U is inside DR(c11 ). Customers c12 and c13 from e7 are pruned as c11 is inside their ER1 regions. Hence, the reverse skyline is RSK = {c1 , c11 }. Analysis of OneTraversal. We first present an important lemma that guarantees the correctness of OneTraversal. After that, we prove the efficiency of our algorithm in terms of processing time, space requirement and progressiveness.

AC

Lemma 6. Any customer added to RSK during the execution of the algorithm is guaranteed to be a final reverse skyline point. Proof. This is guaranteed by Lemma 2 , Lemma 3 and Lemma 5.

With respect to I/O cost, OneTraversal requires only one traversal of indexed R-tree, and it only visits the nodes that contain customers in SRS. In contrast, as analyzed in section 4.1, RSSA traverses the R-tree multiple times 23

ACCEPTED MANUSCRIPT

15 16 17 18 19 20 21

else if reverse(ei ,U) then add ei into CU R;

else T:=true; foreach non-leaf entry enon ∈ Hnon do if enon intersects DR(ei ) then expand enon , insert children entries into Hnon ; foreach child entry e0 in enon do Compare(ei , e0 ); if T==false then go to endexpand;

CE

22

M

14

ED

13

else if ei is a non-leaf entry then expand ei , insert child entries into H;

PT

12

AN US

CR IP T

Algorithm 1: OneT raversal (C, p) Input: C is a dataset indexed by R-tree; p is a query point. Output: RSK(p) is the reverse skyline w.r.t. p. 1 H := Φ; Hnon := Φ; CU R := Φ; RSK(p) := Φ; 2 insert all entries of the root of R-tree into H sorted by distance from p; 3 while H 6= Φ do 4 remove top entry ei ; 5 if pruned(ei , U ) then 6 delete ei ; 7 removes p from CU R to Hnon if p is inside ER2 (M in(ei )) of point ei ; 8 else if inside(U, DR(M in(ei ))) then 9 update CU R based on ei ; 10 insert ei (children entries if ei is expanded) into Hnon ; 11 prune entries in Hnon using pruning rules;

23 24

AC

25 26

27 28 29

30

31

endexpand: if T == true then insert ei into CU R(p); prune new entries in Hnon using pruning rules; remove points in CU R whose DR regions contain ei into Hnon ; 24 remove points in CU R with distance not larger than distance(ei , p)/2 to RSK(p);

CR IP T

ACCEPTED MANUSCRIPT

6 7 8 9

else if reverse(ei , U ) then T := T ∧ true;

M

5

else if e0 is a non-leaf node and e0 intersects DR(ei ) then expand e0 , insert child entries into Hnon ;

10 11 12

foreach child entry e00 in e0 do Compare(ei , e00 ); if T==false then break;

AC

CE

13

ED

4

else if inside(e0 , DR(M in(ei ))) then insert ei into Hnon ; T := f alse;

PT

3

AN US

Algorithm 2: Compare (ei , e0 ) Input: ei is the processed entry; e0 is an entry in Hnon . Output: updated boolean variate T ; updated dataset U . 0 1 if pruned(ei , e ) then 2 delete ei ; T := f alse;

25

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

and the number of traversal is O(|GSK|) in the worst case. Consequently, OneTraversal is more I/O efficient than RSSA. We quantify the computational cost by the number of data comparisons, e.g., dominance test and window query. OneTraversal reduces the number of comparisons by pruning unqualified customers. Each customer in C is compared with entries in U (the maximum size of U is SRS). Hence, the number of comparisons is O(|C| · |SRS|). On the contrary, RSSA first retrieves the global skyline, which is O(|C| · |GSK|), and then refines global skyline points based on DDR and DADR regions, which is O(|GSK|·k). The typical k value is 10 for a three-dimensional dataset that contains 20,000 points [15]. Finally, RSSA verifies global skyline points that have not been identified as reverse skyline points or non-reverse skyline points, by using window queries, which is O(|GSK| · h) in the worst case, where h is the the height of the indexed R-tree over C. We have proved that |SRS| ≤ |GSK| (Lemma 4). Consequently, OneTraversal is efficient than RSSA in terms of computational cost. The overall processing time consists of time for I/O cost and computational cost. From the above analysis, we can see that OneTraversal is more efficient in the overall processing cost than RSSA. Furthermore, since OneTraversal does not need to precompute the dynamic skyline as RSSA does, OneTraversal is markedly faster for a single query. With respect to the progressiveness, recall that OneTraversal examines customers in ascending order of their distances from the query p. A customer c is reported as a reverse skyline point, if distance(c, p) is not larger than distance threshold, and no point that has been checked falls in DR(c). Intuitively, if c is closer to p, it is more likely that c is a reverse skyline point. Hence, OneTraversal reports the first result just after a few checking. In contrast, RSSA cannot report any result in the computation of global skyline set. In the worst case, RSSA outputs the first reverse skyline point after comparing DDR and DADR regions of all global skyline points and then conducting a few window queries.The experimental results (Section 6) verify the superiority of OneTraversal in terms of progressiveness. 5. Top-k reverse skyline queries

Although the reverse skyline query is effective in identifying the most prospective customers, it cannot effectively support the problem of finding k prospective customers. Let |RSK| be the cardinality of reverse skyline set 26

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

RSK. When k ≤ |RSK|, a scoring function can be used to rank points in RSK, and then the top-k ranking points are returned as the results. The scoring function could be the distance from target product p, the number of influencing points, and even more complicated expressions [10–14, 28]. Nevertheless, when k > |RSK|, the reverse skyline is not sufficient. Furthermore, in such a case, company may want to group k customers, such that different promotion strategies will be implemented to different groups. From the point of competitive products, if there exists another product that is better than the target product, user would be less interested in this target product. In this sense, customers can be grouped according to their degrees of “interest” on the product. Promotion should be enhanced when the degree of interest decreases. In this section, we study the problem of finding k most prospective customers with respect to product p. We formulate the problem as the top-k reverse skyline (Top-k RS) queries, which rank customers based on reverse skyline order in order to support arbitrary k as well as group-based promotion. Customers are first grouped according to their “interest” degrees, i..e., the number of competitive products. Then, customers in the same group are ordered according to the selectivity probability which considers the similarity of the target product to both customers’ preferences and better products. In the following, we first give the formal definition of top-k RS query, and then present three algorithms for the query.

CE

PT

5.1. Reverse skyline order and top-k RS queries Definition 6 (Reverse skyline layers). Given a dataset of customers C and a target product as the query point p, the reverse skyline layers of PCl with respect to p, is a series of subset {C1 , C2 , . . . , Cl }, where (1) C = i=1 Ci ; and (2) ∀c ∈ Ci , there are i − 1 customers in region DR(c).

AC

Recall that, the preferences of a customer on the products of a certain type is represented by the parameters of purchased product of the same type. Hence, a customer also stands for a competitive product. The reverse skyline layers group customers of C according to the number of customers/competitive products in the DR regions since competitive products in the DR regions are better than p. For example, for customers in the i-th layer Ci , there are i − 1 products better than the target product p. Hence, customers in the same layers are viewed to have the same degrees of interests on p. 27

ACCEPTED MANUSCRIPT

c1

c7

c8

c5 c2

CR IP T

c4

AN US

c3

p

c6

Figure 8: Reverse skyline layers

AC

CE

PT

ED

M

Example 5. Consider the example of Figure 8 where customers are organized into three reverse skyline layers. We depict the outlines of DR regions of different layers using dashed lines. The 1-st layer C1 = {c1 , c2 , c3 } is also the reverse skyline set of C. Customers in C1 do not have any other customer in their DR regions. The 2-nd layer C2 contains customers c4 , c5 , c6 . These customers have only one customer in their DR regions, i.e., p1 ∈ DR(c4 ), c2 ∈ DR(c5 ), and c3 ∈ DR(c6 ). The 3-rd layer C3 contains customers c7 and c8 . The customer c7 has c1 and c2 inside its DR region, while c8 has c2 and c5 . We have grouped and ranked customers according to their located reverse skyline layers. We now discuss the orders of customers in the same layer. Intuitively, when the target product p is closer to a customer’s favorite product, the probability for the customer to select p is more higher. For example, consider two customers c4 and c5 from C2 in Figure 8. Product p is closer to c5 than to c4 . Hence, p meets the preferences of c5 more. Furthermore, recall that products in DR(c) are superior than p for customer c. The customer is more intended to select p, when the superiority is less obvious. In Figure 8, c1 is superior for c4 while c2 is superior for c5 . Since p is closer to c2 than c1 , p is a better substitute of c5 than c4 . To summarize, we define the selectivity probability for a customer c to select product p as follows:

28

ACCEPTED MANUSCRIPT

F (c, p) =

(

1 distance(c,p)) |DR(c)| P distance(c,p))× o∈DR(c) distance(o,p)

p ∈ C1

p ∈ Di (i > 1)

(1)

CR IP T

We define the pointwise order by combining the reverse skyline layers and intra-layer selectivity probability. Definition 7 (Reverse skyline order). Given a dataset of customers C and a target product as the query point p, C is grouped into reverse skyline layers. Suppose that customer c ranks kc in layer Ci according to selectivity Pi−1 probability value. Then, the order of c in C is j=1 |Cj | + kc .

AN US

By using reverse skyline order, we formalize the problem of finding k most prospective customers as a top-k reverse skyline query.

Definition 8 (Top-k reverse skyline (Top-k RS) query). Given a dataset of customers C, the top-k reverse skyline query retrieves a subset Stopk , composed of the first k points ranked according to reverse skyline order: (2)

M

0 Stopk = (∪i−1 j=1 Cj ) ∪ Ci ,

where |Stopk | = k, Ci0 ∈ Ci , and ∀c ∈ Ci0 and o ∈ Ci −Ci0 , it holds that ko > kc .

PT

ED

Example 6. Referring to Figure 8, the ranked series of three layers are C1 = {c2 , c1 , c3 }, C2 = {c5 , c4 , c6 } and C3 = {c7 , c8 }, respectively. Suppose that the company plans to advertise product p to five prospective customers. Then, according to Definition 8, customers c1 to c5 are returned by the topfive reverse skyline query.

AC

CE

5.2. Algorithms Pointwise algorithm. We first present the Pointwise algorithm. Like OneTraversal, Pointwise traverses the R-tree over C in ascending order of distances from the query point p. The core pseudocode is described in Algorithm 3. There are two phases in Algorithm 3. The first phase is to generate candidate set Sc for Stopk (lines 1-28). The candidate set Sc contains a series of subset {C1 , C2 , . . . , Cj } such that Stopk ⊆ Sc . The second phase is to retrieve the final top-k points from Sj by calling the procedure PointRank() (line 29). The details of PointRank() is presented in Algorithm 4. Specifically, in the first phase, Pointwise fetches the top entry ei that is closest to p, and if ei is a point, it examines ei against each layer Cj 29

ACCEPTED MANUSCRIPT

AN US

CR IP T

Algorithm 3: T opKReverse (C, p) Input: C is a dataset indexed by R-tree; p is a query point. Output: Stopk is the set of top-k reverse skyline points of p. 1 while H 6= Φ do 2 if top entry ei of H is a point then 3 n = 0; 4 foreach Cj from C1 to Cm do 5 Ctemp = Φ; 6 foreach point c in Cj do 7 if inside(c, DR(ei )) then 8 n + +; vei = vei + distance(c, p); if distance(c, p) < distance(ei , p)/2 then mark c “true”; Cj0 = Cj0 ∪ c;

9 10

else if inside(ei , DR(c)) then move c from Cj to Ctemp ; vc = vc + distance(ei , p);

11 12 13

if Ctemp 6= Φ then append Ctemp to Cj+1 ; P if 1≤l≤j |Cl0 | ≥ k then break;

M

14 15

ED

16 17

foreach Ci from Cj+1 to Cm do if n > j then break;

18

PT

19 20

else foreach point c in Ci do if inside(c, DR(ei )) then n + +; vei = vei + distance(c, p);

21

CE

22 23

AC

24

25 26 27 28

29

append ei to Cn ; m = j;

else expand ei , insert children entries into H; P Sc = m l=1 Cl ; return PointRank(Sc ,k); 30

ACCEPTED MANUSCRIPT

PT

ED

M

AN US

CR IP T

sequentially from C1 to the highestPCm (lines 4-17). Let n be the number of customers in DR(ei ), and vei = c∈DR(ei ) distance(c, p) (lines 7-8). The values n and vei will be used to compute the reverse skyline order of ei in PointRank(). Consider a customer c in layer Cj . Recall that, customers with distance larger than 2 · distance(c, p) cannot be inside DR(c). Hence, if distance(c, p) < distance(ei , p)/2, customer c is an exact member of Cj and it is marked “true” (lines 9-10). We use a set Cj0 to keep these ‘true” points from Cj (line 10). Otherwise, Pointwise checks whether ei is inside DR(c). If so, c should be removed to Dj+1 . We use Ctemp to keep points should be moved to Cj+1 (lines 11-13). After scanning Cj , Pointwise appends Ctemp to Cj+1 (lines 14-15). Meanwhile, Pointwise counts the number of points that are marked “true” in ∪jl=1 Cl . If the number is not Pjsmaller than k, then the j top-k points must be inside ∪l=1 Cl . Let Sc = l=1 Cl . Hence, Sc is used as the candidate set for Stopk . In order to determine whether ei is a member of Sc , Pointwise compares ei with points in each rest layer Ci (i > j) (lines 18-24). Note that, if n > j, ei cannot belong Stopk . Hence, Pointwise needs not to compute the exact n value in order to reduce the time complexity (lines 19-20). We omit the description of the second phase (Algorithm 4 PointRank()) here since the top-k points are retrieved just according to their reverse skyline orders. The procedure OrdDR(Ci , num) in PointRank() is used to select top-num points from Ci . OneTraversal-based Algorithm. An improvement of Pointwise is to use OneTraversal to facilitate the computation of the reverse skyline layers. The improved algorithm avoids scanning the overall dataset, therefore eliminating the query cost.

CE

Lemma 7. Let C 0 = C − RSK, and RSKC 0 be the reverse skyline set of C 0 . Given a customer c ∈ RSKC 0 , then for any customer o inside DR(c), we can get o ∈ RSK. Proof. It is straightforward since if o ∈ / RSK, then c ∈ / RSKC 0 .

AC

For illustration, we term RSK as RSK1 , and the reverse skyline of C − RSK as RSK2 , and so on. Given a customer c ∈ RSKi , according to Lemma 7, it is sufficient to compare c with customers in RSKj (j < i) in order to determine the reverse skyline layer of c. Furthermore, an interesting corollary of Lemma 7 is that if c is in reverse skyline layer Ck , then k ≥ i. Motivated by this, the OneTraversal-based algorithm computes top-k RS results as follows. The algorithm first computes reverse skyline set RSK1 , 31

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

then RSK2 . Apparently, C1 =RSK1 . Thereafter, the algorithm computes the reverse skyline layers for customers in RSK2 by comparing the customers with other customers in RSK1 . Next, the algorithm retrieves RSK3 and computes reverse skyline layers of customers in RSK3 , and so on. Like the Pointwise algorithm, if customers in current ∪jl=1 Cl is not smaller than k, it can be determined that the top-k results must be in ∪il=1 Cl as well as ∪il=1 RSKl where i ≤ j. We adapt OneTraversal for the computation of all RSKl (l ≤ j) as follows: one min-heap is used to keep the entries that need to be accessed, and another min-heap to keep the entries that do not belong to current reverse skyline set, respectively. When the first reverse skyline set RSK1 is computed, entries not in RSK1 are kept in the other heap, and the OneTraversal algorithm is executed to obtain the second reverse skyline set RSK2 , with the old heap to contain customers not in RSK2 . This processing is repeated until all RSKl (l ≤ j) are retrieved. Algorithm with precomputed ordered reverse skyline layers. We then present an algorithm with preprocessing technique to facilitate the query processing. When the dataset C is of higher dimension or a large size, retrieving top-k points from C is time-consuming. Furthermore, multiple queries with different k may be issued. Thus, a preferable solution is to precompute the reverse skyline layers and order customers in each layer. Then, when k is appointed, the top-k points can be directly returned from the ordered dataset. The reverse skyline layers can be computed as follows. The algorithm first computes all RSKi by using two min-heaps as described in OneTraversal-based approach, and then examines customers from RSK1 to RSKm sequentially to determine the reverse skyline layer of each customer and corresponding selectivity probability. Finally, points in the same layers are ordered by calling procedure OrdDR(Ci , |Ci |).

32

ACCEPTED MANUSCRIPT

CR IP T

Algorithm 4: P ointRank(Sc , k) Input: Sc is the candidate set for Stopk ; k is cardinality of Stopk . Output: Stopk is the set of top-k reverse skyline points of q. 1 Stopk = Φ; num = k; 2 foreach layer Ci from C1 to Cm in Sc do 3 if |Ci | == num then 4 Stopk = Stopk ∪ OrdDR(Ci , |Ci |); break; 6 7

else Stopk = Stopk ∪ OrdDR(Ci , num); break;

8 9

10

AN US

else if |Ci | < num then Stopk = Stopk ∪ OrdDR(Ci , |Ci |); num = num − |Ci |;

5

return Stopk ;

M

6. Experimental evaluation

AC

CE

PT

ED

In this section, we experimentally evaluate the proposed framework for reverse skyline and top-k reverse skyline queries. As we discussed in the section of related work, prior work [15, 16] used a filter and refinement strategy for reverse skyline queries, and thus, they had problems of a high overhead and the poor progressiveness. Other related approaches were designed for reverse skyline variants or different settings [17–27], many of which are based on the RSSA [15]. Therefore, without the loss of generality, we compare our proposed OneTraversal with the classic algorithm RSSA [15] with respect to reverse skyline queries. We used three synthetic customer datasets with different distributions: independent, anti-correlated, and clustered. The first two datasets are generated as described in [29]. The clustered dataset [15, 16, 42, 43] is composed of ten randomly centered clusters such that these clusters have equal numbers of points that follow a Gaussian distribution with mean equal to the associated centroid and variance 0.05. We also used one real-world data, namely CarDB, which consisting of 50,000 cars with six attributes: make, model, year, price, mileage and location [15, 23]. The two numerical attributes price and mileage of a car are selected for this experiments. The data space is normalized to range from 0 to 1000 on every 33

ACCEPTED MANUSCRIPT

CR IP T

dimension. Each dataset is indexed by an R-tree with a page size of 4096 bytes. The query product is randomly generated in the data space. We measured two metrics: the number of I/Os and the processing time. In particular, the processing time consists of two parts, CPU time and I/O cost where each page access was penalized with 10 millisecond. All parameters and their settings are listed in Table 2. All algorithms were implemented in C++ and executed on a 2.6 GHz Intel (R) core(TM) 2 duo CPU with 8GB RAM. Table 2: Parameters used in experiments

Default 3 60K 100

AN US

Parameter Setting Dataset dimensionality 2, 3, 4, 5 Dataset cardinality 20K, 60K, 100K, 200K k 100, 200, 300, 400, 500

AC

CE

PT

ED

M

6.1. Results on reverse skyline queries Pruning capabilities. We first study the efficiency of the pruning rules by comparing the cardinalities of three datasets: reverse skyline (RS), selective reverse skyline (SRS), and the subset of RS (denoted as “Marked”), which is composed of customers that are marked as points not belonging to reverse skyline. Figure 9(a) plots the results for the independent customer datasets whose dimensionality d varies from 2 to 5, in logarithmic scale. The cardinalities of all the three datasets grow exponentially with dimensionality . Compared with the size of the original dataset, a great number of customers have been pruned away during the reverse skyline computation. More specifically, for datasets with 60K customers, 92.8% ∼ 99.9% customers are pruned , and 79% ∼ 83% customers of unpruned data (i.e., selective reverse skyline) have been marked as points not belonging to reverse skyline. For the datasets with 200K customers, the proportions are 96.4% ∼ 99.9% and 77% ∼ 95%, respectively. Figure 9(b) plots the results for three synthetic datasets with 60K customers in 3d dimensionality. The independent dataset owns the largest sizes of RS, SRS and “marked” sets, while the clustered dataset has the smallest values. The corresponding numbers are also presented for clarity, in Figure 9(d). It is important to note 34

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 9: Pruning capability

AC

CE

PT

that customers not in SRS are safely pruned away during the reverse skyline computation. Although RSSA retrieves global skyline as candidate set, customers not in global skyline cannot be pruned as these customers will be used to verify whether the candidates are real members of RS. Figure 9(c) shows the results for the real CarDB dataset. Query efficiency. We examine the efficiency of algorithms RSSA and OneTraversal for answering reverse skyline queries, in terms of the number of I/Os and the total processing time. In order to test the effect of the dimensionality on the query performance, we set the customer dataset cardinality to 60K and vary the dimensionality from 2 to 5. Figures 10(a) illustrates the number of I/Os required for the independent dataset, in logarithmic scale. As expected (refer to the complexity analysis in Section 4), RSSA is much more expensive than OneTraversal. The experimental results show that our 35

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

OneTraversal algorithm outperforms RSSA by 1 ∼ 2 orders of magnitude. As the R-tree index degrades with dimensionality, the number of I/Os of both algorithms increases evidently as the dimensionality increases. Figure 10(b) illustrates an analogous behavior of the two algorithms in clustered customer dataset with the number of I/Os being smaller. The specific numbers for the synthetic datasets are presented in Figure 10(c), for clarity. In Figures 10(d)-10(e) we plot the processing time by broking the total processing time into two parts, corresponding to the I/O and CPU costs, respectively. In all settings, for both RSSA and OneTraversal algorithms, I/O cost is the dominator of the overall processing time. OneTraversal is notably faster than RSSA in both I/O and CPU time. In terms of the total processing time, in the independent customer dataset, RSSA requires 3 ∼ 50 times more time than OneTraversal (Figure 10(d)). In the clustered customer dataset, RSSA requires averagely 19 times more time than OneTraversal (Figure 10(e)). Figure 10(f) illustrates the results for the real CarDB dataset. Our OneTraversal algorithm shows significant performance again. The main reason for the superiority of OneTraversal is the efficient pruning strategies which discards a great number of non-reverse skyline points, thereby reducing the number of page accesses as well as the number of dominance tests. We next test the effects of the cardinality of customer dataset on the query. We set the dimensionality to 3 and vary the dataset cardinality from 20K to 200K. The experimental results are plotted in Figure 11. OneTraversal consistently performs better than RSSA, which implies a good scalability of our OneTraversal. Compared with dimensionality, the number of I/Os grows more smoothly when dataset cardinality increases (Figures 11(a)11(b)). Our approach outperforms RSSA by about one order of magnitude. Figures 11(c)-(d) plot the processing time in logarithmic scale. The results in the anti-correlated customer dataset are similar and omitted due to space limitations. Progressiveness. We evaluate the progressiveness of RSSA and OneTraversal algorithms in returning reverse skyline points incrementally. Figures 12(a)-12(b) show the processing time as a function of the customers returned, in independent and clustered datasets, respectively. We use a dataset of 60K with 3d dimensionality. The figures show that OneTraversa is notably progressive than RSSA. Specifically, in the independent customer dataset, OneTraversa outputs the first 5% of the reverse skylines in 7% of the time, while RSSA requires 78% of the processing time. The reason is that RSSA cannot report any reverse skyline point during the computation of global skyline set. 36

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 10: I/Os & processing time vs. dimensionality (RS query)

AC

CE

PT

ED

6.2. Results on top-k RS queries In this part, we compare the performances of our proposed approaches for top-k RS Queries: Pointwise approach, OneTraversal-based approach, and OneTraversal-pre approach (the precomputation approach based on OneTraversal). Distribution of reverse skyline layers. To test the effect of dataset dimensionality, we fix the dataset cardinality to 60K and vary the dimensionality from 2 to 4. Figures 13(a)-13(c) plot the histograms of the reverse skyline layers for independent customer datasets. We can see that, when the dimensionality is 2d, a great part of points are in higher layers. However, when the dimensionality increases, points in lower layers are dominant and their number grows rapidly. Furthermore, as the dimensionality increases, the length of the reverse skyline layers (i.e., the number of the layers) gets significantly lower. This is because, the higher the dimensionality is, more points enter the reverse skyline. We then fix the data dimensionality to 3d and vary the cardinality from 20K to 200K. The effects of dataset cardinality are shown in Figure 13(b), Figure 13(d) and Figure 13(e). The layer length 37

ACCEPTED MANUSCRIPT

10

RSSA

OneTraversal

OneTraversal

3

10 number of I/Os

10 number of I/Os

4

RSSA

2

10

10

10

1

10

20K

60K

100K

3

2

1

20K

200K

60K

cardinality

10

I/O cost (RSSA)

I/O cost (RSSA)

I/O cost (OneTraversal)

I/O cost (OneTraversal)

CPU cost (RSSA)

CPU cost (RSSA)

10 processing time (second)

1

10

0

10

-1

10

10

10

10

100K

200K

1

CPU cost (OneTraversal)

AN US

processing time (second)

CPU cost (OneTraversal)

60K

200K

(b) I/Os vs. cardinality (Clustered)

2

2

20K

100K

cardinality

(a) I/Os vs. cardinality (Independent) 10

CR IP T

4

10

0

-1

-2

20K

60K

100K

200K

cardinality

cardinality

(c) processing time vs. cardinality (Independent)

(d) processing time vs. cardinality (Clustered)

M

Figure 11: I/Os & processing time vs. cardinality (RS query)

AC

CE

PT

ED

as well as the number of points in each layer grows with dataset cardinality. Figure 13(f) illustrates the experimental results for anti-correlated customer datasets. The dimensionality is fixed to 3d and cardinality to 20k. Compared with performance for the independent customer dataset in the same experimental setting (Figure (d)), the anti-correlated dataset shows a similar tendency for histogram distribution. The corresponding statistic numbers are presented in Figures 13(g)-13(h), i.e., the number of customers in the top 10% of reverse skyline layers, the number of customers in the bottom 10% of reverse skyline layers, and the average value of layers. For the clustered customer dataset, the difference between the first two values are not so obvious compared with independent and anti-correlated datasets. We also present the distribution of reverse skyline layers for the CarDB dataset in Figure 13(i). Cost for computation of reverse skyline layers. We compare the cost for computation of reverse skyline layers of two approaches: Pointwise approach, and OneTraversal-pre approach. For Pointwise approach, we set k to the cardinality of the customer dataset. We measure the total processing 38

ACCEPTED MANUSCRIPT

101

100 10-1

RSSA

100

10-1

CR IP T

101

processing time (second)

processing time (second)

102

RSSA

OneTraversal

10-2

5% 10%20%30%40%50%60%70%80%90%100% output percentage(%)

OneTraversal

5% 10%20%30%40%50%60%70%80%90%100% output percentage(%)

(b) processing time vs. number of reported results

AN US

(a) processing time vs. number of reported results (Independent)

(Clustered)

Figure 12: Processing time vs. number of reported results

AC

CE

PT

ED

M

time. The effects of data dimensionality are reported in Figure 14(a). The OneTraversal-pre approach outperforms Pointwise approach significantly by computing the reverse skyline layers based on the results of the OneTraversal algorithm. Customers only need to be compared with reverse skylines in lower levels, thereby reducing the computational cost. The effects of dataset cardinality are reported in Figure 14(b). As the cardinality increases, the processing time of both approaches grows rapidly. Compared with dimensionality, the cardinality has a more obvious effect on the performance. Although both I/O cost and CPU cost for each reverse skyline layer computation grows rapidly with dimensionality, the layer length shrinks and more points concentrate on the lower layers (refer to Figure 13). Hence, the effect of dataset dimensionality on the cost for computing all layers is alleviated. On the other hand, we need to compute the layer for each point in the dataset, and thus, the cost increases evidently with the cardinality of the dataset. Query efficiency. We proceed to compare different approaches for top-k RS queries: the Pointwise approach, the OneTraversal-based approach, and the OneTraversal-pre approach. We set k to 100. We first study the effects of the dimensionality on different approaches. Figures 15(a)-15(c) plot the number of I/Os required for different data distributions. The OneTraversalbased approach is more efficient than Pointwise approach. Figures 15(d)15(f) illustrate the overall processing time. Apparently, OneTraversal-pre has a significant advantage over its competitors. Because of the merit of 39

(e) distribution of reverse skyline layers (Independent, d=3, N=100K)

(f) distribution of reverse skyline layers (Anti-correlated, d=3, N=20K)

(h) distribution of reverse skyline layers vs. cardinality(3d)

ED

(g) distribution of reverse skyline layers vs. dimensionality(60K)

(c) distribution of reverse skyline layers (Independent, d=4, N=60K)

AN US

(d) distribution of reverse skyline layers (Independent, d=3, N=20K)

(b) distribution of reverse skyline layers (Independent, d=3, N=60K)

M

(a) distribution of reverse skyline layers (Independent, d=2, N=60K)

CR IP T

ACCEPTED MANUSCRIPT

(i) distribution of reverse skyline layers (CarDB dataset)

Figure 13: Distribution of reverse skyline layers

AC

CE

PT

precomputation, when a query is issued, OneTraversal-pre directly fetches the top k customers from ordered data points without any computation. Hence, little time is required for the query processing. We then study the effects of dataset cardinality on th query efficiency. Figure 16 reports the relevant results. We can see that both the number of I/Os and the processing time increase with dimensionality, for Pointwise and OneTraversal-based approaches. The precomputation based approach OneTraversal-pre again outperforms other two algorithms significantly. Effect of k. In this experiment, we fix the data cardinality to 60K and the dimensionality to 3d, and we then vary k from 100 to 500. Figure 17(a) reports the relevant results for the independent customer dataset. The digits above the x-axis are the number of required layers for corresponding top k RS queries. For k = 150 and k = 200, both approaches need to explore 40

ACCEPTED MANUSCRIPT

105

Pointwise

103

102

101 2

3

Pointwise OneTraversal-pre

processing time(second)

processing time(second)

OneTraversal-pre

4

5

104

103

102

20K

dimensionality

CR IP T

104

60K

100K

200K

cardinality

(a) processing time vs. dimensionality

AN US

(b) processing time vs. cardinality

Figure 14: Processing time for the computation of reverse skyline layers (independent)

AC

CE

PT

ED

M

the first three reverse skyline layers. The costs for k = 150 and k = 200 of each algorithm are the same. When k = 250 and k = 300, the first five layers need to be explored. As the first seven layers own 387 points while the first eight layers own 512 points, eight layers need to be explored for k = 400 and k = 500. The results for the anti-correlated and clustered datasets are reported in Figure 17(b) and Figure 17(c), respectively. It is seen that the independent dataset incurs highest cost while the clustered dataset incurs lowest cost. The OneTraversal-based approach outperforms Pointwise significantly, again. In addition, more reverse skyline layers are explored for anti-correlated and clustered datasets, as the two datasets have a smaller number of customers in lower reverse skyline layers (refer to Figure 13).

41

105 105

Pointwise

Pointwise

OneTraversal-based

OneTraversal-based

CR IP T

ACCEPTED MANUSCRIPT

Pointwise

OneTraversal-based

4

10

4

10

number of I/Os

103

103

2

3

4

101

5

2

dimensionality

(a) I/Os vs. dimensionality (Independent)

101

5

2

3

Pointwise

Pointwise

OneTraversal-based

OneTraversal-based

OneTraversal-pre

OneTraversal-pre

101

100

102

5

OneTraversal-pre

M

processing time (second)

102

4

dimensionality

(c) I/Os vs. dimensionality (Clustered)

(b) I/Os vs. dimensionality (Anti-correlated)

103

ED

processing time (second)

4

OneTraversal-based

Pointwise

103

3

dimensionality

processing time (second)

101

103

102

102

102

AN US

number of I/Os

number of I/Os

104

101

102

101

100

100

3

4

PT

2

dimensionality

(d) processing time vs. dimensionality (Independent)

5

2

3

4

dimensionality

(e) processing time vs. dimensionality (Anti-correlated)

5

2

3

4

(Clustered)

AC

CE

Figure 15: I/Os & processing time vs. dimensionality (top-k RS query)

42

5

dimensionality

(f) processing time vs. dimensionality

ACCEPTED MANUSCRIPT

Pointwise

105

105

OneTraversal-based

OneTraversal-based

104

4

102

number of I/Os

number of I/Os

number of I/Os

103

103

102

60K

100K

101

200K

20K

60K

cardinality

OneTraversal-based OneTraversal-pre

processing time (second)

101

100K cardinality

2

10

101

100

100

100

200K

Pointwise

AN US

processing time (second)

101

102

100K

cardinality

OneTraversal-pre

102

60K

(c) I/Os vs. cardinality (Clustered)

OneTraversal-based

OneTraversal-pre

60K

20K

Pointwise

OneTraversal-based

20K

101

200K

(b) I/Os vs. cardinality (Anti-correlated)

103

Pointwise

103

100K

cardinality

(a) I/Os vs. cardinality (Independent)

processing time (second)

103

102

20K

CR IP T

10

104

101

Pointwise

Pointwise

OneTraversal-based

200K

20K

60K

100K

20K

200K

60K

100K

200K

cardinality

cardinality

M

(d) processing time vs. cardinality (Independent) (e) processing time vs. cardinality (Anti-correlated)(f) processing time vs. cardinality (Clustered)

4x103

ED

Figure 16: I/Os & processing time vs. cardinality (top-k RS query)

4x103

Pointwise

AC 0

2

100

3

3

200

5

5

300

K

7

8

400

(a) Independent

8

OneTraversal-based

3x103

8

500

number of I/Os

1x103

2x103

1x103

0

Pointwise

OneTraversal-based

3x103 number of I/Os

PT

2x103

CE

number of I/Os

3x103

4x103

Pointwise

OneTraversal-based

3

100

5

5

8

200

9

300

11

13

15

14

400

500

K

(b) Anti-correlated

Figure 17: I/Os vs. k

43

2x103

1x103

0

4

100

5

6

200

8

10

300

13

15

400

K

(c) Clustered

17

17

500

ACCEPTED MANUSCRIPT

7. Conclusion

ACKNOWLEDGMENTS

AN US

CR IP T

In this work, we proposed a cost-efficient framework to find (a) the most prospective customers and (b) arbitrary k prospective customers. We first proposed OneTraversal to find the most prospective customers based on the notion of reverse skyline queries. The OneTraversal significantly reduces query cost by pruning unqualified points without any false positive and enables progressive results outputting. We then formulated the problem of finding k prospective customers as the top-k reverse skyline (Top-k RS) queries, and extended the notion of reverse skyline to reverse skyline order to support arbitrary k and group-based promotion. Our evaluation results on both real and synthetic data sets demonstrate that our framework has the promising results for reverse skyline queries and can efficiently support top-k RS queries.

M

This research was supported by the Natural Science Foundation of Hunan Province [grant number 2016JJ3012]. We would like to thank anonymous reviewers for their constructive comments.

ED

References

PT

[1] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of things (iot): A vision, architectural elements, and future directions, Future generation computer systems 29 (7) (2013) 1645–1660.

CE

[2] L. Da Xu, W. He, S. Li, Internet of things in industries: A survey, IEEE Transactions on industrial informatics 10 (4) (2014) 2233–2243.

AC

[3] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, M. Ayyash, Internet of things: A survey on enabling technologies, protocols, and applications, IEEE Communications Surveys & Tutorials 17 (4) (2015) 2347–2376. [4] http://www.businessinsider.com. [5] Z. Bi, L. Da Xu, C. Wang, Internet of things for enterprise systems of modern manufacturing, IEEE Transactions on industrial informatics 10 (2) (2014) 1537–1546. 44

ACCEPTED MANUSCRIPT

CR IP T

[6] M. R. Palattella, M. Dohler, A. Grieco, G. Rizzo, J. Torsner, T. Engel, L. Ladid, Internet of things in the 5g era: Enablers, architecture, and business models, IEEE Journal on Selected Areas in Communications 34 (3) (2016) 510–527. [7] I. Lee, K. Lee, The internet of things (iot): Applications, investments, and challenges for enterprises, Business Horizons 58 (4) (2015) 431–440.

[8] R. M. Dijkman, B. Sprenkels, T. Peeters, A. Janssen, Business models for the internet of things, International Journal of Information Management 35 (6) (2015) 672–678.

AN US

[9] C.-W. Tsai, C.-F. Lai, M.-C. Chiang, L. T. Yang, et al., Data mining for internet of things: A survey., IEEE Communications Surveys and Tutorials 16 (1) (2014) 77–97.

[10] A. Vlachou, C. Doulkeridis, Y. Kotidis, K. Nørv˚ ag, Reverse top-k queries, in: Data Engineering (ICDE), 2010 IEEE 26th International Conference on, IEEE, 2010, pp. 365–376.

ED

M

[11] A. Vlachou, C. Doulkeridis, K. Nørv˚ ag, Y. Kotidis, Identifying the most influential data objects with reverse top-k queries, Proceedings of the VLDB Endowment 3 (1-2) (2010) 364–372. [12] J.-L. Koh, C.-Y. Lin, A. L. Chen, Finding k most favorite products based on reverse top-t queries, The VLDB Journal 23 (4) (2014) 541–564.

PT

[13] O. Gkorgkas, A. Vlachou, C. Doulkeridis, K. Nørv˚ ag, Finding the most diverse products using preference queries., in: EDBT, 2015, pp. 205–216.

AC

CE

[14] S. Wang, M. A. Cheema, Y. Zhang, X. Lin, Selecting representative objects considering coverage and diversity, in: Second International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data, ACM, 2015, pp. 31–38. [15] E. Dellis, B. Seeger, Efficient computation of reverse skyline queries, in: Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, 2007, pp. 291–302. [16] Y. Gao, Q. Liu, B. Zheng, G. Chen, On efficient reverse skyline query processing, Expert Systems with Applications 41 (7) (2014) 3237–3249. 45

ACCEPTED MANUSCRIPT

CR IP T

[17] X. Lian, L. Chen, Monochromatic and bichromatic reverse skyline search over uncertain databases, in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM, 2008, pp. 213– 226. [18] X. Wu, Y. Tao, R. C.-W. Wong, L. Ding, J. X. Yu, Finding the influence set through skylines, in: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, ACM, 2009, pp. 1030–1041.

AN US

[19] A. Arvanitis, A. Deligiannakis, Y. Vassiliou, Efficient influence-based processing of market research queries, in: Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012, pp. 1193–1202.

[20] M. Islam, W. Rahayu, C. Liu, T. Anwar, B. Stantic, et al., Computing influence of a product through uncertain reverse skyline, Proceedings of the 29th International Conference on Scientific and Statistical Database Management (4) (2017) 1–12.

ED

M

[21] M. S. Islam, R. Zhou, C. Liu, On answering why-not questions in reverse skyline queries, in: Proceedings of the 29th IEEE International Conference on Data Engineering, IEEE, 2013, pp. 973–984.

PT

[22] Y. Gao, Q. Liu, G. Chen, L. Zhou, B. Zheng, Finding causality and responsibility for probabilistic reverse skyline query non-answers, IEEE Transactions on Knowledge and Data Engineering 28 (11) (2016) 2974– 2987.

CE

[23] M. S. Islam, C. Liu, Know your customer: computing k-most promising products for targeted marketing, The VLDB Journal 25 (4) (2016) 545– 570.

AC

[24] P. M. Deshpande, P. Deepak, Efficient reverse skyline retrieval with arbitrary non-metric similarity measures, in: Proceedings of the 14th International Conference on Extending Database Technology, ACM, 2011, pp. 319–330. [25] G. Wang, J. Xin, L. Chen, Y. Liu, Energy-efficient reverse skyline query processing over wireless sensor networks, IEEE Transactions on Knowledge and Data Engineering 24 (7) (2012) 1259–1275. 46

ACCEPTED MANUSCRIPT

[26] Y. Park, J.-K. Min, K. Shim, Parallel computation of skyline and reverse skyline queries using mapreduce, Proceedings of the VLDB Endowment 6 (14) (2013) 2002–2013.

CR IP T

[27] M. S. Islam, C. Liu, W. Rahayu, T. Anwar, Q+ tree: An efficient quad tree based data indexing for parallelizing dynamic and reverse skylines, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, ACM, 2016, pp. 1291–1300.

AN US

[28] S. Xu, Y. Lin, H. Xie, J. Lui, A provable algorithmic approach to product selection problems for market entry and sustainability, in: Proceedings of the 26th International Conference on Scientific and Statistical Database Management, ACM, 2014, p. 19.

[29] S. Borzsony, D. Kossmann, K. Stocker, The skyline operator, in: Data Engineering, 2001. Proceedings. 17th International Conference on, IEEE, 2001, pp. 421–430.

M

[30] H.-T. Kung, F. Luccio, F. P. Preparata, On finding the maxima of a set of vectors, Journal of the ACM (JACM) 22 (4) (1975) 469–476.

ED

[31] C. Li, B. C. Ooi, A. K. Tung, S. Wang, Dada: a data cube for dominant relationship analysis, in: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, ACM, 2006, pp. 659–670.

PT

[32] C.-Y. Lin, J.-L. Koh, A. L. Chen, Determining k-most demanding products with maximum expected number of total customers, IEEE Transactions on Knowledge and Data Engineering 25 (8) (2013) 1732–1747.

CE

[33] Y. Peng, R. C.-W. Wong, Q. Wan, Finding top-k preferable products, IEEE Transactions on Knowledge and Data Engineering 24 (10) (2012) 1774–1788.

AC

[34] S. Xu, J. Lui, Product selection problem: improve market share by learning consumer behavior, ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (4) (2016) 34. [35] X. Zhou, K. Li, G. Xiao, Y. Zhou, K. Li, Top k favorite probabilistic products queries, IEEE Transactions on Knowledge and Data Engineering 28 (10) (2016) 2808–2821. 47

ACCEPTED MANUSCRIPT

[36] X. Lin, Y. Yuan, Q. Zhang, Y. Zhang, Selecting stars: The k most representative skyline operator, in: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, IEEE, 2007, pp. 86–95.

CR IP T

[37] A. D. Sarma, A. Lall, D. Nanongkai, R. J. Lipton, J. Xu, Representative skylines using threshold-based preference distributions, in: Data Engineering (ICDE), 2011 IEEE 27th International Conference on, IEEE, 2011, pp. 387–398.

AN US

[38] Y. Tao, L. Ding, X. Lin, J. Pei, Distance-based representative skyline, in: Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on, IEEE, 2009, pp. 892–903.

[39] M. Magnani, I. Assent, M. L. Mortensen, Taking the big picture: representative skylines based on significance and diversity, The VLDB journal 23 (5) (2014) 795–815.

M

[40] D. Papadias, Y. Tao, G. Fu, B. Seeger, An optimal and progressive algorithm for skyline queries, in: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, ACM, 2003, pp. 467–478.

ED

[41] P. Godfrey, R. Shipley, J. Gryz, Maximal vector computation in large data sets, in: Proceedings of the 31st international conference on Very large data bases, VLDB Endowment, 2005, pp. 229–240.

PT

[42] Y. Tao, X. Xiao, J. Pei, Subsky: Efficient computation of skylines in subspaces, in: Data Engineering, 2006. ICDE’06. Proceedings of the 22nd International Conference on, IEEE, 2006, pp. 65–65.

AC

CE

[43] Z. Zhang, Y. Yang, R. Cai, D. Papadias, A. Tung, Kernel-based skyline cardinality estimation, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, ACM, 2009, pp. 509– 522.

48

ACCEPTED MANUSCRIPT

Figure captions

AC

CE

PT

ED

M

AN US

CR IP T

Figure 1. Dynamic skyline and reverse skyline Figure 2. Global skyline and an example of RSSA algorithm Figure 3. Example of decision region Figure 4. Example of effect region Figure 5. Example of MBR Figure 6. Example of pruning Figure 7. Running example of the OneTraversal algorithm Figure 8. Reverse skyline layers Figure 9. Pruning capability Figure 10. I/Os & processing time vs. dimensionality (RS query) Figure 11. I/Os & processing time vs. cardinality (RS query) Figure 12. Processing time vs. number of reported results Figure 13. Distribution of reverse skyline layers Figure 14. Processing time for the computation of reverse skyline layers (independent) Figure 15. I/Os & processing time vs. dimensionality (top-k RS query) Figure 16. I/Os & processing time vs. cardinality (top-k RS query) Figure 17. I/Os vs. k

49

ACCEPTED MANUSCRIPT

List of tables

AC

CE

PT

ED

M

AN US

CR IP T

Table 1. Frequent notations Table 2. Parameters used in experiments

50