Information Sciences 232 (2013) 449–463
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Skyline queries on keyword-matched data Hyunsik Choi a, HaRim Jung a, Ki Yong Lee b, Yon Dohn Chung a,⇑ a b
Department of Computer Science and Engineering, College of Information and Communication, Korea University, Seoul 136-713, Republic of Korea Department of Computer Science, Sookmyung Women’s University, Seoul, Republic of Korea
a r t i c l e
i n f o
Article history: Received 28 July 2009 Received in revised form 9 January 2012 Accepted 30 January 2012 Available online 10 February 2012 Keywords: Information technology and system Database management Query processing Spatial database Textual database
a b s t r a c t Given a set of d-dimensional tuples with textual descriptions, a keyword-matched skyline query retrieves a skyline computed from tuples whose textual descriptions contain all query words. For example, suppose a customer prefers cars with low mileage and low price, and finds a car equipped with ‘air bag’ and ‘sunroof’ in an online shop. In such a case, a keyword-matched skyline query is highly recommended. Although there are many applications for this type of query, to date there have not been any studies on the keywordmatched skyline queries. In this paper, we define a keyword-matched skyline query and propose an efficient and progressive algorithm, named Keyword-Matched Skyline search (KMS). KMS utilizes the IR2-tree as an index structure. To retrieve a keyword-matched skyline, it performs nearest neighbor search in a branch and bound manner. While traversing the IR2-tree, KMS effectively prunes unqualified nodes by means of both spatial and textual information of nodes. To demonstrate the efficiency of KMS, we conducted extensive experiments in various settings. The experimental results show that KMS is very efficient in terms of computational cost and I/O cost. 2012 Elsevier Inc. All rights reserved.
1. Introduction Recently, skyline queries have emerged as an important query for various applications involving multi-criteria decision making. In terms of retrieving the best tuples, skyline queries are very useful. Given a set of d-dimensional tuples, a skyline contains a set of tuples which are not dominated by other tuples. A tuple tp is said to dominate another tuple tp0 when tp is not worse than tp0 on all dimensions and tp is better than tp0 on at least one dimension. Skyline queries, unlike general exemplary queries involving one criteria (e.g., top-k and nearest neighbors), can retrieve numerous results according to data distribution and cardinality. However, many results are meaningless to users, that is the user has to search for desired information again. Therefore, skyline queries should consider various user preferences in order to enable users to find interesting information effectively. In this paper, we use keywords as new preference of skyline queries. Initially, we assume that each d-dimensional tuple is additionally described by a textual attribute. There have been many internet applications in which an enormous volume of data is represented by textual descriptions. In this paper, we integrate both skyline queries and keyword searches in order to provide an improved means to search for desired information. We introduce a keyword match as a preference predicate, and we propose skyline queries on
⇑ Corresponding author. E-mail addresses:
[email protected] (H. Choi),
[email protected] (H. Jung),
[email protected] (K.Y. Lee),
[email protected] (Y.D. Chung). 0020-0255/$ - see front matter 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2012.01.045
450
H. Choi et al. / Information Sciences 232 (2013) 449–463
price
y
10
i
8
j
h
6
g
k
4
b
c
2
a
f d
e
0
2
4
6
8
mileage x 10
Fig. 1. An example of keyword skyline.
keyword-matched data. In terms of representing human’s cogitation well, the word is one of the most useful means to describe users’ preferences. Suppose that a dataset consists of d-dimensional tuples with textual information, and a query tries to retrieve skyline tuples whose textual descriptions contain specific query keywords. We call such queries keywordmatched skyline queries. For simplicity, we assume a textual attribute is a set of keywords such as tags in Web 2.0 [23]. However, it can be replaced with a narrative description without any loss of generality. Example 1. A customer wants to find a used car on a shopping web site. He/she prefers cars with both lower mileage and lower price and desires a car equipped with air bags for safety.
Example 2. Through an online hotel search service provided by a travel agency, a traveler is looking for a hotel which is cheap and close to the beach. In addition, he/she needs wireless internet and a baby sitter service. In such cases, a keyword-matched skyline query is highly recommended. Example 1 is shown in Fig. 1, where 2-dimensional points denote used cars in an online shop. A used car (a point) has two attributes: mileage (x-axis) and price (y-axis). The keywords enclosed by angle brackets (i.e., hi) denote features (e.g., air bag, cruiser control, sunroof, etc.) with which cars are equipped. As we mentioned above, a car tp dominates tp0 if tp is not larger1 than another car tp0 in all axes and tp is smaller than tp0 in at least one axis. In this example, a keyword-matched skyline query returns a skyline computed from tuples whose textual descriptions contain the keyword ‘air bag’. That is, keyword-matched skyline tuples for ‘air bag’ are {b, e, i} denoted as a dashed line in Fig. 1. Note that the result of the keyword-matched skyline is different to that of the original skyline (i.e., the original skyline contains the points {a, e, i} denoted as filled circles, and the points {e, i} belong to both the keyword-matched skyline and the original skyline). Although there are many applications for keyword-matched skyline queries, to date there have been no efficient algorithms known. Intuitively, this problem can be solved by the combination of a keyword search algorithm and a skyline algorithm. The combination approach first retrieves the keyword-matched tuples from the dataset, and then computes skyline tuples among the keyword-matched tuples. This approach results in correct answers by definition 4 as will be discussed in Section 3.1. With respect to a skyline computation, several index-based algorithms (e.g., nearest neighbor search and branch and bound skyline search) [15,24] show the best performance among existing skyline algorithms. However, they are not applicable to the combination approach since index structures cannot be instantly constructed from the keyword-matched tuples. Therefore, we can consider non-index-based skyline algorithms [2] (i.e., block-nested-loop and divide-and-conquer) as a part of skyline computation in keyword-matched skyline, but we exclude a divide-and-conquer approach due to its scalability problem [3]. As a straightforward approach, we introduce an INverted-index-based Keyword Skyline algorithm (INKS) which combines an inverted index [9,33,25] and a block-nested-loop algorithm. Although INKS shows reasonable performance, there are two disadvantages. First, as we mentioned above, the skyline computation of INKS cannot be helped by any index structure because index structures cannot be instantly constructed from keyword-matched tuples retrieved by using an in-
1
Without loss of generality, we assume that lower values are better.
H. Choi et al. / Information Sciences 232 (2013) 449–463
451
verted index. The performance of all non-index-based skyline algorithms is not good. Secondly, each query processing is conducted independently. In detail, it cannot discard non-skyline tuples (i.e., non-final results) during a keyword search. That is, all the tuples matched to query keywords are passed to a skyline computation algorithm as input data. Therefore, INKS incurs a significant overhead during both a keyword search and a skyline computation. In order to solve this problem, we propose an efficient algorithm, referred to as Keyword-Matched Skyline algorithm (KMS). The proposed algorithm is based on an IR2-tree [8] as an index structure. To retrieve a keyword-matched skyline, it performs nearest neighbor search (NN) [26,15] in a branch and bound approach [12,24]. While traversing an IR2-tree, KMS effectively prunes unqualified nodes by means of both spatial and textual information of nodes. In addition, our algorithm is progressive (or online) [27] since it is based on the nearest neighbor search [15,24]. To demonstrate the performance and scalability, we carry out extensive experiments on real and synthetic datasets. The experimental results show that KMS is very efficient in terms of CPU and I/O costs. To the best of our knowledge, this paper is the first work that deals with skyline queries on keyword-matched data. The rest of the paper is organized as follows. Section 2 presents some related work. In Section 3, we define the problem of the keyword-matched skyline and propose KMS algorithm. In Section 4, we present experimental results. We present our conclusions in Section 5.
2. Related work 2.1. Skyline queries In this section, we give a brief overview of skyline algorithms for static data. Kung et al. [16,17] proposed the first skyline algorithm, referred to as the maximal vector problem, for conventional processing framework. In the database context, Borzsonyi et al. [2] proposed two skyline algorithms, namely the block-nested loop (BNL) and the divide-and-conquer algorithm (D&C). They do not need any index structures. During skyline processing, BNL maintains a list of skyline candidates in main memory. Each time BNL reads a tuple from a data file, it compares the list of skyline candidates with the currently read tuple and rearranges the list of skyline candidates by discarding dominated tuples in either the list or the data file. Due to its simplicity, BNL is applicable to any applications and any dimensionality. In addition, BNL provides reasonable performance for the mid-range dimensions [3]. Kossmann et al. [15] discovered that a skyline, which has any monotone scoring function M, can be efficiently computed by performing a nearest neighbor search according to M, and proved its correctness. Based on this observation, they proposed the nearest neighbor algorithm, referred to as NN. In NN, it is assumed that the dataset is indexed by an R-tree [10]. NN is the first study to deal with the skyline problem in a geometric method. Initially, NN performs a nearest neighbor query to search a point with the minimum distance (i.e., L1 norm) [26] from the point to the origin of the data domain. Based on the found nearest neighbor point, NN divides the data space into 2d partitions, where d is dimensionality. Then, NN retries to find nearest neighbor within each divided region except the dominance region that the found nearest neighbor dominates. This process is iterated until skyline tuples cannot be found within divided regions. The found nearest neighbors become the result of the skyline query. Also, once nearest neighbors are retrieved, they are guaranteed to be skyline points without further process. Therefore, NN progressively returns the result early before the end of process. Papadias et al. [24] proposed the branch and bound skyline search algorithm (BBS). In the literature [29,28], BBS is regarded as the state-of-the-art in skyline computation algorithms for static data. BBS also adopts a nearest neighbor search on an Rtree. BBS overcomes several disadvantages of NN, such as redundant skyline reports and an increasing to-do list size. Consequently, BBS is more efficient than NN in terms of CPU and I/O costs. For skyline retrieval, BBS traverses an R-tree and finds skyline tuples by means of the branch and bound approach [12]. During traversing the R-tree, BBS prunes nodes dominated by already-found skyline tuples. Also, BBS holds a heap to keep nodes that have to be visited, and the heap is sorted according to the minimum distance (MINDIST) [26] from the origin of the data space to the minimum bound rectangle (MBR) [10] of each node. The search process is terminated when the heap becomes empty. So far, there have been many variations of skyline queries [20,11,22,13]. Recently, skyline queries have been gaining popularity in various environments, such as P2P and stream data [4,21,31,18]. In addition, recent researches on skyline queries are dealing with low cardinality data and uncertain data [14,1,30,5].
2.2. Information retrieval The IR systems have used several models [9], such as the boolean model, the probabilistic model, and the vector space model. In this paper, we use the boolean model which is a simple but useful IR model based on boolean algebra. The boolean model only determines the relevance of each data according to whether it contains query keywords or not. In the information retrieval field, there are a couple of well-known indexing methods [9,25], such as inverted index [33] and signature file [7,19]. The inverted index has been widely used as an index structure for text search, and it is regarded as the most adequate in practice for text search. An inverted index roughly consists of three components: index file, posting file, and document file. An index file indexes distinct words through specific index structures, such as the B+-tree or
452
H. Choi et al. / Information Sciences 232 (2013) 449–463
hash tables. In an index file, each word points to an entry of the posting file. Each document is distinguished by a unique identifier, which is referred to as docId. In the posting file, docIds, which indicate documents in which each distinct word occurs, are stored in contiguous entries. Each entry is matched to a disk block such that a relatively small number of disk accesses is sufficient to read all docIds for each query keyword. A query is evaluated by obtaining all docIds for each query keyword and intersecting them according to the given query. In a signature file, each word in the textual description field is hashed into a fixed-length signature (i.e., a bit string). A signature for a textual description is obtained by superimposing (i.e., ORing) the fixed-length signatures of words in the text description. In a query, a query signature is generated from query words in the same manner. In retrieval, the signature of the textual description is compared to the query signature. Most of the textual data, which are not matched to the query, are eliminated by comparison of signatures. However, because of the inherent characteristics of the hashing mechanism, there may be false drops, which are discarded by comparisons of actual query terms. In general, the length of signature effects on false drop. The longer length of signature is, the less false drops occur. Recently, Ian De Felipe et al. [8] proposed an IR2-tree that is designed to index data consisting of both spatial and textual information. An IR2-tree is a modification of R-tree, and it has two types of index information: MBRs for spatial data and signatures for textual data. In more detail, the lowest level node (i.e., a leaf node) of an IR2-tree contains both an MBR tightly enclosing an object and a signature of terms. On the other hand, the ith level node (i.e., a non-leaf node) contains both an MBR enclosing the (i 1) th level nodes’ MBRs and a signature that is superimposed from the signatures at the (i 1) th level nodes. Thus, the ith level node has a signature that includes all signatures of the subtree rooted at this node. Based on the two kinds of indexes, IR2-tree facilitates filtering of unqualified nodes in terms of spatial positions and textual information. The construction and maintenance methods of IR2-tree is similar to those of R-tree which is a dynamic index structure. It can be easily achieved by modifying some of the key methods, such as chooseSubtree, splitNode, condenseTree and adjustNode to consider not only MBRs but also signatures.
3. Skyline queries on keyword-matched data In this section, we define a keyword-matched skyline and present our proposed algorithm. Section 3.1 provides the problem definition. Section 3.2 introduces a naive approach for a keyword-matched skyline. Section 3.3 describes the proposed algorithm, named Keyword-Matched Skyline search (KMS). 3.1. Problem definition Let Dd be a d-dimensional dataset, where d is the dimensionality. Dd consists of a set of d-dimensional tuples. A d-dimensional tuple tp is defined as hV, Wi, where V is a value vector consisting of d numerical values (i.e., V = (v1, v2, . . . , vd)) and W is a set of strings (i.e., W = {w1, w2, . . . , wk}, where wj (j = 1 . . . k) is a keyword). In addition, tpi.V denotes the value vector of a tuple tpi, and tpi.W indicates the set of keywords. Definition 1. Let W be a set of query keywords. A tuple tp in Dd is a keyword-matched tuple for W if and only if "w 2 W, w 2 tp.W. Let us express a keyword search operator as Q k ðDd ; WÞ. In this paper, we use the boolean model [25] as the IR model, and thus the textual parts of keyword-matched tuples include all query keywords. Definition 2. Let tp and tp0 be tuples in Dd , where tp.V = (v1, v2, . . . , vd) and tp0 .V = (u1, u2, . . . , ud). Then, tp tp0 if and only if "i, vi 6 ui and $i, vi < ui. Conversely, tp does not dominate tp0 , denoted tp § tp0 if and if only $i, vi > ui. Definition 3. A tuple tp in Dd is a skyline tuple if and only if 8tp0 2 Dd ; tp0 § tp. Let the skyline operator be Q s ðDd Þ. Skyline operator Q s ðDd Þ retrieves the set of skyline tuples, which are not dominated by other tuples, from the given dataset Dd . Definition 4. Let W be a set of query keywords and A be the set of keyword-matched tuples for W in Dd . A tuple tp in Dd is a keyword-matched skyline tuple for W if and only if tp 2 A and 8tp0 2 A; tp0 § tp. Given a set of query keywords W and a dataset Dd , by Definition 4, a keyword-matched skyline query, denoted as Q ks ðDd ; WÞ, retrieves the set of skyline tuples whose each textual attribute contains all words of W. That is, we can derive the following equivalent rule:
Q ks ðDd ; WÞ Q s Q k ðDd ; WÞ
ð1Þ
453
H. Choi et al. / Information Sciences 232 (2013) 449–463
Algorithm 1. INKS 1:
procedure INKS Dd ; W ¼ fw1 ; w2 ; . . . ; wn g
2: 3:
K
4: 5:
for i = 2 to n do// n is the number of terms K Q k Dd ; fwi g
6: 7: 8: 9: 10: 11:
S
£ // an intermediate set of docIds Q k Dd ; fw1 g // Retrieve corresponding docIds from the inverted index
S intersect(S, K) end for O fetch(S)//Fetch data objects from the database R BNL(O) return R end procedure
3.2. Inverted-index-based keyword skyline search As a straightforward approach, we introduce an inverted-index-based keyword skyline search (INKS) method, combining an inverted index [9,33,25] and a block-nested-loop (BNL) [2] algorithm. The inverted index has been widely used as an index structure for text search, and is recognized as the most adequate in practice for text search. Especially, the inverted index is distinctly superior to other index structures. For retrieval processing of a keyword-matched skyline, INKS consists of two phases: (1) a keyword search and (2) a skyline search. As shown in Fig. 2, in the keyword search phase INKS performs simple keyword matching through the inverted index, so it obtains all docIds for each query keyword by fetching both the index file and the posting file. Next, INKS intersects all docIds according to the given query, blue and then it fetches data objects corresponding to intersected results from database. In skyline search phase, INKS passes the fetched data objects to BNL. Then, INKS computes the skyline tuples by using BNL. Algorithm 1 shows this process in detail. 3.3. Keyword-matched skyline search algorithm In this section, we present the proposed algorithm, Keyword-Match Skyline search algorithm (KMS). KMS performs nearest neighbor (NN) search by using the branch and bound technique [12]. In order to store data consisting of both spatial information and textual description, KMS makes use of an IR2-tree. We consider the value vector of a tuple as a point in d-dimensional space. For each tuple, the value vector is indexed by an MBR and its keywords are indexed by a signature. Fig. 3 shows
ID 1
(1) Keyword Match Phase air bag
{b,e,f,I,j,k}
cruiser control
{b,d,e,I,j}
Retrieve docIds
Term air bag
docIDs
2
b, e, f, i, j , k air conditioning g
3
cd player
4
cruiser control
c, d, f b, d, e, i, j a, h, k a, c, g, h
Intersect two sets of docIds
5
leather seats
{b,e,f,I,j,k} {b,d,e,I,j} = {b,e,i,j}
6
sun roof
ID
Value
...
...
...
b ...
(3,4)
{air bag, cruiser control}
e
(9,1)
{air bag, cruiser control}
...
...
b(3,4, {air bag, crui }) e(9,1, {air bag, crui }) i(1,10, {air bag, crui }) j(6,8, {air bag, crui })
(2) Skyline Search Phase Compute skyline tuples by using BNL
Inverted Index
Fetch data objects
i j
Keywords
...
(1,10) {air bag, cruiser control} (6,8)
{air bag, cruiser control} Database
Result of Keyword matched Skyline
{b, e, i} Fig. 2. Inverted-index-based keyword skyline search.
454
H. Choi et al. / Information Sciences 232 (2013) 449–463
Fig. 3. IR2-tree.
the organization of an IR2-tree based on the example in Fig. 1. In Fig. 3a, each rectangle represents an MBR. In Fig. 3b, a bit string in each entry denotes a signature. For query processing, KMS traverses the IR2-tree. While traversing the IR2-tree, KMS examines two conditions, the dominance check and the signature check, at each visit of a node. The dominance check examines whether a node is dominated by any of the already-found skyline tuples. The signature check investigates whether the signature of a node includes the query keywords by using the following operation:
sig chkðr; sÞ ¼
true
if
r ¼ ð r ^ sÞ
false otherwise
ð2Þ
where r denotes the query signature and s indicates the signature of a node. If at least one of the two conditions for node e is not satisfied, node e is discarded immediately because all its descendants cannot belong to a keyword-matched skyline. Here, we explain the KMS algorithm in detail. For simplicity, we assume 2D dataset (i.e., d = 2), but KMS is sufficiently general for higher dimensionality datasets (i.e., d > 2). First, KMS initializes a list S that will contain keyword-matched skyline tuples, and S is used for dominance checks during query processing. Also, KMS makes an empty heap H to hold entries to be visited, and the heap is sorted in ascending order according to MINDIST [26] between the origin of the data space and MBR (or a point). MINDIST is L1norm, namely either the sum of two coordinates of a point if the node is a leaf node, or the sum of coordinates of MBR’s left lower corner if the node is a non-leaf node. Consequently, KMS visits entries in ascending order according to the distance to the origin of d-dimensional space. Also, KMS generates the query signature r by using the make_sig operator which hashes the query terms and superimposes their hash values. KMS inserts entries of the root node into the heap H. Then, KMS iteratively processes the following, referred to as the spread process, until the heap H is empty. KMS removes the top entry e from H, and then checks two conditions (i.e., dominance check and signature check) of e. If at least one check fails, e is discarded immediately. Otherwise, according to whether the node e is a non-leaf node or a leaf node, the algorithm proceeds as follows:
H. Choi et al. / Information Sciences 232 (2013) 449–463
455
Fig. 4. Heap contents.
Algorithm 2. KMS 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
procedure KMS (W = {w1, w2, . . . , wn}) init S// initializing empty set init H// initializing empty heap r make_sig(W)// Getting signature H.add(root) while H ! = £ do e H.remove()// remove a top entry if sig_chk(r, e.s) and "e0 2 S, e0 § e then if e is non-leaf then for each child ei in e do if sig_chk(r, ei.s) and "e0 2 S, e0 § ei then H.add(ei) end if end for else// if e is leaf if "w 2 W, w 2 e.W then S.add(e) end if end if end if end while end procedure
If e is a non-leaf node, KMS obtains child entries from e and checks two conditions of these entries. If each entry satisfies two conditions, it is added to the heap H. Otherwise, it is discarded immediately. If e is a leaf node, e is guaranteed not to be dominated by any tuple. However, e must be tested by an exact keyword match because there may be false positives, i.e., a signature test cannot completely guarantee that the textual data of a node (or its descendants in a non-leaf node) contain all given query keywords. If an entry e contains all query keywords, it belongs to the keyword-matched skyline. Thus, KMS adds e to S. If the heap H is empty, the spread process is terminated. During the spread process, all the tuples, which are added to S, are guaranteed just to be keyword-matched skyline tuples. In other words, the algorithm can progressively return keywordmatched skyline tuples early before the end of process. This algorithm is shown in Algorithm 2. Suppose that a keyword-matched skyline query for ‘air bag’ is issued by a user, based on the dataset in Fig. 3. Firstly, KMS makes a query signature for ‘air bag’ (e.g., 0001 0001 0100 10002). KMS adds entries e6 and e7 of the root to H. Next, the spread process begins by removing the top entry e6 from the heap H. KMS tries to add child entries (e1, e2, e3) of e6, but e2 is discarded because it does not match the query signature. In other words, all child nodes of e2 do not have ‘air bag’. Thus, only e1 and e3 are added to H. Again, KMS removes the top entry e1 from H, and tries to spread e1 to its child entries a and b. Here, a is discarded because of a signature check failure. Thus, KMS adds only b to H. Next, KMS removes b from H and adds b to S since the tuple b passes the two checks and it is a leaf entry, i.e., the tuple b belongs to the keyword-matched skyline. Again, KMS removes e7 from H and adds its child e4 to H, and discards e5 because it is dominated by b. The spread process continues in the same manner until the heap H is empty. During the remainder of spread process, tuples g and h are 2
The signature can be generated by an appropriate hash function on the attributes of the objects [6].
456
H. Choi et al. / Information Sciences 232 (2013) 449–463
discarded by a signature check failure, and both e and i are added to S. Finally, f is discarded because it is dominated by e which is early inserted to S. The changes of the heap contents during the spread process are shown in Fig. 4. In the figure, several entries pruned by signature checks are shown with diagonal lines, and an entry pruned by a dominance check is shown with a horizontal line. In the second column of the figure, each element hei, mi represents an entry ei in the heap and its MINDIST m. Theorem 1. All tuples added to S are keyword-matched skyline tuples. Proof 1. Assume to the contrary that a tuple tpi was added to S, but tpi does not belong to the keyword-matched skyline. This consists of three cases: tpi is dominated by any tuple in S, tpi does not contain all of the query keywords, or both. The fact that tpi is dominated by any skyline tuple implies there is at least one tuple dominating tpi. Assume there is a tuple tpj that dominates tpi. Then, tpj is not larger than tpi on all axes, and is smaller than tpi on at least one axis, i.e., MINDIST(tpj) < MINDIST(tpi). Thus, tpj was added to S prior to tpi because the heap H guarantees KMS to visit nodes in ascending order according to the distance to the origin of d-dimensional space. Therefore, the tuple tpi should have been pruned by tpj. This case is in contradiction with the fact that tpi was added to S. Also, the fact that tpi does not contain all query keywords implies the operator sig_chk(r, tpi.s) fails. Then, when tpi is examined by performing a signature test or an exact keyword match, tpi should be pruned. This case is in contradiction with the fact that tpi was added to S. Consequently, based on proof by contradiction, Theorem 1 is proved. h 4. Experimental evaluation We evaluate the performance of KMS compared to INKS and BBS [24]. In all experiments, we focus on both CPU time and the number of page accesses. We use real and synthetic datasets in the experiments. Each dataset consists of d-dimension tuples whose domain is normalized to [0,1]. We conduct various experiments on each dataset. We use various signature lengths l in the range [32, 768] bit and various numbers of query keywords k in the range [1, 5]. In experiments on the synthetic dataset, we additionally examine the performance with respect to value distribution, cardinality, dimensionality, signature length and skew factor of the word distribution. Keyword-matched skyline queries with BBS is performed by filtering at each leaf node unqualified tuples in regards of matching query words because BBS uses the R-Tree whose intermediate nodes contain only MBR and pointers of their child nodes. Thus, while visiting intermediate nodes, BBS is not able to prune unqualified nodes whose descendant nodes do not contain query keywords. The comparison between KMS and BBS shows clearly the pruning power of signatures that reside in intermediate nodes of IR2-tree. The experiments were carried out on a system with AMD64 2.4 GHz, 4G memory, and 320G HDD (SATA2 7200RPM). We implemented all the algorithms and data structures in Java. The page size, on which a node of the IR2-tree and an entry of the inverted index depend, is matched to the disk block size (i.e., 4096 bytes). 4.1. Real data For a real dataset, we collected sets of items from eBay Motors (http://www.motors.ebay.com). We developed a simple crawler using ebay API in order to gather data. We gathered all categories, such as minivan, SUV, and sedan, from US and Canada ebay Motors sites. The value attributes, we used for experiments, are ‘Buy it price’ (i.e., price fixed by seller) and ‘mileage’. Each item is described by various features (e.g., air bag, leather seat, CD player, etc.). We consider the features as keywords that describe items. The real dataset shows several characteristics as follows. The dataset consists of 15,332 items, and there are 1557 distinct words for items. The word count for each item is in the range [1–13], and the average word count for each item is approximately 5. Also, words have a distribution similar to the Zipf distribution [32]. Fig. 5a shows the distribution of these words. Here, the x-axis means distinct words sorted in a descending order according to their frequency, and the y-axis denotes their frequency. The points in Fig. 5b represent the cars. Here, coordinates of points indicate their attributes (i.e., mileage and price). Figs. 6 and 7 show the results of experiments on this real dataset with respect to various numbers of query keywords and various lengths of signature respectively. In both experiments, KMS significantly outperforms both INKS and BBS. KMS is one order of magnitude faster as shown in Figs. 6a and 7a. Also, KMS involves far fewer page accesses than other methods as shown in Figs. 6b and 7b. These results show that KMS effectively prunes unqualified nodes during query processing. However, the results of this experiment do not match our expectation that an increase of query keywords (or length of signature) leads to a reduction of CPU time and fewer page accesses. Instead, the results indicate that both CPU time and page accesses are increased. This is because that the size of the dataset is too small for a proper examination of the capabilities of algorithms. In contrast, as shown in Fig. 6b the page accesses of INKS decreases as the number of query keywords increases. This is because the number of intersected docIds decreases and it reduces the number of tuples fetched (see Line 6 and 8 in Algorithm 1).
H. Choi et al. / Information Sciences 232 (2013) 449–463
457
Fig. 5. Data collected from eBay motors.
Fig. 6. Varying the numbers of query keywords on real data.
4.2. Synthetic data In order to evaluate the effects of various parameters, we generated two synthetic datasets: (i) Independent dataset and (ii) Anti-correlated dataset, as shown in Fig. 9, with 10,000 distinct words. Independent and anti-correlated distributions are commonly used in the skyline literature [2,15]. In independent data set, each dimension is independent to each other, whereas in anti-correlated data set each dimension is anti-correlated to each other. For example, in 2-dimension anti-correlated data set, if one dimension is low, another dimension is likely to be high. Also, each dataset has various cardinalities N in the range [100k, 1M], dimensionalities d in the range [2, 5], signature lengths in the range [32, 768], numbers of query words k in the range [1, 5] and Zipf skew factors h of the word distribution in the range [0, 1]. Each tuple is described by 6 words. Fig. 8 summarizes these experimental parameters. Their default settings are typeset in boldface. In each experiment, we vary a single parameter, while the other parameters remain their default values.
458
H. Choi et al. / Information Sciences 232 (2013) 449–463
Fig. 7. Varying signature sizes (bit) on real data.
Fig. 8. Experiment parameters.
(a) Independent
(b) Anti-correlated
Fig. 9. Synthetic data.
4.2.1. The effect of cardinality In this section, we evaluate the scalability with respect to the cardinality N of the dataset. We conduct experiments with various cardinalities N [100k, 1M], and set d = 3, k = 2, l = 384 and h = 0.8. As shown in Fig. 10, KMS generally outperforms INKS. The performance difference rapidly increases as the cardinality increases. This shows that KMS algorithm performs effectively for a large dataset. However, the difference between KMS and INKS in the case of the anti-correlated dataset is smaller than the difference for the independent dataset. Specifically, KMS is outperformed by INKS when 100k 6 N < 200k in the case of the anti-correlated dataset. This is because the heap size for KMS in the anti-correlated dataset is much larger than that of the independent dataset, and thus a huge heap results in high overheads for dominance check (Refer to Lines 8 and 11 in Algorithm 2). In contrast, in INKS, the inverted index finds a set of keyword-matched tuples regardless of the distribution of tuple values. BNL works well with a small number of tuples although the complexity of BNL is Oðn2 Þ in the worst case. However, KMS outperforms INKS as the cardinality increases. Fig. 11 shows that KMS is one order of magnitude faster than INKS in all cases. Also, we conducted experiments for BBS, but BBS did not terminate when N > 200k because of highly intensive I/O operations. So, we omitted the results of BBS in the following experiments. 4.2.2. The effect of dimensionality The dimensionality is important for skyline-like queries because the skyline problem considers multi-dimensional attributes, and increasing dimensionality results in an increased number of tuples retrieved [2].
H. Choi et al. / Information Sciences 232 (2013) 449–463
459
Fig. 10. Cardinality vs. processing cost.
Fig. 11. Cardinality vs. number of page accesses.
Fig. 12. Dimensionality vs. CPU time.
In this experiment, we study the effect of dimensionality. We use various dimensionalities d in the range [2, 5], while we set N = 1M, k = 2, l = 384 and h = 0.8 respectively. As shown in Figs. 12 and 13, KMS generally performs better than INKS, although the CPU time of KMS increases with increasing dimensionality. This is because of the following: (i) an IR2-Tree, like the R-tree, deteriorates with increasing dimensionality, (ii) due to the number of keyword-matched skyline tuples, the cost of dominance checks grows with increasing dimensionality. By contrary, BNL, on which INKS is based, is relatively insensitive to the dimensionality. As mentioned in [15], however, most applications are fully sufficient for up to 5-dimensional dataset. Especially, skyline queries for 2-dimensional dataset are the most common. 4.2.3. The effect of the number of query keywords In order to study the effect of the number of query keywords, we vary the number of query keywords k in the range [1, 5]. Here, we set N = 1M, d = 3, l = 384 and h = 0.8 respectively. In this experiment, KMS generally outperforms INKS in terms of both CPU time and I/O costs. Fig. 14 shows that the difference between KMS and INKS rapidly increases as increasing the number of query keywords, especially in the case of
460
H. Choi et al. / Information Sciences 232 (2013) 449–463
Fig. 13. Dimensionality vs. number of page accesses.
anti-correlated datasets. This is because the more the given query keywords are, the more entries that are not matched to the query signature r are pruned early, due to the more selective query signature. Fig. 15 shows that the number of page block accesses of INKS is being reduced when k 6 3 due to the decrease of tuples read. However, the number of page accesses of INKS increases when k > 3 because the costs for both fetching docIds and intersecting them dominate the costs of the reduced number of tuples read. 4.2.4. The effect of signature length In this experiment, we study the effect of signature length. There is a trade-off relation between the probability of false alarm and the signature length. The more the length of signature is, the less the probability of false alarm for keyword match becomes. It improves filtering accuracy. However, increasing the signature length makes the index size larger. Hence, according to a given purpose it is important to find the optimal signature length. Here, we set N = 1M, d = 3, k = 2, and h = 0.8 respectively, and we vary the length of signature l in the range [32, 768]. As shown in Figs. 16 and 17, the experiment when l < 256 shows that the degradation of KMS is caused by frequent false positives. It is more apparent in the case of the anti-correlated dataset. Especially, KMS, when l 6 32 in the case of the anti-correlated dataset, does not terminate, so we omit its result. When 256 6 l < 768, KMS shows the efficient performance.
Fig. 14. Number of query keywords vs. CPU time.
Fig. 15. Number of query keywords vs. number of page accesses.
461
H. Choi et al. / Information Sciences 232 (2013) 449–463
Fig. 16. Signature size (bit) vs. CPU time.
100000
100000 KMS INKS
10000
Page accesses
Page accesses
KMS INKS
1000
100
10000
1000
100 0
128
256
384
512
640
768
0
128
Length of Signature
256
384
512
640
768
Length of Signature
Fig. 17. Signature size (bit) vs. number of page accesses.
Fig. 18. Index size according to the length of a signature.
Fig. 19. Skewness vs. CPU time.
According to the experiment, we observed that in KMS when l > 384 one node occupies more than one page block. It leads to the increase of the index size as shown in Fig. 18, and the increased index size incurs more page accesses. Finally, it leads to the degradation of the number of page accesses when l > 640 as shown in Fig. 17. In addition, Fig. 18 includes the bulk insertion time. It does not directly affect the performance of KMS since IR2-tree is a dynamic index and we assume that the data set is already loaded in database. However, we can guess the maintenance cost of IR2-tree. From this experiment, we chose the optimal signature length (i.e., 384 bit) in terms of both the filtering accuracy and the index size.
462
H. Choi et al. / Information Sciences 232 (2013) 449–463
Fig. 20. Skewness vs. number of page accesses.
4.2.5. The effect of word distribution In this experiment, we evaluate the proposed algorithm’s sensitivity to word distributions. We conduct the experiment with various Zipf skew factors h in the range [0.0, 1.0], and we fix N = 1M, d = 3, k = 2, and l = 384 respectively. The skew factor 0.0 means a uniform distribution. Regardless of the skew factor h, in both experiments KMS outperforms INKS, as shown in Fig. 19. Especially, KMS is almost insensitive to the skew factor in the case of the independent dataset. In Fig. 20, there is a reversal between KMS and INKS. When h < 0.4, the number of keyword-matched tuples may be small. In such cases, the inverted index and BNL work very fast. In contrast, KMS must visit a number of non-leaf nodes despite few results. However, KMS gradually starts to outperform INKS when h > 0.4. In terms of I/O cost, INKS reads all tuples matched to each query keyword. In contrast, KMS reads only nodes, each of which includes the query signature and is not dominated by any tuple in S. For these reasons, KMS involves fewer page accesses than INKS in the skewed distributions of words. 5. Conclusions In this paper, we defined the keyword-matched skyline problem and proposed an efficient and progressive algorithm, referred to as Keyword-Matched Skyline search algorithm (KMS). To the best of our knowledge, this work is the first attempt to address skyline queries on keyword-matched data. The proposed algorithm retrieves keyword-matched skyline tuples by using the nearest-neighbor search technique on an IR2-tree. While traversing the tree, KMS effectively prunes unqualified nodes by means of spatial and textual information stored in IR2-tree nodes. In order to evaluate the performance, we conducted extensive experiments in various settings. The experimental results demonstrated that KMS is very efficient with respect to both computational and I/O costs. Especially, the scalability of KMS with respect to the cardinality of the data is useful in very large databases. Moreover, insensitivity to word distributions shows that the proposed algorithm performs effectively in diverse environments. In future work, we plan to extend our proposal to deal with not only keyword match but also exact and range match for numerical attributes. Acknowledgement This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2010-0025218). References [1] M.J. Atallah, Y. Qi, Computing all skyline probabilities for uncertain data, in: Proceedings of the 28th Symposium on Principles of Database Systems, ACM, Rhode Island, USA, 2009, pp. 279–287. [2] S. Börzsönyi, D. Kossmann, K. Stocker, The skyline operator, in: Proceedings of the 17th International Conference on Data Engineering, Heidelberg, Germany, 2001, pp. 421–430. [3] J. Chomicki, P. Godfrey, J. Gryz, D. Liang, Skyline with presorting, in: Proceedings of the 19th International Conference on Data Engineering, Bangalore, India, 2003, pp. 717–816. [4] B. Cui, L. Chen, L. Xu, H. Lu, G. Song, Q. Xu, Efficient skyline computation in structured peer-to-peer systems, IEEE Transactions on Knowledge and Data Engineering 21 (7) (2009) 1059–1072. [5] X. Ding, X. Lian, L. Chen, H. Jin, Continuous monitoring of skylines over uncertain data streams, Information Sciences 184 (1) (2012) 196–214. [6] U. Deppisch, S-tree: a dynamic balanced signature index for office retrieval, in: SIGIR86, Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 1986, pp. 77–87. [7] C. Faloutsos, Signature files: design and performance comparison of some signature extraction methods, in: ACM International Conference on Management of Data, Austin, Texas, 1985, pp. 63–82. [8] I.D. Felipe, V. Hristidis, N. Rishe, Keyword search on spatial databases, in: Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 656–665. [9] W.B. Frakes, R.A. Baeza-Yates, Information Retrieval: Data Structures & Algorithms, Prentice-Hall, 1992.
H. Choi et al. / Information Sciences 232 (2013) 449–463
463
[10] A. Guttman, R-trees: a dynamic index structure for spatial searching, in: ACM International Conference on Management of Data, Boston, Massachusetts, 1984, pp. 47–57. [11] J.S. Heo, K.Y. Whang, M.S. Kim, Y.R. Kim, I.Y. Song, The partitioned-layer index: answering monotone top-k queries using the convex skyline and partitioning-merging technique, Information Sciences 179 (19) (2009) 3286–3308. [12] G.R. Hjaltason, H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24 (2) (1999) 265–318. [13] H. Im, S. Park, Group skyline computation, Information Science 188 (2012) 151–169. [14] G.T. Kailasam, J. Lee, J.-W. Rhee, J. Kang, Efficient skycube computation using point and domain-based filtering, Information Sciences 180 (7) (2010) 1090–1103. [15] D. Kossmann, F. Ramsak, S. Rost, Shooting stars in the sky: an online algorithm for skyline queries, in: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, China, 2002, pp. 275–286. [16] H.T. Kung, On the computational complexity of finding the maxima of a set of vectors, in: Annual Symposium on Foundations of Computer Science, 1974, pp. 117–121. [17] H.T. Kung, F. Luccio, F.P. Preparata, On finding the maxima of a set of vectors, Journal of the ACM 22 (4) (1975) 469–476. [18] J. Lee, J. Kim, S. Hwang, Supporting efficient distributed skyline computation using skyline views, Information Sciences 14 (2011). [19] D.L. Lee, Y.M. Kim, G. Patel, Efficient signature file methods for text retrieval, IEEE Transactions on Knowledge and Data Engineering 7 (3) (1995) 423– 435. [20] K.C. Lee, W.C. Lee, B. Zheng, H. Li, Y. Tian, Z-SKY: an efficient skyline query processing framework based on Z-order, VLDB Journal: Very Large Data Bases 19 (3) (2010) 333–362. [21] M.D. Morse, J.M. Patel, W.I. Grosky, Efficient continuous skyline computation, Information Sciences 177 (17) (2007) 3411–3437. [22] K. Mouratidis, M. Hadjieleftheriou, D. Papadias, Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring, in: Proceedings of the ACM International Conference on Management of Data, Baltimore, Maryland, 2005, pp. 634–645. [23] T. OReilly, What is web 2.0? Design patterns and business models for the next generation of software, 2005. . [24] D. Papadias, Y. Tao, G. Fu, B. Seeger, Progressive skyline computation in database systems, ACM Transactions on Database Systems (TODS) 30 (1) (2005) 41–82. [25] B. Ricardo, R. Berthier, Modern Information Retrieval, Pearson Education Limited, England, 1999. [26] N. Roussopoulos, S. Kelley, F. Vincent, Nearest neighbor queries, in: Proceedings of the ACM International Conference on Management of Data, NY, USA, 1995, pp. 71–79. [27] K.L. Tan, P.K. Eng, B.C. Ooi, Efficient progressive skyline computation, in: International Conference on Very Large Data Bases, Rome, Italy, 2001, pp. 301– 310. [28] Y. Tao, D. Papadias, Maintaining sliding window skylines on data streams, IEEE Transactions on Knowledge and Data Engineering 18 (2) (2006) 377– 391. [29] P. Wu, D. Agrawal, O. Egecioglu, A.E. Abbadi, Deltasky: optimal maintenance of skyline deletions without exclusive dominance region generation, in: Proceedings of the 23rd International Conference on Data Engineering, Istanbul, Turkey, 2007, pp. 486–495. [30] W. Zhang, X. Lin, Y. Zhang, W.W. 0011, J.X. Yu, Probabilistic skyline operator over sliding windows, in: Proceedings of the 25th International Conference on Data Engineering, Shanghai, China, 2009, pp. 1060–1071. [31] L. Zhu, Y. Tao, S. Zhou, Distributed skyline retrieval with low bandwidth consumption, IEEE Transactions on Knowledge and Data Engineering 21 (3) (2009) 384–400. [32] G.K. Zipf, Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology, Addison-Wesley, 1949. [33] J. Zobel, A. Moffat, K. Ramamohanarao, Inverted files versus signature files for text indexing, ACM Transactions on Database Systems 23 (4) (1998) 453– 490.