Enabling soft queries for data retrieval

ARTICLE IN PRESS Information Systems 32 (2007) 560–574 www.elsevier.com/locate/infosys Enabling soft queries for data retrieval Hwanjo Yua,, Seung-...

Download PDF

496KB Sizes 3 Downloads 143 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Information Systems 32 (2007) 560–574 www.elsevier.com/locate/infosys

Enabling soft queries for data retrieval Hwanjo Yua,, Seung-won Hwangb, Kevin Chen-Chuan Changc a

Computer Science Department, University of Iowa, Iowa City, IA 52242, USA Department of Computer Science and Engineering, POSTECH, Pohang, Korea c Computer Science Department, University of Illinois at Urbana- Champaign, Urbana, IL, USA b

Received 28 October 2005; received in revised form 26 January 2006; accepted 5 February 2006 Recommended by P. Loucopoulos

Abstract Data retrieval ﬁnding relevant data from large databases — has become a serious problem as myriad databases have been brought online in the Web. For instance, querying the for-sale houses in Chicago from realtor.com returns thousands of matching houses. Similarly, querying ‘‘digital camera’’ in froogle.com returns hundreds of thousand of results. This data retrieval is essentially an online ranking problem, i.e., ranking data results according to the user’s preference effectively and efﬁciently. This paper proposes a new rank query framework, for effectively incorporating ‘‘user-friendly’’ rank-query formulation into ‘‘data base (DB)-friendly’’ rank-query processing, in order to enable ‘‘soft’’ queries on databases. Our framework assumes, as the ‘‘back-end,’’ the score-based ranking model for expressive and efficient query processing. On top of the score-based model, as the ‘‘front-end,’’ we adopt an SVM-ranking mechanism for providing intuitive and exploratory query formulation. In essence, our framework enables users to formulate queries simply by ordering some sample objects, while learning the ‘‘DB-friendly’’ ranking function F from the partial orders. Such learned functions can then be processed and optimized by existing database systems. We demonstrate the efﬁciency and effectiveness of our framework using real-life user queries and datasets: our results show that the system effectively learns quantitative ranking functions from qualitative feedback from users with efﬁcient online processing. r 2005 Elsevier B.V. All rights reserved. Keywords: Soft queries; Data retrieval

1. Introduction As we move toward a digital world, information abounds everywhere—retrieving desired data thus becomes a ubiquitous challenge. In particular, with the widespread of the Internet, myriad databases have been brought online, providing massive data Corresponding author. Tel.: + 1 319 335 0734.

E-mail addresses: [email protected] (H. Yu), [email protected] (S.-w. Hwang), [email protected] (K.C.-C. Chang).

through searchable query interfaces. (The July 2000 survey of [1] claims that there were 500 billion hidden ‘‘pages,’’ or data objects, in 105 online sources.) While our databases provide well-maintained, high-quality structured data, with the sheer scale, users are facing the hurdle of searching and retrieving. This data retrieval problem— that of ﬁnding relevant data from large databases — has thus become a clear challenge. (By ‘‘retrieval,’’ we intend to stress the relevance-based matching, even for structured ‘‘data’’ — much like text retrieval for

0306-4379/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2006.02.001

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

561

Fig. 1. Examples of online search facilities for supporting data retrieval.

ﬁnding relevant documents.) To illustrate, Fig. 1 shows several example scenarios. Consider user Amy, who is looking for a house in Chicago. She searches realtor.com with a few constraints on city, price, beds, baths, which returns 3581 matching houses. Similarly, when Amy searches froogle.com for "digital camera", she is again overwhelmed by a total of 746,000 matches. She will have to sift through and sort out all these matches. Or, Amy may realize that she must ‘‘narrow’’ her query — However, on this extreme, and equally undesirable, she may as well get no hits at all. She will likely manually ‘‘oscillate’’ between these extremes, before eventually managing to complete her data retrieval task, if at all. Relational databases offer little support for such retrieval tasks. Traditional Boolean-based query models like SQL are based on ‘‘hard’’ criteria (e.g., priceo $100,000) while users often employ ‘‘soft’’ criteria for their speciﬁc senses of ‘‘relevance’’ or ‘‘preference.’’ Unlike ﬂat Boolean results, these fuzzy criteria naturally calls for ranking, to indicate how well the results match. Such ranking is essential for data retrieval, by ordering answers according to their matching ‘‘scores.’’ Thus, on one hand, there will not be too many matches, since ranking focuses users on the best matches. On the other hand, neither will there be no hits, since ranking will return even partial matches. While such ranking has been the norm for ‘‘text’’ retrieval [2] (e.g., search engines like Google), it is critically missing in relational database systems for supporting similar ‘‘data’’ retrieval. To enable such soft queries for data retrieval, we observe two major barriers: First, user-friendliness: The data retrieval system should be ‘‘user friendly,’’ for ordinary users to easily express their preference. Note that, unlike traditional data management with

mostly ‘‘canned transactions’’ by application developers, data retrieval system must accommodate ordinary users who are not able to express their implicit preference by formulating a query or function. Second, DB-friendliness: The system should be ‘‘DB-friendly,’’ to be compatible with existing relational DBMS, so that it can be executed and optimized by any DBMS. Note that data retrieval, with many interesting scenarios online, must essentially achieve responsive processing. While there has been existing work on supporting ranking in both databases and machine learning communities (discussed in Section 6), due to their different aspects of interests, there has been no efforts ventured for enabling soft queries for data retrieval: On one hand, the databases community has studied rank query processing [3–6]. However, they clearly lack the support for intuitively formulating ranking in the ﬁrst place, to accommodate everyday users (as Section 2 will discuss). On the other hand, the machine learning community has focused on learning or formulating ranking from examples [7,8]. However, such ranking functions are hardly amenable to relational DBMS for efﬁcient processing. This paper develops the ‘‘bridging’’ techniques of database and machine learning, to provide systematic solutions for data retrieval. We proposes a new framework such that: (1) to achieve user-friendliness, it allows users to qualitatively and intuitively express their preferences by ordering some sample, (2) to achieve DB-friendliness, we learn a quantitative global ranking function which is amenable to existing relational DBMS. In summary, our framework seamlessly integrates the front-end machine learner with a back-end processing engine to evaluate the learned functions.

ARTICLE IN PRESS 562

H. Yu et al. / Information Systems 32 (2007) 560–574

The new contributions of this paper are summarized as follows.

We develop the duality of ranking and classiﬁcation view in Section 3.1, in order to connect the ‘‘user-friendly’’ query formulation (i.e., learning ranking from relative orderings) with the ‘‘DBfriendly’’ query processing (i.e., processing ranking from absolute scores). We provide an intuitive interpretation of the SVM ranking solution [8], by using the duality and presenting Corollaries 1 and 2 and Remark 1 in Section 3.2.2. We develop top sampling method which (1) provides an ‘‘exploratory’’ interface to users; (2) further enhances the SVM performance for ranking; and (3) is efﬁciently expressed in SQL and thus facilitate the integration with RDBMS. We experimentally show that the top sampling method is efﬁcient and reduces the amount of user feedback to achieve a high accuracy.

We motivate and describe the architecture of our framework (Section 2) and present the component techniques (Section 3). We demonstrate the efﬁciency and effectiveness of our framework using real-life queries and data sets (Section 4). We discuss valuable lessons we learned from our user study and further challenges to build the data retrievalintegrated relational system (Section 5). We discuss related work in Section 6. 2. Overview: Bridging rank formulation and processing This section motivates and introduces our approach—Our goal is to seamlessly integrate user-friendly rank formulation with DB-friendly rank processing. As Section 1 explained, such ‘‘mix’’ is critical for enabling soft queries for data retrieval. 2.1. DB-friendly rank processing: Score-based model First, we argue the score-based ranking model is both amenable and expressive for query processing: To see why, consider a data retrieval scenario, where queries capture preference. Example 1. Amy, who is looking for a house, prefers those somehow cheap, large, and in a safe area. Assume these ‘‘predicates’’ or ‘‘features’’ are

speciﬁed (e.g., cheap below), each as a soft predicate returning a matching score. Predicate - cheap (h.price): If (h.price4500; 000) Then Return h:price 1:0 MAX _PRICE Else Return 1.0 To rank results in the order of preference, she may formulate a query, combining features with min as the ranking function. Such query is expressible in SQL, using order by clause for ordering results by user-speciﬁed ranking criteria. Query Q: select h.id, h.address from House h where h.city ¼ ‘‘"Chicago"’’ order by minðf 1 : cheap(h.price), f 2 : large(h.size), f 3 : safe(h.zip)) The task in this score-based model is ranking a database of n objects D ¼ fu1 ; . . . ; un g (e.g., "House" in Example 1). For each object u, some m soft predicates f 1 , . . ., f m evaluate u to scores in ½0 : 1, which are then aggregated by some ranking function F, i.e., Fðf 1 ; . . . ; f m ½u ¼ Fðf 1 ½u; . . . ; f m Þ ½uÞ. All objects are then ranked, highest ﬁrst, by their ranking scores Fðf 1 ; . . . ; f m Þ½u or F½u for short. For instance, Query Q in Example 1 uses minðf 1 ; f 2 ; f 3 Þ as the ranking function. This view is (1) amenable and (2) expressive to enable effective query processing: First, such a ranking function is amenable to existing relational DBMS and thus already expressible in SQL (e.g., as the Query Q in Example 1). Second, it is simple yet expressive, by determining a global ordering with a single formula. (Such score-based models have served IR well, e.g., the tf/idf scoring function for ranking.) (Other emerging ranking models are discussed in Section 6.) 2.2. User-friendly rank formulation: Machine learning approach While the score-based model is expressive and efﬁcient, formulating such ranking functions is challenging to users. It is far from trivial for the user to articulate how she evaluates each and every object into an absolute numeric score, that is, to express her preference by deﬁning the soft predicates and function. Note that, unlike typical relational queries usually formulated by application developers or DB administrator, common users for

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

data retrieval tasks are ordinary people like Amy. Thus, to accommodate such users, the formulation of rank criteria must be essentially supported— without which ranking is not usable. To enable effective rank formulation, we believe the framework should be both intuitive and exploratory: ﬁrst, preference often stems from relative ordering without explicit absolute scores. Thus, while scoring is an underlying ‘‘computational machinery’’ to capture a desirable preference, explicit scoring is non-intuitive and overly-demanding to most users. To be intuitive, the framework should allow users to specify only relative ordering or partial orders (but not absolute scores)—it is up to the system to infer the underlying ranking function from a few given examples. Second, ranking often requires context knowledge—of what objects are available in the database to be ranked. However, data retrieval is inherently exploratory; users are exploring an unfamiliar database for what they want, and thus such context knowledge is often lacking. Thus, the framework should present what are available in the database, and let users focus on only those presented. These examples on one hand serve as a ‘‘guided’’ tour of D and on the other hand provide a sufﬁcient context for user interaction. Together, both requirements lead us to pursue an interactive ‘‘rank-by-examples’’ paradigm for rank formulation—consequently, the critical ability of ‘‘inference by examples’’ (for ﬁnding the implicit ranking criteria) clearly suggests a machine learning approach. With interactive sampling and labeling of

563

training examples, our ‘‘learning machine’’ will infer the desired ranking function. However, unlike a conventional learning problem of classifying objects into groups, our learning machine must learn a global ranking function F that outputs a ranking score of each data object so that it is adoptable in the score-based ranking model for efﬁcient processing. The learning machine must also learn from partial orders to provide the intuitive formulation. Additionally, the dynamic nature of online querying poses strict constraints on response time and user intervention—this ranking function must be learned instantly with minimal user intervention. 2.3. The RankFP framework Putting together, we develop the RankFP framework, aiming at integrating a ‘‘front-end’’ for learning-based rank query formulation to a ‘‘backend’’ for score-based rank query processing. As Fig. 2 illustrates, ﬁrst, with the ‘‘iterative learning’’ front-end (at the top), our RankFP framework supports users to formulate queries in an exploratory (as the system iteratively shows database sample objects) and intuitive (by specifying only partial ordering on the sample) process. Second, with the score-based rank processing back-end (at the bottom), our framework supports integrated query processing to return ranked answers efﬁciently. Section 3 will present the techniques for each component. Note that, unlike typical document retrieval tasks, users in data retrieval tasks are often willing

Rank Formulation Learning Machine ad-hoc ranking R*

5

4

3

2

ranking R* over S

1

Over S: F RS R*? no

ranking function F yes

Function Learning: learn new F

sample S (unordered)

database D

Sample Selection: generate new S Rank Processing Top-k Query Processing

results top ranked reultss

Fig. 2. Framework RankFP: Rank formulation and processing for data retrieval.

ARTICLE IN PRESS 564

H. Yu et al. / Information Systems 32 (2007) 560–574

to perform many iterations to further refine the ranking functions; A document retrieval task usually ends as soon as the user ﬁnds a few satisfying documents. However, users in data retrieval tasks often want to retrieve many possible candidate before they make decisions, because the decision often involves a high cost. For instance, users searching for for-sale houses or digital camcorders do not easily ﬁnish their tasks by retrieving a few good examples.

(Sections 3.2 and 3.3 will present the techniques for the rank formulation.)

Rank formulation: The rank formulation module (Fig. 2, top) iteratively interacts with the user to learn the desired ranking function. This process operates in rounds, as Fig. 2 illustrates (in the top) and Fig. 3 shows in details. In each round, the learning machine selects a sample S of a small number of l objects (for l jDj; e.g., l ¼ 5 in our study). The user orders these examples by her desired ranking R ; thus she ‘‘labels’’ these examples as training data. The learning machine will thus construct a function F from the training examples so far; let RF S be the induced ranking over the latest sample S. At convergence, i.e., when RF S is sufﬁciently close to R (i.e., when F is accurate on S), the learner will halt and output F as the learned ranking function. (In particular, we measure such convergence with the Kendall’s t metric, the most widely used measure for similarity between two orderings such as R and RF S [9–11].) This ‘‘learning-till-convergence’’ mechanism seems rather simple and elegant to satisfy our goal of intuitive and exploratory query-formulation.

3. The RankFP framework: enabling rank formulation and processing online

Rank processing: The rank processing module (Fig. 2, bottom) carries out the learned function F for online query processing over the entire database. Section 3.1 will present, through Theorem 1, how to connect the learned function into the score-based ranking model for efﬁcient and integrated query processing.

In this section, we present the techniques for realizing the RankFP framework (Fig. 2). First, Section 3.1 starts with developing how we ‘‘connect’’ score-based ranking view, which is effective for processing back-end, with classiﬁcation view, effective for learning front-end. Second, Section 3.2 then investigates SVM as the learning machine (Step 3a in Fig. 3). Finally, Section 3.3 develops techniques to enable rank formulation and processing to be ‘‘online’’, e.g., selective sampling for effective online learning with minimal user intervention (Step 3b in Fig. 3). 3.1. Duality of ranking and classification view As argued in Section 2, the score-based ranking model, viewing ranking as induced by a ranking function F, is amenable and expressive for query

Fig. 3. The front-end rank formulation: ‘‘learning-till-convergence’’.

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

processing. For now, let’s assume the rank function F is linear, i.e., Fðf 1 ; . . . ; f m Þ½u w1 f 1 ½u þ þ wm f m ½u. (We will show that it is generalizable to nonlinear function in Section 3.2.) Let the weights ~ ðw1 ; . . . ; wm Þ be the weight vector (which is what w the learner will infer). Also, let the ‘‘features’’ f~i ðf 1 ; . . . ; f m Þ½ui be the feature vector of data object ui — In our learning framework, these features are simply the attributes of each database tuple (e.g., price, city as in Example 1). We can thus write ~ f~i , which maps an object ui to its Fw~ðf~Þ½ui ¼ w score F½ui by weighting its various features. As our hypothesis, suppose there exists such a ranking function F that is consistent with the desired ranking R . Our ranking problem is thus to induce an order of objects by comparing their scores, and our goal in rank formulation is to ﬁnd such an Fw~ ~)such that: (or the weight vector w ui XR uj () Fw~ðf~Þ½ui XFw~ðf~Þ½uj ~ f~ X~ () w w f~ . i

j

ð1Þ ð2Þ

We denote ui XR uj or ðui ; uj Þ 2 R when ui is ranked higher than uj according to an ordering R. To automatically infer ranking function Fw~ by applying a machine learning method as a formulation front-end, we rewrite Eq. (2) into the following Eq. (3). ~ðf~i f~j ÞX0. w

(3)

Now, the learning problem becomes a binary classification problem on pairwise ordering: Let d~ij f~i f~j be the feature-difference vector between ui and uj . Then it is formulated as the following binary classiﬁcation problem that determines if ðui ; uj Þ 2 R : ~ d~ij X0. ðui ; uj Þ 2 R () Fw~ðd~ij ÞX0() w

(4)

Thus, our rank formulation problem can be formulated as the following problem: Let R be a ranking over database D. Given training data or partial orders as a set fððui ; uj Þ; yij Þg, where ui 2 D, uj 2 D, and yij ¼ þ1 if ðui ; uj Þ 2 R or else yij ¼ 1, our goal is to learn a function Fw~ for classifying every pair of objects ðui ; uj Þ from D with respect to R , as Eq.(4) deﬁnes. Thus, we formally develop the duality of the classiﬁcation and ranking view through the following theorem.

565

Theorem 1 (duality). Let R ¼ ðf~1 ; f~2 ; . . .Þ be the global ordering determined by pairwise function F such that Fðd~ij Þ40 for every ðf~i ; f~j Þ pair satisfying ioj. Then, Fðf~i Þ4Fðf~j Þ if ioj. Proof 1. From Eq. (3) and (4), 8d~ij 2 R ; Fw~ðd~ij Þ ~ðf~i f~j Þ403~ 403Fw~ðf~i f~j Þ40 3 w w f~i 4~ w f~ 3Fw~ðf~ Þ4Fw~ðf~ Þ. j

i

j

This duality tells that F—a classiﬁer function learned from the pairwise difference vectors — can be used for the global ranking function generating a score per object, which allows us to seamlessly integrate with existing relational database. 3.2. Incorporating SVM learning for rank formulation Our formulation of duality (Theorem 1) enables us to adopt any linear binary classiﬁcation method (e.g., Perceptron, Winnow, SVM, etc)1 for learning ranking function F. Among those, Support Vector Machines or SVMs [12–14] have been recently most actively developed in the machine learning community, as they have shown to demonstrate high generalization performance by the margin maximization property2— That is, they learn an accurate classification function that generalizes well beyond training data. (Generalization performance denotes the performance of the learned function on ‘‘unseen’’ data.) Applying SVM on Eq. (4) becomes essentially equivalent to the solution proposed in 8 which learns an ordinal regression function from Eq. (2). In this section, we provide an intuitive interpretation of the solution by using the duality and presenting Corollary 1 and 2 and Remark 1, which explains how SVM can also improve the generalization of ranking. That justiﬁes, our framework, by adopting SVM as the learning machine, can learn Fw~ that is concordant with the given training data (i.e., partial orders from R ) and also generalizes well to rank unseen data with respect to R . We will also use our analyses in this section to justify our top sampling method in Section 3.3. 1 Given a set of (positive and negative) training points, a linear ~ (and thus the binary classiﬁer will ﬁnd the weight vector w ranking function Fw~ ), which deﬁnes a ‘‘hyperplane’’ separating the positive and negative examples. 2 SVMs compute the classiﬁcation boundary of the highest margin that is the distance between the boundary and the closest data points (i.e., support vectors) in the feature space.

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

566

3.2.1. SVM classification Let us ﬁrst overview SVM classiﬁcation. Suppose there exists such a function Fw~ that Eq. (4) holds for some partial ordering R0 R , then we can ~ such that the following Eq. (5) holds for rescale w that partial orders. ~ d~ij X1 8ðui ; uj Þ 2 R : w 0

F

(d

)>

F

)=

w·

d

F

=

(d

)<

1

(d

|F

0

(d

)|

-1

/|

|w

||

(5)

For instance, let us say that the smallest output from the function of Eq. (4) is 0:01 for all d~sij 2 R0 , and say that the vector outputting 0:01 is d~ . Then s ~ ~ ~ such that w ~ d ¼ 1 and w ~ d41 we can rescale w ~ for all other vectors d. (We do not actually rescale it, but in order to understand the SVM’s margin maximization property, it is crucial to understand ~ that Eq. (5) holds if there exists w ~ that there exists w that Eq. (4) holds for a set of vectors 8d~ij 2 R0 ) The particular feature-difference vector d~ij for which (5) s is satisﬁed with the equality sign (e.g., d~ inthe above example) are called support vectors. Thus, in SVM classiﬁcation, support vectors are the data objects closest to the decision boundary (~ w d~ ¼ 0), because s ~ the decision function Fw~ðd Þ for the support vectors s d~ returns the smallest possible value (¼ 1). The margin denotes the distance from the support s vector d~ to the decision boundary ð~ w d~ ¼ 0Þ in the feature space which is formulated as the following s Eq. (6) since Fðd~ Þ ¼ 1.

m m

=

1/

||w

||

Fig. 4. Margin maximized boundary in a two-dimensional space.

f2 f3

δ1 w1

δ2 w2

f1

s

m¼

Fðd~ Þ 1 ¼ k~ wk k~ wk

f4

(6)

SVMs compute a function Fw~ of the highest margin m by minimizing k~ wk in Eq. (6). Fig. 4 illustrates an example of the margin~ d~ ¼ 0) of maximized decision boundary (i.e., w the SVM binary classiﬁcation in a two-dimensional feature space. Each data pair d~ is represented by ‘o’ or ‘þ’ according to its class (e.g., Fð0 þ0 ÞX1 and Fð0 o0 Þ 1).The data pairs on the dotted lines are the support vectors. SVM computes the boundary that separates the two groups of data and also maximizes the margin, i.e., the distance between the boundary and the support vectors in the feature space. 3.2.2. SVM for ranking In our ranking problem, from Eq. (2), we see that a linear ranking function Fw~ projects data vectors ~. For instance, Fig. 5 onto a weight vector w illustrates linear projections of four vectors ~1 ff~1 ; f~2 ; f~3 ; f~4 g onto two different weight vectors w

Fig. 5. Linear projection of ranking.

~2 in a two-dimensional feature space. Both and w Fw~1 and Fw~1 make the same ordering R for the four vectors such that f~1 4R f~2 4R f~3 4R f~4 . The ranking differ-ence of two vectors ðf~i ; f~j Þ according to a ranking function Fw~ can be denoted by the geometrical distance of the two vectors projected ~, that is, formulated as w ~ðf~i f~j Þ=k~ onto w wk. For instance, in Fig. 5, the ranking difference of ~1 is denoted by d1 ¼ ðf~1 ; f~2 Þ according to w ~1 ðf~1 f~2 Þ=k~ w w1 k. Corollary 1. Suppose Fw~ is an SVM function in Eq. (4) learned from partial orders R0 R . Then, the support vectors of the function Fw~ represent the data pairs that are closest to each other in ranking. s s s Proof 2. Let d~ f~i f~j be the support vector s s s where Fw~ðd~ Þ ¼ 1. Then, from Eq. (3), Fw~ðf~i f~j Þ s s ¼ 13~ wðf~ f~ Þ ¼ 1, which is, by the deﬁnition of i

j

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

the support vector, the smallest possible value for all data pairs 8ðui ; uj Þ 2 R0 . Thus, its ranking difference s

s

~ðf~ f~ Þ w

i j according to Fw~ (¼ jj~ wjj )is also the smallest among all data pairs 8ðui ; uj Þ 2 R0 .

Corollary 2. The ranking function F, generated by the SVM, maximizes the minimal difference of any data pairs in ranking. Proof 3. SVM computes a function that maximizes the s margin, ands thes margin is formulated as ~ðd~ Þ=jj~ ~ðf~i f~j Þ=jj~ w wjj ¼ w wjj, which denotes, from the proof of Corollary 2, the minimal difference of any data pairs in ranking. Remark 1. Our framework, adopting SVM as the learning machine, generates a ranking function of high generalization performance. RATIONALE. Consider the two linear ranking functions Fw~1 and Fw~2 in Fig. 5. Although the two functions make the same ordering R for the four vectors such that f~1 4R f~2 4R f~3 4R f~4 , as we ~1 generalizes better than w ~2 intuitively think, w because the minimal difference of two projected ~1 (i.e., d1 ) is larger than that in w ~2 (i.e., vectors in w d2 ). From Corollaries 1 and 2, SVM computes the ~ that maximizes the minimal weight vector w difference in ranking. Thus, our framework, adopting the SVM as the learning machine, generates a ranking function of high generalization performance. 3.2.3. Incorporating nonlinear ranking We have discussed so far the framework assuming ranking function F is linear. Conceptually, linear ranking functions consider the sum of only individual attributes since the weight is assigned to only ~ ¼ w1 d 1 þ each individual attribute (e.g., Fw~ðdÞ ~ w2 d 2 when d is a two-dimensional vector such as price and size for a house data). Though our experiment results report the vast majority, i.e., over 90%, of user preferences are learned accurately with linear ranking functions, we also need to support nonlinear ranking function to support complex preferences. Complex preferences in this context means the preference considering beyond ~ ¼ w1 d 1 þ w2 d 2 þ w3 individual features (e.g., FðdÞ d 1 d 2 ) thus not being able to be expressed by linear functions. SVMs support the ‘‘kernel trick’’ for nonlinear classiﬁcation [12], which can be exploited for formulating a nonlinear ranking function. (Refer

567

to [8] for details.) However, as the function gets more complex and expressive (as a nonlinear function usually does), its generalization performance tends to decrease and thus requires more training data to be learned accurately. This fact is explained in the Bias-Variance Tradeoff in classiﬁcation [12]. It is also nontrivial to optimize nonlinear kernel parameters in an online environment. Thus, we leave the seamless integration of nonlinear ranking into the framework as future work. In this paper, we simply use two kernels in parallel without optimizing the kernel parameters— linear kernel and RBF kernel3— to support linearly rankable and also complex preferences in a ‘‘naive’’ way. That is, at Step 3 of Fig. 3, the learning machine learns two ranking functions from the partial orders—one is with linear kernel and the other is with RBF kernel. We determine which function to report as results, by comparing the accuracies of both linear and nonlinear function computed at step 5. If the user’s preference is linearly rankable, the linear kernel will converge quickly to high accuracy, as linear kernel generalizes well from smaller sample. (In our experiments, linear kernels generate over 90% expected accuracy mostly within two iterations for linearly rankable preferences.) In contrast, if the user’s preference is not linearly rankable, RBF kernel will eventually outperform the linear kernel. 3.3. Satisfying online requirements While duality (Theorem 1) enables to apply the learning machine for rank formulation, the dynamic nature of our online querying framework poses very strict constraints on (1) user intervention and (2) response time. First, for online rank formulation, we will discuss how to learn effectively with minimal user intervention, Second, for online rank processing, we will discuss how to evaluate ranking efﬁciently as a query processing, leveraging the optimizer of the underlying DBMS. 3.3.1. Online formulation: Top sampling Toward the goal of online formulation, we develop the top sampling technique, which on one 3

SVMs with RBF kernels have inﬁnite VC-dimensions [13] that is able to classify any possible partitions of a data set. That is, in the context of ranking, it is able to express function with any order.

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

568

hand, provides an exploratory interface to users and on the other hand, further enhances the learning performance of SVM. The top sampling technique is applied in Step 3(b) in the framework of Fig. 3. Particularly, this technique minimizes the user intervention required to achieve the user-speciﬁed accuracy. The key idea of the top sampling is at each round to select the data objects that (1) are most informative for ranking such that the ordering on the data maximizes the ‘‘degree’’ of learning and (2) are also highly ranked so that users can focus on the sample of their interests. First, the most informative sample is the data objects that are most ambiguous in ranking: From Corollary 1, the support vectors are most ambiguous data pairs in ranking, and an SVM ranking function is represented by the support vectors. Thus, the user’s feedback on those data will accelerate the ‘‘degree’’ of learning by quickly identifying the support vectors. Selecting the most ambiguous l data objects, S iþ1 , is formulated into the following optimization problem: arg min CðS iþ1 Þ; S iþ1

ð7Þ

where CðS iþ1 Þ (i.e., the cost function of S iþ1 ) is the sum of the ranking difference of every data pair within S iþ1 , i.e., X CðS iþ1 Þ ¼ jFi ð~ uj ~ uk Þj; ð8Þ 8ð~ uj ;~ uk Þ2S iþ1

i is the number of iterations, Fi is the ranking function learned at ith iteration, and Siþ1 is the set of data selected for the next iteration. Since in our framework every data pair ð~ uj ; ~ uk Þ 2 Siþ1 is used as training data, the set, S iþ1 , that minimizes the cost function CðSiþ1 Þ of Eq. (8) is most ambiguous in ranking. However, a direct optimization of Eq. (8) requires jDj C l times of SVM function evaluations at each round, as it needs to evaluate every possible l data objects in the datasets, which would delay the response time intolerably long. The selected data also hardly serves as an exploratory sample since it does not consider the user’s preference. Thus, our top sampling selects the top l data objects ranked according to the function learned in the previous round, that is, select L ¼ fu1 ; :::; ul g

such that Fðui ÞXFðuj Þ for ui 2 L and uj eL. As the data in the set L is consecutive in ranking, the cost function of Eq. (8) for L is likely smaller than that for an randomly chosen sample. The top sampling is efﬁcient as well as exploratory: (1) Selecting such set L requires at most a single scan of the dataset while the direct optimization evaluates jDj C l times. (2) It selects highly ranked sample according to the function learned from the user’s feedback. Thus it naturally provides users with possible candidates which are unseen but highly likely preferred. Our experiment in Section 4 shows that the top sampling achieves a high accuracy more quickly than random sampling. 3.3.2. Online processing As a next step, we carry out ‘‘classiﬁcation’’ by the learned function F over the entire database. Toward the goal of online processing, as data retrieval scenarios naturally situate in large databases, such processing should be efficient and scalable. However, such development has been clearly lacking, as machine learning methods did not have to address this ‘‘processing’’ aspect, as their objective is usually optimizing the accuracy. As a major contribution towards online processing, recall that our development of duality (Theorem 1) naturally enables us to leverage the query optimizer of DBMS for efﬁcient and scalable query processing (in place of evaluating the pairwise classiﬁcation function for all object pairs)— Once the pairwise classiﬁcation function is learned (which is conveniently the per-object ranking function F as well, according to the duality property), user query is readily expressible in SQL using F, as Example 1 demonstrated. Further, the top sampling can be expressed in SQL as well, which enables a uniform DB-friendly interface for both query processing and sample selection. To illustrate, at i-th iteration, our top sampling scheme is essentially k objects ordered by Fi . That is, for Example 1: select price,size,zip from Houses where city ¼‘‘"Chicago"’’ bforderby F i ðprice; size; zipÞ limit k Note actual optimization of such queries inside the database engine is beyond the focus of this work– However, we discuss related research issues in Section 5.

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

4. Experimental evaluation This section reports our extensive experiments for studying the usability (or ‘‘user-friendness’’) and efficiency (or ‘‘DB-friendness’’) of our RankFP framework: First, for usability, we used Kendall’s t measure [9–11], which is used widely to measure the similarity of the two orderings, i.e., the ideal ordering R and the ordering generated by our system RF . Second, for efﬁciency, we measured absolute response time. Our experiments were conducted with a Pentium 4 2 GHz PC with 1 GB RAM. Implementation. We use n-SVM4 (included in LIBSVM5). As for the size of sample l, we set it as 5, which gave fairly good results among all values. Note deciding l is a trade-off problem: Small l requires more iterations, while large l makes ordering non-trivial.

Queries in Fig. 6 show that the user is interested in cheap and large houses having many beds and baths. Further, predicate deﬁnitions in Fig. 7 illustrates the user’s preference in more details: For instance, according to deﬁnitions in Predicates 1, user is willing to trade 200 square footage in size (which decreases score of b by 20) for having one more bedroom (which increases score of c by the same amount). Predicate 1 can be formulated as the

Fig. 6. Preference queries.

Data set. We perform our experiments on a realestate system (as Example 1 introduced) with real-life house dataset used in [4].6 This data set is extracted from realtor.com and contains all the for-sale houses in Illinois, resulting in N ¼ 20990 objects for relation house, each with attributes id, price, size, beds, baths, zip and city. In addition, we translated the zip code into the coordinates of latitude and longitude to support the notion of closeness between two locations. First, to evaluate the framework extensively, we synthetically generate queries with various complexity and measure the performance in Section 4.1. We then evaluate the framework with real-life queries, collected from our user study, in Section 4.2. 4.1. Experiments on synthetic queries First, we evaluate the framework using synthetic queries, expressed in scoring-based model, as shown in Fig. 6: Query 1 shows a scenario of ﬁnding a house in a big city Chicago, while Query 2 shows a scenario of ﬁnding a house in a small city Urbana. 4 n-SVM employs a semantically meaningful soft margin parameter n [15,16]. An intuitive setting of n normally works well. We ﬁxed n ¼ 0:1 for our experiment. 5 http://www.csie.ntu.edu.tw/cjlin/libsvm 6 http://aim.cs.uiuc.edu/readme.html

569

Fig. 7. Deﬁnitions of fuzzy predicates.

ARTICLE IN PRESS 570

H. Yu et al. / Information Systems 32 (2007) 560–574

following linear ranking function:

Table 1 Performance results (averaged over 20 runs)

F ¼ 100 0:001 price þ 0:1 size þ 20 beds þ 20 baths.

(9) To contrast, deﬁnitions in Predicates 2 (Figure 7) show more complex preferences: For instance, from the function many1(beds) c, we can observe that the user penalize much for the houses with less than three bedrooms, while giving the same score for all houses with more than ﬁve bedrooms. As a result, queries using Predicates 2 will not be ‘‘linearly rankable’’. To evaluate the framework with queries with various degrees of complexity, we mix and match two queries and two sets of predicate deﬁnition, i.e., Q1 þ P1, Q2 þ P1, Q1 þ P2, and Q2 þ P2. For each combination, we generate user feedbacks, i.e., partial orderings, based on the given query combination. We then measured the accuracy of the ranking function learned and response time at each round, with both random sampling and the top sampling. Table 1 summarizes the results. We highlight our observations as follow:

For the linearly rankable queries (i.e., Q1 þ P1 and Q2 þ P1 in Table 1), the top sampling technique generates noticeably higher performance from the second round. Fig. 8 illustrates the performance difference between the random sampling and the top sampling at each round. The performance of the ﬁrst round does not make much difference because the sample in the ﬁrst round can only be randomly selected in both cases. However, as rounds go on, the top sampling technique quickly achieves higher performance with smaller sample. Refs. [17,18] observe similar behaviors in binary classiﬁcation problems. Observe that, the response time is similar for Q1 and Q2 when random sampling is used. In contrast, top sampling is proportional to database size. For instance, as there are more houses in Chicago than in Urbana, top sampling is less efﬁcient in Q1 than in Q2. Thus, supporting an efﬁcient top sampling using indexing would be an interesting future work. For the non-linearly rankable queries (i.e., Q1 þ P2 and Q2 þ P2), Gaussian kernel performs better than linear kernel. However, top sampling with Gaussian kernel does not perform noticebly better than random sampling.

Q

R

RAN

TOP

Acc

Kernel

Time

Acc

Kernel

Time

Q1 þ P1

1 2 3 4 5

85.75 88.18 90.60 92.93 94.67

Linear Linear Linear Linear Linear

0.002 0.003 0.003 0.005 0.006

85.75 91.03 92.71 93.72 94.79

Linear Linear Linear Linear Linear

0.005 0.020 0.022 0.024 0.027

Q2 þ P1

1 2 3 4 5

77.53 81.12 89.90 93.24 94.43

Linear Linear Linear Linear Linear

0.002 0.003 0.004 0.006 0.006

77.53 87.04 91.61 94.01 94.97

Linear Linear Linear Linear Linear

0.002 0.011 0.012 0.015 0.017

Q1 þ P2

1 2 3 4 5

75.53 80.41 84.90 85.19 85.22

Gaussian Gaussian Gaussian Gaussian Gaussian

0.002 0.003 0.004 0.005 0.006

75.53 80.91 84.96 84.79 85.38

Gaussian Gaussian Gaussian Gaussian Gaussian

0.002 0.021 0.022 0.025 0.026

Q2 þ P2

1 2 3 4 5

71.87 78.11 78.34 79.73 79.85

Gaussian Gaussian Gaussian Gaussian Gaussian

0.002 0.003 0.004 0.005 0.006

71.87 79.11 78.57 79.01 79.21

Gaussian Gaussian Gaussian Gaussian Gaussian

0.002 0.014 0.015 0.016 0.019

RAN: random sampling; TOP: top sampling; Q: query; R: # of rounds (l ¼ 5); Acc: accuracy (%); Time: average response time (S); Kernel: the kernel of higher accuracy.

4.2. Experiments on real queries In this section, we evaluate the framework in more realistic settings. Ten ordinary users tested our system with their own house preferences from which 100 real queries were collected. Note that, in this user-study setting, the perfect ordering R the user intended remains unclear. (It is infeasible for each user to provide a complete ordering R on hundreds or thousands of houses.) Thus, the accuracy of the ranking function at this iteration is evaluated against the partial ordering speciﬁed by user in the next iteration. That is, the accuracy of ranking function Fi learned at ith iteration is measured by comparing the similarity of user’s partial ordering on S iþ1 at the next iteration and the ordering generated by Fi , i.e., RFSiiþ1 . This measure approximates the generalization performance of ranking functions well as S iþ1 is not a part of training data for learning Fi . Further, using this evaluation method, we can also acquire fair evaluations from users since the users are not aware of whether they are providing feedback or evaluating the functions at each round.

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

Query 1 + Predicate 1

571

Query 2 + Predicate 1

1

1

0.98 0.95

0.96 0.94

0.9

0.92 0.9

0.85

0.88

TOP RAN

0.86

TOP RAN

0.8

0.84

0.75

0.82 0.8

1

1.5

2

2.5

3

3.5

4

4.5

0.7

5

1

1.5

2

2.5

3

3.5

4

4.5

5

Fig. 8. Performance convergence of two sampling techniques on the linearly rankable queries. TOP: top sampling; RAN: random sampling; X-axis: # of rounds; Y-axis: accuracy.

60

35 Linear kernel Gaussian kernel

50

Linear kernel Gaussian kernel

30 25

40

20 30 15 20

10

10

5

0

0 1

2

3

4

5

1

2

3

4

5

6

Fig. 9. Distribution of user preferences generating over 90% (left) and 100% (right) accuracy. X-axis: # of rounds; Y-axis: # of preferences.

However, this measure severely disfavors top sampling: Intuitively, top sampling will be most effective for learning if user’s ordering on S iþ1 is not what is expected from the previous section, RFSiiþ1 . We thus use random sampling for the user study reported in this section. However, note that, top sampling is expected to be effective in practice, as most real queries are linearly rankable (as Fig. 9 will show) and top sampling is effective with linear ranking function (as discussed in Section 4.1). 4.2.1. Overall result Fig. 9 shows the distribution of user preferences generating over 90 % and 100 % accuracy respectively per each iteration. (Note that the accuracy here is an approximation which is computed over ﬁve random data objects.) Observe from Fig. 9 that

linear kernel reaches 90 % accuracy mostly within the second iteration, and 100% accuracy within the third iteration. We deduce the following observations from the analysis of the experiments.

For the real preferences, our rank query framework formulates an accurate (i.e., accuracy X90 %) ranking function in a couple of communications with a user (i.e., iterations p2) quickly (i.e., response time p10 ms. with random sampling) and automatically (without any parameter turning in the processing). For instance, in the rank query on the houses in Chicago where jDj 1500, users are provided over 90 % accurate ranked list by ordering just two 5 objects among the 1500 houses. In other words, our framework provided over 90% accurate pairwise

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

572

orders on 2248500 (¼1500 C 2 ) house pairs from just 20 (¼ 2 5 C 2 ) pairwise orders. From our experiments on the synthetic queries, top sampling can even further improve the accuracy. The accuracy is scalable to the database size jDj. For instance, Table 1 shows that 20 pairwise orders surprisingly generated higher accuracy in a larger jDj (i.e., houses in Chicago) than in those in Urbana. It can be explained by the fact that attributes of the Urbana houses are less variant, e.g., sharing mostly the same zip code, which provides less clues to learning machine.

5. Discussion Integration with relational database systems: Recently, there have been research efforts in processing rank queries in a relational context. References [5,19] have proposed rank processing as a layer ‘‘on top of’’ relational databases—exploiting histograms [5] and materialized views [19], respectively, for efﬁcient rank processing. However, these works cannot be adopted as a processing back-end for our framework: First, these works rely on assumptions that ranking function is a k-nearest neighbor function 5 or a monotonic weighted average function (i.e., all weights are restricted to be positive) [19]. In contrast, as illustrated with F1 in Section 4, the ranking function learned can be an arbitary non-monotonic function, e.g., the weight for attribute price is negative, as high price negatively affects the preference. Second, more importantly, these works cannot support the integration of fuzzy ranking with Boolean ranking conditions (as we will discuss its implication below). In contrast, for systematic support for data retrieval, our framework suggests the seamless integration of ranking machinery ‘‘within’’ a relational database system. As we demonstrated, such integration presents a powerful query mechanism— in which both fuzzy ranking and Boolean ﬁltering seamlessly work together. On one hand, Boolean constraints filter the database D to set the acceptable scope (e.g., only houses in Chicago), and on the other hand, fuzzy criteria rank the scope to set a preferred ordering (e.g., larger, cheaper, and safer). While efﬁcient support of such queries within RDBMs has been pioneered by [3,20]. These works process order by after filtering at the end of query processing, by completely materializing the F score for every object to determine the full ranking. Such ﬁltering-ﬁrst scheme may not adapt to run-time

speciﬁcs- e.g., ﬁltering may be expensive, or ranking may require index-based accesses (no longer available after ﬁltering). As there can thus be many alternative query plans, there have been recent efforts to generally schedule ranking and ﬁltering. Ref. [21] proposes a new operator devised for supporting rank join, which later complemented by [22] which extends relational query optimizer to use such operators. Most recently, RankSQL system [23] developed an algebraic foundation for the seamless support of ranking and ﬁltering, using ‘‘rank-aware’’ operators as ﬁrst-class construct just like Boolean operators. Such rank-aware operators subsume the rank join operator [21,22] as one of such operators. 5.1. Lessons learned from our user study: Lessening ‘‘cognitive load’’: In our user study, we observe an interesting dilemma: On one hand, selective or top sampling selects most difﬁcult data objects to distinguish in terms of ranking, in order to maximize the degree of learning. However, on the other hand, it makes the ordering harder for users (i.e., the ‘‘cognitive load’’ becomes higher), and often results in inconsistent ordering in our user study. To thoroughly understand the related HCI (human-computer interaction) issues, it is important to pursue more systematic ‘‘user-psychology’’ study. It is challenging to design a rank-learning interface that facilitates (or forces) consistent ordering by balancing the cognitive load and the degree of learning. Designing evaluation metrics: It is tricky, from our user study, to design an evaluation metric to fairly reﬂect the performance of the system: First, absolutely-accurate ranking may not be necessary, as users tend to search more extensively for data retrieval (e.g., for house-shopping) while they are often satisﬁed by a few top results in document retrieval systems. Second, an accurate ranking may not even exist, as users are also ambivalent on their exact ranking. The actual perceived ranking performance may be better in our experiment. 6. Related work As our framework consists of rank learning and processing, this section discusses current-of-art in each of the topic areas. Rank learning: For the ‘‘user-friendly’’ rank formulation, we adopt machine learning approach,

ARTICLE IN PRESS H. Yu et al. / Information Systems 32 (2007) 560–574

in particular SVM [12], to learn a quantitative ranking function from a qualitative feedback. SVM has proven highly effective in classiﬁcation [12–14]. Ref. [8] developed an SVM ordinal regression method, and reference [10] applied it for optimizing search engines. We apply it to enable soft queries on relational database systems. References [24,17,18] studied SVM selective sampling techniques to improve the learning performance and also provide an interactive retrieval framework. However, they are limited to binary classiﬁcation (e.g., of whether the image is relevant or not) and thus do not generalize to our rank learning. Our top sampling method enhances the rank-learning performance and also provides exploratory sample converging to the ﬁnal top-k results. Rank processing: Existing work on modeling and processing rank queries has been divided into two major paradigms, depending on how query conditions are represented and combined. As we discussed in Section 2, one of such paradigms is the score-based model (or quantitative ranking) where the query condition is represented as fuzzy predicate mapping each data object into a absolute numerical score, e.g., new½a ¼ 0:9. Predicate scores of an object are then combined by a user-speciﬁed mathematical function such as min or avg. Within the context of score-based model, existing works have studied processing algorithms either on top of [5,19] or inside [3,20–23] relational databases (as discussed in Section 5), and also in middleware [25,4,6] access scenarios. Alternatively, qualitative ranking [26–28] has been recently proposed where predicate is replaced by a qualitative preference expression. That is, instead of deﬁning a predicate evaluating each and every object into an absolute score, user speciﬁes preference in terms of some attributes such as ‘‘I like house a better than b in terms of age’’. While such alleviation makes qualitative ranking model more ‘‘intuitive’’, it compromises ‘‘expressiveness’’ in return: First, as orderings are speciﬁed only for some (likely minority of) objects, it is not clear how to rank ‘‘unspeciﬁed’’ objects among themselves and also over other ‘‘speciﬁed’’ objects. Second, due to the lack of absolute quantiﬁcation, conﬂicts in multiple orderings cannot be resolved. To illustrate, suppose a user prefers house a over b in terms of age, while preferring b over a in terms of size. While such trade-off situations are common in real-life queries, a and b cannot be differentiated in the

573

qualitative ranking model, since the absolute degree of the two conﬂicting preferences cannot be compared as in score-based model. (To illustrate, Ref. [29] reports experiment results where 36k objects tie as the top in an anti-correlated dataset of 1 million objects.) We stress our work effectively ‘‘combines’’ the two models, by combining the expressiveness of score-based model (as the back-end) and intuitiveness of the qualitative ranking model (as the frontend). In essence, we adopt a machine learning approach to bridge the expressiveness gap. 7. Conclusion This paper proposes a new data retrieval framework which incorporates a user-friendly rank formulation into the DB-friendly query processing. In particular, SVM techniques are adopted as the front-end to build an intuitive rank formulation which is compatible and integrated with the backend score-based query processing. The experiments on a real-estate data set show promising results: the data retrieval system effectively learns quantitative ranking functions from the users’ qualitative feedback and efﬁciently processes the ranking query. The proposed framework also introduces new open problems such as: (1) in DB, rank query processing with no restrictions on the type of ranking functions, which is to improve the efﬁciency of rank query processing in DBMS, (2) in HCI, lessening the congitive load and designing evaluation metrics, which is to improve the user interface on the rank query formulation. References [1] BrightPlanet.com, The deep web: Surfacing hidden value, Accessible at http://brightplanet.com/technology/deepweb. asp (Jul. 2000). [2] G. Salton, Automatic Text Processing, Addison-Wesley, Reading, MA., 1989. [3] M.J. Carey, D. Kossmann, On saying ‘‘enough already!’’ in SQL, in: Proceedings of the ACM SIGMOD International Conference Management of Data (SIGMOD’97), 1997. [4] K.C.-C. Chang, S.-W. Hwang, Minimal probing: supporting expensive predicates for top-k queries, SIGMOD 2002. [5] S. Chaudhuri, L. Gravano, Evaluating Top-k Selection Queries, in: Proceedings of the International Conference Very Large Databases (VLDB’99), 1999. [6] R. Fagin, A. Lote, M. Naor, Optimal aggregation algorithms for middleware, in: Proceedings of ACM SIGACTSIGMOD-SIGART Symposium, Principles of Database Systems (PODS’01), 2001.

ARTICLE IN PRESS 574

H. Yu et al. / Information Systems 32 (2007) 560–574

[7] W.W. Cohen, R.E. Schapire, Y. Singer, Learning to order things, in: Proceedings of Advances in Neural Information Processing Systems (NIPS’98), 1998. [8] R. Herbrich, T. Graepel, K. Obermayer (Eds.), Large margin rank boundaries for ordinal regression, MIT-Press, Cambridge, MA, 2000. [9] M. Kendall, Rank Correlation Methods, Hafner, 1955 [10] T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM SIGKDD International Conference Knowledge Discovery and Data Mining (KDD’02), 2002. [11] A.M. Mood, D.C. Boes, F.A. Graybill, Introduction to the Theory of Statistics, McGraw-Hill, New York, 1974. [12] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [13] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining Knowledge Discovery 2 (1998) 121–167. [14] N. Christianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [15] B. Scholkopf, A.J. Smola, R.C. Williamson, P.L. Bartlett, New support vector algorithms, Neural Comput. 12 (2000) 1083–1121. [16] C.-C. Chang, C.-J. Lin, Training nu-support vector classiﬁers: theory and algorithms, Neural Comput. 13 (2001) 2119–2147. [17] G. Schohn, D. Cohn, Less is more: Active learning with support vector machines, in: Proceedings of the Interantional Conference on Machine Learning (ICML’00), 2000, pp. 839–846. [18] S.Tong, D. Koller, Support vector machine active learning with applications to text classiﬁcation, in: Proceedings of the International Conference on Machine Learning (ICML’00), 2000, pp. 999–1006. [19] V. Hristidis, N. Koudas, Y. Papakonstantinou, PREFER: a system for the efﬁcient execution of multi-parametric ranked

[20]

[21]

[22]

[23]

[24] [25]

[26] [27]

[28]

[29]

queries, in: Proceedings ACM SIGMOD International Conference on Management of Data. M.J. Carey, D. Kossmann, Reducing the braking distance of an SQL query engine, in: Proceedings of the International Conference on Very Large Databases (VLDB’98), 1998. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid, Joining ranked inputs in practice, in: Proceeidngs of the International Conference on Very Large Databases (VLDB’02), 2002. I.F. Ilyas, W.G. Aref, A.K. Elmagarmid, Supporting top-k join queries in relational databases, in: Proceedings of the International Conference on Very Large Databases (VLDB’03), 2003. C. Li, K.C.-C. Chang, I.F. Ilyas, S. Song, RankSQL: query algebra and optimization for relational top-k queries, in: Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’05), 2005. E. Chang, S. Tong, Support vector machine active learning for image retrieval, in: ACM Multimedia 2001, 2001. N. Bruno, L. Gravano, A. Marian, Evaluating top-k queries over web-accessible databases, in: Proceedings of the International Conference on Data Engineering (ICDE’02), 2002. J. Chomicki, Preference formulas in relational queries, ACM Trans Database Systems. W. Kiessling, Foundations of preferences in database systems, in: Proceedings of the International Conference on Very Large Databases (VLDB’02), 2002. W. Kiessling, G. Kostler, Preference SQL - design, implementation, experiences, in: Proceedings of the International Conference on Very Large Databases (VLDB’02), 2002. D. Kossmann, F. Ramsak, S. Rost, Shooting stars in the sky: An online algorithm for skyline queries, in: Proceedings of the International Conference on Very Large Databases (VLDB’02), 2002.

Enabling soft queries for data retrieval

Enabling soft queries for data retrieval

Recommend Documents