Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation

Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation

Author’s Accepted Manuscript Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation Jing Yuan, Lihong He, Eduar...

1MB Sizes 1 Downloads 25 Views

Author’s Accepted Manuscript Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation Jing Yuan, Lihong He, Eduard C. Dragut, Weiyi Meng, Clement Yu www.elsevier.com/locate/infosys

PII: DOI: Reference:

S0306-4379(16)30271-X http://dx.doi.org/10.1016/j.is.2016.06.005 IS1147

To appear in: Information Systems Received date: 12 March 2014 Revised date: 14 December 2015 Accepted date: 14 June 2016 Cite this article as: Jing Yuan, Lihong He, Eduard C. Dragut, Weiyi Meng and Clement Yu, Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation, Information Systems, http://dx.doi.org/10.1016/j.is.2016.06.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation Jing Yuan Computer Science Department, University of Illinois at Chicago

Lihong He, Eduard C. Dragut Computer and Information Sciences Department, Temple University

Weiyi Meng Computer Science Department, Binghamton University

Clement Yu Computer Science Department, University of Illinois at Chicago

Preprint submitted to Information Systems

December 4, 2015

Result Merging for Structured Queries on the Deep Web with Active Relevance Weight Estimation Jing Yuan Computer Science Department, University of Illinois at Chicago

Lihong He, Eduard C. Dragut Computer and Information Sciences Department, Temple University

Weiyi Meng Computer Science Department, Binghamton University

Clement Yu Computer Science Department, University of Illinois at Chicago

Abstract Data integration systems on the Deep Web offer a transparent means to query multiple data sources at once. Result merging– the generation of an overall ranked list of results from different sources in response to a query– is a key component of a data integration system. In this work we present a result merging model, called Active Relevance Weight Estimation model. Different from the existing techniques for result merging, we estimate the relevance of a data source in answering a query at query time. The relevances for a set of data sources are expressed with a (normalized) weighting scheme: the larger the weight for a data source the more relevant the source is in answering a query. We estimate the weights of a data source in each subset of the data sources involved in a training query. Because an online query may not exactly match any training query, we devise methods to obtain a subset of training queries that are related to the online query. We estimate the relevance weights of the online query from the weights of this subset of training queries. Our experiments show that our method outperforms the leading merging algorithms with comparable response time.

Preprint submitted to Information Systems

December 4, 2015

1. Introduction

5

10

15

20

25

30

35

40

The amount of data produced by us, humans, in the Big Data era increases yearly by 30%1 . Googles index exceeds 60 trillion Web pages2 and beyond its and similar search engines reach lies an even vaster web of data (aka Deep Web): educational data, medical research data, financial information and a myriad of other material stored in Web databases. Many deep Web data integrators nowadays avoid building (in house) Web data warehouses that collect data from all sources. Instead they query the sources at runtime and dynamically update the answers supplied to users, e.g., Kayak.com and tripadvisory.com. When a user’s query is passed to a data integrator (DI), the DI transforms the query to match the requirements of each data source (or Web database (WDB)), and submits a transformed query to each WDB. WDBs return ordered lists of results. Once the results from WDBs are collected, the DI merges them into a single ranked list using the ranking information supplied by WDBs. This task is called result merging [1, 2]. We assume that the matching records across the returned lists are given [3, 4, 5]. In this paper, we present two result merging algorithms that are query dependent. Our approach is different from the existing techniques which generally compute a general merging model (from past queries) which is then applied indiscriminately to all (subsequent) queries. DIs are particularly useful for vertical search, i.e., searching on a specific segment of online content, such as automotive industry, medical information, or travel. Since WDBs may contain conflicting data, DIs can deliver more accurate (correct) results to users by aggregating the information about the same entities from each WDB [6, 7]. Background: Queries for structured data on the Web are typically formulated on HTML (query) forms, which have more than one input in general. For example, a restaurant query form (e.g., yelp.com) has several fields (attributes), such as location, cuisine, price and features. These queries are also called structured queries, as opposed to keyword queries, which are called unstructured queries. To our knowledge, result merging has mostly been studied for keyword search [1, 2] (surveys) on unstructured data. In this work, we present result merging methods for structured queries. A structured query on the Web can be regarded as an SQL select query with the where clause always given, but the order by clause potentially missing. For instance, the query “River North, Chicago, Greek, $” corresponds to select * from RESTAURANTS where location = ‘River North, Chicago’ and cuisine = ‘Greek ’ and price = ‘$’. The order by clause is missing in this query. Such a query has an implicit ranking, i.e., the user does not mention the desired ranking criteria. The search engine employs a custom ranking algorithm to order the list of results so that it meets user’s ranking expectation. These algorithms generally use attributes or external knowledge that are opaque to users, such 1 https://hbr.org/2012/10/big-data-the-management-revolution/ar 2 www.google.com/insidesearch/howsearchworks/thestory/

3

45

50

55

60

65

70

75

80

85

as user ratings of the searched items or user profiling [8]. In explicit ranking, on the other hand, users specify how the results need to be ordered by the WDB: e.g., by price or year in the book domain. In this work, we are interested in those queries with implicit rankings. Notice that we are not interested in guessing the ranking strategy of WDBs, rather by employing a wisdom-of-thecrowd approach (where the “crowd” is the WDBs), we want to best approximate user’s implicit ranking at query time. Overview of Our Merging Algorithm: A key characteristic of structured queries on the deep Web is that the set of WDBs capable of processing a query varies from one query to another. Consider the following structured query: q = (Cuisine = “French”; Price = “$$”; location = “Grant Park Museums”). Data sources, such as DineSite and MenuIsm, can process Cuisine. The former’s query capabilities can understand Price, but it cannot understand location (because it does not have this or an equivalent attribute for it in its HTML form). The latter handles Neighborhood, but it does not handle Price. We call the WDBs capable of processing a query q the Involved Web Databases (IWDBs) for the query q. We use a merging algorithm based on local ranks (Section 2). We first compute a mapping from the ranks in the WDBs to a global scoring scheme. For this we apply a rank-to-score mapping, e.g., the Borda scores [9] (Section 2). Then, we estimate a weight reflecting the quality or performance of each WDB. The score of a record returned from a WDB sei is multiplied by the weight of sei to generate the final score for the record. This boosts the final scores of the records returned from WDBs with better quality. The DI generates a ranked list of all records in descending order of the final scores of the records. The focus of this paper is about how to determine the weight of each WDB such that the merged results have high effectiveness. A naive weighting scheme uses the average precision of a set of training queries as the assessment of the quality of a WDB [10]. This weighting scheme is problematic because the performance of each WDB varies significantly for different queries. We propose a solution that attempts to estimate the weight (performance) of a WDB for each user query. The set of all possible (structured) queries that can be formulated from an HTML form is the Cartesian product of all its possible inputs. A naive strategy of enumerating the entire set of queries and estimate each WDB’s weight for each query is infeasible. A simple HTML form with 5 fields can yield a query space of over 240 million distinct queries (e.g., cars.com) [11]. Posting such a large number of queries poses an unreasonable load on WDBs. We propose a training procedure to estimate the capability (i.e., weight) of each WDB in processing an unseen (future) query. We also introduce two indexing schemes, IWDB-based and query-based indices, to save the learnt weights of the WDBs associated with each query for fast online query processing. Consider a query q = “Grant Park Museums, French, $”. Suppose that from a set of WDBs, the IWDBs for q are Chicago Reader (4), Menuism(5), Menupages(6) and Yahoo (7). We place the IDs of the WDBs in parentheses for convenient references. Assume that the estimated weights during training for the WDBs are 0.45, 0.15, 0.2, and 0.2, respectively. An entry in either an 4

Figure 1: The workflow of our ARWE model.

90

95

100

105

110

IWDB-based or a query-based index is of the form (key, content). The content has the same meaning for both indices; it is the set of weights learned for the IWDBs from a training query. Key is different for the two indices. It is the set of the IDs of the IWDBs sorted in ascending order in an IWDB-based index and it is the values of the attributes in q. In our example query, content = (0.45, 0.15, 0.2, 0.2) for both indices and key = (4, 5, 6, 7) in an IWDB-based index and key = (Grant Park Museums, French, $) in the query-based index. Notice that two or more queries may have the same set of IWDBs and thus have the same key in an IWDB-based index. Different queries have different keys in a query-based index. When several distinct training queries have the same key in an IWDB-based index, we average their weights per IWDB at the end of the training phase and store the averaged weights to be used online. The merging algorithm used online by the DI depends on the deployed indexing scheme. Let q ′ be an online query. With an IWDB-based index, we use the set of IWDBs associated with q ′ as the look up key. With a query-based index, we use the values of the attributes in q ′ as the lookup key. If an exact match is identified for q ′ in either of the indices, the weights of the IWDBs in the content are used for result merging. If no exact match is found in either index, we perform an approximate search in the index with the key of q ′ . We then identify a set of relevant entries in the index for q ′ , compile a set of IWDBs for q ′ from these entries and compute the weights for each IWDB by aggregating the weights in these entries. We use the new weights to compute the final aggregation scores of the records returned by the IWDBs and produce the final ranked list. We call our result merging model the Active Relevance Weight Estimation (ARWE) model. The workflow of ARWE is depicted in Figure 1. If no match (exact or approximate) is found in either index, our model reduces to the weighted version of the underlying rank-to-score mapping. For in-

5

115

120

stance, if we use Borda-fuse to implement the rank-to-score mapping our model reduces to the weighted Borda-fuse model [10], which calculates the weights of WDBs based on their average precision. Our main contributions in this paper are: • ARWE is the first model for result merging for records returned by structured queries with implicit rankings, to our knowledge. • Different from previous works on result merging, we determine the subset of WDBs capable of processing a query, and online estimate their performance (weight) for answering the query.

125

• We propose two indexing schemes, the IWDB-based and query-based, to support effective query processing. • We extensively evaluated ARWE. We constructed an operational DI (yumimeta.com), which allowed us to perform experiments on real-life data. ARWE outperformed the (Weighted) Borda-fuse [10] and RRF [12] algorithms.

130

135

140

145

150

The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 describes the construction of the two indices. We explain how to utilize the two indices for online weight assignment and rank aggregation in Section 4. Section 5 presents the experimental study. The paper concludes with Section 6. 2. Related Work The integration of Web databases has received significant attention in recent years (e.g., [13, 14, 15, 16, 17, 18, 11, 19, 20] and [21] (survey)). The work has focused on problems such as efficient crawling, finding the semantic equivalent fields in a set of interfaces, clustering and categorization of Web databases. In this work we address the problem of merging of ranked lists of structured data in response to structured queries. To our knowledge this problem has not been addressed. Since our work is most similar to the work on distributed information retrieval [1, 2] (surveys), in this section, we review this line of work. The merging algorithms for unstructured documents can be classified into three categories [1, 22] based on the type of information used: (1) merging based on full document content, (2) merging based on search result records (SRRs) and (3) merging based on local ranks. The principles of (2) and (3) are also applicable to merging structured records, while those of (1) are not. We describe only (2) and (3). Merging Based on Search Result Records. Several result merging algorithms are proposed to perform merging based on the information available in the retrieved SRRs, particularly the titles and snippets in the SRRs, e.g., [23, 24, 25]. Different from documents, returned structured records are brief and most of the records satisfy the query conditions (i.e., the where clause). For

6

155

160

165

170

175

180

185

190

195

200

instance, for the query “River North, Chicago, Greek, $” the vast majority of the SRRs returned by a WDB are Greek restaurants in River North, Chicago, but they are nevertheless ranked based on some custom strategy that is not immediately transparent from the values of the attributes in the records. For instance, a WDB may use a prestige-based relevance [26] that considers the colocation relationship between spatial objects. This relationship is not obvious from the mere glance over the records and requires complex off-line analysis of the spatial relationships between objects. Merging Based on Local Ranks. These algorithms can be classified into two main categories: similarity conversion based and voting based. The former converts the local ranks into similarities so that similarity-based merging techniques can be applied. The state-of-the-art algorithms in this category are SSL [27], SAFE [28] and MoRM [29]. These methods assume that a centralized sample database is created in advance by sampling documents from each WDB. For each query, SSL and SAFE rank documents in the centralized sample database with a retrieval algorithm g. They build a mapping function between the ranks in each WDB and the document scores given by g. The goal is to map the local ranks/scores to a common scale. Having a common scale makes possible to construct the final ranked list. These methods are not directly applicable in our setting because they arbitrarily select a single fixed centralized retrieval algorithm for learning the mapping. This is problematic in a heterogeneous metasearch environment: first, because a single algorithm is suboptimal for learning comparable scores for independent and uncooperative sources and second, because it is difficult to guess a proper centralized retrieval algorithm. MoRM attempts to alleviate the issues of SSL and SAFE by considering several centralized retrieval algorithms. However, the question still remains: which algorithms to choose from? In addition, note that the ranking solutions developed along the lines of SSL, SAFE and MoRM are inherently domain dependent. For one, the centralized retrieval algorithms are different for different domains: e.g., book domain vs. restaurant domain. Our algorithm is domain independent. Voting based methods treat each WDB as a voter and each result as a candidate in an election. They are more suitable for DIs. There are primarily two types of voting strategies: variations based on Borda’s method [10] and variations based on Condorcet’s method [30]. We use the Borda-fuse method to illustrate a possible rank-to-score mapping in this paper. This algorithm assigns a score to each object returned by a WDB. We need to assign scores that would account for the cases when the lists to be integrated are of uneven lengths. We first union all the lists Li to determine the cardinality of the union. Let the length of the union list be n. Then, in a list of k objects Li we assign scores to its objects as follows. The first object in Li gets score n, the second object in Li gets score (n − 1), etc. Let A = n + (n − 1) + ... + 1. Then, the list so far gets sum of scores B = n + (n − 1) + ... + (n − k + 1). Each of the other objects in the union which does not appear in Li is assigned a score of A−B n−k . The objects in the union are ranked in descending order of the sum of scores over all Li ’s. Reciprocal rank fusion (RRF) [12] is another voting based method. RRF outperforms Condorcet’s method and individual rank learning methods. RRF 7

205

210

215

220

sorts the documents according to a naive scoring formula. Given a set D of documents to be ranked and a set of R, for each permutation on ∑rankings 1 1..|D|, we compute RRF (d ∈ D) = σ∈R k+σ(d) , where σ(d) is the rank of the document d, and the constant k (not to be confused with top-k ) mitigates the impact of high weights. k = 60 was empirically determined in [12] and independently confirmed in [31]. RRF is utilized to build an DI for genomics data in [31]. Outranking Fusion (OF) [32] is a result merging method that uses decision rules to identify positive and negative reasons for judging whether a document receives a better rank than another documet. The method uses two conditions: the concordance and discordant conditions. Within the above notation, for two documents d1 and d2 , we compute two sets, Cp (d1 , d2 ) and Dv (d1 , d2 ), to judge whether d1 is ahead of d2 in the merged ranking. Cp (d1 , d2 ) = {σ ∈ R|σ(d1 ) ≤ σ(d2 ) − p} is the support set, while Dv (d1 , d2 ) = {σ ∈ R|σ(d1 ) ≥ σ(d2 ) + v} is the refutal set. p is the preference threshold which is the accepted variation of document positions supporting the hypothesis “d1 is ahead of d2 .” v is the veto threshold which is the variation of document positions that refute “d1 is ahead of d2 .” The concordance condition is |Cp (d1 , d2 )| ≥ cmin and the discordance condition is |Dv (d1 , d2 )| ≤ dmax , where cmin and dmax are two parameters. d1 is ahead of d2 in the merged ranking if both conditions are satisfied. 3. ARWE Training Phase In this section we describe the training phase of our ARWE result merging model and the construction of the two indices.

225

230

235

240

3.1. Per Query Weight Estimation In this section we describe our procedure to learn the per query weights for the WDBs for a set of training queries. Let Qgold be a set of gold standard queries. These are the queries for which the ideal top-k ranking is known. Let q ∈ Qgold and Lq be the ideal top-k list of results for q. Assume that we have d > 1 WDBs. Let Li be the ranked list of records returned by the ith WDB in response to the query q. Let [wi |1 ≤ i ≤ d] be the list of relevance weights of the WDBs for q. Our goal is to determine wi ’s. We describe here a procedure to determine the weights of the IWDBs for each query in Qgold . In Section 4 we describe how to utilize the learnt weights to determine the weights of the IWDBs for unseen (online) queries. Our training process has the following steps. First, submit q to each WDB. Retain only the top-k records from each Li . Second, determine the records in all the lists Li that refer to the same entity (i.e., apply record linkage [4] across all the lists). We use our record linkage algorithm reported in [33]. This is assumed to be known for the queries in Qgold . Third, determine a rank-to-score mapping. Fourth, use the mapping to assign scores to each record in each list Li . Fifth, equate the score of each R in Lq to the aggregated weighted scores of

8

245

250

255

260

265

270

R across all the lists Li . Finally, solve the resulted set of equations to determine the weights wi ’s. We describe how we compute the weights now. Denote by Sq the rank-toscore mapping induced for query q. Sq is a two-dimensional real-valued matrix [sji ]k×d , i.e., the matrix has a column for each WDB and a row for each record in Lq . The scores of the records in the list Li ’s not in Lq are not included in Sq . sji is the score assigned to the record Rj ∈ Lq based on the rank of Rj in the list Li . Let [Tq ] be the vector of scores of the records in the list Lq . That is, Tq [j] is the score of the record in the jth position in Lq according to the same rank-to-score mapping. By equating the score of each record in Lq to its weight aggregated score from the lists returned by the WDBs, we get the following system of linear equations for the variables wi ’s:       s1,1 s1,2 · · · s1,d w1 t1 s2,1 s2,2 · · · s2,d   w2   t2        (1)  .. .. ..  ×  ..  =  ..  ..  .     . . . .  . sk,1 sk,2 · · · sk,d wd tk We first attempt to solve the system of linear equations in wi . If the system is consistent, we use its solution. Otherwise, we estimate suitable values for wi . In other words, we try to find the weight values for the model which “best” fits the data (current query). We utilize the least squares method for this task. ∑k Let F = j=1 fj2 , where fj = sj,1 w1 + sj,2 w2 + · · · + sj,d wd − tj . The least squares method finds its optimum when the sum of squared score residuals, F , is minimum. The minimum of the sum of squares is found by setting the gradient to zero. Since the model contains d parameters there are d gradient equations. We give an example of the entire process. The choice of the rank-to-score mapping is orthogonal to our overall model. Preferably, this must be domain independent and effective. The voting-based scoring functions (Section 2) are domain independent and are quite effective in practice [1]. We adopt the Borda scores to obtain Sq and Tq ; Tq [j] = n − j + 1 for j ∈ [1..k]. Example 1. Consider the example in Figure 2. There are three lists of results from three WDBs and the gold standard list for some training query. WDB1, WDB2, WDB3 returns 3, 5 and 5 records, respectively. The records R6 from WDB2 and R7 from WDB3 are not present in Lq . Their scores do not appear in Sq . The union of the three lists has 7 records. (top−)k = 5, d = 3 and n = 7. Using the Borda method, we obtain the following system of linear equations:     5 7 5 7   6  6  6 4 w1     2.5 1.5 7  ×  w2  =  5  (2)     7  4  5 1.5 w3 2.5 3 3 3

9

Figure 2: Example of returned lists of results with Borda scores (scores) and unknown weights.

275

The system does not have a solution. We use the least square method. For instance, the residual of the weight aggregated score and the ground truth score of R3 is f3 = 2.5 w1 + 1.5 w2 + 7 w3 − 5. We obtain the values w1 = 1.415, w2 = 5.681 and w3 = 7.179 with the least square method. We repeat the process described above for each training query q ∈ Qgold and store the weights in a custom index structure, which we describe next.

280

285

290

295

300

3.2. Weight Indexing Schemes We propose two indexing schemes, IWDB-based and query-based, for weights to effectively and efficiently assist the online query processing. We present them in turn. We describe their usage online in Section 4. Assume that the HTML query form of the DI has h fields A1 , A2 , · · · , Ah . Let q ∈ Qgold and IW DBq be the list of IWDBs for q sorted in ascending order of their IDs. 3.2.1. Involved Web Databases Here we describe the method for identifying the involved WDBs for a query q. We create an index data structure such that for each attribute-value pair Ai = aij in a HTML form we record the list of WDBs that can process Ai = aij . We identify the set of IWDBs for q as follow. For each attribute-value pair A = a in q we use the index to find the set of WDBs that can process A = a. Let X be the set of WDBs such that each one of them can process some field of q. The subset of Web databases in X whose set of SRRs in response to q is not empty is the set of involved Web databases (IWDB) for q. 3.2.2. IWDB-based Index In an IWDB-based index, each entry of the index is represented as (key, content), where key is IW DBq and content is Wq . Suppose the query q is Location = West Loop, Cuisine = Sushi, Price = $$$. Among the WDBs of the DI, suppose that IW DBq is [Dexknows (2), Menuism (5), Menupage (6), Yahoo (7), Yellow Pages (8)] and their corresponding weights are Wq = 10

305

310

315

320

325

330

335

340

[0.32, 0.21, 0.27, 0.12, 0.08]. Then (key, content) =([2, 5, 6, 7, 8], [0.32, 0.21, 0.27, 0.12, 0.08]). If two or more distinct training queries have the same key in an IWDB-based index, we average their weights per IWDB at the end of the training phase and store the averaged weights in the index. The size of an IWDB-based index is upper bounded by 2d , i.e., it can have as many entries as the cardinality of the powerset of the set of WDBs. The rationale for the use of IWDB-based index is that the training queries sharing the same set of IWDBs with a test (online) query can provide direct comparison over the IWDBs and should be recorded in advance. Hence, given a test query q ′ with the same set of IWDBs as q, the trained weights for q can be applied to the rank merging of the records retrieved for q ′ . 3.2.3. Query-based Index Oftentimes, the performance of a data source can be fairly divergent for two different values a and b of an attribute Ai of the query form. The performance depends on the accuracy and the abundance of the information that can be obtained from the database of the data source for a and b, respectively: the information extracted from the data source may be more complete for Ai = a than for Ai = b. For instance, a WDB may have a more complete set of Chinese restaurants than Greek restaurants. Therefore, a WDB that performs well on a query q1 may perform poorly on a different query q2 , even if q1 and q2 have the same set of attributes and are only different in the values. Consequently, we also consider a query-based index, where for each q ∈ Qgold , key is the set of attribute-values pairs in the query q, and content is a vector w with d entries, where w[j] = null if the jth WDB is not in IW DBq and w[j] is the estimated weight of the jth WDB in IW DBq . For the example query in the previous section, key = (Location = West Loop, Cuisine = Sushi, Price = $$$) and content = [null, 0.32, null, null, 0.21, 0.27, 0.12, 0.08, null], where we assumed that d = 9. We omit the attributes in a key throughout the paper when their meaning is implicit from the values. 3.3. Indexing Discussion We construct a synthetic example of an IWDB-index to emphasize a number of important issues with this index and the usefulness of a query-based index. In this example, the DI has a query form with three fields A, B and C. The sets of values of these fields are: A = {a1 , a2 , · · · }, B = {b1 , b2 , · · · } and C = {c1 , c2 , · · · }. There are four WDBs. Five training queries are used to build the IWDB-based index. The index is shown in Table 1. The column “Queries” shows the training queries for each index entry. It is not included in an actual IWDB-based index; it is shown for illustration purposes. We depict the index in a matrix-like format to ease its understanding. Observe that the last entry of the index corresponds to the training queries q2 and q3 . Suppose that the weights ′ ′ ′′ ′′ corresponding to the query q2 are w42 and w44 , and those of q3 are w42 and w44 . ′ ′′ ′ ′′ w44 + w44 w42 + w42 and w44 = . The weights for the key = (2, 4) are w42 = 2 2

11

Table 1: Example of fragment IWDB-based index.

Key

Content Queries W DB1 W DB2 W DB3 W DB4 1,3,4 w11 w13 w14 q1 = (a1 , −, −) 1,4 w21 w24 q5 = (−, b1 , −) 1,2,3 w31 w32 w33 q4 = (a2 , −, c1 ) 2,4 w42 w44 q2 = (−, b2 , c1 ); q3 = (−, b3 , c2 )

345

350

355

360

365

370

375

This issue does not occur in a query-based index since each query has a distinct entry in the index. Let q ′ = {a1 , b1 } be a test query. Its IWDBs are W DB1 , W DB3 and W DB4 . Hence, its key is (1, 3, 4), for which there is a matching entry in the index (Line 1, Table 1). We use the weights in this entry to merge the lists of results from W DB1 , W DB3 and W DB4 in response to q ′ . Although, we show that an DI with an IWDB-based index outperforms the existing result merging algorithms (Section 5), the weight estimates for q ′ could be further improved with a querybased index. Notice that the weights of the entry with the key (1, 3, 4) are calculated with respect to a training query containing only the attribute-value pair A = a1 , whereas the test query q ′ in addition to A = a1 also has B = b1 . Better weight estimates for q ′ are obtained by taking into account the weights of the WDBs for B = b1 as well (Line 2). In other words, we attempt to quantify the performance of the WDBs w.r.t. both attribute values. Let q ′′ = {a1 , c1 } be another online query. Its set of IWDBs is {W DB1 , W DB2 , W DB3 , W DB4 }. Hence, its key = (1, 2, 3, 4), for which there is no entry in the index matching it exactly. We perform an approximate matching look up with this key in the index and look for those entries with the property that the union of their keys equals the key of q ′′ . There are at least a couple of sets of keys with this property: e.g., using the entries in Lines 1 and 3, or the entries in Lines 1 and 4. The next section describes how to make an educated choice between these alternatives. 4. Query Processing With ARWE In this section, we describe two novel strategies for result merging. The two strategies are differentiated by their underlying indexing data structures. One of them uses the IWDB-based index and the other uses the query-based index. The distinctive feature among them is the way the relevance weights for the IWDBs are estimated for unseen queries. Let qtest be an online query and IW DBtest be its set of involved Web databases. qtest may or may not have been seen in the training phase. Recall that the lookup key of qtext is the list of IWDBs associated with qtext when searching in an IWDB-based index and is the set of attribute-value pairs in qtext when searching in a query-based index. In this section we present the estimation of relevance weights of the IWDBs for queries that do not match exactly an entry in the index. (If an entry exactly matches the key of qtext in

12

380

385

either of the indices then the weights of the IWDBs in the content of that entry are used for merging the list of results in response to qtest .) For either of the index data structure, if an exact match cannot be found, then our objective is to identify a relevant set of entries for qtest in the index so that by aggregating their weights (for IW DBtest ) we obtain the estimates of the relevance weights for IW DBtest in answering qtest . There are several ways to perform this task. We first attempted a Borda-based method, but the results were poorer than the baseline. We apply the Condorcet model [30, 34] to calculate a unified and normalized list of weights for IW DBtest . This gives significantly better results. In this paper we present only this method. We proceed as follows: • Step 1: find a set of relevant entries M for qtest in the index.

390

395

400

405

410

415

420

• Step 2: apply the Condorcet model to M to derive the weights of IW DBtest . Sections 4.1 and 4.2 present two alternative ways of accomplishing Step 1. In Section 4.1 we present how to find a set of relevant entries for qtest in an IWDB-based index. In Section 4.2 we describe the process for a query-based index. We introduce the Condorcet model and describe our methodology for carrying out Step 2 in Section 4.3. We denote the set of entries in an index by E = {e1 , e2 , · · · , er }, where ei = (keyi , wi ), keyi is the key of the ith entry and wi is the associated list of weights. We denote the set of keys in the index E by K = {keyi } for i = 1, · · · , r. 4.1. Finding Relevant Entries in an IWDB-Index In this section E is the set of entries in an IWDB-based index, hence keyi represents the set of the IDs of the involved WDBs in ascending order and wi are their corresponding weights. To find relevant entries in an IWDB-based index for the test query qtest , we need to determine a minimum subset M ⊆ K such that each WDB involved in processing of qtest appears in some keyj ∈ M . That is, M is such that ∀se ∈ IW DBtest ∃keyj ∈ M, se ∈ keyj and M is of minimum cardinality among all the sets with this property. Additionally, the Condorcet model for weight aggregation requires that the element sets (keys are regarded as sets) in M are connected. Intuitively, this means that for any pair of keys key, key′ ∈ M there exists a list of keys [X1 , ..., Xu ], Xi ∈ M ∀i ∈ [1..u] such that key = X1 , key′ = Xu and X1 ∩ X2 ̸= ∅, · · · , Xu−1 ∩ Xu ̸= ∅. The incentive to find a minimum subset of keys M from K to cover IW DBtest is that the weights of the IWDBs’ associated with M can provide us a good characterization of the performance of the Web databases in IW DBtest . The reason is that they are obtained from a set of training queries whose set of IWDBs is very similar to that of qtest . Thus if we aggregate the weights that are calculated for this set of training queries, the resulted weights are applicable for weighting the relevance of each se ∈ IW DBtest . In other words, when an exact match is not found, approximate matching entries in the index provide a good estimation of the performance of IW DBtest for the unseen query qtest . 13

Algorithm 1 Weighted Set Cover - A Greedy Algorithm Input: Test query qtest , the set of its IWDBs IW DBtest , and the set of keys in the IWDB-based index K Output: A set cover M that covers IW DBtest 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

425

430

435

440

445

C ← ∅; M ← ∅; K ′ ← K; while C ̸= IW DBtest do

1 ∀ key ∈ K ′ ]; |(IW DBtest − C) ∩ key| Find key ∈ K whose cost (i.e., α(key)) is the smallest, say keyt . If multiple such keys exist, select the one with the smallest cardinality; C ← C ∪ (IW DBtest ∩ keyt ); M ← M ∪ {keyt }; K ′ ← K ′ − {keyt }; return M ; Let α(key) =

We formulate our problem as a variant of the Weighted Connected Set Cover problem (WCSC) [35]. In WCSC, we have a ∪ universe U of elements and a nonempty family of subsets F of U such that A∈F A = U and a connected graph∪G = (F, E) on vertex set F. A connected set cover is a set cover R ⊆ F (i.e., A∈R A = U ) such that the subgraph G[R] induced by R is connected. (A graph is connected if there is a path from any vertex to any other vertex in the graph.) If additionally each subset A ∈ F is assigned a positive cost c(A), we have the weighted connected set cover problem, which is the task of computing the connected set cover with minimum weight subfamily of sets (vertices). Both the weighted and un-weighted versions of the connected set cover are NP-hard problems. Elbassioni et al. [35] gave the first polynomial approximation algorithm with approximation guarantees that approximates WSCS no more than √ O( p log p) times optimal, where p = |F|. This algorithm reduces an instance of the WSCS to an instance of the well-studied node-weighted Group Steiner √ Tree problem. Then they use the O( p log p) approximation algorithm for that problem by Khandekar et al. [36]. Although, it gives approximation guarantees, this algorithm is not suitable in practice as it has a large time complexity. We provide a greedy algorithm based on the framework of the greedy algorithm for the weighted set cover problem by Chvatal [37]. The algorithm is depicted in Algorithm 1. We first disregard the connectivity requirement for M and then discuss how to alter the algorithm to accommodate this requirement. In our problem setting, M corresponds to R, IW DBtest to U and K to F; we define the cost of a set as the inverse of the number of uncovered elements in IW DBtest that can be covered by the set. We regard a key as a set of WDB IDs here. Our goal is to find a set of keys M to cover all the elements in IW DBtest . Let C represent the set of elements covered so far. We define the cost α of

14

each key ∈ K as α(key) =

450

455

460

465

470

475

480

485

1 |(IW DBtest − C) ∩ key|

(3)

The definition indicates that the cost of a key key depends on the number of uncovered elements in IW DBtest that key can cover. Intuitively, a smaller cost indicates that more uncovered elements in IW DBtest are covered by the set. Clearly, a set with a smaller cost is more desirable than one with a larger cost. C = ∅ when the algorithm starts and it is updated each time we select the key with the minimum cost. If multiple such keys exist, we select the one with the smallest cardinality |key|. Then, we add newly covered elements to C. If keyt is selected, we update C as C ← C ∪ (IW DBtest ∩ keyt ). In each step we choose one key to join C, until C = IW DBtest (we cover the entire IW DBtest ). The chief intuition of Algorithm 1 is that if an exact training query is present in the index, its key has the least cost with the smallest cardinality and thus is selected first. For the time-complexity of the algorithm, notice that each time we invoke the statement while (Step 4), we need to compute/update the cost of each key ∈ K that has not been selected yet (Step 5). This step is significantly sped up by using an inverted list data structure. The inverted list records the set of keys associated with each WDB. Specifically, if a key contains the ID of a WDB se, then key is in the inverted list indexed under the ID of se. Let m′ denote the number of indexed entries under IW DBtest . For each step, by going through the entries under IW DBtest in the inverted list, we are able to compute the cost of all involved keys (having at least one element in common with IW DBtest ) within time O(m′ ). In the worst case, to find a set cover for IW DBtest requires m iterations, i.e. execute the while loop m = |IW DBtest | times. Thus the running time of Algorithm 1 is at most O(m m′ ). Example 2. Take Figure 3 as an example. Suppose that the set of IWDBs for the test query qtest is IW DBtest = {1, 2, 3, 4, 5, 6, 7} and we want to find a set M of keys to cover it. According to Algorithm 1, in each step we select the key that has the smallest cost. Initially, we have C = ∅, and α(key1 ) = 21 , α(key2 ) = α(key5 ) = 14 , α(key3 ) = α(key4 ) = α(key6 ) = α(key7 ) = 31 . The keys key2 and key5 have the same smallest cost 14 and the same size 4. In this case, we randomly select one of them and add the elements it covers into C. Suppose we select key2 , we have C = {1, 2, 3, 5} and M = {key2 }. In the second step, we again compute the cost of each key. We have α(key1 ) = α(key3 ) = ∞, α(key4 ) = 11 = 1, α(key5 ) = α(key6 ) = α(key7 ) = 12 . The keys key5 , key6 and key7 have the same smallest cost 12 , while key6 and key7 have the same size 3. We randomly select one of them to add into M and update C. Suppose that key7 is added into M , then we have C = {1, 2, 3, 5, 6, 7} = ̸ IW DBtest and M = {key2 , key7 }. In the third step, we have α(key4 ) = α(key5 ) = 1, all others are infinity. We select key4 which has a smaller size, to add into M and update C. At this point, we have C = T . Thus one solution produced by Algorithm 1 is M = {key2 , key4 , key7 }. 15

Figure 3: Example of a fragment IWDB-based index.

490

495

500

505

510

The set M of keys returned by Algorithm 1 is not guaranteed to be connected. We describe now the alterations required by Algorithm 1 to return a connected M . In our setting, we define the vertex set graph G = (K, E), where there is a vertex for each key in K and there is an edge between two vertices vi and vj if their keys keyi and keyj , respectively, overlap, i.e., keyi ∩ keyj ̸= ∅. For Algorithm 1 to return a connected M , it needs to further expand M until for each keyi ∈ M there exists a path between keyi and every other set of keys keyj ∈ M in the vertex set graph G. We slightly modify Algorithm 1 by adding a constraint when selecting a key from K. In each iteration, among the keys not yet selected (i.e., from K ′ ) that overlap with the keys selected in the previous iterations, we select the key with the smallest cost (Equation 3). Notice that the appropriate set of keys can be accumulatively identified to speed up the procedure. The reason is that if a key overlaps with a previously picked key and it is identified as a candidate key, it remains a candidate key in the following iterations as well. So once a key keyt is added to M , we check the keys that were not previously identified as candidates and add them to the list of candidate keys if they have a common element with keyt . In case we are not able to identify enough candidate keys to cover IW DBtest , our merging model reduces to the weighted model of the rank-to-score model used in training (Section 3.1). For instance, if Borda-fuse rank-to-score method is utilized, then the model reduces to the Weighted Borda-fuse (WBF) [10], when a cover cannot be found for IW DBtest . 4.2. Look up in a Query-based Index

515

The methodology employed for an IWDB-based index (i.e., driven by minimum set cover) is not applicable to a query-based index. Recall that the keys in an query-based index are the training queries themselves. And, thus, the set cover for the keys obtained from a query-based index does not offer enough information about the IWDBs of a test query. An example helps illustrate the issue. Consider the query-based index shown in Table 2. We need to find relevant entries for the test query qtest : “Location = Loop, Cuisine = –, Price =

16

Table 2: Example of fragment query-based index.

Key Loop, French, -, Italian, $$ Lakeview, French, $

520

525

530

535

WDB1 w11 w21 w31

Content WDB2 WDB3

w32

w33

WDB4 w14 w24 w34

$$”. Suppose that its set of IWDBs is {W DB1 , W DB2 , W DB3 , W DB4 }. Applying the algorithm in the previous section to the keys in this index, we find the minimum set cover M = {key1 , key2 }. However, the weights corresponding to key1 and key2 are not appropriate to be used for weight aggregation since they only contain the weight information for W DB1 and W DB4 . We thus propose a different method to find relevant entries in a query-based index. In this section E is a query-based index, hence, for a training query q, keyi is the structured query itself and wi denotes the set of weights associated with the IWDBs for q. Recall also that we are describing the case when the key of the test query qtest does not have an exact match in the index. We propose a technique for identifying the relevant entries in the index for qtest based on the notion of “Entry Query Relevance” (EQR). EQR is inspired from the concept of Result Query Relevance (RQR) [24] from web page retrieval. In web page retrieval, RQR is utilized to measure the relevance of SRRs to a query by calculating the portion of the query terms present in SRRs (i.e., titles, snippets, etc.). We use EQR to measure the relevance of training queries (keys in a query-based index) to qtest . EQR measures the degree of overlap between the attribute-value pairs in a training query and the attribute-value pairs in a test query. Intuitively, the more attribute-value pairs the key (i.e., training query) of an index entry shares with a test query the more relevant that entry is for the test query. EQR is defined as: |keyi ∩ qtest | EQR(qtest , ei ) = |qtest | where keyi ∩ qtest is the set of attribute-value pairs appearing in both keyi and qtest , and |qtest | is the number of attribute-value pairs in qtest . For example, using Table 2, for the test query qtest =“Loop, French, $$$”, we can find the entry e1 with the key key1 = (Loop, French, -), and the entry e2 with the key key2 = (Lakeview, French, $). We have EQR(qtest , e1 ) = 23 and EQR(qtest , e2 ) = 13 . Observe that EQR is 1 (max) for any training query and 0 (min) for a query whose attribute-value pairs are not present in the set of training queries. To avoid collecting too many entries from the index, we restrict the set of relevant entries for a test query qtest to those keys that are either subsets or supersets of qtest . The entries with this property can be quickly collected with an inverted index. The superset entries can be collected by performing an intersection of all the lists of entries associated with target attribute-value pairs in the inverted index. For each relevant entry, we calculate its EQR for qtest and use the value as a relevance coefficient for aggregation (next section).

17

540

545

550

555

560

565

570

575

Observe that the set of relevant keys obtained this way inherently meets the connectivity requirement in Step 2. The reason is that first each non-null entry of wi of a matching keyi corresponds to a WDB that is capable of processing some attribute-value pair in keyi . And, second, because if on one hand keyi is a superset of qtest then wi must have a non-null value (weight) for all WDBs in IW DBtest , and if on the other hand keyi is a subset of qtest then wi has a non-null value for at least one of the WDBs in IW DBtest . 4.3. Aggregating the Relevance Weights In this section we describe our method of computing the weights of the WDBs in IW DBtest . At this stage in the algorithm, we have a nonempty subset of entries Ω ⊆ E from an IWDB/query-based index relevant to the test query qtest . For ease of presentation we view Ω as a set of weight records with the property that ∀ω ∈ Ω ∃t ∈ IW DBtest , ω[t] ̸= null. Our goal is to devise a method of aggregating the weight records into one single weight record β that represents the relevance weight of the WDBs in IW DBtest in answering qtest . We begin with an overview of the Condorcet model for relevance weight estimation, then describe our method for computing the weights of the WDBs in IW DBtest that utilizes this model. 4.3.1. The Condorcet Model In general, we are given a finite set of candidates, a set of voters and a set of votes. The vote of a voter is a linear order of the candidates, with the meaning that the candidate in position i is preferred to the one in position j if i < j. We can construct a corresponding Condorcet graph G = (V, E) such that there is a vertex for each candidate and for each candidate pair (x, y), there exists an edge from x to y (denoted by x → y) if x receives at least as many preferences as y in the set of votes. If the candidates received the same number of preferences then there is an edge pointing in each direction (denoted x ↔ y). A Condorcet graph that has at least one edge between every pair of vertices is called semi-complete [38]. Montague et al. [30] proved an important property of a semi-complete graph: Every semi-connected graph contains a Hamiltonian path. A Hamiltonian path of a graph G in general is a path that visits each vertex exactly once. Montague et al. [30] proposed the Condorcet-fuse method that finds a Hamiltonian path in O(n d log n) time, where n is the number of candidates and d is the number of voters. A Hamiltonian traversal of the semicomplete Condorcet graph produces the rankings and weights for the candidates. We will illustrate the process of obtaining the weights from a Hamiltonian path when we discuss our use of the Condorcet model. 4.3.2. Weight Relevance Computation for the IWDBs The main steps of our algorithm (Algorithm 2) for computing the relevance weights of the WDBs in IW DBtest are:

580

1. Condorcet graph G construction (Lines 1 - 15);

18

2. Find a Hamiltonian path in G. This gives the quality order of the WDBs in IW DBtest in processing qtest (Line 16); 3. Compute the relevance weights for the WDBs in IW DBtest for answering qtest (Lines 17 - 21). 585

590

595

We describe these steps in turn now. Condorcet Graph Construction In our setting the set of candidates is the WDBs in IW DBtest . Hence, the set of vertices V = IW DBtest . To apply the Condorcet model we need to define the edge relationship between vertices in the graph and their weights. Edge Definition: The edge between a pair of data sources sei , sej ∈ IW DBtest , i < j is defined as follows. Let Ωij ⊆ Ω be the set of weight records where both sei and sej have non-null weights. Let numi,j = |Ωij |. Let wli denote the weight of the WDB sei recorded in the weight record ωl ∈ Ωij . We use the formula below to denote the accumulated percentage that sej is preferred over sei in Ω. ∑ accuP CTij = sgn(wli − wlj ) pctBetter(wli , wlj ), (4) ωl ∈Ωij

where sgn is the sign function. pctBetter is defined as:  |wli − wlj |     wlj |wlj − wli | pctBetter(wli , wlj ) =    wli  0

605

610

if wlj > wli , if wlj = wli

accuP CTi,j to represent the average preference relationship between numi,j sei and sej . The sign of accuP CTi,j gives the direction of the edge between the pair of data sources sei and sej . If accuP CTi,j > 0 then we add the edge sei → sej to G; if accuP CTi,j < 0 then we add the edge sej → sei to G. If accuP CTi,j = 0 then we add the edge sei ↔ sej to G. accuP CTij (Equation 4) is slightly modified when a query-based index is utilized: each term in the summation is multiplied by EQR(q, ei ). Algorithm 2 describes the process when the query-based index is used (see Lines 5 and 7). EQR(q, ei ) needs to be removed from Lines 5 and 7 when an IWDB-index is utilized. Semi-complete Graph: We discuss now how to modify graph G so that it becomes a semi-complete graph (Line 15 in Algorithm 2). Because the set of relevant entries Ω is connected, the graph G built so far is connected (Lines 1 -14). To apply the Condorcet method we need G to be a semi-complete graph. Consequently, we need to ensure that the graph contains at least one edge between every pair of nodes. We perform this task by adding necessary edges to G while preserving the preference relationship among the nodes. We proceed as follows. First, we need to define a new weight mapping function for edges in the We use

600

if wli > wlj ,

19

Algorithm 2 Weights Aggregation for Relevant Entries Input: IW DBtest and Ω - the set of weight records Output: The weights βi (1 ≤ i ≤ h) for the WDBs in IW DBtest 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 615

620

for each 1 ≤ j < k ≤ t do weightedCountj,k = 0; for each entry ei = (keyi , wi ) ∈ Ω do if wij > wik then (wij − wik ) accuP CTj,k + = EQR(q, ei ) wik else (wik − wij ) accuP CTj,k − = EQR(q, ei ) wij if accuP CTj,k > 0 then add sej → sek to G; accuP CTj,k is the weight of this edge; numj,k else if accuP CTj,k < 0 then add sek → sej to G; accuP CTj,k is the weight of this edge; numj,k modify G into a semi-complete graph G′ ; get Hamiltonian path HP : seim → seim−1 → · · · → sei1 in G′ for each seil ∈ HP, l > 1 do ∏l accuP CTp,p−1 βl = x p=2 ; nump,p−1 ∑h solve the equation x + l=2 βl = 1; for each seil in HP do compute βl by replacing x in the formula in Line 18;

graph. We denote it by λ. For each edge (se, se′ ) ∈ E, λ(se, se′ ) is the weight of the edge in G. The weights of these edges will not be changed. Second, for each vertex se ∈ V , we first identify se’s two-hop neighbors. Then, we add a directed edge between se and each of its two-hop neighbors, if an edge does not already exist between them. The direction and weight of each newly added edge are computed based on the intermediate node that connects se and its two-hop neighbor. Here is an example. Let se′ be a two-hop neighbor of se such that there is no edge between them in the graph. Hence, we need to add an edge between se and se′ . Let se′′ be a common neighbor of both se and se′ . Suppose λ(se,se′′ )

625

λ(se′′ ,se′ )

we have se −−−−−−→ se′′ −−−−−−→ se′ , then we know that se is λ(se, se′′ ) times better than se′′ , and se′′ is λ(se′′ , se′ ) times better than se′ . Thus we can compute the preference value λ(se, se′ ) = (λ(se, se′′ ) + 1)(λ(se′′ , se′ ) + 1) − 1. λ(se,se′ )

So we add an edge se −−−−−→ se′ in G.

20

Algorithm 3 Building Semi-Complete Graph Input: Connected graph G = (V, E) Output: Semi-Complete graph G′ = (V, E ′ ) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:

630

635

640

for each node se in V do sign = True; while sign do sign = False; H ← ∅; use Breadth-first-search to find two-hop neighbors of se and add them into H; if H ̸= ∅ then for each two-hop neighbor se′ ∈ H do compute preference value λ(se, se′ ) based on the intermediate node that connects se and se′ ; add edge (se, se′ ) to E with weight λ(se, se′ ); sign = True;

The intuition is that the preference relationship among the nodes is transitive. Thus we are able to utilize existing preference values in G to compute the preference values of newly connected pairs of nodes. This procedure, which is depicted in Algorithm 3, produces a new graph G′ = (V, E ′ ) with E ⊆ E ′ , which is semi-complete. Compute Relevance Weights We can now get a Hamiltonian path: HP = seim → seim−1 → · · · → sei1 in G′ , m = |IW DBtest |. Assume that the WDB sei1 ∈ HP has a relevance weight of x. We define the relevance weight of the WDB seil in terms of the relevance weights of the WDBs in HP after it in HP , i.e., seil−1 → · · · → sei1 , and the weights of the edges between l ∏ accuP CTp,p−1 . The only two successive WDBs in HP . That is, βl = x nump,p−1 p=2 unknown parameter is x. We can obtain a value for it by constraining the sum of all relevance weights of all WDBs in IW DBtest to be 1 (Lines 17 - 19). That ∑m is, l=1 βl = x+ m ∏ l m ∏ l ( ∑ ∑ accuP CTp,p−1 accuP CTp,p−1 ) +x = x 1+ = 1. Having x we nump,p−1 nump,p−1 p=2 p=2 l=2

645

650

l=2

can compute the relevance weights of all WDBs in IW DBtest (Lines 20 - 21). Cycles If there is a cycle in the Condorcet graph then all the WDBs in the cycle are considered to have equivalent relevance weight. We skip the details of the required changes in Algorithm 2 due to space limitation. Time Complexity Algorithm 2 computes the weight relevance for the WDBs in IW DBtest in O(|IW DBtest |2 |Ω|). The time complexity is dominated by the procedure to generate the semi-complete Condorcet graph (Algorithm 3), which takes O(|IW DBtest |2 |Ω|) in the worst case. This is impractical for large

21

Figure 4: Retrieval effectiveness of the merging models.

IW DBtest and Ω. However, the number of WDBs selected for a given query is in general small (less than 10 [39]). This is an affordable cost for our result merging model in practice, as shown experimentally in Section 5. 5. Evaluation 655

660

665

670

675

The main goal of the experimental study is to show that our result merging algorithms outperform the existing merging algorithms on real life data. Secondly, we want to analyze the robustness of our algorithms by varying the properties of the training data set (queries). 5.1. Performance Metrics and Datasets We are not aware of any gold standard set of structured queries for structured data on the Web. Consequently, to create an unbiased testing environment we have implemented an DI, called herein Yumi (yumi-meta.com), that connects to 9 Web databases: ChicagoReader, CitySearch, DexKnows, MenuIsm, MenuPages, Metromix, local.Yahoo, YellowPages, and Yelp. We restricted Yumi to retrieve only restaurant entities from these data sources. Yumi also connects to a tenth data source, Zagat. Zagat is regarded as ground truth holder for this domain. Zagat is a well-known and highly regarded authority in the restaurant ranking/rating industry in U.S. Hence, a highly ranked restaurant by Zagat is very likely to be a good restaurant. Using this domain as a case study we get unbiased testing environment with ground truth virtually for free. In Yumi, users are given 50 options for location, 19 options for cuisine type and 6 options for price. The total number of possible user queries is 5700, from which we randomly selected 350 training queries and 150 test queries. ARWE-IWDB finds exact matches for 79 of the 150 test queries. ARWE-Q finds exact matches for only 12 queries. All 9 Web databases return non empty lists of results to each of the 500 (350 + 150) queries. For comparison purposes, we have implemented the Borda-fuse, Weighted Borda-fuse (WBF) [10] and RRF [12]. RRF in particular was shown to outperform many result merging algorithms, including (weighted) Condorcet-fuse [30].

22

Table 3: Retrieval effectiveness when the percentage of exact match test queries is varied.

Models ARWE-IWDB ARWE-Q 680

685

Percentage of Exact Match Test Queries 0% 10% 20% 40% 80% 100% NDCG@10 0.477 0.463 0.491 0.598 0.645 0.683 0.551 0.596 0.641 0.692 0.851 0.926

These algorithms are described in Section 2. We will show that our algorithms outperform all these algorithms by significant margins in this section. Throughout this section we refer to our algorithms by ARWE-IWDB and ARWE-Q. The former refers to our merging algorithm that uses the IWDBbased index, while the latter refers to our merging algorithm with the querybased index. 5.2. Experimental Results In this section we evaluate the performance of our proposed result merging algorithms.

690

695

700

705

710

5.2.1. Accuracy of Various Merging Models In this set of experiments we compare the performance of our merging algorithms against that of four other merging models. We use the normalized discounted cumulative gain (NDCG) [40], a popular measure for search performance evaluation. A higher NDCG value indicates a better performance. The results are averaged over 150 test queries. As shown in Figure 4, ARWE models achieve the best performance among all the merging models. ARWE models are better than the other models at each NDCG rank level and is better by significant margins. Especially, on NDCG@1, ARWE-Q and ARWE-IWDB have more than 10-point gains over RRF which is in the third place. We also conducted the t-test on the improvements of ARWE over the other models. The results show that the improvements are statistically significant for NDCG@1-10 (p-value ≤ 0.05). 5.2.2. Exact Match versus Non-Exact Match The objective of this set of experiments is to test the effectiveness of ARWEIWDB and ARWE-Q under different percentages of exact match test queries. We vary the number of exact match test queries from 0 to 150 and calculate the NDCG values of the merged list produced by the two merging models. The result reported in Table 3 is averaged over 150 test queries. We can see from the figure that as the percentage of exact match queries increases, the average NDCG@10 value increases for both merging models. This means that the accuracy of both merging models increases as more exact matches show up in the test queries. Compared to ARWE-IWDB, the average NDCG@10 values of ARWE-Q increase more rapidly. This verifies that the query-based index can provide more fine-grained and accurate weight information about IWDBs for an exact match test query. 23

Table 4: Retrieval effectiveness of merging models when percentage of training queries is varied.

Number of Queries in Training Set 2% 3% 4% 5% 6% NDCG@10 0.376 0.380 0.385 0.389 0.391 0.394 0.452 0.461 0.467 0.475 0.483 0.496 0.571 0.592 0.614 0.628 0.639 0.642 1%

Models WBF ARWE-IWDB ARWE-Q

Table 5: The average running times (ART) of the result merging algorithms.

Models RRF Borda-fuse WBF ART(ms) 148 149 153 715

720

725

730

735

740

ARWE-IWDB 161

ARWE-Q 167

5.2.3. Effect of the Size of the Training Set The objective of this set of experiments is to evaluate the performance of ARWE-IWDB and ARWE-Q models under different sizes of the training datasets. We use the NDCG measure to compare the accuracy of the two ARWE merging algorithms with the baseline merging algorithm (i.e., WBF). In this experiment we vary the number of training queries from 57 (1%) to 342 (6%). In each iteration the sample set of queries is drawn at random without replacement. The average values of NDCG@10’s are reported in Table 4. We draw the following observations. Average NDCG@10 values of the merged list increase as the number of training queries increases. This holds true for all the merging models. While the average NDCG@10 of ARWE-Q increases by 0.071, those of ARWE-IWDB and WBF increase by 0.044 and 0.018, respectively. The reason is that as more queries are used in the training phase, more exact matches are present in the index, which improve the effectiveness of weight assignment and rank aggregation for a test query. ARWE-IWDB outperforms WBF by 20% under most conditions. ARWE-Q outperforms the ARWE-IWDB by about 10%. 5.2.4. Efficiency of Various Merging Models In this set of experiments, we measure the efficiency of the merging models. We report the average running time (milliseconds) over 150 test queries. The timer starts after all lists of SRRs in response to a test query are retrieved (i.e., the network communication time is ignored) and ends when the merged list is generated. We ran our programs on a PC with dual core CPU, 2.53GHz and 2GB RAM. The average running times are reported in Table 5. The results show that the efficiencies of the ARWE models are similar to that of the other algorithms. ARWE algorithms take extra time when an exact match cannot be found from the index, because of the procedures of finding relevant entries and building the Condorcet graph. ARWE-IWDB and ARWE-Q algorithms take 8 and 14, respectively, more milliseconds on average than WBF to produce a final list for a test query. We also observe that ARWE-IWDB has

24

745

750

a lower average running time than ARWE-Q. The reason is that ARWE-IWDB finds exact matches for 79 of the 150 test queries. That is, ARWE-IWDB finds an exact match with probability of more than 50% for a test query. ARWE-Q finds exact matches for only 12 times, which is less than 10% of the total test queries. Hence, ARWE-Q invokes the weighting procedure more frequently than ARWE-IWDB, leading to a relatively longer average running time. The experimental results suggest that it is worthwhile to trade some running time for improved accuracy. Besides, we believe that the average computing time can be further lessened with a more powerful computing infrastructure. 6. Conclusion

755

760

765

In this work we present a result merging algorithm that estimates the relevance weights of WDBs dynamically at query time. We estimate the weights for a set of training queries and store them for future use in a custom index data structure. We propose two index data structures: based on the involved Web databases in answering a query and query-based. We develop a method for identifying the set of relevant training queries for an online query. We estimate the weights of the WDBs for the online query from the relevance weights of the training queries. We developed a weight aggregation method based on the rank merging framework of the Condorcet model. We experimentally show that by actively computing the relevance weights at query time our method outperforms several existing result merging algorithms by significant margins. To our knowledge, our method is the first exclusively designed for structured queries on the Web. References

770

[1] W. Meng, C. T. Yu, Advanced Metasearch Engine Technology, Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2010. [2] M. Shokouhi, L. Si, Federated search, Foundations and Trends in Information Retrieval 5 (1) (2011) 1–102. [3] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, Duplicate record detection: A survey, TKDE.

775

[4] H. K¨opcke, E. Rahm, Frameworks for entity matching: A comparison, DKE 69 (2). [5] L. Shu, A. Chen, M. Xiong, W. Meng, Efficient spectral neighborhood blocking for entity resolution, in: ICDE, 2011.

780

[6] X. Li, X. L. Dong, K. B. Lyons, W. Meng, D. Srivastava, Truth finding on the deep web: Is the problem solved?, in: PVLDB, 2013. [7] B. Zhao, B. I. P. Rubinstein, J. Gemmell, J. Han, A bayesian approach to discovering truth from conflicting sources for data integration, PVLDB. 25

785

[8] V. Snel, A. Abraham, S. Owais, J. Plato, P. Krmer, Capter 7: User Profiles Modeling in Information Retrieval Systems, Advanced Information and Knowledge Processing, Springer, 2010. [9] J. de Borda, Memoire sur les elections au scrutin, in: Histoire de lAcademie Royale des Sciences, 1781. [10] J. Aslam, M. Montague, Models for metasearch, in: SIGIR, ACM, 2001, pp. 276–284.

790

[11] J. Madhavan, D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A. Halevy, Google’s deep web crawl, PVLDB. [12] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet and individual rank learning methods, in: SIGIR, 2009.

795

[13] L. Barbosa, J. Freire, Combining classifiers to identify online databases, in: WWW, 2007. [14] H. A. Mahmoud, A. Aboulnaga, Schema clustering and retrieval for multidomain pay-as-you-go data integration systems, in: SIGMOD, 2010.

800

[15] C. Sheng, N. Zhang, Y. Tao, X. Jin, Optimal algorithms for crawling a hidden database in the web, PVLDB. [16] W. Liu, X. Meng, W. Meng, ViDE: A Vision-Based Approach for Deep Web Data Extraction, TKDE 22. [17] Z. Zhang, B. He, K. Chang, Light-weight domain-based form assistant: querying web databases on the fly, in: VLDB, 2005.

805

[18] M. Salloum, X. L. Dong, D. Srivastava, V. J. Tsotras, Online ordering of overlapping data sources, PVLDB 7 (3) (2013) 133–144. [19] W. Wu, C. Yu, A. Doan, W. Meng, An interactive clustering-based approach to integrating source query interfaces on the deep web, in: SIGMOD, 2004.

810

[20] B. He, K. Chang, Statistical schema matching across web query interfaces, in: SIGMOD, 2003. [21] E. C. Dragut, W. Meng, C. T. Yu, Deep Web Query Interface Understanding and Integration, Morgan & Claypool Publishers, 2012.

815

[22] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval: The Concepts and Technology behind Search, Addison-Wesley Professional, 2011. [23] T. Tsikrika, M. Lalmas, Merging techniques for performing data fusion on the web, in: CIKM, 2001.

26

[24] Y. Lu, W. Meng, L. Shu, C. Yu, K. Liu, Evaluation of result merging strategies for metasearch engines, Web Information Systems Engineering. 820

[25] C. C. Vogt, G. W. Cottrell, Fusion via a linear combination of scores, Inf. Retr. 1 (3). [26] X. Cao, G. Cong, C. S. Jensen, Retrieving top-k prestige-based relevant spatial web objects, PVLDB.

825

[27] L. Si, J. Callan, A semisupervised learning method to merge search engine results, TOIS 21 (4). [28] M. Shokouhi, J. Zobel, Robust result merging using sample-based score estimates, TOIS 27 (3). [29] D. Hong, L. Si, Mixture model with multiple centralized retrieval algorithms for result merging in federated search, in: SIGIR, 2012, pp. 821–830.

830

[30] M. Montague, J. Aslam, Condorcet fusion for improved retrieval, in: CIKM, 2002, pp. 538–548. [31] Q. Hu, J. Huang, J. Miao, A robust approach to optimizing multi-source information for enhancing genomics retrieval performance, BMC Bioinfo.

835

[32] M. Farah, D. Vanderpooten, An outranking approach for rank aggregation in information retrieval, in: ACM SIGIR, 2007, pp. 591–598. [33] E. C. Dragut, B. DasGupta, B. P. Beirne, A. Neyestani, B. Atassi, C. T. Yu, W. Meng, Merging query results from local search engines for georeferenced objects, TWEB 8 (4) (2014) 20:1–20:29.

840

[34] M. de Borda, Essai sur application de analyse la probabilit des decisions rendues la pluralit des voix, 1785. [35] K. Elbassioni, S. Jeli, D. Matijevi, Note: The relation of connected set cover and group steiner tree, Theor. Comput. Sci. 438 (2012) 96–101. [36] R. Khandekar, G. Kortsarz, Z. Nutov, Approximating fault-tolerant groupsteiner problems, Theor. Comput. Sci. 416 (2012) 55–64.

845

[37] V. Chvatal, A greedy heuristic for the set-covering problem, Math. of Operations Research 4 (3). Pdoi:10.2307/3689577. [38] I. N. R. Rao, S. S. R. Raju, On semi complete graphs, International Journal of Computational Cognition 7 (3) (2009) 50–54.

850

[39] T. T. Avrahami, L. Yau, L. Si, J. Callan, The FedLemur project: Federated search in the real world, JASIST. [40] K. J¨arvelin, J. Kek¨al¨ ainen, Cumulated gain-based evaluation of ir techniques, TOIS 20 (2002) 422–446.

27