Information Processing and Management 51 (2015) 306–328
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Learning combination weights in data fusion using Genetic Algorithms Kripabandhu Ghosh a,⇑, Swapan Kumar Parui a,1, Prasenjit Majumder b,2 a b
Indian Statistical Institute, 203 Barrackpore Trunk Road, Kolkata 700108, West Bengal, India Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Near Indroda Circle, 382007 Gujarat, India
a r t i c l e
i n f o
Article history: Received 2 August 2013 Received in revised form 26 November 2014 Accepted 12 December 2014 Available online 9 January 2015 Keywords: Information retrieval Data fusion Linear combination Genetic Algorithms
a b s t r a c t Researchers have shown that a weighted linear combination in data fusion can produce better results than an unweighted combination. Many techniques have been used to determine the linear combination weights. In this work, we have used the Genetic Algorithm (GA) for the same purpose. The GA is not new and it has been used earlier in several other applications. But, to the best of our knowledge, the GA has not been used for fusion of runs in information retrieval. First, we use GA to learn the optimum fusion weights using the entire set of relevance assessment. Next, we learn the weights from the relevance assessments of the top retrieved documents only. Finally, we also learn the weights by a twofold training and testing on the queries. We test our method on the runs submitted in TREC. We see that our weight learning scheme, using both full and partial sets of relevance assessment, produces significant improvements over the best candidate run, CombSUM, CombMNZ, Z-Score, linear combination method with performance level, performance level square weighting scheme, multiple linear regression-based weight learning scheme, mixture model result merging scheme, LambdaMerge, ClustFuseCombSUM and ClustFuseCombMNZ. Furthermore, we study how the correlation among the scores in the runs can be used to eliminate redundant runs in a set of runs to be fused. We observe that similar runs have similar contributions in fusion. So, eliminating the redundant runs in a group of similar runs does not hurt fusion performance in any significant way. Ó 2014 Elsevier Ltd. All rights reserved.
1. Introduction Data fusion has been used as an effective tool for improving information retrieval performance. Several IR researchers have proposed several methods of combining two or more retrieved lists to produce a single list that contains the useful documents from all the lists at higher ranks. Given a set of retrieved lists (or runs, as they are popularly referred to) and given a fusion algorithm, it is vital to choose the combination weights which will result in improvement in performance. Previous results show that weighted fusions have scored over unweighted fusions when appropriate weights are assigned to the fused runs (Wu et al., 2009). Much research has been done on learning or optimizing the fusion weights. Vogt and Cottrell (1998) have tried to predict the fusion performance of run pairs using multiple linear regression based on several features. Wu et al.
⇑ Corresponding author. Tel.: +91 8335045897; fax: +91 3325773035. 1 2
E-mail addresses:
[email protected] (K. Ghosh),
[email protected] (S.K. Parui),
[email protected] (P. Majumder). Tel.: +91 8335045897; fax: +91 3325773035. Tel.: +91 9712660746; fax: +91 07930520010.
http://dx.doi.org/10.1016/j.ipm.2014.12.002 0306-4573/Ó 2014 Elsevier Ltd. All rights reserved.
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
307
(2009) stated that power functions can be useful in finding good weights for fusion. Bartell et al. (1994) also tried to maximize fusion performance. In the last two papers, Conjugate Gradient method (Press et al., 1995) on Guttman’s Point Alienation function (Guttman, 1978) was used. Vogt and Cottrell used golden search method (Press et al., 1995) for optimization. The Genetic Algorithm (GA) has been used in finding useful solutions to optimization and search problems. GAs generate solutions to optimization problems by using techniques inspired by natural evolution. GAs can thus be used in learning the optimum fusion weights for a weighted linear combination of retrieval scores of different runs. Next, we study how the use of the top ranked documents on linear combination of scores can be used to get an optimal retrieval performance. We consider runs at a given depth k per query and run the optimization algorithm on them. The optimum weights thus learned are tested on the runs. The idea of using the top k documents is modeled on Multi-armed Bandits problem (Auer et al., 1995) where the gambler has no initial knowledge about the levers and tries to maximize the gain based on existing knowledge of each lever. Here we attempt to use partial knowledge about the IR performance of each run to learn the optimal fusion weights. Pal et al. (2011) found that reducing pool size per topic does not have much effect on evaluation. We draw our motivation from this observation also. We also study how the correlation between the scores of runs can help in removing the redundant runs in data fusion. We calculate the correlation values among the run pairs. First we choose the run pair with the highest correlation between them and drop the run which has the inferior retrieval performance. We fuse the remaining runs and see if there is any significant drop in fusion performance. If not, we consider the pair with the next highest correlation. We repeat the procedure until there is a significant drop in performance. This study is aimed at exploring if highly correlated runs in terms of retrieval scores have similar contribution in fusion performance. Our contributions made in the present paper are summarized below: 1. Given a set of runs, a GA based approach to finding the optimal weights for an efficient fusion of these runs on the basis of their retrieval scores, is proposed. 2. It is shown that if the learning of the fusion weights by the GA is based only on the top-ranked documents, there is not much loss in efficiency of the resulting fusion. In other words, if only lower depths in the ranked pool of documents are used by the GA to learn the fusion weights, the performance of the resulting fusion is not hurt much. 3. A new approach to determination of the runs that make insignificant contributions in fusion, and hence to determination of the smallest subset of the runs to be fused, without much loss in efficiency, is proposed. It is based on the fusion weights learnt by the GA and the correlation coefficients between pairs of runs obtained on the basis their retrieval scores. The rest of the paper is organized as follows: We provide a discussion on the related works in Section 2. In Section 3, we describe the GA and discuss how this algorithm can be used in the present fusion problem. The experimental setup is described in Section 4. We present our experimental results and a comparative study in Section 5 and conclude in Section 6. 2. Related work Work of different genres has been done on data fusion. Many new data fusion techniques have been proposed. On the other hand, approaches that focussed on improving the existing methods were also reported. Fox and Shaw proposed CombMIN, CombMAX, CombSUM, CombANZ and CombMNZ algorithms based on linear combinations of scores (Fox and Shaw, 1993). CombSUM and CombMNZ have emerged as effective methods based on linear combinations of scores. Lee (1997) performed an experiment on six submitted runs of TREC 3 and concluded that CombMNZ was slightly better than CombSUM. This claim, however, was contradicted in many works like Montague and Aslam (2001), Lillis et al. (2006), Wu et al. (2009), and no clear inference could be drawn about the supremacy of any single approach. Popular voting algorithms were also used effectively in data fusion. Montague and Aslam (2002) used popular voting method called Condorset method (named after French mathematician and philosopher Marquis de Condorset) to good effect in data fusion. The fusion algorithm was called Condorset fusion. Another voting algorithm, viz., Borda count (named after another French mathematician JeanCharles de Borda) was used by Aslam and Montague (2001) and was referred to hitherto as Borda fusion. Cormack et al. (2009) showed that Reciprocal Rank Fusion paired with learning to rank outperforms Condorset fusion and individual rank learning method. Lillis et al. (2006) proposed a fusion algorithm named ProbFuse which estimated the probability of relevance of documents based on the position in the ranked list. Khudyak Kozorovitsky and Kurland (2011) used document cluster-based approach (named ClustFuse) to find retrieval scores in the fused list. But there is a fundamental factor differentiating the fusion algorithms. CombSUM (and other Comb variants), ProbFuse and ClustFuse make use of the relevance scores assigned to the documents in the input runs that are fused, while Condorset, Borda and Reciprocal Rank fusions use the rank of the documents in the ranked lists. In the case of the algorithms that use the relevance scores, Lee (1997) introduced a score normalization scheme to the algorithms proposed by Fox and Shaw and showed that score normalization is important in data fusion. Similarly, Savoy (2004) used a new score normalization formula on linear combination fusion algorithms. He called the normalized score Z-Score. Montague and Aslam (2001) also studied different score normalization schemes.
308
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Many researchers have tried to improve the fusion performance by optimizing the weights used in the linear combination of runs. Vogt and Cottrell (1998) tried to predict the fusion performance of run pairs using multiple linear regression based on several features. Wu et al. (2009) used power functions to find good weights for fusion. Bartell et al. (1994) also tried to maximize fusion performance using Conjugate Gradient (Press et al., 1995) method on Guttman’s Point Alienation function (Guttman, 1978). Wu (2012) used multiple linear regression to learn appropriate combination weights for data fusion. Wu (2013) used linear discriminant analysis based approach to learning of weights for Condorset fusion. Wu and McClean (2005) used correlation coefficient between runs to assign linear combination weights in data fusion on the submitted runs of TREC 5. Sheldon et al. (2011) used neural networks in generating different query reformulations and merging their results. Hong and Si (2012) used Expectation Maximization algorithm to learn the combination weights. In this paper, we learn the linear fusion weights to maximize the performance of the fused run using the GA. We learn the weights using all the documents in the runs. Also, we learn the weights using the performance information from the top few documents per query. Previous researchers have learned the weights using all the documents in each run (Wu et al., 2009; Bartell et al., 1994). Finally, we do a twofold training and testing on the queries. Correlation among the runs has been used effectively by researchers in data fusion. Wu and McClean (2005) determined combination weights using the correlation of each pair of runs. They also suggested that (1) good results strongly correlate with each other, and (2) a strong correlation among the component runs (that is, the runs participating in fusion) is harmful for data fusion performance. Wu et al. (2009) opined that the runs submitted by the same group are likely to be similar. They chose at most one run from any particular group for fusion and there were no marked difference in performance from the case when no discrimination was done based on groups. In this paper, we have studied how the redundant runs can be identified and removed from the fusion exercise without sacrificing retrieval performance in any significant way. 3. Genetic Algorithm 3.1. Introduction Suppose f is a real-valued function defined on a bounded set A ¼ ½a1 ; b1 ½a2 ; b2 ½ap ; bp where ½ai ; bi is an interval on the real line. The goal of the GA is to find a point x0 ¼ ðx01 ; x02 ; ; x0p Þ in A such that f ðx0 Þ ¼ maxx2A f ðxÞ. For actual implementation of the GA, a grid is super-imposed on the set A and let S be the set of the grid points. S represents A in a discretised manner and the representation becomes more accurate with a higher resolution of the grid. What the GA actually finds is a point y0 ¼ ðy01 ; y02 ; ; y0p Þ in S such that f ðy0 Þ ¼ maxy2S f ðyÞ. A higher resolution of the grid will take y0 closer to x0 . Every grid point y in S corresponds to a binary string of length, say, L where a larger L indicates a higher resolution of the grid. For example, let us consider a function f defined on A ¼ ½0; 10 ½0; 10 (Fig. 1) as f ðx1 ; x2 Þ ¼ expfðx1 5Þ2 ðx2 5Þ2 g. Note that the maximum value of f occurs at x0 ¼ ð5; 5Þ. Now, let the interval corresponding to each xi be discretised on the basis of binary strings of length 3 where the binary strings 0 0 0 and 1 1 1 represent the values 0 and 10 respectively. In general, suppose M ¼ ðb1 b2 b3 ; b4 b5 b6 Þ is a grid point in S where the binary strings b1 b2 b3 and b4 b5 b6 represent the discretised values of x1 and x2 respectively (Fig. 2). Then the corresponding point in A ¼ ½0; 10 ½0; 10 is computed as follows. Let z1 and z2 be the two decimal values corresponding to the binary numbers b1 b2 b3 and b4 b5 b6 respectively. Clearly, zi belongs to f0; 1; ; 7g. Then ðy1 ; y2 Þ ¼ ð10z1 =7; 10z2 =7Þ is the point in A corresponding to the grid point M in S. Thus, the value of f at the grid point M ¼ ðb1 b2 b3 ; b4 b5 b6 Þ is in fact the value of f at ðy1 ; y2 Þ in A. The task of the GA is to find the grid point at which the value of f is maximum. Note that here the number of grid points is only 64 and the maximum f value attainable at these grid points (i.e., f ðy0 Þ) will be much less than f ðx0 Þ. However, if the length of the binary string is increased for discretising x1 and x2 , the value of f ðy0 Þ will be closer to f ðx0 Þ due to higher resolution of the grid. For example, if the length of the binary
Fig. 1. Rectangle A = [0, 10] [0, 10].
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
309
Fig. 2. The grid S on which the GA operates.
string is increased from 6 to 8, the number of grid points will increase from 64 to 256 resulting in an enhanced value of f ðy0 Þ. The GA in fact operates on the binary strings representing the grid points of S. However, it converts such a string into a point x in A whenever needed. The details are provided in the sub-sections below. An optimization approach is based either on exploration or on exploitation or on both. The GA is based on both these aspects and hence is capable of avoiding getting stuck at a local optimum. Mutation and crossover operations described below contribute to exploration and exploitation respectively. Moreover, the GA is applicable even if the function to be optimized is not differentiable or not even continuous or not even having a closed form. The convergence properties of the GA are discussed later in this section. The basic steps of the GA are: 3.1.1. Initialization A initial population P1 ¼ fy1 ; y2 ; ym g of m grid points (or members) selected at random from S is formed where m (called the population size) is an even integer. One way of making such a random selection is to select 0 or 1 each with probability 0.5 for all the positions in the binary string corresponding to each of the m members in P 1 . This population undergoes changes through several iterations (each such iteration is called a generation in the GA parlance). One generation consists of selection, crossover, mutation and elitism operations that are described below. The value of m is taken as 30 in our experiments. 3.1.2. Selection P The fitness value of a member yi of the population P 1 is defined as f ðyi Þ. Let F ¼ m i¼1 f ðy i Þ. Now, a random selection of one member from P1 is made in such a way that the probability of yi being selected is f ðyi Þ=F. Note that a member having a higher fitness value is more likely to be selected. This process is repeated m times so that a new set P 2 of m members is generated from P1 . The two sets P 1 and P2 are not necessarily the same. A member having a smaller fitness value in P1 is more likely to disappear from P2 . On the other hand, a member having a higher fitness value in P1 is more likely to appear more than once in P 2 . 3.1.3. Crossover A random pairing of the m members of P 2 is done. Each such pair creates a pair of offspring through crossover operation. A site (in the binary string) is first selected at random with equal probability (here there are L 1 sites). On the basis of this site, the two members are split into heads and tails as (head1 ; tail1 ) and (head2 ; tail2 ). Then the two offspring are defined as (head1 ; tail2 ) and (head2 ; tail1 ). Each pair of members undergoes the crossover operation with probability pc . If a pair is selected for crossover operation, the question of site selection arises. Otherwise, this does not arise and the members become their offspring (that is, the two corresponding binary strings remain unchanged). Let P 3 denote the population obtained after crossover. In our experiments, the value of pc is taken as 0.7. In Table 1, a crossover example is shown where L ¼ 15 and the selected site is 5. 3.1.4. Mutation In mutation operation, one of the L positions in a binary string is randomly selected with equal probability and the binary value in the selected position is swapped (that is, ‘‘0’’ is changed to ‘‘1’’ and ‘‘1’’ is changed to ‘‘0’’). Each offspring after crossover operation undergoes the mutation operation with probability pm . The value of pm is taken as 0.2 in our experiments. If an offspring is not selected for mutation operation, its corresponding binary string stays unchanged. In Table 2, offspring 1 and 2
310
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328 Table 1 Crossover: an illustration (/ is the crossover point). Member 1 Member 2
11010/0010011011 11011/1100001111
Offspring 1 Offspring 2
11010/1100001111 11011/0010011011
Table 2 Mutation: an illustration. Original offspring 1 Original offspring 2
110101100001111 110110010011011
Mutated offspring 1 Mutated offspring 2
110001100001111 110110110011011
are mutated at positions 4 and 7 respectively. Let P 4 denote the population obtained after mutation operation. We reduce the mutation probability over time initially starting at pm ¼ 0:2. Mutation operation takes care of the exploration aspect of the GA search algorithm. A larger value of pm leads to more exploration. This value is normally kept fixed at a value around 0.01. However, in our implementation, we start with a high value and keep reducing it through generations. Initially starting at pm ¼ 0:2, we reduce it by a factor of 0.9 after every 25 generations. The idea here is that as the algorithm progresses, the need for exploration gets reduced. In that case, a lower value of pm suffices. On the other hand, if the value is large compared to the progress that the algorithm has made, unnecessary time is spent on exploration. In other words, we deliberately reduce the search space while making sure that the reduced search space contains the globally maximum value. 3.1.5. Elitism In elitism operation, the member having the highest fitness value in the population in one generation is carried over to the population in the next generation replacing the member having the lowest fitness value. Thus, in any generation, the population contains the member having the highest fitness value across all the past generations. Note that this fittest member takes part in selection, crossover and mutation operations and is more likely to produce members with high fitness values. Let P5 be the population obtained after elitism operation. 3.1.6. Convergence The role of mutation in GA has been that of restoring lost or unexplored genetic material into the population to prevent the premature convergence of the GA to suboptimal solutions (Srinivas and Patnaik, 1994). Large values of mutation probability transform the GA into a purely random search algorithm, while some mutation is required to prevent the premature convergence of the GA. Rudolph (Rudolph, 1994) has described the GA as a Markov chain, and has proved that an elitist GA converges to the global maximum of the fitness value irrespective of the initial population. We will now illustrate how the above operations in the GA described above, are actually implemented. For this, we consider the exponential function described in the beginning of this section. Let m, the population size, be 4. Thus, in the initialization stage, 4 grid points are randomly selected from the 64 grid points shown in Fig. 2 to form the initial population P 1 of size 4. Let us assume that these 4 randomly selected grid points representing population P 1 , are what is shown in the first column of Table 3. Now the functional values of these 4 grid points are computed the way it is described in the beginning of this section. These functional values are shown in the second column of Table 3. The probabilities of the 4 members of the population are computed as described in Section 3.1.2. These probabilities are shown in the third column of Table 3. Now a member from the first column is selected at random using the probabilities in the third column. Let it be ‘‘101011’’. This is the first entry in the fourth column representing population P 2 . This random selection is repeated another three times and
Table 3 GA: an illustration. Initial population (P1)
Value of f
Probability of selection
Population after selection (P2)
Population after random pairing
Population after crossover (P3)
Population after mutation (P4)
Population after elitism (P5)
New value of f
000001
4:010 1017 0.00608 0.000103 0.00000173
6:5 1015 0.98306 0.01665 0.00028
101011
(101011,
101010
101010
101010
0.0001027
100110 010010 101011
100110) (010010, 101011)
100111 011011 100010
100111 011001 100011
101011 011001 100011
0.00608 0.0000017 0.36045
101011 010010 100110
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
311
the outcomes are shown in the last three entries of the fourth column. Note that the member (i.e., ‘‘101011’’) in population P 1 having the highest probability is selected twice in population P 2 . Also, the member (i.e., ‘‘000001’’) in population P1 having the lowest probability is not selected in population P 2 . Here stronger members (i.e., having higher f values) are more likely to survive in the selection process. Now, the members in the fourth column are paired at random. An example of such pairing is shown in the fifth column. A pair is now selected for crossover with probability pc ¼ 0:7. Suppose both the pairs in the fifth column are selected for crossover. For crossover, we have L 1 = 5 possible sites here. For each crossover, a site is selected with probability 1/5 (see Section 3.1.3). Suppose the two sites for the two pairs are 4 and 2 respectively. The population P3 thus created after crossover is shown in the sixth column. On the basis of the mutation operation as described in Section 3.1.4, suppose the fifth and sixth bits of the last two entries in the sixth column are selected for mutation. The resulting population P4 after mutation is shown in the seventh column. Now, note that the highest fitness value in the population P 1 corresponds to ‘‘101011’’ while the member having the lowest fitness value in the population P4 is ‘‘100111’’. So, in the elitism operation, ‘‘101011’’ replaces ‘‘100111’’ in P 4 . Thus, the final population of the current iteration is P5 that is shown in the eighth column. The functional values of the members of P5 are shown in the last column. Thus, in one generation, the old population P 1 (first column) is changed through the GA operations to a new population P 5 (eighth column). In the next generation, the new population P5 takes the position (first column) of the old population and the whole process is repeated. Now we will discuss how the GA is implemented for our fusion data task. The value of L depends on the number of runs to be fused. In our experiments, we use 16 bits for the weight of each run. Asymptotically, the GA converges in the sense that the strongest member in the population converges to the optimal point y0 ¼ ðy01 ; y02 ; ; y0p Þ in S (Rudolph, 1994). However, we run the algorithm for a finitely many generations and the output of the algorithm is the member of the population for which the value f is the maximum. The number of generations needed for convergence of the GA depends largely on the number of runs to be fused. In our experiments, the number of generations varies from 25 to 1000. 3.2. Optimization Let r 1 ; r 2 ; ; rN be N runs on a query set Q with scores s1 ; s2 ; ; sN respectively, for a given document and a query. Our aim here is to find a linear combination (in fact, a weighted average) of s1 ; s2 ; ; sN that produces maximum MAP value over all P possible combinations. In other words, we would like to determine the weights w1 ; w2 ; ; wN (0 6 wi 6 1 and Ni¼1 wi ¼ 1) such that the new score s ¼ w1 s1 þ w2 s2 þ þ wN sN leads to the highest MAP. We will use the GA described above to find these optimal weights wi . However, for the GA, the search space needs to be unconstrained while the search space A of ðw1 ; w2 ; ; wN Þ here is constrained in the N-dimensional space. To overcome this, we will convert the constrained search space A in N dimensions to an unconstrained search space B in ðN 1Þ dimensions such that there is one-to-one and onto mapping between A and B. It is clear that A corresponds to the surface of the positive quadrant of the unit N-dimensional hyper-sphere. B that we intend to map A to, is a rectangle in the ðN 1Þ dimensional space. Just as an example, consider the surface of the positive quadrant of the unit sphere and the 2-dimensional rectangle ½0; p=2 ½0; p=2. One can see that there is one-to-one and onto mapping between them. Formally, B and the corresponding mapping are defined as follows: B ¼ ½0; p=2 ½0; p=2 ½0; p=2 (ðN 1Þ times), and 2
w1
¼ sin h1
w2
¼ cos2 h1 sin h2
w3
2
2
2
¼ cos h1 cos2 h2 sin h3 2
w4 .. .
¼ cos h1 cos h2 cos2 h3 sin h4
wN1
¼ cos2 h1 . . . cos2 hN2 sin hN1
wN
2
2
2
2
2
¼ cos h1 . . . cos hN2 cos2 hN1
9 > > > > > > > > > > > > = > > > > > > > > > > > > ;
ð1Þ
where hi 2 ½0; p=2 for all i. Let ðh1 ; h2 ; ; hN1 Þ be a point in B and let ðw1 ; w2 ; ; wN Þ be the corresponding point in A. Suppose g ¼ w1 s1 þ w2 s2 þ þ wN sN denotes the score of the run r produced by the linear combination (fusion) of the runs r 1 ; r2 ; ; r N with respective weights as w1 ; w2 ; ; wN . Note that g is a function of ðh1 ; h2 ; ; hN1 Þ. Let the number of relevant documents for an information need qj 2 Q be mj . Suppose, for the fused run r; Rj;k ¼ fd1 ; d2 ; ; dk g is the set of k top ranked retrieved results. Now let a function f ðh1 ; h2 ; ; hN1 Þ be defined as: PjQ j Pmj MAP(g; Q Þ ¼ jQ1 j j¼1 m1j k¼1 Precision(Rjk ) which is the mean average precision of run r on Q. Precision(Rjk ) is the proportion of retrieved documents that are relevant for qj at the point in the ranked list when we get to the document dk . Now our task is to determine the values of h1 ; h2 ; ; hN1 (equivalently, the values of w1 ; w2 ; ; wN ) for which f is maximum. We use the GA described above to maximize the function f ðh1 ; h2 ; ; hN1 Þ over the unconstrained space B. Once the values of h1 ; h2 ; ; hN1 maximizing f are obtained, the values of the optimal weights w1 ; w2 ; ; wN are obtained from the Eqs. (1) above. These optimal weights define the best linear combination of the runs r1 ; r2 ; ; rN providing the highest MAP.
312
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
4. Experimental setup 4.1. Using the top-ranked documents in fusion Pal et al. (2011) found that if pooling is done by choosing lower depths for each query, the quality of evaluation is not hurt. So, the top documents can be a good representative of the quality of the run. This is likely to be more prominent when top heavy measures like Mean Average Precision are used for evaluation. In this paper, we study how the documents at lower depths can be used in learning the fusion weights by the GA. Table 4 shows the correlation coefficients between the MAPs of the fused runs at depth 1000 and the MAPs of the same runs at lower depths, viz., 100, 50 and 25. The MAPs even at lower depths show high correlation with the MAP values at depth 1000. This led us to believe that using only the documents at shallower depths may be sufficient in achieving good fusion runs by the GA. Let r 1 ; r 2 ; ; r N be N runs which we want to fuse with optimum weights. For the GA, the objective function is the MAP of the fused run. That is, we want to maximize the MAP of the fused run. For each query qj , we select the top d documents of each run. Let these new runs be rd1 ; rd2 ; ; r dN . Fig. 3 gives a pictorial view of the situation. It shows, for a given query, N runs with documents at ranks 1 to 1000. dk l is the document of run k at rank l. For example, d1 1; d1 2; ; d1 1000 are the documents of run 1 at ranks 1; 2; ; 1000 respectively. Now if we consider top d ¼ 25 documents of run 1, then the corresponding variant of run 1, say r 25 1 , will contain the documents d1 1; d1 2; ; d1 25 and the documents at lower ranks will not be d
considered. We fuse these new runs with some weights wd1 ; wd2 ; ; wdN to get a fused run, say, f . This can be visualized in Fig. 4. In this figure K is the total number of documents in the fused run. Since each of the N runs contributes d documents in the fused run, so the fused run will contain 6N d unique documents. Note that the inequality is because of the fact that some documents may appear in more than one run. If, for a given query, each of the N runs has d documents, then the fused d
run will contain a ranked union of all the documents in the runs participating in fusion. This fused run f is nothing but a d
ranked pool of the top d documents (per query) in r 1 ; r 2 ; ; r N . Let the optimal weights (for which MAP of f is maximized) w1d opt ; wd2 opt ; ; wdN opt .
These optimal weights are then tested on the runs r 1 ; r 2 ; ; rN at depth 1000 learned by the GA be (note that training was done using depth d which is much less than 1000). We will see if these learned weights yield good retrieval performance.
Table 4 Correlation of MAP at 1000 documents with MAPs at lower depths. Dataset
Top 100
Top 50
Top 25
TREC TREC TREC TREC
0.599034 0.985229 0.993314 0.978372
0.577695 0.970289 0.983439 0.959592
0.582920 0.935517 0.956707 0.932662
3 5 6 7
Fig. 3. Runs with varying depths.
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
313
Fig. 4. Fused run. K is the size of the union of the N sets each having d documents.
4.2. Score normalization Let r 1 ; r 2 ; ; r N be N runs on a query q. Suppose, for the sake of simplicity, d is a document that appears in each of the N runs. Let s1 ; s2 ; ; sN respectively be the similarity scores of d with q in the N runs. A run is the retrieval result of a search system which follows a certain retrieval algorithm. This retrieval algorithm varies largely over the runs and it is likely that different ranges of score are assigned to the retrieved documents in different runs. For example, in run r 1 the similarity scores may range from 43.085 to 447.169 whereas in run r2 the range can be 0.99 to 0.56. Clearly, these values cannot be combined unless they are mapped to a common range, say, [0, 1]. So, we use the following normalization scheme proposed by Lee (1997):
normalized score ¼
score min score : max score min score
So, a score s1 , for a document d and run r 1 , is normalized to, say snorm , using the above scheme, where min score and 1 max score are the minimum and maximum scores respectively among all the documents in the run for the given query q. Note that this normalization scheme maps all the scores in a run, for a given query, to the range [0, 1]. That is, the maximum score is mapped to 1 and the minimum score is mapped to 0. The intermediate values are adjusted accordingly. We do this normalization for each run and get a set of normalized scores for document d (for query q): snorm ; snorm ; ; snorm . Then we per1 2 N norm norm norm form a weighted combination (say, g) of the scores as g ¼ w1 s1 þ w2 s2 þ þ wN sN . So, g is the score of d in the fused run for a given query q. Now, we determine the weights w1 ; w2 ; ; wN using the GA such that for all the documents in all the N runs we get g scores that lead to optimum performance of the fused run. For the documents that appear in less than N runs for a given query, their scores in the runs where they do not appear, are set to zero. 4.3. Correlation between retrieval scores of a pair of runs Let r1 and r2 be two runs and Q ¼ q1 ; q2 ; ; qM be a query set. Let D ¼ d1 ; d2 ; ; dn be the set of documents appearing in at least one of the two runs for at least one query. Let us now consider a query q from Q. Then for each document di , we have an ordered pair of scores (s1i ; s2i ), where s1i is the score of di for run 1 and s2i is the score of di for run 2. We assign a zero score to the document in the run where it does not appear. For example, if document d appears in run r 1 with score s and is absent in run r2 , then the pair of scores for the document d will be (s; 0). Thus, for each query, we have n pairs of scores ðs1i ; s2i Þ. For all the queries in Q, we have Mn such pairs. We remove ð0; 0Þ pairs, i.e., pairs where both the scores are zero. On the basis of the rest of the pairs, we compute the correlation coefficient between the scores obtained by r1 and r2 . 5. Experimental results We run the GA on the submitted Ad Hoc runs of TREC 3, 5, 6 and 7 found at http://trec.nist.gov/results.html. The entire list of the runs used in our experiments can be found in Table 17 towards the end of the paper. The table shows the runs in the decreasing order of MAP values. The total number of submitted runs for TREC 3, 5, 6 and 7 are 40, 61, 74 and 103 respectively. We used simple GA code implemented at Kanpur Genetic Algorithms laboratory found at http://www.iitk.ac.in/kangal/codes.shtml. We compared GA with the following baselines: CombSUM; CombMNZ and Z Score: These are unsupervised parameter-free methods (Fox and Shaw, 1993) (Savoy, 2004). Linear combination method with performance level (LC) and performance level squared (LC2): These are supervised parameter-free methods (Wu et al., 2009). Multiple linear regression based method (RegFuse): This is a supervised method (Wu, 2012). For this algorithm, the logistic model parameters a and b are trained on all the documents for all the queries in Section 5.1, on top-ranked documents for all the queries in Section 5.2 and on disjoint query sets in Section 5.3. Mixture Model (MixModel): This is also a supervised method (Hong and Si, 2012). Here, we chose K, the number of latent groups, as 3.
314
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Lambda Merge (LambdaMerge): This is another supervised method (Sheldon et al., 2011). For LambdaMerge, we optimized MAP. Here, we used all the ‘‘Query-document features’’ and ‘‘Gating Features’’. However, Gating features like IsRewrite, RewriteScore, RewriteRank, Overlap@N and RewriteLen did not have much significance since our original query and the reformulated query were the same. In other words, we did not do any query reformulation and used the Lambda-Merge scheme as a baseline for data fusion method. ClustFuse variants ClustFuseCombSUM and ClustFuseCombMNZ: These are unsupervised baselines (Khudyak Kozorovitsky and Kurland, 2011). In this paper, Kozorovitsky et al. have considered top d highest ranked documents in a retrieved list for fusion, where the chosen d values were 20, 30, 40 and 50. In addition, they fused only three runs at a time. These considerations were done to cope with the high resource complexity of their algorithm. So, we did our experiment with the top 10 runs only (a larger number of runs takes a very long computing time) and we chose the d value as 50. For clustering used in ClustFuse, as mentioned in the paper, we performed K Nearest Neighbour algorithm where, the value of K was chosen as 10. The ClustFuse method incorporates a single free parameter, k. The value of k is chosen from the set f0; 0:1; ; 1g. 5.1. Genetic algorithm on all documents First we consider d to be 1000 and the top 30 runs in terms of MAP. The results are shown in Table 5 which shows that GA produces numerical improvements in MAP, P@5 and P@10 over the best component run as well as the unweighted linear combination methods (CombSUM; CombMNZ and Z Score), the linear combination method with performance level (LC) and performance level squared (LC2), multiple linear regression RegFuse, mixture model for result merging MixModel and LambdaMerge. These pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test (Siegel, 1956). In other words, the performance improvement (in terms of MAP, P@5 and P@10) achieved by GA over all the other runs shown in Table 5 is statistically significant. However, for comparison with ClustFuse variants ClustFuseCombSUM and ClustFuseCombMNZ, we had to make some considerations. These have been already discussed when we described the baselines. Table 6 shows the comparisons of GA with two variants of ClustFuse, viz., ClustFuseCombSUM and ClustFuseCombMNZ. Note that all the reported values are obtained on the basis of the top 50 documents in a run. We can see that GA outperforms both ClustFuseCombSUM and ClustFuseCombMNZ in MAP, P@5 and P@10 and these pairwise differences are statistically significant at 95% confidence level (pvalue < 0.05) by Wilcoxon signed-rank test. 5.2. Genetic algorithm using top-ranked documents Here the training was done at depths d ¼ 100, 50 and 25 while the testing was done at depth d = 1000. Tables 7 and 8 show the results. For all the depths and top 30 runs, GA produced numerical improvements in MAP, P@5 and P@10 over Table 5 MAPs with GA on all documents. The best performance is indicated by boldfaced figures. TREC 3
Best comp CombSUM CombMNZ Zscore LC LC2 RegFuse MixModel LambdaMerge GA
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.4226 0.3193 0.3050 0.4243 0.3169 0.3153 0.3249 0.3189 0.3050 0.4736
0.7440 0.6320 0.5560 0.7320 0.6400 0.6120 0.6360 0.6360 0.6250 0.8080
0.7220 0.5940 0.5440 0.7400 0.6000 0.5980 0.6060 0.5980 0.5880 0.7700
0.3165 0.3561 0.3132 0.3613 0.3605 0.3626 0.3733 0.3573 0.3460 0.3840
0.5572 0.6120 0.5040 0.6080 0.6200 0.6160 0.6360 0.6160 0.6050 0.6833
0.5660 0.5420 0.4400 0.5420 0.5400 0.5480 0.5760 0.5460 0.5360 0.6360
0.4491 0.3993 0.3534 0.4097 0.4354 0.4793 0.4343 0.4100 0.3890 0.5483
0.6680 0.6600 0.6240 0.6640 0.6800 0.7240 0.6880 0.6665 0.6550 0.7720
0.6650 0.5560 0.5260 0.5620 0.5880 0.6220 0.5880 0.5580 0.5880 0.6980
0.3702 0.4168 0.3843 0.4012 0.4305 0.4477 0.4394 0.4190 0.4120 0.5470
0.6920 0.7400 0.6760 0.7400 0.7480 0.7600 0.7560 0.7460 0.7360 0.8050
0.6940 0.6860 0.6120 0.6700 0.7040 0.7060 0.7020 0.6800 0.6720 0.7580
Table 6 MAPs with GA on all documents (ClustFuse). The best performance is indicated by boldfaced figures. TREC 3
Best comp ClustFuseCombSUM ClustFuseCombMNZ GA
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.1794 0.1929 0.2169 0.3006
0.7440 0.3930 0.6000 0.7680
0.7220 0.5090 0.6170 0.7930
0.2190 0.2093 0.2124 0.3262
0.6360 0.4470 0.4410 0.6960
0.5660 0.4023 0.4082 0.6269
0.3474 0.2920 0.2950 0.4550
0.6760 0.4650 0.4760 0.7240
0.6540 0.4365 0.4410 0.6800
0.2560 0.2935 0.2840 0.4572
0.6920 0.5320 0.5412 0.8280
0.6940 0.5873 0.5654 0.9000
315
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328 Table 7 MAPs with GA using top-ranked documents. The best performance is indicated by boldfaced figures. TREC 3
Best comp CombSUM CombMNZ Zscore LC LC2 RegFuse MixModel LambdaMerge GA GA@100 GA@50 GA@25
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.4226 0.3193 0.3050 0.4243 0.3169 0.3153 0.3249 0.3189 0.3050 0.4736 0.4796 0.4782 0.4776
0.7440 0.6320 0.5560 0.7320 0.6400 0.6120 0.6360 0.6360 0.6250 0.8080 0.8200 0.8160 0.8120
0.7220 0.5940 0.5440 0.7400 0.6000 0.5980 0.6060 0.5980 0.5880 0.7700 0.7720 0.7840 0.7700
0.3165 0.3561 0.3132 0.3613 0.3605 0.3626 0.3733 0.3573 0.3460 0.3840 0.3910 0.3817 0.3819
0.6360 0.6120 0.5040 0.6080 0.6200 0.6160 0.6360 0.6160 0.6050 0.6833 0.6960 0.6960 0.6794
0.5660 0.5420 0.4400 0.5420 0.5400 0.5480 0.5760 0.5460 0.5360 0.6360 0.6360 0.6140 0.6200
0.4491 0.3993 0.3534 0.4097 0.4354 0.4793 0.4343 0.4100 0.3890 0.5483 0.5483 0.5372 0.5439
0.6680 0.6600 0.6240 0.6640 0.6800 0.7240 0.6880 0.6665 0.6550 0.7720 0.7720 0.7640 0.7520
0.6540 0.5560 0.5260 0.5620 0.5880 0.6220 0.5880 0.5580 0.5880 0.6980 0.6980 0.6800 0.6920
0.3702 0.4168 0.3843 0.4012 0.4305 0.4477 0.4394 0.4190 0.4120 0.5470 0.5462 0.5324 0.5305
0.6920 0.7400 0.6760 0.7400 0.7480 0.7600 0.7560 0.7460 0.7360 0.8050 0.8040 0.7960 0.8048
0.6940 0.6860 0.6120 0.6700 0.7040 0.7060 0.7020 0.6800 0.6720 0.7580 0.7590 0.7580 0.7340
Table 8 MAPs with GA using top-ranked documents (ClustFuse). The best performance is indicated by boldfaced figures. TREC 3
Best comp ClustFuseCombSUM ClustFuseCombMNZ GA GA@50 GA@25
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.1794 0.1929 0.2169 0.3006 0.3015 0.2998
0.7440 0.3930 0.6000 0.7680 0.7850 0.7820
0.7220 0.5090 0.6170 0.7930 0.7990 0.7920
0.2190 0.2093 0.2124 0.3262 0.3180 0.3170
0.6360 0.4470 0.4410 0.6960 0.7152 0.7080
0.5660 0.4023 0.4082 0.6269 0.6180 0.6150
0.3474 0.2920 0.2950 0.4550 0.4500 0.4460
0.6760 0.4650 0.4760 0.7240 0.7180 0.7180
0.6540 0.4365 0.4410 0.6800 0.6750 0.7280
0.2560 0.2935 0.2840 0.4572 0.4430 0.4435
0.6920 0.5320 0.5412 0.8280 0.8370 0.8158
0.6940 0.5873 0.5654 0.9000 0.9230 0.9120
the best component run as well as CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel and LambdaMerge and all these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. For the top 10 runs and d = 50, this experiment was done for comparison with the ClustFuse variants. But note that here the experiment was done for d = 50 and 25. Table 8 shows that GA outperforms both ClustFuseCombSUM and ClustFuseCombMNZ in MAP, P@5 and P@10 and these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. It is evident that the ranked pool produced by fusion achieves retrieval performance that is at least as good as the case when all the documents are used for optimization. Also, in almost all the cases, the number of documents in the fused run, as shown in Fig. 4, is much less than 1000 when only the top documents are used. This indicates that the ranked pool formed out of the top documents contains a rich collection of useful documents that consistently produces encouraging retrieval performance. Table 9 shows that the average number of documents per query in the fused run highly varies over datasets and also over different depths. When the top documents for a query are considered for each run that is fused, the average number of documents per query in the fused run can be as high as 1000 (as in TREC 6, top 100). On the other hand, it can be as low as 78 (TREC 7, top 25). We now consider a more resource constrained situation. We take only the top d (=100, 50, 25) documents per query in the fused run. So, now we use the relevance assessments of only the top d documents per query. This guarantees that the relevance assessments of only the top d documents per query are used for finding the MAP of the fused run. Tables 10 and 11 show the results. We see a slight drop in performance since a lesser amount of ground truth is used for evaluation. Nevertheless, for TREC 3, 6 and 7, the performance in MAP, P@5 and P@10 using the top d relevance assessments remains significantly better than the best component run as well as CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel and LambdaMerge at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. However, for TREC 5, the MAP values for GA@100rel; GA@50rel and GA@25rel are numerically comparable with RegFuse MAP value and these pairwise differences are not statistically significant at 95% confidence level (p-value > 0.05) by Wilcoxon signed-rank test. But, the TREC 5 GA@100rel; GA@50rel and GA@25rel MAP values are numerically better than all the remaining baselines, viz., the best component run as well as CombSUM; CombMNZ; Z Score; LC; LC2; MixModel and LambdaMerge and these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. But for P@5 and P@10, the TREC 5 GA@100rel; GA@50rel and GA@25rel values are numerically better than all the baselines and these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. Table 11 shows that each of GA@100rel; GA@50rel and GA@25rel is better than ClustFuseCombSUM and ClustFuseCombMNZ in terms of MAP,
316
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328 Table 9 Average number of documents per query in the fused run. Dataset
Top 100
Top 50
Top 25
TREC TREC TREC TREC
697.2 839 1000 349
387.32 615.5 552 175
208.18 331 291 78
3 5 6 7
Table 10 MAPs with GA using only a top few qrels. The best performance is indicated by boldfaced figures. TREC 3
Best comp CombSUM CombMNZ Zscore LC LC2 RegFuse MixModel LambdaMerge GA GA@100 rel GA@50 rel GA@25 rel
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.4226 0.3193 0.3050 0.4243 0.3169 0.3153 0.3249 0.3189 0.3050 0.4736 0.4767 0.4762 0.4699
0.7440 0.6320 0.5560 0.7320 0.6400 0.6120 0.6360 0.6360 0.6250 0.8080 0.8200 0.8240 0.8160
0.7220 0.5940 0.5440 0.7400 0.6000 0.5980 0.6060 0.5980 0.5880 0.7700 0.7780 0.7740 0.7760
0.3165 0.3561 0.3132 0.3613 0.3605 0.3626 0.3733 0.3573 0.3460 0.3840 0.3715 0.3716 0.3747
0.6360 0.6120 0.5040 0.6080 0.6200 0.6160 0.6360 0.6160 0.6050 0.6833 0.6920 0.6960 0.7040
0.5660 0.5420 0.4400 0.5420 0.5400 0.5480 0.5760 0.5460 0.5360 0.6360 0.6080 0.6120 0.6100
0.4491 0.3993 0.3534 0.4097 0.4354 0.4793 0.4343 0.4100 0.3890 0.5483 0.5473 0.5449 0.5371
0.6680 0.6600 0.6240 0.6640 0.6800 0.7240 0.6880 0.6665 0.6550 0.7720 0.7680 0.7640 0.7800
0.6540 0.5560 0.5260 0.5620 0.5880 0.6220 0.5880 0.5580 0.5880 0.6980 0.6920 0.6980 0.7020
0.3702 0.4168 0.3843 0.4012 0.4305 0.4477 0.4394 0.4190 0.4120 0.5470 0.5358 0.5456 0.5404
0.6920 0.7400 0.6760 0.7400 0.7480 0.7600 0.7560 0.7460 0.7360 0.8050 0.7880 0.8200 0.8080
0.6940 0.6860 0.6120 0.6700 0.7040 0.7060 0.7020 0.6800 0.6720 0.7580 0.7422 0.7660 0.7540
Table 11 MAPs with GA using only a top few qrels (ClustFuse). The best performance is indicated by boldfaced figures. TREC 3
Best comp ClustFuseCombSUM ClustFuseCombMNZ GA GA@100 rel GA@50 rel GA@25 rel
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.1794 0.1929 0.2169 0.3006 0.3040 0.3030 0.3010
0.7440 0.3930 0.6000 0.7680 0.7850 0.7780 0.7700
0.7220 0.5090 0.6170 0.7930 0.7980 0.8120 0.7910
0.2190 0.2093 0.2124 0.3262 0.3290 0.3258 0.3250
0.6360 0.4470 0.4410 0.6960 0.7120 0.7122 0.6650
0.5660 0.4023 0.4082 0.6269 0.6275 0.6180 0.6100
0.3474 0.2920 0.2950 0.4550 0.4570 0.4480 0.4463
0.6760 0.4650 0.4760 0.7240 0.7260 0.7180 0.7180
0.6540 0.4365 0.4410 0.6800 0.6780 0.6750 0.6715
0.2560 0.2935 0.2840 0.4572 0.4570 0.4420 0.4415
0.6920 0.5320 0.5412 0.8280 0.8260 0.8170 0.8058
0.6940 0.5873 0.5654 0.9000 0.9050 0.8980 0.8740
Table 12 MAPs with GA using train-test. The best performance is indicated by boldfaced figures. TREC 3
Best comp CombSUM CombMNZ Zscore LC LC2 RegFuse MixModel LambdaMerge GA
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.4226 0.3193 0.3050 0.4243 0.3145 0.3124 0.3225 0.3167 0.3030 0.4725
0.7440 0.6320 0.5560 0.7320 0.6390 0.6110 0.6360 0.6365 0.6230 0.8060
0.7220 0.5940 0.5440 0.7400 0.6100 0.5880 0.6060 0.5970 0.5885 0.7710
0.3165 0.3561 0.3132 0.3613 0.3590 0.3614 0.3699 0.3570 0.3580 0.3780
0.6360 0.6120 0.5040 0.6080 0.6100 0.6120 0.6280 0.6210 0.6070 0.6850
0.5660 0.5420 0.4400 0.5420 0.5350 0.5490 0.5760 0.5430 0.5330 0.6340
0.4491 0.3993 0.3534 0.4097 0.4323 0.4735 0.4358 0.4110 0.3890 0.5512
0.6680 0.6600 0.6240 0.6640 0.6710 0.7250 0.6880 0.6650 0.6530 0.7680
0.6540 0.5560 0.5260 0.5620 0.5790 0.6200 0.5880 0.5560 0.5790 0.6970
0.3702 0.4168 0.3843 0.4012 0.4265 0.4445 0.4400 0.4170 0.4020 0.5403
0.6920 0.7400 0.6760 0.7400 0.7370 0.7560 0.7560 0.7410 0.7390 0.7950
0.6940 0.6860 0.6120 0.6700 0.7020 0.7020 0.7020 0.6720 0.6700 0.7500
317
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328 Table 13 MAPs with GA using train-test (ClustFuse). The best performance is indicated by boldfaced figures. TREC 3
Best comp ClustFuseCombSUM ClustFuseCombMNZ GA
TREC 5
TREC 6
TREC 7
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
MAP
P@5
P@10
0.1794 0.1929 0.2169 0.3001
0.7440 0.3930 0.6000 0.7660
0.7220 0.5090 0.6170 0.7930
0.2190 0.2093 0.2124 0.3253
0.6360 0.4470 0.4410 0.6930
0.5660 0.4023 0.4082 0.6213
0.3474 0.2920 0.2950 0.4512
0.6760 0.4650 0.4760 0.7220
0.6540 0.4365 0.4410 0.6749
0.2560 0.2935 0.2840 0.4556
0.6920 0.5320 0.5412 0.8260
0.6940 0.5873 0.5654 0.8976
Fig. 5. TREC 3: MAP.
Fig. 6. TREC 3: P@5.
318
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Fig. 7. TREC 5: MAP.
Fig. 8. TREC 5: P@5.
P@5 and P@10, and these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. 5.3. Learning GA weights using disjoint query sets In the previous sections, we have learned the optimal weights using either all the relevant judgements or the relevant judgements of the top-ranked documents. However, training was done using all the queries. In this sub-section, we divide the queries into two disjoint groups: even-numbered queries and odd-numbered queries. First, we perform the training on the even-numbered queries only and test the learned optimal weights on the odd-numbered queries. Let the optimal MAP thus obtained be MAPTrainOnEvenTestOnOdd. Then, we swap the train-test sets. In other words, we train on the odd-numbered queries and test on the even-numbered queries to get the optimal MAP MAPTrainOnOddTestOnEven. Finally, we take
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
319
Fig. 9. TREC 6: MAP.
Fig. 10. TREC 6: P@5.
an average of MAPTrainOnEvenTestOnOdd and MAPTrainOnOddTestOnEven values to get MAPTrainTestAverage. We report MAPTrainTestAverage values in Tables 12 and 13. P@5 and P@10 values reported in the table are also obtained in the same way. Note that apart from GA, the same train-test procedure was followed for finding the values for LC; LC2; RegFuse; MixModel and LambdaMerge. The remaining methods are unsupervised and so the same values are reported for them as found in the previous tables. The values are calculated for the top 10 performing runs in case of the ClustFuse variants and the top 30 runs for all the remaining methods. We see that GA outperforms, in MAP, P@5 and P@10, the best component run as well as CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel; LambdaMerge; ClustFuseCombSUM and ClustFuseCombMNZ, and all these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test.
320
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Fig. 11. TREC 7: MAP.
Fig. 12. TREC 7: P@5.
5.4. Learning GA weights on randomly chosen runs In the previous subsections, we have obtained the evaluation measures for the top performing Ad Hoc runs submitted in TREC 3, 5, 6 and 7. In this subsection, for each year’s Ad Hoc runs, we choose runs randomly from all the submitted Ad Hoc runs in that year and test our method by considering different number of runs (instead of 30 and 10) at a time for fusion. We choose k runs at random from all the submitted runs, where k = 2, 3, 5, 10, 20 and 30, and for each value of k the random selection is done 12 times. For a particular k value, the experiments were performed using the train-test procedure discussed in the last subsection. Then, we take an average of all the 12 values for each evaluation measure. Figs. 5 and 6 show the MAP and P@5 values for TREC 3 at different values of k (number of runs) of GA against the best component, CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel and LambdaMerge. Figs. 7–12 depict the MAP and P@5 values for TREC 5, 6 and 7. In MAP, for each randomly chosen set, GA outperforms the best component, CombSUM,
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
321
Fig. 13. TREC 3 Clustfuse: MAP.
Fig. 14. TREC 3 Clustfuse: P@5.
Fig. 15. TREC 5 Clustfuse: MAP.
CombMNZ, Z-Score, LC; LC2; RegFuse; MixModel and LambdaMerge for all the four years, and all these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. In P@5, for three sets of size 3 (k = 3) in TREC 7, the RegFuse; LC and LC2 values are numerically comparable with GA and these pairwise differences are not statistically significant at 95% confidence level (p-value > 0.05) by Wilcoxon signed-rank test. For all the remaining sets of
322
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Fig. 16. TREC 5 Clustfuse: P@5.
Fig. 17. TREC 6 Clustfuse: MAP.
Fig. 18. TREC 6 Clustfuse: P@5.
runs, for all values of k and all the four years, in P@5 GA outperforms the best component, CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel and LambdaMerge and all these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. Figs. 13 and 14 show the MAP and P@5 comparisons of GA with ClustFuseCombSUM and ClustFuseCombMNZ for TREC 3. Figs. 15–20 show the same for TREC 5, 6 and 7. Note that here k = 2, 3, 5, 10 due to high resource requirements of ClustFuse.
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
323
Fig. 19. TREC 7 Clustfuse: MAP.
Fig. 20. TREC 7 Clustfuse: P@5.
For all the four years, GA is numerically better in MAP and P@5 than ClustFuseCombSUM and ClustFuseCombMNZ, and all these pairwise differences are statistically significant at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. 5.5. Time requirements of GA The GA is a time consuming method. The high time requirements can be ascribed to the extensive search exercise performed by GA over the solution space to find the optimal performance. However, the time requirements can be reduced by considering a smaller number of generations. Fig. 21 shows the MAP values of GA for TREC 3, 5, 6 and 7 when the number of generations varies from 200 to 150, 100, 75, 50 and 25. For each year, we found the time required when the maximum number of runs are considered for fusion. In our experiments, this number is 30. These runs are chosen randomly and the MAPs are obtained based on the train-test procedure. We see that there is no notable drop in performance when the number of generations is reduced. This is true for all the four years of data and pairwise differences over the different values of number of generations are not statistically significant at 95% confidence level (p-value > 0.05) by Wilcoxon signed-rank test. The performance achieved by GA for generations 200, 150, 100, 75, 50 and 25 remain statistically significantly better than all the baselines at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. However, reduction in the number of generations reduces the time requirements considerably. Table 14 shows the time requirements of GA for different numbers of generation along with the time requirements of CombSUM; CombMNZ; Z Score; LC; LC2; RegFuse; MixModel and LambdaMerge. The time was calculated on a Intel Core i5 processor machine with 5 GB RAM. The values are averages over TREC 3, 5, 6 and 7. We see that although the time taken by GA is very high when the number of generations is 200, it can be reduced considerably when fewer generations are considered. For generations less than 100, GA is less time intensive than MixModel and LambdaMerge. Table 15 shows the comparison of GA with ClustFuseCombSUM and ClustFuseCombMNZ in terms of time requirement. Here, 10 runs are fused and top 50
324
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Fig. 21. GA MAPs over generations.
Table 14 Time requirements. Method
Time
CombSUM CombMNZ ZScore LC LC2 RegFuse MixModel LambdaMerge GA 200 gens GA 150 gens GA 100 gens GA 75 gens GA 50 gens GA 25 gens
12 s 15 s 10 s 5.5 min 6.5 min 9.5 min 16 h 7 h 30 mins 15.75 h 11.8 h 8.17 h 6.13 h 4.1 h 2.04 h
Table 15 Time requirements (ClustFuse). Method
Time
ClustFuseCombSUM ClustFuseCombMNZ GA 200 gens
9h 8.6 h 33.5 min
Table 16 MAPs of top runs using GA. Dataset
Top 30 MAP
Top 10 MAP
TREC TREC TREC TREC
0.4736 0.3840 0.5483 0.5462
0.4845 0.3749 0.5540 0.5484
3 5 6 7
documents per query are considered for each run. GA was run for 200 generations. We see that GA is superior to both ClustFuseCombSUM and ClustFuseCombMNZ in terms of time requirement even when the number of generations is not reduced. So, we see that GA is not the most expensive algorithm in terms of computation time. It betters MixModel and LambdaMerge at a lower number of generations and also the ClustFuse variants even when the number of generations is
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
325
not compromised. For the remaining methods, the higher time requirements can be justified by the consistent superiority of GA in terms of standard evaluation measures. 5.6. Removal of redundant runs on the basis of MAP values In the fusion task described above, we have considered all the 30 runs and achieved significantly better performance. Now we would like to see if all the runs are indeed necessary to give rise to a fusion with an enhanced performance. For this we
Table 17 Runs used for fusion (sorted in the decreasing order of MAP). Data
Run names
TREC 3
input.inq102,input.citya1,input.brkly7,input.inq101, input.assctv2,input.assctv1,input.crnlea,input.citya2,input.crnlla, input.westp1,input.vtc2s2,input.pircs1,input.eth002,input.vtc5s2, input.pircs2,input.brkly6,input.eth001,input.nyuir2,input.nyuir1, input.topic4,input.clarta,input.dortd2,input.citri1,input.dortd1, input.lsia0mw2,input.rutfua1,input.lsia0mf,input.rutfua2,input.clartm, input.xerox3,input.siems1,input.citri2,input.erima1,input.siems2, input.padre2,input.xerox4,input.padre1,input.acqnt1,input.virtu1,input.topic3
TREC 5
input.ETHme1,input.uwgcx1,input.uwgcx0,input.LNmFull2, input.Cor5M2rf,input.LNmFull1,input.genrl3,input.LNaDesc2,input.LNaDesc1, input.genrl4,input.Cor5M1le,input.CLCLUS,input.CLTHES,input.pircsAAL, input.ETHal1,input.brkly17,input.pircsAM2,input.anu5man4,input.city96a1, input.gmu96ma1,input.gmu96ma2,input.Cor5A2cr,input.brkly16,input.Cor5A1se, input.vtwnB1,input.ibmge2,input.genrl2,input.INQ301,input.brkly18, input.colm4,input.vtwnA1,input.anu5man6,input.DCU961,input.pircsAM1, input.pircsAAS,input.DCU963,input.mds001,input.city96a2,input.ETHas1, input.colm1,input.ibms96b,input.ibms96a,input.fsclt4,input.anu5aut2, input.anu5aut1,input.mds003,input.genrl1,input.ibmge1,input.brkly15, input.fsclt3,input.DCU962,input.DCU964,input.gmu96au2,input.mds002, input.gmu96au1,input.KUSG2,input.KUSG3,input.ibmgd1,input.erliA1,input.ibmgd2,input.INQ302
TREC 6
input.uwmt6a0,input.CLAUG,input.CLREL,input.anu6min1, input.city6at,input.LNmShort,input.gerua1,input.anu6alo1, input.aiatB1,input.gerua2,input.pirc7At,input.iss97man, input.ibmg97b,input.uwmt6a1,input.Mercure3,input.Mercure1, input.att97as,input.LNaVryShort,input.city6al,input.pirc7Aa, input.Brkly23,input.Cor6A3cll,input.city6ad,input.gmu97ma1, input.mds602,input.iss97vs,input.Brkly22,input.LNaShort, input.uwmt6a2,input.att97ac,input.att97ae,input.Cor6A2qtcs, input.DCU97vs,input.Cor6A1cls,input.ibmg97a,input.ibms97a, input.glair61,input.Mercure2,input.VrtyAH6a,input.fsclt6r, input.fsclt6,input.anu6ash1,input.gmu97au1,input.mds603, input.aiatA1,input.INQ402,input.glair62,input.umcpa197, input.gmu97au2,input.VrtyAH6b,input.pirc7Ad,input.Brkly21, input.mds601,input.INQ401,input.csiro97a3,input.csiro97a1, input.gerua3,input.csiro97a2,input.DCU97snt,input.nsasg1, input.iss97s,input.fsclt6t,input.harris1,input.nsasg2, input.DCU97lt,input.glair64,input.gmu97ma2,input.DCU97lnt, input.nmsu2,input.ispa2,input.nmsu1,input.jalbse,input.jalbse0,input.ispa1
TREC 7
input.CLARIT98COMB,input.t7miti1,input.uwmt7a2,input.CLARIT98CLUS, input.CLARIT98RANK,input.iit98ma1,input.ok7ax,input.uwmt7a1,input.att98atdc, input.att98atde,input.Brkly26,input.ok7am,input.INQ502,input.mds98td,input.bbn1, input.tno7exp1,input.uoftimgr,input.INQ501,input.pirc8Aa2,input.LNmanual7, input.Cor7A3rrf,input.acsys7mi,input.acsys7al,input.ok7as,input.nectitechdes, input.nectitechall,input.Cor7A2rrd,input.pirc8Ad,input.INQ503,input.tno7tw4, input.att98atc,input.uoftimgu,input.pirc8At,input.harris1,input.LNaTitDesc7, input.LNaTit7,input.ibms98a,input.Cor7A1clt,input.tno7cbm25,input.FLab7ad, input.MerAdRbtnd,input.iowacuhk2,input.MerTetAdtnd,input.iowacuhk1,input.mds98t, input.mds98t2,input.FLab7at,input.nttdata7Al2,input.ibms98b,input.ibms98c,input.acsys7as, input.nttdata7Al0,input.FLab7atE,input.nttdata7At1,input.Brkly25,input.gersh2,input.fsclt7m, input.ibmg98b,input.iit98au1,input.MerAdRbtd,input.gersh1,input.nectitech,input.uwmt7a0, input.Brkly24,input.ibmg98a,input.APL985LC,input.LIArel2,input.gersh3,input.ETHAC0, input.unc7aal2,input.LIAClass,input.ETHAB0,input.ETHAR0,input.APL985L,input.nsasgrp4, input.LIAshort2,input.unc7aal1,input.ibmg98c,input.iit98au2,input.umd98a1,input.nsasgrp3, input.fub98a,input.fub98b,input.AntHoc01,input.APL985SC,input.ic98san4,input.nthu1,input.ic98san3, input.nthu2,input.fsclt7a,input.jalbse011,input.jalbse012,input.nthu3,input.jalbse013,input.lanl981, input.umd98a2,input.kslsV1,input.ScaiTrec7,input.KD71010s,input.KD70000,input.dsir07a02,input.KD71010q,input.dsir07a01
326
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
consider only the 10 runs having the highest 10 MAP values. We then separately run GA on these 10 runs and get the optimum fused run for each dataset. The MAP values of these fused runs are shown in Table 16. It is seen that removal of the 20 runs does not hurt the performance of the fused run in any significant way. Though theoretically the MAP value achieved by the fusion of 10 runs cannot be higher than that by the fusion of 30 runs, it is in fact higher for TREC 3, 6 and 7 data. This is because of the iterative nature of GA that generates a solution that is close to the global optimum of an objective function, but not necessarily the global optimum itself. Anyhow, the lesson learnt here is that not all the runs are necessarily required for an efficient fusion. Next we will see if the number of runs to be fused can be further reduced.
5.7. Removal of redundant runs using correlation between retrieval scores In the previous sections, we have fused the runs and determined their optimal weights irrespective of how similar the runs are. There may be runs which are similar to each other in terms of the correlation coefficient among the retrieval scores (discussed in Section 4.3) of the same set of documents. So, these similar runs add similar information regarding the relative superiority of a document common to them. Here we explore if we can further remove the redundant information by considering only one of the similar runs instead of considering all of them. Let S be the set of M runs (M = 10 in the present case). Our algorithm is as follows: 1. Calculate the correlation coefficient between the retrieval scores of each pair of runs in S. Let MAP1 be the MAP value of the fused run obtained by GA on the basis of the runs in S. 2. Take the pair of runs in S with the highest correlation coefficient. Drop the run in the pair with the lower MAP value from the set S. Let the new set be S0 . 3. Run GA on the runs in S0 and let MAP2 be the MAP value of the new fused run. 4. If MAP2 is lower than MAP1 by x% or more, stop. Else set S ¼ S0 and MAP1 ¼ MAP2 , and go to Step 2. 0.56
TREC 3 TREC 5 TREC 6 TREC 7
0.54 0.52 0.5
MAP
0.48 0.46 0.44 0.42 0.4 0.38 0.36
10
9
8
7
6
5
4
3
2
No. of runs Fig. 22. Trend in reduction in MAP values when redundant runs are removed from fusion.
0.6
trec 3 trec 5 trec 6 trec 7
0.5
Weights
0.4 0.3 0.2 0.1 0 0
2
4
6
8 10 12 14 16 18 20 22 24 26 28
Top runs Fig. 23. Weights on runs.
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
327
The algorithm is designed to drop one run at a time. Note that the pair with the next highest correlation will be considered if both of them appear in the current set S0 . The threshold x measures the tolerance in the drop of performance that results when a run is barred from participating in a fusion. We chose x as 3%. The experiment was done on the top 10 runs of TREC 3, 5, 6 and 7. Fig. 22 shows the trend. We see that there is no serious drop in performance if some of the similar runs do not participate in fusion. However, the performance drop crosses our tolerance level at 3, 2, 4 and 5 runs for TREC 3, 5, 6 and 7 respectively. So, the number of optimum runs chosen by our algorithm is 4, 3, 5 and 6 for TREC 3, 5, 6 and 7 respectively. The MAP produced by the optimum sets are significantly better than the best component run and all the baselines at 95% confidence level (p-value < 0.05) by Wilcoxon signed-rank test. 5.8. GA as a natural selector of the superior runs Given a set of a high number of runs, GA naturally selects the highly contributing runs in a fusion setup. High fusion weights are automatically assigned to the runs which have higher number of relevant documents at the top and hence contribute more to the fused run. These runs are naturally the best runs in the set. Thus, natural selection, through cross-over and mutation, plays its part in identifying the superior candidates in a linear combination of runs. Fig. 23 shows the weights assigned to the top runs by GA. It shows that the highest fusion weights are assigned to the best candidate runs over four datasets of TREC. 6. Conclusion We have proposed a Genetic Algorithm based approach to determination of an efficient fusion of a set of existing runs. We show that considerable improvement can be obtained in fusion performance even if partial relevance assessments are used for determining the optimal weights. The proposed approach also determines the redundant runs that may be removed from fusion without much harm. No Ad Hoc or manual settings for selection of a subset of runs for fusion or the fusion weights for the selected runs, are necessary. All these parameters are automatically and efficiently learnt through optimization by the Genetic Algorithm. References Aslam, J. A., & Montague, M. (2001). Models for metasearch. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval SIGIR ’01 (pp. 276–284). New York, NY, USA: ACM. URL
. doi:10.1145/ 383952.384007. Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. (1995). Gambling in a rigged casino: The adversarial multi-armed bandit problem. Technical Report. Bartell, B. T., Cottrell, G. W., & Belew, R. K. (1994). Automatic combination of multiple ranked retrieval systems. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval SIGIR ’94 (pp. 173–181). New York, NY, USA: Springer-Verlag New York, Inc. URL. Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval SIGIR ’09 (pp. 758–759). New York, NY, USA: ACM. URL. doi:10.1145/1571941.1572114. Fox, E. A., & Shaw, J. A. (1993). Combination of multiple searches. : In the proceedings of TREC 1993. Guttman, L. (1978). What is not what in statistics. The Statistician, 26, 81–107. Hong, D., & Si, L. (2012). Mixture model with multiple centralized retrieval algorithms for result merging in federated search. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval SIGIR ’12 (pp. 821–830). New York, NY, USA: ACM. URL. doi:10.1145/2348283.2348393. Khudyak Kozorovitsky, A., & Kurland, O. (2011). Cluster-based fusion of retrieved lists. In Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval SIGIR ’11 (pp. 893–902). New York, NY, USA: ACM. URL. doi:10.1145/2009916.2010035. Lee, J. H. (1997). Analyses of multiple evidence combination. In Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval SIGIR ’97 (pp. 267–276). New York, NY, USA: ACM. URL. doi:10.1145/ 258525.258587. Lillis, D., Toolan, F., Collier, R., & Dunnion, J. (2006). Probfuse: A probabilistic approach to data fusion. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval SIGIR ’06 (pp. 139–146). New York, NY, USA: ACM. URL. doi:10.1145/1148170.1148197. Montague, M., & Aslam, J. A. (2001). Relevance score normalization for metasearch. In Proceedings of the tenth international conference on information and knowledge management CIKM ’01 (pp. 427–433). New York, NY, USA: ACM. URL. doi:10.1145/ 502585.502657. Montague, M., & Aslam, J. A. (2002). Condorcet fusion for improved retrieval. In Proceedings of the eleventh international conference on information and knowledge management CIKM ’02 (pp. 538–548). New York, NY, USA: ACM. URL. doi:10.1145/ 584792.584881. Pal, S., Mitra, M., & Kamps, J. (2011). Evaluation effort, reliability and reusability in xml retrieval. JASIST, 62, 375–394. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W.T. (1995). Numerical recipes in c: the art of scientific computing. Cambridge University Press. Rudolph, G. (1994). Convergence analysis of canonical genetic algorithms. IEEE Transactions on Neural Networks, 5, 96–101. Savoy, J. (2004). Data fusion for effective European monolingual information retrieval. In Proceedings of CLEF’2004 (pp. 233–244). Sheldon, D., Shokouhi, M., Szummer, M., & Craswell, N. (2011). Lambdamerge: Merging the results of query reformulations. In Proceedings of the fourth ACM international conference on web search and data mining WSDM ’11 (pp. 795–804). New York, NY, USA: ACM. URL. doi:10.1145/1935826.1935930. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. McGraw-Hill series in psychology, McGraw-Hill. URL. Srinivas, M., & Patnaik, L. M. (1994). Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on Systems, Man and Cybernetics, 24, 656–667.
328
K. Ghosh et al. / Information Processing and Management 51 (2015) 306–328
Vogt, C. C., & Cottrell, G. W. (1998). Predicting the performance of linearly combined ir systems. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval SIGIR ’98 (pp. 190–196). New York, NY, USA: ACM. URL. doi:10.1145/290941.290991. Wu, S. (2012). Linear combination of component results in information retrieval. Data and Knowledge Engineering, 71, 114–126. URL. doi:10.1016/j.datak.2011.08.003. Wu, S. (2013). The weighted condorcet fusion in information retrieval. Information Processing and Management, 49, 108–122. Wu, S., Bi, Y., Zeng, X., & Han, L. (2009). Assigning appropriate weights for the linear combination data fusion method in information retrieval. Information Processing and Management, 45, 413–426. URL. doi:10.1016/j.ipm.2009.02.003. Wu, S., & McClean, S. (2005). Data fusion with correlation weights. In Proceedings of the 27th European conference on advances in information retrieval research ECIR’05 (pp. 275–286).