Expert Systems with Applications 38 (2011) 12839–12844
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Transductive learning to rank using association rules Yan Pan a,⇑, Haixia Luo b, Hongrui Qi b, Yong Tang c a
School of Software, Sun Yat-sen University, Guangzhou 510275, China Department of Computer Science, Sun Yat-sen University, Guangzhou 510275, China c Department of Computer Science, South China Normal University, Guangzhou 510631, China b
a r t i c l e
i n f o
Keywords: Information retrieval Learning to rank Transductive learning Association rules Loss function Ranking SVM
a b s t r a c t Learning to rank, a task to learn ranking functions to sort a set of entities using machine learning techniques, has recently attracted much interest in information retrieval and machine learning research. However, most of the existing work conducts a supervised learning fashion. In this paper, we propose a transductive method which extracts paired preference information from the unlabeled test data. Then we design a loss function to incorporate this preference data with the labeled training data, and learn ranking functions by optimizing the loss function via a derived Ranking SVM framework. The experimental results on the LETOR 2.0 benchmark data collections show that our transductive method can significantly outperform the state-of-the-art supervised baseline. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Learning to rank, which aims to learn ranking functions to properly sort a set of entities/documents by machine learning techniques, has received more and more attention in information retrieval and machine learning research (Burges et al., 2005; Cao, Qin, Liu, Tsai, & Li, 2007; Freund, Iyer, Schapire, & Singer, 2003; Joachims, 2002; Li, Burges, & Wu, 2007; Taylor, Guiver, Robertson, & Minka, 2008; Yue, Finley, Radlinski, & Joachims, 2007). Most research in learning to rank uses supervised learning settings, in which a ranking function is learned from training data that consists of queries, their corresponding retrieved documents and relevance levels annotated by human. Generally speaking, it is difficult to get large amount of labeled data for training because data annotated by experts is usually time-consuming and expensive. Thus supervised learning approaches for ranking may tend to fail since there are only a small number of training examples. Unlabeled data, such as query/document pairs without relevance judgments, can be an alternative source of training data which is relatively easy to get (i.e. from query logs (Joachims, 2002)). Semi-supervised learning with both labeled and unlabeled data has been widely studied on classification and regression problems (Zhu, 2005). Recently, there have been some studies on semi-supervised (inductive or transductive) learning for document ranking (Amini, Truong, & Goutte, 2008; Duh & Kirchhoff, 2008). Precisely, considering the transductive case as an example, one is given a labeled training set S = {(qi, Di, Yi)}i=1,2,. . .,n and a test set ⇑ Corresponding author. Address: School of Software, Sun Yat-sen University, No. 135, XinGangXi Road, Guangzhou 510275, China. Tel.: +86 13631408696. E-mail addresses:
[email protected],
[email protected] (Y. Pan). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.04.076
T = {(qi, Di)}i=n+1,n+2,. . .,n+m, in which qi represents a query, Di represents the list of corresponding retrieved documents, and Yi is the list of manually annotated relevance judgments. The task of transductive learning is to learn a ranking function which sorts the examples in the test set, and both L and T are available for training. While in the supervised learning settings, only L is used in training. Semi-supervised learning to rank is a relative new research direction and worth more exploration. In this paper, we conduct a transductive approach for learning to rank. We argue that transductive learning for ranking can expect to outperform the supervised case because additional data (the unlabeled test set) is used in training. One of the great challenges in the transductive case is how to extract reliable information (i.e. pair-wise preference data) from the unlabeled test set since it is not as easy as from the labeled data. Here we firstly use association rules to obtain pair-wise preference information from unlabeled data. Then we derive an optimization framework by extending Ranking SVM for transductive ranking, and incorporate the extracted preference data into the learning process. Experimental results show that our transductive learning approach significantly outperform the supervised baselines such as Ranking SVM. The rest of this paper is organized as follows: we briefly review the previous research in Section 2. In Section 3, we present our transductive learning approach. Firstly, we describe the general framework. Then we describe our way to use association rules to extract ordered pair relationships from unlabeled data in Section 3.1. After that, we present the loss functions for our transductive learning framework based on pair-wise preference in Section 3.2. Moreover, we revisit the Ranking SVM model and illustrate our derived optimization framework for the transductive case in
12840
Y. Pan et al. / Expert Systems with Applications 38 (2011) 12839–12844
Subsection 3.3. The experimental results will be presented in Section 4. Finally, Section 5 will conclude the paper with some remarks on future directions. 2. Related work 2.1. Learning to rank In recent years many machine learning techniques have been studied for the task of learning to rank (Burges et al., 2005; Freund et al., 2003; Joachims, 2002; Nallapati, 2004). These approaches seek to train ranking functions by combining many kinds of features and most of them conduct a supervised learning fashion. The problem of learning to rank can be viewed as a classification task which aims to appropriately classify the partial order relationships within instance pairs. Ranking SVM (Herbrich, Graepel, & Obermayer, 1999; Joachims, 2002), RankBoost (Freund et al., 2003) and RankNet (Burges et al., 2005) are typical examples in this category. Ranking SVM use a maximum margin optimization framework similar to the traditional SVM (Vapnik, Golowich, & Smola, 1997) to minimize the number of incorrectly ordered document pairs. Several extensions of Ranking SVM have also been proposed to enhance the ranking performance (Cao et al., 2006; Qin, Zhang, Wang, Xiong, & Li, 2007). RankBoost is a boosting algorithm for ranking using pair-wise preference data. The main advantages of RankBoost are easy to implement and adapt the algorithm to run in parallel. RankNet is another well-known algorithm using Neural Network for ranking and cross entropy as its loss function. In this paper, we also conduct a pair-wise approach and use preference data of document pairs in training. 2.2. Semi-supervised learning for ranking Semi-supervised learning, which is an active area in machine learning research, addresses the learning problem by using unlabeled data, which is commonly easy to obtain, together with labeled data. Since semi-supervised learning requires less human effort to collect training data and has potential to provide higher accuracy, it is widely studied in the problems of classification and regression (Zhu, 2005). However, there is not much previous work on semi-supervised learning for ranking, except (Amini et al., 2008 & Duh & Kirchhoff, 2008). Amini, Truong and Goutte presented a boosting based inductive learning algorithm which learns a bipartite ranking function with partially labeled data. Duh and Kirchhoff proposed a framework of transductive learning of ranking functions which exploit unlabeled test data to improve ranking performance. This approach use an unsupervised learning algorithm to derive additional features adapting to the unlabeled test data, then the training set is projected on the directions of the new feature to obtain a better training set. In our paper, we also explore a transductive learning approach but it is different from the work in Amini et al. (2008) and Duh and Kirchhoff (2008). 3. Transductive learning to rank Firstly, we formulate the notations for the problem of tranductive ranking. Let S = {(qi, Di, Yi)}i=1,2,. . .,n and a test set T = {(qi, Di)}i=n+1,n+2,. . .,n+m, be the test set, in which qi represents a query, Di represents the list of corresponding retrieved documents, and Yi is the list of manually annotated relevance judgments. For each query qi (i = 1, 2, . . . , n + m), let Di = {xi1, xi2, . . . , xik} be the documents to be sorted, in which xij (i = 1, 2, . . . , k) represents the feature vector of the jth document instance in Di. The goal of transductive learning is to learn a ranking function h to order document instances using both L and T during
training. Given two document instance xi and xj within a specific query, the notations i, h and = represent three ordered relations: (1) xi > xj means that xi is more relevant than xj; (2) xi < xj represents that xi is less relevant than xj; (3) xi = xj means that xi and xj have equal relevance judgments. When xi and xj have the same relevance judgments, as known as ties, the order of xi and xj does not have any effect on the ranking scores with respect to the commonly used evaluation measures (i.e. MAP and NDCG). Therefore, we ignore ties hereafter. Let P = {(xi, xj)jxi > xj} represent the paired preference information collected from the test set, where xi and xj are document instances which belong to a same query in the test set. Following the Empirical Risk Minimization Principle, we need to find a loss function to quantify how well the function h can predict rankings. The key idea of our approach is to collect some reliable paired preference information from the unlabeled test set by a learning algorithm (i.e. association rules inference, classification or regression), then incorporate this information into the loss function to measure the effectiveness of the ranking function’s prediction. Precisely, we define a loss function (see Section 3.2) by measuring the number of incorrectly ordered pairs in both the labeled data set L and preference data set P (which is collected from the test set T) with respect to the predicted ranking by the function h. For clarity, we describe the skeleton of our transductive learning algorithm is as follows: Algorithm 1 Input: training data S = {(qi, Di, Yi)}i=1,2,. . .,n and test set T = {(qi, Di)}i=n+1,n+2,. . .,n+m EXTRACT (): A learning algorithm to extract reliable paired preference data from an example of T LEARN (): A pair-wise supervised learning algorithm for ranking Output: ranking prediction for the test set T = {(qi, Di)}i=n+1,n+2,. . .,n+m For t = n + 1, n + 2, . . . , n + m do // To collect a set of reliable paired preference data from the example hqt, Dti 2 T Pt = EXTRACT (L, hqt, Dti) End For P = Pn+1 [ Pn+2 [ [ Pn+m // To learn a ranking function ht that minimize the loss function defined in Subsection 3.2 // with the training set L and the collected paired preference data P h = LEARN (L, P) For t = n + 1, n + 2, . . . , n + m do //To predict the ranking of the example hqt, Dti Yt = ht(hqt, Dti) End For
In this paper, we use pair-wise association rules inference as the implementation of EXTRACT () algorithm to discover ordered document pairs from the test data, and we derive an algorithm from Ranking SVM as the LEARN () algorithm. Theoretically, any learning algorithm can be used as EXTRACT and LEARN. Two crucial issues are (1) the EXTRACT algorithm should find reliable paired preference information (i.e. selecting pairs according to confidence or probability); (2) the LEARN algorithm needs a loss function to incorporate the collected paired information from the test example with those ordered document pairs in the training set. We will show that the derived Ranking SVM with pair-wise association rules inference is an effective combination, and other combinations may also work well.
12841
Y. Pan et al. / Expert Systems with Applications 38 (2011) 12839–12844
ScoreðkÞ ¼ MaxðScoreðk; 1Þ; Scoreðk; 1ÞÞ
3.1. Pair-wise association rules inference Association rules learning (Agrawal, Imielinski, & Swami, 1993) is a well studied method to discover interesting relations between variables/features in large dataset. Recently there have been some studies on using association rules for ranking (Veloso, Almeida, Goncalves, & Wagner, 2008). In this subsection, we propose a different method to extract paired preference information from test data by using association rules. Let xi = (fi1, fi2, . . . , fis) and xj = (fj1, fj2, . . . , fjs) be two retrieved document instances of a query q which is in test data. fi (i = 1, 2, . . . , s) are query-document features. As a pair-wise approach, we construct a new training set L1 and test set T1 composed of document pairs. Specially, given any document pair (xi, xj) within the same query, we construct the new training set as L1 = {hkij, rijkij = xi xj, r 2 { 1,1}} and test set as T1 = {kijjkij = xi xj}, where r represents the preference relation of document pair hxi, xji, 1 and 1 stand for the relations i and h respectively, xi and xj belong to the same query. We use kij to represent the pair generated by hxi xji hereafter. Follow the common practice in association rules learning, we use the new training set to generate a model composed of a set of rules in the form f1 \ f2 \ \ fk ? {1, 1}, which represents an association between a set of features f1, f2, . . . , fk and one of the three preference relation i and h. For simplicity, a rule can be rewritten as F ? r, where F denotes the mixture of features and r denotes the preference relation. To select interesting rules from all possible rules, two well known measures, support and confidence, are widely used in association rule learning. In our case, the support of F ? r, which is referred as Supp(F ? r) or Supp(F \ r), is the proportion of pairs containing a set of features F and the preference relation r in the new generated training/test set. And the confidence is defined Conf (F ? r) = Supp(F \ r)/Supp(F), which denotes the conditional probability of r given evidence F. All selected association rules in the generated model should satisfy a minimum support and a minimum confidence at the same time. By tuning these two measures, we can select the most frequent rules, which ensure a strong implication between a mixture of features F and the preference r, to compose the model. There has been proposed many efficient algorithms for association rules mining with support and confidence. In this paper, we use CLOSET+ (Wang, Han, & Pei, 2003) to generate association rules from the new training set. Since the feature values for document pairs are continuous, we perform discretization on the training set before generating rules. After extracting the rules, we need to inference the preference of the document pairs in the test set. Before applying rules to the test set, we discretize them as well. Inspired by the idea in Veloso et al. (2008), we adopt a multiple rules voting strategy. Let R denote the set of rules we get, and denote the rules which are applicable to the pair in R. Then we can use the rules into estimate the preference relation of. Precisely, to estimate the association between a document pair from the test set and a preference ri(ri 2 {1, 1}), we sum and then average the weights of the rules F ? ri, where the confidence value of each rules in F is used as its weight. The combination can be interpreted as a voting, in which each rule is a voter with its confidence as importance weight. For a given pair, the score for a specific preference relation ri can be computed as follow:
P Scoreðk; r i Þ ¼
F!r i 2Rk
jRr j
Conf ðF ! r i Þ
ð1Þ
Given a pair, if Score (k, 1) > Score (k, 1), then the preference relation of k is assigned 1, and 1 otherwise. We also define the final score of as
ð2Þ
After that, we sort the pairs by their final scores. We make an assumption that the higher final score of a pair could imply more confidence on its assigned preference relation. When two pairs have same final scores, we sort them in terms of the number of applicable rules for their assigned preference relations. Finally, we can get a sorted list of pairs and select the top k% pairs on our demand. The selected pairs and their assigned preference relations can be used as additional pair-wise data to incorporate into our transductive learning approach. For clarity, the procedure of extracting paired preference data by association rules learning can be summarized as follows. 1. 2. 3. 4. 5. 6.
Construct a new training/test set by making every two document instances within the same query into pairs. Discretize the new training/test set Generate association rules from the new training set with appropriate minimum support and minimum confidence. Use the generated rules to estimate the final scores and preference relations of all the pairs in the test set. Sort the pairs by their final scores (or number of applicable rules when having same final scores) Select the top k pairs on our demand.
3.2. Ranking function Following the Empirical Risk Minimization Principle, we adopt the following loss function, which is a solution of minimizing the linear combination of two objectives, to measure how well the ranking function can predict rankings.
lossðhÞ ¼
X
X
I hðxij Þ hðxik Þ
Di 2L1 xij 2Dþ ;xik 2D i i
þb
X
Iðxi Þ > Iðxj Þ
ð3Þ
ðxi ;xj Þ2Pandxi Cxj in which Dþ i and Di represent the set of relevant documents and the set of irrelevant documents for the ith query in L1 respectively. ðxi ; xj Þ 2 P and xi Cxj mean that ðxi ; xj Þ is a pair in test data T1. And the preference relation for this pair is that xi more relevant than xj to their corresponding query. I(p) is an indicator function. I(p) = 1 when p holds, and I(p) = 0 otherwise. The first objective measures the number of incorrectly ordered pairs when ranking the queries in the training set L1 by h. The second objective measures the number of document pairs which is inconsistent with respect to the set of pairs obtained from the test data T1 by the EXTRACT algorithm. The parameter b is non-negative and represents the trade-off between the two objectives. Since the pairs obtained from the unlabeled test data could also bring noise, we set 0 < b < 1 to penalize the second objective. As shown in our experiments, the performance of our transductive learning algorithm is not sensitive to the value of parameter b. We take an intuitive explanation on the loss function. When we set the parameter to 0, the loss function is as follow
lossðhÞ ¼
X
X
I hðxij Þ hðxik Þ
ð4Þ
Di 2L1 xij 2Dþ ;xik 2D i i
which is the same loss function as that of conventional pair-wise supervised learning to rank, such as Ranking SVM. In our transductive case, we obtain pair-wise preference data from unlabeled test data, and incorporate this additional information into the training process. The whole loss function measures a trade-off between maintaining the consistence with the predicted rankings of function h on the training data, and maintaining the consistence with the
12842
Y. Pan et al. / Expert Systems with Applications 38 (2011) 12839–12844
restraints of the extracted paired preference information from the test data.
to examine the performance of our approach with different values of the trade-off parameter b.
3.3. Optimization based on derived ranking SVM
4.1. Dataset
The loss function proposed in the previous subsection can be optimized by different optimization techniques such as SVM, Boosting and Neural Network. In this paper, we consider deriving Ranking SVM to tackle our optimization task. Our derived Ranking SVM is referred as Ranking SVM with association rules, or ARRSVM for short. Follow the most practice in learning to rank, we only consider the linear case for the ranking function h. We firstly make a brief review on the conventional Ranking SVM. Ranking SVM is one of the three typical pair-wise approaches for learning to rank. It tackles the ranking problem by solving a quadratic programming problem as follow:
We perform our experiments on the LETOR data collection (Liu, Xu, Qin, Xiong, & Li, 2007). LETOR is a public available benchmark data collection for the research on learning to rank, which contains three datasets for document retrieval: TREC 2003, TREC 2004 and OHSUMED. The TREC datasets (Craswell, Hawking, Wilkinson, & Wu, 2003) contain many features extracted from query-document pairs in topic distillation task of TREC 2003 and TREC 2004. There are 49,171 query-document pairs in TREC 2003 and 74,170 ones in TREC 2004. The documents are from the GOV collection which is based on a January 2002 crawl of.gov web sites (Liu et al., 2007). There are totally 44 features in TREC datasets, covering a wide range including low level features (i.e. idf, tf) and high level features (i.e. BM25, Language models). In the TREC datasets, the relevance judgment for each query-document pair is binary, 1 for relevant and 0 for irrelevant. There are 50 queries in TREC 2003 and 75 queries in TREC 2004 respectively. The OHSUMED dataset (Hersh, Buckley, Leone, & Hickam, 1994) consists of 16,140 query-document pairs extracted from the online medical information database MEDLINE, which is widely used in information retrieval research. There are totally 106 queries with 25 features in OHSUMED. The relevance judgment for each query-document pair is 3 levels: 0 for definitely relevant, 1 for partially relevant and 2 for irrelevant. Each of the three datasets is divided into five folds. There are a training set, a test set and a validation set respectively in each fold, which can be used to conduct cross validation.
minw;nq;i;j
X 1 nq;i;j kwk2 þ c 2 q;i;j
ð5Þ
s:t: wxq;i P wxq;j þ 1 nq;i;j
8xq;i Cxq;j ; nq;i;j > 0 where xq,iCxq,j, nq,i,j means document xq,i is ranked higher than document xq,j for query q in the training data, nq,i,j represents slack variables. kwk2 is the l2-norm used for regularization of the model. And c is the trade-off parameter between the margin size and the training error in SVM. Now we describe the derived Ranking SVM with association rules. The new optimization problem can be formulated as the following quadratic programming problem.
X X 1 minw;nq;i;j ;ntq;i;j kwk2 þ c nq;i;j þ b ntq;i;j 2 q;i;j tq;i;j
!
s:t: wxq;i P wT xq;j þ 1 nq;i;j
ð6Þ
T
wxtq;i P w xtq;j þ 1 ntq;i;j
8xq;i Cxq;j ; nq;i;j > 0 8xtq;i Cxtq;j and ðxtqi ; xtqj Þ 2 P;
ntq;i;j > 0
The paired preference data, which is obtained from the unlabeled test data using association rules, is incorporated into the optimizaP tion problem by adding a new part cb tq;i;j ntq;i;j in the objective function, along with a set of corresponding constraints wxtq,i P wTxtq,j + 1 ntq,i,j. Moreover, (xtqi,tqj) 2 P is a pair in P, which means that xtqi and xtqj are in the same pair. We can see that the new objective is a combination of two parts (except for the regularization part), one part is for the pairs generated from the training data, the other part is for the pairs generated from test data. The constraint set is also a combination of the constraints for the training data and the ones for the test data. Similar to the optimization of conventional Ranking SVM, the new optimization problem is a quadratic programming problem with only linear constraints as well, which can be solved by the existing optimization methods. In the coming sections, we will show the effectiveness of our transductive learning method and compare the ranking performance of our method against the conventional Ranking SVM approach as baseline. 4. Experiments In this section, we evaluate the ranking performance of our proposed tranductive learning algorithm. The goals of our evaluation are threefold: (a) to compare our tranductive learning algorithm to the existing supervised learning algorithm, i.e. Ranking SVM; (b) to evaluate the ranking performance of our approach with different number of pairs extracted by association rules learning; (c)
4.2. Evaluation measures As to evaluate the performance of ranking models, we use Mean Average Precision (MAP) (Baeza-Yates & Ribeiro-Neto, 1999) and Normalized Discounted Cumulative Gain (NDCG) (Jarvelin & Kekalainen, 2002) as evaluation measures. MAP is a standard evaluation measure widely used in Information Retrieval systems. It works for the cases with binary relevance judgments: relevant and irrelevant. MAP is the mean of average precisions over a set of queries. Precision at position j (P@j) (Baeza-Yates & Ribeiro-Neto, 1999) represents the proportion of relevant documents within the top j retrieved documents, which can be calculated by
Pj ¼
Npos ðjÞ j
ð7Þ
where Npos (j) denotes the number of relevant documents within the top j documents. Given a query qi, the average precision of qi is defined as the average of all P@j (j = 1, 2, . . . , n) and can be calculated by the following equation:
Av g Pi ¼
M X PðjÞ posðjÞ Np os j¼1
ð8Þ
where j is the position, M is the number of retrieved documents. And pos (j) is an indicator function. If the document at position j is relevant, then pos (j) is 1, or else pos (j) is 0. Npos represents the total number of relevant documents for query qi. P(j) is the precision at the given position j. NDCG is another popular evaluation criterion for comparing ranking performance in Information Retrieval. Unlike MAP, NDCG can deal with the cases which have more than two levels of relevance judgments. Given a query qi, the NDCG score at position m
12843
Y. Pan et al. / Expert Systems with Applications 38 (2011) 12839–12844
in the ranking list of documents can be calculated by the equation as follows:
m 2rðjÞ1 1 X Ni ¼ Z i j¼1 logð1 þ jÞ
ð9Þ
where r(j) is the grade of the jth document and Z i is a constant used for normalization, which is chosen so that the NDCG score for a perfect ranking is 1. 4.3. Experiment procedure To evaluate the effectiveness of our proposed transductive learning algorithm, we compare our approach to the existing supervised learning algorithm for ranking. Since our approach use a derived Ranking SVM implementation as the optimization framework, we choose the conventional Ranking SVM as the supervised baseline. We try varied values from the range of 0–1 for the parameter b, which is the trade-off between the training data and the extracted paired preference Empirical studies show that our transductive algorithm is not sensitive to the parameter b. Hence, we heuristically fix b to 0.8 in our experiments. To obtain a suitable number of association rules, we heuristically set the minimum thresholds of support to 0.005 and the minimum threshold of confidence to 0.8 respectively in the association rules mining process. We found that even a small number of association rules can generate a large number of ordered document pairs from the unlabeled data. Hence, in order to examine the effect on the ranking performance with different size of paired preference data, we conduct experiments by choosing the top 1%, top 10% and top 40% pairs from the whole set of pairs selected by association rules inference. The trade-off parameter c for SVM is chosen by cross validation. Since the top 10 documents ranked by a ranking model are viewed as the most important ones in web search, we choose the parameters, which achieve the best value of NDCG@10 on the validation set, to use on the test data.
Fig. 2. (b) Comparison of NDCG between RSVM and AR-RSVM On TREC 2004 dataset.
Fig. 3. (c) Comparison of NDCG between RSVM and AR-RSVM On OHSUMED dataset.
4.4. Experimental results Figs. 1–4 show the performance of ranking algorithm on the TD2003, TD2004 and OHSUMED dataset with respect to NDCG and MAP. Here, RSVM stands for the conventional supervised Ranking SVM, AR-RSVM stands for our transductive approach using derived Ranking SVM with association rules inference. Moreover, as mentioned above, we conduct experiments with AR-RSVM by choosing the top 1%, top 10% and top 40% pairs from the whole set of pairs selected by association rules inference. The ranking performance results are in Tables 1–3. The bold values in Tables 1–3 denote the best value of the corresponding evaluation measure in each column.
Fig. 4. Comparison of MAP between RSVM and AR-RSVM on three datasets.
Table 1 Comparison of performance of AR-RSVM with different percentage of selected pairs On TREC 2003. TD2003
%
Pairs
N@1
N@3
N@5
N@10
MAP
RSVM AR-RSVM AR-RSVM AR-RSVM
N/A 1 10 40
N/A 2215 22150 88600
0.4200 0.4200 0.4400 0.4200
0.3787 0.3811 0.3871 0.3708
0.3473 0.3564 0.3590 0.3485
0.3410 0.350 0.3581 0.3356
0.2564 0.2589 0.2633 0.2474
Table 2 Comparison of performance of AR-RSVM with different percentage of selected pairs On TREC 2004.
Fig. 1. (a) Comparison of NDCG between RSVM and AR-RSVM On TREC 2003 dataset.
TD2004
%
Pairs
N@1
N@3
N@5
N@10
MAP
RSVM AR-RSVM AR-RSVM AR-RSVM
N/A 1 10 40%
N/A 2483 24830 99320
0.4400 0.4533 0.5067 0.3067
0.4092 0.4098 0.4183 0.3349
0.3935 0.3956 0.4000 0.3550
0.4201 0.4205 0.4308 0.3753
0.3505 0.3551 0.3657 0.3064
12844
Y. Pan et al. / Expert Systems with Applications 38 (2011) 12839–12844
Table 3 Comparison of performance of AR-RSVM with different percentage of selected pairs On OHSUMED. OHSUMED
%
Pairs
N@1
N@3
N@5
N@10
MAP
RSVM AR-RSVM AR-RSVM AR-RSVM
N/A 1 10 40
N/A 1270 12700 50800
0.4952 0.5584 0.5711 0.5172
0.4649 0.4755 0.4785 0.4623
0.4579 0.4720 0.4772 0.4489
0.4411 0.4530 0.4557 0.4348
0.4469 0.4469 0.4477 0.4317
Two observations can be made from these results. The first observation is that our proposed transductive learning algorithm show significant improvement over the supervised baseline on all three datasets with respect to the NDCG and MAP measures. This empirically shows that the additional paired preference information extracted from the unlabeled is able to help improving the ranking performance. The second observation is that different percentages of the top pairs selected from all pairs generated by association rules inference have a definitive impact on the ranking performance. When we choose the top pairs by a relative small percentage (i.e. from a range of 1%–20%), the ranking predictions perform better than the baseline. But when we choose a larger percentage (i.e. more than 40%), the ranking performance dramatically goes down. 5. Conclusions In this paper, we present a transductive approach to learn ranking functions. The main intuition is to obtain paired preference data from unlabeled test data using association rules inference. Our experiments with derived Ranking SVM as the optimization framework show significant improvement on LETOR data collections. Our future work includes: (1) Investigating different methods to extract paired preference information from unlabeled data. (2) Designing new loss functions for transductive ranking. For example, combine the loss for the labeled data and the loss for the unlabeled data in a different way. (3) Speed up the association rules inference process of scoring document pairs and selecting the top ones. For instance, heuristically reduce the possible pairs which need to compute scores. Acknowledgements This work was funded in part by National Science Foundation of China (Grant No. 61003045), Natural Science Foundation of Guangdong Province, China (Grant No. 10451027501005667), Educational Commission of Guangdong Province, China, and the Fundamental Research Funds for the Central Universities. References Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM conference on knowledge discovery and data mining (KDD) (pp. 207–216). ACM Press.
Amini, M. R., Truong, T. V., & Goutte, C. (2008). A boosting algorithm for learning bipartite ranking functions with partially labeled data. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 99–106). ACM Press. Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison Wesley. Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Halmilton, N., et al. (2005). Learning to rank using gradient descent. In Proceedings of the international conference on machine learning (ICML) (pp. 89–96). ACM Press. Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 186–193). ACM Press. Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the international conference on machine learning (ICML) (pp. 129–136). ACM Press. Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. (2003). Overview of the TREC 2003 web track. In NIST special publication 500-255: The twelfth text retrieval conference (TREC 2003). Duh, K., & Kirchhoff, K. (2008). Learning to rank with partially-labeled data. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 251–258). ACM Press. Freund, Y., Iyer, R., Schapire, R., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research (4), 933–969. Herbrich, R., Graepel, T., & Obermayer, K. (1999). Support vector learning for ordinal regression. In The proceedings of International conference on Artificial Neural Networks (pp. 97–102). Hersh, W. R., Buckley, C., Leone, T. J., & Hickam, D. H. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 192–201). ACM Press. Jarvelin, K., & Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4), 422–446. Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM conference on knowledge discovery and data mining (KDD) (pp. 133–142). ACM Press. Li, P., Burges, C. J. C., & Wu, Q. (2007). McRank: Learning to rank using multiple classification and gradient boosting. In Proceedings of the neural information processing system (NIPS) (pp. 845–852). Liu, T. Y., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). LETOR: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of the learning to rank workshop in conjunction with the 30th ACM SIGIR conference on research and development in information retrieval. Nallapati, R. (2004). Discriminative models for information retrieval. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 64–71). ACM Press. Qin, T., Zhang, X. D., Wang, D. S., Xiong, W. Y., & Li, H. (2007). Ranking with multiple hyperplanes. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 279–286). ACM Press. Taylor, M., Guiver, J., Robertson, S., & Minka, T. (2008). SoftRank: Optimising nonsmooth rank metrics. In Proceedings of the international conference on web search and data mining (WSDM) (pp. 77–86). ACM Press. Vapnik, V., Golowich, S., & Smola, A. J. (1997). Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems (pp. 281–287). MIT Press. Veloso, A., Almeida, M., Goncalves, M., & Wagner, M. Jr., (2008). Learning to rank at query-time using association rules. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 267–274). ACM Press. Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In Proceedings of the ACM conference on knowledge discovery and data mining (KDD) (pp. 236–245). ACM Press. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). support vector method for optimizing average precision. In Proceedings of the ACM SIGIR conference on research and development in information retrieval (pp. 271–278). ACM Press. Zhu, X. (2005). Semi-supervised learning literature survey. Technical Report 1530, University of Wisconsin, Madison, Computer Science Department.