Information Processrng & Managemenl Printed in Great Britain.
Vol. 27, Nos. 2/3, pp. 153-164, 1991 CopyrIght 0
03064573/91 $3.00 + .@I 1991 Pergamon Press plc
DETERMINING THE EFFECTIVENESS OF RETRIEVAL ALGORITHMS H.P.
FREI
and P.
SCHAUBLE
Department of Computer Science, Swiss Federal Institute of Technology (ETH) Zurich. Switzerland (Received 22 November
1989; accepted in final
form 20 June 1990)
Abstract-A new effectiveness measure is proposed to circumvent the problems associated with the classical recall and precision measures. It is difficult to evaluate systems that filter extremely dynamic information; the determination of all relevant documents in a real life collection is hardly affordable, and the specification of binary relevance assessments is often problematic. The new measure relies on a statistical approach with which two retrieval algorithms are compared. In contrast to the classical recall and precision measures, the new measure requires only relative judgments, and the reply of the retrieval system is compared directly with the information need of the user rather than with the query. The new measure has the added ability to determine an error probability that indicates how stable the usefulness measure is. Using a test collection of abstracts from CACM, it is shown that our new measure is also capable of disclosing the effect of manually assigned descriptors and yields a result similar to that of the traditional recall and precision measures.
1. INTRODUCTION
Since the early days of Information Retrieval (IR), there has been a never-ending debate on the issue of how to evaluate IR Systems. When mentioning the term evaluation, one is usually interested in whether a user is able to benefit from using an IR system versus working without such a system. In addition, one would like to express such a benefit exactly (e.g., in terms of money saved or time reduced). With this kind of evaluation in mind, some of the factors that become indispensable parts of the evaluation process are the coverage of the collection, the response time of the system, the presentation of the output, and the effort the user of the IR system has to invest (van Rijsbergen, 1979). Unfortunately, many of the above-mentioned factors are extremely difficult to quantify. For this reason the outcome of such an evaluation is usually debatable and vague. This paper is not concerned with the benefit rendered to an individual user by the entire ZR systern, but rather with onepart of the IR system, namely the retrieval algorithm which determines the answer set when a query and a collection of information items (documents) are given. The effectiveness of a retrieval algorithm is usually measured by computing recaN and precision values. According to Salton and McGill (1983), the recall value expresses the proportion of retrieved relevant information items with respect to all the relevant items contained in the entire collection, and the precision value expresses the proportion of retrieved relevant information items with respect to the number of retrieved (relevant and not relevant) information items. However, there are two inherent problems associated with these values, because they rely- among other things-on:
l l
relevance assessments, and the total number of relevant information
items in the entire collection (recall only).
These well-known problems arise because of the fuzziness of the term relevance. Saracevic (1975) says: “The distinction between information and relevant information . . . , although intuitively quite clear, became and has remained a major point of discord due to a lack of 153
H.P. FREI and P. SCHKUBLE
154
In other words, relevance is a subjective notion and users often a consensus on meaning.” disagree when judging the relevance of an information item with respect to a specific query. Another author, van Rijsbergen (1979), says: “There has been much debate in the past as to whether precision and recall are in fact the appropriate quantities to use as measures of effectiveness. A popular alternative has been recall and fallout (the proportion of nonrelevant documents retrieved). However, all the alternatives still require the determination of relevance in some way.” In order to avoid discussions on the relevance of documents when quoting recall and precision figures, the IR research community started to agree on standardized test collections with standardized queries, and generally accepted answer sets belonging to these queries. New or improved retrieval algorithms were henceforth to be calibrated by running test queries against one or several of these test collections. The determined retrieval performance could then be compared with the performance of previously examined retrieval algorithms. In this way, the subjective relevance determination has been hidden in the standardization and at least everyone is making the same mistakes-in case there are questionable relevance assessments-by using the same test collections. Another drawback is that both the test collections and the number of test queries are usually very small, which hampers the transfer of results to real life collections. However, the most important objective of all research and development in IR is to improve the quality of searching in real fife collections. These collections differ drastically from the test collections used for measuring the performance of new retrieval algorithms: l
l
l
Real collections consist of millions of information items, while test collections contain only a few thousand or perhaps tens of thousands of items; Real collections sometimes contain a variation of different information items (texts, images, data, etc.), while test collections typically consist of uniformly structured bibliographic references; Many of the real collections are not very well indexed, while most of the test collections, on the other hand, are nicely indexed by outstanding information specialists according to well-defined rules.
For these reasons, it is questionable whether results achieved with test collections can be transferred to a wide range of real life collections. Furthermore, new kinds of information as well as novel ways of distributing information are emerging. We encountered the problem addressed in this paper when examining large amounts of information distributed via wide area networks (WAN). In contrast to the well-organized nature of the traditional “IR information,” the information available on WANs is dynamic, timely, often “noisy,” and thus difficult to index and analyze, as shown by Wyle and Frei (1989).
2. USEFULNESS
OF EXTRACTED
INFORMATION
As pointed out above, it is difficult to evaluate the performance of a retrieval algorithm in an IR environment. The retrieval of WAN information that is unique in its volume, distribution, audience, and duration becomes even more difficult to evaluate. First of all, it is harder to determine the relevance, as the information is less organized and not as consistently structured as in other information systems. Secondly, trying to get an indication of which proportion of the existing relevant information items was retrieved by a system (cf. recall) is a hopeless undertaking, as the “total number of relevant information items in the entire collection” is virtually impossible to determine. Due to these shortcomings, we decided to concentrate on developing a measure that expresses how useful the information items are to a user (similar to the traditional precision measure) and to abandon the idea of capturing the exhaustivity of the search (what the traditional recall measure attempts to do). At the same time, we try to avoid the binary decision on an information item’s relevance or nonrelevance with respect to a given query since, as we pointed out above, the validity of such decisions is rather debatable in many cases.
Determining
the effectiveness
of retrieval
algorithms
155
We have called this new indication of the quality of an IR system usefulness and we contrast it here to the traditional precision depending on relevance judgments. Of course, the change of the name from ‘relevant’ to ‘useful’ is not yet a solution. Usefulness is a judgment that compares the material delivered with the information need of the user rather than with the query issued. In other words, not only is the IR system evaluated, but the query formulation is evaluated as well. That the query formulation is included in the evaluation is in contrast to the traditional precision measurement where relevance has to be judged with respect to a query. In addition, absolute judgments seem to be difficult to obtain from an end user. Rees and Schultz (1967) and Lesk and Salton (1971) reinforced this claim by showing that a human examiner makes relative judgments easily by comparing two or several items. This is why we ask a person to rank the information items of the set presented according to the usefulness brought by the items to this particular person. As we shall see in more detail later, the set of information items to be ranked has to consist of items delivered by at least two different retrieval algorithms. Only then can evidence of the performance be gained. In other words we compare two (or several) retrieval algorithms in terms of their performance rather than determining the performance of a single algorithm. At first glance, this seems to be a disadvantage of the proposed method. However, scrutinizing the utility of the traditional recall and precision measures reveals that these traditional values not only depend on the retrieval algorithm, but also on the document collection and the queries employed. Therefore, they can be used to compare two (or several) retrieval algorithms when the queries and the document collection are kept constant. Likewise, they may be used to compare different document collections when the queries and the retrieval algorithm are kept constant, as was done in Schauble (1989, p. 8). Hence, the absolute effectiveness of a single retrieval algorithm is difficult to ascertain. For this reason, recall and precision values are used for the most part to compare two (or several) retrieval algorithms in terms of their effectiveness. 3. COMPARING
TWO
RETRIEVAL
ALGORITHMS
We shall propose an alternative measure to compare two retrieval algorithms. Given a retrieval algorithm A and a retrieval algorithm B, the value uA,Bindicates whether A is more effective than B or vice versa. The numerical value uA,Bis determined statistically. It expresses the relative effectiveness of A with respect to B. A positive value uA,Bsignifies that B is more effective than A. Conversely, a negative value uA,Bsignifies that B is less effective than A. The following notation is based on a probability space (Van der Waerden, 1969). In our approach, it is not necessary to know this probability space precisely. We are interested only in the probability of certain events. The experimental environment of these events consists of a retrieval system and of a community of users who need information. The retrieval system provides access to a dynamic collection of documents. Furthermore, the information need of the users is also assumed to be dynamic. This is a realistic assumption, as the information need of a user is often time dependent. Without loss of generality, we assume that, at every moment, the retrieval system is used by one and only one user. We call her or him the current user. Given such an experimental environment, the value P(D,p, q, r) denotes the probability that (1) D is equal to the current document collection of the system, (2) p is the current user, (3) q is the query by which p expresses her or his current need of information, and (4) the system’s answer set is restricted to at most 2r documents (each of the two retrieval algorithms A and B is contributing at most r documents). It is not important whether the restriction 2r is due to the user or to the retrieval system. We shall see that no restriction is imposed on the answer set if r is equal to (D I. A retrieval algorithm is given by an indexing method and a retrieval function. We assume that the indexing method of the retrieval algorithm A assigns to every query q and to every document d the vectors U4 and U,, respectively. Similarly, method B assigns to q and to d the vectors V, and V, respectively. The components of these vectors correspond to the relevance of certain features (e.g., reduced words, n-grams, term phrases). For in-
H.P. FREI and P. SCH~UBLE
156
stance, when comparing conventional vector space retrieval with retrieval based on information traces (Teufel, 1989), the components of U, and U, represent the weights of reduced (stemmed) words, and the components of V, and V, represent the weights of certain n-grams. The retrieval function of the retrieval algorithms A and B are denoted by RSV, and RSV, respectively. The retrieval function RSV, determines a so-called retrieval status value RSVA( Ug, U,) for every pair (U,, U,). This numerical value predicts how well a document satisfies the information need of the user. Likewise, the retrieval function RSVB determines a retrieval status value RSV, (V,, V,) for every pair (V,, V,). Given a document collection D, a query q, and a threshold r, the answer set RA (D, q, r) of the algorithm A is defined as follows: If the document collection D contains more than r documents, the answer set RA (D,q,r) consists of the r documents with the highest RSVA value. If D contains r or less than r documents, the answer set R*(D,q,r) is identical to D. The answer set RB (0, q, r) is defined analogously and it also contributes at most r documents. Since we are comparing the two algorithms A and B, the set of documents the system actually delivers to the user is the union of the answer sets of A and B: R(D,q,r)
:= h(D,q,r)
U Rg(D,q,r).
This set of documents R(D,q,r) contains at most 2r documents. If r I IDI the answer set R( D, q, r) contains at least r documents, and if r I /II 1 the answer set R (D, q, r) is equal to D.
4. EFFECTIVENESS
AND PREFERENCES
In the process of determining the effectiveness, a decision has to be made at some point as to the degree of usefulness of every retrieved item. As already mentioned, we propose that the user not divide the answer set of a query q into relevant and nonrelevant items as is done in classical relevance judgments. Instead, he or she should point out whether item d’ is more useful than item d. When specifying classical relevance assessments one has to identify relevant documents. Even when subject matter experts specify classical relevance assessments, they may disagree when determining the dividing line between relevant and nonrelevant documents; however, when specifying preference relations, the experts have the possibility of distinguishing between different degrees of relevance or usefulness. Such assessments specifying the relative positioning of information items seem to be quite reliable as pointed out in Lesk and Salton (1971): “Rees and Schultz (1969) also find that the judgment groups used in their study agree substantially as to the relative positioning (i.e., ordering in decreasing order of relevance to a search request) of the documents, although the judges tend to assign different numerical ratings to the documents.” Formally, such preferences are specified by a preference relation cP for every user p. The preference d cP d’ signifies that the user p judges d to be less useful than d’. When determining the usefulness, at different moments, the current user is asked to specify preferences for the documents contained in her or his answer set. More precisely, the current user p whose information need is represented by q specifies preferences between the items contained in the answer set R (D, q, r) . This set of preferences is equal to gp
n R’(D,q,r)
where TP
:= ((d,d’)(d
R2(D,q,r) := R(D,q,r)
x R(D,q,r).
Determining the effectiveness of retrieval algorithms
1.57
The pairs of 7rPrepresent the preferences of the user p. The pairs of r,, fl RZ(D,q,r) represent the known preferences explicitly specified by the user p. In addition to dealing with T~, we specify ?rAand xg determined by the RSV values of the retrieval algorithms A and B respectively:
=s := ~(d,d’)l~~V~(V~,
Vd) < RSVB(V,, Vc)].
Furthermore, we introduce two random variables X(D,i,p,q,r) and Y(D,p,q,r). The former denotes the portion of preferences satisfied by A and the latter denotes the portion of preferences satisfied by B.
Y(Gp,q,r)
:=
lR2(4w9 n 75n ml 1R2(4q,r)
n
?r,[
*
In contrast to the random variables X(Qp,q,r) and Y(D,p,q, r), the actual values obtained from an experiment are denoted by x(D,p,q,r) and y(D,p,q,r) respectively. The value uA,5 determining the usefulness of B relative to A is obtained with k experiments. Every experiment corresponds to an event (Diypi, qil ri) where 0 s i < k. Informally, the value uA,g indicates how often, on the average, the values X(Di,pi,qiyr;) are greater than the values Y(D,,pi,qi,r;). In what follows we abbreviate these values by Xi and _Vi respectively. Given a document collection Di = ( do, . . . , d,_, 1, a current user pi, a query qi, and a threshold ri, the values Xi and yi are calculated in the following way. 1. Determine (j,, . . . ,j,,7-l), a permutation of (0,. . , m - I), and determine (ko, . . . ,k,,,_,), a permutation of (0,. . . ,m - l), such that
Rf%tVq,, vdk,)> RSV,(v,, f”&,) * s < t. The two permutations represent two ranked lists of documents that are in decreasing order of the RSb values and the RSV’ values, respectively. If two documents are assigned an identical retrieval status value their ranks are chosen randomly. 2. Determine the answer set delivered to the user. If the document collection contains more than ri documents, the answer set consists of the first Ti documents retrieved by A and the first r, documents retrieved by B.
If vi 2 m the answer set is identical to the entire document collection:
3. Determine xi (i.e., the portion of preferences satisfied by A) and determine yj (i.e., the portion of preferences satisfied by B)
H.P. FREI and P.
15%
SCHKUBLE
Given k samples of events (Di,pi,qi, rj), the values xi and y, are calculated as described above. Given the pairs ((Xi,yi)\O I i < k], the sum of the positive ranks w, is computed as follows. 1. Calculate the differences yi - xi and discard the differences that are equal to zero.
They do not contribute to the comparison of A and B (see comment below). 2. Rank the absolute values 1yi - xi 1 of the differences yi - xi in increasing order. If there are ties (( yi - xi 1 = )_Yj- xj I), each 1yi - Xi1 is assigned the average value of the ranks for which it is tied (see comment below). 3. Restore the signs of the (yi - Xi1 to the ranks, obtaining signed ranks. 4. Calculate w+, the sum of those ranks that have positive signs. Discarding zero differences (1yi - xi / = 0) reduces the power of the usefulness measure (i.e., the probability error is increased). If there are too many ties ( j Yi - xi \ = 1_Yj - xj 1) this approach is not appropriated, as shown in Rice (1988) and a modified usefulness measure has to be developed similar to the modified Wilcoxon signed rank test (Hollander, 1973 and Lehmann, 1975). If method B is consistently better than method A, many differences with positive signs obtain high ranks. This means that w, is high if B is consistently better than A. However, a high value w, does not indicate whether B is only slightly better than A or much better than A. The usefulness uAqEis defined to be the normalized deviation of w+ from p, the expectation of W+ that is obtained when Xi and Y; have the same distribution. The value k is reduced to the number of valid experiments (without ties) which will be called /co. w-k - P uA,B
=
____ c1
where
P=
ko(ko + 1) 4
-
The value u~,~ indicates how often, on the average, the values yi are greater than the values xi. In order to obtain an indication of how much the YiS are greater than the XiSy we define an adjusted usefulness which also includes the ties.
The adjusted usefulness U& is small if uA,Bis small. In this case, there are few queries for which B is more effective than A. On the other hand, if uA,B is close to 1, we can distinguish two cases. First, if u& is also close to 1 then there are many queries for which B is considerably more effective than A. Second, if U* A,B is small then there are many queries for which B is only slightly more effective than A. In addition to abandoning the absolute relevance judgment, another advantage of the proposed usefulness measure is that it is possible to determine how much we can depend on the values U,,S and u,&. We shall distinguish between uA,B and U,,B. The former value is obtained by performing k experiments. The latter denotes the function value of the random variable VA,, based on the random variable W, . If A is more effective than B or more precisely, if X is sto~hastically greater than Y (i.e., P(X < z) < P(Y < z) for all z f R), the expectation of tiA,B is negative. Remember that a negative value uA,s indicates that A is more effective than B. Because of purely chance fluctuations, a positive value uA,B can be obtained even though A is more effective than B. Let r&B be a positive value obtained by k experiments which indicates that B performs better than A. Then, a small probability pk(uA.B
2
uA,B)
Determining the effectiveness of retrieval algorithms
159
means that it is unlikely that uA,B indicates an improvement by B, although B is less effective than A. In Van der Waerden (1969, p. 267), the value P,( UA,, r u~,~) is called an errorprobability (i.e., the probability of rejecting the hypothesis that A is more effective than B even though the hypothesis is true. Rejecting the hypothesis means that B is regarded as being more effective than A. The following proposition shows how the error probability can be computed. PROPOSITION
1
1
pktuA,B
uA,B)
=
1 -
@
where
lJ=
u*
=
~o(hl+ 1) 4
ko(ko + 1)(2ko + 1) 24
Proof.
’
From the definition of UA,, follows that pk(uA,B
1
tlA,B)
=
pk(w+
2
w,).
According to Rice (1988), if k L 20 the distribution of W+ is close to a normal distribution 0 with mean p and variance ~7~. The next proposition shows a few properties of the random variable Ur,,B and its expectation E[ U,,,] . From 1) follows that UA,B is normalized. 2) and 3) give the mathematical expectations of UA,Aand UB,, which are as one would expect intuitively. PROPOSITION 2
-1 5 UA,B 5 +l 2) ~[KL41 =o 3) E[UA,BI = -E~UB,AI. 1)
Proof. The maximum value of W+ is equal to hkO(k0 + 1) and the minimum value of W+ is equal to zero. Hence, r/A,, is between -1 and +l. The expectation of W+ is equal to ak,(k, + I), i.e., E[ W+] = p = a&,(& + 1). This implies that 1 k,(k, + 1)
(4E[W+] - k,,(/&, + 1)) = 0.
Let W- be the sum of the negative ranks, i.e., W_ = jk,(k, + 1) - W+. From E[W+] + E[ W-1 = E[ W,. + W-1 = iko(ko + 1) follows that E[ U,,,] + E[ UB,,] = 0. 0 5. AN EXAMPLE
In order to explain how the formulae of the previous section are evaluated, a detailed example is given. Let q; be the queries and dj be the documents. Every query is evaluated by the method A and by the method B. Table 1 shows the ranked lists of documents obtained by A and B. The first ranked list (query q. and method A) denotes that RS& (q0,d3) 2 RSV,(q~,d2)~RSVA(qo,do)zRSV,(qo,d,)rRS~(qo,d~).Notethatthedocumentcollection is assumed to be dynamic. For instance, the document collection consists of only three documents when the query q2 was evaluated. Assuming that the parameter r is always equal to 4, we obtain the answer sets given
H.P.
160
Table
FREI
and P.
1. Ranked
SCH~~UBLE
lists of documents
q2
d6
&
q3
d6
ds
ds
q3
4
d6
ds
in Table 2. Furthermore, we assume that the users specified the following preferences given in Table 3. The preferences satisfied by A and B are shown in Table 4. Table 5 shows the values x;,y;, and the differences y, - xi. Discarding the zero difference yl - x1 = 0 and ranking the remaining three differences according to their absolute values yields the ranks given in Table 6. Since 1y2 - x2 1 and )y3 - x3 ) are tied, they are assigned an average rank of 2.5 = (2 + 3)/2. From the ranked differences we obtain the sum of positive ranks, the usefulness of A with respect to B, and the error probability. Note that one experiment (query ql) was discarded. Hence, the number of experiments k0 is equal to 3 rather than 4.
W +=-
7 2
7
02 = -
2 1 uA,B
pk(“A,,
2
=
-
6
u/, f3) = 1 - @(l/m)
Table 2. Answer
sets
= 0.4.
Determining the effectiveness of retrieval algorithms
161
Table 3. Preferences
These results show what we intuitively expected. The usefulness uA,E is slightly positive as in two queries B was more effective than A and in only one query B was less effective than A. However, we cannot conclude that B is performing better than A because of the high error probability which is due to the small number of experiments. 6. EXPERIMENTS
Salton (1969) and other researchers have shown that descriptors assigned manually to describe the documents improve the effectiveness of the retrieval system. The increase of the effectiveness is usually low, although it is often statistically significant. In this section, we show that the new usefulness measure also discloses that manually assigned descriptors enhance retrieval effectiveness. We first show the effect of descriptors assigned manually to the documents with recallprecision diagrams. In particular, a test collection of abstracts from CACM described in Fox (1983) is used to determine the recall-precision graphs of two retrieval methods denoted by A and B. Every document of the CACM collection consists of several fields (title, authors, abstract, descriptors, etc.). The indexing of method A was based upon the title, author, and abstract fields, whereas the indexing of method B was based on the descriptor field in addition to the title, author, and abstract fields. The retrieval method A determines the retrieval status value RSVA (q, d) as follows: Al) The tokens in the fields T, A, and W (i.e., title, authors, and abstract) are identified. A2) The words that occur in van Rijsbergen’s stop list (1979, p. 18) are removed. A3) The term frequency tf (d, t) of every term t in every document d is determined. A4) The document frequency df (t) (number of documents d E D containing t) and the inverse document frequency
are determined. AS) The weights of the terms are determined by di := tf(d,ti)*idf(ti),
qi := tf(q,tj)*idf(ti).
A6) The retrieval status value of d with respect to q is defined to be the cosine of q and d.
Table 4. Satisfied preferences
1 query
IPM 27~213-0
1Rz
162
H.P. FREI and P. SCHAUBLE Table 5. Values xi, Yi, and yi - X,
RSKt(q,d) = gqp?* The retrieval method B is identical to method A except that the fields K containing the manually assigned descriptors are also included. Bl) The tokens in the fields T, A, W, and K (title, authors, abstract, and descriptors) are identified. B2). . . B6) are identical to A2). . . A6). Both retrieval methods were used to evaluate the queries of the CACM test collection. This test collection consists of 3204 documents and of 52 queries with the corresponding sets of relevant documents (the queries with no relevant documents were omitted). The recall and precision values are determined as described in Salton and McGill (1983, p. 164). Figure 1 shows the recall and precision curves of the methods A and B. We see that, according to the recall/precision effectiveness measure, method B performs better than A, particularly in the medium recall range. Having determined the recall/precision graphs of A and 3, the usefulness measure is evaluated (Table 7). In our experiments, the threshold r is equal to )D / and the preference relations ?rPare defined in an obvious way: (d,d’) E rp iff d is not relevant and d’ is relevant. In other words, relevant documents are preferred to nonrelevant documents. Like the recall/precision measure, the new usefulness measure also discloses how the retrieval effectiveness is affected by descriptors assigned manually to the documents. In addition, an error probability Pk( U,,, ;2 u~,~) is given which expresses how much we can depend on the value uA,B. The figures shown in Table 7 can be interpreted as follows. First, method A produces significantly more pairs (X, Y) where X < Y. The inequality X < Y means that A satisfies fewer preferences than B does. Hence, the value u~,~ indicates a consistent improvement by B. Like the conventional recall/precision effectiveness measure, the new measure U& = 0.027 shows a small increase of the effectiveness when manually assigned descriptors are taken into account. In fact, the average increase of the portion of satisfied preferences is very small ( 3 % ) . The figure for Pk( UA,, L uA,J can be interpreted as follows. There is a very small error probability that u~,~ indicates an improvement by B even though B is less effective than A. This probability is small because B performs slightly better than A for 80% of the queries.
Table 6. Ranks of lyi - x,1
Determining the effectiveness of retrieval algorithms
163
Table 7. Usefulness of B with respect to A U‘Q = 0.79 u;,*
= 0.027
Pk( U,,J t UAJ) = 0.7 * 10-s
7. CONCLUSIONS
In this paper we suggest an alternative to the traditional evaluation measures of recall and precision. There were three reasons for doing so: l
l
l
We encountered serious problems when trying to evaluate a system which filters the extremely dynamic information delivered by wide area networks. These problems are due to both the amount and the variable structure of the information items. There are no standardized test collections available for the kind and amount of information we are dealing with. We wanted to overcome the highly debatable absolute relevance judgment necessary when determining recall and precision values.
The new measure concentrates on the usefulness of the information items for the user to whom they are delivered. In addition, relative judgments are used rather than the more rigid absolute (relevant-nonrelevant) classification. In other words: the judge may indicate a rank order of the information items delivered. This ranking constitutes a significant advantage as a human examiner makes relative judgments easily and consistently by comparing two or several items as was shown by various authors. The measure we propose determines which of two given retrieval algorithms delivers more useful results to its users. This measurement is done by evaluating the preferences defined by the user-specified ranked list of the information items delivered. Even though we applied our method to a standard test collection with binary relevance assessments, it is to be noted that this evaluation was done only in order to validate the new method. Many of the advantages of the method proposed cannot be fully realized in such a case. As can be seen from the algorithm, the method also works if users assign an identical rank to more than one information item, thus ranking groups of documents rather than individual documents. However, more documents are needed in order to get a stable value in the cases when preference judgments are omitted. Another advantage of the proposed method is that it is possible to determine how sta-
0.7 0.6 . . 0.5 .0.4 a.
a*- Method A
0.3 a.
*O- Method 6
precision 0.2 *. 0.1 *-
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 recall Fig. 1. Recall and precision curves of A and B.
164
H.P.
FREI
and P.
SCHAUBLE
ble a usefulness value is for a given environment. This calculation is accomplished with a statistical test that has been shown to be both applicable and useful in such experiments. An error probability thus determines the reliability of the usefulness value. Finally, it is to be noted that the usefulness measure yields a single numeric value indicating the increase in performance. AcknowledgementsWe are extremely grateful to B. Teufel, who pointed out a major mistake in an earlier version of this paper. In addition, we thank M. Wyle who carried out the calibrating experiments with the CACM test collection and two unknown reviewers who contributed valuable suggestions,
REFERENCES Fox, EA. (1983). Tesi results on CACM cotieetion. Report 83-561, Dept. of CS, Cornell University. Hollander, M., & Wolfe, D. (1973). ~on~arametric siafjsticai methods. New York: Wiley. Lehmann, E.L. (1975). ~onparumeirics: Statisficul methods based on ranks. Oakland, CA: Holden-Day. Lesk, M.E. & Salton, G. (1971). Relevance assessments and retrieval system evaluation. In G. Salton (Ed.), The SMART retrieval system-experiments (Chap. 26). Englewood Cliffs, NJ: Prentice-Hall. Rees, A.M. & Schultz, D.G. (1967). A field experimental approach to the study of relevance assessments in relation to document searching. Final report to the NSF. Cleveland, OH: Case Western Reserve University, Center for Dot. and Comm. Res. Rice, J.A. (1988). Mathematical statistics and data analysis. Pacific Grove, CA: Wadsworth. Salton, 0. (1969). A comparison between manual and automatic indexing methods. American documentafion, 20(l), 61-71. Salton, G. & McGill, M. (1983). Introduction to modern information retrieval. New York, NY: McGraw-Hill. Saracevic, T. (1975). Relevance: A review of and a framework for the thinking on the notion in information science. Journal of the American Societyfor fnjormation Science, 26(6), 321-343. Schauble, P. (1989). Information retrieval based on injormation structures. D.Sc. Thesis ETH, Nr. 8784. Zurich: Verlag der Fachvereine. Teufel, B. (1989). ~nformatio~~~uren zum numer~chen und gra~h~schen VergJeichvon reduzierten nat~rlic~prachfichen Texten. D. SC. Thesis ETH, No. 8782. Zurich: Verlag der Fachvereine. Van der Waerden, B.L. (1969). Mathematical statistics. Berlin: Springer-Verlag. van Rijsbergen, C.J. (1979). Information retrieval. London: Butterworths. Wyie, M.F. & Frei, H.P. (1989, June). Retrieving highly dynamic, widely distributed information. Proc. of the 12th SIGIR Conference. Cambridge, MA.