DISTANCE BETWEEN SETS AS AN OBJECTIVE MEASURE OF RETRIEVAL EFFECTIVENESS M, H, HEINE Newcastle upon Tyne PoIytechnic Summary-A general measure of retrieval effectiveness having full metric properties and treating the “retrieval system-arbiter of relevance” situation symmetrically, is the Ma~ze~ki-Steinhaus metric, D, measuring the distance between the set of relevant documents, A, and set of retrieved documents, 3, according to D = 1 -(n(A n B)/n(A U B)). D can be expressed as a function of Precision and Recall, or of Generality, Fallout and Recall, and of other sets of traditional measures. Acceptance of the measure allows criteria for retrieval optimality and degeneracy to be stated, defined by minimum and constant values of D respectively. Precision-Recall degeneracy curves for D are given and compared with those for another general measure: the probability that a document will be correctly identified by a retrieval system. Statistical extensions of D are examined, and these and other properties of the metric are illustrated with seven examples. I. INTRODUCTION THE problem
of ending a satisfactory measure* of retrieval effectiveness is a persistent one, with many measures having been put forward over the years. Reviews of the problem have been given by Bounm [I], REES [2], ROBERTSON [3] and SWETS f4], and in view of their completeness and currency (especially Robertson’s) a further review of them will not be attempted here. Moreover we will concentrate on the basic problem : that of best expressing the relationships between the subsets of documents involved in information retrieval, and not on such variables as retrieval speed, or appropriateness and age-distribution of documents in the set of documents accessible to the retrieval system. The following observations will be taken as a starting point: (i) There is no consensus among information scientists as to the criteria that a general measure of retrieval effectiveness should meet. Without universalIy accepted criteria, a general measure cannot be recognized. (ii) It does not seem to be universally held that a general, or universal measure of effectiveness is indeed desirable. Since a general measure (as a bare number ~ndependel~t of any co-ordinates} sacrifices much of the information in the sets of documents involved, a ‘“paired measure” approach such as that employing Recall and Precision, Recall and Fallout, or Sensitivity and Specificity, which sacrifices less information is sometimes thought to be preferable. Against this view one may argue that a general measure of effectiveness has several positive advantages: (a) It allows the particular act of retrieval? (in a set of acts of retrieval) that is most efficient to be recognized, thereby allowing experimental parameters to be varied in a way that optimizes the retrieval process; (b) It allows the notion of costeffectiveness in a retrieval situation to be introduced in an unambiguous way (compared * The term “measure” is used in its loose everyday sense, not in the precise mathemati~1 sense.
t The term “act of retrieval" is introduced to describe the retrieval situation in which there is one query, one document collection, one arbiter of relevance, and one retrieval system acting at a fixed level.
182
M. H. WBINE
with the ambiguous notion of cost-effectiveness introduced by the paired measure approach); and (c) It allows the notion of “degeneracy” to be introduced as a collective description of those acts of retrieval sharing a constant value of the general measure of effectiveness. (iii) Most approaches to the problem have been ad hoc in nature with little attention given to the placing of the measures in a theoretical framework. (An important exception is the work of SWETS [4, 51 which identifies Recall and Fallout with normal distribution functions of an “index of pertinence”, t, and on this basis defines a novel general measure of effectiveness: the area under the Recall vs Fallout graph, A. However, Swets and others (Robertson, BROOKES[6]) fail to distinguish between the effectiveness of a retrieval ~u~guffge-which is what the Swets measure A, or the Brookes measure S, concern themselves with-and the effectiveness of a retrieval system, * defined by a particular value of the index of pertinence.) There also appears to be insufficient control of variables in experiments, in which Precision-Recall data (for example) sometimes arise pro~niscuously from variations in query, system (operati~lg level), or arbiter of relevance; and uncertainty in the application of measures to sets of acts of retrieval, rather than a particular act of retrieval. (iv) Most if not all of such measures are asymmetrical in their comparison of the documents identified by the user as relevant with those identified by the system as relevant. One would hope that a general measure of retrieval effectiveness would take these two sets of documents into account in a symmetrical manner, since it is these two sets, reflecting two assessments of “relevant document” (one by the arbiter of relevance, the other by the system}, that the measure is atteIn~ting to compare. (v) It would be desirable too if the general measure chosen had metric, or measure (in the strict sense) properties. Just as the notion of probability (a measure in the strict sense) has a body of useful theory attaching to it, so the theory of metric spaces or graph theory would then be applicable to our general measure. The theory of metric spaces may prove to be of especial application to optimizing the dissemination process (e.g. S.D.I.) where many sets of documents are to be compared at once, and where at present there seems to be little satisfactory theory. II. CRITERIA
FOR
A GENERAL
MEASURE
OF RETRIEVAL
EFFECTIVENESS
In view of the above, the following path seems to be suggested: A general measure should be sought observing the criteria that it is defined in any and every retrievai situation, that it relates symmetrically to the sets of documents identified as relevant by arbiter and system, and that it fits naturally into the appropriate mathematical framework of set theory. It should thus not be dejitzed as a statistic (cf. the Swets measure A) but it should be capable of being generalized in this direction in simple ways. It should be able to be related to the classicat measures of Recall and Precision etc. and as a further step, the distribution functions of Recall, Precision, Fallout, Generatity, etc. might be examined with a view to inferring the distribution function of the general measure. We shall argue for the use of a function in set theory that meets all these requirements, while emphasizing that the acceptance of it does not rule out the use of such measures as Recall, Fallout etc. where the information conveyed by such measures is of peculiar interest in the experiment or working situation being examined. Finally, it may be of interest to * The distinction surely is that a “language” is a device that alIocates a z-value to each document (given a document collection and a query), and a “system” is a language plus a prescribed critical value for z that determines retrieval and non-retrieval.
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
183
realize that the general measure to be described is directly applicable to the problem of assessing the effectiveness (accuracy) of diagnosis in medicine, and of pattern recognition in general. III.
THE
MARCZEWSKI-STEINHAUS
METRIC
The function to be described is referred to here as the MARCZEWSKI-STEINHAUS metric, or MZ-metric for short, after its discoverers [7]. In order to define it and describe its properties the following notation is introduced. Let S be a set of documents, Ui an arbiter of relevance (perhaps the user), Ej an information retrieval system, and Qk a statement or query. In response to Qk, the arbiter Vi proceeds to identify documents in S that in his view are relevant to it. Such documents define set A, a subset of S. Similarly Ej defines a set of documents B, a subset of S, that according to its logic and language match Qk. The term “act of retrieval” introduced earlier may thus be taken as referring to particular sets A and B defined by Ui, Ej, Qk and S, the altering of any of which will change either or both sets. More briefly, the term can be taken as synonymous with the set (Ui, Ej, Qk, S}. These two subsets of S, A and B, will not in general coincide (be equal). When they do, retrieval is perfect, and when they are disjoint (have no documents in common) one has total retrieval failure. In practice A and B will usually overlap, or more rarely one may be a subset of the other. As a measure of the degree of overlap between A and B, we introduce the MZ-metric, defined as : D(A, B) =
n(A A B) n(A U B)
= 1_
n(A n B) 44 u B)
(1)
where n(. . .) denotes the number of elements (documents) in the set concerned, and A denotes the “symmetric difference of two sets”: A A B = (A u B) - (A n B). The metric serves to assess the “nearness” of A and B, and thus the extent to which the sets of retrieved and relevant documents coincide. It has the following four properties which, taken together, justify its description as a metric: (I) D(A, B) 2 0 (II) D(A, B) = 0
ifA=B
(III)
D(A, B) = D(B, A)
(IV
D(A, B) + WB,
In addition 6
C) 2 L&4, C)
it has the further
for each A, B, C in S.
property:
D(A, B) G 1
implying that sets A and B, along with D(A, B), define a bounded metric space of diameter 1 (using the definition: diameter (S) = sup {D(A, B) 1A, B E S}) and hence that D may be interpreted as a probability. To cover the case where A and Bare both empty (no documents relevant, no documents retrieved) we arbitrarily define : (VI)
W,
B> = 0,
A=@==.
184
M. H. HEINE
The property of D of being affine invariant does not appear to be of interest in information retrieval, nor does the property that it is equivalent to certain other metrics (SHEPHARD and WEBSTER [S], TAYLOR [9]). From the retrieval point of view we may note that D is symmetrical in A and B (i.e interchanging these sets produces the same value of D), and that it needs in its computation all elements of the familiar 2 x 2 table except that of “irrelevant documents not retrieved” In support of this one may argue that “correct rejection” is after all a by-product, an incidental feature of the information retrieval process and that there does not seem to be a convincing reason why the population of documents correctly rejected by Ej should be allowed to affect a comparison between sets A and B. (It is proved in the next section that documents correctly rejected can nevertheless be introduced into the calculation of D provided the Generality of the question is also considered.) This approach is consistent with that of GOOD based on decision theory [lo]. Before interpreting D in terms of the 2x 2 table, and other retrieval measures, the following points should be stressed. The use of the MZ-metric has been established already in information science by RAJSKI [I 11, GOTLIEB and KUMAR [12], and SOERGEL [13], not in connection with the problem of assessing retrieval effectiveness as discussed here, but to measure the dependence between the transmitted and received signals in a Shannonian information channel (Rajski), and the closeness of clusters of “attributes” such as key-words (Gotlieb, Soergel). (FAIRTHORNE too in an early reference [14] has discussed the appropriateness of the symmetric distance of two sets as a description of the separation of sets of documents characterized by different “marks”, and his clear reconciliation of this distance with our intuitive notion of distance is especially valuable. There is also relevant comment by HILLMAN [15] in the proceedings of the 1964 Elsinore conference.) To extend the metric to the retrieval effectiveness problem seems the obvious next step. An objection that has been made to the use of it for this purpose (Farradane, pers. comm.) is that too much of the information in the 2 x 2 table is lost in its computation (e.g. because D is symmetrical it cannot distinguish the acts: Recall, = M, Precision, = N, and Recalli = N, Precisionj = M, as is proved in the next section (relation (5)). But against this one may argue that some loss of information is inevitable in any general measure, indeed any measure other than the 2 x 2 table itself, and a reasonable price to pay for its advantages; also that in practice there will only very rarely be a need to distinguish (M, N) and (N, M) pairs in terms of the general measure, since the Precision vs Recall graph is rarely perfectly symmetrical. A further criticism (also due to Farradane) is that, like the classical paired measures the M&metric excludes the “probably relevant” or “possibly relevant” in its computation. The author’s view here is that the difficulty may be overcome by the artifice of introducing an “arbiter of relevance” in the way described earlier. All documents may then be posted to set A or set S-A, if necessary by making an arbitrary decision. (That we are considering a static collection also makes a Brouwerian or Heytian approach irrelevant.) Of course in practice human limitations will militate against this exhaustive and exhausting procedure if n(S) is large. The problem then is to devise a sampling procedure that will allow D(A, B) to be inferred within agreed limits according to an agreed level of confidence. A less easy to refute criticism is that the membership of set A will in practice be determined by the order in which the arbiter examines the documents in S, or in a sample of S. Since individual’:,
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
185
relevance assessments change with time and/or knowledge (unlike the assessments of presentday retrieval systems, incidentally), the membership of set A will be so affected and strictly we should attempt to describe A as a function of whatever variables achieve this. This philosophy has in fact been built into the theory of GOFFMAN[16], and a partial solution is Even here there are difficulties perhaps replacing set A by the set A defined by Goffman. however, as the conditional probabilities of relevance assumed by Goffman relate only to the preceding document examined rather than the preceding sequence of documents examined. Further discussion on this point is beyond the scope of this article. Another criticism, more of the 2 x 2 table itself than of the MZ-metric, is implied by Good’s use of a 3 x 2 table, or Robertson’s use of an (n + 1) x 2 table, additional rows (or columns) being provided for documents of different degrees of relevance. But to subdivide the documents thus seems far too arbitrary. If a quantitative approach to the question of degrees of relevance is experimentally justified it seems better sense to consider either a family of 2 x 2 tables defined by altering one’s retrieval system (equivalently, one’s criterion of relevance) and hence to infer the distribution(s) of whatever measure(s) is used; or to approach the matter through 2 x 2 x 2. . . (n times) tables. Finally, here, it should be mentioned that the MZ-metric meets all the properties prescribed of a general measure by SWETS[S].
IV. THE MZ-METRIC
IN TERMS
OF THE 2 x 2 TABLE AND RELATED
Let us write the 2 x 2 contingency
table as Selected
Discarded
IlRS n1s n.s
n&Xl nID n.D
Relevant Irrelevant
(The notation partly follows that relationships may be identified:
of Good.)
&. nr. N
In terms
of these elements
n(s)
=
N
n(A
u
@
=
%fnRbnRD
n(A)
=
ItR.
n(A
n
@
=
llRS
n(B)
=
n.S
n
B)
=
nIS
n(s-A)
=
nI.
=
%D
n(S-B)
=
n.D
=
nRD.
From the basic definition
MEASURES
n((S--4 n((s-A) n(A
n
@-@)
n(&B))
the following
(1) we have immediately: llRS
D=l-
nIS+nRS+nRD nIS+nRD
=
, nIS, nRS,
%dnRdnRD
= 0 13
nz
not all zero 1.
, h
nRs, nii all equal to zero I
(2)
186
M. H.
Alternatively defined by:
in terms of the more familiar
HEINE
conditional
R(A, B) = y-p
probabilities
!!fG
=
) nR,
of Recall and Precision
01
#
nR.
(3)
l
(4)
we have D = R+P-2RP R-I-P-RP
,
ltRS
#
o
= 1
,n
RS
-0 -
=
,n
RS
=
0
and nIs
=
nRD
nIS # 0 # =
0
llRD
*
(5)
I
Thus D can be expressed in terms of R and P (in a symmetrical way in fact). To emphasize that (5) is not a statistic, but like (3) and (4) a description of an individual act of retrieval, it may be written more clearly as: D
=
k Further forms for D are obtained and Generality: F(A
B)
&+Pk-2&Pk &fPk-R&’
as follows.
m-m~)
=
5
n(S-A)
Using the following
=5 HI.
= 1
0
,nI.
#
,nI.
= 0I
definitions
for Fallout
(6)
(G is not a measure of retrieval effectiveness but a consequence of our defining the user, the query and the document collection, i.e. of the set { Ui, Qk, S})-along with the following relation* between Precision, Recall, Fallout and Generality: G, R,
P, = Fk(
1 -
Gk)
+
Rk
Gk
easily obtained using Bayes Theorem, and verified by substitution we obtain D in terms of Fallout, Generality and Recall: D
=
k * Apparently first noted by
Fk(l
ROBERTSON [3].
Grc) + G,(l -&I F/#-G,)+&
(8) in it of (3), (4), (6) and (7),
-
(9)
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
187
-a useful form for work with the Swets model. The form of D appropriate to the WRU measures [ 171 of Specificity and Sensitivity (equivalent to 1 -F, and R, respectively) is obtained from (9) as: D
=
k
(S,),(G,- 1)+ Cl- G&J,) @,)k(Gk
Finally,
definining
the Retrievality,
-
1) +
1
(10)
C, as
(11) (the retrieval analogue of Generality, and a consequence of our definining the retrieval system, the query and the document collection, i.e. of the set {Ej, Qk, S}) we obtain the further relations :
D, = =
C,+G,-2R,G, C,+G,-R,G, Ck+Gk-2PkCk
(12) (13)
Ck+Gk-PkCk
upon substituting the further relation between the probabilities involved: RIP = C/G, in (5), and with suitable precautions against the denominator equalling zero. The subscript k again serves to emphasize that the relations are not statistical ones. We note incidentally that the Retrievality is approximately equal to the Fallout for small values of the Generality. The following examples illustrate the use of some of the above relations, and the fundamental property of D that retrieval is more effective the smaller D is. Example 1. {LJ,, E,, Qk, S} defines an act of retrieval in which RI, = 1 .O and Pk = 0.1. Using (5) the value of the metric is found to be 0.9. Compare: If RI, = 0.1, Pk = 1.0, D, = 0.9; If Rk = 0.5 = Pk, Dk = 0.67; If RI, = 0.6 = 4, Dk = 0.57. Example 2. Three per cent of a set of documents are judged by an arbiter to be relevant to a query. Two per cent of the set is retrieved by a system in response to the same query with a Precision of 60 per cent. What is the effectiveness of the retrieval system in terms of the MZ-metric ? From (13), after substituting C, = 0.02, Gk = 0.03 and Pk = 0.60, we obtain Dk = 0.68. Example 3. Prove that the expectation of D in an S.D.I. system is less than the expectation of D in a Current Awareness system. Clarifying the rather loose definitions of these terms used in practice as follows: An S.D.I. system provides subsets B, to each of a set of users U,, when it is confronted with separate queries Q, and a document set S. A Current Awareness system provides a common subset B, of which the B, above are subsets, when it is confronted with separate queries Q, relating to U,, and a document set S. (In each case S may be thought of as a set of new documents or document descriptions that has just entered the ambit of the system. Obviously the above definitions are a very stilted view of the actual processes, but the elemental features are brought out.) We have:
E.sDz(D) = C D(Ai, Bi)/n(O, kZ
where Z is the index set and
Since D(A,, B,) < D(A,, B), the result Es,,is the more effective process of the two.
, 0 follows, suggesting E cd <
the familiar result that S.D.I.
188
M H. HEINE
The difference between ESDIand Ecn is:
E SDI -E,,
= (cid
D(Ai7 Bi) - C WA,, W in(l) id
so that we have the general theorem: 0 > ESDI-E,,
n(Ai n B)
=
n(Ai
V. SOME
CONSEQUENCES
u
B)
n(Ai n Bi) -
OF USING
n(Ai
THE
U
Bi)
’
MZ-METRIC
Criteria for optimality The main benefit of using a general measure is that the optimum act of retrieval identified among a set of acts of retrieval. Three optima suggest themselves: (i) The most effective act of retrieval,
identified
by
k such that D, is a minimum. (ii) The most cost-effective
act of retrieval,
identified
(14) by
k such that D, C, is a minimum where C, is the cost of the kth act of retrieval, to D,. (iii) The most cost-beneficial
and effectiveness
act of retrieval,
may be
identified
(15) is taken to be proportional
by
k such that Dk/(Vk-Ck) is a minimum
(16)
where V, is the benefit (in monetary units) of the retrieved documents to the user of them (who may or may not be the arbiter of relevance), and C,--V, is thus the net cost of each act of retrieval. With the classical paired measures, each of these criteria is of course ambiguous. Example 4. Consider a set of H acts of retrieval, each of which generates a pair of Precision, Recall, values (P,,, R,,). The variable that has produced these pairs may be S or U, or Qrc or E,; let us assume arbitrarily that it is Ej, the different retrieval systems acting on Qk corresponding perhaps to different levels of term co-ordination, or different exhaustivities of indexing. Then of the acts of retrieval we are examining, retrieval is most effective according to criterion (14) when Dh = (Pb + Rh - 2RhPh)/(Ph + Rh - RhPh) is a minimum. The value of h concerned identifies the most effective system (E,),, (i.e. the most effective level of coordination, or exhaustivity of indexing) for the data to hand. Alternatively, if the (Ph, Rh) pairs had arisen from varying the arbiter of relevance, then (14) would identify the arbiter whose assessment of relevance agreed most closely with that of the system. Similar comments apply to (PI,, RJ pairs arising from varying the query or document collection. Example 5. A line drawn through a set of (P, R) pairs corresponding to different acts of retrieval is found to have the form P = f(R). (This line might be thought of as a regression line when there are sufficient points to construct same.) The (discrete) pairs have arisen through variation in some retrieval parameter z. By assuming continuity between P and R, find the maximum effectiveness possible and infer the optimum value of z. After substituting the above expression for P in (5), differentiating D with respect to R, and putting dD/dR = 0, we obtain the requirement: ,__ (17) R = S(R) d - ll(df@)lW implying that for a minimum of D to exist in [0, l],f(R) must be a decreasing function of R over part of the interval at least. A more general expression can be derived if R = R(z), P = P(z), D = D(z), by differentiating with respect to z.
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
189
If the value of R satisfying (17) is denoted by R*, then the minimum value of D (corresponding maximum retrieval effectiveness) is:
to
D*
_
R* +f(R*)
-
2R*f(R*)
R* +f(R*) - R*f(R*) ’ Further, the range of possible retrieval effectiveness is (D*, l), and the optimum value of z (by (14)) is that which yields a value of D(z) closest to D*. and Ol (2 -d/k). Example 6. It is instructive to examine the variation in Precision, Recall and the MZ-metric when the effect of varying {S, U,, E,, Qk} is to “slide the two sets A and B over each other” (analogous to the convolution of two functions). The variations in P, R and D are shown in the following table:
i
&
pj
Dj
1 2 m
0 l/&Z.
0 1ln.s
1 (n.s +&D - 2Mn.s +&D - 1)
(m --)h.
(m ---)/n.s
(n.s + nRD- 2(m - l,lcn., + &CD-(m
- 1))
(The parameter j here serves to index successive acts of retrieval and takes the maximum value j = m = inf (n,,,
n,,)+l.)
This example is especially interesting in that it proves that the notion of a retrieval language having the property that Recall and Precision both increase as successive retrieval systems Ej are defined, is not self-contradictory. If Recall and Precision are found to vary inversely, for constant (Vi, Qk, S}, then that is a consequence of the retrieval language (and its systems) chosen, and the explanation for it must be sought in terms of psychology and linguistics. If it is a law, then it is an empirical one and not a logical necessity. Criteria for degeneracy Use of any general measure also implies criteria for “degeneracy” among sets of acts of retrieval, if we define this term to mean the property that D, = constant, for all k. (Alternative criteria suggested by (15) and (16) are : D, C, = constant, and D,(C,-V,) = constant.) Since much of the literature contains P, R graphs, and since the P, R paired value approach is likely to endure (containing as it does more information than the MZ-metric), we proceed to state a relation between Precision and Recall under degeneracy conditions. From (5) we have immediately: R(l -D) P= (19) R(2-D)+(D-1)’ Graphs of P vs R under degeneracy conditions, for various values of D, are shown in Fig. 1. (For D = I, the curve becomes the P and R axes, and for D = 0, the single point P = I= R.) The continuity implied by the lines is strictly incorrect: the purpose of the lines is to identify sets of P, R pairs that, by lying on one or other of the lines shown or on a line parallel to them, reflect retrieval degeneracy. (There is no restriction on how such sets may have arisen.) The curves are thus only indicative. In practice to test for degeneracy, expression (19) should be plotted in the vicinity of the P, R data being examined. The similarity of the curves to some published P, R curves of best fit suggests that degeneracy of the “D-type” may be being met in practice.
190
M. H. HEINE D0.05 0.10 0.15 0.20 0.25 0.30 0.35 040 0.45 0.50 0.55 060 0.65 0.70 0.75 0.60 0.85 0.90 095
I 0.00
I
I
I
I
I
I
010
020
O-30
0.40
0.50
O-60
I
I
070
080
I 0.90
I I .oo
R R
versus
P
for variousD
FIG. 1. VI.
COMPARISON
WITH
AN
ALTERNATIVE
MEASURE
A measure different from the Marczewski-Steinhaus metric and having a general character is: the probability that a document in S will be correctly identified (as relevant or irrelevant as appropriate) by the retrieval system. This probability, conditional on the set {S, Qk, U,}, we might call the Retried Power of the information retrieval system. Let us denote Retrieval Power by X, and a document in S by s. Since X is equal to the probability that s is (a) relevant and retrieved, or (b) irrelevant and discarded, and since events (a) and (b) are exhaustive and mutually exclusive, all we need do to evaluate Xis add the probabilities of these two events: x = Pr(sE((A~B)u((S-A)n(S-B))~sES) = Pr(s E (A f-j B) 1sES)+Pr(sE((S4)~(s-B))~sES) = hs
+ kd/N
or, in terms of conditional
(20)
probabilities:
x = P~(~ E (A 0 B) 1sEA).Pr(sEAIsES) +P~(sE((S-_~)(~(~-~))~~E(S-~)).~~(~E(S-A)ISES) = GR+(l
-G)
(1 -F)
= G(S,-S,)+S,.
Thus the Generality of the question is introduced Power, whereas with D it may be excluded.
(21) directly into the assessment
of Retrieval
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
We may compare,
in set theory terms, the Retrieval
191
Power with the MZ-metric:
D=J’r(s4(S-B)n4u((S-4n@)Is&4u@) and the different probability introduced by SWETS [5, p. 181: the probability that two documents in S will be correctly distinguished by the retrieval system, given that one is relevant, the other irrelevant. Denoting this probability by Y, we have: Y =Pr(siEBIsiEA).Pr(SjE(S-B)IsjE(S-A)) = R(l-F) which, like D, is independent of the Generality of the question. This probability is related to the Swets measure A via A = 1 j R. d(1 -F) 1 for a set of R, F data arising from the statistical decision theory model. This however appears to assume that in practice each value of z, is as equally likely to engender R, F data as any other. The simple statistics of expectation and variance of Y thus seem to be more satisfactory. From (21) we note first that for small values of G, the Retrieval Power is approximately equal to the Specificity, suggesting that the latter is the more important of the WRU measures for most information retrieval applications. Secondly we may obtain P, R degeneracy curves appropriate to X, in a similar way to those obtained from D. From the Bayes’ Theorem result (8), X may be expressed as a function of Generality, Precision and recall : x = P-G(R+P-2RP)
(22)
P
090
oeo 0.70
1
0.60 t P
0.50
0.40
0.10
0.20
0.30 R versus
0.40 P
0.50 for FIG.
0.60
various
2(a).
X,
0.70
G= 0.05
080
O-90
la0
192
M. H. HEINE
0.90
0.60
P
0.50
O-40
0.30
Q20
0.10
o-00
010
0.20
030
0.40
0.50
0.60
070
0.60
0.90
I.00
R vwws P for variousX, G=O.50 FIG. 2(b). so
that the required
curves are:
GR
’ =GR+(~-G)-X'
X = constant.
These curves are illustrated in Fig. 2 (for X = O-05, 0.10, . . . , 0.95; (a) G = 0.05 and (b) G = O-50), and Fig. 3 (for G = O-05, O-10, . . . , O-50; (a) X = 0.30 and (b) X = 0-90). As can be seen from the Figures, or by differentiating (23), the degeneracy curve has a slope : 8P
zii<
0 for all R when:
X > l-G,
G constant
or G > l-X,
X constant
(24) . ,
so that we cannot completely support the hypothesis suggested by Fig. 1 that the degeneracy curves of all general measures show the “inverse variation of P with R law” (i.e. that the P, R curves are convex towards the origin). In fact since G is small in practice, the opposite will be the case with the X curves at hand. The introduction of X serves to illustrate the notion of degeneracy in terms of a general measure other than D, and in the author’s opinion to emphasize the virtues of the latter: namely that D is not directly dependent on Generality, does not weight n,, so heavily (indeed at all) in its assessment of retrieval effectiveness, and does not start from the biased position whereby relevance is taken solely from the viewpoint of the arbiter, rather than the retrieval system as well. The assessment of the situation offered by D is less arbitrary, and allows the systematic comparison of views of different arbiters.
060
P
0.50
0.40
0,500 0450 0.400 0.350
0.30
0.300 0.250 0.200
0.20
0,150 0.1 0
OdOO 0.050
000
0. IO
020
0.30
0.40
0.50
060
O-70
OBO
090
I.00
R
R Versus P for various G,
x so.3
FIG. 3(a).
0.90
0.500 0450 g:g 0.300 0.250 0.200 0.60
P
0.150
0.5c
0~100
0.40
0.050
0.30
0.20
0.10
/ 0.00
0.10
I
I
I
0.20
0.30
040
R
versus
I P
I
I
I
I
0.50 0.60 R for various G.
070
060
0.90
FIG. 3(b).
X =0,9
I I .oo
M. H. HEINE
194 VII.
STATISTICS
OF SETS
OF ACTS
OF RETRIEVAL
The extension of measures of retrieval effectiveness from one act of retrieval to a set of such acts has always presented di~~ulty. There is for example, with Recall and Precision the long-standing debate of “average of ratios or ratio of averages?‘. The justification for the latter is said to be that it relates to a question having all the features of each of the questions for which the Recall etc. has been measured, but this does not appear to be logically sound : If we consider a question Qk (with {Ui, Ej, S} held constant), the Recall is Ri = n(Ai n Bi)/n(Ai) and the Recall for another
question
Qj (with the same (Vi, Ej, S)) is
Rj = n(Aj f-j Bj)jn(Ai) so that given a question
Q = Qi v Qj (i.e. Qi OR Qj in the Boolean
Ri and given a question
=
ORj
generalizations
R
U
CBi
Aj) n
u
BjJ)i44
u
Aj)
Q’ = Qi A Qj (i.e. Qi AND Q,), the Recall is R.t ANDj
The obvious
4Vi
sense) the Recall is
-
n((4
n
Aj) 0
(40
of these expressions
Bj))lfl(At n
Aj)*
are:
_L(( Q*+J(Q/i)) -
Q-
(g1Ai>
n
for Q =: Q1 v Qz v . . . v Q,
(25)
. A Q,
(26)
and
( fi (4 n Bd)
?t
i=l
with analogous D
forms for Precision,
_ Q-
for Q’ = Q1 A Q2 A..
Fallout,
the kEGmetric
‘((,QAi)A( ibl Bi)) n
( j$l(-4 u 41) ’
etc.
For example,
for Q - Q1 v Qz v . . , v Q,.
(27)
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
195
It is these expressions, and their further generalizations in the event that Q is a more complicated Boolean function of the Qi, which form the extension of the measures at which the “ratio of averages” school seem to be aiming, rather than expressions of the form: f R,
i=l
=
n(Ai
”
n Bi)
Jl n(AJ
’
However in practice, effectiveness in the above sense will only rarely be meaningful, the author’s view the simple statistics of mean (or expectation) and variance: E(D) =
1
&.P@,),
and
Var (D) = h TH [4 -
hsH
and in
W)Y. fW&J
(where H is the index set) based on system response to individual { Ui, Ej, Qk, S} sets, are more appropriate.* By suitable control of variables, useful meanings can also be given to the following more restricted terms: “mean D (users)“, “mean D (systems)“, “mean D (queries)“, “mean D (document sets)“; and “variance D (users)” etc. A statistical problem of a different type arises as follows: Suppose we know the distributions of P and R, or of G, F and R, or of each of the members of any of the sets of variables of which D is a function-how may we infer the distribution of D itself? This theoretical problem is clearly important in work on models of the retrieval process. (It is not an experimental problem of course, since the distribution of D is given directly by experimental data.) The following is a description of a way of doing this but involves making the approximation that the variables involved are continuous. For (P, R) or (F, G, R) sets numbering about 50 or more, the error involved should not be serious. The approach given is described by APOSTOL [18] and ALEXANDER [19]. Let us define the sample space of the random variable R as r, and that of P as p. Let f(p, r) be the joint probability density of the two-dimensional random variable (P, R). (f(p, r) may be regarded either as a density function whose distribution approximates that of the (P, R) data obtained experimentally, or the density function of the true distribution of which the (P, R) data obtained experimentally is a sample.) Then fromf(p, r) we can calculate the joint density g(u, u) of (U, V), where (U, V) is a two-dimensional random variable in the uv-plane defined as a function of P and R. One of these variables, U say, may be chosen to have the form of D; the other may be chosen arbitrarily. Once g(u, v) has been calculated on this basis, it is possible to calculate the density function of U alone, say 4(u), which solves our problem. Let us map the region R in the pr-plane into the region R’ in the uv-plane according to : The inverse mapping is: u = (p + r- 2pr)/(p +r -pr), and u = r (chosen arbitrarily). p = M(u, V) = v(l - u)/(u(l -v) + 2v- l), and r = N(u, v) = v. Now by calculus we have the relation :
du, 4 = fwm 4, Nu, 4).
Iy$ I 3
* Most statistics texts give E and Var formulae for special distributions. occur with probabilities given by the binomial formula with parameters H andp :
Pr(D& =
0
f
.ph.(l-p)H-h,
where 0 < P < 1, we have E(D) = Hp and Var(D) = Hp(1 -p).
For example if D,, values
196
M.H.
HEINE
where 1&tf, IV)/@, v) I stands for the absolute value of the Jacobian tion. (The above relation follows from the definitions offand g:
f’W, R) E R) = j j f@, 9 dp dr
and
Pr((U,
V) E R’) = j j g(u, U) du do;
R
from the equality double integrals.) takes the form general relations
of the transforma-
R’
of the two left-hand sides, and from the familiar formula for transforming Since the Jacobian, a(M, iV)/a(u, v) = (&V/~U) (&V/&)- @M/au) (aiV/au), v”/(u( I- V)+ 2v - 1)” for the expressions for M and N above, we have the :
du, 4 = f and
(
v(l -u)
u(l-v)+2v-1’v
4(u) =
H
V2
. (u(l-u)+2v-1)2
7 g(a, v) du = ,I” g(u, u) du. -a
>
(28)
(29)
The reason for choosing the latter integral’s limits is that the region in which&, r) # 0 in the pr-plane, namely the region 0 < p < 1, 0 < r < 1, maps into the triangular region in the uv-plane described by the lines u = 1, v = 1, and v = I- u. Expression (29) could, alternatively, have been deduced directly from M and dM/au using Theorem 36.2 given by Alexander. The reason for the above approach is to indicate the general method. In the “three-dimensional” case, for example, where one is dealing with “Fallout, Recall, Generality” data rather than “Recall, Precision” data, the density function of D, after the mapping: u = (g(1 -r-f)+f)/(g(l -f) +f), v = r, w = f, is:
where h(g,f, r) is the joint density of the random variable (G, F, R), and is separable into the product h,(g). h2(f, r) when (as is to be expected in practice) G is independent of (F, R). Example 7. Assume P and R to be uniform random variables over the interval [0, 11. (This is unrealistic but serves to illustrate the method in the simplest way.) Find and sketch the density function of D. The joint density isf(p, r) = 1 over the square 0 < p < 1, 0 < r G I. Substitution of this in (29) gives:
which simplifies after some calculus to
0)
2u = (2_u)*
4(1-u) -___ (2_u)3
*log,(l-u)7
O
The meaning of this function is that the probability that D lies in the interval (u, 61 is given by it according to: &(a
< D < b 1 0 < a < b < 1) = i 4(u) du.
II The expectation of D in this situation is obtained by evaluating
with the result E(D) = 0.71. A sketch of d(u) obtained from sample values of 4(u) for u in [0, l), and using lim d(u) = 2, is shown in Fig. 4. U-+1
Distance Between Sets as an Objective Measure of Retrieval Effectiveness
197
FIG. 4.
There is obviously considerable scope here for theorists and experimenters in the information retrieval field to attempt to set up density functions of the (P, R) variable, with a view to understanding the relationship between such functions and the (P, R) data arising from various retrieval systems used in practice or the subject of experiment. It would, for example, be interesting to test the hypothesis that a normal bivariate density of (P, R) was able to provide regression curves tallying with experimental P, R curves of best fit. VIII.
CONCLUSIONS
The Marczewski-Steinhaus metric provides what appears to be an objective general measure of retrieval effectiveness. It does so within the framework of set theory and the theory of metric spaces, and it is hoped that this will clear some of the way to a more satisfactory theoretical treatment of retrieval and dissemination. The metric can be related to the classical paired measures; indeed some faith is restored in the use of the Cranfield measures by virtue of its simple relationship to them. The metric is not necessarily explicitly a function of Generality (see (5)) though Generality may be used in its assessment along with Fallout (see (9)). But Generality will influence the metric implicitly, in (5), if, as in the Swets Being a general measure, criteria for model, Precision itself is a function of Generality. optimality and degeneracy can readily be derived from it, and it is hoped that in particular the latter notion will lead to an increased understanding of the Precision-Recall relationships met in practice, and of the retrieval languages which produce them. A further report on the relationship between the Marczewski-Steinhaus metric and the Swets model is planned. Acknowledgements-I wish to thank Mr. J. E. FARRADANE,City University, for his constructive criticisms of the MZ-metric, and Mr. J. A. GARTSIDEof the Polytechnic’s Computer Unit who did the programming for Figs. 1-3.
REFERENCES [l] C. P. BOURNE:Evaluation of indexing systems. In: Annual Review ofInformation logy (Edited by C. A. CUADRA),Vol. 1, pp. 171-90, New York (1966).
Science and Techno-
198
M. H. HEINE
[2] A. M. REES: Evaluation of indexing systems and services. In: Annual Review of Information Science and Technology (Edited by C. A. CUADRA),Vol. 2, pp. 63-86, New York (1967). [3] S. E. ROBERTSON:Parametric description of retrieval tests. J. Docum. 1969, 25, l-27, 93-107. [4] J. A. SWETS:Information retrieval systems. Science 1963, 241, 245-50. [5] J. A. SWETS: Effectiveness of Information Retrieval Methods. Air Force Cambridge Research Laboratories, Bedford, Mass. (1967). (Report AFCRL-67-0412.) (Reprinted in Am. Docum. 1969, 20, 72-89.) [6] B. C. BROOKES:The measures of information
[7] [8]
[9] [lo]
retrieval effectiveness proposed by Swets. J. Docum. 1968, 24, 41-54. E. MARCZEWSKIand H. STEINHAUS:On a certain distance of sets and the corresponding distance of functions. Colloquium math. (Warsaw) 1958, 6, 319-27. G. C. SHEPHARDand R. J. WEBSTER:Metrics for sets of convex bodies. Mathematika 1965, 12, 73-88. A. E. TAYLOR:General Theory of Functions and Integration, p. 111, Blaisdell, New York (1965). I. J. Goon: The decision-theory approach to the evaluation of information retrieval systems. Inform. Star. Rerr. 1967, 3, 314.
[ll] C. RAJSKI: A metric space of discrete probability distributions. Inform. Comroll961, 4, 371-77. [12] C. C. GOTLIEBand S. KUMAR: Semantic clustering of index terms. J. Ass. comput. Mach. 1968, 15, 493-513. [13] D. SOERGEL:Mathematical analysis of documentation systems. Inform. Sror. Retr. 1967, 3, 129-73. [14] R. A. FAIRTHORNE:Delegation of classification. Am. Docum. 1958, 9, 159-64 (also as Chap. 12 of his Towards Information Retrieval, Butterworths, London (1961)). [15] D. J. HILLMAN: Mathematical Classification Techniques for Non-Static Document Collections, with Particular Reference to the Problem of Relevance. In: Classification Research, Proceedings of the Second International Study Conference . . ., Elsinore, 1964, pp. 177-209, Copenhagen, Munksgaard (1965) (see also the Discussion by FAIRTHORNEet al., pp. 210-19). [16] W. GOFFMAN:A searching procedure for information retrieval. Inform. Star. Retr. 1964, 2, 73-8. [17] W. GOFFMANand V. A. NEWILL: A methodology for test and evaluation of information-retrieval systems. Inform. Stor. Retr. 1966, 3, 19-25. [18] T. M. APOSTOL:Calculus, Blaisdell, New York (1962). (Vol. 2, section 3.12.) [19] H. W. ALEXANDER:Elements OfMathematical Statistics, section 5.36, Wiley, New York (1961).