RETRIEVAL
PERFORMANCE EEC
Computer
Centre.
MAURO Plateau
AND
INFORMATION
GUAZZO de Kirchberg,
Luxembourg
THEORY
G D.
Abstract-This paper challenges the meaningfulnesr of precision and recall values as a meawre of performance of a retrieval system. Instead. it advocates the use of a normalised form of Shannon’s functions (entropy and mutual mformatlon) Shannon’< four axioms are replaced by an equivalent set of five axiomr which are more readily rhown to be pertinent to document retneval The applicabdity of these axiom\ and the conceptual and operational advantagea of Shannon’s functions are the central pomtb of the work The apphcabihty of the rerults to any automatic classdicatlon IS also outhned
I.
PROBLEM
DEFINITION
1.1. The overall performance of a document retrieval system is taken to be the agreement between a user’s and a machine’s judgement about the relevance of a given document to a given query. Such a performance depends at least on (a) Indexing methods (b) Query formulation (c) Retrieval policy. A measure of performance that compares the performance of retrieval sessions using (a) Different indexing methods and/or (b) Different query formulations and/or (c) Different retrieval policies and/or (d) Different document collections and/or (e) Different user queries, would be extremely useful to the designers of documentary systems because the choices involved in the design of (a), (b) and (c) are for the most part subjective. The outcome of a retrieval session (relative to a given user query) is described by a set of four probabilities: P,,: the probability that a document is rated relevant by both user and system, PIz: the probability that a document is rated relevant by the user but not by the system, Pz,: the probability that a document is rated relevant by the system but not by the user, Pz2: the probability that a document is rated relevant by neither. In table form the outcome may be described as retrieved yes no
relevant
(1)
Whether each table entry represents a probability or a count of occurrences divided by the collection size is rather irrelevant to the argument that follows. Each entry will represent a count of occurrences whenever both user and machine process the whole collection. It will represent (estimated) probabilities if only a sample is processed or if we like to regard the collection as a large sample drawn from an infinite document population. In most practical cases at least P,I (number of relevant non-retrieved documents) must be estimated with some sampling technique. 1.2. This discussion will purposely neglect all qualities of a retrieval system (e.g. coverage and quality of the collection, response time, cost justification) other than retrieval performance.
Ih
Mir KO(;I’A//II
In particular, it will ignore to what u5e the information about relevance is put. or what final decisions are based on it. It IS well known that the user’s requirement5 a5 to the four probabilities P,, depend extremely on his application: some application5 will be very adversely affected by the existence of relevant non-retrieved documents (e.g. a search for pre-existing patents) while others will be sensitive to retrieved non-relevant documents (e.g. when retrieved document5 have to be translated before their relevance can be as5essedj. If a u5er’5 profit function (or value function) is known for each of the four possible events: retrieved yes no
relevant
then it is natural to choose the expectation
(3)
of V,, a5 the performance
measure
EWd=C/ CPJ!, ,
(3)
and to regard all system adjustments that produce an increase in Et V,,) as improvements. This value-function approach contradicts the assumption that the performance measure should only measure the agreement between user’s and machine’s judgements and also has the following disadvantages: (I) Most users are unable to quantify their value function with any degree of accuracy. I?) The function Et Vt,1 does not provide any insight into the functioning of the retrieval 5y5tem and system optimisations that are based on it will vary from u5er to user and possibly from query to query. 1.3. ,4 myriad of performance measures, based on the probabilities P,,. has been proposed [Refs. l&12]. Their re5pective merits and their significance. however, are always debatable to the point that it is unclear whether the results of Feveral retrieval sessions should be averaged by averaging probabilities or performance measures! [see Ref. 4. Part II]. Thi5 is the case for the most widely accepted measures: P,, preci5ion = p,, +P,, PI1 recall = ~ P,, + PI, which are normally quoted as a pair and plotted one vs the other so that the desirability of a trade-off between them is left to the user. 1.4. The desirable properties of a performance measure are now briefly discussed. The best possible Gtuation, a5 regards the probabilities P,,, is easily seen to be P,,=P,,=O P,,+Pz,=
and this IS reflected in precision = recall = I, and perfect retrieval. The worst possible situation is more debatable. I maintain machine judgement is irrelevant to the user judgement: P,, P,,
P,2_o Px
(5)
I
that
it occurs
when
the
(6)
Retrieval performance
and mformatlon
157
theory
that is, when they retrieved set is just as rich in relevant documents seemingly worse Gtuation, in which:
as the whole collection.
A
leads to the obvious remedy of inverting the machine judgement and thus obtaining a useful system. An acceptable performance measure will have to provide some measure of the gradual change from situation (5) to situation (6). Joint precision/recall values fail to do this, not least because they do not depend on Pz,. The following two examples both correspond to precision = recall = 0.9
(excellent
performance
on a rather stringent
(useless performance
query)
on a very loose query)
My first requirement for an acceptable performance measure is then that it takes up its lowest value when (6) holds and its upper value when (5) holds. Joint precision/recall values seem to be intuitively satisfactory only when
1.5. This paper proposes a function that is preferable to precision/recall measures in the following respects: (A) It takes up the value 0 when (6) holds, the value I when (5) holds and intermediate values in all other cases. It is also intuitively acceptable over the whole range of P,,, Plz, Pz,, Pzz. (B) It yields a single value (as opposed to a pair) so that comparison of two retrieval sessions is unequivocal and arbitrary system parameters (thresholds. weights etc.) can be adjusted empirically. (C) It does not represent one more performance measure to be added to the existing ones but is derived uniquely from five intuitively acceptable axioms. These axioms do not postulate the mathematical form of the function but only its properties (continuity, symmetry, additivity, etc.). (D) It produces numerical values that can be meaningfully compared even in the case of retrieval sessions that differ in respect of points (a), (b). cc), td), (e) of Section 1.1. (E) It can be generalised upwards to meet the case of more than two relevance classes (see Section 3) and downwards to apply to each entry of the probability table (1) (i.e. each of the events measured by PII. PI-, P?,, Pzz) as well as to the whole table (see Section 3). This latter property enables us to break down the overall performance (performance measure applied to the whole table) into constituent parts (performance measure applied to each entry P,,) and analyse each. This analysis is roughly equivalent to discussing whether performance can be improved by increasing recall or by increasing precision. ?
AXIOMATIC
BASIS
FOR
THE
PERFORMANCE
This section reviews Shannon’s Theory of Information to performance measurements.
MEASURE
with a view to proving its applicability
Iv-
MAIIKI) C;r LU(I
Shannon’s basic function is derived from four axioms that are intuitively acceptable in the field of information measurement in which the theory was conceived. This paper discusses the axiomatic acceptability of five properties which are, as a whole, equivalent to Shannon’s axioms but are more congenial to the field of documentary systems. The aim here is to show why we would discard a performance measure that does not possess one of the five properties. This may be done with limitative examples which do not show the full implications of the axioms, because the uniqueness of the function can be proved. Shannon’s theory applies to two sets of events such that one and only one event of each set must occur: first set:
k=l,...n
event\ sh
(7) c P(s,,)= 1.
withP(sl)*O; second set:
event5 y,
i =
1,
I
111
I
I The co-occurrence probabilities:
with P (y,) 3 0;
c
P(y,)=
of the two sets can be described
P(s,y,)
.”
,,,
“’
,..
P(.y,y,)
“’
by a bidimensional
“’
P(x,ym)
“’
P(x.y,)
.
I
(8)
...
“’
table of joint
In the case of a retrieval system, the event\ may be: y,: the document is retrieved yz: the document is not retrieved I,: the document is rated relevant .x2: the document is rated irrelevant and the probability table is the \ame as defined in (1). Shannon’s theory is concerned with measuring the quanity of information that each event x, provides on each event .x~. Since we are attempting to borrow Shannon’s functions for use in another field. the reader who finds the phrase “quantity of information” misleading should simply replace it with “the function”. Property I. (Arguments and Continuity). The quantity of information I (x,, .v,) is a continuous function of the probabilities P(.Q) and P (sr(y, ):t (9) Applied to our example, this means that we are concerned with the probability of event I,, (for ex. I,: the document is relevant) evaluated before and after event .v, (for ex. y,: the document is retrieved) becomes known. In other words we are attempting to measure the evidence that event y, provides an event .L Propurtv 3,. (Symmetry). f(.r,,?‘,)= In words,
we require
that s- and y+ets
( 10)
I(.LYr.).
be treated
symmetrically.
in accordance
with the fact
that we intend to obtain a measure of agreement. Property 3. (Zero-calibration). I(.L y,) = 0 if and only if P(.rr) = P(.r,Iy,)
(1 I)
Retrieval
performance
that is if events xI and y, are statistically be expressed as
and mformatlon
independent.
IS9
theory
The condition for independence
can also
PLxky’,= P(.x,,)P(y,) to show the symmetry between X- and y-events. Property 4. (Conditioning event). Given any event 2, the information provided by y, on .G given that z has occurred is the same function F( . ) of arguments P(.u,[z) and P(.u,Izy,):
This means that the quantity of information 1(x,, y,) can be measured on conditioned probabilities, that is when event z is known to have occurred. Property 4 is certainly necessary, because event 2 may simply represent the operating conditions under which all probabilities are defined. As a less general example, imagine that z is the event: “the document belongs to a given subset of the collection”. We would certainly discard a performance measure that applies differently to a subset of the collection and to the whole collection. Property 5. (Addition Rule). The quantity of information provided by the joint event y,z on event xL is
that is, the quantity of information provided by y, on xa plus the quantity of information provided by 2 on xk once y, is known to have occurred. In other words we define how quantities of information (and performance measures) can be meaningfully added. The following example of cascaded systems is intended to show an application to retrieval systems. Let a retrieval system S, be applied to a collection C and a given query Q producing a subset C, of retrieved documents. Let a second retrieval system Sz be applied only to C, and produce a subset Cz of those documents that Sz retrieves. We demand that the quantity of information on the relevance of a document be additive in the sense that it can be computed either separately for the two phases of the retrieval and then added, or for the overall retrieval (from C to Cz), providing the same result. In other words we want to be able to measure the additional information on relevance added by the second retrieval to that provided by the first. Apart from being attractive, this property is necessary if we want to obtain the same result irrespective of whether we regard the retrieval process as consisting of one or two phases. It can be proved that a function that satisfies properties l-5 is unique,$. except for a scale factor and is expressed as (13) where the base a of the logarithm determines the scale factor. In most applications be 2 and the unit of measure of I is called a “bit”. *A\
a \hort notation
we define
1(x,, v,lz)= $Shannon’s proof 15ba,ed on four axloms[?] stated a? follows.
.4uomT
F(PLr*(z).PLr,j:y,ll I . 2. 3 are eqwalent
to propertles
Given four events r, y. a. h and given that the jomt event ah I\ 5tatlstically I(ra,yb)=I(x, y)+l(a. hl. Th15 relatlonship
a is chosen to
can be proved on the bass of propertIes ?, 3, 5
Itxa. vh) = It.ru. yl+ Itxa, h/y) =I(y,.m)+l(b,.~a/y) =[(~,x)+I(y.~l*)+ltb.r/~)+I(b,a/y.r) =Ity, r)+l(b.u) = ILK y)+llu. b)
(byproperty5) (by property 2) (byproperty5) (by the postulated Independence) (by property 3).
I. 4,5 Shannon’q
Independent
awom 4 can be
of the Jomt event ry. then
bf4l'KIl(;I
Ihl)
4//O
The function ( 13) depends on the pair (i. k ): for instance all examples above have been dealing with the pair (I. I): \,: the document y,: the document The averaging
i\ relevant is retrieved.
of f(sr. y,) over all pair\ (i, A) is now possible hy taking the expectation:
This is indicated as 1(X, I’) and represents the average quantity of information that an event of the y-set provides on an event of the s-set. In other words it represents the performance measure of a retrieval session as a function of the whole probability table. It is a Gmple matter to show that propertie\ 2-5 defined for I(.L y,) also apply to f(X. Y). In the case of array (1) the function Z(X, 1’) reduces to:
1(X. Y)= P,, log
PI, (p,,+P,,),P,,+Pl,)+P’~logipl,+P,,)(P,,+P,,i
+ P,, log
It is useful to consider
(quantity x-set:*
of information
P,, (P,, + P,,)(P,,
the particular
+ Px)
+ P,, log
PI,
Px ~ CP,,-t P,,)(P,, + P22)’
case of ( 13):
supplied by event .tL) and the definition
of the entropy
H(X) of the
(16)
It can be proved[3] that I(?(. Y), unlike I(.L xc). ij non-negative. For a given query on a given collection (i.e. given x-events) the he\t performance that produces the maximum of f(X, Y). It is easy to prove[!] that
\o that the value H(X). that depend\ on the query and not on the retrieval maximum that 1(X, Y) can take up. The function 1(X, Y) can be used to compare retrieval sesjrons using
is the one
system,
I\ the
(a) different Indexing methods tb) different query formulations (c) different retrieval policie\ Cd) the surnti document collection (e) thr sume query. The limitation
as to points (d) and (e) can be ju\trfied in an intuitive
way with reference
to
Retrieval performance
161
and Information thery
(1). As long as query and collection are constant, (P,, + P,J and (I’?, + PZ2) are separately constant. The usefulness of the retrieval system lies in the splitting of (PII + P,J in favour of P II and of (PZ, + Pzz) in favour of Pzz. The best that a system can do is to obtain P,? = PZ, =0 and 1(X, Y) = H(X). Therefore two searches having different marginal probabilities, probability that a document is relevant = PII + PI2 probability that a document is irrelevant = Pz, + P,z cannot be meaningfully compared by comparing 1(X, Y). For instance consider the two retrieval cases below:
table
YI
YZ
YI
x,
0.09
0.01
P(x,) =O.l
X1
xZ
0.01 it-l
0.89
P(xz)=O.9
XZ
YZ P(x,) = 0.01 P (x*) = 0.99
EH
H(x) = 0.468 hits
H(X) = 0.080 brts
1(X. Y) = 0.W bit\
rtx. Y)=
In the second case no system can possibly provide as much information on relevance as in the first because events .Y, and x2 only “contain” a quantity of information H(X) (see (17)). In other words in the first case the function 1(X, Y) ranges from zero to H(X) = 0.468 depending on the goodness of the system and in the second case it ranges from zero to H(X) = 0.080. In order to overcome this limitation and to remove restrictions (d) and (e) it is sufficient to normalise the performance measure 1*(x,
Y,=g$
(information provided by system judgement contained in the user judgement). 3. GENERALISATIONS
(18)
on user judgement
AND
divided by the information
REMARKS
3.1. The function I*(X.Y) is defined by (18) for a probability table of dimensions (n, m) and therefore applies to cases when the user classifies into n relevance classes and the machine into m classes, e.g. user judgement (1) very useful (2) useful (3) worth reading (4) useless machine judgement (A) relevant (B) marginally relevant (C) irrelevant [see for instance Ref. 7, p. 1771 3.2. In all previous cases a reference (human) classification is compared to a make-good (machine) classification. This explains the lack of symmetry of definition (18) with respect to x and y. If a measure of agreement is applied to two equally reliable classifications (for instance of two experts) we demand that the function respect the symmetry between x- and y-events. Therefore a symmetric mean of the two entropies H(X) and H(Y) should replace H(X) in definition (18). For instance 1(X. Y) (19) V/H(X). H(Y)
I’(**y,=
lh2
MIIW~
Gt,u/o
A function of this kind allows to measure the c~nsister~y compiling a table of co-occurrences:
of the classification
of experts by
expert N. I irrel. relev. retev. expert N. 7 irrel.
The value I ‘(X. Y) computed from such a tahie can be regarded as an upper limit to the best possible performance of an automatic system, in the sense that an automatic system cannot produce a relevance judgement in accordance with that of each user insomuch as there is disagreement between the users themselves. If the experts are asked to assess N degrees of relevance, then 1(X, Y) can be expected to increase with N up to a saturation point when an expert becomes unable to differentiate between adjacent relevance classes. The saturation value of I’fX. Y) then provides a rough measure of the resc~fution of the judgement. Such measures of user consistency and resolution prevent a system designer from developing sophisticated and expensive systems in an attempt to attain performance levels that are conceptually impossible. It can be noted in pacsing that resolution considerations discourage the use of all performance measures based on ranking (that is on the agreement between the relevance rankings produced by the user and by the machine respectively) if we regard ranking as the classification into a very large number of relevance classes. 3.3. A byproduct of property St is that the contributions to performance of different design factors are additirr. This can be useful: (A) to measure the performance of each phase of a multiph~se retrieval system (e.g. a first filter based on keyword matching and a second filter based on semantic analy-sis). (B) To quantify the contribution of each tag or descriptor to the Success of the retrieval. This differs from (A) in that such a contribution can be acsessed independently of the retrieval policies. by computing the function I(.y, .v) with I-events defined on user relevance and y-events defined on the contents of the tags. 3.4. Although precision/recall methods measure the retrieval performance with a pair of values and the function 1*(X, 1’) with a single value, the latter cannot be obtained as a function of a preciGon/recall pair. In fact even when both precision and recall are held constant. the function 1*(X. Y) still has one degree of freedom. Fig. I. for instance. shows a plot of 1*(X, Y) for precision = recall = 0.9. 3
~J’J’l.I~ATl~~
TO
ACTOMATt(’
~I..~SSJFI~.~TJ~~
4.1. The retrieval process discussed in Sections l-3 can be regarded as a special case of automatic classification into two classes (retrieved and non-retrieved set). This section discusses the applicability of 1*(X, I’) to measure the performance of any automatic classification system, thus unifying the two approaches. The case for using I*(,~. Y) may follow the same lines as Sections l-3, with the substituti~)n (a) (b) (c) (d) (e)
indexing methods dfeatures describing query formulations+class definitions retrieval policies --$classification rules document collections -+ item collections user queries + classification ptoblems.
an item
The probability table (or frequency count) is now an (n, m) array with each probability Pk, = P tsry,) being the probability that an item is classfied into class x1,by the referee and into class y, by the machine.
Retrieval
performance
and information
Preclslon 09
-
07
-
06
-
05
-
04
-
03
-
02
-
=
0 3
Recall-09
t 08
163
theory
01 -
00009
I 0’8,
0009
y, , log Scale Rg
I
The function 1*(X, Y) can thus be applied in a straight-forward fashion to measure the agreement between two classifications of ifems into classes, for instance to compare a human and a machine classification of -readers into interest groups, -documents into subject classes, -enterprises into similarity clusters, -personnel into profile classes and so on, provided always that the validity of the five postulated
properties is accepted and, irrespective of the final
above all, that one wishes to measure the goodness of the classification
decisions that might be based on it. Consider, for the sake of concreteness, into the subject classes Class Class Class Class
the problem of the automatic classfication
C,: Production Techniques and Methcds C?: Manufacturing Machines C3: Manufactured Products CA: Instrumentation; Testing and Measuring
and consider
the realisation
of the co-occurrence
Equipment table shown in Fig. 3.
Machme c,
Reference classification
clawticatlon c,
c,
c,
of patents
16-l
M.~I~KII (;I
VICI
My claim is that the function I(?(, I’), which 15a well-known measure of statistical dependence between discrete random variables. is al\o suitable as a measure of merit to be aj\ociated with a table lihe this. 4.2. The generaltty of the approach of Sectron 4.1 can he extended to the case of two unrelated classifications of the \ame items. no matter whether they are automatic or manual, whether
one is regarded as the reference
anything
to do with each other.
Let the patents of our example
for the other or whether
be classified by a different
their class definition\
\y\tem
have
Into the five cla\\e\
D,: Research D,: Technology D,: Production I),:
Engineermg
D,: Others. Then one can fill in a co-occurrence
table to measure some agreement between thi\ classification
and one of the classifications into classes (C, C, C, C,). One property of I(X. 1’1 that is relevant to juch a case is the following: Property A ,. A permutation
of the classes xI (or classes v,) and of the corresponding
columns) leaves the value 1(X, I’) unchanged. This means that the function
rows (or
1f.X. I’) detects and
measures the agreement between the two classifications on the basis of a probability without any need to indicate which .t-class corresponds to which !-cla\j.
table only,
Note that in our example a re-shuffling of the D-classes or of the C-clajse5 is permitted but irrelevant to the problem, YOthat property A, is certainly a requirement for any merit function one may wish to use. Al40. the ta\k of mapping one set of classes onto the other \et is far from straight-forward, Property Thij
.4, also helps define bejt- and worst-ca\es
in a more general form than (5) and (6).
i\ expressed by propertie\ A, and A,. Property A,. I(X. 1’) = 0 if and only if the events .Vr: one classification system assigns an item to class _rr ?:: a second classification
are statistically
independent
system assigns the same item to class V,
for any pair (L. i):
or, equivalently, the rank of the probability matrix is one. Proprrty ,4,. The function If.%‘, I’) takes up its maximum probability
array ha\ only one non-zero
element
per column.
I’,
0
0
0
P,
0
(1
0
I’.
0
0
P,
0
0
II
(I
0
P,
0
0
value
If(X)
if and only if the
For example
~
In this situation the s-class to which an item belongs, can be unequivocally determined on the basis of the y-class to which it has attributed (but not vice-versa). If the s-classes represent a reference clascificatron and the y-classes a make-good one, the latter can replace the former without introducing any error. 4.3. The discussion on the desirability of the five properties of the tive properties 1 to 5 runs parallel to the one in section 7, so that there is no need to rephrase it here.
Retrieval performance
and mformatmn
theory
165
Normalisation can be achieved by applying (18) or ( 19), without any need to take the number of classes into account. The entropy H(X) itself measures at the same time the number of classes and the size of each marginal probability P (.w,). This can be illustrated by the following: Property .44. If a new class x, is added to the x-classes and if P (s,) = 0 then both 1(X, Y) and H(X) remain unchanged. For example. redundant classes can be defined in which there are no item\, without jeopardising the validity of the performance measure.
It is useful to stres, that properties A, to A.,, attractive as they may be, are of the nature of boundary conditions and cannot replace properties I to 5 in defining the function 1(X, Y) uniquely. RtFERENCES [I] C E SHANNI)~, The mathematical theory of commumcatlon Brll .~!\I Tech J (19481. [Z] C. E SH~IUNONand W U’F~L tk 77~ Mufhrmatrtul &w-v of Comm~tnrc’ut~~m Umverslty of Ilhnol\ Pre% Urbana (I9491 131 A. 1 KHINCHP.. .Mutkmut~c~ul Foudaf~n~ 01 Inf~~rmutrrmThecn. Dover. New York (1957) 141 G S4LTON (Ed ), The Smurt Rrfriwul Syrtrm. Prentice Hall. Englewood CUT\. NS (1971) 151 G SALTON. 4utomurlc Informut~on Organr~ufron cmd Rufrwwl McGraw-Hill. New York I 1968) [61 J. A SWFTS. Effectlvene\s of mformatlon retrieval method\ .-lm Doturn. 1976. 20(l). 17) F W LAW~SI~R and E G FAYEN. Informdon Rrfriwal On-Lmr Melville. Lo\ Angele> I 1973) 181 C W CI AV~KD(I\*. Eruluufron of Operutmnul Rrfnrwl Sy.ttm~c College of Aeronautic\. Cranfield. England (1964) 191 C. SU TON, Evaluation problem\ In mteractlve mformatlon retrieval. Infor. Sf~jr Rrtr 1970. 6. [IO] P. C4i IV,~I ~1. System performance evaluation wrvey dnd apprawl Commrtn .$,sw Completing Machrnt~ry 1967. 10.
(I?] J. M SMII-H. A rewex and comparison of certain methods of computer performance evaluation
Computer Bull 196X. 12.