Retrieval performance and information theory

Retrieval performance and information theory

RETRIEVAL PERFORMANCE EEC Computer Centre. MAURO Plateau AND INFORMATION GUAZZO de Kirchberg, Luxembourg THEORY G D. Abstract-This paper ch...

720KB Sizes 11 Downloads 125 Views

RETRIEVAL

PERFORMANCE EEC

Computer

Centre.

MAURO Plateau

AND

INFORMATION

GUAZZO de Kirchberg,

Luxembourg

THEORY

G D.

Abstract-This paper challenges the meaningfulnesr of precision and recall values as a meawre of performance of a retrieval system. Instead. it advocates the use of a normalised form of Shannon’s functions (entropy and mutual mformatlon) Shannon’< four axioms are replaced by an equivalent set of five axiomr which are more readily rhown to be pertinent to document retneval The applicabdity of these axiom\ and the conceptual and operational advantagea of Shannon’s functions are the central pomtb of the work The apphcabihty of the rerults to any automatic classdicatlon IS also outhned

I.

PROBLEM

DEFINITION

1.1. The overall performance of a document retrieval system is taken to be the agreement between a user’s and a machine’s judgement about the relevance of a given document to a given query. Such a performance depends at least on (a) Indexing methods (b) Query formulation (c) Retrieval policy. A measure of performance that compares the performance of retrieval sessions using (a) Different indexing methods and/or (b) Different query formulations and/or (c) Different retrieval policies and/or (d) Different document collections and/or (e) Different user queries, would be extremely useful to the designers of documentary systems because the choices involved in the design of (a), (b) and (c) are for the most part subjective. The outcome of a retrieval session (relative to a given user query) is described by a set of four probabilities: P,,: the probability that a document is rated relevant by both user and system, PIz: the probability that a document is rated relevant by the user but not by the system, Pz,: the probability that a document is rated relevant by the system but not by the user, Pz2: the probability that a document is rated relevant by neither. In table form the outcome may be described as retrieved yes no

relevant

(1)

Whether each table entry represents a probability or a count of occurrences divided by the collection size is rather irrelevant to the argument that follows. Each entry will represent a count of occurrences whenever both user and machine process the whole collection. It will represent (estimated) probabilities if only a sample is processed or if we like to regard the collection as a large sample drawn from an infinite document population. In most practical cases at least P,I (number of relevant non-retrieved documents) must be estimated with some sampling technique. 1.2. This discussion will purposely neglect all qualities of a retrieval system (e.g. coverage and quality of the collection, response time, cost justification) other than retrieval performance.

Ih

Mir KO(;I’A//II

In particular, it will ignore to what u5e the information about relevance is put. or what final decisions are based on it. It IS well known that the user’s requirement5 a5 to the four probabilities P,, depend extremely on his application: some application5 will be very adversely affected by the existence of relevant non-retrieved documents (e.g. a search for pre-existing patents) while others will be sensitive to retrieved non-relevant documents (e.g. when retrieved document5 have to be translated before their relevance can be as5essedj. If a u5er’5 profit function (or value function) is known for each of the four possible events: retrieved yes no

relevant

then it is natural to choose the expectation

(3)

of V,, a5 the performance

measure

EWd=C/ CPJ!, ,

(3)

and to regard all system adjustments that produce an increase in Et V,,) as improvements. This value-function approach contradicts the assumption that the performance measure should only measure the agreement between user’s and machine’s judgements and also has the following disadvantages: (I) Most users are unable to quantify their value function with any degree of accuracy. I?) The function Et Vt,1 does not provide any insight into the functioning of the retrieval 5y5tem and system optimisations that are based on it will vary from u5er to user and possibly from query to query. 1.3. ,4 myriad of performance measures, based on the probabilities P,,. has been proposed [Refs. l&12]. Their re5pective merits and their significance. however, are always debatable to the point that it is unclear whether the results of Feveral retrieval sessions should be averaged by averaging probabilities or performance measures! [see Ref. 4. Part II]. Thi5 is the case for the most widely accepted measures: P,, preci5ion = p,, +P,, PI1 recall = ~ P,, + PI, which are normally quoted as a pair and plotted one vs the other so that the desirability of a trade-off between them is left to the user. 1.4. The desirable properties of a performance measure are now briefly discussed. The best possible Gtuation, a5 regards the probabilities P,,, is easily seen to be P,,=P,,=O P,,+Pz,=

and this IS reflected in precision = recall = I, and perfect retrieval. The worst possible situation is more debatable. I maintain machine judgement is irrelevant to the user judgement: P,, P,,

P,2_o Px

(5)

I

that

it occurs

when

the

(6)

Retrieval performance

and mformatlon

157

theory

that is, when they retrieved set is just as rich in relevant documents seemingly worse Gtuation, in which:

as the whole collection.

A

leads to the obvious remedy of inverting the machine judgement and thus obtaining a useful system. An acceptable performance measure will have to provide some measure of the gradual change from situation (5) to situation (6). Joint precision/recall values fail to do this, not least because they do not depend on Pz,. The following two examples both correspond to precision = recall = 0.9

(excellent

performance

on a rather stringent

(useless performance

query)

on a very loose query)

My first requirement for an acceptable performance measure is then that it takes up its lowest value when (6) holds and its upper value when (5) holds. Joint precision/recall values seem to be intuitively satisfactory only when

1.5. This paper proposes a function that is preferable to precision/recall measures in the following respects: (A) It takes up the value 0 when (6) holds, the value I when (5) holds and intermediate values in all other cases. It is also intuitively acceptable over the whole range of P,,, Plz, Pz,, Pzz. (B) It yields a single value (as opposed to a pair) so that comparison of two retrieval sessions is unequivocal and arbitrary system parameters (thresholds. weights etc.) can be adjusted empirically. (C) It does not represent one more performance measure to be added to the existing ones but is derived uniquely from five intuitively acceptable axioms. These axioms do not postulate the mathematical form of the function but only its properties (continuity, symmetry, additivity, etc.). (D) It produces numerical values that can be meaningfully compared even in the case of retrieval sessions that differ in respect of points (a), (b). cc), td), (e) of Section 1.1. (E) It can be generalised upwards to meet the case of more than two relevance classes (see Section 3) and downwards to apply to each entry of the probability table (1) (i.e. each of the events measured by PII. PI-, P?,, Pzz) as well as to the whole table (see Section 3). This latter property enables us to break down the overall performance (performance measure applied to the whole table) into constituent parts (performance measure applied to each entry P,,) and analyse each. This analysis is roughly equivalent to discussing whether performance can be improved by increasing recall or by increasing precision. ?

AXIOMATIC

BASIS

FOR

THE

PERFORMANCE

This section reviews Shannon’s Theory of Information to performance measurements.

MEASURE

with a view to proving its applicability

Iv-

MAIIKI) C;r LU(I

Shannon’s basic function is derived from four axioms that are intuitively acceptable in the field of information measurement in which the theory was conceived. This paper discusses the axiomatic acceptability of five properties which are, as a whole, equivalent to Shannon’s axioms but are more congenial to the field of documentary systems. The aim here is to show why we would discard a performance measure that does not possess one of the five properties. This may be done with limitative examples which do not show the full implications of the axioms, because the uniqueness of the function can be proved. Shannon’s theory applies to two sets of events such that one and only one event of each set must occur: first set:

k=l,...n

event\ sh

(7) c P(s,,)= 1.

withP(sl)*O; second set:

event5 y,

i =

1,

I

111

I

I The co-occurrence probabilities:

with P (y,) 3 0;

c

P(y,)=

of the two sets can be described

P(s,y,)

.”

,,,

“’

,..

P(.y,y,)

“’

by a bidimensional

“’

P(x,ym)

“’

P(x.y,)

.

I

(8)

...

“’

table of joint

In the case of a retrieval system, the event\ may be: y,: the document is retrieved yz: the document is not retrieved I,: the document is rated relevant .x2: the document is rated irrelevant and the probability table is the \ame as defined in (1). Shannon’s theory is concerned with measuring the quanity of information that each event x, provides on each event .x~. Since we are attempting to borrow Shannon’s functions for use in another field. the reader who finds the phrase “quantity of information” misleading should simply replace it with “the function”. Property I. (Arguments and Continuity). The quantity of information I (x,, .v,) is a continuous function of the probabilities P(.Q) and P (sr(y, ):t (9) Applied to our example, this means that we are concerned with the probability of event I,, (for ex. I,: the document is relevant) evaluated before and after event .v, (for ex. y,: the document is retrieved) becomes known. In other words we are attempting to measure the evidence that event y, provides an event .L Propurtv 3,. (Symmetry). f(.r,,?‘,)= In words,

we require

that s- and y+ets

( 10)

I(.LYr.).

be treated

symmetrically.

in accordance

with the fact

that we intend to obtain a measure of agreement. Property 3. (Zero-calibration). I(.L y,) = 0 if and only if P(.rr) = P(.r,Iy,)

(1 I)

Retrieval

performance

that is if events xI and y, are statistically be expressed as

and mformatlon

independent.

IS9

theory

The condition for independence

can also

PLxky’,= P(.x,,)P(y,) to show the symmetry between X- and y-events. Property 4. (Conditioning event). Given any event 2, the information provided by y, on .G given that z has occurred is the same function F( . ) of arguments P(.u,[z) and P(.u,Izy,):

This means that the quantity of information 1(x,, y,) can be measured on conditioned probabilities, that is when event z is known to have occurred. Property 4 is certainly necessary, because event 2 may simply represent the operating conditions under which all probabilities are defined. As a less general example, imagine that z is the event: “the document belongs to a given subset of the collection”. We would certainly discard a performance measure that applies differently to a subset of the collection and to the whole collection. Property 5. (Addition Rule). The quantity of information provided by the joint event y,z on event xL is

that is, the quantity of information provided by y, on xa plus the quantity of information provided by 2 on xk once y, is known to have occurred. In other words we define how quantities of information (and performance measures) can be meaningfully added. The following example of cascaded systems is intended to show an application to retrieval systems. Let a retrieval system S, be applied to a collection C and a given query Q producing a subset C, of retrieved documents. Let a second retrieval system Sz be applied only to C, and produce a subset Cz of those documents that Sz retrieves. We demand that the quantity of information on the relevance of a document be additive in the sense that it can be computed either separately for the two phases of the retrieval and then added, or for the overall retrieval (from C to Cz), providing the same result. In other words we want to be able to measure the additional information on relevance added by the second retrieval to that provided by the first. Apart from being attractive, this property is necessary if we want to obtain the same result irrespective of whether we regard the retrieval process as consisting of one or two phases. It can be proved that a function that satisfies properties l-5 is unique,$. except for a scale factor and is expressed as (13) where the base a of the logarithm determines the scale factor. In most applications be 2 and the unit of measure of I is called a “bit”. *A\

a \hort notation

we define

1(x,, v,lz)= $Shannon’s proof 15ba,ed on four axloms[?] stated a? follows.

.4uomT

F(PLr*(z).PLr,j:y,ll I . 2. 3 are eqwalent

to propertles

Given four events r, y. a. h and given that the jomt event ah I\ 5tatlstically I(ra,yb)=I(x, y)+l(a. hl. Th15 relatlonship

a is chosen to

can be proved on the bass of propertIes ?, 3, 5

Itxa. vh) = It.ru. yl+ Itxa, h/y) =I(y,.m)+l(b,.~a/y) =[(~,x)+I(y.~l*)+ltb.r/~)+I(b,a/y.r) =Ity, r)+l(b.u) = ILK y)+llu. b)

(byproperty5) (by property 2) (byproperty5) (by the postulated Independence) (by property 3).

I. 4,5 Shannon’q

Independent

awom 4 can be

of the Jomt event ry. then

bf4l'KIl(;I

Ihl)

4//O

The function ( 13) depends on the pair (i. k ): for instance all examples above have been dealing with the pair (I. I): \,: the document y,: the document The averaging

i\ relevant is retrieved.

of f(sr. y,) over all pair\ (i, A) is now possible hy taking the expectation:

This is indicated as 1(X, I’) and represents the average quantity of information that an event of the y-set provides on an event of the s-set. In other words it represents the performance measure of a retrieval session as a function of the whole probability table. It is a Gmple matter to show that propertie\ 2-5 defined for I(.L y,) also apply to f(X. Y). In the case of array (1) the function Z(X, 1’) reduces to:

1(X. Y)= P,, log

PI, (p,,+P,,),P,,+Pl,)+P’~logipl,+P,,)(P,,+P,,i

+ P,, log

It is useful to consider

(quantity x-set:*

of information

P,, (P,, + P,,)(P,,

the particular

+ Px)

+ P,, log

PI,

Px ~ CP,,-t P,,)(P,, + P22)’

case of ( 13):

supplied by event .tL) and the definition

of the entropy

H(X) of the

(16)

It can be proved[3] that I(?(. Y), unlike I(.L xc). ij non-negative. For a given query on a given collection (i.e. given x-events) the he\t performance that produces the maximum of f(X, Y). It is easy to prove[!] that

\o that the value H(X). that depend\ on the query and not on the retrieval maximum that 1(X, Y) can take up. The function 1(X, Y) can be used to compare retrieval sesjrons using

is the one

system,

I\ the

(a) different Indexing methods tb) different query formulations (c) different retrieval policie\ Cd) the surnti document collection (e) thr sume query. The limitation

as to points (d) and (e) can be ju\trfied in an intuitive

way with reference

to

Retrieval performance

161

and Information thery

(1). As long as query and collection are constant, (P,, + P,J and (I’?, + PZ2) are separately constant. The usefulness of the retrieval system lies in the splitting of (PII + P,J in favour of P II and of (PZ, + Pzz) in favour of Pzz. The best that a system can do is to obtain P,? = PZ, =0 and 1(X, Y) = H(X). Therefore two searches having different marginal probabilities, probability that a document is relevant = PII + PI2 probability that a document is irrelevant = Pz, + P,z cannot be meaningfully compared by comparing 1(X, Y). For instance consider the two retrieval cases below:

table

YI

YZ

YI

x,

0.09

0.01

P(x,) =O.l

X1

xZ

0.01 it-l

0.89

P(xz)=O.9

XZ

YZ P(x,) = 0.01 P (x*) = 0.99

EH

H(x) = 0.468 hits

H(X) = 0.080 brts

1(X. Y) = 0.W bit\

rtx. Y)=

In the second case no system can possibly provide as much information on relevance as in the first because events .Y, and x2 only “contain” a quantity of information H(X) (see (17)). In other words in the first case the function 1(X, Y) ranges from zero to H(X) = 0.468 depending on the goodness of the system and in the second case it ranges from zero to H(X) = 0.080. In order to overcome this limitation and to remove restrictions (d) and (e) it is sufficient to normalise the performance measure 1*(x,

Y,=g$

(information provided by system judgement contained in the user judgement). 3. GENERALISATIONS

(18)

on user judgement

AND

divided by the information

REMARKS

3.1. The function I*(X.Y) is defined by (18) for a probability table of dimensions (n, m) and therefore applies to cases when the user classifies into n relevance classes and the machine into m classes, e.g. user judgement (1) very useful (2) useful (3) worth reading (4) useless machine judgement (A) relevant (B) marginally relevant (C) irrelevant [see for instance Ref. 7, p. 1771 3.2. In all previous cases a reference (human) classification is compared to a make-good (machine) classification. This explains the lack of symmetry of definition (18) with respect to x and y. If a measure of agreement is applied to two equally reliable classifications (for instance of two experts) we demand that the function respect the symmetry between x- and y-events. Therefore a symmetric mean of the two entropies H(X) and H(Y) should replace H(X) in definition (18). For instance 1(X. Y) (19) V/H(X). H(Y)

I’(**y,=

lh2

MIIW~

Gt,u/o

A function of this kind allows to measure the c~nsister~y compiling a table of co-occurrences:

of the classification

of experts by

expert N. I irrel. relev. retev. expert N. 7 irrel.

The value I ‘(X. Y) computed from such a tahie can be regarded as an upper limit to the best possible performance of an automatic system, in the sense that an automatic system cannot produce a relevance judgement in accordance with that of each user insomuch as there is disagreement between the users themselves. If the experts are asked to assess N degrees of relevance, then 1(X, Y) can be expected to increase with N up to a saturation point when an expert becomes unable to differentiate between adjacent relevance classes. The saturation value of I’fX. Y) then provides a rough measure of the resc~fution of the judgement. Such measures of user consistency and resolution prevent a system designer from developing sophisticated and expensive systems in an attempt to attain performance levels that are conceptually impossible. It can be noted in pacsing that resolution considerations discourage the use of all performance measures based on ranking (that is on the agreement between the relevance rankings produced by the user and by the machine respectively) if we regard ranking as the classification into a very large number of relevance classes. 3.3. A byproduct of property St is that the contributions to performance of different design factors are additirr. This can be useful: (A) to measure the performance of each phase of a multiph~se retrieval system (e.g. a first filter based on keyword matching and a second filter based on semantic analy-sis). (B) To quantify the contribution of each tag or descriptor to the Success of the retrieval. This differs from (A) in that such a contribution can be acsessed independently of the retrieval policies. by computing the function I(.y, .v) with I-events defined on user relevance and y-events defined on the contents of the tags. 3.4. Although precision/recall methods measure the retrieval performance with a pair of values and the function 1*(X, 1’) with a single value, the latter cannot be obtained as a function of a preciGon/recall pair. In fact even when both precision and recall are held constant. the function 1*(X. Y) still has one degree of freedom. Fig. I. for instance. shows a plot of 1*(X, Y) for precision = recall = 0.9. 3

~J’J’l.I~ATl~~

TO

ACTOMATt(’

~I..~SSJFI~.~TJ~~

4.1. The retrieval process discussed in Sections l-3 can be regarded as a special case of automatic classification into two classes (retrieved and non-retrieved set). This section discusses the applicability of 1*(X, I’) to measure the performance of any automatic classification system, thus unifying the two approaches. The case for using I*(,~. Y) may follow the same lines as Sections l-3, with the substituti~)n (a) (b) (c) (d) (e)

indexing methods dfeatures describing query formulations+class definitions retrieval policies --$classification rules document collections -+ item collections user queries + classification ptoblems.

an item

The probability table (or frequency count) is now an (n, m) array with each probability Pk, = P tsry,) being the probability that an item is classfied into class x1,by the referee and into class y, by the machine.

Retrieval

performance

and information

Preclslon 09

-

07

-

06

-

05

-

04

-

03

-

02

-

=

0 3

Recall-09

t 08

163

theory

01 -

00009

I 0’8,

0009

y, , log Scale Rg

I

The function 1*(X, Y) can thus be applied in a straight-forward fashion to measure the agreement between two classifications of ifems into classes, for instance to compare a human and a machine classification of -readers into interest groups, -documents into subject classes, -enterprises into similarity clusters, -personnel into profile classes and so on, provided always that the validity of the five postulated

properties is accepted and, irrespective of the final

above all, that one wishes to measure the goodness of the classification

decisions that might be based on it. Consider, for the sake of concreteness, into the subject classes Class Class Class Class

the problem of the automatic classfication

C,: Production Techniques and Methcds C?: Manufacturing Machines C3: Manufactured Products CA: Instrumentation; Testing and Measuring

and consider

the realisation

of the co-occurrence

Equipment table shown in Fig. 3.

Machme c,

Reference classification

clawticatlon c,

c,

c,

of patents

16-l

M.~I~KII (;I

VICI

My claim is that the function I(?(, I’), which 15a well-known measure of statistical dependence between discrete random variables. is al\o suitable as a measure of merit to be aj\ociated with a table lihe this. 4.2. The generaltty of the approach of Sectron 4.1 can he extended to the case of two unrelated classifications of the \ame items. no matter whether they are automatic or manual, whether

one is regarded as the reference

anything

to do with each other.

Let the patents of our example

for the other or whether

be classified by a different

their class definition\

\y\tem

have

Into the five cla\\e\

D,: Research D,: Technology D,: Production I),:

Engineermg

D,: Others. Then one can fill in a co-occurrence

table to measure some agreement between thi\ classification

and one of the classifications into classes (C, C, C, C,). One property of I(X. 1’1 that is relevant to juch a case is the following: Property A ,. A permutation

of the classes xI (or classes v,) and of the corresponding

columns) leaves the value 1(X, I’) unchanged. This means that the function

rows (or

1f.X. I’) detects and

measures the agreement between the two classifications on the basis of a probability without any need to indicate which .t-class corresponds to which !-cla\j.

table only,

Note that in our example a re-shuffling of the D-classes or of the C-clajse5 is permitted but irrelevant to the problem, YOthat property A, is certainly a requirement for any merit function one may wish to use. Al40. the ta\k of mapping one set of classes onto the other \et is far from straight-forward, Property Thij

.4, also helps define bejt- and worst-ca\es

in a more general form than (5) and (6).

i\ expressed by propertie\ A, and A,. Property A,. I(X. 1’) = 0 if and only if the events .Vr: one classification system assigns an item to class _rr ?:: a second classification

are statistically

independent

system assigns the same item to class V,

for any pair (L. i):

or, equivalently, the rank of the probability matrix is one. Proprrty ,4,. The function If.%‘, I’) takes up its maximum probability

array ha\ only one non-zero

element

per column.

I’,

0

0

0

P,

0

(1

0

I’.

0

0

P,

0

0

II

(I

0

P,

0

0

value

If(X)

if and only if the

For example

~

In this situation the s-class to which an item belongs, can be unequivocally determined on the basis of the y-class to which it has attributed (but not vice-versa). If the s-classes represent a reference clascificatron and the y-classes a make-good one, the latter can replace the former without introducing any error. 4.3. The discussion on the desirability of the five properties of the tive properties 1 to 5 runs parallel to the one in section 7, so that there is no need to rephrase it here.

Retrieval performance

and mformatmn

theory

165

Normalisation can be achieved by applying (18) or ( 19), without any need to take the number of classes into account. The entropy H(X) itself measures at the same time the number of classes and the size of each marginal probability P (.w,). This can be illustrated by the following: Property .44. If a new class x, is added to the x-classes and if P (s,) = 0 then both 1(X, Y) and H(X) remain unchanged. For example. redundant classes can be defined in which there are no item\, without jeopardising the validity of the performance measure.

It is useful to stres, that properties A, to A.,, attractive as they may be, are of the nature of boundary conditions and cannot replace properties I to 5 in defining the function 1(X, Y) uniquely. RtFERENCES [I] C E SHANNI)~, The mathematical theory of commumcatlon Brll .~!\I Tech J (19481. [Z] C. E SH~IUNONand W U’F~L tk 77~ Mufhrmatrtul &w-v of Comm~tnrc’ut~~m Umverslty of Ilhnol\ Pre% Urbana (I9491 131 A. 1 KHINCHP.. .Mutkmut~c~ul Foudaf~n~ 01 Inf~~rmutrrmThecn. Dover. New York (1957) 141 G S4LTON (Ed ), The Smurt Rrfriwul Syrtrm. Prentice Hall. Englewood CUT\. NS (1971) 151 G SALTON. 4utomurlc Informut~on Organr~ufron cmd Rufrwwl McGraw-Hill. New York I 1968) [61 J. A SWFTS. Effectlvene\s of mformatlon retrieval method\ .-lm Doturn. 1976. 20(l). 17) F W LAW~SI~R and E G FAYEN. Informdon Rrfriwal On-Lmr Melville. Lo\ Angele> I 1973) 181 C W CI AV~KD(I\*. Eruluufron of Operutmnul Rrfnrwl Sy.ttm~c College of Aeronautic\. Cranfield. England (1964) 191 C. SU TON, Evaluation problem\ In mteractlve mformatlon retrieval. Infor. Sf~jr Rrtr 1970. 6. [IO] P. C4i IV,~I ~1. System performance evaluation wrvey dnd apprawl Commrtn .$,sw Completing Machrnt~ry 1967. 10.

(I?] J. M SMII-H. A rewex and comparison of certain methods of computer performance evaluation

Computer Bull 196X. 12.