NUCLEAR
INSTRUMENTS
AND
METHODS
56 (I967) 289-292; ©
NORTH-HOLLAND
PUBLISHING
CO.
EXACT T R E A T M E N T OF S E A R C H STATISTICS* D. A. E V A N S a n d W. H. B A R K A S
University of California, Riverside, California, U.S.A. Received 16 J u n e 1967
Exact formulae have been found for the likelihood of any combinations of search statistics to result from successive scans of the same "area". The formulae apply for similar events which
are found with arbitrary efficiencies. One obtains a best estimate of the actual numbers of events present and simultaneously the interval corresponding to any particular level of confidence.
The following statistical analysis was developed to enable more information to be obtained from the results of successive scans of nuclear track emulsion or bubble chamber film. The method, however, is equally applicable to any form of search process (with replacement) which is repeated one or more times and may be described by the statistical models assumed in the present treatment. Frequently in the field of experimental high energy physics employing track detection methods (bubble chamber, nuclear emulsion, spark chamber) the experimenter must rely on the work of scanners who search for specified types of events in the presence of large quantities of extraneous information. It is therefore to be expected that not all events of the required type will be found, i.e. the scanning process is not 100% efficient. Estimates of the efficiencies often have been made by performing a second completely independent scan. If a and b are the numbers of similar events found in the first and second scans respectively and 7 is the number of events in common to the two scans, then a simple argument establishes an estimate of N the total number of events present as N = ab/7
1. Seen by A and B, (?); 2. Seen by A not by B, (~); 3. Not seen by A, seen by B, (fl); 4. Not seen by either A or B, { n - ( ~ + f l + 7 ) }. The quantities in brackets represent the number of that type of event present. ~, fl, 7 are experimental quantities, n is, of course, unknown. Clearly, a = ~ + 7, b = fl + 7 where a and b were defined previously. A further convenient quantity is f = ~ + [3 +7, the total number of events found. One now postulates two quantites eA, ea which are the probabilities for A and B respectively to detect a required event. The important assumption is made that all events are equally likely to be seen. If one has reason to believe that certain event parameters such as track ionization, event geometry, etc. cause a variation in what might be termed the intrinsic probability of being seen, then events should be grouped into bins of the parameters within which this intrinsic probability is likely to remain constant. The postulated quantities eA, % are then the products of this intrinsic probability and the true scanning efficiencies. With these assumptions in mind, the double scan-
and the efficiencies of the two scans as el = a / N = 7/b, ~2 = b i N = 7 / a respectively. However, these formulae may only be reliably applied in the case of large statistics and reasonably high efficiencies. A closer statistical analysis of this scanning problem will be described and an attempt made to extract as much useful information as possible from the data obtained in performing two or more scans. The situation for two scans is represented diagrammatically in fig. 1. The main area represents the number of events actually present (n), the result of successive scans (called A and B) is to enclose areas representing events seen, the intersection of these areas are events seen in common, the outside area is the missed events. All events may be divided into four exclusive classes: * W o r k s p o n s o r e d by U. S. A t o m i c E n e r g y C o m m i s s i o n .
/ SCAN
A ~
SCAN
B ~
Z-ALL
EVENTS
Fig. 1. D i a g r a m m a t i c representation o f classes created by two scans. T h e s y m b o l s in each area are the n u m b e r o f events in that class.
289
290
D. A. E V A N S
AND
ning process may be described statistically as follows. In the first scan (A), a events are selected out of n present (n is unknown) with eA being the probability of selecting an event. The total probability is then described according to the binomial distribution as: p~(a In,cA) = [n F/{a !(n - a)! }]e~(l - CA)"- ". The subsequent scan (B) may be divided into two sections: 1. B's scan of events already seen by A, 2. B's scan of events not seen by A. In the first case, we have the probability given as
Pz(Tla,ea) = [a!l {y!(a--y)!}]e~(1 - - e . ) " - ' , since 7 events are found out of a, with the probability per event of eB. The second case yields the probability, p3(fl I n - a,eB) = [(n - a)!/{fl!(n - a - fl)t }]. •e ~ ( 1 - e B]~n-a-fl since out of ( n - a ) events fl are found. The joint probability of the whole sequence occurring, i.e. a events by A, b events by B and ~ in c o m m o n is given by the product
P(a,b,yl
t(a,b,y) = In ! / ( n - f ) ! ] . a b
-
,A)"-°0
-
0-1,
where ~, fl, y , f a r e as defined previously. The expression is seen to be symmetrical in a, b as it must since both scans are on an equal footing. The above expression may be interpreted as the likelihood of a given situation of known a, b, y with parameters n, ~A, ~B,
~(n,eA,en I a,b,y) = P(a,b,y). The values of these parameters which simultaneously maximize the likelihood are then the m a x i m u m likelihood estimates. It is quite easy to show that the estimates of eA and eB are a/n and b/n as expected, but the estimate of n cannot be obtained analytically. In most situations n is the parameter which is desired, both its estimate and the statistical error. Accordingly, the other two parameters (the efficiencies) may be treated as "nuisance parameters" and be integrated over a). This then yields a likelihood function of the single parameter n, viz., £F(n)= n!(n-a)!(n-b)!
a!b! ( n - f ) ! [ ( n + l ) ! ] 2 ~!fl!Y!"
This expression is not very suitable for numerical evaluation since it involves the products and quotients of very large numbers (the factorials)• The result may be recast in the form of the recurrence relation,
£~'(n) = [n(n - a) (n - b) / {(n - f ) (n + 1)2 }]5O(n _ l ). This relation is well-suited both to hand calculation and particularly to computer use. Clearly the lower limit of n is f, (the total number of events found). One may arbitrarily assign & o ( f ) = 1 and proceed stepwise to generate as much of the likelihood function as desired• A computer subroutine using this method has been written for use when many calculations must be made (e.g. each bin of an energy spectrum)• For individual problems the likelihood function may be conveniently generated with a hand calculator. The behavior of the function for typical experimental situations is shown in fig. 2. The examples include artificial cases of small statistics to demonstrate the information still available• A confidence interval for n may be established by quoting the value n = N for which
Z n=f
n,gA,gB ) = P l P 2 P 3 .
Simplification of the expression yields:
•
W . H. B A R K A S
= c n =f
where C is the confidence level desired. In practice ~,~=: ~°(n) may be approximated by summing until further terms become a predetermined small fraction of the sum. (The nature of the convergence of the sum ensures this to be a good approximation.) The confidence level as described is, of course, single-sided; for such a distribution as that of fig. 2d the interval may be made two-sided in a similar manner. The estimation of the individual efficiencies is best done independently. One notes that for scanner A, of the b events known to be present (found by B) only y are found. The probability for A to detect an event is CA; the statistical model which is appropriate is the binomial with parameter CA, where
p(y) = [ b ! / {y [(b - y)! }]el(1 - CA)b- '. It may be shown 2) that the r a n d o m variable sin- t~/(y/b) is more nearly normally distributed than y, with mean sin-l~/e A and variance (4b)-~. Hence the estimate of eA is y/b and the (two-sided) confidence interval limits are given by: sin 2 [sin - l x/(y/b ) +_ ~/(2x/b)], where e is the number of standard deviations each side of the mean for the desired confidence interval in a normal distribution (e.g. 95% conf., e = 1.96). It is clear that the calculation for e~ is identical in nature.
EXACT
1.0
-
TREATMENT
OF SEARCH
291
STATISTICS
(a)
(b) O=
I
0=3
¥=
I
f=
I
Y: 2 ~:4
b=3
b=
0.5-
TTT,,...
0.0 54
0
56
78910
0 l 2 3 4 5 6 7 8 9 I011 12 13 14 15
(c)
(d)
1.0
o : b= Y:
I00 80
O : 263 b:316
70 = Jlo
Y:
214
f = 565
c-
0.5
ooT II0
120
350
400
n
n
F i g . 2. Typical likelihood functions obtained using the method described. The a , b and ~ have the same meanings as in the text. The functions are normalised to ,,~max = 1.
Extension to m o r e than two scans: The approach leading to the expression for ~ ( n , CA, eB [ a, b, ?) may be extended to three or more scans o f the same area. The p r o o f is given in the appendix and the results are quoted here. Again, the recurrence relations are well suited to hand or computer calculation. 3 scans: If total number o f events found is f, and each scanner finds a, b, c respectively, then:
•LP3(n,eA,e.,ec l a,b,c,f) oc { n ! / ( n - f ) ! a
b
c ~n--
• gA~B~CAA
a~n--
b~n--
AB
AC
which gives on integrating out CA, ca, ec,
~3(n) oc
{n!/(n-f)!}(n-a)!(n -b)!(n-c)!/[(n+
ze3(n) =
{ . l(n
-f)}" n >f.
• ((n-a)(n--b)(n-c)/(n+l)3}~LP3(n--1),
N scans: If total number o f events found is f and the individual scans find ai, i = 1, 2, - N,
}. c
N
where 2 = 1 - e ,
I)!] 3
or
~q~N(n,eilai,f) oc { n ! / ( n - - f ) ! }
H e ia'2i"-a'. i=1
292
D. A. E V A N S
AND
W. H. B A R K A S
[ABC] + [ABC] + [ABC] + [ABC] = [C],
On integrating out e~, e2 , . . . , es, N
£,aN(n ) ~ {n!l(n--f)!} [I { ( n - - a i ) ! / ( n + 1)!} i=I
since all possible arrangements of A, B, A, B occur. Similarly, the exponent of 2c is [C]. Thus one obtains
or
5 N ( n ) = { n / ( n - - f ) } [ {i~l
Appendix In order to treat the general case of N scans it is useful to introduce the following notation. The events m a y be divided into classes characterized by whether or not the event was seen in successive scans. Thus A will denote the class seen by A, A the class not seen by A. Such combinations as AB, AB denote the intersections of the classes, viz. AB is the class of events seen by A and not by B. The n u m b e r of events in a given class is denoted by enclosing the class symbol in square brackets. In this notation the following correspondences with the quantities defined in the main text are noted, a = [A], b = [B], ~ = lAB], fi = lAB1, 7 = lAB1. It is clear that relations such as
e[A]e[BI~[AI~[B]
5 2 - [AB]![AB]![AB]![A-13]i °A ~B "~A "~, , since n - f = [ A B ] , i.e. the n u m b e r o f events found in neither scan. The form of the d e n o m i n a t o r should be noted, viz. the p r o d u c t of the factorials of the numbers in each of the four classes into which the two scans divided the events. A third scan may be considered as dividing each of these classes into two, seen or not seen by C, e.g. AB is split into A B C and ABC. The scanning o f each of the four classes gives rise to an extra binomial probability factor in the likelihood, thus" No. seen by C
AB
[ABC]
AB
[ABC]
where there are eight factors in the denominator, each corresponding to one of the eight classes into which the three scans divide the events. This should be compared with the similar expression for two scans. Since the likelihood is a function only of n, eA, ~a and ec, all but one of the denominator factors remain constant with varying n. The exception is [ A B C ] = n - J ~ the total n u m b e r of events missed. Thus, rewriting [ A ] = a etc. one obtains
53(n,~A,~,,~c) < {"!/(n -f)!}. • ~i(1
- ~.)° - ~4(1
- ~c)"-'
ei[i] 21[i] H [X].,T where the product
2M
1-[ [ X ] ! is taken over all the 2 M classes created by the M scans. If one now considers a further scan, this will divide each class into two. Take one class X where X denotes some combination of symbols such as A B C D . . . In the new scan, [X, M + 1], events are seen out of [X]. Thus this gives a probability factor of
[X]! otx,M + ~l~tx,M + u [ X , M + I ] ! [ x , M + I]! ~M+, ,~M+, • A similar expression will occur for each of the 2 M classes• When the p r o d u c t o f all these is taken with L-"M the product of 2 M factors in 5 M cancels with the terms like I X ] ! and is replaced by a new product over the new 2 M+I classes. The exponent of eM+a is symbolically expressed as ~ x [ X , M + 1] = [ M + 1]; similarly the exponent of 2M+l is [ M + 1]. Thus,
~1
o[ABCl~[ABCI ,,c
[ABC]![ABC]! ,-c [AB]!
- ~A)" - °4(1
The situation where M scans have been done is now considered. The extension from 5 2 to 5 3 suggests that:
Probability
[AB]t
e}il21il,
i=1
hold, and in particular [A] + [A1 = n, the total n u m b e r of events present. This notation is easily extended to cover any n u m b e r of scans. The expression for the likelihood derived in the main text for two scans may be written:
Class
l~ I-A,B,C
5 M = n!
[ A B ] + [A B] = [A],
n [
5 3 = { n ! / E A B C ] ! .... [ABC]!}
(n-al)}/(n+l)N] ~<(~N(n-l)"
Gc°[ABc]~,,~CtAacl
[ABC]![ABC]! and similarly for classes AB, AB. It is then noted that each o f the four expressions contains factors like [-AB]! in the numerator which cancel those present in the d e n o m i n a t o r of 5(' 2. Further, the exponent of ec is
5"M + 1 = n !
12M +1
~t~a~th/ n [ X ' ] ! i = 1 ~i Ai / 11
(where X ' are the 2 M+~ new classes) i.e. the same expression as for 5 M with M + 1 substituted for M. It has been shown that this general expression is true for M = 2, therefore, by induction it is universally true.
References 1) D. V. Lindley, Introduction to Probability and Statistics 2 (Cambridge U. P., 1965) p. 38. 2) Ibid. 1, p. 138.