NUCLEAR
INSTRUMENTS
65 (r968) I 5 7 - I 6 2 ;
AND METHODS
© NORTH-HOLLAND
PUBLISHING
CO.
S E A R C H S T A T I S T I C S OF RARE EVENTS* M.-L. S H E N Bartol Research Foundation o f The Franklin Institute, Swarthmore, Pennsylvania 19081, U.S.A. Received 4 J u n e 1968 T h e statistics associated with the search o f rare events is discussed for the simplest case in which only two s c a n n e r s are e m p l o y e d . It is f o u n d that the m a x i m u m likelihood e s t i m a t o r for the n u m b e r o f events in m r e p e a t e d s c a n n i n g processes is the solution o f a
( 2 m - 1) o r d e r algebraic equation. T w o practical a p p r o a c h e s are suggested a n d are briefly c o m p a r e d . T h e cases for three scanners, a n d m o r e generally t scanners, are briefly discussed.
In problems dealing with the estimation of a n u m b e r of rare events, for example, in nuclear emulsion or bubble chamber film, the efficiency of detection is often not perfect (i.e. less than 100%). To determine the overall uncertainty that should be associated with the final result for the n u m b e r of rare events, a knowledge of statistical distributions associated with the available efficiency of detection is needed. This field has been called search statistics. A simple form of search process is described below and the statistics for this process are developed. Let two scanners A and B scan a stack of nuclear emulsion plates, searching for events of some particular type. Let nA and nB represent the number of events of a given type found by A and B respectively, and naB be the n u m b e r of events which are found by both A and B. The true total n u m b e r of events, N, is usually estimated as
estimated by eq. (1)would show a statistical fluctuation since nA, nB and nAB have statistical distributions. In our case of two scanners, an event is either detected or overlooked by either A or B. Therefore, all events can be classified into four categories. The population and the probability o f occurrence associated with the event belonging to each category are defined as 1. nA~ = N u m b e r of events seen by A and B, PAB = eA~B= Probability of an event seen by A and B; 2. nAg = N u m b e r of events seen by A not by B, PAg = eA(1--eB) = Probability of an event seen by A not by B; 3. nXB = N u m b e r of events seen by B not by A, PXB = eB(1--eA) = Probability o f an event seen by B not by A; 4. nxg = N u m b e r of events not seen by either A or B. PXg = (1 -eA) (1 --e~) = Probability o f an event not seen by either A or B. The four categories are schematically shown in fig. 1. We note here that
n = nAnB/nAB,
(1)
simply because 11A = N,~A,
nB nAB
PAB q- PAB + PAB q- P~-~ = 1,
NeB,
(3a)
(2) N = nABq-nA-~q-n~B q- n ~ ,
NSASB,
n A = DAB "-{-F/A~,
is assumed to hold, where ~A and eB are the detection efficiencies of A and B respectively for this particular type o f event. The above treatment is justified only if na and riB, and thus naR, have the ideal form of a 6-function distribution. In the actual case, nA, for example, follows a binomial distribution with a m e a n = NeA, and a standard deviation, (7"A = {N~A(1 eA)}½ such that the r/a-distribution approaches a 6-function as N increases1). Here, the detection efficiency is assumed to be stationary. The estimation o f N by eq. (1) is reliable for large statistics and high efficiency. F o r rare events, the n u m b e r -
(3b)
n B = nAB ~- n x a .
-
* W o r k s u p p o r t e d by the N a t i o n a l Science F o u n d a t i o n .
Fig. l. D i a g r a m m a t i c representation o f categories (nAB, nAE, n~B, nXg) describing the case o f two scanners A a n d B.
157
158
M.-L.
SHEN
It is obvious that the probability of‘ the outcome of the present scanning process follows a multinomial distributionl).
nz~logN!/aN+mlog(l-&,)+Pllog(l-EB)-
P1(n AB,n.A,, %B, %6; NYpAB, pA,, pzB? pz)
Xx,+ Z yi = mNE,, Cx,+ Cz, = mN.s,.
N!
=
X
illog(N - xi- yi - Zi)!/2N = 0,
-i$
Interpreting
(8)
a log N!/aN as
llAB.In AE! n,,! n,,! x [pAB]n.4*. [pAB].AB. [pAB]G.
[p--]G*
(4)
Using the expressions for the various PAB’s and also eq. (3), one can express eq. (4) in terms of observed quantities
alogN!‘tiN = = {logN!-log@-l)!}/{N-(N-l)}=logN, then N is the solution
c
log lP,(nA, nB, nAB; N, &A,&B)=
YZXi+
Cyi
mN
(9)
of
1
+log
c 1-L’
cx.+cz. mN
=
1
= ; & log (1 - =$3). x
&;*&?(I -EA)N-“A(l
-EB)N-?
(5)
Since N is usually estimated by IZ of eq. (l), it is important to determine the statistical distribution of n under the condition that Ed, ~~ and N are known as parameters. Now, n = nAnB/nAB and the density distribution of nA, nB and nAB obeys eq. (5). We then have
(10)
An analytic expression of the maximum likelihood estimator for N can be obtained in general only for the simplest cases m = 1,2. For the case m = 2, it is the solution of a third order equation of N. For the case m = 1, it becomes N = (x+y)(x+z)/x
= n,n,jnA,,
(11)
where, by definition p(” ; N, &A,&B)= c
P2(nA,
nB,
nAB
; N,
EA, &B),
(6)
where the summation is carried out over nA, nB, nAB such that n = nAnB/nAB. [In the actual evaluation of P, we assume n as the integer part of {(nAnB/nAB) + 0.5}.] Since nA, nB 2 nAB and nAB 1 1 in order to have a meaningful solution for n, we have nzl p(n
; N?FAy
&B) =
1 -pP(nAB
=
O>,
(ha)
where P(” AB = 0) = (1 - EAEB)N is the probability for the case nAB = 0. Let us denote nAB, n,E and n-,, respectively by X, y and z for simplicity. The likelihood function of m sets of random variables (x1, Y,, z,), (x1. yz, 6. .(x,,,, Y”,, z,,,) is the joint distribution’),
nA= x+y, n ,=x+z, n AB = x. Comparing eq. (I) and the above equation, it appears that the usual way of estimating N by eq. (1) corresponds to the maximum likelihood estimate of N for the case of a single reading of the data (n,, nB, nAB). The estimate of N in such a case is given by N = (nAnB/flAB) k a,(N, EA,EB)7
(12)
where IS”is the standard deviation of n about N in the distribution P(n; N, zA, +), with N given by nAnB/nAB, and consequently EA and &Bgiven by &A= nA/N = nAB/nB, (13) [ &B= n,lN = nABin,.
The justification of eq. (12), i.e. the way of evaluating o,(N, Ed, cB), is presumably that nA, nB and nAB are respectively close enough to their expectation values so that N, &A and cB are also close enough to their _eB)mN-Z~,-Z~i_ x &yx’+xy’(l-EA) mN-Ix,-Z~i~~i+Di(l actual value to give an adequate evaluation of o,(N, (7) Ed, Ed). Such an assumption is typical for the case The maximum likelihood estimator for N, eA and &B where only one reading is available. This point can be understood from the following example. can be found from the solution of
159
S E A R C H S T A T I S T I C S OF R A R E E V E N T S
W e estimate the strength of a s t a t i o n a r y cosmic r a y source or r a d i o a c t i v e source by
N=IO 0.3
E =E, =E== 0 . 6
(a)
2 = n +_ x/n,
(14)
where n represents one reading of the observed c o u n t i n g rate and 2 is the true c o u n t i n g rate. The justification for eq. (14) again is t h a t n is close e n o u g h to 2 so t h a t we can o b t a i n an a d e q u a t e e v a l u a t i o n o f the s t a n d a r d d e v i a t i o n a f r o m ~/n, because n follows a Poisson distribution P(n ; ~.) = e - ~'. 2"In !, (15) with ~r =~/).. F o r our case o f two scanners, we have calculated s o m e distributions. I n figs. 2, 3 a n d 4 we p l o t a typical P(n; N, 0.6, 0.6) versus n for the cases N = 5, 10, 15 a n d also a d i s t r i b u t i o n o f eA (or ca) as given by eq. (13). W e discard those c o m b i n a t i o n s o f (nA, na, nAa) which include nAB = 0, because these sets o f values c a n n o t give any meaningful e s t i m a t i o n of n. A careful s t u d y of P(n; N, CA, ca), such as figs. 2, 3 and 4, reveals t h a t the e s t i m a t i o n o f N as well as eA (or eB) based on a single r e a d i n g of (nA,nU,nAB) c o u l d be erroneous, especially for the case where N i s less t h a n 10 a n d eA as well as ea are n o t close to 1. I f we let the two scanners r e p e a t the same scanning, i.e. for the case m > 2, the m a x i m u m l i k e l i h o o d estim a t o r can be o b t a i n e d , as was f o u n d previously, f r o m the solution of eq. (10), which is essentially a ( 2 r n - 1) '~ o r d e r algebraic equation. I n a d d i t i o n to the fact t h a t it is n o t easy a n d s o m e t i m e s impossible to o b t a i n an algebraic solution, it is also difficult to t r e a t the statistics o f the m a x i m u m l i k e l i h o o d e s t i m a t o r o f N. W e suggest therefore the following two a p p r o a c h e s .
P(N,.= O) = ( I - E , E . ) '°= 0 . 0 1 ; )
= 10.66
=
(:7, = 3.21
(b)
0.587
= 0.201
0.2
P (n:
0.1
4
.T 8
[TT1,T,,,
12
16
20
. . . .
24
28
0.2
0.4
0.6
0.8
1.0
n Fig. 3. a. P(n; N, eA, eB) VS n for the case N = I0, eA = en = 0.6; b. The probability distribution of detection efficiency, e, as estimated by eq. (13).
N=15 E=E= (0)
E=0.6
< n > = 15.74 0~, = 4 . 0 4
;
p (N==O);(I-EA~E ~)is= 0 . 0 0 1
(b)
0.386 (~, =0.169
=
0.2
P(n) 0.1
TITTT., ........
12
16
20
24
28
32
0.2
0.4
06
0.8
1.0
8
N=5
E= E,= e, = o.6 (0)
5.03
Fig. 4. a. P(n; N, ea, en) vs n for the case N = 15, eA = eB = 0.6; b. The probability distribution o f detection effÉciency, e, as estimated by eq. (13).
( I - E.E=)==O.107
).3
0.3
=
P 03,,-- o ) =
(b]l
=0.636 O; = 0 . 2 5 6
=1.49
0.2
1. O b t a i n , (riB> a n d ( r / A B ) f r o m the m sets o f d a t a a n d then evaluate N as
).2
N , = < hA> < n n > / < nAB > .
P(n)
;E)
0.1
O,I
The e r r o r associated with N1 is the c o m b i n e d e r r o r of < hA> , < riB> a n d < nAB> . D e n o t i n g the errors a s s o c i a t e d with , '/B> a n d by m-~GnA, m-lffnu a n d m - 3 O.ABrespectively1), we have
.~
(16)
=
T
2
4
6
IT
8
I0
0
0.2 0.4 0.6 0.8
(17)
1.0 - O',AB
Fig. 2. a. P(n; N, cA, en) vs n for the case N = 5, eA = EB = 0.6; b. The probability distribution of detection efficiency, e, as estimated by eq. (13).
{NeASB(1 -- 8AeB)} ~.
2. O b t a i n n f r o m each set o f d a t a a n d then evaluate N as
160
M . - L . SHEN
N2 = < n> = ( n A n d , a . ) .
(18)
The error associated with N 2 is m - } a ( N 2 , ( h A ) / N2,(riB) / N2). In the first approach, although ~,~ etc. can be obtained from eq. (17), evaluation of the combined error is not straightforward because nAn is not statistically independent of nA and nn, General consideration of both approaches (for example, comparing the variance of both cases) shows that these two approaches are approximately the same in practice, despite the fact that n is a biased estimator of N. It should be noted that the present a p p r o a c h to search statistics can easily be extended to the case of more than two scanners. The case for three scanners and that for t scanners are briefly discussed in the following: (A) Three scanners: Let us consider the case of the three scanners, A, B and C with efficiency gA, gn and ec respectively. The true n u m b e r of events N in a stack of nuclear emulsion or bubble chamber film, are usually estimated by any of the following four equations: N = nAnB/nAB,
(19a)
N = nBnc/nBo
(19b)
N=
ncnA/nCA ,
(19c)
N=
nAnBncnAac/(nAnnBcncA).
(19d)
Fig. 5. Diagrammatic representation o f categories (nAF,C, nA~-0, etc.) describing the case o f three scanners A, B and C.
The likelihood function can be written accordingly. The m a x i m u m likelihood estimators of eA, eB, eC and N for m sets ot r a n d o m observables n~ ),ng ), n (~) C ~ n (~) A B ) -(~) ~ t B C , -(~) t~CA~ and "(~) "'ABC (where i runs f r o m 1 to m), become the solution of eA = Z n ~ ) / ( m N ) , en = Y n ~ ) / ( m N ) ,
(22)
gc = Z n ~ ) / ( r n N ) ,
log {1 - Z n ~ ) / ( m N ) } + log { 1 - Z n ~ ) / ( m N ) } + L + l o g {1 -- Z n ( ~ ) / ( m N ) } m
As has been pointed out in the text, this procedure is valid only if we neglect the statistical distributions of nA,-..nABO due to the imperfect efficiency (less than 100%). Here nA and nBC etc. are defined as in the case of two scanners, and nABC is the n u m b e r of c o m m o n events detected by A, B and C. The expression for Pa in this case is given by
= ~
--
.
.
.
i__~1 log [-1 - {n(~>+ n~>+ n ~ ' •-(')
--("
-(')
IIA B --
ring--
~lCA~
Thus the maximum likelihood estimator of N is the solution of a ( 3 m - l ) th o r d e r algebraic equation for the case of m repeated scannings. I f there is only one set of the data, i.e. m -- 1, the m a x i m u m likelihood estimator becomes the solution of the quadratic equation
PI(nABC, nAB c .... , nAnc: N , eA, ca, % ) =
_
N!
"
nABC [ nAa~!. "'nA'Bc [ [ P A B c ]
....
(nAB -{- n a c d- nCA -- nABc)N 2
-~ rp---q ....
× - " × I_ ABCJ
(20)
ednA,,,.,nonA.,n.oncA,
=
(23)
C o m p a r i n g eqs. (19a)-(19d) and eq. (23), it is interesting to note that the simultaneous solution of eqs. (19a)-(19d), i.e. N = nAnB/nAB = rtBnC/nBC = n c n A / n C A -=
= nAnancnAac/(nABnncnck ), (24)
(21)
N! "ABC!(nA. -- nAnc)!"-(N
-
- - ( ' A ' . + , . n o + , c n A ) U + , , A , . , c = 0.
'
where nABC and PAnc etc. are defined as in the case of two scanners. For example, n A ~ and PAB~ are respectively the n u m b e r and probability of events seen by both A and B but not by C. They are shown in fig. 5. P2 becomes
-
- - h A - - n n -- n c + nAn + n . c + , C A - - h A . C ) !
nA nB nc
× cA e . e c
(l
?IB(
hA(1
-- eA, N
- - eB, N
l
t) C .
-- ec, N
161
S E A R C H S T A T I S T I C S OF R A R E E V E N T S
is exactly the solution of the quadratic equation (23). This is consistent with the fact that PI (also Pz) is maximum if eq. (24) is satisfied. In general, the maximum likelihood estimator of N for a single reading of the data is the quadratic solution which satisfies the conditions N > /'/A, /'/B, tiC;
hA, nB]> nAB, e t c . ;
nAB,/"/BC~ n C A ~ /lABC.
This implies that N = (q + (q2 _ 4pr)~ }/(2p),
(25)
where P = nAB+ nBc + nCA-
nAnBnc
The efficiency % can now be estimated as
ej = nj/N.
(29)
A more detailed discussion concerning the practical application of search statistics will be given elsewhere. A different approach to search statistics has recently been given by Evans and BarkasZ). Comments on their treatment are given in the appendix.
In the treatment of Evans and Barkas2), P2(nA,nB,nAB ; N, eA,~B) of eq. (5) was integrated over eA and eB to obtain
.
Thus the maximum likelihood estimator of m repeated scannings for the case of three scanners becomes more complicated than that for two scanners. (B) t scanners: In general, if t persons with efficiency ~j,j= 1,2,..., t, are involved in the search process, the number of expressions for N without consideration of the statistical distribution due to the efficiency is 2 t - t - 1. The maximum likelihood estimator for N, in the case of m repeated scannings, can be induced from those of two and three scanners as the solution of the following equation
±,o,[,_{;.;.}/(.,)]: = m-1
N > ni, n j, nk,...etc.
Appendix nABC,
q = n A n B -k- n B n c q- ncnA, r ~
However, the solution must satisfy the condition
log l V - 1 .(i)
nj
--
i=1
j=l
--
njk j=k
/
where njkt is the number of events seen b y j th, k th and 1'h persons. This equation is equivalent to a ( t i n - l ) th order algebraic equation. The maximum likelihood estimator for ej in this case can be written as
In the particular case of one scan by each scanner, the maximum likelihood estimator of N appears as the solution of the following ( t - 1) t" order algebraic equation.
FI ( N - n j) = N t - l ( N - - Z , j - - ~ -
~¢'(N; n A, n B, nAB) =
=
N!(N-nA)!(N-%)!
(2s)
(A1)
hA.!
This is then considered as a likelihood function of a single parameter N. The fluctuations of N are therefore given by
{N--(nAn./nAB)}2~(N; hA, HB,F/AB), N
f
where f
=
N A
-I-Nn -
NAB-
We wish to offer the following comment about their procedure. 1. Integrating P2 over eA and eB implies that scanners A and B are abnormal in the sense that their efficiencies are random. A practical distribution for eA or eB should be an approximate Gaussian distribution with a definite mean and variance. 2. 5¢ corresponds to a probability distribution of N for a particular scanning (nA, nB, nAB) by A and B among the random stacks of nuclear emulsion or bubble chamber films in which N__>hA+riB--nAB. Therefore, S cannot give the statistical distribution of the estimator of N as expressed by eq. (1) obtained by scanning a single stack of nuclear emulsion or bubble chamber film. If the procedure of the statistical analysis as proposed by Evans and Barkas is followed, in the example cited above, namely, the case of measuring the counting rate of a stationary cosmic ray source, the same expression as eq. (15) should be considered as a likelihood function of the parameter 2, written as P(2;n). Then,
~njk...-b(--1)t-lrl(i)2...t).
j=l
hA!%!
( x - hA- n . - hA.)! [(, + 1)!3' nA !
(r2(2) =
f-
0
P(2;n)(2-n)Zd2=n+2,
(A2)
162
M.-L. SHEN
so that a single measurement of the counting rate, n, should lead to the estimate of 2 as 2 = n _ ( n + 2 ) ~,
(A3)
instead of the usually accepted eq. (8). In fact, P(2;n) corresponds to the frequency of occurrence of ), for a particular measurement of n among the uniformly random sample of cosmic ray source. Again, it is not the kind of statistics to be used for estimating the strength of a particular cosmic ray source.
The author is grateful to Mr. G. G. Bell and Dr. S. P. Duggal of the Bartol Research Foundation for stimulating discussions, and Dr. R. Silberberg of the Naval Research Laboratory for reading this paper. He also wishes to thank Dr. M. A. Pomerantz for his interest. References 1) See for example: Mood and Graybill, Introduction to the theory o f statistics (McGraw-Hill, New York, 1963). 2) D. A. Evans and W. H. Barkas, Nucl. Instr. and Meth. 56 (1967) 289.