Statistics & Probability Letters 46 (2000) 411 – 419
New approximations for the distribution of the r -scan statistic Xiaoping Su, Sylvan Wallenstein ∗ Department of Biomathematical Sciences, Mount Sinai School of Medicine, 1 Gustave Levy Place, New York, NY 10029-6574, USA Received March 1999; received in revised form May 1999
Abstract New approximations are given for the distribution of the length of the ith (i¿1) smallest intervals containing r points, when N observations are randomly distributed on the unit interval, with primary emphasis on the case where N is a c 2000 Elsevier Science B.V. All rights reserved random variable. Keywords: Polya–Aeppli distribution; Compound Poisson approximation; Declumping
1. Introduction Given a Poisson process on (0; ∞), let X1 6X2 6 · · · 6XN be the times of occurrence of points in the interval (0; 1]. For r¿1, de ne the r-spacings as Sj(r) = Xr+j − Xj ;
j = 1; : : : ; N − r:
(1)
The associated order statistics of {Sj(r) }Nj =−r1 are denoted by (r) (r) ¡ · · · ¡ S(N S(1) −r) :
(2)
The ith minimum of these r-spacings is called the ith minimal r-scan statistic, for example, for i = 1, (r) = min{Sj(r) ; 16j6N − r}: S(1)
(3)
The r-scan statistic has recently (Dembo and Karlin, 1992; Karlin and Macken, 1991), received fairly extensive attention as a method to detect anomalies in the distribution of sites along a segment of DNA sequence. The r-scan process is a moving sum process, and the use of sum of r neighboring spacings between r + 1 consecutive points, rather than single (r = 1) spacings, provides greater sensitivity for detecting unusual spacings in the point array. ∗ Corresponding author. E-mail address:
[email protected] (S. Wallenstein)
c 2000 Elsevier Science B.V. All rights reserved 0167-7152/00/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 1 3 5 - 2
412
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
In this paper we describe three cases. Our primary focus is on the unconditional scan statistic, in which we nd the distribution of the ordered r-spacings when N is itself a random variable with a Poisson distribution, and conditional on N = n, the n unordered points are each uniformly and independently distributed on [0; 1). Equivalently in this case, X1 ; X2 − X1 ; : : : ; XN − XN −1 are exponentially and independently distributed, and N has a Poisson distribution. For the second case, the conditional case, N is xed and the X ’s are the order statistics generated by N uniformly distributed random variables. A third problem, very closely linked to the rst (see Dembo and Karlin, p. 332), which we comment on in passing, is the case identical to the rst, where now N is xed. Glaz (1993) reviews some of the considerable research conducted over the past 15 years on approximating (r) for the conditional case. Approximations are less frequently discussed for the unconditional case which S(1) is the focus of this paper with some noteworthy references being Newell (1963), Naus (1982), and Alm (r) , 16i6N −r. Asymptotic (1983,1997). However, these methods do not easily extend to the distribution of S(i) (r) results and approximations for the distribution of S(i) for the conditional case have been studied by Dembo and Karlin (1992), Glaz et al. (1994), and Huer and Lin (1997). To the best of our knowledge, there is no work on the unconditional case, except for the approximation cited by Dembo and Karlin (1992). (r) which are accurate The purpose of this paper is to present new approximations for the distribution of S(i) and easy to compute under the three models described above. Section 2 uses declumping techniques to evaluate the number of clusters for the three cases. Section 3 translates the results to order statistics of r-scans and evaluates the adequacy of results using simulations. 2. Applying declumping techniques to nd the number of small r-spacings For r¿1, the r-spacings and the distance between r + 1 consecutive observations are de ned by Eq. (1). To nd the number of r-spacings less than w, de ne for each observation an indicator variable indicating whether the r-spacing, is less than w, that is for 0 ¡ w ¡ 1, let Ij (w) =1 =0
if Xr+j − Xj 6w; otherwise:
For all w, set Ij (w) = 0 for j = 0 and for j ¿ N − r. (In the sequel, we will often refer to Ij (w) as Ij .) Let Mr (w) =
N −r X
Ij
(4)
j=1
be the number of r-spacings less than w, and note that = E(Mr (w)) =
X
P(Ij = 1) = (E(N ) − r)
∞ X e−(wE(N )) (wE(N )) j j=r
j!
:
(5)
Thus a very simple Poisson approximation to the distribution of Mr (w) is P(Mr (w) = m) ' e− m =m!:
(6)
This approximation together with error bounds can be rigorously derived (Dembo and Karlin, 1992) as the limiting process as N increases and w goes to 0 as O(N −(1+1=r) ). However the speed in which w is required to approach 0, precludes its use in many application even when n is large, as evidenced in Glaz et al. (1994) for the conditional case, and as illustrated below for the unconditional case.
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
413
Glaz et al. (1994) proposes a declumping technique for the conditional problem but then abandons it in favor of the compound Poisson approximation. We describe how to use another declumping technique, proposed in a dierent context by Arratia et al. (1990), and described in a broad context by Aldous (1989), yielding a new approximation. We then describe use of yet another declumping technique to provide an extension of Glaz’s compound Poisson approximation to the unconditional case. 2.1. Markovian declumping – the unconditional case We apply a Markovian de nition in which a new clump starts whenever the jth observation is associated with a r-spacing less than w, whereas the previous observation is not, so that the indicator variable for the start of a clump is Ij∗ (w) = Ij (w) [1 − Ij−1 (w)]:
(7)
Let Cr (w) =
N −r X
Ij∗ (w)
(8)
j=1
be the total number of clumps, and let Yc be the number of r-spacings less than w in the cth clump, c = 1; : : : ; Cr (w), i.e. Yc =
bX c −1
Ij (w);
(9)
j = ac
Pj Pj ∗ ∗ where ac = {min j: i = 1 Ii (w) = c}, and bc = {min j: i = 1 Ii (w) = c; Ij (w) = 0}; and bCr (w) = N − r if IN −r (w) = 1: We propose an approximation for Cr (w)
Mr (w) =
X
Yc
(10)
c=1
by assuming 1. A Poisson distribution for the number of clumps, Cr (w). 2. Independence of the clump sizes {Yc ; c = 1; : : : ; Cr (w)} and their independence from Cr (w). 3. A Markovian assumption: P[Ij = 1 | I1 ; : : : ; Ij−1 ] ' P[Ij = 1 | Ij−1 ]. As shown in the appendix, the second and third assumption imply that Yc can be approximated by a shifted geometric distribution P{Yc = h} ' (1 − )h−1 ;
h¿1;
(11)
where = P[I2 = 1 | I1 = 1] = 1 − P(I1 = 1; I2 = 0)=P(I1 = 1):
(12)
As shown in the appendix, P(I1 = 1; I2 = 0) is an alternating series in same terms as in (5), P(I1 = 1; I2 = 0) =
∞ X j=r
(−1) j−r
e−(wE(N )) (wE(N )) j : j!
A nite sum representation of P(I1 = 1; I2 = 0) is given in Eq. (25).
(13)
414
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
Thus by the three assumptions, the distribution of Mr (w) in (10) can be approximated by a Poisson stopped sum distribution, where the summand, Yc , is a shifted geometric distribution as in (11) and the number of terms, Cr (w) has a Poisson distribution. The resulting distribution, the geometric–Poisson (Sherbrooke, 1968) or Polya–Aeppli is described (Johnson et al., 1992, p. 378) as one that “arises in a model formed by supposing that objects arise in clusters, the numbers of clusters having a Poisson distribution, while the numbers of objects per cluster has a shifted geometric distribution”. Speci cally, setting = E[Cr (w)], P{Mr (w) = 0} ' e− , and for m¿1, m −
P{Mr (w) = m} ' e
m X m−1 k=1
k −1
[(1 − )=]k =k!:
(14)
To complete the derivation note that = E(I1∗ ) + E
N −r X
Ij∗
j=2
∗ ∗ Ij N = E(I1 ) + E E j=2
N −r X
= P(I1 = 1) + (E(N ) − r − 1)P(I1 = 1; I2 = 0); = P(I1 = 1) [1 + (E(N ) − r − 1) (1 − )]:
(15)
One immediate consequence of this expression in terms of a known distribution, is the straightforward approximation of variance of Mr (w), speci cally that Var[Mr (w)] ' (1 + )=(1 − )2 :
(16)
Note that the derivation of the approximations in (14) and (16) is identical whether N is random or xed, except that the former case requires a conditioning on N in Eq. (15). 2.2. Declumping by run size In this subsection, we provide an alternative derivation and extend the compound Poisson approximation by Glaz et al. (1994) based on another way of declumping. De ne an indicator variable for a run of exactly length i starting at position j as Iji = (1 − Ij−1 )
j+i−1
Y
Ik (1 − Ij+i ):
k=j
Let Ni be the number of runs of exactly length i, i.e., Ni =
N −r X j=1
Iji :
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
Thus, Mr (w) =
X
415
iNi :
i
If we assume that 1. the number Ni of clump with size exactly i is approximately Poisson, with E(Ni ) = i , 2. Ni are mutually independent, then, it follows that ( ) r X X Y P{Mr (w) = m} = P iNi = m ' [i i e−i = i !]; P i i=1
(17)
i i =m
where 1 ; 2 ; : : : ; r are non-negative integers. Finding values for 1 ; 2 ; : : : is non-trivial and has been considered in the conditional case by Roos (1993), Glaz et al. (1994), and Huer and Lin (1997). Analogous with the approximation of Glaz et al., an approximation to i which follows from the clump heuristic of Aldous (1989) can be given by i ' i∗ , where i = (E(N ) − r)P(I1 = 1) (1 − )2 i−1 i∗
for i = 2; : : : ; r;
=0
for i¿r + 1; r X ii 1∗ = E[Mr (w)] − i=2
= (E(N ) − r)P(I1 = 1) (1 − 2 + 2 − rr+1 + (r + 1)r ): Approximation (17) is thus an analogue to Glaz’s result for the conditional case based on a dierent de nition of P(I1 = 1) and of . 2.3. The conditional case This case diers from that described above in that now N = n, n xed and the unordered observations have a uniform distribution. Most of the de nitions given above go through with N replaced by n. In particular, de ne the indicator variable Ij∗ as in (7) and the number of clumps as in (8) with N replaced by n. We apply the same three assumptions as previously, and nd that the Polya–Aeppli approximation, is also valid for this case, with the proviso that we re-compute the parameters P(I1 = 1), and = P{I2 = 1 | I1 = 1} based on the binomial framework, i.e., n X n P(I1 = 1) = (18) w j (1 − w)n−j : j=r
j
A formula P(I1 = 1; I2 = 0) is given by Berman and Eagleson (1985) and is an alternating series sum of binomial probabilities, where n X n j−r P(I1 = 1; I2 = 0) = w j (1 − w)n−j : (−1) (19) j=r
j
Thus the same approximation is used as in (14), with based on (12) and on (18) and (19), and with in (15) derived from as described above, and from P(I1 = 1) in (18).
416
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
3. Order statistics of r-spacings and numerical results As shown by Glaz et al. (1994), and Naus (1982), the results concerning the number of r-scans less than w can be translated to order statistics of r-spacings via the relationship (r) P{S(i) 6w} = P{Mr (w)¿i};
i = 1; : : : ; n − r:
(20)
In particular, our new approximation (14) based on Markovian declumping can be written as (r) P{S(i) 6w} = P{Mr (w)¿i} = 1 − P{Mr (w) ¡ i}
=1−
i−1 X
P{Mr (w) = m}
m=0
'
− 1−e 1 − e− − e−
Xi−1 Xm m=1
k =1
m−1 k−1
if i = 1; ((1 − ))k (m−k) k!
if i¿2:
(21)
Note that the same format in Eq. (21) is used for the unconditional and conditional cases, the dierence being the methods by which and are calculated. Note that similarly, Eq. (17) can be translated into an (r) approximation for S(i) , as in Eq. (3:4) of Glaz et al. with the ’s given in Section 2.2.
(r) We evaluate the approximations for P{S(i) 6w} under both the unconditional model with E(N ) = 100, and the conditional model with n = 100, for selected values of r, w, and i, which were used by Glaz et al. (1994). Each simulation is based on 40; 000 trials. We summarize the agreement between simulated and approximate values for the 32 cases in Glaz et al. (1994) by computing an overall value of agreement: qX (Approx − Sim)2 =32: s=
Table 1 compares simulated and approximate values for two closely related cases: the unconditional scan with N having a Poisson distribution with mean 100, and the r-scan applied to a sequence of 100 indepen1 . Approximations are based on (i) the new dently, identically distributed (i.i.d.) exponentials with mean 100 Markovian declumping procedure (21), (ii) the declumping technique based on exact run lengths as in (17), and (iii) the one-parameter approximation (6). Our results indicate that the one-parameter approximation is fairly crude with s=0:15 for both cases. The other two approximations are fairly accurate, with the Markovian better for both cases. For approximating the unconditional scan, s = 0:022 for the Markovian procedure and s = 0:030 for declumping based on exact run lengths; while for approximating the r-scan applied to 100 i.i.d. exponentials, s = 0:023 for the Markovian, while s = 0:028 for the exact run size approximation. The approximation by Naus (1982) is closer to the simulated values, especially for the case where N has a Poisson distribution, but this approximation is only valid for i = 1. A similar comparison is performed for evaluating the approximation for the conditional case, but the individual results are presented in Su and Wallenstein (1999). Our results show that the new Markovian procedure performs well (s = 0:020), but that greater accuracy is shown by the compound Poisson approximation of Glaz et al., Eq. (17) (s = 0:011), or the Huer–Lin approximation, CPG2 (s = 0:007). However, our new approximation in Eq. (21) is much simpler than that of Glaz et al. since no multiple partitions are required, and simpler than even the relatively simple CPG2 approximation of Huer and Lin (1997) which requires a four-way summation (double summmation of multinomials) to compute Var[Mr (w)]. Thus in summary, we present a new approximation for the distribution of r-scan statistics under both the unconditional case and the conditional case. For the unconditional case, the declumping procedure appears more accurate than any previously published approximation. For the conditional case, the Markovian declumping is
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
417
Table 1 (r) Comparison of three approximations to P{S(i) 6w}; E(N ) = 100 under the unconditional model Simulations i 1
w
r
N ∼ POSN
100 exp.
(21)
(17)
(6)
0.01
3 4 5 6 7 15 17 19 21 23 3 4 5 6 7 15 17 19 21 23 3 4 5 15 17 19 21 3 4 15 17 19
0.9927 0.7085 0.2217 0.0430 0.0063 0.7889 0.4371 0.1622 0.0436 0.0096 0.9698 0.4701 0.0843 0.0117 0.0015 0.7102 0.3478 0.1152 0.0281 0.0058 0.7835 0.0992 0.0050 0.5360 0.1986 0.0495 0.0099 0.3190 0.0037 0.3192 0.0780 0.0130
0.9957 0.7113 0.2208 0.0397 0.0061 0.8011 0.4327 0.1526 0.0378 0.0085 0.9815 0.4661 0.0824 0.0104 0.0013 0.7215 0.3405 0.1077 0.0246 0.0055 0.8104 0.0882 0.0037 0.5279 0.1829 0.0458 0.0090 0.2955 0.0028 0.3018 0.0698 0.0122
0.9913 0.7066 0.2211 0.0412 0.0060 0.8446 0.5010 0.1855 0.0478 0.0096 0.9664 0.4658 0.0818 0.0109 0.0014 0.7712 0.3965 0.1279 0.0299 0.0056 0.7706 0.0952 0.0038 0.5637 0.1921 0.0418 0.0074 0.3098 0.0037 0.3018 0.0542 0.0064
0.9969 0.7235 0.2222 0.0411 0.0060 0.8892 0.5062 0.1827 0.0469 0.0095 0.9838 0.4756 0.0814 0.0109 0.0013 0.7918 0.3903 0.1252 0.0294 0.0055 0.8192 0.0870 0.0033 0.5677 0.1869 0.0407 0.0072 0.2915 0.0019 0.2917 0.0511 0.0061
0.9996 0.8384 0.2937 0.0543 0.0077 0.9992 0.8940 0.4413 0.1179 0.0225 0.9964 0.5439 0.0481 0.0015 0.0000 0.9933 0.6561 0.1160 0.0072 0.0003 0.8876 0.0381 0.0000 0.8354 0.0774 0.0003 0.0000 0.2576 0.0000 0.1792 0.0001 0.0000
0.10
2
0.01
0.10
5
0.01 0.10
10
Approximations based on
0.01 0.10
nearly as good as the compound Poisson approximation, but is simpler to program and requires less computing time.
Appendix Let Yc = 1 + Zc , where Zc denotes the extra number of r-spacings less than w in the cth clump for c = 1; : : : ; Cr (w). Thus P[Yc = h] = P[Zc = h − 1 | {Ij = 1; Ij−1 = 0}];
h¿1;
418
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
where j = min{i :
Pi
∗ k=1 Ik (w)
= c}. By the de nition of the clump sizes, symmetry, and assumption 2
P[Yc = h] ' P[Z1 = h − 1 | {I2 = 1; I1 = 0}] h+1 Y {Ij = 1} ; {Ih+2 = 0} | {I2 = 1; I1 = 0} =P j=3
( ( j−1 ) # h+1 " ) # h+1 Y Y Y I‘ = 1 ; I1 = 0 P Ij = 1 I‘ = 1 ; I1 = 0 : = P Ih+2 = 0 "
j=3
‘=2
‘=2
By assumption 3, and the symmetry condition P[Ij = 1 | Ij−1 = 1] = P[I2 = 1 | I1 = 1] for all 36j6n − r, P[Yc = h] ' P[Ih+2 = 0 | Ih+1 = 1]
h+1 Y
P[Ij = 1 | Ij−1 = 1]
j=3
= P[I2 = 0 | I1 = 1]
h+1 Y
Pr[I2 = 1 | I1 = 1]
j=3
= (1 − )h−1 ;
h¿1:
(22)
Thus {Yc } can be approximated by a shifted geometric distribution. To obtain P(I1 = 1; I2 = 0), we set E(N ) = , and rst nd the following joint probability for 06w60:5, conditional on = X2 − X1 : P((I1 = 1; I2 = 0) | = u) = P(Xr+1 − X1 6w; Xr+2 − X2 ¿ w | = u) = P(Xr+1 − X2 6w − u; Xr+2 − Xr+1 ¿ u | = u) =
e−((w−u)) ((w − u))r−1 −(u) e : (r − 1)!
(23)
Since = X2 − X1 follows an exponential distribution under the Poisson process, its density function is easily found to be f (u) = e−u . Hence, removing the conditioning from Eq. (23), one obtains Z w P(Xr+1 − X1 6w; Xr+2 − X2 ¿ w | = u)f (u) du P(I1 = 1; I2 = 0) = 0
Z
w
e−((w−u)) ((w − u))r−1 −(u) −nu e e du (r − 1)! 0 Z w e−w e−u r (w − u)r−1 du: = (r − 1)! 0
=
By integrating above integral by partition, we have X j r−1 −w (j+1) (w) −w − (1 − e (−1) ) e j=1 j! P(I1 = 1; I2 = 0) = X r−1 (w) j e−w + (1 − e−w ) (−1) j j=1 j! If we expand e−w by Taylor expansion, then we get (13).
(24)
if r is even; (25) if r is odd:
X. Su, S. Wallenstein / Statistics & Probability Letters 46 (2000) 411 – 419
419
References Aldous, D., 1989. Probability Approximations via the Poisson Clumping Heuristic. Springer, New York. Alm, S.E., 1983. On the distribution of the scan statistic of a Poisson process. In: A. Gut, L. Holst (Eds.), Probability and Mathematical Statistics, Essays in honour of Carl-Gustave Esseen, Uppsalla University, pp. 1–10. Alm, S.E., 1997. On the distribution of scan statistics of a two-dimensional Poisson process. Adv. Appl. Probab. 29, 1–18. Arratia, R., Goldstein, L., Gordon, L., 1990. Poisson approximation and the Chen–Stein method. Statist. Sci. 5, 403– 434. Berman, M., Eagleson, G.K., 1985. A useful upper bound for the tail probabilities when the sample size is large. J. Amer. Statist. Assoc. 80, 886– 889. Dembo, A., Karlin, S., 1992. Poisson approximations for r-scan process. Ann. Appl. Probab. 2, 329–357. Glaz, J., 1993. Approximations for the tail probabilities and moments of the scan statistic. Statist. Med. 12, 1845–1852. Glaz, J., Naus, J., Roos, M., Wallenstein, S., 1994. Poisson approximations for the distribution and moments of ordered m-spacings. Ann. Appl. Probab. 4, 271–281. Huer, F.W., Lin, C.T., 1997. Approximating the distribution of the scan statistic using moments of the number of clumps. J. Amer. Statist. Assoc. 92, 1466–1475. Johnson, N.L., Kotz, S., Kemp, A.W., 1992. Univariate Discrete Distributions. Wiley, New York. Karlin, S., Macken, C., 1991. Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J. Amer. Statist. Assoc. 86, 27–35. Naus, J.I., 1982. Approximations for distributions of scan statistics. J. Amer. Statist. Assoc. 77, 177–183. Newell, G.F., 1963. Distribution for the smallest distance between any pair of kth nearest-neighbor random points on a line. Time Series Analysis, Proceedings of a Conference, Brown Univ., pp. 89 –103. Roos, M., 1993. Compound Poisson approximations for the numbers of extreme spacings. Adv. Appl. Probab. 25, 847–874. Sherbrooke, C.C., 1968. Discrete compound Poisson processes and tables of the geometric Poisson distribution. Naval Res. Logist. Quart. 15, 189–203. Su, X., Wallenstein, S., 1999. New approximations for the distribution of the r-scan statistic. Technical Report, Mount Sinai School of Medicine.