Estimating the probability mass of unobserved support in random sampling

Estimating the probability mass of unobserved support in random sampling

Journal of Statistical Planning and Inference 91 (2000) 91–105 www.elsevier.com/locate/jspi Estimating the probability mass of unobserved support in...

195KB Sizes 0 Downloads 48 Views

Journal of Statistical Planning and Inference 91 (2000) 91–105

www.elsevier.com/locate/jspi

Estimating the probability mass of unobserved support in random sampling A. Almudevara , R.N. Bhattacharyab , C.C.A. Sastric; ∗

a Department

of Mathematics and Computer Science, St. Mary’s University, Halifax, NS, Canada B3H 3C3 b Department of Mathematics, Indiana University, Bloomington, IN 47405, USA c Department of Mathematics and Statistics, Dalhousie University, Halifax, NS, Canada B3H 3J5 Received 15 June 1999; received in revised form 1 February 2000; accepted 8 March 2000

Abstract The problem of estimating the probability mass of the support of a distribution not observed in random sampling is considered in the case where the distribution is discrete. An example of a situation in which the problem arises is that of species sampling: suppose that one wishes to determine the species of ÿsh native to a body of water and that, after repeated sampling, one identiÿes a certain number of species. The problem is to estimate the proportion of the ÿsh population belonging to the unobserved species. Since it is a rare event, ideas from large deviation theory play a role in answering the question. The result depends on the underlying distribution, which is unknown in general. Methods similar to nonparametric bootstrapping are therefore used to prove a limit theorem and obtain a conÿdence interval for the rate function. c 2000 Elsevier Science B.V. All rights reserved.

MSC: primary 62E20; secondary 60F10 Keywords: Unobserved support; Large deviations; Bootstrapping

1. Introduction The purpose of this paper is to estimate the probability mass of the support of a distribution not observed in random sampling, given that the underlying distribution is discrete. Such a situation arises, for example, in the problem of species sampling.



Corresponding author. E-mail addresses: [email protected] (A. Almudevar), [email protected] (R.N. Bhattacharya), [email protected] (C.C.A. Sastri). c 2000 Elsevier Science B.V. All rights reserved. 0378-3758/00/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 0 0 ) 0 0 1 2 5 - 7

92

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

Suppose that one wishes to determine the number and types of species of ÿsh that are native to a body of water or of animals to a rainforest. Suppose that certain species have been identiÿed after a number of trials, say n trials have been conducted. The problem is to estimate the probability, for large n, that one would observe a new species on the next trial. The statistical problem of estimating unseen support was ÿrst considered by Good (1953). If zn denotes the size of the unseen support after n trials, Good proposed that its expectation E(zn ) be estimated by a1(n) =n, where a1(n) is the number of sample points observed exactly once in the ÿrst n observations. Robbins (1968) showed that E(zn − a1(n) =n)2 ¡ (n + 1)−1 . The investigation was carried further by Starr (1979), Esty (1982), Clayton and Frees (1987), Nayak (1992), Yatracos (1995), and Nicoleris (19XX) (preprint). Our approach is di erent in the sense that we obtain – by large deviation methods or otherwise – estimates, not of the moments of zn , but of the tail probabilities of zn . Also, using methods similar to bootstrapping, we prove a limit theorem for the rate function, deÿned below, and obtain a conÿdence region for it. The actual number of species may be ÿnite or inÿnite; we shall consider both cases. Thus, let S ={1; 2; : : : ; N } be the support of a probability distribution ={i ; 16i6N }, with i = ({i}). Possibly, N = ∞, in which case S = N where N is the set of all natural numbers. By relabeling the states if necessary, one may assume that i ¿ 0, for all i in S. One can think of  as the true distribution, with the whole of S for its support. Let M(S) be the set of all probability measures with support contained in S. Let M = M(N). Let M denote the set of all measures on N such that 0 ¡ (N)61. For ∈ M write  = = (N), so that  ∈ M. For any measure  = (1 ; 2 ; : : :) ∈ M, let S() = {i ∈ N | i ¿ 0} be the support of . Fixing the measure , we deÿne z() = ({i : i = 0}). Let {xi ; i=1; 2; : : :} be an i.i.d. sequence of random variables deÿned on a probability space ( ; F; P) with common distribution . Let the empirical distribution generated Pn by x1 ; x2 ; : : : ; x n be Ln = (ˆ1 ; ˆ2 ; : : : ; ˆN ), where ˆj = (1=n) i=1 j (xi ), 16j6N , with  denoting the Kronecker -function. Then z(Ln ) may be interpreted as the conditional probability that x n+1 equals an element of S() not represented among x1 ; x2 ; : : : ; x n . Our aim is to estimate, for large n, the probability that z(Ln )¿ for any  ¿ 0. This is P(Ln ∈ { ∈ M | z()¿}). Since the event we are interested in is rare, large deviation ideas play a role in estimating its probability. We shall carry out the estimation ÿrst by using Sanov’s theorem for discrete spaces. Once the result is known, we perform a more direct calculation to obtain the same result. We believe it is instructive to present both the methods. The asymptotic value of the above probability may be expressed as exp{n(1−s∗ ())} with a speciÿcation of the rate function s∗ (s) given by (4) below. We carry out the estimation of the rate function by methods similar to nonparametric bootstrapping. We prove a limit theorem and construct a conÿdence interval. Out of this work emerge several questions, of a practical as well as theoretical nature.

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

93

2. Estimates We recall Sanov’s theorem for empirical measures on countable spaces: If ⊂ M is measurable, and ;  ∈ M, then 1 − inf H (|) 6 lim inf log P(Ln ∈ ) 0 n→∞ n ∈ 1 6 lim sup log P(Ln ∈ ) n→∞ n 6 − inf H ( | ); ∈ 

(1)

where 0 and  are the interior and closure, respectively, of in the weak topology, P and H ( | )= i∈N i log i =i is the conditional entropy of  given . In this deÿnition, it is not necessary to have i ¿ 0 for all i. Indeed, if i = 0 = i the corresponding summand is taken to be zero. We wish to apply Sanov’s theorem to the set s = { ∈ M | z()¿s}, 06s61. We ÿrst verify that s is closed. Lemma 2.1. The set

s

is closed for any s ∈ [0; 1].

Proof. Let {(n) } be a convergent sequence in s with limit , convergence here meaning the weak convergence of probability measures, denoted by ⇒, as usual. Let T∞ S∞ B = {i: i = 0}, An = {i: i(n) = 0} (n¿1). Then A:= m=1 n=m An ≡ {i: i(n) = 0 for inS∞ ÿnitely many n} ⊂ B, since (n) ⇒ . Hence z()=(B)¿(A)=limm→∞ ( n=m An )¿ lim supm→∞ (Am )¿s. Thus  ∈ s , proving s is closed. We will need the following lemma involving the minimum value of the conditional entropy conditioned on improper probability measures. We ÿrst extend the deÿnition of H : Let P i i log ; (2) H ∗ (| ) =

i i∈S( )  We then have where  ∈ M and ∈ M.  Lemma 2.2. For any = ( 1 ; 2 ; : : :) ∈ M; P

i = H ∗ ( | ):  inf H ∗ (| ) = −log ∈M

Proof. We have H ∗ (| ) + log

i∈N

P i∈N

i =

P i i log = H ∗ (| ): 

i

Since  ∈ M,  = H ∗ ( |  )  = 0; inf H ∗ ( | )

∈M

and the result follows.

94

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

It will be convenient to express this, let As = {A ⊂ N | (A)¿s};

s

in terms of sets of a simpler structure. To do

s ∈ [0; 1];

GA = { ∈ M|(A) = 0}: A lower bound for P({Ln ∈ s }) is derived easily. We have {Ln ∈ ⊃{A ⊂ S(Ln )c for some A ∈ As }. Hence, P({Ln ∈

s }) ¿

(3) s}

= {z(Ln )¿s}

sup P({A ⊂ S(Ln )c })

A∈As

= sup (1 − (A))n A∈As

= (1 − s∗ )n ; where s∗ = s∗ (s) = inf {(A) | A ∈ As }:

(4)

This gives lim inf n→∞ (1=n)log P({Ln ∈ s })¿log(1 − s∗ ). We now verify that this bound is the same as the upper bound given by Sanov’s theorem. We ÿrst observe that [ GA : s= A∈As

So inf H ( | ) = inf inf H (|):

∈

s

A∈As ∈GA

Let A be the improper probability measure obtained from  by setting the probabilities of all the elements of A equal to zero. Then inf H (|) = inf H ∗ (|A )

∈GA

∈M

= −log(1 − (A)) by Lemma 2.2. We conclude that inf H (|) = inf (−log(1 − (A)))

∈

s

A∈As

= −log(1 − s∗ ): By Lemma 1, s is closed, so that the upper bound in Sanov’s theorem applies. Thus combining the lower bound obtained above and the upper bound in Sanov’s theorem, we get Theorem 2.3. limn→∞ (1=n)log P({z(Ln )¿s}) = log(1 − s∗ ). We shall now derive the upper bound given by Sanov’s theorem without using the theorem.

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

95

P Let S = {1; 2; : : :}, Sk = {1; 2; : : : ; k}, and k = j¿k+1 j ; k = 1; 2; : : : . Let s ∈ [0; 1] and, for given s and k, let Ak; s = {A ⊂ Sk | (A)¿s∗ (s) − k }. If s∗ (s) ¿ 0, then s∗ (s) − k ¿ 0 for all suciently large k and, for such k, one has P({z(Ln )¿s}) = P({z(Ln )¿s∗ (s)}) = P({(S(Ln )c )¿s∗ (s)}) 6 P({(S(Ln )c ∩ Sk )¿s∗ (s) − k }) = P({S(Ln )c ⊃ A for some A ∈ Ak; s }) P P({S(Ln )c ⊃ A}) 6 A∈Ak; s

6 2k (1 − s∗ (s) + k )n : If s∗ (s) = 0, then the above relations hold trivially. Hence 1 lim sup log P({z(Ln )¿s})6log(1 − s∗ (s) + k ) ∀k¿1: n→∞ n Letting k ↑ ∞, we get 1 lim sup log P({z(Ln )¿s})6log(1 − s∗ (s)): n→∞ n Remark. (1) Suppose S is ÿnite, S = {1; 2; : : : ; N }. Then N = 0, and the upper bound becomes P({z(Ln )¿s})62N (1 − s∗ (s))n : (2) From the inequalities derived above, one may write P({z(Ln )¿s})6 inf 2k (1 − s∗ (s) + k )n ; k;Sk

where the inÿmum is taken over all k, and for each k, over all subsets of S of size k. (3) The lower bound given by Sanov’s theorem presents a problem in our case, since the interior of s is empty when s ¿ 0. This is the case since, for any  ∈ M, ∃ a sequence {(n) } in M with  as the limit with z((n) ) = 0 ∀n. The combinatorial proof of Sanov’s theorem for distributions with ÿnite support (as given, for example, in Dembo and Zeitouni, 1993) can be altered in a simple way to show that, in our case, the lower bound is the same as the upper bound. However, we have been unable to do the same in the case where the support is inÿnite. Fortunately, however, one does not need Sanov’s theorem to show that the upper and lower bounds are the same, as has been already demonstrated. (4) It is clear from the foregoing discussion that an understanding of the properties of s∗ as a function of s would be useful, in fact necessary, for carrying out estimation of the desired probabilities. From the deÿnition, it follows that s∗ is an increasing (i.e., nondecreasing) function of s. It also follows that s∗ (0) = 0, s∗ (1) = 1. Now, let S be ÿnite, and  = (1 ; 2 ; : : : ; N ). Let the values of (A), as A varies over the power set 2S , be arranged in increasing order and given by m = (m0 ; m1 ; : : : ; mp ). Then m0 = 0 and mp = 1. If  is the degenerate measure with N = 1, then m = (0; 1), and if 

96

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

is the uniform distribution  = (1=N; 1=N; : : : ; 1=N ), then m = (0; 1=N; 2=N; : : : ; 1). In the general case, it follows from the deÿnition of s∗ that s∗ (mi ) = mi ∀i. Moreover, if mi−1 ¡ s ¡ mi , i = 1; 2; : : : ; p, then s∗ (s) = mi . Thus s∗ is a step function which is left-continuous in (0; 1] with jumps at precisely the points s = m1 ; m2 ; : : : ; mp−1 . These are also the points at which the graph of s∗ intersects the line y = s. (5) Observe that s∗ does not characterize a distribution, i.e., several distinct distributions may give rise to the same s∗ function. In fact, any two distributions with the same vector m have the same s∗ function. For example, =( 14 ; 14 ; 14 ; 14 ) and 0 =( 14 ; 14 ; 12 ) have the same s∗ function, since for both of them, m = (0; 14 ; 12 ; 34 ; 1). In general, m is a vector of order no greater than 2N . However, m can be of order 2N even if  is not 5 5 7 2 3 ), m = (0; 14 ; 13 ; 12 ; 12 ; 3 ; 4 ; 1). degenerate. For example, if  = ( 14 ; 13 ; 12 ∗ In general, it is dicult to calculate s if S is inÿnite. For instance, it is not clear what s∗ is if  is Poisson with parameter . It is not even clear whether s∗ is left-continuous for inÿnite S. We now give an interesting example where s∗ can be explicitly computed. Example. Let n¿2 be an integer, and let S = {1; 2; : : :}. On S, let  = (1 ; 2 ; : : :) be a measure given by the following rule: 1 if i = (n − 1)j − (n − 2); (n − 1)j − (n − 3); : : : ; (n − 1)j − 1; or nj (n − 1)j; j = 1; 2; : : : :

i =

Then





 1 1 1 1 1 1  =  n ; n ; : : : ; n ; n2 ; n2 ; : : : ; n2 ; : : : ; {z } | {z } | (n−1) terms

(n−1) terms

P∞

so that i=1 i = 1. Let s ∈ [0; 1]. Consider the n-ary expansion of s: s = ·a1 a2 : : : =

∞ a P j ; j n j=1

where for each j; aj = 0; 1; 2; : : : ; n − 1. Let A be a set of natural numbers deÿned as follows: If aj = 1, put exactly one of (n − 1)j − (n − 2);

(n − 1)j − (n − 3);

(n − 1)j − 1; and (n − 1)j

in A; if aj = 2, put exactly two of the above integers in A, and so on; if aj = n − 1, put all of the (n − 1) above integers in A. Then, clearly (A)=s, and hence s∗ (s)=s. We observe that the geometric distribution for which p = q = 12 , i.e, i = 21i ∀i, corresponds to the case n = 2. Similarly, if ({r}) = 2=3r (r = 1; 2; : : :), then s∗ (s) is the Cantor middle-third distribution function, which is singular (Billingsley, 1995).

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

97

3. Estimation using empirical distributions Since the underlying distribution  is unknown in general, we will estimate it by Ln , and denote by L∗n the empirical distribution based on a random sample of size n taken with replacement from Ln . That is, Ln ; L∗n play the roles of ; Ln , respectively. Let sn∗ (s) denote the corresponding rate function. Then sn∗ , for each n, has the same properties as s∗ : sn∗ is an increasing function, sn∗ (0) = 0, sn∗ (1) = 1, etc. In addition, we have Theorem 3.1. limn→∞ sn∗ (s) = s∗ a.s. at all points of continuity of s∗ . Proof. Let n be a positive integer and s ∈ [0; 1] . Let s(s)  = lim supn→∞ sn∗ (s), s(s) = ∗ lim inf n→∞ sn (s). Let As; n = {A ⊂ S | Ln (A)¿s}. Recall that As = {A ⊂ S | (A)¿s}. P∞ If  = (1 ; 2 ; : : :) and Ln = (ˆ1 ; ˆ2 ; : : :), let |Ln − | ≡ k=1 |ˆk − k | be the L1 -distance between Ln and . Then it follows from the strong law of large numbers and Sche e’s theorem that limn→∞ |Ln − | = 0 a.s. Let  be such that 0 ¡  ¡ s. Then, for large enough n, |Ln − |6 a.s. Let En;  = {|Ln − |6}. Then, on En;  (A) − 6Ln (A)6(A) + 

∀A:

This implies that, on En;  , As+2 ⊂ As+; n ⊂ As ⊂ As−; n : Thus, on En;  , inf Ln (A)6 inf ((A) + ) = s∗ (s) + 

A∈As

A∈As

and so inf

A∈As−; n

Ln (A)6s∗ (s) + 

or sn∗ (s − )6s∗ (s) + : Letting n → ∞ ÿrst and then letting  ↓ 0, we get ∗ (s) ∀s (a:s:) s(s−)6s 

(5)

Similarly, the inequalities (on En;  ) inf ((A) − )6 inf Ln (A) ≤

A∈As

A∈As

inf

A∈As+; n

Ln (A)

yield s∗ (s) − 6sn∗ (s + ). Letting n → ∞ ÿrst and then letting  ↓ 0 one has s∗ (s)6s(s+) ∀s (a:s):

(6)

Using (5) and monotonicity, with s +  for s, and we get (with probability one) ∗ (s+) ∀s (a:s): s(s+)6s 

(7)

98

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

Using (6) and (7) we get ∗  (s+): s∗ (s)6s(s+)6s(s+)6s

(8)

Therefore, if s is a point of continuity of s∗ , we get limn→∞ sn∗ (s) = s∗ (s), a.s. We observe that on En;  = {|Ln − |6}, we have, ∀s, sn∗ (s − ) − 6s∗ (s)6sn∗ (s + ) + :

(9)

This gives a conÿdence region for the rate function s∗ , of conÿdence level at least P(En;  ) = 1 − e−nI (1+o(1)) , where I given by the Cramer–Cherno theorem (see Dembo and Zeitouni, 1993). For ÿnite S a more useful result is the following theorem. √ To state it, let S = {1; 2; : : : ; k}. Then { n(ˆr − (r)); 16r6k} converges in distribution to a k-variate normal. Let n;∗ r denote the proportion of r in a bootstrap sample (of size n) from the empirical Ln = (ˆ1 ; ˆ2 ; : : :); 16r6k. From the standard theory of the bootstrap for the sample mean (see Efron, 1979) it follows that the di erence √ between the bootstrap probability P ∗ (r | n(ˆ∗n; r − ˆr )|6) and P(En; =√n ) goes to zero a.s. as n → ∞. In this sense we say that the bootstrap estimate of P(En; =√n ) is consistent. We then have Theorem 3.2. Suppose S is ÿnite. Then; for any  ¿ 0; the coverage probability of √ √ √ the conÿdence interval for s∗ given by [sn∗ (s − = n) − = n6s∗ (s)6sn∗ (s + = n) + =n; 06s61] is at least P(En; =√n ). A consistent bootstrap estimate of P(En; =√n ) is √ given by P ∗ (r | n(ˆ∗n; r − ˆr )|6). 4. Numerical examples A numerical example of the calculations proposed in this article will proceed in two steps. The approximation suggested by Theorem 2.3 for the cumulative distribution function (cdf) Fn of z(Ln ) is given by Fn (t) ≈ 1 − (1 − s∗ (t))n :

(10)

As discussed above, an estimate of the cdf based on the rate function will necessarily overestimate the true cdf. We ÿrst calculate the rate function for a variety of distributions to see if (10) gives an adequate approximation. The geometric distribution with parameter p set to 0.5 and 0.9 will be used as well as a uniform distribution on a set of size m for m = 25. The calculation of the rate function is straightforward for the uniform distribution since there are only m possible values for s∗ . The rate function for the geometric distribution for p = 0:5 is simply s∗ (t) = t as discussed earlier. In the absence of a special form for the distribution, the rate function is calculated by constructing a set of all attainable values of the probability set function (A). This is obviously not calculable exactly for distributions with inÿnite support. However if we proceed by simply excluding all probabilities below a certain value and calculating s∗

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

99

as though for a distribution with ÿnite support, then the resulting rate function must be within  of the true one, where  is the sum of the excluded probabilities. The enumeration of subsets is a computationally dicult problem. It can be performed more eciently by using a recursive algorithm. If A is a set of real numbers, let H (A) be the set of all possible sums of elements of A. For any constant k let A + k denote the set formed by adding k to all elements of A. The following recursive relationship then holds for m¿2: H ({a1 ; : : : ; am }) = {a1 } ∪ H ({a2 ; : : : ; am }) ∪ (H ({a2 ; : : : ; am }) + a1 ) with H ({a}) = {a} for any singelton {a}. Fig. 1 gives the cdf of z(Ln ) estimated by simulations of 100,000 trials for each of the test distributions for n = 100 and 200, as well as the cdf estimated using (10). The approximation is almost exact for the geometric (p = 0:9), within a reasonable neighborhood for the geometric (p=0:5) but not representative for the uniform distributions, although the approximation is at least of the same order of magnitude. In particular, for the uniform cdf (n = 200) we have by the rate function P(z(Ln )¿0:4) = 0:000285, whereas the true value is approximately P(z(Ln )¿0:4) = 0:0073. The two values di er by a factor of about 25, as would be expected. To evaluate the bootstrap for each of the three test distributions a sample of size ˆ Then 10,000 n = 100 was simulated, resulting in an empirical distribution function F. ˆ For replicates were constructed by simulating random samples of size n = 100 from F. each replicate a rate function was calculated. Then upper and lower 95% conÿdence limits for z ranging from 0 to 0.1 were constructed from the replicate rate functions. The rate function calculated from Fˆ was also calculated, serving as a point estimate. Figs. 2– 4 show the resulting rate function estimates, superimposed on the true rate functions, as well as the resulting estimates of the cdf for z(Ln ) for n = 100; 200, also superimposed on the true cdf. Table 1 gives numerical values for the tail probabilities P(z(Ln )¿z) for selected values of z. Clearly, tail probabilities are not being estimated with the precision expected from well behaved parametric estimation problems. However, the di erences in the base 10 logarithms are generally within 1, and do not exceed 1.39 in magnitude, except for the geometric (p = 0:9). With respect to the geometric distribution with p = 0:9, if we examine the rate function in Fig. 3, we see that the crucial features for calculating the tail probabilities in the range of interest are the changes in value which occur at about z=0:001 and 0.01. This is not detected in our estimate, so that any attempt at tail probability estimates will not succeed. With respect to the bootstrap one peculiar problem is that the rate function is bounded below by the identity function, which is achieved by certain distributions, in this section by the geometric (p = 0:5). This suggests that the usual practice of using upper and

100

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

Fig. 1. Cumulative distribution functions of z(Ln ) for six test distributions. The solid line is the true cdf and the dashed line is the estimate of the cdf obtained by using the rate function.

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

101

Fig. 2. The top graph shows the estimate of the rate function of z(Ln ) for the geometric distribution (p=0:5) in the range [0.0,0.1], with bootstrap upper and lower conÿdence limits based on a sample of size 100. The subsequent graphs show the resulting estimate of the cdf of z(Ln ) for n=100; 200. The estimate and bootstrap limits are shown as dashed lines. The true distribution is superimposed in solid line.

102

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

Fig. 3. The top graph shows the estimate of the rate function of z(Ln ) for the geometric distribution (p=0:9) in the range [0.0,0.1], with bootstrap upper and lower conÿdence limits based on a sample of size 100. The subsequent graphs show the resulting estimate of the cdf of z(Ln ) for n=100; 200. The estimate and bootstrap limits are shown as dashed lines. The true distribution is superimposed in solid line.

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

103

Fig. 4. The top graph shows the estimate of the rate function of z(Ln ) for the uniform distribution (m = 25) in the range [0.0,0.1], with bootstrap upper and lower conÿdence limits based on a sample of size 100. The subsequent graphs show the resulting estimate of the cdf of z(Ln ) for n = 100; 200. The estimate and bootstrap limits are shown as dashed lines. The true distribution is superimposed in solid line.

104

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

Table 1 For six distributions, various tail probabilities are given, with the point estimate obtained from the emprical rate function and lower and upper 95 is the logarithm di erence deÿned as the di erence between the base 10 logarithms of the true tail probability and the point estimates Distr.

z

P(z(Ln )¿z)

Lower 95% limit

Point estimate

Upper 95% limit

Log di .

Geometric p = 0:5 n = 100

3:13E − 02 4:69E − 02 7:03E − 02

4:37E − 02 9:25E − 03 9:20E − 04

2:39E − 04 8:02E − 05 2:81E − 06

1:69E − 02 5:92E − 03 2:39E − 04

1:69E − 02 5:92E − 03 7:05E − 04

0.41 0.19 0.59

Geometric p = 0:5 n = 200

1:56E − 02 2:54E − 02 3:52E − 02

4:56E − 02 8:89E − 03 9:50E − 04

3:51E − 05 4:97E − 07 5:72E − 08

1:76E − 02 2:26E − 03 2:85E − 04

1:76E − 02 2:26E − 03 2:85E − 04

0.41 0.59 0.52

Geometric p = 0:9 n = 100

1:00E − 03 9:1E − 03 1:00E − 02

4:05E − 01 3:69E − 01 5:00E − 05

2:68E − 08 2:68E − 08 2:68E − 08

2:66E − 05 2:66E − 05 2:66E − 05

5:92E − 03 5:92E − 03 5:92E − 03

4.18 4.14 0.27

Geometric p = 0:9 n = 200

1:00E − 03 1:00E − 02

1:64E − 01 1:11E − 16

7:18E − 16 7:18E − 16

7:06E − 10 7:06E − 10

3:51E − 05 3:51E − 05

8.37 −6:80

Uniform m = 25 n = 100

4:00E − 02 8:00E − 02

6:10E − 02 5:90E − 03

1:69E − 02 2:39E − 04

5:92E − 03 2:39E − 04

1:69E − 02 2:39E − 04

1.01 1.39

Uniform m = 25 n = 200

4:00E − 02

1:00E − 05

2:85E − 04

3:51E − 05

2:85E − 04

−1:30

lower conÿdence limits will not always be appropriate. In general, bias plays a large role in this inference problem, and any complete solution of the problem must pay particular attention to this. 5. Concluding remarks As indicated earlier, the problem of estimating unseen support has been considered by several authors. Our approach is di erent in the sense that previous authors considered mainly the estimation of E(zn ), the expectation of the unseen support, whereas we are focusing on the tail probabilities of zn . It is worth noting that the rates at which E(zn ) → 0 as n → ∞ in the ÿnite and inÿnite support cases are di erent. In fact, in PN the ÿnite support case, E(zn ) = i=1 i (1 − i )n → 0 exponentially as n → ∞. We PN have here used the relation zn = i=1 i 5{i∈S(Ln )c } . In the case of inÿnite support, we have the following result: Lemma 4.1. If the support of  inÿnite then ∃ a sequence {ki } of positive integers such that lim inf ki E(zki )¿e−1 : i→∞

A. Almudevar et al. / Journal of Statistical Planning and Inference 91 (2000) 91–105

105

Proof. We shall assume that i ¿ 0 ∀i. Let ki denote the integral part of 1=i : ki = [1=i ]. Then E(zki )¿i (1 − i )ki : Since  is a distribution, limi→∞ i =0, so that ki → ∞ as i → ∞. Now limi→∞ ki i =1, and limi→∞ (1 − i )ki = e−1 , so that lim inf ki E(zki )¿e−1 : i→∞

In a numerical study it was found that the rate function technique was able to give rough estimates of tail probabilities for z(Ln ). In many cases of practical interest, the motivation for these estimates is to know whether or not a sample of objects is relatively complete, so knowing the order of magnitude would suce. In addition, this technique is able to forecast these probabilities for additional sampling. Certainly, before such a technique becomes practical, reÿnements are needed. In particular, conditions under which the estimates are reliable are needed. Also, the application of the bootstrap to this problem requires further study and reÿnement, particularly with respect to identiÿable sources of bias. Acknowledgements One of us (C.C.A.S) is pleased to take this opportunity to thank Prof. S.R.S. Varadhan and Mr. T. Chelluri for helpful discussions. References Billingsley, P., 1995. Probability and Measures. 3rd Edition. Wiley, New York. Clayton, M.K., Frees, E.W., 1987. Nonparametric estimation of the probability of discovering a new species. JASA 82, 305–311. Dembo, A., Zeitouni, O., 1993. Large deviations techniques and applications. Jones and Bartlett. Efron, B., 1979. Bootstrap methods: another look at the jackknife. Ann. Statist. 7, 1–26. Esty, W., 1982. Conÿdence intervals for the coverage of low coverage samples. Ann. Statist. 10, 190–196. Good, I.J., 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, 237–264. Nayak, T.K., 1992. On statistical analysis of a sample from a population of unknown species. J. Statist. Plann. Inference 31, 187–198. Nicoleris, T., 19XX. Inadmissibility of the predictor of the probability of observing a rare species, preprint. Robbins, H., 1968. Estimating the total probability of the unobserved outcomes of an experiment. Ann. Math. Statist. 30, 521–548. Starr, N., 1979. Linear estimation of the probability of discovering a new species. Ann. Statist. 7, 644–652. Yatracos, Y.G., 1995. On the rare species of a population. J. Statist. Plann. Inference 48, 321–329.