Prediction bands for the EDF of a future sample

Prediction bands for the EDF of a future sample

Journal of Statistical Planning and Inference 142 (2012) 506–515 Contents lists available at SciVerse ScienceDirect Journal of Statistical Planning ...

253KB Sizes 6 Downloads 31 Views

Journal of Statistical Planning and Inference 142 (2012) 506–515

Contents lists available at SciVerse ScienceDirect

Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi

Prediction bands for the EDF of a future sample Jesse Frey Department of Mathematics and Statistics, Villanova University, Villanova, PA 19085, United States

a r t i c l e in f o

abstract

Article history: Received 20 January 2010 Received in revised form 25 March 2011 Accepted 15 August 2011 Available online 22 August 2011

We develop both nonparametric and parametric methods for obtaining prediction bands for the empirical distribution function (EDF) of a future sample. These methods yield simultaneous prediction intervals for all order statistics of the future sample, and they also correspond to tests for the two-sample problem. The nonparametric prediction bands correspond to the two-sample Kolmogorov–Smirnov test and related nonparametric tests, but the parametric prediction bands correspond to entirely new parametric two-sample tests. The parametric prediction bands tend to outperform the nonparametric bands when the parametric assumptions hold, but they may have true coverage probabilities well below their nominal levels when the parametric assumptions fail. A new computational algorithm is used to obtain critical values in the nonparametric case. & 2011 Elsevier B.V. All rights reserved.

Keywords: Confidence band Empirical distribution function Exponential distribution Kolmogorov–Smirnov test Prediction interval

1. Introduction In prediction problems, we use an initial sample X1 , . . . ,Xn to predict the value of some statistic from a future sample. If the future sample contains only a single observation Y, then we might create a prediction interval for Y. If the future sample contains multiple observations Y1 , . . . ,Yk , then we might create a prediction interval for the sample mean Y or for some order statistic YðrÞ from the future sample. There is an extensive literature on methods for obtaining prediction intervals in nonparametric and parametric settings, and numerous applications of prediction intervals were discussed by Hahn and Meeker (1991). Standard nonparametric prediction intervals for a single observation Y have the form ðXðrÞ ,XðsÞ Þ, where Xð1Þ r    r XðnÞ are the order statistics from the initial sample. Normal-distribution-based prediction intervals for Y have the form ðX aS,X þ aSÞ, where X and S are the customary sample mean and sample standard deviation from the initial sample, and exponential-distribution-based prediction intervals for Y have the form ðaX ,bX Þ. These standard prediction intervals were discussed both by Hahn and Meeker (1991) and, in the nonparametric and normal cases, by Vardeman (1992). Fligner and Wolfe (1979) developed nonparametric prediction intervals for the median of a future sample, again using intervals of the form ðXðrÞ ,XðsÞ Þ. Their method can be extended to obtain nonparametric prediction intervals for an arbitrary order statistic from the future sample. Fertig and Mann (1977) studied one-sided normal-distribution-based prediction intervals for an arbitrary order statistic in which the bounds are of the form X þ bS, and it is clear that one could also obtain two-sided prediction intervals of the form ðX þ aS,X þbSÞ. Thus, the difference between a prediction interval for a single observation and a prediction interval for an order statistic typically comes not in the form of the interval, but only in the critical values used to obtain the desired coverage probability. In this paper, we wish to make predictions about multiple aspects of the future sample. In this case, a single prediction interval no longer suffices. Methods for creating simultaneous prediction intervals have been developed by authors such as

E-mail address: [email protected] 0378-3758/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2011.08.008

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

507

Davis and McNichols (1987) and Shih (1988), but these methods are for making predictions about multiple future samples rather than for making predictions about multiple aspects of a single future sample. To make predictions about multiple aspects of one future sample, we create prediction bands for the empirical distribution function (EDF) of the future sample. The prediction bands that we create are uniform rather than pointwise. As a result, the prediction bands are equivalent to a set of simultaneous prediction intervals for all order statistics of the future sample. As is typical in prediction problems ¨ (Knusel, 1994; LaMotte and Volaufova´, 1999), the prediction bands also correspond to tests for the two-sample problem. We describe the nonparametric prediction bands in Section 2, and we derive an algorithm for obtaining critical values in Section 3. We develop parametric prediction bands for normal and exponential data in Section 4, and we compare the performance of all of the prediction bands in Section 5. We give our conclusions in Section 6. 2. Nonparametric prediction bands Suppose that X1 , . . . ,Xn is a simple random sample from a continuous distribution F, and suppose that we wish to create P a prediction band for the EDF of the future sample Y1 , . . . ,Yk  F. Since this EDF is defined by F^ Y ðtÞ  ð1=kÞ ki ¼ 1 IðYi r tÞ, it is a nondecreasing step function that takes values from the set f0; 1=k, . . . ,1g. We explore two approaches to obtaining prediction bands for F^ Y ðtÞ. The first approach involves using the set of all EDFs G that are uniformly close to F^ X ðtÞ, the EDF for the initial sample, in terms of Euclidean distance, and the second approach involves using the set of all EDFs G that are uniformly close to F^ X ðtÞ in a likelihood-based sense. In what follows, we let Xð1Þ r    r XðnÞ be the order statistics from the initial sample, and we define Xð0Þ  1 and Xðn þ 1Þ  1 for notational convenience. If the prediction band is the set of all EDFs G that are always within h of F^ X ðtÞ in terms of Euclidean distance, then the prediction band has lower and upper bounds L(t) and U(t), where LðtÞ  maxf0, F^ X ðtÞhg and UðtÞ  minf1, F^ X ðtÞ þ hg. Thus, the prediction band has the same form as the Kolmogorov–Smirnov confidence band for F, and we will refer to it as the Kolmogorov–Smirnov prediction band. However, for a given desired confidence level, the prediction band will typically be much wider than the corresponding confidence band for F. Let the initial and future sample sizes n and k be fixed, and suppose that we wish to find the coverage probability for the prediction band bounded by LðtÞ ¼ maxf0, F^ X ðtÞhg and UðtÞ ¼ minf1, F^ X ðtÞ þ hg, where h 40. This coverage probability is given by PðLðtÞ r F^ Y ðtÞ r UðtÞ for all t), which may be rewritten as Pð9F^ X ðtÞF^ Y ðtÞ9 r h for all t). Thus, the coverage probability is Pðsupt 9F^ X ðtÞF^ Y ðtÞ9 r hÞ, where the statistic supt 9F^ X ðtÞF^ Y ðtÞ9 may be recognized as the two-sample Kolmogorov–Smirnov test statistic. Tables of exact and asymptotic critical values for this test statistic are available in several places in the literature, including Hollander and Wolfe (1999). However, critical values may also be obtained using Algorithm 1 from Section 3. Suppose that for fixed n, k, and h4 0, we wish to compute PðLðtÞ r F^ Y ðtÞ r UðtÞ for all t). Since L(t) is a step function with LðtÞ ¼ 0 for t oXð1Þ , F^ Y ðtÞ goes below L(t) only if it goes below L(t) at one of the points Xð1Þ , . . . ,XðnÞ . Thus, if we define li  maxf0,i=nhg, then F^ Y ðtÞ never goes below L(t) if and only if F^ Y ðXðiÞ Þ Zli for i¼ 1,y,n. Similarly, F^ Y ðtÞ goes above U(t) only if it goes above U(t) in the limit as t approaches one of the values Xð1Þ , . . . ,XðnÞ from the left. Since F is continuous, there is zero probability of a tie between any of the X and Y values. Thus, if we define ui  minf1,i=n þhg, then F^ Y ðtÞ never goes above U(t) if and only if F^ Y ðXðiÞ Þ r ui1 for i¼1,y,n. Putting both sets of constraints together, the coverage probability is Pðli r F^ Y ðXi:n Þ rui1 , i ¼ 1, . . . ,nÞ, where li and ui are the lower and upper bounds for the prediction band on the interval ½XðiÞ ,Xði þ 1Þ Þ. Such probabilities can be computed using Algorithm 1 from Section 3, and a root-finding algorithm can be used to find the critical value ha that comes closest to yielding the desired coverage probability 1a without going below 1a. Using the method just described, we computed critical values h0:05 appropriate for obtaining 95% Kolmogorov–Smirnov prediction bands for the EDF for different choices of n and k. These critical values appear in Table 1. We see from the table that the critical values are unchanged when we switch the roles of n and k. We also see that the critical values tend to decrease when either n or k is increased. The procedure for finding the prediction band is summarized next. Procedure 1. Find the Kolmogorov–Smirnov prediction band for the EDF as follows. Step 1. Using either Table 1 or the general method described earlier, find the critical value ha that is appropriate for the desired n, k, and coverage probability 1a. Step 2. The upper and lower bounds of the prediction band are the functions LðtÞ  maxf0, F^ X ðtÞha g and UðtÞ  minf1, F^ X ðtÞ þ ha g, where F^ X ðtÞ is the EDF based on the sample X1 , . . . ,Xn . Table 1 Critical values for 95% Kolmogorov–Smirnov prediction bands for the EDF of a future sample. n

k¼ 10

k¼ 20

k¼30

k¼ 40

k¼50

10 20 30 40 50

0.600 0.500 0.467 0.450 0.440

0.500 0.400 0.367 0.350 0.340

0.467 0.367 0.333 0.317 0.300

0.450 0.350 0.317 0.300 0.280

0.400 0.340 0.300 0.280 0.260

508

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

In obtaining the Kolmogorov–Smirnov prediction band for the EDF, we measured the pointwise discrepancy between F^ X ðtÞ and F^ Y ðtÞ using Euclidean distance. If we use a different measure of discrepancy, we obtain a different prediction band. One natural alternative to using Euclidean distance is to use likelihood. The likelihood ratio statistic for testing the pointwise hypothesis H0t : FX ðtÞ ¼ FY ðtÞ against H1t : FX ðtÞaFY ðtÞ is LðF^ X ðtÞ, F^ Y ðtÞÞ, where np

Lðp1 ,p2 Þ 

kp

p1 1 ð1p1 Þnnp1 p2 2 ð1p2 Þkkp2 ðn þ kÞpp pp ð1pp Þn þ kðn þ kÞpp

and pp ¼ ðnp1 þ kp2 Þ=ðn þkÞ is the pooled estimate. Taking logarithms, we find that an equivalent statistic is lðF^ X ðtÞ, F^ Y ðtÞÞ, where lðp1 ,p2 Þ  np1 logðp1 Þ þ ðnnp1 Þlogð1p1 Þ þ kp2 logðp2 Þ þ ðkkp2 Þlogð1p2 Þ ðn þ kÞpp logðpp Þðn þkðn þ kÞpp Þlogð1pp Þ

ð1Þ

and 0 logð0Þ is understood to be 0. We reject H0t in favor of H1t when lðF^ X ðtÞ, F^ Y ðtÞÞ is large. Thus, it makes sense to reject the overall hypothesis H0 : FX ¼ FY in favor of H1 : FX aFY when supt lðF^ X ðtÞ, F^ Y ðtÞÞ is large. Following this same logic, we let the likelihood prediction band for F^ Y ðtÞ consist of all EDFs G such that supt lðF^ X ðtÞ,GðtÞÞ rc for some appropriate value c. Suppose that, for fixed n, k, and c, we wish to compute the coverage probability for the likelihood prediction band. The lower and upper bounds of the prediction band on the interval ½XðiÞ ,Xði þ 1Þ Þ are the smallest and largest values s 2 f0; 1=k, . . . ,1g such that lði=n,sÞ r c. Thus, we may find these bounds by computing all values lði=n,j=kÞ for i¼ 0,y,n and j ¼0,y,k. For i¼0,y,n, the lower bound is l0i  minfs 2 f0; 1=k, . . . ,1g : lði=n,sÞ r cg, and the upper bound is u0i  maxfs 2 f0; 1=k, . . . ,1g : lði=n,sÞ r cg. Having determined l0i and u0i for i¼0,y,n, we then compute the coverage probability for the prediction band as Pðl0i r F^ Y ðXi:n Þ ru0i1 ,i ¼ 1, . . . ,nÞ. Such probabilities can be computed using Algorithm 1 from Section 3, and a root-finding algorithm can be used to find the critical value ca that comes closest to yielding the desired coverage probability 1a without going below 1a. Example 1. We compute the likelihood prediction band when n ¼10, k¼ 4, and c¼4.00. Table 2 shows the full set of values for lði=n,j=kÞ as i runs from 0 to 10 and j runs from 0 to 4, with values exceeding 4.00 shown in bold. For each i ¼ 0, . . . ,n, we find l0i and u0i by identifying the smallest and largest j for which lði=n,j=kÞ r 4:00 and then dividing by k¼4. This gives lower bounds ðl00 , . . . ,l010 Þ ¼ ð0; 0,0; 0,0; 0,0; 0,1=4; 1=4; 1=2Þ and upper bounds ðu00 , . . . ,u010 Þ ¼ ð1=2; 3=4; 3=4; 1,1; 1,1; 1,1; 1,1Þ. Computing the coverage probability then requires computing Pðl0i r F^ Y ðXi:n Þ ru0i1 ,i ¼ 1, . . . ,nÞ. This can be done using Algorithm 1 from Section 3. Using the method just described, we computed critical values c0:05 appropriate for obtaining 95% likelihood prediction bands for the EDF for different choices of n and k. These critical values appear in Table 3. We see from the table that, as in Table 1, the critical values are unchanged when we switch the roles of n and k. The procedure for finding the prediction band is summarized next. Procedure 2. Find the likelihood prediction band for the EDF as follows. Step 1. Using either Table 3 or the general method described earlier, find the critical value ca that is appropriate for the desired n, k, and coverage probability 1a. Table 2 All values for lði=n,j=kÞ when n¼10 and k¼ 4 as in Example 1. Values in bold exceed the critical value c¼ 4.00. j

i¼ 0

i ¼1

i¼ 2

i¼ 3

i¼ 4

i¼ 5

i¼ 6

i¼ 7

i¼ 8

i¼ 9

i¼10

4 3 2 1 0

8.376 5.025 2.969 1.353 0.000

5.874 2.876 1.251 0.241 0.352

4.557 1.871 0.599 0.021 0.738

3.595 1.203 0.243 0.018 1.165

2.831 0.725 0.058 0.145 1.646

2.193 0.380 0.000 0.380 2.193

1.646 0.145 0.058 0.725 2.831

1.165 0.018 0.243 1.203 3.595

0.738 0.021 0.599 1.871 4.557

0.352 0.241 1.251 2.876 5.874

0.000 1.353 2.969 5.025 8.376

Table 3 Critical values for 95% likelihood prediction bands for the EDF of a future sample. n

k¼10

k¼ 20

k¼ 30

k ¼40

k¼ 50

10 20 30 40 50

4.32 4.53 4.55 4.49 4.40

4.53 4.70 4.66 4.69 4.68

4.55 4.66 4.52 4.79 4.83

4.49 4.69 4.79 4.87 4.97

4.40 4.68 4.83 4.97 5.01

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

509

Step 2. The lower and upper bounds of the prediction band on the interval ½XðiÞ ,Xði þ 1Þ Þ are the minimum and maximum values in the set fj=k : j 2 f0; 1, . . . ,kg and lði=n,j=kÞ r ca g, where lð,Þ is defined in Eq. (1). To compare the two nonparametric prediction bands, we computed 95% prediction bands using a variety of choices for n and k. Due to the discreteness of the statistics supt 9F^ X ðtÞF^ Y ðtÞ9 and supt lðF^ X ðtÞ, F^ Y ðtÞÞ, it is rarely if ever possible to achieve exactly 95% coverage. Thus, in each case, we used the critical value that gives the smallest coverage probability greater than or equal to 95%. Fig. 1 shows the two 95% prediction bands for the EDF in the cases where n is 50 or 100, k is 10 or 20, and the initial sample consists of the equally spaced values 1, . . . ,n. We see from the figure that while the likelihood prediction band is narrower in the tails of the distribution, the Kolmogorov–Smirnov prediction band is narrower near the center. We also see that increasing either n or k tends to lead to a narrower prediction band. However, once n is reasonably large, increasing k has a bigger impact than increasing n. It is possible to create other nonparametric prediction bands, but the Kolmogorov–Smirnov and likelihood prediction bands are perhaps the most natural choices. Zhang (2006) studied a two-sample test based on a slightly modified version of the statistic supt lðF^ X ðtÞ, F^ Y ðtÞÞ, finding that the test tends to be more powerful than the two-sample Kolmogorov– Smirnov test. However, we see from Fig. 1 that the advantage in power for likelihood-based tests does not seem to lead to a substantially smaller prediction band for the EDF. Thus, it appears that both prediction bands have value. The likelihood prediction band would be preferable when it is important that the prediction intervals for the individual order statistics be two-sided rather than one-sided. 3. A computational algorithm In this section, we describe an algorithm for computing coverage probabilities for the nonparametric prediction bands developed in Section 2. There are several existing algorithms for computing related probabilities (Gail and Green, 1976; Durbin, 1973), but this new algorithm is particularly simple and general. It can be applied to compute coverage probabilities for nonparametric prediction bands or to compute a levels for nonparametric two-sample tests, and the only operations required are addition and, at the very end, division by ðn þk kÞ. Suppose that we wish to compute the coverage probability Pðai r F^ Y ðXi:n Þ r bi1 ,i ¼ 1, . . . ,nÞ, where a1 , . . . ,an and b0 , . . . ,bn1 are known values from the interval [0,1]. If we let Ki be the number of values from the future sample Y1 , . . . ,Yk that are less than or equal to Xi:n , then kF^ Y ðXi:n Þ ¼ Ki , and we may rewrite Pðai r F^ Y ðXi:n Þ rbi1 ,i ¼ 1, . . . ,nÞ as

n = 50 , k = 20 Cumulative Probability

Cumulative Probability

n = 50 , k = 10 0.8

0.4

0.0

0

10 20 30 40 Equally Spaced Data

0.8

0.4

0.0

50

0

0.8

0.4

0.0 0

20 40 60 80 Equally Spaced Data

50

n = 100 , k = 20 Cumulative Probability

Cumulative Probability

n = 100 , k = 10

10 20 30 40 Equally Spaced Data

100

0.8

0.4

0.0 0

20 40 60 80 Equally Spaced Data

100

Fig. 1. Nonparametric 95% prediction bands for the EDF of a future sample of size k, based on a sample of size n. In each plot, the solid curve gives the Kolmogorov–Smirnov prediction band, and the dashed curve gives the likelihood prediction band. The prediction bands have been plotted using the equally spaced data values 1, . . . ,n.

510

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

Pðkai r Ki rkbi1 ,i ¼ 1, . . . ,nÞ. Since Ki must be an integer, we may then rewrite Pðkai rKi r kbi1 ,i ¼ 1, . . . ,nÞ as Pðci r Ki rdi ,i ¼ 1, . . . ,nÞ, where ci  dkai e and di  bkbi1 c. Thus, the probability has been expressed in terms of the random vector ðK1 , . . . ,Kn Þ. The key fact about ðK1 , . . . ,Kn Þ is that is uniformly distributed on the set S  fðs1 , . . . ,sn Þ 2 Zn : 0 r s1 r    r sn r kg. To see this, note that the initial sample has n observations and the future sample has k observations, all of which come from the same continuous distribution. Thus, there are ðn þk kÞ equally likely choices of ranks for the values from the future sample within the full set of n þ k values. Each choice of ranks for the values from the future sample determines a unique vector ðK1 , . . . ,Kn Þ, and each vector ðs1 , . . . ,sn Þ 2 S determines a unique set of ranks for the k values from the future sample. Thus, ðK1 , . . . ,Kn Þ is uniform on S, and we take advantage of this to compute the desired probability. To compute Pðci r Ki r di ,i ¼ 1, . . . ,nÞ, we proceed recursively. For i ¼ 1, . . . ,n and j ¼ 0, . . . ,k, we define Cij to be the number of integer vectors ðs1 , . . . ,si Þ such that (i) 0 rs1 r    rsi ¼ j and (ii) ct rst r dt for t ¼ 1, . . . ,i. The values C10 , . . . ,C1k are given by ( 1, c1 r j r d1 , C1j ¼ ð2Þ 0 otherwise, and for i4 0, the values fCi þ 1,j ,j ¼ 0, . . . ,kg can be obtained from the values fCij ,j ¼ 0, . . . ,kg via the recursion 8 j > < XC , c i þ 1 rj rdi þ 1 , Ci þ 1,j ¼ t ¼ 0 it > : 0 otherwise:

ð3Þ

The recursion (3) holds because we obtain a sequence that contributes to Ci þ 1,j only by taking a sequence ðs1 , . . . ,si Þ that contributes to Cit for some t rj and then adding a j to the end of that sequence. Once we have obtained the values fCnj ,j ¼ 0, . . . ,kg, we use the fact that ðK1 , . . . ,Kn Þ is uniform on S to obtain the desired probability as P Pðci r ki rdi ,i ¼ 1, . . . ,nÞ ¼ ðn þk k Þ1 nj¼ 0 Cnj . Thus, we may summarize the algorithm as follows. Algorithm 1. Compute Pðai r F^ Y ðXi:n Þ rbi1 ,i ¼ 1, . . . ,nÞ as follows. Step Step Step Step

1. 2. 3. 4.

Set ci  dkai e and di  bkbi1 c for i ¼ 1, . . . ,n. Compute the values fCij g for i¼ 1 via Eq. (2). For i ¼ 1, . . . ,n1, compute fCi þ 1,j ,j ¼ 0, . . . ,kg from fCij ,j ¼ 0, . . . ,kg using (3). P Obtain Pðai r F^ Y ðXi:n Þ rbi1 ,i ¼ 1, . . . ,nÞ as ðn þk k Þ1 nj¼ 0 Cnj .

To illustrate the algorithm just developed, we apply it to compute the coverage probability of the Kolmogorov–Smirnov prediction band when n ¼10, k ¼4, and h¼0.7. Example 2. Set n ¼10 and k¼ 4. Suppose that we wish to compute the coverage probability for the Kolmogorov–Smirnov prediction band with h ¼0.7. On the intervals ð1,Xð1Þ Þ, . . . ,½XðnÞ ,1Þ, the EDF for the initial sample takes the values 0,0:1, . . . ,1:0. Thus, the interval-by-interval lower bounds are ðl0 , . . . ,l10 Þ ¼ ð0; 0,0; 0,0; 0,0; 0,0:1,0:2,0:3Þ, and the upper bounds are ðu0 , . . . ,u10 Þ ¼ ð0:7,0:8,0:9,1; 1,1; 1,1; 1,1; 1Þ. Setting ci  dkli e and di  bkui1 c for i ¼ 1, . . . ,n, we obtain the vectors ðc1 , . . . ,c10 Þ ¼ ð0; 0,0; 0,0; 0,0; 1,1; 2Þ and ðd1 , . . . ,d10 Þ ¼ ð2; 3,3; 4,4; 4,4; 4,4; 4Þ. Table 4 shows the values fCij g that arise in implementing Algorithm 1. Note that each nonzero value Cij is obtained as the sum of the values Ci1;0 to Ci1,j from the column just to the left. The coverage probability is obtained as ð688 þ 215 þ 52Þ=ð14 4 Þ  0:954, the ratio between the sum of the right-most column and the total number of ways to choose which ordered values from the combined sample are the ones from the future sample. 4. Parametric prediction bands To obtain prediction intervals for the EDF in parametric settings, we adopt a two-step process. We first, using the initial sample X1 , . . . ,Xn  F, estimate the distribution F using parametric methods. We then compute 100ð12gÞ% equal-tailed Table 4 The values fCij g used for computing the coverage probability of the Kolmogorov–Smirnov prediction band when n¼10, k¼ 4, and h ¼ 0.7. The coverage probability is ð688 þ 215 þ 52Þ=ð14 4 Þ  0:954, the ratio between the sum of the right-most column and the total number of ways to choose which ordered values from the combined sample are the ones from the future sample. Bold values are structural zeros determined by the fact that h ¼ 0.7. j

i¼ 1

i¼2

i ¼3

i¼ 4

i¼ 5

i¼6

i¼7

i ¼8

i¼ 9

i¼ 10

4 3 2 1 0

0 0 1 1 1

0 3 3 2 1

0 9 6 3 1

19 19 10 4 1

53 34 15 5 1

108 55 21 6 1

191 83 28 7 1

310 119 36 8 0

473 163 44 8 0

688 215 52 0 0

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

511

probability intervals for each order statistic Yð1Þ , . . . ,YðkÞ under the assumption that the estimated F is the true F. To ensure that the entire procedure has the desired coverage probability 1a, we choose g to be the value such that the simultaneous coverage probability for the k prediction intervals is 1a. Consider the normal case first. Given the initial sample X1 , . . . ,Xn  Nðm, sÞ, we estimate the parameters m and s using the customary sample mean X and sample standard deviation S. When m and s are known, standard results on order statistics (see David and Nagaraja, 2003) show that the 100ð12gÞ% equal-tailed probability interval for YðrÞ is the interval ðm þ szQðg;r,k þ 1rÞ , m þ szQ ð1g;r,k þ 1rÞ Þ, where za is the lower a quantile for the standard normal distribution and Q ðZ; a, bÞ is the lower Z quantile for the Betaða, bÞ distribution. Thus, our prediction interval for YðrÞ takes the form ðX þSzQ ðg;r,k þ 1rÞ ,X þ SzQ ð1g;r,k þ 1rÞ Þ,

ð4Þ

and the only remaining task is to select g so that the simultaneous coverage probability exactly matches the desired level. Though there does not seem to be an easy way to obtain the correct g values analytically, it is straightforward to obtain them via simulation. To do this, we simulate a large number of combined samples fX1 , . . . ,Xn ,Y1 , . . . ,Yk g. Given each initial sample X1 , . . . ,Xn , we compute the statistics X and S. We then determine the largest g such that all of the order statistics Yð1Þ , . . . ,YðkÞ are contained in their respective prediction intervals, which are given by expression (4). This g value is obtained by looking at tail probabilities under the assumption that X and S are the true mean and standard deviation. Thus, the maximum g is given by (( ! ! ) ( ! ! )) Y X YðrÞ X g ¼ min B F ðrÞ ; r,k þ 1r ,r ¼ 1, . . . ,k [ 1B F ; r,k þ1r ,r ¼ 1, . . . ,k , S S %

where FðÞ is the standard normal CDF and Bð; a, bÞ is the CDF for the Betaða, bÞ distribution. Each time we generate a combined sample, we obtain a g . If we order these values as gð1Þ r    r gðNÞ , where the number of runs N is large, then ga  gðbNacÞ is approximately the correct critical value to give coverage probability 1a. Table 5, constructed by generating N ¼ 100,000 combined samples for each choice of n and k and then applying the method just described, gives critical values g0:05 appropriate for constructing 95% prediction bands for the EDF of a future sample in the case of normal data. We see from the table that for fixed n, g0:05 decreases as k increases, while for fixed k, g0:05 increases as n increases. When we let n go to infinity so that m and s become known, then the g0:05 values are bounded below by the values 0:05=ð2kÞ that we would obtain from the Bonferroni inequality. Note, however, that the values obtained from the Bonferroni inequality would be extremely conservative. We also see from Table 5 that convergence to the n ¼ 1 line is quite slow. To plot the prediction band corresponding to a particular choice of n, k, and a, we first obtain the prediction intervals for the individual order statistics Yð1Þ , . . . ,YðkÞ from (4). Once those intervals are given, we use the fact that the bounds on YðrÞ are the left-hand and right-hand bounds for the prediction band in the region between the horizontal lines y ¼ ðr1Þ=k and y¼r/k to obtain the lower and upper bounds for the band. The procedure for finding the prediction band is summarized next. %

%

%

%

Procedure 3. Find the normal-distribution-based prediction band for the EDF as follows. Step 1. Using either Table 5 or the general method described earlier, find the critical value ga that is appropriate for the desired n, k, and coverage probability 1a. Step 2. The prediction interval for YðrÞ has lower bound vr  X þSzQ ðga ;r,k þ 1rÞ and upper bound wr  X þ SzQð1ga ;r,k þ 1rÞ for r ¼1,y,k. Step 3. At the point t, the lower and upper bounds for the prediction band are C2 ðtÞ=k and C1 ðtÞ=k, where C1 ðtÞ  #fvr rtg and C2 ðtÞ  #fwr r tg. Fig. 2 plots the normal-distribution-based prediction band for the case where n ¼50, k¼ 10, X ¼ 0, and S ¼1. We see from Fig. 2 that the prediction band is symmetric about the vertical line x ¼ X , with the lengths of the individual prediction intervals being shortest for order statistics near the median and longest for extreme order statistics. Table 5 Simulated critical values for creating normal-distribution-based 95% prediction bands for the EDF of a future sample of size k, based on a sample of size n. Values are given in scientific notation so that 3:16  105 , for example, means 0.0000316. n

k¼ 10

k¼20

k¼30

k¼40

k¼ 50

10

3:16  105

6:85  107

2:61  108

9:30  1010

4:39  1011

20

4:86  104

7:66  105

1:61  105

4:02  106

9:44  107

30

1:05  103

2:74  104

9:63  105

3:47  106

1:39  105

40

1:45  103

5:00  104

2:20  104

9:49  105

4:57  105

50

1:76  103

6:68  104

3:29  104

1:72  104

9:48  105

1

3:73  103

2:38  103

1:96  103

1:64  103

1:51  103

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

Cumulative Probability

512

1.0 0.8 0.6 0.4 0.2 0.0 −4

−2

0 x

2

4

Fig. 2. The 95% prediction band for the EDF of a future normal sample of size k¼ 10, based on a sample of size n¼ 50 with X ¼ 0 and S¼ 1. The prediction band is determined by the critical value g0:05 ¼ 0:00176.

Table 6 Simulated critical values for creating exponential-distribution-based 95% prediction bands for the EDF of a future sample of size k, based on a sample of size n. Values are given in scientific notation so that 4:28  104 , for example, means 0.000428. n

k¼ 10

k ¼20

k¼ 30

k¼40

k¼ 50

10

4:28  104

4:50  105

5:97  106

9:03  107

1:63  107

20

1:36  103

3:88  104

1:38  104

5:31  105

2:04  105

30

3

1:97  10

7:21  10

4

4

3:61  10

4

1:71  10

9:79  105

40

2:38  103

1:01  103

5:43  104

3:24  104

1:92  104

50

3

2:49  10

1:18  10

3

4

7:02  10

4

4:25  10

2:85  104

1

3:76  103

2:37  103

1:88  103

1:65  103

1:49  103

Now consider the exponential case. Given the initial sample X1 , . . . ,Xn  ExponentialðlÞ, where l ¼ E½Xi , we estimate l using X . When l is the true mean, standard results on order statistics (see David and Nagaraja, 2003) show that the equaltailed 100ð12gÞ% probability interval for YðrÞ is given by     ðl log 1Q ðg; r,k þ 1rÞ ,l log 1Q ð1g; r,k þ1rÞ since logð1ZÞ is the lower Z quantile for the Exponentialð1Þ distribution. Thus, the prediction interval for YðrÞ is ðX logð1Q ðg; r,k þ 1rÞÞ,X logð1Q ð1g; r,k þ 1rÞÞ. As in the normal case, we obtain appropriate critical values g by simulation. Table 6, constructed by generating N ¼100,000 combined samples for each choice of n and k and then ordering the resulting g values as in the normal case, gives critical values g0:05 appropriate for constructing 95% prediction bands for the EDF of a future sample in the case of exponential data. We see from Table 6 that, as in the normal case, critical values decrease when k increases while n is kept fixed, but increase when n is increased while k is kept fixed. For a given n and k, the critical value in the exponential case tends to exceed that for the normal case. This reflects the fact that it tends to be easier to accurately estimate F in a one-parameter setting (exponential) than in a two-parameter setting (normal). When n goes to infinity, the critical values g0:05 for the normal and exponential cases become the same. This would also hold true for other parametric families provided that the estimate of F is consistent and that the prediction intervals for individual order statistics are the same equal-tailed intervals that we have used here. The procedure for finding the prediction band is summarized next. %

Procedure 4. Find the exponential-distribution-based prediction band for the EDF as follows. Step 1. Using either Table 6 or the general method described earlier, find the critical value ga that is appropriate for the desired n, k, and coverage probability 1a. Step 2. The prediction interval for YðrÞ has lower bound v0r  X logð1Q ðga ; r,kþ 1rÞÞ and upper bound w0r  X logð1Q ð1ga ; r,k þ 1rÞÞ for r ¼1,y,k. Step 3. At the point t, the lower and upper bounds for the prediction band are C 02 ðtÞ=k and C 01 ðtÞ=k, where C 01 ðtÞ  #fv0r r tg and C 02 ðtÞ  #fw0r r tg. Fig. 3 plots the exponential-distribution-based prediction band in the case where n ¼50, k¼10, and X ¼ 1. We see that the exponential prediction band is highly nonsymmetric, with the prediction intervals for the upper-tail order statistics being much longer than the prediction intervals for the lower-tail order statistics. Like the nonparametric prediction bands from Section 2, the parametric prediction bands correspond to tests for the two-sample problem. The tests are parametric tests of H0 : FX ¼ FY against H1 : FX aFY , and each test is conducted by computing the prediction band using X1 , . . . ,Xn and then noting whether or not each value Yð1Þ , . . . ,YðkÞ falls into its

Cumulative Probability

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

513

1.0 0.8 0.6 0.4 0.2 0.0 0

2

4

6

8

10

x Fig. 3. The 95% prediction band for the EDF of a future exponential sample of size k¼ 10, based on a sample of size n ¼50 with X ¼ 1. The prediction band is determined by the critical value g0:05 ¼ 0:00249.

respective prediction interval. If any YðrÞ falls outside of its prediction interval, then we reject H0. Otherwise, we retain H0. These tests are quite different from standard parametric two-sample tests. However, they are inappropriate for use since they lack a basic invariance property that is desirable for two-sample tests; the result of the test might depend on which sample is designated as the initial sample and which is designated the future sample. 5. Comparisons When we compare statistical intervals, we often compare them on the basis of length. This suggests that we might compare prediction bands on the basis of area. However, while the parametric bands always have finite area, the nonparametric bands usually have infinite area, which makes comparisons difficult. Thus, comparisons by area seem inappropriate, and we instead compare the prediction bands using the integrated expected width (IEW), which we define as Z 1 IEW ¼ E½UðtÞLðtÞ dFðtÞ, 1

where L(t) and U(t) are the lower and upper bounds of the prediction band. The IEW is the F-weighted average width of the prediction band, and in the nonparametric case, its value does not depend on which continuous distribution is used. In the nonparametric case, the expression for the IEW simplifies considerably. The difference UðtÞLðtÞ is constant on each of the intervals ½Xð0Þ ,Xð1Þ Þ, . . . ,½XðnÞ ,Xðn þ 1Þ Þ, and each of these intervals contains a fraction 1=ðn þ 1Þ of the total P probability from F on average. Thus, we may write IEW ¼ ð1=ðn þ 1ÞÞ ni¼ 0 ðui li Þ, where li and ui are the lower and upper bounds of the prediction band on the interval ½XðiÞ ,Xði þ 1Þ Þ. For the normal-distribution-based and exponential-distribution-based prediction bands, simplifications are also possible. By splitting the prediction band into slices via the horizontal lines y ¼ 0,y ¼ 1=k, . . . ,y ¼ 1, we find that IEW ¼ P ð1=kÞ kr ¼ 1 P ðY is in the prediction interval for YðrÞ Þ, where Y is a single additional independent observation from F. For the normal case, suppose that the prediction interval for YðrÞ is given by ðX þ Sar ,X þ Sbr Þ for r ¼ 1, . . . ,k. We can then write the probability that Y is in the prediction interval for YðrÞ as ! X Y oar : PðX þ Sar o Y o X þSbr Þ ¼ P br o S with mean 0 and variance s2 ð1 þ 1=nÞ, and ðn1ÞS2 =s2 is independently The difference X Y is normally distributed pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 distributed w ðn1Þ. Thus, the quantity n=ðn þ 1ÞððX YÞ=SÞ has a t distribution with n  1 degrees of freedom. This means that the IEW is rffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffi   k  1X n n Tn1 br , IEW ¼ Tn1 ar kr¼1 nþ1 nþ1 where Tn ðÞ is the CDF for the t distribution with n degrees of freedom. For the exponential case, suppose that the prediction interval for YðrÞ is given by ðar X ,br X Þ for r ¼1,y,k. We can then write the probability that Y is in the prediction interval for YðrÞ as   ar Y br o , o Pðar X o Y o br X Þ ¼ Pðar oY=X o br Þ ¼ Pðar =n o Y=ðnX Þ obr =nÞ ¼ P ar þ n br þn Y þnX where the last equality follows from the fact that if a,b,c, and d are all positive, then a=b o c=d if and only if a=ða þ bÞ o c=ðc þ dÞ. Since Y=ðY þnX Þ has a Betað1,nÞ distribution, we can write the IEW as IEW ¼

k 1X fBðbr =ðbr þ nÞ; 1,nÞBðar =ðar þ nÞ; 1,nÞg, kr¼1

where Bð; a, bÞ is the CDF for the Betaða, bÞ distribution.

514

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

Table 7 Integrated expected widths of 95% prediction bands for the EDF when the future sample has size k¼ 10. Critical values for the parametric prediction bands were obtained by simulation, using 100,000 runs in each case. n

K–S

Likelihood

Normal

Exponential

10 20 40 80 160

0.818 0.710 0.649 0.628 0.605

0.782 0.695 0.649 0.625 0.606

0.754 0.680 0.640 0.619 0.607

0.687 0.641 0.620 0.610 0.602

Table 8 Integrated expected widths of 95% prediction bands for the EDF when the future sample is the same size as the initial sample. Critical values for the parametric prediction bands were obtained by simulation, using 100,000 runs in each case. n

K–S

Likelihood

Normal

Exponential

Like/norm ratio

10 20 40 80 160 320

0.818 0.629 0.505 0.378 0.277 0.201

0.782 0.628 0.488 0.366 0.271 0.200

0.754 0.583 0.434 0.318 0.231 0.166

0.687 0.538 0.405 0.298 0.217 0.157

1.04 1.08 1.12 1.15 1.17 1.21

Table 9 Simulated coverage probabilities for normal-distribution-based 95% prediction bands when the data follow various distributions. Each value is based on 100,000 simulated runs, using the critical values from Table 4. n

k

Normal

Exponential

Uniform

Cauchy

t(5)

10 10 50 50

10 50 10 50

0.9502 0.9490 0.9505 0.9503

0.8143 0.7457 0.7982 0.3122

0.9553 0.9387 0.9401 0.9036

0.6411 0.1830 0.5297 0.0165

0.9007 0.8965 0.8927 0.7941

Table 7 compares IEW values for the two nonparametric prediction bands and the two parametric prediction bands when k is fixed at 10 and n takes on progressively higher values. In each case, we assume that the assumptions needed for each prediction band are met. We see from the table that for all but the highest sample sizes, the parametric prediction bands tend to have smaller IEW than the nonparametric bands. We also see that the exponential bands tend to have smaller IEW than the normal bands. The two nonparametric bands have almost the same IEW for n 4 20, but the likelihood band has an advantage in terms of IEW for smaller values of n. Once the sample size reaches n ¼160, all of the prediction bands are essentially equivalent in terms of IEW. Thus, we see that when n is much larger than k, the performance of nonparametric prediction bands is comparable to that of parametric prediction bands. To compare the performance of the prediction bands when n and k increase together, we did a similar study in which we computed IEW values for 95% prediction bands when k ¼n and n takes on progressively higher values. The results appear in Table 8, where the right-most column shows the ratio between the IEW for the likelihood band and the IEW for the normal band. We see that while all of the bands narrow as n increases, the advantage of the parametric bands over the nonparametric bands actually increases as n goes from 10 to 320. This suggests that unless n is much larger than k, parametric bands are preferable to nonparametric bands when the parametric assumptions hold. One danger in using parametric statistical procedures is that the true coverage probability may deviate from the nominal coverage probability when the parametric assumptions fail. To explore the extent to which this is true for the parametric prediction bands from Section 4, we did a simulation study. We used n ¼10 and n ¼50, k¼10 and k¼50, and data drawn from the normal, exponential, uniform, Cauchy, and tð5Þ distributions. For each choice of n, k, and a distribution, we generated 100,000 combined samples. Given each initial sample, we computed the 95% normaldistribution-based prediction band for the EDF of the future sample, and we noted whether the prediction band contained the true EDF F^ Y ðtÞ. The results are given in Table 9. We see from the table that when the data are not normal, the coverage probability can differ substantially from the nominal level. The true coverage probabilities in the Cauchy and exponential cases are much lower than the nominal 95%, and the coverage probabilities even for the uniform and tð5Þ data are several percentage points lower than the nominal 95% for the n ¼50, k¼50 case. Thus, the parametric prediction bands are not robust to violations of the parametric assumptions.

J. Frey / Journal of Statistical Planning and Inference 142 (2012) 506–515

515

6. Conclusions We have developed both nonparametric and parametric prediction bands for the EDF of a future sample. The parametric prediction bands tend to be narrower when the parametric assumptions hold, but they have poor coverage probability properties when the parametric assumptions fail. Prediction bands of either type yield simultaneous confidence intervals for all order statistics from the future sample. Thus, they are appropriate for use when one wishes to use an initial sample to make predictions about multiple aspects of a single future sample.

Acknowledgments The author thanks the referees for helpful comments that have improved the paper. References David, H.A., Nagaraja, H.N., 2003. Order Statistics, 3rd ed. Wiley, New York. Davis, C.B., McNichols, R.J., 1987. One-sided intervals for at least p of m observations from a normal population on each of r future occasions. Technometrics 29, 359–370. Durbin, J., 1973. Distribution Theory for Tests Based on the Sample Distribution Function. SIAM, Philadelphia. Fertig, K.W., Mann, N.R., 1977. One-sided prediction intervals for at least p out of m future observations from a normal population. Technometrics 19, 167–177. Fligner, M.A., Wolfe, D.A., 1979. Nonparametric prediction intervals for a future sample median. Journal of the American Statistical Association 74, 453–456. Gail, M.H., Green, S.B., 1976. Critical values for the one-sided two-sample Kolmogorov–Smirnov statistic. Journal of the American Statistical Association 71, 757–760. Hahn, G.J., Meeker, W.Q., 1991. Statistical Intervals: A Guide for Practitioners. Wiley, New York. Hollander, M., Wolfe, D.A., 1999. Nonparametric Statistical Methods, 2nd ed. Wiley, New York. ¨ Knusel, L., 1994. The prediction problem as the dual form of the two-sample problem with applications to the Poisson and the binomial distribution. The American Statistician 48, 214–219. LaMotte, L.R., Volaufova´, J., 1999. Prediction intervals via consonance intervals. The Statistician 48, 419–424. Shih, W.J., 1988. Non-parametric simultaneous prediction intervals for quality control. The Statistician 37, 383–386. Vardeman, S., 1992. What about the other intervals? The American Statistician 46, 193–197. Zhang, J., 2006. Powerful two-sample tests based on the likelihood ratio. Technometrics 48, 95–103.