Distribution-free prediction intervals for order statistics based on record coverage

Distribution-free prediction intervals for order statistics based on record coverage

Journal of the Korean Statistical Society 40 (2011) 181–192 Contents lists available at ScienceDirect Journal of the Korean Statistical Society jour...

251KB Sizes 0 Downloads 42 Views

Journal of the Korean Statistical Society 40 (2011) 181–192

Contents lists available at ScienceDirect

Journal of the Korean Statistical Society journal homepage: www.elsevier.com/locate/jkss

Distribution-free prediction intervals for order statistics based on record coverage Jafar Ahmadi a,∗ , N. Balakrishnan b a

Department of Statistics, Ordered and Spatial Data Center of Excellence, Ferdowsi University of Mashhad, P.O. Box 91775-1159, Mashhad, Iran

b

Department of Mathematics and Statistics, McMaster University, Hamilton, Ontario, Canada L8S 4K1

article

info

Article history: Received 17 December 2009 Accepted 12 September 2010 Available online 18 October 2010 AMS 2000 subject classifications: primary 62G30 secondary 62E15 Keywords: Coverage probability Current records Prediction intervals Record coverage Order statistics

abstract In this paper, based on the largest and smallest observations at the times when a new record of either kind (upper or lower) occurs, we discuss the prediction of future order statistics. The proposed prediction intervals are distribution-free in that the corresponding coverage probabilities are known exactly without any assumption about the parent distribution other than that it being continuous. An exact expression for the prediction coefficient of these intervals is derived. Similarly, prediction intervals for future records based on observed order statistics are also obtained. Finally, two real-life data, one involving the average July temperatures in Neurenburg, Switzerland, and the other involving the amount of annual rainfall at the Los Angeles Civic Center, are used to illustrate the procedures developed here. © 2010 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.

1. Introduction Let {Xi , i ≥ 1} be a sequence of independent and identically distributed (iid) random variables. An observation Xj is said to be an upper (or lower) record if Xj > Xi (or Xj < Xi ) for every i < j. Let us denote the nth upper and lower usual records by Un and Ln (with U1 = L1 ≡ X1 ), respectively, as in Ahmadi and Balakrishnan (2010). For more details on the theory and applications of record values, one may refer to Arnold, Balakrishnan, and Nagaraja (1998) and the references contained therein. Now, suppose Rlm and Rsm are the largest and smallest observations, respectively, at the time when the mth record of any kind (either an upper or lower) occurs in the Xk -sequence. We choose superscripts l and s for indicating the largest and smallest observations, respectively. Here, we refer to them as the mth current upper record and the mth current lower record, respectively, of the Xk -sequence when the mth record of any kind (either an upper or lower) is observed (with Rs0 = Rl0 ≡ X1 ). For m ≥ 1, the interval (Rsm , Rlm ) is then referred to as the record coverage, and the difference Rm = Rlm − Rsm is the mth record range. To fix these concepts, let us consider the following sequence of observations: 3.0, 2.0, 2.5, 2.6, 1.7, 3.7, 2.2, 1.5, 2.7, 2.3, 1.2, 4.0, 2.5, 4.7, 4.1, . . . . The usual and current records extracted from the above sequence are then as follows:



Corresponding author. E-mail addresses: [email protected] (J. Ahmadi), [email protected] (N. Balakrishnan).

1226-3192/$ – see front matter © 2010 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved. doi:10.1016/j.jkss.2010.09.003

182

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

m

0

1

2

3

4

5

6

7

Um Lm Rlm Rsm Rm

– – 3.0 3.0 0.0

3.0 3.0 3.0 2.0 1.0

3.7 2.0 3.0 1.7 1.3

4.0 1.7 3.7 1.7 2.0

4.7 1.5 3.7 1.5 2.2

– 1.2 3.7 1.2 2.5

– – 4.0 1.2 2.8

– – 4.7 1.2 3.5

In fact, the current values of lower and upper records are the endpoints of the mth record range. A new record range is encountered whenever either a new lower or upper record is observed. Basak (2000) proposed a stopping time based on the record range for model choice and outlier detection, and presented some characterization results for the exponential distribution in terms of record range. While Ahmadi and Balakrishnan (2005a) established some reliability properties of current records and record range, Ahmadi and Balakrishnan (2004, 2005b) proposed distribution-free confidence intervals for population quantiles and quantile intervals using current records and record ranges. Raqab (2007) obtained sharp bounds for the spacings between any two current upper records. Independently of the Xk -sequence, let Y1 , Y2 , . . . , Yn be a finite sample of size n from the same distribution and denote the corresponding order statistics by Y1:n ≤ Y2:n ≤ · · · ≤ Yn:n . One may refer to Arnold, Balakrishnan, and Nagaraja (1992), Balakrishnan and Rao (1998a,b), and David and Nagaraja (2003) for elaborate details on the theory and applications of order statistics. In this paper, we consider the prediction of future order statistics based on current records from an iid sequence. Prediction of future events is a problem of great interest. Several authors have considered prediction problems involving record values and order statistics. For example, Ahsanullah (1980) obtained three linear predictors for the sth record value based on the first m record values (s > m) for the case when the parent distribution is two-parameter exponential. Dunsmore (1983) derived a mean coverage and guaranteed coverage tolerance region for the (m + r )th record value based on the first m record values in the classical framework and also under a Bayesian model. Hsieh (1997) studied the construction of prediction intervals for future Weibull order statistics for two cases: when only previous independent failure data are available, and when both previous independent failure data and early failure data in current experiment are available. Kaminsky and Nelson (1998) considered the prediction of order statistics in one-sample as well as two-sample cases, and obtained linear point predictors and prediction intervals based on samples from location-scale families. Raqab and Balakrishnan (2008) obtained distribution-free prediction intervals for records from the Y -sequence based on record values from the X -sequence of iid random variables from the same distribution. Ahmadi, Jafari Jozani, Marchand, and Parsian (2009) considered a large class of distributions and obtained Bayesian predictors under balanced-type loss functions for future krecords based on observed k-records. Raqab (2009) obtained prediction intervals for the current records from a future iid sequence based on observed current records from an independent iid sequence of the same distribution. Thus far, researchers have considered the prediction of records based on records, and similarly the prediction of order statistics based on order statistics. Recently, Ahmadi and Balakrishnan (2010) discussed how one can predict future usual records (order statistics) from an independent Y -sequence based on order statistics (usual records) from an independent X -sequence and developed nonparametric prediction intervals. Ahmadi and MirMostafaee (2009) and Ahmadi and Balakrishnan (2010) obtained prediction intervals for order statistics as well as for the mean life time from a future sample based on observed usual records from an exponential distribution using the classical and Bayesian approaches, respectively. Here, along the lines of Ahmadi and Balakrishnan (2010), we consider the case of records and order statistics jointly and discuss the construction of prediction intervals for order statistics from a future independent Y -sample based on observed current records as well as record coverage from an independent X -sequence. We compare the results with those of Ahmadi, MirMostafaee, and Balakrishnan (in press) and show that the coverage probability can be increased based on current records as compared to those of intervals based on usual records only, while two prediction intervals have the same expected width. The results are demonstrated in Section 2. In fact, in the process of obtaining the ordinary record values, one usually observes the current records, and so it is worthwhile to use them in the construction of prediction intervals for order statistics. In Section 3, in a similar vein, we also discuss the reverse problem by taking the base sample as order statistics from an independent Y -sample and use them to construct prediction intervals for current records from a future independent X -sequence. Two real-life data, one involving the average July temperatures in Neurenburg, Switzerland, and the other involving the amount of annual rainfall at the Los Angeles Civic Center, are used to illustrate the proposed procedures in Section 4. Finally, a conclusion of this study is given in Section 5.

2. Prediction of order statistics based on current records We are interested in two-sided prediction intervals of the form (L, U ) containing an order statistic from a future iid sample and in the determination of the coverage probability of these intervals. Here, we show that current records can be used in place of L and U. Throughout of this section, we assume that {Xi , i ≥ 1} is a sequence of iid continuous random variables with cdf F (x) and pdf f (x). Also, Y1:n ≤ Y2:n ≤ · · · ≤ Yn:n denote the order statistics from a future random sample

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

183

of size n from the same cdf F (x) and pdf f (x). Then, the marginal density of Rli is given by (see Arnold et al., 1998, p. 276),

 fRl (x) = 2 f (x) 1 − F¯ (x) i

i −1 − {− log F¯ (x)}j

i

j!

j=0

 ;

(1)

∑ −1

for convenience in notation, we suppose the empty sum to be 0, i.e., j=0 aj = 0. By replacing F¯ in (1) by F , the pdf of Rsi is derived. Intuitively, in a sample of size n, the upper order statistics, Yk:n , k ≥ (n + 1)/2, will be predicted well by current upper records and the lower ones by current lower records, and for constructing the prediction intervals for the middle order statistics it would be reasonable to use record coverage. With this in mind, we consider the following three schemes. 2.1. Prediction intervals (PIs) based on current upper records The following lemma will be used to construct the prediction intervals in what follows. Lemma 1. Let V1 , V2 and V3 be 3 continuous random variables such that V1 ≤ V3 with probability one and also V2 and (V1 , V3 ) are independent. Then P (V1 ≤ V2 ≤ V3 ) = P (V1 ≤ V2 ) − P (V3 ≤ V2 )



[FV1 (x) − FV3 (x)] dFV2 (x).

= x

Now, let Rlr be the rth current upper record from the X -sequence. Then, by the assumptions, Rlr (for r ≥ 0) is a continuous random variable, and for i < j, Rli ≤ Rlj with probability one, and moreover Yk:n and (Rli , Rlj ) are independent. Thus, from Lemma 1, we get P (Rli ≤ Yk:n ≤ Rlj ) =

+∞



[P (Rli ≤ y) − P (Rlj ≤ y)] dFYk:n (y). −∞

We thus need to obtain the above integral. First, notice that, for a > 2, the following identity holds: i −1 − (2i − 2r ) r =0

ar +1

2i

=

a−1



1 a−2

+

2i

(a − 1)(a − 2)ai

.

(2)

Lemma 2. Under the assumptions of this section, we have P (Rli ≥ Yk:n ) = 1 − ϕ(i, k, n), where

ϕ(i, k, n) = k

  k−1 n − k−1 (−1)s k

s

s=0

2i (n − k + s + 3)−i

(n − k + s + 1)(n − k + s + 2)

.

(3)

Proof. From Lemma 2 of Ahmadi and Balakrishnan (2005a), the survival function of Rli is given by

  i−1 r − ¯ i r {− log F (u)} ¯FRl (u) = F¯ (u) 2i − F¯ (u) . (2 − 2 ) i r! r =0

(4)

Using the pdf of Yk:n (see, for example, David & Nagaraja, 2003, page 10) and the expression in (4), we obtain P (Rli ≥ Yk:n ) =



+∞

P (Rli ≥ y)k

n

[F (y)]k−1 [F¯ (y)]n−k f (y)dy  ∫ 1  i−1 n r − i i r (− log y) = y 2 −y (2 − 2 ) k (1 − y)k−1 yn−k dy r! k 0 r =0 −∞

=

n−k+1 n+1

2i − k

k

∫ i−1 n − (2i − 2r ) k

r =0

1 0

(− log y)r (1 − y)k−1 yn−k+2 dy. r!

(5)

184

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

Next, we have 1

∫ 0

 ∫ 1 k−1  − (− log y)r n−k+2 (− log y)r n−k+s+2 k−1 y (1 − y)k−1 dy = (−1)s y dy r! r! s 0 s=0  k−1  − k−1 (−1)s = . s (n − k + s + 3)r +1 s=0

(6)

Using (2) now, we can write i−1 k−1 − − (2i − 2r ) r =0

s=0

(−1)s



k−1 s

 =

(n − k + s + 3)r +1

 k −1  − (−1)s 2i k−1 (−1)s − s n−k+s+2 s n−k+s+1 s=0 s =0   k − 1 k−1 (−1)s 2i (n − k + s + 3)−i − s + (n − k + s + 1)(n − k + s + 2) s=0  k−1  − k−1

(k − 1)!(n − k + 1)! (k − 1)!(n − k)! − (n + 1)! n!   k−1 k−1 (−1)s 2i (n − k + s + 3)−i − s + . (n − k + s + 1)(n − k + s + 2) s=0

= 2i

Substituting (6) and (7) into (5) and simplifying, the required result follows. Hence, from Lemma 2, we find the coverage probability of the event

Rli

(7)



≤ Yk:n ≤ Rlj as

α1 (i, j; k, n) = P (Rli ≤ Yk:n ≤ Rlj ) = ϕ(i, k, n) − ϕ(j, k, n),

(8)

where ϕ(., ., .) is as given in (3). Thus, we have a prediction interval (Rli , Rlj ), j > i ≥ 0, for Yk:n , 1 ≤ k ≤ n, whose prediction coefficient given by (8), is free of F . From (8), it is observed that for fixed n and k, the prediction coefficient α1 (i, j; k, n) is decreasing in i and increasing in j, as we would expect. We may be interested in finding a one-sided prediction interval of the form [Rli , +∞) or (−∞, Rlj ] for future order statistics. The coverage probability of these intervals can be easily obtained from (8) and are given by ϕ(i, k, n) and 1 − ϕ(j, k, n), respectively. In this case, Rli is a lower prediction bound and Rlj is an upper prediction bound for the kth order statistic from the future sample. Some Special Cases We have the following simpler expressions for some special cases. It is well known that for n components with iid lifetime variables, the time to failure of series and parallel systems are the first and last order statistics from those variables.

• Prediction interval for minimum of the future sample: Taking k = 1, we find from (8) that  i  j  1 2 2 l l P (Ri ≤ Y1:n ≤ Rj ) = − . n+1 n+2 n+2 • Prediction interval for maximum of the future sample: With k = n, we find from (8) that   i  j  n−1  − 2 2 n−1 (−1)s l l P (Ri ≤ Yn:n ≤ Rj ) = n − . (s + 1)(s + 2) s+3 s+3 s s=0 • Prediction interval for median of the future sample: With n odd and k = (n + 1)/2, we find from (8) that   −  i  j  k−1  s 2k − 1 k − 1 (− 1 ) 2 2 P (Rli ≤ Yk:2k−1 ≤ Rlj ) = k − , k s (k + s)(k + s + 1) k+s+2 k+s+2 s=0 or equivalently P(

Rli

≤ Y n+1 :n ≤ 2

Rlj

)=

n+1 2

 ×



n−1  − n −1  2

n

2

n +1 2

s =0

4 n + 2s + 5

s

i

 −

4(−1)s

(n + 2s + 1)(n + 2s + 3) j  4 . n + 2s + 5

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

185

Table 1 The values of α1 (i, j; k, n) for n = 20, 50 and some selected choices of i, j and k. n

k

i

j 4

6

8

10

12

14

16

18

20

20

10

0 1 2 0 1 2 0 1 2 0 1 2 3

0.4654 0.2273 0.0869 0.6210 0.4262 0.2362 0.4731 0.3909 0.2746 0.3079 0.2646 0.1970 0.1064

0.4754 0.2373 0.0969 0.6959 0.5011 0.3111 0.6953 0.6131 0.4968 0.5276 0.4843 0.4167 0.3260

0.4761 0.2380 0.0976 0.7115 0.5167 0.3267 0.8203 0.7380 0.6218 0.7053 0.6620 0.5944 0.5037

0.4762 0.2381 0.0977 0.7139 0.5191 0.3291 0.8751 0.7929 0.6766 0.8214 0.7781 0.7105 0.6198

0.4762 0.2381 0.0977 0.7142 0.5194 0.3295 0.8954 0.8131 0.6969 0.8873 0.8440 0.7764 0.6857

0.4762 0.2381 0.0977 0.7143 0.5195 0.3295 0.9020 0.8197 0.7035 0.9214 0.8781 0.8105 0.7198

0.4762 0.2381 0.0977 0.7143 0.5195 0.3295 0.9040 0.8217 0.7055 0.9380 0.8947 0.8271 0.7365

0.4762 0.2381 0.0977 0.7143 0.5195 0.3295 0.9046 0.8223 0.7060 0.9458 0.9025 0.8349 0.7443

0.4762 0.2381 0.0977 0.7143 0.5195 0.3295 0.9047 0.8225 0.7062 0.9494 0.9061 0.8385 0.7479

0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 3

0.6253 0.4141 0.2171 0.6420 0.4761 0.2871 0.5556 0.4538 0.3118 0.3914 0.3371 0.2507 0.1755 0.1566 0.1235 0.0717

0.6780 0.4668 0.2698 0.7537 0.5878 0.3988 0.7643 0.6625 0.5205 0.6473 0.5930 0.5066 0.3644 0.3456 0.3124 0.2607

0.6855 0.4743 0.2773 0.7797 0.6138 0.4248 0.8508 0.7490 0.6070 0.8171 0.7628 0.6764 0.5669 0.5481 0.5149 0.4632

0.6862 0.4751 0.2781 0.7838 0.6179 0.4289 0.8758 0.7740 0.6320 0.8984 0.8441 0.7577 0.7332 0.7143 0.6812 0.6294

0.6863 0.4751 0.2781 0.7843 0.6184 0.4294 0.8812 0.7794 0.6374 0.9287 0.8744 0.7880 0.8456 0.8268 0.7936 0.7419

0.6863 0.4751 0.2781 0.7843 0.6184 0.4294 0.8822 0.7804 0.6384 0.9380 0.8837 0.7973 0.9118 0.8929 0.8598 0.8080

0.6863 0.4751 0.2781 0.7843 0.6184 0.4294 0.8823 0.7805 0.6385 0.9404 0.8861 0.7998 0.9471 0.9283 0.8951 0.8434

0.6863 0.4751 0.2781 0.7843 0.6184 0.4294 0.8824 0.7805 0.6385 0.9410 0.8867 0.8003 0.9648 0.9459 0.9128 0.8610

0.6863 0.4751 0.2781 0.7843 0.6184 0.4294 0.8824 0.7805 0.6385 0.9411 0.8868 0.8005 0.9732 0.9544 0.9212 0.8695

15

19

20

50

35

40

45

48

50

Notice that, for a given α0 , k and n, the two-sided prediction interval exists if and only if, for a large m, P (Rl0 ≤ Yk:n ≤ Rlm ) ≥ α0 . From (8), this condition is equivalent to max α1 (i, j; k, n) = k i,j

 k−1  n − k−1 k

s=0

s

k (−1)s = ≥ α0 , (n − k + s + 1)(n − k + s + 2) n+1

which means that we can construct a prediction interval for Yk:n with a coverage probability of at least α0 by using current upper record values if k ≥ (n + 1)α0 , which does support our initial intuition. It needs to be mentioned that Ahmadi and Balakrishnan (2010) found the same result on the basis of ordinary upper records. In the next subsection, we show that current lower records are appropriate for the construction of prediction intervals for lower order statistics. If n, k and the desired prediction level α0 are specified, we can choose i and j so that α1 (i, j; k, n) exceeds α0 , and the prediction coefficients can be determined based on the computation of (8). Table 1 presents values of α1 (i, j; k, n) for some choices of n and k. Thus, (Rli , Rlj ) is a 100α1 % prediction interval for Yk:n , the kth order statistic from a future sample of size n, with α1 given by (8) which does not depend on the parent distribution F . For comparing the results with those of Ahmadi and Balakrishnan (2010), let us denote the mth usual upper record by Um , with U1 ≡ X1 . Then, Rl0 = U1 and Rlj ≤ Uj+1 with probability one for j ≥ 1, and so the expected width of the prediction interval (U1 , Uj+1 ) is greater than that of the expected width of (Rl0 , Rlj ). It can be shown that, for the Uniform (0, 1) distribution, E (Uj ) = 1 −

1 2j

and

E (Rlj ) = 1 −

1 3

  j −1 2

3

Thus, the two prediction intervals (U1 , Uj0 +1 ) and (Rl0 , Rlj ) would have the same expected width if j = 1.71j0 , where j and j0 denote the number of current and usual records, respectively. For example, when j0 = 6 we have j ≈ 10; then, from Table 1 of Ahmadi and Balakrishnan (2010) and the results presented earlier, we have the following results:

186

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

k P ( ≤ Yk:20 ≤ ) P (U1 ≤ Yk:20 ≤ U6 ) k P (Rl0 ≤ Yk:50 ≤ Rl10 ) P (U1 ≤ Yk:50 ≤ U6 ) Rl0

Rl10

10

15

19

20

0.4762 0.4760 35 0.6862 0.6844

0.7136 0.7101 40 0.7838 0.7774

0.8751 0.8377 45 0.8758 0.8518

0.8214 0.7693 50 0.7332 0.6772

From these, we observe that the coverage probability is increased by using current records. Raqab (2007) obtained sharp distribution-free bounds for the expected value of the gap between the current upper records and usual upper records, i.e., E (Uj+1 − Rlj ) for j ≥ 1, and these results would be useful for comparing the two types of prediction intervals. 2.2. PIs for order statistics based on current lower records As mentioned in the preceding subsection, for the construction of prediction intervals for lower order statistics, i.e. Yk:n , k ≤ (n + 1)/2, current upper records are not suitable and that the current lower records are better suited. Here, we present outline of the corresponding results. Suppose the assumptions of this section hold, and that Rsi denotes the ith current lower record from the X -sequence. Then, it can be shown that P (Yk:n ≤ Rsi ) = P (Uk:n ≤ Uis ) = P (Un−k+1:n ≥ Uil ) = P (Yn−k+1:n ≥ Rli ),

(9) Uil

Uis )

where Ur :n is the rth standard uniform order statistic from a sample of size n and (and is the ith current upper (and lower) record from the standard uniform distribution. Hence, by using (9) and Lemma 2, we can construct prediction intervals for order statistics from a future sample on the basis of current lower records. For i < j, we have

α2 (i, j; k, n) = P (Rsj ≤ Yk:n ≤ Rsi ) = ϕ(i, n − k + 1, n) − ϕ(j, n − k + 1, n)    i  j   n −k n − (−1)s n−s k 2 2 − =k , k s=0 (k + s)(k + s + 1) k+s+2 k+s+2

(10)

where ϕ(., ., .) is as given in (3). Since α2 (i, j; k, n) = α1 (i, j; n − k + 1, n), we can use the values in Table 1 for the case of lower records simply by replacing k by n − k + 1. For a given α0 , k and n, the two-sided prediction interval exists if and only if, for a large m, the inequality P (Rsm ≤ Yk:n ≤ s R0 ) ≥ α0 holds. From (10), this is equivalent to the condition max α2 (i, j; k, n) = k

 n−k  n − n−k

i,j

k

s=0

s

(−1)s n−k+1 = ≥ α0 , (k + s)(k + s + 1) n+1

which means we can construct a prediction interval for Yk:n with a coverage probability of at least α0 based on current lower record values if k ≤ (n + 1)(1 − α0 ), as we expected intuitively. For example, if α0 = 0.70 and n = 29, then a prediction interval with a coverage probability of at least 0.70 for Yk:29 can be constructed on the basis of current lower record values only for k ≤ 9. 2.3. PIs for order statistics based on record coverage In the last two subsections, we observed that the lower current records are suitable for predicting lower order statistics while the upper current records are suitable for predicting upper order statistics. Then, our intuition suggests that record coverage will be suitable for predicting middle order statistics. Now, let (Rsm , Rlm ), m ≥ 1, be the mth record coverage of the X -sequence when the mth record of any kind (either an upper or lower) is observed. Then, by Lemmas 1 and 2, we find

α3 (m; k, n) = P (Rsm ≤ Yk:n ≤ Rlm ) = 1 − ϕ(m, k, n) − ϕ(m, n − k + 1, n)    n −k  n  − (−1)s n−s k 1 = 1−k 2m  k (k + s)(k + s + 1) (k + s + 2)m s =0 +

(−1)s

k−1 − s=0



k−1 s

 1

 

(n − k + s + 1) (n − k + s + 2) (n − k + s + 3)m 

where ϕ(., ., .) is given in (3).

.

(11)

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

187

Table 2 The minimum values of m which satisfy the inequality (12) for α0 = 0.90. n

k

mopt

n

k

mopt

n

k

mopt

n

k

mopt

10

1 2 3 4 5 6 7 8 9

10 6 5 4 4 4 4 5 6

20

2 3 4 5 6 7 8 9 10

8 7 6 5 4 4 4 3 3

25

2 3 5 7 8 9 10 11 12

9 7 6 5 4 4 4 3 3

50

2 3 5 7 10 14 18 22 25

10 9 7 6 5 4 4 3 3

Thus, the record coverage (Rsm , Rlm ), m ≥ 1, is a two-sided prediction interval for Yk:n , 1 ≤ k ≤ n, the kth order statistic from a future Y -sample, whose coverage probability is free of F and is given by (11). For a given α0 , k and n, the two-sided prediction interval exists, (Rsm , Rlm ), m ≥ 1, if and only if 1 − ϕ(m, k, n) − ϕ(m, n − k + 1, n) ≥ α0 .

(12)

For choosing the optimal interval, if k and n and the desired prediction level α0 are specified, we have to start with m from 1 and gradually increase it until α3 (m; k, n) exceeds α0 , resulting in an optimal mopt . We have determined in this manner the minimum m needed for constructing of the prediction interval with shortest width for α0 = 0.90 and some selected values of n and k, and these results are presented in Table 2. From Table 2, it observed that m is minimum for the construction of the prediction interval for the median of the future sample, which does support our intuition. Some special cases

• Prediction interval for minimum of the future sample: Taking k = 1, we have from (11) that    n−1  − n−1 (s + 3)−m (n + 2)−m s s l m P (Rm ≤ Y1:n ≤ Rm ) = 1 − 2 +n (−1) . (n + 1) s (s + 1)(s + 2) s=0 • Prediction interval for maximum of the future sample: With k = n, we have from (11), by using the symmetry property of α5 (m; k, n), that P (Rsm ≤ Yn:n ≤ Rlm ) = P (Rsm ≤ Y1:n ≤ Rlm ).

• Prediction interval for median of the future sample: With n odd and k = (n + 1)/2, we have from (11) that   k−1   2k − 1 − k − 1 (k + s + 2)−m s l m+1 P (Rm ≤ Yk:2k−1 ≤ Rm ) = 1 − k2 . (−1)s (k + s)(k + s + 1) k s s=0 From (11), it is observed that α3 (m; k, n) is symmetric around k ≈ [n/2] + 1 for fixed n and m, and is increasing in m for fixed n and k. Table 3 presents values of α3 (m; k, n) for m = 1 up to 10, n = 10, 20, 25 50, and some selected choices of k. Values of the prediction coefficient for upper order statistics can be obtained by using the symmetry property of α3 (m; k, n) with respect to k. For fixed m and n, we have the maximum values of α3 (m; k, n) for the middle order statistics, as seen in Table 3. For comparing these results with those of Ahmadi and Balakrishnan (2010), let us denote the jth usual lower record by Lj , with L1 ≡ X1 . It is logical to expect the prediction intervals with minimum expected width for middle order statistics based on upper and lower usual records jointly, of the form (Lj+1 , Uj+1 ), for j ≥ 1. Also, we have Rl0 = U1 = L1 and Rlj ≤ Uj+1 with probability one for j ≥ 1. For the case of Uniform(0, 1) distribution, we have E (Uj+1 − Lj+1 ) = 2E (Uj+1 ) − 1 = 1 −

1 2j

and E (Rlm − Rsm ) = 2E (Rlm ) − 1 = 1 −

 m 2

3

.

Thus, the two prediction intervals (Lj+1 , Uj+1 ) and (Rsm , Rlm ) have the same expected width if m = 1.71j, where m and j denote the number of current and usual records, respectively. For example, if j = 3, then m = 5.13 ≈ 5, and then we have the following results:

(n, k)

P (L4 ≤ Yk:n ≤ U4 )

P (Rs5 ≤ Yk:n ≤ Rl5 )

(10, 6) (20, 10) (25, 13) (50, 25)

0.9738 0.9821 0.9838 0.9863

0.9837 0.9915 0.9929 0.9949

188

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

Table 3 The values of α3 (m; k, n) for m = 1(1)10 and some selected choices of n and k. n

k

m 1

2

3

4

5

6

7

8

9

10

10

1 2 3 5 8 9 10

0.1515 0.2727 0.3636 0.4545 0.3636 0.2727 0.1515

0.2671 0.4651 0.6063 0.7420 0.6063 0.4651 0.2671

0.3876 0.6227 0.7673 0.8898 0.7673 0.6227 0.3876

0.5080 0.7477 0.8696 0.9563 0.8696 0.7477 0.5080

0.6186 0.8395 0.9306 0.9837 0.9306 0.8395 0.6186

0.7131 0.9022 0.9647 0.9942 0.9647 0.9022 0.7131

0.7896 0.9425 0.9827 0.9980 0.9827 0.9425 0.7896

0.8488 0.9671 0.9918 0.9993 0.9918 0.9671 0.8488

0.8931 0.9817 0.9962 0.9998 0.9962 0.9817 0.8931

0.9254 0.9900 0.9983 0.9999 0.9983 0.9900 0.9254

20

1 3 5 7 9 10 12 15

0.0866 0.2338 0.3463 0.4242 0.4675 0.4762 0.4675 0.3896

0.1581 0.4060 0.5820 0.6981 0.7609 0.7733 0.7609 0.6470

0.2491 0.5665 0.7498 0.8543 0.9056 0.9154 0.9056 0.8099

0.3556 0.7062 0.8622 0.9352 0.9663 0.9718 0.9663 0.9056

0.4675 0.8143 0.9299 0.9733 0.9889 0.9915 0.9889 0.9568

0.5752 0.8898 0.9669 0.9897 0.9966 0.9976 0.9966 0.9816

0.6715 0.9380 0.9853 0.9963 0.9990 0.9994 0.9990 0.9926

0.7529 0.9668 0.9938 0.9987 0.9997 0.9998 0.9997 0.9972

0.8183 0.9829 0.9975 0.9996 0.9999 1.0000 0.9999 0.9990

0.8690 0.9915 0.9990 0.9999 1.0000 1.0000 1.0000 0.9997

25

1 5 7 9 11 13

0.0712 0.2991 0.3789 0.4359 0.4701 0.4815

0.1315 0.5101 0.6315 0.7154 0.7646 0.7808

0.2122 0.6809 0.7972 0.8693 0.9087 0.9212

0.3111 0.8097 0.8980 0.9451 0.9682 0.9751

0.4195 0.8957 0.9530 0.9788 0.9898 0.9929

0.5279 0.9471 0.9800 0.9924 0.9970 0.9981

0.6284 0.9749 0.9921 0.9975 0.9992 0.9995

0.7158 0.9887 0.9971 0.9992 0.9998 0.9999

0.7880 0.9952 0.9990 0.9998 0.9999 1.0000

0.8452 0.9980 0.9997 0.9999 1.0000 1.0000

50

1 5 10 15 20 25

0.0377 0.1735 0.3092 0.4072 0.4676 0.4902

0.0716 0.3099 0.5273 0.6743 0.7612 0.7932

0.1233 0.4616 0.7027 0.8370 0.9066 0.9304

0.1951 0.6128 0.8319 0.9268 0.9675 0.9799

0.2840 0.7433 0.9143 0.9705 0.9898 0.9949

0.3840 0.8423 0.9602 0.9892 0.9971 0.9988

0.4873 0.9098 0.9830 0.9964 0.9992 0.9998

0.5865 0.9517 0.9933 0.9989 0.9998 1.0000

0.6761 0.9756 0.9975 0.9997 1.0000 1.0000

0.7528 0.9883 0.9991 0.9999 1.0000 1.0000

We observe once again that the coverage probability increases with the use of current records instead of usual records. Moreover, it may be noted that if the number of extracted lower and upper usual records in the Xn -sequence equals i and j, then the number of current records in the corresponding sequence equals m = i + j − 2, with Rs0 = Rl0 = U1 = L1 ≡ X1 . 3. Prediction of current records based on order statistics As mentioned earlier in Section 1, the current record values can be used in a general sequential method for model choice and outlier detection; see Basak (2000). Prediction of current records also becomes quite pertinent in climatology as one is often interested in any record that occurs irrespective of whether it is upper or lower record. Raqab (2009) obtained prediction intervals for the current records from a future iid sequence based on observed current records from an independent iid sequence of the same parent. Here, we consider the prediction of the current records from a future sequence in terms of order statistics from an informative sample. Throughout this section, we assume that Y1:n ≤ · · · ≤ Yn:n are the observed order statistics from the Y -sample of n iid random variables from F , and Rlk (or Rsk ) is the kth upper (or lower) current record from a future sequence from the same distribution. By using Lemma 2, we obtain two-sided prediction intervals as well as lower and upper prediction bounds for Rlk and Rsk (k ≥ 0), separately, whose endpoints are based on observed order statistics. 3.1. PIs for current upper records Here, the conditions of Lemma 1 hold and so by using Lemma 2, we obtain

α4 (k; i, j, n) = P (Yi:n ≤ Rlk ≤ Yj:n ) = P (Rlk ≤ Yj:n ) − P (Rlk ≤ Yi:n ) = ϕ(k, j, n) − ϕ(k, i, n),

(13)

where ϕ(., ., .) is given in (3). If we are interested in finding lower and upper bounds for Rlk , k ≥ 0, on the basis of Yi:n , 1 ≤ i ≤ n, the prediction coefficients of these one-sided prediction intervals can be deduced from (13); for example P (Rlk ≤ Yj:n ) = ϕ(k, j, n).

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

189

We obtain another useful expression for α4 (k, i, j, n) by carrying out the condition first on Rlk from (1), as follows:

α4 (k; i, j, n) = 2

k

j −1   ∫ − n t =i

t

+∞

{F (y)}t {F¯ (y)}n−t f (y) dy −∞

j−1 − k−1   ∫ − n

[− log F¯ (y)]r {F (y)}t {F¯ (y)}n−t +1 f (y) dy r! t −∞ t =i r =0 j−1 − k−1 − t    − n t (−1)s k j−i k =2 −2 n+1 t s (n − t + s + 2)r +1 t =i r =0 s=0 j −1 − t    − 2k n t (−1)s . = t s (n − t + s + 1) (n − t + s + 2)k t =i s=0 k

−2

+∞

(14)

The above expression is simpler for numerical computation. For that case of usual upper records, Ahmadi and Balakrishnan (2010) derived an expression similar to the one in (14). For a given α0 and k, the two-sided prediction interval for Rlk based on order statistics exists if and only if max α4 (k; i, j, n) ≥ α0 .

(15)

i,j

Using (13), the condition in (15) is then equivalent to max α4 (k; i, j, n) = 1 − i,j

2k

(n + 1)(n + 2)k

− 2k

n   − n s=0

s

(−1)s ≥ α0 , (s + 1)(s + 2)k

which means one can construct a prediction interval for Rlk with a coverage probability of at least α0 on the basis of the observed order statistics Y1:n , . . . , Yn:n , if n is such that 1

+

(n + 1)(n + 2)k

n   − n s=0

s

(−1)s 1 − α0 ≤ . (s + 1)(s + 2)k 2k

We have determined the minimum sample size, nmin , needed to construct prediction intervals for Rlk with prediction coefficient at least α0 = 0.80 and 0.95, and these values are as follows:

α0

k

0

0.800 nmin 9 0.950 nmin 39

1

2

3

9 39

15 74

27 48 85 150 140 263 492 912

4

5

6

Logically, of course one would use upper order statistics for predicting current upper records. Therefore, we have considered the prediction interval of the form (Yi:n , Yn:n ), and presented in Table 4 the values of α4 (k; i, n, n) for some selected choices of i, n and k. From Table 4, it is observed that α4 (k; i, n, n) is decreasing in i and k and also increasing in n if the other arguments are kept fixed, as we would expect. 3.2. PIs for current lower records Lower order statistics, Yi:n , i ≤ (n + 1)/2, are suitable for predicting current lower records. Thus, analogous to the results in the preceding subsection, upon using (9) and Lemma 2 we have

α5 (k; i, j, n) = P (Yi:n ≤ Rsk ≤ Yj:n ) = P (Yn−j+1:n ≤ Rlk ≤ Yn−i+1:n ) = ϕ(k, n − i + 1, n) − ϕ(k, n − j + 1, n),

(16)

where ϕ(., ., .) is given in (3). We also obtain the following expression for α5 (k; i, j, n), by carrying out the condition first on Rsk :

α5 (k; i, j, n) = 2

k

 j −1 − n −t    − n n−t t =i s=0

t

s

(−1)s . (t + s + 1)(t + s + 2)k

Since α5 (k; i, j, n) = α4 (k; n − j + 1, n − i + 1, n), we can use the values presented in Table 4 for the construction of prediction intervals for current lower records by making a simple adjustment.

190

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

Table 4 The values of α4 (k; i, n, n) for some selected choices of i, n and k. n

i

k 0

1

2

3

4

5

6

50

1 5 10 20 30

0.9608 0.8824 0.7843 0.5882 0.3922

0.9608 0.9502 0.9201 0.8032 0.6109

0.9284 0.9274 0.9217 0.8789 0.7577

0.8766 0.8766 0.8758 0.8639 0.8068

0.8049 0.8049 0.8048 0.8021 0.7803

0.7160 0.7160 0.7160 0.7154 0.7083

0.6160 0.6160 0.6160 0.6159 0.6138

100

1 10 20 30 50 70

0.9802 0.8911 0.7921 0.6931 0.4950 0.2970

0.9802 0.9697 0.9396 0.8901 0.7329 0.4980

0.9624 0.9616 0.9563 0.9420 0.8655 0.6801

0.9320 0.9319 0.9312 0.9283 0.9010 0.7916

0.8854 0.8854 0.8853 0.8848 0.8770 0.8252

0.8211 0.8211 0.8211 0.8210 0.8191 0.7985

0.7407 0.7407 0.7407 0.7407 0.7403 0.7331

200

1 5 20 50 100 150

0.9900 0.9701 0.8955 0.7463 0.4975 0.2488

0.9900 0.9894 0.9798 0.9273 0.7413 0.4322

0.9807 0.9807 0.9799 0.9692 0.8839 0.6233

0.9638 0.9638 0.9637 0.9621 0.9333 0.7675

0.9359 0.9359 0.9359 0.9357 0.9278 0.8425

0.8938 0.8938 0.8938 0.8938 0.8920 0.8548

0.8359 0.8359 0.8359 0.8359 0.8356 0.8215

Table 5 The current records extracted from Arnold et al. (1998, pp. 49–50). m

0

1

2

3

4

5

6

7

8

9

10

11

12

13

Lm Um

– –

19.0 19.0

18.4 20.1

17.4 21.0

17.2 21.4

15.6 21.7

15.3 22.0

– 22.1

– 22.6

– 23.4

– –

– –

– –

– –

Rsm

19.0

19.0

18.4

17.4

17.4

17.4

17.2

15.6

15.6

15.6

15.6

15.3

15.3

15.3

Rlm

19.0

20.1

20.1

20.1

21.0

21.4

21.4

21.4

21.7

22.0

22.1

22.1

22.6

23.4

Table 6 Prediction intervals for future order statistics based on current records in Table 5.

(n, k)

m

(Rsm , Rlm )

α3 (m; k, n)

( n, k )

m

(Rsm , Rlm )

α3 (m; k, n)

(10, 1) (10, 2) (10, 5) (10, 6)

10 8 4 5

(15.6, 22.1) (15.6, 21.7) (17.4, 21.0) (17.4, 21.4)

0.9254 0.9671 0.9563 0.9837

(25, 1) (25, 5) (25, 7) (25, 13)

10 6 5 4

(15.6, 22.1) (17.2, 21.4) (17.4, 21.4) (14.4, 21.0)

0.8452 0.9471 0.9530 0.9751

(50, 5) (50, 10)

8 6

(15.6, 21.7) (17.2, 21.4)

0.9517 0.9602

(50, 20) (50, 25)

4 3

(17.4, 21.0) (17.2, 20.1)

0.9675 0.9304

4. Numerical examples Here, for illustrating the proposed procedures in the preceding sections, we present two examples. Example 1. First, we use the data set in Arnold et al. (1998, pp. 49–50) which represent the record values of the average July temperatures (in degrees centigrade) of Neurenburg, Switzerland, during the period 1864–1993, and the extracted usual and current records are given in Table 5. Based on the observed current record values in Table 5, prediction intervals for future order statistics were obtained, and these intervals are presented in Table 6. In order to compare the presented results with those of Ahmadi and Balakrishnan (2010), we have P (L4 ≤ Y6:10 ≤ U4 ) = 0.9738 and P (L4 ≤ Y10:20 ≤ U3 ) = 0.9536. From Table 5, (L4 , U4 ) = (17.2, 21.4) and (L4 , U3 ) = (17.2, 21.0) which are prediction intervals with shortest width for Y6:10 and Y10:20 , respectively, based on usual records with prediction coefficients at least 0.95. While on the basis of current records, P (Rs4 ≤ Y6:10 ≤ Rl4 ) = 0.9563 and P (Rs4 ≤ Y10:20 ≤ Rl4 ) = 0.9718. From Table 5, (Rs4 , Rl4 ) = (17.4, 21.0) which is prediction interval with shortest width for Y6:10 and also Y10:20 based on current records with prediction coefficients at least 0.95. Thus, we can construct prediction intervals for future order statistics with shorter width if we use current records instead of usual records. Example 2. Next, let us consider the amount of annual rainfall at Los Angeles Civic Center (LACC) during 1900–2000. By arranging them in ascending order, we obtain the order statistics, and some of these order statistics are presented in Table 7.

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

191

Table 7 Order statistics form the amount of annual rainfall at LACC during 1900–2000. r

1

2

3

4

5

6

7

8

9

10

Year Yr :100

1960 4.85

1958 5.58

1923 6.67

1971 7.17

1975 7.21

1947 7.22

1989 7.35

1986 7.66

1969 7.74

1963 7.93

r

20

30

50

70

80

90

95

98

99

100

Year Yr :100

1980 8.96

1941 11.10

1928 12.66

1926 18.03

1921 19.66

1937 23.43

1992 27.36

1982 31.28

1940 32.76

1977 33.44

Table 8 Prediction intervals for Rlk based on order statistics in Table 7. k

i

(Yi:n , Yn:n )

α4 (k; i, n, n) k

i

(Yi:n , Yn:n )

α4 (k; i, n, n)

0 1 2

1 10 20

(4.85, 33.44) (7.93, 33.44) (8.96, 33.44)

0.9802 0.9697 0.9563

20 30 70

(8.96, 33.44) (11.10, 33.44) (18.03, 33.44)

0.9312 0.8848 0.7985

3 4 5

Using the data in Table 7, prediction intervals for future current upper records, Rlk , were obtained, and these are presented in Table 8. Ahmadi and Balakrishnan (2010) also used these data, in fact, for constructing prediction intervals for future ordinary upper records. From Table 8, we observe that the interval (8.96, 33.44) would contain Rl2 with probability 0.9563, while from Ahmadi and Balakrishnan (2010), we know that interval would contain U3 with probability 0.8564. 5. Conclusions Recently, Ahmadi and Balakrishnan (2010) considered the two-sample prediction wherein the observed usual records (order statistics) form the informative sample and discussed how prediction intervals can be constructed for future order statistics (usual records). In this paper, we developed similar results for current records instead of the usual records. We have shown that the coverage probabilities of the prediction intervals for future order statistics can be increased with the use of current records in place of usual records. We recall the fact that in the process of obtaining the usual records, one usually observes the current records. This approach may be adopted for other ordered data such as k-records, progressively censored data and generalized order statistics. The expected width of the prediction interval can be considered as an optimality criterion while comparing different intervals. In this regard, the upper bounds for expected values of the spacings of order statistics, Wi,j:n = Yj:n − Yi:n , 1 ≤ i < j ≤ n (see for example, Chapter 4 of David and Nagaraja (2003)) and those for E (Rlj − Rli ), j > i ≥ 0, established recently by Raqab (2007, 2009) would be useful. Acknowledgements The authors thank the referees for their constructive comments and useful suggestions on the original version of this manuscript, which led to this improved version. The research was supported by a grant from Ferdowsi University of Mashhad; No. MS88074AHM. References Ahmadi, J., & Balakrishnan, N. (2004). Confidence intervals for quantiles in terms of record range. Statistics & Probability Letters, 68, 395–405. Ahmadi, J., & Balakrishnan, N. (2005a). Preservation of some reliability properties by certain record statistics. Statistics, 39, 347–354. Ahmadi, J., & Balakrishnan, N. (2005b). Distribution-free confidence intervals for quantile intervals based on current records. Statistics & Probability Letters, 75, 190–202. Ahmadi, J., Jafari Jozani, M., Marchand, É, & Parsian, A. (2009). Prediction of k-records from a general class of distributions under balanced loss functions. Metrika, 70, 19–33. Ahmadi, J., & Balakrishnan, N. (2010). Prediction of order statistics and record values from two independent sequences. Statistics, 44, 417–430. Ahmadi, J., & MirMostafaee, S. M. T. K. (2009). Prediction intervals for future records and order statistics coming from two parameter exponential distribution. Statistics & Probability Letters, 79, 977–983. Ahmadi, J., MirMostafaee, S. M. T. K., & Balakrishnan, N. (2010). Bayesian prediction of order statistics based on k-record values from exponential distribution. Statistics, in press (doi:10.1080/02331881003599718). First published on: 20 April 2010 (iFirst). Ahsanullah, M. (1980). Linear prediction of record values for the two parameter exponential distribution. Annals of the Institute of Statistical Mathematics, 32, 363–368. Arnold, B. C., Balakrishnan, N., & Nagaraja, H. N. (1998). Records. New York: John Wiley & Sons. Arnold, B. C., Balakrishnan, N., & Nagaraja, H. N. (1992). A first course in order statistics. New York: John Wiley & Sons. Balakrishnan, N., & Rao, C. R. (Eds.). (1998a). Handbook of statistics — 16: order statistics: theory and methods. Amsterdam: North-Holland. Balakrishnan, N., & Rao, C. R. (Eds.). (1998b). Handbook of statistics — 17: order statistics: applications. Amsterdam: North-Holland. Basak, P. (2000). An application of record range and some characterization results. In N. Balakrishnan (Ed.), Advances on theoretical and methodological aspects of probability and statistics (pp. 83–95). Newark, NJ: Gordon and Breach. David, H. A., & Nagaraja, H. N. (2003). Order statistics (third ed.). Hoboken, NJ: John Wiley & Sons. Dunsmore, I. R. (1983). The future occurrence of records. Annals of the Institute of Statistical Mathematics, 35, 267–277.

192

J. Ahmadi, N. Balakrishnan / Journal of the Korean Statistical Society 40 (2011) 181–192

Hsieh, H. K. (1997). Prediction intervals for Weibull order statistics. Statistica Sinica, 7, 1039–1051. Kaminsky, K. S., & Nelson, P. I. (1998). Prediction of order statistics. In N. Balakrishnan, & C. R. Rao (Eds.), handbook of statistics – 17: order statistics: applications (pp. 431–450). Amsterdam: North-Holland. Raqab, M. Z. (2007). Inequalities for expected current record statistics. Communications in Statistics. Theory and Methods, 36, 1367–1380. Raqab, M. Z. (2009). Distribution-free prediction intervals for the future current record statistics. Statistical Papers, 53, 429–439. Raqab, M., & Balakrishnan, N. (2008). Prediction intervals for future records. Statistics & Probability Letters, 78, 1955–1963.