Local information and the design of sequential hypothesis tests

Local information and the design of sequential hypothesis tests

Journal of Statistical Planning and Inference 130 (2005) 111 – 125 www.elsevier.com/locate/jspi Local information and the design of sequential hypoth...

244KB Sizes 0 Downloads 27 Views

Journal of Statistical Planning and Inference 130 (2005) 111 – 125 www.elsevier.com/locate/jspi

Local information and the design of sequential hypothesis tests Robert W. Keener∗ Department of Statistics, University of Michigan, Ann Arbor, MI48103, USA Received 7 July 2003; accepted 13 October 2003 Available online 29 July 2004

Abstract Various results on sequential hypotheses testing are reviewed. Optimal stopping rules are related to a local measure of statistical information. In some cases, local information can be approximated by L-numbers discovered by Lorden, and simple rules based on these approximations are asymptotically optimal to better order than the cost for a single observation. © 2004 Elsevier B.V. All rights reserved. Keywords: Optimal stopping; Sequential testing; Measures of Information

1. Introduction As my contribution to this celebration of Chernoff’s eightieth birthday, I would like to explore connections between information and sequential testing, two of his interests. Most measures of information are linked to asymptotic performance of tests or estimators. In large samples, Fisher information gives the limiting variance of maximum likelihood estimators. It is also the most relevant measure in large sample testing when power at alternatives contiguous to H0 are of central importance. But when large deviations play a central role, other measures may be of more interest. One example would be Neyman Pearson testing in large samples, studied in Chernoff (1952). If  and  are Bayes error probabilities testing whether the data have density f0 or f1 , then log( + ) ∼ −nIC

∗ Tel.: +1-7347633519; fax: +1-7347634676.

E-mail address: [email protected] (R.W. Keener). 0378-3758/$ - see front matter © 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2003.10.011

112

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

as n → ∞, where 

IC = − log

 inf

t∈(0,1)

f0t f11−t

 d ,

commonly called Chernoff Information in information theory. In a sequential context, approximations for expected sample sizes depend on Kullback– Leibler information numbers, so these play a central role in an asymptotic theory for sequential design, as in Chernoff (1959) or Keener (1984). In a certain sense, optional stopping in statistics can be considered as simply an exercise in measuring the value of information. Specifically, it is optimal to take another observation whenever the value of the information from the observation exceeds its cost. The breakeven value, the highest amount you would be willing to spend for an additional observation, divided by the risk for stopping can be viewed as a measure of the information from that observation. Unlike the measures mentioned above, this measure of information should be more local in character, for if a researcher has collected a fair amount of data and is seriously considering terminating an experiment, the key issue is whether stopping is better than collecting a little more data. An appropriate measure of information in this context should depend on how risks or likelihood ratios evolve in the near future. In an asymptotic analysis with sampling costs tending to zero, these sort of considerations are generally necessary to achieve optimality to better order than the cost for a single observation. The rest of the paper is organized as follows. The next section introduces notation and defines local information in a general setting. In Section 3 we review an interesting result from Lorden (1977b), which provides an exact solution for one-sided sequential probability ratio tests. Local information in this context is given by L-numbers related to the fluctuation theory of random walk. In Section 4 sequential tests with an indifference zone are considered. We sketch a result from Lorden (1977a) showing the role of his L-numbers as an approximation for local information. In Section 5 we derive a formula to compute Lorden’s L-numbers. In the final section we explore potential use of these ideas in testing without an indifference zone. This problem has a rich history, starting with a series of fundamental papers by Chernoff (1961, 1965a, b) and Breakwell and Chernoff (1961).

2. Local information Let {Q :  ∈ } be a parametric family of probability distributions, dominated by a measure  with densities f = dQ /d. Under P , potential observations X1 , X2 , . . . will be i.i.d. from Q . Local information will be defined in the context of testing H0 :  ∈ 0 versus H1 :  ∈ 1 , where 0 and 1 partition the parameter space . The formulation will be decision theoretic and Bayesian. The random parameter  will have prior density , each observation will cost c() and there is a loss L() for a wrong decision at the end of the experiment. The posterior risk if the experimenter stops at stage n and acts optimally is

n = min{E[L()I ( ∈ 0 )|Fn ], E[L()I ( ∈ 1 )|Fn ]},

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

113

where Fn = (X1 , . . . , Xn ) and I (·) is an indicator function. The design goal is to choose a stopping time to minimize the integrated risk r( ) = E[ + c() ]. At stage n, a decision to stop or continue should be based in part on the risk if we stop, n , and the expected cost for the next observation, E[c()|Fn ]. Let

n =

E[c()|Fn ] , n

the ratio of these risks. If n > 1 stopping must be optimal, and if n is very close to zero we should take at least one more observation. If we consider a collection of design problems in which the sampling costs are scaled by a constant, there will be a critical value In such that stopping is optimal if n > In and continuation is optimal if n < In . This critical value In is our measure of the local information from an observation at stage n. To be more precise, let denote the scale factor for the sampling costs, and let r0 ( ) = r0 (, ) = inf E[ + c() ],

the minimal risk as a function of . When = 0 is optimal, r0 ( ) = 0 . Local information for the first observation is defined as the smallest value for 0 where this happens,   Ec () I() = inf : r0 ( ) = 0 . 0 The local information at stage n is I(n ), where n denotes the posterior distribution of  given Fn . In most problems In is a complicated function of the model and the posterior distribution of . Brute force computation of In involves solving a collection of optimal stopping problems, a task harder than solving the single problem at hand. But there is one advantage to this approach. By continuity, when is exactly equal to In , = 0 will be optimal, and some stopping time greater than zero will also be optimal. This gives an equation

0 = inf E[ + In c() ] r 1

for In . In a few simple cases this equation can be solved explicitly. Remark 2.1. A similar approach has deep significance in the study of bandit problems. Gittins and Jones (1974) use break even values for a family of stopping problems to define dynamic allocation indices used to characterize the optimal strategy in multi-armed bandit problems with discounting and independent arms. 3. One-sided sequential probability ratio tests Sequential probability ratio tests are optimal for Bayesian simple versus simple testing with a constant cost per observation, but the end-points for the optimal continuation region cannot be specified explicitly. But in a one-sided variant of this testing problem, the

114

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

optimal stopping time continues until the likelihood ratio exceeds a single critical value, and an explicit expression for the value is available, given by Lorden (1977b). To describe this problem and Lorden’s solution, assume that  = {0 , 1 }, and let p denote the prior probability that  = 1 in our Bayesian model, p = ({1 }). The losses and costs in this problem are chosen so that we will want to stop if  = 0 but continue indefinitely if  = 1 . Specifically, if  = 0 observations are free but we lose 1 if < ∞, and if  = 1 there is no loss if we stop, but each observation costs . This gives r( ) = r( , p)=E[I { = 0 , < ∞} + I { = 1 }] = (1 − p)P0 ( < ∞) + p E1

as the risk for a stopping time . With L(0 ) = 1, L(1 ) = ∞, and c() = I { = 1 }, this risk has the form introduced in the previous section. Also, note that if Q0 >Q1 we can change measure using Wald’s fundamental identity and write  

 f (Xk ) 0 r( ) = (1 − p)E1 + p E1 . (3.1) f1 (Xk ) k=1

Let r˜0 (p) = inf r( , p),

1

the minimal risk over all stopping times that take an initial observation, and r0 (p) = inf r( , p),

the minimal risk over all stopping times. Since the risk for = 0 is 1 − p, r0 (p) = min{1 − p, r˜0 (p)}. For a fixed stopping time , r( , p) is a linear function of p, and so r0 (p) and r˜0 (p) are both convex functions of p. By this convexity and continuity, there will be a critical value pˆ with r0 (p) = 1 − p for p  pˆ and r0 (p) = r˜ (p) for p  p. ˆ When p > pˆ we should stop immediately, when p < pˆ we should take at least one observation, and when p = pˆ either option, stopping or taking an initial observation, can be optimal. From the Markov character of this problem, one optimal stopping time is given by

opt = inf{n  0 : pn > p}, ˆ where pn = P ( = 1 |X1 , . . . , Xn ). From this opt stops when the likelihood ratio exceeds a boundary. If we define

f1 (Xi ) Zi = Zi (1 , 0 ) = log , i  1, f0 (Xi ) the log likelihood ratio for observation i, then the log likelihood ratio at stage n is Sn = Sn (1 , 0 ) = Z1 + · · · + Zn .

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

115

By Bayes Theorem, pn =

eSn , 1 + eSn

an increasing function of Sn , and so

opt = inf{n  0 : Sn > a}, with



pˆ a = log 1 − pˆ





p − log . 1−p

To solve for p, ˆ Lorden considers what happens if p = p. ˆ Then a = 0, and so the optimal time opt is the first increasing (strict) ladder time T + for the random walk Sn , T + = inf{n  1 : Sn > 0}. In this case, since p = p, ˆ = 0 is also optimal and must have the same risk as opt = T + . Equating risks for these rules, ˆ (1 − p)P ˆ 0 (T + < ∞) + pˆ E1 T + = (1 − p), a linear equation for p. ˆ Solving, pˆ =

P0 (T + < ∞) . P0 (T + < ∞) + E1 T +

By a duality result from the fluctuation theory of random walk, P0 (T + < ∞) =

1 , E0 T −

where T − is the first (weak) descending ladder time T − = inf{n  1 : Sn  0}. So pˆ =

1 , 1 + /L

where

L = L(0 , 1 ) =

1 . (E0 T − )(E1 T + )

Another derivation of this solution is possible viewing (3.1) as a generalized parking problem as in Woodroofe et al. (1994).

116

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

To relate this solution to the notion of information from Section 2, note that p  pˆ if and only if p L. 1−p Since p is the expected cost for the first observation and 1 − p is the risk for stopping, the ratio on the left hand side here is 0 . So in this case local information In at any stage n equals L. The importance of these numbers L(0 , 1 ) in sequential testing is recognized in Lorden (1977a, b). They also arise in the exact solutions for simple bandit problems, see Keener (1985, 1986). L-numbers are generally more difficult to compute than other measures of information. But exact results are possible in a few cases. For instance, if f is the uniform density on the interval (,  + 1), then T + (0 , 1 ) has a geometric distribution under P1 , and T − (0 , 1 ) = 1 almost surely under P0 . So

L(0 , 1 ) = min{1, |0 − 1 |}. Formulas for numerical calculation will be presented in Section 5.

4. Sequential testing with an indifference zone Classically, there have been two major approaches in sequential testing. The first approach is based on a formulation by Schwarz (1962). Assume that the densities have canonical exponential family form, f (x) = h(x)ex−A() ,

∈⊂R

and consider testing H0 :  < 0 versus H1 :  > 0. The sampling cost c() will be a constant c independent of , and the loss function L is zero if || < , and is bounded, continuous, and positive otherwise. Here is a prespecified constant, and the indifference zone || < , where either action is acceptable, is introduced with the practical consideration in mind that a correct decision is usually not crucial when  is near zero. Finally, assume for now that the prior  and Lebesgue measure on  are mutually absolutely continuous. Under these conditions, the optimal stopping rule has form

opt = inf{n : (Wn , n) ∈ Bc } with an appropriate set Bc , where Wn = X1 + · · · + Xn . Schwarz (1962) shows that Bc /| log c| → B0 as c → 0. To describe the limiting shape B0 , let ˆ (·) denote the inverse function of A , the derivative of A, so that the maximum likelihood estimator of  is ˆ (X¯ n ), with X¯ n = Wn /n. Then

B0 = {(w, t): ( w − tA( ))∧(− w − tA(− ))  ˆ (w/t)w − tA(ˆ (w/t)) − 1}.

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

117

The asymptotic shape B0 suggests an approximate stopping time

∗ = ∗c = inf{n : (n, Wn ) ∈ B0 | log c|}. Several authors (Wong, 1968; Woodroofe, 1980) have studied the performance of ∗ as c → 0, and in this setting ∗ is efficient—the integrated risk using ∗ is asymptotic to the Bayes risk as c → 0. This result may fail, however, if the dimension of H1 less the dimension of H2 is two or more; see Woodroofe (1980). The stopping time studied by Kiefer and Sacks (1963) and Lorden (1967) which stops when the posterior risk is less than c (or an arbitrary multiple of c), is also asymptotically efficient as c → 0. Results more pertinent to our notion of information appear in Lorden (1977a). This paper gives stopping rules optimal to o(c) as c ↓ 0. Since the Bayes risk in this limit is of order c| log c|, this is a definite improvement over efficiency. This level of optimality can only be achieved with rules that take explicit account of the excess over the boundary. To describe a few of the main results in Lorden (1977a), define Kullback–Leibler information numbers   f0 (X) I (0 , 1 ) = E0 log = (0 − 1 )A (0 ) − A(0 ) + A(1 ) f1 (X) and take

  A( ) − A(− ) ˆ ∈ (− , ),  = 2 ∗

so that I (∗ , ) = I (∗ , − ). For regularity, Lorden makes the following assumptions: A1: The prior  has support [a, b] ⊂ o , with a < − and b > , and is absolutely continuous with density  continuous and positive on [a, b]. A2: The loss function L is bounded, nonnegative, zero on (− , ), continuous from the right at , continuous from the left at − , and bounded away from zero on [a, − ] and [ , b]. Since the prior has support [a, b] it is natural to truncate estimators to this interval. Accordingly, let us modify the definition of ˆ (·) so that ˆ (x) = b if x  A (b), ˆ (x) = a if x  A (a), def and ˆ (x) solves x = A (ˆ (x)), otherwise. Then ˆ n = ˆ (X¯ n ) maximizes the likelihood over [a, b]. Using Laplace’s method, if ˆ = ˆ (x) ¯ ∈ (a, b) and g is continuous and bounded with g(ˆ )  = 0, then  b 2 ˆ¯ nx−nA( ¯ ) ˆ ) ˆ e g() d ∼ g() (4.1) enx−nA(  a nA (ˆ ) as n → ∞. Similarly, if g is bounded on [ , b] and continuous at with g( )  = 0, and if ˆ = ˆ (x) ¯ < , then  b ¯

) g( )en x−nA( ¯ ) . (4.2) enx−nA( g() d ∼ n[A ( ) − x] ¯

118

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

(The derivation is similar to Laplace’s method, but now most of the contribution to the integral comes from values of  near , instead of from values near ˆ .) Combining these approximations, if ˆ < , b nx−nA( ) L()() d ¯

e ¯ E[L()I { > 0} | X¯ n = x]= ) () d ¯ enx−nA(

( )L( ) A (ˆ ) ˆ ∼ e−nI (, ) . √   ˆ ˆ () 2n[A ( ) − A ()] ¯ we can Using this and an analogous approximation for E[L()I { < 0} | X¯ n = x], approximate1 the posterior stopping risk n by ˆ n (X¯ n ) with

   (

)L(

) A (ˆ )  ˆ ∗  ˆ   , e−nI (, ) , √  ˆ   ˆ  (  ) 2  n|A (

) − A (  )|

ˆ n (x) ¯ =   (− )L(− ) A (ˆ )  ˆ   e−nI (,− ) , ˆ < ∗ . √ (ˆ ) 2n|A (− ) − A (ˆ )| Approximations (4.1) and (4.2) can also be used to study the conditional evolution of n given Fn . If ˆ < , E[L()I { > 0} | X¯ n = x, ¯ Xn+1 = xn+1 , . . . , Xn+k = xn+k ] b nx−nA(  ) L() k f (x e ¯ i=1  n+i )() d =  k ) ¯ enx−nA( f (xn+i )() d

i=1  k  ( )L( ) A (ˆ ) f (xn+i ) −nI (ˆ , ) ∼ . e √   ˆ ˆ fˆ (xn+i ) () 2n[A ( ) − A ()] k

i=1

Aside from the likelihood ratio factor i=1 [f (xn+i )/fˆ (xn+i )], this is the same as the approximation for E[L()I { > 0} | X¯ n = x]. ¯ So if our key approximations hold with suitable uniformity, we should have   f (Xn+i ) k ∗ ˆ    n i=1 f (Xn+i ) (1 + op (1)), on n <  ; ˆ n n+k = (4.3) k f− (Xn+i )  ∗ ˆ  (1 + o  (1)), on  >  .  n i=1 p n fˆ (Xn+i ) n

To use these approximations, Lorden first shows that the optimal stopping rule will take an additional observation whenever the posterior risk  exceeds some constant M ∗ c, and so

opt  = inf{n : n  M ∗ c}. def

1 It might be slightly more natural to use the minimum of the two expressions as the approximation when  ˆ is very near ∗ —this change would not impact subsequent developments.

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

119

When , −  will be a stopping time with respect to the filtration {F+k : k  0}. Conditioning on Fn , our design goal will be to choose to minimize     n   cE  + cE E( − |F ) + . E  F c   The first term here is independent of . The second term is O(c) when −  = 0, so we only need to consider rules with E( − |F ) bounded. Since this expectation is bounded and  → ∞ as c ↓ 0, approximation using (4.3) is appropriate. Also, as c ↓ 0 the posterior distribution for  given F concentrates at ˆ  , and so, given F , X+1 , X+2 , . . . will be approximately i.i.d. from Qˆ . Using these approximations, the difference ˜ = −  should  approximately minimize Eˆ ˜ +

˜   f (X+i ) Eˆ ,  c fˆ (X+i )

Eˆ ˜ +

˜   f− (X+i ) , Eˆ  c fˆ (X+i )



k=1

on ˆ  < ∗

(4.4)

n

and



k=1

∗ on ˆ  > ˆ .

(4.5)

n

Both of these expressions have the same form as (3.1), with local information given by L(ˆ  ), where  L(ˆ , ), ˆ < ∗ , L(ˆ ) = ˆ L(, − ), ˆ > ∗ . (Either number can be used if ˆ = ∗ .) This suggests that local information In in the original problem can be approximated by L(ˆ n ) with the corresponding stopping rule

ˆ = inf{n : c < n L(ˆ n )}. Since ˆ n is consistent, this rule minimizes (4.4) and (4.5) to o(1), at least if ˆ  is not too near ∗ , and this level of performance leads to optimality to o(c) in the original stopping problem, r(ˆ ) = r( opt ) + o(c), as c ↓ 0. Optimality of ˆ to o(c) follows from the main results of Lorden (1977a). He also considers similar stopping times and allows terminal decisions is based on approximations for the risks of accepting or rejecting H0 . For instance, the following procedure: stop and reject  < 0 if ˆ n ∗ and √ c(ˆ n ) 2| log c||A (− ) − A (ˆ n )| −nI (ˆ n ,− ) ,  e

(− )L(− )L(ˆ n ) I (ˆ n , − )A (ˆ n )

120

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

stop and reject  > 0 if ˆ n ∗ and √ c(ˆ n ) 2| log c||A ( ) − A (ˆ n )| ˆ , e−nI (n , ) 

 ˆ ˆ ˆ ( )L( )L(n ) I (n , )A (n ) is also optimal to o(c). When ˆ n ∈ (a, b), the exponential quantities in these expressions are generalized likelihood ratio test statistics for  = ± . At a technical level, Lorden’s proof turning the sketch above into precise mathematics, is quite involved. In the approximations, uniformity needs to be considered carefully. This is delicate when ˆ is near the dividing value ∗ or near an end-point for the support, a or b. Also, the “starting point” for the argument, opt , though it seems natural from the prior research mentioned above, is difficult to prove. Although the stopping time ˆ is complicated, improvement beyond efficiency has considerable practical significance. L-numbers can be fairly small, even when the parameter values are fairly well separated. For instance, the binomial L-number for success probabilities 0.4 and 0.6 is about 1/15, so ˆ will tend to stop much earlier than the Kiefer–Sacks rule which continues until n  c. A more complicated normal one-sided testing problem is considered by Schwarz (1993) and Keener et al. (1995). The model is Bayesian with P ( = 0) = p and the distribution of  given   = 0 is N (0, 2 ). Given  = , data X1 , X2 , . . . are conditionally i.i.d. from N(, 1). Sampling costs are c2 per observation with a unit loss if  = 0 and < ∞. So the risk for a stopping time is r( ) = P ( = 0, < ∞) + cE 2 . Stopping rules optimal to o(c) are given in these papers. They are similar to

ˆ = inf{n : cE(2 |Fn ) < P ( = 0|Fn )L(X¯ n , 0)}, obtained approximating local information In by L(X¯ n , 0). This problem was initially considered in continuous time by Lerche (1986b). 5. A quadrature formula for L By Spitzer’s identity (see Spitzer, 1966 or Theorem 3, Section XII.7 of Feller, 1971),  ∞   1 [P (Sn  0) + P0 (Sn > 0)] . L(0 , 1 ) = exp − n 1 n=1

Although this formula is general, it is only convenient for calculation when distributions for Sn axe available. In this section we give a quadrature formula for L(0 , 1 ) when Q0 and Q1 are mutually absolutely continuous. Let Z(·) denote the log-likelihood ratio function, Z = log(f1 /f0 ), so that Zi = Z(Xi ); let h denote the Hellinger distance between Q0 and Q1 ,      f0 f1 d; h2 = ( f1 − f0 )2 d = 2 − 2

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

121

and define a new density  f0 f1 f¯ = . 1 − h2 /2 When the densities {f :  ∈ } form an exponential family, f¯ = f¯ with ¯ = (0 + 1 )/2. Next, introduce a probability measure P¯ under which X, X1 , X2 , . . . are i.i.d. with common density f¯, and let E¯ denote expectation under this measure. Restricted to Fn = (X1 , . . . , Xn ), the likelihood ratios between the original measures and P¯ are       dP1  dP0  1 1 and , = exp − − n

S S = exp − n

n n 2 2 dP¯ Fn dP¯ Fn where

= − log(1 − h2 /2). So 1 ¯ P1 (Sn  0) = E[exp{ 2 Sn − n }; Sn  0]

and 1 ¯ P0 (Sn > 0) = E[exp{− 2 Sn − n }; Sn > 0].

Here E[Y ; A] denotes EY 1A . Using these,   ∞  e−n −|Sn |/2 ¯ . Ee L(0 , 1 ) = exp − n n=1

By Parseval’s relation,  fˆn (t) 2 −|Sn |/2 ¯ Ee = dt,  1 + 4t 2 where fˆ is the characteristic function for Zi under P¯ , ¯ itZ i . fˆ(t) = Ee Interchanging summation and integration,    ∞  2e−n fˆn (t) dt L(0 , 1 )= exp − n(1 + 4t 2 ) n=1    2 log(1 − e− fˆ(t)) = exp dt .  1 + 4t 2 The formula for L can be simplified when the densities form a canonical exponential family, f (x) = h(x)eT (x)−A() . Then Z = (1 − 0 )T − A(1 ) + A(0 ),

= 21 (A(1 ) + A(0 )) − A(¯ )

122

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

and fˆ(t)=



exp{it (1 − 0 )T − itA(1 ) + itA(0 ) + ¯ T − A(¯ )}h d

= exp{A(¯ + it 1 − it 0 ) − itA(1 ) + itA(0 ) − A(¯ )}, giving







A(¯ + it 1 − it 0 ) − itA(1 )  log 1 − exp 2 +itA(0 ) − 21 (A(0 ) + A(1 )) L(0 , 1 ) = exp   1 + 4t 2



  dt  .

6. Sequential testing without indifference The formulation for testing with an indifference zone can be criticized for various reasons. The limiting shape B0 only depends on the model, the support of the prior distribution, and

. Different values for lead to much different stopping rules ∗ . Since an indifference zone was originally suggested for qualitative reasons, so it is disconcerting that the exact specification of has such impact. It is also worrisome that the shape of the loss function and the prior  have very limited impact. In Lorden’s refinements, the situation is only slightly better—the shape of the loss function only matters for || near , and although the entire prior density is taken into account, values for this density near || = have the most impact. Another approach to sequential testing, perhaps developed with some of these concerns in mind, was initiated in a series of papers by Chernoff (1961, 1965a, b) and Breakwell and Chernoff (1961). The original formulation involves testing whether the drift for a Wiener process {Wt }t  0 is positive or negative. If the prior distribution for the drift  is N (0 , 20 ), then the posterior distribution given Ft = (Ws , 0  s  t) is N (t , 2t ) with

t =

Wt + 0 / 20 t

+ 1/ 20

and

2t =

1 . t + 1/ 20

In these papers sampling costs are constant per unit time, and the loss for an incorrect terminal action is a multiple of ||. After some normalization, constant factors in these losses can be taken to be one, so the risk to be minimized is E + E[||;   < 0] = E[ + ( / )],

(6.1)

where (z) = (z) − |z|(|z|). In this formulation, there is no indifference zone, but a correct decision is still less important when || is small. The stopping time minimizing (6.1) has form

opt = inf{t : |Wt + 0 / 20 |  b(t + 1/ 20 )}. The stopping boundary b(·) is characterized in Chernoff’s work as the solution of a diffusion equation with a free boundary. A numerical approach to the solution is given in Chernoff and Petkau (1986).

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

123

Using invariance arguments, this solution provides asymptotic solutions to various testing problems in discrete time. The sampling costs will be a constant c per observation, and large samples are achieved letting c ↓ 0. Details in the normal case are given by Chernoff (1961), and an extension to exponential families is in Lai (1987). For regularity, Lai assumes that L is nonnegative, EL() < ∞, L() ∼ || as  → 0, and the prior density  is positive and continuous in some neighborhood of zero. With these conditions,   1/3     2 2 A (0) 1/3 c  n b2

eff = inf n : 2n2 I (ˆ n , 0)  2    c A (0) 2 is asymptotically efficient as c ↓ 0. Lai also has results when L() ∼ || as  → 0 for values of  other than one. The zero-one case ( = 0) is considered in Lai (1988). This formulation for testing has advantages and disadvantages over the formulation with an indifference zone. First, the loss function used seems much more natural. Behavior of the model and loss function near  = 0 is crucial, but since  = 0 separates the two hypotheses, this is natural and may even indicate some robustness in the formulation. Calculations to find the boundary function b are difficult and need to be done numerically, but b can be tabled once it is found, so this should not be a practical concern. As c ↓ 0, the values of  that contribute substantially to the risk are of order c1/3 . Error probabilities and expected sample sizes for such  can be approximated replacing the scaled discrete time process by a Wiener process, but explicit values can only be obtained solving associated partial differential equations numerically. Of more import, since these approximations take no account of the excess over the boundary, they may not be very accurate unless c is very small. Although there has been some progress on related problems by Siegmund (1979, 1985) and Hogan (1984a, b), improved approximations taking proper account of the excess have not been obtained. In this setting, L-numbers might be used to approximate local information, but the approximation is inadequate in certain regions. One simple approximation is L(ˆ n , 0). But since L(0) = 0, a modified approximation L(ˆ n , ˜ n ), where ˜ n maximizes

()L()

n 

f (Xi )

i=1

over  > 0 when ˆ < 0, and maximizes the same expression over  < 0 when ˆ > 0, might be slightly better. Numerical work indicates that these expressions provide a reasonable approximation for In when ˆ n is bounded away from zero. The stopping boundary for the corresponding rule,

ˆ = inf{n : c < n L(ˆ n , ˜ n )}, is very close to the boundary for opt unless ˆ n is near zero, at least in the example considered below. Unfortunately, as c ↓ 0, values of  near zero are increasingly important, and ˆ is not efficient in this limit. Better approximations for In may need to take proper account of both the excess and expectations associated with continuous time stopping problems.

124

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

Table 1 Risk comparison K

N

r(ˆ )

r( eff )

r( opt )

Increase

500 1000 2500 10,000 50,000

7.9 11.2 17.7 35.4 79.1

10.34 14.64 22.03 39.38 74.12

10.19 13.90 20.87 37.47 71.13

9.49 13.49 20.63 37.31 70.68

9.0% 8.5% 6.8% 5.5% 4.9%

To study the performance of ˆ , consider testing H0 : p < 1/2 versus H1 : p > 1/2 based on data from a Bernoulli distribution with success probability p. Sampling costs are scaled so that there is a unit cost for each observation, and the loss for an incorrect decision is K|p − 1/2|. The prior distribution for p will be uniform on (0, 1). The optimal stopping time opt can be found by backwards induction, and risks for opt , eff , or ˆ can be found by similar recursions. Table 1 gives results of√these calculations for various choices of K. Values in the second column, labeled N, are K/8, approximately the sample size for the best fixed sample procedure. The risk of this fixed sample procedure is roughly 2N . The last column in Table 1 is the increased risk for ˆ as a percentage, 100[r(ˆ ) − R( opt )]/r( opt ). Even though ˆ is not efficient as K → ∞, its risk is quite reasonable, even when K =50, 000. The main difference between ˆ and opt is that ˆ stops much earlier when ˆ is near zero. When ˆ n and ˜ n are close, L(ˆ n , ˜ n ) will be near zero. This underestimates the true local information In for the next observation and accounts for early stopping by ˆ . The performance of eff in this example is surprisingly good: the risk difference r( eff ) − r( opt ) is less than the cost for a single observation in all cases. Since eff takes no account of the excess over the boundary, this level of performance may not persist in similar problems. The Kiefer–Sacks stopping time which continues until the stopping risk is less than the cost of the next observation, inf{n : n  1} is similar to ˆ but estimates In as one instead of an L-number, in a sense ignoring excess over the boundary. In this example, this rule does quite poorly. With K = 500, its risk is 34.2, more than 3 times the risk of ˆ and over twice the risk of the best fixed sample procedure. Stopping rules like ˆ should work better if the sampling costs tend to zero as  → 0. The normal case with zero-one loss (L() = 1) and sampling costs c2 per observation is considered by Schwarz (1993). In this work a stopping rule similar to ˆ is optimal to o(c). The corresponding problem in continuous time is treated by Lerche (1986a).

References Breakwell, J.V., Chernoff, H., 1961. Sequential tests for the mean of a normal distribution II (large t). Ann. Math. Statist. 35, 162–173. Chernoff, H., 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493–507. Chernoff, H., 1959. Sequential design of experiments. Ann. Math. Statist. 30, 755–770. Chernoff, H., 1961. Sequential tests for the mean of a normal distribution. In: Proceedings of the Fourth Berkeley Symposium, Berkeley.University of California Press, California, pp. 79–91.

R.W. Keener / Journal of Statistical Planning and Inference 130 (2005) 111 – 125

125

Chernoff, H., 1965a. Sequential tests for the mean of a normal distribution III (small t). Ann. Math. Statist. 36, 28–54. Chernoff, H., 1965b. Sequential tests for the mean of a normal distribution IV (discrete case). Ann. Math. Statist. 36, 55–86. Chernoff, H., Petkau, J., 1986. Numerical solutions for Bayes sequential decision problems. SIAM J. Sci. Statist. Comput. 7, 46–59. Feller, W., 1971. An Introduction to Probability Theory and its Applications, vol. 2. Wiley, New York. Gittins, J.C., Jones, D.M., 1974. A dynamic allocation index for the sequential design of experiments. In: Gani, J. (Ed.), Progress in Statistics. North-Holland, Amsterdam, pp. 241–266. Hogan, M., 1984a. Comment on corrected diffusion approximations in certain random walk problems. J. Appl. Probab. 23, 89–96. Hogan, M., 1984b. Corrected diffusion approximations to first passage times. Technical Report 25, Department of Statistics, Standard University. Keener, R., 1984. Second order efficiency in the sequential design of experiments. Ann. Statist. 12, 510–532. Keener, R., 1985. Further contributions to the “two-armed bandit” problem. Ann. Statist. 13, 418–422. Keener, R., 1986. Multi-armed bandits with simple arms. Adv. Appl. Math. 7, 199–203 (In honor of Herbert Robbins). Keener, R., Lerche, H., Woodroofe, M., 1995. A nonlinear parking problem. Seq. Anal. 14, 247–272. Kiefer, J., Sacks, J., 1963. Asymptotically optimal sequential inference and design. Ann. Math. Statist. 34, 705–750. Lai, T.L., 1987. On Bayes sequential tests. Gupta, J., Berger, (Eds.), Statistical Decision Theory and Related Topics IV, vol. 2. Springer, New York, pp. 131–143. Lai, T.L., 1988. Nearly optimal sequential tests of composite hypotheses. Ann. Statist. 16, 856–886. Lerche, H.R., 1986a. An optimal property of the repeated significance test. Proc. Nat. Acad. Sci. 83, 1546–1548. Lerche, H.R., 1986b. The shape of Bayes tests of power one. Ann. Statist. 14, 1030–1048. Lorden, G., 1967. Integrated risk of asymptotically Bayes sequential tests. Ann. Math. Statist. 38, 1399–1422. Lorden, G., 1977a. Nearly optimal sequential tests for exponential families. unpublished manuscript. Lorden, G., 1977b. Nearly optimal sequential tests for finitely many parameter values. Ann. Statist. 5, 1–21. Schwarz, G., 1962. Asymptotic shapes of Bayes sequential testing regions. Ann. Math. Statist. 33, 224–236. Schwarz, G., 1993. Tests mit macht eins und Bayes-optimalitaet. Ph.D. Thesis, University of Freiburg. Siegmund, D., 1979. Corrected diffusion approximations in certain random walk problems. Adv. Appl. Probab. 11, 701–719. Siegmund, D., 1985. Corrected diffusion approximations and their applications. In: LeCam, L., Olshen, R. (Eds.), Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer. Belmont, Wadsworth. Spitzer, F., 1966. Principles of Random Walk. Van Nostrand, New York. Wong, S.P., 1968. Asymptotically optimal properties of certain sequential tests. Ann. Math. Statist. 39, 1244–1263. Woodroofe, M., 1980. On the Bayes risk incurred by using asymptotic shapes. Comm. Statist. A9, 1727–1748. Woodroofe, W., Lerche, H.R., Keener, R., 1994. A generalized parking problem. In: Gupta, R., Berger, (Eds.), Statistical Decision Theory and Related Topics. Springer, New York, pp. 523–532.