•
,~ v, r
STATISTICS& PROBABILffY LETTERS ELSEVIER
Statistics & Probability Letters 35 (1997) 283-288
Bandit bounds from stochastic variability extrema Stephen J. Herschkorn* Faculty of Management and RUTCOR, Rutgers University, P.O. Box 5062, New Brunswick, NJ 08903-5062, USA
Received June 1996; received in revised form January 1997
Abstract In the consideration of bandit problems with general rewards and discount sequences, we compare an arm to one whose reward distribution may be one of two degenerate distributions. For the general multi-armed case, the latter problem provides an upper bound on the optimal return. In the case of two arms with the second known and regular discounting, consideration of the two-point distribution provides a sufficient condition for stopping. We interpret these results in the context of the value of information. The results, and others in the literature, suggest that bandit thresholds (or indices) may be monotonic with respect to ordering of distributions in the convex sense. © 1997 Elsevier Science B.V. A M S classification: 62C10; 60E15; 90C40 Keywords: Bandit problems; stochastic variability ordering; convex ordering; value of information
O. Introduction This note establishes inequalities for some very general bandit problems. These inequalities arise from the consideration of extremal distributions on the u n k n o w n parameters or distributions. Extremal here is meant in the sense of stochastic variability orderings. In particular, we will c o m p a r e the value, optimal policy, and threshold (or index) for a general distribution to those for a two-point distribution. F o r a concrete context, the reader m a y consider the special case of the t w o - a r m e d Bernoulli bandit whose first a r m has u n k n o w n parameter 0 with distribution G and whose second a r m has k n o w n parameter s. Discounting m a y be uniform or (possibly truncated) geometric. We c o m p a r e this problem to a n o t h e r one: In this second problem, the u n k n o w n parameter has itself the Bernoulli distribution with parameter p = EO; all other features of the problem are the same as in the first. In this context, our results are as follows: - V(G) <~ V(Ber(p)), where V denotes the optimal return of the problem.
* Tel.: 1908 445 3126; fax: + 1 908 445 5472; e-mail:
[email protected]. 0167-7152/97/$17.00 © 1997 Elsevier Science B.V. All rights reserved PII S0 1 67-7 1 5 2 ( 9 7 ) 0 0 0 2 4 - 2
S.J. Herschkorn / Statistics & Probability Letters 35 (1997) 283-288
284
- If it is optimal to choose the known arm (i.e., to stop) when the unknown parameter is Bernoulli, then it is optimal to choose the known arm when 0 ~ G. In terms of thresholds (indices in the infinite-horizon geometrically discounted case), A(G) <~ A(Ber(p)). In each case, the weak inequality may be replaced by a strong one under obvious necessary conditions. We note later explicit formulae for the upper bounds. After the statement and proof of the general results, this note discusses an interpretation thereof in the context of the value of information. It then explores possible extensions (and an erroneous one) to more general inequalities with respect to stochastic variability orderings. This exploration includes a survey of some related results in the literature.
1. Results
The first result concerns a very general discrete-time (n + 1)-armed bandit problem. The discount sequence A = (~1, ~2 . . . . . ) (as defined by Berry and Fristedt, 1985) is completely general; we require only that it be deterministic, nonnegative, and summable. The returns from the first arm, called arm 0, come from an unknown probability distribution F on ~; the random distribution F has distribution Q (on 9 , the space of probability distribution on ~), and we let M denote the random mean of F. As part of our hypothesis, we assume the existence of finite real numbers a and b such the P~{a ~< M ~< b} = 1. The unknown distributions for arms 1 to n are G1, ..., G,, and these have arbitrary joint distribution R. (Thus R is a distribution on 9".) F is independent from the collection (G1 . . . . . G,) by hypothesis. We compare this bandit problem to that where F has a different distribution, denoted by Q'. Q' puts weights p - (b - #)/(b - a) and q =- (# - a)/(b - a) on the degenerate distributions 6a and fib, respectively, where p = EQM (i.e., # is the unconditional expected return from a single pull of arm 0 is the same in both problems. Note that in the modified problem a single observation of arm 0 removes all uncertainty about this arm: either all returns will be a or all will be b. All other features of the second problem (viz., discount sequence, unknown distributions of remaining arms, and independence) are the same as in the first problem. Let V(Q, R) or V(Q, R , A ) be the value (i.e., supremal expected total return) from the original problem. Theorem 1. V(Q, R) <~ V(Q', R).
Proofi Consider the following modified bandit problem (suggested by D.A. Berry in a personal communication): In this modified problem, after we observe arm 0 for the first time, we will learn with certainty the value of M. Thus, after such an observation, the expected reward for the first arm becomes known (deterministic). Let V' denote the optimal value function for this problem. It is clear that V(Q,R) <~ V'(Q,R), for one can always apply the same policy in both problem (i.e., one can ignore the additional information in the modified problem). Also, V'(Q', R) = V(Q', R), since one does learn M exactly under Q' in the standard problem. We thus need to show that V'(Q,R) <~ V'(Q', R). Fix an optimal policy for the modified problem when F ~ Q; let N be the first time one pulls arm 0 under this policy. Then V'(Q,R) = E
I o~iZi -~ ~N~ ] -~- EQV(6aM,R*,AtN)), L_i=l
where Zi is the observation from the ith pull, R* is the random posterior distribution immediately before the first observation of arm 0, and A(")= (~,+ 1,~,+2,a,+3, ..., ). The first summand above is the expected discounted return from the first N pulls, and the second is the discounted return from the remaining pulls: Since F is independent of G, the expected undiscounted return from the Nth pull is unconditionally #. After
S.J. Herschkorn / Statistics & ProbabilityLetters 35 (1997) 283-288
285
the observation from F, M is known exactly, and R* is the current distribution of G. Since a, approaches 0, the formula holds when N -- c~ as well (where A (°~) = 0).
EQ V(6~,, R*, A (m) = EEq [V(6~M, R*,A(m)[N], and M is independent of N. Standard bandit arguments show that EV(6~, R,A) is convex in x. Thus, letting M' have the two-point distribution Eo, F,
EV(b~M,R,A ) <~ EV(6aM.,R,A), since M is stochastically smaller than M' in the convex sense. (By definition, X ~< cxY if and only if Ef(X) <<,E f ( Y ) for all convex functions f for which the expectations are defined. See, for example, Shaked and Shanthikumar 1994, Ch. 2.) Thus, we have shown that by observing arm 0 at time N in the modified problem with F ~ Q', we obtain a total expected return at least as great as the optimal return when F ~ Q. Hence, V'(Q, R) <. V'(Q',R). [] The remaining results apply to more restricted problems. Henceforth, there are but two arms, now called arm 1 and arm 2. Rewards from arm 2 have certain expectation s; we shall let V(Q, A) or V(Q, s, A) denote the optimal return. We further assume that the discount sequence A is regular as defined by Berry and Fristedt (1985, p. 90): for each m, NA(m)lla"[]A(m+2)N1 ~< ]]A(m+l)l[z (where I]'H denotes the f l norm). Berry and Fristedt (1985) show tat regularity implies the following two properties (well-known for uniform and geometric discounting, which provide examples of regularity): -
-
The problem is a stopping problem i.e., there exists an optimal policy where an observation of arm 2 is followed only by further observations of arm 2. There exists a threshold A(Q,A) such that it is optimal to pull arm 1 first if and only if s <~ A(Q,A).
A statement such as "it is optimal to pull arm i first" is our abbreviated expression of"there exists an optimal policy which observes arm i on the first trial". Note that when A denotes infinite-horizon geometric discounting, A(Q,A) is the Gittins index. Theorem 2. Assume regular discounting. I f it is optimal to pull the known arm first in the problem with F ~ Q', then it is optimal to pull the known arm first with F ~ Q. In terms of thresholds, A(Q,A) <~ A(Q',A). Proof. We show the contrapositive: If it is not optimal to pull the known arm under Q, then s IIA I]1 < V(Q,A) <~ V(Q',A). Thus, it is not optimal to pull the known arm first under Q'. The ordering of thresholds follows immediately. [] Corollary
1. I f A is regular, then
V(Q,A) <. max(pbIM Ill + q(ala + s IIA(1)]ll),s IM I[1) and A(Q,A) <~(oql~ + pb HA(l) H1)/(oq -4- p IIA(1) ILl). Proofi The right-hand sides of the inequalities are the respective expressions when F ~ Q'. Corollary
2 (Strict inequalities). Assume A is regular, Pq{a < M < b} > O, and A (1) ¢ O.
(i) I f a < s < A(Q',A), then V(Q,A) < V(Q',A). (ii) l f ~l > O, then A(Q,A) < A(Q',A).
[]
S.J. Herschkorn / Statistics & Probability Letters 35 (1997) 283-288
286
Proof. (i) By comparison to the modified problem described in the proof of Theorem 1,
V(Q, A) <~max(cq ~ + EQ max(m, s) IlAtl) ]11, s IlA []1).
Since s < A(Q',A), s [IA II1 < V(Q',A) = ellz + (pb + qs)IIAmlI~. To verify the strict upper bound of EQ max(M,s), note that because PQ{a < M < b} > 0 and a < s < b,
{
PQ max(M,s) < ~ b + ~
j>O
and M-a PQ max(m,s)~< b - a
b
b-M + ~ s
}
=1,
whence
[
-Ms7
Eemax(m,s ) < E M b -- aab + b - a J =pb + qs.
(ii) Considering the (standard) bandit problem where F ~ Q' and s = A(Q, A), we need to show A(Q,A) IIAIII < eta# + [pb + qA(Q,A)] 11,4ml[~.
We do so by considering the (Q, A(Q,A),A)-problem: V(Q,A(Q,A),A) = A(Q,A) [IA []1 = ~lP + EV(Q*,A(Q,A),A m)
(Q* is the random posterior) <<.cq# + EQmax(M,A(Q,A))
IIA¢I)II1
(by comparison to the modified problem) < ~1# + [pb + qA(Q,A)] IIA")II1
as in the proof of part (i). []
2. Discussion
We interpret these results in terms of the decision theoretic concept of the value of (sample) information. Consider the (standard) bandit problem where the second arm is known. We measure the expected value of an observation of the unknown arm in terms of the expected increment to the optical value: I(Q,A) =_ EV(Q*,A) -- V(Q,A),
where Q* is the random posterior distribution conditional on the first observation. This expression appears explicitly in the difference between the majorands in the optimality equation: As do Berry and Fristedt (1985), define the advantage function
A(Q,.4) _ [~lu + EV(Q*,A"q = ~I(H -- s) + I(Q, Am).
-
[~ls + v(o,A"~],
S.,L Herschkorn / Statistics & Probability Letters 35 (1997) 283-288
287
It is optimal to select the unknown arm first if and only if A/> 0.
A(Q,A) <<,A(Q',A) (Theorem 2)=~ A (Q,A) <~A (Q',A)~ I(Q,A ~1~)<<.I(Q',Atl)). That is, Q' maximizes information in the class of all distributions Q such that PQ{a ~< M ~
u + f and the mean of H'= H(.,u',f') is no greater than that of H = H(.,u,f), then A(H',K)<~ A(H,K) (where A denotes the advantage of the first arm over the second). Berry refers to u + f as a rough indicator of the information content H; a higher sum indicates less information to be gained from an observation. (We would also say that the stochastic variability decreases as the sum increases.) Thus, the conjecture states that if we prefer K to H, then we prefer K to another distribution with smaller immediate reward and smaller information content. Berry notes the following implication of his conjecture: Let G be a distribution on [0, 1] and G' and G" be, respectively, the Bernoulli and degenerate distributions with the same mean as G. Then A(G", K)<<. A(G, K) <~A (G', K) for any distribution K. Our results verify the case where K is degenerate. Gittins and Wang (1992) discuss (titularly) "the learning component of dynamic allocation indices". That is, these authors consider the role of information under infinite-horizon geometric discounting. They show that thresholds (indices) decrease as n increases in the following cases: Bernoulli rewards where the unknown parameter has Beta(n#, n(1 -/~)) distribution, normally distributed rewards where the unknown mean has Normal(p, a2/n) distribution, and exponentially distributed rewards where the unknown parameter has Gamma(n, n/l~) distribution. Note that the first example implies our threshold bound for geometrically discounted Bernoulli bandits. As Gittins and Wang themselves note, their methods rely on index policies and hence do not apply to more general (in particular, uniform) discount sequences. This observed effect of stochastic variability is likely to apply to more general sequential decision problems in addition to bandits. For example, Herschkorn (1993) establishes the inequalities for a less-studied problem (the processor reordering problem) that one may view as a variation on the bandit problem. We have noted that one common method to compare stochastic variability is by the ordering in the convex sense. Hence, all the results displayed in this article suggest collectively the following conjecture regarding bandits where distributions are parametrized by their means.
288
S.J. Herschkorn / Statistics & ProbabilityLetters 35 (1997) 283-288
Conjecture. I f G <<,cxH and A is regular, then A(G,A) <~A(H,A). The establishment or refutation of this conjecture is the subject of current research. We must note, however, that the corresponding implication for optimal value functions is false. For a counterexample, let G = ~061/3 + 961t/12 and H = ~ 6 t / 3 + 965/6 + 9 6 1 be distributions for an unknown Bernoulli parameter, and let A denote either uniform discounting with horizon 3 or infinite-horizon geometric discounting with discount factor 9 ; let the known arm have parameter ½. Then G ~< cxH, but V(G, 1/2, A) > V(H, 1/2, A). Thus, at least for optimal value functions, ordering in the convex sense is not the appropriate generalization of stochastic variability in this application. The identification of appropriate characterization of variability (for the ordering of both value and threshold) remains an open question.
Acknowledgements The author thanks Dr. Donald A. Berry for an early contribution to this analysis and an anonymous referee who provided helpful comments regarding clarification of the exposition.
References Berry, D.A., 1972. A Bernoullitwo-armedbandit. Ann. Math. Statist. 43, 871-897. Berry, D.A., Fristedt, B., 1985. Bandit Problems: SequentialAllocation of Experiments.London. Gittins, J., Wang, Y.-G., 1992. The learning component of dynamic allocation indices.Ann. Statist. 20, 1625-1636. Herschkorn, S.J., 1993.A Bayesian approach to the processor reordering problem. Ph.D. Thesis, Berkeley,California. Shaked, M., Shanthikumar, J.G., 1994. StochasticOrders and their Applications.Boston.