Journal of Statistical North-Holland
Planning
OPTIMAL LOGISTIC Kathryn
Received
of Applied 17 August
Recommended
Abstract:
guess however
received
way to design a binary
account
for a situation theory
distributions
St. Paul, MN55108,
24 January
response
for parameter
U.S.A
1988
where approximate
for concave Designs
experiment
values close to that best guess.
design found
critria
attached models
optimization
The theoretical
values.
for a best
We propose A design for
to it is very different
of the parameters
for non-linear
by numerical
and a range of criteria.
in the parameter
uncertainty values
is to design the experiment
values. A design which is optimal
for the prior uncertainty
where the best guess has substantial
regression.
are known.
and then apply
are examined
from
We derive a the theory
for a range
to
of prior
results are used to verify that the designs are
optimal. Subject
Key words:
Classification: Algorithm;
Mead algorithm; design;
of Minnesota,
for a best guess of the parameter
which formally
logistic
AMS
University
may not be efficient
general
indeed
Statistics,
LARNTZ
1987; revised manuscript
A traditional
a design
and Kinley
TO
by A. Cohen
to be most efficient
a situation
191
21 (1989) 191-208
BAYESIAN DESIGN APPLIED REGRESSION EXPERIMENTS
CHALONER
Department
designs
and Inference
logistic
Primary
A-optimality;
optimization; regression;
62KO5; Secondary binary
reliability
non-linear
data;
D-optimality;
experiments;
models;
62F15.
optimal
simplex
equivalence method;
theorem;
stress testing;
NelderBayesian
design.
1. Introduction Designing an experiment for a binary response is important in the biological sciences for bioassay experiments and in the engineering sciences for reliability testing. We will focus specifically on the logistic regression experiment, although the theory and methods apply to any generalized linear model. A logistic regression experiment is just one example of a non-linear design problem. The non-linear problem is intrinsically much harder than the linear problem as the efficiency of a design depends on the unknown parameters. A common approach is to design an experiment to be efficient for a best guess of the parameter values. This approach leads to what are termed ‘locally optimal’ designs, the term being introduced by Chernoff (1953). A natural generalization is to use a prior distribution on the unknown parameters rather than a single guess. We use this approach here and justify it using Bayesian arguments. 0378.3758/89/$3.50
0
1989, Elsevier
Science Publishers
B.V. (North-Holland)
192
K. Chaloner,
K. Larntz / Optimal Bayesian design
Non-Bayesian design for logistic regression is discussed at length in Finney (1978), Abdelbasit and Plackett (1983) and Meeker and Hahn (1977). Various Bayesian approaches are given in Tsutakawa (1972,1980), Zacks (1977), Freeman (1970), Leonard (1982) and Owen (1975). There is a large literature on sequential design for binary response models, see for example Wu (1985), but we will not consider the sequential approach in this paper. In many experimental situations sequential design is impractical or too expensive (see Anbar, 1984). We first present a unifying theory of optimal Bayesian design for non-linear problems. This theory parallels the theory for linear problems as presented in Silvey (1980) and Fedorov (1972,198l). The theory is valid, under regularity conditions, for general concave optimality criteria. The concavity arises from the formulation of the criteria. The general theory developed in Section 2 discusses an equivalence theorem of Whittle (1973) and Dubov (1977) and shows how the theorem applies to Bayesian design problems. The theorem also provides a method of verifying whether or not a particular design is optimal, or close to optimal. In Section 3 we discuss the logistic regression model and give expressions for the directional derivatives needed to use the theorem. Two concave criteria will be used. The first criterion corresponds to maximizing the average over the prior distribution of the log of the determinant of the information matrix. This maximizes, approximately, the expected increase in Shannon information provided by the experiment as suggested by Lindley (1956) and given a decision theoretic basis by Bernard0 (1979). It was used by Lauter (1974,1976) for mixtures of linear models. This criterion is equivalent to D-optimality in linear design. The second criterion corresponds to minimizing the approximate expected posterior variance of the quantities of interest. This criterion requires the experimenter to specify what is to be estimated or what is to be predicted and the relative importance of these predictions. A weighted trace of the inverse of information matrix is then averaged over the prior distribution and minimized. In linear problems this criterion is called A-optimality. Both these criteria correspond to having a fixed number of observations to allocate. In Section 4 an experiment to investigate these designs for a range of prior distributions is described and the results discussed. One result is that as the uncertainty in the prior distribution increases so does the number of support points required. In Section 5 the designs from Section 4 are compared and contrasted to the corresponding locally optimal designs. We also examine whether designs that maximize the average log determinant of the information matrix are generally efficient in terms of minimizing average variances. In Section 6 we examine a few designs in more detail. Previous work in this area (Tsutakawa, 1972, 1980, and Abdelbasit and Plackett, 1983) is restricted to designs taking observations at equally spaced points with equal numbers of observations at each point. No such restrictions are made here. Extensions of the results of Abdelbasit and Plackett (1983) are given in Minkin (1987) and Khan and Yazdi (1988).
K. Chaloner,
K. Larntz
2. The general theory of optimal
/ Optimal Bayesian
Bayesian
design
193
design
2. I. The criteria To introduce the notation, suppose that an experiment is to be designed by choosing n values of the design variable x from an experimental region X. At each of the n values of xi an independent observation yi will be available. Let the data be denoted yT = (y i, . . . , y,) and the unknown parameters Or= (O,, . . . ,IVJ. The contribution to the likelihood for a single observation yi made at x is p(_,Vi) 0,x). For most models using exact posterior distributions is intractable and asymptotic arguments must be used. Under easily satisfied assumptions the posterior distribution of 0 is approximately a multivariate normal distribution with mean the maximum likelihood estimate, 0, and variance covariance matrix the inverse of the observed Fisher information matrix, i.e. the matrix of second derivatives of the log likelihood function evaluated at the maximum likelihood estimate of 9. This approximation is given, for example, in Berger (1985) page 224. The expected Fisher information matrix could also be used, but for the logistic regression model they are identical. By expanding the definition of a design to include any probability measure q on X we define the normalized matrix I((?, q) by
Assuming that the information matrix is nonsingular, the posterior distribution of 8 using a design measure q is approximately multivariate normal N,(g, (nl(e v)))‘). Preposterior expected losses can therefore be approximated using the prior distribution of 8 as the predictive distribution of 0. The first criterion we consider is to choose the design measure rl maximizing @i(q) where
For a design measure q for which Z(e, q) is singular for 0 values of non-zero prior probability we define @i(v) to be -w. The second criterion is &();1), which requires that the parameters to be estimated are specified and possibly weighted. The criterion is Q2(1;1)= -E0(tr
B(e)1(e,&),
where B(B) is a symmetric p by p matrix. If linear combinations of the 0, are of interest then B(B) does not depend on B and is a matrix of known weights. If nonlinear combinations of the 0,‘s are of interest then B(8) has entries which are functions of 8. 2.2.
General equivalence theorem
Define H to be the set of all probability
measures
on X and view the design prob-
K. Chaloner,
194
K. Larntz / Optimal Bayesian design
lem as finding a measure in H that maximizes Q(q). We will only consider concave functions $J on Hand if we assume X is compact then H is compact in weak convergence topology. Both @i and Q2 are concave on H for logistic regression experiments and many other design problems. In linear design problems the information matrix does not depend on 0 and there is a compact set of possible information matrices to optimize over. The set is a finite dimensional subset of Euclidean space. The general equivalence theorems of optimal design are usually stated in terms of this set and directional derivatives defined on this set. This is the approach in Silvey (1980). For the non-linear problem we need to optimize over the set of measures H directly and define directional derivatives on H. Whittle (1973) uses this approach and we use his version of the general equivalence theorem. For two measures y~i and rZ in H the derivative at vi in the direction of y/1 is denoted by F(q,,qz) and is defined, when the limit exists, by F(vl1, viz) = ;,:: E&H(l
-&)I?1 + UZ) - @(rl,)).
The extreme points of H are the measures which put point mass at a single point x in X. We denote such a measure qX. If F(q,,q2) is linear in y/2 then r$ is said to be differentiable and
We further define d(l;l, x) to represent F(q, rX). Theorem 1 of Whittle (1973) is usually used in the context of linear design but we use it here for non-linear design. This theorem forms the basis for checking whether particular designs are optimal and is reproduced below in our notation. (Whittle, 1973). (a) If @ is concave, then a @-optimal design q0 can be equivalently characterized by any of the three conditions (i) q. maximizes @(v]), (ii) q. minimizes SUP,,~ d(u,x), (iii) SUP,,~ d(q?,,x)=O. (b) The point (qo, qo) is a saddle point of F in that
Theorem
F(vo,
‘II 110
= F(vo,
~0) 5 F(YI~, ~0)
for all VI, 112E H.
(c) If 4~is differentiable, then the support of q. is contained in the set of x for which d(tlo,x) = 0 almost everywhere in q. measure. The proof is given by Whittle. Whittle’s proof is in the matrix for the linear design problem but his proof applies under some additional assumptions. Specifically, assume derivatives exist and are continuous in x, that there is at
context of the information in more general problems that X is compact, that the least one measure in H for
K. Chaloner,
which
@ is finite,
and
that
195
K. Larntz / Optimal Bayesian design
@ is such that
if vi * q in weak
convergence
then
@(vi) + Q(q). Under these assumptions Whittle’s proof applies to general nonlinear problems. Dubov (1977) shows how this theorem applies to non-linear regression problems. The equivalence theorem proved by Lauter (1974) is a special case of Whittle’s theorem for mixtures of linear models and is also used, for example, by Cook and Nachtsheim (1982) for mixtures of polynomial regression models. Mixtures of linear models are a special case since there is a bound on the number of support points in an optimal design. For models other than linear models, and specifically for logistic regression, there is no such bound. Some results and algorithms of linear design, as described in Wynn (1970) and Fedorov (1972), apply. Specifically, results apply if they do not require the existence of a finite dimensional set of information matrices not depending on 19. For local optimality all results of linear design apply (White, 1973) since for a fixed value of 0 the set of information matrices is a finite dimension set. Even though Whittle’s theorem can be used to decide whether a particular design v0 is optimal, it does not appear to have been used elsewhere. For example Pronzato and Walter (1985) use @,-optimality for some bioengineering design problems but do not use this theorem. These results could also be applied to the dynamic systems described in Mehra (1974) and Titterington (1980). Mehra does not use @I,- or @z-optimality but uses a criterion which first computes the prior expectation of I(& ~7)and then optimizes the determinant or trace of that matrix. 2.3.
The criteria Q1 and &
For the two criteria
@, and Qz the necessary
directional
derivatives
are
(i) for 6,
We now proceed experiments.
3. The logistic
to use these general
regression
results
for designing
logistic
regression
design problem
The logistic regression model is such that, for an observation at a value x of the explanatory variable, a Bernoulli response yj is observed. There are two parameters 19~= (p, p) and the probability of success is p(x, 0) = l/(1 +exp(-p(x-p)).
196
K. Chaloner,
K. Larntz / Optimal Bayesian design
The model is a generalized linear model with the logit transformation of the expected response being linear in x. The parameter ,U is the value of x at which the probability of success is 0.5 and p is the slope in the logit scale. The parameters ,U and p can be thought of as representing the location and scale of an underlying tolerance distribution. We will assume that X is closed and bounded and therefore H will be compact. We also suppose that the support of the prior measure is bounded. This requirement insures non-singularity of Z(& YZ)for all measures YZexcept for those corresponding to point mass at a particular x. These conditions are sufficient for the results of Section 2 to be applied to logistic regression. For a design measure, r, on X putting nj weight at k distinct design points xl, i=l ,..., k, Cnj=l, define
t=
w; = P(X;, e)(l -P(X,, @IL
k
c njw,, I
x = tr’ i: n, wixi,
s=
n;w;(x;-K)2.
i
1
1
Note that w;, t, R and s all depend on 8, but this dependence has been dropped simplify the notation. The normalized information matrix can be written as Z(& rl) =
The inverse
P2Z -pt(K-p)
of the information
z(e, q)p
= pp2
-PG-P) s+ t(x-,uy matrix
to
(2)
>.
is therefore
1/3+(X-&/s
p(x-p)/.s
p(G,LL)/s
p%
>.
It can be shown that the expected Fisher information matrix for any generalized linear model, with a single explanatory variable, takes a form similar to (2). This result follows from the form of the likelihood function for any member of the exponential family. Details can be found in Zacks (1977) for models with no unknown scale parameter and Chaloner (1987) for models with a scale parameter. The form of the information matrix (2) makes the concavity of @t and @2 straightforward to prove. For example, to show that @, is concave, take any two measures in H, ty, and v12, and any (Y where 0< a< 1. It is easy to show that for any e log det Z(e, cxq, + (1- (x)qJ I a log det I(& q,) + (1 -a) log det Z(0, Y/~). Taking expectations over both sides of the cavity of @t over H. A similar argument The Ql-criterion and several versions of expressions will be found for the criteria
above inequality gives the required conyields the concavity of G2. @,-optimality will now be examined and and their derivatives.
197
K. Chaloner, K. Larntz / Optimal Bayesian design
3. I. $,-optimality For @,-optimality, plify to give
maximizing
the average log determinant,
the expressions
sim-
@l(V) = ‘% llog(PZtJJ)l and if we define w(x, 0) = P(X, 0x1 -PC& 0)) we have
d(q,x)
=E,{w(x,B)[t-‘+(R-x)2s~‘]}
-2.
rate, and For any design q, we see that if 1x1--t COthen w(x, 0) --f 0 at an exponential therefore d(a,x) + -2. From Whittle’s result, we know that at points of support of the optimal design this derivative must be exactly zero. For sufficiently wide X, the design points of the optimal design will therefore not be at the boundary of X. Rather if q0 is the optimal design the support points will be at the roots of the function d(qa,x). This is different from design for linear models where optimal designs require that 1x1 is as large as possible. 3.2.
@,-optimality
The weighted trace criterion of @,-optimality is perhaps more appealing than @,-optimality as it corresponds to the familiar squared error loss function. It requires that the experimenter carefully specify exactly what is to be estimated or predicted. The specification of what is to be estimated provides the matrix B(0). If, for example, the only parameter of interest is ,u, then B(B) can be written as ccT with c = (1, O)T. If both ,L and /3 are of equal interest then B(B) is the identity matrix. It is often of interest, especially in reliability and bioassay experiments, to estimate the value of x at which the probability of success, p(x, O), is a particular value. Suppose that we wish to estimate x0 where logit(p(xo, r3)) = y, then x0 =p + yp-’ which is a non-linear function of 0. Standard asymptotic arguments give B(B)= c(0)c(Q)T where c(6) is the vector of derivatives of the non-linear function. Specifically, c(0)T = (1, -ype2). If several percentile response points are of interest then a distribution could be put on y and B(B) averaged over this distribution. This is a standard way of using the weighted trace criterion in a Bayesian framework (see, for example, Chaloner, 1984, for this approach in linear models). We implement the design algorithm for three choices of B(8). The first choice is where ,/Ais the only parameter of interest. The second is where the only parameter of interest is the value x such that p(x, 0) = 0.95. The third is when we are interested in estimating the value of x at which logit(p(x, 0)) = y and y has a uniform distribution over [-1, 11. This range on y gives probabilities between 0.27 and 0.73. These
198
K. Chaloner,
three
choices
are by no means
K. Larntz
/ Optimal
exhaustive
Bayesian
of the many
design
which might
occur
in real
experimental situations but they represent a range of criteria which might be appropriate in particular experiments. These three examples of @,-optimality will be examined and optimal designs found for a range of uniform and independent prior distributions for p and fi. The criterion functions and derivatives are now derived for these examples of @,-optimality. Recall that the derivative involves the weighted trace of the matrix I(& v))‘I(e, q,)1(0, v)-‘. Expressions for the entries of this matrix are given in the Appendix.
qb,-optimality for estimation of p
3.3.
When interest is in estimation of p alone then the (1,l) entry of B(8) is 1 and all other entries are zero. For any design q, with more than one support point, Q*(q) = -E,{p[t-’
+(x-,&s+]}
and 0/,x)
=
EB{W(X,e)(pst)-2[f(~--X)(K-~u)+S12} +@2m.
Estimating p alone is of interest in many biological applications, the parameter p is often referred to as the LD50 as it is the 50th percentile of the underlying logistic tolerance distribution. 3.4.
@,-optimality for any or several percentile response points
Following
the earlier
discussion
@2(yI) = -4W2[tp1
in 3.2, to estimate
p + ypP1 we have
+(Y-P(~--iu))2P~2s~1)1}
and d(ul,x) =EB{W(X,~)(p2St)-2[t(R-X)(P(R--)--Y)+PSl2}+~2(YI). Substituting y=O gives the expressions in 3.3 for the special case of estimating ,u alone. Substituting y = 2.944 gives the situation where we want to estimate the value of x at which p(x, 0) = 0.95. This value of x is referred to as the LD95 and is often of interest in reliability experiments (see Meeker and Hahn, 1977). If a distribution is put over y to represent interest in several percentile response points then define B(8) =
1 -E(Y)/P2
The expectation is over the distribution on y. An example where y has a uniform distribution on t-1, l] will later be used. In that case E(y) = 0 and E(y2) = 3 and the criterion and derivative become
K. Chaloner, K. Larntz / Optimal Bayesian design
@2(q) = d(q,x)
199
-E@{p-2[tr1 +(K-~)2s~'+(3p2s)-1]}
=E,{w(x,e)(p2st)-2[~2(t(~-X)(R-~)+S)2+t2(~-X)23~1]} +
@2(v).
A uniform [- 1, l] distribution on y represents of the logistic response curve.
interest
in calibrating
the central
part
4. Implementation Finding optimal designs is a problem in numerical optimization. We have found the Nelder and Mead (1965) version of the simplex algorithm to be effective for these problems. This method is an unconstrained optimization and we transform the problem to satisfy this. This method also requires that the number of design points is specified and so we repeat the optimization, starting with 2 design points and increasing the number of support points steadily up to 11, or, if necessary 20. We then choose the design which optimizes the criterion on the smallest number of design points. To verify that a design chosen this way does indeed correspond to a global optimum of the criterion the derivative d(~,x) is evaluated over possible values of X. A double precision FORTRAN implementation of this procedure is available from the authors. This implementation uses an adaptation of O’Neill’s (1971) subroutine and standard numerical integration methods. Algorithms of the type proposed by Wynn (1972) and Fedorov (1972) could also be used and versions of these were implemented. These were found to be very slow, especially for vague prior distributions. Our implementations of these algorithms could undoubtedly be improved but we found the Nelder and Mead algorithm to be reliable and adequate, especially given the ability to verify that a global optimum had been found by examining the derivative function. 4. I. A numerical
investigation
A numerical investigation using our implementation of the Nelder-Mead algorithm will now be described. Designs were found for a range of uniform and independent prior distributions for p and /3 and a range of criteria. It was assumed that p and p were independent in the prior distribution and each had a uniform distribution on an interval. Three different intervals were chosen for each parameter and all nine combinations were used to generate nine different prior distributions in a 3 x 3 factorial structure. The intervals for p were all centered at zero and were [-O.l,O.l], [-0.3,0.3] and [-1.0, 1.01. The intervals for /I were [6.9,7.1], [6.0,8.0] and [4.0,10.0] and so they were all centered at 7. These intervals were motivated by the example in Brown (1966) and represent a range of prior distributions from quite
200
K. Chaloner,
K. Larntz
/ Optimal
Bayesian
design
precise to quite vague but informative. Even the vaguest prior distribution is proper and recognizes that the sign of p is known and a range is known for p. This knowledge would be reasonable to assume in many real experimental situations. Independent uniform prior distributions on an interval may not always be appropriate but they serve to illustrate our technique. The computer implementation may easily be modified to use any prior distribution. For each of the nine uniform prior distributions optimal designs were found, using the Nelder-Mead algorithm. Four criteria were used, @,-optimality and the three versions of @,-optimality introduced in Sections 3.3 and 3.4. 4.2.
@,-optimality
Figure 1 highlights some of the results for maximizing the average log determinant. The figure displays the designs as probability mass functions and gives the maximum value of the determinant. Examination of Figure 1 reveals that as our uncertainty increases so does the minimum number of support points required to attain the optimal value. For p uniform on [-O.l,O. l] optimal designs for all three prior distributions on /3 are on only k = 2 points. For ,D uniform on [-0.3,0.3] the optimal design is on k = 3 points and for ,Duniform on [- 1 .O, 1 .O] more than 7 points are needed. The designs appear to be much more sensitive to the prior distribution on p than on j3.
,
l
, , , (-3.372g)
(-3.o4o7)
,
,
, , , (-4.5604) , L
f6.9.7.11
1
-1
0
(-4.5763)
(-3.3786)
I 0
-1
[4,,ol
-1
, (-3.1461)
,
-1
1
0
1
[-0.1.0.11
,
-1
1
0
1
-1
,
r.4297)
I
0
1
-1
Illll
1
0
1
, , , , (-4.5613) , , ,
0
1
[-1.0.1.01
L-0.3.0.31
P
Fig. 1. @,-optimal are the maximum
designs plotted values attained
as probability mass functions by the criterion.
on [-I,
11. The numbers
in parentheses
K. Chaloner, K. Larntz / Optimal Bayesian design
201
@,-optimality for the estimation of p
4.3.
Figure 2 gives the corresponding results for minimizing the expected variance of p. Again, as for @,-optimality, the more uncertainty there is in the prior distribution the more support
points
are needed.
@,-optimality for the estimation of LD95
4.4.
In engineering problems it may well be an extreme percentile, such as the point where the probability of response is 0.95, that is of interest. Unlike the two previous examples the optimal design is not necessarily symmetric about the prior mean for ,D. Figure 3 illustrates these designs. They are clearly not symmetric and can be quite different depending on the prior distribution. We noted that the criterion function to be optimized in this case does not appear to be as well behaved as for the other criteria studied.
@,-optimality for the average percentile response point
4.5.
As described earlier we use a criterion designed to minimize the average variance of ,D+ yp-’ where y has a uniform distribution over [-I, 11. The results are sum-
(0.1101)
(0.1565)
CO.35231
L6.9.7.11 AL -1
_L_L_ 0
1
-1
L_Ll-Lu 0
(0.1121)
B
-1
0
1, 1 1
-1
, ,(0.1321)
[4,101
Fig. 2. &-optimal The numbers
-1
0
CO.16111
11
16.81
1
1
(0.3561)
, ,,,, ,
0
1
(0.1664)
-1
0
1
(0.3915)
1 1
designs
in parentheses
for estimating
the LD50, p, plotted
are the minimum
average
as probability
variance.
mass functions
on L-1, 11.
K. Chaloner,
[6.9,7.11
(0.44;
[6.61
(0.6;;
1.2
(0.46;
,0.6;9)
1.2
,
-1.2
,
-1.2
r-0.3.0.31
,
1.2
(1.4301) ,, , , , ,
1
1.2
,
1.2
, (1.2606) , , , ,
1.2
-1.2
[-0.1.0.11
,
-1.2
/
(0.79,44),
-1.2
,(1.2652) , , , ,
1.2
-1.2
1
design
1
(0.625;
1.2
Bayesian
-1.2
(
-1.2
[4,101
/ Optimal
1
-1.2
8
K. Larntz
,
1.2 [-1.0.1.01
)1
Fig. 3. q$optimal The numbers
8
designs
for estimating
in parentheses
the LD95 plotted
are the minimum
average
as probability
mass functions
[6,q,7.11-1
/ 0 / (0.1452) 1
-1
, 0, rl673) 1
-1 ,
,
[6,al
( 0 ( (0.1467) 1
-1
, 0, /(0.1906! i
-1 ,
,
-1
[4,101
,
-1
( , ( (o.2244)
, (0.1653)
0
[-O.l,O.ll
1
-1
on [-
1.2,1.2].
variance.
0
[-0.3.0.31
i
,o (0.3959) , , ~
, o
(0.4002) , , ~
] , , , (0.4413) , , ,
-1
0
1
[-l.O,l.Ol
P
Fig. 4. @2-optimal designs for estimating the average percentile response plotted as probability functions on [-1, I], The numbers in parentheses are the minimum average variance.
mass
K. Chaloner,
marized interval
/ Optimal
Bayesian
design
203
in Figure 4. The designs are symmetric around zero and all are within the [- 1, 11. Again we see that as prior uncertainty increases so does the number
of support 4.6.
K. Larntz
points.
The locally optimal design
In these examples the best guess for (,/J,p) would be (0,7). The locally C$J~ -optimal design is to put weight 0.5 at each of -to.2205 and -0.2205. These are the points at which the probability p(x, 19), for these values of the parameters 8, are 0.176 and 0.824.
4.7. General conclusions In general as the uncertainty in the prior distribution increases so does the number of support points. This result is in accord with the results of Tsutakawa (1972,198O) and Abdelbasit and Plackett (1983) who, however, restricted their designs to be equally spaced and have equal weights. The designs we found make no such restriction and indeed the designs, in general, are neither equally spaced nor have equal weight on all design points, especially for estimating the LD95. A feature common to all of the four design criteria was that as the uncertainty in the prior distribution increased so did the length of computing time. Looking at the derivative functions we found that as the uncertainty in the prior distribution increased the derivative function d(q,x) of a design close to optimal is very flat compared to the derivative functions for a sharper prior distribution. Optimization for these cases is thus more difficult.
5. Efficiencies Efficiency calculations were made for the nine examples used in Section 4. We calculated the improvement in using either the @,-optimal or &-optimal design instead of the locally D-optimal design. The locally D-optimal design is most inefficient for the prior distribution where ,/J is uniform over [-1, l] and, independently, p is uniform over [4, lo]. For this prior distribution, the locally D-optimal design requires 5.5 times as many observations to be as efficient as the @i-optimal design in maximizing the expected log of the determinant of the unnormalized information matrix. For minimizing the expected asymptotic variance of y, the locally D-optimal design requires 480 times as many observations as the @,-optimal design. Details of these efficiency calculations and tables for all the examples of Section 4 are in a technical report available from the authors.
K. Chaloner,
204
K. Larntz / Optimal Bayesian design
6. Two examples
In order to further understand these designs we take two particular prior distributions and study the optimal designs in detail. We take the prior distribution where /3 is between 6 and 8 and two choices of the prior distribution on ,LL,namely ,U is between -0.3 and 0.3 and between - 1 and 1. These two prior distributions represent a moderate amount of prior information and more vague prior information. These two examples illustrate the difference caused by changing the prior distribution on p. First consider the prior distribution representing a moderate amount of information, ,D uniform on [-0.3,0.3] and, independently, /? uniform on [6,8]. That the @,-optimal design in Figure 1 is indeed optimal can be verified by plotting the derivative function, d(q,x). The plot is given in Figure 5(a) and it is clear that the function is negative almost everywhere and has exactly three roots at the design points. Plots are also given of the corresponding derivatives for the criteria for the other three optimal designs in 5(b), 5(c) and 5(d). Now consider vaguer prior information, where ,/Ais uniform on [-1, l] and p is again uniform on [6,8]. Figure 6 gives the corresponding plot of the derivative function d(q,x), again for values of x between -1.2 and 1.2. In comparison to Figure 5 we see that the derivative is much flatter in all four cases.
a
b----L
0
-0.4 -1
0
-1
0
1
[d
0
t
t -0.4
Fig. 5. Derivative
functions
d(q,x)
plotted
against
prior distribution with ,u uniform on [-0.3,0.3] mality for LD50; (c) q?,-optimality for LD95;
x for each of the four optimal
and /? uniform (d) &-optimality
1 designs
on [6,8]. (a) @,-optimality; for average percentile.
found
for the
(b) @2-opti-
K. Chaloner,
K. Larntz
/ Optimal
a 0
0
Bayesian
205
design
br---7
/-----I -0.4
-0.4
I
0
-1
-1
1
C 0
0
:“
-0.4
0
1
d /
\
-0.4 I
L
0
-1
Fig, 6. Derivative
functions
1
d(q,x)
plotted
-1
against
0
x for each of the four optimal
1
designs
prior distribution with p uniform on [-1, l] and /3 uniform on [6,8]. (a) @I-optimality; for LD50; (c) Q2-optimality for LD95; (d) &optimality for average percentile.
7. Asymptotic A referee
found
for the
(b) @2-optimality
approximations
has pointed
out that,
as the justification
of these Bayesian
designs
de-
pends on asymptotic arguments, the effectiveness of the designs will depend heavily on the accuracy of the asymptotic approximations. This is clearly very important. There are really two separate issues here. First, whether the value of the criterion really does well approximate the expected posterior utility, and second, whether a design obtained by maximizing the approximate posterior utility gives an actual posterior utility close to the exact optimal value. We have done some preliminary numerical work to answer these questions for the prior distributions used in Section 4. We simulated 1500 replications of the experiment with a sample size of n = 100 for the &-optimal design for estimating ,u. We also simulated 1500 replications for eight other designs obtained by perturbing the @,-optimal design slightly. The actual posterior variance of ,U was calculated for each replication of each experiment. The median of the actual posterior variance of ,D was close to nP1&(r) for all the designs. The @,-optimal design, therefore, has reasonable empirical properties for n = 100. The details of this simulation are omitted to save space and are in a technical report available from the authors. This very preliminary investigation is clearly
K. Chaloner,
206
incomplete and sample
and we intend
K. Larntz / Optimal Bayesian design
to explore
sizes in future
this issue for other criteria,
prior distributions
research.
8. Conclusion Numerical optimization is efficient in finding good designs which are close to optimal using these Bayesian criteria. The results of Section 2 are useful in understanding the problem and in verifying that the designs obtained from the optimization algorithm are indeed close to optimal. The prior distributions used in this paper, uniform and independent prior distributions for ,/Aand /3, clearly do not always apply. The methodology we have presented, however, applies to any situation where bounds are available on the parameter space. We believe that this methodology is preferable to the procedure of designing for a single best guess of the parameter values. We also believe that this methodology provides a framework for the careful examination of any experimental procedure in which the data analysis will include fitting a generalized linear model. Of course, in practice designs are required which have an integer number of observations. The approximate optimal designs derived here can be rounded to exact designs and the experimenter can calculate exactly how much is lost by rounding. These approximate designs are easy to derive. Even if an experimenter can not use one of these designs exactly they provide a useful benchmark with which to calibrate other designs. Interestingly, for these designs, the number of support points is generally greater than the number of parameters and, therefore, checks can usually be made on the appropriateness of the model.
Appendix Denote
the matrix
A by
A = I(& r)PZ(& then the derivative
rl,)Z(4 vl)Y’;
for the @,-criterion
is
d(vl,x) = 41 tr B(B)A + $0). The entries
of the matrix
A can be written
as
A,, = w(x,~)(pst)~*[f(R-X)(X-~)+S]2, A,, = W(X,e)(prs*)~‘(X-x)[t(K-X)(K--)++I, A21
These expressions choices of B(B).
=
,412,
simplify
A,,
=
w(x,o)(K-x)2s-2.
the algebra
required
to find the derivatives
for different
K. Chaloner,
K. Larntz
/ Optimal
Bayesian
design
207
Acknowledgements We are grateful to the referees and our colleagues at the University of Minnesota for helpful suggestions. We are also grateful for the comments of the participants at the workshop, in March 1987, at Castle Wartburg, Eisenach, DDR. This research was supported in part by grant number DMS-8706754 from the National Science Foundation.
References Abdelbasit, K.M. and R.L. Assoc. 78, 90-98.
Plackett
Anbar,
approximation
D. (1984). Stochastic
Commun. Berger,
Statist.
~ Theory
(1983).
methods
Methods
J.O. (1985). StatisticalDecision
Experimental
design
for binary
and their use in bioassay
data.
J. Amer.
Statist.
and phase I clinical trials.
13 (19), 2451-2467. Theory and Bayesian Analysis
(second edition).
Springer-Verlag,
New York. Bernardo,
J.M.
(1979). Expected
information
as expected
utility.
Ann.
Statist.
7, 686-690.
Brooks, R.J. (1987). Optimal allocation for Bayesian inference about an odds ratio. Biometrika 196-199. Brown, B.W., Jr. (1966). Planning a quanta1 assay of potency. Biometrics 22, 322-329. Chaloner,
K. (1984). Optimal
Chaloner,
K. (1987). An approach
shop
on Model-oriented
Mathematical Chernoff,
H.
Locally
E.L. (1977). D-optimal
sion Experiments, Fedorov, Fedorov,
Wartburg,
Moscow
optimal
Academic V.V. (1981).
Press,
March
(1982). Model robust, Press,
1987, Lecture parameters.
linear-optimal
Moscow,
in Economics
Ann.
Math.
and
designs.
103-111.
Experiments.
Statist.
24,
Technometrics24,49-54. approach.
In: Regres-
(In Russian).
Transl.
and ed. by W.J. Studden
and E.M.
New York.
Active
P.R. (1970). Optimal
57, 79-89. Khan, M.K. and A.A.
12, 283-300. of the Work-
Notes
models under the Bayesian
regression
experiments.
In: Mathematical
Design. Nauka Press, Siberian Section, 19-73. (In Russian). Finney, D.J. (1978). Statistical Methods in Biological Assay (third Freeman,
Ann. Statist.
In: Proceedings
3-12.
for estimating
designs for non-linear University
linear models.
Berlin,
designs
V.V. (1972). The Theory of Optimum
Klimko.
design for linear models.
No. 297. Springer-Verlag.
586-602. Cook, R.D. and C.J. Nachtsheim Dubov,
experimental
to design for generalized
Data Analysis,
Systems (1953).
Bayesian
74,
Bayesian
sequential
Yardi (1988). On D-optimal
estimation designs
Methods
edition).
Macmillan,
of the median
for binary
data.
of Experimental
effective
New York.
dose. Biomefrika
J. Statist.
Plann.
Inference
18, 83-91. Lauter,
E. (1974). Experimental
LCuter, E. (1976). Optimal
design in a class of models. Math.
multipurpose
designs for regression
Operationsforsch.
models. Math.
Sfatist.
5, 379-396.
Operationsforsch.
Statist.
7, 51-68. Leonard, T. (1982). An inferential approach 2416, University of Wisconsin-Madison, Lindley,
D.V. (1956). On a measure
to the bioassay design problem. Mathematics Research Center.
of information
provided
Technical
by an experiment.
Ann.
summary Math.
report
Statist.
27,
986-1005. Meeker, W.Q. and G.J. Hahn (1977). Asymptotically optimum over stress tests to estimate the survival probability at a condition with low expected failure probability. Technometrics 19, 381-399. Mehra, R.K. (1974). Optimal input signals for parameter estimation new results. IEEE Trans. Automat. Control 19, 753-768.
in dynamic
systems
- survey and
208
K. Chaloner,
Minkin,
S. (1987). On optimal
Nelder,
K. Larntz
design
/ Optimal
for binary
data.
J.A. and R. Mead (1965). A simplex method
O’Neill,
R. (1971). Function
minimization
Bayesian
J. Amer.
Statist.
for function
using a simplex
design
Assoc.
procedure,
82, 1098-1103.
ComputerJ. 7, 308-313.
minimization.
Algorithm
AS47. Appl.
Statist.
20, 338-345. Owen,
R.J. (1975). A Bayesian
tal testing. Pronzato,
S.D. (1980). D.M.
Tsutakawa,
procedure
for quanta1
response
in the context
of adaptive
men-
70, 351-356.
(1985).
Optima/
Design.
(1980). Aspects
R.K. (1972). Design
Tsutakawa,
Robust
experiment
curve.
Appl.
Statist.
design
and Hall,
via stochastic
London-New
design in dynamic
of an experiment
approximation.
Math.
for bioassay.
of dose levels for estimating
York.
systems. J. Amer.
Technometrics
22, 287-299.
Statist.
67, 584-590.
a percentage
Assoc.
point of a logistic
quanta1
29, 25-33.
L.V. (1973). An extension
60, 345-348. Whittle, P. (1973).
Chapman of optimal
R.K. (1980). Selection
response
Sot. Wynn,
Assoc.
75, 103-120.
Titterington,
White,
sequential
Statist.
L. and E. Walter
Biosci. Silvey,
J. Amer.
Some general
of the general points
Ser. B 35, 123-130. H.P. (1970). The sequential
equivalence
in the theory
generation
theorem
of optimal
of D-optimal
to nonlinear
experimental
experimental
models.
design.
designs.
Biometrika
J. Roy.
Statist.
Math.
Statist.
designs.
J. Roy.
Ann.
41, 1655-1664. Wynn,
H.P.
Statist. Wu, C.F.J. Zacks, linear
(1972). Results in the theory
Sot.
(1985). Efficient
S. (1977). Problems problems.
209-223.
and construction
of D-optimum
experimental
Ser. B 34, 133-147.
In: P.R.
sequential
designs
and approaches Krishnaiah,
with binary
data.
J. Amer.
in design of experiments Ed.,
Multivariate
Analysis
Statist.
for estimation
Assoc.
80, 974-984.
and testing
IV. North-Holland,
in non-
Amsterdam,