Journal of Statistical Planning and Inference 74 (1998) 177–192
First-order optimal designs for non-linear models a Department
Paola Sebastiani a; ∗ , Raaella Settimi b of Statistics, The Open University, Walton Hall, Milton Keynes MK7 6AA, UK of Statistics, University of Warwick, Coventry CV4 7AL, UK
b Department
Received 13 September 1996; received in revised form 14 January 1998; accepted 20 January 1998
Abstract This paper presents D-optimal experimental designs for a variety of non-linear models which depend on an arbitrary number of covariates but assume a positive prior mean and a Fisher information matrix satisfying particular properties. It is argued that these optimal designs can be regarded as a rst-order approximation of the asymptotic increase of Shannon information. The eciency of this approximation is compared in some examples, which show how the results can be further used to compute the Bayesian optimal design, when the approximate solution is not c 1998 Elsevier Science B.V. All rights reserved. accurate enough. AMS classi cation: primary 62K05; secondary 62F15 Keywords: D-optimal designs; Generalized linear models; Numerical integration; Shannon information
1. Introduction Consider this problem: an experimenter needs to choose the values of some input variables, say x. The output of the experiment will be Y = f(x; ; ), where f(·) is a known function, is a vector of unknown parameters, and is an error. An optimal experimental design, say , is a set of trials xi in the design region X, with corresponding number of observations ni , that maximize a utility function representing P the aim of the experiment. Usually the optimization is subject to ni = n, n given. The constraint represents for instance costs. Suppose that aim of the experiment is to obtain accurate estimates of . Classical optimal design theory (Silvey, 1980) provides us with several suitable design criteria. Among those, we limit attention to D-optimality, which has the property of being invariant under nonsingular linear transformations of the design region. Thus, if ∗
Corresponding author. Fax: +44 1908 652140; e-mail:
[email protected].
c 1998 Elsevier Science B.V. All rights reserved. 0378-3758/98/$ – see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 9 8 ) 0 0 0 7 8 - 0
178
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
V = V () is the variance–covariance matrix of the Maximum Likelihood Estimators (MLE) of , a design ∗ maximizing −log det V () is D-optimal. For non-linear problems, the asymptotic variance–covariance matrix of the MLE is approximated by the expected Fisher information matrix, which is de ned as 2 @ l( |Y ) I ( ; ) = −EY @ i @ j and l( |y) denotes the log-likelihood function. Since, usually, I ( ; ) will be function of , so will be the design criterion. A design ∗ that maximizes log det I ( 0 ; )
(1)
for a best guess 0 of is called locally D-optimal. It is well known that solutions in closed form are dicult to nd, since maximization of Eq. (1) with respect to the design is often impracticable. In this paper, we will show that if the Fisher information matrix depends on the unknown parameters in the model only via a weight function, that satis es particular conditions, then optimal designs in closed form exist, or at least the solution depends on solving an equation in one unknown. The conditions on the weight function are general enough to be met by a class of models that include some generalized linear models or heteroscedastic normal linear models with variance which is a function of the mean. The optimal solution is a balanced design with a minimal number of support points that belong to the boundary of the design region. Thus, the optimal strategy is a one-at-a-time experiment. A similar result was found by Burridge and Sebastiani (1994) for generalized linear models with variance proportional to the square of the mean. Our result holds under the assumption that i ¿ 0, for all i, but for more general variance functions. Following Sebastiani (1996), we argue that a locally D-optimal design can be regarded as a rst-order approximation of a Bayesian D-optimal design, when the best guess of is the prior mean of the parameters. Thus, a by-product of our results is that we provide Bayesian optimal designs under the assumption that the prior precision on the parameters is large enough for the rst-order approximation to be accurate. If the prior uncertainty is large, then locally optimal designs can help the numerical search of Bayesian designs. The remaining of the paper is structured as follows. Section 2 deals with the classical approach to optimal design for non-linear models. A description of models that yield an expected Fisher information matrix of the speci c forms considered here is given in Section 2.1. The main result of this paper is given in Section 2.2 and examples of locally D-optimal designs for models of general interest are given in Section 2.3. In Section 3 we describe the Bayesian approach to optimal experimental design for non-linear problems based on the use of Shannon information, and we show how locally optimal designs can provide an accurate approximation of the Bayesian solution when the prior precision on the parameters is large. Robustness issues are discussed in Section 3.1, and some examples are presented. Conclusions are in Section 4.
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
179
2. Locally D-optimal designs In this section we present the main result of this paper. We rst describe the statistical model. 2.1. The model Let Y | denote the response variable. We suppose that both expectation and variance of Y are function of the design variables x = (1; x1 ; : : : ; xk )T , x ∈ X ⊂ Rk+1 , and of the parameter vector = ( 0 ; 1 ; : : : ; k )T ∈ Rk+1 . We shall assume E(Y | ) = (); V (Y | ) = v(); = T x; v() ¿ 0; ∀;
(2)
(·); v(·) are known functions and the distribution of Y | belongs to the exponential family. Example 1 (Generalized linear models). In this case v() = w()−1 (d=d)2 and the function w() has special forms (Ford et al., 1992; Sitter and Torsney, 1995). Particular cases are: w() = exp() (log-linear Poisson regression); w() = f()2 =[F(){1 − F()}], where F(·) is a tolerance distribution and f() is the corresponding density function (a quantal response model). Some non-linear normal models are a special case of Eq. (2) with v() constant, and () a known function of , e.g. a rst-order decay () = exp() (Atkinson and Donev, 1992). Example 2 (Heteroscedastic linear normal models). Let () = and Y ∼ N(; v()), for example v() = exp(−) (Davidian and Carrol, 1987). We suppose that the experimental conditions are such that only an upper bound (or lower bound) is given for any design variable, and without loss of generality we assume that xi ∈ (−∞; 0]; i = 1; : : : ; k. We shall also assume that, for a given value = 0 , 0i ¿ 0; i = 1; : : : ; k, that is we suppose that the design variables have a ‘similar’ eect on the response variable, and hence 0 = T0 x ∈ (−∞; 00 ]. In this class, we shall focus on models which have an expected Fisher information de ned as I ( ; ) =
P ni W (i )xi xiT ; i
i = T xi :
It is easy to show that the models considered in both Examples 1 and 2 have an information matrix of this form, with W () ≡ w() and W () ≡ {2v()+(dv=d)2 }={2v()2 }, respectively. Given a best guess = 0 , we have I ( 0 ; ) =
P ni W (0i )xi xiT ; i
0i = T0 xi :
(3)
For convenience, we will consider the continuous setting, in which a design is regarded as a probability measure on the design region X with weights pi at xi .
180
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
We thus replace Eq. (3) by the normalized information matrix M ( 0 ; ) =
P pi W (0i )xi xiT ; i
in which the pi are non-negative real numbers, such that D-optimal design maximizes
(4) P i
pi = 1. Thus, the locally
log det M ( 0 ; ) w.r.t. xi ∈ X, pi ; pi ¿0; the weight function W (·).
P i
pi = 1. Note that M ( 0 ; ) will depend on 0 only via
2.2. D-optimal design We rst introduce some notation. Let u∗ be the unique solution (if it exists) of the equation W 0 ( 00 + u)u + 2W ( 00 + u) = 0;
(5)
where W 0 ( 00 +u) denotes the derivative dW ( 00 +u)=du. Let ∗a ¡ ∗b be the solutions of W 0 (a )(b − a ) − 2W (a ) = 0; W 0 (b )(b − a ) + 2W (b ) = 0:
(6)
Note that ∗a and ∗b are the support points of the D-optimal design in the one-dimensional case when the design region is unbounded (Ford et al., 1992). Theorem 1. Consider a model which has an information matrix (4), with 0 = 00 + P i 0i xi , xi ∈ (−∞; 0] and 0i ¿ 0, i = 1; : : : ; k. If W ( 00 + u) is (i) either increasing in u or has a unique maximum, W ( 00 )¿W ( 00 + u∗ ) and 00 6∗b ; (ii) log-concave, i.e. W 00 ( 00 + u)W ( 00 + u) − W 0 ( 00 + u)2 60; (iii) W (0 )1=2 0 is bounded for all 0 ∈ (−∞; 00 ]; then the design ∗ with the same number of observations at the k + 1 points x0∗ = (1; 0; : : : ; 0)T ; x1∗ = (1; u∗ = 01 ; : : : ; 0)T ; : : : ; xk∗ = (1; 0; : : : ; u∗ = 0k )T with u∗ the unique solution of Eq. (5) is D-optimal. Proof. See the appendix. Let now X be a bounded region, and without loss of generality we assume that xi ∈ [−1; 0]; i = 1; : : : ; k. Clearly, if the support points of the design ∗ given in Theorem 1 are in X, ∗ is still D-optimal. This is the message of the next corollary.
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
181
Corollary 1. Under the same hypotheses of Theorem 1, if xi ∈ [−1; 0], W 0 ( 00 − W ( 00 −
Pk
i=1 0i ) ¿2 Pk i=1 0i )
k P i=1
0i
(7)
and − u∗ 6 min 0i ; i
(8)
where u∗ is the solution of Eq. (5), then the design ∗ is still D-optimal. Proof. Condition (7) is necessary in order that Eq. (5) has a solution. Condition (8) ensures that the support points are in X. Remark 1. If we suppose that 0i ¡ 0, i = 1; : : : ; k, then Theorem 1 holds if we replace (i)–(iii) by the hypotheses W ( 00 + u) is (i) either decreasing, or has a unique maximum, W ( 00 )¿W ( 00 +u∗ ) and 00 ¿∗a ; (ii) log-concave; (iii) W (0 )1=2 0 is bounded for all 0 ∈ [ 00 ; +∞). Remark 2. Theorem 1 essentially states that locally D-optimal designs for model (2) are balanced minimal designs, since they have the same number of observations at each treatment and have support on a minimal number of points. The support points of the design are such that 0 equals u∗ + 00 at each of them, and are in the boundary of the design region. Thus, a one-at-a-time experiment is the optimal strategy. A similar result was found by Burridge and Sebastiani (1992), when the information matrix is of the form (4), the weight function is −k 0 ; 06k62, and the linear predictor is 0 = 00 + 01 x1 + 02 x2 . This result was further extended by Burridge and Sebastiani (1994) to generalized linear models with weight function −2 0 , when the number of covariates is arbitrary. The D-optimal design we have found can be seen as a direct generalization of the one-dimensional case when the design regions are bounded at one end (Ford et al., 1992; Sebastiani and Settimi, 1997). A rather surprising result is that the numerical complexity of the problem reduces to the solution of Eq. (5), which is in one unknown variable. Thus, increasing the number of covariates does not aect the complexity of the design problem. Sitter and Torsney (1995) propose a geometrical technique to nd D-optimal designs in multi-dimensional spaces for generalized linear models when the design space is bounded. They obtain explicit D-optimal designs only for some simple bounded space, however the designs may not have the smallest support. For instance, for the logistic model in three parameters they nd D-optimal designs with 4 support points. The result in Corollary 1, although it is restricted to particular hypotheses on the model and on the design region, gives minimal D-optimal designs in closed form, which are easy to determine.
182
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
Some examples of optimal designs for models of special interest are given in the next section. 2.3. Examples Example 3 (Non-linear normal model). Let Y | ∼ N((); 2 ), and () = exp(). Then W () = exp(2), so that W ( 00 +u) is an increasing function of u, and log-concave. Eq. (5) is u + 1 = 0 whose solution is u∗ = −1. We then have xi∗ = (1; 0; : : : ; −1= 0i ; : : : ; 0), i = 1; : : : ; k. Thus, the optimal design puts n=(k + 1) observations at the corner point of the design region X, x0 = (1; 0; : : : ; 0)T , where () has the maximum value. The other points depend on the 0i ’s. As 0i ’s increase xi∗ ’s approach the corner point of the design region, thus the optimal design tends to one-point design. Example 4 (Poisson regression). Suppose that Y | has a Poisson distribution and assume a log-link function, then w() = exp(), so that the hypotheses of the Theorem 1 are satis ed. Thus, Eq. (5) is u + 2 = 0 leading to u∗ = −2. If k = 1, we have the same result found by Ford et al. (1992). Example 5 (Binary data). With binary data, a link function F −1 () = yields w() = f()2 =[F(){1 − F()}], where F(·) is a tolerance distribution and f() is the corresponding density function. For logistic regression, we have w() = exp()={1 + exp()}2 . Thus, W ( 00 + u) is log-concave, and it satis es the hypothesis (i) of Theorem 1 with ∗b = −∗a solutions of exp(x) − (x + 1)=(x − 1) = 0, so that ∗b ≈ 1:5434. Thus, the design given in Theorem 1 is optimal if 00 6∗b , that is, according to our best guess, the probability of response at the highest stimulus is smaller than about 0.82. Eq. (5) reduces to exp( 00 + u) − (u + 2)=(u − 2) = 0, so that a numerical solution is needed for a given 00 . We note that this result generalizes that found by Sebastiani and Settimi (1997), which is limited to one-dimensional logistic regression. In the one-dimensional case if 00 ¿ ∗b then the design which has the same weight at the points (±∗b − 00 )= 01 is D-optimal. It is an open problem to check whether this result extends to more general models. For a bounded design region, condition (7) reduces to logit p0 − logit pm ¿ 2; where p0 is the probability of response at the highest treatment: (1; 0; : : : ; 0)T , and pm is the probability of response at the lowest treatment: (1; −1; : : : ; −1)T . We further note that the solution u∗ is such that u∗ ¡ −2. For large negative values of 00 , the solution u∗ of Eq. (5) approaches −2, so that condition (8) is equivalent to 0i ¿ 2 for i = 1; : : : ; k. The complementary log–log link function leads to w() = exp{2 − exp()} = [1 − exp{−exp()}]. This is log-concave, and (i) in Theorem 1 holds for ∗b ≈ 0:980 and
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
183
∗a ≈ −1:338 which yields u∗ 6 − 2:3173, that is we limit consideration to probabilities of response smaller than about 0.930. For illustration we consider the following numerical case. Suppose that we wish to design an experiment for estimating the eects of age and quantity of salt consumption on the probability of presence of high blood pressure. The response variable is binary, ‘high blood pressure’ or ‘no high blood pressure’. Both covariates are likely to be positive correlated to the probability of presence of high blood pressure. We suppose that previous experiments suggest that the complementary log-log link function gives an appropriate model and also we parameterize the linear predictor by taking the logarithm of the covariates, so that a further reparameterization leads to the model de ned in Eq. (2). Suppose that the prior knowledge is 00 = 0:3266, 01 = 0:5 and 02 = 0:8, where 1 and 2 refer to age and quantity of salt consumption, respectively. Then u∗ ∼ = −2:7157 so that the optimal design is a balanced three point design with support at (1; 0; 0)T ; (1; −4:3514; 0)T ; (1; 0; −2:7196)T . 3. Bayesian optimal designs Suppose now that the experimenter can describe the information about the parameters, before the experiment, in a continuous prior distribution with density function ( ), and that the aim of the experiment is to update this prior information in the posterior density: ( |y; ). A measure of the gain of information that can be achieved from the experiment is the dierence between the prior and posterior Shannon information, that is Z Z ( )log( ) d − ( |y; )log( |y; ) d : (9) B
B
A Bayesian design criterion is then to choose the experiment which yields the maximum expected gain of information, the expectation being over the marginal distribution of Y (Lindley, 1956). Excluding exceptional cases (Sebastiani and Wynn, 1997) in which exact solutions in closed form are found without need for approximations, typically, the expected utility is approximated by using a normal approximation to the posterior distribution. In Berger (1985) several of such approximations are presented, and a comprehensive discussion is given in Chaloner and Verdinelli (1995). We will limit attention to the most widely used approximation, in which the posterior distribution of is approximated as ˆ I ( ; ˆ )−1 ); |y; ∼ N( ; ˆ ) is the observed Fisher where ˆ is the maximum likelihood estimate of , and I ( ; information matrix, that is minus the Hessian matrix of the log-likelihood function ˆ Thus, if the prior distribution of is independent of the experiment, evaluated in . asymptotically the expected gain of information is (up to a constant) Z ˆ )p(y|) dy: log det I ( ;
184
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
A further approximation yields the design criterion Z E{log det I ( ; )} = log det I ( ; )( ) d ;
(10)
where I ( ; ) is the expected Fisher information matrix (Chaloner and Verdinelli, 1995). A design which maximizes Eq. (10) is usually de ned Bayesian D-optimal. More re ned approximations take into account the prior precision R of the parameters ˆ ))−1 . If the sample size allowed and the posterior variance is approximately (R + I ( ; for the experiment is large, the contribution of the prior precision to the posterior variance, and hence to the design criterion is usually negligible. Maximizing (10) w.r.t. involves an integration step, which usually cannot be solved in closed form. Thus, the objective function is approximated by numerical integration routines and the optimization is then performed numerically. Examples are given, for instance, in Chaloner and Larntz (1989). It is easily seen that (1) is a rst-order approximation of Eq. (10) when 0 is the prior mean of , and the accuracy of the approximation increases as the prior precision on becomes large. Thus, locally D-optimal designs can be regarded as an approximation of Bayesian ones. The advantage of this re-interpretation is that designs in Theorem 1 are Bayesian optimal when the prior precision is large. When, on the other hand, the prior precision is small, locally optimal designs can help the search of Bayesian optimal designs. In the next Sections we will study the eciency of this approximation in two examples. 3.1. Robustness issues The design ∗ found in Theorem 1 optimizes a rst-order approximation of the asymptotic Bayesian criterion. Clearly, the approximation is ecient as long as the prior precision in the parameters is large. A tool to verify the optimality of the approximate solution is the directional derivative of the Bayesian optimality criterion at ∗ in the direction of the extreme points of the design region, i.e. F(∗ ; x) = E{W ()xT M (∗ ; )−1 x} − (k + 1);
(11)
the expectation being over the prior distribution for . The design ∗ is optimal in X if F(∗ ; x)60 for all x ∈ X (Chaloner and Verdinelli, 1995). In practice, we can evaluate Eq. (11) to check whether the design ∗ found in Theorem 1 is Bayesian optimal in a given design region X, or to nd the largest region in which ∗ is Bayesian optimal. If ∗ is not Bayesian optimal, because the prior precision is not large enough for the rst-order approximation to be accurate, we expect the Bayesian D-optimal design not to be of minimal support any longer. This is noted, for instance, by Chaloner and Larntz (1989) and by Chaloner and Verdinelli (1995): as the prior uncertainty increases, the support of the optimal design gets larger. If this is the case, F(∗ ; x) is positive in some subsets of X. Examining the behavior of F(∗ ; x) can then suggest candidate support points for the optimal design.
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
185
Once the Bayesian design, say ∗B has been found, we can also measure the eciency of ∗ relative to ∗B using e (∗ ; ∗B ) = exp((E (log det M (∗ ; )) − E (log det M (∗B ; )))=(k + 1)):
(12)
A rational behind Eq. (12) is the following: the quantity (E (log det M (∗ ; )) − E (log det M (∗B ; ))) k +1 is the average amount of Shannon information which is not gained in using the design ∗ instead of the optimal one ∗B . Using an exponential transformation yields a criterion in (0,1], and e (∗ ; ∗B ) = 1 only if the design ∗ is Bayesian optimal. A similar measure of eciency is used by Clyde (1995). We will now show some examples in which the prior distribution is ∼ N( 0 ; Ik+1=2 ), 2 being the prior precision. Thus, with f denoting the density function of , we have Z W ()xT M (∗ ; )−1 xf( ) d − (k + 1): F(∗ ; x) = Rk+1
The integral can be approximated numerically over a grid of points, using quadrature formulas (Abramowitz and Stegun, 1964). This will then yield a function of , 0 and x which can be studied. The computation of the Bayesian optimal designs reported in the examples below was performed using numerical procedures. The Bayesian optimality criterion (10) is discretized using a quadrature formula, yielding a function of the design only. The design is then found using a numerical optimization technique which takes into account the constraints on the weights and the support points. We used the function constr in the Optimization Toolbox of the package MATLAB 3.5. (The codes of our MATLAB functions and scripts les used in the examples can be found at http://www.city.ac.uk/∼sn303/optimal-designs.) The approach we have followed is the following: we start with an initial balanced design on s support points which are guessed from the shape of the directional derivative of the locally D-optimal in Theorem 1. We then increase the number of points of the initial design, until the numerical procedure produces a design having support points with null weight or which appear more than once. A nal checking of the optimality of the design found is done by computing the directional derivative. Example 6. Consider the rst-order decay model of Example 3 with one covariate x varying in X = [−1; 0]. If 0 = (1; 2)T , the D-optimal design found in Theorem 1 puts the same number of observations at the points x0∗ = (1; 0)T ; x1∗ = (1; −0:5)T : The directional derivative F(∗ ; x), shown in Fig. 1, is negative for all x ∈ X when the prior precision is greater than 1, but assumes positive values when 61, increasing remarkably for x approaching −1. Hence, the design ∗ is Bayesian optimal only when the prior precision is fairly large. As ¿1 the plot of the directional derivative suggests to take a support point at the left extreme of the design region. Table 1 gives the
186
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
Fig. 1. Directional derivative F(∗ ; x). Table 1 Eciency of the design ∗ with respect to the design ∗B −1
Support of ∗B
e(∗ ; ∗B )
1
x: p:
(1; 0)T 0:4992
(1; −0:4956)T 0:4943
(1; −1)T 0:0066
0.9986
1.5
x: p:
(1; 0)T 0:4713
(1; −0:3709)T 0:3771
(1; −1)T 0:1516
0.9440
Bayesian optimal designs for −1 = 1 and −1 = 1:5. Indeed they put some observations at the point x = −1, in accordance with the shape of the directional derivative F(∗ ; x). As the precision decreases, the support of the optimal design changes sensitively from the D-optimal one. The values of the eciency e(∗ ; ∗B ), displayed in Table 1, show that the D-optimal design is still a good approximation for = 1 and = 2=3. Example 7. We now consider a Poisson regression model with two covariates. We suppose 0 = (1; 2; 4)T . Hence, the D-optimal design ∗ puts the same number of observations at the points x0∗ = (1; 0; 0)T ; x1∗ = (1; −1; 0)T ; x2∗ = (1; 0; −0:5)T : For −1 = 0:1 the directional derivative of the D-optimal design (Fig. 2) is always negative in the design region. When −1 = 1:5, F(∗ ; x) is positive for points x close to the border [−1; 0] × {0} of X (Fig. 3). Hence the shape of the directional derivative for −1 = 1:5 suggests that points on that side of the design region, that is when x2 = 0, might be informative. Indeed the Bayesian D-optimal design found numerically
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
Fig. 2. Directional derivative F(∗ ; x) for −1 = 0:1.
Fig. 3. Directional derivative F(∗ ; x) for −1 = 1:5.
187
188
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
is x : (1; 0; 0)T (1; 0; −0:5)T (1; −0:6928; 0)T (1; −1; 0)T p : 0:3321
0:3333
0:0337
0:3009
The value of the eciency measure e(∗ ; ∗B ) = 0:9996 indicates that the locally D-optimal design can be regarded as a good approximation. Examples 6 and 7 show how locally D-optimal designs can provide an ecient approximation of Bayesian D-optimal designs. For simplicity, we have limited attention to prior distributions ∼ N( 0 ; Ik+1 =2 ), and models that depend on one or two covariates. A direct investigation of more complex cases, in which parameters are a priori correlated or several covariates aect the response variable, is needed to evaluate the general accuracy of the approximation.
4. Conclusions In this paper we have found locally D-optimal designs for a large class of nonlinear models widely used in practice. The designs found are balanced minimal designs with support on the borders of the design region, and the numerical complexity is reduced to the solution of an equation, independently on the number of covariates in the model. We have also shown that locally D-optimal designs are a rst-order approximation of Bayesian D-optimality. This argument has practical and computational relevance because, for non-linear problems, Bayesian D-optimal designs can only be determined as solutions of numerical procedures that are computationally demanding. A by-product of our results is to provide optimal Bayesian designs in closed form when the prior precision is large. The accuracy of the approximation depends on the precision of the prior distribution, as the latter decreases the Bayesian D-optimal design changes considerably from the locally optimal one. Even in this case, locally D-optimal designs can be a starting point for nding Bayesian D-optimal ones. Examining the behavior of the directional derivative of the locally D-optimal design can also be useful to identify candidate support points of the optimal design.
Acknowledgements The rst author acknowledges support of the Nueld Foundation, grant # SCI/180/95/493/G. The work was completed while the second author was visiting City University, with support from the University of Perugi a (Italy). Special thanks go to the referee who helped the improvement of a rst version of this paper.
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
189
Appendix. Proof of Theorem 1 Theorem 1. Consider a model which has an information matrix (4), with 0 = 00 + P i 0i xi ; xi ∈ (−∞; 0] and 0i ¿ 0; i = 1; : : : ; k. If W ( 00 + u) is (i) either increasing in u or has a unique maximum, W ( 00 )¿W ( 00 + u∗ ) and 00 6∗b ; (ii) log-concave, i.e. W 00 ( 00 + u)W ( 00 + u) − W 0 ( 00 + u)2 60; (iii) W (0 )1=2 0 is bounded for all 0 ∈ (−∞; 00 ]; then the design ∗ with the same number of observations at the k + 1 points x0∗ = (1; 0; : : : ; 0)T ; x1∗ = (1; u∗ = 01 ; : : : ; 0)T ; : : : ; xk∗ = (1; 0; : : : ; u∗ = 0k )T with u∗ the unique solution of Eq. (5) is D-optimal. Proof. We will prove that the design ∗ is D-optimal by showing that the sucient condition of the General Equivalence Theorem for D-optimality, (Silvey, 1980, p. 23) is satis ed. We will thus show that sup W (0 )xT M ( 0 ; ∗ )−1 x − (k + 1) = 0;
(13)
x∈X
where M ( 0 ; ∗ ) is the information matrix of the design ∗ given = 0 . Consider the set of designs u = {(u); u ¡ 0}, which depend on u and put equal weight 1=(k + 1) at the k + 1 points, t0
= W ( 00 )1=2 (1; 0; 0; : : : ; 0)T
t1 (u) = W ( 00 + u)1=2 (1; u= 01 ; 0; : : : ; 0)T .. .
(u ∈ (−∞; 0));
tk (u) = W ( 00 + u)1=2 (1; 0; 0; : : : ; u= 0k )T ; and let Mu = {M { 0 ; (u)}; u ¡ 0} denote the corresponding set of information matrices. Put W ( 00 ) = W0 and W ( 00 + u) = W1 and de ne z = (u= 01 ; : : : ; u= 0k )T ;
Du = diag(u= 01 ; : : : ; u= 0k ):
For xed u the information matrix of the design (u) is easily seen to be W0 =W1 + k z T ; (k + 1)M { 0 ; (u)} = W1 z Du2 and hence det M { 0 ; (u)} =
W0 W1k u2k Q 2 : 0i
190
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
We now nd the design (u∗ ) which is D-optimal in the set u and then we will show that (u∗ ) is globally D-optimal. By dierentiating w.r.t. the unknown u we have d det M { 0 ; (u)} ∝ uW 0 ( 00 + u) + 2W ( 00 + u) = 0; du which is Eq. (5). Provided that W is log-concave, this equation has a unique solution u∗ ¡ 0. In fact, the possible solutions of Eq. (5) are the points of intersection of the functions W 0 ( 00 + u)=W ( 00 + u) and −2=u. The latter is a positive increasing function of u, if u ∈ (−∞; 0]. The former is a decreasing function of u for (ii), and is always positive in (−∞; 0] if W (·) is increasing. If however W (·) has a unique maximum, say uM ¡ 0, then W 0 ( 00 + u)=W ( 00 + u) is positive for u ¡ uM , and negative for uM ¡ u. Thus, there exists a unique point of intersection of the two functions, and u∗ ¡ uM . Let ∗ = (u∗ ) be the design which puts weight 1=(k + 1) at the points t0∗ = t0 and ti∗ = ti (u∗ );
i = 1; : : : ; k:
Clearly (u∗ ) is D-optimal in u , and Eq. (13) holds in the support points of the design. We now need to show that Eq. (13) holds for any x ∈ X. Hypothesis (iii) implies that for any design point x the corresponding vector Pk W (0 )1=2 x can be written as the convex combination i=1 i ti (0 − 00 ), with i = Pk = 1. By convexity, it is then sucient to prove 0i xi =(0 − 00 ), i ¿0 and i=1 i ∗ ∗ that Eq. (13) holds for t1 (u), u ¡ 0. Let W1 be W ( 00 + u ). Simple algebra leads to 1 1 −1Tk (Du∗ )−1 {(k + 1)M ( 0 ; ∗ )}−1 = ; W0 −(Du∗ )−1 1k (Du∗ )−1 (W0 Ik + W1∗ Jk )(Du∗ )−1 =W1∗ where Du∗ = diag(u∗ = 01 ; : : : ; u∗ = 0k ): Hence Eq. (13) holds if and only if f(u) = (W0 + W1∗ )u2 − 2uu∗ W1∗ + (u∗ )2 W1∗ − (u∗ )2 W0 W1∗ =W1 60: Note rst that the equation f(u) = 0 can have at most two solutions which are the points of intersection of the functions (W0 + W1∗ )u2 − 2uu∗ W1∗ + (u∗ )2 W1∗ and (u∗ )2 W0 W1∗ =W1 , the former being a convex function of u with minimal turning point between u∗ and 0 and the latter being convex decreasing for the log-concavity of W . It is easy to show that f(0) = 0 and f(u∗ ) = 0. We now consider the rst derivative of f(u), that is f0 (u) = 2u(W0 + W1∗ ) − 2W1∗ u∗ + (u∗ )2 W0 W1∗ W10 =W12 : Let W10∗ and W100∗ denote the rst and the second derivative of W1 with respect of u evaluated in u∗ and let W00 denote the rst derivative of W (·) in 00 . We rst show that f0 (0)¿0. We have f0 (0) = −2W1∗ u∗ + (u∗ )2 W00 W1∗ =W0 ; and if W (·) is an increasing function then f0 (0) ¿ 0 because −u∗ ¿ 0 and W00 ¿ 0. Suppose now that W ( 00 + u) has the unique maximum uM . We have that W1∗ ¿ 0
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
191
and hence f0 (0)¿0 if and only if W 0 ( 00 )=W ( 00 )¿2=u∗ . Since u∗ is the solution of Eq. (5) we have 2=u∗ = −W 0 ( 00 + u∗ )=W ( 00 + u∗ ) so that f0 (0)¿0 if and only if h( 00 ) =
W 0 ( 00 ) W 0 ( 00 + u∗ ) + ¿0: W ( 00 ) W ( 00 + u∗ )
By letting be 00 + u, Eq. (5) is W 0 ()( − 00 ) + 2W () = 0:
(14)
Note that if 00 = ∗b , the solution ∗ of (14) coincides with ∗a . It can be easily shown, using Eq. (14) and the fact that d∗ =d 00 = (d 00 =d∗ )−1 ¿0, that ∗ is an increasing function of 00 . We now show that h( 00 ) is a non-increasing function of 00 and that h(∗b ) = 0, from which we conclude that h( 00 )¿0 for all 00 6∗b . This result then implies that f0 (0)¿0 if 00 6∗b . The rst derivative of h(·) is h0 ( 00 ) =
W 00 ( 00 )W ( 00 ) − W 0 ( 00 )2 W 00 (∗ )W (∗ ) − W 0 (∗ )2 d∗ + ; 2 W ( 00 ) W (∗ )2 d 00
which is not positive for all 00 since W (·) is log-concave and ∗ is an increasing function of 00 . We now show that the point u∗ is a local maximum of f(u), by proving that 0 ∗ f (u ) = 0 and f00 (u∗ ) ¡ 0. Simple algebra leads to f0 (u∗ ) = 2u∗ W0 + (u∗ )2 W0 W10∗ =W1∗ = W0 u∗ [2 + u∗ W10∗ =W1∗ ] = 0 because u∗ is solution of Eq. (5). Moreover f00 (u∗ ) = 2(W0 + W1∗ ) + (u∗ )2 W0 W1∗ (W100∗ W1∗2 − 2W1∗ W10∗2 )=W1∗4 : By Eq. (5) it follows that (u∗ )2 = 4(W1∗ =W10∗ )2 and thus f00 (u∗ ) = 2W1∗ − 6W0 + 4W0 W100∗ W1∗ =W10∗2 ; and by W1∗ ¡ W0 there follows that f00 (u∗ ) ¡ 4W0 (W100∗ W1∗ =W10∗2 − 1) ¡ 0, being W (·) log-concave. Hence, since the equation f(u) = 0 has the two solutions u = u∗ and u = 0 only, u∗ is a local maximum of f(u) and f0 (0)¿0, we can conclude that f(u)60. References Abramowitz, M., Stegun, I.A., 1964. Handbook of Mathematical Functions. Dover, New York. Atkinson, A.C., Donev, A.N., 1992. Optimum Experimental Design. Clarendon Press, Oxford. Berger, J.O., 1985. Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York (1st ed., 1980). Burridge, J., Sebastiani, P., 1992. Optimal designs for generalized linear models. J. Ital. Statist. Soc. 1, 183–202. Burridge, J., Sebastiani, P., 1994. D-optimal designs for generalized linear models with variance proportional to the square of the mean. Biometrika 81, 295–304. Chaloner, K., Larntz, K., 1989. Optimal Bayesian design applied to logistic regression experiments. J. Statist. Plann. Inference 21, 191–208.
192
P. Sebastiani, R. Settimi / Journal of Statistical Planning and Inference 74 (1998) 177–192
Chaloner, K., Verdinelli, I., 1995. Bayesian Experimental Design: a review. Statist. Sci. 10, 273–304. Clyde, M.A., 1995. Bayesian designs for approximate normality. In: Kitsos, C.P., Muller, W.G. (Eds.), MODA4: Advances in Model Oriented Data Analysis, Sptses, Greece, Physical Verlag, Heidelberg, pp. 25–35. Davidian, M., Carrol, R.J., 1987. Variance function estimation. J. Amer. Statist. Assoc. 82, 1079–1090. Ford, T., Torsney, B., Wu, C.F.J., 1992. The use of a canonical form in the construction of locally optimal designs for nonlinear problems. J. Roy. Statist. Soc. 54, 569–583. Lindley, D.V., 1956. On a measure of information provided by an experiment. Ann. Math. Statist. 27, 986 –1005. Sebastiani, P., 1996. Discussion on papers by Atkinson and by Bates et al. J. Roy. Statist. Soc. 58, 96 –97. Sebastiani, P., Settimi, R., 1997. A note on D-optimal designs for a logistic regression models. J. Statist. Plann. Inference 59, 359–368. Sebastiani, P., Wynn, H.P., 1997. Maximum entropy sampling and optimal Bayesian experimental design. J. Roy. Statist. Soc. Silvey, S.D., 1980. Optimal Design. Chapman & Hall, London. Sitter, R.R., Torsney, B., 1995. D-optimal designs for generalised linear models. In: Kitsos, C.P., Muller, W.G. (Eds.), MODA4: Advances in Model Oriented Data Analysis, Sptses, Greece, Physical Verlag, Heidelberg, pp. 87–102.