Journal of Statistical Planning and Inference 144 (2014) 81–91
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Elemental information matrices and optimal experimental design for generalized regression models Anthony C. Atkinson a, Valerii V. Fedorov b,n, Agnes M. Herzberg c, Rongmei Zhang d a
Department of Statistics, London School of Economics, London WC2A 2AE, UK Quintiles, 5927 South Miami Boulevard, Morrisville, NC 27560, USA Department of Mathematics and Statistics, Queen’s University, Kingston, Ontario, Canada K7L 3N6 d Department of Biostatistics, Perelman School of Medicine, University of Pennsylvania 423 Guardian Drive, Philadelphia, PA 19104, USA b c
a r t i c l e i n f o
abstract
Available online 3 October 2012
The construction of optimal experimental designs for regression models requires knowledge of the information matrix of a single observation. The latter can be found if the elemental information matrix corresponding to the distribution of the response is known. We present tables of elemental information matrices for distributions that are often used in statistical work. The tables contain matrices for one- and two-parameter distributions. Additionally we describe multivariate normal and multinomial cases. The parameters of response distributions can themselves be parameterized to provide dependence on explanatory variables, thus leading to regression formulations for wide classes of models. We present essential results from optimal experimental design and illustrate our approach with a few examples including bivariate binary responses and gamma regression. & 2013 Elsevier B.V. All rights reserved.
Keywords: Adaptive design Convex optimal design Elemental information matrix Equivalence theorem
1. Introduction We are concerned with optimal experimental design for a rather general class of regression models. The observations need not be normally distributed, but might, for example, follow a beta or binomial distribution, or, indeed, any distribution for which the Fisher information matrix exists, together with regular maximum-likelihood estimators. By regression we intend that the parameters of these distributions are themselves functions of further parameters and of explanatory variables, at least some of which can be chosen and controlled in the experiment. Frequently a link function will be required, for example in the parameterization of the variance of a normal distribution or the probability in a Bernoulli model. We introduce the term ‘‘elemental information matrix’’ to emphasize that the information matrix is associated with a specific distribution in its standard form and is not related to the further parameterization that comes from the inclusion of explanatory variables. Our main objective (see Sections 2 and 4) is to provide a collection of elemental information matrices for specific distributions which are crucial for the solution of optimal design problems. In this way, we bring together results that are repetitively scattered throughout the statistical literature. These matrices are essential for constructing information matrices for ‘‘single’’ observations. The latter may in many cases consist of a series of observations on a single subject. A typical example is when, together with a controlled variable, for example the dose of a drug, the model includes the times at which this dose is administered. For applications to experimental design we require the possibility of independent
n
Corresponding author. Tel.: þ1 919 9981855. E-mail addresses:
[email protected] (V.V. Fedorov),
[email protected] (R. Zhang).
0378-3758/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2012.09.012
82
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
replications of these series. Combining the concept of a single observation (supported by tables of information matrices) with a collection of sensitivity functions for various popular optimality criteria (see Section 3), the reader can address numerous tasks in the optimal design of experiments. We begin in Section 2 with an introduction to the generalized regression model, continuing in Section 3 with some basic ideas about optimal experimental design. The core of our paper is Section 4 where we provide tables of elemental information matrices for one and two parameter univariate distributions. The multivariate normal distribution, including a parameterized variance, and multinomial distributions receive special attention in Sections 4.2 and 4.3. Examples 1 and 2 in Section 5 include the equivalence theorem for D-optimality for normal observations with a specific parameterized variance. In particular, Example 3 includes the family of designs for univariate generalized linear models, whereas, in Example 4 there are two binary responses. Example 5 is gamma regression when both the parameters are functions of explanatory variables. 2. Generalized regression 2.1. Main model We assume that the observed response Y is distributed as Y pðy9ZÞ,
ð1Þ
where Y and y are k-dimensional vectors, with Z of dimension l. The parameters Z depend on controls x 2 X , where X is a design set (region). Usually X 2 Rs , but in general X could be a compact set of more complicated structure, for instance, part of a functional space (see Fedorov and Hackl, 1997, Section 5.7). In what follows we assume that X 2 Rs unless otherwise stated. In the standard setting for regression models, it is assumed that the expected values of responses Y are parameterized, as in Pa´zman (1986), Pukelsheim (1993), Fedorov and Hackl (1997) or Atkinson et al. (2007); see Example 1. The most popular examples are ‘‘normal’’ regression, binary regression and Poisson regression. Actually, all regression models with responses belonging to a one-parameter family can be treated in a similar way. For multiparameter distributions, Z is a vector, all components of which may again depend on controls and other unknown parameters. When the components of Z are of the same type (e.g. expectations in a multivariate normal model) the generalization is straightforward. However, in the case of other multi-parameter distributions (such as the gamma or Weibull), the forms of the various dependencies are less obvious. As far as we know, in the experimental design literature only in the ‘‘normal’’ case have different types of components of Z been parameterized, namely, the expectation(s) of the response(s) and its variance (variance–covariance matrix); see, for instance, Atkinson and Cook (1995), Dette and Wong (1999) or Fedorov and Leonov (2004). A rare exception is the beta regression considered by Wu et al. (2005) where again the mean and variance are separately parameterized. 2.2. Likelihood estimators and Fisher information matrices Before proceeding with the design problem for generalized regression, we recall a few commonly known facts and introduce the required notation. Let
Z ¼ Zðx, yÞ,
ð2Þ
T
where Z ðx, yÞ ¼ fZ1 ðx, yÞ, . . . , Zk ðx, yÞg are given functions of controls x and unknown regression parameters y. Further, let n independent observations fY i gn1 be made at fxi gn1 . Eqs. (1) and (2) constitute a generalized regression model. The maximum-likelihood estimator (MLE) y^ n of y can be defined as
y^ n ¼ arg max y2Y
n Y
pðyi 9Zðxi , yÞÞ
i¼1
¼ arg max Lðfyi gn1 ,fxi gn1 , yÞ, y2Y
ð3Þ
where Y is compact and the true value yt of y is an internal point of Y. Under rather mild conditions on pðy9ZÞ, Zðx, yÞ and on the sequence fxi gn1 (see, for example, Lehmann and Casella, 1998, Chapter 2), the MLE y^ n is strongly consistent and its normalized asymptotic variance–covariance matrix is nDðyt ,fxi gn1 Þ ¼ M 1 ðyt ,fxi gn1 Þ,
ð4Þ
where Mðy,fxi gn1 Þ ¼
n X
mðxi , yÞ,
ð5Þ
i¼1
and
mðx, yÞ ¼ Var
@ ln pfy9Zðx, yÞg : @y
ð6Þ
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
83
We introduce Fðx, yÞ ¼
@ZT ðx, yÞ @y
and observe that @ @ZT ðx, yÞ @ ln pðy9Zðx, yÞ ¼ ln pðy9ZÞ : @y @y @Z Z ¼ Zðx, yÞ
Lemma 1. From the preceding definitions and (6) it follows that
mðx, yÞ ¼ Fðx, yÞnðZÞF T ðx, yÞ,
ð7Þ
where
nðZÞ ¼ Var
@ ln pðy9ZÞ : @Z
ð8Þ
Note that in (7) and (8) the vector Z depends on x and y; we have omitted the compound subscript Z ¼ Zðx, yÞ and will continue to do so if this does not lead to ambiguity. In what follows we call nðZÞ, defined by (8), the elemental information matrix for model (1). It plays a central role in this paper. If there are repeated observations, (5) should be replaced by Mðy, xN Þ ¼
n X
r i mðxi , yÞ,
ð9Þ
i¼1
or, moving to the normalized information matrix, N1 Mðy, xN Þ ¼ Mðy, xN Þ ¼
n X
wi mðxi , yÞ,
ð10Þ
i¼1
P where N ¼ ni¼ 1 r i , wi ¼ r i =N and x ¼ fwi ,xi gn1 . If there are no repeated observations, then n and N coincide. More generally, Z Z mðxi , yÞxðdxÞ ¼ Fðx, yÞnðZÞF T ðx, yÞxðdxÞ, Mðy, xÞ ¼
ð11Þ
where x could be any probabilistic measure defined on X . Thus the information matrix (11) is completely defined by the design x, by the derivatives Fðx, yÞ of the regression functions Zðx, yÞ and by the elemental information matrix nðZÞ. 3. A few basic facts from optimal design theory Optimality criteria play a fundamental role in the design of experiments. We follow the well-accepted paradigm of convex design theory (Fedorov, 1972; Silvey, 1980; Pa´zman, 1986; Pukelsheim, 1993; Fedorov and Hackl, 1997; Atkinson et al., 2007) and define an optimal (continuous/approximate) design as
xn ¼ arg minCfMðx, yÞg,
ð12Þ
x2X
where C½M is a convex and homogeneous function of matrix M (see Fedorov and Hackl, 1997, Section 2.2). An introduction to optimal design, emphasizing the importance of Fisher information, is Cox and Reid (2000, Chapter 7). For the sake of simplicity, we confine ourselves throughout this paper to cases when optimal designs defined by (12) have regular information matrices, e.g. non-zero determinants. 3.1. Equivalence theorem The following general theorem, which is often called the equivalence theorem after the result of Kiefer and Wolfowitz (1960) showing the equivalence of the D- and G-criteria, is the theoretical basis for the development of many numerical methods and for the construction of both sequential and adaptive designs. n
Theorem 1. A necessary and sufficient condition for x to be optimal is fulfillment of the inequality mincðx, yÞ Z0, x2X
ð13Þ
84
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91 n
where cðx, x Þ is the directional derivative:
cðx, xn Þ ¼ lim
@
a-0 @a
C½Mfð1aÞxn þ axðxÞg,
ð14Þ
and xðxÞ is a design atomized at x. A ‘‘design atomized at x’’ is such that all observations are taken at the single point x so that the design xðxÞ ¼ fw ¼ 1,xg. R xþD 0 Mathematically, the measure xðxÞ can be viewed as the delta function xD dðxx0 Þ dx ¼ 1 for any D 4 0. Frequently the function cðx, xÞ is called a sensitivity function. A table of sensitivity functions for the most popular design criteria is presented in Table 1. All functions cðx, xÞ are defined by the information matrices of a single observation. These, in turn, are defined by the elemental information matrices (8) and by the derivatives of the regression functions Fðx, yÞ. Thus Table 1, together with tables from the next section which contain elemental matrices, leads directly to the specific versions of the equivalence theorem for a wide variety of models; the necessity of special calculations for each individual case is avoided. n Note that Table 1 is constructed under the assumption that the optimal design x is regular. In some special cases in n n rows 2 and 4, for example, it can happen that x is singular, that is that rank Mðx Þ o m. Such cases are well treated in Pukelsheim (1993), but are beyond the scope of this paper. All the criteria in Table 1 operate in parameter space, that is they provide a scalar measure of precision, or uncertainty, of the MLE of y or of the linear transformation W ¼ AT y. In the case of arbitrary, but smooth, transformations WðyÞ the results T of the table still hold with AT replaced by the vector of derivatives @W =@y. We observe that 9D9
1=m
¼ limðm1 tr Dg Þ1=g r ðm1 tr Dg Þ1=g r lim ðm1 tr Dg Þ1=g ¼ lmax ðDÞ, g-1
g-0
where lmax ðDÞ is the largest eigenvalue of D. Thus the optimality criterion in the last line of Table 1 is somewhat flexible and, in particular, can be used for the construction of optimal designs approximating those minimizing lmax ðDÞ. Often a practitioner may be interested in estimation of response functions Zðx, yÞ or of a utility function zðx, yÞ on some set Z which may or may not coincide with X . The corresponding criteria are not explicitly included in Table 1. However, the table provides the requisite information for some criteria. For instance, let interest be in Z Z @zðx, yÞ @zðx, yÞ Var½zðx, y^ Þ dx ffi Var y^ dx T @y Z Z @y Z @zðx, yÞ @zðx, yÞ ¼ tr dx Var y^ : T @y Z @y Defining A¼
Z Z
@zðx, yÞ @zðx, yÞ dx, T @y @y
one can use the fourth row of Table 1 to find the necessary sensitivity function. 3.2. Computing optimal designs Many widely used numerical algorithms for the construction of optimal designs are based on directional derivatives (Wynn, 1970; Fedorov, 1972; Fedorov and Hackl, 1997). For instance, in first-order algorithms, forward excursions add weight to the design at xsþþ 1 ¼ arg maxcðx; xs , yÞ,
ð15Þ
x2X
whereas backward excursions delete weight from x s þ 1 ¼ arg mincðx; xs , yÞ:
ð16Þ
x2X
Again, as in the previous section, Tables 1–5 provide all the components that are needed, now for (15) and (16). Table 1 Sensitivity functions cðx, xÞ, D ¼ M 1 ; see (4).
CðxÞ
c ðx, xÞ 1=m
log9D9, 9D9
,
Qm
a ¼ 1 la ðDÞ
T
log9A DA9, dim A ¼ k m, rank A ¼ k o m P tr D, m a ¼ 1 la ðDÞ T
tr A DA P g tr Dg , m a ¼ 1 la ðDÞ
tr nðxÞF T ðxÞDFðxÞm tr nðxÞF T ðxÞDAðAT DAÞ1 AT DFðxÞk tr nðxÞF T ðxÞD2 FðxÞtr D tr nðxÞðF T ðxÞDAÞ2 tr AT DA tr nðxÞF T ðxÞDg þ 1 FðxÞtr Dg
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
85
Table 2 Elemental information for single parameter distributions. Distribution
Density
Bernoulli ðpÞ
py ð1pÞ1y
0rpr1
Mean Variance
Information
p
1=fpð1pÞg
pð1pÞ ð1pÞy p
Geometric ðpÞ 0rpr1
ð1pÞ=p
1=fp2 ð1pÞg
ð1pÞ=p2
Binomial(p,n)
n y
np
py ð1pÞny
0rpr1
n=fpð1pÞg
npð1pÞ m þ y1 m y m1 p ð1pÞ
Neg. Bin. (p,m)
mð1pÞ=p
0rpr1
m=fp2 ð1pÞg
mð1pÞ=p2
Hypergeometric ðp,N,nÞ 0rpr1
n y
Nn Npy
N Np
ly el
Poisson ðlÞ l40
y!
np npð1pÞðNnÞ=ðN1Þ
ðN1Þn pð1pÞðNnÞ
l l
1=l
Table 3 Transformed elemental information for single parameter distributions. Distribution
New parameter
Information
Bernoulli ðpÞ
W ¼ ln
p 1p p W ¼ ln 1p W ¼ ln l p W ¼ ln 1p p W ¼ ln 1p
eW =ð1 þ eW Þ2
Binomial (p,n) Poisson ðlÞ Geometric ðpÞ Neg. Bin. (p,m)
neW =ð1 þ eW Þ2 eW 1=ð1 þ eW Þ m=ð1 þ eW Þ
Fedorov and Hackl (1997, Section 3.2) and Atkinson et al. (2007, Section 9.5) describe the use of general-purpose second-order algorithms for the construction of optimal designs. These algorithms require second-order directional derivatives. Again, for all criteria listed in Table 1, they are completely defined by the elemental information matrices and by Fðx, yÞ (Fedorov and Hackl, 1997, Eq. 3.2.16). Adaptive designs in the optimal design setting are driven by taking the ðn þ 1Þ-th observation at the point that is viewed as the most informative (with respect to the selected criterion) after n observations (Box and Hunter, 1965; Fedorov, 1972). The approximate location of this point is defined as xn þ 1 ¼ arg maxcðx; xs , y^ n Þ, x2X
ð17Þ
where y^ n is the maximum likelihood estimate after n observations. Note that (17) is identical to (15) if, instead of the unknown y, we substitute its estimate y^ n . Thus, knowledge of the sensitivity function – i.e. the knowledge of Fðx, yÞ together with the elemental matrix nðZÞ – is sufficient to develop adaptive design procedures. The asymptotic properties of such schemes are discussed by Lai (1994). 4. Elemental information matrices In this section, we provide a collection of elemental information matrices for some popular distributions. If the reader does not find the needed distribution in Table 2 or in Table 4, classical results on the location-scale or exponential families may help (cf. Lehmann and Casella, 1998). Results for the multivariate exponential family can be found in Kotz et al. (2000). 4.1. Univariate distributions Almost all of the reported elemental information matrices, or at least the corresponding references, can be found in Johnson et al. (1994, 1995, 2005). We have also found useful results in Lehmann and Casella (1998) and in Bernardo and
86
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
Table 4 Elemental information matrices for two-parameter distributions. Distribution
Density
Mean variance
Support
Normal ða, s2 Þ
1 2 2 pffiffiffiffiffiffiffiffiffiffiffiffi eðyaÞ =2s 2ps2
a
1,1
Bða, bÞya1 ð1yÞb1
a=ða þ bÞ ab ða þ bÞ2 ða þ b þ 1Þ
0o yo 1
0 c0 ðaÞc0 ða þ bÞ c ða þ bÞ 0 c0 ðbÞc0 ða þ bÞ c ða þ bÞ
ab ab2
0,1
c0 ðaÞ
1 o a o 1
Information matrix
1
!
0
s2 0
s2
1 2
s
2
s40 Beta ða, bÞ
a40 b40 Gamma ða, bÞ
a40
1
GðaÞba
ya1 ey=b
!
!
1=b 2
1=b
a=b
1 b2 3 0
0
b40 Logistic ða,bÞ 1 o a o 1
a
eðyaÞ=b
1,1 2
2
b p =3
b½1 þ eðyaÞ=b 2
! 2
1þb
b40 Do not exist
Cauchy ða,bÞ
b
1 o a o 1
p½b2 þ ðyaÞ2
1
ð1,1Þ
2
2b
I
b40 a1
Weibull ða, bÞ
a y
a40
b b
b40 Pareto ða, sÞ a 4 2 ðfor varianceÞ s40
a y a x s
Double exponential ðLaplaceÞ ða,bÞ
e9ya9=b 2b
a
eðy=bÞ
bGð1 þ a1 Þ b2 fGð1 þ 2a1 Þ G2 ð1 þ a1 Þg a a1
s
0
0,1
6
2
gÞ þ ð1 a2
1
1,1 2
2
b
b
a2
b2
b
a
2 @ s ða þ 2Þ sða1þ 1Þ
2=b
g1
g1
0
0,1
as2 ða1Þ2 ða2Þ a
p2
B @
1 C A
sða1þ 1Þ 1
1 A
a2
I
1 o a o 1 b40
cðaÞ ¼ G0 ðaÞ=GðaÞ and c0 ðaÞ ¼ dcðaÞ=daÞ are the digamma and trigamma functions, g ¼ 0:5772 is Euler’s constant; see Abramowitz and Stegun (1965). Table 5 Transformed elemental information matrices for two-parameter distributions after transformation. Distribution
New parameters
Nða, s2 Þ
W1 ¼ a W2 ¼ ln s2
Nða,a2 Þ 2 2
Nða,k a Þ
Betaða, bÞ
W ¼ ln a
Information matrix 0
0
1=2
3 2 þ 1=W2
1=W2
1=W2
2 1=ð2W2 Þ
W1 ¼ ln a W2 ¼ ln b
e2W1 c ðeW1 Þ 0
0
0
ðeW1
W1 ¼ ln a W2 ¼ ln b
eW1
W1 ¼ a W2 ¼ ln b
1 e2W2 3 0
Cauchy(a,b)
W1 ¼ a W2 ¼ ln b
2W2
Weibullða, bÞ
W1 ¼ ln a W2 ¼ ln b
Logistic(a,b)
!
W1 ¼ ln a W2 ¼ k2
c Gammaða, bÞ
!
1=eW2
0 @
p2 6
e
þ eW2 Þ
1
e2W2
0
1 1
!
!
0 2W2
1þe 0 1
!
þ ð1gÞ2 eW2
ðeW2 Þ 1
1
g1
c
0
e2W1
eW1 c ðeW1 Þ
1 e 2 0
!
0 2W2
g1
1
eW2 2W1
A
e
!
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
87
Smith (1994). Table 2 contains elemental information expressions for single parameter distributions with elemental information matrices for two parameter distributions in Table 4. The distributions in Table 4 include the Weibull and Pareto which are often used for modelling survival or lifetime data. We note (Harris, 1968) that a mixture of exponential distributions with a gamma distributed parameter leads to a form of Pareto distribution, which makes it appropriate for modelling heterogeneity in survival times. These distributions are frequently elaborated by the introduction of extra parameters to add flexibility in shape and to allow for times that do not start at zero. Brazauskas (2003) presents information matrices for these more complicated Pareto distributions. Escobar and Meeker (1994), Gertsbakh (1995) and Gupta and Kundu (2006) discuss information matrices for censored survival distributions. Information matrices for bivariate and multivariate Pareto distributions are presented by Yari and Jafari (2006), Gupta and Nadarajah (2007) and by Kotz (2008). Elaborations of the Laplace distribution are in Kotz et al. (2001). Ali and Nadarajah (2007) cover normal-Laplace mixtures, with the Dirichletmultinomial distribution in Paul et al. (2005). If one of these two parameters is assumed to be known, then the elemental information for the other (unknown) parameter equals the corresponding diagonal element of the elemental information matrix. In many settings, it is helpful to work with parameters W which are often log or logit functions of the original parameters Z and can vary between 1 and 1. We use nðWÞ when the parameters W, not Z, will be considered as functions of controls x and regression parameters y. All results of Section 3 stay valid with the obvious replacement of nðZÞ by nðWÞ. Tables 3 and 5 contain information or information matrices for popular choices of those new parameters. The multivariate normal distribution and multinomial distribution are described in the corresponding subsections. Many further references on multivariate distributions can be found in Johnson et al. (1997) and Kotz et al. (2000). 4.2. Multivariate normal distribution Let Y 2 Rl have a normal distribution, i.e., 1=2
pðy9a, SÞ ¼ ð2pÞl=2 9S9
expf12ðyaÞS1 ðyaÞT g:
We let y represent the unknown parameters defining the mean aðx, yÞ and the variance–covariance matrix Sðx, yÞ. Then the ða, bÞ element of the information matrix for y and for a single observation is (Magnus and Neudecker, 1988, p. 325) @a 1 @aT 1 @S 1 @S mab ¼ S þ tr S1 S : ð18Þ @ya @yb 2 @ya @yb The information matrix for the l þ ðl þ 1Þl=2 parameters a and S appears complicated. We therefore introduce notation to express it in a more compact form (Magnus and Neudecker, 1988, Section 2.4; Harville, 2002, Section 16.4). Let vec S be a vector that is constructed from S and consists of l 2 components. To build it, stack the columns of S beneath each other so that the first column of S is on top, followed by the second column of S, etc.; the l-th column of S is therefore at the bottom of the stack. Because S is symmetric, this vector vec S contains considerable redundancy. To obtain a parsimonious column vector with the same information as is in S, eliminate all elements that come from the super-diagonal elements of S. The resulting vector with only lðl þ 1Þ=2 elements is denoted vech S. The duplication matrix Dl (Magnus and Neudecker, 1988, Section 3.8) is a linear transform that links vec S and vechS: Dl vech S ¼ vecS: 2
Dl is a unique matrix of dimension l lðl þ 1Þ=2. We use Dl to express an elemental information matrix with parameters W ¼ faT ,ðvech SÞT g) in a relatively compact format: ! S1 0 mðyÞ ¼ : ð19Þ 1 1 T S1 ÞDm 0 2 Dm ð S For example, for the bivariate normal distribution with parameters
y ¼ ða1 ,a2 , s21 , rs1 s2 , s22 ÞT , 0
1
1
0
0
B0 B D2 ¼ B @0
1 1
0C C C, 0A
0
0
1
which inserted in (19) yields the compact re-expression of the elemental information matrix: ! 1 S1 0 mðyÞ ¼ , 1r2 0 B
ð20Þ
88
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
where 0
1
s21 S1 ¼ B @ s1rs2
0
s1rs2 1
s22
2r2 4s41
C A,
r2
s3rs
B 1 B B r 3 B 1r2 B s1 s2 @ r2
B¼
1
C C s rs3 C C: 1 2 C 2r2 A
2
1
1 þ r2 2 1
2 2
s s
s rs3
2s21 s22
1
1
2s21 s22
4s42
2
4.3. Multinomial distribution For multinomial observations y ¼ ðy1 , . . . ,yl ,yl þ 1 Þ (Bernardo and Smith, 1994, Section 5.4) pðy9W,nÞ ¼
n! y Wy1 Wyl l Wl þl þ11 , y1 ! yl !yl þ 1 ! 1
where lþ1 X
yi ¼ n
and
i¼1
lþ1 X
Wi ¼ 1:
i¼1
There are actually only l independent parameters. We take, as is common, the elemental parameters to be W ¼ ðW1 , . . . , Wl ÞT , P noting that Wl þ 1 ¼ 1 li ¼ 1 Wi . The elemental information matrix for W is 0 W1 þ Wl þ 1 1 1 1 W1 B C W2 þ Wl þ 1 1 C n B B 1 C W 2 mðWÞ ¼ ð21Þ B C: Wl þ 1 B ^ ^ ^ C @ A yl þ Wl þ 1 1 1 W1 The latter can be rewritten as 0 11 W1 0 0 B C B 0 W2 0 C C þ n llT , mðWÞ ¼ nB B ^ ^ ^ C Wl þ 1 @ A 0 0 Wl
ð22Þ
T
where l ¼ ð1 1Þ. Recalling that T
T
ðA þll Þ1 ¼ A1
A1 ll A1 1 T
1 þ lA
l
,
we obtain 0
W1
B B 0 n@ ^
1 m1 ðWÞ ¼ B B
0
0
W2
^
0
1 0 C 0C 1 T C WW : ^C A n
ð23Þ
Wl
5. Examples
Example 1 (Linear regression with normal errors and constant variance). For independent normally distributed observations there are two elemental parameters, the mean a and the variance s2 ; see Table 4. Let the variance be constant and ! T aðx, WÞ W 0 f ðxÞ Zðx, yÞ ¼ ¼ : ð24Þ 2 2 s s 0, . . . ,0, 1
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
Often it is assumed that s2 does not depend on x. Then from (7) and (11) and Table 1 it follows that ! T 0 1 f ðxÞf ðxÞ mðxÞ ¼ 2 : s 0 1=2s2
89
ð25Þ
Note that the information matrix is block diagonal, i.e. the first m components W^ p of the MLE are independent of the last 2 one, W^ v ¼ s^ . The information matrix for the design x is ! 0 M p ðxÞ MðxÞ ¼ , ð26Þ 0 1=2s4 R T where Mp ðxÞ ¼ f ðxÞf ðxÞxðdxÞ. We now consider the D-criterion. From Theorem 1 and Table 1 it follows that the sensitivity function is cðxÞ ¼ T f ðxÞM 1 p ðxÞf ðxÞm, and a necessary and sufficient condition for x to be optimal is fulfillment of the inequality T
f ðxÞM 1 p ðxÞf ðxÞ rm: The later statement is, of course, the major part of Kiefer–Wolfowitz equivalence theorem. Note that this inequality does not contain any unknown parameters. Example 2 (Normal linear regression with independently parameterized variance). Let us continue the previous example with ! ! ! T aðx, Wp Þ Wp 0 f ðxÞ Zðx, yÞ ¼ : ð27Þ ¼ ln s2 ðx, Wv Þ Wv jT ðxÞ 0 From (7) and the first line of Table 5, it follows that ! 0 M p ðxÞ , MðxÞ ¼ 0 M v =2ðxÞ
ð28Þ
R R T where M p ðxÞ ¼ f ðxÞf ðxÞxðdxÞ and M v ðxÞ ¼ jðxÞjT ðxÞxðdxÞ. Applying Theorem 1 and Table 1, we have for the D-criterion that (compare with Atkinson and Cook, 1995) T
T
1 T 1 eWv jðxÞ f ðxÞM1 p ðxÞf ðxÞ þ 2jðxÞM v ðxÞj ðxÞ r mp þ mv ,
where mp ¼ dim Wp and mv ¼ dim Wv . T
Example 3 (One parameter families, linear predictor function and D-optimality). Let Zðx, yÞ ¼ hðy f ðxÞÞ, where the range of the inverse link function h coincides with the domain of the corresponding parameter (e.g. between 0 and 1 for p from Table 2). From (7) and Table 1, it can immediately be seen that a necessary and sufficient condition for x to be optimal is fulfillment of the inequality
lðyT f ðxÞÞf T ðxÞM1 ðxÞf ðxÞ r m, R T 2 where lðuÞ ¼ nðuÞf ðuÞ, fðuÞ ¼ @hðuÞ=@u and MðxÞ ¼ f ðxÞf ðxÞxðdxÞ. Compare our results with the special cases presented by Wu (1985) and Torsney and Gunduz (2001) for binary regression or by Ford et al. (1992) and Atkinson et al. (2007, Chapter 22) who considered generalized linear models. Example 4 (Bivariate binary response model). Consider two binary outcomes, efficacy and toxicity, from a clinical trial. T The possible outcomes are y ¼ ðy00 , y01 , y10 , y11 Þ with probabilities W ¼ ðW1 , . . . , W4 Þ. It is more intuitive to re-express these probabilities respectively as p00 , p01 , p10 and p11. The interpretation of these probabilities is: ‘‘probability of no efficacy, no toxicity’’; ‘‘probability of no efficacy, toxicity’’; etc. Let a ‘‘single’’ observation be an observation performed on a cohort of size n. Then n! y y y y p 00 p 01 p 10 p 11 , ð29Þ y00 !y01 !y10 ,!y11 ! 00 01 10 11 P2 P2 P2 P2 where i¼1 j ¼ 1 yij ¼ n and i¼1 j ¼ 1 pij ¼ 1. Define p ¼ ðp00 ,p01 ,p10 Þ. From (22), it follows that the elemental information matrix for a bivariate binary random variable and a cohort of size n is 0 11 0 0 p00 T nll B 0 p 0 C mðWÞ ¼ n@ : ð30Þ A þ 01 1p00 p01 p10 0 0 p10 pðy9p,nÞ ¼
This formula was derived and used in a number of publications on dose–response studies; see, for instance, Dragalin and Fedorov (2006).
90
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
Example 5 (Gamma regression). In the case of gamma distributed observations there are two intuitively attractive ways to define the link function:
1. Model the parameters a and b directly: T
ln a ¼ f ðxÞWa , ln b ¼ jT ðxÞWb : 2
2. Model the mean a ¼ ab and variance b ¼ ab : T
ln a ¼ lnðabÞ ¼ f~ ðxÞW~ a , 2 ~ T ðxÞW~ b : ln b ¼ lnðab Þ ¼ j
The formulae are more compact for the first case, which is the reason why we proceed with it. Following Sections 2 and 3, we have ! af ðxÞ 0 Fðx, yÞ ¼ , ð31Þ bjðxÞ 0 T
T
T
where y ¼ ðWa , Wb Þ and the information matrix of a single observation made at x is (see Lemma 1 and Table 5) ! ac0 ðaÞf ðxÞf T ðxÞ f ðxÞjT ðxÞ , mðx, yÞ ¼ a jðxÞf T ðxÞ jðxÞjT ðxÞ T
ð32Þ
T
where a ¼ ef ðxÞWa and b ¼ ej ðxÞWb . The matrix (32) allows us to build the total information matrix MðW, xÞ. A necessary and sufficient condition for D-optimality follows immediately from (31) and Table 1: 2
a
c0 ðaÞf T ðxÞðM1 Þaa f ðxÞ þ f T ðxÞðM1 Þab jðxÞ þ 2 jT ðxÞðM1 Þbb jðxÞ rma þ mb , b b where the matrices ðM 1 Þaa , ðM 1 Þab , ðM 1 Þbb are respectively the blocks of the inverse information matrix corresponding to the parameters Wa and Wb . Of course, one has to remember that a and b are functions of x and of unknown parameters. 6. Conclusions We have provided a set of tools which makes the optimal design of experiments as routine as possible for the most popular distributions of responses. A key is that the parameters of these distributions may depend on controllable, and perhaps uncontrollable, variables. Once a model is selected, that is the distribution of the responses and the predictive functions relating y and x, the design procedure consists of almost identical steps for all the alternatives enumerated and discussed in this paper. We trust that this collection of results will not only streamline the practical aspects of experimental design, but that it will also lead to the development of rather simple software which can incorporate all the cases we have considered (and, we hope, some we have not) in one menu driven toolkit. It should be clearly understood that there is a wealth of challenging problems which lies beyond the scope of one paper, even if we have surveyed much material. One such example is problems with transformed responses when the transformations depend on unknown parameters as in the Box and Cox (1964) transformation; see Atkinson (2005) for some design aspects in this case. In various areas of biostatistics it may be challenging to build information matrices when the correlated multivariate responses consist of both continuous and discrete variables; see Tate (1955) (with a correction in Hannan and Tate, 1965) and Fedorov et al. (2012). These remarks are an explicit call for joint efforts with other statisticians to build a collection of elemental information matrices to make experimental design more attractive and readily available for a wider population of practitioners.
Acknowledgments This paper was written at the Isaac Newton Institute for Mathematical Sciences in Cambridge, England, during the 2011 programme on the Design and Analysis of Experiments. We thank the referees for their comments that helped to improve our paper.
A.C. Atkinson et al. / Journal of Statistical Planning and Inference 144 (2014) 81–91
91
References Abramowitz, M., Stegun, I.A., 1965. Handbook of Mathematical Functions. Dover Publications, New York. Ali, M.M., Nadarajah, S., 2007. Information matrices for normal and Laplace mixtures. Information Sciences 117, 947–955. Atkinson, A.C., 2005. Robust optimum designs for transformation of the response in a multivariate chemical kinetic model. Technometrics 47, 478–487. Atkinson, A.C., Cook, R.D., 1995. D-optimum designs for heteroscedastic linear models. Journal of the American Statistical Association 90, 204–212. Atkinson, A.C., Donev, A.N., Tobias, R.D., 2007. Optimum Experimental Designs, With SAS. Oxford University Press, Oxford. Bernardo, J.M., Smith, A.F.M., 1994. Bayesian Theory. Wiley, New York. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations (with discussion). Journal of the Royal Statistical Society, Series B 26, 211–246. Box, G.E.P., Hunter, W.G., 1965. Sequential design of experiments for nonlinear models. In: Proceedings IBM Scientific Computing Symposium: Statistics. IBM, New York, pp. 113–137. Brazauskas, V., 2003. Information matrix for Pareto(IV), Burr, and related distributions. Communications in Statistics—Theory and Methods 32 (2), 315–325, http://dx.doi.org/10.1081/STA-120018188. Cox, D.R., Reid, R., 2000. The Theory of the Design of Experiments. Chapman and Hall, CRC Press, Boca Raton. Dette, H., Wong, W.K., 1999. Optimal designs when the variance is a function of the mean. Biometrics 55, 925–929. Dragalin, V., Fedorov, V., 2006. Adaptive designs for dose-finding based on efficacy-toxicity response. Journal of Statistical Planning and Inference 136 (6), 1800–1823, http://dx.doi.org/10.1016/j.jspi.2005.08.005. Escobar, L.A., Meeker, W.Q., 1994. Algorithm AS 292: Fisher information matrix for the extreme value, normal and logistic distributions and censored data. Applied Statistics 43, 533–540. Fedorov, V.V., 1972. Theory of Optimal Experiments. Academic Press, New York. Fedorov, V.V., Hackl, P., 1997. Model-Oriented Design of Experiments. Lecture Notes in Statistics, vol. 125. Springer Verlag, New York. Fedorov, V.V., Leonov, S.L., 2004. Parameter estimation for models with unknown parameters in variance. Communications in Statistics—Theory and Methods 33, 2627–2657, http://dx.doi.org/10.1081/STA-200037917. Fedorov, V.V., Wu, Y., Zhang, R., 2012. Optimal dose-finding designs with correlated continuous and discrete responses. Statistics in Medicine 31, 217–234. Ford, I., Torsney, B., Wu, C.F.J., 1992. The use of a canonical form in the construction of locally optimal designs for non-linear problems. Journal of the Royal Statistical Society, Series B 54, 569–583. Gertsbakh, I., 1995. On the Fisher information in type-I censored and quantal response data. Statistics and Probability Letters 23, 297–306. Gupta, A.K., Nadarajah, S., 2007. Information matrices for some bivariate Pareto distributions. Applied Mathematics and Computation 184, 1069–1079. Gupta, R.D., Kundu, D., 2006. On the comparison of Fisher information of the Weibull and Generalizd Exponential distributions. Journal of Statistical Planning and Inference 136, 3130–3144, http://dx.doi.org/10.1016/j.jspi.2004.11.013. Hannan, J.F., Tate, R.F., 1965. Estimation of the parameters for a multivariate normal distribution when one variable is dichotomized. Biometrika 52, 664–668. Harris, C.M., 1968. The Pareto distribution as a queue server discipline. Operations Research 16, 307–313. Harville, D.A., 2002. Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York. Johnson, N.L., Kotz, S., Balakrishnan, N., 1994. Continuous Univariate Distributions—1, 2nd edition. Wiley, New York. Johnson, N.L., Kotz, S., Balakrishnan, N., 1995. Continuous Univariate Distributions—2, 2nd edition. Wiley, New York. Johnson, N.L., Kotz, S., Balakrishnan, N., 1997. Discrete Multivariate Distributions. Wiley, New York. Johnson, N.L., Kemp, A.W., Balakrishnan, N., 2005. Univariate Discrete Distributions, 3rd edition. Wiley, New York. Kiefer, J., Wolfowitz, J., 1960. The equivalence of two extremum problems. Canadian Journal of Mathematics 12, 363–366. Kotz, S., 2008. Information matrices for some bivariate Pareto distributions. In: Betti, G., Lemmi, A. (Eds.), Advances on Income Inequality and Concentration Measures, Routledge, London, pp. 105–116. Kotz, S., Balakrishnan, N., Johnson, N.L., 2000. Continuous Multivariate Distributions—1, 2nd edition. Wiley, New York. Kotz, S., Kozubowski, T.J., Podgo´rski, K., 2001. The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, ¨ Engineering and Finance. Birkhauser-Verlag, Boston. Lai, T.L., 1994. Asymptotic properties of nonlinear least squares estimates in stochastic regression models. Annals of Statistics 22, 1917–1930. Lehmann, E., Casella, G., 1998. Theory of Point Estimation, 2nd edition Springer-Verlag, New York. Magnus, J.R., Neudecker, H., 1988. Matrix Differential Calculus with Applications in Statistics and Econometrics. Wiley, Chichester. Paul, S.R., Balasooriya, U., Banerjee, T., 2005. Fisher information matrix of the Dirichlet-multinomial distribution. Biometrical Journal 47, 230–236, http:// dx.doi.org/10.1002/bimj.200410103. Pa´zman, A., 1986. Foundations of Optimum Experimental Design. Reidel, Dordrecht. Pukelsheim, F., 1993. Optimal Design of Experiments. Wiley, New York. Silvey, S.D., 1980. Optimum Design. Chapman and Hall, London. Tate, R.F., 1955. The theory of correlation between two continuous variables when one is dichotomized. Biometrika 42, 205–216. Torsney, B., Gunduz, N., 2001. On optimal designs for high dimensional binary regression models. In: Atkinson, A.C., Bogacka, B., Zhigljavsky, A. (Eds.), Optimal Design 2000, Kluwer, Dordrecht, pp. 275–285. Wu, C.F.J., 1985. Efficient sequential designs with binary data. Journal of the American Statistical Association 80, 974–984. Wu, Y.H., Fedorov, V.V., Propert, K.J., 2005. Optimal design for dose response using beta distributed responses. Journal of Biopharmaceutical Statistics 15, 753–771, http://dx.doi.org/10.1081/BIP-200067760. Wynn, H.P., 1970. The sequential generation of D-optimal experimental designs. Annals of Mathematical Statistics 41, 1055–1064. Yari, G., Jafari, A.M.D., 2006. Information and covariance matrices for multivariate Pareto (IV), Burr, and related distributions. International Journal of Engineering Science 17, 61–69.