Journal
of Statistical
Planning
and Inference
253
21 (1989) 253-263
North-Holland
ON INFORMATION MATRICES IN NONLINEAR EXPERIMENTAL DESIGN Andrej
PAZMAN
Mathematical Institute, Slovak Academy of Sciences, Brafislava, Czechoslovakia Received
26 March
Recommended
Absrract: preted
1986; revised manuscript
In the Gaussian
as measuring
by least squares. bility density
received
5 February
1988
by F. Pukelsheim
nonlinear
the information
regression about
The whole argumentation
of the least squares
model we propose
the parameters
matrices
is based on a nonasymptotical
estimates.
Proposals
which can be inter-
in the case when they are estimated for nonlinear
approximative experimental
proba-
design
are
presented. AMS Classification Numbers: Key words: Nonlinear
Primary
regression;
62502; Secondary
nonlinear
least squares;
62KO5, 62Fll differential
geometry;
experimental
design.
1. Introduction We consider
a nonlinear
regression
y(x) = q(&x)+c(x)
model
of the form
(XE 26 BE@), (1)
E(X) - N(0, 02(x)).
Here x is a design point (= the vector of explanatory variables) taken from a design space K, y(x) is the observed variable (= the response variable), E(X) is the error, 0 := (Or, . . . , B,,l)TE 0 is the vector of unknown parameters, XE E^- q(8,x) E RI is the regression function (= the mean response). Regularity assumptions: The parameter space 0 is a bounded open convex subset BE 0 H q(8,x) has continuous second order of R”‘. For every XE 2X the function derivatives a$q(e,x) := 82~(B,x)/%I, 89,. A design (exact, of size N) is an N-tuplet X:= (xr, . . . ,xN) of points of %, which are not necessarily distinct. We use the notation yX := (y(xr), . . . . y(~,,,))~, Z;, := diag(02(x,), . . . , a’(~,,,)), and similarly we write E~,~~(O). The random variables y(x,), . . . , y(x,,,) are supposed to be observed independently, hence we obtain from (1) y, = q,(e) 03783758/89/$3.50
+ eX,
Cm 1989, Elsevier
e,Y - N(O,z,v). Science Publishers
(2) B.V. (North-Holland)
254
A. Pdzrnan / Information matrices in nonlinear design
Regularity assumptions on the considered designs: (i) the mapping BE 0 - ~~(0) is one-to-one (= the identifiability of the parameters), (ii) the vectors J,qx(8) := Jq,(e)/&?, (i= 1, . . . . m) are linearly independent on the set 0 (= the nonsingularity of the design X). As a consequence of (ii), the expected (Fisher) information matrix cMX(e)ljj
:= +&e)Q,~,(e)
is nonsingular. By S,(0, yX) we denote
{s,(e, y,)},
(3)
the observed := -a;/(~,,
information
matrix
in the model
0
(2) (4)
where /( yX, 0) is the log-likelihood of the model. (As known, A4X(Q) =E,{S,(B, . )}.) A recent survey of nonlinear experimental design by Ford et al. (1987) shows that most design methods in nonlinear models are based on the use of M,(g) or S,(&,y,), where 6 is some preliminary L.S. estimate of 0. The most popular methods are those based on M,(8). This is justified asymptotically, since &&t(Q) is the asymptotic variance matrix of the L.S. estimate. In nonasymptotical situations, the designs minimizing some @[A4X(8)] (say @[M,(8)] = -log det M,(8)) are often considered as locally optimal (locally = for the true 0 being in the neighbourhood of the estimate 8). This approach may be incorrect in many experiments, since, nonasymptotically, the variance matrix of the L.S. estimates differs from M,(8). Moreover, as shown in the example in Section 3, two designs giving the same and constant information matrix may not be equivalent, since the two probability distributions of the L.S. estimates of 0 are different. A simple explanation why MX(e) is not a correct nonasymptotical measure of information in nonlinear models is that only the first order derivatives of qX(B) enter into (3), but the nonlinearity of the model (2) is expressed by the second order derivatives. Corrections of linear confidence regions (Beale (1960)) and of the bias of the L.S. estimates (Box (1971)) have been done using the second order derivatives of qX(8). A deeper understanding of these corrections has been obtained using the notion of curvatures of the expectation surface +
:=
{q,(e)
: eE 0)
(3
by Bates and Watts (1980) (cf. also Ratkovsky (1984), Ch. 1.4, and Hamilton et al. (1982)). A generalization of the D-optimality criterion from linear to nonlinear models has been obtained by Hamilton and Watts (1985), using a local quadratic approximation (by the Taylor formula) of the volume of a confidence region for 8. The criterion given in Eq. (3.2) of their paper has the form det M,(8)(1 where
b is a constant
+ b tr H,(8))
and H,(g)
(6) is a matrix
depending
on the parameter
effect
A. P&man
/ Information
matrices
in nonlinear
255
design
nonlinearity of Bates and Watts (1980); hence the criterion (6) is not invariant to the change of parameters (cf. also Ford et al. (1987)). The criterion (6) as well as the criteria based on M,(B) or on SX(&yX) depend on the estimate 6, i.e. they are local. The dependence on the locality of 0 is natural, since different parts of the function BE 0 - q,(B) can clearly give a different amount of information about 6’. This dependence can be removed only using some prior information about 0.
2. The new approach We start from a nonasymptotical (although approximative) probability density of the L.S. estimate I!?(Eq. (19) given below), presented in Pazman (1984). An analogical asymptotic formula has been obtained by independent considerations in Skovgaard (1985) and in Hougaard (1985). Details concerning the precision of the nonasymptotical approximation are in Pazman (1987a). The density qx(d 18) given in (19) contains the determinant of
where
:=I&(@+
Q,(&l)
[~?,(B)-YI~(~)]~~X’[Z-P~] d&,(6)
(7)
and where P$:=
g
a;rfx(e,{M~‘(B)}ij
ajr&e,z,-’
(8)
i,j=l
is the projector
onto the plane
T,(e):=
z:zeRN,
z=q(@+E
c;d;r/(@); (q,...,c,)~R” 1
i=l
(9)
which is tangent to the expectation surface &,. The matrices Q,(g, 0) and W,(8,0) have several interesting properties: (a) Q,(I$ 19)=M, = W,(C?, 19) in linear models. (b) Asymptotically, Q,(e, 0) and W,(& 0) are equal to M,(8) (Proposition 3). (c) There is a formal similarity between (7) and S,(&, y,), the latter being expressed in the form S&Y,)
= M,(B)
The difference between the residual vector Y~-v,&)
+ [vl,(@ -YxlTG1
a&(@.
S,(g, yX) and Q,(C$ 0) can be explained
(LO) by decomposing
= [~-P~i[y,-~x(e)i+[~-P~i[~x(e)-~x(~)i.
The first component is almost ancillary with respect to 8, therefore it is reasonable that only the second component enters into (7). (d) The matrix Q,(& 0) is positive definite on that subset of 0 where the approximative density (17) is correct (Proposition 1).
256
A.
Pcizman
/ Information
matrices
in nonlinear
design
(e) The differences QX (4 6) - M, (8) and W, (4 19)- M, (6) are negative definite matrices iff the true mean u,(8) and the centres of curvature of &, at q,(e) are on the same side of the tangent plane T,(8) (Proposition 2 and Corollary 2). The intuitive explanation is that in such a case it is more probable that sample points will be near some centre of curvature; such samples are not very informative as regards the estimation of 8. (f) The matrix WAZ(g 0) behaves like the squared generalized Jacobi matrix of a reparametrization of the model (2) which leads to the (truncated) N(0, I) distribution of the least squares estimates (Propositions 4 and 5). (g) If the matrix MX(0) in the model (2) does not depend on 19, then the density qx(8 ( 0) is specified uniquely from the function e- Q,(e, (3) (Proposition 6). On the other hand, two designs giving the same expected information matrix may lead to different densities qx (6 / 0) (Example 1). round 0 and if M,(0) does not depend (h) If 4x(0 10) IS sufficiently concentrated on 8, then the entropy of qx(8 10) is nearly equal to -+E,[logdet
W,(.,t?)]++mlog2n++m
(Corollary 5). (i) Like M,(8), the matrices Q,(& 19) and W,(0,0) tensors, i.e. after a regular reparametrization P=P(e)
(11)
are symmetric we obtain
second
order
(12)
According to (a)-(i) the matrix W,(e 0) can be considered as a modification of the information matrix M,(8) in nonlinear models, and can be interpreted as the (multivariate) information gain about B when 6 has been obtained by least squares estimation. We could interpret the matrix Q,(& 19) similarly, but we shall not use it separately (only within the matrix W,(& 0)) because of the properties (f) and (h) which are arguments in favour of the use of W,(t?, 0) in experimental design. Proofs and more exact formulations of the properties (a)-(i) are in Section 3. The D-optimality criterion. In linear models the D-optimality criterion function [-log det M,] determines the entropy of the probability density of 6, which is equal to +{ -log det MX + rn log 2rr + yn} (cf. Stone (1959)). Hence, according to h), we propose to compare designs in nonlinear models according to the value of I I -E[log det IV,<6 0)] = log det W,(e 8)q,(6 119)de rc(8) do, i II
1
where ~(0) is some prior density, and where U, is that neighbourhood of B where the approximation qx(8^ 1f3) is correct (see Eq. (17) below). From (12) it follows that this criterion is invariant to the change of parameters (like in linear models).
A. P&man
/ Information
marrices in nonlinear
251
design
This shows at once that (13) differs from the criterion (6) given by Hamilton and Watts (1985). They generalize another property of the linear D-optimality, namely the correspondence to the volume of the confidence ellipsoid for 8. Other optimality criteria. The property (f) (together with (h)) motivates to construct an optimality criterion from any standard criterion function C#J(cf. Fedorov (1972) or PBzman (1986)), or from any information functional (cf. Pukelsheim (1980)) according to the formula
In linear models this criterion reduces simply to @[M,]. An example is given at the end of the paper.
of QxC8,S)
3. Properties
and W,Cb,@
We shall use the notation
llallx := Kw>,l”2.
(a, b), := aTZi’b, By definition,
a goedesic
on the expectation
surface
G, is a curve
y: tE(a,b)-y(t)EGx such that (i) y(t) = qx o g(t) where second order derivatives, (ii) (iii)
g : t E (a, b) H g(t) E 0 is a mapping
with continuous
(15)
lI~(t)llx = 1, ji(t)&’
Jiqxog(t)
=
0
(i=l,...,m).
(16)
Here we denoted by a dot the derivative with respect to t. The radius of the geodesic y at the point t is equal to e,(t)
of curvature
:= II?w)lli’.
We shall denote by e,(t) :=e,(t)i)‘(t) the unit vector of curvature at t. Let Y be the minimal radius of curvature of the goedesics on G,, and let G(r)
:= {z: zdN,
I~~--‘1~(8)l~,
in the true mean of yX. Let U, be a neighbourhood
be a sphere centred 8 defined by u, :=
e*: 8*EO,
ZIyEG(r):
B*=argrr&
i Note that 8 := &Y,)
:= arg g;
llyX - q,(e)iG
lIy-‘1_.J8)lI$
of the true
.
(17)
258
A. P&man / Information matrices in nonlinear design
is the L.S. estimate,
and that
<[Yx-rx(e)l,a,rlx(B)), are normal equations The function
=0
G=L...,m)
for the computation
qx(8 1e) :=
of 6.
det Q,de? 0) (27~)““~ det”2A4X(o)
is the approximative probability to a regular reparametrization
density P=/?(e),
(18)
expC-3
I/~~~rlx~~~-~x~~~lll~~ (19)
of &on the set U,. This density i.e.
is invariant
on U,. The bounds for the difference between qx(f? j 0) and the true density of 6 are given in Pazman (1987a). The formula (19) gives the exact probability density of ~!?if the model (2) is linear (and U, = O=R”), and is ‘as exact as possible’ in the case when the expected information matrix M,(B) does not depend on 8 (Pazman (1987b)). Proposition
1. The matrix Qx(&, 0) is positive definite on U,.
Proof. We shall omit the subscript X in expressions like vX(8), gX, etc. Take an arbitrary geodesic y on & such that y(f) = YZ(@for some fE_(a, 6). From (16) it follows that P’ji(f) =O, and from (8) we obtain that (Z-P’) c?,v(~) = 0. Therefore, using (1.5) and (16) we obtain gV)Q(&
e)g(t3
= $13
lIv(t? - q(e)il*I
= 1 _ em
- ~(61, e,(f)> e,
Hence
Q(& 0) is proved
to be positive
([q(e) - r(e)], From
(I-P’)ey(f^)=e,,(f^)
definite
(20)
. if we show that
(21)
e,(f)> < e,(t^>. and from the Schwarz
([~(e)-&)i,e,(i%
inequality
it follows
that
(22)
5 IIw~(~)-u~(~)II~
where we used the notation V/,(B) = ~~~(0) Geometrically, plane
the point A(8)
+
(I-#)~(e).
I,u~(~?)is obtained
:= {z: ([z-r(@l,a;rl(@)
by projecting =0
the point
(i=l,...,m)).
q(e)
onto
the
A. P&man
Suppose
/ Information
matrices
in nonlinear
259
design
that (23)
llw&) - rl(d)ll 2 r* Since I,u~(~)E,~(@, YEA(~) plane (9), there is a sphere 8:= such that nectingy lows that because
{z:z.ERN,
(Eq. (18)), and since A(8) is orthogonal
to the tangent
l/z-cc/I
9 is tangent to 8 at r(e), and that its centre c lies on the straight line conwith ~~(6). From IIy-q($)II
lIws(@- vr(fl)ll= II&J - r?(@>ll 5 IIY- ul(U < t-. It follows that KEY. However, this is not possible, because Y is the minimal radius of curvature, hence ~7 cannot intersect &. Consequently, (23) does not hold, 0 which together with (22) implies (21). Proposition
2. The matrix Qx(& 13)-Mx(@ is positive definite (resp. negative definite) iff the centre of curvature of every goedesic through the point q,(B) is on the opposite side (resp. on the same side) of the tangent plane T,(6) as the true mean qx(e). Proof.
We use the notation
of Proposition
1. From
(15) and (20) it follows
iT(t3[Q(& 0) -M@l k(t^>= (bd) - rlW1, e,(t^)>/e,(t^).
that
(24)
The vector e,(i) is orthogonal to T,(o)) and it is pointing from the point ~~(0) towards the centre of curvature of the goedesic y. The required statement follows from (24). 0 Corollary
2. The statement
Wx(&e)
instead of Qx(& 0).
in Proposition
2 remains true if we set the matrix
Proof. There is a nonsingular matrix V such that VTMX(6) V= I and VTQX (4 0) V= diag(A,, . . . . Am):=A, for some Ai,..., A,,,. From Proposition 1 it follows that Ai>0 (i=l,..., m). Evidently, Q,(ee)--M,(o) is p.d. iff Ai>1 (i=l,..., m), i.e. iff VT[WX(e?e)-M,(&)]V=diag(A:-l,...,Ak-1) is p.d. But this is true iff W,(6,0)-M,(@) is p.d. Q,(&, 0) -M,(8) is negative definite iff W,(& 0)-M,(o)
Similarly, we prove is negative definite.
that Cl
Denote by 0’“’ the L.S. estimate of 0 obtained from n independently repeated observations of the model (2). Denote by M, @), Qg’, I+$’ the corresponding matrices in this larger model.
A. P&man / Information matricesin nonlinear design
260
Proposition
3. We have
I I
Q$)(tj(“), 0)
lim m
II -
lim n-m
-&‘$)(fj(“))
W$‘(@‘n’,/j) -M$)(/j(“))
1 1
[&f$)(Qcrr))lp’ = [Mf)(fj’“))]-’
=
Proof. By M,(Q(“)) we denote the information matrix when setting the new estimate f3@)instead of &, etc. We have
; Qjyn)(@), 8) =
0,
0,
in the original
model
(2)
IMx(Q’“’)+ [~x(e(“))-~x(e)]‘r~,-‘[I-P~(“‘] a;Q(e@)).
From the strong consistency of e(n) on 0 we obtain that lim,,, 0 (Jennrich (1969)). Thus we obtain both required limits.
qX(8(“)) - ~~(0) = 0
Denote w(0) := P;[?&-‘1x(8)]. This vector is parallel Ch. 8, the expression
to the tangent
0, ~(8) := Pia;W(Q) is the i-th partial covariant derivative w(Q). Proposition
plane
T,(8).
(= the inner
According
derivative
to Thorpe
in 8,)
(1979),
of the vector
4. We have { WxCs^,S)>, = (0; W(Q), Dj w(Q)>,
(25)
and consequently
4,-&J1e) = Proof.
det”*[{(DiW(B), Dj
w(B)>,}:I:=~]
(2ny’2
ew-+
IINW}.
We have
aify,-l(B) = -~,-'(8)a,~~(e)M,-'(e). Hence,
using
(3) and (8) we can verify that
{Q,(&@}, = ,. Eq. (25) then follows
from
(8), and (26) follows
from (19).
0
(26)
A. P&man
We can better
understand
/ Information
matrices
the content
in nonlinear
of Proposition
261
design
4 in the following
special
case. Proposition 5. If M,(8) does not depend on 6 and if 6 is distributed according to qx(8 10) on U,, then the random vector
u(8) := A4,“2 ar&@z,-l
[n,(6) - r/x(8)1
has the truncated normal distribution U, - u(B) is equal to
N(O,I). The Jacobian of the mapping 8~
det”2 W,(&, e). Proof.
From
(8) we obtain
uT(@)u(& =
By differentiation
lI~~~v(~)-rl,(~)1ll::.
of (3) we obtain
c$~~(B)C~’
akqx(8)
= 0. Hence
using
(8) we
can write
Thus the statement
follows
from (19) and from Proposition
1.
0
Corollary 5. Zf qx(O 18) IS so concentrated round 0 that the truncation of the normal distribution of v(6) can be neglected, then
I
qx(QI 8) log 4,dQI 0) de
L,(/H
*
=+[I
[log det W,(o, e)] qx(6 10) de-
m log 2n -m
The proof Proposition
is obtained
. 1
’ uo
when the substitution
6. Under the assumption
0 ++ u(g) is used in the first integral.
of Proposition 5 we can write
det”2[Qx(8,8)Q~‘(&@Qx(6e)] qx(O I 0) =
(27cy2
. Qx[(l-u)B+u&3]
dtdu
Proof.
We have
=
”(M-1/2}i.Qx[(i-
I0 I’
t)e+ to, e](& e) dt.
.
(d-0) 1
1
262
A. Pdzman / Information matrices in nonlinear design
Hence,
the statement
Example
is a consequence
5.
0
1. Take x=
0 = (0,271),
{x,,x,,x,,x,],
a(xJ
tl(&x,)
= 8,
9/(@x,) = fi cos 0,
V(G,)
= fl&
r(&x,)
Consider two designs, X:= (x,,x~), &, is a circle. From (3) we obtain M,(8) When
of Proposition
= 3 = M,(8)
= 1,
= fi sin 19.
Z :=(x3,x4).
&, is a segment
of a straight
line,
(8E(0,27r)).
the design X is used, we have,
according
to (19),
& 0 - N(0, $) on (0,2x).
When
Z is applied,
sin(8-
0) - N(O,+)
we obtain
from Proposition
5 that (27)
on the set U, > {6 : /& 0 I< fn}. We see that @ is more concentrated round 8 in the first case than in the second case. Consequently, the design X is better than Z, despite of their having the same expected information matrix. We have
Hence,
w,(O,e)
= 3
w,(f$e)
= 3~0~~(8-8)
geE(o,2n)), (&eE(o,2m
Wz(6 e)< W,(d, 0). Further
-3[1-+I
It follows that X is better mality criterion.
=2a,pv,(.,e)].
than Z with respect
to the proposed
generalized
D-opti-
Acknowledgement The author wishes to thank the Coordinating Editor and the referees for helpful remarks which led to an improvement of the earlier version of this paper.
References Bates,
D.M. and D.G. Watts
(1980). Relative
curvature
measures
of nonlinearity.
J. Roy. Statist. Sot.
Ser. B 42, 1-25. Beale, E.M.L.
(1960). Confidence
regions in non-linear
estimation.
J. Roy. Statist. Sot. Ser. B 22, 41-88.
A. P&man
Box, M.J. Fedorov, Ford,
(1971).
Bias in nonlinear
V.V. (1972).
Theory
/ Information
estimation.
of Optimal
Hamilton,
J. Roy.
in nonlinear
Statist.
Experiments.
I., C.P. Kitson and D.M. Titterington
Technometrics,
matrices
Sot.
Academic
263
design
Ser. B 33, 171-201.
Press,
(1987). Recent advances
New York.
in nonlinear
experimental
D.C.,
D.G. Watts and D.M. Bates (1982). Accounting
for intrinsic
nonlinearity
in nonlinear
regression parameter inference regions. Ann. Statist. 10, 386-393. Hamilton, D.C. and D.G. Watts (1985). A quadratic design criterion for precise estimation regression
models.
Hougaard, P. (1985). Lett. 3, 161-166. Jennrich,
design.
to appear.
Technometrics
27, 241-250.
Saddle-point
approximations
R.J. (1969). Asymptotic
properties
for curved
of non-linear
exponential
least-squares
families.
estimators.
in nonlinear
Statist.
Probab.
Ann. Math.
Stat&.
40, 633-643. Pazman, A. (1984). Probability Kybernetika 20, 209-230. Pazman,
A. (1986).
distribution
of the multivariate
Foundations
of Optimal
Experimental
for the distribution
nonlinear
Design.
least
Reidel,
squares
estimates.
Dordrecht,
and
Veda,
Bratislava. Pazman,
A. (1987a).
On formulas
Parman,
A. (1987b).
Flat Gaussian
shop
“Model
Oriented
nonlinear
Data Analysis”,
of nonlinear
regression Wartburg,
models. March
L.S. estimates.
1987, Lecture
Skovgaard,
I.M. (1985). Large
Regression
deviation
Math. Statist. 6(2), 89-107. Stone, M. (1959). Application of a measure experiments. Thorpe, J.A.
Ann. Math. Statist. (1979). Elementary
Heidelberg-Berlin.
Modelling.
approximations of information
30, 55-70. Topics in Differential
Marcel
18, 3-15.
Dekker,
for maximum
J. Statist.
and
Plann. Infer-
New York-Basel.
likelihood
estimators.
to the design and comparison Geometry.
Work-
Notes in Economics
Mathematical Systems No. 297. Springer Verlag, Berlin, 120-124. Pukelsheim, F. (1980). On linear regression designs which maximize information. ence 4, 339-364. Ratkovsky, D.A. (1984). Nonlinear
Statistics
In: Proc. of the International
Springer
Verlag,
Probab.
of regression New
York-