On information matrices in nonlinear experimental design

On information matrices in nonlinear experimental design

Journal of Statistical Planning and Inference 253 21 (1989) 253-263 North-Holland ON INFORMATION MATRICES IN NONLINEAR EXPERIMENTAL DESIGN Andr...

588KB Sizes 0 Downloads 15 Views

Journal

of Statistical

Planning

and Inference

253

21 (1989) 253-263

North-Holland

ON INFORMATION MATRICES IN NONLINEAR EXPERIMENTAL DESIGN Andrej

PAZMAN

Mathematical Institute, Slovak Academy of Sciences, Brafislava, Czechoslovakia Received

26 March

Recommended

Absrract: preted

1986; revised manuscript

In the Gaussian

as measuring

by least squares. bility density

received

5 February

1988

by F. Pukelsheim

nonlinear

the information

regression about

The whole argumentation

of the least squares

model we propose

the parameters

matrices

is based on a nonasymptotical

estimates.

Proposals

which can be inter-

in the case when they are estimated for nonlinear

approximative experimental

proba-

design

are

presented. AMS Classification Numbers: Key words: Nonlinear

Primary

regression;

62502; Secondary

nonlinear

least squares;

62KO5, 62Fll differential

geometry;

experimental

design.

1. Introduction We consider

a nonlinear

regression

y(x) = q(&x)+c(x)

model

of the form

(XE 26 BE@), (1)

E(X) - N(0, 02(x)).

Here x is a design point (= the vector of explanatory variables) taken from a design space K, y(x) is the observed variable (= the response variable), E(X) is the error, 0 := (Or, . . . , B,,l)TE 0 is the vector of unknown parameters, XE E^- q(8,x) E RI is the regression function (= the mean response). Regularity assumptions: The parameter space 0 is a bounded open convex subset BE 0 H q(8,x) has continuous second order of R”‘. For every XE 2X the function derivatives a$q(e,x) := 82~(B,x)/%I, 89,. A design (exact, of size N) is an N-tuplet X:= (xr, . . . ,xN) of points of %, which are not necessarily distinct. We use the notation yX := (y(xr), . . . . y(~,,,))~, Z;, := diag(02(x,), . . . , a’(~,,,)), and similarly we write E~,~~(O). The random variables y(x,), . . . , y(x,,,) are supposed to be observed independently, hence we obtain from (1) y, = q,(e) 03783758/89/$3.50

+ eX,

Cm 1989, Elsevier

e,Y - N(O,z,v). Science Publishers

(2) B.V. (North-Holland)

254

A. Pdzrnan / Information matrices in nonlinear design

Regularity assumptions on the considered designs: (i) the mapping BE 0 - ~~(0) is one-to-one (= the identifiability of the parameters), (ii) the vectors J,qx(8) := Jq,(e)/&?, (i= 1, . . . . m) are linearly independent on the set 0 (= the nonsingularity of the design X). As a consequence of (ii), the expected (Fisher) information matrix cMX(e)ljj

:= +&e)Q,~,(e)

is nonsingular. By S,(0, yX) we denote

{s,(e, y,)},

(3)

the observed := -a;/(~,,

information

matrix

in the model

0

(2) (4)

where /( yX, 0) is the log-likelihood of the model. (As known, A4X(Q) =E,{S,(B, . )}.) A recent survey of nonlinear experimental design by Ford et al. (1987) shows that most design methods in nonlinear models are based on the use of M,(g) or S,(&,y,), where 6 is some preliminary L.S. estimate of 0. The most popular methods are those based on M,(8). This is justified asymptotically, since &&t(Q) is the asymptotic variance matrix of the L.S. estimate. In nonasymptotical situations, the designs minimizing some @[A4X(8)] (say @[M,(8)] = -log det M,(8)) are often considered as locally optimal (locally = for the true 0 being in the neighbourhood of the estimate 8). This approach may be incorrect in many experiments, since, nonasymptotically, the variance matrix of the L.S. estimates differs from M,(8). Moreover, as shown in the example in Section 3, two designs giving the same and constant information matrix may not be equivalent, since the two probability distributions of the L.S. estimates of 0 are different. A simple explanation why MX(e) is not a correct nonasymptotical measure of information in nonlinear models is that only the first order derivatives of qX(B) enter into (3), but the nonlinearity of the model (2) is expressed by the second order derivatives. Corrections of linear confidence regions (Beale (1960)) and of the bias of the L.S. estimates (Box (1971)) have been done using the second order derivatives of qX(8). A deeper understanding of these corrections has been obtained using the notion of curvatures of the expectation surface +

:=

{q,(e)

: eE 0)

(3

by Bates and Watts (1980) (cf. also Ratkovsky (1984), Ch. 1.4, and Hamilton et al. (1982)). A generalization of the D-optimality criterion from linear to nonlinear models has been obtained by Hamilton and Watts (1985), using a local quadratic approximation (by the Taylor formula) of the volume of a confidence region for 8. The criterion given in Eq. (3.2) of their paper has the form det M,(8)(1 where

b is a constant

+ b tr H,(8))

and H,(g)

(6) is a matrix

depending

on the parameter

effect

A. P&man

/ Information

matrices

in nonlinear

255

design

nonlinearity of Bates and Watts (1980); hence the criterion (6) is not invariant to the change of parameters (cf. also Ford et al. (1987)). The criterion (6) as well as the criteria based on M,(B) or on SX(&yX) depend on the estimate 6, i.e. they are local. The dependence on the locality of 0 is natural, since different parts of the function BE 0 - q,(B) can clearly give a different amount of information about 6’. This dependence can be removed only using some prior information about 0.

2. The new approach We start from a nonasymptotical (although approximative) probability density of the L.S. estimate I!?(Eq. (19) given below), presented in Pazman (1984). An analogical asymptotic formula has been obtained by independent considerations in Skovgaard (1985) and in Hougaard (1985). Details concerning the precision of the nonasymptotical approximation are in Pazman (1987a). The density qx(d 18) given in (19) contains the determinant of

where

:=I&(@+

Q,(&l)

[~?,(B)-YI~(~)]~~X’[Z-P~] d&,(6)

(7)

and where P$:=

g

a;rfx(e,{M~‘(B)}ij

ajr&e,z,-’

(8)

i,j=l

is the projector

onto the plane

T,(e):=

z:zeRN,

z=q(@+E

c;d;r/(@); (q,...,c,)~R” 1

i=l

(9)

which is tangent to the expectation surface &,. The matrices Q,(g, 0) and W,(8,0) have several interesting properties: (a) Q,(I$ 19)=M, = W,(C?, 19) in linear models. (b) Asymptotically, Q,(e, 0) and W,(& 0) are equal to M,(8) (Proposition 3). (c) There is a formal similarity between (7) and S,(&, y,), the latter being expressed in the form S&Y,)

= M,(B)

The difference between the residual vector Y~-v,&)

+ [vl,(@ -YxlTG1

a&(@.

S,(g, yX) and Q,(C$ 0) can be explained

(LO) by decomposing

= [~-P~i[y,-~x(e)i+[~-P~i[~x(e)-~x(~)i.

The first component is almost ancillary with respect to 8, therefore it is reasonable that only the second component enters into (7). (d) The matrix Q,(& 0) is positive definite on that subset of 0 where the approximative density (17) is correct (Proposition 1).

256

A.

Pcizman

/ Information

matrices

in nonlinear

design

(e) The differences QX (4 6) - M, (8) and W, (4 19)- M, (6) are negative definite matrices iff the true mean u,(8) and the centres of curvature of &, at q,(e) are on the same side of the tangent plane T,(8) (Proposition 2 and Corollary 2). The intuitive explanation is that in such a case it is more probable that sample points will be near some centre of curvature; such samples are not very informative as regards the estimation of 8. (f) The matrix WAZ(g 0) behaves like the squared generalized Jacobi matrix of a reparametrization of the model (2) which leads to the (truncated) N(0, I) distribution of the least squares estimates (Propositions 4 and 5). (g) If the matrix MX(0) in the model (2) does not depend on 19, then the density qx(8 ( 0) is specified uniquely from the function e- Q,(e, (3) (Proposition 6). On the other hand, two designs giving the same expected information matrix may lead to different densities qx (6 / 0) (Example 1). round 0 and if M,(0) does not depend (h) If 4x(0 10) IS sufficiently concentrated on 8, then the entropy of qx(8 10) is nearly equal to -+E,[logdet

W,(.,t?)]++mlog2n++m

(Corollary 5). (i) Like M,(8), the matrices Q,(& 19) and W,(0,0) tensors, i.e. after a regular reparametrization P=P(e)

(11)

are symmetric we obtain

second

order

(12)

According to (a)-(i) the matrix W,(e 0) can be considered as a modification of the information matrix M,(8) in nonlinear models, and can be interpreted as the (multivariate) information gain about B when 6 has been obtained by least squares estimation. We could interpret the matrix Q,(& 19) similarly, but we shall not use it separately (only within the matrix W,(& 0)) because of the properties (f) and (h) which are arguments in favour of the use of W,(t?, 0) in experimental design. Proofs and more exact formulations of the properties (a)-(i) are in Section 3. The D-optimality criterion. In linear models the D-optimality criterion function [-log det M,] determines the entropy of the probability density of 6, which is equal to +{ -log det MX + rn log 2rr + yn} (cf. Stone (1959)). Hence, according to h), we propose to compare designs in nonlinear models according to the value of I I -E[log det IV,<6 0)] = log det W,(e 8)q,(6 119)de rc(8) do, i II
1

where ~(0) is some prior density, and where U, is that neighbourhood of B where the approximation qx(8^ 1f3) is correct (see Eq. (17) below). From (12) it follows that this criterion is invariant to the change of parameters (like in linear models).

A. P&man

/ Information

marrices in nonlinear

251

design

This shows at once that (13) differs from the criterion (6) given by Hamilton and Watts (1985). They generalize another property of the linear D-optimality, namely the correspondence to the volume of the confidence ellipsoid for 8. Other optimality criteria. The property (f) (together with (h)) motivates to construct an optimality criterion from any standard criterion function C#J(cf. Fedorov (1972) or PBzman (1986)), or from any information functional (cf. Pukelsheim (1980)) according to the formula

In linear models this criterion reduces simply to @[M,]. An example is given at the end of the paper.

of QxC8,S)

3. Properties

and W,Cb,@

We shall use the notation

llallx := Kw>,l”2.

(a, b), := aTZi’b, By definition,

a goedesic

on the expectation

surface

G, is a curve

y: tE(a,b)-y(t)EGx such that (i) y(t) = qx o g(t) where second order derivatives, (ii) (iii)

g : t E (a, b) H g(t) E 0 is a mapping

with continuous

(15)

lI~(t)llx = 1, ji(t)&’

Jiqxog(t)

=

0

(i=l,...,m).

(16)

Here we denoted by a dot the derivative with respect to t. The radius of the geodesic y at the point t is equal to e,(t)

of curvature

:= II?w)lli’.

We shall denote by e,(t) :=e,(t)i)‘(t) the unit vector of curvature at t. Let Y be the minimal radius of curvature of the goedesics on G,, and let G(r)

:= {z: zdN,

I~~--‘1~(8)l~,
in the true mean of yX. Let U, be a neighbourhood

be a sphere centred 8 defined by u, :=

e*: 8*EO,

ZIyEG(r):

B*=argrr&

i Note that 8 := &Y,)

:= arg g;

llyX - q,(e)iG

lIy-‘1_.J8)lI$

of the true

.

(17)

258

A. P&man / Information matrices in nonlinear design

is the L.S. estimate,

and that

<[Yx-rx(e)l,a,rlx(B)), are normal equations The function

=0

G=L...,m)

for the computation

qx(8 1e) :=

of 6.

det Q,de? 0) (27~)““~ det”2A4X(o)

is the approximative probability to a regular reparametrization

density P=/?(e),

(18)

expC-3

I/~~~rlx~~~-~x~~~lll~~ (19)

of &on the set U,. This density i.e.

is invariant

on U,. The bounds for the difference between qx(f? j 0) and the true density of 6 are given in Pazman (1987a). The formula (19) gives the exact probability density of ~!?if the model (2) is linear (and U, = O=R”), and is ‘as exact as possible’ in the case when the expected information matrix M,(B) does not depend on 8 (Pazman (1987b)). Proposition

1. The matrix Qx(&, 0) is positive definite on U,.

Proof. We shall omit the subscript X in expressions like vX(8), gX, etc. Take an arbitrary geodesic y on & such that y(f) = YZ(@for some fE_(a, 6). From (16) it follows that P’ji(f) =O, and from (8) we obtain that (Z-P’) c?,v(~) = 0. Therefore, using (1.5) and (16) we obtain gV)Q(&

e)g(t3

= $13

lIv(t? - q(e)il*I

= 1 _ em

- ~(61, e,(f)> e,

Hence

Q(& 0) is proved

to be positive

([q(e) - r(e)], From

(I-P’)ey(f^)=e,,(f^)

definite

(20)

. if we show that

(21)

e,(f)> < e,(t^>. and from the Schwarz

([~(e)-&)i,e,(i%

inequality

it follows

that

(22)

5 IIw~(~)-u~(~)II~

where we used the notation V/,(B) = ~~~(0) Geometrically, plane

the point A(8)

+

(I-#)~(e).

I,u~(~?)is obtained

:= {z: ([z-r(@l,a;rl(@)

by projecting =0

the point

(i=l,...,m)).

q(e)

onto

the

A. P&man

Suppose

/ Information

matrices

in nonlinear

259

design

that (23)

llw&) - rl(d)ll 2 r* Since I,u~(~)E,~(@, YEA(~) plane (9), there is a sphere 8:= such that nectingy lows that because

{z:z.ERN,

(Eq. (18)), and since A(8) is orthogonal

to the tangent

l/z-cc/I
9 is tangent to 8 at r(e), and that its centre c lies on the straight line conwith ~~(6). From IIy-q($)II
lIws(@- vr(fl)ll= II&J - r?(@>ll 5 IIY- ul(U < t-. It follows that KEY. However, this is not possible, because Y is the minimal radius of curvature, hence ~7 cannot intersect &. Consequently, (23) does not hold, 0 which together with (22) implies (21). Proposition

2. The matrix Qx(& 13)-Mx(@ is positive definite (resp. negative definite) iff the centre of curvature of every goedesic through the point q,(B) is on the opposite side (resp. on the same side) of the tangent plane T,(6) as the true mean qx(e). Proof.

We use the notation

of Proposition

1. From

(15) and (20) it follows

iT(t3[Q(& 0) -M@l k(t^>= (bd) - rlW1, e,(t^)>/e,(t^).

that

(24)

The vector e,(i) is orthogonal to T,(o)) and it is pointing from the point ~~(0) towards the centre of curvature of the goedesic y. The required statement follows from (24). 0 Corollary

2. The statement

Wx(&e)

instead of Qx(& 0).

in Proposition

2 remains true if we set the matrix

Proof. There is a nonsingular matrix V such that VTMX(6) V= I and VTQX (4 0) V= diag(A,, . . . . Am):=A, for some Ai,..., A,,,. From Proposition 1 it follows that Ai>0 (i=l,..., m). Evidently, Q,(ee)--M,(o) is p.d. iff Ai>1 (i=l,..., m), i.e. iff VT[WX(e?e)-M,(&)]V=diag(A:-l,...,Ak-1) is p.d. But this is true iff W,(6,0)-M,(@) is p.d. Q,(&, 0) -M,(8) is negative definite iff W,(& 0)-M,(o)

Similarly, we prove is negative definite.

that Cl

Denote by 0’“’ the L.S. estimate of 0 obtained from n independently repeated observations of the model (2). Denote by M, @), Qg’, I+$’ the corresponding matrices in this larger model.

A. P&man / Information matricesin nonlinear design

260

Proposition

3. We have

I I

Q$)(tj(“), 0)

lim m

II -

lim n-m

-&‘$)(fj(“))

W$‘(@‘n’,/j) -M$)(/j(“))

1 1

[&f$)(Qcrr))lp’ = [Mf)(fj’“))]-’

=

Proof. By M,(Q(“)) we denote the information matrix when setting the new estimate f3@)instead of &, etc. We have

; Qjyn)(@), 8) =

0,

0,

in the original

model

(2)

IMx(Q’“’)+ [~x(e(“))-~x(e)]‘r~,-‘[I-P~(“‘] a;Q(e@)).

From the strong consistency of e(n) on 0 we obtain that lim,,, 0 (Jennrich (1969)). Thus we obtain both required limits.

qX(8(“)) - ~~(0) = 0

Denote w(0) := P;[?&-‘1x(8)]. This vector is parallel Ch. 8, the expression

to the tangent

0, ~(8) := Pia;W(Q) is the i-th partial covariant derivative w(Q). Proposition

plane

T,(8).

(= the inner

According

derivative

to Thorpe

in 8,)

(1979),

of the vector

4. We have { WxCs^,S)>, = (0; W(Q), Dj w(Q)>,

(25)

and consequently

4,-&J1e) = Proof.

det”*[{(DiW(B), Dj

w(B)>,}:I:=~]

(2ny’2

ew-+

IINW}.

We have

aify,-l(B) = -~,-'(8)a,~~(e)M,-'(e). Hence,

using

(3) and (8) we can verify that

{Q,(&@}, = ,. Eq. (25) then follows

from

(8), and (26) follows

from (19).

0

(26)

A. P&man

We can better

understand

/ Information

matrices

the content

in nonlinear

of Proposition

261

design

4 in the following

special

case. Proposition 5. If M,(8) does not depend on 6 and if 6 is distributed according to qx(8 10) on U,, then the random vector

u(8) := A4,“2 ar&@z,-l

[n,(6) - r/x(8)1

has the truncated normal distribution U, - u(B) is equal to

N(O,I). The Jacobian of the mapping 8~

det”2 W,(&, e). Proof.

From

(8) we obtain

uT(@)u(& =

By differentiation

lI~~~v(~)-rl,(~)1ll::.

of (3) we obtain

c$~~(B)C~’

akqx(8)

= 0. Hence

using

(8) we

can write

Thus the statement

follows

from (19) and from Proposition

1.

0

Corollary 5. Zf qx(O 18) IS so concentrated round 0 that the truncation of the normal distribution of v(6) can be neglected, then

I

qx(QI 8) log 4,dQI 0) de

L,(/H

*

=+[I

[log det W,(o, e)] qx(6 10) de-

m log 2n -m

The proof Proposition

is obtained

. 1

’ uo

when the substitution

6. Under the assumption

0 ++ u(g) is used in the first integral.

of Proposition 5 we can write

det”2[Qx(8,8)Q~‘(&@Qx(6e)] qx(O I 0) =

(27cy2

. Qx[(l-u)B+u&3]

dtdu

Proof.

We have

=

”(M-1/2}i.Qx[(i-

I0 I’

t)e+ to, e](& e) dt.

.

(d-0) 1

1

262

A. Pdzman / Information matrices in nonlinear design

Hence,

the statement

Example

is a consequence

5.

0

1. Take x=

0 = (0,271),

{x,,x,,x,,x,],

a(xJ

tl(&x,)

= 8,

9/(@x,) = fi cos 0,

V(G,)

= fl&

r(&x,)

Consider two designs, X:= (x,,x~), &, is a circle. From (3) we obtain M,(8) When

of Proposition

= 3 = M,(8)

= 1,

= fi sin 19.

Z :=(x3,x4).

&, is a segment

of a straight

line,

(8E(0,27r)).

the design X is used, we have,

according

to (19),

& 0 - N(0, $) on (0,2x).

When

Z is applied,

sin(8-

0) - N(O,+)

we obtain

from Proposition

5 that (27)

on the set U, > {6 : /& 0 I< fn}. We see that @ is more concentrated round 8 in the first case than in the second case. Consequently, the design X is better than Z, despite of their having the same expected information matrix. We have

Hence,

w,(O,e)

= 3

w,(f$e)

= 3~0~~(8-8)

geE(o,2n)), (&eE(o,2m

Wz(6 e)< W,(d, 0). Further

-3[1-+I

It follows that X is better mality criterion.

=2a,pv,(.,e)].

than Z with respect

to the proposed

generalized

D-opti-

Acknowledgement The author wishes to thank the Coordinating Editor and the referees for helpful remarks which led to an improvement of the earlier version of this paper.

References Bates,

D.M. and D.G. Watts

(1980). Relative

curvature

measures

of nonlinearity.

J. Roy. Statist. Sot.

Ser. B 42, 1-25. Beale, E.M.L.

(1960). Confidence

regions in non-linear

estimation.

J. Roy. Statist. Sot. Ser. B 22, 41-88.

A. P&man

Box, M.J. Fedorov, Ford,

(1971).

Bias in nonlinear

V.V. (1972).

Theory

/ Information

estimation.

of Optimal

Hamilton,

J. Roy.

in nonlinear

Statist.

Experiments.

I., C.P. Kitson and D.M. Titterington

Technometrics,

matrices

Sot.

Academic

263

design

Ser. B 33, 171-201.

Press,

(1987). Recent advances

New York.

in nonlinear

experimental

D.C.,

D.G. Watts and D.M. Bates (1982). Accounting

for intrinsic

nonlinearity

in nonlinear

regression parameter inference regions. Ann. Statist. 10, 386-393. Hamilton, D.C. and D.G. Watts (1985). A quadratic design criterion for precise estimation regression

models.

Hougaard, P. (1985). Lett. 3, 161-166. Jennrich,

design.

to appear.

Technometrics

27, 241-250.

Saddle-point

approximations

R.J. (1969). Asymptotic

properties

for curved

of non-linear

exponential

least-squares

families.

estimators.

in nonlinear

Statist.

Probab.

Ann. Math.

Stat&.

40, 633-643. Pazman, A. (1984). Probability Kybernetika 20, 209-230. Pazman,

A. (1986).

distribution

of the multivariate

Foundations

of Optimal

Experimental

for the distribution

nonlinear

Design.

least

Reidel,

squares

estimates.

Dordrecht,

and

Veda,

Bratislava. Pazman,

A. (1987a).

On formulas

Parman,

A. (1987b).

Flat Gaussian

shop

“Model

Oriented

nonlinear

Data Analysis”,

of nonlinear

regression Wartburg,

models. March

L.S. estimates.

1987, Lecture

Skovgaard,

I.M. (1985). Large

Regression

deviation

Math. Statist. 6(2), 89-107. Stone, M. (1959). Application of a measure experiments. Thorpe, J.A.

Ann. Math. Statist. (1979). Elementary

Heidelberg-Berlin.

Modelling.

approximations of information

30, 55-70. Topics in Differential

Marcel

18, 3-15.

Dekker,

for maximum

J. Statist.

and

Plann. Infer-

New York-Basel.

likelihood

estimators.

to the design and comparison Geometry.

Work-

Notes in Economics

Mathematical Systems No. 297. Springer Verlag, Berlin, 120-124. Pukelsheim, F. (1980). On linear regression designs which maximize information. ence 4, 339-364. Ratkovsky, D.A. (1984). Nonlinear

Statistics

In: Proc. of the International

Springer

Verlag,

Probab.

of regression New

York-