Journal of Statistical Planning and Inference 103 (2002) 401–420
www.elsevier.com/locate/jspi
Results on nonlinear least squares estimators under nonlinear equality constraints Andrej P&azman ∗; 1 University of Augsburg, 86159 Augsburg, Germany Received 15 February 1999; accepted 19 July 1999
Abstract A -nite sample approach to parametric nonlinear regression models with identifying and=or restrictive constraints (conditions) on the parameters is presented. Some results from the regular nonlinear regression without constraints are extended to models, which may be singular and subject to constraints. The extension is done mainly ‘parameter-free’, with the help of some basic notions from geometry of the model, among them the geodesic curves which are related to the Rao’s distance measure. All results are expressed in the original parameters and in terms containing just the -rst and second order derivatives of the response function and the functions de-ning the constraints. This leads to presentations of con-dence regions, of a step in the Gauss– Newton algorithm, of the ‘7at’ or saddlepoint probability density of the least squares estimators, and of the conditions for ‘7atness’. Special attention is given to the extension of the measures c 2002 Elsevier Science B.V. of nonlinearity of Bates and Watts to models with constraints. All rights reserved. MSC: primary 62J02; secondary 62F11; 53A05 Keywords: Nonlinear regression; Expectation surface; Measures of nonlinearity; Con-dence regions; Curvature tensor; Distribution of estimators
1. Introduction 1.1. The nonlinear model with constraints We consider an experiment where the responses to the inputs xi can be modeled by the equations yi = (; xi ) + i ;
i = 1; : : : ; N
∗
Corresponding author. Department of Probability and Statistics, Faculty of Maths, Physics and Informatics, Comenius University, Mlynsk&a dolina, SK-842 48 Bratislava, Slovak Republic. E-mail address:
[email protected] (A. P&azman). 1 On leave from University Comenius, Bratislava. Research supported by DAAD in Germany and by a grant of VEGA in Slovakia. c 2002 Elsevier Science B.V. All rights reserved. 0378-3758/02/$ - see front matter PII: S 0 3 7 8 - 3 7 5 8 ( 0 1 ) 0 0 2 3 4 - 8
402
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
with the unknown parameters = (1 ; : : : ; m )T subject to q equality constraints
l () = 0;
l = 1; : : : ; q:
Here x1 ; : : : ; xN is a given design of the experiment, = (1 ; : : : ; N )T is the error vector with mean E() = 0 and variance Var() = 2 I . We suppose that ∈ ; and that is an open subset of Rm : (Otherwise, our investigation gives results for the interior of .) The mappings → () = ((; x1 ); : : : ; (; xN )T and → () = ( 1 (); : : : ; q ())T are known, and are supposed to be continuous with continuous derivatives up to the order 2. The model and the constraints can be written in a vector form y = () + ; E() = 0;
∈ ;
Var() = 2 I;
() = 0:
(1)
We note that the case Var() = 2 W , with a known p.d. matrix W , can be linearly transformed to model (1), so it need not to be considered separately. By J () and L() we denote the matrices of -rst order derivatives of () and () Jij () = @(; xi )=@j ; Llj () = @ l ()=@j ;
i = 1; : : : ; N; l = 1; : : : ; q;
j = 1; : : : ; m; j = 1; : : : ; m:
Their joint rank is supposed to be equal to the number of parameters m J () rank = m; L()
(2)
which is necessary to have the uniqueness of the least squares (= LS) estimator ˆ = arg
min
∈; ()=0
||y − ()||2
(see also Section 3.1). In standard situations one can suppose that the ranks of J () and of L() are constant, and that the dimension of the intersection of the column spaces of J () and L() is constant as well. This will be supposed also here. Notice that we call the model regular if J () is full rank, otherwise it is called singular. 1.2. Examples Example 1. The response function (; t); t ∈ [a; b] is a solution of a diPerential equation with a priori -xed boundary values (; a) = v1 ;
(; b) = v2 :
The parameters related to the parameters of the diPerential equation are unknown, and (; t) is observed at times t1 ; : : :,tN ∈ [a; b]. Here the prescription of the boundary values gives nonlinear equality constraints on the parameters .
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
403
Example 2. The estimation of the position of a node in a spline regression. Let a ¡ b ¡ c ¡ d be real numbers. Observations are taken at x1 ; : : : ; xk ∈ [a; b] with means E(yi ) = (; xi ), and at xk+1 ; : : : ; xN ∈ [c; d] with E(yi ) = (!; xi ): One has to estimate the point (the node) x0 ∈ [b; c], such that @(; x) @ (!; x) = : (; x0 ) = (!; x0 ); @x @x x0
x0
These equalities are nonlinear constraints in this regression model. The vector of unknown parameters here is ( T ; !T ; x0 )T . Even in the particular case that (; x) and (!; x) are polynomials in x, the constraints are nonlinear in the additional parameter x0 . Example 3. A biadditive model in the statistical analysis of genotypes (cf. Denis and Gower, 1994). The model is yijk = " + #i + !j + $i %j + ijk with linear and nonlinear constraints #i = 0; !j = 0; $i = 0;
$2i =
%j = 0;
%2j :
In general the constraints may appear from some ‘physical reasons’ as in Examples 1 and 2, but also they may be used just as one of the possibilities how to regularize an overparametrized singular model, and so to obtain unique least squares estimates. This is the case in ANOVA models, like in Example 3. 1.3. The applied approach Asymptotic normality and other -rst order asymptotic properties of the estimator ˆ are considered in the classical papers of Aitchison and Silvey (1958) and Silvey (1959). The results can be formulated in terms of the -rst order derivatives of () and () taken at the true value of . In fact the asymptotic properties of ˆ correspond to properties of the least squares estimator in a linear model with linear constraints, as considered, e.g. in Rao (1963) and later in Silvey (1975). This model (see also (11)) is a linear approximation of model (1). However, as pointed out in Bates and Watts (1988), even in some simple models without constraints, the approximations which follow from the asymptotic normality of the LS are insuQcient. Better approximations are known for regular models without constraints. For linear models a passage from these models to models with constrains has been presented in Rao (1963, Section 4a9). In nonlinear models we must extend and modify the results, and essentially we have two possibilities:
404
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
(i) To reparametrize the model to exclude the constraints, for example by solving the q equations l () = 0; = 1; : : : ; q; so that we could express some of the parameters by the others. However, this is tedious, and even impossible in more complicated models. Besides, it may be important to maintain the original parameters for some reasons, e.g. because of the symmetry of the model with respect to these parameters. Nevertheless, in some parts of the paper we consider (just theoretically) some parameters ! such that [(!)] = 0, and that the Jacobian @[(!)]=@!T is a full rank matrix for every value of !. With this parametrization model (1) becomes regular, and the constraints hold automatically. As said, the mapping (!) expressing the old parameters by the new ones is diQcult to obtain, but it is not necessary to work it out, since it will not appear in the -nal results. (ii) The second possibility, avoiding the diQculties with the mapping (!); is to maintain the original parameters, and to use a parameter-free (geometrical) approach, which allows an easier comparison of models with and without constraints. For that purpose we describe in the paper the following geometrical concepts: the expectation surface E, the projector onto a tangent plane of E, and the geodesic curves (= geodesics) on the surface E. All -nal results in the paper are formulated in the original parameters ; and exclusively in terms of the mappings () and (), and of the -rst and second order derivatives of these mappings.
2. Some geometrical preliminaries The nonlinearity of the model and of the constraints require quite naturally the use of some diPerential-geometric concepts. Since after the book of Amari (1985) many rather complicated diPerential-geometric studies of statistical models appeared, it is necessary to emphasize that we shall use just the geometry of a surface in RN induced by the Euclidean geometry of RN , and no special geometric knowledge is required. 2.1. The expectation surface The basic geometric object in our investigation is the expectation surface. It is the set of all possible means () of the observed vector y; hence in model (1) it is the set E = {(): ∈ ; () = 0}: It forms a surface in RN parametrized by and restricted by the constraints. A curve (more exactly a segment of a curve) on the surface E, and through the point ( ∗ ), can be de-ned in two steps. First we de-ne a curve in subject to the constraints () = 0. It is a continuous mapping g : t ∈ (−%; %) → g(t) ∈ with continuous second order derivatives, and such that g(0) = ∗ and [g(t)] = 0 for
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
405
every t. Taking the derivative of [g(t)] with respect to t; we obtain an equivalent de-nition of a curve g : t ∈ (−%; %) → g(t) ∈ such that g(0) = ∗ ; [g(0)] = 0
and
Lg(t) g(t) ˙ = 0 for every t:
(3)
Here one dot denotes the -rst order derivative with respect to t, and Lg(t) denotes the matrix L() taken at = g(t). Although this notation is nontraditional, we use it to avoid confusions with too many brackets. The choice of % is irrelevant, since we are interested only in derivatives at t = 0. Finally, a curve in E through the point ( ∗ ); which corresponds to the curve g(t), is given by $(t) = [g(t)]: Evidently, $(0) = ( ∗ ). We shall distinguish strictly between curves on E and curves in by the notation: $(t) or [g(t)] in the -rst case, and g(t) in the second case. 2.2. The projector onto the tangent space By de-nition, the tangent space T ∗ at a point ∗ is the set of all vectors which are tangent to curves through the same point ( ∗ ) ∈ E. In symbols T ∗ = {d[g(t)]=dt|t=0 : g(:) is a curve through ∗ }. We perform the indicated derivative and take into account that any vector v ∈ Rm , such that L( ∗ )v = 0, is equal to g(0) ˙ for some curve g(:). So we obtain from (3) T ∗ = {J ( ∗ )v: v ∈ Rm ; L( ∗ )v = 0}: Another expression is obtained with the help of the projector PL ( ∗ ) = LT ( ∗ )[L( ∗ )LT ( ∗ )]− L( ∗ )
(4)
which projects orthogonally Rm onto the column space of the matrix LT ( ∗ ). (We note that the indicated g-inverse is arbitrary). It allows us to write T ∗ = {J ( ∗ )[I − PL ( ∗ )]u: u ∈ Rm }
(5)
since [I − PL ( ∗ )] is the projector, which is orthogonal to PL ( ∗ ), hence L( ∗ )v = 0 is equivalent to v = [I − PL ( ∗ )]u for some u ∈ Rm . We see from (5) that the tangent space T ∗ is the column space of the matrix J ( ∗ )[I − PL ( ∗ )], which allows us to write the orthogonal projector of RN onto T ∗ in the form (cf. Rao and Mitra, 1971, p. 111) P( ∗ ) = J ( ∗ )[I − PL ( ∗ )][ML ( ∗ )]− [I − PL ( ∗ )]J T ( ∗ ); where the g-inverse is arbitrary, and where ML ( ∗ ) = [I − PL ( ∗ )][M ( ∗ )][I − PL ( ∗ )]:
406
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
The matrix ML ( ∗ ) can be interpreted as a certain restriction of the matrix M ( ∗ ) = J T ( ∗ )J ( ∗ ); which is the Fisher information matrix in the model without constraints (for = 1). Correspondingly, the matrix ML ( ∗ ) can be considered as the information matrix of model (1) when constraints are taken into account. Since the projector P( ∗ ) is important, we present also alternative expressions for P( ∗ ), which hold in particular cases. The proofs are in the appendix. 2.2.1. The projector expressed through the asymptotic variance Let us denote by V ( ∗ ) the variance of the asymptotic normal approximation of ˆ in the case when ∗ is the true value of . In models with known V ( ∗ ) one can express the projector P( ∗ ) in the simple form P( ∗ ) = −2 J ( ∗ )V ( ∗ )J T ( ∗ ): 2.2.2. The projector in singular models with identifying constraints When we have a singular regression model without constraints, the parameter is not identi-able, and the LS estimator ˆ is not unique. To make it unique one put sometimes auxiliary constrains () = 0, such that rk(J ( ∗ )) + rk(L( ∗ )) = m
(6)
and that equality (2) holds. In this particular case we have P( ∗ ) = J ( ∗ )[M ( ∗ ) + LT ( ∗ )L( ∗ )]−1 J T ( ∗ ): 2.2.3. The projector in the case that J ( ∗ ) and L( ∗ ) are full rank In this case the projector can be expressed in the form P( ∗ ) = J ( ∗ )[M −1 ( ∗ ) − M −1 ( ∗ )LT ( ∗ )R−1 ( ∗ )L( ∗ )M −1 ( ∗ )]J T ( ∗ ) with R( ∗ ) = L( ∗ )M −1 ( ∗ )LT ( ∗ ). 2.3. The geodesics A geodesics on the surface E (through the point ( ∗ )) is a curve $(t) = [g(t)], which is the ‘less curved’ than any other curve through ( ∗ ) having the same initial direction d$(t)=dt|t=0 . Formally, a geodesics through ∗ is de-ned by the following properties: (i) (ii) (iii) (iv)
g(0) = ∗ , Lg(t) g(t) ˙ = 0 for every t, ||d$(t)=dt|| = 1 for every t, P g(t) d 2 $(t)=dt 2 = 0 for every t.
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
407
Properties (i) and (ii) come from the de-nition of a curve given above. Property (iii) means that the parameter t is the ‘arclength’ of the curve $ (since then dt = ||d$(t)|| = the Euclidean distance along $). Property (iv) means that the ‘vector of curvature’ d 2 $(t)=dt 2 points always orthogonally to the surface E, which is just the property ensuring the minimal curvature of the curve $. Note that the radius - of the circle which has “the best touch” to the curve $(:) at t is - = ||d 2 $(t)=dt 2 ||−1 . Its inverse ||d 2 $(t)=dt 2 || is called the curvature of $ at t. It is useful to express the properties (iii) and (iv) alternatively in terms of the curve g(t): For that we insert into (iii) and (iv) the straightforwardly obtained derivatives d$(t) = J g(t) g(t); ˙ dt m m d 2 $k (t) g(t) = g ˙ (t)H g ˙ (t) + Jklg(t) gUl (t); i j ij; k dt 2 i; j=1 l=1
k = 1; : : : ; N:
(7)
Here two dots denote the second order derivative with respect to t, and H () denotes the array with components Hij; k [] =
@2 k () : @i @j
The property (iii) can then be written also in the form (iii∗ ) g˙ T (t)M g(t) g(t) ˙ =1
for every t:
Notice that beside (iv) we have another equation for the second order derivative of g(t), obtained by taking the derivative of (ii) with respect to t p m g(t) g(t) g˙i (t)Sij; g ˙ (t) + Lkl gUl (t) = 0; k = 1; : : : ; q: (8) j k i; j=1
l=1
Here we denoted Sij; k [] =
@2 k () : @i @j
Remark. In a regular model without constraints the projector P() has the form P() = J ()M −1 ()J T () (compare with 2.2.2 when L() = 0). By inserting this into (iv) one obtains m g(t) gUk (t) + g˙i (t)1ij; k g˙j (t) = 0 i; j=1
(9)
(10)
with 1ij; k () = Mk:−1 ()J T ()Hij; : (); where Mk:−1 () denotes the kth column of the matrix M −1 (). For readers familiar with diPerential geometry we note that this is the symbol of ChristoPel, or the aQne
408
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
connection. Eq. (10) is the equation of geodesics as is usually presented in textbooks of diPerential geometry (cf. also Amari, 1985), with M () as the metric tensor. We note also that these geodesics are those de-ning the Rao’s distance, as proposed in Rao (1945, 1949) for more general statistical models. 3. The linear approximation, and the condence regions 3.1. The linear approximation of the model A linear approximation of model (1) taken without constraints is the model y − ( ∗ ) = J ( ∗ )( − ∗ ) + : Geometrically, its expectation surface is the plane, which is tangent to the expectation surface of the nonlinear model. This shows us how to proceed with constraints. The linear approximation of (1) is given by the model y − ( ∗ ) = J ( ∗ )( − ∗ ) + ; L( ∗ )( − ∗ ) = 0;
∈ Rm ;
(11)
since its expectation surface is indeed the tangent plane to E T ∗ + ( ∗ ) = {J ( ∗ )( − ∗ ) + ( ∗ ): ∈ Rm ; L( ∗ )( − ∗ ) = 0}: A direct application of (11) gives a suggestion for the iterative step in the Gauss– Newton algorithm in (1). The Gauss–Newton algorithm has been built for regular models without constraints (cf. Seber and Wild, 1989, p. 619). If (n) is the evaluation of ˆ after the nth step of the algorithm, one takes for (n+1) the least squares estimate in the nonlinear model linearized at the point (n) . The generalization to models with constraints is straightforward, and can be used also for singular models: One takes for (n+1) the least squares estimate in (11) with ∗ = (n) . Then the vector J ( (n) )( (n+1) − (n) ) is the orthogonal projection of the vector y − ( (n) ) onto T (n) . In symbols J ( (n) )( (n+1) − (n) ) = P( (n) )[y − ( (n) )]: We add to this the constraints L( (n) )( (n+1) − (n) ) = 0: Since, according to (2), the matrix G T ( (n) ) = (J T ( (n) ); LT ( (n) )) does have a full rank, one can solve these equations to obtain (n+1) explicitly P( (n) )[y − ( (n) )] (n) T (n) (n) −1 T (n) (n+1) : − = [G ( )G( )] G ( ) 0 Note that the right-hand side should be still multiplied by a correction factor (the step length) to ensure the convergence of the method (e.g. the Hartley’s correction). If the
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
409
matrix G T ( (n) )G( (n) ), although regular, is ill-conditioned, one can modify the last equation in the spirit of the Levenberg–Marquardt modi-cation of the Gauss–Newton method (cf. Seber and Wild, 1989, p. 623– 624 for the nonconstraint regular case). 3.2. The linear con8dence region In this and the following three subsections we suppose that the errors are normal ≈ N(0; 2 I ): The linear con-dence region is an approximate con-dence region following from ˆ Since this distribution is the same as the the asymptotic normal distribution of . exact distribution of the least squares estimator in (11), the linear con-dence region is obtained from this linear model. We take the usual linear con-dence region (cf. Bates and Watts, 1988 or Seber and Wild, 1989, p. 25, 194), but with the constraints ˆ − ] ˆ = 0: So we obtain L()[ ˆ − ] ˆ = 0; {: L()[
ˆ T M ()[ ˆ − ] ˆ ¡ ks2 Fk; N −k (1 − #)}: [ − ]
We denoted by Fk; N −k (1 − #) the (1 − #) quantile of the Fisher distribution with k and N − k degrees of freedom. Here k is the dimension of Tˆ, and s2 =
ˆ 2 ||y − ()|| N −k
ˆ instead of M () ˆ in the con-dence is the estimator for 2 . In fact, one should use ML () ˆ ˆ = 0. region, but this would have no in7uence on the form of the region, since L()[− ] ˆ Note that k = rk[ML ()]. Note also, that this region is a planar cut of the ellipsoidal cylinder ˆ T ML ()( ˆ ˆ ¡ const} {: ( − ) − ) ˆ − ) ˆ = 0}. The result is not a cylinder, but a bounded ellipsoid by the plane {: L()( localized in this plane, because of the rank condition (2). Of course, the criticism on the performance of the linear con-dence region presented in Bates and Watts (1988, p. 65), holds also in the case of models with constraints. 3.3. The likelihood ratio con8dence region This region has the form (cf. Seber and Wild, 1989, p. 194) for the case without constraints) ˆ 2 ¡ ks2 Fk; N −k (1 − #)}: {: () = 0; ||y − ()||2 − ||y − ()|| Since the region is de-ned only by geometric, i.e. parameter-free terms, as are the points of the expectation surface, the sample point y; or the Euclidean norm, the passage from the case without to the case with constraints is straightforward. The same holds also for the following regions. So the use of geometry is here evident, and elementary.
410
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
3.4. The ‘:at’ con8dence region It has the form (cf. P&azman (1993, 1999) for the case without constraints) ˆ ||P()[y − ()]||2 k Fk; N −k (1 − #) : : () = 0; ¡ ˆ N −k ||[I − P()][y − ()]||2 3.5. The exact sampling con8dence region In the terminology of Bates and Watts (1988), one calls the exact sampling con-dence region the set k ||P()[y − ()]||2 Fk; N −k (1 − #) ¡ : () = 0; ||[I − P()][y − ()]||2 N −k (cf. also Seber and Wild, p. 236 as Hartley’s con-dence region). However, as known, this region is not recommended (cf. Bates and Watts, 1988, pp. 223–225), although it is exact in the sense that its con-dence level is exactly 1 −#. They show in examples that the region may be very large, sometimes nonclosed, even for moderate con-dence levels. 3.6. A formal comparison of the regions At the -rst sight the exact sampling con-dence region and the ‘7at’ region looks very similarly. However, the fact that the projector P() is not taken at the hypothetical ˆ is very important. As a consequence, parameter value but at the estimated value , one can rewrite the ‘7at’ region in a form, which makes it close to the likelihood region, as follows. First, taking into account the de-nition of s2 one obtains the likelihood region in the form ˆ 2 ||y − ()||2 − ||y − ()|| k : () = 0; ¡ Fk; N −k (1 − #) : ˆ 2 N −k ||y − ()|| We shall show that the ‘7at’ region can be written in a similar form ˆ )||2 ||y − ()||2 − ||y − (; k : () = 0; ¡ Fk; N −k (1 − #) ; ˆ )||2 N −k ||y − (; ˆ ) is the point of intersection of the straight line L through the points y where (; ˆ and of the plane through () which is parallel to the tangent space Tˆ. and (), ˆ projects orthogonally onto Tˆ, we have Since P() ˆ ) = [I − P()][y ˆ y − (; − ()]; which by the use of the Pythagorean relation prove the correctness of the second form of the ‘7at’ con-dence region. Another comparison of these con-dence regions will be done at the end of Section 5 where a classi-cation of nonlinear models with constraints is considered.
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
411
4. The measures of nonlinearity 4.1. A summary of the approach in regular models without constraints Research on corrections of con-dence regions presented in Beale (1960) stimulated Bates and Watts (1980) to consider measures of nonlinearity of a regular nonlinear model based on the array of second order derivatives of (). (It is denoted by H () here, but by VU in Bates and Watts, 1980). They recommended to consider separately the orthogonal components P()H () and [I − P()]H (), and to normalize them by components of the matrix J (), (denoted by V˙ in Bates and Watts, 1980). So they obtained what they called the parameter ePect and intrinsic measures of nonlinearity. In Bates and Watts (1988, pp. 233), they stressed on the intuitive ‘physical’ interpretation calling components of V˙ ‘velocities’, and components of VU ‘accelerations’. However, in models with constraints one can use neither this interpretation, nor the orthogonal decompositions of H (), and one has to go deeper into the geometry. The relation of the intrinsic measure of nonlinearity to curvatures known from geometry is presented in Bates and Watts (1980), however, curvatures of straight lines in are considered there. These of course are not geodesics, (unless M () does not depend on ). But geodesics are the less curved among all curves, so at least the intrinsic measure of nonlinearity should be based on them. And in this section we show that also the parameter ePect measure of nonlinearity can be related to geodesics. In the model without constraints the curvature of a geodesics $(t) = [g(t)] through ( ∗ ) can be expressed as 2 d [g(t)] Kint ( ∗ ; v) = dt 2 t=0 ||[I − P( ∗ )][ ij vi Hij; : ( ∗ )vj ]|| (12) = vT M ( ∗ )v with v = g(0); ˙ and the expression Kint ( ∗ ) = max Kint ( ∗ ; v) v∈Rm \{0}
is shown to coincide with the intrinsic measure of nonlinearity given of Bates and Watts (1980) (cf. P&azman, 1993, Chapters 4 and 5). The -rst expression in (12) is parameter-free, but not the second. So the -rst one should be used for extensions to models with constraints. Similarly, the parameter ePect nonlinearity (at ∗ and in the direction v) is ||P( ∗ )[ ij vi Hij; : ( ∗ )vj ]|| ∗ (13) Kpar ( ; v) = vT M ( ∗ )v and the parameter ePect nonlinearity at ∗ can be expressed as Kpar ( ∗ ) = max Kpar ( ∗ ; v): v∈Rm \{0}
412
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
These expressions are not parameter free. To make them so, we present the following proposition. Proposition 1. In regular models without constraints we can write 1=2 Kpar ( ∗ ; v) = [gU T (t)M g(t) g(t)] U t=0 :
(14)
where g(t) is the geodesics in such that g(t) ˙ = v. Proof. From (10) we obtain g(t) U = {M −1 ()J T ()a(t)}=g(t) ; where
a(t) =
p i; j=1
g˙i (t)Hij; : ()g˙j (t)
: =g(t)
It follows that gU T (t)M g(t) g(t) U = {aT (t)J ()M −1 ()J T ()a(t)}=g(t) = ||P()a(t)||2=g(t) ; where we applied expression (9). From (iii) in 2:3 we then have d[g(t)] 2 = 1: g˙ T (t)M g(t) g(t) ˙ = dt Hence gU T (t)M g(t) g(t) U =
||P()a(t)||2=g(t)
˙ 2 [g˙ T (t)M g(t) g(t)]
:
We take this at t = 0, insert v = g(0), ˙ and compare with (13) to obtain (14). 4.2. Measures of nonlinearity in models with constraints Denote by B() the following array: Bij; k () = Hij; k () − Jk: ()LT ()[L()LT ()]− Sij; : (); where the chosen g-inverse is arbitrary. Proposition 2. In the model with constraints; the intrinsic and the parameter e
∗
Kpar ( ; v) =
||P( ∗ )[
i; j=1 vi Bij; : ( T v M ( ∗ )v
∗
)vj ]||
:
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
413
The proof is based on the parameter-free expressions in (12) and (14), and it is presented in the appendix. Remark 1. Notice that L( ∗ )v = 0 implies M ( ∗ )v = 0, since v = 0, and the rank condition (2) holds. Remark 2. The expressions Kint ( ∗ ) = max{Kint ( ∗ ; v): v = 0; L( ∗ )v = 0}; Kpar ( ∗ ) = max{Kpar ( ∗ ; v): v = 0; L( ∗ )v = 0}; are the (maximal) intrinsic, respectively parameter ePects measures of nonlinearity at ∗ : Remark 3. Denote by BL () the array m [I − PL ()]ir Brs; k ()[I − PL ()]sj : {BL ()}ij; k =
(15)
r; s=1
Proposition 2 still holds if in the expressions for Kint ( ∗ ; v) and Kpar ( ∗ ; v) we put BL () instead of B() and ML () instead of M (): Moreover, one can then drop out the conditions v = 0; L( ∗ )v = 0 in Remark 2, but have to put instead the condition ML ( ∗ )v = 0: So we have exactly the same expressions for the measures of nonlinearity as in the regular models without constraints, with the important diPerence, that now BL () and ML () are used instead of H () and M (). Consequently, we have p ||[I − P( ∗ )][ i; j=1 vi BL {( ∗ )}ij; : vj ]|| ∗ Kint ( ) = max ; ML ( ∗ )v=0 vT ML ( ∗ )v p
∗
Kpar ( ) =
max ∗
ML ( )v=0
||P( ∗ )[
i; j=1 vi {BL ( vT ML ( ∗ )v
∗
)}ij; : vj ]||
:
5. Probability distribution of the LS estimator in the case of normal errors In regular models without constraints a much better approximation than the asymptotic normal density of ˆ is the ‘7at’ or ‘saddle-point’ approximation (cf. P&azman, 1993 or Hougaard, 1995 for a discussion). To consider its analogy in models with constraints, let us consider the parameters !; which allow to eliminate the constraints from model (1) (see introduction). By (!) we denote the mapping which relates the new parameters with the old ones, and k = dim(!) is equal to the rank of ML (): The ‘7at’ approximation of the density of the LS estimator !ˆ has the form ˆ !) ˜ !; 1 ˜ ˆ det Q( ˆ = ˆ − (!)]||2 ; (16) exp − q(!|!) || P( !)[ ( !) ˆ 22 (2:)k=2 k det 1=2 M˜ (!)
414
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
ˆ = P[(!)] ˆ ˜ !) where (!) = [(!)] is the reparametrized regression function, and P( ∗ ˆ is the projector considered in Section 2.2 but for = . ˆ (The fact that the = P() projector is not in7uenced by the reparametrization is due to its geometric origin). Further in (16) M˜ ij (!) = (@ T (!)=@!i )(@ (!)=@!j ) is the information matrix for the new parameters, and 2 ˆ !) = M˜ ij (!) ˆ + [ (!) ˆ − (!)]T [I − P( ˆ @ (!) : ˜ !)] Q˜ ij (!; @!i @!j !ˆ
Proposition 3. We can write ˆ L (; ˆ )Z(!)] ˆ 1 det[Z T (!)Q 2 ˆ ˆ ˆ exp − 2 ||P()[() − ()]|| ; q(!|!) = ˆ L ()Z( ˆ !)] ˆ 2 (2:)k=2 k det 1=2 [Z T (!)M ˆ = (!); and where Z(!) = @(!)=@!T ; ˆ = (!); ˆ ij + [() ˆ − ()]T [I − P()]{B ˆ ˆ ˆ )}ij = {ML ()} {QL (; L ()}ij; : with BL () de8ned in (15). The proof is presented in the appendix. 5.1. The modi8ed information matrix and the four-dimensional curvature array ˆ ) depends only on the original parameters and their estimators The matrix QL (; ˆ ; and can be interpreted as a measure of information about contained in ˆ (cf. P&azman, 1989 for a discussion of this interpretation for models without constraints). In contrast, the matrix ML () is the mean information about which is obtained in the observed vector y in the model with constraints. However, in the general case the ˆ ) is not equal to ML (), since there is a loss of information in the mean of QL (; ˆ transformation y → (y). ˆ The approximation of the density by q(!|!) is especially good if the Riemannian curvature tensor of the model with the parametrization ! is zero (cf. P&azman, 1993). The components of this tensor in the model parametrized by ! are Rikjl (!) =
@2 T (!) @2 (!) @2 T (!) @2 (!) ˜ ˜ [I − P(!)] − [I − P(!)] : @!i @!j @!k @!l @!i @!l @!k @!j
By performing the derivatives of (!), and proceeding similarly as for the density, we obtain that ˆ tk (!)Z ˆ uj (!); ˆ Zvl (!) ˆ {RL ()}stuv Zsi (!)Z Rikjl (!) = s; t;u;v
with {RL ()}ikjl = {BLT ()}ij; : [I − P()]{BL ()}kl; : − {BLT ()}il; : [I − P()]{BL ()}kj; : : So we can call the four-dimensional array RL () the curvature array of the model with constraints. The components of R(!) are zero identically if and only if RL () is zero identically.
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
415
5.2. A classi8cation of models and con8dence regions Having expressions for the parameter and intrinsic measures of nonlinearity and for the curvature tensor, one can present brie7y a classi-cation of models. The classi-cation is essentially the same as for regular models without constraints (cf. P&azman, 1992). The model (including the constraints) is linear in if and only if Kint () and Kpar () are zero identically. Then all four con-dence regions from Section 3 coincide, and are exact (i.e. their con-dence level is exactly (1 − #). If only Kint () is zero identically, then the model is intrinsically linear, and the likelihood ratio, the ‘7at’, and the exact sampling con-dence regions coincide, and are exact. If only Kpar () is zero identically, then the model is linear in the parametric space, and the 7at region is “almost exact” (in a sense described in P&azman, 1993). In all these cases the curvature array RL () is zero identically. The weakest case is when only RL () is zero identically, but Kint () and Kpar () are nonzero. In this case the ‘7at’ con-dence region is “almost exact”, and also other ‘linear-like’ properties of the least squares estimator can be found (cf. P&azman, 1999). 5.3. A summarizing remark and an example One can see from the paper that the passage from the models without to models with constraints can be reduced formally to the passage from M () to ML () from H () to BL () ˆ ) ˆ ) to QL (; from Q(; from R() to RL () and to the use of the projector P() from Section 2.2, maintaining so that the resulting formulae are not much more complicated than in the case without constraints. Example. We shall indicate the computation on a simpler version of Example 2, with E(yi ) = fT (xi )#;
i = 1; : : : ; k;
E(yi ) = hT (xi )!;
i = k + 1; : : : ; N;
where f(x) = (1; x)T ; h(x) = (cos x; sin x)T ; and with just one constraint >(#; !; x0 ) = fT (x0 )# − hT (x0 )! = 0 F# which is nonlinear in x0 . We have = (#; !; x0 )T ; and (#; !; x0 ) = ( C! ) with cos xk+1 sin xk+1 1 x1 C = ::: F =::: :::; ::: : sin xN 1 xk cos xN
416
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
So we can write F 0 J= 0 C
0
0
;
L = (fT ; −hT ; c); where f = f(x0 );
h = h(x0 )
T T c = c(#; !; x0 ) = f˙ # − h˙ !
@f(x) ; f˙ = @x x=x0
@h(x) h˙ = : @x x=x0
The design of the experiment is supposed to be such that F, H are full rank matrices, so the matrix ( JL ) does have a full rank as required in (2). We have F TF 0 0 M = CTC 0 0 0 0 0 LLT = u with u = ||f||2 + ||h||2 + c2 . Consequently ffT −fhT cf T LT L = hhT −ch −hf cfT −chT c2 and according to (4) I − PL = I − u−1 LT L: By a multiplication of 5 × 5 matrices we obtain ML (#; !; x0 ) = [I − PL ]M [I − PL ]: Since (#; !; x0 ) is linear, the array H (#; !; x0 ) is zero. Further, the array S(#; !; x0 ) can be described by a matrix (because we have just one constraint) @L 0 0 f˙ @# @L 0 −h˙ S = S(#; !; x0 ) = : @! = 0 T T T @L f˙ −h˙ −hU ! @x0
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
So the tth face of the array B(#; !; x0 ) is equal to 0 0 0 B::; t (#; !; x0 ) = −u−1 fT (xt ): f(x0 ) 0 T T −h˙ f˙
= −u
0 h (xt )h(x0 ) 0 T f˙
−1 T
0 0 T
−h˙
417
f˙ −h˙ ; t = 1; : : : ; k T −hU ! f˙ −h˙ ; t = k + 1; : : : ; N T −hU !
and BL (#; !; x0 ) is obtained according to (15). Since −fhT cf F T F + ffT T T T M + LT L = −hf C C + hh −ch cfT −chT c2 is a regular matrix, we can express the projector (#; !; x0 ) in the form P(#; !; x0 ) = P = J [M + LT L]−1 J T ; which requires again just some matrix multiplication.
Acknowledgements The author thanks gratefully the Deutscher Akademischer Austauschdienst (DAAD) for the -nancial covering of his stay in Augsburg, and the University of Augsburg, especially Professor Friedrich Pukelsheim, for the hospitality and support during the research and the preparation of this paper.
Appendix A Proof of P(∗ ) in Section 2.2.1 The asymptotic normal distribution of ˆ is equal to the distribution of the least squares estimator in the approximate model (11), and the tangent plane of this model is just the tangent plane of the nonlinear model, i.e. T ∗ + ( ∗ ). So we can restrict our attention to model (11) to obtain the required relation between the variance of the estimator V ( ∗ ) and the projector P( ∗ ). In the model (11) the variance of J ( ∗ )(ˆ − ∗ ) is equal to Var[J ( ∗ )(ˆ − ∗ )] = 2 P( ∗ ):
418
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
But evidently, from the elementary rules for computing variances, we have Var[J ( ∗ )(ˆ − ∗ )] = J ( ∗ )V ( ∗ )J T ( ∗ ) which proves the required expression for P( ∗ ) Proof of P(∗ ) in Section 2.2.2 Denote by M[A] the column space of a matrix A. We have M[M ( ∗ )] = M[J T ( ∗ )]; M[LT ( ∗ )L( ∗ )] = M[LT ( ∗ )]. Further, according to (2) we have M[M ( ∗ )] ⊕M[LT ( ∗ )L( ∗ )]=M[(M ( ∗ ); LT ( ∗ )L( ∗ ))]=Rm , but also M[M ( ∗ )]∩M[LT ( ∗ ) L( ∗ )] = {0}, as follows from the assumption (6) in 2:2:2. In this case the tangent space is T ∗ = {J ( ∗ )v: v ∈ Rm }, and hence the orthogonal projector onto it is P( ∗ ) = J ( ∗ )(J T ( ∗ )J ( ∗ ))− J T ( ∗ ), with any choice of the indicated g-inverse (cf. Rao and Mitra, 1971). As proved in Pukelsheim (1993, p. 52), from the given properties of M[M ( ∗ )] and M[LT ( ∗ )L( ∗ )] it follows that the matrix [M ( ∗ ) + LT ( ∗ )L( ∗ )]−1 is a g-inverse of M ( ∗ ) = J T ( ∗ )J ( ∗ ). Proof of P(∗ ) in Section 2.2.3 We use the expression for P( ∗ ) in 2.2.1, and the equality −2 V ( ∗ ) = M −1 ( ∗ ) − M −1 ( ∗ )LT ( ∗ ) ×(L( ∗ )M −1 ( ∗ )LT ( ∗ ))−1 L( ∗ )M −1 ( ∗ ) given in Silvey (1975, p. 178). Proof of Proposition 2 and of Remark 3 ˙ = v. We start Take a geodesics $(t) = [g(t)] such that g(0) = ∗ and take -rst g(0) by (see (12) and (14)) U Kint ( ∗ ; v) = ||$(0)||;
T 1=2 Kpar ( ∗ ; v) = {[g(0)] U ML ( ∗ )g(0)} U :
From here now we omit to write the arguments = ∗ , and t = 0 in the proof. Consider the projector PL given by (4). By multiplying (8) from the left by LT (LLT )− we obtain ˙ PL gU = − LT (LLT )− [g˙ T S g]: Here we made an abbreviation g˙ T S g˙ =
ij
(A.1) g˙i Sij; : g˙j . Further from (5) we obtain
[I − P]J [I − PL ]gU = 0:
(A.2)
From point (iv) in the de-nition of geodesics in Section 2.3 and from (7) we have ˙ + J g}, U hence 0 = P $U = P{[g˙ T H g] ˙ PJ gU = − P[g˙ T H g]: Again from (iv) and from (7) we have ˙ + J g} U $U = [I − P]$U = [I − P]{[g˙ T H g] = [I − P]{[g˙ T H g] ˙ + J ([I − PL ]gU + PL g)}: U
(A.3)
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
419
Setting here the expressions from (A.1) and (A.2) we obtain $U = [I − P]{[g˙ T H g] ˙ ˙ − JLT (LLT )− [g˙ T S g]}: We set the expression for the array B or of BL from (15), take the norm, and divide the result by 1 = g˙ T M g˙ = g˙ T ML g˙ to obtain the required expression for Kint ( ∗ ; v) in model (1). Further, from the de-nition of ML and from (A.2) we obtain {gU T ML g} U 1=2 = ||J (I − PL )g|| U = ||PJ (I − PL )g||: U Hence from (A.1) and (A.3) we obtain {gU T ML g} ˙ ˙ − JLT (LLT )− [g˙ T S g]}||: U 1=2 = ||P{[g˙ T H g] Dividing this by g˙ T ML g˙ = 1 and using (15) we obtain the required expression for Kpar ( ∗ ; v) in model (1). Note that we can change the magnitude of v without in7uencing this expression. Proof of Proposition 3 By the de-nition of the information matrix we have ˆ = Z T (!)M ˆ L ()Z( ˆ !): ˆ M˜ (!)
(A.4)
By diPerentiating [(!)] = 0 twice with respect to ! one obtains (with = (!)) @2 (!) = − Tij; : (!); @!i @!j where Tij; k (!) = s; t {Z T (!)}is Sst; k (){Z(!)}tj . This gives, after a multiplication by LT ()[L()LT ()]− that L()
PL ()
@2 (!) = − LT ()[L()LT ()]− Tij; : (!): @!i @!j
Hence [I − P()]J ()
@2 (!) @2 (!) = [I − P()]J ()PL () @!i @!j @!i @!j = −[I − P()]J ()LT ()[L()LT ()]− Tij; : (!):
The -rst equality holds since [I − P()]J ()[I − PL ()] = 0, as follows from the de-nition of P() in Section 2.2. So 2 2 @ (!) (!) @ ˆ ˆ ˆ ˆ is Hst; : (){Z( ˆ ˆ tj + {Z T (!)} ˜ !)] [I − P( = [I − P()] J () !)} @!i @!j @!i @!j !ˆ
=
s; t
ˆ is {BL ()} ˆ st; : {Z(!)} ˆ tj {Z T (!)}
s; t
420
A. P(azman / Journal of Statistical Planning and Inference 103 (2002) 401–420
and from this and (A.3) one obtains ˆ )Z(!): ˆ ˆ !) = Z T (!)Q ˆ L (; ˜ !; Q( Finally, we put this and (A.3) into (16) to obtain the required result. References Aitchison, J., Silvey, S.D., 1958. Maximum-likelihood estimation of parameters subject to restraints. Ann. Math. Statist. 29, 813–823. Amari, S.I., 1985. DiPerential-Geometrical Methods in Statistics. Lecture Notes in Statistics, No. 28. Springer, Berlin. Bates, D.M., Watts, D.G., 1980. Relative curvature measures of nonlinearity. J. Roy. Statist. Soc. Ser. B 42, 1–25. Bates, D.M., Watts, D.G., 1988. Nonlinear Regression Analysis and Its Applications. Wiley, New York. Beale, E.M.I., 1960. Con-dence regions in nonlinear estimation. J. Roy. Statist. Soc. Ser. B 32, 41–88. Denis, J.-B., Gower, J.C., 1994. Biadditive models. Biometrics 50, 310–311. Hougaard, P., 1995. Nonlinear regression and curved exponential families. Improvement of the approximation to asymptotic distribution. Metrika 42, 191–202. P&azman, A., 1989. On information matrices in nonlinear experimental design. J. Statist. Plann. Inference 21, 253–263. P&azman, A., 1992. A classi-cation of nonlinear regression models and parameter con-dence regions. Kybernetika 28, 444–453. P&azman, A., 1993. Nonlinear Statistical Models. Kluwer Acad. Publ, Dordrecht. P&azman, A., 1999. Some properties of the saddlepoint approximation in nonlinear regression. Ann. Inst. Statist. Math (in print). Pukelsheim, F., 1993. Optimal Design of Experiments. Wiley, New York. Rao, C.R., 1945. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91. Rao, C.R., 1949. On the distance between two populations. Sankya 9, 246–248. Rao, C.R., 1963. Linear Statistical Inference and Its Applications. Wiley, New York. Rao, C.R., Mitra, S.K., 1971. Generalized Inverse of Matrices and Its Applications. Wiley, New York. Seber, G.A., Wild, C.J., 1989. Nonlinear Regression. Wiley, New York. Silvey, S.D., 1959. The Lagrangian multiplier test. Ann. Math. Statist. 30, 389–407. Silvey, S.D., 1975. Statistical Inference. Chapman & Hall, London.