Statistics & Probability North-Holland
January
Letters 11 (1991) 57-61
1991
Linear structural relations: Gradient and Hessian of the fitting function Heinz Neudecker Instituut
voor Actunriaat
en Econometric,
Unrversity of Amsterdam,
Amsterdam,
The Netherlands
Albert Satorra * Department
of Economrcs,
Universitat Pompeu Fabra, Barcelona, Spain
Received September 1989 Revised November 1989
Abstract: We have derived latent-variable models. Keywords: Covariance
the Hessian
structure,
matrix
fitting-function,
of a general
Kronecker
type
product,
of fitting
matrix
function
differential,
commonly
used
first and second
in the analysis
of linear
derivatives.
1. Introduction Latent-variable models are now frequently used in economic, social, and behavioral studies for analyzing structural relations among variables. Perhaps the most popular of these models is the so-called LISREL model (Joreskog, 1977; Wiley, 1973), the analysis of which has become standard practice through Jiireskog and Siirbom’s (1983) computer program also named LISREL. The LISREL model integrates the classical simultaneous equation model developed in econometrics with the factor-analysis model developed by psychometricians. The classical ‘errors-in-variables’ model is also a particular case of LISREL. Alternative formulations of latent-variable models, as Bentler and Weeks’ (1980) model implemented in computer programs such as EQS (Bentler, 1987) or the COSAN model of McDonald (1980) have been recognized to be equivalent specifications of the LISREL model. A particular type of analysis of the LISREL model suited for discrete data is implemented in Miithen’s (1988) program LISCOMP. A review of latent-variable models can be found in Anderson (1984), and Aigner, Hsiao, Kapteyn and Wansbeek (1984). A maximum-likelihood analysis, under the assumption that the vector of observed variables is normally distributed, as well as more general distribution-free methods, are implemented in most of the above-mentioned computer programs. Covariance-structure analysis is a general framework for fitting latent-variable models. Within this framework, if 6 denotes the vector containing all the unknown parameters of the model and 2 the population covariance matrix of the vector of all observed variables, then the covariance structure, 2 = 2 (8) implied by the specified model is fitted to the corresponding sample covariance matrix S. The model is fitted by minimizing with respect to B a (non-negative) real-valued function F = F(S,2(8)) of 8. Numerical iteration methods are required to achieve such a fit. As far as we know, all the above-mentioned programs use only the first derivatives of F, thus quasi-Newton optimization * The research
of this author
0167-7152/91/$03.50
was made possible
by a grant
0 1991 - Elsevier Science Publishers
from the Netherlands B.V. (North-Holland)
Organization
for Scientific
Research
(NWO).
57
Volume 11, Number 1
STATISTICS & PROBABILITY
January 1991
LETTERS
methods and expected Hessians are used. Browne (1984) Shapiro (1987) and Satorra (1989) deal with different theoretical aspects of covariance-structure analysis. Lee and Jennrich (1979) study several algorithms which are used in the practice of covariance-structure analysis. This paper provides an expression for the second derivatives of a general type of fitting function used in the analysis of the LISREL model. Standard matrix differential calculus, as in Magnus and Neudecker (1986), will be used. Although the expressions for the first derivatives are known and widely used, they will also be given since they are obtained on the way to the second derivatives. The expressions obtained for the second derivatives are a novelty and have practical implications. For instance, the expressions for the Hessian would be needed to implement true Newton fitting algorithms, or to compute statistics using observed Hessians instead of expected. The paper is structured as follows. Section 2 describes the model. Sections 3 and 4 obtain the first and second derivatives respectively of the matrix valued function _Z= Z(O). Finally, Section 5 derives the desired gradient and Hessian expressions.
2. The model A general
linear
statistical
relation
that combines
measurement
and structural
equations
is the following:
where z represents ap-vector of observable variables, and TJ, E and [ are random vectors such that E is uncorrelated with [. The matrices A and B,, and the variance matrices of E and E, q and @ respectively, contain the parameters of the model. The specific form of these parameter matrices gives rise to particular models. It can be shown that (1) is both a specific case and a generalization of the LISREL model (see, e.g., Bollen (1989), p. 396). Equations (1) with the assumptions of zero correlation between E and 5, imply the following couariance structure for the matrix 2 of variances of z: xX=AB-‘@B-TA’+
P,
(2)
where B = I - B, is supposed to be non-singular. Prior information with regard to the form of B,, A, @ and \k, yields more specific models. For instance, in LISREL one may restrict some of the elements of the matrices B,, A, @ and 9 to have equal specific values, or to be equal among themselves. Given a specific model, the distinct and functionally unrelated unknown elements of the matrices B,,, A, CDand \k will be assembled into, say, a q-dimensional parameter vector 8. We will also consider the following vector 6 of unrestricted parameters of model (1): 6 = [(vet A)‘,
(vech @)‘, (vet B)‘,
(vech ‘I’)‘]‘,
where the ‘vet’ operator transforms a matrix into a vector by stacking the columns of the matrix one underneath the other, and the ‘ vech’ operator does the same but eliminates the supradiagonal elements of the matrix. Usually, without further restrictions added to (1) the parameter vector 6 will not be identified by 2. A specific model will induce a function 6 = S (~9) expressing 6 in terms of a parameter vector B of smaller dimension. For model (1) with the type of restrictions considered, the function 6 = 6( .) will be regular enough to be twice continuously differentiable. In fact, in LISREL the derivative matrix A = as/as is a constant matrix of zeroes, ones and minus ones. Moreover, Eq. (2) implies that 2 is a function of 8, hence of 8. 58
Volume
11, Number
1
STATISTICS
& PROBABILITY
January
LETTERS
1991
3. First derivatives of CJ= a(S) Vectorizing
(2), and taking
differentials,
we obtain
dvecZ=(I+K)(AB-‘@BPT@I)
dvecA-
+(AB-‘@AB-‘)
dvec
@+dvec
where ‘d’ denotes differential, ‘ @’ denotes the (right) appropriate order, and K is a commutation matrix and Now using vet A = D vech A and vech A = L vet A, the duplication and elimination matrices of Magnus and
(I+K)(AB-‘@B-T~AB-‘)
dvec
B
P, Kronecker product, I is the identity matrix of inverse of a matrix. ‘-T’ denotes the transposed for symmetric A, where D and L are respectively Neudecker (1986), we get
dvech~=L(~+K)(AB~‘~B~T~I)dvecA-L(I+K)(AB-1~B-T~AnB-1)dvecB +L(AB-‘~3AB~‘)Ddvech@+dvech\k; which implies
that
d vech Z = G d8, with G defined G=
as the following
partitioned
matrix:
[L(Z+K)(AB-‘W’@Z),
L(ABp’@ABp’)D,
-L(Z+K)(ABp’@B-T~AB-‘),
(3)
I].
4. Second derivatives of X = Z(6) Differentiating the expression of d vet 2 obtained in the section matrix of second derivatives for the ijth element u,, of 2:
above,
we get the following
partitioned
(4) where HI, = B-‘@BpT
8 TIJ’
HI2 = (BP’ @ 7],A BP’) D, HI3 = - ( B-‘@B-~ Hz3=
-D’(BpT@
8 7;,ABp1)
B B-l),
BP’A’T,AB-‘),
H33 = ( B-~@B-~A’~;,AB-~ + ( B-‘@BpT
- K( ~;,AB-‘@B-~
B B-‘)
K + K( BpTA’T,A~p~@~-T
8 B-‘)
~3 B-TA’T,AB-‘),
where El, = e,ei, with e, the ith column
of the identity
matrix
of dimension
p x p, and T, = E,, + E,, 59
Volume
11, Number
1
STATISTICS
5. Gradient and Hessian In the practice function
January
LETTERS
1991
of the fitting function
of covariance-structure
F(B)
& PROBABILITY
analysis
= (s - @))‘W(s
it is usual
to minimize
the (weighted)
least-squares
- u(d)),
fitting
(5)
where s = vech S, with S the sample covariance matrix, (T= vech 2 and W is a possibly stochastic (positive semi-definite) matrix. This is the fitting function used in generalized least-squares estimation implemented in most of the computer programs for covariance-structure analysis (EQS, LISREL, LISCOMP, etc.). Also maximum-likelihood estimation (under the assumption that z is normally distributed) has been shown to yield the same estimators as a weighted least-squares analysis where W is updated after each iteration (see, e.g., Browne, 1973; Lee and Jennrich, 1979; Fuller, 1987, Section 4.2.2) The weighted least-squares function (5) with W supposed not to depend on 8, implies the following expression for the gradient of F: aF/aS’=
- a)‘W(&~/‘t@‘)
-2(s
= -2(s
- a)‘WG,
(6)
where G was defined in (3) above. Differentiating dF, we get d2F=2(da)‘Wdo-2(s-a)‘Wd2a = 2(da)‘Wda
- 2 g
((s - e)‘W),(d2a),,
r=l
where the sum runs over the indexes ij corresponding to the p * = p( p + 1)/2 distinct elements of Z. Now using da = G d8 and (d2a), = (da)‘H, da, where G is the matrix of (3) and H, is the matrix of (4) with corresponding ij indexes, we get
hence, the Hessian t12F as asr=2
5 ((.s-IJ)‘W)~H, ?-=I
G’WG-
d2F=2(d6)’
of F is expressed G’WG-
as
r=l 5
{(vech(S-E))‘W}.H,
1
(da);
1,
(7)
where
Here w,, is the hrth element of W, and H, = au,,/a?? as’ for corresponding index ij. In the case of normality of z, maximum likelihood (ML) estimation yields the following to be minimized: F,,(B)
=lnlE(B)
1+ trace s(Z(e))-‘,
fitting
function
(8)
which is [ - 2/( n - l)] times the log-likelihood function of the data, with n being the sample size. The first derivative of F&O) is the same as the right-hand side of (6) with the matrix W being defined as w= 60
22’O’(Z”
@ X’)D.
(9)
Volume
11, Number
1
The Hessian of FML(0) following term: 2G’D’(Z-’
STATISTICS
is obtained
8 X’(S
& PROBABILITY
by adding
LETTERS
to the right-hand
side of (7)
with
W given
January
1991
in (9)
the
- _Z)X’)DG.
Note that from the gradient and Hessian given above, we immediately obtain the gradient and Hessian of a specific model. Effectively, such a specific model is associated with a function 6 = S(O) with corresponding derivative matrix A = as(O)/a6’. Linear restrictions imply that the matrix A is constant, hence the corresponding gradient and Hessian will be the product matrices [aF/M’]A and k[a2F/asas’]A, respectively.
References Aigner, D.J., C. Hsiao, A. Kapteyn and T. Wansbeek (1984), Latent variable models in econometrics, in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol. 2 (North-Holland, Amsterdam) pp. 1321-1393. Anderson, T.W. (1984), Estimating linear statistical relationships, Ann. Statist. 12 (1) l-45. Bentler, P.M. (1985), Theory and Implementation of EQS, a Structural Equations Programs (BMDP Statistical Software, Los Angeles). Bentler, P.M. and D.G. Weeks (1980) Linear structural equations with latent variables, Psychometrika 45, 289-308. Bollen, K.A. (1989) Structural Equations with Latent Vanables (Wiley, New York). Browne, M.W. (1974) Generalized least squares estimators in the analysis of covariance structures, South African Statist. J. 8, l-24. Browne, M.W. (1984) Asymptotically distribution-free methods for the analysis of covariance structures, British J. Math. Statist. Psych. 37, 62-83. Fuller, W. (1987), Measurement Error Models (Wiley, New York). Joreskog, K.G. (1977) Structural equation models in the social sciences: Specification, estimation and testing, in: P.R. Krishnaiah, ed., Application of statistics (Norh-Holland, Amsterdam) pp. 265-287.
Joreskog, K.G. and D. S&born (1983) LISREL V User’s Guide. Analysis of Structural Equation Relationship by Maximum Likelihood and Least Squares Methods (International Educational Services, Chicago). Lee, S-Y., and R.I. Jennrich (1979) A study of algorithms for covariance structure analysis with specific comparison using factor analysis, Psychometrika 44, 99-113. Magnus, J.R. and H. Neudecker (1986) Symmetry, O-l matrices and Jacobians, Econometric Theory 2, 157-190. McDonald, R.P. (1980) A simple comprehensive model for the analysis of covariance structures, British J. Math. Statist. Psych. 31, 59-72. Muthen, B. (1987) LISCOMP: Software for advanced analysis of L.S.E. with a comprehensive measurement model (Scientific Software, Mooresville, IN). Satorra, A. (1989). Alternative test criteria in covariance structure analysis: a unified approach, Psychometrika 54 (1) 131-151. Shapiro, A. (1986) Asymptotic theory of/Jverparameterized structural models, J. Amer. Statist. Assoc. 81, 142-149. Wiley, D.E. (1973) The identification problem for structural equation models with unmeasured variables, in: AS. Goldberger and O.D. Duncan, eds., Structural Equation Models in the Social Sciences (Seminar Press, New York) pp. 69-83.
61