DETECTION OF INFLUENTIAL POINTS IN GENERAL NONLINEAR PROBLEMS JiPi Rilitk$
and Jaroslav Edp2'
Summary A method is suggested for expressing the influence of empirical function Eai of individual experimental points on the qarameter estimates of a general regression model with nonlinear model function. The method is based on expressing the estimates "a/c/ as the functions of random errors c . As a measure of the influence s normed distance between unconstrained &stimator 8 /E/ and the constrained one 8 /s/ is imnlemented where i-th component of The 1. vector is equal to zero. A comparison of the empirical hnfluence functioii'withthe well-known ones for linear models is demonstrdtea in an example of a special case of nonlinear least squares. Keywords :
general nonlinear regression, empirical influence function, Cook's distance
1. Introduction (3neof the principal phases in constructing nonlinear regression models f/x,5/ is to estimate model parameters a=/al,....,am/T. Here, it is started from n experimental points $xi,yijirl,...,n' where x are aei terministic /input/ variables and yi are realizations of random output variables /which is functionally dependent on the input ones/. Generally, this regression problem may be expressed by a model relationship , i q l,...,n
1
Where the function gi/./ also comprises the type of measurement model. For the familiar additive measurement model it holds 2 In the above relations ei /i z 1, .... n/
denote such realizations of
random variables that cannot be measured or otherwise assessed. These are considered as measurement errors. The random vector & is characterized by simultaneous probability density p/l/ &/ which depends upon distributional parameters _"_ With modelling in nature and technical sciences it is often supposed that random errors are independent and identically distributed, having probability density p/(1/62/, e2 is the variance. Under the assumption that l/ Research Institute for Textile Finishing, 544 28 D&r Krdlovd n.L., CS 2/ Headquarters of Cotton Industry, 501 63 Hradec Krdlove, CS
additive model of measurement is valdd and p/ ai/2/
is the Gaussian
probability density, maximum likelihood estimates 2 may be obtained by minimizing the nonlinear least squares criterion, viz. n s/a/
=
X6, i-1
-
f/xi* ,)
3
*
It is obvious that for both parameter estimates z and the statistical inference about regression model the quality of input data plays the key role. The presence of none but one outlying observation yi can cause occunance of paradox estimates and rejection of the regression model in usel respectively. For this reason recently the methods axe implemented enabling influential points identification ii. e. identification of the points which strongly affect both parameter estimates and further tregressioncharacteristics[. (Severaltechniques for identification of influential points in linear models are described in /I/. In case of nonlinear models mostly linearization to the linear ones is performed and instead of the design matrix Y the Jacobian matrix S/1,2/ is used. In the present work the method for detection of influential points, applicable in a general regression problem defined by equation 1, is described.
2. Influential Points for General Nonlinear Regression l!heinfluence of the i-tB point on parameters 2 is commonly being expressed as the change in maximum likelihoood estimate a caused hy deletion of this point /1,3,4/. In our work we define the empirical nfluence function EIFi connected with the i-th point [xi,yitby the expression 4 where
g
is the vector of error terms and gi differs from C by having
the i-th component equal to zero, CiWO. Empirical influence function &fiend in this way expresses the difference between the maximum likePihood estimate g /&/ and the one 2 /pi/ which presumes accuracy of the ti-thpoint /i. e. realization of yi equals to f/xi##/li// and Li * O/. In order to express zi in an explicit form the expansion of 4 ,f%/ 8.nthe Taylor series is executed. Provided that we confine to linear terms only, we may write 5 The matrix F /m x.n/ having the elements *ai/O//>cj,
366
i * l,...,m,
j - l,...,
n /see /5//.
The estimates 2 for the model defined in equation 1 are in general obtained by solving the system of normal /likelihood/ equations H.
,‘; .
=
J
hj 13, gif
=
6
0 , j g f,...,m
Then, the matrix U? may be expressed by the form IF
t
IL-l .
/-l/.
7
A
where 1G /m x n/ is the matrix with the elements Lij =aHi/a a., /i,j - 1 ,...,m/ and IR ,fmx n/ the one with the elements Aij ?ahi/dCj /i = j,...,m, j P l,...,n/. For determininq 2 /r,/ it suffices to substitute the vector ii in equation 5 for 5. By means of .equation 5 the empirical influence function is expressed in the form . Ei
53
W.i_
.gi
,
i = l,...,n
8
where i /n x l/ is the unit vector having all elements except the g-&h one equal to zero and Zi = Yi - f/xi, s/f// is the maximum likelihood estimate of Ei. Since zi is a /n x l/ vector it is useful to consider some norms deterc mined by a symmetric positive definite /m x m/ matrix /see /l/, chapter 3.5.1/. Similarly as in linear models it is useful to choose an invariant norm corresponding approximately to /l -oC/. 100 % confidence reqion, i.el.
In equation 9 CC /a/ is the covariance matrix of estimates $ 't;/.From eqn 5 it may easily be derived that for independent errors Ei with constant variance (C/z/ = g2. IF.lPT. Consequently, equation 9 may be rearranged to the form 10
Thequantity Di/m corresponds to Cook's distance in linear models /see /l/J and may be converted to probability scale by comparing it with the quantile I?&/m, n - m/ of Fisher distribution. For example if Di/m equals to PO,5 Im, n - m/ , then putting of i-th error pi = 0 would move the estimate &'&/ - -3. to the edge of 50 % confidence region relative to p /g/.
3. Influential Points for Nonlinear Least Squares In practice mostly the additive measurement model is utilized. Gaussian error distribution p/s/E N/0,g2 IF./ with zero mean and diagonal covariance 367
matrix.
Maximum likelihood estimates(a are then assessed by solving the
nonIinear equations set Hj
=
)S/g//>a,
= 0,
11
j = l,...,m
where S/g/ is defined by equation 3. It may easily be derived that in this case the following holds IF
3
- 2 III-l. JIT
where 8i /m x m/ is the matrix having the elements >S/S?,//aaidaj and Jl/n x m/ is the Jacobian matrix with the elements >f/xi,a//>aj /i = 1, ...#nr 3'rl ,...,m/. The matrix IB may be decomposed into the form H = 2 JT JT+ 8. The matrix I8 contains second derivatives of model function according to regression parameters and estimates to the errors 2. In cases where equation 5 is valid approximately, 8 may be omitted.Then the empirical function results in the form 13 where -i J = />f/x,,a/ /bal.....~f/xi,d//aam/T corresponds to i-th row of matrix Jl . It may be proved with ease that equation 13 for linear regression models becomes the familiar empirical influence curve /see /l// and corresponds to the infinitesimal Jackknife. After substituting from equation 13 into equation 9 we may write
Di =
14
In order to identify highly influential points again Di/m is compared with quantiles Fd/m, n - m/ of Fisher distribution. For highly nonlinear functions and higher errors Qi the refinement in calculation of gi and/or Di may be achieved by substituting from equation 12 directly into equation 8 and equation 10, respectively /which necessitates computation of derivatives b2f/xi,a//aaj>ah/.
4. Literature Cook R. D., Weisberg S.: Residuals and Influence in Regression, Chapman and Hall, New York 1982. Moolgavkar S.H! et al. : Annals of Statistics, 12 /1982/, 816. ;:; Pregibon D.: Annals of Statistics.9 /1981/, 7057 /4/ Militky J., Cap J. : Proc. 13th AnEual Conference "Grundlagen der Modelierung und Simulationstechnik",Restock, December 1984. /5/ Ramsey J. B. : Econometrica -46 /1978/, 901.
IV