Statistics & Probability North-Holland
June 1989
Letters 8 (1989) 171-174
INFLUENCE OF INCOMPLETE OBSERVATIONS IN MULTIPLE LINEAR REGRESSION
Weichung Joseph SHIH Inoestigative Research, Biostntistics and Research Data Systems, Merck, Sharp & Dohme Research Laboratories, P.O. Box 2000, Rahway, NJ 07065-0914, USA
WBD-216,
Received April 1988 Revised August 1988
Abstract: This paper is concerned with the influence of incomplete data due to random missing values in the multiple linear regression problem. Using the idea of Hampel’s influence function, a partial influence function is derived and shown to be useful in several indications. Comparisons with the complete data situation and with the empirical case-deletion distance measure are also given. Keywords:
missing
values, EM algorithm,
influence
function,
multiple
linear regression.
1. Introduction Let y be the response variable and x be a p-dimensional the linear regression relationship
vector of the explanatory variables, according to
y=x’p+e
(1)
where j3 is a p x 1 vector of unknown regression coefficients and e is an error term which has mean 0 and is independent of x. Consider the situation when independent samples are taken but missing values occur at random in the sample in some of the components of (y, x’) (missing at random in the sense of Rubin 1976). A common approach to estimating j3 in this situation is by the maximum likelihood method with a multinormal model on z = (y, x’)’ and using the well-known EM algorithm (see, e.g., Beale and Little, 1975; Demster, Laird and Rubin, 1977; and Little, 1979). An interesting question arises frequently in practice following the above approach is whether the missing values in the response variable affect the estimation of /3. The answer is “no”, and this has been shown by many authors: Hocking and Smith (1972), Press and Scott (1974) Rubin (1974), and Zacks and Rodriguez (1986). However, little attention has been paid to the other equally interesting question: What effect will missing explanatory variable values have? Shih and Weisberg (1986) studied this problem by a case-deletion method which gives a Cook Distance-like measure for the effect. This note uses Hampel’s (1974) influence function approach to examine the problem in a broader framework including both least squares and maximum likelihood estimation. It is interesting to note that this approach also covers the first situation of missing response variable values, and so provides a unified approach to the two questions. Furthermore, the resulting partial influence function (see eq. (7)) of this note provides a theoretical analogue of the empirical distance measure given in Shih and Weisberg (1986). By comparison with the complete data influence function, the partial influence function has an extra correction term for the regression coefficients corresponding to the explanatory variables with missing values. The expression of the partial influence function gives an explicit composition of the marginal and carry-over effects of the 0167-7152/89/$3.50
0 1989, Elsevier Science Publishers
B.V. (North-Holland)
171
STATISTICS & PROBABILITY
Volume 8, Number 2
LETTERS
June 1989
observed variables. It also suggests a form of estimated residuals which can be useful in implementing model diagnostics (such as that proposed in Shih and Weisberg, 1986) or robust estimations for multiple regression with incomplete data.
2. Partial influence function of /3 at an incomplete observation Let T(F) be some Gateaux differentiable statistical functional (c.f. Huber, 1981) corresponding to a statistic (estimator) T(P), where $ is the empirical distribution function of F. The influence function (IF) of T(F) at a point z is defined by Hampel (1974) as IF(z;
T, F) = hm Ed0
T(C) - T(F) e
(2)
where F, = (1 - E)F + ~6~ is a mixture of the distribution function F and the point mass distribution function a,, 0 < E -C 1. That is, the influence of z on T(F) is measured by the infinitesimal change in T when perturbation is induced by adding a point mass to the distribution. Applying (2) to our problem we first partition the vector point z = (z,‘, zi)’ and express F = Fz2,,,F,I, where F&, =, and F=, denote, respectively, the conditional distribution of z2 given zi and the marginal distribution of zl. The first part zi is the set of fully observed variables and the second part z2 contains variables with missing values. Next we let the perturbation occur on the marginal distribution Fz, at a lower dimensional point mass zi = z:, and let the perturbation propagate along the domain of Fz, , =,=rr to the joint distribution F. By writing
in this case, (2) becomes (following from the definition of Gateaux differentiation distribution function)
IFzl(z; T, F) = j-IF(z;T, F) dF,2,,,=,y which is referred as the partial influence function at z,. We now consider the regression problem posed previously where z = (y, x’)‘. has mean 0 and is independent of X, E,(xy)
= J?~(x(x~
A functional representation P(F)
= C’(F)+‘)>
4(F)
= E,(=‘)
with (3) as the perturbed
(4) Since the error term e
+ e)) = E,(xx’)B.
of /? is (5)
where and
Y(F)
= E,(V),
which corresponds to the least-squares estimate of /I (also the maximum likelihood estimate when F is multinormal) in the complete data situation by setting F = I? Hinkley (1977) provides (5) and, without detailed proof, the following influence function (6): IF(z;
8, F) = $-‘(
F)x(y
- x’/.l(F)).
(6)
The proof of (6) will follow easily as a special case of the partial influence function shown in the following
theorem. 172
STATISTICS
Volume 8, Number 2
& PROBABILITY
LE’lTERS
June 1989
Theorem. Let z1 = (xi, y)’ be the set of fully observed explanatov variables and response variable z2 = x2 contain explanatory variables with missing values. Partition the regression coefficient j3 = (pi, according to the partition of x = (xi, xi)‘. Then the partial influence function of fi at z1 is
y>1(Y -+I
IF&; 13,F) =4-‘(F) 3
-E;hk~
Y)&)
0
E,(x,x;Ix,,
~)-E,(x,lx,,
and pi)
Y)E;(x,IxI,
Y)
(7)
Proof. (3) and (5) imply that \cI(F,) = (I - E)EF(XX’)
+ eJ%,*,(,,.J~~‘)
1 --%Ei.(% =rCI(F) +E
= (1 - &)4(F)
+ E
x14
1x1, Y>X;
E,(x,
~1’
xiE;(x,
1x1, Y)
E,(x,x;
1x1, Y>
E,(x,&
I x1, Y) 1x1, Y>
and that
The inverse of +( F,) is x1x1’
#-‘(F,)=+-‘(F)-&(F)
E,(x,
EA& x&(x,
1x1. ~1-4
+o(E2)
1x1, Y) I x1, Y)
so that B(K)
= V(F,)Y(F,)
E,(x2;:,,1
=B(F) + E+-1(F)
y)
y
Xl4
EF
(x2
I XI. Y)X;
xzEL(x2
&-(-v;Ix,,
1x1,
Y) Y)
(8) 1B(F) I+0b2).
The theorem follows by substituting (8) into (2) with F, in (3) and then simplifying the result.
3. Remarks (1) For the complete data case, zl = z = (y, x’)‘, (7) reduces to (6). (2) If missing values occur only in y, i.e. when z2 = y and zl = x, then following through the proof of the theorem will lead to IFJz; j3, F) = 0. Therefore, we have proved here in a broader framework that for any procedure, which replaces the missing values by their conditional expectations and calculates the estimate of j3 satisfying the functional form of (5), the incomplete observations with missing values in the 173
Volume 8, Number 2
STATISTICS & PROBABILITY
LETTERS
June 1989
response variable and not in the explanatory variables have no influence on the estimation of j3 and can be discarded without loss of information. This includes least-squares and maximum likelihood estimation using the EM algorithm. (3) Expression (7) has very significant interpretations. Firstly, the observed variables in x1 have their marginal effect on PI as well as their carry-ouer effect on &. Secondly, by comparison with the complete data influence function (6) there is an extra correction term in the incomplete data case for &, and this correction is proportional to the conditional covariance of x2 (the set of variables with missing values). Thirdly, residuals play an important role in (7) as they do in the complete data regression, but in the incomplete data case the residuals need to be estimated by filling in the missing values with the conditional expectations. As has been pointed out by Beale and Little (1975) and by Shih and Weisberg (1986) this filling-in step is a useful characteristic of the EM algorithm for residual analyses in the incomplete-data regression. Finally, (7) is a functional form of the empirical distance measure proposed in Shih and Weisberg (1986, eq. (3.3)) and therein provides a theoretical analogue of the one-step case deletion procedure.
Acknowledgments The author thanks Sanford Weisberg for his valuable discussion of the problem, as well as A. Lawrence Gould, Associate Editor, and the referee for their helpful suggestions concerning the presentation.
References Beale, E.M. and R.J.A. Little (1973, Missing values in multivariate analysis, J. Roy. Statist. Sot. Ser. B 37, 129-146. Dempster, A.D., N.M. Laird and D.B. Rubin (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. Roy. Statist. Sot. Ser. B 39, l-39. Hampel, F.R. (1974) The influence curve and its role in robust estimation, J. Amer. Statist. Assoc. 69, 383-93. Hinkley, D.V. (1977) Jackknifing in unbalanced situations, Technometrics 19, 285-92. Hocking, R.R. and W.B. Smith (1972), Optimum incomplete multinormal samples, Technometrics 14, 299-307. Huber, P.J. (1981) Robust Statistics (Wiley, New York). Little, R.J.A. (1979), Maximum likelihood inference for multiple regression with missing values: a simulation study, J. Roy. Statist. Sot. Ser. B 41, 76-88.
174
Press S.J. and A. Scott (1974), Missing variables in Bayesian regression, in: S. Fienberg and A. Zellner, eds., Srudies in Bayesian Econometrics and Statistics (North-Holland, Amsterdam) pp. 259-272. Rubin, D.B. (1976), Inference and missing data, Biometrika 63, 581-592. Shih, W.J. and S. Weisberg (1986), Assessing influence in multiple linear regression with incomplete data, Technometrics 28 (3), 231-239. Zacks, S. and J. Rodrigues (1986), A note on the missing value principle and the EM algorithm for estimation and prediction in sampling from finite populations with a multinorma1 superpopulation model, Statist. Probab. Lett. 4, 35-37.