Computational North-Holland
Statistics & Data Analysis 14 (1992) 1-27
Ali S. Hadi Department of Economic and Social Statistics, Cornell University, Ithaca, NY 14353-330!, L’S,? Received September I990 Revised February 199 1 Abstract: In this article we introdticc F new influence measure and a graphical display for the detection of influential obsevations in linear regression. We demonstrate the need for such a new influence measure by giving examples whcrc rxisting influence measures, numerous 3s they may be, may fail to detect influential observations. The existing influence measures are procedures each of which is de+ ned to detect the influence of observations on a specific regression result. The proposed diagnu:tic is a measure of overall potential influence. Fitting a regression model in practice can have many purposes and the user may prefer to begin with an ovcrali measure. This didgnostic possesses several desirable properties that many of the frequently used diagnostics do not generally possess: (a) invariance to location and scale in the response variable. (b) invariance to nonsingular transformations of the explanatory variables, (c) it is an additive function of measures of leverage and of residual error, and (d) it is monotonically increasing in the leverage values and in ihe squared residuals. The new influence measure is also supplemented by a graphical display which shows the source of influence. This graph is suitable for both single- and multiple-case influence. Analytical and empirical comparisons bctwccn the proposed measure and the existing ones indicate that our mcasute is superior to existing influence measures. Keywords: Influence measure; Influentia! Outliers; Regression diagnostics.
observations;
Leverage values; L-R
plot; P-R
plot;
1. Introduction The area of regression diagnostics has seen a great surge of research activities during the last 15 years or so. This is evident from the numerous articles that appeared in the iiterature starting. for example, with Cook (1977), Welsch and Kuh (19?7), Andrews and Pregibon (1978), and Hoaglin and Welsch (1978). Many books dea! with the problem either exclusively or in large part, for example, Belsiey, Kuh, and Welsch (ISSO~, Cook and Weisberg (198% Atkinson (ISrSS), and Chatterjee and Hcnidi(i9@). Consider the standaid regres$non mtidel Y = Xp + E, where Y is an n-vector of responses, X is an n x k matrix representing k explanatory variables with Correspondence to: Dr. AS. Hadi, Department of Economics University, 358 Ives Hall, Ithaca, NY 13851-0952, PJSA. 0167-9473/92/$OS
00 0 1992 - Elscvier Scicncc Publishers
and Social Statistics,
B.V. All rights rcscrvcd
C~\rncll
.4. S. Hdi
2
/ .4 nebvrnensw~ ofpot4rkll
irtjluerlce
rank
k < ~1, p is a k-vector of unknown parameters, and t’ is an n-vector of ran :>m disturbances (errors) whose conditional mean and variance are giveci bg E(e ] x) = 0 and Var(e 1X) - a’l,, where u’ is an unknown para is the identity matrix of order .L The least-squ?res estimator of /3 is b = ( X ‘%) - ‘X rY, and the vector of fitted values is Y = X/3 = PY, whxc P=X(X”‘X)_‘X?
(1.1)
The residual vector is denoted by e=Y-?=(I,,-P)Y, and the least-squares 0AZr-
(l-2) estimate of 0’
ual mean square,
e”e n-k’
W%
Of interest is the assessment of the influence of observations, individually or in groups, on the above and other regression calculations. A “bewilderingly large number of statistical quantities have been proposed to study outliers and influence of individual observations in regression analysis.” (Chatterjee and Hadi, 1986). Yet, here I propose a new measure of influence, raising the rather natural question, “With all existing measures of influence, do we still need a new one?” I am convinced that the answer to the question is in the affirmative for the following reasons: uence measures may highlight observations which influence a particular regression result but may fail to point out observations influential on other least-sq:lares results. In many practical applications, fitting a regressign model can have a number of different p;trposes so interest may center on not one but several regression results. 2. There is no consensus amorlg statisticians as to which measure(s) should be used, so a through allalysis requires examining several of the existing diagnos:s measures. 3 . . Many of the existing measures ot’ influence do not possess the following desirable properties: invariance to location and scale in the response variable. invariance to nonsingular transformations of the explanatory variables. monotonicity with respect to the leverage values and the squared residuals. d. ability to identify all observations that are influential on several regression results. 1.
The invariance to location and scale properties are desirable because a diagnostic measure should give the same rankings of observations when, for example. income is measured in dollars instead of thousands of dollars or when the regression model is reparameterized, that is, we write Y = X0 + E as Y = Zy +E, where Z=xA-’ and y = A& for any non-singular matrix A.
0hwvation
Y
p--
1 2 3 4 5 6 7 8 9 10 11 12
0 5 15 16 17 18 19 20 21 22 23 24
8.11 Il.00 8.20 X.30 9.40 9.30 9.60 10.30 11.30 11.40 12.20 12.90
To illustrate Same: of the above points, consider se1 era1 of the diagnostic measures most widely used to detect outliers and influential observations in linear regression. CwoIc)s distance (C,?, defined in (2.6a99 measures the in&tence of an obszroatiQa or\ the estimated regression coefficients; Welsch’s distance
k
d;
H’z=myq+
Pti
1 - pii ’
i = l,...,n,
(1.4)
where d,? = e,‘/c’e is the square of the ith normalized residual and p,; is the ith diagonal element of P defined in (1.19. The rationale and derivation of this measure are gitefl ip Section 3. Now, con~ick~ fitting a simple regression model to the artificial data given in Table 1. l!rrsim e r-egression, the scatter plot of Y versus X (Figure 19gives a about influence. It is clear from this plot that observations I RWII to the remaining data points. yet the least-squares line sxvation 1. Index plozs for three of the first eight of the above measures are @&I irr Figure 2. Those for the other five are similar and not given to save space. Each of these eight measures identifies only one of the two he measures U?, and CM/, Aentify only observation I influential Qbse ly observation 2. Indeed, tht: residual as well as six of and the others the above mc;asdues are identically zero for observatiolr 1.
A.S. Hadi / A new nwaswe of potential influence
8.11 0.00
Fig. 1. A scatter
6.00
12.00 X
18.00
L4.00
plot of Y versus x for the artificial data in %b!e 1.
The fourth plot in Figure 2 is the index plot for the proposed measure. Whereas none of the above eight measures singly sr;cceeds in detecting the two observations, our measure correctly indicates that two observations (1 and 2) are influential. The effectiveness of the proposed measure is clear in this artificial data set. It was also applied to many reai-life data sets and was effective in each. The results i’sr twc, of these data are given in Section 4. In this paper, we also suggest a graphical display to supplement the proposed influence measure which highlight the source of influence and which is suitable for both single- and multiple-case influence. Section 2 defines the best known measures of influence. To save space, they are introduced without discussiohl of their derivation or respective merits. Readers interest&zd in further details are referred to the above citations. Section 3 introduces the new measure and the associated graphical display. Section 4 compares all measures analytically an.d using two real-life data sets.
0.91
2.6-
0.68
2.01
c,z 0.45
rqi_34
0.23
0.67
. .
0.00 123
+TtT_T
4 5 6'i 8 9101112 Observation number
l
.
ee*ee
.
1 2 3 4 5 6 7 8 910111 Observation number
2.84
1.52
2.30
1.17 ff,20.02
v/$1.76 I.22
0.47
0.68
0.12 1 2 3 4 5 6 7 8 9 101112 Observation number
i 2 3 4 5 6 7
b
9
lo1112
Observation number
Fig. 2. Artificial data: Index plots of influcl-ce measures.
oneway to assess t
e influence of a group of observations on a given regression result is to compare the result computed using all tz obs tions with the result computed after omitting the group of observations. s is known as the omission or deletion approach to influence. Let I = {i,, . . . ,i,,], m < (tz - k), be the set containing the indices of the m observations whose influence we wish to monitor Without loss of generality, assume that these m observatrons are the last m rcws of X and Y so that X, Y and e of (1.2) can be partitioned as ,y=
X=
n-m)xls m X
The corresponding expressed as
m
X 172
submatrix
of P of (1.1) is denoted
1
by P, and is
(2.1) When m = 1 and I = {i}, (2.1) reduces to pii
=X’(
(L.la)
=iTX)-'Xi,
where xl? is the Itn row of X. Because $, can be expressed as (2.lb) pii is interpreted as the amount of leverage each value y, has in determining $,. For this reason, the pii are called leverage values and observations with large values of pi; are called high-leverage points (Hoaglin and Welsch, 1978). Because high-leverage points are outliers in the X-space, some authors define leF/erage in terms of outlyingness in the X-space. However, leverage and outlyingness in the Kspace arc two different concepts and shou!d not be confused. High-leverage points are outhers in the X-space but the converSe is not necessarily true. Some observations may be outlying in the X-space but have small values of pii (perhaps because of the presence of other neighboring observations). Thus, if Xi is an outlier in the X-space but pi, is small then, as can be seen from (2.lb), yi has a small leverage in determining i,. The statistic eT( Im - PJ ‘e,. which is equal to the reduction of the residual sum of squares due to the deletion of the m observations indexed by I, can be used as a basis for a test statistic for determining whether the subset of observations is cutliers. For details see, for example, Gentleman and Wilk (1975), John and Draper (1978) and Beckman and Cook i 1983). A scaled version of this statistic is eT
+-- 4Kl -w’e, A.2 (r
.
(2.2)
For m = 1 and I = {i), (2.2) reduces to e;?
r,z = S(l
(2.24
-p,,)
’
which is the square of the ith internally studentized residual. f we replace G2 in (2.2) by c?&, where &(f, is the residual mean square of (1.3) but computed after omitting the observations indexed by I, we obtain
(2.3) For m = 1 and I = {i}, (2.3) reduces to (2.3a) which is the square of the ith externally studentized residual. It should be noted here that r,“’ is a monotonically increasing function of rf. Observations with large values of $ or Y;*~are called outliers. Outliers and high-leverage points may influence the regression results and the conclusions based on them. Because high-leverage points tend to have small residuals, examination of residuals alone may not detect influential observations. In this section, we list nine of the most well known measures of influence. They aYe grouped according to the aspect of influence they are designed to assess. All measures are defined so tnat large absolute values correspond to potentially influential observations. 2.1. Measures based on volumes of conjT.lence ellipsoids I. Andrew-Pregibon statistic: Let 2 = (X: Y). The influence of a subset of observations indexed by I can be measu=.,J by the relative change in the det(ZrZ) when the subset is omitteo. Andrews and Pregibon (1978) suggest the ratio
det(Z;)Z,,, 1 det( ZTZ)
’
but here we define Andrews-Pregibon AP
I
=
1
-
det(43d =l-(I--$] det( ZTZ)
statistic as det(I,,I-P,),
so that a large value of (2.4) indicates that the subset is influential. and I = (i}, (2.4) becomes
(2.4) For m = 1
(2.4a)
2. Variance ratio:
The influence of a measured by Acomparing the estimated &I,) where &,, is the least-squares indexed by i is omitted. The variance
subset of observations indexed by I can be variance of p to the estimated variance of estimate of p obtained when the subset ratio is defined by
It can be shown that
k
= i”-k-r;]’
det(X’X)
I n - k - HZ
det( X&X,,,) This measure
is proposed
1
det( I,,, - P,) ’
(2.5)
by Belsley, Kuh, and Welsch ( 1983. For uz = 1 and
I = {i}, (2.5) becomes
Let E“ be the lOO(1 - cx)% joint confidence ellipsoid for p based on all n observations. Similarly, let ET,, be the joint confidence ellipsoid obtained with m observations indexed by I omitted. The influence of the subset of observations indexed by I can be measured by the relative change in the volume of E” to the volume of E;;,. Cook and Weisberg (1982) suggest measuring the relative change in the volumes by the log of the volumes’ ratio. namely,
3. Cook- Weisberg statistic:
CW[ =
log
Volume ( E” ) Volume( E,;,) \
where &.._, is the appropriate and I = {ii; (2Sb) reduces to
1
=
- ylog( YR,) x
percentile
(23)
from the F-distribution.
For rn = 1
(2Sc)
where VR, is defined in (2Sa). 2.2. Measw-es based on the injhmce
cwre
indexed by I on
1. Cook’s distance:
the estimated
The influence of a subset of observations regression coefficients can be measured by =-_
ew,,, -P,)-+,~, kc?’
.
(2.6)
S
A.S. &uu’; ,I 4 new measure
ofpotential
influence
This measure, suggested by Cook (1977), can be thought of as the relative distance between the center of the confidence ellipsoids for p based on the full data and the center of the confidence ellipsoids for p based on the reduced data, i.e., after omitting the m observations indexed by 1. For m = 1 and I = {i}, (2.6) becomes c~ = ri2 Pii k 1 -pii
(2.6aj l
This and the following three measures can be derived using the influence curve. 2. Welsch S distarzce: A measure of an influence of the subset of observations indexed by I, which is based on the influence curve, is suggested by Welsch (1982). This measure is defined by n - m w;= 7 ef( L - p,pp,ej. i %)
1
For m = 1 and I = {i}, (2.7) reduces to (2.7a)
3. Welsch -K-t& distance: *The influence of a subset of observations on the predicted values Y, is measured by wKz _ (P^ - &)‘x,P,I-
‘x:(P^ - !$I,)
e,‘(l,,, - &)-zPlq
. (2.7b) 4.5, by Welsch and Kuh (1977). For m = 1 and I = {i},
4;,
This measure is suggested (2.7bI reduces to
WKf= [
=
indexed by I
;,i,c:“bii,ij =byz( S),
(2.7~)
where Wi’ is defined in (2Ja). 4. Modified Cook’s distance:
A modified version of (2.6) is given by (2.7d)
This measure was originally suggested by Welsch and Kuh (1977) and subsequently by Welsch and Peters (1978), Belsley, Kuh, and Welsch (1980), and Atkinson (1981). For m = 1 and I = {i}, (2.7d) reduces to C,~ = WKi
n -
-
k
k ’ where WK, is rlefined in (2.7~).
(2.7e)
ofpotential irrjhertce
A.S. Hadi / A llew measure
9
2.3.Measures based on the likelihood function
1. mc HiWihood Displacement for p and IT-): Let Cr”= ((n - k)/n)$ and s = 6 be the maximum likelihood estimate (MLE) of a2 and p, respectively, computed using all n observations. Similarly, let &$I and & be the MLE of 0 2 and /3 computed with the subset indexed by I omitted. Fina_lly, let !<@; Cr2) be the log likelihood function evaluated_ at p and cf2 and let I( &,,; 5$,) be the log likelihood function evaluated at &, and G&. Cook and Weisberg (1982) measure the influence of the subset of observations indexed by P on the likelihood function for p and a2 by the likelihood displacement
LD,(P; u2)= 2(1( 8; aZ) n(n -k
=n In
I( &,,;
q;,)}
- rf)
- m)(n - k + kc;) n-k-r:
( n-m)(n-k)
-n. (2.8)
For m = 1 and I = {i}, (2.8) reduces to LD,(P;
(n-I)(n-k+kCf)
a’) = n In
n - k - rf
- n.
(2.Ya)
2. The likelihood displacement for (p 1a’): The LD,( /3; ~‘1 of (2.8) is appropriate if we are interested in the estimation of both p and u’. If interest is centered only on p, the influence of a subset of observations indexed by I on the likelihood function for p is measured by
=n
eF( I,,, - P,)-‘el
In i
+ ((n - l)/n)a;“,,
5’ (2.8b)
For m = 1 and I = {i}, (2.8b) becomes LD,@(&)
=n
In
(2.8~)
where Cl? is defined in (2.6a). The major omission from this list are the predictive influence measures basccl on the Bayesian predictive influence function. The reasons for this omission are: (i) the form of the predictive influence measure depends on the assumptions one is willing to make regarding the joint prior distribution of p and u’, and (ii)
A.S. Hndi / A new’ measure of potential iujlueme
10
Cook and Weisberg (19821, p. 172, state that the two predictive measures suggested by Johnson and Geisser (1979, 1980) behave essentially the same as Cf of (2.6a).
In the previous section we define nine of the most commonly known measures of influence. In this section we introduce a new measure of influence and in the next section we compare all these measures. 3.1. Meusurirzg influence The new influence measure is based on the simple fact that potentially influentiai observations are outliers in either the X-space, the Y-space, or both. To measure outlyingness in the X-space we use the statistic
which measures how far the subset is from the remaining rr - m observations the X-space. To simplify (3.1) we express (X,yjX&’ as
in
(x,:,x,lJ- 1 qx’x-x,x’,y
=(X’X)_‘+(X9
I- ’x,(I,,,-x;qx’x)-‘x,)-‘x/yx’;u)-‘,
from which it follows that
Thus
tr(x:(x,:,x,,,)-‘x,) = tr(w,,, - w’).
W)
Large values of (3.2) indicate that the m observations indexed by I are far in the X-space from the remaining rt - mz observations. On the other hand, when one attempts to predict the values for the m observations indexed by I using the remaining (tr - m) observations; the best predictor is X$, , ), the prediction error is
and the variance of this error is proportional P I( I)
=x;yx(:)x(,))-‘x,.
to ( I,,, + P,&
where
(3.3a)
A. S. Hdi
,’ A two* ttuw.ww
Thus a scaled version of the prediction
of‘potcwtitrl
itt~lrtcw~c~
error is
Large absolute value of (3.4) indicates that the m observations indexed by I do not conform to the pattern suggested by the remaining (n - m) observations. A statistic which would be large if the m observations indexed by I have large prediction error, relative to the remaining (n - m) observations, is then given by
where p(1) and P,(I)are defined in (3.3) and (3.3b), respectively, and Ed’&,, = ere I -flel. comes
Since e,(,, = ( I,,, + P&e,
ew,,,+pl(,)h(,) e'e
- e:e,
=
and (I,,, + P,J
= ( I,?, - P,V ‘, (3.5) be-
4U,?,-we, . e7b
(3.Sa)
- e;C!,
The quantities in (3.2) and C3Sa) can then be combined to provide a measure of overall potential influence. We propose H/L-
k e;‘( I,,, - P,)-‘e, m
4% - e:,
1 + --$r(P,(I,,,
(3.6)
-P,,-‘)7
where k, we recall, is the number of columns in X. This is simply a weighted sum of (3.2) and (3Sa). For ~1 = 1 and I = (il. (3.6) reduces to k Hi’=
(1 --pi;)
e,: (elk-ef)
-=P,, + 1 -y,,
k
(1 -p,,)
d’
(1-n;)
L
+ 1 -prr’
(3.7)
where d,’ = ef/e’> is the square of the ith normalized residual. The rationale behind the weights used in (3.6) can be seen from examining the special case given in (3.7). Since Cp,, = k while EC/,’= 1, multiplying (3.5a) by k prevents (3.6) from being dominated by its second term. Indeed, if we replace pi; and d,? in (3.7) by their respective average values k/u and l//z, the two terms in (3.7) have approximately the same magnitude. The first term on the right-hand side of (3.6) is invariant under scale transformations of Y. The matrix (I,,, - P,) is invariant under nonsingular transformations of X. Therefore, Hi has the nice property that it is invariant under nonsingular transformations of X and Y. Draper and John ( 1981) recommend the examination of e;.( I,,, - P, I- k,, which is the numerator in (3.5a), to determine whether the corresponding subset is deviant. This quantity, however, is not invariant under scale transformation of Y. Another property of HI’ is that it is monotonically increasing in both the leverage value and the squared residua..1 Measures such as I/R,, CW,, and the Johnson and Geisser predictive measure do not possess this property.
12
AS. Hadi / A new measure of potential injluence
The diagnostic measure (3.7) is the sum of hvo components each of which has a nice interpretation. The second component of (3.71, p,,/( 1 - P,j), is the ratio of the Vat-( f,) to Vat-(e). This quantity is also known as the “potential.” The first component is a function of the ith normalized residual weighted by the it -Pjj be the ith adjusted residual (Marasinghe, leverage value. Let 1985; and Kianifard and Swallow, 1989). Then the first component of (3.7) can also be expressed as L7j
=
Cj/
~1
nf
k U-P,,)
kaf
(1-d;)
= (1 -(I
-p,,)a”)
and can be interpreted as a function of the ith adjusted residual weighted by the + square of the ith adjusted residual is the reduction in the ith leverage value. T,.. residual sum of squares due to the omission of the ith observation. As we mentioned above, a subset of observations may be influential because (a) it has a poor fit (a large prediction error) and/or (b) it is an outlying in the X-space. The first source of influence would be indicated by a large value of (3.SaI. which is proportional to the first component of Ht. The second source would be indicated by a large value of the second component of Hf. While a large value of HI’ indicates that the corresponding subset is influential, examining H/ alone would not indicate the source of influence.
To determine the source of influence we suggest a scatter plot of one component of Hf versus the other. For example, the scatter plot of h eI.( I,,, - PJ’e, eTe _e,7e, verSuS -
1 ;tr(p\(I,,,
-‘,)-I)
I?1
9
which reduces to the scatter plot of -
Pii
versus
1 -R,
k (l -pii)
el; (1 -C/f)’
for the case 112= 1 and I = {i}. Since the first component is referred to as potential and the second component is a function of the residual, we refer to this plot as the Potential-Residual (P-R) p!ot. For the special case HI = 1, the P-R plot is tated (but not equivalent) to the Leverage-Residual (L-R) plot suggested by Gray ( 1983, 1986), see also McCulloch and Meeter (1983). The L-R plot is a scatter plot of Q,, versus n,‘. In both of these two plots, high-leverage points are located in the upper area of the plot and observations with large prediction error are located in the area to the right. The P-R plot gives more emphasis to high-leverage observations.
An advantage of the P-R plot over the L-R plot is that the latter is suitable only for single-case influence, whereas the former is suitable tkr both single- and z?ultipk-cast influence. 3.3. Cut-off values for the diagnostic measure As we mentioned above, a large value of Hf indicates that the corresponding subset is influential. Now we address the question of “how large is large?” This is, in general, a difficult question. Under the usual general linear model assumptions, the second component of H,’ is nonrandom and the first is random, but its distribution is difficult to derive. However, it may bd possible to suggest cut-off points for the special case, Hi’, of (3.71, which is a function of df and pii. Several cut-off points for pi, are suggested in the literature. The first is the twice-the-mean rule suggested by Hoaglin and Welsch (1978). Since the meant pi,) = k/n, p,i is considered to be large if it satisfies
(3. 8)
pii 2 2k/n.
Velleman
and Welsch (1981) update this rule to (3.8a)
The second of these rules will fail to nominate any observation whenever rz I 3k and both will fail whenever II 5 2 k, since the right h;lnd side of (3.8) is then greater than or equal to 1 and pji c 1, for all h. Huber (1980, p. 162, suggests that values of pi, < 0.2 appear to be safe whereas values between 0.2 and 0.5 are risky. A fourth cut-off point for pi,, proposed in Hadi (1990), is mean( pi,) + cl Var( p,J ,
(3.9)
where c is an appropriately chosen constant such as 2 or 3. This form is analogous to a confidence bound for a location parameter. The variance of p;i is Var( pii) =
m-k’ n(rt - 1) ’
where s = Cp,$ Thus p,; is considered k Pi; 2 -4-C
n
to be large if
ns - kz
n(n - 1) ’
The rules (3.8) and (3.9) can also be applied to d” n’=ndH,?. For example, one may consider H12 to be large if it exceeds mean( Hi’) + c-1Var( H,‘) . The problem with these cut-off points, however, is that both the mean and variance
14
A. S. Hdi
/ A tlew measure of potential itlflirerlce
are nonrobust. Extreme values inflate the mean and varl ce, yielding a high ean and variance cut-off point. This problem can be avoided by replacing n absolute deviaby more robust estimators such as the medialrl and the t ion, respect iveiy. Cut-off points should be used with caution. Diagno ethods are not otheses. They are designed to be (and should not be used as) formal tests of esuits more than designed to detect observations which influence regressi Iagnostic measure other observations in a data set. Thus the values of a give e using graphical should be compared to each other. This can best be lot. What matters displays such as stem-and-leave display, index plot, or Phere is the pattern of points in a plot and the relative value of the measure.
4. Comparisons of in uence measures As Cook and Weisberg (1982), p. 173 indicate, comparis measures is much more complicated for the general case o observations than for the special case of a single influential therefore confine ourselves here to comparing influenc special case where m = 1 and I = (i}. We compare influ analytically and using two real-life data set!;.
among influence uitipie influential
3. I. Amdyticnl conrpnr’isorts As stated earlier, an observation may be influential because it has a iarge residual and/or because it is a point with high-leverage. The proposed influence measure H,‘! is similar to the other influence measures in the sense that it is a functicln of the residual, e,, and the leverage value, pii. It differs from the ether measures in the way it combines the residual and leverage value. The c fher measures are multiplicative functions of the residual and I e leverage value, whereas H,’ is an additive function of the two. The disadvantage of multiplicative functions is that components is 5maii (near zero), it annihilates the other the multiplicative measure would be small if one c nt is too small. Additive measures, by contrast, are large if either or bo onents are large. For e.znple, observations with large leverage values (n residuats (nea: 01, and, as a result, they will have small values for mu!tipiicative measi, .Ies (e.g., C:, LD,(p 1CT’), IV;‘, WK,‘, CF). Hence, they may leave 3,~ with the i;carrect impression that these observations are not fluential. 3ut an additL*:: function of the residual aKd ieverage value will ide ify outliers in the X-spact: in the Y-space, or in both. To illustrate this point further, consider the following three extreme cases: (ii p:, = 0 and t-f = 0; (ii) p,, = 0 alad rf = (n - k); (iii) = 1 and ef = 0. Pii
(Note that whenever pi, = 1, as in case (iii), e,’ = 0 but rr’ is undefined.) In case (i) the ith observation is clearly not influential; it is neither an outlier nor a point with high leverage. Case (ii) is highly undesirable because of poor fit (the observation lies in the center of the X-space but it is extreme in the Y-dir&on where rf takes its maximum value of n - k). Case (iii) is a difficult situation in which fi is determined solely by yi and the ordinary residual, ei, will always be zero no matter what the value of y, is. In case (i) all measures, including Hf, are zero as they should be. In case (iii) they are all undefined. Note that in case (iii) the second term of E(’ is infinite but the first term is undefined. An observation is extremely influential if its residual is small only because it has a very large value of p,i, as in case (iii). However, a disturbing fact about multiplicative measures is that in cases (ii) they are exactly zero, whereas Hi2 is large. To illustrate case (ii) further, consider fitting a simple linear regression model through the origin to a data set for which the first (12- 1) observations have mean zero for both X and Y and the last observation is x,, = 0 and y,, = c. For this data set, e,! increases as y,, deviates from zero. Consequently, H,’ increases as y, deviates from zero. On the other hand, because the Mth diagonal element of p, P,,,,, is zero, we have C,Z= LDi( p I a’) = W;:2 = WK?= Ciy: = 0, no matter how large y,* is! Note also that the observation (x,~, y,,) has no influence on the value of the slope of the fitted line, however, it has a substantial influence on its standard error (of course, it is not very meaningful to examine a point estimate without reference to it standard error). A good influence measure should take notice of observations which are similar to those in cases (ii) and (iii). These are exactly the observations that we would like to be able to detect. The measure Hi2 meets this criterion whereas multiplicative measures do not. 4.2. Comparisoresusing real-life data sets We now compare the various influence measures in the context of two real-life data sets commonly known as New York Rivers and Scottish Hill Races data. Ie 1. New York Rivers Data. In a study exploring the relationship between water quality and land use, Haith (1976) obtains the following measurements on 20 New York river basins: X,: X2: X,: X,: Y:
percentage percentage percentage percentage a measure on samples
of land area used in agriculture of forest land area of land area in residential use of land area in commercial and/or industrial case of water quality (mean nitrogen concentration (mg/liter) taken at regular intervals).
based
A.S. Hadi / A new measure of potential influence
lh Table 2 New York rivers data
Nnme and number of river
A-1 26 29 54 2 3 19 16 40 28 26 26 15 6 3 3 6 22 4 21 40
1. Olean 2. Cassadaga 3. Oatka 4. Neversink 5. Hackensack 6. Wappingcr 7. Fishkill 8. Honeoye 9. Susquchanna 10. Chenango 11. Tioughn;oga 12. West Canada 13. East Canada 14. Saranac 15. Ausablc 16. Black 17. Schoharie 18. Raquette 19. Oswegatchie 20. Cohocton
x,
x,
x4
Y 1.10 1.01 1.90 1.oo 1.99 1.42 2.04 1x5 1.01 1.21 1.33 0.75 0.73 0.80 0.76 0.87 0.80 0.87 0.66 1.25
63
1.2
0.29
57 26 84 27 61 60 43 62 60 53 75 84 81 89 82 70 75 56 49
0.7 1.8 1.9 23.4 34 ;:6 1.3 1.1 0.9 0.9 0.7 05 0.8 0.7 0.5 0.9 0.4 0.5 1.1
0.09 0.58 1.98 3 11 0.56 1.11 0.24 0.15 0.23 0.18 0.16 0.12 0.35 0.35 0.15 0.22 0.18 0.13 0.13
-
The data are given in Table 2. Allen and Cady (1982) use this data set as an example to illustrate linear regression method0 For comparison among the various influence measures we use Y as a response variable and only X, and X, as possible explanatory variables. The reason for this choice is that these variables conIprise a subset of the data within which the various measures disagree sharply about which observations are influential. Now consider fitting the following models to the data:
0 y= PO
+
pjx, + E.
(ii) Y = cyO-I-a,X, f a,X, + E.
(4.1)
(4 ) l
3 _
The scatter plea of Y versus X4 and the least-squares line resulting from fitting model (4.1) are given in Figure 3. Observations 5 and 4 are located far from the bulk of other data points in Figure 3. Also observations 7,3, 8, and 6 are somewhat sparse in the (Y: X& two-dimensional space. Several diagnostic measures resulting from fitting model (4.1) are shown in Table 3, and the corresponding index plots are shown in Figure 4. Some of the influence measures are equivalent to others, in the sense that they provide the same rankings of observations. The influence measures Ti*, CT, FVK,?,Ci*, and LD,( p 1cd are not shown here since ri* is equivalent to ri (as noted below (2.3a)), C~i is equivalent to pi (as can be seen from (~SC)), WKf calI CF are
17
Yl
0.66 0.09
0.84
I.60
2.36
3.21
X4
Fig. 3. New York rivers data: A scatter plot of Y versus Xd.
equivalent to Wj2, as seen from (2.7~) and (2.7e); and LQ(@ I a’) is equivalent to Cf (as seen from (2.8cj). Examination of the index plots in Figure 4 indicates that the various influence measures do not agree. This is not unexpected because each measure is designed to detect influence on a specific regression result. The index plot for pii indicates that the largest leverage value (0.67) corresponds to observation 5 and the next largest (0.25) corresponds to observation 4. These two values of pii are separated from the bulk of the otlw- leverage values. The reason for the small pii for observation 4 relative to that of observation 5 is
‘Lab! 3 New York rivers data: Various influence
measures for model (4.1)
Row
P,,
r,
AI’,
VI?,
Cf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.05
0.03 - 0.05 1.95 - 1.85 0.16 0.67 1.92 1.57 -0.10 0.38 0.75 -0.81 - 0.83 - 0.83 - 0.94 - 0.48 - 0.72 - 0.50 - 1.03 0.57
0.05 0.07 0.25 0.39 0.67 0.07 0.2:’ 0.19 0.06 0.07 0.09 0.10 c 10 0.09 0.10 0.07 0.09 0.07 0.12 0.08
1.19 1.20 0.73 0.98 3.40 1.12 0.77 0.89 1.19 1.17 1.12 1.11 1.11 1.09 1.07 1.17 1.12 1.16 1.06 1.15
0.00 0.00 0.10 0.56 0.02 0.01 0.17 0.07 0.00 0.00 0.02 0.02 0.02 0.02 0.02 0.01 0.02 0.01 0.04 0.01
0.07 0.05 0.25 0.67 0.05 0.08 0.06 0.06 0.06 0.06 0.06 0.06 0.05 O.,% 0.06 0.06 0.06 0.06 0.06
W,,’ 0.00 0.00 4.85
33.10 2.69 0.46 8.20 3.26 0.01 0.17 0.71 0.86 0.95 0.75 0.97 0.29 0.63 0.3 1 1.47 0.44
0.03 0.03 0.64 1.73 0.08 0.03 0.78 0.27 0.03 0.03 0.04 0.05 0.05 O.(!1 0.05 0.03 0.04 MI3 0.08 0.03
0.06 0.117 US8 0.77 2.04 0.10 0.60 0.37 0.07 0.08 0.13 0.14 0.15 0.13 0.16 0.09 0.12 0.09 0.19 0.11
.4 S. Hadi / A net+*measure of pterltial
1X
ir$‘ut*nce
-1.85 I
3
i
5 7 9 ill.3151719
3
i
3
5
7 9 ii 13151719
Obsexvation number
Observation numt =-
1
5 7 9 1113 35 ill9
3
5 7 9 ill3151719
Observation number
Observation number 33.10 24.83 Lb';16.55 8.28 0.00
Observation number
Observation number
1
Fig.
4. tqew
3
5 7 9 ii 13153.719
Observation number York rivers data: Index
1
3
5 7 9 ii 13151719
Observation number plots
of influence meawes
for model (4-l).
that the leverage of the former is masked by the presence of the latter as can be seen in the scatter plot in Figure 3. What is surprising is that the index plot of rj shows that the internally studentized residuals are all less than 1.95 (observation 3) and that the values of ‘; for observations 5 an4 4 (which clearly do not belong to the remaining of the other points in Figure 3) are as small as 0.16 and - 1.85, respectively. While a smal! value of the residual is highly desirable, the reason for the small value of ri corresponding to observation 5 is not due to a good fit; it is due to the fact that observation 5 is a high-leverage point and, in collaboration with observation 4, they pull the regression line toward them. This situation is very similar to case (iii) of Section 4.1 where we have an observation with a high-leverage value and a small residual, This is an example where an influential observation would go undetected if analysis was based only on the examination of residuals or on the examination of multiplicative influence measures. (Of course, if the observations
0.05 0.00
0.13
0.27
0.4:
0.53
Residual
Fig. 5. New Yrxk riversdata: P-R plot tTc:* model (4.1). with high-leverage and small residuals are “correct ’ observations, then we have a good fit at these observations.) The influence measures for model (4.1) are in grclat disarray (see Table 3 and Figure 4). The measures based on volumes of cmfidence elli-oids (AP, and VRj) declare observation 5 to be the most influen:ial observatilsll, followed, in case of AP,, by four other observations (4, 7, 3, ancj ~9. On the other hand, index plots in Figure 4 show that Cf and IV;’ indicate that observation 4 is the only influential obsl=rvation in the data. The index plot of LD,(p, a’) indicates that observation 4 is the most influential observation to lowed by obsertatjons 7, 3, and 8. These three measures fail to detect observation S which is seen in Figure 3 to be the most extreme influential observation K,” = 0.02 as compared to the largest value of 0.56, Wi’ = 2.69 as compared to tk jargcst value of 33.10, and LDi(pI CI')= 0.08as compared to the large;; r’;:!ue of 1.73). These measures assign a small value to observation 5 precisely for the reason given in Section 4.1, that multiplicative measures can be small when either of its components is small. Here observation 5 has a small residual, thereby hiding the high-leverage value. Of all measures thus, only H.’ and AP, correctly indicate that observations 5 and 4 are influential, followed by observations 7, 3, and 8. The P-R plot for model (4.1) is shown in Figure 5. Observation 5, which is a high-leverage point, is located by itself in the upper left corner of the p!ot. Four out!ying observations (3, 7, 4, and 8) are located in the lower right area of the graph. So far, Hi’ and AP, seem to be qually preferable. Indeed, the index plots for Hi2 and AP, in Figure 4 are similar. But this similarity may only be due to the fact that the model has only one explanatory variable for generally I-l;’ and AP, tend to differ, especially in models that include mo:e than one explanatory variable. We now turn our attention to model (4.2). The nnfluence measures obtained from Wing model (4.2) are shown in Table 4. The corresponding index plots are shown in Figure 6. The index plot of pii shows that obsenations 5, 3. and 4 have large values of p,i relative to the remaining observr;tions. The index plot of I’, shows that observations 7 and 4 have large residuals as compared to other observations. The residual for observation 5 is still near zero (I’, = 0.08). Looking
A.S. Hadi / A rlerv measure of potential injltrerlce
20
Table 4 New York rivers data: Various influence Row
p,,
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.06 0.08 0.39 0.26 0.67 0.05 0.08 0.15 0.07 0.06 0.07 0.07 0.14 0.14 0.15 0.13 0.06 0 15 o:o; 0.15
- 0.34 - 0.59 0.16 - 2.03 0.08 0.97 2.72 0.86 -0.61 0.19 0.75 - 0.65 0.04 0.16 0.08 0.56 - 1.09 0.67 - 1.41 - 0.59
measures
for model (4.2)
AP,
VU,
C;
W,’
LD, cp, tT’)
H,’
0.07 0.10 0.39 0.44 0.67 0.10 0.48 0.19 0.09 0.07 O.l(! 0.10 0.14 0.14 0.15 0.15 0.12 0.17 0.17 0.17
1.25 1.22 1.95 0.71 3.65 1.06 0.24 1.24 1.21 1.27 1.16 1.20 1.39 1.38 1.40 1.31 1.03 1.30 0.88 1.33
0.00 0.01 0.01 0.49 0.00 0.02 0.22 0.04 0.01 0.00 0.01 0.01 0.00 0.00 0.00 0.02 0.02 0.03 0.04 0.02
0.15 0.59 0.46 47.22 0.70 1.oo 22.94 2.94 0.58 0.05 0.77 0.66 0.01 0.08 0.02 0.99 1.49 i .66 2.92 1.36
0.03
0.09
0.04 0.04 2.73 0.04 0.06 4.53 0.16 0.04 0.03 0.05 0.05 0.03 0.03 0.03 0.06 0.09 0.09 0.22 0.08
0.15 0.64 1.24 2.04 0.23 2.26 0.32 0.15 0.07 0.17 0.16 0.16 0.16 0.17 0.21 0.28 0.25 0.46 0.24
at the index plot for the influence measures in Figure 6, we find that AR Ricks out observations 5, 7, 4, and 3. Each of the measures C,‘, by*, and LB,@1 a2) picks out only observations 4 and 7. None of these measures succeeds in pointing out observation 5
two clear outliers,
observations
(4.3) 18 with Yi= 4.6 and 7 with
A.S. Hadi
/
A uew measure of potezltial influence
21
2.72 1.53 0
r, 0.34 -0.84 -2.03 1
3
5 7 9 1113l51719
.o.. e..
0
.oo.o.. . .
0
0
0
. I
I
i
3
I
I
I
I
I
,
,
*
,
5 7 9 1113l5i719
Observation number 3.65
3
0
. . . .?
0
Observation number
I
0
0
.
. .
i
0
0.24-1, 1
5 7 9 l.i13%.5i719
Observation number
1
, 0 I I I I I i 3 6 7 9 1113151?19
Obscrwation number
.
0.4Q-
.
I
0.124
i
3
5 7 9 ill3 l5 1719
1
Observation number
“g‘
s
2.28
3
1.15
3
5 7 9 1113l51719
Observation number
. Observation number
Observation number
Fig. 6. New York rivers data: Index plots of influence measures for model (4.2).
Potential
Residual
Fig. 7. New York rivers data: P-R plot for model (4.2).
A.S. Hadi / A new measure of potentiul injhence Table 5 Scottish hill races data Record time (seconds)
Name and number of race
Distance (miles)
Climb (feet)
1. Greenmantle New Year II %sh 2. Carnethy 5 Hill Race 3. Craig Dunain 4. Ben Rha 5. Ben Lomond 6. Goatfell 7. Bens of Jura 8. Cairnpapple 9. Scolty 10. Traprain Law Il. Lairig G hru 12. Dollar 13. Lomonds of Fife 14. Cairn Table 15. Eildon Two 16. Cairngorm ‘17.Seven Hills of Edingburgh 18. Knock Hill 19. Black Hill 20. Creag Beag 21. Kildoon 22. Meal1 Ant-Suiche 23. Half Ben Nevis 24. Cow Hill 25. North Bet-wick Law 26. Creag Dubh 27. Burnswark 28. Largo 29. Criffel 30. Achmony 31. Ben Nevis 32. ffiockfarrel 33. Two Breweries Fell 34. Cockleroi 33. Moffat Chase
2.5 6.0 6.0 7.5 8.0 8.0 ‘I6.0 6.0 5.0 6.0 28.0 5.0 9.5 6.0 4.5 10.0 14.0 3.0 4.5 5.5 3.0 3.5 6.C 2.0 3.0 4.0 6.0 5.0 6.5 5.0 10.0 6.0 18.0 4.5 20.0
650
965
2500 900 800 3070 2866 7500 800 800 650 2100 2000 2200 500 1500 3000 2200 350 1000 600 300 1500 2200 900 600 2000 800 950 1750 500 4400 600 5200 850 5000
2901 2019 2736 3736 4393 12277 2182 1785 2385 11560 2583 3900 2648 1616 4335 5905 4719 1045 1954 957 1674 2859 1076 1121 1573 2066 1714 3030 1257 5135 1943 10215 1686 9590
I-.= 2.8. For c A;;!Q :ison purj?- q we remove these two observations and refit model (4.3) to the remaining 33 races. Several influence measures are shown in Table 6 and the corresoonding index plots are shown in Figure S. The index plots for pii and ri show thilt observation 11 is a high-leverage po,nt ( pii = “.7) and observation 33 is an outliers (T,.= 3.5). Examination of the other index plots shows that AP, and If: succeed in detecting these two observations whereas W?,, Cf, Wi’, and LD,(/3, a’) highlight only observation 33. The two observations with large HF are clearly separated from the bulk of the oiher observations in the P-R plot in Figure 9.
Scottish
I 2 3 4 5 6 x 9 10 11 12 13 14 15 16 17 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
hill races data: V:ir~ow
0.06 0.06 0.04 0.06 0.08 0.07 0.05 0.M 0.05 0.70 0.05 0.04 0.06 0.04 0.06 0.08 0.04 0.05 0.07 0.05 0.05 0.06 0.06 0.06 0.05 0.04 0.03 0.06 0.20 0.06 0.26 0.04 0.26
-
-
-
-
0.76 0.27 0.57 0.1 1 0.97 1.19 0.02 0.04 0.80 0.38 0.65 1.oo 1.76 0.83 1.44 0.45 1.7S 0.22 0.65 0.47 0.03 1.31 0.70 1.07 0.30 0.37 O.SS 1.OS 1.18 0.38 3.52 0.25 0.74
intluencc
0.08 0.07 0.0s 0.06 0. I 1 0. l 1 0.0s 0.04 0.07 0.71 0.06 0.07 0.16 0.06 O.k.3 0.09 0.14 0.06 0.08 0.06 0.05 0.11 0.07 0.10 0.05 0.05 0.04 0.09 0.24 0.06 0.57 0.05 0.27
mcasurcs
1.11 1.18 1.12 1.17 1.10 1.03 1.16 1.16 1.10 3.69 1.12 1.04 0.X5 1.0X 0.95 1.18 0.84 1.16 1.14 I.13 1.16 0.99 1.12
for model
0.0 1 0.00 0.00 0.00 O.O.? 0.03 0.00 0.00 0.0 1 0.1 1 0.0 1 0.0 1 0.07 0.0 1 0.05 0.01 0.04 0.00 0.01 0.00 0.00 0.04 0.0 1 0.03 0.00 0.00 0.00 0.02 0.12 0.00 1.46 O.QO 0.06
(4.3)
1.21 0.1 h 0.47 0.02 2.92 357 0.00 0.00 1.21 36.12 0.75 1.23 7.47 0.97 5.01 0.63 4.73 0.09 1.03 0.39 0.00 3.90 0.99 2.69 0.15 0.19 0.33 2.29 14.19 0.2x 312.13 0.10 8.25
0.04 0.02 0.02 0.02 0.00 0.12 0.02 0.02 0.03 0.38 0.03 0.04 0.32 0.03 0.19 0.03 0.26 0.02 0.04 CO2 0.02 0.13 0.04 0.09 0.02 0.02 0.02 0.07 0.4( J 0.02 12.89 0.02 0.21
0.12 0.08 0.0x 0.06 0.19 0.22 0.05 0.05 0.12 2.40 0.10 0.14 0.4 1 0.1 1 0.29 0.1 1 0.38 0.06 0.1 1 0.iN 0.05 0.25 0.11 0.19 0.06 0.06 0.06 0.18 0.30 0.07 2.13 0.05 0.4 1
The P-R plot classifies observation 33 as an outlier and observation 11 as a high-leverage point. In both data sets, AP, and H,’ give somewhat similar results. The reason for the similarity is that AP, and Hi’ are the only measures that arc additive functions of the leverage value and the residual. In those cases where they give different results, the results based on H,’ are the more relevant because. unlike AP,, it recognizes the special role that the response variable plays in linear regression.
A.S. Hadi / A new measure of potential inflrrence
24 0.70 0.53 r,,0.3j 0.20 o.c3
1
5
9 13 if 21 25
Observation number
Observation number 0.71
3.69
0.54
2.84
A/',&38
\'Ri1.99
0.21
1.15
0.04
0.30 1
5
9 13 17 21 25 29 33
i
5
Observation number
1
5
9 13 17 21 25 29 33
I
Observation number
9 13 17 21 25 29 33
Hf:LJ 5
9 131721252(933
i59z-z-Ex33
Observation number
Fig. 8. Scottish hill raw
5
Observation number
l::il,._..,_.l 1
9 131721252933
Observation number
O?xeivation number
data: Index plots of influence
measures
for model (4.3) after delctmg
2.39
Potential 1.21
0.62
0.03 0.00
0.45
0.89
1.34
1.78
Residual
Fig. 9. Scottish hill races data- . P-R plot for model (4 3) after deleting observations
7 and 18.
In Section 3 we propose a new measure Hi for overall potential influence ot 3 subset of observations on regression results. This measure is defined in (3.6) for a subset containing Y)Zobservations. Its special case H,‘, for a subset containing a single observation, is defined in (3.7). The properties of the proposed measure are discussed and the P-R plot is suggested as a supplementary graph to aid in classifying observations as high-leverage points, outliers, or combinations of both. In Section 4, we compare H,'to several existing influence measures analytically and empirically. The results of Sections 3 and 4 lead us to recommend H,'and the associated P-R plot for the follov iilg reasons: 1. Existing influence measures focus on the influence of observations on a single regression result, whereas Hf highlights observations that are potentially influential on several regression results. 2. Ht is an additive function of the residuals and of the leverage values and hence assumes large values for observations that have either large residuals or large leverage values, or both. By contrast, multiplicative measures would be small if one of these components (residual and leverage) is small and, hence, could fail to detect mfluential observations. 3. Hf is a monotonically increasing function o!: both the leverage values and the squared residuals. 4. Hi is invariant under nonsingular iransformations of X and Y. 5. Hf recognizes the special role that Y plays in regression, has nice interpretations, and is easy to compute. 6. The P-R plot is simple, easy to interpret, and useful for diagnostic purposes. It is also suitable for displaying both single and multiple case influence. 7. In every instance in which I compared H,' to other existing influence measures using real-life data sets, this diagnostic, together with the P-R plot, provided complete information abou‘ influence in linear regression. The results for two of these data set? are given in Section 4.
I thank J. Bian Gray and Robert F. Ling for their helpful comments on an earlier version of this ma:mscript. 1 am also grateful to the Associate Editor and a referT:t- for providing me with valuable comments and suggestions that gave rise to a substantially improved manuscript.
26
A.S. Hadi / A tlebi’measure of potential injkewe
Andreus. D-F. and D. Prcgibon, Finding outliers that matter, Joiirriul of thC Royal Statistical Society B. 40 (1978) 85-93. Atkinson, A.C. TWO graphical displays for outlying and inflUCntial ObSelVatiOIls h KWCSSiOk Biometrika, 613( 198 1) 13-X. Atkinson, A.C., Plots, tratnformations, arid regression: At1 irltrodrrctiorl to graphical methods oj‘ diagrlo.+c regressiorl analysis (Clarendon Press. Oxford, 1985). Atkinson, A.C., Comment: Aspects of diagnostic regression analysis, A Comfnent On “hfhnfial ObsCmatiOns, high-leverage points, and outlicrs in linear regression”, by S. Chatterjee and AS. Hadi. Statistical Science, I (3) (1986) 379-416. Beckman. R.J. and R.D. Cook, (19831, Outlier ...... ...P . . Teclmometrics. 25 ( 1983) I l9- 163. Belsley, D.A., E Kuh and R.E. Wclsch, Regression diagnostics: IdealtijjGg injlwntial du ta and sowces of collinearity (Wiley, New York, 1980). Chattcrjcc, S. and A.S. Hadi, Influential observations, high-leverage points, and outlicrs in linear regression, Statistical Sciertce, I (3) ( 1986) 379-416. Chattcrjcc. S. and AS. Hadi. Sensitivity arzal~sis it1 litlear regressiorl (Wilcq. New York, 1988). Cook. R.D.. Detection of influential observations in linear regression. Teclmometrics, 19 (1977) 15-18. Cook. R.D. and S. Wcisbcrg. Residuals atld irlflwnce in regression (Chapman and Hall. London. 1982). Draper, N.R. and J.A. John. Influential observations and outlicrs in regression. Techrtometrics, 23 (1981) 21-26. Gentleman. J.F. and M.B. Wilk. Detecting outlicrs II: Supplementing the direct analysis of residuals, Biometrics. 31 ( 1975 1 387-4 IO. Gray. J.B.. The L-R plot: A graphical tool for assessing influence, Proceedings of the Statistical Computing Section. American Statistical Association (1983) lS9- 164. Gray. J.B.. A simple graphic lix assessing influence in regression, Jowlal of the Statistical Compn ta tiorl arid Simulution. 24 ( 1986) I2 I - 134. Hadi. A.S.. TIVOgraphical displays for the detection of potentially influential subsets in regression, Jownal qf Applied Statistics. 17 ( 1990) 3 13-327. Haith. D.A.. Land use and uatcr quality in New York rivers. Jolrtnal of the Emironmentul Engineen.vg Dirisiotl. ASCE. I02 (no. EEI. Proceedings paper I 1902, 1976) I-IS. Hoaglin. D.C. and R.E. Wclsch. The hat matrix in rcgrcssion and ANOVA. Xlre American Statisticran. 32 ( 197X) 17-22. Huber. P.J., Rohrtst statistics (Wiley, New York, 1981). John. J.A. and N.R. Draper. On testing for two outliers or one outlier in two-way tables, T~~clmometrics,20 ( 1978) 69-78. Johnson, W’ and S. Geisscr. Assessing the predictive influence of observations, Technical report no. 355 (University of Minnesota, School of Statistics, Minneapolis, MN, 1979). Johnson. W. and S. Gcisscr, A predictive view of the dctcction and characterization of influential observation in regression analysis, Technical report no. 365 (University of Minnesota, School of Statistics, Minneapolis. MN, 1980). Kianifard, F. and W.H. Swallow, Using recursive residuals, calculated on adaptively-ordered observations. to identify outliers in linear rcgrcssion. Biometrics 45 (1989) 571-585. Lawance. A.J., Local and deletion influcncc, Unpublished manuscript (1989). Marasinghe, M.G., A multistage procedure for dctectin g several outlicrs in linear rcgrcssion, Teclv~ometrics, 27 ( 1985) 395-399. McCulloch. C.E. and D. Mectcr, Discussion of “Outher ....... ..s”. by R.J. Beckman and R.D. Cook, T~~cllnomctri~s,25 ( 1983) I l9- 163. Vclleman. P.F. and R.E. Wclsch. Efficient computing of rcgrcssion diagnostics. Talc Americun Stutisticiarl. 35 (1981) 234-242. Wclsch. R.E., Influence functions and regression diagnostics, in: R.L. Launcr and A.F. Siegel (Eds.1, Modern data unu1~si.s(Academic Press, New York. 1982).
A.S. Hadi / A rtew measwc of potentid influence
27
Welsch, R.E. and E. Kuh. Linear rcgwsion diagnostics, Technical report no. 923-77 (Sloan School of Management, Massachusetts Institute of Technology. Cambridge. MA. !977). Welsch, R.E. and S.C. Peters, Finding influential subsets of data in regression models, in: A.R. Gallant and T.M. Gerig (Eds.), Proceedings of the Eleventh Interface Symposium on Computing Science and statistics, (Institute of Statistics, North Carolina State University, Raleigh, NC. 1978).