3 Robust inference: The approach based on influence functions

3 Robust inference: The approach based on influence functions

G. S. Maddala and C. R. Rao, eds., Handbookof Statistics, Vol. 15 © 1997 ElsevierScienceB.V. All rights reserved ,.~ D Robust Inference: The Approac...

2MB Sizes 0 Downloads 47 Views

G. S. Maddala and C. R. Rao, eds., Handbookof Statistics, Vol. 15 © 1997 ElsevierScienceB.V. All rights reserved

,.~ D

Robust Inference: The Approach Based on Influence Functions

M . M a r k a t o u a n d E. R o n c h e t t i

1. Introduction Although statisticians have been long aware of the sensitivity of statistical procedures to slight changes in the assumptions under which the procedures were developed, only in the last decades the tools to describe these problems mathematically have been developed. This generated the field of robust statistics. Small deviations from the underlying assumptions on the model can have drastic effects on the performance of classical estimation and testing procedures. Robust statistics deals with such deviations and their dangers on the statistical procedures and develops new techniques which are more stable and reliable in a neighborhood of the model. We can view robustness as a collection of attributes of a technique, among which are resistance, smoothness, and breadth. Resistance refers to the property of being insensitive to the presence of a moderate number of " b a d " values in the data and to inadequacies in the assumed model. Smoothness requires that the technique responds gradually to the introduction of a small proportion of gross errors, to small changes in the model, and to small data perturbations. Breadth refers to the attribute of being applicable in a wide variety of situations. In the last decade the amount of statistical research devoted to robustness has increased considerably. Much of the effort has concentrated on robust estimation, though in ttie last years there has been a renewal of interest in robust inference. General references are the following books: Huber (1981), Hampel, Ronchetti, Rousseeuw, Stahel (1986), Rousseeuw and Leroy (1987), Staudte and Sheather (1990), Morgenthaler and Tukey (1991), and Stahel and Weisberg eds. (1991). In this paper we focus on robust inference. We focus on the approach based on influence functions, we summarize other approaches, and give a partial survey of the results in this field. The paper is organized as follows. In Section 2 we summarize the main ai~proaches to robust statistics, show their connections, and discuss the basic tools which are used to formalize the robustness concept. We review two key tools, the influence function and the breakdown point, and apply them to the inference problem. They can be used to investigate the local stability 49

M. Markatou and E. Ronchetti

50

and the global reliability of a test or confidence interval. Moreover, they provide the basis for constructing new robust tests and confidence intervals. In Section 3 we present robust tests for general parametric models. Robust analogues of likelihood ratio, Wald, and scores (Rao) tests are derived. Robust inference procedures for linear and nonlinear models are discussed in Section 4. Some numerical results show the finite sample performance of these robust procedures. Section 5 is devoted to the discussion of other techniques including procedures based on robust likelihood. Finally, in Section 6 we discuss some open problems.

2. Approaches to robustness

2.1. An example of non-robustness The purpose in robust testing is twofold. First, the level of a test should be stable under small, arbitrary departures from the null hypothesis (robustness of validity). Secondly, the test should still have a good power under small arbitrary departures from specified alternatives (robustness of efficiency). For confidence intervals, these criteria translate to coverage probability and length of the interval. Many classical inferential procedures do not satisfy these criteria. An extreme case is the F-test for comparing two variances. Box (1953) investigated the stability of the level of this test and its generalization to k samples (Barlett's test). He embedded the normal distribution in the t-family and computed the actual level of these tests (in large samples) by varying the degrees of freedom. His results are discussed in Hampel et al. (1986), p. 188-189, and are reported in Table 1. It is clear that the level of these tests increases drastically even when we move from the normal to a q0-distribution. The situation deteriorates when the number of samples k increases. These tests are extreme examples of non-robustness of validity. Actually, the F-test for comparing two variances is so sentitive to nonnormality that it should be used as a test for normality rather than as a test for equality of variances. Other classical procedures show a less dramatic behavior but the robustness problem remains. The classical F-test for linear models is relatively robust with respect to the level, but it loses power in the presence of departures from the

Table 1 Actual Level in % (in Large Samples) of the Barlett Test When the Observations Come from a Slightly Nonnormal Distribution; from Box(1953) Distribution

k = 2

k = 5

k = 10

Normal tl0 t7

5.0 11.0 16.6

5.0 17.6 31.5

5.0 25.7 48.9

Robust inference." The approach based on influence functions

51

normality assumption on the errors; cf. Hampel (1973), Schrader and Hettmansperger (1980), Ronchetti (1982a,b). Both the level and the power of the one-sample two-sided t-test can be distorted by small departures from normality; cf. Beran (1981). The Wilcoxon rank test is attractive since it combines good robustness of both level and power. Note however, that the "nonparametric" stability is affected by asymmetric contamination in the one-sample problem, and by different contamination of the two samples in the two-sample problem; cf. Hampel et al. (1986), p. 201. Even randomization tests which keep an exact level, are not robust with respect to the power if they are based on a non-robust test statistic.

2.2. Overview of different approaches Robust statistics in its largest sense develops methods that are applicable in a broad range of circumstances. An important part of robust statistics deals with the stability of statistical procedures in the presence of deviations from a parametric model. The first natural step is to study the performance of a statistical procedure at a small finite number of (extreme) alternatives ("pencils") outside the parametric model. For instance, in a regression model, we can consider several alternative error distributions to the normal, such as a contaminated normal with larger variance, a slash (that is, the distribution of the ratio of a normal with a uniform random variable) etc. Then we can optimize over these alternatives and construct the best statistical procedures which cannot be improved simultaneously. Since the procedures are constructed to have a good performance at these selected (extreme) situations, one can hope that their behavior will not deteriorate at intermediate cases. Such robust estimators were derived in location/scale and regression models where the invariance properties of the models were used to condition on ancillary statistics. Morgenthaler (1986) used this idea to derive robust confidence intervals for a location parameter. The key points are (i) The coverage probability should hold conditionally given an ancillary statistics and (ii) An optimal compromise between two extreme distributional shapes (such as normal and slash) can be achieved by letting the data choose which of the situations it leans towards. The advantage of this approach is that it gives exact finite sample results. Unfortunately the selection of the alternatives seems difficult beyond the location/scale model and the approach does not seem to carry over beyond regression. A good account of the theory under the name "Configural Polysampling" can be found in Morgenthaler and Tukey (1991). An alternative is to enlarge the model by embedding it in a "supermodel". Typically this is achieved by adding a shape parameter or a transformation parameter. For instance, in a regression model, we can extend the normal model by considering the family of t-distributions for the errors. By varying the degrees of freedom, we move from the normal to heavy tailed distributions. For instance, in a recent paper Perez (1993) derived robust tests for a location/scale model by defining a "supermodel" based on tail-ordering of distributions. A potential

52

M. Markatou and E. Ronchetti

disadvantage of this approach is that the statistical procedures derived under the supermodel are not necessarily robust in a broad sense, the supermodel being too thin in the space of all distributions. The basic element of Huber's (1964, 1981) minimax theory is a full neighborhood of the parametric model. The estimation problem is viewed as a game between the Nature (which chooses a distribution in the neighborhood) and the statistician (who chooses an estimator in a given class). The statistician achieves robustness by constructing a minimax procedure which minimizes a loss criterion (such as the bias or the variance) at the worst possible case in the full neighborhood. Huber (1965; 1981, Chapter 10) extended his minimax approach to testing. In the problem of testing a simple hypothesis against a simple alternative, he found the test which maximizes the minimum power over a neighborhood of the alternative, under the side condition that the maximum level over a neighborhood of the hypothesis is bounded. The solution to this problem can be interpreted in the framework of capacities (see Huber and Strassen, 1973) and leads to exact finite sample minimax confidence intervals for a location parameter (Huber, 1968). While Huber's minimax theory is one of the key ideas in robust statistics and leads to elegant and exact finite sample results, it seems difficult to extend it to general parametric models, when no invariance structure is available. In the univariate case, Huber's idea led to an approach using shrinking neighborhoods, where the neighborhoods around the hypothesis and the alternative, as well as the distance between them, shrink at rate n -1/2 as the sample size n increases; cf. Huber-Carol (1970), Rieder (1978), Bednarski (1982). Finally, one can extrapolate from the parametric model to the neighborhood by means of the influence function. The basic idea is to linearize the functional of interest (for instance the bias and the variance of an estimator or the level and the power of a test) and use this extrapolation to study its properties in the neighborhood. This approximation will hold only in a neighborhood of the model and at most up to the first singularity of the functional. The distance to the first singularity is described by the breakdown point. Therefore, the influence function describes the effects of small deviations (the local stability of a statistical procedure) whereas the breakdown point takes into account the global reliability and describes the effects of large deviations. This approach was originated by Hampel (1968, 1974) and is fully developed in Hampel et al. (1986). Its advantage seems to be its generality: it can be used to construct robust estimators and tests for general parametric models. We will discuss the place and the role of the influence function and the breakdown point in the inference problem in subsections 2.3-2.5. A final remark about the interplay of robustness and diagnostics. Both are concerned with deviations from the assumed model but from a different perspective. The basic approaches discussed above show clearly that "the purpose of robustness is to safeguard against deviations from the assumptions whereas the purpose of diagnostics is to find and identify deviations from the assumptions"; Huber (1991), p. 121.

Robust inference."The approach based on influencefunctions

53

2.3. Local stability of a test: The influence function In this subsection we focus on robust testing procedures in the univariate case. In particular, we review the concept of localstability of a test by means of the influence function. We will relate it to the global reliability as expressed by the breakdown point in subsection 2.5. The multivariate case will be treated in Section 3. Ronchetti (1979, 1982a,b) and Rousseeuw and Ronchetti (1979, 1981) extended Hampel's influence function to testing a null hypothesis about a one-dimensional parameter; cf. Hampel et al. (1986), Chapter 3. An essential result of this approach is the approximation of the asymptotic level and the asymptotic power under a contaminated distribution in a neighborhood of the hypothesis. Consider a parametric model {Fo}, where 0 is a real parameter, a sample zl, z 2 , . . . , z, of n i.i.d, observations, and a test statistic T, which can be written as a functional T(F(")) of the empirical distribution function F("). Let H0 : 0 = 00 be the null hypothesis and 0, = 00 + ~ a sequence of alternatives. The idea is first to compute the asymptotic level of th~test under contamination and compare it with its nominal level c~0.We consider the e-contamination F~,, = (1 - 3 ) F ° 0 + - ~ G, where G is an arbitrary distribution. It is necessary to choose a dontamimition which converges to zero at the same rate as the sequence of contiguous alternatives converges to the hypothesis in order to avoid overlapping between the neighborhood of the hypothesis and that of the alternative; see Huber-Carol (1970), Rieder (1978), Bednarski (1982), Hampel et al. (1986), Chapter 3. It turns out that by von Mises expansion (yon Mises,1947; Fernholz, 1983), the asymptotic level and the asymptotic power under contamination can be expressed as (see Remark 2.1 for the conditions)

aslevel(e) = ~o + g

f

LIF(z; T, Foo)dG(z) + o(e) ,

(2.1)

and

aspower(e) = to + e

f

PIF(z; T, Foo)dG(z) + o(g) ,

(2.2)

where

LIF(z; T, Foo) = qS(~q(1 - ao))IF(z; T, Foo)/[V(T, Foo)] 1/2 ,

(2.3)

PIF(z; T, Foo) = q ~ ( ~ - l ( 1 - ~ o ) - r x / E ) I F ( z ; T, Foo)/ [V( T, Foo)]1/2 , (2.4) eo is the nominal asymptotic level,//o = 1 - ~ ( ~ - l (1 - co) - 6v/~) is the nominal asymptotic power, E = [~t(Oo)]2/V(T, Foo) is Pitman's efficacy of the test, ~(0) = T(Fo), V(T, Foo) = f lF(z; T, Foo)2dFoo(z) is the asymptotic variance of T, and ~-1 (1 - c~0) is the 1 - ~0 quantile of the standard normal distribution ~ and

54

M . Markatou and E. Ronchetti

~b is its density; see Hampel et al. (1986), equations (3.2.13), (3.2114), (3.2.3) and Property 2, p. 195. It follows from (2.3)and (2.4) that the level influence function and power influence function are proportional to the self-standardized influence function of the test statistic T, e.g. IFs(z; T, Foo) = IF(z; T, Foo)/[V(T, Foo)] 1/2. Notice that in the notation o f Hampel et al. (1986), p. 199, IF~ ~ x/--EIFte~t. The functions given by (2.3) and (2.4) describe the influence of a small amount of contamination at some point z on the asymptotic level and power of the test. By means of (2.1)-(2.4) we can obtain the maximal level and the minimal power over the neighborhood:

supaslevel(e) ~- Ceo+ e~b(~-t(1 - a0)) sup IF~(z; T, Foo) , G

infaspower(e) ~- to + e~b(~-l( 1 - ce0) - 6v/-E)inflFs(z; T, Foo) . G

(2.5)

z

(2.6)

Therefore, bounding the self-standardized influence function of the the test statistic from above will ensure robustness of validity. On the other hand, bounding from below will ensure robustness of efficiency.This will not generally guarantee that the level and the power will remain stable in the presence of large deviations. The effect of large deviations is described by the breakdown point, see subsection 2.5. REMARK 2.1. Conditions for the validity of the approximations of the level and the power are given in Heritier and Ronchetti (1994). A key assumption is the Frtchet differentiability of the test statistic T. It ensures the uniform convergence to normality in the neighborhood of the model. This condition is satisfied for instance by M-functionals with a bounded if-function; see Clarke (1986), Bednarski (1993), and Heritier and Ronchetti (1994). REMARK 2.2. Lambert (1981) developed a similar approach by computing the influence function of the asymptotic log P-value (i.e. the Bahadur half-slope) of a test viewed as functional of the underlying distribution of the observations. Since this influence function is proportional to the influence function of the test statistic, the same results as those obtained above apply; cf. Hampel et al. (1986), subsection 3.6a.

2.4. Optimal robust tests As in the case of robust estimation, one can try to achieve an optimal compromise between robustness and efficiency by finding an optimal robust test which maximizes the asymptotic power at the model subject to a bound on the level and power influence functions; see Hampel et al. (1986), Chapter 3. Typically this is achieved by bounding the likelihood scores. It follows from (2.3) that the level influence function is the product of a factor depending only on the level and the self-standardized influence function of the test statistic IFs(z;T, Foo)=

Robust inference: The approach based on influencefunctions

55

IF(z; T, Foo)/[V(T, F00)] 1/2. Therefore, the class of solutions for different bounds is the same for all levels. Moreover, the solution does not depend o n the "distance" of the alternative 6, and the optimal robust test is Uniformly Most Powerful. This is due to the fact that the asymptotic power can be expressed as 1 ~(~-~(1 - ~0) - 6x/E). I f the test statistic is scaled such that it is Fisherconsistent, that is 4'(00) = 1, we obtain E -1 --- V(T, Foo), the asymptotic variance of the test statistic T. Thus, finding the test which maximizes the asymptotic power subject to a bound on the level and power influence function is equivalent to finding an estimator T which minimizes the asymptotic variance subject to a bound on its self-standardized influence function, that is, to finding the " o p t i m a l " Bs-robust estimator given in H a m p e l et al. (1986), Chapter 4. A similar result for the general multivariate case will be presented in Section 3. H a m p e l et al. (1986), p. 200, show the influence functions of three nonparametric tests, the sign, the Wilcoxon, and the normal scores test at the normal model. The normal scores test is the asymptotically most efficient one-sample rank test at the normal model but its influence function is unbounded. By imposing a bound b on the influence function and by solving the optimality problem mentioned above, we obtain a new optimal robust one-sample rank test defined by the scores function J ( u ) = m i n { ~ - l ( 1 / 2 + u / 2 ) , b } ; cf. Ronchetti (1979), Rousseeuw and Ronchetti (1979). Table 2 gives the maximal level and minimal power of this test in a neighborhood of the normal model. The table is computed by means of (2.5) and (2.6) and shows the stability of level and power of this test in the presence o f contamination. Other examples can be found in Victoria-Feser (1993) and Heritier and Victoria-Feser (1997). REMARK 2.3. The optimality problem discussed above can be viewed as an insurance problem. One can buy robustness by bounding the influence function and Table 2 Maximal Asymptotic Level and Minimal Asymptotic Power (%) of the Bounded Normal Scores Test (b = 1.345) e

6

Max. Level

0

0 0.5 3.0

5

0 0.5 3.0

5.2

0 0.5 3.0

5.8

0 0.5 3.0

6.6

0.01

0.05

0.i0

Min. Power 12.3 90.0 12.0 89.7 10.7 88.6 9.1 87.2

56

M. Markatou and E. Ronchetti

paying a small insurance premium in terms of efficiency at the model. In this sense robustness and full efficiency at the parametric model are not compatible. In an effort to reconcile these somewhat conflicting requirements, Beran (1977) used minimum Hellinger distance estimators to obtain robustness and first order efficiency. However, one should notice that Hellinger neighborhoods seem "too small" to describe the deviations observed in real data. These ideas were picked up again and extended by Simpson (1989), Lindsay (1994), and Basu, Markatou and Lindsay (1995); cf. Section 5. 2.5. Global reliability of a test: The breakdown point

A further step in the robustness analysis is the computation of the breakdown point of a test.This quantity describes the global reliability of a test and gives the maximum amount of contamination which can be tolerated by the test. Although we generally believe that the breakdown point is an important component of a robustness analysis, we feel that in the framework of inference it should be taken "cum grano salis". In fact, as pointed out by He, Simpson and Portnoy (1990), p. 447, "small deviations are relevant for inference, which is more meaningful in small neighborhoods of the assumed models". This implies that the notion of high breakdown point is less important at the inference stage than in point estimation. The concept of breakdown point for estimators was introduced by Hampel (1968,1971) in an asymptotic version. Donoho and Huber (1983) gave a finite sample version. An early finite sample definition of the breakdown point of a test was given by Ylvisaker (1977) and adapted to rank statistics by Hettmansperger (1984). Consider a test with critical region {T, > Cn). The resistance to acceptance PA [resistance to rejection PR] of the test is defined as the smallest proportion ~ for which, no matter what Zm0+l,... ,z, are, there are values Zl,... ,Zmo in the sample with T, < c, [T, >_ c,]. In other words, given PA there is at least one sample of size n - (npA - 1) which suggests rejection so strongly that this decision cannot be overruled by the remaining npA - 1 observations. For instance, the test defined by the critical region "2n >_ Cn has a resistance to acceptance and a resistance to rejection of ~. He, Simpson and Portnoy (1990) and He (1991) developed the notion of level andpower breakdown functions. Formally, the level breakdown function of a test based on a test statistic T viewed as functional, is defined as ~*(T) = inf {e > 0:T((1 - ~)Foo + eG) = T(Fo),for some G} ,

(2.7)

and the power breakdown function e~(T) = inf {e > 0: T((I - e)Fo + eG) = T(Foo), for some G} .

(2.8)

The breakdown functions of a test measure its ability to distinguish contamination neighborhoods of the null hypothesis and alternative. Approximations to these functions are obtained by taking the inverse functions of (2.5) and (2.6). Then the level breakdown point is the minimum value e such that the supremum of

Robust inference. The approach based on influence functions

57

the asymptotic level is 1. Analogously, the power breakdown point is the minimum value of e such that the infimum of the asymptotic power is 0. Related papers are Rubio and Visek (1993) and He and Shao (1995)

3. Robust tests for general parametric models In this section, we discuss the extension of robust testing to general multivariate parametric models as developed by Heritier and Ronchetti (1994). We present robust versions of the Wald, scores (Rao), and likelihood ratio tests based on Mestimators which are the natural counterpart of the corresponding classical tests (see e.g. Rao, 1973 and Silvey, 1979). We consider a general parametric model {Fo} where 0 lies in Y2an open convex subset of R p, and a sample Zl,Z2,... ,zn of n i.i.d, random vectors. Let fo be the density of Fo and s(z, 0) = 0 log fo(z)/O0 the scores function. We are interested in testing the null hypothesis that q (< p) linearly estimable functions of 0 are zero. Denote by a t = (a}l), a~2)) the partition of a vector a into p - q and q components and by A(ij), i,j = 1, 2 the corresponding partition of p x p matrices. Through a linear transformation of the parameter, this problem can be reformulated in testing H0 : 0 = 00 where 00(2) = 0, 00(1) unspecified, against the alternative Hi : 0(2 ) 5~ 0, 0(1 ) unspecified. The tests we consider will rely on M-estimators T. of 0 defined by n

Z O ( z i , Tn) = 0 .

(3.1)

i=1

Under quite general conditions, M-estimators are consistent at the model and asymptotically normally distributed; see Huber (1967, 1981) and Clarke (1983, 1986) as basic references. Their influence function and asymptotic covariance matrix V are given by IF(z;~b,Fo)=M(~,Fo)-aO(z,O), and V(O,Fo)= M(O,Fo)-I Q(~b,Fo)M(O, Fo) -t, where M(~,Fo) = - f ~o (z, O)dFo(z) and Q(O,Fo)

= f~k(z, O)~b(z, o)taFo(z). As in classical likelihood inference, the following classes of tests are defined. i) A Wald type test statistic is a quadratic form of the second component (Tn)(2) of an M-estimator of 0

:

(V.)<2>

(3.2)

Practically V(~, F0)(22) is estimated by replacing 0 by T, in V. ii) A scores (or Rao) type test is based on the test statistic R 2 =Z',C t -1Z, ,

(3.3)

58

M. Markatou and E. Ronchetti

where Zn = tnZinl ~1(Zi' Tn~)(2) ' Tn~ is the M-estimator in the reduced model, i.e. the solution of the equation n

Z

O(zi, T~)(,) = 0 with ir~(2) = 0 ,

(3.4)

i=1 C = M(22.l) V(22)M~22.1)

(3.5)

is a q x q positive definite matrix and M(z2.~) -----M(22) - M(21)M(ll)M(12 -1 )• The matrix C = C(O, Fo) is the asymptotic covariance matrix of Z, and can be estimated consistently. When O(z, 0) = s(z, O) the scores function, the test is the classical scores or Rao test; cE Rao (1973). iii) A likelihood ratio type test (or drop-in-dispersion test) is given by a test statistic of the form

=

p(z,, T,) - p(z,, ~o

,

(3.6)

where p(z, O) = O, ~ (z, O) = ~k(z, O) and T, and T~ are the M-estimators in the full and reduced model, defined by (3.1) and (3.4) respectively. When p is the log-likelihood function, (3.6) gives the classical likelihood ratio test. One can define even more general tests by (3.6), (3.1), and (3.4) with Op N (z, 0) ¢ 0(z, 0); cf. Richardson (1995). The test statistics (3.2), (3.3), and (3.6) can be written as functionals of the empirical distribution F (n), e.g. W2 = W2(F(n)), where W2(F)= T(F)I2) (V(22))-1 T(F)(2) and T(F) is the functional defining the correponding M-estimator. These functionals can in turn be written as quadratic forms U(F)tU(F). For the likelihood ratio statistic this is true asymptotically. The asymptotic distribution and the robustness properties of these tests are driven by the functional U(F). Heritier and Ronchetti (1994) show that the Wald and scores type tests have asymptotically a X~ distribution where q _< p is the dimension of the hypothesis to be tested. 2 under the null hypothesis and This distribution turns out to be a central )~q noncentral under a sequence of contiguous alternatives, with the same noncentrality parameter for the two classes. Under an additional condition, the asymptotic distribution of the likelihood ratio type test is a linear combination of X2. Therefore robust Wald and scores tests have the same asymptotic distribution as their classical counterpart, whereas likelihood ratio type tests have in general a more complicated asymptotic distribution. Gardiol (1995) and Holly and Gardiol (1995) use computer algebra to obtain higher order expansions of the distribution of the classical tests both under the hypothesis and a sequence of contiguous alternatives. We believe that their results (and programs) can be used for the robust tests defined above. This would provide a better understanding of the differences among the three tests.

Robust inference: The approach based on influencefunctions

59

To investigate the local stability of these tests, the concept of influence function for tests discussed in Section 2.3 was extended by Heritier and Ronchetti (1994) to Wald and scores type tests in the multivariate case. As in the univariate situation, the idea is to compute the level of the test under contamination and compare it with its nominal level. We consider again the z-contamination F~,n = (.1 - ± .4~) FO0 + ~4~ G, where G is an arbitrary distribution. Then, if we denote by • (F~,n) the level of the test under the e-contamination F,,n and by ~0 the nominal level a(Foo), we have lirn a(F~,,) = ~0 + e2.#

ftF(z;

U, Foo)dG(z) 2 + °(e2)

(3.7)

where I]-]] is the Euclidean norm, # = - ~ °H.q(ql_~0;6)[6=0, Hq(.; 6) is the cumulative distribution function of a Z2(6) distribution, ql-~0 the 1 - ~0 quantile of the central X~ distribution, and U is the functional defining the quadratic forms of the Wald and scores type test statistics. A similar result can be obtained for the power. F r o m (3.7) it is clear that the proper quantity to bound to have a stable level in a neighborhood around the hypothesis is the influence function of the

functional U(F). In particular if we choose a point mass contamination e gL~,n= (1 -- -~)Foo ÷ ~,/~A z, we obtain

G = Az, i.e.

lirn a(F~n ) = a0 + eZ./dg(z; T(2),FOo)t(V(22))-llF(z; T(2),Foo) + o(e 2) • (3.8) where T(2) is the second component of an M-estimator T defined by (3.1). Notice that [1F(z; T(2),Foo)t(V(22))-11F(z; T(2),Foo)] 1/2 is the self-standardized influence function of the estimator T(2); see Krasker and Welsch (1982) and Hampel et al. (1986) p. 228. Therefore, in order to obtain a robust Wald or scores type tests, we have to bound the self-standardized influence function of the estimator T(2). Moreover, we can obtain optimally bounded-influence (Wald and scores) tests by using an optimally robust self-standardized estimator T(2); see Hampel et al. (1986) Section 4.4.b, Heritier (1993), and Heritier and Ronchetti (1994). Examples of optimal robust tests can be found in Heritier and Ronchetti (1993), Victoria-Feser and Ronchetti (1994), and Heritier and Victoria-Feser (1997).

4. Applications 4.1. Linear models We consider the following model: {(xi, yi) : i = 1 , 2 , . . . , n} are independent random variables such that y; = x,t.fl+ eg where the xi's are independent of the eg's and e l , . . . , e n are independent errors from a continuous distribution symmetric around 0. The basic estimation technique is least squares and it is well known that

M. Markatou and E. Ronehetti

60

the theory of least squares regression relies heavily on a number of distributional assumptions - in particular, on the normality of the errors. A slight violation in one or more of these assumptions can seriously damage the reliability of the least squares estimate of the model parameters as well as the estimates of the standard errors. The testing procedures based on these estimates exhibit similar problems. For example, the conventional F test loses power rapidly in the presence of departures from the normality assumption of the errors (see Schrader and Hettmansperger, 1980). This sensitivity has led to various proposals for robust methods of estimation and testing. Among the estimation proposals are the classical M-estimates of Huber (1973) and the generalized M-estimates (GMestimates). Huber's M-estimators have a bounded influence of residuals and the GM-estimators have a totally bounded influence, that is, the influence of residuals and the influence of position in factor space are both bounded; cf. Hampel et al. (1986), Ch. 6. These estimators for the regression parameters can be obtained from the minimization problem min ~ " z(xi, Yi -_ Xlfl ~ t3 ~ = 1 ~

a

(4.1)

J

where z : R p x R ~ R + is a function such that z(x,r) ¢ O,z(x,r) >_ 0 for all x E R P , r C R and z(x, 0) = 0 for all x c R p. Let tl(x,r) = ° z ( x , r ) . Then the first order conditions give n

-a

)xi = 0 .

(4.2)

i=1

This is a special case of the M-estimator defined in (3.1) and can be viewed as an iteratively reweighted least squares estimator by rewriting (4.2) as ~;"--1 wirix~ = O, where wi = q ( x i , - gFi) / Fri and ri y i - x ~ f l is the residual. The most important choices of the q-function are of the form t/(x, r) = v(x)6c(rs(x))

(4.3)

where v : R p --+ R + , s : R p --* R + and ~c is Huber's psi-function, i.e. tp~(r) = max ( - c , min(c, r)). To obtain Huber's M-estimators as weighted least squares we use weights of the form wi = min { 1 , c a / [ r d } where r~ is the ith residual and c is a positive constant. If c = 1.345 then the corresponding location estimator has 95% efficiency at the normal model.The scale a can be estimated by means of the M A D estimator or Huber's proposal 2; cf. Huber(1981). To obtain the GM-estimators as weighted least squares we use weights of the form wi = m i n { 1 , e a u ( x i ) / I r i ] } , where u : R P ~ [0, 1] is a weight function that depends only on the design points xi, i = 1, 2 , . . . , n. An example of such a weight function is u(xi) = { 1 - x ~ ( X t x ) - l x i } l/2 = ( 1 - hi) 1/2 where hi = x ~ ( X t X ) - l x i is the ith leverage. This form of u(x) was proposed by Schweppe; cf. Handschin et al.

Robust inference: The approach based on influence functions

61

(1975). As hi --+ 1 the weight U(Xi) --~ 0 and hence outlying points in the factor space are downweighted. More robust forms of u(x) can be found in Hampel et al. (1986). Extensions to more complex models such as simultaneous equations models can be found in Krishnakumar and Ronchetti (1997). The bounded influence procedures give good stability and good efficiency against infinitesimal contaminations; therefore these procedures pertain to the local stability of the estimators and tests. A high breakdown point procedure is capable of handling data with multiple outliers; thus, it refers to the global stability properties of the estimators and test statistics. It is therefore desirable to combine these two properties. For coefficient estimation He (1991) showed that it is possible to construct high breakdown point and bounded influence estimators. Simpson, Ruppert and Carroll (1992) studied the one-step GM-estimators of Mallows-type starting from a high breakdown point (HBP) fit. Coakley and Hettmansperger (1993) studied the one-step estimators of Schweppe-type. Both estimates are of the form/~=/~0 + Hoago where/~0 is a HBP estimator. For the estimator defined by Simpson, Ruppert and Carroll (1992) we have gO • ~0 ~in=l ~l(ri/50)WiXi and #0 = medi{lYi-X~ol}/k, where k is a properly chosen standardizing constant. There are two choices for the matrix H0: n ^ t Newton-Raphson: H0 = ~i=1 wi~ t (r i/¢ro)xix i

Scoring: Ho = n-'

~-~in=l~9'(ri/~o) Effn=l WjXjXj

¢'(-) is the derivative of ~ and wi=min{1,{b/[(xi-mx)tC; 1 (xi- mx)])~/2}} are Mallows-type weights with multivariate location mx and scatter Cx for the design points with a high breakdown point. If ~ = 0, then the

where

estimate specializes to the one-step Huber estimate discussed by Bickel (1975). As a multivariate (rex, Cx) minimum volume ellipsoid (MVE) estimators are used; cf. Rousseeuw and Leroy (1987). Algorithms for computing this type of estimators have been given by Ruppert (1992). Parameter estimation is the first step in data analysis. We are often interested in testing if various linearly independent estimable functions are equal to 0. Through a transformation in the parameter space, these hypotheses reduce to the hypotheses of testing if a certain subvector of the vector of unknown parameters equals to 0, treating the remaining parameters as nuisance. Here, we will discuss the tests based on the robust estimators discussed above. To this end, let /3 t t t = (/3(1)/3(2)) where/3 is p × 1 vector and//(2) is a q × 1 vector, q < p. We would like to test the hypothesis H0 : fl(z) = 0,/3(1) unspecified versus the alternative H1 :/3(2) ¢ 0,/3(1) unspecified. Testing procedures based on M-estimators and GM-estimators can achieve robustness of validity and robustness of efficiency. Such testing procedures for linear models have been studied by Schrader and Hettmansperger (1980), Sen (1982), Ronchetti (1982a,b), Hampel et al. (1986) Ch. 7, Markatou and Hettmansperger (1990), Markatou, Stahel and Ronchetti (1991), Silvapulle (1992), and Mfiller (1995). For an alternative, a review of rank based procedures can be found in Hettmansperger and Naranjo (1991).

62

M. Markatou and E. Ronchetti

Robust tests based on M-estimators for linear models are special cases of the general tests defined in Section 3. i) A Wald-type test for regression is defined by the test statistic

W2 ~-/~{2)~Z(21)/~(2)

(4.4)

and

where fic2~is the M-estimate for fl(2~ obtained under the full model V(22) is the q x q submatrix of V = M - Q M -1 where M, Q are estimates of the matrices M = E{rl'(x, r/cr)xxt}, Q = E{tlZ(x, r/a)xxt}. Note that if it(x, r/cr) = Oc(r/cr) then we obtain the corresponding matrices for Huber's M-estimators. We can estimate M, Q b y . M = 1 , ~#( xi' ri/ff)XiX~ and ~)=1 E,=, 2(xi, rUo)x x . The asymptotic distribution of the test statistic Wff is a central )~2 under H0 and a noncentral 22 under a sequence of contiguous alternatives. .

.

^

^

1

^

^-

-

^

n~-~i=l

^

.

.

n

ii) Score (or Rao)-type tests for regression were developed by Sen (1982) and M a r k a t o u and Hettmansperger (1990). The test statistic is defined by

R2n =

(1~_~ #1(Xi, y i - X ~ e ) ) X(2)/)t ~-1 Q1 /=~1 n (x ,=, a i ,7 ,,

,

(4.5) where x~ = (x{1)i,x{2)i) and C = ~)(22) - 474(21)43/(iI)~)(12) - Q(21)45/(11)33/(12) + 474(21) a;/~il)0(ll)a;/~l))l)/(12), and/}o~ is the reduced model M-estimate. The asymptotic distribution of the test under H0 is a central Z2. Under contiguous alternatives the asymptotic distribution is a noncentral X2. iii) A likelihood ratio type- (or drop-in-dispersion) test for regression is a direct generalization of the F-test. This test was discussed by Schrader and Hettmansperger (1980) and Ronchetti (1982a,b). The test statistic is

9 (x (re), z i, 6" )

S~ ~- 2n -t

i=1

rcoi

-- "c X i~ 7

where ro,~ = Y i - x i f l o , r i = Y i - xl[J are the residuals from the reduced and full model respectively, and z has been defined in (4.1). Large values of S2 are significant. The asymptotic distribution of S2, under H0, is determined by the distribution of L = ~j-p-q+l 2jNj , where N j , j = p - q + 1 , . . . , p are independent standard normal random variables and 2 j , j = p - q + 1 , . . . , p, are the q positive eigenvalues of the matrix t

P

^

2

Robust inference: The approach based on influence functions

63

where Max is the upper (p - q) x (p - q) part of the M-matrix defined before. Algorithms to compute the p-values associated with the statistic S2 were given, for example, by Solomon and Stephens (1977) and Farebrother (1974) and have been implemented in the package ROBETH-ROBSYS; see Marazzi (1993). If the function z is selected to be Huber's Pc function then the asymptotic distribution of E~2~(r/a) the test statistic is simply 2X2 where 2 - E:'c(,/~)" This class of tests was proposed by Schrader and Hettmansperger (1980). Another special case is when q = 1. In this case the asymptotic distribution is 2.Z2 where 2 = E{tl2(X, r)[(B-lx)p]2}, with M = BB ~ and (B-lx)p denotes the last component of the vector B-ix. Another special case is when the density of x is spherically symmetric around x(2), the component of x that corresponds to fl(2). Markatou and Hettmansperger (1992) have shown that when there are no extreme leverage cases all eigenvalues 2j are equal to E[~/2c(r/a)]/E[~/c(r/a)]. They develop expansions of the 2is as functions of the leverages and show that noticeable separation between the eigenvalues is observed in the presence of moderate and high leverage. The test statistics presented above have a bounded influence function but they may fail completely if the data contain a large amount of outliers. This is because the tests do not have high breakdown points. For the notion of breakdown point for tests, cf. Section 2.5. Markatou and He (1994) studied the breakdown properties of the three classes of tests in regression models and constructed tests based on the one-step, bounded influence and high breakdown point estimates.

4.2. Numerical results There are not many numerical studies to examine the finite sample properties of robust testing procedures. Markatou and Manos (1995) present a simulation study in nonlinear models. Schrader and Hettmansperger (1980) present another small study for the drop-in-dispersion test in linear models. Stahel and Hartmann (1992) presented a more extensive simulation for the three classes of tests. They studied the finite sample behavior of the three classes of tests based on M-estimators. The designs were the two-sample problem and simple linear regression with sample sizes 10 and 20 and with error distributions N(0,1), t3, and (.90) N(0,1)+(.10) N(0,102). The simulation results show that differences among tests are mostly small. However, when the p-values of the test statistics are determined on the basis of the asymptotic distributions there exist remarkable differences among the tests. The Wald-type test generally fits an appropriate F-distribution. In the case of simple linear regression this distribution has 1 and n-2 degrees of freedom. It was also noticed that estimating the scale by Huber's proposal 2 is typically preferable to MAD. Here we present a different simulation study. The design consists of two regressions with common intercepts and x's placed at 1,2,3,4,5,10 for the first sample and at 7,8,9,10,11,12 for the second sample. It contains a point of moderate leverage corresponding to the point 10 in the first sample. This design has

64

M. Markatou and E. Ronchetti

been used by Hettmansperger and McKean (1983) to study the finite sample properties of the rank based tests in linear models. The dimension of the unknown parameter vector is 3 so the model is Yi =/30 +/31xil +/32x,2 + ~. We wish to test H0 :/32 = 0,/30'/31 unspecified vs//1 :/32 76 0,/30,/31 unspecified. We simulated the three classes of tests for N(0,1), contaminated normal, and Cauchy samples. The contaminated distribution is (.90) N(0,1)+ (.10) N(0,102). The number of replications was 500 and the nominal level 0.05. The results are reported in Table 3. Notice that the Wald-type test is quite liberal for the Cauchy and contaminated normal distribution, while it exhibits a level of .066 in the case of the N(0.1) distribution. To obtain the p-values associated with the Wald test we used the F3,9 distribution. Generally, the distribution used to fit the quantiles of the distribution of the Wald-type test is Fp,n_p. The score-type test exhibits a conservative behavior. The drop-in-dispersion test has the best performance in terms of level. We note here that Hettmansperger and McKean (1983) found similar results to hold when the estimators used are rank estimators. The performance in terms of power of the tests is shown in Table 4. We see that the drop-in-dispersion test has a good performance in terms of power as well. Again the behavior of the score test is very conservative. Simple small sample corrections, such as using F-critical values instead of the asymptotic chisquared do not help here. The small sample corrections needed for this test appear to be quite complicated, as it is not a matter of using Huber's proposal 2 for scale estimation. Here we estimated the scale using the MAD of the residuals from the last iteration. Experimentation with scale estimated by Huber's proposal 2 did not improve the behavior of the score-type test. Overall the drop-in-dispersion test exhibited the best performance in this design. This is in line with the findings of the same type of test based on rank estimates; see Hettmansperger and McKean (1983). It also justifies the concern raised in the recommendation of Stahel and Hartmann (1992). If there are no moderate or high leverage points in the design, the Wald-type test can be recommended. When moderate leverage points are present in the design the drop-in-dispersion test has the best performance and therefore it is recommended. The performance of the tests based on GM-estimates remains yet to be studied.

Table 3 Levels (%) of Drop-in-dispersion, Wald and Store-type Tests

N(0,1) Contam Cauchy

Drop-inDispersion

Wald test

Score test

6.5 6.8 5.8

6.6 10.0 14.0

2.4 2.4 3.8

Robust inference." The approach based on influence functions

65

Table 4 Powers (%) of Drop-in-dispersion, Wald and Score-type Tests Drop-inDispersion

Wald test

Score test

6 = 0.5

N(0,1) Contam Cauchy

34.0 22.4 11.2

39.4 38.4 33.4

8.2 7.4 10.6

6 = 0.7

N(0,1) Contam Cauchy

55.0 40.2 19.0

62.8 62.2 53.2

13.4 14.0 18.2

6 = 1

N(0,1) Contam Cauchy

83.8 65.2 33.8

92.4 88.4 79.4

20.2 22.2 27.2

6 - 1.2

N(0,1) Contam Cauchy

94.0 76.8 43.8

96.6 92.4 85.8

24.0 27.2 30.4

6 = 1.7

N(0,1) Contam Cauchy

99.4 89.0 63.0

99.8 97.8 94.0

38.2 38.2 44.4

4.3. Nonlinear models N o n l i n e a r models play an i m p o r t a n t role in m a n y fields. The estimation techniques and the error assumptions are usually analogous to those made for linear models. As a result the estimators and tests are sensitive to outliers, high leverage points and departures f r o m the error assumptions. The model is specified by yi = h(xi, fl) -~- ~i, i = 1, ..., n, where yi is the observed response, xi a vector o f explanatory variables, h(xi, fl) is a response function, fl is a p-dimensional vector o f u n k n o w n parameters, and ei, i = 1, ..., n are the u n o b servable errors. In this setting M-estimators are defined by (3.1) with ~J(Xi, Yi; fl) = tl(xi, yi-h(xi,fl))a , o~ h(xi, fl). The associated tests are defined according to Section 3. M-estimators in a nonlinear regression setting were studied by F r a i m a n (1983). Welsh, Caroll and R u p p e r t (1994) obtained estimators for the parameters o f the nonlinear model in the presence o f heteroscedasticity. The associated robust tests were studied by Heritier and Ronchetti (1994) and M a r k a t o u and M a n o s (1995). To illustrate the behavior o f these tests, consider the following example taken f r o m Heritier and Ronchetti (1994). The response function is h(x, f l ) = fllXl ~-flzX2-~-f14 efl3x3, where x = (Xl,X2,X3, X4) t and fl = (ill, f12, f3, f14) t" W e simulated a sample o f 60 observations similar to Gallant (1975) data. Here, the inputs correspond to a one-way design with two treatments for experimental material whose age (x3) affects the response exponentially. The least squares estimates o f fl and a are ( - . 0 0 4 , 1 . 0 1 , - . 9 5 , - . 5 0 ) t and .038 respectively. F o r this model the relevant hypotheses in an empirical study are o f the f o r m H0 : ~ = Y0 against HI:

M. Markatou and E. Ronchetti

66

o

if) ffi

oo

...........

• ..............

o,

o 0

•.

u~

°'.

o It:

\

o

0

e-

.o "n

.o_ e-

._~ CO I

I

i

I

1

t,'L

~'L

O'L

8"0

9"0

~

;uco/[

o

r~

o

~O. o'e" O" ~e.O o

c~

~• °



90

~



0

e.

eo

e. m et-

<

e.



0 B '.

-

,.;.o

|

i

!

I

l

~'L

~'L

O't

80

9"0

o

~

0 ~

~

Robust inference: The approach based on influencefunctions

67

¢ ~0, where 7 is some subvector of ft. As in Gallant (1975) we chose ~ = (ill, f13)t and tested H0: fll = 0, f13 = -1.5. We used both classical and optimal robust scores tests and computed their p-values when moving one observation (yls) along the y-axis; see Figure 1. We performed the same experiment with other observations and obtained comparable results. We used the least squares estimator of scale for the classical Rao test and the M A D (median absolute deviation of the residuals) obtained from a preliminary Huber type fit for the robust alternative. The left part of Figure 1 shows y versus x3. As xl is a dummy variable, the symbol, is used for xl = 0 and • for Xl = 1. The right part shows the sensitivity curve of log p-values under H0 for classical and optimal robust scores tests. By moving one single observation out of 60, the level of the classical scores test is unstable and switches from significance to nonsignificance when y15 is greater than 1.1. On the other side, the robust scores type test (c = 5 and c = 7.5) is not influenced by the outlying observation and is always significant. A similar result is obtained with a robust Wald type test. 4.4. Generalized linear models

Robust estimation for generalized linear models has been studied by Stefanski, Carroll and Ruppert (1986) and Kfinsch, Stefanski and Carroll (1989). Markatou, Basu and Lindsay (1996) propose robust estimators for logistic regression by means of the approach outlined in Section 5.2. Robust tests can be derived by using the general procedures described in Section 3. Applications to logistic regression can be found in Heritier and Victoria-Feser (1997).

5. Other techniques 5.1. Methods based on the use o f the empirical characteristic function

A different approach to constructing robust procedures is based on the use of the empirical characteristic function. Empirical characteristic function based methods take advantage of the boundedness of the cosine and sine functions and thus exhibit robustness properties. The empirical characteristic function (ECF) has been used in robust estimation of regression slope parameters by Chambers and Heathcote (1981), Heathcote (1982) and Welsh (1985). More recently and in relation to the applications of panel data (error component) problems Markatou, Horowitz and Lenth (1995), and Markatou and Horowitz (1995) made use of the ECF. Methods based on the ECF have generally a bounded influence.Here we will review briefly these methods and discuss their applications. Chambers and Heathcote (1981) and Heathcote (1982) introduced a method for estimating the regression vector/~ in a linear regression model which is called functional least squares. Functional least squares corresponds to minimizing the empirical version of the alternative measure of scale L(t) = t -2 log{ U 2 (t) + V2(t) }

M. Markatou and E. Ronchetti

68

where U(t) and V(t) are the real and imaginary parts of the characteristic function of the errors respectively. For each t that belongs to a set near the origin, the functional least squares estimator of slope parameters B,/~(t) at t is the statistic minimizing

Ln(3, t) = - t -21og{Un2(t) + vz(t)} ,

(5.1)

where n

Un(t) = n - 1 Z c o s

{t(yj - x f 3 ) } and Vn(t) = n 1 ~ s i n { t ( y j

j=l

- x f fl) } ,

j=l

subject to the additional constraint tan(t~) = Vn(fi(t))/Un(fl(t)). This removes the indeterminacy resulting from the fact that L(fl + c t ) = L(fl, t) (Welsh, 1985). Minimization of Ln(fl, t) leads to a family of estimators indexed by t. Under certain conditions the asymptotic variance of suitably standardized /~(t) is a scalar, even, location invariant function #2(t) multiplied by a constant matrix; cf. Chambers and Heathcote (1981) and Heathcote (1982). Welsh (1985) showed that functional least squares and regression analysis of angular variables are the same techniques. Moreover, functional least squares estimators are M-estimators defined by ~(r) = t -1 sin(tr). One of the advantages of the methodology based on the ECF is that it sometimes applies to problems that are not treated easily by conventional methods. In the robustness context an example is the estimation of scale in the panel data (error components) model. Panel data consist of a time series of observations on each of several individuals (the panel). A distinguishing feature of such data is that they may depend on unobserved attributes of individuals that are constant over time (individual specific effects). Panel data arise frequently in econometrics, and there is a large literature on their analysis. See Hsiao (1986). To focus the discussion, it is useful to consider a simple model for the analysis of panel data

Yst = ff xst + ~j + ejt ,

(5.2)

where j = 1 , 2 , . . . , n, t = 1, 2 , . . . , T and Yjt is the observed value of the dependent variable for individual j in time period t, x jr is a vector of observed explanatory variables for individual j in time period t, 3 is a vector of parameters to be estimated, c~j is an unobserved individual effect that may or may not be correlated with one or more components ofxjt and ejt is an unobserved random variable that is independently and identically distributed across both time periods and individuals. In certain applications, the number of individuals n is large and T is small. If T remains constant then ej cannot be estimated consistently. Thus, it is desirable to have robust methods of estimation of scale that avoid the need of estimating the permanent effects ~j. Estimates of the relative scales of e and e are important in certain econometric applications of models for panel data because the relative scales influence the

Robust inference: The approach based on influencefunctions

69

extent to which a realization of y is 'permanent' or 'transitory'. For example, suppose that yjt is income of the jth individual in year t, e ~ N(0, a 2) and g ,-~ N(0, ¢r2). Then, given a fixed variance of the income distribution, the probability that a person whose income is below poverty line in year 1 still is in poverty in year 2 is an increasing function of a~/a~. Existing robust measures of scale cannot deal with the above problem easily because there is no simple relationship between the scales of two independent random variables and the scale of their sum. Markatou and Horowitz (1995) used a scale estimator based on the modulus of the empirical characteristic function of appropriately constructed residuals to obtain estimators of scale for e and e. Testing procedures based upon this scale estimate were developed by Lenth, Markatou & Tsimikas (1995).

5.2. Procedures based on robust likelihoods The robust estimators and tests presented in the previous sections can be viewed as modifications of likelihood based procedures. In this section we discuss other modifications of likelihood methods which have been proposed in the literature. Let za,z2,..., zn be a random sample from a density f(z; O) corresponding to a distribution F(z; O) and s(zi, O) = ° l o g f ( z i ; O) be the score function. The maximum likelihood estimator of 0, under appropriate regularity conditions, is a solution of the system of equations Ein=aS(Zi, O) = O. Green (1984) while discussing the iteratively reweighted least squares technique for maximum likelihood estimation suggested the use of weighted score equations. The weights wi are functions of the discrepancy between the data and the fitted model. This discrepancy is measured in terms of the deviance defined as A (z; 0) = - 2 { l o g f(z; O) - sup~ log f(z; z)}. Pregibon (1982) discusses resistant fits for the binomial/logistic model, again in terms of deviances. Lenth and Green (1987) solve the set of estimating equations n

E w{AV2(zi; 0)}s(z;, 0) = 0,

(5.3)

i=1

where w(t) = O(t)/t, for a given function 0 like , for instance, Huber's function 0c. The least squares case in location estimation corresponds to 0(t) = t for which w(t) = 1. For a general ~ a deviance based M-estimator (DBME) can be viewed as an adaptively weighted maximum likelihood estimator. Lenth and Green (1987) show that the estimators obtained by solving equations (5.3) are consistent in regular location-scale problems where the distribution of the sample is continuous. They are not consistent for most of the discrete-case distributions. Ironically, the discrete case is one of the earliest settings in which deviance based M-estimators were used (Pregibon 1981, 1982). Other suggestions about the selection of the weights came from Field and Smith (1995). They suggest the following two forms of weights. The first weight function creates a bounded score function similar to classical M-estimators. The

M. Markatou and E. Ronchetti

70

idea is to consider the supremum of each score function over the central (1 - 2p) % of the distribution as determined by the current value of 0. The score function outside this range is downweighted so it does not exceed the supremum of the score function over this region. LetA(0, p) = (z : p <_ F(z; 0) < 1 - p}. The jth component of the weight function is given by

wj(z;O) = min{1, sup

Is(y, 0)j[/ls(z ,O)j[} .

(5.4)

yEA(O,p)

The weights have the same effect as in the classical M-estimates in that the score functions remain bounded but it differs in the method used to carry out the downweighting. In the case of optimal robust M-estimation, the bounding is carried out in a Euclidean scale with the metric determined by the Fisher information matrix while here the bounding is carried out using the central (1 - 2p)% portion of the distribution function to determine upper bounds on each score function. In the case of location with normal density this leads to a Huber estimate with tuning constant c = ~b-l(1 - p). The second weight function uses the same weight for each component of the score function with the weight given by

w(z; 0) =

F(z;O)/p 1 1-F(z;O) P

ifF(z;O) < p if p < F(z; O) < 1 - p . if F(z; O) > 1 - p

(5.5)

This function has the property of downweighting any points which do not lie within the central (1-2p)% of the distribution as it is determined by the value of 0. In contrast (5.4) downweights only those components of the score which exceed the supremum (of the same component) over the central portion of the distribution. Both weight functions are invariant under monotone transformations of the data. Basu, Markatou and Lindsay (1995) and Markatou, Basu and Lindsay (1996) take the approach based on the idea of modifying the usual likelihood equations to achieve efficient estimates with good breakdown properties. Thus, the score equations are replaced by weighted score equations, in which the weights are functions of appropriately defined residuals. The role of the weights is to reduce the influence of outlying observations on the score equations. Given a point z in the sample space, we construct a weight depending on z, on the assumed probability model Fo, and on the empirical cumulative distribution function F (n), say w(z; Fo, F(n)). The estimates then for 0 are obtained by solving the equations

Z w ( z i ; F o , F ( n ) ) s ( z i , 0) = 0 .

(5.6)

i

To describe an observation as an outlier we need to describe an appropriate set of residuals. Since the relationship between parameters and observations here is probabilistic, it seems more appropriate to take into account in the definition of

Robust inference: The approach based on influencefunctions

71

residuals this probabilistic structure. When the underlying probability distribution is discrete, define the Pearson residual as 6(t) = d(t)/f(t; 0) - 1, where d(t) is the proportion of observations with value t. If the underlying probability model is continuous then the Pearson residual is defined as 6*(t) = fi~l!~)- 1, where g*(t)is a kernel density estimator defined as g* (t) = f k(t; z, h)dF(")(z) and f* (t; 0) is the smoothed by the same kernel model, that is, f*(t;O)= fk(t;z,h)dFo(z). The Pearson residuals take values in the interval [-1, ~ ) . In this context, an observation is described as an outlier if its corresponding Pearson residual is large. Large Pearson residuals correspond to observations that occur in locations t with small probabilities. This type of probabilistic outlier is called by Lindsay (1994) a surprising observation. The weights used are functions of the Pearson residuals and are defined as [A(6) + 1]+ w(6) -

6 + 1

(5.7) '

where A is an increasing function defined on [-1, e~). This function is called residual adjustment function (Lindsay, 1994). It operates on the Pearson residuals in analogous fashion as the Oc-function operates on regular residuals. A linear residual adjustment function corresponds to the maximum likelihood estimator, while A ( 6 ) = 2{(3+ 1) 1/2- 1} corresponds to the Hellinger distance. Basu, Markatou and Lindsay (1995) and Markatou, Basu and Lindsay (1996) describe the connection between the weighted likelihood equations and the minimum disparity estimators. Here, we note that the estimates obtained by using equations (5.6) with weights (5.7) are consistent, asymptotically normal and fully efficient. Thus, they have the same influence function as the maximum likelihood estimator. However, these estimates have positive breakdown point which can be 50%.

6. Conclusions

In this paper we reviewed the basic elements of robust inference by means of the approach based on influence functions. The theory is based on the asymptotic distribution of test statistics over neighborhoods of the model. Two directions seem important in order to improve the results for finite samples. The first and most natural direction is to use the optimal robust procedures based on the asymptotic theory and derive better finite sample approximations for their distributions. Edgeworth, (empirical) saddlepoint approximations, and bootstrap play here an important role; cf. Field and Ronchetti (1990). In particular, accurate approximations of margial distributions are now available. Recent results include Tingley and Field (1990), Stahel and Hartmann (1992), Ronchetti and Welsh (1994), Fan and Field (1995), and Gatto and Ronchetti (1996). A survey is given in Field and Tingley (1997). The second and more challenging direction is to start with more refined (2nd order) tail areas approximations such as those provided by saddlepoint tech-

72

M. Markatou and E. Ronchetti

niques, and develop similar results to those presented in Sections 2,3, and 4. A first attempt in this direction for a simple situation can be found in Field and Ronchetti(1985). Acknowledgment The work of Marianthi Markatou was funded by NSF Grant DMS-9208820 and RGK Foundation Grant RGK CU01583501. References Basu, A., M. Markatou and B. G. Lindsay (1995). Weighted likelihood estimating equations: The continuous case. Technical Report, Columbia University. Bednarski, T. (1982). Binary experiments, minimax tests and 2-alternating capacities. Ann. Statist. 10, 226-232. Bednarski, T. (1993). Fr6chet differentiability of statistical functionals and implications to robust statistics. In: S. Morgenthaler, E. Ronchetti and W. A. Stahel, eds., New Directions in Statistical Data Analysis and Robustness. Birkh~iuser, Basel, 26-34. Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. Ann. Statist. 5, 445463. Beran, R. (1981). Efficient robust tests in parametric models. Zeitschriftffir Wahrscheinlichkeitstheorie und Verwandte Gebiete 57, 73-86. Bickel, P. J. (1975). One-step Huber estimates in linear models. 3-. Amer. Statist. Assoc. 70, 428-434. Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika 40, 318-335. Chambers, R. L. and C. R. Heathcote, (1981). On the estimation of slope and the identification of outliers in linear regression. Biometrika 68, 21-33. Clarke, B. R. (1983). Uniqueness and Fr6chet differentiability of functional solutions to maximum likelihood type equations. Ann. Statist. 11, 1196-1205. Clarke, B. R. (1986). Nonsmooth analysis and Fr6chet differentiability o f M functionals. Probability Theory and Related Fields 73, 197-209. Coakley, C. W. and T. P. Hettmansperger (1993). A bounded influence, high breakdown, efficient regression estimator. J. Amer. Statist. Assoc. 88, 872-880. Donoho, D. L. and P. J. Huber (1983). The notion of breakdown point. In: P. J. Bickel, K. A. Doksum and J. L. Hodges eds., A Festschriftfor Erich L. Lehmann, Wadsworth, Belmont (CA), 157-184. Fan, R. Y. K. and C. A. Field (1995). Marginal densities of M- estimators. Canad. J. Statist. 23, 185-197. Farebrother, R. W. (1974). The distribution of a positive linear combination of chi-squared random variables. Appl. Statist. 33, 332-339. Fernholz, L. T. (1983). Von Mises Calculus for Statistical Functionals. Lectures Notes in Statistics 19, Springer-Verlag, New York. Field, C. A. and E. Ronchetti (1985). A tail area influence function and its application to testing. Communications in Statistics C 4, 19-41. Field, C. A. and E. Ronchetti (1990), Small Sample Asymptotics, Institute of Mathematical Statistics Monograph Series, Hayward (CA). Field, C. A. and B. Smith (1995). Robust estimation - a weighted maximum likelihood approach. Internat. Statist. Rev. 62, 405-424. Field, C. A. and M. Tingley (1997). Small sample asymptotics: Applications in robustness. This volume. Fraiman, R. (1983). General M-estimators and applications to bounded influence estimation for nonlinear regression. Communicationos in Statistics, Theory and Methods, 12(22), 2617-2631. Gallant, A. R. (1975). Nonlinear regression. Amer. Statist. 29, 73-81.

Robust inference: The approach based on influence functions

73

Gardiol, L. (1995). D~veloppement asymptotique de tests statistiques suivant asymptotiquement une loi du chi-carr~ sous une suite d'alternatives locales: rtsultats thtoriques et modules informatiques. Ph.D thesis, HEC, University of Lausanne, Switzerland. Gatto, R. and E. Ronchetti (1996). General saddlepoint approximations of marginal densities and tail probabilities. J. Amer. Statist. Assoc. 91, 6664573. Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. J. Roy. Statist. Soe. B 46, 149-192. Hampel, F. R. (1968). Contribution to the theory of robust estimation. Ph.D thesis, University of California, Berkeley. Hampel, F. R. (1971). A general qualitative definition of robustness. Ann. Math. Statist. 42, 1887 1896. Hampel, F. R. (1973). Robust estimation: A condensed partial survey. Zeitsehrift ffir Wahrscheinlichkeitstheorie und Verwandte Gebiete 27, 87-104. Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69, 383-393. Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel (1986), Robust Statistics: The Approach Based on Influence Functions, Wiley, New York. Handschin, E., F. C. Schweppe, J. Kohlas and A. Fiechter (1975). Bad data analysis for power system state estimation (with discussion). IEEE Transactions on Power Apparatus and Systems 94, 329 337. He, X. (1991). A local breakdown property of robust tests in linear regression. J. Multivar. Anal. 38, 294-305. He, X. and Q. M. Shao (1995). Bahadur efficiency and robustness of Studentized score tests. Manuscript. He, X., D. G. Simpson and S. L. Portnoy (1990). Breakdown robustness of tests. J. Amer. Statist. Assoc. 85, 446~452. Heathcote, C. R. (1982). Linear regression by functional least squares. Appl. Prob. A 19, 225-239. Heritier, S. (1993). Contribution to robustness in nonlinear models: application to economic data. Ph.D thesis, Faculty of Economic and Social Sciences, University of Geneva, Switzerland. Heritier, S. and E. Ronchetti (1993). Robust estimation and inference in the market rffodel. Manuscript. Heritier, S. and E. Ronchetti (1994). Robust bounded-influence tests in general parametric models. J. Amer. Statist. Assoc. 89, 897-904. Heritier, S. and M. P. Victoria-Feser (1997). Practical applications of bounded-influence tests. This volume. Hettmansperger, T. P. (1984). Statistical Inference Based on Ranks, Wiley, New York. Hettmansperger, T. P. and J. W. McKeau (1983). A geometric interpretation of inferences based on ranks in the linear model. J. Amer. Statist. Assoc. 78, 885-893. Hettmansperger, T. P. and J. D. Naranjo (199l). Some research directions in rank-based inference. In: W. Stahel and S. Weisberg, eds., Directions in Robust Statistics and Diagnostics (Part I), SpringerVerlag, New York, 113-120. Holly, A. and L. Gardiol (1995). An asymptotic expansion for the distribution of test criteria which are asymptotically distributed as chi-squared under contiguous alternatives. In: G. S. Maddala ed., Advances in Econometrics and Quantitative Economics, 123-145. Hsiao, C. (1986). Analysis of Panel Data. Econometric Society Monographs. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35, 73-101. Huber, P. J. (1965). A robust version of the probability ratio test. Ann. Math. Statist. 36, 1753 1758. Huber, P. J. (1967). The behavior of the maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 221-233. Huber, P. J. (1968). Robust confidence limits. Zeitschriftffir Wahrscheinlichkeitstheorie und Verwandte Gebiete 10, 269-278. Huber• P. J. ( •973). R•bust regressi•n: Asympt•tics• c•njectures and M•nte Car••. Ann. Statist. 1.799821. Huber, P. J. (1981). Robust Statistics. Wiley, New York. Huber, P. J.(1991). Between robustness and diagnostics. In: W. Stahel and S. Weisberg, eds., Directions in Robust Statistics and Diagnostics (Part 1), Springer-Verlag, New York, 121-130.

74

M. Markatou and E. Ronchetti

Huber, P. J. and V. Strasseu (1973). Minimax tests and the Neyman-Pearson lemma for capacities. Ann. Statist. 1, 251263; 2, 223224. Huber-Carol, C. (1970). Etude asymptotique de tests robustes. Ph.D thesis, ETH, Ztirich, Switzerland. Krasker, W. S. and R. E. Welsch (1982). Efficient bounded-influence regression estimation. J. Amer. Statist. Assoc. 77, 595-604. Krishnakumar, J. and E. Ronchetti (1997). Robust estimators for simultaneous equations models. J. Econometrics, to appear. Ktinsch, H. R., L. A. Stefanski and R. J. Carroll (1989). Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. J. Amer. Statist. Assoc. 84, 460-466. Lambert, D. (1981). Influence functions for testing. J. Amer. Statist. Assoc. 76, 649-657. Lenth, R. V. and P. J. Green (1987). Consistency of deviance-based M-estimators. J. Roy. Statist. Soc. B 49, 326-330. Lenth, R. V., M. Markatou and J. Tsimikas (1995). Robust analysis of variance based on the sample characteristic function. Australian Journal o f Statistics 37, 1-16. Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum Hellinger distance and related methods. Ann. Statist. 22, 1081-1114. Marazzi, A. (1993). Algorithms, Routines, and S Functions For Robust Statistics. Wadsworth & Brooks/ Cole, Belmont (CA). Markatou, M., A. Basu and B. G. Lindsay (1996).Weighted likelihood estimating equations: the discrete case with applications to logistic regression. J. Statist. Plan. Infer. to appear. Markatou, M. and X. He (1994). Bounded influence and high breakdown point testing procedures in linear models. J, Amer. Statist. Assoc. 89, 543-549. Markatou, M. and T. P. Hettmausperger (1990). Robust bounded influence tests in linear models. J. Amer. Statist. Assoc. 85, 187-190. Markatou, M. and T. P. Hettmansperger (1992). Applications of the asymmetric eigenvalue problem techniques to robust testing. J. Statist. Plan. Infer. 31, 51-65. Markatou, M. and J. L. Horowitz (1995). Robust scale estimation in the error components model. Canad. J. Statist. 23, 369-381. Markatou, M., J. L. Horowitz and R. V. Lenth (1995). A robust scale estimator based on the empirical characteristic function. Statist. Prob. Lett. 25, 185-192. Markatou, M. and G. Manos (1995). Robust tests in nonlinear regression models. J. Statist. Plan. Infer., to appear. Markatou, M.,, W. A. Stahel and E. Ronchetti (1991). Robust M-type testing procedures for linear models. In: W. Stahel and S. Weisberg, eds., Directions in Robust Statistics and Diagnostics (Part I), Springer-Verlag, New York, 201-220. Morgenthaler, S. (1986). Robust confidence intervals for a location parameter: The configural approach. J. Amer. Statist. Assoc. 81, 518-525. Morgenthaler, S. and J. W. Tukey (1991). Configural Polysampling, Wiley, New York. Mtiller, C. (1995). Outlier robust inference for planned experiments. Habilitationsschrift, Fachbereich Mathematik und Informatik, Freie Universit~it Berlin. Perez, A. G. (1993). On robustness for hypotheses testing. Internat. Statist. Rev. 61, 369-385. Pregibon, D. (1981). Logistic regression diagnostics. Ann. Statist. 9, 705 724. Pregibon, D. (1982). Score tests in GLIM with applications. In: R. Gilchrist ed., G L I M 82: Proceedings o f the International Conference on Generalized Linear Models, Springer-Verlag, New York. Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd ed., Wiley, New York. Richardson, A. M. (1995). Some problems in estimation in mixed linear models. Ph.D thesis, Australian National University, Canberra. Rieder, H. (1978). A robust asymptotic testing model. Ann. Statist. 6, 1080-1094. Ronchetti, E. (1979). Robustheitseigenschaften yon Tests. Diploma thesis, ETH, Ziirich, Switzerland. Ronchetti, E. (1982a). Robust alternatives to the F-test for the linear model. In: W. Grossmann, C. Pflug and W. Wertz, eds., Probability and Statistical Inference, Reidel, Dortrecht, 329-342.

Robust inference: The approach based on influence functions

75

Ronchetti, E. (1982b). Robust testing in linear models: The infinitesimal approach. Ph.D thesis, ETH, Zfirich, Switzerland. Ronchetti, E. and A. H. Welsh (1994). Empirical saddlepoint approximations for multivariate Mestimators. J. Roy. Statist. Soc. B 56, 313 326. Rousseeuw, P. J. and A. M. Leroy (1987). Robust Regression & Outlier Detection. Wiley, New York. Rousseeuw, P. J. and Ronchetti, E. (1979). The influence curve for tests. Research Report 21, Fachgruppe fiir Statistik, ETH, Ziirich. Rousseeuw, P. J. and Ronchetti, E. (1981). Influence curve for general statistics. J. Comput. App. Math. 7, 161-166. Rubio, A. M. and J. A. Visek (1993). Discriminability of robust test under heavy contamination. Kybernetika 29, 372388. Ruppert, D. (1992). Computing S estimators for regression and multivariate location/dispersion. J. Comput. Graph. Statist. 1, 253 270. Schrader, R. M. and T. P. Hettmansperger (1980). Robust analysis of variance based upon a likelihood ratio criterion. Biometrika 67, 93 101. Sen, P. K. (1982). On M-tests in linear models. Biometrika 69, 245-248. Silvapulle, M. J. (1992). Robust Wald-type tests of one sided hypothesis in the linear model. J. Amer. Statist. Assoc. 87, 156--161. Silvey, S. D. (1979). Statistical Inference, 3rd ed., Chapman and Hall, London. Simpson, D. G. (1989). Hellinger deviance tests: efficiency, breakdown points and examples. J. Amer. Statist. Assoc. 84, 107-113. Simpson, D. G., D. Ruppert and R. J. Carroll (1992). On one-step GM-estimates and stability of inferences in linear regression. J. Amer. Statist. Assoc. 87, 439-449. Solomon, H. and M. A. Stephens (1977). Distribution of a sum of weighted chi-square variables. J. Amer. Statist. Assoc. 72, 881 885. Stahel, W. A. and P. Hartmann (1992). M - t y p e tests in linear models: some comments on Studentizing and a small simulation. In: Y. Dodge ed., Ll-statistical Analysis and Related Methods, North Holland, 93-102. Stahel, W. A. and S. Weisberg eds. (1991). Directions in Robust Statistics and Diagnostics. SpringerVerlag, New York. Staudte, R. G. and S. J. Sheather (1990). Robust Estimation and Testing. Wiley, New York. Stefanski, L . A . , R . J . Carroll and D. Ruppert (1986). Optimally bounded score functions for generalized linear models with application to logistic regression. Biometrika "/3, 413-424. Tingley, M. and C. A. Field (1990). Small-sample confidence intervals. J. Amer. Statist. Assoc. 85, 427-434. Victoria-Feser, M. P. (1993). Robust methods for personal income distribution models. Ph.D thesis, Faculty of Economic and Social Sciences, University of Geneva, Switzerland. Victoria-Feser, M. P. and E. Ronchetti (1994). Robust methods for personal-income distribution models. Canad. J. Statist. 22, 247-258. von Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. Ann. Math. Statist. 18, 309-348. Welsh, A. H. (1985). An angular approach for linear data. Biometrika 72, 441-50. Welsh, A. H., R. J. Carroll and D. Ruppert (1994). Fitting heteroscedastic regression models. J. Amer. Statist. Assoc. 89, 100-116. Ylvisaker, D. (1977). Test resistance. J. Amer. Statist. Assoc. 72, 551-556.