Robust Box–Cox transformations based on minimum residual autocorrelation

Robust Box–Cox transformations based on minimum residual autocorrelation

Computational Statistics & Data Analysis 50 (2006) 2752 – 2768 www.elsevier.com/locate/csda Robust Box–Cox transformations based on minimum residual ...

217KB Sizes 7 Downloads 38 Views

Computational Statistics & Data Analysis 50 (2006) 2752 – 2768 www.elsevier.com/locate/csda

Robust Box–Cox transformations based on minimum residual autocorrelation Alfio Marazzia,∗ , Victor J. Yohaib a Institute for Social and Preventive Medicine, University of Lausanne, Bugnon 17,

CH 1005 Lausanne, Switzerland b Departamento de Matematica, Universidad de Buenos Aires, Ciudad Universitaria, Pabellon 1,

1428 Buenos Aires, Argentina Received 25 August 2004; accepted 12 April 2005 Available online 10 May 2005

Abstract Response transformations are a popular approach to adapt data to a linear regression model. The regression coefficients, as well as the parameter defining the transformation, are often estimated by maximum likelihood assuming homoscedastic normal errors. Unfortunately, consistency to the true parameters holds only if the assumptions of normality and homoscedasticity are satisfied. In addition, these estimates are nonrobust in the presence of outliers. New estimates are proposed, which are robust and consistent even if the assumptions of normality and homoscedasticity do not hold. These estimates are based on the minimization of a robust measure of residual autocorrelation. © 2005 Elsevier B.V. All rights reserved. MSC: Primary 62J05; Secondary 62F35 Keywords: Box–Cox transformation; Heteroscedasticity; Robust estimation

1. Introduction Response transformations are a widely used tool to make data conform to a linear regression model. Sakia (1992) provides a review of transformations, the most common example ∗ Corresponding author. Tel.: +41 21 3147260; fax: +41 21 3147373.

E-mail addresses: Alfi[email protected] (A. Marazzi), [email protected] (V.J. Yohai). 0167-9473/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2005.04.007

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2753

being the Box–Cox transformation. Transformation techniques are thoroughly discussed in Carroll and Ruppert (1988) and Atkinson and Riani (2000, Chapter 4). Often, the transformed response g(y, ) is assumed to be linearly related to the covariates and the errors normally distributed with constant variance; the regression coefficients, as well as the parameter  defining the transformation, are estimated by maximum likelihood (ML) (Box and Cox, 1964; Bickel and Doksum, 1981; Hinkley and Runger, 1984). Unfortunately, it can be difficult to attain both normality and homoscedasticity with a single transformation. Moreover, the ML-estimate is not consistent under non-normal or heteroscedastic errors and they are nonrobust in the presence of outliers. Several diagnostic techniques for assessing the contribution of individual observations to the evidence for transformations have been developed (e.g., Atkinson and Riani, 2000; Cheng, 2005). In addition, semiparametric and nonparametric approaches to relax the parametric structure of the response distribution have been studied (Han, 1987; Foster et al., 2001). A few robust estimates have also been proposed: Carroll and Ruppert (1988) described bounded influence estimates based on the normal homoscedastic model and Parker (1988) used minimum sum of absolute errors regression. In general, these nonparametric and robust estimates do not provide effective protection against heavy contamination and heteroscedasticity. For the simple regression case, Marazzi and Yohai (2004) proposed a new nonparametric estimate of  based on minimization (with respect to ) of a robust measure of autocorrelation among the residuals with respect to a robust fit of the transformed responses. To compute the autocorrelation, the residuals have to be ordered according to the values of the regressor. It was proved that this estimate, as well as the corresponding coefficient estimates, are consistent even if the assumptions of normality and homoscedasticity do not hold. The consistency proof was however based on rather strong assumptions about the behavior of the basic robust coefficient estimates. Monte Carlo results showed that the new estimates outperform available methods (Carroll and Ruppert, 1988; Foster et al., 2001) when the errors are heteroscedastic and have longtailed distributions. In this paper we provide an extension of the method based on minimization of robust residual autocorrelation to the multiple regression case. We describe a computational algorithm for a particular implementation of the method, where S- and MM-estimates (Rousseeuw and Yohai, 1984; Yohai, 1987) are used to determine the robust fit. Monte Carlo results show that the performance (bias and mean square error) of the new procedure for large sample sizes (larger or equal 200) is very appealing in comparison with previously published methods. In addition, we provide a consistency result for the simple regression case based on a new set of assumptions, which are satisfied for the implementation based on S- and MM-estimates. This result was not available in Marazzi and Yohai (2004); yet, a consistency proof for the multiple regression case is still outstanding. Several references to Marazzi and Yohai (2004) are made in order to minimize overlap. The consistency proof and the auxiliary results concerning S- and MM-estimates can be found in Marazzi and Yohai (2005). S-Plus programs can be found at the first author website (www.iumsp.ch) together with the data set considered in the example of Section 8.

2754

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2. The model We consider a random sample (x1 , y1 ) , . . . , (xn , yn ) of the random vector (x, y), where x is a vector of p explanatory variables and y is a positive response variable. They are assumed to be linked by the linear relationship g(y, 0 ) = xT 0 + q(x)u,

(1)

where g (y, 0 ) denotes the response transformation, 0 is a parameter that takes values in  ⊂ R, 0 ∈ Rp is a parameter vector (the first component of 0 being an intercept term). In the case of the Box–Cox transformation we have     if   = 0, y −1 (2) g(y, ) = ln() if  = 0. We assume that x and y are independent, that the error u is symmetrically distributed and independent of x, and q(x) is an unknown scale function.

3. The robust residual autocorrelation estimate In the first place, we consider a sample (r1 , v1 ) , . . . , (rn , vn ) of a random pair (r, v), where r is a centered variable. We can measure how strongly E(r|v) depends on v with the help of a robust measure of autocorrelation as follows. Suppose that all the values vi , 1 i n are distinct, sort them in ascending order and let j1 , . . . , jn be the corresponding permuted indices, i.e., vj1 < vj2 < · · · < vjn . Then, define the robust autocorrelation of r ordered according to v by n,r,v

    n−1 rji+1 rji 1   , =  n−1 Sn Sn i=1

where  is an odd, continuous, nondecreasing and bounded function, and Sn is a robust estimate of the scale of r. If E(r|v) is a nonconstant smooth function, we expect this coefficient to be positive. In fact, in this case, since vji is generally close to vji+1 , we expect the values of (rji /Sn ) and (rji+1 /Sn ) to be similar, hence n,r,v > 0. Since  is bounded, this measure of autocorrelation is robust. We now turn to the problem of estimating the transformation parameter 0 in model (1). Let n be a consistent robust estimator of the coefficients for a multiple linear regression model with homoscedastic errors and Sn a measure of error scale. Specific examples are considered in Section 5. Let n () and Sn () be the result of applying these estimators to the transformed sample (x1 , g(y1 , )), . . . , (xn , g(yn , )) and (), S() their asymptotic values. Since (x1 , g(y1 , 0 )), . . . , (xn , g(yn , 0 )) is a sample of a linear model, where the coefficient vector is 0 , we have (0 ) = 0 . We define the residuals by r(, , x, y) = g(y, ) − xT  and denote by h the inverse of g(y, 0 ).

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2755

Suppose that  is any vector in Rp and that all the values x1T , . . . , xnT  are distinct. In this case, we sort xiT , i = 1, . . . , n in ascending order and define j1 (), . . . , jn () as the corresponding permuted indices, i.e., xjT1 ()  < xjT2 ()  < · · · < xjTn () . Then, for given  ∈ ,  ∈ Rp ,  ∈ Rp , s > 0, we define 



n−1 r , , xji () , yji () r , , xji+1 () , yji+1 () 1  n (, , s, )=   . n−1 s s i=1

(3) We observe that n (, , Sn (), ) is the robust autocorrelation estimate of r(, , x, y) ordered according to T x. In the case of ties among the values x1T , . . . , xnT , one can modify (3) by arbitrarily permuting the tied values and computing the correlations by averaging over the permutations. Details of this modification can be found in Marazzi and Yohai (2004). Finally, we define ∗n () = sup n (, n (), Sn (), n ()) . µ∈

(4)

Heuristic arguments given below show that lim ∗n (0 ) = 0

n→∞

(5)

and that for   = 0 lim ∗n () > 0.

n→∞

(6)

This suggests defining the robust residual autocorrelation estimate (RAC-estimate) ˆ n of 0 by ˆ n = arg min ∗n (). ∈

(7)

The corresponding estimate of 0 is n (ˆ n ). Heuristic justification: For (5) to hold, it is enough that lim n (0 , n (0 ), Sn (0 ), ) = 0

n→∞

(8)

for all . Since limn→∞ n (0 ) = 0 and limn→∞ Sn (0 ) = s0 , for (8) to hold it is enough that lim n (0 , 0 , s0 , ) = 0.

n→∞

This follows from  



r 0 , 0 , xji+1 () , yji+1 () r 0 , 0 , xji () , yji ()  =0 E  s0 s0

2756

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768





and, since r 0 , 0 , xji () , yj () = q xji () uji () , this equality holds because, conditional on xji () and xji+1 () , the variables uji () and uji+1 () are independent and  is odd and symmetric. For (6) to hold, it is enough that lim n (, n (), Sn (), n (0 )) > 0

n→∞

(9)

for   = 0 and for (9) to hold it is enough that lim n (, (), S(), 0 ) > 0.

n→∞

This follows from ⎞ ⎛   ⎞⎞ ⎛ ⎛  r , (), xji+1 (0 ) , yji+1 (0 ) r , (), xji (0 ) , yji (0 ) ⎠⎝ ⎠⎠ > 0 E ⎝ ⎝ S() S() and, since r(, , x, y) = g(h(xT 0 + u), ) − xT , this inequality follows from the fact that uji (0 ) and uji+1 (0 ) are independent and that for large n (and under appropriate assumptions) the values xjT  0 and xjT  0 are very close. i ( 0) i+1 ( 0 ) Remarks. (1) Model (1) is affine invariant and, in the case of the Box–Cox family of transformations, it is also scale and power invariant. It may be proved that the proposed estimates of 0 and 0 are always affine equivariant and, for the case of the Box–Cox family of transformations, they are scale and power equivariant. Since model (1) is not regression invariant, the proposed estimates are not regression equivariant, despite that the MM-estimates have this property when no transformation is used. (2) A variant of the RAC-estimate can be defined by (7) putting ∗n () = sup n (, n (), Sn (), ) .

(10)

∈Rp

The heuristic arguments given above still hold and thus this estimate is consistent. However, its computation is much harder because (10) involves a supremum over Rp while (4) involves a supremum over a bounded subset of R. (3) A different variant of the RAC-estimate was proposed in Marazzi and Yohai (2004). They defined the estimate of 0 by (7), where ∗n () is given by ∗n () = n (, n (), Sn (), n ()) .

(11)

This estimate (that we denote as the Rac-estimate in the following) is very simple because it avoids taking the supremum in the definition of ∗n (). The heuristic argument for (5) still holds with this definition, but the argument for (6) is not valid any more. The reason is that in general ()  = 0 for   = 0 . This explains the poor performance of this variant in the Monte Carlo study presented in Section 7. (4) In the simple regression case g(y, 0 ) = 01 + 02 x + q(x)u,

(12)

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2757

we only need to sort the residuals according to the values of the xi ’s. Then, the permutation ji , 1i n, is defined by xj1 < xj2 < · · · < xjn . In this case for any  ∈ ,  = (1 , 2 )T and s > 0 we put 



n−1 r , , xji+1 , yji+1 r , , xji , yji 1  n (, , s) =  (13)  n−1 s s i=1

and ∗n () = n (, (), Sn ()) .

(14)   The estimate ˆ n is still defined by (7) and the coefficient estimate is n ˆ n . (5) The definitions of n,r,v and n given above use lag 1 autocorrelation. One could think that taking into account autocorrelations of other lags could provide useful information. Therefore, we tried to add a term with lag 2 correlations in the definitions. A simulation study showed however that the improvement was irrelevant.

4. Consistency Theorem 1 of this section provides a consistency result for the RAC estimates in the simple regression model. The proof can be found in Marazzi and Yohai (2005). The following assumptions are required: A1. The distribution H of xi in model (12) is continuous. A2. The distribution F of ui is continuous and symmetric. A3. For each , the function g(y, ) is continuous and strictly monotone with respect to y and  ⊂ R is compact. A4. q is continuous. A5.  is continuous, odd, monotone nondecreasing and bounded. A6. The estimates n () and Sn () have the following properties: (i) Given  > 0, there exists  > 0 such that lim

sup n () − 0   |−0 |  

where · denotes the Euclidean norm. (ii) There exists K > 0 such that lim sup n () K

n→∞ ∈

(iii) There exists s2 > 0 such that lim sup sn ()s2 a.s..

n→∞ ∈

a.s..

a.s.,

2758

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

(iv) There exists s1 > 0 such that lim inf sn ()s1 ,

n→∞ ∈

a.s..

A7. Robust Identifiability Condition. In model (12) let     r(, , x, y)  d(, , s, x) = E  x . s

(15)

Then, for any   = 0 ,  ∈ Rp and s > 0, P (d(, , s, x) = 0) > 0. Theorem 1. Let (x1 , y1 ) , . . . , (xn , yn ) be a random sample of model (12). Assume A1–A7. Then, (i) ˆ n → 0 ,  P (ii) n ˆ n →P 0 , where →P denotes convergence in probability. Remarks. (1) As noted in Marazzi and Yohai (2004), assumption A7 can be interpreted as an identifiability condition of model (12). In fact, suppose that A7 does not hold. Then, there exist values 1  = 0 , 11 , 12 , and s such that, with probability one, g (y, 1 )=11 +12 x+u, where u satisfies the condition E((u/s)|x)=0. Therefore the parameter  is not identifiable. When q(x) = 1 and 02 = 0, A7 does not hold, but in this case, 0 is clearly not identifiable. (2) For the transformation family g(y, ) = y  (which is equivalent to the Box–Cox family when  > 0) we can give the following sufficient set of conditions for A7 to hold: (i)  (x) > 0 for all x. (ii) If 02 > 0, the distribution of x is unbounded to the right; if 02 < 0, the distribution of x is unbounded to the left. (iii) There exists M > 0 such that u > − M; (iv) q(x) = 1. To prove this, take   = 0 and any  = (1 , 2 )T . Then, we have    (01 + 02 x + u) − 1 − 2 x  d(, , s, x) = E  x ,  s with  = /0 and jd(, , s, x) jx       1   (01 +02 x+u) −1 −2 x −1 = E  02 (01 +02 x+u) −2  x .  s s Suppose that 02 > 0. Then, if  > 1 and x is large enough, jd(, , s, x)/*x > 0; if  < 1 and x is large enough, jd(, , s, x)/jx < 0. Therefore, it is not possible to have d(, , s, x)=0 for all x. A similar argument can be used when 02 < 0. Note however that conditions (i)–(iv)

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2759

are much stronger than what it is strictly required for the validity of A7. In addition, observe that (iii) is a necessary condition to have g(y, ) = y  defined for all values of y. (3) The assumption that the error distribution is symmetric could be removed by modifying the residuals in (3). More precisely, define rloc by n  i=1



ri (, , xi , yi ) − rloc  s

 =0





and replace r , , xji () , yji () by the centred residuals r , , xji () , yji () − rloc in (3). However, numerical experiments suggest that the convergence of the corresponding modified estimate of  can be slow. 5. S- and MM-estimates We consider a sample u = (u1 , u2 , . . . , un ) of size n of a univariate distribution. Huber (1964) defines the M-estimate of scale as a solution sn (u) of the equation 1 1 n n

i=1



ui sn (u)

 = b,

(16)

where b is a given positive real number and 1 a given function R → R+ . In the following, we require that 1 satisfies the following properties: A8. (i) 1 (0) = 0; (ii) 1 is even; (iii) if |u| < |v|, then 1 (u)1 (v); (iv) 1 is bounded; (v) 1 is continuous at 0. We now consider a sample Z = {(x1 , z1 ) , . . . , (xn , zn )} of a regression model z = xT  + u. Rousseeuw and Yohai (1984) define the S-estimate of  by ˜ n = arg min sn (r()),

(17)



where r() = (r1 (), . . . , rn ())T and ri () = yi − xiT . An associated S-estimate of the error scale is then given by   S˜n = sn r(˜ n ) . (18) Rousseeuw and Yohai (1984) show that the asymptotic breakdown point of the S-estimates is ∗ = min (b/a, 1 − b/a), where a = max 1 . Therefore, if b = a/2 we have ∗ = 1/2. Unfortunately, the S-estimate of  cannot simultaneously attain a breakdown point of 0.5 and a high efficiency under normal errors (Hossjer, 1992). In order to obtain both these properties, Yohai (1987) propose the MM-estimate of  defined by ˆ n = arg min 

n  i=1

 2

 ri () , S˜n

(19)

2760

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

where S˜n is the error scale estimate (18) and 2 is a second function satisfying A8 and such that: A9. 2 1 for all u. The MM-estimate ˆ n has the same breakdown point as the S-estimate ˜ n . In addition, by conveniently choosing 2 , its asymptotic efficiency under normal errors can be made as close as desired to 100%. We now define ˜ n (), S˜n (), and ˆ n () as the S-estimate of , the associated S-estimate of scale, and the MM-estimate of  when zi = g(yi , ) and yi is distributed according to (1). Sufficient conditions for these estimators to satisfy assumptions A6 (i)–(iv) in Section 4 can be found in Marazzi and Yohai (2005).

6. Computational aspects In this section, we describe a particular implementation of the RAC-estimate based on conventional versions of S-estimates and MM-estimates used in the Monte Carlo Study described in Section 7. The functions 1 and 2 are taken in the Tukey’s biweight family  3(z/ h)2 − 3(z/ h)4 + (z/ h)6 if |z| h,  (z, h) = 1 if |z| > h, where h is a given constant. To define the S-estimate we take 1 (z)=(z, 1.548) and b =0.5 in (16). This choice makes the asymptotic breakdown point of the S-estimates ˜ n and S˜n equal to 0.5. In addition, we take 2 (z) = (z, 4.687) in (19). With this choice, the MMestimate ˆ n attains an asymptotic efficiency of 0.95 under normal errors while preserving breakdown point 0.5. In order to define the robust autocorrelation function we use Huber’s function (z, c) = max(−c, min(c, x)) with tuning constant c = 1.34.   We obtain ˆ n by minimizing ∗n () over an equally spaced grid L = 1, . . . , k of -values using the following algorithm. Step 1: For each  ∈ L, compute and store (approximations of) the S-estimates ˜ n () and S˜n (). This is the most computer intensive step because it requires a resampling scheme (e.g., Marazzi, 1992) for each  ∈ L. An efficient procedure computes the projection matrix of each subsample just once (e.g., using Householder transformations); the projection matrix is then simultaneously applied to the k vectors of transformed responses g(y, 1 ), . . . , g(y, k ), providing k coefficient estimates (one for each j ). Keeping track of the k best current coefficient estimates over the successive subsamples yields k parallel resampling schemes. Eventually, the procedure provides resampling approximations of ˜ n () and S˜n () for each  ∈ L. Step 2: For each  ∈ L, compute and store the MM-estimate ˆ n (). Here, we use an iteratively reweighted least squares algorithm with fixed scale S˜n () (e.g., the W-algorithm in Marazzi, 1992) starting at ˜ n (). Step 3: For each  ∈ L, sort xiT ˆ n (), i = 1, . . . , n in ascending order and store the corresponding permuted indices j1 , . . . , jn which depend on .

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2761

  Step 4: For each pair (, ) ∈ L × L, compute n , ˆ n (), S˜n (), ˆ n () according to (3) and store these values. Step 5: For each  ∈ L, compute and store   ∗n () = sup n , ˆ n (), S˜n (), ˆ n () . ∈L

Step 6: Compute the approximation ∗n = arg min∈L ∗n () of the RAC-estimate ˆ n . 7. Monte Carlo results Simulation results for simple regression have been reported in Marazzi and Yohai (2004); here, we report simulation studies for multiple regression. We compare various estimators: RAC is the estimator defined in Section 3 and Rac the robust residual autocorrelation estimator for multiple regression initially suggested in Marazzi and Yohai (2004; see Section 3, Remarks, (3)). F is the estimator defined in Foster et al. (2001) and RF is a modification of F that uses the MM-estimates described in Section 6 in place of the least squares estimates. BI denotes the bounded influence estimator as defined in Carroll and Ruppert (1988, p. 186, with tuning constant a = 1.5 and based on 10 iterations). ML denotes the maximum likelihood estimator. Observations (xi , yi ) were generated according to the model yi0 = xiT 0 + q (xi ) ui ,

i = 1, . . . , n,

where, xi = (xi0 , xi1 , . . . , xip )T denotes a vector of p + 1 covariates; the first component equals 1 and the others are independently drawn from uniform distributions on [0.20, 20]. The errors ui were i.i.d., such that ui = tei , ei was normal N (0, 1) or contaminated normal 0.9N(0, 1) + 0.1N (0, 25), and the factor t was chosen so that MAD(ui ) = 1. The parameter vector was 0 = (10, 1, . . . , 1)T and 0 = 0.5. Two options for q(x) were used: q(x) ≡ 1 (homoscedastic case) and q(x) = (x1 + · · · + xp )/(6p) (heteroscedastic case). The simulation parameters (in particular 01 ) were chosen in order to have positive responses, clear identifiability, and perceptible homo/heteroscedasticity. The choice of 0 was not relevant

because of equivariance of ˆ n , i.e., ˆ n yia , . . . , yna = ˆ n (y1 , . . . , yn ) /a. Each experiment was based on 1000 samples. The results for n = 50, 100, 200  are reported in Tables  1–3 and  include the average bias ˆ ˆ of the simulated estimates  of  bias = 1000 × mean  − 0 and the root of the mean   0.5  √ squared error . As expected, for homoscedastic mse = 1000 × mean(ˆ − 0 )2 Gaussian errors, ML has no bias and the lowest mean squared error; however, for contaminated data, the robust estimates (except Rac for n=50) outperform ML. In the homoscedastic cases, the behaviour of all estimates is satisfactory in relation to bias; with regard to mean squared error, BI seems to be the best (close to ML for Gaussian errors and stable under contamination). The performance of RAC depends on the sample size: for n = 50, its mean squared error is rather large; however, for increasing sample size and Gaussian data it approaches the mean squared error of ML whereas for contaminated data it comes close to

bias √ mse bias √ mse bias √ mse bias √ mse

0 84 −1 97 −1 78 5 96

0 92 0 102 −5 108 1 116

0 74 0 84 −1 67 2 81

0 68 −3 123 −3 67 1 109

2 177 −5 191 −1 189 5 213

0 133 −9 146 −2 154 4 159

19 213 19 225 −5 217 18 233

Rac

−1 108 −6 119 −1 125 2 135

ML

4: auss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.

C8

G8

C4

4

BI

RAC

RF

Rac

RAC

F

Heteroscedastic case

Homoscedastic case

Table 1 Simulation results: n = 50

−35 144 −46 165 −19 129 −30 158

F

−31 152 −27 167 −16 164 −13 173

RF

−88 143 −97 159 −43 118 −52 139

BI

−102 148 −120 231 −48 118 −77 199

ML

2762 A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

bias √ mse bias √ mse bias √ mse bias √ mse

−1 79 0 87 4 83 0 90

−2 57 0 60 2 56 −1 61

−1 56 1 62 1 50 3 59

F −1 57 1 61 0 54 1 58

RF −3 49 0 52 3 43 2 45

BI −1 43 −3 89 2 41 1 76

ML

G4: Gauss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.

C8

G8

C4

G4

Rac

RAC

Homoscedastic case

Table 2 Simulation results: n = 100

−5 96 −1 99 1 100 −3 102

RAC 8 145 15 160 5 151 8 159

Rac

Heteroscedastic case

−31 99 −36 110 −16 85 −24 100

F

−28 98 −22 101 −14 91 −12 97

RF

−88 116 −90 122 −36 80 −44 88

BI

−101 122 −134 210 −45 82 −87 165

ML

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768 2763

bias √ mse bias √ mse bias √ mse bias √ mse

0 39 −1 43 0 36 3 40

0 39 −1 42 0 37 3 39

−2 33 −3 34 −2 29 2 30

−1 30 1 66 −1 27 0 54

−3 59 −3 62 −3 53 −2 59

0 56 −2 62 −2 55 1 61

5 110 8 114 −1 101 1 110

Rac

−1 35 −1 39 −1 33 0 37

ML

4: auss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.

C8

G8

C4

4

BI

RAC

RF

Rac

RAC

F

Heteroscedastic case

Homoscedastic case

Table 3 Simulation results: n = 200

−28 72 −32 79 −16 62 −17 69

F

−26 69 −27 73 −13 62 −11 68

RF

−86 100 −89 104 −40 60 −44 68

BI

−101 112 −137 184 −50 67 −98 143

ML

2764 A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2765

Table 4 Computing times (seconds)

Step 1 Step 2 Steps 3–6

p=4 n = 100

p=4 n = 200

p=4 n = 300

p=8 n = 100

p=8 n = 200

p=8 n = 300

2.92 0.08 0.44

5.68 0.14 0.86

8.63 0.21 1.44

5.72 0.19 0.47

11.00 0.30 0.94

16.28 0.44 1.47

the mean squared error of BI. In the heteroscedastic case, the bias of ML, BI, F, and RF clearly becomes the dominant component of the mean squared error and all estimates are outperformed by RAC, especially for sample size larger than 200. (Note: it can be shown that both F and BI are asymptotically biased when the errors are heteroscedastic.) Finally, we notice that RAC works much better than Rac; an explanation is given in Section 3, Remarks, (3). We used FORTRAN programs loaded into S-plus to optimize the robust residual autocorrelation on a grid of 101 equally spaced points between 0.05 and 1.25 according to the algorithm described in Section 6. Some computing times (for a single estimation problem) on a 2.53 GHz Pentium 4 are reported in Table 4. 8. Example We consider a sample of 78 patients hospitalized in a Swiss hospital during 2000 for “valve cardiac surgery with major complications”. We study the relationship between the cost of stay (Cost, in Swiss francs) and two explanatory variables that are available on administrative files: length of stay (LOS, in days) and admission type (0 = planned, 1 = emergency). This relationship is often used as a basis to determine reimbursement rates. We use the abbreviations y = Cost/1000, x1 = log(LOS), x2 = admission type. The log(LOS)/Cost-plot in Fig. 1, panel (a), suggests that the costs of two cases were exception

ally high; however, model (1)–(2) with xT = (1, x1 , x2 ) and T0 = (01 , 02 , 03 ) seems a reasonable description   for the majority of these data. The ML-estimates of 0 and 0 are T ¯ n = −0.33 and ¯ n ¯ n = (1.61, 0.19, 0.02) whereas the RAC-estimates are ˆ n = 0.65 and   T ˆ n ˆ n =(−4.94, 7.82, 1.03). Since several stays have the same length, the autocorrelation function has been computed using the modification of (3) for tied observations described in Marazzi and Yohai (2004). Fig. 1, panel (b) displays the function ∗n () minimized for  = 0.65. Fig. 2, panel (a) displays the data on the transformed response scale for  = ¯ n together with the estimated ML prediction lines; Fig. 2, panel (b), is a normal probability plot of the ML residuals. The corresponding plots for the RAC-estimate are given in Fig. 3, panel (a) and panel (b). No strong outliers are visible in Fig. 2; on the contrary, two exceptional cases clearly show up with very large residuals in Fig. 3. Removing these cases from the data, the ML-estimate of 0 becomes ¯ n = 0.32 and the classical 95% confidence interval for 0 based on the likelihood ratio test and the reduced data set is (−0.13, 0.76). This interval

2766

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

200

0.115

0.105 rho

Cost/1000

150

100

0.095 50 0.085 2.0

(a)

2.5 3.0 log(LOS)

3.5

4.0

0.0 (b)

0.2

0.4 0.6 lambda

0.8

1.0

Fig. 1. Panel (a): Data log(LOS) and log(Cost/1000) of 78 hospital patients; circles denote planned admissions, bullets emergency admissions. Panel (b): function ∗n () minimized for  = 0.65.

2

2.4 observed quantile

transformed response

2.5

2.3 2.2 2.1 2.0 1.9

0 -1 -2

2.0 (a)

1

2.5 3.0 log(LOS)

3.5

4.0 (b)

-2 -1 0 1 2 expected normal quantile

Fig. 2. Panel (a): data on transformed response scale for  = −0.33 (ML-estimate) and fitted lines for planned/emergency admissions. Panel (b): normal probability plot of ML residuals.

contains the RAC-estimate but not the ML-estimate on the full data set! Finally, in order to investigate which estimate better succeeds in transforming the data, we provide inverse scatter plots according   to Cook and Weisberg (1994). Fig. 4, panel (a) is T ¯ ¯ the plot of the fitted values y¯i = n n xi (vertical axis) against yi . A transformation can be estimated by fitting a curve to this graph. For example, the solid curve shown on the diagram was obtained by computing ordinary least squares regression of y¯i on g (yi , −0.33) and then displaying the fitted values from this regression as a curve on the inverse response plot of (yi , y¯i ). The broken line was obtained with the help of a nonparametric smoother (S-Plus lowess) based on the reduced data set. The two curves disagree, casting doubts on the transformation  plots for the RAC estimate  g(y, −0.33). The corresponding  T ˆ ˆ ˆ i.e., the plot of yˆi =  n xi against yi for n = 0.65 is given in Fig. 4, panel (b). n

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

2767

40

8 observed quantile

transformed response

10

30

20

6 4 2 0

10

-2 2.0

(a)

2.5 3.0 log(LOS)

3.5

4.0

-2 -1 0 1 2 expected normal quantile

(b)

Fig. 3. Panel (a): data on transformed response scale for  = 0.65 (RAC-estimate) and fitted lines for planned/emergency admissions. Panel (b): normal probability plot of RAC residuals.

100 80

fitted value

fitted value

80 60 40 20

40

20

50 (a)

60

100 150 response

200

50 (b)

100 150 response

200

T Fig. 4. Panel (a): inverse scatter plot of (yi , y¯i ) with y¯i = ¯ n (−0.33)xi (ML). Panel (b): inverse scatter plot of

T yi , yˆi with yˆi = ˆ n (0.65)xi (RAC).

Here, the parametric (MM-regression of yˆi on g(yi , 0.65)) and the nonparametric (S-Plus lowess) curves are very close, confirming that g(y, 0.65) is a sound transformation. For the sake of simplicity, the user may obviously prefer a value of  = 0.5 in place of  = 0.65. In this example, these two values of  provide almost the same results. Remarks. (1) A robust confidence interval for  could be based on a bootstrap simulation of the RAC-estimate. However, the standard bootstrap simulation of a robust estimate does not provide robust confidence intervals (Salibian-Barrera and Zamar, 2002). The development of a more sophisticated robust bootstrap algorithm for the RAC-estimate is necessary. (2) Despite assumption A1 in Section 4, the RAC procedure for multiple regression does not require the distribution of x to be continuous. In particular, dummy explicative

2768

A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768

variables—such as the admission type in the example—can be used. The procedure works as soon as the ordered fitted values are sufficiently close to provide large residual autocorrelations for bad values of . This may happen even when only one explanatory variable is continuous. Acknowledgements This work was completed with the support of Grant 2053-066895.01 from the Swiss National Science Foundation, Grant PICT–99 03–06277 from the Agencia Nacional de Promoción de la Ciencia y la Tecnología, Grant X-094 from the Universidad de Buenos Aires, Argentina, and a grant from Fundación Antorchas, Argentina. The authors are grateful to Danielle Meylan andAlex Randriamiharisoa for the help in programming. They also thank the referees for the numerous questions and remarks that helped clarifying the paper. References Atkinson, A., Riani, M., 2000. Robust Diagnostics Regression Analysis. Springer, New York. Bickel, P.J., Doksum, K.A., 1981. An analysis of transformations revisited. J. Amer. Statist. Assoc. 76, 296–311. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26, 211–252. Carroll, R.J., Ruppert, D., 1988. Transformation and Weighting in Regression. Chapman & Hall, New York. Cheng, T.-C., 2005. Robust regression diagnostics with data transformations. Comput. Statist. Data Anal., in press. Cook, R.D., Weisberg, S., 1994. Transforming a response variable for linearity. Biometrika 81 (4), 731–737. Foster, A.M., Tian, L., Wei, L.W., 2001. Estimation for the Box–Cox transformation model without assuming parametric error distribution. J. Amer. Statist. Assoc. 96, 1097–1101. Han, A.K., 1987. A non-parametric analysis of transformations. J. Econometrics 35, 191–209. Hinkley, D.V., Runger, G., 1984. Analysis of transformed data. J. Amer. Statist. Assoc. 79, 302–309. Hossjer, O., 1992. On the optimality of S-estimators. Statist. Probab. Lett. 14, 413–419. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101. Marazzi, A., 1992. Algorithms, Subroutines, and S-functions for robust statistics. Wadsworth & Brooks/Cole, Pacific Grove, CA. Marazzi, A., Yohai, V., 2004. Robust Box–Cox transformations for simple regression. In: Hubert, M., Pison, G., Struyf, A., Van Aelst, S. (Eds.), Theory and Applications of Recent Robust Methods. I: Statistics for Industry and Technology, pp. 173–182. Marazzi, A., Yohai, V., 2005. Consistency of the robust residual autocorrelation estimate of a transformation parameter. Technical Report, available at www.iumsp.ch Parker, I., 1988. Transformations and influential observations in minimum sum of absolute errors regression. Technometrics 30, 215–220. Rousseeuw, P., Yohai, V.J., 1984. Robust regression by means of S-estimators. In: Franke, J., Härdle, W., Martin, D. (Eds.), Robust and Nonlinear Time Series. Lecture Notes in Statistics, vol. 26. Springer, Berlin, pp. 256–272. Sakia, R.M., 1992. The Box–Cox transformation technique: a review. Statistician 41, 169–178. Salibian-Barrera, M., Zamar, R.H., 2002. Bootstrapping robust estimates of regression. Ann. Statist. 30, 556–582. Yohai, V.J., 1987. High breakdown-point and high efficiency robust estimates for regression. Ann. Statist. 15, 642–656.