Computational Statistics & Data Analysis 50 (2006) 2752 – 2768 www.elsevier.com/locate/csda
Robust Box–Cox transformations based on minimum residual autocorrelation Alfio Marazzia,∗ , Victor J. Yohaib a Institute for Social and Preventive Medicine, University of Lausanne, Bugnon 17,
CH 1005 Lausanne, Switzerland b Departamento de Matematica, Universidad de Buenos Aires, Ciudad Universitaria, Pabellon 1,
1428 Buenos Aires, Argentina Received 25 August 2004; accepted 12 April 2005 Available online 10 May 2005
Abstract Response transformations are a popular approach to adapt data to a linear regression model. The regression coefficients, as well as the parameter defining the transformation, are often estimated by maximum likelihood assuming homoscedastic normal errors. Unfortunately, consistency to the true parameters holds only if the assumptions of normality and homoscedasticity are satisfied. In addition, these estimates are nonrobust in the presence of outliers. New estimates are proposed, which are robust and consistent even if the assumptions of normality and homoscedasticity do not hold. These estimates are based on the minimization of a robust measure of residual autocorrelation. © 2005 Elsevier B.V. All rights reserved. MSC: Primary 62J05; Secondary 62F35 Keywords: Box–Cox transformation; Heteroscedasticity; Robust estimation
1. Introduction Response transformations are a widely used tool to make data conform to a linear regression model. Sakia (1992) provides a review of transformations, the most common example ∗ Corresponding author. Tel.: +41 21 3147260; fax: +41 21 3147373.
E-mail addresses: Alfi
[email protected] (A. Marazzi),
[email protected] (V.J. Yohai). 0167-9473/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2005.04.007
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2753
being the Box–Cox transformation. Transformation techniques are thoroughly discussed in Carroll and Ruppert (1988) and Atkinson and Riani (2000, Chapter 4). Often, the transformed response g(y, ) is assumed to be linearly related to the covariates and the errors normally distributed with constant variance; the regression coefficients, as well as the parameter defining the transformation, are estimated by maximum likelihood (ML) (Box and Cox, 1964; Bickel and Doksum, 1981; Hinkley and Runger, 1984). Unfortunately, it can be difficult to attain both normality and homoscedasticity with a single transformation. Moreover, the ML-estimate is not consistent under non-normal or heteroscedastic errors and they are nonrobust in the presence of outliers. Several diagnostic techniques for assessing the contribution of individual observations to the evidence for transformations have been developed (e.g., Atkinson and Riani, 2000; Cheng, 2005). In addition, semiparametric and nonparametric approaches to relax the parametric structure of the response distribution have been studied (Han, 1987; Foster et al., 2001). A few robust estimates have also been proposed: Carroll and Ruppert (1988) described bounded influence estimates based on the normal homoscedastic model and Parker (1988) used minimum sum of absolute errors regression. In general, these nonparametric and robust estimates do not provide effective protection against heavy contamination and heteroscedasticity. For the simple regression case, Marazzi and Yohai (2004) proposed a new nonparametric estimate of based on minimization (with respect to ) of a robust measure of autocorrelation among the residuals with respect to a robust fit of the transformed responses. To compute the autocorrelation, the residuals have to be ordered according to the values of the regressor. It was proved that this estimate, as well as the corresponding coefficient estimates, are consistent even if the assumptions of normality and homoscedasticity do not hold. The consistency proof was however based on rather strong assumptions about the behavior of the basic robust coefficient estimates. Monte Carlo results showed that the new estimates outperform available methods (Carroll and Ruppert, 1988; Foster et al., 2001) when the errors are heteroscedastic and have longtailed distributions. In this paper we provide an extension of the method based on minimization of robust residual autocorrelation to the multiple regression case. We describe a computational algorithm for a particular implementation of the method, where S- and MM-estimates (Rousseeuw and Yohai, 1984; Yohai, 1987) are used to determine the robust fit. Monte Carlo results show that the performance (bias and mean square error) of the new procedure for large sample sizes (larger or equal 200) is very appealing in comparison with previously published methods. In addition, we provide a consistency result for the simple regression case based on a new set of assumptions, which are satisfied for the implementation based on S- and MM-estimates. This result was not available in Marazzi and Yohai (2004); yet, a consistency proof for the multiple regression case is still outstanding. Several references to Marazzi and Yohai (2004) are made in order to minimize overlap. The consistency proof and the auxiliary results concerning S- and MM-estimates can be found in Marazzi and Yohai (2005). S-Plus programs can be found at the first author website (www.iumsp.ch) together with the data set considered in the example of Section 8.
2754
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2. The model We consider a random sample (x1 , y1 ) , . . . , (xn , yn ) of the random vector (x, y), where x is a vector of p explanatory variables and y is a positive response variable. They are assumed to be linked by the linear relationship g(y, 0 ) = xT 0 + q(x)u,
(1)
where g (y, 0 ) denotes the response transformation, 0 is a parameter that takes values in ⊂ R, 0 ∈ Rp is a parameter vector (the first component of 0 being an intercept term). In the case of the Box–Cox transformation we have if = 0, y −1 (2) g(y, ) = ln() if = 0. We assume that x and y are independent, that the error u is symmetrically distributed and independent of x, and q(x) is an unknown scale function.
3. The robust residual autocorrelation estimate In the first place, we consider a sample (r1 , v1 ) , . . . , (rn , vn ) of a random pair (r, v), where r is a centered variable. We can measure how strongly E(r|v) depends on v with the help of a robust measure of autocorrelation as follows. Suppose that all the values vi , 1 i n are distinct, sort them in ascending order and let j1 , . . . , jn be the corresponding permuted indices, i.e., vj1 < vj2 < · · · < vjn . Then, define the robust autocorrelation of r ordered according to v by n,r,v
n−1 rji+1 rji 1 , = n−1 Sn Sn i=1
where is an odd, continuous, nondecreasing and bounded function, and Sn is a robust estimate of the scale of r. If E(r|v) is a nonconstant smooth function, we expect this coefficient to be positive. In fact, in this case, since vji is generally close to vji+1 , we expect the values of (rji /Sn ) and (rji+1 /Sn ) to be similar, hence n,r,v > 0. Since is bounded, this measure of autocorrelation is robust. We now turn to the problem of estimating the transformation parameter 0 in model (1). Let n be a consistent robust estimator of the coefficients for a multiple linear regression model with homoscedastic errors and Sn a measure of error scale. Specific examples are considered in Section 5. Let n () and Sn () be the result of applying these estimators to the transformed sample (x1 , g(y1 , )), . . . , (xn , g(yn , )) and (), S() their asymptotic values. Since (x1 , g(y1 , 0 )), . . . , (xn , g(yn , 0 )) is a sample of a linear model, where the coefficient vector is 0 , we have (0 ) = 0 . We define the residuals by r(, , x, y) = g(y, ) − xT and denote by h the inverse of g(y, 0 ).
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2755
Suppose that is any vector in Rp and that all the values x1T , . . . , xnT are distinct. In this case, we sort xiT , i = 1, . . . , n in ascending order and define j1 (), . . . , jn () as the corresponding permuted indices, i.e., xjT1 () < xjT2 () < · · · < xjTn () . Then, for given ∈ , ∈ Rp , ∈ Rp , s > 0, we define
n−1 r , , xji () , yji () r , , xji+1 () , yji+1 () 1 n (, , s, )= . n−1 s s i=1
(3) We observe that n (, , Sn (), ) is the robust autocorrelation estimate of r(, , x, y) ordered according to T x. In the case of ties among the values x1T , . . . , xnT , one can modify (3) by arbitrarily permuting the tied values and computing the correlations by averaging over the permutations. Details of this modification can be found in Marazzi and Yohai (2004). Finally, we define ∗n () = sup n (, n (), Sn (), n ()) . µ∈
(4)
Heuristic arguments given below show that lim ∗n (0 ) = 0
n→∞
(5)
and that for = 0 lim ∗n () > 0.
n→∞
(6)
This suggests defining the robust residual autocorrelation estimate (RAC-estimate) ˆ n of 0 by ˆ n = arg min ∗n (). ∈
(7)
The corresponding estimate of 0 is n (ˆ n ). Heuristic justification: For (5) to hold, it is enough that lim n (0 , n (0 ), Sn (0 ), ) = 0
n→∞
(8)
for all . Since limn→∞ n (0 ) = 0 and limn→∞ Sn (0 ) = s0 , for (8) to hold it is enough that lim n (0 , 0 , s0 , ) = 0.
n→∞
This follows from
r 0 , 0 , xji+1 () , yji+1 () r 0 , 0 , xji () , yji () =0 E s0 s0
2756
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
and, since r 0 , 0 , xji () , yj () = q xji () uji () , this equality holds because, conditional on xji () and xji+1 () , the variables uji () and uji+1 () are independent and is odd and symmetric. For (6) to hold, it is enough that lim n (, n (), Sn (), n (0 )) > 0
n→∞
(9)
for = 0 and for (9) to hold it is enough that lim n (, (), S(), 0 ) > 0.
n→∞
This follows from ⎞ ⎛ ⎞⎞ ⎛ ⎛ r , (), xji+1 (0 ) , yji+1 (0 ) r , (), xji (0 ) , yji (0 ) ⎠⎝ ⎠⎠ > 0 E ⎝ ⎝ S() S() and, since r(, , x, y) = g(h(xT 0 + u), ) − xT , this inequality follows from the fact that uji (0 ) and uji+1 (0 ) are independent and that for large n (and under appropriate assumptions) the values xjT 0 and xjT 0 are very close. i ( 0) i+1 ( 0 ) Remarks. (1) Model (1) is affine invariant and, in the case of the Box–Cox family of transformations, it is also scale and power invariant. It may be proved that the proposed estimates of 0 and 0 are always affine equivariant and, for the case of the Box–Cox family of transformations, they are scale and power equivariant. Since model (1) is not regression invariant, the proposed estimates are not regression equivariant, despite that the MM-estimates have this property when no transformation is used. (2) A variant of the RAC-estimate can be defined by (7) putting ∗n () = sup n (, n (), Sn (), ) .
(10)
∈Rp
The heuristic arguments given above still hold and thus this estimate is consistent. However, its computation is much harder because (10) involves a supremum over Rp while (4) involves a supremum over a bounded subset of R. (3) A different variant of the RAC-estimate was proposed in Marazzi and Yohai (2004). They defined the estimate of 0 by (7), where ∗n () is given by ∗n () = n (, n (), Sn (), n ()) .
(11)
This estimate (that we denote as the Rac-estimate in the following) is very simple because it avoids taking the supremum in the definition of ∗n (). The heuristic argument for (5) still holds with this definition, but the argument for (6) is not valid any more. The reason is that in general () = 0 for = 0 . This explains the poor performance of this variant in the Monte Carlo study presented in Section 7. (4) In the simple regression case g(y, 0 ) = 01 + 02 x + q(x)u,
(12)
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2757
we only need to sort the residuals according to the values of the xi ’s. Then, the permutation ji , 1i n, is defined by xj1 < xj2 < · · · < xjn . In this case for any ∈ , = (1 , 2 )T and s > 0 we put
n−1 r , , xji+1 , yji+1 r , , xji , yji 1 n (, , s) = (13) n−1 s s i=1
and ∗n () = n (, (), Sn ()) .
(14) The estimate ˆ n is still defined by (7) and the coefficient estimate is n ˆ n . (5) The definitions of n,r,v and n given above use lag 1 autocorrelation. One could think that taking into account autocorrelations of other lags could provide useful information. Therefore, we tried to add a term with lag 2 correlations in the definitions. A simulation study showed however that the improvement was irrelevant.
4. Consistency Theorem 1 of this section provides a consistency result for the RAC estimates in the simple regression model. The proof can be found in Marazzi and Yohai (2005). The following assumptions are required: A1. The distribution H of xi in model (12) is continuous. A2. The distribution F of ui is continuous and symmetric. A3. For each , the function g(y, ) is continuous and strictly monotone with respect to y and ⊂ R is compact. A4. q is continuous. A5. is continuous, odd, monotone nondecreasing and bounded. A6. The estimates n () and Sn () have the following properties: (i) Given > 0, there exists > 0 such that lim
sup n () − 0 |−0 |
where · denotes the Euclidean norm. (ii) There exists K > 0 such that lim sup n () K
n→∞ ∈
(iii) There exists s2 > 0 such that lim sup sn ()s2 a.s..
n→∞ ∈
a.s..
a.s.,
2758
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
(iv) There exists s1 > 0 such that lim inf sn ()s1 ,
n→∞ ∈
a.s..
A7. Robust Identifiability Condition. In model (12) let r(, , x, y) d(, , s, x) = E x . s
(15)
Then, for any = 0 , ∈ Rp and s > 0, P (d(, , s, x) = 0) > 0. Theorem 1. Let (x1 , y1 ) , . . . , (xn , yn ) be a random sample of model (12). Assume A1–A7. Then, (i) ˆ n → 0 , P (ii) n ˆ n →P 0 , where →P denotes convergence in probability. Remarks. (1) As noted in Marazzi and Yohai (2004), assumption A7 can be interpreted as an identifiability condition of model (12). In fact, suppose that A7 does not hold. Then, there exist values 1 = 0 , 11 , 12 , and s such that, with probability one, g (y, 1 )=11 +12 x+u, where u satisfies the condition E((u/s)|x)=0. Therefore the parameter is not identifiable. When q(x) = 1 and 02 = 0, A7 does not hold, but in this case, 0 is clearly not identifiable. (2) For the transformation family g(y, ) = y (which is equivalent to the Box–Cox family when > 0) we can give the following sufficient set of conditions for A7 to hold: (i) (x) > 0 for all x. (ii) If 02 > 0, the distribution of x is unbounded to the right; if 02 < 0, the distribution of x is unbounded to the left. (iii) There exists M > 0 such that u > − M; (iv) q(x) = 1. To prove this, take = 0 and any = (1 , 2 )T . Then, we have (01 + 02 x + u) − 1 − 2 x d(, , s, x) = E x , s with = /0 and jd(, , s, x) jx 1 (01 +02 x+u) −1 −2 x −1 = E 02 (01 +02 x+u) −2 x . s s Suppose that 02 > 0. Then, if > 1 and x is large enough, jd(, , s, x)/*x > 0; if < 1 and x is large enough, jd(, , s, x)/jx < 0. Therefore, it is not possible to have d(, , s, x)=0 for all x. A similar argument can be used when 02 < 0. Note however that conditions (i)–(iv)
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2759
are much stronger than what it is strictly required for the validity of A7. In addition, observe that (iii) is a necessary condition to have g(y, ) = y defined for all values of y. (3) The assumption that the error distribution is symmetric could be removed by modifying the residuals in (3). More precisely, define rloc by n i=1
ri (, , xi , yi ) − rloc s
=0
and replace r , , xji () , yji () by the centred residuals r , , xji () , yji () − rloc in (3). However, numerical experiments suggest that the convergence of the corresponding modified estimate of can be slow. 5. S- and MM-estimates We consider a sample u = (u1 , u2 , . . . , un ) of size n of a univariate distribution. Huber (1964) defines the M-estimate of scale as a solution sn (u) of the equation 1 1 n n
i=1
ui sn (u)
= b,
(16)
where b is a given positive real number and 1 a given function R → R+ . In the following, we require that 1 satisfies the following properties: A8. (i) 1 (0) = 0; (ii) 1 is even; (iii) if |u| < |v|, then 1 (u)1 (v); (iv) 1 is bounded; (v) 1 is continuous at 0. We now consider a sample Z = {(x1 , z1 ) , . . . , (xn , zn )} of a regression model z = xT + u. Rousseeuw and Yohai (1984) define the S-estimate of by ˜ n = arg min sn (r()),
(17)
where r() = (r1 (), . . . , rn ())T and ri () = yi − xiT . An associated S-estimate of the error scale is then given by S˜n = sn r(˜ n ) . (18) Rousseeuw and Yohai (1984) show that the asymptotic breakdown point of the S-estimates is ∗ = min (b/a, 1 − b/a), where a = max 1 . Therefore, if b = a/2 we have ∗ = 1/2. Unfortunately, the S-estimate of cannot simultaneously attain a breakdown point of 0.5 and a high efficiency under normal errors (Hossjer, 1992). In order to obtain both these properties, Yohai (1987) propose the MM-estimate of defined by ˆ n = arg min
n i=1
2
ri () , S˜n
(19)
2760
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
where S˜n is the error scale estimate (18) and 2 is a second function satisfying A8 and such that: A9. 2 1 for all u. The MM-estimate ˆ n has the same breakdown point as the S-estimate ˜ n . In addition, by conveniently choosing 2 , its asymptotic efficiency under normal errors can be made as close as desired to 100%. We now define ˜ n (), S˜n (), and ˆ n () as the S-estimate of , the associated S-estimate of scale, and the MM-estimate of when zi = g(yi , ) and yi is distributed according to (1). Sufficient conditions for these estimators to satisfy assumptions A6 (i)–(iv) in Section 4 can be found in Marazzi and Yohai (2005).
6. Computational aspects In this section, we describe a particular implementation of the RAC-estimate based on conventional versions of S-estimates and MM-estimates used in the Monte Carlo Study described in Section 7. The functions 1 and 2 are taken in the Tukey’s biweight family 3(z/ h)2 − 3(z/ h)4 + (z/ h)6 if |z| h, (z, h) = 1 if |z| > h, where h is a given constant. To define the S-estimate we take 1 (z)=(z, 1.548) and b =0.5 in (16). This choice makes the asymptotic breakdown point of the S-estimates ˜ n and S˜n equal to 0.5. In addition, we take 2 (z) = (z, 4.687) in (19). With this choice, the MMestimate ˆ n attains an asymptotic efficiency of 0.95 under normal errors while preserving breakdown point 0.5. In order to define the robust autocorrelation function we use Huber’s function (z, c) = max(−c, min(c, x)) with tuning constant c = 1.34. We obtain ˆ n by minimizing ∗n () over an equally spaced grid L = 1, . . . , k of -values using the following algorithm. Step 1: For each ∈ L, compute and store (approximations of) the S-estimates ˜ n () and S˜n (). This is the most computer intensive step because it requires a resampling scheme (e.g., Marazzi, 1992) for each ∈ L. An efficient procedure computes the projection matrix of each subsample just once (e.g., using Householder transformations); the projection matrix is then simultaneously applied to the k vectors of transformed responses g(y, 1 ), . . . , g(y, k ), providing k coefficient estimates (one for each j ). Keeping track of the k best current coefficient estimates over the successive subsamples yields k parallel resampling schemes. Eventually, the procedure provides resampling approximations of ˜ n () and S˜n () for each ∈ L. Step 2: For each ∈ L, compute and store the MM-estimate ˆ n (). Here, we use an iteratively reweighted least squares algorithm with fixed scale S˜n () (e.g., the W-algorithm in Marazzi, 1992) starting at ˜ n (). Step 3: For each ∈ L, sort xiT ˆ n (), i = 1, . . . , n in ascending order and store the corresponding permuted indices j1 , . . . , jn which depend on .
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2761
Step 4: For each pair (, ) ∈ L × L, compute n , ˆ n (), S˜n (), ˆ n () according to (3) and store these values. Step 5: For each ∈ L, compute and store ∗n () = sup n , ˆ n (), S˜n (), ˆ n () . ∈L
Step 6: Compute the approximation ∗n = arg min∈L ∗n () of the RAC-estimate ˆ n . 7. Monte Carlo results Simulation results for simple regression have been reported in Marazzi and Yohai (2004); here, we report simulation studies for multiple regression. We compare various estimators: RAC is the estimator defined in Section 3 and Rac the robust residual autocorrelation estimator for multiple regression initially suggested in Marazzi and Yohai (2004; see Section 3, Remarks, (3)). F is the estimator defined in Foster et al. (2001) and RF is a modification of F that uses the MM-estimates described in Section 6 in place of the least squares estimates. BI denotes the bounded influence estimator as defined in Carroll and Ruppert (1988, p. 186, with tuning constant a = 1.5 and based on 10 iterations). ML denotes the maximum likelihood estimator. Observations (xi , yi ) were generated according to the model yi0 = xiT 0 + q (xi ) ui ,
i = 1, . . . , n,
where, xi = (xi0 , xi1 , . . . , xip )T denotes a vector of p + 1 covariates; the first component equals 1 and the others are independently drawn from uniform distributions on [0.20, 20]. The errors ui were i.i.d., such that ui = tei , ei was normal N (0, 1) or contaminated normal 0.9N(0, 1) + 0.1N (0, 25), and the factor t was chosen so that MAD(ui ) = 1. The parameter vector was 0 = (10, 1, . . . , 1)T and 0 = 0.5. Two options for q(x) were used: q(x) ≡ 1 (homoscedastic case) and q(x) = (x1 + · · · + xp )/(6p) (heteroscedastic case). The simulation parameters (in particular 01 ) were chosen in order to have positive responses, clear identifiability, and perceptible homo/heteroscedasticity. The choice of 0 was not relevant
because of equivariance of ˆ n , i.e., ˆ n yia , . . . , yna = ˆ n (y1 , . . . , yn ) /a. Each experiment was based on 1000 samples. The results for n = 50, 100, 200 are reported in Tables 1–3 and include the average bias ˆ ˆ of the simulated estimates of bias = 1000 × mean − 0 and the root of the mean 0.5 √ squared error . As expected, for homoscedastic mse = 1000 × mean(ˆ − 0 )2 Gaussian errors, ML has no bias and the lowest mean squared error; however, for contaminated data, the robust estimates (except Rac for n=50) outperform ML. In the homoscedastic cases, the behaviour of all estimates is satisfactory in relation to bias; with regard to mean squared error, BI seems to be the best (close to ML for Gaussian errors and stable under contamination). The performance of RAC depends on the sample size: for n = 50, its mean squared error is rather large; however, for increasing sample size and Gaussian data it approaches the mean squared error of ML whereas for contaminated data it comes close to
bias √ mse bias √ mse bias √ mse bias √ mse
0 84 −1 97 −1 78 5 96
0 92 0 102 −5 108 1 116
0 74 0 84 −1 67 2 81
0 68 −3 123 −3 67 1 109
2 177 −5 191 −1 189 5 213
0 133 −9 146 −2 154 4 159
19 213 19 225 −5 217 18 233
Rac
−1 108 −6 119 −1 125 2 135
ML
4: auss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.
C8
G8
C4
4
BI
RAC
RF
Rac
RAC
F
Heteroscedastic case
Homoscedastic case
Table 1 Simulation results: n = 50
−35 144 −46 165 −19 129 −30 158
F
−31 152 −27 167 −16 164 −13 173
RF
−88 143 −97 159 −43 118 −52 139
BI
−102 148 −120 231 −48 118 −77 199
ML
2762 A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
bias √ mse bias √ mse bias √ mse bias √ mse
−1 79 0 87 4 83 0 90
−2 57 0 60 2 56 −1 61
−1 56 1 62 1 50 3 59
F −1 57 1 61 0 54 1 58
RF −3 49 0 52 3 43 2 45
BI −1 43 −3 89 2 41 1 76
ML
G4: Gauss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.
C8
G8
C4
G4
Rac
RAC
Homoscedastic case
Table 2 Simulation results: n = 100
−5 96 −1 99 1 100 −3 102
RAC 8 145 15 160 5 151 8 159
Rac
Heteroscedastic case
−31 99 −36 110 −16 85 −24 100
F
−28 98 −22 101 −14 91 −12 97
RF
−88 116 −90 122 −36 80 −44 88
BI
−101 122 −134 210 −45 82 −87 165
ML
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768 2763
bias √ mse bias √ mse bias √ mse bias √ mse
0 39 −1 43 0 36 3 40
0 39 −1 42 0 37 3 39
−2 33 −3 34 −2 29 2 30
−1 30 1 66 −1 27 0 54
−3 59 −3 62 −3 53 −2 59
0 56 −2 62 −2 55 1 61
5 110 8 114 −1 101 1 110
Rac
−1 35 −1 39 −1 33 0 37
ML
4: auss, p = 4; C4: contam. Gauss, p = 4; G8: Gauss, p = 8; C8: contam. Gauss, p = 8.
C8
G8
C4
4
BI
RAC
RF
Rac
RAC
F
Heteroscedastic case
Homoscedastic case
Table 3 Simulation results: n = 200
−28 72 −32 79 −16 62 −17 69
F
−26 69 −27 73 −13 62 −11 68
RF
−86 100 −89 104 −40 60 −44 68
BI
−101 112 −137 184 −50 67 −98 143
ML
2764 A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2765
Table 4 Computing times (seconds)
Step 1 Step 2 Steps 3–6
p=4 n = 100
p=4 n = 200
p=4 n = 300
p=8 n = 100
p=8 n = 200
p=8 n = 300
2.92 0.08 0.44
5.68 0.14 0.86
8.63 0.21 1.44
5.72 0.19 0.47
11.00 0.30 0.94
16.28 0.44 1.47
the mean squared error of BI. In the heteroscedastic case, the bias of ML, BI, F, and RF clearly becomes the dominant component of the mean squared error and all estimates are outperformed by RAC, especially for sample size larger than 200. (Note: it can be shown that both F and BI are asymptotically biased when the errors are heteroscedastic.) Finally, we notice that RAC works much better than Rac; an explanation is given in Section 3, Remarks, (3). We used FORTRAN programs loaded into S-plus to optimize the robust residual autocorrelation on a grid of 101 equally spaced points between 0.05 and 1.25 according to the algorithm described in Section 6. Some computing times (for a single estimation problem) on a 2.53 GHz Pentium 4 are reported in Table 4. 8. Example We consider a sample of 78 patients hospitalized in a Swiss hospital during 2000 for “valve cardiac surgery with major complications”. We study the relationship between the cost of stay (Cost, in Swiss francs) and two explanatory variables that are available on administrative files: length of stay (LOS, in days) and admission type (0 = planned, 1 = emergency). This relationship is often used as a basis to determine reimbursement rates. We use the abbreviations y = Cost/1000, x1 = log(LOS), x2 = admission type. The log(LOS)/Cost-plot in Fig. 1, panel (a), suggests that the costs of two cases were exception
ally high; however, model (1)–(2) with xT = (1, x1 , x2 ) and T0 = (01 , 02 , 03 ) seems a reasonable description for the majority of these data. The ML-estimates of 0 and 0 are T ¯ n = −0.33 and ¯ n ¯ n = (1.61, 0.19, 0.02) whereas the RAC-estimates are ˆ n = 0.65 and T ˆ n ˆ n =(−4.94, 7.82, 1.03). Since several stays have the same length, the autocorrelation function has been computed using the modification of (3) for tied observations described in Marazzi and Yohai (2004). Fig. 1, panel (b) displays the function ∗n () minimized for = 0.65. Fig. 2, panel (a) displays the data on the transformed response scale for = ¯ n together with the estimated ML prediction lines; Fig. 2, panel (b), is a normal probability plot of the ML residuals. The corresponding plots for the RAC-estimate are given in Fig. 3, panel (a) and panel (b). No strong outliers are visible in Fig. 2; on the contrary, two exceptional cases clearly show up with very large residuals in Fig. 3. Removing these cases from the data, the ML-estimate of 0 becomes ¯ n = 0.32 and the classical 95% confidence interval for 0 based on the likelihood ratio test and the reduced data set is (−0.13, 0.76). This interval
2766
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
200
0.115
0.105 rho
Cost/1000
150
100
0.095 50 0.085 2.0
(a)
2.5 3.0 log(LOS)
3.5
4.0
0.0 (b)
0.2
0.4 0.6 lambda
0.8
1.0
Fig. 1. Panel (a): Data log(LOS) and log(Cost/1000) of 78 hospital patients; circles denote planned admissions, bullets emergency admissions. Panel (b): function ∗n () minimized for = 0.65.
2
2.4 observed quantile
transformed response
2.5
2.3 2.2 2.1 2.0 1.9
0 -1 -2
2.0 (a)
1
2.5 3.0 log(LOS)
3.5
4.0 (b)
-2 -1 0 1 2 expected normal quantile
Fig. 2. Panel (a): data on transformed response scale for = −0.33 (ML-estimate) and fitted lines for planned/emergency admissions. Panel (b): normal probability plot of ML residuals.
contains the RAC-estimate but not the ML-estimate on the full data set! Finally, in order to investigate which estimate better succeeds in transforming the data, we provide inverse scatter plots according to Cook and Weisberg (1994). Fig. 4, panel (a) is T ¯ ¯ the plot of the fitted values y¯i = n n xi (vertical axis) against yi . A transformation can be estimated by fitting a curve to this graph. For example, the solid curve shown on the diagram was obtained by computing ordinary least squares regression of y¯i on g (yi , −0.33) and then displaying the fitted values from this regression as a curve on the inverse response plot of (yi , y¯i ). The broken line was obtained with the help of a nonparametric smoother (S-Plus lowess) based on the reduced data set. The two curves disagree, casting doubts on the transformation plots for the RAC estimate g(y, −0.33). The corresponding T ˆ ˆ ˆ i.e., the plot of yˆi = n xi against yi for n = 0.65 is given in Fig. 4, panel (b). n
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
2767
40
8 observed quantile
transformed response
10
30
20
6 4 2 0
10
-2 2.0
(a)
2.5 3.0 log(LOS)
3.5
4.0
-2 -1 0 1 2 expected normal quantile
(b)
Fig. 3. Panel (a): data on transformed response scale for = 0.65 (RAC-estimate) and fitted lines for planned/emergency admissions. Panel (b): normal probability plot of RAC residuals.
100 80
fitted value
fitted value
80 60 40 20
40
20
50 (a)
60
100 150 response
200
50 (b)
100 150 response
200
T Fig. 4. Panel (a): inverse scatter plot of (yi , y¯i ) with y¯i = ¯ n (−0.33)xi (ML). Panel (b): inverse scatter plot of
T yi , yˆi with yˆi = ˆ n (0.65)xi (RAC).
Here, the parametric (MM-regression of yˆi on g(yi , 0.65)) and the nonparametric (S-Plus lowess) curves are very close, confirming that g(y, 0.65) is a sound transformation. For the sake of simplicity, the user may obviously prefer a value of = 0.5 in place of = 0.65. In this example, these two values of provide almost the same results. Remarks. (1) A robust confidence interval for could be based on a bootstrap simulation of the RAC-estimate. However, the standard bootstrap simulation of a robust estimate does not provide robust confidence intervals (Salibian-Barrera and Zamar, 2002). The development of a more sophisticated robust bootstrap algorithm for the RAC-estimate is necessary. (2) Despite assumption A1 in Section 4, the RAC procedure for multiple regression does not require the distribution of x to be continuous. In particular, dummy explicative
2768
A. Marazzi, V.J. Yohai / Computational Statistics & Data Analysis 50 (2006) 2752 – 2768
variables—such as the admission type in the example—can be used. The procedure works as soon as the ordered fitted values are sufficiently close to provide large residual autocorrelations for bad values of . This may happen even when only one explanatory variable is continuous. Acknowledgements This work was completed with the support of Grant 2053-066895.01 from the Swiss National Science Foundation, Grant PICT–99 03–06277 from the Agencia Nacional de Promoción de la Ciencia y la Tecnología, Grant X-094 from the Universidad de Buenos Aires, Argentina, and a grant from Fundación Antorchas, Argentina. The authors are grateful to Danielle Meylan andAlex Randriamiharisoa for the help in programming. They also thank the referees for the numerous questions and remarks that helped clarifying the paper. References Atkinson, A., Riani, M., 2000. Robust Diagnostics Regression Analysis. Springer, New York. Bickel, P.J., Doksum, K.A., 1981. An analysis of transformations revisited. J. Amer. Statist. Assoc. 76, 296–311. Box, G.E.P., Cox, D.R., 1964. An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26, 211–252. Carroll, R.J., Ruppert, D., 1988. Transformation and Weighting in Regression. Chapman & Hall, New York. Cheng, T.-C., 2005. Robust regression diagnostics with data transformations. Comput. Statist. Data Anal., in press. Cook, R.D., Weisberg, S., 1994. Transforming a response variable for linearity. Biometrika 81 (4), 731–737. Foster, A.M., Tian, L., Wei, L.W., 2001. Estimation for the Box–Cox transformation model without assuming parametric error distribution. J. Amer. Statist. Assoc. 96, 1097–1101. Han, A.K., 1987. A non-parametric analysis of transformations. J. Econometrics 35, 191–209. Hinkley, D.V., Runger, G., 1984. Analysis of transformed data. J. Amer. Statist. Assoc. 79, 302–309. Hossjer, O., 1992. On the optimality of S-estimators. Statist. Probab. Lett. 14, 413–419. Huber, P.J., 1964. Robust estimation of a location parameter. Ann. Math. Statist. 35, 73–101. Marazzi, A., 1992. Algorithms, Subroutines, and S-functions for robust statistics. Wadsworth & Brooks/Cole, Pacific Grove, CA. Marazzi, A., Yohai, V., 2004. Robust Box–Cox transformations for simple regression. In: Hubert, M., Pison, G., Struyf, A., Van Aelst, S. (Eds.), Theory and Applications of Recent Robust Methods. I: Statistics for Industry and Technology, pp. 173–182. Marazzi, A., Yohai, V., 2005. Consistency of the robust residual autocorrelation estimate of a transformation parameter. Technical Report, available at www.iumsp.ch Parker, I., 1988. Transformations and influential observations in minimum sum of absolute errors regression. Technometrics 30, 215–220. Rousseeuw, P., Yohai, V.J., 1984. Robust regression by means of S-estimators. In: Franke, J., Härdle, W., Martin, D. (Eds.), Robust and Nonlinear Time Series. Lecture Notes in Statistics, vol. 26. Springer, Berlin, pp. 256–272. Sakia, R.M., 1992. The Box–Cox transformation technique: a review. Statistician 41, 169–178. Salibian-Barrera, M., Zamar, R.H., 2002. Bootstrapping robust estimates of regression. Ann. Statist. 30, 556–582. Yohai, V.J., 1987. High breakdown-point and high efficiency robust estimates for regression. Ann. Statist. 15, 642–656.