Journal of Statistical Planning and Inference 143 (2013) 1988–2001
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Regression with outlier shrinkage Shifeng Xiong a,n, V. Roshan Joseph b a b
Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0205, United States
a r t i c l e in f o
abstract
Article history: Received 29 August 2012 Received in revised form 6 March 2013 Accepted 8 June 2013 Available online 26 June 2013
We propose a robust regression method called regression with outlier shrinkage (ROS) for the traditional n 4 p cases. It improves over the other robust regression methods such as least trimmed squares (LTS) in the sense that it can achieve maximum breakdown value and full asymptotic efficiency simultaneously. Moreover, its computational complexity is no more than that of LTS. We also propose a sparse estimator, called sparse regression with outlier shrinkage (SROS), for robust variable selection and estimation. It is proven that SROS can not only give consistent selection but also estimate the nonzero coefficients with full asymptotic efficiency under the normal model. In addition, we introduce a concept of nearly regression equivariant estimator for understanding the breakdown properties of sparse estimators, and prove that SROS achieves the maximum breakdown value of nearly regression equivariant estimators. Numerical examples are presented to illustrate our methods. & 2013 Elsevier B.V. All rights reserved.
Keywords: Breakdown value Penalized regression Robust regression Variable selection Weighted least squares
1. Introduction Consider a linear regression model y ¼ Xβ þ ε;
ð1Þ
where X ¼ ðx1 ; …; xn Þ′ ¼ ðxij Þi ¼ 1;…;n;j ¼ 1;…;p is an n p regression matrix, y ¼ ðy1 ; …; yn Þ′ is the response vector, β ¼ ðβ1 ; …; βp Þ′ is the vector of coefficients, and the random error ε ¼ ðε1 ; …; εn Þ′∼Nð0; s2 I n Þ:
ð2Þ
In this paper we always assume X to have full column rank, which implies n 4 p. The ordinary least squares (OLS) estimator of β is β^ LS ¼ ðX′XÞ−1 X′y, which is obtained from min‖y−Xβ‖2 : β
ð3Þ
The OLS estimator corresponds to the maximum likelihood estimate, and has full efficiency since its covariance matrix attains the Rao–Cramér lower bound. However, it is well-known that β^ LS is extremely sensitive to outliers, i.e., observations that do not obey the linear pattern formed by the majority of the data. Various robust regression methods have been proposed to improve β^ LS when datasets contain outliers (Hampel et al., 1986; Rousseeuw and Leroy, 1987; Maronna et al., 2006). n
Corresponding author. E-mail addresses:
[email protected] (S. Xiong),
[email protected] (V.R. Joseph).
0378-3758/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.jspi.2013.06.007
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1989
A natural estimator to improve β^ LS is the least absolute deviation (LAD) estimator (Rousseeuw and Leroy, 1987), which is defined by replacing the ℓ2 norm in (3) with the ℓ1 norm. However, the LAD estimator does not have the ability to resist the influence from leverage points, i.e., points that deviate from the majority in x-space. Its robustness can be quantified using the (finite-sample) breakdown value (Donoho and Huber, 1983), which measures the smallest amount of contamination that can have an arbitrarily large effect on an estimator. In fact, the breakdown value of LAD is 1/n, i.e., only one bad observation can make it very poor. There are a number of robust regression estimators with high breakdown value, including the repeated median estimate (Siegel, 1982), the least median squares (LMS) (Rousseeuw, 1984), the least trimmed squares (LTS) (Rousseeuw, 1984) and the S-estimate (Rousseeuw and Yohai, 1984). Since these high-breakdown estimates lose efficiency under the normal model (2) as the price of robustness, one-step M-estimators (Jureckova and Portnoy, 1987), MM-estimators (Yohai, 1987), τestimators (Yohai and Zamar, 1988), and one-step weighted estimators (Agostinelli and Markatou, 1998) have been proposed to improve the efficiency. Gervini and Yohai (2002) introduced REWLSE (robust and efficient weighted least squares estimators) that can achieve maximum breakdown value and full asymptotic efficiency with fixed p simultaneously. We propose a new method for robust estimation. The main idea is to introduce n mean shift parameters corresponding to the n observations, and then estimate them along with β by penalized least squares using a quadratic penalty. We call this method regression with outlier shrinkage (ROS). This approach is inspired by the penalized regression method, but its studies are limited in the literature (Lee et al., 2007; She and Owen, 2011). McCann and Welsch (2007) and Menjoge and Welsch (2010) also considered the mean shift parameters for robust variable selection, but they used forward selection approaches. We show that ROS has a closed form of weighted least squares (Holland and Welsch, 1977; Gervini and Yohai, 2002; ˘ z˘ ek, 2010). Like REWLSE, ROS can also simultaneously achieve maximum breakdown value and full asymptotic efficiency Cí with fixed p. Outliers can also affect the performance of variable selection procedures in regression analysis. A common approach used in practice is to detect outliers using residual plots, remove them from the dataset, and repeat the variable selection. This procedure maybe continued until there are no more outliers remaining in the dataset. However, many times this iterative procedure can fail due to the confounding between outliers and model parameters. Procedures that incorporate simultaneous variable selection and estimation are needed for this purpose. Wang et al. (2007) proposed such a procedure using the LAD regression with the lasso penalty, called LAD-lasso. Other robust sparse estimation methods are proposed by Zou and Yuan (2008), Wang and Li (2009), Li and Yu (2009), Chen et al. (2010), and Bradic et al. (2011). Although these estimators are consistent in variable selection under some conditions, the corresponding estimators of nonzero coefficients in β cannot achieve the full asymptotic efficiency with fixed p under the normal model. Based on ROS, we propose a method called sparse regression with outlier shrinkage (SROS) for robust variable selection and estimation. We prove that it can not only give consistent selection but also estimate the nonzero coefficients with full asymptotic efficiency under the normal model. SROS is also computationally much more efficient than the other sparse robust regression methods. In addition, we introduce the concept of nearly regression equivariant estimator to discuss optimal breakdown value of sparse estimators, and prove that SROS achieves the optimal breakdown value of nearly regression equivariant estimators. The remainder of the paper is organized as follows. Section 2 presents the ROS estimator and studies its robustness and asymptotic properties. Section 3 introduces the concept of nearly regression equivariant estimator and derives the maximum breakdown value of such an estimator. Section 4 presents the SROS estimator and its properties. Section 5 gives numerical examples and we conclude with some remarks in Section 6. All proofs are presented in the Appendix. 2. ROS estimator Suppose that the ith observation ðyi ; xi Þ is an outlier, i.e., it does not obey the linear pattern EðyjxÞ ¼ x′β formed by the majority of the data. Then we can assume that the expected response is shifted by an amount δi , i.e., Eðyi jxi Þ ¼ x′i β þ δi . For all observations, replace (2) by ε ¼ ðε1 ; …; εn Þ′∼Nðδ; s2 I n Þ;
ð4Þ
where δ ¼ ðδ1 ; …; δn Þ′ are called mean shift parameters. Of course the mean shift model with added parameter δ is not identifiable. Note that the outliers are usually much less than the other “good” observations. To both make β and δ estimable, the estimator of δ should be shrunk towards zero. Penalized methods can be used here by assigning a penalty on δ. Lee et al. (2007) considered the mean shift model and proposed the following robust lasso for estimating β and δ simultaneously p
n
j¼1
i¼1
min‖y−Xβ−δ‖2 þ 2λ1 ∑ jβj j þ 2λ2 ∑ jδi j; β;δ
ð5Þ
where λ1 and λ2 4 0 are tuning parameters. Without the purpose of variable selection, we consider the estimator of β from n
min‖y−Xβ−δ‖2 þ 2λ ∑ jδi j; β;δ
ð6Þ
i¼1
where λ 4 0 is a tuning parameter. Interestingly, the estimator from (6) is equivalent to an M-estimator (She and Owen, 2011), and its breakdown value is very low. Similarly, the robust lasso in (5) does not produce high-breakdown estimator.
1990
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
Now we present our method. First we consider the robust estimation of β and then in a later section we extend this method for doing simultaneous variable selection and robust estimation. An adaptive penalty is used to improve (6). Let β~ 0 be a robust initial estimator of β. Under the mean shift model in (4), the initial estimator of δ is δ~ 0 ¼ ðδ~ 01 ; …; δ~ 0n Þ′ ¼ y−X β~ 0 . The final estimators β~ ¼ ðβ~ 1 ; …; β~ p Þ′ and δ~ ¼ ðδ~ 1 ; …; δ~ n Þ′ are specified by δ2i ~2 i¼1δ n
min‖y−Xβ−δ‖2 þ μ ∑ β;δ
ð7Þ
0i
and μ 40 is a tuning parameter. We call β~ regression with outlier shrinkage (ROS) estimator. The choice of the initial estimator β~ 0 plays an important role in ROS. Here we use the least trimmed squares (LTS) estimator (Rousseeuw, 1984) since it has maximum breakdown value and has an efficient algorithm (Rousseeuw and Van Driessen, 2006) for its computation. The LTS can be described as follows. For any β in (1), denote the corresponding residuals by r i ðβÞ ¼ yi −x′i β for i¼1,…,n. The LTS estimator β^ LTS is the solution to h
min ∑ r 2i:n ðβÞ; β
ð8Þ
i¼1
where r 21:n ðβÞ ≤⋯ ≤r 2n:n ðβÞ are the ordered squared residuals. An estimator of s based on LTS can be given by !1=2 s^ ¼ ch;n
h
∑ r 2i:n =h
i¼1
;
ð9Þ
where r 21:n ≤⋯ ≤r 2n:n are the ordered squared residuals from the LTS and ch;n is chosen to make s^ consistent and unbiased under Gaussian error distributions (Pison et al., 2002). If h ¼ ½n=2 þ ½ðp þ 1Þ=2 in (8), the breakdown value of β^ LTS achieves the maximum breakdown value of any regression equivariant estimator, defined in Section 2.1 (Rousseeuw and Leroy, 1987). In ROS, we let β~ 0 ¼ β^ LTS with h ¼ ½n=2 þ ½ðp þ 1Þ=2. For high-dimensional data, Alfons et al. (2013) provided a sparse LTS estimator, which can also be used as the initial estimator. The adaptive quadratic penalty function in (7) makes ROS have a closed form of weighted least squares. Denote the diagonal matrix 2 2 W ¼ diagðw1 ; …; wn Þ ¼ diagðμ=ðδ~ 01 þ μÞ; …; μ=ðδ~ 0n þ μÞÞ:
ð10Þ
~ Plugging this into (7), we can obtain β~ by solving By minimizing (7) with respect to δ, we obtain δ~ ¼ ðI n −WÞðy−X βÞ. minðy−XβÞ′Wðy−XβÞ:
ð11Þ
β
Hence, the ROS estimator has the following closed form: β~ ¼ ðX′WXÞ−1 X′Wy:
ð12Þ
˘ z˘ ek, This makes ROS similar to weighted least squares regression (Holland and Welsch, 1977; Gervini and Yohai, 2002; Cí 2010). We can see that ROS is easy to understand and implement. The weight wi on the ith observation ðxi ; yi Þ relies on the 2 magnitude of the residual δ~ 0i . If δ~ 0i is large, then wi will be small and therefore, the ith observation ðxi ; yi Þ is almost ignored in the estimation. As μ-∞, β~ tends to the OLS estimator and as μ-0, β~ tends to the LTS estimator (by noting that ðβ~ 0 ; δ~ 0 Þ makes the objective function in (7) zero when μ ¼ 0). Thus, the value of μ controls the trade-off between efficiency and robustness. We will propose a new criterion for its careful selection in Section 2.3. 2.1. Breakdown value of ROS ^ of an estimator β^ for the dataset X ¼ fðx1 ; y Þ; …; ðxn ; y Þg is the For the regression model (1), the breakdown value εn ðβÞ 1 n ^ Precisely, consider all possible contaminated smallest amount of contamination that can have an arbitrarily large effect on β. datasets, X n , obtained by replacing any m of the original observations by arbitrary points. Then the breakdown value of β^ is defined as ^ ¼ min m : sup∥βðX ^ Þ−βðX ^ n Þ∥ is infinite : εn ðβÞ ð13Þ n Xn She and Owen (2011) showed that the estimator given by (6) is actually an M-estimator with Huber's loss function. By the general result on breakdown properties of monotone M-estimates (Section 5.16.1 of Maronna et al., 2006), its breakdown value is only 1/n. Similarly, the breakdown value of the robust lasso defined in (5) cannot have high breakdown value, either. ^ Unlike the robust lasso estimator, ROS has high breakdown value. An estimator β^ ¼ βðX; yÞ is called regression equivariant if for any X and y it holds that ^ ^ βðX; y þ XαÞ ¼ βðX; yÞ þ α
for any α∈Rpþ1 :
Rousseeuw and Leroy (1987) proved that the breakdown value of any regression equivariant estimator cannot be larger than ð½ðn−pÞ=2 þ 1Þ=n. Note that the LTS estimator β~ 0 used in ROS is regression equivariant and achieves this maximum
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1991
breakdown value. Next we prove that the ROS estimator β~ is also regression equivariant and has the same breakdown value as β~ 0 . Theorem 1. The ROS estimator β~ is regression equivariant. Theorem 2. The ROS estimator β~ has the same breakdown value as β~ 0 . 2.2. Asymptotics of ROS In this subsection we show that the ROS can have optimal asymptotical efficiency with fixed p under the normal model. We make several assumptions. Assumption 1. As n-∞, X′X=n-Σ 4 0. Assumption 2. As n-∞, max1 ≤i ≤n x′i xi =n-0. −3=2 Þ. Assumption 3. As n-∞, μ ¼ μn satisfies μ−1 n ¼ Oðn
Two lemmas are needed in the proof of Theorem 3 which are presented in the Appendix. Theorem 3. If Assumptions 1–3 hold, then as n-∞, pffiffiffi 2 −1 ~ nðβ−βÞd Nð0; s Σ Þ:
Theorem 3 also holds when we relax the normal distributional (2) by assuming that ε1 ; …; εn are independently and identically distributed with zero mean and finite variance s2 . From Theorem 3, a consistent estimator of s2 based on the ROS estimator β~ can be given by s^ 2 ¼
~ ~ ðy−X βÞ′W ðy−X βÞ : n ∑i ¼ 1 wi
ð14Þ
2.3. Selection of μ As mentioned before, the value of μ controls the trade-off between efficiency and robustness of ROS. Therefore, it is important to choose this value carefully. ~ For μ 4 0, denote the residuals of an ROS estimator β~ ¼ βðμÞ by r i ðμÞ ¼ yi −x′i βðμÞ. Let h ¼ ½n=2 þ ½ðp þ 1Þ=2. Multiple least trimmed squares criterion is defined as MCðμÞ ¼
1 k 2 ∑ r i:n ðμÞ; k ¼ h;hþ1;…;n ak i ¼ 1 min
ð15Þ
where r 21:n ðμÞ ≤⋯ ≤r 2n:n ðμÞ are the ordered squared residuals and ak ¼ ð kþ8 9 Þ. We choose the μ in (7) that minimizes the value of MCðμÞ. A justification for the choice of ak can be given as follows. Usually the observation corresponding to the squared residual r 2k:n will be considered as an outlier if its squared residual r 2k:n 4 9s2 ;
ð16Þ
where s2 is the error variance in (2). By (15), it is desired that (16) is equivalent to 1 k 2 1 k−1 2 ∑ r 4 ∑ r ; ak i ¼ 1 i:n ak−1 i ¼ 1 i:n which implies k−1 ak ak −1 ∑ r 2i:n ≈ðk−1Þ −1 s2 : r 2k:n 4 ak−1 ak−1 i¼1 Letting a1 ¼ 1 and ðk−1Þðak =ak−1 −1Þ ¼ 9, we have ak ¼ ð kþ8 9 Þ. 3. Optimal breakdown value for sparse estimation Before extending ROS for simultaneous variable selection and robust estimation, we present some general results on sparse estimation. Note that a sparse estimator cannot be regression equivariant. Here we present a weaker condition and show that the maximum breakdown value of estimators under this condition is the same as that for regression equivariant estimators. Recall that we always assume the regression matrix X in (1) to have full column rank, which implies n 4 p.
1992
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
^ Definition. An estimator β^ ¼ βðX; yÞ of β in (1) is called nearly regression equivariant if there exists a regression equivariant estimator β~ such that for any X and y, ^ ~ ∥βðX; yÞ−βðX; yÞ∥ ≤cðXÞ;
ð17Þ
where cðXÞ does not rely on y. The following theorem is an extension of Theorem 4 in Chapter 3 of Rousseeuw and Leroy (1987). Theorem 4. Any nearly regression equivariant estimator β^ satisfies ^ ≤ ½ðn−pÞ=2 þ 1 : εn ðβÞ n Many popular sparse estimators like the lasso (Tibshirani, 1996) are nearly regression equivariant. We will show that the lasso and adaptive lasso (Zou, 2006) have this property. The lasso and adaptive lasso estimators are defined as the solutions of p
min‖y−Xβ‖2 þ 2λ ∑ jβj j β0 ;β
ð18Þ
j¼1
and p
min‖y−Xβ‖2 þ 2λ ∑ β0 ;β
jβj j
~ j ¼ 1 jβ j j
;
ð19Þ
respectively, where β~ ¼ ðβ~ 1 ; …; β~ p Þ′ is the OLS estimator. Proposition 1. For each λ≥0 in (18), the lasso estimator β^ defined by (18) is nearly regression equivariant. Proposition 2. For each λ≥0 in (19), the adaptive lasso estimator β^ defined by (19) is nearly regression equivariant. Similar to the proof of Proposition 1, other sparse estimators including the SCAD (Fan and Li, 2001), the elastic net (Zou and Hastie, 2005), and the MCP (Zhang, 2010) can be proven to be nearly regression equivariant. 4. SROS estimator For fitting a sparse model, consider the ROS estimator β~ given by (12) as the initial estimator of β. The initial estimator of δ is δ~ 0 ¼ ðδ~ 01 ; …; δ~ 0n Þ′ ¼ y−X β~ 0 , where β~ 0 is the LTS estimator with h ¼ ½n=2 þ ½ðp þ 1Þ=2 in (8). The final estimators β^ ¼ ðβ^ 1 ; …; β^ p Þ′ and δ^ ¼ ðδ^ 1 ; …; δ^ n Þ′ are obtained by solving p
min‖y−Xβ−δ‖2 þ 2λ ∑ β0 ;β;δ
jβj j
δ2i ; 2 ~ i¼1δ 0i n
~ j ¼ 1 jβ j j
þμ ∑
ð20Þ
where λ and μ4 0 are tuning parameters. We call this method sparse regression with outlier shrinkage (SROS). The difference between SROS and the robust lasso in (5) is that we use adaptive penalties on both β and δ. Moreover, as shown below, the quadratic penalty on δ makes the computation in our procedure feasible for large n. Similar to (11), the SROS estimator β^ can be specified by p
minðy−XβÞ′Wðy−XβÞ þ 2λ ∑ β0 ;β
jβj j ; ~
j ¼ 1 jβ j j
ð21Þ
where W is the same as in (10). By (21), the algorithms for computing the lasso such as the LARS (Efron et al., 2004) and the coordinate descent (Friedman et al., 2007) can be used to compute SROS. Thus, unlike (5), SROS can be computed efficiently for large datasets. 4.1. Breakdown value of SROS We can prove that the SROS estimator β^ satisfies ^ β∥ ~ ≤c; ∥β−
ð22Þ
where β~ is the ROS estimator and c does not rely on y. This indicates that the SROS estimator β^ is nearly regression equivariant. By (21), the proof of (22) is almost the same as Proposition 2, and we omit it. Theorem 5. For each λ≥0 and μ 40 in (20), the SROS estimator β^ is nearly regression equivariant. Next consider the breakdown property of SROS. ~ Theorem 6. The SROS estimator β^ has the same breakdown value as β. Theorem 6 shows that SROS achieves the maximum breakdown value of nearly regression equivariant estimators.
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1993
4.2. Asymptotics of SROS This subsection investigates the asymptotics of SROS with fixed p. Let the number of nonzero coefficients of β in (1) be p1 with p1 ≤p. Partition β as β ¼ ðβ′1 ; β′2 Þ′;
ð23Þ
′ ′ where β2 ¼ 0 and any component of β1 is nonzero. According to (23), partition the SROS estimator β^ in (21) as β^ ¼ ðβ^ 1 ; β^ 2 Þ′. Fan and Li (2001) introduced the concept of oracle property. A penalized regression estimator of β has the oracle property if it can not only select the correct submodel asymptotically, but also estimate the nonzero coefficients β1 in (23) as efficiently as if the correct submodel were known in advance. Next we prove that SROS has the oracle property. Since the SROS β^ is a weighted least squares estimator, the asymptotic efficiency of β^ 1 can achieve the optimality under the normal model. We need the following assumptions.
Assumption 4. As n-∞, X′X -Σ ¼ n
Σ1
Σ12
Σ21
Σ2
! 4 0;
where Σ1 is a p1 p1 matrix. pffiffiffi Assumption 5. The tuning parameter λ ¼ λn in (20) satisfies that, as n-∞, λn -∞ and λn = n-0. Theorem 7. If Assumptions 2–5 hold, then as n-∞,
(i) Pðβ^ 2 ¼ 0Þ-1; pffiffiffi (ii) nðβ^ 1 −β1 Þ-d Nð0; s2 Σ−1 1 Þ:
4.3. Details of computing SROS Note that the adaptive lasso we use in (20) is very close to Breiman (1995)'s nonnegative garrote (Zou, 2006). For selecting the tuning λ in (20), we suggest using λ ¼ log ðnÞs^ 2 =2, where s^ 2 is the estimator of s2 computed from (14). This selection, which was proposed by Xiong (2010) for the tuning parameter in the nonnegative garrote according to the BIC criterion (Schwarz, 1978), satisfies Assumption 5. The tuning parameter μ in (20) is selected the same way as in Section 2.3. For convenient use of SROS, we present the steps in its computation as follows. Step1 Begin with the LTS estimator β~ 0 in (8) with h ¼ ½n=2 þ ½ðp þ 1Þ=2. Step2 Compute W in (10) and the ROS estimator β~ in (12) with μ selected by minimizing MC in (15). Step3 Compute s^ 2 in (14). Step4 Compute the SROS estimator β^ in (21) with W and λ ¼ log ðnÞs^ 2 =2. 5. Numerical examples 5.1. Simulations We consider four different models. In each model, we generate data from the model y ¼ Xβ þ ε;
ð24Þ
where ε ¼ ðε1 ; …; εn Þ′ are random errors, and the p-dimensional predictors x1 ; …; xn are independently generated from Nð0; CÞ. Here the (i,j) entry of C is ρji−jj for i; j ¼ 1; …; p. In our simulation, we fix n ¼50, p¼8, and ρ ¼ 0:5, and consider two different β′s in (24), β ¼ ð3; 2; 1:5; 1; 1; 1; 1; 1Þ′ and β ¼ ð3; 2; 1:5; 0; 0; 0; 0; 0Þ′. Model A: Generate data from (24), where the error vector ε∼Nð0; 4 I n Þ. There is no outliers under Model A. Model B: Generate data from (24), where the errors ε1 ; …; εn are independently and identically distributed from Student's t distribution with 2 degrees of freedom. There is no outliers, but the distribution of the errors has much more heavy tail than the normal distribution. Model C: Generate data from (24) with the same error distribution as Model A. Then replace yi by yi þ ð−1ÞIðU 1 o 1=2Þ ð20 þ 10U 2 Þ for i ¼ 1; …; n=10, where U1 and U2 are independently generated from U[0,1] distribution and I is the indicator function. These n/10 pairs of ðx; yÞ are outliers that deviate from the majority of the observations only in y-space. Model D: Generate data from (24) with the same error distribution as Model A. Then replace xi1 and yi by xi1 þ 10 and yi þ ð−1ÞIðU 1 o 1=2Þ ð20 þ 10U 2 Þ for i ¼ 1; …; n=5, respectively, where U1 and U2 are independently generated from U[0,1] distribution. These n/5 pairs of ðx; yÞ are leverage points.
1994
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
Seven estimators of β are compared. (i) LTS with maximum breakdown value defined by (8). (ii) REWLSE (Gervini and Yohai, 2002). (iii) ΘIPOD (She and Owen, 2011). (iv) ROS defined by (12). (v) LAD-lasso (Wang et al., 2007). (vi) ACQR (Zou and Yuan, 2008). (vii) SROS defined by (21). 2 ^ the squared error (SE) is defined as ∥β−β∥ ^ For an estimator β, . For 100 replicated data generated from the four models, the SE's for the six estimators are compared in Figs. 1 and 2. We also show the performances of LAD-lasso, ACQR, and SROS ^ CS denotes the rate of correct in variable selection when β ¼ ð3; 2; 1:5; 0; 0; 0; 0; 0Þ′ in Table 1. In Table 1, for an estimator β, selection (means fj : β^ j ≠0g ¼ fj : βj ≠0g), CR denotes the rate of correct variable reduction (means fj : β^ j ≠0g⊃fj : βj ≠0g), and AN denotes average number in fj : β^ j ≠0g. We repeated all the simulations for n ¼100 and obtained essentially the same results. It can be observed from the simulation results that: (i) REWLSE, ΘIPOD, and ROS both improve their initial estimator LTS much in terms of estimation accuracy; (ii) ROS has at least comparable performance with REWLSE and ΘIPOD, and performs much better than them under Models A and B; (iii) as expected, LAD-lasso and ACQR cannot resist the leverage points in terms of estimation accuracy like LAD, while SROS estimates β quite well under all models when β is sparse; (iv) the selection performances of LAD-lasso and ACQR are also influenced by leverage points, but SROS yields good results in variable selection under all models. We conclude that the proposed ROS and SROS estimators are quite competitive in robust (sparse) regression. 5.2. A real example We now study the relationship between the per-capita GDP (gross domestic product) and some indices of development using the linear regression technique. All data used in this example can be found on the internet (available from the authors). The response and predictors are as follows: y x1 x2 x3 x4 x5
per-capita per-capita per-capita per-capita per-capita per-capita
GDP (hundred RMB) export–import volume (hundred US dollar) grain output (ton) retail sales of consumer goods (hundred RMB) investment in the fixed assets (hundred RMB) actually utilized foreign capital (thousand US dollar) Model A
Model B
30
16
25
14 12
20
10
15
8 6
10
4 5
2
0
0 LTS REWLSE IPOD
ROS LAD−lasso ACQR
SROS
Model C 35
LTS REWLSE IPOD
ROS LAD−lasso ACQR SROS Model D
90 80
30
70
25
60
20
50 40
15
30 10
20
5
10 0
0 LTS REWLSE IPOD
ROS LAD−lasso ACQR SROS
LTS REWLSE IPOD
ROS LAD−lasso ACQR SROS
Fig. 1. Box plots of SE's when β ¼ ð3; 2; 1:5; 1; 1; 1; 1; 1Þ′ and n¼50.
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1995
Model B
Model A 8
30
7 25
6
20
5
15
4 3
10 2 5
1
0
0 LTS
REWLSE
IPOD
ROS LAD−lasso ACQR
SROS
LTS
REWLSE IPOD
ROS LAD−lasso ACQR
SROS
Model D
Model C 20 80 70 15
60 50
10
40 30
5
20 10 0
0 LTS
REWLSE IPOD
ROS LAD−lasso ACQR
SROS
LTS
REWLSE IPOD
ROS LAD−lasso ACQR
Fig. 2. Box plots of SE's when β ¼ ð3; 2; 1:5; 0; 0; 0; 0; 0Þ′ and n¼50.
Table 1 Comparisons of LAD-lasso, ACQR, and SROS in variable selection. Method
CS
CR
AN
Model A LAD-lasso ACQR SROS
0.50 0.45 0.47
0.99 0.99 1.00
3.57 5.23 3.61
Model B LAD-lasso ACQR SROS
0.77 0.45 0.56
1.00 1.00 1.00
3.25 5.24 3.51
Model C LAD-lasso ACQR SROS
0.52 0.22 0.73
0.97 0.99 0.98
3.57 4.57 3.29
Model D LAD-lasso ACQR SROS
0.23 0.49 0.61
0.57 0.64 0.98
3.60 5.09 3.42
CS: rate of correct selection. CR: rate of correct variable reduction. AN: average number of selected variables.
SROS
1996
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
Table 2 Estimation results in the real example. Variable
OLS
Adaptive lasso
LTS
LAD-lasso
SROS
x1 x2 x3 x4 x5 x6 x7 x8 x9
6.6917 2.2027 18.5197 2.4721 −3.6577 23.1105 −0.8621 −10.9205 5.8976
3.2522 0 17.3389 0 −1.3480 24.1088 0 −1.8635 0
5.3146 2.2742 14.7314 11.7606 −3.0310 24.0391 1.7798 −3.6210 8.3127
5.8830 1.8978 16.8831 4.5111 −3.2862 23.9643 0 0 0
4.5862 1.5301 15.5426 6.9681 −2.2498 23.1845 0 0 1.8159
OLS
30 20
20
10
10
0
0
−10
−10
−20
−20
−30
0
10
20
30
40
50
Adaptive Lasso
30
−30
20
10
10
0
0
−10
−10
−20
−20 0
10
20
30
0
10
20
40
50
−30
30
40
50
30
40
50
SROS
30
20
−30
LTS
30
0
10
20
^ Here s^ used for the OLS and adaptive lasso is ∥y−X β^ OLS ∥2 =ðn−p−1Þ, s^ used for Fig. 3. Residual plots for the estimators, where the dotted lines denote 7 3s. the LTS is given by (9), and s^ used for the SROS is given by (14).
x6 per-capita value-added of industry (hundred RMB) x7 birthrate ð‰Þ x8 per-capita net income of rural people (RMB) x9 per-capita disposable income of urban residents (RMB) The data of the year 2008 come from 50 Chinese cities. After standardizing the data, we use the OLS, the adaptive LASSO, the LTS, the LAD-lasso, and the SROS to estimate the coefficients β in (1), and the results are shown in Table 2. The residual plots are shown in Fig. 3. We first observe that there is at least one outlier, the 35th observation, from the residual plots of the OLS and adaptive lasso. Therefore, the two estimators' performances may not be good because of the outliers. When using the LTS, a part of the corresponding residuals that are involved in (8) are very close to zero, and the other residuals seem much farther from zero. This leads to the phenomenon (see Fig. 3) that too many observations are detected to be outliers, and then there is some loss in efficiency. Now turn to the residual plot of the SROS. Four observations are detected to be outliers by the SROS, and the residuals seem to be distributed more reasonably. Thus, we believe that the SROS can improve the LTS in terms of efficiency. From Table 2, the SROS indicates that x7 and x8 are the least important predictors of the per-capita GDP. It is worth noting that the adaptive lasso did not select x4, which is shown to be an active predictor by the LAD-lasso and SROS. This is also an example that the performance of penalized least squares method in variable selection can be effected by outliers. 6. Concluding remarks In this paper, we have developed a new robust regression method, known as regression with outlier shrinkage (ROS). It has optimal breakdown value and optimal efficiency. Although the REWLSE is the first method that can achieve these two aspects of optimality, ROS has simpler form and better performance. We have also extended ROS for sparse estimation. This method, known as sparse regression with outlier shrinkage (SROS) avoids the confounding between the model parameters
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1997
and outliers through a penalized estimation. To the best of our knowledge, this is the first sparse estimator with optimal breakdown value and optimal efficiency. Moreover, SROS has better performance and is computationally more efficient than the competing methods. The SROS can be modified to apply to other variable selection problems, including selection of grouped variables (Yuan and Lin, 2006; Huang et al., 2009; Zhao et al., 2009; Xiong, 2010) and ordered variables (Tibshirani et al., 2005), selection under multicollinearity (Zou and Hastie, 2005; Xiong, 2010), and selection with multiple responses (Turlach et al., 2005). This paper only considers the case of p on. Recently some authors discussed robust variable selection for the case of p≥n (Khan et al., 2007; Gather and Guddat, 2008). For this case, a two-stage procedure (Fan and Lv, 2008) can be used to estimate the sparse β. In the first stage, the full model f1; …; pg is shrunk down to a submodel of size d o n by one of these robust selection techniques. SROS can be used in the second stage to give a robust and efficient estimator.
Acknowledgments The research of Xiong was supported by the National Natural Science Foundation of China (Grant no. 11271355). The research of Joseph was supported by the U.S. National Science Foundation Grant CMMI-0448774. Appendix Proof of Theorem 1. Since the LTS estimator β~ 0 is regression equivariant, we have δ~ 0 ðX; y þ XαÞ ¼ y þ Xα−X β~ 0 ðX; y þ XαÞ ¼ y þ Xα−Xðβ~ 0 ðX; yÞ þ αÞ ¼ δ~ 0 ðX; yÞ: Therefore WðX; y þ XαÞ ¼ W ðX; yÞ. It follows that ~ ~ yÞ þ α: βðX; y þ XαÞ ¼ ðX′WXÞ−1 X′W½y þ Xα ¼ βðX; □
This completes the proof.
Proof of Theorem 2. Let X n be obtained by replacing the first m ¼ ½ðn−pÞ=2 observations by arbitrary points. By (13), we ~ δÞ, ~ is bounded over X n . Let the objective function in (7) be f. If ∑n δ~ 2 =δ~ 2 4 n, then need to show that the solution of (7), ðβ; 0i i¼1 i 2 2 2 2 n ~ ~ ~ ~ ~ ~ f ðβ; δÞ≥μ∑i ¼ 1 δ i =δ 0i 4μn ¼ f ðβ 0 ; δ 0 Þ. This is a contradiction. Therefore ∑ni¼ 1 δ~ i =δ~ 0i ≤n. Since δ~ 0i and δ~ i , i ¼ m þ 1; …; n, are all bounded over X n , we have ! ! n n n n 2 ′~ ~ 2 ′ ~ ~ ~ ~ ~ ~ ∑ xi x β−2 f ðβ; δÞ≥ ∑ ðy −x β−δ i Þ ¼ ∑ ðy −δ i Þ þ β′ ∑ ðy −δ i Þxi ′β~ i
i ¼ mþ1
≥
n
∑
i ¼ mþ1
i
i ¼ mþ1
i
i ¼ mþ1
i
i ¼ mþ1
i
~ 2 −2c∥β∥; ~ ðyi −δ~ i Þ2 þ λmin ‖β‖
where λmin 40 is the smallest eigenvalue of ∑ni¼ mþ1 xi x′i and c 40 is bounded over X n . Therefore β~ lies in a bounded region ~ 2 −2c∥β∥ ~ ≤f ðβ~ 0 ; δ~ 0 Þ− λmin ‖β‖
n
∑
i ¼ mþ1
ðyi −δ~ i Þ2 :
□
This completes the proof.
Lemma 1. Under Assumptions 1 and 3, ∑ni¼ 1 ð1−wi Þ ¼ Op ðn−1=2 Þ. Proof. We have n
n ðyi −x′i β~ 0 Þ2 ½x′i ðβ−β~ 0 Þ þ εi 2 ¼ ∑ : 2 2 ′ ′ ~ ~ i ¼ 1 μ þ ðyi −x i β 0 Þ i ¼ 1 μ þ ½x i ðβ−β 0 Þ þ εi n
∑ ð1−wi Þ ¼ ∑
i¼1
Since the function gðxÞ ¼ x=ðμ þ xÞ is concave on ½0; þ∞Þ, by Jensen's inequality, n
∑ ð1−wi Þ ≤n
i¼1
∑ni¼ 1 ½x′i ðβ−β~ 0 Þ þ εi 2 =n : μ þ ∑n ½x′ ðβ−β~ 0 Þ þ εi 2 =n i¼1
ð25Þ
i
Note that 1 n 2 ∑ ½x′ ðβ−β~ 0 Þ þ εi 2 ≤ ni¼1 i n
n
n
i¼1
i¼1
∑ ½x′i ðβ−β~ 0 Þ2 þ ∑ ε2i
By (25) and (26), n
∑ ð1−wi Þ ¼ Op ðn=μÞ;
i¼1
and this completes the proof.
□
!
X′X 2 n ðβ−β~ 0 Þ þ ∑ ε2i ¼ Op ð1Þ: ¼ 2ðβ−β~ 0 Þ′ n ni¼1
ð26Þ
1998
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
Lemma 2. For two sequences of random and positive definite matrices fAn g and fBn g satisfying that Bn -P B 40 and −1 −1 −1 An ¼ Bn þ op ða−1 n Þ, where fan g is a series of positive numbers tending to infinity, then An ¼ Bn þ op ðan Þ. −1 Proof. Since B−1 n -P B , we have −1 −1 −1 an ðA−1 n −Bn ÞAn ¼ an ½I−Bn ðBn −Bn þ An Þ ¼ Bn ½an ðAn −Bn Þ-P 0: −1 −1 that an ðA−1 It follows from A−1 n -P B n −Bn Þ-P 0.
□
Proof of Theorem 3. If we can prove that ½ðX′WXÞ−1 X′W−ðX′XÞ−1 X′ε ¼ op ðn−1=2 Þ;
ð27Þ
which implies β~ ¼ ðX′XÞ−1 X′y þ op ðn−1=2 Þ, then the proof is completed. Let K ¼ ðX′WXÞ−1 X′W−ðX′XÞ−1 X′ ¼ ½ðX′WXÞ−1 −ðX′XÞ−1 X′ þ ðX′WXÞ−1 X′ðW−I n Þ ¼ K 1 þ K 2 : We have K 1ε ¼
" # X′WX −1 X′X −1 X′ε : − n n n
ð28Þ
By Lemma 1, n n trðxi x′i Þ x′ xi x′ xi n X′X X′WX tr − ¼ ∑ ð1−wi Þ ¼ ∑ ð1−wi Þ i ≤ max i ∑ ð1−wi Þ ¼ op ðn−1=2 Þ; n n n n i¼1 n 1 ≤i ≤n i¼1 i¼1 where “tr” denotes trace. By the equivalence of matrix norms, X′X X′WX − ¼ op ðn−1=2 Þ: n n
ð29Þ
By Lemma 2, X′WX −1 X′X −1 − ¼ op ðn−1=2 Þ: n n
ð30Þ
Note that for j¼1,…,p, x21j þ ⋯ þ x2nj jx1j ε1 þ ⋯ þ xnj εn j ≤ n n
!1=2
ε21 þ ⋯ þ ε2n n
1=2 ¼ Op ð1Þ;
which implies X′ε=n ¼ Op ð1Þ. By (28) and (30), K 1 ε ¼ op ðn−1=2 Þ:
ð31Þ
Consider K 2 ε ¼ ðX′WXÞ−1 X′ðW−I n Þε ¼
X′WX −1 X′ðW−I n Þε : n n
For j ¼1,…,p, by Lemma 1, !1=2 2 !1=2 1=2 x1j þ ⋯ þ x2nj jð1−w1 Þx1j ε1 þ ⋯ þ ð1−wn Þxnj εn j ε21 þ ⋯ þ ε2n ð1−w1 Þ2 þ ⋯ þ ð1−wn Þ2 ≤ n n n n !1=2 1=2 2 2 2 x1j þ ⋯ þ xnj ε1 þ ⋯ þ ε2n ð1−w1 Þ þ ⋯ þ ð1−wn Þ ≤ ¼ op ðn−1=2 Þ: 1=2 n n n Therefore X′ðW−I n Þε=n ¼ op ðn−1=2 Þ
ð32Þ
and then K 2 ε ¼ op ðn−1=2 Þ: Combining (31) and (33), we get (27).
ð33Þ □
^ Proof of Theorem 4. We need to show that βðX; yÞ will go to infinity if ½ðn−pÞ=2 þ 1 observations can take arbitrary values. Denote m ¼ ½ðn−pÞ=2 þ 1 and q ¼ n−m. Consider all X n that contains q observations of X . Note that there exists p-dimensional vector α≠0 such that x1 ′α ¼ ⋯ ¼ xp ′α ¼ 0. Since 2q−p þ 1 ≤n, the first 2q−p þ 1 observations of X can be replaced by ðx1 ; y1 Þ; …; ðxp−1 ; yp−1 Þ; ðxp ; yp Þ; …; ðxq ; yq Þ;ðxp ; yp þ rxp ′αÞ; …; ðxq ; yq þ rxq ′αÞ
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
1999
for any r 4 0. Denote the new sample and its corresponding response vector by X n1 and yn1 , respectively. Consider another new sample X n2 constructed by replacing yn1 by yn2 ¼ yn1 −rXα. Note that both X n1 and X n2 contain q observations of X . Let β~ be the regression equivariant estimator that satisfies (17). We have ^ n Þ−βðX ^ n Þ∥≥∥βðX ~ n Þ−βðX ~ n Þ∥−∥βðX ^ n Þ−βðX ~ n Þ∥−∥βðX ^ n Þ−βðX ~ n Þ∥ ¼ r∥α∥−∥βðX ^ n Þ−βðX ~ n Þ∥−∥βðX ^ n Þ−βðX ~ n Þ∥: ∥βðX 1 2 1 2 1 1 2 2 1 1 2 2
ð34Þ
By (17), the last two terms in the right side of (34) do not rely on r. It follows that ^ n Þ−βðX ^ n Þ∥-∞ ∥βðX 1 2 ^ n Þ and βðX ^ n Þ are both bounded. as r-∞. Then it is impossible that βðX 1 2
□
Proof of Proposition 1. By the iterative formula of the OEM algorithm (Xiong et al., 2011), the lasso estimator β^ λ with tuning λ from (18) can be written as β^ ¼
1 ^ λÞ; SðX′y þ ðγ max I pþ1 −X′XÞβ; γ max
ð35Þ
where γ max is the largest eigenvalue of X′X and Sðu1 ; …; up ; λÞ ¼ ðSðu1 ; λÞ; …; Sðup ; λÞÞ′
ð36Þ
is a p-vector valued function with Sðu; λÞ ¼ signðuÞðjuj−λÞþ . The OLS estimator β~ ¼ ðX′XÞ X′y satisfies (35) with λ ¼ 0. Since for any u∈R, jSðu; λÞ−uj ≤λ, we have −1
^ β~ ¼ γ −1 SðX′y þ ðγ max I pþ1 −X′XÞβ; ^ λÞ−γ −1 ðX′y þ ðγ max I pþ1 −X′XÞβÞ ~ ¼ ðI pþ1 −γ −1 X′XÞðβ− ^ βÞ ~ þ γ −1 c; β− max max max max
ð37Þ
where c ¼ ðc1 ; …; cp Þ′ satisfies jcj j ≤λ for j¼1,…,p. By (37), ^ β∥ ~ ≤γ min γ −1 ∥c∥ ≤γ min γ −1 p1=2 λ; ∥β− max max
ð38Þ
where γ min is the smallest eigenvalue of X′X. Note that the right-hand side of (38) does not rely on y and that β~ is regression equivariant. We complete the proof. □ ~ denote the OLS estimator. If ∑p jβ^ j j=jβ~ j j4 p, then Proof of Proposition 2. Let β~ ¼ ðβ~ 0 ; β′Þ′ j¼1 jβ^ j j ^ 2 þ 2λp≥∥y−1n β~ 0 −X β∥ ~ 2 þ 2λp; 4∥y−1n β^ 0 −X β∥ ~ j ¼ 1 jβ j j p
^ 2 þ 2λ ∑ ∥y−1n β^ 0 −X β∥
which leads to a contradiction. Therefore, jβ^ j j ≤p: ~ j ¼ 1 jβ j j p
∑
ð39Þ
Denote I ¼ fj ¼ 1; …; p : jβ~ j j ≤1g and J ¼ fj ¼ 1; …; p : jβ~ j j 4 1g. For j∈I, by (39), jβ^ j −β~ j j ≤p þ 1:
ð40Þ
For j¼1,…,p, similar to (35), β^ j ¼
! ^ λ ; S xj ′y þ αj ′β; γ max jβ~ j j 1
where S and γ max are the same as in (36), and xj and αj are the (j+1)th columns of X and A ¼ γ max I pþ1 −X′X, respectively. For −1 −1 ^ ~ j∈J, jβ^ j −ðxj ′y þ γ −1 max αj ′βÞj ≤γ max λ=jβ j j ≤γ max λ, and then −1 −1 −1 ^ ^ ~ ^ ^ −1 ~ ^ β^ j −β~ j ¼ β^ j −ðxj ′y þ γ −1 max αj ′βÞ þ ðxj ′y þ γ max αj ′βÞ−β j ¼ β j −ðxj ′y þ γ max αj ′βÞ þ γ max αj ′β−γ max αj ′β −1 ^ ~ ^ ^ ¼ γ −1 max αj ′ðβ−βÞ þ β j −ðx j ′y þ γ max αj ′βÞ:
ð41Þ
^ ~ For j¼0, noting that β^ 0 ¼ ∑ni¼ 1 ðyi −x′i βÞ=n and β~ 0 ¼ ∑ni¼ 1 ðyi −x′i βÞ=n, we have ^ βÞ; ~ β^ 0 −β~ 0 ¼ q′ðβ−
ð42Þ
where q′ ¼ −ð0; ∑ni¼ 1 x′i =nÞ. Combining (40)–(42), we have for j ¼ 0; 1; …; p, ^ βÞ ~ þ cj ; β^ j −β~ j ¼ bj ′ðβ−
ð43Þ
2000
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
−1 where b0 ¼ q, bj ¼ 0 for j∈I, bj ¼ γ −1 max αj for j∈J, and jcj j ≤maxfp þ 1; γ max λg for j ¼ 0; 1; …; p. Consider the matrix B′ ¼ ðb0 ; b1 ; …; bp Þ. Note that I pþ1 −γ −1 max A is invertible, which implies that I pþ1 −B is invertible. By (43),
^ β∥ ~ ≤jμmin j−1 ∥c∥; ∥β− where μ2min is the smallest eigenvalue of ðI pþ1 −BÞ′ðI pþ1 −BÞ and c ¼ ðc0 ; c1 ; …; cp Þ′. This completes the proof.
□
n
Proof of Theorem 6. Let X be obtained by replacing the first m ¼ ½ðn−pÞ=2 observations by arbitrary points. We need to ^ δÞ, ^ is bounded over X n . The objective function in (20) is show that the solution of (20), ðβ; p
f ðu; vÞ ¼ ∥y−Xβ−δ∥2 þ 2λ ∑
jβj j
~ j ¼ 1 jβ j j
n
p jβ j n n δ2 δ2i j þ μ ∑ 2i ¼ ∑ ðyi −x′i β−δi Þ2 þ 2λ ∑ 2 ~ ~ j ~ j β i¼1δ i¼1 i¼1δ j¼1 j 0i 0i n
þμ ∑
n δ2 jβj j þ μ ∑ 2i ; ~ ~ i¼1δ j ¼ 1 jβ j j p
¼ ∑ si ðβ; δÞ2 þ 2λ ∑ i¼1
0i
where ðx1 ; y1 Þ; …; ðxm ; ym Þ can take arbitrary values but ðxmþ1 ; ymþ1 Þ; …; ðxn ; yn Þ are fixed. If 2λ∑pj¼ 1 ðjβj j=jβ~ j jÞþ 2
μ∑ni¼ 1 ðδ2i =δ~ 0i Þ 4 2λp þ μn, then f ðβ; δÞ 4 f ðβ~ 0 ; δ~ 0 Þ. Therefore for all X n , 2λ∑pj¼ 1 ðjβ^ j j=jβ~ j jÞ þ μ∑ni¼ 1 ðδ^ i =δ~ 0i Þ ≤2λp þ μn. This 2
indicates that β^ and δ^ both lie in a bounded set.
2
□
Proof of Theorem 7. By the proof of Theorem 2 in Zou (2006), we only need to show X′WX -P Σ n and X′Wε pffiffiffi -d Nð0; s2 ΣÞ; n which follow from (29) and (32) respectively.
□
References Agostinelli, C., Markatou, M., 1998. A one-step robust estimator for regression based on the weighted likelihood reweighting scheme. Statistics & Probability Letters 37, 341–350. Alfons, A., Croux, C., Gelper, S., 2013. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Annals of Applied Statistics 7, 226–248. Bradic, J., Fan, J., Wang, W., 2011. Penalized composite quasi-likelihood for ultrahigh-dimensional variable selection. Journal of the Royal Statistical Society: Series B 73, 325–349. Breiman, L., 1995. Better subset regression using the nonnegative garrote. Technometrics 37, 373–384. Chen, X., Wang, Z.J., McKeown, M.J., 2010. Asymptotic analysis of robust LASSOs in the presence of noise with large variance. IEEE Transactions on Information Theory 56, 5131–5149. ˘ zek, ˘ Cí P., 2010. Reweighted least trimmed squared: an alternative to one-step estimators. CentER Discussion Paper Series No. 2010-100. Available at SSRN: 〈http://ssrn.com/abstract=1685903〉. Donoho, D.L., Huber, P.J., 1983. The notion of breakdown point. In: Bickel, P., Doksum, K., Hodges, J. (Eds.), A Festschrift for Erich Lehmann. Wadsworth, Belmont, CA, pp. 157–184. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Annals of Statistics 32, 407–451. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. Fan, J., Lv, J., 2008. Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society: Series B 70, 849–911. Friedman, J., Hastie, T., Hofling, H., Tibshirani, R., 2007. Pathwise coordinate optimization. Annals of Applied Statistics 1, 302–332. Gather, U., Guddat, C., 2008. Discussion on ‘Sure independence screening for ultrahigh dimensional feature space’. Journal of the Royal Statistical Society: Series B 70, 893–895. Gervini, D., Yohai, V.J., 2002. A class of robust and fully efficient regression estimators. Annals of Statistics 30, 583–616. Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A., 1986. Robust Statistics: The Approach Based on Influence Functions. Wiley, New York. Holland, P.W., Welsch, R.E., 1977. Robustness regression using iteratively reweighted least-squares. Communications in Statistics—Theory and Methods 6, 813–827. Huang, J., Ma, S., Xie, H., Zhang, C.-H., 2009. A group bridge approach for variable selection. Biometrika 96, 339–355. Jureckova, J., Portnoy, S., 1987. Asymptotics for one-step M- estimators in regression with application to combining efficiency and high breakdown point. Communications in Statistics—Theory and Methods 16, 2187–2199. Khan, J.A., Van Aelst, S., Zamar, R.H., 2007. Robust linear model selection based on least angle regression. Journal of the American Statistical Association 102, 1289–1299. Lee, Y., MacEachern, S.N., Jung, Y., 2007. Regularization of Case-specific Parameters for Robustness and Efficiency. Technical Report No. 799, The Ohio State University. Li, B., Yu, Q., 2009. Robust and sparse bridge regression. Statistics and its Interface 2, 481–491. McCann, L., Welsch, R.E., 2007. Robust variable selection using least angle regression and elemental set sampling. Computational Statistics & Data Analysis 52, 249–257. Maronna, R.A., Martin, R.D., Yohai, V.J., 2006. Robust Statistics: Theory and Methods. Wiley, New York. Menjoge, R.S., Welsch, R.E., 2010. A diagnostic method for simultaneous feature selection and outlier identification in linear regression. Computational Statistics & Data Analysis 54, 3181–3193. Pison, G., Van Aelst, S., Willems, G., 2002. Small sample corrections for LTS and MCD. Metrika 55, 111–123. Rousseeuw, P.J., 1984. Least median of squares regression. Journal of the American Statistical Association 79, 871–880. Rousseeuw, P.J., Leroy, A.M., 1987. Robust Regression and Outlier Detection. Wiley, New York.
S. Xiong, V. R. Joseph / Journal of Statistical Planning and Inference 143 (2013) 1988–2001
2001
Rousseeuw, P.J., Van Driessen, K., 2006. Computing LTS regression for large data sets. Data Mining and Knowledge Discovery 12, 29–45. Rousseeuw, P.J., Yohai, V.J., 1984. Robust regression by means of S-estimators. In: Franke, J., Hardle, W., Martin, R. (Eds.), Robust and Nonlinear Time Series Analysis. Lecture Notes in Statistics, vol. 26. Springer, New York, pp. 256–272. Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464. She, Y., Owen, A.B., 2011. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association 106, 626–639. Siegel, A.F., 1982. Robust regression using repeated medians. Biometrika 69, 242–244. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B 58, 267–288. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K., 2005. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67, 91–108. Turlach, B.A., Venables, W.N., Wright, S.J., 2005. Simultaneous variable selection. Technometrics 47, 349–363. Wang, H., Li, G., Jiang, G., 2007. Robust regression shrinkage and consistent variable selection through the LAD-lasso. Journal of Business & Economic Statistics 25, 347–355. Wang, L., Li, R., 2009. Weighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics 65, 564–571. Xiong, S., 2010. Some notes on the nonnegative garrote. Technometrics 52, 349–361. Xiong, S., Dai, B., Qian, P.Z.G., 2011. Orthogonalizing Penalized Regression. Technical Report. Available at: 〈http://arxiv.org/abs/1108.0185〉. Yohai, V.J., 1987. High breakdown point and high efficiency robust estimates for regression. Annals of Statistics 15, 642–656. Yohai, V.J., Zamar, R.H., 1988. High breakdown point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association 83, 406–413. Yuan, M., Lin, Y., 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B 68, 49–68. Zhang, C.-H., 2010. Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics 38, 894–942. Zhao, P., Rocha, G., Yu, B., 2009. The composite absolute penalties family for grouped and hierarchical variable selection. Annals of Statistics 37, 3468–3497. Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B 67, 301–320. Zou, H., Yuan, M., 2008. Composite quantile regression and the oracle model selection theory. Annals of Statistics 36, 1108–1126.