A note on the one-step estimator for ultrahigh dimensionality

A note on the one-step estimator for ultrahigh dimensionality

Journal of Computational and Applied Mathematics 260 (2014) 91–98 Contents lists available at ScienceDirect Journal of Computational and Applied Mat...

396KB Sizes 1 Downloads 96 Views

Journal of Computational and Applied Mathematics 260 (2014) 91–98

Contents lists available at ScienceDirect

Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam

A note on the one-step estimator for ultrahigh dimensionality Mingqiu Wang a , Xiuli Wang a , Xiaoguang Wang b,∗ a

School of Mathematical Sciences, Qufu Normal University, Shandong, 273165, PR China

b

School of Mathematical Sciences, Dalian University of Technology, Dalian, 116023, PR China

article

abstract

info

Article history: Received 2 March 2013 Received in revised form 23 July 2013

The one-step estimator, covering various penalty functions, enjoys the oracle property with a good initial estimator. The initial estimator can be chosen as the least squares estimator or maximum likelihood estimator in low-dimensional settings. However, it is not available in ultrahigh dimensionality. In this paper, we study the one-step estimator with the initial estimator being marginal ordinary least squares estimates in the ultrahigh linear model. Under some appropriate conditions, we show that the one-step estimator is selection consistent. Finite sample performance of the proposed procedure is assessed by Monte Carlo simulation studies. © 2013 Elsevier B.V. All rights reserved.

MSC: 33B15 26D15 60E15 Keywords: Cross validation High-dimensional data Model selection consistency One-step estimator Partial orthogonality

1. Introduction Consider the linear regression model Yi = β 0 +

pn 

Xij βj + εi ,

i = 1, . . . , n,

j =1

where Yi is the response variable, Xij is the covariate or design variable and εi is the error term. In many applications, such as studies involving microarray or mass spectrum data, the total number of covariates pn can be large or even much larger than n, but the number of important covariates is typically smaller than n. loss of generality, assume that the outcome Without n we n n 2 −1 is centered and the predictors are standardized, i.e. i=1 Yi = 0, X = 0 and n X = 1, j = 1, . . . , pn , so the ij i =1 i=1 ij intercept β0 is not included in the regression function. Zou and Li [1] proposed the one-step sparse estimator, which is defined by minimizing n 1

n i=1

 Yi −

pn  j=1

2 Xij βj

+

pn 

p′λn (|β˜j |)|βj |,

(1.1)

j =1

˜ = (β˜1 , . . . , β˜p )T is the initial value, and p′ (·) is the derivative of penalty function. Several important penalty where β λn functions have been proposed, and include the bridge [2] with pλn (|t |) = λn |t |q , the least absolute shrinkage and selection operator (Lasso, [3]) with pλn (|t |) = λn |t |, the smoothly clipped absolute deviation (SCAD) penalty [4] with pλn (|t |, a) ∗

Corresponding author. Tel.: +86 411 84708351 8040; fax: +86 411 84708354. E-mail addresses: [email protected], [email protected] (X. Wang).

0377-0427/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.cam.2013.09.037

92

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98

= λ|t |I (|t | < λ) + {(a2 − 1)λ2 − [(aλ − |t |)+ ]2 }I (λ 6 |t |)/[2(a − 1)], the minimax concave penalty (MCP, [5]) with pλn (|t |, γ ) = [λn |t | − t 2 /(2γ )]I (|t | < γ λn ) + γ λ2n /2I (|t | > γ λn ), and so on. High-dimensional data analysis has become increasingly frequent and important in diverse fields of sciences, engineering, and humanities. Much progress has been made in the ultrahigh dimensional linear model. Meinshausen and Buhlmann [6] and Zhao and Yu [7] studied the variable selection consistency of the Lasso when the number of covariates is much larger than the sample size. Huang, Horowitz and Ma [8] studied the bridge estimator in the sparse high dimensional linear model. Huang, Ma and Zhang [9] studied the asymptotic properties of the high dimensional adaptive Lasso estimator. Fan and Lv [10] proposed sure independence screening for high-dimensional regression problems. However, all the forgoing results only are suitable for a specific penalty. There is no general frame that can be suitable for various penalties. As suggested by ˜ in the objective function (1.1) is often chosen as the ordinary least squares estimate or Zou and Li [1], the initial value β maximum likelihood estimate. However, we cannot obtain these estimates in ultrahigh cases. So it is a challenge to study the theoretical properties of the one-step estimator in ultrahigh dimensionality. Huang, Ma and Zhang [9] √ suggested that the marginal ordinary least squares estimates can be chosen as the initial values although they are not n-consistent estimator of the parameters. In this paper, we study the one-step estimator with the initial estimator being marginal ordinary least squares estimates in the ultrahigh linear model. Under certain appropriate conditions, we show that the one-step estimator is selection consistent. Finite sample performance of the proposed procedure is assessed by Monte Carlo simulation studies. The rest of the article is organized as follows. Section 2 states the results on the model selection under the partial orthogonality and some other appropriate conditions. In Section 3, we give some simulation studies to assess the performance of the proposed method. Section 4 gives some conclusions. Technical proofs are relegated to the Appendix. 2. Model selection consistency Let the true parameter be β0 = (β01 , . . . , β0pn )T . Denote A = {j : β0j ̸= 0, j = 1, . . . , pn }, which are the indices of nonzero coefficients in the underlying model. Let Y = (Y1 , . . . , Yn )T , Xj = (X1j , . . . , Xnj )T , X = (X1 , . . . , Xpn ) and ε = (ε1 , ε2 , . . . , εn )T . The cardinality of A is denoted by |A| and |A| = kn . Let XA = (Xj , j ∈ A), Σ = n−1 XT X and ΣA = n−1 XTA XA . Let βn∗ = min{|β0j |, j ∈ A} and maxj∈A |β0j | = O(1). Assumption 1. There exists  a constant  that the covariates with zero coefficients and those with nonzero coefficients n c0 such are weakly correlated, i.e. n−1/2 i=1 Xij Xik  6 c0 , j ∈ A, k ̸∈ A. The estimated marginal regression coefficient is n 

β˜ nj =

Xij Yi

i =1 n



= Xij2

1 n

XjT Y .

i=1

 n Take ηnj = E (β˜ nj ) = n−1 l∈A ( i=1 Xij Xil )β0l . According to Assumption 1, we have    n   O(1),    1  1  kn |ηnj | 6 √ max |β0l | Xij Xil  = √ = o(1),   O √ n l∈A n l∈A i=1 n

j ∈ A, j ̸∈ A.

For simplicity, we take ηnj = 0 for j ̸∈ A. It is easy to know max |β˜ nj − ηnj | = OP (n−k ),

16j6pn

for k <

1 2

.

It will be used in the proof of Theorem 2.1. Consider the penalized objective function Qn (βn ) =

1 2n

∥Y − X β n ∥2 +

pn 

p′λn (|β˜ nj |) · |βnj |

(2.1)

i=1

βˆ n = arg min{Qn (βn )} is the one-step estimator. For any vector x = (x1 , x2 , . . .)T , denote its sign vector by sgn(x) = (sgn(x1 ), sgn(x2 ), . . .)T , with the convention sgn(0) = 0. Following Zhao and Yu [7], we write βˆ n =s β0 if and only if ˆ n ) = sgn(β0 ). sgn(β The following conditions are needed for the selection consistency. ,

(A1) Suppose that εi s are i.i.d. random variables, E (εi ) = 0, Var(εi ) = σ 2 , i = 1, . . . , n. Furthermore, suppose their tail probabilities satisfy P (|εi | > x) 6 K exp(−Cxd ) for constants 1 6 d 6 2, C and K .

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98

93

(A2) Let τn1 be the  smallest eigenvalue  of ΣA . There exists a constant τ > 0 such that τn1 > τ for all n. (A3) (log n)I (d=1)

1

1

d (log √ kn ) nβn∗

+

d (log √ pn ) nλn

+

√ λn kn βn∗

→ 0 as n → ∞.

Condition (A1) is a standard assumption for variable selection in the high dimensional linear model. Here we allow nonnormal and heavy tailed error distributions. Condition (A2) assumes that the matrix ΣA is strictly positive definite. In a sparse model, the number of nonzero coefficients is small, so this condition is reasonable. Condition (A3) is assumptions about the tuning parameter, the numbers of nonzero and zero coefficients, and the smallest nonzero coefficient. Theorem 2.1. Suppose that conditions (A1)–(A3) hold. Let p′λn (·) = λn p′ (·). Suppose that p′ (·) is continuous in (0, +∞), there

√ is some s > 0 such that p′ (θ ) = O(θ −s ) as θ → 0+ , and kn n−ks → 0, then ˆ n =s β0 ) → 1. Pr(β

(2.2)

In addition, if pλn (·) is the SCAD function, and λn n → ∞, then (2.2) holds. k

This theorem indicates that the one-step estimator can correctly identify covariates with nonzero and zero coefficients with probability tending to one in the high dimensional settings. Note that the bridge penalty belongs to the first case. We choose d = 2. Suppose the index q = 1/2 of bridge penalty, of nonzero coefficients kn = o(nk ), λn = n−c1  the number  for some 0 6 c1 < 1/2. So log(pn ) can be as large as o n(1−2c1 ) , and the smallest nonzero coefficient is larger than max{nk/2−c1 , (log(o(nk ))/n)1/2 }. If the penalty is the SCAD function, we set λn = n−k+c2 for some k > c2 > 0, then log(pn ) can be as large as o n(1−2k+c2 ) . Suppose kn = nc3 for some 2(k − c2 ) > c3 > 0, then the smallest nonzero coefficient is larger than max{n−k+c2 +c3 /2 , (log(o(nc3 ))/n)1/2 }. 3. Numerical studies In this section, simulation studies are conducted to illustrate the finite sample performance of the one-step estimate. We focus on the case when pn > n. 3.1. Simulations Through the simulations, we generate 500 data sets, each of which consists of n observations from the following linear model: Yi = XiT β + εi ,

i = 1, . . . , n,

where β is a pn × 1 vector and εi ∼ N (0, σ 2 ). We set σ = 1.5. The sample size in the simulations is n = 100. As similar as Huang Ma and Zhang [9], we consider three cases, which are exhibited in the following three examples. Example 1 (Partial Orthogonality). For the i-th sample, the first 15 covariates (Xi,1 , . . . , Xi,15 ) and the remaining pn − 15 covariates (Xi,16 , . . . , Xi,pn ) are independent. The pairwise correlation between the j-th and the kth components of (Xi,1 , . . . , Xi,15 ) is r |j−k| . The pairwise correlation between the j-th and the kth components of (Xi,16 , . . . , Xi,pn ) is r |j−k| . In this example, we consider pn = 200, 400 and r = 0.5, 0.95. The true value of β: βj = 2.5 for j = 1, . . . , 5, βj = 1.5 for j = 6, . . . , 10, βj = 0.5 for j = 11, . . . , 15, and the others are zero. Example 2 (Group Structure). The covariates are generated as follows: Xij = Z1j + eij , i = 1, . . . , 5, Xij = Z2j + eij , i = 6, . . . , 10, Xij = Z3j + eij , i = 11, . . . , 15, and Xij = Zij , where Zij are i.i.d. N (0, 1) and eij are i.i.d. N (0, 0.12 ). The true parameter βj = 1.5 for j = 1, . . . , 15, and βj = 0 for j = 16, . . . , pn . We consider pn = 200 and pn = 400. Example 3 (General). The covariate X is generated from a pn -dimensional multivariate normal distribution with zero mean and covariance between the ith and jth elements being r |i−j| with r = 0.5, 0.95. Let pn = 200. The true parameter β: βj = 2.5 for j = 1, . . . , 5, βj = 1.5 for j = 11, . . . , 15, βj = 0.5 for j = 21, . . . , 25, and the rest are zero. In each example, we consider six methods including Lasso, adaptive Lasso (ALasso), Bridge, SCAD, MCP and Oracle. All estimates are computed by the coordinate descent algorithm [11–13], and the marginal regression estimates are chosen as the initial estimates. Tuning parameters are selected by the V -fold cross validation with V = 5. To evaluate the performances of estimates, we give four summary statistics: the average number of selected nonzero covariates (NN), the average number ˆ − β∥, and the median of the prediction mean squared of truly selected nonzero covariates (NTN), the median of L2 loss ∥β error (MPMSE). The simulated data consist of the training set and the testing set. Let Yˆi be the fitted value based on the training data, and let Yi be the outcome of the testing data. Thus the prediction mean squared error for this data is n n−1 i=1 (Yˆi − Yi )2 where n = 100. The simulated results are given in Tables 1–3. It can be seen that the Lasso selects the largest model size in all methods. When the partial orthogonality condition is satisfied, the SCAD and MCP perform very similar in terms of prediction performance and model selection. When the model

94

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98 Table 1 Simulation results in Example 1. p

r

Method

NTN

NN

L2 loss

MPMSE

200

0.5

Lasso ALasso Bridge SCAD MCP Oracle

14.716 13.846 14.298 13.178 13.186 15.000

41.334 26.856 30.322 20.396 20.002 15.000

1.128 (0.295) 1.155 (0.258) 1.114 (0.242) 1.280 (0.260) 1.274 (0.252) 0.776 (0.188)

3.534 (0.839) 3.389 (0.759) 3.321 (0.646) 3.643 (0.909) 3.594 (0.848) 2.596 (0.441)

0.95

Lasso ALasso Bridge SCAD MCP Oracle

13.734 13.762 13.742 14.880 14.892 15.000

20.098 16.158 16.856 15.384 15.332 15.000

2.267 (0.504) 2.300 (0.527) 2.266 (0.520) 2.506 (0.630) 2.501 (0.610) 2.603 (0.683)

2.751 (0.455) 2.627 (0.429) 2.652 (0.426) 2.610 (0.424) 2.616 (0.445) 2.596 (0.441)

0.5

Lasso ALasso Bridge SCAD MCP Oracle

14.658 13.650 14.114 12.790 12.836 15.000

63.162 29.942 37.654 20.062 20.362 15.000

1.283 (0.254) 1.211 (0.260) 1.216 (0.255) 1.348 (0.286) 1.346 (0.262) 0.767 (0.178)

3.907 (0.846) 3.616 (1.009) 3.622 (0.907) 3.976 (1.126) 3.947 (0.974) 2.626 (0.388)

0.95

Lasso ALasso Bridge SCAD MCP Oracle

13.722 13.742 13.732 14.938 14.866 15.000

21.790 16.094 17.498 15.460 15.240 15.000

2.252 (0.503) 2.271 (0.520) 2.268 (0.518) 2.516 (0.571) 2.510 (0.587) 2.589 (0.642)

2.768 (0.440) 2.655 (0.447) 2.653 (0.416) 2.621 (0.422) 2.690 (0.408) 2.626 (0.388)

400

Table 2 Simulation results in Example 2. p

Method

NTN

NN

L2 loss

MPMSE

200

Lasso ALasso Bridge SCAD MCP Oracle

11.378 11.570 11.522 14.204 13.842 15.000

22.064 14.600 16.576 14.774 14.404 15.000

6.411 (0.816) 6.305 (0.748) 6.303 (0.772) 7.450 (0.952) 7.325 (0.947) 5.448 (1.295)

3.224 (0.556) 2.804 (0.410) 2.856 (0.480) 2.870 (0.446) 2.915 (0.455) 2.644 (0.395)

400

Lasso ALasso Bridge SCAD MCP Oracle

11.238 11.504 11.420 14.016 13.850 15.000

25.064 15.050 18.100 14.596 14.580 15.000

6.436 (0.919) 6.345 (0.712) 6.408 (0.778) 7.491 (0.968) 7.387 (1.020) 5.606 (1.219)

3.362 (0.618) 2.821 (0.488) 2.970 (0.497) 2.925 (0.492) 2.953 (0.473) 2.632 (0.432)

Table 3 Simulation results in Example 3. r

Method

NTN

NN

L2 loss

MPMSE

0.5

Lasso ALasso Bridge SCAD MCP Oracle

14.662 13.578 14.114 12.970 12.956 15.000

43.222 31.208 34.282 22.548 22.900 15.000

1.226 (0.242) 1.281 (0.272) 1.227 (0.256) 1.424 (0.282) 1.418 (0.279) 0.743 (0.171)

3.763 (0.832) 3.755 (0.989) 3.629 (0.835) 4.051 (1.119) 4.155 (1.074) 2.613 (0.430)

0.95

Lasso ALasso Bridge SCAD MCP Oracle

13.398 13.384 13.370 14.298 14.362 15.000

24.260 19.686 20.790 27.616 27.484 15.000

2.428 (0.511) 2.425 (0.516) 2.410 (0.524) 3.612 (0.737) 3.588 (0.718) 2.447 (0.595)

2.890 (0.489) 2.769 (0.461) 2.759 (0.451) 3.034 (0.571) 3.052 (0.585) 2.641 (0.431)

has the group structure, the adaptive Lasso, SCAD and MCP miss some nonzero coefficients, and the Lasso also yields a large model size. When the partial orthogonality condition is violated, all methods identify a large number of important covariates, and the bridge performs the best in L2 loss and prediction performance.

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98

95

4. Discussion This paper studies the model selection for the ultrahigh dimensional linear model. The marginal regression estimates are used as the initial estimates of the one-step estimator. Under certain conditions, the model selection consistency is established. The performance of the proposed method is demonstrated by simulation studies. It is clear that the one-step estimator can be applied to the high dimensional generalized linear model with the initial values being marginal maximum likelihood estimates. We will give the results for some complex models in our future study. Acknowledgments The authors sincerely thank the editor and referees for reading the paper very carefully and for their valuable suggestions. Mingqiu Wang’s research was supported by the Foundation of Qufu Normal University (xkj201304). Xiaoguang Wang’s research was supported by the National Natural Science Foundation of China (NSFC) grant (11101063). Appendix Here, we prove Theorem 2.1. To facilitate the proof, we need the following lemma, whose proof can be found in Huang, Horowitz and Ma [8] and Huang, Ma and Zhang [9]. Lemma A.1. Suppose that the condition (A1) holds, then for all constants ai satisfying

 Pr

n 

 ai ε i > t

td

 6 exp −

i=1

n

i=1

a2i = 1, we have



M (1 + log n)I (d=1)

for certain M depending {d, K , C } only. Proof of Theorem 2.1. By the KKT conditions, βn is the solution of (2.1) if and only if

 1   XjT (Y − Xβn ) = p′λn (|β˜ nj |)sgn(βnj ),  n  1   XjT (Y − Xβn ) 6 p′λn (|β˜ nj |), n

βnj ̸= 0, (A.1)

βnj = 0.

ˆ 1n = n−1 Σ −1 (XT Y − nS˜n ) = β10 + Σ −1 (n−1 XT ε − S˜n ), where S˜n = (p′ (|β˜ nj |)sgn(β0j ), j ∈ A)T . If βˆ 1n =s β10 , then Let β λn A A A A

T βˆ n = (βˆ 1n , 0T )T is the solution of (A.1). Thus, since Xβˆ n = XA βˆ 1n for this βˆ n and {Xj , j ∈ A} are linearly independent, βˆ n =s β0

if

 βˆ 1n =s β10 ,   1  XjT (Y − XA βˆ 1n ) 6 p′λn (|β˜ nj |), n

j ̸∈ A.

ˆ 1n = ε − XA Σ −1 (n−1 XT ε − S˜n ) = Hn ε + XA Σ −1 S˜n , where Hn = In − n−1 XA Σ −1 XT . By some calculations, Y − XA β A A A A A ˆ n =s β0 if So that β  ˆ sgn  (β0j ) · (β0j − βnj ) < |β0j |, 1 T   Xj (Hn ε + XA ΣA−1 S˜n ) 6 p′λn (|β˜ nj |), n

j ∈ A, j ̸∈ A.

Thus









    ˆ n ̸=s β0 ) 6 Pr ∃j ∈ A, |β0j − βˆ nj | > |β0j | + Pr ∃j ̸∈ A,  1 X T (Hn ε + XA Σ −1 S˜n ) > p′ (|β˜ nj |) Pr(β λn A n j 

      1 T −1 T  1 1 −1  6 Pr ∃j ∈ A,  ej ΣA XA ε > |β0j | + Pr ∃j ∈ A, |eTj ΣA S˜n | > |β0j | n 2 2         1 T  1 ′ 1 T  1 ′ −1 ˜     ˜ ˜ + Pr ∃j ̸∈ A,  Xj Hn ε > pλn (|βnj |) + Pr ∃j ̸∈ A,  Xj XA ΣA Sn  > pλn (|βnj |) 

n

, In1 + In2 + In3 + In4 ,

where ej is the unit vector.

2

n

2

96

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98

Let aTj = n−1/2 eTj ΣA−1 XTA , by the condition (A2), we have

 ∥aj ∥ =

1 n

eTj ΣA−1 XTA XA ΣA−1 ej

 21

1

1 − 6 τn1 2 6 τ − 2 .

By Lemma A.1 and the condition (A3), we have

    a T  √ 1 j   6 Pr ∃j ∈ A,  ε > nτ 2 βn∗ /2  ∥aj ∥    d  √ 1   2   nτ βn∗ /2 6 kn exp − I (d=1)    M (log n + 1)     √ 1 d   1 nτ 2 βn∗ = exp − log(kn )  − 1 1 I ( d = 1 )   M (log n + 1) 2(log kn ) d 

In1

→ 0. First, we consider the case where p′λn (|β˜ nj |) = λn p′ (|β˜ nj |). p′ (·) is continuous in (0, +∞), and for some s > 0, p′ (θ ) = O(θ −s ), θ → 0+ . In addition, since ηnj = O(1), for j ∈ A, we have

∥S˜n ∥ =

 

 21 λ2n [p′ (|β˜ nj |)]2

j∈A

= λn

    p′ (|β˜ |) 2 nj  j∈A 

p′ (|η˜ nj |)

[p′ (|η˜ nj |)]2

1 2 

= λn kn OP (1). Therefore, by conditions (A2) and (A3), we have

 −1 ˜ ∥Sn ∥ = λn kn OP (1) = oP (βn∗ ), |eTj ΣA−1 S˜n | 6 τn1 so In2 → 0. P

For j ̸∈ A, |β˜ nj | − → 0, thus p′λn (|β˜ nj |) = λn OP (|β˜ nj |−s ). Since ∥n−1 (XjT Hn )T ∥ 6 n−1/2 , by the condition (A3), for a large constant C > 0, we have In3

 √    1 T  n ks   6 pn Pr  √ Xj Hn ε > C λn n 2 n   √  d nC λn nks /2 6 pn exp − M (log n + 1)I (d=1)    √ d   1 nC λn nks  = exp − log pn  − 1   M (log n + 1)I (d=1) 2(log pn ) 1d

→ 0. For In4 , since

        1 T    X XA Σ −1 S˜n  6  1 (X T XA Σ −1 )T  · ∥S˜n ∥ 6 τ − 21 OP λn kn , A A n j  n j  so we have

       −1 ˜  1 ˜ s 1 T −1 −ks λ− | β | X X Σ S 6 λ Cn O λn k n nj A n P n n A n j    = OP kn n−ks → 0. Hence In4 → 0.

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98

97

Second, for the SCAD penalty, p′λn (|β˜ nj |) = p′λn (|β˜ nj |)I (|β˜ nj | 6 aλn ), by the condition n−k /λn → 0, we have Pr

√

np′λn (|β˜ nj |) > ϵ





6 Pr |β˜ nj | 6 aλn



→ 0.

Therefore,

 12

 

∥S˜n ∥ =

λ

2 n OP

(n ) −1

 =

j∈A

kn n

λn OP (1).

√ By the condition (A3), we have |eTj ΣA−1 S˜n | 6 τ −1 λn kn /nOP (1) = oP (βn∗ ), so we have In2 → 0. For j ̸∈ A, p′λn (|β˜ nj |)

=1+

λn

p′λn (|β˜ nj |)

λn

I (|β˜ nj | > λn ).

Since maxj̸∈A |β˜ nj | = OP (n−k ) and λn nk → ∞, we have

    p′ (|β˜ |)   λn nj  Pr  − 1 > ϵ 6 Pr(|β˜ nj | > λn )   λn   6 Pr nk |β˜ nj | > λn nk → 0. Note that (nλn )−1 |XjT XA ΣA−1 S˜n | = OP

 In4 = Pr ∃j ̸∈ A, λn

−1

√

kn /n , so we have



   1 T  1 p′λn (|β˜ nj |) − 1  X XA Σ S˜n  > → 0. A n j  2 λ n

For In3 , we have

   √ ′  1 T  nλn pλn (|β˜ nj |)   Pr ∃j ̸∈ A,  √ Xj Hn ε > 2 λn n     √ ′ ′  1 T  nλn pλn (|β˜ nj |) pλn (|β˜ nj |) 1   , > Pr ∃j ̸∈ A,  √ Xj Hn ε > 2 λn λn 2 n     √ ′ ′  1 T  nλn pλn (|β˜ nj |) pλn (|β˜ nj |) 1 , 6 + Pr ∃j ̸∈ A,  √ Xj Hn ε > 2 λn λn 2 n   √    1  nλ n Pr ∃j ̸∈ A,  √ XjT Hn ε > + o(1) 4 n   √ d nλn /4 + o(1) pn exp − M (log n + 1)I (d=1)     √ d   1 nλn exp − log pn  − 1 + o(1) 1 I ( d = 1 )   M (log n + 1) 4(log pn ) d 

In3 =

=

6

6

=

→ 0. The proof is completed.



References [1] [2] [3] [4] [5] [6] [7] [8]

H. Zou, R. Li, One-step sparse estimates in nonconcave penalized likelihood models, Ann. Statist. 36 (2008) 1509–1533. I.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools, Technometrics 35 (1993) 109–148 (with discussion). R.J. Tibshirani, Regression shrinkage and selection via the Lasso, J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1996) 267–288. J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (2001) 1348–1360. C.H. Zhang, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist. 38 (2010) 894–942. N. Meinshausen, P. Buhlmann, High-dimensional graphs and variable selection with the Lasso, Ann. Statist. 34 (2006) 1436–1462. P. Zhao, B. Yu, On model selection consistency of Lasso, J. Mach. Learn. Res. 7 (2006) 2541–2567. J. Huang, J.L. Horowitz, S. Ma, Asymptotic properties of bridge estimators in sparse high-dimensional regression medels, Ann. Statist. 36 (2008) 587–613. [9] J. Huang, S. Ma, C.-H. Zhang, Adaptive Lasso for sparse high-dimensional regression models, Statist. Sinica 18 (2008) 1603–1618.

98 [10] [11] [12] [13]

M. Wang et al. / Journal of Computational and Applied Mathematics 260 (2014) 91–98 J. Fan, J. Lv, Sure independence screening for ultrahigh-dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 849–911. J. Friedman, T. Hastie, H. Höfling, R. Tibshirani, Pathwise coordinate optimization, Ann. Appl. Stat. 1 (2007) 302–332. J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, J. Statist. Softw. 33 (2010) 1–22. P. Breheny, J. Huang, Coordinate descent algorithms for nonconvex penalized regression, with application to biological feature selection, Ann. Appl. Stat. 5 (2011) 232–253.