Statistics and Probability Letters 80 (2010) 254–261
Contents lists available at ScienceDirect
Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro
Least squares approximation with a diverging number of parameters Chenlei Leng a , Bo Li b,∗ a
Department of Statistics and Applied Probability, National University of Singapore, Singapore
b
440 Weilun Hall, School of Economics and Management, Tsinghua University, Beijing, 100084, China
article
info
Article history: Received 1 June 2009 Received in revised form 21 October 2009 Accepted 21 October 2009 Available online 29 October 2009
abstract Regularized regression with the `1 penalty is a popular approach for variable selection and coefficient estimation. For a unified treatment of the `1 -constrained model selection, Wang and Leng (2007) proposed the least squares approximation method (LSA) for a fixed dimension. LSA makes use of a quadratic expansion of the loss function and takes full advantage of the fast Lasso algorithm in Efron et al. (2004). In this paper, we extend the fixed dimension LSA to the situation with a diverging number of parameters. We show that LSA possesses the oracle properties under appropriate conditions when the number of variables grows with the sample size. We propose a new tuning parameter selection method which achieves the oracle properties. Extensive simulation studies confirmed the theoretical results. © 2009 Elsevier B.V. All rights reserved.
1. Introduction The method of regularization with a sparsity inducing penalty is a popular approach in producing parsimonious models and accurate estimation (Fan and Li, 2006). For linear regression, the penalized least squares with an `1 penalty, also known as the Lasso (Tibshirani, 1996), has been the focus of considerable research in recent years. As the Lasso is equivalent to an `1 -constrained quadratic programming problem, Efron et al. (2004) developed an efficient algorithm such that the entire solution path of the Lasso, as a function of the regularization parameter, can be computed at a computation complexity similar to a least squares fit. Besides its superior numerical properties, Lasso is also attractive theoretically. For example, with an adaptive `1 penalty, Zou (2006) and Wang et al. (2007a) showed that Lasso enjoys the so-called oracle properties (Fan and Li, 2001). More precisely, any procedure possessing the oracle properties allows the model to be fitted as if the true sparsity pattern of the model was known. The `1 penalty has also been applied to other general linear models, such as the generalized linear model (GLM) and the Cox model for model selection. However, compared to the superior numerical properties of the Lasso, efficient numerical algorithms need to be developed for other `1 -constrained loss functions. For example, Park and Hastie (2007) developed a predictor–corrector algorithm for GLM and the Cox model, see, e.g., Shevade and Keerthi (2003) and Zhang and Lu (2007). Clearly, developing efficient algorithms for loss based `1 -regularization problems other than the least squares is more difficult. Furthermore, oracle properties for a general linear model need to be studied on a case-by-case basis. See Wang and Leng (2007) for a general discussion. As a simple solution, Wang and Leng (2007) proposed the method of least squares approximation (LSA) for a unified treatment of various `1 -constrained estimation problems, including linear models, generalized linear models, quantile regression, and many others as special cases. The real thrust of LSA is to formulate many different types of `1 -constrained objective functions into their asymptotically equivalent least squares problems. This facilitates the development of the standard asymptotic theory similar to the Lasso and the application of the fast Lasso algorithm (Efron et al., 2004). However, the LSA in Wang and Leng (2007) was studied with fixed dimensionality. Recent years have witnessed an emerging need to
∗
Corresponding author. Fax: +86 10 6277 1647. E-mail addresses:
[email protected] (C. Leng),
[email protected],
[email protected] (B. Li).
0167-7152/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2009.10.015
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
255
analyze large dimensional data sets, which are routinely collected in microarray data analysis, neuroscience and financial time series (Fan and Li, 2006). A question thus arises naturally: does LSA still apply with a diverging number of parameters? We offer an affirmative answer to this question under proper assumptions. In particular, we show that under appropriate conditions for the dimension, the sample size and the strength of the signal, LSA has the oracle properties when the regularization parameter is suitably chosen. We further propose using a penalized quadratic function criterion to choose the regularization parameter. The result shows that we are able to take advantage of the fast Lasso algorithm in many general linear models, without the need to re-derive the asymptotic theories and to re-develop the computational algorithms. Thus, the problem of `1 -constrained model selection for many general linear models can be easily implemented and analyzed, even when the dimensionality diverges. There are related works to this article. Fan and Peng (2004) studied the smoothly clipped absolute deviation estimator (SCAD) for likelihood based problems and showed that the oracle properties still hold when p3 /n → 0. Note that throughout the paper, we use p to represent the dimensionality and n to denote the sample size. The dimensionality p can depend on the sample size n but this dependence is suppressed in the paper. When the signal is conspicuous asymptotically, we allow the dimensionality p to diverge at the sub-root-n rate, i.e., p2 /n → 0. There are also a large number of literature dealing with large p for Lasso and its variants in linear regression. See, for example, Huang et al. (2008a,b), Zhang and Huang (2008) and Meinshausen and Yu (2009). The rest of the paper is organized as follows. The LSA method is introduced in Section 2. We spell out the main theoretical results in Section 3. These results illuminate the conditions to ensure the consistency and asymptotic normality of the LSA estimator. Under the Covariance Assumption specified later, we further show that LSA is an oracle estimator. A penalized quadratic function criterion is proposed for choosing the tuning parameter. We illustrate the performance of LSA via extensive simulations in Section 5. Section 6 makes a few concluding remarks. 2. The least squares approximation method Let (x1 , y1 ), . . . , (xn , yn ) be n independent and identically distributed random vectors, where yi ∈ R1 is the response of interest and xi = (xi1 , . . . , xip )> ∈ Rp is the p-dimensional predictor. Assume that β = (β1 , . . . , βp )> ∈ Rp is the parameter of interest. We denote the number of nonzero elements in β as d. Similar to the dimension p, we allow d to diverge with ˜ = argmin Ln (β) is a natural estimate n. Assume further that Ln (β) is a plausible loss function, whose global minimizer β ˜ is the usual least squares estimate. If Ln (β) is the negative logof β. For example, if Ln (β) is the least squares function, β ˜ is the maximum likelihood estimator (MLE). likelihood function, β The penalized likelihood method with an adaptive `1 penalty (Zou, 2006; Wang et al., 2007a) can be written as n−1 Ln (β) +
p X
λj |βj |.
(1)
j =1
For fixed p, Wang and Leng (2007) proposed to use a second order Taylor expansion to approximate n−1 Ln (β). Thus, the LSA objective function in Wang and Leng (2007) can be written as −1
˜ >Σ ˜ + ˆ (β − β) (β − β)
p X
λj |βj |
(2)
j=1
−1
ˆ λ = (βˆ λ,1 , . . . , βˆ λ,p )T for λ = (λ1 , . . . , λp )T . Here Σ ˆ is the sample version of Σ −1 , whose global minimizer is denoted as β ˜ . Note that Σ −1 is routinely available for many parametric models. In general, where Σ is the asymptotic covariance of β ˆβλ is different from the adaptive Lasso estimator, which is obtained by minimizing (1). Hereafter, we refer to βˆ λ as the LSA ˜ λ ) as the adaptive Lasso (aLasso) estimator. estimator and the minimizer of (1) (denoted by β 3. Theoretical properties Let Am = {a ∈ Rm : kak = 1} be the set of unit m-dimensional vectors, where k · k stands for the usual Euclidean norm. Denote S as the set of the indices for a candidate model and |S | as its size. Without loss of generality, write T T T , β0b ) where β0a = (β01 , . . . , β0d )T ST = {1, . . . , d} = {j : βj 6= 0} as the support of β and SF as the full model. Let β0 = (β0a and β0b = (β0(d+1) , . . . , β0p )T . Write an = max{λ1 , . . . , λd } and bn = min{λd+1 , . . . , λp }. An arbitrary candidate model S is
˜ S ∈ Rd as the corresponding an underfitted model if S 6⊃ ST , or an overfitted model if S ⊃ ST and S 6= ST . Lastly, denote β unpenalized estimator, i.e., β˜ S =
argmin {β∈Rd :βj =0,∀j6∈S }
Ln (β).
˜ = β˜ S . On the other hand, by minimizing the objective function (2) when λ = 0, another estimator This implies that β F βˆ S =
argmin
−1
˜ >Σ ˜ ˆ (β − β) (β − β)
{β∈Rp :βj =0,∀j6∈S }
(3)
256
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
ˆ is possibly a diverging p × p matrix. In general, β˜ S 6= βˆ S . The estimator β˜ S is obtained by can be obtained. Note that here Σ
ˆ S is produced by minimizing the LSA objective function without regularization. minimizing the loss function Ln (·), while β
˜ λ and Although they are closely related, they are unlikely to be exactly the same. A similar difference also exists between β βˆ λ .
For an arbitrary vector b ∈ Rp , let b(S ) ∈ R|S | denote the subvector of α associated with the candidate model S . Thus, we
˜ = β˜ have β √
(SF )
, β0
(ST ) 6= 0, and the oracle estimator β˜ ST . In addition, assume that for any S ⊃ ST and α ∈ A|S |
(S )
(S )
(ST )
˜ − β ) →d N (0, α T Σ S α) = N (0, α T Ω−1 α) nα T ( β S S 0
for some positive definite matrix Σ S ∈ R
|S |×|S |
(4)
, whose inverse is given by ΩS = Σ S . The notation implies that −1
(ST ) α (Σ SF − Σ )α = 0 for any α ∈ Ap . Furthermore, the asymptotic covariance matrix of the oracle estimator β˜ ST , multiplied by any unit vector α ∈ R|ST | , is given by α T Σ ST α , i.e., T
√
( ST )
˜ nα T ( β ST
(S )
1 − β0 T ) →d N (0, α T Σ ST α) = N (0, α T Ω− ST α).
(5)
Lastly, for an arbitrary p × p matrix A, let A(S ) denote the submatrix associated with S . We now state formally the following important covariance assumption, which is the key to establishing the oracle property. (S )
The Covariance Assumption: For any S ⊃ ST , we have ΩS = ΩSF . Intuitively, the covariance assumption indicates that there exists a close relationship between the inverse asymptotic covariance matrix of an overfitted model (i.e., ΩS ) and that of the full model (i.e., ΩSF ). This assumption is satisfied in many common regression problems, for example linear models, generalized linear models, quantile regression models with homoscedastic errors and the Cox models (Wang and Leng, 2007). In what follows, we will denote Ω = ΩSF for the sake of −1
ˆ =Σ ˆ is the sample version of Ω. brevity. We recall that Ω To study the properties of the LSA under diverging dimensionality, we make the following assumptions: √
˜ − β0 k = Op (1). (i) n/pkβ (ii) The maximum eigenvalue and minimum eigenvalue of Ω, which are denoted as τmax and τmin , are uniformly bounded from above and away from 0 respectively, i.e., there is a positive constant κ such that κ −1 ≤ τmin ≤ τmax ≤ κ . P
P
ˆ respectively. (iii) τˆmax → τmax , τˆmin → τmin , where τˆmax and τˆmin stand for the maximum and minimum eigenvalues of Ω P
ˆ − ΩkF → 0, where k · kF stands for the Frobenius norm. (iv) kΩ Remark 1. He and Shao (2000) and Fan and Peng (2004), among many others, have provided conditions for establishing (i) for the usual regressions, the logistic regressions, the LAD regressions, etc. For instance, in robust estimation of usual linear regression, He and Shao (2000, Corollary 2.1) showed that, when proper conditions about the design matrix and the robust loss function are met, (i) holds as long as p(log p)3 /n → 0. Their results encompassed a wide range of robust loss functions including the L1 loss for least absolute value (LAD) regression. He and Shao (2000, Example 3) also demonstrated that, if p log p/n → 0, (i) is satisfied in the widely used logistic regression, when some mild conditions on the design matrix are given. In fairly general likelihood based models, when p4 /n → 0, (i) has been proved by Fan and Peng (2004, Theorem 1) under regularity conditions. Remark 2. Assumption (ii) is standard, see Fan and Peng (2004) for instance. It is equivalent to say that the maximum and minimum eigenvalues of Σ are bounded from above and away from 0. Remark 3. Assumption (iii) is weaker than (iv). Since convergence in Frobenius norm implies convergence in operator norm, which further implies the consistency of the sample eigenvalues. Assumption (iii) is satisfies p/n → 0 under regularities. Since in this circumstance the sample covariance matrix performs asymptotically identical to its population counterpart. Singular phenomenon occurs usually when p diverges faster, say p/n → c > 0. See, e.g., Bai and Silverstein (2006, Chapter 5) for a comprehensive account. Assumption (iv) is usually satisfies when p2 /n → 0, see, e.g., the proof of Lemma 3.1 in Ledoit and Wolf (2004) for one reference.
√
ˆ λ − β0 k = Theorem 1 (Consistency). Under the assumptions (i), (ii) and (iii), if an = Op (1/ n) and p/n → 0, then kβ √ Op ( p/n). Proof. Denote Qn (β) as the expression in (2). As stated in Wang and Leng (2007), the LSA objective function Qn (β) is strictly convex in β. So its local minimum must also be the global minimum. We then follow Fan and Peng (2004) and Wang and Leng (2007)’s idea to show that for any given ε > 0, there is a sufficiently large constant C such that lim inf P { n
inf
u∈Rp ,kuk=C
Q (β0 +
p
p/nu) > Q (β0 )} > 1 − ε.
(6)
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
257
To this end, we consider the following difference: p−1 n Qn (β0 +
ˆ p/nu) − Qn (β0 ) = uT Σ
p
−1
ˆ u + 2uT Σ
−1
p X p ˜ + p−1 n [ n/p(β0 − β)] λj (|β0j + (p/n)1/2 uj | − |β0j |) j =1
(7) where u = (u1 , . . . , up ) ∈ R . In light of the assumptions (i), (ii) and (iii), we see that for an arbitrary δ > 0, the first term on the right hand side of (7) is no less than uk2 with probability tending to 1. And the second term is of order √ √ (κ +Pδ)k p Op (1)kuk. The third term is no less than − n/pan j=1 |uj | ≥ − nan kuk = Op (1)kuk, where the inequality is due to the Cauchy–Schwarz inequality. Therefore, as kuk is sufficiently large, the right hand side of (7) is positive with probability tending to 1. Then (6) follows immediately. Thus the proof is completed. T
p
Theorem 2 (Sparsistency). Under the conditions (i)–(iv) stated earlier, if we further assume 0, for all j = d + 1, . . . , p) → 1.
√
p
ˆ λ,j = n/pbn → ∞, then P (β
ˆ λ,j 6= 0. We will prove the theorem by Proof. We denote An as the event that there is some j = d + 1, . . . , p such that β showing that P (An ) → 0. If An occurs, we then have
∂ Qn (β) | ∂βj β=βˆ λ
= 0 for some j = d + 1, . . . , p. Then for this particular j, it
is legitimate to write
p
n/p
ˆ where Ω
(j)
p p ∂ Qn (β) ˜ + n/pλj sgn(βˆ λ,j ) ˆ (j) n/p(βˆ λ − β) | ˆ = 2Ω ∂βj β=βλ
(8)
ˆ . We also denote Ω(j) as the jth row of Ω. Since all the eigenvalues of Ω are bounded represents the jth row of Ω q
by κ , those of Ω2 are bounded by κ 2 . Therefore, kΩ(j) k =
ej Ω2 ej ≤ κ , where ej is a p-vector with the jth entry being 1
(j)
ˆ k ≤ kΩ(j) k + kΩ ˆ − ΩkF ≤ κ + kΩ ˆ − ΩkF for all j = d + 1, . . . , p. On the other hand, it and the others being 0. So kΩ √ ˆ ˜ follows from assumption (i) and Theorem 1 that n/p(βλ − β) = Op (1). Taking together, we see that the first term on the √
√
p
right hand side of (8) is Op (1) simultaneously across j = d + 1, . . . , p. The second term n/pλj ≥ n/pbn → ∞ uniformly over j = d + 1, . . . , p. In summary, we have shown that, with probability tending to 1, the right side terms in (8) for all j = d + 1, . . . , p diverges to either +∞ or −∞. This implies that P (An ) → 0. The proof is hence completed. Theorem 3 (Oracle Property). Under the assumptions (i)–(iv) and the Covariance Assumption, if ∞, then for any α ∈ Ad , we have
√
p
ndan → 0 and
√
p
n/pbn →
√
ˆ λ,a − β0a ) → N (0, α T ΣS α). nα T ( β T
Proof. Same arguments as in Wang and Leng (2007) lead us to
√
ˆ − β0a ) = nα T ( β
√
√
−1 ˆ −1 ˜ a − β0a ) + α T Ω ˆ 11 ˆ 11 nα T ( β Ω12 ( nβ˜ b ) − α T Ω ×
√
ˆ λ,a ) nD(β
(9)
ˆ λ,a ) is a d-dimensional vector with its jth component given by .5λj sgn(βˆ λ,j ). Since the maximum eigenvalue of where D(β
−1 ˆ 11 Ω is bounded with probability tending to 1, by an application of Cauchy–Schwarz inequality, we see that the third term √ on the right hand side is of order ndan = op (1). Then it follows from (9) that √ T √ ˆ − β0a ) = nα T Id×d Ω −1 Ω12 (β˜ − β0 ) + op (1). nα ( β 11 −1 The proof can be completed by using (4) (treating α T Id×d Ω11 Ω12 as the weight vector) and following the same routes
of the proof of Theorem 3 in Wang and Leng (2007).
Remark 4. In Wang and Leng (2007), it has been verified that the covariance assumption holds in a variety of typical scenarios, such as the usual linear regression models, the LAD regression models, the logistic regression models. Nevertheless, as also noted in Wang and Leng (2007), the Covariance Assumption is not required for asymptotic normality. Its validity is merely used to derive the oracle efficiency. Remarks 1–4 imply that the stated results in Theorems 1–3 hold in many typical applications, as long as the tuning parameters are appropriately set. We tackle this emerging issue of tuning parameter selection in the next section. 4. Tuning parameter selection Note that the consistency, the sparsistency and the possible oracle properties of the LSA estimator all depend on a judicious choice of the tuning parameters. Following Zou (2006) and Wang et al. (2007a), we propose to use the follow adaptive Lasso penalty as
λj =
λ
|β˜ j |γ
258
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
where γ > 0 is a pre-specified constant. This formulation reduces the p-dimensional optimization problem to a onedimensional one. Furthermore the form of LSA allows us to utilize the fast Lars–Lasso algorithm developed by Efron et al. (2004) to compute the solution path of β as a function of λ, at a computational complexity equivalent to a single least squares fit. That is, the computational complexity has the same order as np2 , thus is quite efficient. The only additional computation ˜. needed is to obtain a preliminary estimator β When γ = 1 is used, as in the following numerical studies, the conditions needed by Theorems 1–3 will be satisfied as long as the true model size d is bounded and
√
nλ → 0,
nλ/p → ∞,
p2 /n → 0.
The previous statement can be verified by noting that
p
n/pbn ≥
p
n/pλ/ min |β˜ λ,j | ≥
√
nan is bounded by
√
(10)
nλ times a term of order Op (1) and
˜ − β0 k = Op (nλ/p) n/pλ/kβ
p
d
by assumption (i). Now if p2 /n → 0, one can easily find a sequence of parameters λn satisfy the condition (10). We define p a tuning parameter sequence vector as λn = (λn,1 , . . . , λn,p )T , where λn,j = √1n ( √n )1−δ |β˜ j |−1 for a δ ∈ (0, 1). It is evident that this tuning parameter vector makes (10) to be fulfilled. In practice, we propose minimizing the following penalized quadratic function (PQF) function to select the tuning parameter λ in practice:
ˆ λ − β) ˜ >Σ ˆ PQFλ = (β
−1
˜ + Cn p × dfλ /n, (βˆ λ − β) (11) ˆ λ , the degrees of freedom, and Cn is required to be diverging. In this work, where dfλ is the number of nonzero coefficients in β we use Cn = max{log log n, 1}. In contrast to the usual ones like AIC or BIC, this criterion function imposes a heavier penalty term, which enables one to deal with the possible large inflation in the quadratic function, due to the diverging dimension, to safeguard against overfitting. Similar ideas have been applied in Chen and Chen (2008) and Wang et al. (2009). Following Wang et al. (2007a,b) and Wang and Leng (2007), we can partition Rp+ into the following three mutually exclusive regions: p
R− = {λ ∈ Rp+ : Sλ 6⊃ ST } p
R0 = {λ ∈ Rp+ : Sλ = ST } and p
R+ = {λ ∈ Rp+ : Sλ ⊃ ST , Sλ 6= ST }. We will need an extra assumption characterizing the strength of the signals (nonzero coefficients) (v)
n/(p2 Cn ) × liminfn→∞ {minj∈ST kβ0j k} → ∞.
p
Then we can establish the following theorem which shows the optimality of the PQF criterion. Theorem 4 (PQF Optimality). Under the assumption that there is a tuning parameter sequence λn satisfying the conditions needed for Theorems 1–3, plus the assumptions we have made for Theorems 1–3, and the extra assumption (v), minimizing (11) enables one to choose a tuning parameter such that P(
inf p
p
λ∈R− ∪R+
PQFλ > PQFλn ) → 1.
ˆ −1 respectively. Under the assumed Proof. Denote τˆmax and τˆmin to be the maximum and minimum eigenvalues of Σ conditions, it follows from the spectral theory of large sample covariance matrices (Bai and Silverstein, 2006) that, with probability tending to 1, both of τˆmax and τˆmin will be bounded away from 0 and from above. Next, we can use the same techniques of Wang et al. (2007a,b) or Wang and Leng (2007) to show the optimality of PQF. ˆ S − βS k2 = Op (p/n) = op (1) and kβ˜ − β0 k2 = Op (p/n) = op (1), then, noting that Case 1: (Underfitted model). Since kβ p λ ∈ R− is equivalent to Sλ 6⊃ ST , one can show that ˆ S − β) ˜ Σ ˜ ˆ −1 (βˆ S − β) inf PQFλ ≥ inf (β λ λ p
λ∈R−
p
λ∈R−
˜ 2 ≥ τˆmin infp kβˆ Sλ − βk λ∈R−
≥ τˆmin min [kβˆ S − β0 k2 − kβ˜ − β0 k2 ] S 6⊃ST
≥ τˆmin [min kβ0j k2 − kβ˜ − β0 k2 ]. j∈ST
Under the condition (iii), the right hand side will be of higher order than Cn p2 /n with probability tending to 1, whereas it is apparent that PQFλn = Op (Cn p2 /n). Hence P (infλ∈Rp PQFλ > PQFλn ) → 1. −
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
259
Table 1 Simulation results for Poisson regression. n
p
100
18
ρ 0.25
0.5
0.75
200
23
0.25
0.5
0.75
400
29
0.25
0.5
0.75
Method
RME
AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC AIC BIC PQF EBIC
MS
CM (%)
Median
(SE)
Mean
(SE)
0.315 0.221 0.243 0.221 0.306 0.207 0.247 0.213 0.319 0.249 0.695 0.295 0.287 0.131 0.120 0.122 0.252 0.147 0.142 0.142 0.260 0.152 0.163 0.151 0.270 0.139 0.129 0.129 0.236 0.122 0.113 0.113 0.205 0.106 0.102 0.102
(0.008) (0.010) (0.015) (0.011) (0.015) (0.012) (0.019) (0.013) (0.015) (0.013) (0.058) (0.029) (0.011) (0.007) (0.006) (0.006) (0.010) (0.004) (0.005) (0.007) (0.013) (0.007) (0.011) (0.009) (0.007) (0.006) (0.006) (0.006) (0.010) (0.005) (0.005) (0.005) (0.010) (0.005) (0.005) (0.005)
5.364 3.472 2.506 3.036 5.39 3.442 2.604 3.012 5.396 3.414 2.248 2.786 6.138 3.242 2.996 3.024 5.714 3.216 2.972 3.022 5.808 3.278 2.748 3.002 7.374 4.194 4.000 4.010 7.118 4.112 4.000 4.008 7.290 4.126 3.996 4.008
(0.099) (0.039) (0.041) (0.014) (0.105) (0.039) (0.029) (0.016) (0.102) (0.043) (0.031) (0.022) (0.126) (0.025) (0.003) (0.007) (0.119) (0.026) (0.008) (0.009) (0.132) (0.027) (0.023) (0.012) (0.136) (0.024) (0.000) (0.004) (0.141) (0.016) (0.000) (0.004) (0.161) (0.021) (0.003) (0.004)
24.0 68.2 71.8 89.0 25.8 67.0 69.6 85.0 24.0 53.6 36.0 58.6 19.4 81.6 99.6 97.8 26.0 84.6 97.6 97.0 27.8 77.4 78.0 0.918 21.0 85.4 100 99.0 24.0 89.8 100 99.2 30.0 91.0 99.6 99.2
Case 2: (Overfitted Model). On the other hand, it is straightforward to show that
ˆ λ − βk ˜ 2 + Cn . inf (n/p)(PQFλ − PQFλn ) ≥ −τˆmax (n/p)kβ n p
λ∈R+
ˆ λ − βk ˜ 2 ≤ (n/p)[kβˆ λ − β0 k2 + kβ˜ − β0 k2 ] = Op (1) and Cn → ∞ as assumed, the right hand side diverges Since (n/p)kβ n n to ∞ with probability tending to 1, i.e., P (infλ∈Rp PQFλ > PQFλn ) → 1. And thus the proof is completed. +
The implication of Theorem 4 is that any λ failing to find the true model will not be selected by the PQF criterion with probability tending to one. In other words, PQF is capable of identifying the right model with probability tending to one. Remark 5. It is conceptually straightforward to apply the EBIC criterion proposed by Chen and Chen (2008) for the purpose of tuning parameter selection. But its theoretical verification in our current context is not trivial. Though the empirical performance of EBIC seems to be superior. See the following simulation for the details. 5. Simulations We conduct extensive simulation studies in this section to assess the finite sample performance of the LSA in terms of variable selection and estimation accuracy. To make a comparison, we use AIC, BIC (Wang and Leng, 2007), EBIC (Chen and Chen, 2008) and PQF for variable selection. For each simulation setup, we present the results of 500 simulation replications, which are summarized with the median relative model error (RME) comparing to the full unpenalized model, the average model size (MS), and the percentage of correct models identified (CM). Example 1 (Poisson Regression). Independent responses are generated from a Poisson distribution with the mean function µ(x) such that
µ(x) = xT β.
260
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
Table 2 Simulation results for robust regression. n
100
p
37
ρ 0.25
0.5
0.75
200
46
0.25
0.5
0.75
400
58
0.25
0.5
0.75
Method
AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF AIC BIC PQF
RME
MS
CM (%)
Median
(SE)
Mean
(SE)
0.492 0.371 0.461 0.494 0.405 0.742 0.573 0.524 2.005 0.441 0.342 0.347 0.435 0.346 0.354 0.467 0.401 0.588 0.441 0.328 0.325 0.431 0.331 0.329 0.434 0.359 0.364
(0.009) (0.008) (0.018) (0.010) (0.012) (0.084) (0.013) (0.013) (0.075) (0.007) (0.008) (0.008) (0.007) (0.008) (0.008) (0.006) (0.008) (0.135) (0.007) (0.003) (0.004) (0.006) (0.005) (0.005) (0.008) (0.005) (0.006)
18.35 13.83 10.72 18.13 13.59 10.63 18.67 13.39 8.316 20.188 15.68 15.00 20.22 15.50 15.00 20.31 15.58 13.68 25.20 19.34 19.00 24.73 19.27 19.00 24.58 19.21 18.98
(0.194) (0.098) (0.133) (0.195) (0.095) (0.086) (0.020) (0.115) (0.087) (0.197) (0.050) (0.000) (0.202) (0.047) (0.003) (0.204) (0.047) (0.079) (0.206) (0.034) (0.000) (0.225) (0.029) (0.000) (0.242) (0.026) (0.008)
6.6 34.4 74.2 6.2 35.2 51.2 2.6 16.6 8.4 12.6 62.6 100 14.4 71.8 99.6 15.2 65.0 52.0 10.2 77.2 100 13.6 80.8 100 18.8 85.0 98.6
The components of xi are standard normal and the correlation between xij1 and xij2 is fixed to be ρ |j1 −j2 | . We consider ρ = 0.25, 0.5, 0.75 and three sample sizes n = 100, 200, 400. The true coefficient vector is taken to be β = (0.51d , 0p−d ), where p = [4n1/3 ] and d = [p/6]. The results are summarized in Table 1. We make the following comments from this table. Firstly, BIC, EBIC and PQF perform better in terms of variable selection and coefficient estimation than AIC, a result to be expected from Wang et al. (2007b). Secondly, for small sample size (n = 100) and high correlation (ρ = 0.75), BIC performs best. However, in other cases, PQF and EBIC based LSA are the better estimators. The EBIC proposed by Chen and Chen (2008), on the average, seems to be the preferred estimator in terms of model selection and estimation accuracy. Indeed, Wang et al. (2009) showed that BIC is not likely to be a consistent estimator for variable selection when the number of parameters diverges. Our simulation results confirm this point empirically for the generalized linear model. Example 2 (Robust Regression). Independent responses are generated from the following linear regression model y = xT β + ε, where ε follows the standard Cauchy distribution. The components of xi are generated as in Example 1. The true coefficient vector is taken to be β = (21d , 0p−d ), where p = [8n1/3 ] and d = [p/6]. The results are summarized in Table 2. In order to fit the model, we use Huber’s robust linear regression model with the default choice function rlm in R library MASS. We see from Table 2 that when sample size is large (n = 200, 400), PQF based LSA is most successful in recovering the true model. In addition, BIC perform relatively better than AIC and PQF for a small sample size (n = 100). 6. Concluding remarks We have studied an extension of the least square approximation to the case where a diverging number of covariates are available. We show that the oracle properties in Wang and Leng (2007) continue to apply, under appropriate regularity conditions. Furthermore, a novel tuning parameter selection criterion is proposed to deal with the diverging dimensionality problem. These efforts extend the applicability of LSA to growing dimensional problems and are practically relevant in contemporary statistics. Acknowledgements We are grateful to the referees and the Co-editor-in-Chief for their very useful comments which led to an improvement in the paper. Chenlei Leng is supported by National University of Singapore academic research grants and a grant from the
C. Leng, B. Li / Statistics and Probability Letters 80 (2010) 254–261
261
National University of Singapore Risk Management Institute. Bo Li is supported by the National Natural Science Foundation of China (No. 10801086, No. 70621061, and No. 70831003). References Bai, Z.D., Silverstein, J.W., 2006. Spectral Analysis of Large Dimensional Random Matrices. China Science Press. Chen, J., Chen, Z., 2008. Extended Bayesian information criterion for model selection with large model spaces. Biometrika 95, 759–771. Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression (with discussion). The Annals of Statistics 32, 407–451. Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360. Fan, Li, 2006. Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In: Sanz-Sole, M., Soria, J., Varona, J.L., Verdera, J. (Eds.), Proceedings of the International Congress of Mathematicians, vol. III. European Mathematical Society, Zurich, pp. 595–622. Fan, J., Peng, H., 2004. On non-concave penalized likelihood with diverging number of parameters. The Annals of Statistics 32, 928–961. He, X., Shao, Q., 2000. On parameters of increasing dimensions. Journal of Multivariate Analysis 73, 120–135. Huang, J., Horowtiz, J., Ma, S., 2008a. Asyptotic properties of bridge estimators in sparse high-dimensional regression models. Annals of Statistics 36, 587–613. Huang, J., Ma, S., Zhang, C.H., 2008b. Adaptive LASSO for sparse high dimensional regression. Statistica Sinica 18, 1603–1618. Ledoit, O., Wolf, M., 2004. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88, 365–411. Meinshausen, N., Yu, B., 2009. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics 37, 246–270. Park, M.Y., Hastie, T., 2007. An L1 regularization-path algorithm for generalized linear models. Journal of the Royal Statistical Society, Series B 69, 659–677. Shevade, S., Keerthi, S., 2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, 2246–2253. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58, 267–288. Wang, H., Leng, C., 2007. Unified lasso estimation via least squares approximation. Journal of the American Statistical Association 101, 1418–1429. Wang, H., Li, G., Tsai, C.L., 2007a. Regression coefficient and autoregressive order shrinkage and selection via lasso. Journal of Royal Statistical Society, Series B 69, 63–78. Wang, H., Li, R., Tsai, C.L., 2007b. On the consistency of SCAD tuning parameter selector. Biometrika 94, 553–558. Wang, H., Li, B., Leng, C., 2009. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society, Series B 71, 671–683. Zhang, C., Huang, J., 2008. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics 4, 1567–1594. Zhang, H.H., Lu, W., 2007. Adaptive lasso for Cox’s proportional hazard model. Biometrika 94, 691–703. Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.