Journal of Complexity 32 (2016) 20–39
Contents lists available at ScienceDirect
Journal of Complexity journal homepage: www.elsevier.com/locate/jco
Stability of the elastic net estimator✩ Yi Shen a,b,c,∗ , Bin Han b , Elena Braverman c a
Department of Mathematics, Zhejiang Sci–Tech University, Hangzhou, 310028, China
b
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Alberta T6G 2G1, Canada c Department of Mathematics and Statistics, University of Calgary, 2500 University Drive N.W., Calgary, Alberta T2N 1N4, Canada
article
info
Article history: Received 30 August 2014 Accepted 16 July 2015 Available online 30 July 2015 Keywords: Compressed sensing Elastic net Stable recovery Correlated random matrix Grouping effect
abstract The elastic net is a regularized least squares regression method that has been widely used in learning and variable selection. The elastic net regularization linearly combines an l1 penalty term (like the lasso) and an l2 penalty term (like ridge regression). The l1 penalty term enforces sparsity of the elastic net estimator, whereas the l2 penalty term ensures democracy among groups of correlated variables. Compressed sensing is currently an extensively studied technique for efficiently reconstructing a sparse vector from much fewer samples/observations. In this paper we study the elastic net in the setting of sparse vector recovery. For recovering sparse vectors from few observations by employing the elastic net regression, we prove in this paper that the elastic net estimator is stable provided that the underlying measurement/design matrix satisfies the commonly required restricted isometry property or the sparse approximation property. It is well known that many independent random measurement matrices satisfy the restricted isometry property while random measurement matrices generated by highly correlated Gaussian random variables satisfy the sparse approximation property. As a byproduct, we establish a uniform bound for the grouping effect of the elastic net. Some numerical
✩ Research was supported in part by Natural Sciences and Engineering Research Council of Canada (NSERC Canada) under grants 05865 and 261351, and Pacific Institute for the Mathematical Sciences (PIMS) CRG grant. Research of Yi Shen was also supported in part by a PIMS postdoctoral fellowship, the Zhejiang Provincial Natural Science Foundation of China under Grant No. LY15A010020 and the Key Laboratory of Oceanographic Big Data Mining & Application of Zhejiang Province under grants No. OBDMA201505. ∗ Corresponding author at: Department of Mathematics, Zhejiang Sci–Tech University, Hangzhou, 310028, China. E-mail addresses:
[email protected] (Y. Shen),
[email protected] (B. Han),
[email protected] (E. Braverman). URL: http://www.ualberta.ca/∼bhan (B. Han).
http://dx.doi.org/10.1016/j.jco.2015.07.002 0885-064X/© 2015 Elsevier Inc. All rights reserved.
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
21
experiments are provided to illustrate our theoretical results on stability and grouping effect of the elastic net estimator. © 2015 Elsevier Inc. All rights reserved.
1. Introduction The standard model of linear regression can be stated as follows: y = X β∗ + z ,
(1.1)
where y ∈ R is an observed signal/vector, X ∈ R is a given measurement/design matrix, β ∈ Rp is an unknown true signal/vector to be recovered, and z is the white Gaussian noise term which is a vector of independent normal/Gaussian random variables. In the setting of compressed sensing, one is interested in the linear regression model (1.1) for the particular case that n is much smaller than p; that is, the dimension p of the ambient space of the unknown signal β∗ is large but the number n of the observations y is extremely small. If the unknown signal β∗ is sparse, the central task of compressed sensing is to recover in a stable way the unknown sparse true signal β∗ from its few noisy observations/measurements y (see e.g. [8,15]). To provide some necessary background for compressed sensing and the elastic net, we first recall some definitions. Let β = (β[1] , . . . , β[p] )T denote a vector in Rp . For 0 < q < ∞, the lq norm of n×p
n
∗
1/q β is ∥β∥q = |β[1] |q + · · · + |β[p] |q , and the l∞ norm of β is ∥β∥∞ = max{|β[1] |, . . . , |β[p] |}. By supp(β) we denote the support of β, i.e., the set {1 ≤ j ≤ p : β[j] ̸= 0}. Moreover, # supp(β) is defined to be the cardinality of the set supp(β), that is, the number of nonzero entries in the vector β. We further define the l0 ‘‘norm’’ of β to be ∥β∥0 = # supp(β). For a nonnegative integer s, we say that a vector β ∈ Rp is s-sparse if the number of all its nonzero entries is no more than s, that is, ∥β∥0 ≤ s. For a positive integer s, we say that a measurement matrix X ∈ Rn×p has the restricted isometry property (RIP) with RIP constant 0 < δs < 1 if (1 − δs )∥β∥22 ≤ ∥X β∥22 ≤ (1 + δs )∥β∥22 ,
for all β ∈ Σs ,
(1.2)
where the set Σs of all s-sparse vectors in R is defined to be p
Σs := {γ ∈ Rp : ∥γ∥0 ≤ s}.
(1.3)
The restricted isometry property in (1.2) with a small RIP constant δs implies that every group of arbitrary s column vectors from the measurement matrix X must be nearly orthogonal to each other. It is well known [8] that with overwhelming probability, a measurement matrix X generated by many known independent random variables has the restricted isometry property with a small RIP constant. If the measurement matrix X has the restricted isometry property with RIP constants satisfying δs + δ2s + δ3s < 1, [8, Theorem 1.4] shows that one can always recover an s-sparse signal β∗ from few measures in y in (1.1). Even for the case n = O(s ln(p/s)) so that the number n of measurements in y is much less than the number p of all entries in the unknown true signal β∗ , a random measurement matrix can satisfy the restricted isometry property with RIP constants satisfying δs + δ2s + δ3s < 1 with overwhelming probability. Compressed sensing employs the l1 regularization technique which has been adopted in the basis pursuit [15] and in the lasso [37]. The lasso is an l1 penalized least squares regression method that can fit the observation data well while seeking a sparse solution simultaneously. However, it is known that the lasso may not be an ideal method if a group of columns of a measurement matrix is highly correlated, for example, in microarray data analysis [40]. To avoid such a limitation of lasso, the elastic net has been proposed in Zou and Hastie [40] by linearly combining an l1 penalty term (like lasso) and an l2 penalty term (like ridge regression). Through experiments, the elastic net shows the ‘‘grouping effect’’ which could include automatically all the highly correlated variables in a group. The theoretical properties of the grouping effect of the elastic net have been studied in [40] and further improved in [39].
22
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
We now provide some necessary details on the elastic net. The elastic net estimator is the solution to the following minimization method: min λ1 ∥β∥1 +
β∈Rp
λ2 2
1
∥β∥22 + ∥X β − y ∥22 ,
(1.4)
2
where λ1 ≥ 0 and λ2 ≥ 0 are regularization parameters. Two special cases of the above elastic net regression in (1.4) are the lasso in [37] with λ2 = 0 and the ridge regression in [22] with λ1 = 0. By βs ∈ Σs we denote the best s-sparse approximation of a vector β ∈ Rp in the sense of l1 norm, i.e.,
∥β − βs ∥1 = min ∥β − γ∥1 , γ∈Σs
where the set Σs is defined in (1.3). For a subset S ⊆ {1, . . . , p}, by #S we denote the number of elements in S (that is, #S is the cardinality of the set S) and by S c we denote the complement set of S, that is, S c := {1, . . . , p}\S. For S ⊆ {1, . . . , p}, we define βS to be the vector such that βS is equal to β at the set S and βS is zero elsewhere. Recall that βs denotes the best s-sparse approximation of a vector β. If S records the locations of the s largest entries in magnitude of the entries of the vector β, then it is trivial to see that βS is the best s-sparse approximation of β, that is, βS = βs satisfying ∥β − βS ∥ = minγ ∈Σs ∥β − γ∥1 . Let y , X , β∗ and z as be in (1.1). Consider the linear regression model (1.1). For any given p and s, ˆ is said to be stable if there exist positive constants C1 and C2 such that the elastic net estimator β
√ ∥β∗ − β∗s ∥1 , ∥βˆ − β∗ ∥2 ≤ C1 sλ1 + C2 √ s
for all β∗ ∈ Rp .
(1.5)
Compared with the stability of the constrained l1 minimization method in Corollary 5 (also see [7, Theorem 2] and [34]), the√ estimate (1.5) is often called nearly optimal in the literatures of compressed sensing due to the factor s. When the measurement matrices X are Gaussian random matrices or random partial Fourier matrices, numerical experiments in [23,24] show that the elastic net could recover any sparse vectors as effectively as the lasso or other l1 minimization methods do. Suppose that the measurement matrix satisfies the finite basis injectivity property (see [5]) and the unknown true vector β∗ satisfies the source condition (see [19,23]). As the regularization parameter λ2 in (1.4) tends to zero, it has ˆ in (1.4) converges to the solution of the following been shown in [23] that the elastic net estimator β augmented l1 minimization method: min λ3 ∥β∥1 +
β∈Rp
1 2
∥β∥22 ,
subject to y = X β∗
(1.6)
λ
with the choice λ3 = λ1 . Also see [28, Proposition 8] for a similar convergence analysis on the elastic 2 net estimator within the framework of statistical learning theory. Asymptotic properties of the elastic net estimator can be found in [38]. Under the condition that the regularization parameter λ3 > 10∥β∗ ∥∞ , if the measurement matrix X satisfies the restricted isometry property with RIP constant δ2s < 0.4404, then the estimator provided by the augmented l1 minimization method in (1.6) has been proved to be stable in [25]. A weaker notion than the restricted isometry property with a small RIP constant δs in (1.2) is the sparse approximation property. If a measurement matrix X satisfies the restricted isometry property with a small RIP constant δs in (1.2), then X must have the following sparse approximation property: there exist positive constants c1 and c2 such that
∥βS0 ∥2 ≤ c1 ∥X β∥2 + c2
∥β − βS0 ∥1 , √ s
for all β ∈ Rp and S0 ⊆ {1, . . . , p} with #S0 = s. (1.7)
Therefore, the sparse approximation property is weaker than the commonly adopted restricted isometry property (see [17,36]).
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
23
Under the condition that the measurement matrix X satisfies the restricted isometry property with 4 or X has the sparse approximation property with c1 > 0 and 0 < c2 < 1/3, RIP constant δ2s < √ 5 7
ˆ in (1.4) has in this paper we prove in Theorem 1 and Corollary 3 that the elastic net estimator β the desired stability in (1.5) if the regularization parameters λ1 and λ2 in (1.4) satisfy the following condition: λ1 2
≥ ∥X T z ∥∞ + λ2 ∥β∗ ∥∞ ,
where X T denotes the transpose of the matrix X . By Xj we denote the jth column vector of a matrix X ∈ Rn×p ; in other words, X = [X1 , . . . , Xp ]. Throughout the paper, we assume that all the column vectors of a measurement matrix X are normalized to have the unit l2 norm, that is, ∥Xj ∥2 = 1 for all j = 1, . . . , p. Define ⟨Xj , Xk ⟩ := XjT Xk . Since Xj and Xk have the unit l2 norm, we have ⟨Xj , Xk ⟩ = cos θj,k , where θj,k is the angle between the two unit vectors Xj and Xk . Therefore, if ⟨Xj , Xk ⟩ is close to 1, then the vectors Xj and Xk are said to be highly correlated, since they are very close to each other on the unit sphere. Since measurement matrices with highly correlated columns often appear in machine learning and statistics [31,40], it is natural for us to study elastic net which employs a measurement matrix with highly correlated columns. For example, the restricted eigenvalue property of the correlated Gaussian random matrix has been addressed in [31] and then such property was used to establish some sufficient conditions for the stable recovery of the lasso with weakly sparse assumptions on the unknown signal β∗ in [29]. Instead of following [3] by using the restricted eigenvalue property, in this paper we prove in Proposition 4 that a correlated Gaussian random matrix satisfies the sparse approximation property with high probability. Then we further prove in Theorem 6 that the stability of the elastic net estimator in (1.4) employing correlated Gaussian random measurement matrices holds for all β∗ ∈ Rp with high probability. This paper is organized as follows. In Section 2, we shall establish the stability and grouping effect of the elastic net estimator under the sparse approximation property. As a byproduct, we improve the result on the grouping effect in [39,40] by establishing a uniform bound for the grouping effect. In Section 3, we first study the sparse approximation property for correlated Gaussian random measurement matrices. Then the elastic net estimator is proved to be stable when the measurement matrix has some highly correlated columns. In Section 4, we provide some numerical experiments to illustrate the performance of the elastic net in recovering sparse vectors. 2. Stability and grouping effect of the elastic net estimator under the sparse approximation property Before presenting our main results on the stability and grouping effect of the elastic net estimator in (1.4) under the sparse approximation property, let us first discuss measurement/design matrices. A measurement matrix X satisfying the restricted isometry property is often obtained through random matrices. In the following we discuss measurement matrices which are built from some wellknown random matrices and satisfy the restricted isometry property. A random variable ξ is called a subguassian random variable if there exist positive constants c1 and c2 such that 2
P(|ξ | ≥ t ) ≤ c1 e−c2 t ,
for all t > 0.
A subgaussian random matrix X = (Xj,k )1≤j≤n,1≤k≤p ∈ Rn×p has all its entries Xj,k generated by independent subgaussian random variables with zero mean and variance 1/n. For a subgaussian random matrix X ∈ Rn×p satisfying n ≥ C δ −2 (s ln(ep/s)),
(2.1)
where C is a universal constant and e is the Euler number, the measurement matrix X satisfies the restricted isometry property in (1.4) with the RIP constant 0 < δs ≤ δ and with the probability at least 1 − 2 exp(−δ 2 n/(2C )), e.g., see [18, Theorem 9.2]. Due to the well-known result on Gelfand widths in approximation theory, the number n of the measurements in (2.1) is optimal for stable recovery via any l1 minimization methods, e.g., see [2,9].
24
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
Another family of random matrices is random partial Fourier matrices. A matrix X ∈ Rn×p is called a random partial Fourier matrix if all its row vectors are randomly chosen from the row vectors of the p × p discrete Fourier transform matrix. Due to the fast Fourier transform, a random partial Fourier matrix leads to a fast algorithm and is of interest in many applications. It has been proved in [9] that for any τ > 1 and for all sufficiently large p (such a lower bound on p depends on τ ), if n ≥ C ′ (δ)τ s log6 p,
(2.2)
then the normalized n × p random partial Fourier matrix −c ′ (δ)τ
√1 X n
satisfies the restricted isometry property
with the RIP constant 0 < δs ≤ 1 − p , where C (δ) and c ′ (δ) are positive constants depending only on δ . The number n of measurements in (2.2) has been further improved in [33]. In comparison with (2.1), the number n of measurements in (2.2) is nearly optimal. Other structured random matrices, such as the random partial Toeplitz matrices, random partial circulant matrices and time-frequency structured random matrices also satisfy the restricted isometry property with the nearly optimal number of measurements [21,30,32]. The deterministic Fourier matrices which satisfy the restricted isometry property are discussed in [4] and references therein. For other random matrices with the optimal number of measurements (for instance, matrices generated by independent Laplace random variables), the restricted isometry property fails [1], but the sparse approximation property holds [17]. We now have the following result on the stability of the elastic net estimator in (1.4) assuming that the measurement matrix satisfies the sparse approximation property. ′
Theorem 1. Let y , X , β∗ and z be as in the model (1.1), that is, X ∈ Rn×p is a measurement matrix, β∗ ∈ Rp is an unknown true signal to be recovered, y ∈ Rn is the observation. If the regularization parameters λ1 and λ2 of the elastic net regression in (1.4) satisfy
λ1 ≥ ρ(∥X T z ∥∞ + λ2 ∥β∗ ∥∞ ) for some ρ > 1
(2.3)
(a commonly adopted choice of ρ in (2.3) is ρ = 2) and if the measurement matrix X satisfies the sparse ρ−1 approximation property with positive constants c1 > 0 and 0 < c2 < ρ+1 , then the elastic net estimator
βˆ obtained by the elastic net minimization method in (1.4) satisfies √ ∥X βˆ − X β∗ ∥2 ≤ 2(1 + ρ1 )c1 sλ1 +
2ρ
(ρ + 1)c1 ∗ ∗ √ ∥β − β ∥ 1 ∥βˆ − β∗ ∥2 ≤ C3 sλ1 + C4 √ s ,
∥β∗ − β∗s ∥1 , √ s
s
(2.4) (2.5)
where β∗s is the best s-sparse approximation of β∗ in terms of l1 norm and
(5ρ − 3) + (3ρ − 1)c2 ρ+1 , ρ (ρ − 1) − (ρ + 1)c2 ρ (5ρ − 3) + (3ρ − 1)c2 ρ ρ C4 = + + c2 . ρ−1 (ρ − 1) − (ρ + 1)c2 ρ+1 ρ−1 C3 = c12
(2.6)
∗ ˆ Proof. Set h := β−β . Let S denote the locations of the s largest entries in magnitude of the vector β∗ . Recall that S c is the complement set of S, that is, S c = {1, . . . , p}\S. By S1 we denote the locations of the s largest entries in magnitude of h in S c , and by S2 we denote the locations of the s largest entries in magnitude of h in (S ∪ S1 )c , and so on. The stability of the elastic net estimator can be established under two requirements on the measurement matrix X . The first requirement is the sparse approximation property in (1.7) and the other requirement is the following condition
∥X h∥22 + 2(1 − ρ1 )λ1 ∥hS c ∥1 ≤ 2(1 + ρ1 )λ1 ∥hS ∥1 + 4λ1 ∥β∗S c ∥1 .
(2.7)
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
25
We first prove that the measurement matrix X satisfies the condition in (2.7). Employing the widely ˆ = β∗ + h, we have used technique and by β
ˆ 1 − ∥β∗ ∥1 = ∥β∗ + h∥1 − ∥β∗ ∥1 ∥β∥ = (∥β∗S + hS ∥1 + ∥β∗S c + hS c ∥1 ) − (∥β∗S ∥1 + ∥β∗S c ∥1 ) ≥ ∥β∗S ∥1 − ∥hS ∥1 − ∥β∗S c ∥1 + ∥hS c ∥1 − ∥β∗S ∥1 − ∥β∗S c ∥1 ≥ ∥hS c ∥1 − ∥hS ∥1 − 2∥β∗S c ∥1 .
(2.8)
ˆ = β∗ + h, we deduce that By a simple computation and β 1 2
∥X βˆ − y ∥22 + =
1 2
λ2 ˆ 2 1 λ2 ∥β∥2 − ∥X β∗ − y ∥22 − ∥β∗ ∥22 2
2
∥X (β + h) − y ∥ + ∗
2 2
λ2 2
= ⟨X T (X β∗ − y ) + λ2 β∗ , h⟩ +
λ2 2
1
2 2
2
2 1
λ2 2
∥β∗ ∥22
∥h∥22 + ∥X h∥22
≥ −∥X z + λ2 β ∥∞ ∥h∥1 + ∥X h∥ ∗
T
2 1
∥β + h∥ − ∥X β∗ − y ∥22 − ∗
2
2 2
1 λ1 ≥ − (∥hS ∥1 + ∥hS c ∥1 ) + ∥X h∥22 , ρ 2
(2.9)
where in the second-to-last inequality we used X β∗ − y = z in (1.1) and our assumption in (2.3). ˆ minimizes (1.4), the inequality in (2.9) leads to Since the vector β 1 2
λ1 ˆ 1 − ∥β∗ ∥1 ) (∥hS ∥1 + ∥hS c ∥1 ) + λ1 (∥β∥ ρ 1 1 λ2 ˆ 2 λ2 ˆ ∥X βˆ − y ∥22 + ∥β∥ ∥X β∗ − y ∥22 + ∥β∗ ∥22 + λ1 ∥β∗ ∥1 ≤ 0. ≤ 2 + λ1 ∥β∥1 −
∥X h∥22 − 2
2
2
2
That is, we proved
∥X h∥22 −
2λ1
ρ
ˆ 1 − ∥β∗ ∥1 ). (∥hS ∥1 + ∥hS c ∥1 ) ≤ −2λ1 (∥β∥
(2.10)
It follows from (2.8) and (2.10) that
∥X h∥22 −
2λ1
ρ
(∥hS ∥1 + ∥hS c ∥1 ) ≤ −2λ1 (∥hS c ∥1 − ∥hS ∥1 − 2∥β∗S c ∥1 ).
The above inequality can be rewritten as
∥X h∥22 + 2(1 − ρ1 )λ1 ∥hS c ∥1 ≤ 2(1 + ρ1 )λ1 ∥hS ∥1 + 4λ1 ∥β∗S c ∥1 which proves (2.7). By (2.7), we trivially have
∥hS c ∥1 ≤
ρ+1 ∥hS ∥1 ρ−1
+
2ρ ∥β∗S c ∥1 . ρ−1
(2.11)
By the definition of the subsets Sj for j ≥ 1, we have
√ ∥hSj ∥2 , ∥hS c ∥2 ≤ ∥hS1 ∥2 + ( 2 − 1)
(2.12)
j ≥2
which can be easily proved by taking square on both sides of the above inequality. By the definition of the subsets Sj for j ≥ 1, using the Cauchy–Schwarz inequality, we also have the following inequality:
∥hSj+1 ∥2 ≤
∥hSj ∥1 √ , s
for all j ≥ 1.
26
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
From the above two inequalities, we now have the following well-known inequality (e.g., see [8]):
∥hSj ∥2 ≤
∥hS c ∥1 √ . s
j≥2
This, together with (2.12), implies that
∥hS c ∥2 ≤ ∥hS1 ∥2 +
∥hS c ∥1 √ .
(2.13)
2 s
Since h = hS + hS c , it follows from (2.11) and (2.13) that
∥h∥2 ≤ ∥hS ∥2 + ∥hS c ∥2 ≤ ∥hS ∥2 + ∥hS1 ∥2 + ≤ ∥hS ∥2 +
3ρ − 1 2(ρ − 1)
∥hS c ∥1 √ 2 s
ρ∥βS c ∥1 √ , (ρ − 1) s ∗
∥hS1 ∥2 +
(2.14)
√
where we used ∥hS1 ∥1 ≤ s∥hS1 ∥2 in the last inequality. It remains to bound the quantities ∥hS ∥2 and ∥hS1 ∥2 on the right-hand side of the above inequality. Since the measurement matrix X satisfies the sparse approximation property and #S = s, we have
√
√ ∥hS ∥1 ≤
sc1 ∥X h∥2 + c2 ∥hS c ∥1 .
s∥hS ∥2 ≤
(2.15)
It follows from (2.7) and (2.15) that
∥X h∥22 + 2λ1 (1 −
1
ρ
− (1 + ρ1 )c2 )∥hS c ∥1
≤ 2(1 + ρ1 )λ1 ∥hS ∥1 + 4λ1 ∥β∗S c ∥1 − 2(1 + ρ1 )λ1 c2 ∥hS c ∥1 √ ≤ 2(1 + ρ1 )λ1 c1 s∥X h∥2 + 4λ1 ∥β∗S c ∥1 , where we used (2.7) for the first inequality and (2.15) for the second inequality. By our assumption ρ−1 0 < c2 < ρ+1 , we have (1 − ρ1 ) − (1 + ρ1 )c2 > 0 and the above inequality trivially implies
√ ∥X h∥22 ≤ 2(1 + ρ1 )λ1 c1 s∥X h∥2 + 4λ1 ∥β∗S c ∥1 , from which we conclude that
√ ∥X h∥2 ≤ 2(1 + ρ1 )λ1 c1 s +
2ρ
(ρ + 1)c1
∥β∗S c ∥1 √ . s
(2.16)
ˆ − β∗ and β∗ is the best s-sparse approximation of β∗ , we have ∥β∗c ∥1 = ∥β∗ − β∗ ∥1 = Since h = β S S S ∥β∗ − β∗s ∥1 . Consequently, (2.16) is simply (2.4). This proves (2.4). By the sparse approximation property in (1.7) and the estimate of ∥X h∥2 in (2.16), we deduce that √ ∥β∗S c ∥1 ∥hS c ∥1 2ρ ∥hS c ∥1 + c2 √ ∥hS ∥2 ≤ c1 ∥X h∥2 + c2 √ ≤ c1 2(1 + ρ1 )λ1 c1 s + √ (ρ + 1)c1 s s s ∗ √ 2 ρ ∥β ∥ c ρ + 1 2 ρ c 1 2 S ∗ 1 ≤ c1 2(1 + ρ )λ1 c1 s + +√ ∥hS ∥1 + ∥β c ∥1 √ (ρ + 1)c1 ρ−1 S s s ρ−1 ∗ √ (ρ + 1)c2 2ρ 2ρ c2 ∥βS c ∥1 ≤ 2(1 + ρ1 )c12 λ1 s + (2.17) ∥hS ∥2 + + √ , ρ−1 ρ+1 ρ−1 s √ where we used (2.11) in the third inequality and ∥hS ∥1 ≤ s∥hS ∥2 (due to Cauchy–Schwarz (ρ+1)c ρ−1 inequality) in the last inequality. Since 0 < c2 < ρ+1 , we have ρ−1 2 < 1. From (2.17), moving the term
(ρ+1)c2 ∥hS ∥2 ρ−1
from the right-hand side to the left-hand side and then dividing both sides by
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
1−
(ρ+1)c2 ρ−1
27
> 0, we have
∥hS ∥2 ≤
2c12 (ρ 2 − 1)/ρ
(ρ − 1) − (ρ + 1)c2
√ λ1 s +
2ρ
ρ+1
ρ − 1 + (ρ + 1)c2 (ρ − 1) − (ρ + 1)c2
∥β∗S c ∥1 √ . s
(2.18)
To evaluate ∥hS1 ∥2 , by the sparse approximation property in (1.7) and h = hS + hS c , we obtain
∥hS c ∥1 ∥h∥1 ∥hS ∥1 + ∥hS c ∥1 ∥hS1 ∥2 ≤ c1 ∥X h∥2 + c2 √1 ≤ c1 ∥X h∥2 + c2 √ ≤ c1 ∥X h∥2 + c2 . √ s
s
s
Therefore, an upper bound for ∥hS1 ∥2 can be estimated as follows:
∥hS1 ∥2 ≤ c1 ∥X h∥2 + c2
∥hS ∥1 + ∥hS c ∥1 √ s
(ρ + 1)c2 2ρ c2 √ ∥hS ∥1 + √ ∥β∗c ∥1 (ρ − 1) s (ρ − 1) s S 2ρ c2 ∥β∗S c ∥1 2ρ c2 ∥hS ∥2 + ≤ c1 ∥X h∥2 + √ , ρ−1 ρ−1 s √ where we used (2.11) in the second inequality and the fact ∥hS ∥1 ≤ s∥h∥2 in the third inequality. c2
≤ c1 ∥X h∥2 + √ ∥hS ∥1 + s
By (2.16) and (2.18), we deduce from the above inequality that 2ρ c2 ∥β∗S c ∥1
√
2ρ
∥β∗S c ∥1 √
√ + c1 2(1 + ρ )λ1 c1 s + ρ−1 (ρ + 1)c1 s s ∗ 2 2 √ 2c1 (ρ − 1)/ρ ρ − 1 + (ρ + 1)c2 ∥βS c ∥1 2ρ c2 2ρ λ1 s + + √ ρ − 1 (ρ − 1) − (ρ + 1)c2 ρ+1 (ρ − 1) − (ρ + 1)c2 s 2 √ (ρ − 1)/ρ = 2c12 (1 + c2 )λ1 s (ρ − 1) − (ρ + 1)c2 (ρ − 1) + (ρ + 1)c2 ∥β∗S c ∥1 2(1 + c2 )ρ (2.19) + √ . ρ+1 (ρ − 1) − (ρ + 1)c2 s
∥hS1 ∥2 ≤
1
Plugging the estimate of ∥hS ∥2 in (2.18) and the estimate of ∥hS1 ∥1 in (2.19) into (2.14), we conclude
ˆ − β∗ . that the bound in (2.5) holds, since h = β
The common choice for ρ in (2.3) in the literature is ρ = 2, e.g., see [11,29]. Now we consider the error bound (2.5) with the constants C3 and C4 given by (2.6). Let λ1 = ρ(∥X T z ∥∞ + λ2 ∥β∗ ∥∞ ). √ λ √ Then we rewrite the first term C3 sλ1 as C3 ρ s ρ1 . In this way, λ1 /ρ is independent of ρ . Since the ρ−1
ρ−1
constant c2 in Theorem 1 cannot be zero and 0 < c2 < ρ+1 , we choose c2 = 2ρ+2 and calculate the constants C3 ρ and C4 numerically. Both of them are functions depending on ρ . It is observed from Fig. 1 that C3 ρ and C4 almost achieve the minimal values around ρ = 2. Hence, ρ ≈ 2, seems to be a good choice. As illustrated in [40] for some statistical applications, columns of a measurement matrix could be highly correlated, i.e., ∥Xj − Xk ∥2 can be close to 0. A regression scheme is said to exhibit the grouping effect [40] if their corresponding regression coefficients βj ≈ βk whenever ∥Xj − Xk ∥2 ≈ 0. It has
ˆ = (β[1] , . . . , β[n] )T has the following grouping been proved in [39,40] that the elastic net estimator β effect: |β[j] − β[k] | ≤
ˆ |⟨Xj − Xk , y − X β⟩| , λ2
for all 1 ≤ j, k ≤ n.
(2.20)
If ∥Xj − Xk ∥2 tends to zero, the inequality in (2.20) guarantees that the difference between the regression coefficients β[j] and β[k] must be small. Thus the inequality (2.20) gives a quantitative
28
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39 24
70
22
65
20
C4
60 18
55 16 50
45
14
1.5
2
2.5
3
3.5
12
4
1.5
2
2.5
3
3.5
4
ρ
ρ ρ−1
Fig. 1. Let c2 = 2ρ+2 and λ1 = ρ(∥X T z ∥∞ + λ2 ∥β∗ ∥∞ ). Then constants C3 ρ and C4 which are defined in (2.6) depend only on ρ . The number ρ changes from 1.2 to 4 in the above graphs.
description of the grouping effect of the elastic net. For frame-based image restoration problems, due to the redundancy of a tight frame, elements of a tight frame are often highly correlated. Therefore, their corresponding frame coefficients of the recovery image are expected to be close to each other as well. The grouping effect property for frame-based convex minimization models using the balanced approach was established in [35]. The inequality in (2.20) shows that an upper bound of |β[j] − β[k] |
ˆ . As a consequence, the upper bound in (2.20) for the grouping effect depends on a recovered vector β cannot be verified directly. The lq ball with radius Rq is defined in [29, Section 4.3] as follows: Bq (Rq ) := β ∈ Rp : ∥β∥q ≤ Rq
with q ∈ (0, 1].
For a vector β = (β[1] , . . . , β[p] ) ∈ R , the vector (β(1) , . . . , β(p) )T is the rearrangement vector with the same entries as in the vector β but β(j) has the jth largest magnitude among β[1] , . . . , β[p] with |β(1) | ≥ |β(2) | ≥ · · · ≥ |β(p) |. If T
|β(j) | ≤
M j1/q
,
p
for all j = 1, . . . , p,
(2.21)
then the smallest such constant M in (2.21) is called the weak lq norm of the vector β ∈ Rp . Under the conditions as in Theorem 1, in the following we shall improve the result on the grouping effect in (2.20) by establishing a uniform bound using only the unknown vector β∗ instead of a ˆ . Under the additional assumption that the unknown vector β∗ belongs to some recovered vector β weak lq space, we establish a universal bound for the grouping effect as follows: Theorem 2. Under the same assumptions as in Theorem 1, if the noise vector satisfies ∥z ∥2 ≤ ε with ε > 0, then the elastic net estimator βˆ = (βˆ [1] , . . . , βˆ [p] )T in (1.4) satisfies
|βˆ [j] − βˆ [k] | ≤
∥Xj − Xk ∥2 λ2
2(1 + ρ1 )c1 λ1
√ s+
2ρ
(ρ + 1)c1
∥β∗ − β∗s ∥1 +ε . √ s
(2.22)
Moreover, if β∗ belongs to the weak lq space with weak lq norm M for some 0 < q < 1, then
|βˆ [j] − βˆ [k] | ≤ where r :=
1 q
−
1 2
√ ∥Xj − Xk ∥2 2ρ qM 2(1 + ρ1 )c1 λ1 s + s −r + ε , λ2 (ρ + 1)c1 (1 − q)
> 0 and ρ > 1 in (2.3).
(2.23)
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
29
Proof. It follows from (2.20) and (2.4) that
ˆ ˆ 2 |⟨Xj − Xk , y − X β⟩| ∥Xj − Xk ∥2 ∥y − X β∥ ≤ λ2 λ2 ∥Xj − Xk ∥2 ˆ 2 + ∥ y − X β ∗ ∥2 ∥X β∗ − X β∥ ≤ λ2 √ ∥Xj − Xk ∥2 4 ∥β∗ − β∗s ∥1 3c1 λ1 s + +ε . ≤ √ λ2 3c1 s
|βˆ [j] − βˆ [k] | ≤
(2.24)
If in addition β∗ belongs to the weak lq space with weak lq norm M, then p
∥β − βs ∥1 = √ ∗
∗
j=s+1
|β∗(j) |
√
s
s
M
p
≤ √
s
where we used r = (2.23) holds.
1 q
−
s 1 2
p M −1/q j s j=s+1
≤ √
M q qM −r t −1/q dt ≤ √ s1−1/q = s , 1−q s1−q
(2.25)
> 0. Substituting (2.25) into (2.24), we conclude that the inequality in
If all the vectors Xj are normalized in Theorem 2 such that ∥Xj ∥2 = 1 for all j = 1, . . . , p, then
∥Xj − Xk ∥2 =
2(1 − XjT Xk ).
Since the restricted isometry property implies the sparse approximation property, we have the following result on stable recovery of the elastic net estimator using the restricted isometry property. Corollary 3. Suppose that the measurement matrix X satisfies the restricted isometry property with the RIP constant 4
≈ 0.6247.
δ2s < √
41
Then 1 < c1 < 2.0405 and 0 < c2 < 1, where
√ c1 =
1 + δ2s
and
2 1 − δ2s − δ2s /4
δ2s
c2 = . 2 1 − δ2s − δ2s /4
(2.26)
If the regularization parameters λ1 and λ2 in the elastic net in (1.4) satisfy (2.3) for some 1 < ρ < 1−c2 , 2 ˆ in (1.4) satisfies (2.5) with positive constants C3 and C4 depending on δ2s . then the elastic net estimator β 1+c
Proof. Suppose that the measurement matrix X satisfies the restricted isometry property with the RIP constant δ2s . Then [18, Theorem 6.13] shows that the measurement matrix X satisfies the sparse approximation property with the constants c3 and c4 in (2.26). It is straightforward to check that δ2s < √4 implies 1 < c1 < 2.0405 and 0 < c2 < 1, where the constants c1 and c2 are defined 41
in (2.26). The claim now follows from Theorem 1.
The elastic net can be equivalently rewritten in terms of the lasso as the following form: min λ1 ∥β∥1 + β
1 2
∥X˜ β − y˜ ∥22 ,
where
y y˜ = , 0
˜X = X . λ2 I
30
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
To establish the stability of the elastic net estimator as in Corollary 3, one may hope that [6, Theorem 2.8] can be directly applied if the augmented measurement matrix X˜ satisfies the restricted isometry property. However, if X˜ satisfies the restricted isometry property with a small RIP constant 0 < δ2s < 1, then we must have 1 −δ2s ≤ λ2 ≤ 1 +δ2s , in other words, λ2 ≈ 1. However, the quantity λ2 could be away from 1 in some applications (see [40]). Consequently, the restricted isometry property for X˜ can fail and thus, [6, Theorem 2.8] cannot be directly applied to prove Corollary 3. To set the regularization parameter λ1 in (2.3), ∥β∗ ∥∞ should be given before solving the elastic net minimization programming in (1.4). Although β∗ is unknown, it is still possible to estimate ∥β∗ ∥∞ in applications. For example, ∥β∗ ∥∞ is the maximum intensity of the underlying signal or the maximum sensor reading (see [25]). Similar estimations for the regularization parameter λ1 can be found in [25,29]. 3. Correlated Gaussian random matrices and the sparse approximation property In this section, we first study the sparse approximation property for correlated Gaussian random matrices. Then we shall establish the stability of the elastic net estimator using a correlated Gaussian random measurement matrix. Let us first recall the correlated Gaussian random matrices proposed in [31]. A correlated Gaussian random matrix is a matrix XΣ ∈ Rn×p with independent and identically distributed (i.i.d.) rows obeying the distribution N (0, Σ ) with zero mean and a covariance matrix Σ . If Σ is a diagonal matrix, then all the entries of the random matrix XΣ are independent. Nevertheless, for simplicity, we still call it a correlated Gaussian random matrix for convenience. In particular, if Σ is a diagonal matrix with the diagonal entry 1, then a correlated Gaussian random matrix √1n XΣ simply becomes the standard Gaussian random matrix, which has been extensively employed and investigated in compressed sensing [8,14]. Suppose that Σ is a positive definite matrix. Let ρ1 (Σ ) denote the square root of the smallest eigenvalue of Σ = (Σj,k )1≤j≤p,1≤k≤p while let ρ22 (Σ ) = maxj=1,...,p Σj,j denote its maximal variance. Suppose that the correlated Gaussian random matrix XΣ is generated by drawing each row of XΣ in Rp i.i.d. from an N (0, Σ ) distribution. Let X = √1n XΣ . Then there exist two strictly positive constants κ1 =
ρ1 ( Σ ) 4
and κ2 = 9ρ2 (Σ ) such that
∥X β∥2 ≥ κ1 ∥β∥2 − κ2
log p n
∥β∥1 for all β ∈ Rp ,
(3.1)
with probability at least 1 − c3 exp(−c4 n) for some positive constants c3 and c4 (see [31, Theorem 1]). Let us first discuss the relationship between the sparse approximation property and the inequality in (3.1). Proposition 4. Suppose that a matrix XΣ ∈ Rn×p is generated by drawing each row of XΣ in Rp i.i.d. from an N (0, Σ ) distribution, where Σ is a p × p positive definite matrix. If the number n of measurements satisfies n > (κ2 /κ1 )2 s log p,
(3.2)
then with probability at least 1 − c3 exp(−c4 n), the normalized measurement matrix X =
√1 XΣ n
satisfies
the sparse approximation property with the positive constants c1 and c2 given in (3.4). Proof. Let S0 be any subset of {1, . . . , p} with #S0 ≤ s. Since the measurement matrix X ∈ Rn×p satisfies (3.1) with probability at least 1 − c3 exp(−c4 n), we have
κ1 ∥βS0 ∥2 ≤ κ1 ∥β∥2 ≤ ∥X β∥2 + κ2
n
= ∥X β∥2 + κ2
log p
∥β∥1
log p n
∥βS0 ∥1 + ∥βS c ∥1 0
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
= ∥X β∥2 + κ2
s log p
n
≤ ∥X β∥2 + κ2
∥βS c ∥1 ∥βS0 ∥1 √ + √0 s
s log p n
31
s
∥βS c ∥1 ∥βS0 ∥2 + √0
s
.
(3.3)
It follows from (3.3) that the measurement matrix X satisfies the sparse approximation property with
c1 =
1
κ1 − κ2
and s log p n
c2 =
κ2
s log p n
κ1 − κ2
s log p n
.
The condition in (3.2) guarantees that the constants c1 and c2 defined in (3.4) are positive.
(3.4)
The sparse approximation property is a sufficient condition for the stable recovery of the unknown vector β∗ from the noisy observation by solving the basis pursuit linear programming in [17,36]. If n > (2κ2 /κ1 )2 s log p, then one can easily check that the constants c1 and c2 defined in (3.4) satisfy 0 < c1 and 0 < c2 < 1. As a direct consequence of Proposition 4 and Theorem 1, we have Corollary 5. Suppose that a matrix XΣ ∈ Rn×p is generated by drawing each row of XΣ in Rp i.i.d. from an N (0, Σ ) distribution. Let X = √1n XΣ be the normalized measurement matrix. Consider the linear regression model (1.1) with n > (2κ2 /κ1 )2 s log p. Let β∗s be the best s-sparse approximation of β∗ . If
ˆ bp of the following basis pursuit linear the noise vector z is bounded by ∥z ∥2 ≤ ε , then the minimizer β programming: min ∥β∥1
β∈Rp
subject to ∥X β − y ∥2 ≤ ε,
(3.5)
satisfies
∥βˆ bp − β∗ ∥2 ≤ C5 ε + C6
∥β∗ − β∗s ∥1 √ s
(3.6)
with probability at least 1 − c3 exp(−c4 n), where C5 and C6 are positive constants depending on κ1 , κ2 , s and p. Proof. By Proposition 4 and the condition n > (2κ2 /κ1 )2 s log p, the measurement matrix X satisfies the sparse approximation property with the positive constants c1 and c2 given in (3.4) with probability ˆ bp satisfies at least 1 − c3 exp(−c4 n). [36, Theorem 1.1] implies that the minimizer β
∥βˆ bp − β∗ ∥2 ≤ 2
(3 + c2 )c1 1 + c22 ∥β∗ − β∗s ∥1 ε+2 . √ 1 − c2 1 − c2 s
The condition n > (2κ2 /κ1 )2 s log p implies 0 < c2 < 1. We conclude that (3.6) holds with positive constants C5 and C6 depending on κ1 , κ2 , s and p. To illustrate the feature of the sparse approximation property, we discuss two classes of random matrices in the following. Example 1 shows that the sparse approximation property can hold with high probability, even when the restricted isometry property does not hold. Example 2 shows that the sparse approximation property can hold with less number of measurements than the restricted isometry property does. Example 1 ([13]). Let Σ be a diagonal matrix with decreasing positive entries Σj,j , j = 1, . . . , p. It is easy to check that the normalized measurement matrix X = √1n XΣ satisfies the restricted isometry
property with a RIP constant δs = δ > 0 if 1 −δ ≤ Σj,j ≤ 1 +δ for all j = 1, . . . , p. If Σ1,1 > 1, then the
32
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
restricted isometry property for X fails, while the random matrix X satisfies the sparse approximation property with the positive constants c1 and c2 given in (3.4) and
κ1 =
Σp,p 4
and
κ2 =
√ Σ1,1 9
.
Example 2 ([31]). The covariance matrix Σ of a matrix XΣ is given by
Σ = (1 − c )Ip×p + c 1⃗(1⃗)T ,
c ∈ (0, 1)
⃗ ∈ Rp is the column vector with all ones. It was shown in [31, where Ip×p is the identity matrix and 1 Example 2] that the normalized measurement matrix X = √1n XΣ satisfies the restricted isometry
property with a RIP constant 0 < δs = c < 1/(s − 1)√ . Since the minimum eigenvalue of Σ is 1 − c and
the maximum variance is ρ22 (Σ ) = 1, we have κ1 = obeys
1 −c 4
and κ2 =
1 . If the number of measurements 9
n = O(s ln(p)), then Proposition 4 implies that with high probability the matrix X satisfies the sparse approximation property with the positive constants c1 and c2 given in (3.4). Now we derive conditions for the stable recovery if a measurement matrix X satisfies the inequality in (3.1). Theorem 6. Suppose that a matrix XΣ ∈ Rn×p is generated by drawing each row of XΣ in Rp i.i.d. from an N (0, Σ ) distribution. Let X = √1n XΣ be the normalized measurement matrix. If the measurement matrix X satisfies n > (2κ2 /κ1 )2 s log p and if the regularization parameters λ1 and λ2 satisfying (2.3) for some 1 < ρ <
1 + c2 1 − c2
with c2
ˆ being defined in (3.4), then with probability at least 1 − c3 exp(−c4 n) the elastic net estimator β in (1.4) satisfies (2.4) and (2.5), where C3 and C4 are constants given in (2.6) with c1 and c2 being defined in (3.4) using κ1 , κ2 , s and p. Proof. It follows from (3.4) that n > (2κ2 /κ1 )2 s log p implies that the measurement matrix X satisfies the sparse approximation property with 0 < c1 and 0 < c2 < 1, where the constants c1 and c2 are defined in (3.4). The rest of the proof follows from Theorem 1. The elastic net in (1.4) with λ2 = 0 reduces to the lasso. The following result follows directly from Theorem 6. Corollary 7. Consider the linear regression model y = X β∗ +z , where z[j] are i.i.d. N (0, σ 2 ). Suppose that a measurement matrix X satisfies the condition in (3.1). Given the lasso with the regularization parameter √ λ1 = 4σ log p, then the lasso estimator βˆ lasso satisfies
√ ∥β∗ − β∗s ∥1 ∥βˆ lasso − β∗ ∥2 ≤ C3 sλ1 + C4 √ s
with probability at least 1 − c3 exp(−c4 nλ ). Assume that unknown vector β∗ belongs to some given weak ˆ lasso obeys lq space with weak lq norm M for some 0 < q < 1. Then the lasso estimator β 2
∥βˆ lasso − β∗ ∥22 ≤ C7 (log p)(sσ 2 + M 2 s−2r ).
(3.7)
If #{1 ≤ j ≤ p : |β[j] | > σ } ≤ (M /σ ) , with the choice s ≥ (M /σ ) , then the estimate (3.7) becomes q
q
2
2r
∥βˆ lasso − β∗ ∥22 ≤ C7 (log p)M 2r +1 (σ 2 ) 2r +1 , which is the minimax rate for the vector β in the weak lq space. ∗
(3.8)
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
33
Proof. The stability follows from Theorem 6 and (4.3). The stability, together with the estimate (2.25), implies (3.7) directly. The bound (3.8) follows from (3.7) and the choice s ≥ (M /σ )q . It is proved in [10] that the bound (3.8) is the minimax rate for the vector β∗ in the weak lq space. Suppose that a measurement matrix X satisfies the assumptions in Theorem 6. Then the error bound of the lasso estimator was given in [29, Section 4.3, Lemma2]. In [29, Section 4.3, Lemma 2], the unknown vector β∗ is assumed to belong to some given lq ball with radius Rq such that
q log p 1/2−q/4 n ∗ p
Rq (
)
≤ 1. Corollary 7 shows that the stability of the lasso estimator holds for all the
vectors β in R . A similar bound of the Dantzig selector is given in [26]. 4. Numerical experiments In this section we provide two numerical experiments to demonstrate some relationships between the two regularization parameters λ1 , λ2 in the elastic net and their impacts on the stability. Since there are two parameters λ1 , λ2 in the elastic net, we have to find good choices of these two parameters on the two-dimensional region [0, ∞)2 for a good approximation to the true vector/signal β∗ . Many well-established methods for choosing such tuning regularization parameters can be found in [20]. We shall discuss how to choose appropriate regularization parameters λ1 and λ2 according to our theoretical results in Theorem 1 for stability and in Theorem 2 for the grouping effect of the elastic net (1.4). We now outline some main principles for choosing the regularization parameters λ1 and λ2 according to our theoretical results in Theorem 1 on stability and in Theorem 2 on grouping effect of the elastic net (1.4). According to our theoretical results on stability of the elastic net in Theorem 1, the two regularization parameters λ1 and λ2 should satisfy the condition in (2.3), which can be equivalently expressed as
λ1 = c (∥X T z ∥∞ + λ2 ∥β∗ ∥∞ ) with c > 1,
(4.1)
where c = ρ > 1. On one hand, it is natural that the regularization parameter λ1 should be also selected according to the sparsity level s of the true vector/signal β∗ . More precisely, we need to choose a larger regularization parameter λ1 in the elastic net (1.4) when the true vector/signal β∗ is more sparse. In the form of (4.1), this simply means that it is not a good idea to choose the constant c for a sparse vector/signal β∗ to be extremely small. On the other hand, our stability results for the elastic net in (2.4) and (2.5) in Theorem 1 are in favor of a small parameter λ1 for a small estimated error of our stability result. Since a common choice in the literature for ρ in (2.3) is ρ = 2, a natural choice for us is to take c = 2. As we shall see later in our two numerical experiments, setting the theoretical predicted value c = 2 for the choice of λ1 in (4.1) is often good enough. According to our theoretical result in (2.22) on grouping effect of the elastic net in Theorem 2, plugging the choice of λ1 in (4.1) which is predicted by our results for stability of the elastic net, we see that (2.22) becomes
√ |βˆ [j] − βˆ [k] | ≤ ∥Xj − Xk ∥2 2(1 + ρ1 )cc1 s∥β∗ ∥∞
+
2(1 + ρ1 )cc1
√
2ρ
s∥X T z ∥∞ + (ρ+1)c 1
λ2
∥β∗ −β∗s ∥1 √ s
+ε
.
(4.2)
If many columns of the measurement matrix X are highly correlated (that is, ∥Xj − Xk ∥2 ≈ 0 for many pairs (j, k)), it is natural to choose a large regularization parameter λ2 to emphasize the grouping √ effect of the elastic net (1.4). Note that the term 2(1 + ρ1 )cc1 s∥β∗ ∥∞ on the left-hand side of (4.2)
34
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
has nothing to do with the regularization parameter λ2 and
lim 2(1 + ρ1 )cc1
λ2 →∞
√
s∥β∗ ∥∞ +
2(1 + ρ1 )cc1
√
2ρ
s∥X T z ∥∞ + (ρ+1)c 1
∥β∗ −β∗s ∥1 √
λ2
s
+ε
√ = 2(1 + ρ1 )cc1 s∥β∗ ∥∞ ̸= 0. The above limit indicates that very large λ2 does not significantly improve grouping effect while (4.1) tells us that very large λ2 forces the parameter λ1 to be too large regardless of the sparsity of the true vector/signal β∗ . As a consequence, even for the measurement matrix X having many highly correlated columns, it is more appropriate to choose a moderately large λ2 for the grouping effect of the elastic net (1.4). If a measurement matrix X has very few highly correlated columns (that is, all ∥Xj − Xk ∥2 are mostly relatively large), it is more meaningful to choose a regularization parameter λ2 ≈ 0 so that we have a wide range for the parameter λ1 . The estimate for grouping effect of the elastic net in (4.2) also indicates that the choice of λ2 should be also linked to the sparsity level s of the true vector/signal β∗ . For properties of the elastic net compared to the lasso, we refer to [40]. To demonstrate the impacts of the two regularization parameters λ1 and λ2 , we first pick up one parameter from a set of a few values ranging from extremely small to relatively large. Then we allow the other parameter to vary in an interval to see the changes of the performance of the elastic net measured by the error defined below:
Err = log10
∥βˆ − β∗ ∥2 ∥β∗ ∥2
,
ˆ is the elastic net estimator and β∗ is the true vector. where β In our two numerical experiments, we consider measurement matrices X which contain highly correlated columns. Therefore, the parameter λ2 is always positive. The structure of the measurement matrix X is similar to the example in [40]. We choose eight indices j1 , j2 , . . . , j8 from the index set {1, . . . , 256} uniformly at random. An 8-sparse vector β∗ is generated as follows: 1, = 1, 0,
β∗[j]
j ∈ {j1 , . . . , j4 }, j ∈ {j5 , . . . , j8 }, j ∈ {1, . . . , 256}\{j1 , . . . , j8 }.
The measurement matrix X of size 72 × 256 is generated as follows: Xj1 = Z1 + η1 ,
Xj2 = Z1 + η2 ,
Xj3 = Z1 + η3 ,
Xj4 = Z1 + η4
Xj5 = Z2 + η5 ,
Xj6 = Z2 + η6 ,
Xj7 = Z2 + η7 ,
Xj8 = Z2 + η8 ,
with Z1 ∼ N (0, 1), with Z2 ∼ N (0, 1),
and Xj , j ∈ {1, . . . , 256}\{j1 , . . . , j8 } are independently and identically distributed random variables obeying N (0, 1), where η1 , . . . , η8 are independently and identically distributed random variables obeying N (0, 0.01). The columns of the measurement matrix X are normalized to have unit l2 norm. For this measurement matrix X , there are two equally important groups. Each group has four highly correlated columns. We recover an 8-sparse vector β∗ from y = X β∗ + z, where z[j] ∼ N (0, σ 2 ) for j = 1, . . . , n with standard deviation σ . All the figures are obtained as an average over 50 simulations.
4.1. Numerical experiment without noise For this numerical experiment, we consider the noiseless case (that is, z = 0) by selecting the parameter λ1 first. We first pick λ1 from the set {0.0005, 0.005, 0.05}. Since z = 0 and ∥β∗ ∥∞ = 1, λ according to (4.1), we have λ2 = c1 for some c > 1. As a consequence, for better comparison, we select
λ2 from an interval [0, 4λ1 ] centering around the theory-predicted value λ2 = λ21 with the commonly adopted choice c = ρ = 2. The errors in terms of Err with different choices λ2 are presented in Fig. 2. The first graph in Fig. 2 shows that when λ1 is extremely small, the elastic net (1.4) fails to recover the
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39 λ1=0.0005
–0.5
–0.5
–1
–1
–1.5
–1.5
–2
–2
–2.5
–2.5
–3
0
0.1
0.2
0.3
0.4
0.5 λ2
λ1=0.005
0
Err
Err
0
0.6
0.7
0.8
0.9
0
35
1 x10–3 λ1=0.05
–3
0
0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009 0.01
λ2
–0.5
Err
–1 –1.5 –2 –2.5 –3
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 λ2
Fig. 2. Noiseless case. The error Err is measured by log10
∗∥ ˆ ∥β−β 2 ∥β∗ ∥2
. Green squares indicate the special theoretical value
λ2 = λ1 /2 with the commonly adopted c = ρ = 2 in (4.1). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
true sparse vector/signal β∗ at all, regardless of the choice of the parameter λ2 . The second and third graphs in Fig. 2 show that when the parameter λ1 is not that small but the parameter λ2 is extremely small (that is, λ2 ≈ 0), the elastic net (1.4) does not recover the true sparse vector/signal β∗ very well. This is mainly because there are two groups of highly correlated columns in the measurement matrix X and consequently the parameter λ2 should not be extremely small. The formula in (4.1) with the λ constant c = 2 gives us a theoretical choice λ2 = 21 . That is, (4.1) with the constant c = 2 provides the theoretical choices of the parameters λ1 and λ2 as follows:
λ2 =
λ1 2
0.0025, 0.025,
=
if λ1 = 0.005, if λ1 = 0.05.
Both the second graph with the theory-predicted choice (λ1 = 0.005, λ2 = 0.0025) and the third graph with the theory-predicted choice (λ1 = 0.05, λ2 = 0.025) in Fig. 2 indeed show that their corresponding errors in terms of Err are quite close to the smallest optimal error, which is obtained by exhaustive search of all the values of the parameter λ2 in the interval [0, 4λ1 ]. The second and third graphs in Fig. 2 also show that as λ2 becomes larger and larger, the error in terms of Err becomes slightly larger and larger. This is mainly because as we explained before, too large λ2 does not significantly improve the grouping effect of the elastic net any more while according to (4.1) large λ2 requires large λ1 to recover a sparse vector/signal β∗ . Put in other words, since λ1 is fixed here with varying λ2 , too large λ2 does not contribute too much any more for the grouping effect but it makes the parameter λ1 relatively less important (that is, relatively small) which deteriorates its ability for
36
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
recovering sparse vector/signals. Similar phenomenon can be found in the numerical experiments in [23]. 4.2. Numerical experiment with noise In our second numerical experiment, the standard deviation σ of white Gaussian noise is set to be 0.05. We first pick λ2 from the set {0.001, 0.01, 0.1} of values ranging from extremely small to relatively large. Note that the 72 × 256 measurement matrix X is normalized to have unit l2 norm for columns. Since the entries in the noise vector z are Gaussian random variables, standard bound on Gaussian tail probability shows that
∥X T z ∥∞ ≤ σ 2 log p
(4.3)
holds with overwhelming probability [10]. Together with ∥β ∥∞ = 1, the theoretical choice of λ1 in (4.1) is set to be ∗
c (σ
2 log(256) + λ2 )
with c > 1.
(4.4)
For better comparison, we test the performance for the parameter λ1 varying in an interval
(0, 4(σ 2 log(256) + λ2 )]. All the three graphs in Fig. 3 indicate that when the parameter λ1 is extremely small (that is, λ1 ≈ 0), the elastic net (1.4) fails to find the sparse vector β∗ , regardless of the choice of the parameter λ2 for grouping effect. Since the true vector/signal β∗ is sparse, the parameter λ1 cannot be too small for recovering a sparse vector/signal β∗ . According to the theoretical choice of the parameter λ1 in (4.4) with the commonly adopted constant c = 2, by σ = 0.05, we have 0.335, if λ2 = 0.001, λ1 = 0.353, if λ2 = 0.01, (4.5) 0.533, if λ2 = 0.1. All the three graphs in Fig. 3 indicate that their corresponding errors in terms of Err for the above theoretical choices of (λ1 , λ2 ) are quite close to the smallest optimal error, which is obtained by exhaustive search of all the values of the parameter λ1 in the interval (0, 4σ + 4λ2 ]. As shown by the first graph of Fig. 3, if the parameter λ2 is extremely small, the performance of the elastic net is not good enough regardless of the choice of the parameter λ1 (even though the theoretical choice of λ1 can still achieve a reasonably error close to the optimal heuristical smallest optimal error). This is mainly because there are two groups of highly correlated columns in the measurement matrix X . We lost the grouping effect of the elastic net by using extremely small λ2 and for extremely small λ2 , the elastic net (1.4) is close to the lasso. It is a known phenomenon for the lasso for not being able to handle highly correlated measurement matrix well (see [40]). The third graph in Fig. 3 is quite similar to the second graph in Fig. 3 (up to an appropriate scaling of the parameter λ1 ). As we explained before, too large parameter λ2 does not contribute too much any more to the grouping effect of the elastic net but deteriorates its ability for recovering sparse vector/signals by making the parameter λ1 relatively less important (that is, relatively small). It is reasonable to see gaps between the optimal λ1 and theoretical choices (4.5) in Fig. 3. It was suggested by [40] that we can find the parameter λ1 by cross validation once the parameter λ2 is given. Therefore, we can focus on finding an appropriate parameter λ1 in an interval instead of [0, ∞). From all the graphs in Figs. 2 and 3 (except the first graph of Fig. 2), we see that the errors in terms of Err decrease first and then increase. When λ1 is extremely small, the condition in (2.3) fails and the elastic net fails to recover the sparse vector/signal β∗ . Since the true vector/signal β∗ is indeed sparse in our experiments, it is reasonable that the error decreases when λ1 increases in a small interval near zero so that it starts to capture the sparse vector/signal β∗ . As λ1 becomes larger and larger, the condition in (2.3) is surely satisfied. Now according to the stability results on the elastic net for the error bounds in (2.4) and (2.5), such error bounds in (2.4) and (2.5) grow linearly with λ1 . Consequently, as predicted by our theoretical result on stability in Theorem 1, the errors will indeed increase as
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39 λ2=0.001
0
37 λ2=0.01
0
–0.5
Err
Err
–0.5
–1
–1
–1.5
0
0.1
0.2
0.3 λ1
0.4
0.5
–1.5
0.6
0
0.1
0.2
0.3
λ2=0.1
0
0.4 λ1
0.5
0.6
0.7
Err
–0.5
–1
–1.5
0
0.1
0.2
0.3
Fig. 3. Noisy case. The error Err is measured by log10
√
0.4
0.5 0.6 λ1
∗∥ ˆ ∥β−β 2 ∥β∗ ∥2
0.7
0.8
0.9
1
. Green squares indicate the special theoretical value λ1 =
c (σ 2 log 256 + λ √2 ) given by (4.5) with the commonly adopted c = ρ = 2 in (4.1). Note that here we usedT (4.3) (i.e., ∥X T z ∥∞ ≤ σ 2 log p holds with √ overwhelming probability). However, it may be more appropriate to use E (∥X z ∥∞ ) which could be much smaller than σ 2 log p with p = 256 leading to a much smaller λ1 . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
λ1 → ∞. Another interpretation of such decreasing behavior of errors as λ1 → ∞ is that the elastic net estimator is biased by soft thresholding operator [16]. This explains the decreasing and then increasing behavior of all the graphs in Figs. 2 and 3. Since the two regularization parameters λ1 and λ2 are sitting in the two-dimensional region [0, ∞)2 , the error of the elastic net measured by Err as a function of (λ1 , λ2 ) may have many local equilibria. Currently there is no known method to be able to find the optimal parameters (λ1 , λ2 ) yielding the smallest possible error in terms of Err. Despite the fact that finding the optimal parameters λ1 and λ2 with the smallest possible error is generally known to be a very challenging task (e.g., see [12,41]), our numerical experiments show that our theoretical results provide a reasonably good choice for the two regularization parameters λ1 and λ2 according to (4.1) for stability and (4.2) for the grouping effect of the elastic net (1.4). In summary, we suggest the following way of choosing the regularization parameters λ1 and λ2 . If a measurement matrix X does not have many correlated columns, we simply take λ2 to be zero (this reduces the elastic net to the lasso) or a very small parameter λ2 . If a measurement matrix X does have many highly correlated columns, then it is a good idea to first choose λ2 to be not too small and not too large. More precisely, λ2 should be moderately large, depending on the magnitude of the numerator above λ2 in (4.2), which involves the sparsity level s, the noise term ∥X T z ∥∞ and the best s-sparse approximation error ∥β∗ − β∗s ∥1 in the l1 norm. Then we can find an appropriate parameter λ1 in an interval such as
(0, 4(∥X T z ∥∞ + λ2 ∥β∗ ∥∞ )]
38
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
by cross validation. Though the true vector/signal β∗ , the sparsity level s and the quantity ∥X T z ∥∞ are unknown, as we discussed before, there are some numerical ways to roughly estimate such quantities in advance [10,25,27]. Then those estimations can be used to set the regularization parameters λ1 and λ2 , according to (4.1) based on our theoretical result for stability and (4.2) based on our theoretical result for the grouping effect of the elastic net (1.4). Acknowledgments The authors would like to thank the anonymous reviewers for providing useful comments and additional references, which helped them to improve this revised manuscript. References [1] R. Adamczak, A.E. Litvak, A. Pajor, N. Tomczak-Jaegermann, Restricted isometry property of matrices with independent columns and neighborly polytopes by random sampling, Constr. Approx. 34 (2011) 61–88. [2] R. Baraniuk, M. Davenport, R. DeVore, M. Wakin, A simple proof of the restricted isometry property for random matrices, Constr. Approx. 28 (2008) 253–263. [3] P.J. Bickel, Y. Ritov, A.B. Tsybakov, Simultaneous analysis of lasso and dantzig selector, Ann. Statist. (2009) 1705–1732. [4] J. Bourgain, S.J. Dilworth, K. Ford, S. Konyagin, D. Kutzarova, Explicit constructions of RIP matrices and related problems, Duke Math. J. 159 (2011) 145–185. [5] K. Bredies, D.A. Lorenz, Linear convergence of iterative soft-thresholding, J. Fourier Anal. Appl. 14 (2008) 813–837. [6] E.J. Candès, Y. Plan, Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements, IEEE Trans. Inform. Theory 57 (2011) 2342–2359. [7] E. Candès, J. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measurements, Comm. Pure Appl. Math. 59 (2005) 1207–1223. [8] E.J. Candès, T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory 51 (2005) 4203–4215. [9] E.J. Candès, T. Tao, Near-optimal signal recovery from random projections: Universal encoding strategies?, IEEE Trans. Inform. Theory 52 (2006) 5406–5425. [10] E. Candès, T. Tao, The dantzig selector: Statistical estimation when p is much larger than n, Ann. Statist. (2007) 2313–2351. [11] Y. De Castro, A remark on the lasso and the dantzig selector, Statist. Probab. Lett. 83 (1) (2013) 304–314. [12] C. A Deledalle, S. Vaiter, J. Fadili, G. Peyré, Stein Unbiased GrAdient estimator of the Risk (SUGAR) for multiple parameter selection, 2014. [13] E. Dobriban, J. Fan, Regularity properties of high-dimensional covariate matrices, 2013, arXiv preprint arXiv:1305.5198. [14] D.L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (2006) 1289–1306. [15] D.L. Donoho, X. Huo, Uncertainty principles and ideal atomic decomposition, IEEE Trans. Inform. Theory 47 (2001) 2845–2862. [16] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc. 96 (456) (2001) 1348–1360. [17] S. Foucart, Stability and robustness of l1 -minimizations with weibull matrices and redundant dictionaries, Linear Algebra Appl. 441 (2014) 4–21. [18] S. Foucart, H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer, 2013. [19] M. Grasmair, M. Haltmeier, O. Scherzer, Sparse regularization with lq penalty term, Inverse Problems 24 (5) (2008) 055020. [20] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning; Data Mining, Inference and Prediction, Springer, New York, 2001. [21] J. Haupt, W.U. Bajwa, G. Raz, R. Nowak, Toeplitz compressed sensing matrices with applications to sparse channel estimation, IEEE Trans. Inform. Theory 56 (2010) 5862–5875. [22] A.E. Hoerl, R.W. Kennard, Ridge regression: applications to nonorthogonal problems, Technometrics 12 (1970) 69–82. [23] B. Jin, D.A. Lorenz, S. Schiffler, Elastic-net regularization: error estimates and active set methods, Inverse Problems 25 (2009) 115022. [24] M. Kang, S. Yun, H. Woo, M. Kang, Accelerated bregman method for linearly constrained l1 –l2 minimization, J. Sci. Comput. 56 (2013) 515–534. [25] M.-J. Lai, W. Yin, Augmented l1 and nuclear-norm models with a globally linearly convergent algorithm, SIAM J. Imaging Sci. 6 (2013) 1059–1091. [26] T.L Li, Y. Shen, Z. Qin, Stable signals recovery from a minimal number of noisy weibull random measurements, manuscript. [27] M.E. Lopes, Estimating unknown sparsity in compressed sensing, 2012, arXiv:1204.4227. [28] C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularization in learning theory, J. Complexity 25 (2009) 201–230. [29] S.N. Negahban, P. Ravikumar, M.J. Wainwright, B. Yu, A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers, Statistical Science 27 (2012) 538–557. [30] G.E. Pfander, H. Rauhut, J.A. Tropp, The restricted isometry property for time–frequency structured random matrices, Probab. Theory Related Fields 156 (2013) 707–737. [31] G. Raskutti, M.J. Wainwright, B. Yu, Restricted eigenvalue properties for correlated Gaussian designs, J. Mach. Learn. Res. 11 (2010) 2241–2259. [32] J. Romberg, Compressive sensing by random convolution, SIAM J. Imaging Sci. 2 (2009) 1098–1128. [33] M. Rudelson, R. Vershynin, On sparse reconstruction from fourier and Gaussian measurements, Comm. Pure Appl. Math. 61 (2008) 1025–1045. [34] Y. Shen, B. Han, E. Braverman, Stable recovery of analysis based approaches, Appl. Comput. Harmon. Anal. 39 (2015) 161–172.
Y. Shen et al. / Journal of Complexity 32 (2016) 20–39
39
[35] Y. Shen, B. Han, E. Braverman, Image inpainting using directional tensor product complex tight framelets, 2014, arXiv preprint arXiv:1407.3234. [36] Q. Sun, Sparse approximation property and stable recovery of sparse signals from noisy measurements, IEEE Trans. Signal Process. 59 (2011) 5086–5090. [37] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Stat. Methodol. (1996) 267–288. [38] V. Umanità, S. Villa, Elastic-net regularization: Iterative algorithms and asymptotic behavior of solutions, Numer. Funct. Anal. Optim. 31 (12) (2010) 1406–1432. [39] D.X. Zhou, On grouping effect of elastic net, Statist. Probab. Lett. 83 (2013) 2108–2112. [40] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301–320. [41] H. Zou, T. Hastie, R. Tibshirani, On the degrees of freedom of the lasso, Ann. Statist. 35 (2007) 2173–2192.