Digital Signal Processing 83 (2018) 165–179
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Vector minimax concave penalty for sparse representation Shibin Wang a,b , Xuefeng Chen a,∗ , Weiwei Dai b , Ivan W. Selesnick b , Gaigai Cai b,c , Benjamin Cowen b a b c
State Key Laboratory for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710049, PR China Department of Electrical and Computer Engineering, Tandon School of Engineering, New York University, New York 11201, USA Key Laboratory of Ministry of Education for Electronic Equipment Structure Design, Xidian University, Xi’an, 710071, PR China
a r t i c l e
i n f o
Article history: Available online 10 September 2018 Keywords: Nonconvex sparsity regularization Minimax-concave penalty CEL0 Sparse representation
a b s t r a c t This paper proposes vector minimax concave (VMC) penalty for sparse representation using tools of Moreau envelope. The VMC penalty is a weighted MC function; by fine tuning the weight of the VMC penalty with given strategy, the VMC regularized least squares problem shares the same global minimizers with the L 0 regularization problem but has fewer local minima. Facilitated by the alternating direction method of multipliers (ADMM), the VMC regularization problem can be tackled as a sequence of convex sub-problems, each of which can be solved fast. Theoretical analysis of ADMM shows that the convergence of solving the VMC regularization problem is guaranteed. We present a series of numerical experiments demonstrating the superior performance of the VMC penalty and the ADMM algorithm in broad applications for sparse representation, including sparse denoising, sparse deconvolution, and missing data estimation. © 2018 Elsevier Inc. All rights reserved.
1. Introduction Formulations of sparsity regularization have attracted a great deal of attention in recent years, aiming to find sparse solutions or sparse approximate solutions of linear systems. In a typical setup, an unknown sparse vector x ∈ R N is reconstructed from measurements y = Ax, or more generally, from noisy measurements y = Ax + ε , where A ∈ R M × N (commonly M < N) is a measurement matrix and ε represents the observation noise. A general way to model the sparse regularized least squares problem is
F (x) = 12 y − Ax22 + λψ(x)
(1)
where λ > 0 is a regularization parameter to balance the data fidelity and sparsity, and ψ is a sparsity-inducing penalty, including L 0 quasi-norm x0 (the number of nonzero elements of x), the L 1 norm x1 [1–4], and other nonconvex penalties. Compared to convex regularization, nonconvex sparse regularization can further enhance sparsity, and can provide more accurate solutions than convex regularization [5,6]. Thus, nonconvex sparse regularization has been widely used for compressive sensing and signal processing [7,8]. Some typical examples of nonconvex penalties include
*
Corresponding author. E-mail addresses:
[email protected] (S. Wang),
[email protected] (X. Chen). https://doi.org/10.1016/j.dsp.2018.08.021 1051-2004/© 2018 Elsevier Inc. All rights reserved.
the log penalty [9,10], the atan penalty [11,12], the L p quasi-norm (0 < p < 1) [13,14], the minimax concave (MC) penalty [15–17], the continuous exact L 0 (CEL0) penalty [18,19], and the seamless L 0 (SEL0) penalty [20]. This paper proposes a parameterized nonconvex penalty named vector minimax concave (VMC) penalty for sparse representation, which is realized by introducing a weight vector into the original MC penalty. The main contributions of the paper are threefold. First, the VMC penalty is a weighted MC penalty; inspired by the CEL0 penalty (a recently proposed nonconvex penalty we will review later in this section), it is easy to tune the weight vector such that the VMC regularized least squares problem shares the same global minimizers with the following L 0 regularization problem but has fewer local minima:
min F L0 (x) = 12 y − Ax22 + λx0 . x
(2)
Second, facilitated by the alternating direction method of multipliers (ADMM), the nonconvex VMC regularization problem can be tackled as a sequence of convex sub-problems, each of which can be solved fast. Theoretical analysis of ADMM shows that the convergence to a stationary point is guaranteed for solving the nonconvex VMC regularization problem. Third, we numerically demonstrate the superior performance of the VMC penalty and the ADMM algorithm and its applications for sparse representation, i.e.
166
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
sparse denoising, sparse deconvolution, and missing data estimation. This paper is organized as follows. Section 2 reviews the MC penalty and the CEL0 penalty, and unveils the relationship between these two penalty functions. In Section 3, we define the VMC penalty and provide some properties. We also describe the connection and difference between the VMC penalty and the CEL0 penalty for sharing the global and local minima of the VMC regularization problem and the L 0 regularization problem. In Section 4, we introduce an algorithm to solve the VMC regularized least squares problem, and analyze the convergence of the algorithm. In Section 5, we introduce an algorithm for minimizing the VMC penalty with equality constraint, which is useful for the noise-free case. Section 6 presents simulation experiments wherein the VMC penalty is used for sparse representation.
A direct way to model the sparse prior is the L 0 quasinorm x0 , and the L 0 regularized least squares problem can be modelled as (2) [1,2]. In general, the L 0 problem is intractable to solve, especially for the large value of dimension N, because this problem is known to be NP-hard. One effective alternative is to relax the L 0 problem by other sparsity-inducing functions and a widely used approach is to relax L 0 quasi-norm by L 1 norm, i.e.,
min F L1 (x) = 12 y − Ax22 + λx1 . x
1.2. Notation
1.1. Related work
ψ generally leads to nonconvex regularized function F , thus necessitating nonconvex optimization methods. However, theoretical analysis has shown that FBS algorithms can still converge to a stationary point, including variants such as iterative half thresholding [15,16,33] and iterative p-shrinkage [34,35]. The convergence of the iterative thresholding algorithm has also been studied for a special class of nonconvex regularization problem, i.e., weakly convex penalty [17,36]. Nonconvex ADMM is another important algorithm for solving nonconvex regularization problems, and it has been found to not only work in some applications but often exhibit exemplary performance [37–39]. Theoretical analysis of nonconvex ADMM has shown that the convergence to a stationary point can be guaranteed for some special nonconvex problems [40–44].
(3)
This is known as basis pursuit denoising [21] and LASSO [22]. The L 1 problem (3) is a convex optimization problem, and therefore we can take advantage of extensive theories dedicated to convex problems [23]. Thus, the L 1 relaxation became popular and has been widely used in various applications [24,25]. However, researchers have shown that there is a difference between the L 1 relaxation and L 0 quasi-norm for high-amplitude components of x ∈ R N , which usually leads to a biased estimate, thus L 1 solutions are suboptimal when applied to certain applications [9,26]. In order to enhance the sparsity, many nonconvex regularization methods have been proposed. Among nonconvex penalties, we are particularly interested in those that provide sparse, stable and nearly unbiased solutions [18]. These penalties are singular at the origin (to produce sparse solutions), continuous (to produce stable solution) and bounded by a constant (to produce nearly unbiased estimates for large coefficients). The early attempt by Fan and Li defined necessary conditions to satisfy singularity, continuity, and unbiasedness, and proposed the smoothly clipped absolute deviation (SCAD) penalty [27]. In the same spirit, Zhang proposed the minimax concave (MC) penalty [15,16], whose proximal operator is firm thresholding [28] which is a continuous, piecewise-linear approximation of hard thresholding [29]. CEL0 is a nonconvex penalty proposed recently by Soubies et al. and it is also nonconvex and satisfies these conditions [18]. The main advantage of the CEL0 regularization problem is that it shares the same global minimizers with the L 0 regularization problem (2) but has fewer local minima. Smoothed L 0 (SL0) proposed by Mahimani et al. is another nonconvex penalty satisfying these important conditions, which uses a class of Gaussian function to approximate the L 0 quasi-norm [30–32]. Experimental results show that the SL0 is about two to three orders of magnitude faster than interior-point linear programming solvers while providing the same accuracy. Extensive theory has been developed to study the problem (1) when F is convex, such as forward-backward splitting (FBS) algorithm, ADMM, majorization-minimization (MM) methods. Unfortunately, for a general matrix A, a separable nonconvex penalty
The L p norm of x ∈ R N is defined as x p = ( n |xn | p )1/ p for p > 0 (including the quasi p-norm for 0 < p < 1). The vector an ∈ R M is the n-th column of matrix A ∈ R M × N . The symbol λmax ( A ) (resp. λmin ( A )) denotes the maximum (resp. minimum) eigenvalue of the matrix A. The symbol diag( A ) for a matrix A ∈ R N × N denotes the vector a = (α11 , α22 , . . . , α N N )T composed of the diagonal elements in the matrix A (αii being the i-th diagonal element of the matrix A), and the symbol diag(a) for a vector a = (α1 , α2 , . . . , α N )T denotes a diagonal matrix A ∈ R N × N with the elements α1 , α2 , . . . , α N on the diagonal and zeroes elsewhere. Let T > 0. The hard thresholding function is defined as
0, |x| ≤ T
hard(x; T ) =
(4)
|x| > T
x,
The soft thresholding function is defined as
soft(x; T ) =
|x| ≤ T sign(x)(|x| − T ) |x| > T 0,
(5)
where sign(·) is the signum function for a real input and sign(x) = x/|x| for a complex input. The firm thresholding function is a generalization of hard and soft thresholding [29]. Let T 2 > T 1 > 0. The threshold function is defined as
firm(x; T 1 , T 2 ) =
⎧ ⎪ ⎨0 , ⎪ ⎩
sign(x)
T 2 (|x|− T 1 ) , T 2 −T 1
x,
|x| ≤ T 1 T 1 < |x| ≤ T 2 |x| > T 2
(6)
The proximal operator prox f : R N → R N of function f is defined by prox f (x) = arg min v ( f ( v ) + 12 x − v 22 ). 2. Review of MC penalty and CEL0 penalty In this section, we review the definitions and properties of the MC penalty and the CEL0 penalty. 2.1. MC penalty The MC penalty is a nonconvex penalty proposed for variable selection in high-dimensional linear regression. Definition 2.1 (MC penalty). The MC penalty φMC with parameter
λ ∈ R+ and γ ∈ R+ is defined as
|x| λ|x| − 1 x2 , |x| ≤ λγ φMC (x; λ, γ ) = λ (1 − z/(γ λ))+ dz = 1 2 2γ γλ , |x| > λγ 2 0
(7)
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
with (u )+ = max{0, u }. The multivariable MC penalty ψMC is defined as:
ψMC (x; λ, γ ) =
N
φMC (xn ; λ, γ )
(8)
n =1
For the MC penalty regularized least squares problem
min F (x) = 12 y − x22 + ψMC (x; λ, γ ) ,
proxψMC ( y ) =
γ ≥ 1, and the proximal operator is the
arg min{ 12 y x
−
x22
Thus, the MC penalty regularized least squares problem also shares the global minimum and local minima with the L0 problem (2) but has few local minima. However, for a general matrix A, we cannot set the scalar parameters λ and γ such that MC and CEL0 are equal to each other unless we use two non-uniform parameter vectors to replace the scalar parameters λ and γ in the original MC penalty. However, the freedom makes it more complex to set parameters.
(9)
x
the function F is convex if firm thresholding:
167
+ ψMC (x; λ, γ )}
= firm( y ; λ, λγ ).
(10)
2.2. CEL0 penalty
3. Vector minimax concave penalty In this section, we define the VMC penalty as a generalization of the MC penalty and the CEL0 penalty using the Moreau envelope. The VMC penalty is a weighted MC function, and inspired by the CEL0 penalty, the VMC regularization problem shares the same global minimizers with the L 0 regularization problem (2) but has fewer local minima after tuning the weight vector of the VMC penalty.
The CEL0 penalty was proposed to continuously approximate the L 0 quasi-norm, leading to a tight continuous relaxation of the L 0 regularization problem (2).
3.1. Moreau envelope and MC penalty
Definition 2.2 (CEL0 penalty). For the sparse formulation (1), the CEL0 penalty ψCEL0 with parameter λ ∈ R+ and A ∈ R M × N is defined as
The Moreau envelope was recently used to define a new multivariable Huber function, leading to new penalties for sparse representation and optimization [45–48]. The VMC penalty will also be defined using the Moreau envelope.
ψCEL0 (x; λ, A ) =
N
φCEL0 (xn ; λ, an 2 )
(11)
n =1
where the scalar CEL0 penalty is defined as
⎧ √ √ ⎨λ − 1 α 2 (|x| − 2λ )2 , |x| ≤ 2λ 2 α α √ φCEL0 (x; λ, α ) = ⎩λ, |x| > α2λ And the CEL0 regularization problem is
x
f M (x) = inf f ( v ) + 21γ x − v 22 v
(12)
√ M Theorem 2.1 ([18]). √ Let y ∈ TR , λ ∈ R+ , and aλ = ( 2λ/a1 2 , √ 2λ/a2 2 , . . . , 2λ/a N 2 ) . (i) The set of global minimizers of F L0 (x) defined in (2) is included in the set of global minimizers of F CEL0 (x) defined in (12), i.e., arg minx { F L0 } ⊆ arg minx { F CEL0 }. (ii) Conversely, if xˆ ∈ R N is the global minimizer of F CEL0 (x), the hard thresholding result xˆ 0 = hard(ˆx, aλ ) is a global minimizer of F L0 (x), and F CEL0 (ˆx) = F CEL0 (ˆx0 ) = F L0 (ˆx0 ). (iii) If F CEL0 (x) has a local (not necessarily global) minimum at xˆ ∈ R N , the hard thresholding result xˆ 0 = hard(ˆx, aλ ) is a local minimizer of F L0 (x), and F CEL0 (ˆx) = F CEL0 (ˆx0 ) = F L0 (ˆx0 ). As stated in Ref. [18], the main advantage of the CEL0 regularization is that the function F CEL0 (x) shares the global and local minima with the function F L0 (x) via the hard thresholding. However, not all the local minima of the function F L0 (x) are local minima of the function F CEL0 (x). Thus, if we minimize F CEL0 (x) instead of F L0 (x), parts of the local minimum can be eliminated and the global minimum of F L0 (x) is preserved [18]. We consider a specific matrix A, i.e., the column vector of the matrix A is orthogonal A T A = α 2 I . In this case, we set MC penalty √ ψMC (x; λ, γ ) with the parameters λ = 2λ0 α (λ0 being the regularization parameter of F L0 (x) in the model (2)) and γ = 1/α 2 . For the MC penalty with these parameters, we have
ψMC (x; 2λ0 α , 1/α 2 ) = ψCEL0 (x; λ, A )
(14)
Definition 3.2 (Huber function). The Huber function s with parameter γ ∈ R+ is defined as
min F CEL0 (x) = 12 y − Ax22 + ψCEL0 (x; λ, A ) .
Definition 3.1 (Moreau envelope). The Moreau envelope of the function f with parameter γ is defined by
(13)
s(x; γ ) =
1 2γ
x2 ,
|x| ≤ γ
(15)
|x| − 12 γ , |x| > γ
According to the definition, the Huber function s can be written as
f M (x) = min | v | + 21γ (x − v )2 v
(16)
and the scalar MC penalty function φMC can be written as
φMC (x; λ, γ ) = λ|x| − λs(x; λγ ) = λ|x| − min λ| v | + 21γ (x − v )2 v
(17)
The multivariable Huber function S is
S (x; γ ) =
N n =1
s(xn ; γ ) = min v 1 + 21γ x − v 22 v
(18)
i.e. the function S with parameter γ is the Moreau envelope of the function · 1 with parameter γ . The multivariable MC penalty ψMC can be written as
ψMC (x; λ, γ ) = λx1 − λ S (x; λγ ) = λx1 − min λ v 1 + 21γ x − v 22 v
(19)
168
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
3.2. VMC penalty Definition 3.3 (VMC penalty). Let the regularization parameter λ ∈ R+ , the convexity parameter γ ∈ R+ , and the weight vector N w = (ω1 , ω2 , . . . , ω N )T ∈ R+ . The VMC penalty in the scalar form is defined as
φVMC (xn ; λ, γ , ωn ) = λ|ωn xn | − λs(ωn xn ; λγ )
1
γ
W TW ) 0
In particular, when A T A = (diag( w ))2 with w = (ω1 , ω2 , . . . , ω N )T , this condition is equivalent to γ ≥ 1.
and the multivariable VMC penalty ψVMC is defined as N
Proposition 3.3. The cost function F VMC in (24) is convex if
( AT A −
= λ|ωn xn | − min{λ| v | + 21γ (ωn xn − v )22 }, v
ψVMC (x; λ, γ , w ) =
The following proposition will focus on the convexity condition of the VMC regularization and a special case where the columns of matrix A are orthogonal, i.e., A T A = (diag( w ))2 with w = (ω1 , ω2 , . . . , ω N )T .
φVMC (xn ; λ, γ , ωn )
n =1
= λ W x1 − min λ v 1 + 21γ W x − v 22 , v
Remark. For a matrix A with orthogonal columns, the function F VMC is convex if γ ≥ 1. However, for the underdetermined problem, since λmin ( A T A ) = 0, the matrix A T A − γ1 W T W is not positive
(20) where W = diag( w ).
semidefinite for γ ∈ R+ and the function F VMC in (24) is nonconvex. Then nonconvex optimization methods can be used to solve the VMC regularization problem, which will be analyzed in the next section.
Remark. According to the definition of the VMC penalty, the relation between the VMC and MC penalty is
Now, we discuss the connection and difference between the VMC penalty and the CEL0 penalty. Inspired by the CEL0 penalty,
ψVMC (x; λ, γ , w ) = ψMC ( W x; λ, γ ) = λ W x1 − λ S ( W x; λγ ).
1
we consider w A = (diag( A T A )) 2 , then we distinguish three cases concerning the choice of the parameter γ .
(21) Thus, the VMC penalty is a weighted MC penalty. Compared to the original MC penalty, the VMC penalty has two scalar parameters (regularization parameter λ, convexity parameter γ ) and a weight parameter vector w. The advantage of VMC by introducing the weight parameter vector w will be discussed in the next subsection, and in this subsection, we first discuss the property of the VMC penalty. The next proposition shows the relation between the VMC penalty and the L 0 quasi-norm x0 . In order to simplify presentation, we focus on a special VMC penalty, which has a uniform weight ω for each element xn ∈ R, i.e. W = ω I . Proposition 3.1. For ω ∈ R+ , the VMC penalty satisfies
lim ψVMC (x; λ, γ , ω) =
ω→∞
lim ψVMC (x;
√
ω→∞
1 2
γ λ2 x0 ,
√
(i) When γ = 1 and λ = 2λ0 (with λ0 being the regularization parameter of F L0 in (2)), the VMC penalty reduces to the CEL0 penalty,
ψVMC (x; 2λ0 , 1, w A ) = ψCEL0 (x; λ0 , A )
and thus the associated relaxation F VMC shares the global minimum and local minima with √ the function F L0 (x) via the hard thresholding xˆ 0 = hard(ˆx, 2λ/ w A ). It means that minimizing F VMC (x) instead of F L0 (x) can eliminate part of the local minima and preserve the global minimum of F L0 [18,19]. (ii) When γ < 1 and λ = 2λ0 /γ , according to the necessary and sufficient conditions (Equation (46)) derived by Soubies et al. in the reference [19], we get that in order to make the VMC relaxation an exact relaxation (i.e. eliminate part of the local minima and preserve the global minimum of F L0 ), we have to set
λ2 γ
2, 1, ω) = x0 .
(25)
2
= λ0 , and
λγ
ωi
√
≤
2 λ0
αi
, and −
ωi2 < −αi2 , γ
for i = 1, 2, ..., N ,
The next proposition shows the VMC function is bounded.
(26)
N Proposition 3.2. For w = (ω1 , ω2 , . . . , ω N )T ∈ R+ and W = diag( w ), the VMC penalty satisfies
where αi > 0 denotes the norm of the i-th column of A asso1 ciated to the variable x. The choice w = (diag( A T A )) 2 makes that ωi = αi and the above conditions reduce to
0 ≤ ψVMC (x; λ, γ , w ) ≤ min{λ W x1 , 12 γ λ2 N },
λ=
lim ψVMC (x; λ, γ , w ) = λ W x1 .
γ →∞
(22) (23)
3.3. Vector minimax concave regularization In this subsection, we will use the VMC penalty as the sparse regularizer in the problem (1). Definition 3.4 (VMC regularization). Let y ∈ R M , A ∈ R M × N , λ ∈ R+ , N γ ∈ R+ , and w = (ω1 , ω2 , . . . , ωN )T ∈ R+ . Define F : R N → R as
F VMC (x) = 12 y − Ax22 + ψVMC (x; λ, γ , w ), where ψVMC : R → R is the VMC penalty defined by (20). N
(24)
2λ0 /γ , and γ < 1
(27)
which means that when γ < 1 and λ = 2λ0 /γ , the VMC regularization function F VMC is also an exact relaxation of F L0 , i.e. it preserves global minima and may eliminate part of the local minima. (iii) When γ > 1, the VMC regularization function F VMC is not an exact relaxation of F L0 , i.e. it does not preserve global minima but introduce new minima. From this point of view, the CEL0 penalty is a special case of the √ VMC penalty (i.e., when VMC sets the parameters as γ = 1, λ = 2λ0 , and w A = (diag( A T A ))1/2 ). We now consider two toy examples to illustrate these situations about the L 0 problem and the VMC regularization problem.
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
169
Fig. 1. Contours plot of the functions F L0 and F VMC for two examples. (a)–(d) A = [1, 0.5; 1, −0.5], y = [1, 2], and λ0 = 1; (e)–(h) A = [3, 2; 1, 3], y = [1, 1], and λ0 = 0.5. Circle points are local minimizers, and square points are global minimizers.
Fig. 1 presents the contours of both F VMC in (2) and F L0 in (24) for different A, y, λ > 0 (the regularization parameter for the L 0 problem). For the first case, i.e., A = [1, 0 √.5; 1, −0.5], y = [1, 2], and λ0 = 1, when we set γ = 1 and λ = 2λ0 for VMC, because the columns of the matrix A are orthogonal, as stated by Proposition 3.3, the function F VMC with γ = 1 is convex. Moreover, in this example, F L0 has a unique global minimizer which is thus also the unique global minimizer of the convex F VMC (γ = 1), as shown in Fig. 1(b). When we set γ < 1 (in this case γ = 0.5) and λ = 2λ0 /γ for VMC, as shown in Fig. 1(c), the global minimum of the function F L0 can also be preserved by F VMC , and this phenomenon was studied by Soubies et al. to discuss the property of the MC penalty [19]. When we set γ > 1 (in this case γ = 3), as shown in Fig. 1(d), as stated by Proposition 3.3, the function F VMC with γ = 1 is also convex. However, the global minima of F VMC and F L0 have different positions. For the second case, i.e. A =√ [3, 2; 1, 3], y = [1, 1], and λ0 = 0.5, when we set γ = 1 and λ = 2λ0 for VMC, even though the columns of the matrix A are not orthogonal and the function F VMC with γ = 1 is not convex, the global minimum and local minima of F L0 are also preserved by F VMC , as shown in Fig. 1(f). When we set γ < 1 (in this case γ = 0.2) and λ = 2λ0 /γ for VMC, the global minimum and local minima of the function F L0 can also be preserved by F VMC , as shown in Fig. 1(g). It should be noted that it exists a local maxima around (0.1, 0.1) in Fig. 1(g). When we set γ > 1 (in this case γ = 3), as shown in Fig. 1(h), the global minima of F VMC and F L0 have different positions. Even func √ though the VMC regularized tion F VMC with γ = 1 (λ = 2λ0 ) and γ < 1 (λ = 2λ0 /γ ) can preserve the global minima and may eliminate part of the local minima of F L0 , comparing the results for γ = 1 and γ < 1, it can be found that the smaller value toward γ = 0 will lead to more local minima preserved, as shown in Fig. 1(c) and Fig. 1(f). If γ > 1, the global minimum and local minima of the function F L0 are not preserved by F VMC , as shown in Fig. 1(d) and Fig. 1(h). 4. VMC regularization for noisy case and its ADMM solver In this section, we focus on using ADMM to solve the VMC regularization problem for noisy cases.
For the nonconvex optimization problem
min F VMC (x) = 12 y − Ax22 + ψVMC (x; λ, γ , w ) , x
(28)
applying variable splitting to (28) yields
min { f 1 (x) + f 2 ( z)} x, z
s.t.
x=z
(29)
with
f 1 (x) = ψVMC (x; λ, γ , w ), f 2 ( z) = 12 y − Az22 .
(30)
The augmented Lagrangian function for optimization problem (28) is given by
L β (x, z, u ) = f 1 (x) + f 2 ( z) + u , x − z +
β 2
x − z22 ,
(31)
where β > 0 is a constant representing the primal penalty parameter. Then ADMM can be used to solve the nonconvex optimization problem (29), as detailed in Algorithm 1. Algorithm 1 ADMM. T 1/2 1: Initialize z(0) , u (0) , x(0) , w = √(diag( A A )) . 2: Convergence condition β > 2λmax ( A T A ). 3: Convexity condition γ > (max{ w })2 /β . 4: while stopping criterion are not satisfied do 5: Update the primal variables: 6: x(k+1) = arg minx L β (x, z(k) , u (k) ), 7: z(k+1) = arg minz L β (x(k+1) , z, u (k) ), 8: Update the dual variables: 9: u (k+1) = u (k) + β(x(k+1) − z(k+1) ), 10: k ← k + 1, 11: end while 12: return x(k)
√
The initialization condition β > 2λmax ( A T A ) is set for the convergence of ADMM, which will be analyzed later. Before we discuss the algorithm and its convergence, we focus on a special sub-problem minx { 12 β y − x22 + ψVMC (x; λ, γ , w )} and the convexity condition γ > (max{ w })2 /β .
170
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
N Proposition 4.1 (Weakly convex). For w = (ω1 , ω2 , . . . , ω N )T ∈ R+ , N 2 w max = max{ w }. The VMC penalty ψVMC : R → R is w max /γ -weakly convex, where a function g : R N → R is said to be ρ -weakly convex if β f (x) = 2 x22 + g (x) is convex when β ≥ ρ ≥ 0 [36]. N Proposition 4.2. Let y ∈ R N , λ > 0, and w = (ω1 , ω2 , . . . , ω N )T ∈ R+ , N w max = max{ w }. Define f : R → R as
f (x) =
β
y − x22 + ψVMC (x; λ, γ , w ) 2
(32)
where ψVMC is the VMC function defined by (20). The function f is convex if γ ≥ w 2max /β . Moreover, the minimizer of the convex function in (32) is given by β
xopt = arg min{ 2 y − x22 + ψVMC (x; λ, γ , w )}
Theorem 4.1. For the nonconvex optimization problem (28), Algorithm 1 √ converges for any β satisfying β > 2λmax ( A T A ). That is, starting from x(0) , z(0) , u (0) , it generates a sequence that is bounded, has at least one limit point (x , z , u ), and that any limit point (x , z , u ) is a stationary point of L β (x, z, u ), namely 0 ∈ ∂ L β (x , z , u ). In particular, x is a stationary point of the problem (28). Remark. The proof of the theorem is basically followed the procedure used in [40,42]. Thus, we only provides the main lemmas for the proof. Lemma 4.1 (Sufficient descent). The sequence (x(k) , z(k) , u (k) ) in Algorithm 1 satisfies
L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k) , z(k) , u (k) )
x
= firm( y ; λ w /β, λγ / w ).
(33)
Proposition 4.2 shows that although the cost function in (28) is nonconvex, both functions to be minimized in primal variables update (i.e. x-update and z-update) in Algorithm 1 are convex if the convexity condition is satisfied. According to Proposition 4.2, x-update in Algorithm 1 can be calculated as follows:
x(k+1) = arg min L β (x, z(k) , u (k) )
= arg min{ψVMC (x; λ, γ , w ) + 2 x − z
(k)
x
1
+ βu
(k) 2
2 }
= firm( z(k) − β1 u (k) ; λ w /β, λγ / w ).
z
2
z(k+1) − z(k) 22 ≤ 0
(42)
2λmax ( A T A ).
lim s(k) 2 = 0, lim r (k) 2 = 0,
(36)
k→∞
(43)
(44)
lim x(k+1) − x(k) 2 + z(k+1) − z(k) 2 + u (k+1) − u (k) 2 = 0,
k→∞
(45) where the primal residual s(k) and the dual residual r (k) are defined in (40).
According to the matrix inverse lemma, we have
(37)
Then the z-update equation in Algorithm 1 becomes:
(38)
Moreover, if the columns of A form a tight frame, i.e., A satisfies A A T = α I with α > 0, then 1 z(k+1) = (x(k+1) + β1 u (k) ) + α +β A T [ y − A (x(k+1) + β1 u (k) )]. (39)
The stopping criterion in Algorithm 1 can be set as in the standard ADMM [49]. That is, the primal and dual residuals are
(40)
respectively, which must be small, i.e.
r (k) 2 ≤ ε prim , and r (k) 2 ≤ ε dual
√
β
Lemma 4.2 (Boundedness). The sequence (x(k) , z(k) , u (k) ) in Algorithm 1 satisfies that L β (x(k) , z(k) , u (k) ) is bounded and (x(k) , z(k) , u (k) ) is bounded.
k→∞
which is a strictly convex quadratic function, thus the minimization can be expressed as
r (k+1) = x(k+1) − z(k+1) , s(k+1) = z(k+1) − z(k)
if β satisfies β >
−
(35)
z
z(k+1) = (x(k+1) + β1 u (k) ) + A T ( A A T + β I )−1 × [ y − A (x(k+1) + β1 u (k) )].
β
A)
Lemma 4.4 (Residual convergence). The sequence (x(k) , z(k) , u (k) ) in Algorithm 1 satisfies
= arg min{ 12 y − Az22 + 2 x(k+1) − z + β1 u (k) 22 },
( A T A + β I )−1 = β1 ( I − A T ( A A T + β I )−1 A )
T
d(k+1) 2 ≤ C (β) z(k+1) − z(k) 2
β
z(k+1) = ( A T A + β I )−1 ( A T y + β x(k+1) + u (k) ).
max ( A
(34)
The z-update in Algorithm 1 is as follows:
z(k+1) = arg min L β (x(k+1) , z, u (k) )
λ2
Lemma 4.3 (Subgradient boundedness). There exists C (β) > 0 and d(k+1) ∈ ∂ L β (x(k+1) , z(k+1) , u (k+1) ) such that
x
β
≤
Lemma 4.5 (Global√ subsequence convergence). Suppose that when Algorithm 1 with β > 2λmax ( A T A ) is applied to the problem (28). Then, its sequence (x(k) , z(k) , u (k) ) has at least a limit point (x , z , u ), and any limit point (x , z , u ) is a stationary point of L β (x, z, u ). That is, 0 ∈ ∂ L β (x , z , u ), or equivalently,
−u ∈ ∂ f 1 (x ), ∇ f 2 ( z ) = u , x = z .
5. A variational approach for equality constraint and its ADMM solver In this section, we introduce the VMC penalty for equality constraint:
min ψVMC (x; λ, γ , w ) (41)
We next study the convergence of Algorithm 1. The convergence analysis for the nonconvex ADMM has been studied recently, and theoretical analysis has shown that the convergence to a stationary point can be guaranteed for some special nonconvex problem [40–44]. Our main result is then the following:
(46)
In particular, x is a stationary point of the problem (28).
x
s.t.
Ax = y
(47)
This problem is similar to basis pursuit [21]. By applying the variable splitting technique, we obtain an equivalent optimization problem:
min ψVMC (x; λ, γ , w ) x, z
s.t.
Az = y , x = z
(48)
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
Using the “partly” augmented Lagrangian:
L β (x, z, u ) = ψVMC (x; λ, γ , w ) + u , x − z +
(l+1)
ωi β 2
x −
z22
+ u , Az − y .
(49)
Using ADMM to solve the problem, we obtain Algorithm 2. Algorithm 2 ADMM for equality constraint. 1: Initialize z(0) , u (0) , x(0) . 2: Convexity condition γ > (max{ w })2 /β . 3: while stopping criterion are not satisfied do β 4: x(k+1) = arg minx ψVMC (x; λ, γ , w ) + 2 x − z(k) + β1 u (k) 22 5: z(k+1) = arg minz x(k+1) − z + β1 u (k) 22 s.t. Ax = y 6: u (k+1) = u (k) + β(x(k+1) − z(k+1) ), 7: k ← k + 1, 8: end while 9: return z(k)
Similar to Algorithm 1, the minimization with respect to x-update can be written explicitly in terms of firm-thresholding in (34). The minimization with respect to z-update is a constrained least squares problem which admits an explicit solution in terms of matrix inverses if A A T is invertible:
z(k+1) = (x(k+1) + β1 u (k) ) + A T ( A A T )−1 [ y − A (x(k+1) + β1 u (k) )]. (50) Note that, because of the equality constraint, the variable z satisfies Az = y at each iteration in Algorithm 2. 6. Numerical simulation In this section, we apply the VMC penalty and the ADMM algorithm for sparse denoising, sparse deconvolution, and missing data estimation. The first two cases are considered for noisy situation, in which sparse denoising is for underdetermined systems (i.e. A ∈ R M × N with M < N is a wide matrix) and sparse deconvolution is for overdetermined systems (i.e., A ∈ R M × N with M > N is a tall matrix). In these two cases, the ADMM algorithm in Section 4 is used to solve the nonconvex optimization problems. The last case is considered for the noise-free situation, which is also for underdetermined systems. In this case, the ADMM algorithm in Section 5 is used for this noise-free case. For comparison purposes, we consider both convex (L 1 norm) and nonconvex sparsity regularizers for these three kinds of numerical simulations, such as iterative half thresholding algorithm (Half) [13,14] and iterative p-shrinkage (IPS) [34,35]. The Half algorithm uses the L 1/2 norm regularization to enhance the sparsity, which is also a nonconvex optimization problem. The IPS algorithm is an iterative thresholding algorithm designed to locally minimize a nonconvex objective function, which was found to be particularly effective in comparison to several other algorithms [45,50]. Moreover, for sparse denoising and deconvolution, we consider the multivariate sparse regularization (MUSR), which is a nonconvex regularization and convex optimization methods proposed recently. For missing data estimation, we also consider the smoothed-L 0 (SL0) [30–32] and the reweighted L 1 (RwL1) algorithm [9]. The SL0 approximates the L 0 quasi-norm using a class of Gaussian functions, and it can provide effective sparse approximation solution in the noise-free or low-noise situation. The RwL1 algorithm consists of solving a sequence of weighted L 1 -minimization problems where the weights used for the next iteration are computed from the value of the previous solution. In this paper, the reweighted strategy is
=
1,
|x(i l) | < T 1
0,
|x(i l) | ≥ T 1
171
(51)
which corresponds to the minimization of the capped- 1 penalty instead of the minimization of the log-sum penalty used in the Ref. [9]. Moreover, in this paper, we use root-mean-squared-error (RMSE) to evaluate the performance of different methods, which is defined as
M 1 RMSE = ( y (m) − yˆ (m))2 M
m =1
where y is the noisy-free signal, yˆ is its estimation, and M is its length. 6.1. Sparse representation for denoising This example illustrates the use of the VMC penalty and the ADMM algorithm for sparse denoising. In this case, we consider the discrete-time signal
y 0 (m) = 2 cos(2π f 1m) + sin(2π f 2m), m = 0, . . . , M − 1, of length M = 100 with frequencies f 1 = 0.1 and f 2 = 0.22. The signal y 0 is sparse in the frequency domain, so we model the signal as y 0 = Ax where A is an overcomplete inverse discrete Fourier transform and x ∈ C N is a sparse vector of Fourier coefficients with N > M. Specifically, we define the matrix A ∈ C M × N as A m,n = √ (1/ N ) exp(j(2π / N )mn), m = 1, . . . , M − 1, N = 0, . . . , N − 1, and we use N = 256. The columns of A form a normalized tight frame, i.e., A A H = I . The operator A is implemented as a truncated inverse FFT and the operator A H is implemented as a zero-padded FFT. For this denoising experiment, we corrupt the signal with additive white Gaussian noise (AWGN) with standard deviation σ = 1, i.e., y = y 0 + ε , and ε ∼ N (0, 1). Fig. 2 shows the waveform of the noise-free signal y 0 and the noisy signal y simultaneously. Fig. 2 shows the solutions obtained by the different methods for one noise realization, and Fig. 3 shows the optimized sparse Fourier coefficients by these methods. In this case, we set the weight parameter w of the VMC function as ω A = (ωa1 , ωa2 , . . . , ωa N )T = (diag( A T A ))1/2 . Thus, √ the weight parameter for each element of x is uniform, i.e., ωn = M / N for n = 1, . . . , N. Moreover, we set the parameter γ = 1. Thus, the VMC penalty in this case is equivalent to the CEL0 penalty. Furthermore, we set parameter λ for each method so as to minimize the average root-mean-square error (RMSE) for 200 realizations (the detailed information for the parameter setting will be discussed later). This leads to the values λ = 0.9 for L 1 norm, λ = 3.4 for Half, λ = 1.6 for IPS (an additional parameter for the IPS is p, which is set as p = −0.5 as the recommended value in the IPS reference [34]), λ = 1.4 for MUSR, and λ = 2.8 for VMC (β = 10). The comparison among solutions demonstrates that the nonconvex regularization provides a sparser Fourier coefficients and causes less underestimation than the L 1 regularization, and the VMC regularization provides a more accurate estimation than the other nonconvex regularization methods. Second, we discuss the parameter selection for different methods. In this case, we vary λ from 0.5 to 6 (with increment 0.1), and evaluate the average RMSE of 200 realizations of noises for different methods. The average RMSE is shown in Fig. 4 as a function of λ for different methods, and the aforementioned parameters for those methods corresponds to the minimal average RMSE value here. Moreover, the average RMSE values corresponding to the chosen parameters are listed in Table 1. The comparison results verify that the VMC regularization compares favorably with these
172
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
Fig. 2. Sparse denoising using different methods (the light dashed line is the noise-free signal).
Fig. 3. Optimized Fourier coefficients by different methods. The plot shows only the non-zero values.
Fig. 5. RMSE values of 200 realizations of sparse denoising obtained by different methods.
Fig. 4. Average RMSE values of 200 realizations of different methods for sparse denoising.
Table 1 Average RMSE values and improvement by VMC regularization for sparse denoising.
Average RMSE Improvement by VMC
L 1 norm
MUSR
IPS
Half
VMC
0.4044 31.7%
0.3486 20.8%
0.3465 20.3%
0.3185 13.3%
0.2762 –
methods, including two convex optimization methods, i.e., L 1 and MUSR, and the two nonconvex methods, i.e., IPS and Half. In order to further illustrate the robustness of VMC to noise, the RMSE of the VMC solution versus the RMSE of other methods in each realization of noise is designated as a single point in the scatter plot as shown in Fig. 5. The scatter points below the diagonal line represent realizations where the VMC provides a more accurate solution than the compared methods, and the scatter points above the diagonal line represent realizations where VMC performs unfavorably. In each subfigure, the numbers K1 and K2 denote the number of the scatter points below and above the diagonal line, respectively. For example, in the VMC versus L 1 norm result, the counters K1 = 192 means that there are 192 in 200 realizations that VMC performs better than the L 1 norm. The scatter plot shows the evidence that the VMC provides appreciable improvements over the other methods for sparse denoising. The favorable behavior of VMC is maintained over a range of noise levels, as shown in Fig. 6, which shows the average RMSE a function of σ for 0.25 ≤ σ ≤ 2.0. We calculate the average RMSE here using 200 noise realizations for each noise level. As shown in Fig. 6, the VMC regularization compares favorably to the other methods in this experiment, including L 1 norm, MUSR, IPS, Half.
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
Fig. 6. Average RMSE values for different method.
173
Fig. 7. Average RMSE values of 200 realizations of VMC regularization for sparse denoising.
Now, we analyze the effect of the convexity parameter γ on the performance of the method for sparse denoising. We consider the noisy signal with σ = 1. We vary λ from 0.5 to 6 (with increment 0.1) and γ from 0.25 to 2 (with increment 0.25), and evaluate the average RMSE of 200 realizations of noises for the VMC regularization, for each λ and γ . The average RMSE is shown in Fig. 7 as a function of λ for each γ . It can be found the change of the parameter γ almost does not affect the minimal average RMSE and only affects the value of the parameter λ to obtain the minimal average RMSE when γ ≤ 1. This is consistent to the phenomenon shown in Fig. 1 that the VMC regularization problem with γ ≤ 1 preserves the global minimum but has less local minima of the L 0 problem, which was also study by Soubies et al. to discuss the property of the MC penalty [19]. In particular, according to the relation λ = 2λ0 /γ when γ ≤ 1 discussed in Section 3, the smaller the value of the parameter γ is, the larger the value of the RMSE-optimal parameter λ is. Moreover, according to the relation that limγ →∞ ψVMC (x; λ, γ , w ) = λ W x1 in Eq. (23), the increase of the parameter γ leads to the VMC solutions behaving
similarly to the L 1 norm solution with the weight matrix W , as shown in Fig. 7. Furthermore, according to the relation in Eq. (25), we know that the CEL0 penalty (i.e., a special case of the VMC penalty when γ = 1) provides the same performance as the VMC penalty here. 6.2. Sparse deconvolution This example illustrates the VMC regularization as applied to sparse deconvolution problem wherein the unknown sparse signal x is to be determined from data y = Ax + ε where A represents convolution and ε is AWGN. We generate a sparse signal of length N = 200 with 10 non-zero values (uniformly distributed in value between 0 and 100, located at random positions). Fig. 8 illustrates one realization. We set the convolution operator A to be a 10-point moving average filter, i.e., y = h ∗ x + ε where h(n) = 0.1 for n = 0, . . . , 9 and h(n) = 0 otherwise. We set the AWGN standard deviation to σ = 2. The observed signal y is shown in Fig. 8. In this
Fig. 8. Sparse deconvolution example.
174
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
Fig. 9. Average RMSE values of 200 realizations of different methods for sparse deconvolution. Table 2 Average RMSE values and improvement by VMC regularization for sparse deconvolution.
Average RMSE Improvement by VMC
L 1 norm
MUSR
IPS
Half
VMC
4.69 19.4%
4.00 5.4%
4.01 5.7%
3.81 0.6%
3.78 –
case, we also set the weight parameter w of the VMC function as ω A = (ωa1 , ωa2 , . . . , ωa N )T = (diag( A T A ))1/2 , and set other parameters in each case so as to minimize the average RMSE for 200 realizations. The sparse signal estimated using VMC is illustrated in Fig. 8. The estimation results show that the VMC regularization almost avoids the underestimation of the L 1 norm regularization. Then we discuss the parameter selection for different methods. In this case, we vary λ from 0.2 to 14 (with increment 0.2), and evaluate the average RMSE of 200 realizations of noises for different methods. The average RMSE is shown in Fig. 9, and the average RMSE values and the improvements by VMC over the other methods are shown in Table 2. The scatter plots of RMSE of 200 realizations are shown in Fig. 10. The comparison results show that the VMC provides appreciable improvements over the L 1 norm convex regularization, and also works better than the other nonconvex regularization methods, even though the value of the improvement is not such appreciable compared to the L 1 norm regularization. Now, we analyze the effect of the convexity parameter γ on the performance of the method for sparse deconvolution. We vary λ from 0.2 to 14 (with increment 0.2) and γ from 0.5 to 5, and evaluate the average RMSE of 200 realizations of noises for the VMC regularization. The average RMSE is shown in Fig. 11. It can be found that the effect of the parameter γ is not similar to the case for sparse denoising in Fig. 7. In this case, the change of the parameter γ affects the minimal average RMSE, and γ = 3 behaves better than the other values. Moreover, the increase of the parameter γ also leads that the performance of the VMC behaves similarly to the L 1 norm solution with the weight matrix W . Furthermore, the CEL0 penalty (i.e. a special case of the VMC penalty when γ = 1) is not the best choice for sparse deconvolution, and the VMC penalty provides more accurate deconvolution results. 6.3. Sparse representation for estimating missing signal samples In this subsection, we consider the problem of estimating missing samples from incomplete signals. Suppose that the N values of signal x, only M are observed, M < N. The M-point incomplete signal y can be written as y = Sx where S is a “selection” (or “sampling”) matrix. Suppose that the signal x has a sparse representation with respect to F , i.e., x = F c, and set A = S F , then the
Fig. 10. RMSE values of 200 realizations of sparse deconvolution.
Fig. 11. Average RMSE values of 200 realizations of different methods for sparse deconvolution.
incomplete signal y can be modeled as y = Ac, and the problem of estimating the unknown signal x can be expressed as
cˆ = arg min ψ(c ) c
s.t.
y = Ac
(52)
where ψ is a sparsity-inducing function. The estimated signal is then given by xˆ = F cˆ . In this case, since the missing data is noisefree, we use the equality constraint optimization algorithm in Section 5 for estimating the missing data. In this example, we consider sparse representation for estimating special missing samples from a speech signal due to clipping. Clipping occurs during audio recording when the amplitude of the source is too high for the recording equipment (i.e. saturation) [51]. Fig. 12 shows the waveform of the full speech with N = 512. We first consider the clipped signal with the threshold T being 0.15, and there are 191 missing samples. Applying VMC and the other methods yields the results shown in Fig. 12, including L 1 norm, Half, IPS, SL0, and RwL1 (the reweighted strategy in (51)). In this case, we also set the weight parameter w of the VMC function as w A = (ωa1 , ωa2 , . . . , ωa N )T = (diag( A T A ))1/2 , and choose other parameters in each method so as to minimize the RMSE. The results of these methods show that only the VMC regularization and the RwL1 can provide a reasonable accurate solution for estimating the clipped missing samples, and the other methods cannot
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
175
Fig. 12. Estimating of clipping data using different methods.
Fig. 13. RMSE values corresponding to different clipping threshold.
recover the missing samples. It is noted that, for the clipped missing data estimation, the L 1 regularization can only “recover” the missing samples between the threshold, and the other nonconvex regularization methods cannot provide an accurate solution for this problem, such as IPS, half, and SL0. In order to further test the effectiveness of these methods for clipped missing data estimation, different thresholds T are considered. Fig. 13 shows the RMSE a function of T . It can be concluded that the VMC method compares favorably to the other methods in this experiment, including L 1 norm, Half, SL0, IPS, and RwL1.
vergence to a stationary point is guaranteed for solving the VMC regularization problem. We present three kinds of experiments to demonstrate the superior performance and broad applicability of the VMC penalty and the algorithm for sparse representation, i.e., sparse denoising, sparse deconvolution, and missing data estimation. The proposed method compares favorably with the other methods, including L 1 norm, MUSR, IPS, Half, SL0. In this paper, we mainly focus on the performance of the penalty. For all the methods discussed in this paper, the regularization parameter λ is set based on RMSE. However, the selection of proper regularization parameter λ is a common and hard problem for sparse representation. In most and general cases, an “trial and error” method, also called the cross-validation method, is still an accepted or even unique choice. Nevertheless, when some prior information (e.g. sparsity) is known for a problem, it is realistic to set the regularization parameter more reasonable [13,52]. We will pay more attention on the adaptive parameter selection in the future work. Acknowledgments This work was partly supported by National Natural Science Foundation of China under Grand 51605366 and 51835009, National Key Basic Research Program of China under Grant 2015CB057400, China Postdoctoral Science Foundation under Grand 2016M590937 and 2017T100740, the Fundamental Research Funds for the Central Universities, and the open fund of Zhejiang Provincial Key Laboratory of Laser Processing Robot/Key Laboratory of Laser Precision Processing and Detection.
7. Conclusion Appendix A. Proof In this paper, we propose the VMC penalty for sparse representation. After tuning the weight of the VMC penalty, the VMC regularized least squares problem shares the same global minimizers with the L 0 regularization problem but has fewer local minima. Facilitated by ADMM, the VMC regularization problem can be tackled as a sequence of convex sub-problems, and each of them can be solved fast. Theoretical analysis of ADMM shows that the con-
A.1. Proof of Proposition 3.1 For the n-th element, since
φVMC (xn ; λ, γ , ω)
= λ|ω xn | − min λ| v | + 21γ (ω xn − v )2 v
176
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
=
1 2 1 2
γ λ2 − 21γ ω2 (|xn | − γ λ/ω)2 , |xn | ≤ γ λ/ω γ λ2 , |xn | ≥ γ λ/ω
we have
lim φVMC (xn ; λ, γ , ω) =
γ λ2 , xn = 0
ω→∞
N n =1
v
A.5. Proof of Proposition 4.2
√
2 and
γ = 1.
A.2. Proof of Proposition 3.2 According to the definition of the Huber function, we have
0 ≤ ψVMC (x; λ, γ , w ) ≤ λ W x1 .
(A.2)
Moreover, according to (A.1), we have
(A.7) where the fact that W is a diagonal matrix with positive entries is considered. For the n-th element,
φVMC (x; λ, γ , ω) ≤ 12 γ λ2 N .
(A.3)
The lower bound and upper bound of the VMC function ψVMC (x; λ, γ , w ) in (22) can be easily obtained by combining (A.2) and (A.3). Moreover, according to the definition of the Huber function, we have limγ →∞ S ( W x; γ ) = 0. Thus, we get the limitation in (23).
We write the cost function as
−
+ λ W x1 − λ S ( W x; λγ )
=
−
Ax22
β(xn − yn ),
|xn | ≥ γ λ/ωn (A.8)
Before the proof of Lemma 4.1, we notice from the update in Algorithm 1 that the sequence (x(k+1) , z(k+1) , u (k+1) ) satisfies
+ λ W x1
−u (k+1) − β( z(k+1) − z(k) ) ∈ ∂ f 1 (x(k+1) ),
− min{λ v 1 + 21γ W x − v 22 } v max{ 12 y v
=
− γ1 ωn (ωn xn − soft(ωn xn ; γ λ)) (β − γ1 ωn2 )xn − β yn + λωn sign(xn ), |xn | ≤ γ λ/ωn
A.6. Proof of Lemma 4.1
F VMC (x) = 12 y − Ax22 + ψVMC (x; λ, γ , ω)
=
[∂ f (x)]n = β(xn − yn ) + λωn [∂xn 1 ]
Set 0 ∈ [∂ f (x)]n , then we can get the n-th element of the optiopt mal is xn = firm( yn ; λωn /β, λγ /ωn ). Combining the N elements provides the minimizer in (33).
A.3. Proof of Proposition 3.3
−
= β(x − y ) + λ W T [∂x1 ] − λ W T ∇ S ( W x; γ λ)
γλ .
n =1
=
∂ f (x) = β(x − y ) + λ∂[ W x1 ] − λ∇[ S ( W x; γ λ)] = β(x − y ) + λ W [∂x1 ] − γ1 W ( W x − soft( W x; γ λ))
2
Thus summing the N elements together, we have
Ax22 Ax22
(A.6)
Thus, according the definition the VMC penalty in (20), we have
thus
1 y 2 1 y 2
γ ≥ w 2max /β . We then consider the gradient of the multivariable Huber function defined in (18). Since S with parameter γ is the Moreau envelope of the function · 1 with parameter γ , it follows ∇ S (x; γ ) = γ1 (x − proxγ ·1 (x)) = γ1 (x − soft(x, γ )).
≤ v 1 + 21γ W x − v 22 v =W x = W x1 ,
0 ≤ ψVMC (x; λ, γ , w ) =
According to Proposition 4.1 and its proof, we know that f is convex if I − γ1β W 2 0 are satisfied, which is sufficient to require
by Proposition 12.29 in Ref. [53] that
0 ≤ S ( W x; γ ) = min{ v 1 + 21γ W x − v 22 } v
N
(A.5)
where g is affine in x. The last term is convex as it is the pointwise supremum of a set of convex functions. When β ≥ w 2max /γ , we have I − γ1β W 2 0, then the function f in (A.5) is convex.
lim φVMC (xn ; λ, γ , ω)
And the special case is obtained by setting λ =
0 ≤ φVMC (x; λ, γ , ω) ≤
2
β T
ω→∞
= 12 γ λ2 x0 .
1 2
β
x22 + ψVMC (x; λ, γ , w ) = 2 x I − γ1β W 2 x + λ W x1 + max{ g (x, v )},
f (x) =
Summing the N elements, we have
lim ψVMC (x; λ, γ , w ) =
A.4. Proof of Proposition 4.1 Similarly to (A.4), for the VMC penalty, we have
xn = 0
0, 1 2
ω→∞
(A.1)
+ λ W x1 − λ v 1
− 21γ W x − v 22 } = max{ 12 xT A T A − γ1 W T W x + λ W x1 + g (x, v )} v = 12 xT A T A − γ1 W T W x + λ W x1 + max{ g (x, v )} v
(A.4) where W = diag( w ), the function g is affine in x. The last term is convex as it is the pointwise supremum of a set of convex functions. Thus, the cost function F VMC is convex if A T A − γ1 W T W 0. When the matrix A satisfies A T A = (diag( w ))2 , we have A T A − 1 T γ W W 0 if γ ≥ 1, then the convex condition of the function F in (24) is equivalent to γ ≥ 1.
(A.9)
u (k+1) = ∇ f 2 ( z(k+1) ),
(A.10)
x(k+1) − z(k+1) = (u (k+1) − u (k) )/β.
(A.11)
These three relations will be used for the proof. We notice that L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k) , z(k) , u (k) ) is the sum of three terms L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k+1) , z(k+1) , u (k) ), L β (x(k+1) , z(k+1) , u (k) ) − L β (x(k+1) , z(k) , u (k) ), and L β (x(k+1) , z(k) , u (k) ) − L β (x(k) , z(k) , u (k) ). For the first term, according to the definition in (31) and combining (A.11), we have
L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k+1) , z(k+1) , u (k) )
= (u (k+1) − u (k) )T (x(k+1) − z(k+1) ) = β1 u (k+1) − u (k) 22
(A.12)
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
According to (A.10), we have
In order to prove (43), we only need to show that each block of ∂ L β can be controlled by some constant depending on β . For the first block ∂x L β , since
u (k+1) − u (k) 22 = ∇ f 2 ( z(k+1) ) − ∇ f 2 ( z(k) )22 = A T ( Az(k+1) − y ) − A T ( Az(k) − y )22 T
= A A (z ≤
(k+1)
−z
(k)
λ2max ( A T A ) z(k+1)
177
∂x L β (x(k+1) , z(k+1) , u (k+1) )
)22
= ∂ f 1 (x(k+1) ) + u (k+1) + β(x(k+1) − z(k+1) )
(k) 2
− z 2
(A.13)
= ∂ f 1 (x(k+1) ) + u (k) + β(x(k+1) − z(k) ) + (u (k+1) − u (k) ) + β(z(k) − z(k+1) ).
so
L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k+1) , z(k+1) , u (k) )
≤
According to the x-update in Algorithm 1, we have
λ2max ( A T A ) (k+1) − z(k) 22 . z β
(A.14)
For the second term L β (x(k+1) , z(k+1) , u (k) ) − L β (x(k+1) , z(k) , u (k) ), we see that the function z → L β (x(k+1) , z, u (k) ) is strong convex with modulus at least β . Using this and the definition of z(k+1) as a minimizer, we have
0 ∈ ∂ f 1 (x(k+1) ) + u (k) + β(x(k+1) − z(k) ). Thus (u (k) − u (k+1) ) + β( z(k+1) − z(k) ) ∈ ∂x L β (x(k+1) , z(k+1) , u (k+1) ). Then we have
(u (k) − u (k+1) ) + β( z(k+1) − z(k) )2 ≤ (u (k+1) − u (k) )2 + β z(k+1) − z(k) 2 ≤ (λmax ( A T A ) + β) z(k+1) − z(k) 2 .
L β (x(k+1) , z(k+1) , u (k) ) − L β (x(k+1) , z(k) , u (k) )
≤ − β2 z(k+1) − z(k) 22 .
(A.15)
For the second block ∇z L β , since
Moreover, for the third term L β (x(k+1) , z(k) , u (k) ) − L β (x(k) , z(k) , u (k) ), using the definition of x(k+1) as a minimizer, we have
∇z L β (x(k+1) , z(k+1) , u (k+1) )
L β (x(k+1) , z(k) , u (k) ) − L β (x(k) , z(k) , u (k) ) ≤ 0.
and according to (A.13), we have
(A.16)
(A.19)
L β (x(k+1) , z(k+1) , u (k+1) ) − L β (x(k) , z(k) , u (k) )
≤
max ( A
T
A)
β
−
β
z
2
(k+1)
For the third block ∇u L β , since
(k) 2
− z 2 ≤ 0,
∇u L β (x(k+1) , z(k+1) , u (k+1) ) = x(k+1) − z(k+1)
where √ the last inequality is followed by the initialization condition β > 2λmax ( A T A ).
L β (x(k) , z(k) , u (k) ) = f 1 (x(k) ) + f 2 ( z(k) ) +
− β1 u (k) 22 −
β
2 (k) 2 1 u 2 2β
− β u 2 − 21β ∇ f 2 ( z(k) )22 ≥ f 1 (x(k) ) + 12 (1 − √1 ) Az(k) − 2 1
(A.17)
(k) 2
y 22
+ β2 x(k) − z(k) − β1 u (k) 22 ≥ −∞ where the last equality follows from (A.10), and the last inequality follows from that the function f 1 (x) is bounded below (Proposition 3.2). Moreover, according to Lemma 4.1, L β (x(k) , z(k) , u (k) ) is upper bounded by L β (x(0) , z(0) , u (0) ). Thus, L β (x(k) , z(k) , u (k) ) is bounded. Because the function f 2 ( z) is coercive, the sequence { z(k) } is bounded. According to the relation (A.10) and (A.11), the sequence {u (k) } and {x(k) } are also bounded. Thus, the sequence (x(k) , z(k) , u (k) ) in Algorithm 1 is bounded. A.8. Proof of Lemma 4.3
,z
,u
A.9. Proof of Lemma 4.4 According to Lemma 4.1 and Lemma 4.2, L β (x(k) , z(k) , u (k) ) is monotonically nonincreasing and lower bounded, and therefore
lim z(k+1) − z(k) 2 = 0.
k→∞
(A.21)
Then we have limk→∞ s(k) 2 = 0. Moreover, according to (A.13), we have
lim u (k+1) − u (k) 2 = 0.
k→∞
(A.22)
Thus limk→∞ r (k) 2 = 0. Furthermore, according to the relation in (A.11), we have
x(k+1) − x(k) 2 ≤ z(k+1) − z(k) 2 + β1 u (k+1) − u (k) 2 + β1 u (k) − u (k−1) 2 and therefore,
According to the definition in (31), we know (k+1)
(A.20)
Combining (A.18), (A.19), (A.20), we have the boundedness in (43).
x(k) − z(k)
= f 1 (x(k) ) + f 2 ( z(k) ) + β2 x(k) − z(k)
∂ L β (x
combining with (A.13), we have
≤ (λmax ( A T A )/β)z(k+1) − z(k) 2 .
According to the definition in (31), we have
(k+1)
= (u (k+1) − u (k) )/β, ∇u L β (x(k+1) , z(k+1) , u (k+1) )2
A.7. Proof of Lemma 4.2
(k+1)
= ∇ f 2 ( z(k+1) ) − u (k+1) + β( z(k+1) − x(k+1) ) = u (k) − u (k+1) , ∇z L β (x(k+1) , z(k+1) , u (k+1) )2 ≤ λmax ( A T A ) z(k+1) − z(k) 2 .
Summing the three terms together, we have
λ2
(A.18)
) = (∂x L β , ∇z L β , ∇u L β )
lim x(k+1) − x(k) 2 = 0.
k→∞
(A.23)
Consequently, combining (A.21), (A.22), and (A.23), (45) holds.
178
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
A.10. Proof of Lemma 4.5 According to Lemma 4.2, the sequence (x(k) , z(k) , u (k) ) is bounded, so a convergent subsequence and a limit point exist, denoted by
lim (x(ki ) , z(ki ) , u (ki ) ) = (x , z , u ).
i →∞
(A.24)
Based on Lemma 4.3, there exists d(k) ∈ ∂ L β (x(k) , z(k) , u (k) ) such that
lim d(ki ) 2 = 0.
(A.25)
i →∞
By the definition of subgradient [28, 36], we have 0 ∈ ∂ L β (x , z , u ). Moreover, from the definition of x(ki ) as a minimizer, we have
L β (x(ki +1) , z(ki ) , u (ki ) ) ≤ L β (x , z(ki ) , u (ki ) )
(A.26)
Taking limit and using (A.24), we have
lim sup L β (x(ki +1) , z(ki ) , u (ki ) ) ≤ f 1 (x ) + f 2 ( z ) + u , x − z i →∞
+ β2 x − z 22
(A.27)
On the other hand, from lower semicontinuity of L β , (A.24) and (45), we have
lim inf L β (x(ki +1) , z(ki ) , u (ki ) ) ≥ f 1 (x ) + f 2 ( z ) + u , x − z i →∞
+ β2 x − z 22
(A.28)
The above two relations show that limi →∞ f 1 (x(ki +1) ) = f 1 (x ). This together with (A.22) shows that x is a stationary point of the problem (28) and (46) holds. References [1] D.L. Donoho, Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006) 1289–1306. [2] E.J. Candès, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Trans. Inf. Theory 52 (2) (2006) 489–509. ´ I. Orovic, ´ S. Stankovic, ´ Compressive sensing meets time-frequency: [3] E. Sejdic, an overview of recent advances in time-frequency processing of sparse signals, Digit. Signal Process. 27 (2018) 22–35. [4] P. Ruiz, X. Zhou, J. Mateos, R. Molina, A.K. Katsaggelos, Variational Bayesian blind image deconvolution: a review, Digit. Signal Process. 47 (2015) 116–127. [5] R. Chartrand, W. Yin, Nonconvex sparse regularization and splitting algorithms, in: Splitting Methods in Communication, Imaging, Science, and Engineering, Springer, 2016, pp. 237–249. [6] Z.-X. Cui, Q. Fan, A nonconvex nonsmooth regularization method for compressed sensing and low rank matrix completion, Digit. Signal Process. 62 (2017) 101–111. [7] L. Chen, Y. Gu, The convergence guarantees of a non-convex approach for sparse recovery, IEEE Trans. Signal Process. 62 (15) (2014) 3754–3767. [8] Q. Liu, H.C. So, Y. Gu, Off-grid doa estimation with nonconvex regularization via joint sparse representation, Signal Process. 140 (2017) 171–176. [9] E.J. Candes, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted 1 minimization, J. Fourier Anal. Appl. 14 (5) (2008) 877–905. [10] Z. Zhao, S. Wu, B. Qiao, S. Wang, X. Chen, Enhanced sparse period-group lasso for bearing fault diagnosis, IEEE Trans. Ind. Inform. (2018), http://doi.org/10. 1109/TIE.2018.2838070. [11] W. He, Y. Ding, Y. Zi, I.W. Selesnick, Sparsity-based algorithm for detecting faults in rotating machines, Mech. Syst. Signal Process. 72 (2016) 46–64. [12] Y. Ding, W. He, B. Chen, Y. Zi, I.W. Selesnick, Detection of faults in rotating machinery using periodic time-frequency sparsity, J. Sound Vib. 382 (2016) 357–378. [13] Z. Xu, X. Chang, F. Xu, H. Zhang, l1/2 regularization: a thresholding representation theory and a fast solver, IEEE Trans. Neural Netw. Learn. Syst. 23 (7) (2012) 1013–1027. [14] J. Zeng, S. Lin, Y. Wang, Z. Xu, l1/2 regularization: convergence of iterative half thresholding algorithm, IEEE Trans. Signal Process. 62 (9) (2014) 2317–2329.
[15] C.-H. Zhang, et al., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2) (2010) 894–942. [16] C.-H. Zhang, T. Zhang, A general theory of concave regularization for highdimensional sparse estimation problems, Stat. Sci. (2012) 576–593. [17] X. Shen, Y. Gu, Nonconvex sparse logistic regression with weakly convex regularization, arXiv preprint, arXiv:1708.02059. [18] E. Soubies, L. Blanc-Féraud, G. Aubert, A continuous exact 0 penalty (cel0) for least squares regularized problem, SIAM J. Imaging Sci. 8 (3) (2015) 1607–1639. [19] E. Soubies, L. Blanc-Féraud, G. Aubert, A unified view of exact continuous penalties for 2 − 0 minimization, SIAM J. Optim. 27 (3) (2017) 2034–2060. [20] L. Dicker, B. Huang, X. Lin, Variable selection and estimation with the seamlessl0 penalty, Stat. Sin. 23 (2) (2013) 929–962. [21] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM Rev. 43 (1) (2001) 129–159. [22] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc., Ser. B, Methodol. (1996) 267–288. [23] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004. [24] A.M. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Rev. 51 (1) (2009) 34–81. [25] F. Bach, R. Jenatton, J. Mairal, G. Obozinski, et al., Optimization with sparsityinducing penalties, Found. Trends Mach. Learn. 4 (1) (2012) 1–106. [26] R. Chartrand, Exact reconstruction of sparse signals via nonconvex minimization, IEEE Signal Process. Lett. 14 (10) (2007) 707–710. [27] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (456) (2001) 1348–1360. [28] R. Mazumder, J.H. Friedman, T. Hastie, Sparsenet: coordinate descent with nonconvex penalties, J. Am. Stat. Assoc. 106 (495) (2011) 1125–1138. [29] H.-Y. Gao, A.G. Bruce, Waveshrink with firm shrinkage, Stat. Sin. (1997) 855–874. [30] A. Eftekhari, M. Babaie-Zadeh, C. Jutten, H.A. Moghaddam, Robust-sl0 for stable sparse representation in noisy settings, in: IEEE International Conference on Acoustics, Speech and Signal Processing, 2009, ICASSP 2009, IEEE, 2009, pp. 3433–3436. [31] H. Mohimani, M. Babaie-Zadeh, C. Jutten, A fast approach for overcomplete sparse decomposition based on smoothed 0 norm, IEEE Trans. Signal Process. 57 (1) (2009) 289–301. [32] M.M. Hyder, K. Mahata, An improved smoothed 0 approximation algorithm for sparse representation, IEEE Trans. Signal Process. 58 (4) (2010) 2194–2205. [33] Y. Wang, J. Zeng, Z. Peng, X. Chang, Z. Xu, Linear convergence of adaptively iterative thresholding algorithms for compressed sensing, IEEE Trans. Signal Process. 63 (11) (2015) 2957–2971. [34] J. Woodworth, R. Chartrand, Compressed sensing recovery via nonconvex shrinkage penalties, Inverse Probl. 32 (7) (2016) 75004–75028. [35] S. Voronin, R. Chartrand, A new generalized thresholding algorithm for inverse problems with sparsity constraints, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2013, pp. 1636–1640. [36] I. Bayram, On the convergence of the iterative shrinkage/thresholding algorithm with a weakly convex penalty, IEEE Trans. Signal Process. 64 (6) (2016) 1597–1608. [37] A.P. Liavas, N.D. Sidiropoulos, Parallel algorithms for constrained tensor factorization via alternating direction method of multipliers, IEEE Trans. Signal Process. 63 (20) (2015) 5450–5463. [38] Y. Shen, Z. Wen, Y. Zhang, Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization, Optim. Methods Softw. 29 (2) (2014) 239–263. [39] J. Zeng, S. Lin, Z. Xu, Sparse regularization: convergence of iterative jumping thresholding algorithm, IEEE Trans. Signal Process. 64 (19) (2016) 5106–5118. [40] Y. Wang, W. Yin, J. Zeng, Global convergence of admm in nonconvex nonsmooth optimization, arXiv preprint, arXiv:1511.06324. [41] M. Hong, Z.-Q. Luo, M. Razaviyayn, Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems, SIAM J. Optim. 26 (1) (2016) 337–364. [42] G. Li, T.K. Pong, Global convergence of splitting methods for nonconvex composite optimization, SIAM J. Optim. 25 (4) (2015) 2434–2460. [43] F. Wang, Z. Xu, H.-K. Xu, Convergence of Bregman alternating direction method with multipliers for nonconvex composite problems, arXiv preprint, arXiv:1410. 8625. [44] Q. Liu, X. Shen, Y. Gu, Linearized admm for non-convex non-smooth optimization with convergence analysis, arXiv preprint, arXiv:1705.02502. [45] I. Selesnick, M. Farshchian, Sparse signal approximation via nonseparable regularization, IEEE Trans. Signal Process. 65 (10) (2017) 2561–2575. [46] M. Carlsson, On convexification/optimization of functionals including an l2-misfit term, arXiv preprint, arXiv:1609.09378. [47] I. Selesnick, Total variation denoising via the Moreau envelope, IEEE Signal Process. Lett. 24 (2) (2017) 216–220. [48] I. Selesnick, Sparse regularization via convex analysis, IEEE Trans. Signal Process. 65 (17) (2017) 4481–4494.
S. Wang et al. / Digital Signal Processing 83 (2018) 165–179
[49] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [50] I.W. Selesnick, I. Bayram, Enhanced sparsity by non-separable regularization, IEEE Trans. Signal Process. 64 (9) (2016) 2298–2313. [51] A. Adler, V. Emiya, M.G. Jafari, M. Elad, R. Gribonval, M.D. Plumbley, Audio inpainting, IEEE Trans. Audio Speech Lang. Process. 20 (3) (2012) 922–932. [52] S. Wang, I. Selesnick, G. Cai, Y. Feng, X. Sui, X. Chen, Nonconvex sparse regularization and convex optimization for bearing fault diagnosis, IEEE Trans. Ind. Inform. 65 (9) (2018) 7332–7342. [53] H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, vol. 408, Springer, 2011.
Shibin Wang received the B.S. and M.S. degrees in electrical engineering from Soochow University, Suzhou, China, in 2008 and 2011, respectively, and the Ph.D. degree in mechanical engineering from Xi’an Jiaotong University, Xi’an, China, in 2015. He then joined the School of Mechanical Engineering, Xi’an Jiaotong University, where he is an associate professor. In 2017, he was a Visiting Scholar at the Tandon School of Engineering, New York University, NY, USA. His research interests include time-frequency analysis and sparsity-assisted signal processing for machine condition monitoring and fault diagnosis. Xuefeng Chen is a full professor and dean of School of Mechanical Engineering in Xi’an Jiaotong University, P.R. China, where he received his Ph.D. Degree in 2004. He works as the executive director of the Fault Diagnosis Branch in China Mechanical Engineering Society. Besides, he is also a member of ASME and IEEE, and the chair of IEEE the Xian and Chengdu Joint Section Instrumentation and Measurement Society Chapter. He has authored over 100 SCI publications in areas of composite structure, aero-engine, wind power equipment, etc. He won National Excellent Doctoral Thesis Award in 2007, First Technological Invention Award of Ministry of Education in 2008, Second National Technological Invention Award in 2009, First Provincial Teaching Achievement Award in 2013, First Technological Invention Award of Ministry of Education in 2015, and he was awarded as Science & Technology Award for Chinese Youth in 2013. Additionally, he hosted a National Key 973 Research Program of China as principal scientist in 2015.
179
Weiwei Dai received the B.E. degree in Mechanical Engineering & Automation from Shanghai University, Shanghai, China, in 2013, and M.S. degree in Electrical Engineering from New York University, New York, NY, USA, in 2015, where he is currently working toward the Ph.D. degree in Electrical Engineering. His research interests include signal and image processing, sparsity techniques, and eye tracking. Ivan W. Selesnick received the B.S., M.E.E., and Ph.D. degrees in electrical engineering from Rice University, Houston, TX, USA, in 1990, 1991, and 1996, respectively. In 1997, he was a Visiting Professor at the University of Erlangen–Nurnberg, Germany. He then joined the Department of Electrical and Computer Engineering, Polytechnic University, Brooklyn, New York, NY, USA (now the Tandon School of Engineering, New York University), where he is currently a Professor. His research interests include signal and image processing, wavelet-based signal processing, sparsity techniques, and biomedical signal processing. He has been an Associate Editor for the IEEE TRANSACTIONS ON IMAGE PROCESSING, IEEE SIGNAL PROCESSING LETTERS, and IEEE TRANSACTIONS ON SIGNAL PROCESSING. Gaigai Cai received the B.S. degree in mechanical engineering from Xidian University, Xi’an, China, in 2008, and the Ph.D. degree in mechanical engineering from Xi’an Jiaotong University, Xi’an, China, in 2013. She then joined the School of Rail Transportation, Soochow University, Suzhou, China, where she is currently an Associate Professor. In 2017, he was a Visiting Scholar at the Tandon School of Engineering, New York University, NY, USA. Her current research interests include machinery fault diagnosis, sparse techniques for machinery signal processing, and machinery failure evolution mechanism. Benjamin Cowen received the B.S. and M.S. degrees in Applied Mathematics from Case Western Reserve University in Cleveland, Ohio, USA, in 2014 and 2015, respectively. He is now a Ph.D. candidate in the Electrical and Computer Engineering Department at the New York University Tandon School of Engineering in Brooklyn, NY, USA. Benjamin’s research interests include sparsity-based signal and image processing and the intersection of iterative optimization and machine learning.