A gradient descent based algorithm for ℓp minimization

A gradient descent based algorithm for ℓp minimization

A gradient descent based algorithm for ‘p minimization Journal Pre-proof A gradient descent based algorithm for ‘p minimization Shan Jiang, Shu-Cher...

1MB Sizes 0 Downloads 66 Views

A gradient descent based algorithm for ‘p minimization

Journal Pre-proof

A gradient descent based algorithm for ‘p minimization Shan Jiang, Shu-Cherng Fang, Tiantian Nie, Wenxun Xing PII: DOI: Reference:

S0377-2217(19)30963-4 https://doi.org/10.1016/j.ejor.2019.11.051 EOR 16187

To appear in:

European Journal of Operational Research

Received date: Accepted date:

21 January 2019 22 November 2019

Please cite this article as: Shan Jiang, Shu-Cherng Fang, Tiantian Nie, Wenxun Xing, A gradient descent based algorithm for ‘p minimization, European Journal of Operational Research (2019), doi: https://doi.org/10.1016/j.ejor.2019.11.051

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • Introduce a scaled KKT condition with no relaxation to identify candidate solutions • Design a gradient-descent-based algorithm on positive entities to avoid smoothing • Provide a convergence proof with worst-case iteration complexity analysis • Exhibit dominating performance in sparsity recovery rate and computing time

1

A gradient descent based algorithm for `p minimization Shan Jianga , Shu-Cherng Fanga , Tiantian Nieb,∗, Wenxun Xingc a

Edward P. Fitts Department of Industrial and Systems Engineering, North Carolina State University, Raleigh, NC, 27695-7906, USA b University of North Carolina at Charlotte, Charlotte, NC, USA c Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China

Abstract In this paper, we study the linearly constrained `p minimization problem with p ∈ (0, 1). Unlike those known works in the literature that propose solving relaxed -KKT conditions, we introduce a scaled KKT condition without involving any relaxation of the optimality conditions. A gradient-descent-based algorithm that works only on the positive entries of variables is then proposed to find solutions satisfying the scaled KKT condition without invoking the nondifferentiability issue. The convergence proof and complexity analysis of the proposed algorithm are provided. Computational experiments support that the proposed algorithm is capable of achieving much better sparse recovery in reasonable computational time compared to state-of-the-art interior-point based algorithms. Keywords: Global optimization, Nonsmooth optimization, Nonconvex optimization, Gradient descent algorithm, KKT condition

1. Introduction In this paper, we consider the following “`p minimization” problem studied in Ge et al. (2011): n X

xpi

min

g(x) :=

s. t.

x ∈ F := {x : Ax = b, x > 0},

i=1

(`p )

where A ∈ Rm×n , b ∈ Rm and p ∈ (0, 1). Notice that g(x) = kxkpp , where Pn 1 kxkp := ( i=1 |xi |p ) p is a quasi-norm function for p ∈ (0, 1), and we essentially consider kxkp as the objective function. Actually, for p ∈ (0, 1), kxkp is not ∗ Corresponding

author Email addresses: [email protected] (Shan Jiang), [email protected] (Shu-Cherng Fang), [email protected] (Tiantian Nie), [email protected] (Wenxun Xing)

Preprint submitted to Elsevier

November 26, 2019

subadditive and thus called an `p quasi-norm function, but g(x) is subadditive and homogeneous of degree p. The `p minimization related problems have attracted much attention in recent years. They have been used in matrix completion (Nie et al., 2012), proximal support vector machine (Chen & Tian, 2010), vector reconstruction (Kabashima et al., 2009; Mairal et al., 2008; Donoho, 2006a) and recursive parameter estimation (Ma et al., 2015) in the fields of machine learning and compressed sensing. An `p quasi-norm minimization problem is in general difficult to solve due to its nonconvexity, non-smoothness and lack of Lipschitz continuity. Ge et al. (2011) proved that problem (`p ) is NP-hard with all basic feasible solutions of the polyhedral set F being local minimizers. One major challenge of solving problem (`p ) lies in finding a global minimizer without being trapped at a local one. Various studies have been done on designing algorithms for solving `p quasi-norm minimization related problems in different forms (Ge et al., 2017; Liu et al., 2016; Haeser et al., 2018). In this study, we aim at designing an efficient algorithm for solving a linearly constrained `p minimization problem in the form of (`p ). In the literature, there are two widely adopted techniques that may help handle the `p quasi-norm minimization related problems. The first techique is to find candidates of global optima satisfying certain optimality conditions. Due to the nonsmoothness of an `p quasi-norm function, efforts have been made in establishing relaxed KKT conditions (Bian et al., 2015; Ge et al., 2017, 2011; Liu et al., 2016; Haeser et al., 2018). Ge et al. (2011) studied problem (`p ) and defined an -KKT condition, where a relative complementarity gap is allowed. Liu et al. (2016) studied a composite `p (0 < p < 1) minimization over a polyhedron, for which problem (`p ) can be seen as a special case. A perturbed -KKT condition was introduced there. Haeser et al. (2018) studied the linearly constrained optimization without differentiability on the boundary, which is a generalization of problem (`p ). To solve problems of this kind, they proposed a different -KKT condition, where both of the stationarity condition and the complementary slackness condition are relaxed. Although the relaxed KKT conditions cannot guarantee global optimality, they still describe some properties of a global optimum and enable a solution algorithm to proceed in practice. In this sense, introducing relaxed optimality conditions is valuable. The second technique is using the descent approach that iteratively reduces the objective value. To adopt this technique, finding a good descent direction is the key in algorithm design. However, the conventional gradient direction is not directly applicable to the nonsmooth `p quasi-norm minimization problems. Hence different smoothing schemes For Pn Pn were adopted in the literature. a fixed constant δ > 0, the function i=1 xpi is substituted by i=1 (xi + δ)p for xi > 0 in Chen & Zhou (2014), Lai & Wang (2011) and Lai et al. (2013), Pn Pn x2 |xi | es et al. (2008) and by i=1 (x2 +δii)1−p/2 or i=1 (|xi |+δ 1−p for xi ∈ R in Cand` i) i and Nikolova et al. (2008). Smoothing the g(x) function does not change the complexity of problem (`p ) (Chen et al., 2014; Ge et al., 2011), but it enables us

3

to derive the gradient of the objective function in designing a gradient descent algorithm. Although the relaxed -KKT conditions and smoothing techniques are of significant importance for practical computation, they may cause errors in solving an `p quasi-norm minimization problem. The -KKT conditions adopted in both of Ge et al. (2011) and Haeser et al. (2018) relax the complementary slackness condition. Hence an -KKT point may still exhibit a duality gap. Bian et al. (2015) relaxed the subgradient optimality condition for an unconstrained optimization problem, where an -KKT point may still possess a descent direction. As for the smoothing technique, although for a given δ > 0, functions Pn Pn Pn x2i xi p i=1 (xi +δi )1−p are all differentiable at i=1 (xi + δ) , i=1 (x2i +δi )1−p/2 and the origin, the gradient of each smoothing function at the origin is either 0 or depends on the value of δ heavily. It can be shown that the Taylor expansions of these smoothed `p functions are not good approximations of function g(x) at the origin, which may degrade the accuracy of a gradient-descent-based algorithm. In this study, to deal with the issues mentioned above, we propose a scaled KKT condition and a gradient-descent-based algorithm for solving problem (`p ). The proposed scaled KKT condition is a necessary condition for any local and global optimal solutions of problem (`p ). It does not relax the KKT condition on strictly positive entries like other works in the literature (Ge et al., 2011; Liu et al., 2016; Haeser et al., 2018). We show that, compared with other -KKT conditions, the scaled KKT condition is a stronger condition that leads to a smaller set of candidates. To find points satisfying the scaled KKT condition, instead of adopting a smoothing technique, we propose a gradient-descent-based algorithm that solves problem (`p ) in a simplex-method like approach. The proposed algorithm consists of two loops. The inner loop proceeds within a given subset of positive variables to find a scaled KKT point. The outer loop proceeds with all variables to find a better scaled KKT point if possible. Convergence proof and complexity analysis show that this algorithm terminates at a scaled KKT point of problem (`p ) within O(δ −1 n) iterations, where δ > 0 is a given parameter and n is the size of problem (`p ). Worst-case complexity comparison with the algorithms of Ge et al. (2011), Haeser et al. (2018) and Liu et al. (2016) is provided to show the superiority of the proposed algorithm. Computational experiments are also conducted to show the effectiveness and efficiency of the proposed algorithm. The remainder of this paper is organized as follows. In Section 2, we first review some -KKT conditions in the literature for problem (`p ) and then propose a scaled KKT condition. To locate scaled KKT points, a gradient-descent-based algorithm is developed with the convergence proof and complexity analysis in Section 3. In Section 4, computational experiments on synthetic and real-life data sets are conducted to validate the effectiveness and efficiency of the proposed algorithm for sparse recovery. Section 5 concludes the paper. Notations: Here are some notations to be used throughout the paper. For x ∈ Rn+ , X := [ x ] is the diagonal matrix formed by the elements of x, P (x) := {i | xi > 0} is the positive index set of x, Z(x) := {i | xi = 0} is the zero 4

index set of x, and n(x) := kxk0 is the number of nonzero entries of x. Vector e ∈ Rn denotes the vector with all entries being 1 and vector ei ∈ Rn , i = 1, 2, ..., n, denotes the ith unit vector whose ith entry is 1 while other entries are 0. Moreover, F := {x | Ax = b, x > 0} is the feasible domain of problme (`p ), and PF (·) is the projection mapping onto set F . 2. Scaled KKT Condition In this section, we propose a scaled Karush-Kuhn-Tucker (KKT) condition for problem (`p ). First, for the convenience of comparison, we state the formal KKT condition for problem (`p ) as below. Definition 1. A point x ∈ Rn++ is a KKT point of problem (`p ) if there exist λ ∈ Rn and µ ∈ Rn+ such that the following KKT condition holds: Ax = b, x > 0, ∇g(x) + AT λ − µ = 0, xT µ = 0,

(1)

where ∇g(x) = (pxp−1 , pxp−1 , ..., pxnp−1 )T for x > 0. 1 2 Notice that when x 6∈ Rn++ , system (1) is not well-defined due to the nonsmoothness of function g(x). To handle this situation, several relaxed KKT conditions of problem (`p ) have been studied in the literature. Given any  > 0, Ge et al. (2011) proposed an interior-point based potential reduction algorithm to find -KKT points of problem (`p ) defined as below. Definition 2 ((Ge et al., 2011)). Given any  > 0, a point x ∈ Rn+ is an -KKT n(x) point of problem (`p ) if there exist λ ∈ Rm and µ ∈ R+ such that the following -KKT condition holds: Ax = b, x > 0, T pxp−1 + (A λ) − µj = 0, ∀j ∈ P (x), j j µT x z¯−z

(-KKT 1)

6 ,

where P (x) is the positive index set of x, z and z are the minimum and maximum value of g(x) over F , respectively. Liu et al. (2016) studied a class of a composite `p (0 < p < 1) minimization over polyhedral set. A smoothing sequential quadratic programming framework was proposed to locate -KKT points of the problem studied. Since (`p ) can be seen as a special case of their problems, an -KKT point for problem (`p ) can be defined accordingly. Definition 3 ((Liu et al., 2016)). Given any  > 0, a point x ∈ Rn+ is an -KKT |K |

point of problem (`p ) if there exists λ ∈ R+ x such that the following -KKT

5

condition holds:

Ax = b, x > 0 |λi xi | 6 p , i ∈ Kx , kx − PF (x − ∇L (x, λ)) k2 6 ,

(-KKT 2)

where PF (x) is the projection of a point x onto the convex set F and X p−1 X ∇L (x, λ) = pxi ei + λi ei i∈Jx

 i∈Kx

with Jx = {i | xi > } and Kx = {i | 0 6 xi 6 }. Furthermore, Haeser et al. (2018) studied a linearly constrained optimization problem without differentiability on the boundary. Given any  > 0, they proposed using some interior-point trust region algorithms to locate points satisfying some -relaxed first order and second order necessary conditions. Since problem (`p ) is a special case of the linearly constrained optimization without differentiability on the boundary problem, an -KKT condition can be derived from Haeser et al. (2018) as below. Definition 4 ((Haeser et al., 2018)). Given any  > 0, a point x ∈ Rn++ is an -KKT point of problem (`p ) if there exist λ ∈ Rm and s ∈ Rn+ such that the following -KKT condition holds: Ax = b,P x > 0, n k∇g(x) + AT λ − i=1 si ei k∞ 6 , |xi si | 6 , i = 1, 2, ..., n.

(-KKT 3)

Notice that compared to the KKT condition (1), conditions (-KKT 1), (KKT 2) and (-KKT 3) further relax the necessary optimality condition and expand the candidate set for optimal solutions. To address this issue, instead of searching for points satisfying relaxed KKT conditions, in this paper, we consider a first order necessary condition that involves no relaxation of the KKT condition (1). For simplicity, we assume that, in problem (`p ), F 6= ∅ and A is of full row rank throughout the study. Since the KKT condition (1) is not well-defined on Rn+ \ Rn++ , we define the scaled KKT condition for (`p ) as below. Definition 5. A point x ∈ Rn+ is a scaled KKT point of problem (`p ) if there exist λ ∈ Rm and µ ∈ Rn+ such that the following scaled KKT condition holds: Ax  = b, x > 0, pxp1  ...  + XAT λ − Xµ = 0, pxpn xT µ = 0, µ > 0, 

(Scaled KKT)

where X = [x] is the diagonal matrix with xj being its jth diagonal.

6

We now show that (Scaled KKT) is a necessary condition for any local minimizer of problem (`p ). Theorem 1. If x∗ is a local minimizer of problem (`p ) satisfying the linear independence constraint qualification condition on (`p ), then x∗ is a scaled KKT point. Proof. Without loss of generality, we may assume that x∗ = (x∗B , x∗N ), where n(x∗ ) x∗B ∈ R++ (n(x∗ ) 6 n) contains all the strictly positive components of x∗ ∗ and x∗N = (0, 0, .., 0)T ∈ Rn−n(x ) . We group the columns of A as A = [B, N ] accordingly with B being an m×n(x∗ ) matrix and N an m×(n−n(x∗ )) matrix. It is not difficult to see that x∗B is a local minimizer to the following problem: n(x∗ )

min gp (z) :=

X

zjp

j=1

s. t.

(2)

z ∈ F¯ := {z : Bz = b, z > 0}.

Under the linear independence constraint qualification (LICQ) condition, the first order necessary condition guarantees that there exists λB ∈ Rm such ∗p ∗p T T T that [px∗p−1 , ..., px∗p−1 1 n(x∗ ) ] +B λB = 0. Therefore, we have [px1 , ..., pxn(x∗ ) ] + diag(x∗B )B T λB = 0. Taking any µ ∈ Rn+ with µB = 0, we have        

px∗p 1 ... px∗p n(x∗ ) px∗p n(x∗ )+1 ... px∗p n



   ∗   T   ∗   xB B xB  + diag λB + diag µ = 0. ∗ ∗ T  x x N N N  

Then the desired result follows.

Remark 1. Notice that, for x ∈ Rn++ , the scaled KKT condition (Scaled KKT) is equivalent to (1), and (Scaled KKT) does not hold at the origin unless the origin is a global minimizer of problem (`p ). Besides, any point satisfying (Scaled KKT) satisfies the relaxed KKT conditions (-KKT 1), (-KKT 2) and (-KKT 3). Remark 2. Given any  > 0 and x ∈ F , the -KKT conditions (-KKT 1) and (-KKT 3) are all defined on the positive index set P(x) and (-KKT 2) adopts a relaxation of the stationary point condition. As  → 0, they all reduce to the (Scaled KKT) condition. The scaled KKT condition is necessary for any local or global minimizer of problem (`p ) to satisfy. Compared to the relaxed KKT conditions (Ge et al., 2011; Liu et al., 2016; Haeser et al., 2018), the scaled KKT condition can be seen as an unrelaxed extension of the formal KKT condition (1). The optimal solution candidate set corresponding to the scaled KKT condition is a subset 7

of those corresponding to the relaxed -KKT conditions, which may increase the chance of finding a global minimizer of problem (`p ). In other words, the relaxation may reduce the chance of finding a global optimal solution using these optimality conditions. In the next section, we provide a gradient-descent-based algorithm to find the scaled KKT points. 3. Gradient-descent-based Algorithm As far as we know, Ge et al.’s interior-point potential reduction (IPPR) algorithm (Ge et al., 2011) is the only one in the literature that directly solves the `p minimization problem (`p ) by finding -KKT points. More recent works of Liu et al.’s smoothing sequential quadratic programming (SSQP) algorithm (Liu et al., 2016) and Haeser et al.’s interior-point trust region (IPTR) algorithm (Haeser et al., 2018) can also be customized to find different -KKT points of problem (`p ). In this section, we design a gradient-descent-based algorithm to solve problem (`p ) by finding its scaled KKT points, which have higher chance to be a global minimizer or a better local minimizer. For a gradient-descent-based algorithm, the non-differentiability of the objective function g(x) poses a challenge to its direct application. To address this issue, smoothing techniques have been widely used in the literature. Given any δ > 0, the following differentiable functions have been adopted in Chen & Zhou (2014); Lai & Wang (2011); Lai et al. (2013); Cand`es et al. (2008); Nikolova et al. (2008) as an alternative for g(x): Pn h1 (x) = i=1 (xi + δ)p , Pn x2 h2 (x) = i=1 (x2 +δii)1−p/2 , i Pn h3 (x) = i=1 (xi +δxii )1−p .

Notice that ∇h1 (0) = [pδ p−1 , pδ p−1 , ..., pδ p−1 ]T , ∇h2 (0) = ~0 and ∇h3 (0) = [δ , δ p−1 , ..., δ p−1 ]T . Hence, ∇h1 (0) and ∇h3 (0) depend on the value of δ heavily. In order to make h1 (x) or h3 (x) a good approximation of g(x), δ has to be sufficiently small. However, in this situation, δ p−1 would become extremely large for p ∈ (0, 1). As for h2 (x), since ∇h2 (0) is the zero vector, it does not provide much information. Therefore, the Taylor expansions of h1 (x), h2 (x) and h3 (x) at any x having xi = 0, where i ∈ {1, 2, ..., n}, would not be good approximations of g(x) around x. For a better illustration, we show the value of Taylor expansions of h1 (x) and h3 (x) with n = 1 in Figures 1 and 2. p−1

8

Figure 1:

Taylor Expansion of h1 (x), h3 (x) at x = 0

Figure 2:

Taylor Expansion of h1 (x), h3 (x) at x = 0

As shown in Figures 1 and 2, the difference between g(0 + d) and h1 (0) + ∇h1 (0)d (or h3 (0) + ∇h3 (0)d) is already apparent when 0< d 61.0E-4, and it becomes more drastic when d >1.0E-2. In the meanwhile, the Taylor expansion of h2 (x) is 0 around the origin. Thus, the Taylor expansions of the smooth functions h1 (x), h2 (x) and h3 (x) do not provide good approximations of g(x) when some entries of x are 0. This may degrade the performance of a gradient descent framework using the smoothing techniques for solving problem (`p ), in particular, when the objective of problem (`p ) is used to seek the sparsest solution involving as many zero entries as possible. In this study, we take a totally different way to handle the non-differentiability

9

issue. For any x ∈ F , remember that the positive index set P(x) = {i | xi > 0} and the zero index set Z(x) = {i | xi = 0}. We further denote Ap(x) as the submatrix of A consisting of all columns of A corresponding to those in P(x), so is Az(x) corresponding to Z(x). Without loss of generality, we denote x = [xp(x) ; xz(x) ], g(x) = gp(x) (xp(x) ) + gz(x) (xz(x) ) and A = [Ap(x) , Az(x) ]. With an abuse of notation, for any x feasible to problem (`p ), if there is no specification, we denote [xp(x) ; xz(x) ] as [xp ; xz ], gp(x) (xp(x) ) + gz(x) (xz(x) ) as gp (xp ) + gz (xz ) and [Ap(x) , Az(x) ] as [Ap , Az ] in our discussions. Also, we use n(x) = kxk0 to denote the number of nonzero entries of x. Given any x ∈ F , since ∇gp (xp ) = (p(xp1 )p−1 , p(xp2 )p−1 , ..., p(xpn(x) )p−1 )T is well defined at xp > 0, we consider the following linear programming problem to find a descent direction dp ∈ Rn(x) at x: min ∇gp (xp )T dp dp

s. t.

(Inner Dir)

Ap dp = 0, dp > −xp .

It is obvious that the optimal value of problem (Inner Dir) is bounded above by 0. Furthermore, at any x ∈ F , if the optimal value of the corresponding problem (Inner Dir) is negative, we can find a descent direction d = (dp , dz )T ∈ Rn with dz = ~0 for problem (`p ) at x. Moreover, if the optimal value of the corresponding problem (Inner Dir) is 0, we can prove that x satisfies the scaled KKT condition as proposed. Theorem 2. For a given x∗ ∈ F , let ν ∗ be the optimal value of the corresponding problem (Inner Dir). Then ν ∗ 6 0 and x∗ is a scaled KKT point of problem (`p ) if ν ∗ = 0. Proof. Given any x∗ ∈ F , first notice that ∇gp (x∗p ) is well-defined. It is clear ∗ that ~0 ∈ Rn(x ) is feasible to the corresponding problem (Inner Dir), and ν ∗ 6 ∗ ∗ 0. If ν = 0, ~0 ∈ Rn(x ) must be an optimal solution to the linear program (Inner Dir) satisfying the KKT condition: ∇gp (x∗p ) + ATp λ −

Pn(x∗ ) i=1

ei µi = 0, µi x∗pi = 0,

i = 1, 2, ..., n(x∗ ).

(3)

Let A = [Ap , Az ] and x∗ = (x∗p , 0, 0, .., 0), then the first equation in (3) is equivalent to   px∗p p1   + Xp∗ ATp λ − Xp∗ µ = 0. ... ∗p pxpn(x∗ )

10

Moreover, since x∗i = 0 for any i ∈ Z(x∗ ), we have    T T px∗p 1  ...  + X ∗ ApT λ − X ∗ µ = 0, Az ∗p pxn Ax∗ = b, x∗ > 0, x∗T µ = 0,

(4)

hold at x∗ . Notice that the scaled KKT condition is only a necessary condition for the global optimality of problem (`p ). To find a scaled KKT point with better objective value, it is necessary to take those variables with 0 value into consideration in search for a descent direction. To deal with the non-differentiability of variables indexed by Z(x) for any x ∈ F , first we consider the following fact: Theorem 3. P For any w, x ∈ F , if kw − xk2 6 1 and g(w) < g(x), then gp(x) (wp(x) ) + i∈z(x) wi < g(x).

Proof. Since g(w) < g(x), we have gp(x) (wp(x) ) + gz(x) (wz(x) ) = g(w) < g(x). Since kw − xk2P6 1 and xi = 0 ∀ i ∈ Z(x), we have wi 6 1 ∀ i ∈ Z(x). Thus, P p i∈Z(x) wi < Pi∈Z(x) wi = gz(x) (wz(x) ) for p ∈ (0, 1). Consequently, we have gp(x) (wp(x) ) + i∈Z(x) wi < g(x).

For any x ∈ F , to find a better descent direction d = [dp(x) , dz(x) ]T ∈ Rn at x, Theorem 3 inspires us to adopt a vector Dg(x) = [∇gp (xp ), en−n(x) ]T , where ∇gp (xp ) ∈ Rn(x) is the gradient of gp (xp ) at xp and en−n(x) ∈ Rn−n(x) can be seen as a subgradient of gz (xz ) in the unit ball around 0, to approximate the gradient of g(x). Thus, to find a good descent direction d = (dp , dz )T ∈ Rn at x, we may solve the following simple convex quadratic programming problem: T min Dg(x) d d

s. t.

Ad = 0, dT d 6 1, dz(x) > 0, dp(x) > −xp ,

(Outer-Dir)

p−1 , ..., pxn(x) , 1, 1, ..., 1)T . where Dg(x) = (pxp−1 , pxp−1 1 2 Here we propose a gradient-descent-based algorithm. The proposed algorithm has two loops. The inner loop proceeds within a given subset of positive variables to find a scaled KKT point by solving (Inner Dir). The outer loop proceeds with all variables in order to find a scaled KKT point with a reduced objective value if possible by solving (Outer-Dir). The detailed algorithm is given below.

11

Algorithm 1 Gradient-Descent-Based Algorithm 1: Initialization: Input A, b and δ > 0; Find and initial solution x0 ∈ F ; Set k = 0. 2: while Outer Loop Not Stop do 3: while Inner Loop Not Stop do 4: Set P(xk ) = {i|xki > 0}, Z(xk ) = {1, 2, ..., n} \ P(xk ). 5: Obtain xkp and Ap according to P(xk ). 6: Solve problem (Inner Dir) at xkp with an optimal solution dkp of optimal value ν k . 7: if νk = 0 then 8: Stop the inner loop at xk (A scaled KKT point is found). 9: else 10: Set dk = (dkp , dkz = ~0)T and xk+1 = xk + dk . 11: end if 12: if kxk+1 k0 = 0 then 13: Terminate the algorithm and output xk+1 = ~0 as the global minimizer. 14: end if 15: Reset k ← k + 1. 16: end while 17: Solve problem (Outer-Dir) at x = xk with an optimal solution dk . 18: xtem = xk + dk . 19: if g(xk ) − g(xtem ) > δ then 20: Reset xk ← xtem and go to the inner loop with xk . 21: else 22: Terminate the algorithm and output xk as a scaled KKT point. 23: end if 24: end while Remark 3. The input value of δ only affects the maximum iteration of Algorithm 1, while the output of Algorithm 1 is always a scaled KKT point no matter what value δ takes. Remark 4. Algorithm 1 is designed for `p minimization with p ∈ (0, 1). However, when p = 1, ∇gp (xp ) = e ∈ Rn(x) for the inner loop iteration, and the corresponding inner loop procedure can be reduced to solving a linear programming problem. Now we provide a convergence proof of Algorithm 1. First we show that each iteration of the inner loop of Algorithm 1 reduces the cardinality of the incumbent point at least by one, i.e., kxk k0 − kxk+1 k0 > 1.

Lemma 1. Given an x and the corresponding xp , if the optimal value of problem (Inner Dir) is strictly less than 0 with dp being an optimum solution, there is at least one i ∈ P(x) such that dpi = −xpi .

Proof. Assume that dp is an optimum solution of problem (Inner Dir) at xp with ∇gp (xp )T dp < 0. If dpi > −xpi for every i ∈ P(x), then we can always find 12

an α > 1 such that αAp dp = 0, αdp > −xp and α∇gp (xp )T dp < ∇gp (xp )T dp . This contradicts to the assumption of dp . Lemma 2. In each iteration of the inner loop, Algorithm 1 either finds xk as a scaled KKT point of problem (`p ) or finds an xk+1 such that kxk k0 − kxk+1 k0 > 1. Proof. At xk , if the inner loop of Algorithm 1 stops, xk is a scaled KKT point of problem (`p ) according to Theorem 2. If it does not stop, we find a dp such that ∇gp (xkp )T dp < 0. From Lemma 1, we know that kxk k0 − kxk+1 k0 = kxk k0 − kxk + dk0 > 1. We can further prove that the inner loop of Algorithm 1 takes at most n iterations to stop. Theorem 4. The inner loop of Algorithm 1 either stops at a scaled KKT point or output 0 as a global optimum solution of problem (`p ) in at most n iterations. Proof. From Lemma 2 we know that at each iteration of its inner loop, Algorithm 1 will either stop or generate a xk+1 such that kxk k0 − kxk+1 k0 > 1. Since kxk k0 6 n for any k, we know the inner loop of Algorithm 1 stops in n iterations. Let xk denotes the stopping point of the inner loop, if kxk k0 > 1, it is a scaled KKT point of problem (`p ) from Theorem 2. Otherwise, xk = 0 is a global minimizer of problem (`p ). Next we show that for any δ > 0, Algorithm 1 terminates at either a scaled KKT point or a global minimizer of problem (`p ) in O(δ −1 n) iterations. Lemma 3. If the inner loop of Algorithm 1 does not stop at xk , we have g(xk+1 ) < g(xk ). Proof. At each iteration of its inner loop, when Algorithm 1 continuous, we find a direction dp with ∇gp (xp )T dp < 0. Furthermore, when gp (xp ) is concave, we have g(xk+1 ) = gp (xkp + dp ) 6 gp (xkp ) + ∇gp (xkp )T dp < gp (xkp ) = g(xk ). Theorem 5. Given any δ > 0, Algorithm 1 terminates at either a scaled KKT point or a global minimizer of problem (`p ) in O(δ −1 n) iterations. Proof. In each iteration of its inner loop, if Algorithm 1 does not stop, Lemma 3 shows that it reduces the objective value of problem (`p ) by a positive value, i.e., g(xk+1 ) < g(xk ). Moreover, Theorem 4 indicates that this reduction procedure in the inner loop takes at most n iterations and finds a scaled KKT point or a global minimizer of problem (`p ). Then Algorithm 1 jumps to the outer loop. 13

The outer loop of Algorithm 1 either updates the incumbent solution to reduce the objective value at least by a given value δ > 0 and jumps back to the inner loop, or outputs the incumbent solution as a scaled KKT point or a global minimizer of problem (`p ). Since {g(xk )} is decreasing and bounded below by 0, Algorithm 1 terminates at a scaled KKT point or a global minimizer of problem (`p ) in at most O(δ −1 n) iterations. To better evaluate the proposed algorithm, here we compare its worst-case iteration complexity with that of the interior-point potential reduction (IPPR) algorithm of Ge et al. (2011), the smoothing sequential quadratic programming (SSQP) algorithm of Liu et al. (2016) and the interior-point trust region (IPTR) algorithm of Haeser et al. (2018) in Table 1. Table 1: Comparison of Worst-Case Complexity Condition

Algorithm

Iteration Complexity

Parameters

Ge et al. (2011)

-KKT

O( n log 1 )



Liu et al. (2016)

-KKT

O(p−4 )



Haeser et al. (2018)

2-KKT

O(−2 )



This study

scaled KKT

Iterior-Point Potential Reduction (IPPR) Smoothing Sequential Quadratic Programming (SSQP) Iterior-Point Trust Region (IPTR) Algorithm 1

O( n ) δ

δ

Remark 5. Here we compare the worst-case iteration complexity of each algorithm. As for the complexity within each single iteration, the proposed algorithm solves one linear programming problem of size n, while both of the IPPR and IPTR algorithms minimize one n-dimensional linear objective function subject to several liner and ball constraints, the SSQP algorithm projects an n-dimensional vector onto a linear domain, which can be done by solving one convex quadratic programming problem of size n. Remark 6. Each of the IPPR, SSQP and IPTR algorithms searches for a “weaker” solution of a relaxed KKT condition, and the quality of the solution depends on the given value of  > 0. On the other hand, the proposed Algorithm 1 finds a “stronger” solution of a non-relaxed KKT condition and the quality of the solution is independent of the given value of δ > 0. 4. Computational Experiments In this section, we conduct computational experiments to validate the effectiveness and efficiency of Algorithm 1. Notice that (i) Algorithm 1 finds an non-relaxed (scaled) KKT point while IPPR, SSQP and IPTR find a relaxed -KKT point; (ii) The complexity of IPPR has an advantage over SSQR and IPTR; (iii) Both of the IPPR and IPTR are interior-point based algorithms; 14

(iv) IPPR is the only algorithm in the literature that directly solves problem (`p ) by finding -KKT points. However, as to be shown, the IPPR algorithm is not computationally efficient enough for solving large size problems. Therefore, our computational experiments are mainly conducted in comparison with the IPPR algorithm of Ge et al. (2011) on problems of lower dimensions. For problems of higher dimensions, we compare Algorithm 1 with another state-ofthe-art interior-point based MFIPMCS (matrix free interior point method for compressed sensing) algorithm of Fountoulakis et al. (2014). All experiments are conducted on a desktop with 2.50 GHz CPU and 8.0 GB RAM, and all related convex quadratic programming problems are solved on the MATLAB platform with the CPLEX solver. To validate the effectiveness and efficiency of Algorithm 1 in sparse recovery, we conduct numerical experiments with two groups of data. In Section 4.1, we conduct experiments using synthetic data that are generated by the commonly adopted scheme as shown in Ge et al. (2017). In Section 4.2, we work on some practical instances of the Sparco dataset (van den Berg et al., 2007), which consists of a collection of instances for benchmarking algorithms for sparse signal reconstruction. 4.1. Sparse Recovery with Synthetic Data In this section, we validate the effectiveness and efficiency of Algorithm 1 in sparse recovery and compare it with the IPPR and MFIPMCS algorithms using randomly generated sparse recovery instances. The random instances are generated following a commonly used procedure as described in Ge et al. (2017). Given a sensing matrix A ∈ Rm×n and an observation b ∈ Rm , the goal is to recover a known sparse solution x0 ∈ Rn with sparsity T (# of nonzero elements) to the system Ax = b. The parameters A, b and x0 are generated as Ge et al. (2017); Chen et al. (2010) by the following codes:

x0 = zeros(n, 1); P = randperm(n); x0 (P (1 : T )) = abs(2 ∗ randn(T, 1)); A = randn(m, n); A = orth(A0 )0 ; b = A ∗ x0 ;

4.1.1. Sparse Recovery of Lower Dimensional Problems In this subsection, we first apply Algorithm 1 and the IPPR algorithm to a randomly generated sparse recovery problem to highlight their rates of successful sparse recovery. In the experiment, we take p = 0.5 and set m = 30, n = 120 and T ∈ {1, 2, ..., 20}. For each T value, Algorithm 1 and the IPPR algorithm are applied to 500 randomly generated instances to compare their successful sparse recovery rates. For a randomly generated sparse solution x0 , let x∗ denote the output point of Algorithm 1 or IPPR, we take kx∗ − x0 k2 6 1.0E-3 (-4, -5, -6)

15

as the precision requirement for a successful recovery. The results are reported in Tables 2 and 3 below.

16

Table 2: Sparse recovery rates on instances with different sparsity T T=1

T=2

T=4

T=5

T=6

T=7

T=8

T=9

T=10

61.0E-3

Alg 1 IPPR

100.0% 99.0%

100.0% 99.6%

100.0% 100.0%

T=3

100.0% 100.0%

100.0% 100.0%

100.0% 100.0%

100.0% 100.0%

99.0% 99.0%

94.8% 94.8%

90.4% 87.4%

61.0E-4

Alg 1 IPPR

100.0% 97.8%

100.0% 99.2%

100.0% 100.0%

100.0% 100.0%

100.0% 100.0%

100.0% 100.0%

99.8% 100.0%

99.0% 99.0%

94.2% 94.8%

90.2% 87.4%

61.0E-5

Alg 1 IPPR

100.0% 97.0%

100.0% 98.4%

100.0% 0.0%

99.6% 0.0%

100.0% 0.2%

100.0% 2.8%

99.8% 2.6%

99.0% 8.2%

94.0% 12.0%

90.0% 13.0%

61.0E-6

Alg 1 IPPR

100.0% 93.8%

100.0% 96.6%

100.0% 0.0%

99.6% 0.0%

100.0% 0.0%

100.0% 0.0%

99.8% 0.0%

98.2% 0.0%

94.0% 0.0%

90.0% 0.0%

17

Table 3: Sparse recovery rates on instances with different sparsity T T=11

T=16

T=17

T=18

T=19

T=20

61.0E-3

Alg 1 IPPR

83.0% 82.6%

T=12 65.4% 66.2%

T=13 53.4% 53.2%

T=14 37.2% 35.0%

T=15 23.2% 23.0%

13.6% 13.2%

6.6% 6.6%

3.0% 2.6%

0.4% 0.4%

0.2% 0.0%

61.0E-4

Alg 1 IPPR

83.0% 82.6%

65.4% 66.2%

52.8% 53.2%

37.2% 35.0%

23.2% 22.8%

13.6% 13.2%

6.4% 6.6%

2.8% 2.6%

0.4% 0.4%

0.2% 0.0%

61.0E-5

Alg 1 IPPR

83.0% 18.4%

65.4% 16.4%

52.6% 14.0%

37.2% 12.0%

22.6% 7.6%

13.6% 5.4%

6.4% 3.6%

2.8% 1.0%

0.4% 0.4%

0.2% 0.0%

61.0E-6

Alg 1 IPPR

83.0% 0.0%

65.4% 0.0%

52.6% 0.0%

37.2% 0.0%

22.6% 0.0%

13.6% 0.0%

6.4% 0.0%

2.8% 0.0%

0.4% 0.0%

0.2% 0.0%

Tables 2 and 3 show that (i) The successful recovery rates of both algorithms decrease as T increases, which is consistent with the results shown in Theorem 2.4 of Donoho (2006b). (ii) Algorithm 1 successfully recovers more instances compared to the IPPR algorithm in general. This computational advantage becomes more apparent as the precision requirement gets stronger and T becomes larger. (iii) The successful recovery rate of IPPR depends heavily on the precision requirement, while Algorithm 1 is rather stable even the requirement gets stronger. Notice that the IPPR algorithm proceeds with interior-point solutions of problem (`p ), which may lead a terminating point with multiple entries having small positive values instead of 0. Thus, the recovery rate of IPPR is low when the precision requirement becomes 1.0E-5 or 1.0E-6, while Algorithm 1 avoids this drawback. Next, we show a more detailed comparison of Algorithm 1 and the IPPR algorithm using some randomly generated problems of lower dimensions. Here, we generate 4 groups of instances with different size (m, n) and sparsity (T ). For each group, 100 randomly generated instances are tested. We report the successful recovery rate (Rec rate), where an instance is considered as a successful recovery if kx∗ − x0 k2 /kx0 k2 6 1.0E-3, the average deviation of kx∗ − x0 k2 , the average infeasibility of kAx∗ − bk2 , the average running time in seconds (ave-t) and the standard deviation of running time (std-t) for those successfully recovered instances in Table 4. Testing problems are denoted by the size and sparsity of instances, e.g., m50n256T10 means the group of instances with A ∈ R50×256 and kx0 k0 = 10 (10 nonzero entries). Table 4: Algorithm 1 vs. IPPR Rec rate m50n256T10 m50n256T15 m50n256T20 m50n256T25 m100n512T20 m100n512T25 m100n512T30 m100n512T35 m150n1024T25 m150n1024T30 m150n1024T35 m150n1024T40 m200n2048T30 m200n2048T35 m200n2048T40 m200n2048T45

100% 100% 64% 9% 100% 100% 100% 94% 100% 100% 100% 100% 100% 100% 100% 100%

Algorithm 1 kx∗ − x0 k2 kAx∗ − bk2 1.60E-15 6.66E-15 1.28E-06 2.24E-14 1.39E-14 2.24E-14 2.06E-06 1.34E-06 2.74E-14 4.52E-14 6.74E-14 4.14E-06 1.63E-13 2.65E-13 1.09E-13 1.90E-13

7.20E-16 2.74E-15 7.24E-06 6.44E-14 3.44E-14 5.52E-14 8.97E-06 8.46E-06 7.50E-14 1.25E-13 1.89E-13 3.21E-05 4.89E-14 7.72E-14 3.76E-13 6.83E-13

ave-t

std-t

Rec rate

0.116 0.118 0.116 0.145 0.111 0.135 0.164 0.224 0.295 0.307 0.312 0.469 0.844 0.856 0.803 0.935

0.001 0.001 0.030 0.042 0.007 0.011 0.046 0.083 0.020 0.015 0.012 0.198 0.055 0.032 0.061 0.258

100% 97% 48% 7% 100% 100% 100% 89% 100% 100% 100% 99% 100% 100% 100% 100%

kx∗ − x0 k2

1.05E-04 2.81E-05 1.28E-03 9.24E-04 5.23E-04 5.76E-04 5.90E-04 6.82E-04 7.15E-04 8.15E-04 9.58E-04 9.78E-04 8.22E-04 8.40E-04 9.19E-04 9.21E-04

IPPR kAx∗ − bk2

3.34E-04 4.18E-04 5.69E-04 4.19E-04 9.72E-05 1.41E-04 1.51E-04 1.44E-04 2.31E-04 2.79E-04 3.43E-04 3.44E-04 2.24E-05 2.64E-05 6.53E-05 1.51E-07

Table 4 shows that: (i) The successful recovery rate of Algorithm 1 is better than that of IPPR for every group of instances. (ii) Algorithm 1 achieves much smaller numerical deviation of kx∗ − x0 k2 and infeasibility of kAx∗ − bk2 in much shorter running time. (iii) The computational dominance of Algorithm 1 becomes much more apparent as the size of problems increases. For example, for 18

ave-t

std-t

0.544 0.505 0.581 0.562 5.729 5.760 5.789 5.808 42.849 43.919 44.109 43.170 315.645 308.346 315.553 300.380

0.041 0.032 0.053 0.049 0.155 0.186 0.149 0.128 5.544 1.467 2.121 0.919 15.447 3.068 14.666 3.705

instances with m = 200 and n = 2048, Algorithm 1 runs 300 times faster than IPPR. (iv) IPPR significantly slows down as the problem size increases, which may limit its capability of solving large size problems for further comparisons. Our numerical results clearly support that, compared with the IPPR algorithm, the proposed Algorithm 1 achieves a better sparsity recovery in a much more efficient manner. Moreover, Algorithm 1 does not require to start with an interior point of problem (`p ), which makes it more flexible for pre-processing. 4.1.2. Sparse Recovery of Higher Dimensional Problems In the previous subsection, we have shown that Algorithm 1 dominates the IPPR algorithm in both of the sparse recovery rate and computational efficiency for lower dimensional problems. For testing problems of higher dimensions, since the IPPR algorithm has its computational limit, we replace it with another state-of-the-art interior-point based MFIPMCS (matrix free interior point method for compressed sensing) algorithm of Fountoulakis et al. (2014) to conduct experiments in this section. Here we take p = 0.5, m = 300, n ∈ {10, 000, 20, 000, 60, 000, 100, 000} and T ∈ {10, 15, 20, 25} to generate instances. For each T value, as before, 100 instances are randomly generated. Let x∗ denote the output of either algorithm, an instance is considered as a successful recovery if kx∗ k0 6 kx0 k0 = T . For each group of instances, here we report the successful recovery rate (Rec rate), the average deviation of kx∗ − x0 k2 , the average infeasibility of kAx∗ − bk2 and the average running time in seconds of those successfully recovered instances in Table 5. Table 5 shows that (i) Algorithm 1 clearly has a much higher successful sparse recovery rate compared to the MFIPMCS algorithm. (ii) The quality of solutions obtained by Algorithm 1 is in general much higher in terms of feasibility and numerical accuracy. (iii) The computation time of MFIPMCS dominates that of Algorithm 1, because the MFIPMCS algorithm involves solving a convex linearly constrained `1 minimization problem, while Algorithm 1 solves a nonconvex programming problem. Thus, Algorithm 1 takes longer running time compared to the MFIPMCS algorithm on average. (iv) Although Algorithm 1 takes longer running time than MFIPMCS does, it is still very efficient in general, e.g., Algorithm 1 solves (`p ) with A ∈ R300×100,000 in less than 4 CPU minutes on average. Here we plot the average running time of Algorithm 1 for solving problem (`p ) of different size (m×n) in Figure 4.1.2 to show its potential of solving large size problems for real applications.

19

Table 5: Algorithm 1 vs. MFIPMCS Alg ID

Rec rate

Algorithm 1 kx∗ − x0 k2 kAx∗ − bk2

time(s)

Rec rate

MFIPMCS kx∗ − x0 k2 kAx∗ − bk2

time(s)

m300n10000T10 m300n10000T15 m300n10000T20 m300n10000T25

100% 100% 100% 100%

1.46E-14 3.87E-14 1.18E-13 3.45E-13

2.50E-15 6.62E-15 2.01E-14 5.80E-14

8.80 10.84 12.70 13.89

100% 98% 73% 12%

1.09E-01 1.39E-01 1.59E-01 1.97E-01

1.84E-02 2.28E-02 2.62E-02 3.04E-02

0.68 0.77 0.90 1.01

m300n20000T10 m300n20000T15 m300n20000T20 m300n20000T25

100% 100% 100% 100%

1.63E-14 3.91E-14 7.15E-14 1.46E-13

1.98E-15 4.64E-15 8.50E-15 1.74E-14

28.01 29.51 29.90 30.15

96% 90% 34% 0%

2.19E-01 2.79E-01 3.11E-01 NA

2.60E-02 3.23E-02 3.63E-02 NA

1.05 1.25 1.32 NA

m300n60000T10 m300n60000T15 m300n60000T20 m300n60000T25

100% 100% 100% 100%

1.47E-14 4.75E-14 7.39E-14 2.13E-13

1.02E-15 3.23E-15 5.12E-15 1.43E-14

87.48 75.33 72.02 67.37

100% 86% 27% 0%

6.70E-01 8.00E-01 9.21E-01 NA

4.53E-02 5.42E-02 6.14E-02 NA

2.55 3.12 4.35 NA

m300n80000T10 m300n80000T15 m300n80000T20 m300n80000T25

100% 100% 100% 100%

1.84E-14 4.73E-14 1.41E-13 1.42E-13

1.11E-15 2.83E-15 8.45E-15 8.42E-15

141.90 141.59 148.22 136.37

100% 86% 33% 6%

8.65E-01 1.09E+00 1.15E+00 1.19E+00

5.06E-02 6.26E-02 6.84E-02 7.22E-02

4.63 4.78 4.69 4.84

m300n100000T10 m300n100000T15 m300n100000T20 m300n100000T25

100% 100% 100% 100%

1.89E-14 3.84E-14 1.04E-13 1.66E-13

1.03E-15 2.06E-15 5.57E-15 8.91E-15

224.65 220.19 223.48 220.19

98% 77% 25% 3%

1.06E+00 1.30E+00 1.40E+00 1.60E+00

5.61E-02 6.78E-02 7.54E-02 8.11E-02

4.94 5.24 6.14 7.28

NA denotes “no successful recovery in all of 100 instances”

4.2. Sparse Signal Recovery with Sparco Database To further compare Algorithm 1 with the IPPR and MFIPMCS algorithms, we conduct experiments on some testing problems of the Sparco collection (van den Berg et al., 2007). Sparco has a suite of 26 problems for sparse signal reconstruction. Like van den Berg & Friedlander (2009), we choose 9 real-valued ones with m 6 3200 and n 6 4096, which is the limit for the IPPR algorithm to solve in 2 CPU hours on our desktop machine. Each problem has a linear operator A and a right-hand-side vector b. Table 6 provides the information about each selected problem including Name, Sparco ID, number of rows (m) and columns (n) of A. 20

Table 6: Selected testing problems from Sparco Name

ID

m

n

blocksig cosspike gcosspike p3poly sgnspike soccer1 yinyang jitter spiketrn

2 3 5 6 7 601 603 902 903

1024 1024 300 600 600 3200 1024 200 1024

1024 2048 2048 2048 2560 4096 4096 1000 1024

We apply Algorithm 1, IPPR and MFIPMCS to the sparse signal recovery problems listed in Table 6. Each attempt to solve a problem is limited to two hours of CPU time. Let x∗ denote the output of each algorithm, here we report the number of nonzero entries of x∗ , the infeasibility of kAx∗ − bk2 and total running time in seconds (denoted as time). For a clear comparison with the results in van den Berg & Friedlander (2009), we measure the number of nonzero entries in two ways: (i) For any x∗ , kx∗ k0 denotes the number of entries whose absolute value is no less than 1.0E-03. (ii) We adopt the same criteria as van den Berg & Friedlander (2009), where the number of nonzero entries of x∗ is the minimum number of entries that carry 99.9% of its one-norm, i.e., nz(x∗ ) = argmin{k ∈ N|

k X i=1

|x∗bic | = 0.999kx∗ k1 },

(5)

where |x∗b1c | > |x∗b2c | > ... > |x∗bnc | are the n elements of x∗ sorted by absolute value. Table 7: Sparse signal recovery results Alg

nz(x∗ )

ID

2 3 5 6 7 601 603 902 903

71 115 59 166 20 2853 767 3 12

Algorithm 1 kx∗ k0 kAx∗ − bk2 71 121 63 196 20 2049 904 3 12

3.75E-11 9.21E-14 2.11E-07 3.52E-05 1.34E-13 5.90E-03 9.30E-07 1.09E-11 5.48E-14

time

nz(x∗ )

kx∗ k0

0.10 1.89 5.68 54.09 8.35 1905.11 62.91 0.48 0.78

71 115 59 467 20 t 768 3 12

71 121 63 596 20 t 904 3 12

IPPR kAx∗ − bk2 8.16E-10 1.37E-12 2.07E-07 2.02E-11 2.96E-13 t 3.06E-06 1.05E-11 5.71E-05

time

nz(x∗ )

kx∗ k0

443.51 1132.47 857.53 1788.12 1499.82 t 7198.42 162.59 1074.62

71 115 59 618 20 3062 850 3 12

71 121 80 776 20 3247 942 3 12

t denotes execution exceeds 2 hour CPU time limit

Table 7 shows that: (i) As for the quality of sparse signal recovery, Algorithm 1 dominates the IPPR and MFIPMCS algorithms in terms of nz(x) and kxk0 . To be more specific, Algorithm 1 finds a clearly more sparse solution for instances ID 6, 601 and 603. (ii) Algorithm 1 again dominates the IPPR algorithm in terms of running time. It runs in general two orders (100 times) faster 21

MFIPM kAx∗ − bk2 8.43E-03 1.10E-02 9.39E-03 8.81E-04 9.37E-03 4.02E-02 4.70E-02 3.84E-03 5.83E-04

time 0.29 0.42 1.02 23.29 0.22 164.29 22.77 0.42 1.08

than the IPPR algorithm does. (iii) The MFIPMCS algorithm takes shorter running time compared to Algorithm 1, while Algorithm 1 is able to find more sparse representations of signals in all cases. 5. Conclusion In this paper, we have studied the linearly constrained `p minimization problem with p ∈ (0, 1). Such a problem has wide applications in image processing, signal recovery and machine learning, but it is NP-hard due to its nonconvexity, nonsmoothness and lack of Lipschitz continuity. Rather than relying on the commonly adopted relaxed KKT condition approach, we introduced a scaled KKT condition that involves no relaxation to identify better candidate optimal solutions. To avoid the errors caused by using a smoothed `p function, we proposed a gradient-descent-based algorithm that works only on the positive entries of variables to find solutions satisfying the scaled KKT condition without invoking the nondifferentiability issue. We have provided convergence proof and complexity analysis of the proposed algorithm and compared its worst-case complexity with that of state-of-the-art algorithms searching for -KKT points in the literature. Theoretical analysis showed the improved computational complexity of the proposed algorithm. Extensive computational experiments with synthetic and real-life datasets have been conducted to show that the proposed algorithm is capable of achieving much better sparse recovery in reasonable computational time compared to state-of-the-art interior-point based algorithms. Acknowledgments This work has been sponsored by the US Army Research Office Grant #W911NF15-1-0223 and the Chinese National Natural Science Foundation Grant #11771243 and #11571029. References van den Berg, E., & Friedlander, M. (2009). Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31 , 890– 912. URL: https://doi.org/10.1137/080714488. doi:10.1137/080714488. arXiv:https://doi.org/10.1137/080714488. van den Berg, E., Friedlander, M., Hennenfent, G., Herrmann, F., Saab, R., & ¨ (2007). Sparco: A testing framework for sparse reconstruction. Yılmaz, O. Technical Report TR-2007-20 Dept. Computer Science University of British Columbia, Vancouver. Bian, W., Chen, X., & Ye, Y. (2015). Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, 149 , 301–327. URL: http://dx.doi.org/10.1007/ s10107-014-0753-5. doi:10.1007/s10107-014-0753-5. 22

Cand`es, E. J., Wakin, M. B., & Boyd, S. P. (2008). Enhancing sparsity by reweighted `1 minimization. Journal of Fourier Analysis and Applications, 14 , 877–905. URL: http://dx.doi.org/10.1007/ s00041-008-9045-x. doi:10.1007/s00041-008-9045-x. Chen, W. J., & Tian, Y. J. (2010). Lp -norm proximal support vector machine and its applications. Procedia Computer Science, 1 , 2417 – 2423. URL: http://www.sciencedirect.com/science/article/pii/ S1877050910002735. doi:http://dx.doi.org/10.1016/j.procs.2010.04. 272. ICCS 2010. Chen, X., Ge, D., Wang, Z., & Ye, Y. (2014). Complexity of unconstrained L2 -Lp minimization. Mathematical Programming, 143 , 371–383. URL: http://dx.doi.org/10.1007/s10107-012-0613-0. doi:10. 1007/s10107-012-0613-0. Chen, X., Xu, F., & Ye, Y. (2010). Lower bound theory of nonzero entries in solutions of `2 − `p minimization. SIAM Journal on Scientific Computing, 32 , 2832–2852. URL: http://dx.doi.org/10.1137/090761471. doi:10.1137/090761471. arXiv:http://dx.doi.org/10.1137/090761471. Chen, X., & Zhou, W. (2014). Convergence of the reweighted `1 minimization algorithm for `2 − `1 minimization. Computational Optimization and Applications, 59 , 47–61. URL: http://dx.doi.org/10.1007/s10589-013-9553-8. doi:10.1007/s10589-013-9553-8. Donoho, D. L. (2006a). Compressed sensing. IEEE Transactions on Information Theory, 52 , 1289–1306. doi:10.1109/TIT.2006.871582. Donoho, D. L. (2006b). For most large underdetermined systems of linear equations the minimal `1 -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59 , 797–829. URL: https://onlinelibrary. wiley.com/doi/abs/10.1002/cpa.20132. doi:10.1002/cpa.20132. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpa.20132. Fountoulakis, K., Gondzio, J., & Zhlobich, P. (2014). Matrix-free interior point method for compressed sensing problems. Mathematical Programming Computation, 6 , 1–31. URL: https://doi.org/10.1007/s12532-013-0063-6. doi:10.1007/s12532-013-0063-6. Ge, D., He, R., & He, S. (2017). An improved algorithm for the l2 -lp minimization problem. Mathematical Programming, 166 , 131–158. URL: https:// doi.org/10.1007/s10107-016-1107-2. doi:10.1007/s10107-016-1107-2. Ge, D., Jiang, X., & Ye, Y. (2011). A note on the complexity of Lp minimization. Mathematical Programming, 129 , 285–299.

23

Haeser, G., Liu, H., & Ye, Y. (2018). Optimality condition and complexity analysis for linearly-constrained optimization without differentiability on the boundary. Mathematical Programming, online first. URL: https://doi.org/ 10.1007/s10107-018-1290-4. doi:10.1007/s10107-018-1290-4. Kabashima, Y., Wadayama, T., & Tanaka, T. (2009). A typical reconstruction limit for compressed sensing based on Lp -norm minimization. Journal of Statistical Mechanics: Theory and Experiment, 2009 , L09003. URL: http: //stacks.iop.org/1742-5468/2009/i=09/a=L09003. Lai, M.-J., & Wang, J. (2011). An unconstrained `q minimization with 0 < q < 1 for sparse solution of underdetermined linear systems. SIAM Journal on Optimization, 21 , 82–101. URL: http://dx.doi.org/10.1137/090775397. doi:10.1137/090775397. arXiv:http://dx.doi.org/10.1137/090775397. Lai, M.-J., Xu, Y., & Yin, W. (2013). Improved iteratively reweighted least squares for unconstrained smoothed `q minimization. SIAM Journal on Numerical Analysis, 51 , 927–957. URL: https://doi.org/10.1137/110840364. doi:10.1137/110840364. arXiv:https://doi.org/10.1137/110840364. Liu, Y.-F., Ma, S., Dai, Y.-H., & Zhang, S. (2016). A smoothing SQP framework for a class of composite Lq minimization over polyhedron. Mathematical Programming, 158 , 467–500. URL: https://doi.org/10.1007/ s10107-015-0939-5. doi:10.1007/s10107-015-0939-5. Ma, W., Qu, H., Gui, G., Xu, L., Zhao, J., & Chen, B. (2015). Maximum correntropy criterion based sparse adaptive filtering algorithms for robust channel estimation under non-Gaussian environments. Journal of the Franklin Institute, 352 , 2708 – 2727. URL: http://www.sciencedirect. com/science/article/pii/S0016003215001507. doi:http://dx.doi.org/ 10.1016/j.jfranklin.2015.03.039. Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2008). Discriminative learned dictionaries for local image analysis. In 2008 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). doi:10.1109/CVPR. 2008.4587652. Nie, F., Wang, H., Cai, X., Huang, H., & Ding, C. (2012). Robust matrix completion via joint schatten p-norm and `p -norm minimization. In 2012 IEEE 12th International Conference on Data Mining (pp. 566–574). doi:10. 1109/ICDM.2012.160. Nikolova, M., Ng, M. K., Zhang, S., & Ching, W.-K. (2008). Efficient reconstruction of piecewise constant images using nonsmooth nonconvex minimization. SIAM Journal on Imaging Sciences, 1 , 2– 25. URL: https://doi.org/10.1137/070692285. doi:10.1137/070692285. arXiv:https://doi.org/10.1137/070692285.

24