Neurocomputing 367 (2019) 226–235
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Sparse dictionary learning by block proximal gradient with global convergence Tao Zhu School of Electronic and Information Engineering, South China University of Technology, Guangzhou 510640, China
a r t i c l e
i n f o
Article history: Received 3 February 2019 Revised 1 July 2019 Accepted 9 August 2019 Available online 13 August 2019 Communicated by Dr. Jie Wang Keywords: Sparse dictionary learning Double sparsity Sparse representation Sparse approximation Block proximal gradient Convergence
a b s t r a c t The present paper focuses on dictionary learning in the double sparsity model for sparse representation (or approximation). Due to the highly non-convexity and discontinuity of l0 pseudo-norm, the proofs of convergence for sparse K singular value decomposition (Sparse-KSVD) and online sparse dictionary learning (OSDL) are challenging. To theoretically analyze the convergence, we relax the l0 pseudo-norm to the convex entry-wise l1 norm, and reformulate the constrained optimization problem as an unconstrained one leveraging the double-l1 -regularizer. Therefore, the problem becomes bi-convex. To train a sparse dictionary, an algorithm based on the block proximal gradient (BPG) framework with global convergence is derived. Several experiments are presented to show that our algorithm outperforms Sparse-KSVD and OSDL in some cases. © 2019 Elsevier B.V. All rights reserved.
1. Introduction During the last decade, sparse representations [1] of real world signals over redundant dictionaries have attracted considerable research interest in the signal processing community. The basic model supposes that natural signals can be represented as a linear combination of a few atoms that are chosen from a dictionary. Formally, a signal y ∈ Rn can be described by y = Dx, where the dictionary D ∈ Rn×m (n < m) is redundant (or say over-complete) and contains the atoms di ∈ Rn (1 ≤ i ≤ m) as its columns, x ∈ Rm is the sparse coefficient vector. In practical, a deviation is permitted in the representation accuracy, thus we can also use the term approximation instead of representation, which means y ≈ Dx. Generally, finding the x can be formulated as
minn ϕ (x ) x∈ R
sub ject t o
y − Dx2 ≤ ,
(1)
where ϕ (·) is a sparsity-promoting function, is an error upperbound. One crucial problem is how to choose a dictionary. Originally, analytically-defined dictionaries, such as discrete cosine transform (DCT) dictionary, Gabor dictionary and wavelet dictionary, are considered. Though being generic, these dictionaries could not effectively represent some natural signals due to the non-adaptivity. To attack this issue, the dictionary learning [2] that leverages ma-
E-mail address:
[email protected] https://doi.org/10.1016/j.neucom.2019.08.028 0925-2312/© 2019 Elsevier B.V. All rights reserved.
chine learning techniques to infer a dictionary from a set of examples has been widely investigated. Dictionary learning can provide adaptive dictionaries that lead sparsity-inspired algorithms to achieve state-of-the-art results. Although learned dictionaries dramatically outperform analytically-defined ones in many applications, this comes at the cost of resulting in unstructured and explicit dictionaries that have no fast numerical implementations. Moreover, the sizes of learned dictionaries and the dimensions of signals that can be processed are limited by the complexity constraints. Considering these issues, Rubinstein et al. [3] proposed the double sparsity model that imposes some structure on dictionaries, allowing bigger ones. Based on this model, a dictionary is factorized into a known base dictionary and a trainable sparse dictionary. Dictionary learning in the double sparsity model is known as sparse dictionary learning. The sparsity constraints on the coefficient matrix (or coefficient vectors) and the sparse dictionary (or sparse atoms) are required. In [3] and [4], the sparse dictionary learning is formulated as a constrained optimization problem that leverages the l0 pseudo-norm to promote the sparsities of coefficient vectors and sparse atoms. Currently, sparse K singular value decomposition (Sparse-KSVD) [3] and online sparse dictionary learning (OSDL) [4] are two popular algorithms to approximately solve this problem. However, due to the highly non-convexity and discontinuity of l0 pseudo-norm, the proofs of convergence for Sparse-KSVD and OSDL are challenging, and thus absent in previous works.
T. Zhu / Neurocomputing 367 (2019) 226–235
Contribution: The present paper focuses on sparse dictionary learning for sparse representation (or approximation). The main contributions are summarized as follows:
• To theoretically analyze the convergence, we model the sparse dictionary learning as a bi-convex problem. More specifically, we relax the l0 pseudo-norm to the convex entry-wise l1 norm, and reformulate the constrained optimization problem as an unconstrained one leveraging the double-l1 -regularizer. • To learn the sparse dictionary with the proposed objective, we derive an algorithm based on the block proximal gradient (BPG) framework [5] that is tailored for solving a class of multiconvex (a generalization of bi-convex) problems. Moreover, since BPG does not guarantee the monotonically non-increasing in objective values, our algorithm includes the restarting technique [6,7] to eliminate this drawback. • The global convergence of the derived algorithm is analyzed leveraging the BPG framework. Also, the complexity analysis of the algorithm is given. • Performance comparisons with Sparse-KSVD and OSDL are presented via investigating capabilities of dictionary recovery and image patch representation.
wise l1 norm1 , AF = (
A2 = (λmax (AT A ))
pseudo-norm. a, c = aT c denotes the vector inner product of a and c ∈ Rn . For any matrix A ∈ Rn×m , ai,j is the i, jth element, n ai ∈ Rn is the ith column, A1,1 = m j=1 i=1 |ai, j | is the entry-
1 2
m n
1
|ai, j |2 ) 2 is the Frobenius norm, is the spectral norm, where λmax (· ) is the i=1
j=1
maximal eigenvalue of its argument. The matrix inner product of A and B is denoted by A, B = trace(AT B ), where trace(·) is the trace of its argument. For any function f : dom( f ) → R, its gradient and Hessian matrix (if they exist) are denoted by ∇ f and ∇ 2 f, respectively. 2. Preliminaries This section provides a short review including the dictionary learning problem, the double sparsity model, and the BPG framework. 2.1. Dictionary learning Essentially, dictionary learning is a practical problem of matrix factorization which has also found applications in many other areas such as blind source separation [20], feature selection clustering [21,22], to name just a few. Formally, dictionary learning is formulated as
min
D∈D,X∈X
We would like to state the differences between some previous works and this work. Although the regularization techniques are used in the unstructured dictionary learning [8–14], the double-l1 regularizer has never been applied to sparse dictionary learning. Recently, Lin et al. [15] used the double-l1 -regularizer to formulate their optimization problem which was then solved by the split Bregman iteration (SBI) algorithm [16]. However, their purpose is to reconstruct a single image from multiple blurry measured images and estimate the blurs, not to train a sparse dictionary. In previous works, the BPG framework has been employed to design algorithms for the unstructured dictionary learning [14] and the convolutional dictionary learning [17]. Nevertheless, as far as we know, the sparse dictionary learning has not yet been concerned. It is worth noting that, Xu et al. also used the terminology “sparse dictionary learning” in subsection 1.2 of [5], however, this is totally different from the case in the present paper. Therein, they discussed sparse blind source separation (BSS) model [18,19] and called the unknown mixing matrix to be estimated as the sparse dictionary. In [14], the restarting is triggered by comparing the objective values. Differently, considering the computation burdens in evaluating the objective values, we chose the gradient mapping [6] as the restarting criterion. Structure: The rest of this paper is organized as follows. In Section 2, we briefly review the dictionary learning problem, the double sparsity model, and the BPG framework. Section 3 presents our main work. We first reformulate the optimization problem for sparse dictionary learning, and then derive an algorithm to solve it. Also, convergence analysis and complexity analysis of the algorithm are given. Several experiments are presented in Section 4 to demonstrate the superior performance of our algorithm over Sparse-KSVD and OSDL in some cases. In Section 5, we draw conclusions. Notation: Throughout this paper, capital and small bold-face letters denote matrices and vectors, respectively. Rn denotes the set of all n-dimensional real vectors, Rn×m denotes the set of all real matrices with size n × m. (·)T denotes transpose. k denotes the iteration index. For any vector a ∈ Rn , ai is the ith element, 1 a2 = ( ni=1 |ai |2 ) 2 is the l2 norm, a0 = ni=1 1{ai =0} is the l0
227
Y − DX2F ,
(2)
where Y ∈ Rn×N is a training data matrix that contains N examples as its columns, X ∈ Rm×N is a coefficient matrix that contains N sparse vectors corresponding to the training examples. D and X are admissible sets of the dictionary and the coefficient matrix, respectively. Generally, D constrains the l2 norm of atoms to prevent D from being arbitrarily large, and X constrains the coefficient matrix to be sparse. Many algorithms solve the dictionary learning problem via alternately performing two stages: sparse coding and dictionary update. In order to obtain well results, this procedure needs to be repeated several times. 2.2. Double sparsity model In the double sparsity model, each atom of the dictionary is assumed to be a sparsely linear combination of some pre-specified base atoms that are chosen from a collection called base dictionary. Formally, the D can be expressed as D = A, where ∈ Rn×s is the base dictionary, A ∈ Rs×m is the sparse dictionary. D is thus called the effective dictionary. The number of columns in might differ from that of A, this allows flexibility in the redundancy of D. In [3], the over-complete DCT dictionary is used as the base dictionary, but its applicability to larger signal sizes is weak, thus [4] proposes the cropped wavelets dictionary. Clearly, sparse dictionary learning is to infer A since is known. Based on such a model, problem (2) can be reformulated as
min
A∈A,X∈X
Y − AX2F ,
(3)
where A is the admissible set of the sparse dictionary. Besides the sparsity, A also constrains the l2 norm of sparse atoms. In [3] and [4], problem (3) is instantiated as
min Y − AX2F , sub ject to A,X
xi 0 ≤ p, ∀i , a j 0 ≤ q, a j 2 = 1, ∀ j (4)
where p and q are the given sparsity levels. Sparse-KSVD and OSDL are two popular algorithms for approximately solving (4). 1 Note that this entry-wise l1 norm is different from the l1 norm A1 = max1≤ j≤m ni=1 |ai, j |.
228
T. Zhu / Neurocomputing 367 (2019) 226–235
2.3. Block proximal gradient Consider a general optimization problem
min F (u1 , . . . , uB ) := f (u1 , . . . , uB ) + u∈U
B
r b ( ub ) ,
(5)
b=1
where variable u is decomposed into B blocks u1 , . . . , uB , i.e., u = [u1 , . . . , uB ], the set U of feasible points is assumed to be a closed and block multi-convex subset of Rn , f is assumed to be a differentiable and block multi-convex function, and rb , b = 1, . . . , B are extended-value2 convex (and possibly non-smooth) functions that include the individual constraints of u1 , . . . , uB when they are presented. The set U and function f can be non-convex over u. A set U is block multi-convex if its projection to each block ub ∈ Rnb , b = 1, . . . , B is convex, i.e., for each b and any fixed B − 1 blocks u1 , . . . , ub−1 , ub+1 , . . . , uB , the set
Ub (u1 , . . . , ub−1 , ub+1 , . . . , uB ) := {ub ∈ Rnb : (u1 , . . . , ub−1 , ub , ub+1 , . . . , uB ) ∈ U }
(ub ) := f (
uk1 , . . .
−1 , ukb−1 , ub , ukb+1 ,...
, ukB−1
), ∀b = 1, . . . , B.
To solve (5), BPG minimizes F cyclically over each of u1 , . . . , uB while fixing the remaining blocks at their last updated values. Formally, at the kth iteration of BPG, the update of ub can be written as ukb = arg min
ub ∈Ubk−1
k−1 k−1 ∇ fbk−1 ( ub ) , ub − ub +
Lkb−1 2
k−1 ub − ub 22 + rb (ub ) , (6)
Ubk−1
−1 where set := Ub (uk1 , . . . , ukb−1 , ukb+1 ,... k−1 Lipschitz constant of ∇ fb (ub ),
, ukB−1 ),
Lkb−1
> 0 is a
k−1 ub := ukb−1 + ωbk−1 (ukb−1 − ukb−2 )
(7)
denotes an extrapolated point with the extrapolation weight 0 ≤ ωbk−1 ≤ 1. Using the definition of proximal operator [23,24] and ignoring the constant terms with respect to ukb−1 , formula (6) is simplified to
k−1 ukb = P roxrb ub −
1 Lkb−1
R X ( X ) = X 1 , 1 ,
k−1 ∇ fbk−1 ( ub ) ,
(8)
where P roxrb (· ) is the proximal operator of rb (·). 3. Sparse dictionary learning by block proximal gradient
(10)
and
RA (A ) = A1,1 + δA (A ),
(11)
where the set A := {A ∈ Rs×m : a j 2 ≤ 1, j = 1, . . . , m} is convex, and δA (A ) is called the indicator function on the set A , which is defined by [25]
is convex. A function f is block multi-convex if for each b, f is a convex function of ub while all the other blocks are fixed. Therefore, when all but one blocks are fixed, problem (5) over the free block is convex. Let ukb denotes the value of ub after the kth iteration. Define
fbk−1
where RX (·) and RA (·) are the regularizers on X and A, respectively. τ X > 0 and μA > 0 are the regularization parameters. Then, we need to choose two feasible regularizers for (9). Naturally, the sparsities of X and A are required. Moreover, a constraint on the l2 norm of sparse atoms is needed to prevent the dictionary from being arbitrarily large. It is well-known that, the entry-wise l1 norm encourages a matrix to be sparse by serving as a convex surrogate for the number of non-zero entries in the matrix [23]. On the other hand, projecting the atoms into a convex set is usually considered in dictionary learning. Hence, we use the following regularizers:
δA (A ) :=
0, A ∈ A . +∞, A ∈ A
Armed with (10) and (11), problem (9) is instantiated as
min A, X
1 Y − AX2F + τ X1,1 + μA1,1 + δA (A ), 2
(12)
where τ > 0 and μ > 0 are tunable parameters to control the tradeoff between data fidelity and sparsity levels. Since two l1 norm terms appear in (12), we call them double-l1 -regularizer. Obviously, problem (12) is bi-convex. Define the data fidelity term in (12) as
f (X, A ) :=
1 Y − AX2F , 2
the gradient with respect to X and A are
∇X f (X, A ) = AT T (AX − Y )
(13)
and
∇A f (X, A ) = T (AX − Y )XT ,
(14)
respectively. We straightforwardly obtain the Lipschitz constant of ∇ X f(X, A) as
LX = ∇X2 f (X, A )2 = AT A2 . T
(15)
Since for any A1 and A2 , we have
∇A f (X, A1 ) − ∇A f (X, A2 )F T T = A1 XXT − A2 XXT F T = (A1 − A2 )XXT F T ≤ 2 XXT 2 A1 − A2 F , the Lipschitz constant of ∇ A f(X, A) can be set as
In this section, we first reformulate the optimization problem for sparse dictionary learning, and then derive an algorithm based on the BPG framework to solve this problem. Convergence analysis and complexity analysis of the algorithm are also provided.
LA = 2 XXT 2 .
3.1. Problem reformulation
In what follows, we derive an algorithm for the sparse dictionary learning leveraging the BPG framework. Applying (8)–(12), we alternatively update X and A via
Different from [3,4] that instantiate (3) as (4), we first reformulate (3) leveraging the regularization technique as
1 min Y − AX2F + τX RX (X ) + μA RA (A ), A, X 2 2
Extended value means rb (ub ) = ∞ if ub ∈ dom(rb ).
(9)
T
(16)
3.2. Algorithm
Xk = P rox
τ
Lk−1 X
X1,1
1 k−1 k−1 X − k−1 ∇X f ( X , Ak−1 ) LX
(17)
with an extrapolated point k−1 X := Xk−1 + ωXk−1 (Xk−1 − Xk−2 ),
(18)
T. Zhu / Neurocomputing 367 (2019) 226–235
and
k−1 k−1 1 Ak = P rox μ A1,1 +δ (A) A − k ∇ A f ( Xk , A ) A Lk LA A
(19)
k−1
A
:= A
k−1
+ω
k−1 A
(A
k−1
−A
k−2
).
k−1
(20)
From [23], we know that the proximal operator of the function h(Z ) = T · Z1,1 with T > 0 becomes the well-known soft shrinkage operator [26]
(ST (Z ))i, j := sgn(zi, j ) · max{|zi, j | − T , 0}, ∀i, j. Therefore, (17) can be rewritten as
k
A = PA Sμ/Lk
A
k−1 k−1 1 A − k ∇ A f ( Xk , A ) LA
(21)
,
(22)
z j , max(1, z j 2 )
∀ j.
Input: training data matrix: Y; regularization parameter: τ > 0, μ > 0; initial points: (X−1 , A−1 ) = (X0 , A0 ); number of iteration: K. 2: for k = 1, 2, . . ., K do Set the Lipschitz constant LkX−1 by (23); 3: 5: 6:
8:
(23)
11: 12:
and
13:
LkA = 2 Xk (Xk )T 2 T
(24)
for (21) and (22), respectively. From [14], the extrapolation weights are taken by
and
tk−1 − 1 , tk
LkX−1
LkA−1 LkA
Compute Xk by (21); if (29) is satisfied k−1
Re-compute Xk by (21) with X = Xk−1 ; end if Set the Lipschitz constant LkA−1 by (24); Set the extrapolation weight ωAk−1 by (26);
k−1 Set the extrapolation point A by (20); k Compute A by (22); if (30) is satisfied k−1 Re-compute Ak by (22) with A = Ak−1 ;
end if end for K 18: Output: A . 17:
3.3. Convergence analysis
(26)
where 0 < δ < 1. Herein, the tk can be chosen as
k−1 Set the extrapolation weight ωX by (25); k −1 Set the extrapolation point X by (18);
16:
(25)
,
14: 15:
LkX
t −1 = δ · min k−1 , tk
(30)
1:
9:
T
ωXk = δ · min
− Ak , Ak − Ak−1 > 0
is satisfied. The algorithm that referred to as block proximal gradient sparse dictionary learning (BPGSDL) is summarized in Algorithm 1.
10:
LkX−1 = (Ak−1 )T Ak−1 2
ωAk
k−1
A
7:
From (15) and (16), we have
(29)
is satisfied, and for the updating of A when
4:
where PA (· ) denotes the Euclidean projection to A . Suppose that the input is Z, we define PA (Z ) by
(PA (Z )) j :=
− Xk , Xk − Xk−1 > 0
Algorithm 1 BPGSDL.
1 k−1 k−1 Xk = Sτ /Lk−1 X − k−1 ∇X f ( X , Ak−1 ) , X LX and (19) can be rewritten as
value in each iteration is computationally expensive, thus we use another criterion called gradient mapping [6]. Specifically, in our algorithm, the restarting is triggered for the updating of X when
X
with an extrapolated point
229
The convergence analysis is based on the following assumptions and Theorem 2.8 of [5].
(27)
Assumption 1. Function F defined in (5) is continuous in dom(F) and lower-bounded (i.e., in fu∈dom(F ) F (u ) > −∞). Problem (5) has a Nash point [5];
which is used in fast iterative shrinkage-thresholding algorithm (FISTA) [27,28] and monotone FISTA (MFISTA) [29,30], or
Assumption 2. For ∀b = 1, . . . , B, ∇ fbk−1 (ub ) is Lipschitz continuous, parameter Lˆk−1 obeys lˆb ≤ Lˆk−1 ≤ Lˆb where 0 < lˆb ≤ Lˆb < ∞ are
t0 = 1 tk =
tk =
1 2
1+
k+2 , 2
1 + 4tk2−1
,
constants, and
(28)
which is used in [31]. It is well-known that the monotonic non-increasing property is not only required by global convergence, but also important to make the algorithms perform stably and converge rapidly. However, the extrapolation technique destroys this property in the objective values. In order to keep the monotonic non-increasing property, the restarting technique is employed in the algorithm. The restarting means that the algorithm sets the extrapolation weight to zero (i.e., no extrapolated point) and re-updates the variable when a pre-specified restarting criterion is satisfied. The most direct restarting criterion is exact non-monotonicity that compares the objective value of the current iteration with that of the previous one, by which the re-updating is implemented when the former is greater than the later. However, evaluating the objective
b
b
k−1
fbk−1 (ukb ) ≤ fbk−1 ( ub +
Lˆkb−1 2
k−1 k−1 ) + ∇ fbk−1 ( ub ), ukb − ub
k−1 ub − ub 22 .
(31)
Assumption 3. Function F that defined in (5) satisfies the Kurdyka–Lojasiewicz (KL) property [5]. Theorem 1 (Theorem 2.8 of [5], Global Convergence). Under the Assumption 1–3 and that sequence {uk } has a finite limit point u∗ , the sequence {uk } converges to u∗ that is a critical point of (5). The objective function of (12) straightforwardly satisfies the continuity and the lower-boundedness in Assumption 1. The positive regularization parameter τ ensures that the sequence {Xk } is bounded (otherwise the objective function of (12) will blow up). Similarly, the positive regularization parameter μ guarantees that
230
T. Zhu / Neurocomputing 367 (2019) 226–235
the sequence {Ak } is bounded. Therefore, {(Xk , Ak )} has a finite limit point. Naturally, the gradients defined in (13) and (14) are Lipschitz continuous. Taking the parameter in Assumption 2 as the Lipschitz constant, inequality (31) is always satisfied according to the wellknown Descent Lemma (Proposition A.24 in [32]). Moreover, since {(Xk , Ak )} has a finite limit point, the Lipschitz constants specified in (23) and (24) must be upper bounded. On the other hand, as long as {Xk } and {Ak } are uniformly away from origin, LkX and LkA are uniformly above zero, thus the lower boundedness condition is satisfied. According to the following theorem, any limit point of {(Xk , Ak )} is a Nash point. Theorem 2 (Theorem 2.3 of [5], A Limit Point Is a Nash Point). Under the Assumption 1–2, any limit point of {uk } is a Nash point. From [33,34], we know that semi-algebraic functions [35] defined below are sub-analytic, thus they satisfy the KL property. Definition 1. A set S ⊂ Rn is called semi-algebraic if it can be represented as
S=
J I
{s ∈ R n : gi j ( s ) = 0 , hi j ( s ) > 0 },
i=1 j=1
where gij , hij are real polynomial functions for i = 1, . . . , I, j = 1, . . . , J. A function ψ is called semi-algebraic if its graph Graph(ψ ) := {(s, ψ (s)): s ∈ dom(ψ )} is a semi-algebraic set. We first provide some known examples and elementary properties of semi-algebraic sets and functions as follows [5,36]: 1. 2. 3. 4.
Real polynomial functions are semi-algebraic. Absolute value function is semi-algebraic. Indicator functions of semi-algebraic sets are semi-algebraic. Finite sums and products of semi-algebraic functions are semialgebraic. 5. If a set S is semi-algebraic, so is its closure cl(S ). Next, we explain that the objective function of (12) is semialgebraic. Clearly, the data fidelity term f (X, A ) = 12 Y − AX2F is a real polynomial function, thus it is semi-algebraic according to item 1. From item 2 and 4, we know that the regularizers X1,1 and A1,1 are semi-algebraic functions. It is easy to notice that the set
A0 := {A0 ∈ Rs×m : a0 j 2 < 1, j = 1, . . . , m} =
m
{A0 : 1 − a0 j 2 > 0}
1. If θ ∈ (0, 1/2], then ∃ ω > 0 and ρ ∈ [0, 1) such that Vk − V∗ F ≤ ωρ k ; 2. If θ ∈ (1/2, 1), then ∃ ω > 0 such that Vk − V∗ F ≤ ωk−(1−θ )/(2θ −1) ; where θ ∈ [0, 1)3 corresponding to the desingularizing function φ (s ) = cs1−θ with c > 0. From Theorem 4, we can see that
lim
k→+∞
ρk
k−(1−θ )/(2θ −1)
= 0,
where ρ ∈ [0, 1) and θ ∈ (1/2, 1). Hence, Vk converges to V∗ at least with the rate O (k−(1−θ )/(2θ −1 ) ). 3.4. Complexity analysis Before proceeding with the complexity analysis of BPGSDL, we make some statements as follows: • Let p be the average sparsity of coefficient vectors and q be that of sparse atoms. • As in [4], we also assume that the sparse dictionary A is square (i.e., s = m) for simplicity. • As in [3], we also consider the asymptotic case where p ∼ q n ∼ m N. • Let T be the complexity of applying the base dictionary . As in [4], we only consider the separable case where T = √ O ( n m ). Clearly, the complexity of matrix–matrix multiplication is the dominant term. According to [3] and [4], any matrix-vector multiplication with D = A or DT = AT T has a complexity of TD = √ O (qm + n m ). Calculating the gradients ∇ X f and ∇ A f have a com√ plexity of T∇X f = O (Nqm + Nn m ) and T∇A f = O (Nqm + N pn + √ Nn m ), respectively. Under the asymptotic case, Nqm ∼ Npn, thus √ T∇A f = O (Nqm + Nn m ). Calculating the Lipschitz constant LX has √ a complexity of TLX = O (qm2 + nm m ), and calculating another Lipschitz constant LA has a complexity of TLA = O (N p2 ) since we can pre-compute the Gram matrix T . Putting these elements together, BPGSDL has a complexity of
√ √ TBPGSDL = O (N n m + N qm + N p2 + qm2 + nm m ).
From [4], we know that OSDL has a complexity of TOSDL = √ O (Nn2 ( m + k ) + (N/N0 )m2 log(m )), where N0 is the number of examples in a mini-batch. Therefore, BPGSDL provides a computational advantage over OSDL under the asymptotic case. In the next section, we will report the run-time to demonstrate that BPGSDL is faster than Sparse-KSVD and OSDL.
j=1
is a semi-algebraic set. From item 5, its closure cl(A0 ) = A is also a semi-algebraic set. Hence, we know that δA (A ) is a semialgebraic function according to item 3. Finally, we have that the objective function of (12) is semi-algebraic according to item 4. Based on Theorem 1 and the analyses above, we provide a theorem for the global convergence of Algorithm 1. Theorem 3. Let {(Xk , Ak )} be the sequence generated by Algorithm 1. Suppose that {Xk } and {Ak } are uniformly away from origin. Then {(Xk , Ak )} converges to a critical point of (12). Additionally, according to the Theorem 2.9 of [5], we can obtain the asymptotic convergence rate of Algorithm 1. Theorem 4. Let {(Xk , Ak )} be the sequence generated by Algorithm 1. Suppose that {Xk } and {Ak } are uniformly away from origin. Then {(Xk , Ak )} converges to a critical point V∗ = (X∗ , A∗ ) with the following asymptotic convergence rates:
4. Experiments In this section, we present several experiments to demonstrate the superior performance of BPGSDL over Sparse-KSVD and OSDL in some cases. Experiments on synthetic data and images are reported. We start with a numerical experiment to investigate the capability of dictionary recovery. We then learn image-specific dictionaries on an image. Finally, we train universal dictionaries on some image databases. The last two experiments are devoted to investigate the capability of representation. We make some settings in these experiments as follows. For BPGSDL, the value of δ in (25) and (26) is set as 0.9999 since a value that is closer to 1 makes the algorithm converging faster, and the sequence {tk }, ∀k ≥ 1 is computed according to (27). The base 3 According to the Theorem 2.9 of [5], if θ = 0, Vk converges to V∗ in finite iterations.
T. Zhu / Neurocomputing 367 (2019) 226–235
a
0.8 0.7
4.1. Dictionary recovery
|d#T dˆ j | ≥ 0.99, ˆ j 2 1≤ j≤m d# d 2 max
ˆ j is the jth atom of the recovered dictionary. where d Fig. 1 depicts the results for the average recovery rates of 10 independent trials. Generally speaking, we see from Fig. 1 that the BPGSDL get higher atoms recovery rates than Sparse-KSVD and 4 For example, a 2-D base dictionary can be constructed as = 0 0 , where 0 is a 1-D base dictionary and the notation “” denotes the Kronecker product. Examples of separable dictionaries include the DCT (Fourier), overcomplete DCT (Fourier), and wavelet dictionaries, among others. 5 We use the publicly available software packages. Sparse-KSVD is available at: http://www.cs.technion.ac.il/∼ronrubin/Software/ksvdsbox11.zip. OSDL is available at: https://github.com/jsulam/OSDL. Actually, the sparse coding for them are implemented by orthogonal matching pursuit (OMP), the codes can be found at: http://www.cs.technion.ac.il/∼ronrubin/Software/ompsbox1.zip.
0.6 0.5 0.4 0.3 0.2 Sparse-KSVD OSDL BPGSDL
0.1 0
5 6 7 8 9 10
15
20
25
30
35
40
number of atoms of D# composing each training signal
b
1 0.9 0.8 0.7
recovery rate
For dictionary learning algorithms, investigating the capability of dictionary recovery is rather a difficult task unless the true dictionary is known. Therefore, we use the synthetic data built from a known dictionary. We construct a 2-D base dictionary using two 1-D overcomplete DCT dictionaries in two setups of size, 8 × 12 and 12 × 14. This means that the dimension of signal in the train set will be 64 and 144, respectively, and the dimension of corresponding sparse vector will be 144 and 196, respectively. For the sparse dictionary A, each atom has 5% non-zero elements, whose values are taken i.i.d from Gaussian distribution N (0, 1 ) and locations are uniformly and randomly selected. The true dictionary denoted as D# is obtained by D# = A. We then generate the true sparse coefficient matrix X# as follows: the locations of non-zero elements in each column are selected uniformly and randomly, the values are drawn i.i.d from a Gaussian distribution with zero mean and unit variance. We investigate several sparsity levels of the coefficient vectors. The training set Y is generated as sparsely linear combinations of dictionary atoms according to the model Y = D# X# . For the training set with signal dimension 64, we generate 2 × 104 examples, and the sparsity levels of the coefficient vectors are set as {5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40} (i.e., the number of atoms of D# composing each training signal). For the training set with signal dimension 144, we generate 6 × 104 examples, and the sparsity levels of the coefficient vectors are set as {5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80}. Sparse-KSVD is implemented 200 iterations, OSDL is implemented 100 ones, and BPGSDL is implemented 20 0 0 ones. The sparsity levels of the coefficient vectors and sparse atoms required by Sparse-KSVD and OSDL are the true ones. For BPGSDL, the reg√ ularization parameters are set as: τ = 0.1/ r, where r is the sparsity levels of the coefficient vectors, and μ = 0.2. The size of minibatch for OSDL are set as 0.5 × 104 and 1.5 × 104 in the two setups # respectively. The recovery of each atom d of the true dictionary D# is regarded successful if
1 0.9
recovery rate
dictionaries are set as separable dictionaries that are the Kronecker product of several 1-D dictionaries4 For simplicity, the sparse matrix A is set to be square, thus the redundancy is determined by the base dictionary. All the experiments are implemented in MATLAB 9.4.0 on a machine equipped with a 3.0 GHz Intel Core i5-7400 CPU, 8GB memory and Microsoft Windows 10 operating system. Our codes are entirely written in MATLAB, while the codes of sparse coding for Sparse-KSVD and OSDL were written in C5 , giving them somewhat of an advantage in run-time.
231
0.6 0.5 0.4 0.3 0.2 Sparse-KSVD OSDL BPGSDL
0.1 0 5678910
15
20
25
30
35
40
50
60
70
80
number of atoms of D# composing each training signal Fig. 1. Average recovery rates of 10 independent trials by implementing SparseKSVD, OSDL and BPGSDL. (a) For the train set with signal dimension 64; (b) For the train set with signal dimension 144. Table 1 Average run-time (by seconds) of 10 independent trials by implementing SparseKSVD, OSDL and BPGSDL. Algorithm
Sparse-KSVD OSDL BPGSDL
Signal dimension 64
144
3363.8 3444.2 1471.5
49035.5 21044.8 13,823
OSDL. As the sparsity levels of the coefficient matrix increase, the recovery rates of Sparse-KSVD and OSDL decrease faster than that of BPGSDL. More specifically, as the sparsity level increases to 30 in Fig. 1a and to 60 in Fig. 1b, Sparse-KSVD and OSDL hardly recover the atoms, while BPGSDL still holds high recovery rates. Table 1 shows the average run-time (by seconds) of 10 independent trials. We find that BPGSDL spends much less time than Sparse-KSVD and OSDL. 4.2. Image-specific dictionary learning In this experiment, we learn image-specific dictionaries on patches that are overlappingly extracted from the image “Barbara”
T. Zhu / Neurocomputing 367 (2019) 226–235
with size 512 × 512, pixels are scaled between 0 and 255. We uniformly and randomly select 1 × 105 patches as the training examples, and 1 × 104 patches that are different from the ones for training as the test examples. Three setups of patch size are used for training and testing: 12 × 12, 20 × 20 and 32 × 32. The corresponding sparsity levels of sparse dictionary are set as 25%, 15% and 8% non-zero elements, respectively. The sparsity level of coefficient matrix is 20% non-zero elements in all the setups. We construct a 2-D base dictionary using two 1-D Haar cropped wavelets dictionary. All the algorithms are implemented the same length of time. For BPGSDL, we experimentally set the regularization parameters τ and μ to yield approximate sparsity levels as those set for Sparse-KSVD and OSDL. In the three setups, τ are set as 4.5, 4.2 and 6.5, respectively; μ are set as 1 × 105 , 0.95 × 105 and 2.2 × 105 , respectively. The size of mini-batch for OSDL is set as 1 × 103 . During the training process, orthogonal matching pursuit (OMP) is employed to test the capability of representation for the dictionaries trained by Sparse-KSVD, OSDL and BPGSDL. The root mean square error (RMSE) is used to evaluate the representation error, which is defined by
a
5 Sparse-KSVD OSDL BPGSDL
4.5
4
RMSE
232
3.5
3
2.5
2 0
100
200
300
400
500
600
700
800
training time / s
1 RMSE = √ Y − AXF . nN
(32)
b
Sparse-KSVD OSDL BPGSDL
5
4.5
4
RMSE
Note that here Y is the test set that contains N test examples with dimension n, X is the coefficient matrix of test examples obtained by OMP. Fig. 2 depicts the results for the RMSEs of representations versus the training time (by seconds) on the test set. From the figure, we can see that BPGSDL achieves smaller RMSEs than SparseKSVD and OSDL, which means that the “Barbara”-specific dictionary trained by BPGSDL has a stronger capability of representation than those trained by Sparse-KSVD and OSDL. Moreover, the convergence speed of BPGSDL is faster than those of Sparse-KSVD and OSDL.
3.5
3
2.5
4.3. Universal dictionary learning
6
Available at: http://www.eecs.berkeley.edu/Research/Projects/CS/vision/ grouping/resources.html. 7 Available at: http://sipi.usc.edu/database/.
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
training time / s
c
Sparse-KSVD OSDL BPGSDL
7 6.5 6
RMSE
In this experiment, we learn universal dictionaries on patches extracted from non-specific natural images. Such general-purpose dictionaries can be used in many image restoration tasks. The natural images taken from the Berkeley Segmentation Database 50 0 (BSD50 0)6 [37] and the USC-SIPI image database7 [38] are used for training and testing, pixels are scaled between 0 and 255, the size of patches is set as 20 × 20. For BSD500, we uniformly and randomly extract 2 × 105 overlapping patches from training images as training examples, and 2 × 104 ones from test images as test examples. We chose two volumes of USC-SIPI: For Aerials (high altitude aerial images), we uniformly and randomly extract 1.9 × 105 overlapping patches as training examples, 1.9 × 104 overlapping patches that are different from the training ones as test examples; For Textures (Brodatz textures, texture mosaics, etc), we use the some manner to extract 1.92 × 105 overlapping patches as training examples, 1.92 × 104 overlapping patches as test examples. We use two 1-D Haar cropped wavelets dictionary to construct a 2-D base dictionary as in Experiment 4.2. The sparsity levels of sparse dictionary and coefficient matrix are both set as 20% nonzero elements. Both Sparse-KSVD and OSDL are implemented 60
2
5.5 5 4.5 4 3.5 0
0.2
0.4
0.6
0.8
1
1.2
training time / s
1.4
1.6
1.8
2
2.2 104
Fig. 2. The RMSEs of representations versus the training time (by seconds) on the test set. (a) patch size: 12 × 12, sparsity levels of sparse dictionary: 25% non-zero elements; (b) patch size: 20 × 20, sparsity levels of sparse dictionary: 15% non-zero elements; (c) patch size: 32 × 32, sparsity levels of sparse dictionary: 8% non-zero elements.
T. Zhu / Neurocomputing 367 (2019) 226–235
a
a 16
Sparse-KSVD OSDL BPGSDL
14
14 Sparse-KSVD OSDL BPGSDL
12
12
10
10
RMSE
RMSE
233
8
8
6
6 4 4 2
2 1%3%5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
0 1%3%5%
sparsity levels of coefficient matrix (percentage)
10%
15%
20%
25%
30%
35%
40%
45%
50%
sparsity levels of coefficient matrix (percentage)
b 18
b Sparse-KSVD OSDL BPGSDL
16
16
Sparse-KSVD OSDL BPGSDL
14 14 12 10
10
RMSE
RMSE
12
8
8
6
6
4
4
2
2
1%3%5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
sparsity levels of coefficient matrix (percentage)
0 1%3%5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
sparsity levels of coefficient matrix (percentage) Fig. 3. The RMSEs of representations versus the sparsity levels of coefficient matrix (percentage of non-zero elements) on the BSD500 test set. (a) Tested by OMP; (b) Tested by FISTA.
iterations, and BPGSDL is implemented 40 0 0 ones. For BPGSDL, we experimentally set the regularization parameters to yield approximate sparsity levels as those set for Sparse-KSVD and OSDL: τ = 4.4 and μ = 7.5 × 104 for BSD500, τ = 3.8 and μ = 5.5 × 104 for USC-SIPI-Aerials, τ = 8.2 and μ = 20.6 × 104 for USC-SIPI-Textures. The size of mini-batch for OSDL is set as 2 × 103 . The OMP and FISTA are employed to test the capability of representation for the dictionaries trained by these algorithms. The representation error is measured by RMSE defined in (32). Fig. 3 depicts the results for the RMSEs of representations versus the sparsity levels of coefficient matrix (percentage of non-zero elements) on BSD500. So do Figs. 4 and 5 on USC-SIPI-Aerials and USC-SIPI-Textures, respectively. Generally speaking, the RMSEs of representations will be smaller if more atoms are used to compose patches approximating the true ones in the test set. From the results tested by OMP, one can see that the RMSEs of BPGSDL even worse than those of Sparse-KSVD and OSDL in some ranges of sparsity levels, especially from 10% to 25% on BSD500, and from 5% to 25% on USC-SIPI-Textures. We briefly give some reasons for this phenomenon as follows. Clearly, the sparse coding subproblem appears during learning and also after it when the designed dictionary is used in applications and tests. As stated in [2], the
Fig. 4. The RMSEs of representations versus the sparsity levels of coefficient matrix (percentage of non-zero elements) on the USC-SIPI-Aerials test data. (a) Tested by OMP; (b) Tested by FISTA.
Table 2 The training time (by seconds) by implementing Sparse-KSVD, OSDL and BPGSDL on different databases. Algorithm
Sparse-KSVD OSDL BPGSDL
Database BSD500
USC-SIPI-Aerials
USC-SIPI-Textures
80126.3 69255.4 33858.6
75020.8 64616.2 32578.3
76356.4 64373.7 33636.2
same representation algorithm should be used in learning and after it, since the learning result is related to the choice of the representation algorithm. In other words, it is better to match the sparse representation method used in learning with that used in the application and test. Sparse-KSVD and OSDL employ OMP in the sparse coding step during learning, while BPGSDL uses (21) to deal with the sparse coding subproblem, giving Sparse-KSVD and OSDL somewhat of an advantage when tested by OMP. We also test these algorithms employing FISTA, and find that BPGSDL can achieve smaller RMSEs than Sparse-KSVD and OSDL, see the figures for details. Table 2 shows the training time of Sparse-KSVD, OSDL
234
a
T. Zhu / Neurocomputing 367 (2019) 226–235
ever, due to the batch process, as Sparse-KSVD, BPGSDL might fail to deal with too large number of training examples if the memory is limited. An online version of BPGSDL is left to future work.
40 Sparse-KSVD OSDL BPGSDL
35
Declaration of Competing Interest
30
The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
RMSE
25
20
References
15
10
5 1%3%5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
sparsity levels of coefficient matrix (percentage)
b Sparse-KSVD OSDL BPGSDL
30
RMSE
25
20
15
10
5 1%3%5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
sparsity levels of coefficient matrix (percentage) Fig. 5. The RMSEs of representations versus the sparsity levels of coefficient matrix (percentage of non-zero elements) on the USC-SIPI-Textures test data. (a) Tested by OMP; (b) Tested by FISTA.
and BPGSDL on different databases. From Table 2 and the figures, one can see that BPGSDL spends much less time to train universal dictionaries and obtains competitive representation results. 5. Conclusion In previous works, the sparse dictionary learning is formulated as a constrained optimization problem leveraging l0 pseudo-norm to encourage the sparsities of coefficient vectors and sparse atoms. Sparse-KSVD and OSDL were proposed for approximately solving this tricky problem. The highly non-convexity and discontinuity of l0 pseudo-norm make the proofs of convergence for Sparse-KSVD and OSDL challenging, and no works are devoted to theoretically analyze their convergences. In contrast, we relax the l0 pseudonorm to the convex entry-wise l1 norm, and reformulate the constrained optimization problem as an unconstrained one leveraging the double-l1 -regularizer, which makes the problem of sparse dictionary learning bi-convex. Based on BPG framework, an algorithm that effectively solves this bi-convex problem is derived, and its global convergence can be theoretically analyzed. Experimentally, BPGSDL outperforms Sparse-KSVD and OSDL in some cases. How-
[1] M. Elad, Sparse and Redundant Representations—From Theory to Applications in Signal and Image Processing, New York: Springer, 2010. [2] B. Dumitrescu, P. Irofti, Dictionary Learning Algorithms and Applications, Gewerbestrasse Switzerland: Springer, 2018. [3] R. Rubinstein, M. Zibulevsky, M. Elad, Double sparsity: learning sparse dictionaries for sparse signal approximation, IEEE Trans. Signal Process. 58 (3) (2010) 1553–1564. [4] J. Sulam, B. Ophir, M. Zibulevsky, M. Elad, Trainlets: dictionary learning in high dimensions, IEEE Trans. Signal Process. 64 (12) (2016) 3180–3193. [5] Y. Xu, W. Yin, A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion, SIAM J. Imaging Sci. 6 (3) (2013) 1758–1789. [6] P. Giselsson, S. Boyd, Monotonicity and restart in fast gradient methods, in: IEEE 53rd Annual Conference on Decision and Control (CDC), 2014, IEEE, 2014, pp. 5058–5063. [7] B. O’Donoghue, E. Candes, Adaptive restart for accelerated gradient schemes, Found. Comput. Math. 15 (3) (2015) 715–732. [8] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: Advances in Neural Information Processing Systems, 2007, pp. 801–808. [9] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 689–696. [10] M. Yaghoobi, T. Blumensath, M.E. Davies, Dictionary learning for sparse approximations with the majorization method, IEEE Trans. Signal Process. 57 (6) (2009) 2178–2191. [11] A. Rakotomamonjy, Direct optimization of the dictionary learning problem, IEEE Trans. Signal Process. 61 (22) (2013) 5495–5506. [12] M. Sadeghi, M. Babaie-Zadeh, C. Jutten, Learning overcomplete dictionaries based on atom-by-atom updating, IEEE Trans. Signal Process. 62 (4) (2014) 883–891. [13] C. Bao, H. Ji, Y. Quan, Z. Shen, Dictionary learning for sparse coding: algorithms and convergence analysis, IEEE Trans. Pattern Anal. Mach. Intell. 38 (7) (2016) 1356–1369. [14] Y. Xu, W. Yin, A fast patch-dictionary method for whole image recovery, Inverse Probl. Imaging 10 (2) (2016) 563–583. [15] T.-C. Lin, L. Hou, H. Liu, Y. Li, T.-K. Truong, Reconstruction of single image from multiple blurry measured images, IEEE Trans. Image Process. 27 (6) (2018) 2762–2776. [16] T. Goldstein, S. Osher, The split bregman method for l1-regularized problems, SIAM J. Imaging Sci. 2 (2) (2009) 323–343. [17] I.Y. Chun, J.A. Fessler, Convolutional dictionary learning: acceleration and convergence, IEEE Trans. Image Process. 27 (4) (2018) 1697–1712. [18] P. Bofill, M. Zibulevsky, Underdetermined blind source separation using sparse representations, Signal Process. 81 (11) (2001) 2353–2362. [19] M. Zibulevsky, B.A. Pearlmutter, Blind source separation by sparse decomposition in a signal dictionary, Neural Comput. 13 (4) (2001) 863–882. [20] P. Comon, C. Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, Academic press, 2010. [21] R. Shang, W. Wang, R. Stolkin, L. Jiao, Non-negative spectral learning and sparse regression-based dual-graph regularized feature selection, IEEE Trans. Cybernet. 48 (2) (2018) 793–806. [22] Y. Meng, R. Shang, L. Jiao, W. Zhang, Y. Yuan, S. Yang, Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering, Neurocomputing 290 (2018) 87–99. [23] N. Parikh, S. Boyd, Proximal algorithms, Found. Trends Optim. 1 (3) (2014) 127–239. [24] P.L. Combettes, J.C. Pesquet, Proximal splitting methods in signal processing, in: Fixed-point Algorithms for Inverse Problems in Science and Engineering, Springer, 2011, pp. 185–212. [25] B.S. Mordukhovich, Variational Analysis and Generalized Differentiation I: Basic Theory, 330, Springer Science & Business Media, 2006. [26] D.L. Donoho, De-noising by soft-thresholding, IEEE Trans. Inf. Theory 41 (3) (1995) 613–627. [27] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202. [28] D. Kim, J.A. Fessler, Another look at the fast iterative shrinkage/thresholding algorithm (fista), SIAM J. Optim. 28 (1) (2018) 223–250. [29] A. Beck, M. Teboulle, Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems, IEEE Trans. Image Process. 18 (11) (2009) 2419–2434.
T. Zhu / Neurocomputing 367 (2019) 226–235 [30] M.V. Zibetti, E.S. Helou, D.R. Pipa, Accelerating over-relaxed and monotone fast iterative shrinkage-thresholding algorithms with line search for sparse reconstructions, IEEE Trans. Image Process. 26 (7) (2017) 3569–3578. [31] P. Tseng, On accelerated proximal gradient methods for convex–concave optimization, SIAM J. Optim. (May 2008). [Online]. Available: http://www.mit.edu/ ∼dimitrib/PTseng/papers/apgm.pdf (2008). [32] D.P. Bertsekas, Nonlinear Programming, Athena Scientific Belmont, 1999. [33] J. Bolte, A. Daniilidis, A. Lewis, The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM J. Optim. 17 (4) (2007) 1205–1223. [34] J. Bolte, A. Daniilidis, A. Lewis, M. Shiota, Clarke subgradients of stratifiable functions, SIAM J. Optim. 18 (2) (2007) 556–572. [35] J. Bochnak, M. Coste, M.F. Roy, Real Algebraic Geometry, 36, Springer Science & Business Media, 2013. [36] J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Math. Programm. 146 (1–2) (2014) 459–494. [37] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Proceedings of the 8th International Conference on Computer Vision, 2, ICCV Vancouver, 2001, pp. 416–423.
235
[38] A.G. Weber, The usc-sipi image database version 5, 1997. USC-SIPI Report Tao Zhu received the M.S. degree in 2014 from the College of Computer Science and Electronic Engineering, Hunan University, Changsha, China. From 2017, he began to pursue the Ph.D. degree at School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China. His research interests include compressed sensing, sparse representation, inverse problem, and dictionary learning.