Use of PALM for ℓ1 sparse matrix factorization: Difficulty and rationalization of a two-step approach

Use of PALM for ℓ1 sparse matrix factorization: Difficulty and rationalization of a two-step approach

Digital Signal Processing 97 (2020) 102611 Contents lists available at ScienceDirect Digital Signal Processing www.elsevier.com/locate/dsp Use of P...

3MB Sizes 0 Downloads 18 Views

Digital Signal Processing 97 (2020) 102611

Contents lists available at ScienceDirect

Digital Signal Processing www.elsevier.com/locate/dsp

Use of PALM for 1 sparse matrix factorization: Difficulty and rationalization of a two-step approach Christophe Kervazo a,∗ , Jérôme Bobin a , Cécile Chenot b , Florent Sureau a a b

Gif-sur-Yvette, 91191 Cedex, France Institute for Digital Communications, University of Edinburgh, United Kingdom

a r t i c l e

i n f o

Article history: Available online 14 November 2019 Keywords: Blind source separation Sparsity Minimization strategy Large-scale Regularization parameter choice

a b s t r a c t Blind Source Separation (BSS) is a key machine learning method, which has been successfully applied to analyze multichannel data in various domains ranging from medical imaging to astrophysics. Being an illposed matrix factorization problem, it is necessary to introduce extra regularizing priors on the sources. While using sparsity has led to improved factorization results, the quality of the separation process turns out to be dramatically dependent on the minimization strategy and the regularization parameters. In this scope, the Proximal Alternating Linearized Minimization (PALM) has recently attracted a lot of interest as a generic, fast and highly flexible algorithm. Using PALM for sparse BSS is theoretically well grounded, but getting good empirical results requires a fine tuning of the involved regularization parameters, which might be too computationally expensive with real-world large-scale data, therefore mandating automatic parameter choice strategies. In this article, we first investigate the empirical limitations of using the PALM algorithm to perform sparse BSS and we explain their origin. Based on this, we further study and justify an alternative two-step algorithmic framework combining PALM with a heuristic approach, namely the Generalized Morphological Component Analysis (GMCA). This method enables an automatic parameter choice for the PALM step. Numerical experiments with comparisons to standard algorithms are carried out on two realistic experiments in spectroscopy and astrophysics. © 2019 Elsevier Inc. All rights reserved.

1. Introduction 1.1. Blind source separation problem Blind source separation (BSS) has become a key analysis tool to learn meaningful decompositions of multivalued data. It has been shown to be very successful in a wide variety of scientific fields such as audio processing [1–5], biomedical data processing [6–8] or astrophysics [9], to only cite three. Assuming the standard linear mixture model to hold [10], the m observations are supposed to be the linear combinations of n sources, each of them having t samples. In matrix form, this can be written as X = A∗ S∗ + N, where X (of size m × t) is the observation matrix corrupted with some unknown noise N, and S∗ (of size n × t) and A∗ (of size m × n, with n  m in our setting) are called respectively the sources and the mixing matrix. The objective of BSS is to retrieve the mixing coefficients A∗ as well as the sources S∗ up to a scaling and permutation indeterminacy, which requires tackling an ill-posed unsupervised

*

Corresponding author. E-mail address: [email protected] (C. Kervazo).

https://doi.org/10.1016/j.dsp.2019.102611 1051-2004/© 2019 Elsevier Inc. All rights reserved.

matrix factorization problem. Therefore additional assumptions are mandatory to identify the mixture parameters. Classical constraints include the statistical independence of the sources (ICA - [10]), the non-negativity of A∗ and S∗ [11,12]. In this work, we will focus on the sparsity of the sources, which has been showed to enable enhanced separation quality in various matrix factorization problems [13–16]. Generally speaking, sparse BSS aims at minimizing a cost function of the following form:

minimize

A∈Rm×n ,S∈Rn×t

1 2

    X − AS2F + RS  (STS ) + ι{∀i ∈[1,n];Ai 2 ≤1} (A) 1

2

(1)

• The

1 2

X − AS2F term promotes a faithful reconstruction of

the data. We will further assume that the noise N is white Gaussian, which induces the use of the Frobenius norm . F .  • The RS  (STS )1 term promotes a 1 sparsity of the sources, with  denoting the Hadamard product. The S matrix (of size T × t, with T  t) corresponds to the transform enforcing the sparsity. While in this article it will be first taken equal to the identity matrix (in which case the sparsity is enforced in the direct domain), the conclusions would be exactly the same

2

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

for any orthogonal transform S [17]. We will further empirically show that the studied algorithm works well on a realistic experiment when using the non-orthogonal starlet transform [18]. The regularization parameters RS (size n × T ) control the trade-off between the data fidelity and the sparsity terms. It can be decomposed into RS = S W where S (n × n) is a diagonal matrix of the regularization parameters λ1 , λ2 , ..., λn and W (n × T ) is a matrix used to introduce individual penalization coefficients in the context of reweighted 1 [19] (when no reweighting is used, W is a matrix with all coefficients equal to 1). This work will focus on the 1 -norm, which has been shown to help the separation of the sources in the context of sparse matrix factorization [20]. • To avoid degenerated A and S matrices where A F → ∞ and S F → 0 due to the sparsity penalty, the last term enforces the oblique constraint, ensuring that for all i, the ith column Ai of A is in the 2 ball: J (A) = ι{∀i ∈[1,n];Ai 2 ≤1} (A), with ι 2

being the characteristic function of the corresponding convex set. 1.2. Sparse BSS, a practitioner perspective

liable (e.g. it becomes highly dependent on the initial point) nor yield even decent separation results. Furthermore, while the convergence of GMCA is not guaranteed because it is built on projected Alternating Least-Square (ALS – [11]), in practice the heuristic helps it to numerically stabilize. 1.3. Contributions Since problem (1) is a non-convex matrix factorization problem, the minimization strategy plays a decisive role. More formally, any ˆ of the true (denoted sparse BSS algorithm yields an estimate  ∗ with an asterisk) mixture parameters  = {A∗ , S∗ }. This estimate depends on the data X: of course, through ∗ but also through the ˆ , a practitioner has access to a noise N. To find a good estimate  given data set X but can choose a solver A, an initial point (0) and a set of regularization parameters RS , which can be formulated as follows:

  ˆ = A ( 0 ) , R S ; X 

So far, different algorithms have been designed for tackling sparse BSS problems. Among them we can highlight the two following categories:

More specifically, the goal of this article is to study the PALM minimization strategy in sparse BSS. Indeed, while it has become one of the most attractive algorithm for tackling generic matrix factorization problems [27–31], its application to this domain raises open questions:

• Algorithms trying to minimize exactly problem (1) with fixed

• What are the major limitations of performing sparse BSS

parameters: The problem described by Eq. (1) is multi-convex [21]. In sparse BSS, its exact form with fixed RS parameters is usually tackled using one of the following two main algorithms: Block-Coordinate Descent (BCD - [22]) and PALM [23]. Interestingly, both algorithms can precisely converge to a minimizer of (1). However, using this approach, any BSS practitioner can draw the same conclusion: the regularization parameters RS are generally difficult to tune without any first guess of the solution and the output of sparse BSS methods is highly sensitive to them. Furthermore, in many practical cases it is not possible to consider re-launching the same algorithm several times with different parameters and initializations until decent results are obtained, especially when: – The data are large, prohibiting multi-start strategies due to the computational burden. This is for instance the case in astrophysics (e.g. Planck1 data: 450 × 106 pixels; Chandra2 data: up to 109 pixels); – No information is available to assess the quality of a solution and therefore if parameter and initialization choices were relevant or not. In real-world applications, this highly hinders the application of PALM, BCD or other such algorithms minimizing problem (1) with fixed parameters. • Algorithms minimizing an approximation of problem (1) with changing parameters: Among them, Generalized Morphological Component Analysis (GMCA [24]) has been applied with success in astrophysics [25,9] and biomedical data processing [26]. GMCA is very convenient since it circumvents the cumbersome fixed parameter choice by proposing a heuristic technique to automatically tune the regularization parameters (which are in this approach changing during the iterations – cf. Sec. 4.1.2). It is noticeable that without this heuristic, the GMCA algorithm is neither re-

1 2

https://www.cosmos.esa.int/web/planck/. http://chandra.harvard.edu/.

through minimizing Eq. (1) with PALM? The investigations presented in the following highlight the high sensitivity of PALM to the initial point (i.e. its low reliability3 : the final estiˆ depends much on (0) ), while performing sparse BSS mate  by minimizing Eq. (1) with it needs a cumbersome regularization parameter RS choice (due to a lack of efficiency – for ˆ a given X, difficulty of finding some RS yielding estimates  close to ∗ – and versatility – difficulty to generalize good RS values for a given X to other datasets X). Due to such a difficulty for the practitioner to set the regularization parameter “by hand”, PALM – and potentially any other algorithm performing sparse BSS through the minimization of Eq. (1) – is impractical if not associated with an automatic regularization parameter strategy. In particular, the regularizing parameter choice is not always well discussed in the literature (despite it is a difficult problem [32], in many works it is either not addressed, or sums up to a grid search, or uses the true factorization [27–30]). • Is it possible to improve the estimation yielded by the minimization of Eq. (1) combining PALM with heuristic techniques? The hope is then twofold: i) reach PALM potential accurate results when adapted regularizing parameters are used; ii) benefit from PALM mathematical guarantees. It has to be emphasized that using heuristics in other matrix factorization algorithms has already known a wide success, both for sparse BSS [24] and for NMF [12]. We show how combining PALM with an heuristic algorithm (namely GMCA) inside of a sequential two-step approach can enhance the separation quality by providing adapted initializations and regularization parameters. 1.4. Outline Section 2 presents PALM algorithm in the context of sparse BSS. In section 3, we investigate the practical limitations of performing sparse BSS using it for minimizing Eq. (1). In section 4, we

3 The terms reliability, efficiency and versatility, while shortly explained here, will be detailed in Section 3.

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

show and explain the difficulty of straightforwardly extending existing heuristic techniques to PALM to make it easier to use. In section 5, we re-use the results of section 3 and 4 to handle each element hindering PALM applicability in sparse BSS and, building on them, to rationalize a 2-step approach (with an initialization stage followed by a refinement procedure). In section 6, the quality of the 2-step approach is demonstrated on realistic astrophysical data, and the limitations studied. 1.5. Notations of the different matrices In the following, the matrices are written in bolt upper cases, e.g. A and S for the mixing matrix and the sources respectively. The true underlying physical matrices, i.e. the ones we are looking for, are denoted with a star index, e.g. A∗ . The final estimate ˆ while the yielded by an algorithm is written with a hat, e.g. A, one corresponding to the kth iteration is denoted as A(k) . Tilde is used for the temporary estimate obtained before applying the ˜ The noproximal operator corresponding to the regularization: A. tation Diag(λ1 , ..., λn ) stands for a square diagonal matrix of size (n × n) having (λ1 , ..., λn ) as diagonal values. 2. Algorithm minimizing problem (1) with fixed parameter: PALM algorithm In the following, we discuss the PALM algorithm. BCD is not studied in this work. The main reason is that BCD has in general a much higher computational cost (while providing for given fixed regularizing parameter empirical results similar to PALM in many usual settings [33]). Indeed, at each iteration the subproblems in A and S need to be exactly and fully tackled. PALM has been introduced in [23,34] and has driven an intensive research with a large number of applications and extensions to the scope of matrix factorization [35,36]. It is theoretically guaranteed to converge to a minimizer of (1) under conditions [23] that are verified in the sparse BSS setting studied in this article. Furthermore, provided that the proximal operators (cf. Appendix for the definition of a proximal operator) of the regularization terms are explicit and the gradients easy to compute, PALM iterations have a low computational cost. More precisely, PALM is an iterative algorithm alternating between a proximal gradient step on A and S. In the context of sparse BSS corresponding to Eq. (1), PALM can be written as: PALM(X, A(0) , S(0) ) While the stopping criterion k has not reached the desired value, iterate over (k): 1 - Update of S using the current version of A(k−1) :

S˜ = S(k−1)

γ

−  (k−1)T (k−1)  A(k−1)T (A(k−1) S(k−1) − X) A  A

(2)

2

(S˜ TS )S

S(k) = S 

γ RS   (k−1) T (k−1)  A  A

(3)

2

3

The adaptation of PALM to our cost function is detailed below:

• Update of S

The soft thresholding S (see definition in Appendix) can be shown to be the proximal operator corresponding to the 1 -norm in the case where – for the sake of simplicity – S is supposed to be an orthogonal transform. Furthermore, A(k−1) T (A(k−1) S(k−1) − X) is the gradient of the data fidelity term with respect to modulus  S and a Lipschitz  upper bound can be chosen as A(k−1) T A(k−1) 2 , where .2 is the spectral norm. The parameter γ is chosen in the range (0, 1), which is required to ensure the convergence. • Update of A The proximal operator corresponding to the oblique constraint is the projection of each column on the Euclidean 2 unitary ball, which shall be denoted as .2 ≤1 (cf. Appendix). Furthermore, (A(k−1) S(k) − X)S(k) T is the gradient of the data fidelity term with respect  to S and a Lipschitz modulus upper bound can be chosen as S(k) S(k) T 2 . The parameter δ is chosen in the range (0, 1), which is required to ensure the convergence. • Stopping criterion The stopping criterion k was chosen in this work as the cosine of the maximum angle between the columns of A(k) and that of the previous estimate of the mixing matrix A(k−1) . The algorithm stops when k becomes higher than a threshold τ fixed by the user, that is when the changes in A become very small. 3. Limitations of minimizing Eq (1) with PALM to perform sparse BSS, an empirical study The objective of this section is to empirically shed light on the limitations of using the PALM algorithm along with Eq. (1) to handle the sparse BSS problem. For that purpose, we will evaluate the ˆ = {Aˆ , Sˆ } yielded by PALM accuracy of the final point estimate  in terms of an estimator (which can be computed here only due to the fact that the true matrices are simulated and therefore known). More specifically, we will highlight the limitations of using PALM to minimize problem (1) to perform sparse BSS in terms of efficiency, reliability and versatility. At this point, it might be important to highlight two elements:

• We study the results in terms of separation quality measured by an estimator (see Sec. 3.3), and not only in terms of the cost function (1). That is, we try to find a good critical point, i.e. close to a perfect factorization knowing the forward model. • The phenomena we study here could be inherent to the use of Eq. (1) for performing BSS, since the role of PALM is only to perform its minimization. However, in the light of the previous remark and since (1) is non-convex, the use of another minimization scheme could lead to a different (local) minimum with a different quality in terms of the estimator. Therefore, another minimization algorithm could for instance be less sensitive to the regularization parameter values. In this work, we study the separation quality when minimizing Eq. (1) with PALM only (and therefore not problem (1) in general).

2 - Update of A using the current version of S(k) :

˜ = A(k−1) −  A

3.1. Gist of the experiments: what is to be studied?

δ

 (A(k−1) S(k) − X)S(k)T S(k) S(k)T 

(4)

2

˜) A(k) = .2 ≤1 (A

(5) (k−1) j

(k) j

3 - Update stopping criterion k = min j ∈[1,n]  A(k) j  ,  A(k−1) j  A

F

A

F

The goal of this subsection is to determine which factors have a potential influence on the final separation quality and must therefore be taken into account in the study of PALM for sparse BSS. This can be inferred by detailing the first update step of the sources in the PALM algorithm:

4

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

S˜ = S(0) − 

γ

 A(0)T (A(0) S(0) − X) A(0)T A(0)  2

γ

= S(0) +  (0)T (0)  (A(0)T A∗ S∗ − A(0)T A(0) S(0) + A(0)T N) A A  2

(6) The gradient descent step is followed by the application of the proximal operator of the penalization term:

S( 1 ) = S 

  S˜

γ RS

  (0) T (0)  A  A

(7)

2

Therefore, the best estimation of S∗ will potentially depend on the starting point (0) = {A(0) , S(0) }, the true mixture parameters  = {A∗ , S∗ }, the noise N and the regularizing parameters RS . In the following, and after presenting the experimental data we will use and the evaluation protocol, we will explore in the following order:

Fig. 1. True Chandra sources used in case 4.

the case in realistic experiments in which the detectors do not have the same quality. Furthermore, A∗ is taken orthogonal (that is, A∗ T A∗ = A∗ A∗ T = Id) to ensure that the noise projection on the source space has different amplitudes. While this setting might be simpler than the one of cases 2, it allows to study the impact of the noise on the parameter choice. • Case 4: The sources come from simulations obtained from real data of Cassiopeia A supernova remnant. These data originate from the Chandra4 X-ray observatory. The sources in these wavelength values correspond to the thermal emission and the iron. As displayed in Fig. 1, each of them consists in a 2D image of resolution t = 128 × 128 pixels, supposed to be approximately sparse in the starlet domain (in which their dynamic range is circa 4 × 10−3 and 6 × 10−4 respectively). The mixing is performed with the same A∗ matrix as in case 1. A Gaussian additive noise is added to the mixing, such that the SNR = 30 dB. Beyond being still more realistic than the other cases, case 4 involves non-equilibrated sources (that is, sources with different energies).

ˆ quality; • The impact of RS for a given X on the estimate 

• The impact of the true mixture parameters  = {A∗ , S∗ } and ˆ estimate quality and RS choice. More specifically, N on the  we will look at: – the impact of a statistical distribution change of the mixture parameters  , N; – the impact of working on a new realization (but keeping the same statistical distribution) of the mixture parameters  , N; • The impact of the initial point (0) , and more specifically A(0) , ˆ quality and RS choice. on the estimate  3.2. Description of the data To bring out the mechanisms at stake, we will perform experiments using 4 different data sets. In each case, there is n = 2 sources.

• Case 1: The sources are assumed to be exactly sparse in the direct domain and to follow a Bernoulli-Gaussian distribution: a proportion p = 0.1 of the t = 500 samples is non-zero and drawn according to a standard normal distribution. The sources are equilibrated, i.e. they have equal variances. Their dynamic range (maximum minus minimum value) is circa 0.6. This example therefore corresponds to very simple synthetic data, but is a good model for spectroscopy experiments (e.g. Liquid Chromatography / Mass Spectroscopy [37] or γ -ray spectroscopy [38]). The mixing is performed through a matrix A∗ drawn randomly following a standard normal distribution and modified to have unit columns. Its condition number is C d = 10 and we focus on the exactly determined case: there is an equal number of observations and sources m = n. To complete the creation of the X data, a Gaussian noise is added to the mixing, such that the Signal to Noise Ratio is SNR = 60 dB. • Case 2: The sources are assumed to be approximately sparse in the direct domain and to follow a generalized Gaussian distribution of parameter α = 0.25. There is again t = 500 samples and the sources are still equilibrated with a dynamic range of circa 0.6. The mixing is performed with the same A∗ matrix as in case 1 and an additive noise N is added in a similar way. Compared to case 1, case 2 corresponds to a more realistic setting, since the wavelet coefficients of natural images would have a similar distribution [17,33]. • Case 3: The sources are constructed in a similar way as in case 2. However, the noise energy of the first observation is twice the one of the second observation. This could be for instance

3.3. Evaluation protocol For each experimental scenario, a large number of values for the regularization parameters RS are tested. More specifically, since no reweighting 1 in problem (1) is used in this part, the work will be done on S = RS . As we will focus on the case of n = 2 sources, there will be 2 parameters to test, λ1 and λ2 . For each (λ1 , λ2 ) value, a criterion measuring the separation quality is evaluated. Since a change in the regularization parameters directly impacts the source estimation, we will use a global criterion on the mixing matrix [39]:

ˆ † A∗ − Id |)) C A = −10 log10 (mean(|PA

(8) †

ˆ the pseudo-inverse of With A∗ the true mixing matrix and PA the solution found by the algorithm corrected through P for the scale and permutation indeterminacies. The mean is the average of all the elements inside the matrix. The higher C A , the better the separation. 3.4. Results and interpretation 3.4.1. Sensitivity to the regularization parameters (efficiency) In this subsection, we want to get an idea of the sensitivity of PALM estimate to the regularizing parameters S = RS for a fixed dataset X. Therefore, we study one fixed experiment of each case and we try several (λ1 , λ2 ) values. For each of them, the mixing matrix criterion C A is computed from the estimate and reported in

4

http://chandra.harvard.edu/.

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

5

Fig. 2. Median of C A for 5 initializations of PALM algorithm as a 2D function of the 2 thresholds corresponding to the n = 2 sources. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)

Fig. 3. C A of the output of PALM algorithm as a 2D function of the thresholds.

a 2D plot, displayed in Fig. 2. To try to separate as much as possible the sensitivity to RS question studied here from the sensitivity to the initial point, we launched the algorithm with 5 different random initializations A(0) and the median of C A is displayed. The first observation corresponds to the high variations induced by changes of regularizing parameters, with a dynamic range of more than 40 dB for case 1, 30 dB for case 2 and 20 dB for case 4. The results change from good to catastrophic, making the choice

of the regularizing parameters a crucial point, especially as these high variations are very fast on the plot. In case 1, since the sources have identical distributions and the noise is supposed to be white, it would be expected from the best regularization parameters λ1 and λ2 to have similar values. However, while staying close to the diagonal could restrict possible good parameters, it is unfortunately dangerous to fully restrict the search to the diagonal, as demonstrated in Fig. 3. In case 1, just using the parameters on the diagonal closest to the best ones of

6

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

Fig. 3a yields a drop of more than 10 dB (which might be linked to a too small number of samples for the noise to be perfectly white). In the non-equilibrated case 4, while not very visible in Fig. 2 due to the use of the median over different initializations, very good parameter settings appear far from the diagonal for specific initializations as testified in Fig. 3b. Restricting the parameter search to the diagonal would hide these high quality results. Even if we assume that it is possible to restrict the regularization parameter search to the diagonal, the variations are still quick in comparison to the dynamic range of the sources. In the simple case 1, a shift of 0.1% of the dynamic range of the sources in the regularization parameters in a parallel direction to the diagonal yields a 7 dB loss. In case 2, a shift of 5% yields a 8 dB drop and for case 3, a 1% change implies a 8 dB drop. This highlights that, for a single experiment (i.e. for a given A∗ , S∗ and N), the separation performances are highly sensitive to the choice of the regularization parameters. This is what we will call a lack of efficiency. Re-using the general notations of Section 1.3, it means that for a given dataset X (and initialization (0) ) obtaining ˆ close to the true  requires to perform a careful a solution  (cumbersome) regularization parameter choosing, as the variance of C A over RS is high. 3.4.2. Impact on the regularization parameters of A∗ , S∗ and N (versatility) 3.4.2.1. Inter-case versatility The lack of efficiency observed in the last section could be of relatively limited impact if it was possible to generalize a good set of regularization parameters for a given dataset to new ones (as such, the cumbersome RS choice could be done once by a costly grid-search approach, but re-used in other experiments). In this first subsection, we will first study whether specific parameter choices for given distributions of the matrices A∗ , S∗ and N keep to be relevant for other distributions. To do that and for computational reasons, we will merely compare and explain the results displayed in Fig. 2 to highlight the difficulties implied by a change in the distributions of S∗ or of the back-projected noise A∗T N.

• Impact of S∗ The shapes of the plots of the 4 different cases strongly differ. In particular, the observations we made in the previous subsection show that the best regularizing parameter values are highly dependent on the distribution of true sources S∗ : – In addition to remove the noise, the regularizing parameters should limit the presence of remixing or interferences between the sources, which depend on the distribution of S∗ , the mixing matrix A∗ as well as the initialization. – The resulting thresholding induces a bias, called artifacts, which also depend on the distribution of S∗ . They are also problematic as they can eventually convert into interferences between the sources in the iterative optimization process. • Impact of the back-projection of N At the vicinity of the true mixture parameters A∗ and S∗ , the only source of error that contributes to the estimate of S originates from propagated noise. The role of the regularization parameters is then merely to avoid a deterioration of the estimate by the noise, that is to threshold the update A∗ T (X − A∗ S∗ ) = A∗ T N. Consequently, an intuitive choice for the regularization parameters for each sources i would be   λi = (A∗ T N)i ∞ , which implies a clear dependency between the thresholds and the back-projection of N through A∗ T . This can be empirically confirmed through case 3. In this setting, A∗ is orthogonal and the noise N energy is different over the

observations. The thresholds enabling the convergence should therefore be different for the two sources: λ1 = λ2 . This is confirmed by Fig. 2c, which shows that contrary to the other cases, the highest separation quality lies far from the diagonal. Thus, unbalanced backprojected observation noise (that is, with different variances on each source) can make the threshold choice much more difficult5 than in settings where the noise is equilibrated due to optimal parameters further from the diagonal. 3.4.2.2. Intra-case versatility The previous subsection has highlighted the difficulty to generalize regularization parameters RS found for given data distributions of S∗ and A∗T N to new distributions. A more restricted but nevertheless interesting property would still be the possibility to re-use some RS found on a given realization of S∗ and A∗T N to a new realization, provided that the new and previous realizations follow the same distribution. Thus, we now study the impact of changing the realization of the data for a fixed case – more specifically, we will focus on case 1 and 2. To do that, we draw for both cases 10 new random A∗ , S∗ and N matrices and create new data. For each of these data and to try to separate the question at hand from the sensitivity to initialization one, we try 10 initializations and take the median. Therefore, we get 10 plots similar to the ones of Fig. 2, each of them corresponding to a new random A∗ , S∗ and N. To study the possibility to generalize RS choices to new data realizations, we look at the diversity among these plots by looking at their dynamic range: for each (λ1 , λ2 ) value, we plot the mixing criterion of the best estimation minus the mixing criterion of the worst estimation. It gives us an idea of the intra-case variability for given parameters. The results are plotted in Fig. 4. In the exactly sparse case 1, the dynamic range of circa 40 dB is huge compared to the results yielded by the best value of some settings (which are in some realizations of 24 dB). For the approximately sparse case 2, the dynamic range is relatively similar. It makes that even without changing the distribution of the random matrices (said differently, even if we know their distributions), choosing good (λ1 , λ2 ) values is very difficult because this choice is extremely dependent on the specific data realization that must be handled. The conclusion of the two previous subsections is the low versatility of performing sparse BSS by minimizing Eq. (1) using PALM without any automatic regularization parameter choice. Indeed, generalizing good RS to new datasets X is difficult, as the variance of C A over different datasets X for given (0) and RS is high. In particular, the extra-case versatility study has pointed out that the regularization parameters should be chosen based on S∗ as well as the backprojected noise. 3.4.3. Impact of the initialization A(0) , S(0) (Reliability) The impact of the initialization has two theoretical groundings: – As problem (1) is not convex but multi-convex, the algorithm performing its minimization can be trapped in spurious critical points depending on the initial matrices A(0) and S(0) . – The initialization directly impacts the quality of the estimate yielded by specific thresholds through Eq. (7) and the threshold choice. Said differently, for a given initialization, it might be desirable to change the cost function via the choice of thresholds to avoid a spurious critical point that would not be an issue for another initialization.

5 Introducing a weighted Frobenius norm – squared Mahalanobis distance – would avoid this situation but requires a good estimate of the covariance matrix of the noise N.

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

7

Fig. 4. Dynamic range of C A over different random realizations of A∗ , S∗ and N.

rameter choice used in GMCA. In the first subsection, GMCA and its heuristic are presented, as well as some general remarks about the performances of GMCA. In section 4.2, the heuristic parameter choice is adapted to the PALM framework. In section 4.3, the heuristic is tested and shown to work badly within PALM, while in 4.4 these empirical results are explained. 4.1. An heuristic optimization method: GMCA 4.1.1. GMCA as a pALS algorithm GMCA [24] builds upon projected ALS, which is described iterating on the following steps (for the sake of simplicity, S is supposed to be the identity matrix): GMCA(X, A(0) , S(0) ) Fig. 5. Dynamic range (maximum value minus minimum value) of C A as a 2D function of the thresholds for 5 different initializations of PALM algorithm.

1 - S is updated assuming a fixed A. Automatic choice of MS (see next subsection)



To quantify this impact, the dynamic range of C A over 5 different initializations A(0) is plotted for case 4 in Fig. 5. The high values, up to more than 13 dB, are to be compared with the best results of Fig. 3b, that is slightly less than 18 dB. This is mainly due to regularizing parameters outside the diagonal, which unfortunately correspond to a high separation quality for some experiments. This means that choosing such parameters in these experiments can lead to almost the best performances, as well as very bad ones (3 dB). Therefore, beyond the fact that PALM results is highly connected to the initialization, the choice of good regularization parameters in terms of C A is also connected to the initial point. The conclusion is thus that PALM suffers from a low reliability: ˆ has a high dependence on the initial point (0) , the solution  which is detrimental when little is initially known about the solution ∗ . More specifically, with given RS and X the variance of C A over (0) is high. In practice, to avoid relaunching the algorithm with different initializations for given regularization parameter values, it is therefore important to make PALM estimate, as well as the regularization parameter choice, more robust to the initial conditions.



S(k) = SMS A(k−1)† X

(9)

Where A(k−1)† is the pseudo-inverse of A(k−1) . 2 - A is updated assuming a fixed S:



A(k) = .2 ≤1 XS(k)†



(10)

It has to be emphasized that GMCA does not rigorously minimize Eq. (1). Furthermore, updating the sources is performed by thresholding the least-square estimate of S instead of a gradient step. It makes that the regularization parameters do not fully correspond to the ones in PALM when the estimate of the mixing matrix is not orthogonal. To further highlight this point, the parameters in GMCA will be denoted by MS (size n × T ) in Eq. (9), to be compared with RS in Eq. (1) and (3) for PALM. Note that here, we will not use any reweighting (without loss of generality) when using GMCA, that is MS = diag(μ1 , μ2 , ..., μn )1n× T , where 1n× T is a matrix of size n × T filled with ones. An important difference between GMCA and PALM is that GMCA cannot be proved to converge (while stabilizing in most of practical cases after some iterations).

4. Enhancing PALM with heuristic techniques To bypass the cumbersome parameter choice described in the previous section, the goal of this section is to try and show the issues of a straight adaptation within PALM of the heuristic pa-

4.1.2. Heuristic parameter choice in GMCA In GMCA, the thresholding applies to the least-square estimate of the sources. Said differently, it is performed after the inversion of the mixing matrix. This leads to a simple interpretation of

8

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

the parameters (μ1 , μ2 , ...μn ) of GMCA, allowing for an heuristic thresholding strategy that yields good practical results and robustness. In GMCA, the regularization parameter choice is performed so that the thresholding removes a dense signal partially coming from the back-projection in the source space of the data noise N. In practice, this dense signal removal is performed computing the Median Absolute Deviation6 (MAD) for each source and setting the corresponding threshold to a multiple p of this value:

(μ1 , μ2 , ..., μn ) T = p × MAD(A(k−1)† X)

(11)

In this equation and in the remaining of this section, TS was supposed to be the identity matrix without loss of generality and A(k−1)† X corresponds to the current source estimate (cf. Eq 9). The parameter p is a positive number (generally taken between 1 and 5) and the MAD operator is computed row-by-row. In GMCA, the regularization parameters are not fixed and they vary during the iterations, thus defining a whole path of values contrary to PALM where they do not change. If we suppose that during the iterations, the estimate found by GMCA becomes close to the true A matrix denoted by A∗ , it can be shown that:

(μ1 , μ2 , ..., μn ) T = p × MAD(A∗† X)   = p × MAD A∗† (A∗ S∗ + N)   = p × MAD S∗ + A∗† N  

p × MAD A∗† N

(12)

where the last line is obtained because of the sparsity assumption of S∗ , as the MAD operator is robust to a sparse contamination7 [33]. It has however to be emphasized that this noise removal interpretation of the thresholding is limited to the case when the estimation of the mixing matrix is close to A∗ . In the opposite case, the imperfect demixing results in an increase of the thresholds due to an extra noise coming from interferences between the sources. In GMCA, this original increase of the thresholds enables a coarse-to-fine approach exploiting the morphological diversity hypothesis among the sources [17]. More specifically, high thresholds during the first iterations of GMCA enables to keep only the most significant coefficients of the sources, which are assumed to enable a better estimation of the mixing matrix [24]. 4.1.3. Remark about the performances of GMCA The interest of GMCA is that it benefits from the automatic regularization parameter choice of MS described in the previous section. It has been empirically shown to improve the robustness with respect to the initialization [26]. This automatic parameter tuning strategy also allows to find parameter values that provides reasonable to accurate estimates in general. Therefore, the practical success of the GMCA algorithm relies on its good reliability as well as the heuristic to automatically tune MS , enabling both efficiency and versatility. To give an idea of the performances of GMCA in the experimental setting of Sec. 3, we launched it on cases 1 and 2, for 10 different initializations and 10 A∗ , S∗ and N settings. The results are summarized in Table 1. Compared to PALM, the dynamic range over different realizations of A∗ , S∗ and N is much lower than the one of PALM (at

6 For a vector u ∈ Rt , MAD(u) = mediani ∈[1,t ] |ui − mediani ∈[1,t ] (ui )|. We further extend this definition to matrices, by taking their row-wise MAD: MAD : Rn×t → Rn , such that ∀i ∈ [1, n], ∀U ∈ Rn×t , MAD(U)i = MAD(Ui ). 7 Indeed, a contamination of a dense signal by a sparse one will change very few entries and thus the MAD value will not be much affected.

Table 1 GMCA results in terms of C A . Value displayed

Case 1

Case 2

Average Dynamic range over A∗ , S∗ and N Minimum value over A∗ , S∗ and N Variance over initializations

36.3 8.6 29.7 3.2 ×10−12

20.0 14.2 15.8 5.6 ×10−13

least for regularization parameter values for which PALM works decently), and the worst values are still good, meaning a much higher versatility. The automatic parameter choice described in the previous session yields a high mean of C A , showing GMCA efficiency. The reliability seems to be extremely high as shown by the very low standard deviation over the initializations. However, the results of GMCA are not always as good as the ones yielded by PALM with the best parameters. This might be due that the use of the pseudo inverse A(k)† in GMCA can strongly increase the noise back-projection when A(k) is badly conditioned. As an example of such a difference between GMCA and PALM, the results of GMCA are 36.3 dB in the experience of case 1 displayed in Fig. 3, to be compared with a little more than 40 dB for PALM. This partially justifies the will to make PALM easy to use: to benefit from its potential high accuracy when it is well tuned. 4.2. Adaptation of GMCA heuristic parameter choice to PALM: a first idea The goal is here to explain a straightforward adaptation to PALM of GMCA automatic parameter choice. For the sake of simplicity, let us assume that there is no reweighting: RS = S . The update of S by PALM in Eq. (3) is then:



S(k) = S 

γ S

  (k−1) T (k−1)  A  A

S˜ PALM



(13)

2

Under the same assumptions, the update of GMCA is the following (with S˜ GMCA = A(k−1)† X):



S(k) = SMS S˜ GMCA



(14)

Taking into account the differences between S˜ PALM and S˜ GMCA , the parallel between eq. (13) and (14) suggests that S could be chosen similarly as the parameters in GMCA (in which case the parameters change over the iterations, but are fixed at the end ensure the convergence of the algorithm):

  γ   (λ1 , λ2 , ..., λn )T = p × MAD S˜ PALM A∗T A∗ 

(15)

2

If so, and if PALM has converged to both the true matrices A∗ and S∗ , then:  

γ

γ

  (λ1 , λ2 , ..., λn )T = p × MAD S∗ −   A∗T (A∗ S∗ − X) A∗ T A∗  A∗ T A∗  2 2

γ

p ×  ∗T ∗  MAD(A∗T N) A A  2

(16) where the last line is obtained because S∗ is assumed sparse and the MAD is robust to outliers. Therefore, using the MAD enables a thresholding of a projection of the noise N, which yields a similar interpretation as in GMCA. This parallel must however be tempered:

• It only holds when and if PALM has converged towards good A and S;

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

9

Fig. 6. Mixing matrix criterion for PALM and the 2-step strategy (cf. Sec. 5) on case 1. The results of GMCA are also plotted as a baseline, with a fixed p = 3, which is a classical value due to the corresponding hypothesis in terms of Gaussian noise removal. Left: the dashed line is the average of C A over the different A∗ , S∗ and N, and the error bars to the quartiles of the criterion over the initialization; center: the dashed line corresponds to the average of C A for one specific A∗ , S∗ and N, and the error bars to the quartiles of the criterion over the initialization; right: to dashed line corresponds to the average of C A over the initializations, and the error bars to the quartiles of the criterion over the realizations of A∗ , S∗ and N.

• The projection is not performed through the pseudo-inverse

as in GMCA, but rather through A∗ T . Both projections however merge when A is orthogonal.

lack of reliability. Finally, the lack of versatility is shown by the high error bars in the rightmost plot. The goal of the following subsection is to explain these empirical results.

Remark. Choosing the regularization parameters in general inverse problems has been the subject of several studies, which propose various ways for setting them. Among them, one can highlight the Stein’s Unbiased Risk Estimator – SURE – method (and its extended versions [40–43]), the generalized cross-validation [44,45], the Lcurve [46], the discrepancy principle (or some variants [47]) or some Bayesian methods [48,49]. However, most of these are not directly tractable in the case of sparse BSS for several reasons: i) we are dealing with the blind setting, in which the linear operator A must also be evaluated, making the use of these methods more complicated; ii) some of these methods can be computationally expensive in the large-scale setting, since they require to try several regularization parameters and compute a criterion to decide which one to choose; iii) some of them, such as SURE, use as a criterion for the parameter choice the estimated MSE: it is not clear if this criterion is the most relevant one in sparse BSS (cf. [50]); iv) some of them have mostly been applied in the case of linear solutions (e.g. by using a 2 regularization instead of 1 – this has been emphasized in the context of deconvolution in [47]). Therefore, choosing the regularization parameters using the MAD seems one of the most interesting solutions since it has already lead to enhanced separation quality within the context of sparse BSS when using GMCA.

Therefore, the estimation errors that originate from both an imperfect unmixing and the noise are different in the two algorithms. Since it is the role of the thresholds to filter out these estimation errors, they do not have the same impact and the optimal ones may differ from an algorithm to the other. To better understand this role, let us assume that both GMCA and PALM algorithms are initialized with the true mixing matrix A∗ only. Let us further assume on the contrary that the initialization of the sources is not perfect and can be written as S = S∗ + s with s the error made on the sources. For GMCA, the thresholds given by the MAD heuristic then are:

4.3. Illustration

(μ1 , μ2 , ..., μn ) T = MAD(A∗† X) = MAD(S∗ + A∗† N) MAD(A∗† N)

4.4. Understanding the limitations of the heuristic in the scope of PALM We previously pointed out that the thresholds are not identical in GMCA and PALM. A key difference is that the thresholding applies to different quantities:

• In GMCA, it is applied on a least-square estimate of the sources after a direct inversion of the mixing matrix;

• In PALM, it applies on a single gradient descent step, that is during the inversion.

(17) The automatic parameter setting based on the MAD described in the previous subsection is tried inside of a PALM algorithm for different p values. The parameter p is the same for both sources, since the MAD is supposed to be adaptive enough to the noise. The metric on the separation quality C A is the same as defined in Sec. 3.2 and the data is the same as in case 1: it is obtained from two exactly sparse sources with a square mixing matrix having a condition number of 10. The experiments are conducted with 5 realizations of A∗ , S∗ and N and 10 different initializations. The results are displayed in Fig. 6 (red curve). The low efficiency is highlighted by two points: i) the low average values of C A , reaching at most 1 dB in the leftmost plot of Fig. 6; ii) even in the experiments for which the strategy works, it is very difficult to choose a good p value, as seen with the upper quartile of the rightmost plot. Furthermore, for one specific experiment (see plot in the center of Fig. 6) the standard deviation for the different initializations is very high compared to the average value of C A , which denotes a

where the last equality is obtained because the MAD operator is not sensible to sparse signals. In this case, the thresholds are thus set according to the back-projection of the noise only and an imperfect estimation of S∗ does not affect the thresholding strategy as long as the mixing matrix is well estimated. In contrast, when using PALM the thresholds are given by

γ

  (λ1 , λ2 , ..., λn ) T A∗T A∗  2 



γ

∗T



γ

∗T



= p × MAD S −  ∗T ∗  A (A S − X) A A  2  ∗



= p × MAD S −  ∗T ∗  A (A (S + s) − X) A A  2   γ ∗ T

p  ∗T ∗  MAD A N − A∗T A∗ s A A  2

(18)

10

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

The previous equation highlights the following (and unfortunately complementary) issues of using the MAD inside of PALM:

5.1. Motivation of the two step approach Our 2-step approach has several motivations:

• A unadapted threshold choice: Compared to GMCA, the thresholds are calculated with the additional detrimental A∗ T A∗ s interference term. Indeed, they are computed after only one gradient step and therefore when the interferences are still high. Another interpretation is that the MAD is computed on an approximation of the minimization of the data fidelity term, namely a gradient step update, and its results are therefore not as accurate as in GMCA. This hinders the interpretation of the use of the MAD in terms of noise removal, particularly when the remixing A∗ T A∗ s has more energy than the noise term A∗ T N. More specifically, this is the case when the columns of A∗ are strongly correlated, as testified by the product A∗ T A∗ , or when the error on the sources s are high (which is in particular the case if the initialization is random or if there are many artifacts). • Higher interferences in the estimation process: The gradient update on the sources creates at each iteration a remixing of the error s due to the A∗ T A∗ s term. While this could be a limited issue if A was fixed, since in PALM only one proximal gradient step is performed on S before the update of the mixing matrix, A will be computed from sources with high interferences. In particular, this issue is very important with high thresholds that induce artefacts and therefore a large s, creating more interferences (which in turn feeds the problem of the previous point, creating more artefacts and so on...). It makes that high thresholds are not that valuable in PALM. Therefore, contrary to GMCA in which the A∗ T A∗ s is not present, it is harder to exploit the Morphological Diversity assumption. Indeed, while the most significant coefficients are still assumed to be the most interesting for the separation, selecting only them will create interferences counterbalancing their positive effect. To conclude, in PALM high thresholds are less interesting than in GMCA and using the MAD heuristic does not result in good separations in practice if the level of the interferences is not negligible in comparison to the noise contribution.

• Dealing with the factors deteriorating the results of sparse BSS (in terms of versatility, efficiency and reliability) when using PALM for minimizing Eq. (1): While the MAD heuristic enables the refinement stage to handle the noise back-projection, using some reweighting information from the warm-up stage enables to take into account the distribution of S∗ , which was the second factor of the lack of versatility of using PALM in the context of sparse BSS (cf. Sec. 3). The automatic parameter choice then enables to circumvent in the refinement stage the difficult parameter choice and the lack of efficiency. Finally, using a reliable algorithm such as GMCA as a warm-up stage enables a reliable global 2-step algorithm. Furthermore, it enables a good initialization of PALM (which is reached using decreasing regularization parameters and benefiting from the morphological diversity, in contrast to what is usually done when using a single PALM with fixed regularization parameters). • Enabling to use the MAD heuristic: One of the main advantages of combining GMCA and PALM is ˆ GMCA and Sˆ GMCA of GMCA. to initialize PALM with the output A ˆ ˆ Since both AGMCA and SGMCA are close to A∗ and S∗ , the level of the interferences is relatively low in comparison to the noise. Therefore, following the conclusion of Sec. 4.4, the MAD is still relatively accurate to derive the thresholds once at the beginning of the refinement step and the interpretation in terms of noise removal is more accurate than with a poor initialization due to smaller interferences. In addition, the interferences are further indirectly reduced by the use of the reweighting, which reduces the artefacts that would be transformed into interferences by the gradient step. • Keeping mathematical guarantees and PALM high potential accuracy: While GMCA is only a proxy, the 2-step algorithm will attempt to solve exactly Eq. (1) as PALM does. It will further benefit from PALM potential high accuracy (cf. Sec. 4.1.3). Moreover, since PALM is proved to converge under mild assumptions [23], so is the 2-step algorithm.

5. Combining GMCA and PALM: a hybrid strategy

5.2. Use information on S∗ from GMCA: reweighted 1

The previous part has demonstrated that the introduction of an heuristic based only on the back-projection of the noise does not yield satisfactory results within a single PALM. We therefore rationalize an approach tackling the other elements (cf. Sec. 3) hindering the applicability of PALM in sparse BSS by combining the best of GMCA and PALM in a two-step approach. The first warmup stage consists in a GMCA. It is followed by a refinement stage during which PALM is performed retaining as much information as possible coming from the warm-up stage. Inside the refinement stage, the MAD heuristic is used to choose PALM parameters. The main idea of the two step approach is to reduce the interferences by a good initialization of the refinement stage and the artifacts using a reweighting information coming from the warm-up stage. It has to be emphasized that, beyond a preliminary version of the present work [51], this 2-step approach has already been empirically used in two of our previous works [52,53]. Therefore, the novelty of the following is to provide an in-depth justification of this approach while the previous works were only empirical. More specifically, this justification describes in details all the required elements for this approach to work, which is done both with theoretical arguments along with experiments to confirm these on new datasets.

As evoked in the previous subsection, it is possible to take into account in the refinement stage the information from S∗ coming from the warm-up stage. To that end, minimizing the artifacts can be carried out by improving the sparse regularizer. This is done by resorting to a reweighted 1 regularizer [19]. In this setting, one can benefit from the first guess estimate SGMCA to compute the reweighting matrix W in problem (1). This generally leads to a significant decrease of the artifacts, which eventually reduces the importance of the interferences introduced by the A∗ T A∗ s term. This will enhance the efficiency of the proposed hybrid heuristic, which interpretation is based on noise removal only. In this work, we use the following reweighting scheme: j

Wi =

|Sˆ | j

(19)

+  ˆ i Si 



where

is a small constant (here 10−3 ), Wij the coefficient of W

corresponding to the ith line and jth column and Sˆ i the ith line ˆ In brief, the thresholds are lowered for the largest samples of S. of the estimated sources, reducing the bias introduced by the softthresholding and therefore the error s and the interferences.

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

11

5.3. Final algorithm The previous remarks lead to the following algorithm using a smart initialization and thresholding strategy: Input : X (data matrix)

• Warm-up stage:

Random initialization A(0) and S(0) ˆ GMCA , Sˆ GMCA = GMCA(X, A(0) , S(0) ) A • Refinement stage: Update of the reweighting information in Eq. (1):

Fig. 7. Example of A∗ matrix with 3 sources and 6 observations.



j

Wi =

(20)

|Sˆ GMCA | i  SGMCAi  j

+  ˆ



Update of the parameters S in Eq. (1):

γ

  (λ1 , λ2 , ..., λn ) T A∗T A∗  2





γ ⎜ ⎟ T  Aˆ GMCA = p × MAD ⎝Sˆ GMCA −  (Aˆ GMCA Sˆ GMCA − X)⎠ ˆT  AGMCA Aˆ GMCA 

Beyond this complexity, the overall number of required iterations needed to give “good” results is also of uttermost importance. In practice, the number of PALM iterations tends to be much higher than the one of GMCA. Therefore, the refinement stage is in practice much more computationally expensive than the warm-up one. The whole 2-step algorithm is however much faster than a single isolated PALM. This is due to the warm-up stage that enables the refinement stage to start from a good initialization and therefore to converge in less iterations. 5.5. Illustration

2

(21) PALM step, initialization coming from GMCA:

ˆ PALM , Sˆ PALM = PALM(X, Aˆ GMCA , Sˆ GMCA ) A

5.4. Complexity of the algorithm The goal of this subsection is to derive the complexity of one iteration of each of the 2 step of the algorithm. 5.4.1. Initialization stage Each iteration of the warm-up stage can be decomposed into the following elementary steps: i) the pseudo-inverse is performed using the singular value decomposition of a n × n matrix, which yield an overall complexity of O (n3 + m2 n + nmt ); ii) the thresholding-strategy first requires the evaluation of the threshold values, which has a complexity of nt; iii) the soft-thresholding step itself has a complexity of O (nt ); and iv) updating A is finally performed using a conjugate gradient algorithm, whose complexity is known to depend on the number of non-zero entries in S and on the condition of this matrix C d (S). An upper-bound for this com√ plexity is O (nt C d (S)). The final estimate of the complexity of a single iteration is thus given by

m2n + n3 + nmt + nt C d (S)

(22)

with C d (S) the condition number of S. 5.4.2. Refinement stage For the refinement stage, i) the update of S is dominated by the multiplication required for A T (X − AS),which has a O (nmt ) complexity, and the computation of A T A2 having an overall com-

plexity of O (n2 m) (for instance, using the power method). The thresholding and weight computation complexities are negligible; ii) similarly, for A update the complexity is O (n2 t + nmt ). Therefore, the final estimate of the complexity of a single iteration of the refinement stage is given by

nmt + n2 (m + t )

(23)

The experimental protocol is the same as described in Sec. 4.3 (two exactly sparse sources with a square mixing matrix having a condition number of 10 and a SNR of 60 dB) except that the 2-step algorithm is used instead of PALM. The results are plotted in Fig. 6. With values of C A higher than 33 dB, the demixing is close to the best ones obtained with the exhaustive search in Sec. 3. Compared to PALM only, the variance of results over different initializations is also largely improved as it is closer to zero, which shows the increased reliability of the algorithm with regards to the initialization. 6. Application to two realistic data separation problems In this section, we provide two realistic BSS experiments to demonstrate the relevance of the 2-step algorithm on practical BSS problems. Its limitations are then studied in the last section. 6.1. Application to simulated LC / 1 H NMR data 6.1.1. Description of the data The data come from a simulated LC - 1 H NMR (Liquid Chromatography - 1 H Nuclear Magnetic Resonance) experiment, which goal is to identify each of the chemicals compounds present in a fluid, as well as their concentrations. The LC - 1 H NMR experiment enables a first physical imperfect separation during which the fluid goes through a chromatography column and its chemicals are separated according to their speeds (which depend on their physical properties). Then, the spectrum of the output of the column is measured at a given time frequency. These measurements of the spectra at different times can be used to feed a BSS algorithm to refine the imperfect physical separation. The n = 3 sources (with each t = 1024 samples) are composed of elementary sparse theoretical spectra of chemical compounds taken from the SDBS database,8 which are further convolved with a Laplacian having a width of 3 samples to simulate a given spectral resolution. The mixing matrix A∗ of size (m, n) = (6, 3) is pictured in Fig. 7, the objective being to have a matrix that could

8 National Institute of Advanced Industrial Science and Technology (AIST), Spectral database for organic compounds: http://sdbs.db.aist.go.jp.

12

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

Table 2 C A for 5 SNR values and different algorithms. 10 dB

15 dB

20 dB

30 dB

60 dB

2 step PALM GMCA EFICA RNA

18.0 3.14 17.8 9.93 1.88

21.6 3.01 20.0 13.8 3.53

21.7 3.13 20.8 16.7 11.9

32.2 3.20 31.7 24.5 21.9

45.9 3.08 29.9 25.2 13.7

2 step with random refinement initialization

18.2

21.2

21.5

32.2

44.8

Fig. 8. Example of source retrieved by the 2-step algorithm, superimposed on the true source.

be consistent with the first imperfect physical separation. It would correspond to two chemicals having similar speeds / elution times, and one being better isolated through the physical separation process. Some additive Gaussian noise with various SNR is eventually added (note that in general, relatively high SNRs can be expected in such experiments). For each noise level, the BSS algorithms are launched with 10 different mixing matrix initializations. 6.1.2. Results The results of the 2-step algorithm are presented in Table 2, along with the ones of other algorithms [54,55] (the median over the initializations is displayed). The “ PALM” line corresponds to a PALM algorithm equipped with the MAD heuristic described in Section 4 (which is already a potential improvement compared to an exhaustive search on the regularizing parameters if the interferences are low compared to the noise level). An example of a source detail retrieved by the 2-step algorithm for a SNR of 30 dB is displayed in Fig. 8. As can be seen, the source estimation is both qualitatively and quantitatively very good. In particular, it is better than the ones of EFICA or RNA, isolated PALM or GMCA. The last line of Table 2 is a variant of the two step approach, which objective is to get a better understanding of the most crucial elements in the 2-step approach enabling to make it work well. The algorithm is the same as the previous 2-step approach but the initialization of the refinement step is random, instead of being the one given by GMCA. That is, the goal of the warm-up stage is here only to find good RS values (including the reweighting). As the results are on average almost as good as the ones of the full 2-step approach, we can conjecture that on such an experiment the regularization parameter choice is much more crucial than the one of the initialization.9 It has however to be emphasized that

9 Note that regularization parameter choice and initialization are however coupled, since such a choice is done using the good initialization yielded by GMCA. Therefore, it is not easy to leverage the need of finding a good first estimate of S∗ .

such a conclusion could be different in experiments in which more local minima are present (such as when n is higher [52]). 6.2. Application to simulated astrophysical data 6.2.1. Description of the data The second experiment is carried out on simulated astrophysical data, which have been generated from real Chandra10 observations of the Cassiopeia A supernova remnants (this experiment is thus similar to case 4 of part 3). The observations are made of a linear combination of three components: the synchrotron emission and 2 redshifted iron (Fe) emission lines. More precisely, the emissions lines correspond to the telescope impulse response to Dirac-shaped emission lines, which are modeled as Gaussianshaped spectra. These lines are centered about different energy values, which depend on the relative speed of propagation of each iron component due to the Doppler effect. The synchrotron component has a power emission law. These components are representative of typical supernovae remnants in the energy band 5000–6000 eV (electron-volt). Compared to case 4, there is now n = 3 sources (which creates more interferences and makes the unmixing more difficult). These are displayed in Fig. 10. Furthermore, contrary to case 4 where A∗ (size (m, n) = (12, 3)) was taken equal to the one of case 1 and case 2 to enable simple comparisons, A∗ is now obtained from realistic simulations (cf. Fig. 9). Finally, comparisons are performed with 5 different levels of additional noise are tested: SNR = 10, 15, 20, 30 and 60 dB. The practical SNR levels are between 10 and 35 dB. 6.2.2. Results Results for different SNR values in terms of C A are displayed in Table 3. The original sources and the ones estimated for a SNR of 30 dB, as well as the residual between both rows, are shown

10

http://chandra.harvard.edu/.

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

13

Table 3 C A for 5 SNR values and 5 algorithms.

2 step PALM GMCA EFICA RNA

Fig. 9. Realistic Chandra A matrix with 12 channels.

in Fig. 10. To make the comparison fair with RNA [54] and EFICA [55], in this subsection the data were pre-whitened. The results of the 2-step algorithm are here again good, both for the estimation of A∗ (cf. Fig. 10) and S∗ (cf. Fig. 3). In this experiment, the 2-step algorithm always obtains better results than both PALM and GMCA with a gain of about 2 dB for all tested

10 dB

15 dB

20dB

30 dB

60 dB

15.0 11.9 13.2 8.8 9.8

16.3 13.3 14.8 10.3 12.6

17.4 13.5 15.1 14.0 15.6

19.7 14.2 17.1 18.9 18.3

20.9 14.5 18.6 19.4 18.4

SNR. This again highlights the advantage provided by the 2-step approach compared to either GMCA or PALM alone. Interestingly, the PALM algorithm alone provides rather reasonable separation results in these experiments. It has to be noticed that in this setting the condition number of the mixing matrix is low, e.g. 1.8. Therefore, since A∗ is close to the orthogonality, the A∗ T A∗ s error term of Eq. (18) is quite low. This is precisely the regime where the MAD heuristic can perform correctly in the PALM algorithm. In agreement with the results of the previous subsection the standard sparse ICA-based methods, namely RNA and EFICA, perform rather poorly at low SNR, which highlights their higher sen-

Fig. 10. Up: true sources S∗ ; Middle: sources estimated by the 2-step algorithm (for a mixing with 30 dB of noise); Down: residual between upper and middle images.

14

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

Fig. 11. Left: C A as a function of the SNR (in dB), with a condition number of A∗ of 10. To plot the figure, the mean over the realizations has been used. The dashed line corresponds to the mean over the different A∗ and the error bars to the corresponding quartiles. Right: Practical influence of the two terms of Eq. (18) in the experiment.

sitivity to noise. In contrast, they perform similarly (or even better) to GMCA in the low noise regime. 6.3. Discussion about the limitations of the 2-step strategy While the 2-step approach has been shown to perform very well in the two above realistic experiments, the limitations and applicability of the 2-step approach are now discussed in more details. To further understand in which regimes the 2-step is valuable or not, let us have a closer look at Eq. (18) where two important terms must be highlighted: i) the back-projected error on the first guess sources A∗ T A∗ s and ii) the noise back-projection A∗ T N. For a fixed starting point (i.e. fixed error s), both the condition number of the mixing matrix A∗ T and the SNR define different regimes. As an illustration, experiments are performed with the astrophysical data described in the previous subsection but with random non-negative mixing matrices having a larger condition number of 10. We carry out 10 Monte-Carlo simulations over A∗ , with for each 10 different random initializations. The results are displayed in Fig. 11a, in which the average over the initialization is used. The error bars correspond to the quartiles of the results for different A∗ . For larger condition numbers, the re-mixing effect in the PALM iterations plays a prominent role since it tends to concentrate the first guess errors in the subspace spanned by the eigenvectors that are associated with the largest eigenvalues of the Gram matrix of A∗ . Consequently, the MAD heuristic is more likely to yield badly estimated regularization parameters and the results of the 2-step approach compared to GMCA only are worse than in the previous subsection. This gain will likely depend on the relative levels of A∗ T A∗ s and A∗ T N, which are displayed in Fig. 11b. When the noise level is large enough, the term A∗ T N dominates and the proposed MAD heuristic will perform correctly, which will be favorable for the 2-step approach. This can be observed when the SNR is below 25 dB but not too small: for still smaller SNR, GMCA does not work well, which makes that the 2-step approach has deteriorated performances due to a bad initialization (this is probably due to the pseudo-inverse in GMCA of a badly conditioned A in the presence of strong noise). For larger SNR (in the range 25 − 50 dB in Fig. 11a), the term A∗ T A∗ s becomes dominant. In that case, the 2-step approach might degrade the GMCA solution. Finally for larger SNR, the contribution of the noise to the thresholds is almost zero. While the MAD therefore yields a bad estimation of optimal parameters, the thresholds are however low because of low interferences due to the relatively good initial point yielded

by GMCA. It makes that the solution yielded by the refinement step does not deviate significantly from a simple minimization of the data fidelity term. Since in GMCA the data fidelity term is minimized at each iteration, the solution of the warm-up stage is close to one of its minimizers and therefore the solution of the 2-step algorithm does not change much from the one yielded by GMCA. Said differently, in this regime even if the thresholds are bad, they are low enough for the refinement stage estimate to stay close from the one yielded by the warm-up stage, which is already quite good because GMCA performs well with low noise. 7. Conclusion The applicability of sparse BSS methods to real-world data, especially for large-scale ones, largely depends on the design of reliable, efficient and versatile methods. Methods such as GMCA presenting these properties rely on heuristics that do not however guarantee any optimality of the estimate. On the contrary, the estimation using PALM algorithm to minimize a 1 regularized data fidelity term, which is now a standard approach to tackle generic matrix factorization problems due to mathematical guarantees, does not always exhibit such characteristics. To develop an optimization framework merging the best of the two worlds, we first investigated the behavior of PALM in the context of sparse BSS. These investigations show and explain the dramatic sensitivity of the estimate with respect to the regularization parameters (lack of efficiency and versatility) and also to the initialization (lack of reliability). To mitigate these limitations, we rationalize a hybrid approach that allows to circumvent the robustness issue of PALM with respect to the initialization and provides a proxy to automatically tune the regularization parameters for various data. Numerical experiments on both simulated and realistic data demonstrate the quality of the proposed approach with respect to the three desired characteristics. They also exhibit an improved separation accuracy with respect to state-of-the art methods. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgment This work is supported by the European Community through the grant LENA (ERC StG - contract no. 678282) and by the MOD

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

University Defense Research Collaboration (UDRC) for signal processing in the networked battlespace, number P/K014277/1. JB would like to thank Fabio Acero for the Chandra simulations. Appendix A A.1. Definition of proximal operators The proximal operator of an extended-valued proper and lower semi-continuous convex function f : Rn → (−∞, ∞] is defined as:

prox f ( v ) = argminx ( f (x) +

1 2

x − v 22 )

(24)

where v ∈ Rn . A.2. Definition of the soft thresholding operator For two scalar λ and b, the soft thresholding operator Sλ (.) is defined as:



b − λ × sign(b) if |b|  λ

+

∀b ∈ R, ∀λ ∈ R , Sλ (b) =

0 otherwise

(25)

For two matrices  and B, we extend our notation to: n×m

∀B ∈ Rn×m , ∀ ∈ R+

, ∀i ∈ [1, n],

(26)

∀ j ∈ [1, m], {S (B)}i j = Si j (Bi j )

A.3. Definition of the projection of the columns of a matrix B on the 2 unit ball We define the projection of a column vector b on the 2 unit ball as:



∀b ∈ Rm , π.

2

1 (b) =

b if b2  1

(27)

b/ b2 otherwise

We extend the notation to a matrix B by projecting all its columns on the 2 unit ball:

∀B ∈ Rm×n , ∀ j ∈ [1, n], {.

2

1 (B)}

j

= π.

2

1 (B

j

)

(28)

References [1] E. Vincent, M.G. Jafari, S.A. Abdallah, M.D. Plumbley, M.E. Davies, Probabilistic modeling paradigms for audio source separation, in: Machine Audition: Principles, Algorithms and Systems, IGI Global, 2011, pp. 162–185. [2] E. Vincent, C. Févotte, R. Gribonval, L. Benaroya, X. Rodet, A. Röbel, E. Le Carpentier, F. Bimbot, A tentative typology of audio source separation tasks, in: 4th Int. Symp. on Independent Component Analysis and Blind Signal Separation (ICA), 2003, pp. 715–720. [3] A. Ozerov, C. Févotte, Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation, IEEE Trans. Audio Speech Lang. Process. 18 (3) (2010) 550–563. [4] N.Q. Duong, E. Vincent, R. Gribonval, Under-determined reverberant audio source separation using a full-rank spatial covariance model, IEEE Trans. Audio Speech Lang. Process. 18 (7) (2010) 1830–1840. [5] C. Févotte, N. Bertin, J.-L. Durrieu, Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis, Neural Comput. 21 (3) (2009) 793–830. [6] T.-P. Jung, S. Makeig, C. Humphries, T.-W. Lee, M.J. Mckeown, V. Iragui, T.J. Sejnowski, Removing electroencephalographic artifacts by blind source separation, Psychophysiology 37 (2) (2000) 163–178. [7] F. Negro, S. Muceli, A.M. Castronovo, A. Holobar, D. Farina, Multi-channel intramuscular and surface EMG decomposition by convolutive blind source separation, J. Neural Eng. 13 (2) (2016) 026027. [8] M.-Z. Poh, D.J. McDuff, R.W. Picard, Non-contact, automated cardiac pulse measurements using video imaging and blind source separation, Opt. Express 18 (10) (2010) 10762–10774. [9] J. Bobin, F. Sureau, J.-L. Starck, A. Rassat, P. Paykari, Joint PLANCK and WMAP CMB map reconstruction, Astron. Astrophys. 563 (2014) A105.

15

[10] P. Comon, C. Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, Academic Press, 2010. [11] N. Gillis, F. Glineur, Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization, Neural Comput. 24 (4) (2012) 1085–1105. [12] A. Vandaele, N. Gillis, F. Glineur, D. Tuyttens, Heuristics for exact nonnegative matrix factorization, J. Glob. Optim. 65 (2) (2016) 369–400. [13] J. Bobin, J.-L. Starck, Y. Moudden, M.J. Fadili, Blind source separation: the sparsity revolution, Adv. Imaging Electron Phys. 152 (1) (2008) 221–302. [14] M. Zibulevsky, B.A. Pearlmutter, Blind source separation by sparse decomposition in a signal dictionary, Neural Comput. 13 (4) (2001) 863–882. [15] Y. Li, S.-I. Amari, A. Cichocki, D.W. Ho, S. Xie, Underdetermined blind source separation based on sparse representation, IEEE Trans. Signal Process. 54 (2) (2006) 423–437. [16] J. Le Roux, F.J. Weninger, J.R. Hershey, Sparse NMF – half-baked or well done? Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep, pp. TR2015–023. [17] J.-L. Starck, F. Murtagh, J.M. Fadili, Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity, Cambridge University Press, 2010. [18] J.-L. Starck, J. Fadili, F. Murtagh, The undecimated wavelet decomposition and its reconstruction, IEEE Trans. Image Process. 16 (2) (2007) 297–309. [19] E.J. Candes, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted 1 minimization, J. Fourier Anal. Appl. 14 (5–6) (2008) 877–905. [20] J. Rapin, J. Bobin, A. Larue, J.-L. Starck, Sparse and non-negative BSS for noisy data, IEEE Trans. Signal Process. 61 (22) (2013) 5620–5632. [21] Y. Xu, W. Yin, A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion, SIAM J. Imaging Sci. 6 (3) (2013) 1758–1789. [22] P. Tseng, Convergence of a block coordinate descent method for nondifferentiable minimization, J. Optim. Theory Appl. 109 (3) (2001) 475–494. [23] J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Math. Program. 146 (1–2) (2014) 459–494. [24] J. Bobin, J.-L. Starck, J.M. Fadili, Y. Moudden, Sparsity and morphological diversity in blind source separation, IEEE Trans. Signal Process. 16 (11) (2007) 2662–2674. ´ P. Labropou[25] E. Chapman, F.B. Abdalla, J. Bobin, J.-L. Starck, G. Harker, V. Jelic, los, S. Zaroubi, M.A. Brentjens, A. de Bruyn, et al., The scale of the problem: recovering images of reionization with generalized morphological component analysis, Mon. Not. R. Astron. Soc. 429 (1) (2012) 165–176. [26] J. Rapin, A. Souloumiac, J. Bobin, A. Larue, C. Junot, M. Ouethrani, J.-L. Starck, Application of non-negative matrix factorization to LC/MS data, Signal Process. 123 (2016) 75–83. [27] A. Repetti, M.Q. Pham, L. Duval, E. Chouzenoux, J.-C. Pesquet, Euclid in a taxicab: sparse blind deconvolution with smoothed 1 /2 regularization, IEEE Signal Process. Lett. 22 (5) (2015) 539–543. [28] C. Lanaras, E. Baltsavias, K. Schindler, Hyperspectral super-resolution by coupled spectral unmixing, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3586–3594. [29] P.-A. Thouvenin, N. Dobigeon, J.-Y. Tourneret, Estimation de variabilité pour le démélange non-supervisé d’images hyperspectrales, in: 25eme Colloque Groupe de Recherche et d’Etudes du Traitement du Signal et des Images (GRETSI 2015), 2015, p. 1. [30] C. Bao, H. Ji, Y. Quan, Z. Shen, Dictionary learning for sparse coding: algorithms and convergence analysis, IEEE Trans. Pattern Anal. Mach. Intell. 38 (7) (2016) 1356–1369. [31] F. Pierre, J.-F. Aujol, A. Bugeau, N. Papadakis, V.-T. Ta, Luminance-chrominance model for image colorization, SIAM J. Imaging Sci. 8 (1) (2015) 536–563. [32] A. Mensch, J. Mairal, B. Thirion, G. Varoquaux, Stochastic subsampling for factorizing huge matrices, IEEE Trans. Signal Process. 66 (1) (2018) 113–128. [33] C. Chenot, Parcimonie, Diversité Morphologique et Séparation Robuste de Sources, Ph.D. thesis, Université Paris-Saclay, 2017. [34] Y. Xu, W. Yin, A globally convergent algorithm for nonconvex optimization based on block coordinate update, preprint arXiv:1410.1386. [35] E. Chouzenoux, J.-C. Pesquet, A. Repetti, Variable metric forward–backward algorithm for minimizing the sum of a differentiable function and a convex function, J. Optim. Theory Appl. 162 (1) (2014) 107–132. [36] E. Chouzenoux, J.-C. Pesquet, A. Repetti, A block coordinate variable metric forward–backward algorithm, J. Glob. Optim. 66 (3) (2016) 457–485. [37] J. Rapin, J. Bobin, A. Larue, J.-L. Starck, NMF with sparse regularizations in transformed domains, SIAM J. Imaging Sci. 7 (4) (2014) 2020–2047. [38] J. Xu, J. Bobin, A. de Vismes Ott, C. Bobin, Spectral unmixing for activity estimation in Gamma-Ray Spectrometry, https://hal.archives-ouvertes.fr/hal02060476/document. [39] J. Bobin, J. Rapin, A. Larue, J.-L. Starck, Sparsity and adaptivity for the blind separation of partially correlated sources, IEEE Trans. Signal Process. 63 (5) (2015) 1199–1213. [40] C.M. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Stat. 9 (6) (1981) 1135–1151. [41] Y.C. Eldar, Generalized sure for exponential families: applications to regularization, IEEE Trans. Signal Process. 57 (2) (2009) 471–481.

16

C. Kervazo et al. / Digital Signal Processing 97 (2020) 102611

[42] R. Giryes, M. Elad, Y.C. Eldar, The projected GSURE for automatic parameter tuning in iterative shrinkage methods, Appl. Comput. Harmon. Anal. 30 (3) (2011) 407–422. [43] C.-A. Deledalle, S. Vaiter, J. Fadili, G. Peyré, Stein unbiased gradient estimator of the risk (sugar) for multiple parameter selection, SIAM J. Imaging Sci. 7 (4) (2014) 2448–2487. [44] G.H. Golub, M. Heath, G. Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (2) (1979) 215–223. [45] M.A. Lukas, Robust generalized cross-validation for choosing the regularization parameter, Inverse Probl. 22 (5) (2006) 1883. [46] P.C. Hansen, D.P. O’Leary, The use of the L-curve in the regularization of discrete ill-posed problems, SIAM J. Sci. Comput. 14 (6) (1993) 1487–1503. [47] M.S. Almeida, M.A. Figueiredo, Parameter estimation for blind and non-blind deblurring using residual whiteness measures, IEEE Trans. Image Process. 22 (7) (2013) 2751–2763. [48] M. Pereyra, J.M. Bioucas-Dias, M.A. Figueiredo, Maximum-a-posteriori estimation with unknown regularisation parameters, in: 2015 23rd European Signal Processing Conference (EUSIPCO), IEEE, 2015, pp. 230–234. [49] A.F. Vidal, M. Pereyra, Maximum likelihood estimation of regularisation parameters, in: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 1742–1746. [50] F. Feng, M. Kowalski, Revisiting sparse ICA from a synthesis point of view: blind source separation for over and underdetermined mixtures, Signal Process. 152 (2018) 165–177. [51] C. Kervazo, J. Bobin, C. Chenot, Heuristics for efficient sparse blind source separation, in: 2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 489–493. [52] C. Kervazo, J. Bobin, C. Chenot, Blind separation of a large number of sparse sources, Signal Process. 150 (2018) 157–165. [53] C. Chenot, J. Bobin, Blind source separation with outliers in transformed domains, SIAM J. Imaging Sci. 11 (2) (2018) 1524–1559. [54] M. Zibulevsky, Blind source separation with relative Newton method, in: Proceedings of the 4th International Symposium on Independent Component Analysis and Blind Source Separation (ICA 2003), Nara, Japan, 2003, pp. 897–902. [55] Z. Koldovsky, P. Tichavsky, E. Oja, Efficient variant of algorithm FastICA for independent component analysis attaining the Cramér-Rao lower bound, IEEE Trans. Neural Netw. 17 (5) (2006) 1265–1277.

Christophe Kervazo received, in 2015, the engineering degree from Supelec, and the master of science in Electrical and Computer Engineering from Georgia Institute of Technology, in 2016. From 2016 to 2019, he is

PhD student at CEA Saclay, France, in the CosmoStat group, where he is working on sparse Blind Source Separation. His main interests are signal and image processing, linear and nonlinear component separation, multi-convex optimization, sparse representations and machine learning. Jérôme Bobin graduated from the Ecole Normale Superieure (ENS) de Cachan, France, in 2005, and received the M.Sc. degree in signal and image processing from ENS Cachan and Paris-Sud University. He received the Agregation de Physique, in 2004, and a Ph.D. in Electrical Engineering, in 2008, at CEA. From 2008 to 2009, he has been a post-doctoral scholar in Applied Mathematics at Caltech. Since 2009, he is a post-doctoral scholar in the mathematics department at Stanford University. Since 2010, he is researcher at CEA Saclay in the CosmoStat laboratory. In 2015, he has been awarded an ERC starting grant. His research interests include statistical signal processing, machine learning, multiscale methods and sparse representations in signal and image processing and their applications in astrophysics. Cécile Chenot received the M.Sc. in Electrical Engineering and Information Technology from the ETHZ, in 2014, the engineering degree from Supelec, in 2014, and the Ph.D. degree in Signal and Image Processing from the University of Paris Sud, in 2017. Starting from October 2017, she was Research Associate at the Institute for Digital Communications, University of Edinburgh School of Engineering. She is now working at Microwave Vision Group. Her primary research interests lie in signal and image processing, sparse representations, component separation and matrix factorization. Florent Sureau received a PhD in Physics from the Université Paris 11, in 2008, for his work on PET image reconstruction at the Service Hospitalier Frédéric Joliot (CEA), and held a post-doctorate position at the Vrije Universiteit Brussel, in 2008-2009, where he worked with Michel Defrise on truncated data reconstruction. Since then he has been working at the Laboratoire de Cosmologie et Statistiques (CEA) on solving inverse problems in astrophysics, in particular for the Planck and Euclid spatial missions. His research interests include sparse approximation and dictionary learning in signal and image processing.