A separation–aggregation network for image denoising

A separation–aggregation network for image denoising

Applied Soft Computing Journal 83 (2019) 105603 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

2MB Sizes 2 Downloads 57 Views

Applied Soft Computing Journal 83 (2019) 105603

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

A separation–aggregation network for image denoising✩ ∗

Lei Zhang a,d , Yong Li a , Peng Wang c , Wei Wei a,b , , Songzheng Xu a , Yanning Zhang a,b a

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Northwestern Polytechnical University, Xi’an 710072, China c School of Computer Science, The University of Adelaide, Australia d Inception Institute of Artificial Intelligence, United Arab Emirates b

highlights • A separation–aggregation strategy is proposed for easier noise removal. • A novel separation–aggregation network is proposed for image denoising. • The proposed network generalizes well across challenging datasets.

article

info

Article history: Received 9 August 2018 Received in revised form 19 June 2019 Accepted 27 June 2019 Available online 3 July 2019 Keywords: separation–aggregation Convolutional neural network Image denoising

a b s t r a c t Image denoising is the problem that aims at recovering a clean image from a noisy counterpart. A promising solution for image denoising is to employ an appropriate deep neural network to learn a hierarchical mapping function from the noisy image to its clean counterpart. This mapping function, however, is generally difficult to learn since the potential feature space of the noisy patterns can be huge. To overcome this difficulty, we propose a separation–aggregation strategy to decompose the noisy image into multiple bands, each of which exhibits one kind of pattern. Then a deep mapping function is learned for each band and the mapping results are ultimately assembled to the clean image. By doing so, the network only needs to deal with the compositing components of the noisy image, thus makes it easier to learn an effective mapping function. Moreover, as any image can be viewed as a composition of some basic patterns, our strategy is expected to better generalize to unseen images. Inspired by this idea, we develop a separation–aggregation network. The proposed network consists of three blocks, namely a convolutional separation block that decomposes the input into multiple bands, a deep mapping block that learns the mapping function for each band, and a band aggregation block that assembles the mapping results. Experimental results demonstrate the superiority of our strategy over counterparts without image decomposition. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Image denoising is a classical image restoration task, which aims at predicting a clean image from a noisy observation [1–4]. With the additive-noise-corruption assumption, the observation model can be formulated as y = x + n,

(1)

where a clean image x ∈ Rn is corrupted with noise n and the resulted noisy observation is y. Since it is inherently ill-posed to infer the latent image x from observation y based on Eq. (1) [2,5], extra regularization for x is necessitated to seek a better solution. ✩ In this work, Lei Zhang and Yong Li contributed equally. ∗ Corresponding author. E-mail address: [email protected] (W. Wei). https://doi.org/10.1016/j.asoc.2019.105603 1568-4946/© 2019 Elsevier B.V. All rights reserved.

On the eve of the success in deep learning, most works [1, 2,6–9] mainly focus on designing heuristic regularizations on image structures and devising corresponding models to exploit these informations. However, these methods observe limitations in generalizing to challenging cases due to the heuristic nature of the assumptions, e.g., sparsity [10,11] or low-rank [8,12], as well as the representation capacity of the shallow models. Deep neural networks [13–18] revolutionize the paradigm for the denoising task. Without explicitly considering the image priors, they resort to a deep model to map the noisy image to its clean counterpart and expect this highly non-linear mapping function can generalize better. However, the complicated noisy patterns may result in huge potential feature space for these mapping functions and consequently poses great changes for the success of such mappings.

2

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

To overcome this difficulty, we propose a separation-aggregation strategy to decompose the noisy image into multiple bands. Each of them only exhibits one kind of basic patterns that are much simpler than the original noisy image. Based on each band, the deep neural network is employed to learn a mapping function aiming at identifying the noise thereon. Through assembling the mapping results from all bands, the clean image can be finally recovered. Following this idea, we develop a separation– aggregation network (SANet), which consists of a convolutional separation block, a deep mapping block and a band aggregation block. The convolutional separation block decomposes the noisy image into multiple bands by separating each channel from the feature maps obtained in the convolution. Then each band is fed into the subsequent deep mapping block to identify the noise via learning a hierarchical mapping. The band aggregation block convolves the concatenation of all the previous mapping results to predict the residual image. In contrast to previous counterpart [13,16] without image decomposition, SANet only needs to deal with the compositing components (e.g., bands) of the noisy image, each of which exhibits a much reduced feature space, thus making it easier to learn an effective mapping function to tell noise apart. This is similar as the principle behind the non-local patch (NLP) model [1, 2] (a brief reviewed is given in Section 2) which decomposes the noisy image into a collection of similar-patch groups and turn to handle each group instead of the whole image. Different from NLP where the similar-patch matching is sensitive to noise corruption and only hand-crafted regularization is adopted, SANet is able to noise-robustly generate the image bands and learn a complicated mapping function. The detailed difference and connection will be discussed in Section 4.2. More importantly, it has been demonstrated that any image can be viewed as the composition of some basic patterns [19,20], therefore, the separation–aggregation strategy can benefit SANet to discover these basic patterns and assemble them to deal with unseen images. Experimental results on standard image denoising benchmarks demonstrate the effectiveness of SANet. In summary, our contribution comes three aspects:

• We propose a separation–aggregation strategy to decompose the noisy image for easier noise separation.

• We develop a novel separation–aggregation network for image denoising.

• We demonstrate that the proposed SANet generalizes better than other counterpart without image decomposition across challenging datasets. 2. Related work In this section, we quick review the literatures on image denoising from the following two aspects. NLP based method depicts the correlation among non-local patches with the hand-crafted regularization for image denoising. According to the scheme of modeling the correlation among patches, they can be roughly divided into two categories. One is the explicit patch model based methods which directly define the correlation between non-local patches as regularization. For example, Buades et al. propose a non-local mean [1] method to define the linear relationship between similar patches. Dong et al. [2] embed the non-local mean scheme into the sparse representation of patches to obtain a nonlocally centralized sparse representation model for denoising. The other one is the implicit patch model based methods which impose a specific constraint on the data or representation space of non-local patches to implicitly depict their correlation. For example, Zoran et al. [21] model patches distribution with a mixture of Gaussians. Mairal

et al. [22] utilize a ℓp,q norm to constrain the group sparsity of the decompositions of non-local patches on a given dictionary. Dabov et al. [6] propose the famous block-matching and 3D filtering (BM3D) method which further promotes the group sparsity of similar patches with a collaborative filtering scheme. Gu and Chen et al. [8,9] attempt to exploit the low-rank characteristics in the non-local patches. In addition to these patch model based methods above, some researcher also introduce tensor decomposition technique into image denoising. For example, Rajwade et al. [23] utilize higher order singular value decomposition (HOSVD) to restore noisy image by manipulate the coefficients of HOSVD with hard thresholding. Gomes et al. [24] propose a tensor-based multiple denoising approach to successively conduct spatial smoothing, denoising and reconstruction onto the noisy data. Wu et al. [25] develop a new weighted tensor rank1 decomposition method to capture the non-local similarity in image for denoising. In contrast to these methods that often employ some hand-crafted regularizations to depict the inherent image characteristics, we attempt to learn more complicated regularization with the deep neural network. Deep neural network based method follows an end-to-end manner to task-drivenly learn the complicated image regularization for image denoising. On one hand, some literatures [13,14,26] utilize the deep neural network framework to directly learn a non-linear mapping from the noisy observation to the latent clean image. For example, Mao et al. [13] develop a very deep convolutional encoder–decoder networks with symmetric skip connections. Contemporarily Zhang et al. [14] propose a residual deep convolutional neural network for denoising. On the other hand, some works [15,16,26–28] employ deep neural network to learn an appropriate image regularization and embed it into the traditional regularized regression model. For example, Zhang et al. [26] learn the image regularization with a delated convolution based deep neural network, and integrate it with the half quadratic splitting method [5]. Kim et al. [15] incorporate a deep aggregation network into the alternative minimization method for denoising. Lefkimmiatis [16] proposes to model the gradient of the NLP based regularization in proximal gradient method with deep neural network. Koziarski et al. [29] further combine the image denoising and recognition into an joint end-to-end framework to learn a noise robust image recognition model. The principle of these deep neural network based methods is to learn a mapping function to deal with the noisy image. In contrast, we decompose the input noisy image into multiple bands with simpler patterns, and then employ the deep neural network to handle each band separately. Recently, Liu et al. [30] and Koziarski et al. [29] also introduce image decomposition into deep neural network for image denoising. However, they are totally different from the proposed method. In [30], the discrete wavelet transform (DWT) is employed to decompose the noisy image into multiple subimages with different scales as well frequencies. Under the point of dictionary based representation, the method in [30] represents the noisy image onto a fixed wavelet functions based dictionary. In contrast, in this study, we decompose the noisy image by feeding it into a convolutional layer and the output features are considered as the decomposed subimages. Under the point of dictionary based representation, the proposed method turns to represent the input noisy image on the dictionary that consists of learnable convolutional kernels. In addition, the method in [30] adopts the inverse discrete wavelet transform with the fixed parameters to reconstruct the image from the obtained deep features, while we introduce a convolution based band aggregation block that reconstructs the image from the deep features using the learnable weights. In [29], the input color image is decomposed into three color channels (e.g., red, green and blue) and each channel is then

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

mapped by a deep neural network independently. In contrast, the proposed method decomposes the input image into much more bands each of which contains a specific kind of basic pattern in the nature images (e.g., edge, texture or flat areas). In addition, the method in [29] mainly focuses on improving the recognition accuracy on the noisy input images and thus further introduce a deep recognition model at the end of the image denoising network for joint training. In the proposed method, we mainly focus on removing the noise corruption and improving the image quality. It is noticeable that both the proposed method and other deep neural networks based methods [13,26,30] also can be easily extended to solve the image recognition task in presence of noise by introducing an extra recognition model and jointly training the denoising and recognition model as [29].

3.2. Deep mapping block

3. separation–aggregation network

di = GD (GD−1 (...(G1 (fi ))...)),

In this section, we will detailedly present the proposed separation–aggregation network (SANet) for image denoising. SANet consists of three blocks, including a convolutional separation block, a deep mapping block and a band aggregation block. The architecture of SANet is sketched in Fig. 1. In SANet, the input noisy image is first convolved by the convolutional separation block. Then, each channel in the generated feature map is considered as an individual image band and separately fed into the subsequent deep mapping block for non-linear transformation. Finally, the band aggregation block concatenates all mapping results as a new feature map and convolves it for output. 3.1. Convolutional separation block In SANet, the role of the convolutional separation block is to decompose the input noisy image into multiple bands each of which exhibits simpler pattern. In deep convolutional network, each filter in a convolution layer can be viewed as a detector for a specific local pattern [31,32]. During convolution, pixels exhibiting a specific pattern will produce a high response to the corresponding detector (i.e., filter) in the resulted feature map while other pixels with different patterns will be suppressed accordingly. Thus, each channel in the resulted feature map only activates pixels exhibiting a specific kind of pattern, which is much simpler than the original noisy image. Inspired by this, we adopt the first convolution layer in SANet as the convolutional separation block and separate each channel from the generated feature map as an individual image band, shown as Fig. 1. Thus, the amount of bands is same as the number of channels in the feature map produced by convolution. Denoting the input noisy image as y, the output of the convolutional separation block can be formulated as

{f1 , . . . , fn } = SC [σ (F (y, W))] ,

(2)

where F (y, W) denotes the convolution operation on y with filter weight W and function σ (·) indicates the rectified linear unit (ReLu) [33]. Thus, σ (F (y, W)) denotes a traditional convolution layer [34]. SC [·] denotes the function of separating images band (i.e., channels) from the produced feature map, and f1 , . . . , fn represent those obtained bands. It can be found that the combination of SC [·] function and the traditional convolution layer implements the expected image decomposition function. Without the SC function, the convolutional separation block will be reduced to a traditional convolution layer. To demonstrate the effectiveness of the convolutional separation block, we depict some visual examples of the decomposed bands in Fig. 2. It can be seen that each noisy image can be decomposed into multiple bands with some basic patterns which are shared across images. More details will be discussed in Section 4.1.

3

Given those bands generated by the convolutional separation block, we propose to leverage a deep mapping block to map each band into a clean latent band for ultimately assembling. To this end, it is crucial to employ a complicated non-linear function to model the ill-posed denoising procedure. It has been shown that sufficient hierarchical convolution layers are able to fit any continuous functions [35]. Inspired by such an observation, we adopt the prevalent ‘‘Convolution + ReLu’’ in the deep convolutional neural network as the basic unit to construct the deep mapping block in a hierarchical structure, shown as Fig. 1. Specifically, given the ith image band, the output of this block can be formulated as (3)

where Gd = σ (F (·, Wd )) with d = 1, . . . , D denotes the operation in the dth layer of the deep mapping block, and Wd is the corresponding filter weight. D denotes the number of layers in this block. For each input image band fi , this block outputs an individual di of the same size as fi . The deep mapping block learns a non-linear mapping function to predict each pixel in di according to the pattern within a fixed-size receptive field in fi . Due to the deep hierarchical structure, the deep mapping block is able to learn a flexible mapping function which has enough capacity to deal with the challenging noisy patterns (e.g., heave noisy pattern). In contrast to the previous counterpart [13,14] without image decomposition where all pixels exhibiting various patterns in the receptive field are activated in prediction, only pixels with a specific kind of pattern are activated in the receptive field of the image band in SANet. This makes it easier for the learned mapping function to tell the noise apart. Figs. 6 and 8 provide some visual examples. In the implementation of SANet, a plausible way is to feed each band into an individual deep mapping block. In this study, to demonstrate the effectiveness of our separation–aggregation strategy as well as reducing model complexity, all bands share the same deep mapping block as Fig. 1. 3.3. Band aggregation block By going through the convolutional separation block and the followed deep mapping block, the input noisy y is projected to n mapping results di . To predict the final output, we develop a band aggregation block, which first concatenates all those n mapping results into a new n-channel feature map, and then convolves this feature map to give the output. This procedure can be visually illustrated by Fig. 1. The corresponding mathematical formulation can be given as follows xr = σ (F (F, Wr )),

(4)

where xr indicates the final output, F = [d1 , . . . , dn ] represents the n-channel feature map and Wr denotes the corresponding filter weight. It is noticeable that the architecture in Fig. 1 integrates the noisy image decomposition, deep mapping and mapping result aggregation into a forward neural network. This allows SANet to be trained in an end-to-end manner and thus enables datadrivenly discovering those basic patterns exhibited in each band and automatically assembling their mapping results. More importantly, since any image can be viewed as a composition of some basic patterns [19,20], shown as Fig. 2, our separation– aggregation strategy enables SANet to generalize well to the complex noisy patterns which are unseen during training procedure.

4

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

Fig. 1. The architecture of the proposed separation–aggregation network (SANet) which consists of a convolutional separation block, a deep mapping block and a band aggregation block. The red dashed lines denotes separating each channel from the generated feature map as an individual band. The blue dashed lines denotes concatenating all mapping results from the deep mapping block as a new feature map. In the deep mapping block, all branches share the same weights in each convolution layer. In the figure, C denotes the number of feature maps, while H and W denote the height and width of the feature map.

Fig. 2. 4 image bands obtained from the decomposition of image ‘C.Man’ (top) and ‘Monar.’ (bottom) with the proposed SANet when the standard deviation of noise corruption σ = 25. From left to right, (a) Clean image, (b) Low-intensity flat areas, (c) High-intensity flat areas, (d) Sharp edges, (e) Less-sharp edges. These decomposed bands not only highlight different basic patterns (e.g., flat areas and edges), but also can distinguish the edges of different sharpness.

3.4. Residual learning with SANet Training very deep neural networks often suffers from the gradient vanishing problem [36], which limits the performance of network. To alleviate this problem, He et al. [36] develop a residual learning scheme which enables training deeper network. Recently, it has been demonstrated that the residual learning scheme also can benefit training a better deep neural network for image denoising [13,14]. Inspired by these, we also employ the SANet to predict the residual image (i.e., noise). Therefore, xr in Eq. (3) denotes the residual image, shown as Fig. 1, and the training problem for SANet can be given as min Θ

N 1 ∑

2N

∥xi − (yi − R(yi , Θ))∥22

(5)

i=1

where R(·, Θ) denotes the output of SANet parametrized by Θ = {W, Wr }∪{Wd }d=1,...,D . {yi , xi }i=1,...,N denotes N noise-clean image pairs in training dataset.

Fig. 3. Two patterns discovered by SANet with different levels of noise, e.g., flat areas (top) and edges (bottom).

4. Discussion on some related problems

4.1. Image bands in decomposition

In this section, we discuss three problems related to SANet, including the image bands in decomposition, connection with non-local patch model and model complexity.

As mentioned above, when going through the convolutional separation block, the input noisy image is decomposed into multiple image bands. Each band only exhibits a specific patterns.

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

5

4.3. Model complexity

Fig. 4. Comparison between the flow charts of the proposed SANet and the non-local patch (NLP) model in image denoising.

In the other words, in each band pixels exhibiting the specific pattern are highlighted, while pixels with other patterns are suppressed accordingly. To clarify this point, we give some visual examples of the produced bands in Fig. 2, where we select 4 image bands for the image ‘C.Man’ and ‘Monar.’ produced by SANet when the standard deviation of noise corruption σ = 25. In Fig. 2(b), both bands for two images highlight pixels of low intensity from flat areas while suppress pixels with other patterns. Bands in Fig. 2(c)(d) exhibit high-intensity flat areas and edges, respectively. Moreover, from Fig. 2(b)(c)(d), we can find that the convolutional separation block can discover those basic patterns shared among images in spite of their different content. In addition, the convolutional separation block can distinguish edges of different sharpness, as shown in Fig. 2(d)(e). We further find that the convolutional separation block performs robustly to noise corruption. For example, in Fig. 3, we provide two kinds of patterns obtained in image ‘C.Man’ with different levels of noise. It can be seen that both patterns can be discovered robustly to noise corruption. 4.2. Connection with non-local patch model The proposed SANet can be explained as a flexible generalization of the non-local patch model (NLP) in image denosing [6,8,9]. Typically, an NLP model based denoising method consists of three steps, including patch grouping, transformation and patch aggregation, as shown in Fig. 4. In patch grouping, a specific matching scheme (e.g., block matching [37], clustering [38]) is utilized to collect similar non-local patches from the noisy image into groups. Then, the transformation step maps each patch group to its clean counterpart with a specific regularization. Finally, all patch groups are aggregated together to construct the clean image. In SANet, the convolutional separation block that highlights pixels with similar patterns in a specific band, can be viewed as a general patch grouping step. Due to the task-driven end-to-end learning scheme, the convolutional separation block performs robustly to noise corruption, shown as Fig. 3, while the patch matching in NLP model is prone to be misled by the noise corruption. The deep mapping block in SANet performs similarly as the transformation step in NLP model. The difference is that the deep mapping block can learn a complicated mapping function to deal with each band, while NLP model often adopts the hand-crafted regularization to depict the structure of each group. Apparently, the band aggregation block in SANet corresponds to the patch aggregation step in NLP model. The deep mapping block adopts convolution to assemble all mapping results with learned weights, while NLP model often employs weighted average with heuristics weights [2]. Therefore, SANet can be viewed as a generalized NLP model for image denosing. Tables 4, 3 give the quantitative comparison between SANet and the NLP model based methods.

In practice, we can implement the proposed SANet by slightly modifying the architecture of an existing DCNN based baseline method. Given an DCNN based baseline that stacks multiple convolutional layers as [14], we can implement the proposed network by simply replacing the first convolution layer with a convolutional separation block and supplement a band aggregation block at the end. Considering an (D + 1)-layer baseline network where each convolution layer consists of M filters of size K × K , the number of additional parameters introduced by the SANet with P image bands, is given as follows 2P − M 2 · K 2

(

)

(6)

When P < M 2 /2 (e.g., P = M > 2), the number of parameters in SANet is even less than the baseline network. This is mainly because the convolutional separation block degenerates the original convolution operation to a channel-wise convolution operation. In general, the model complexity of SANet is comparable to that of the baseline network. 4.4. Noise removal Similar as most existing image denoising methods [5,6,8,16, 21], in this study the proposed method mainly focuses on removing the Gaussian white noise corruption on images. However, this does not implies that the proposed method can handle only one type of noise corruption. It is noticeable that the denoising ability of the proposed method is determined by the noise type involved in the training samples. In theory, with appropriate noise corrupted training samples and their clean counterparts, the proposed method can handle any types of noise through endto-end training. This is similar as other learning based image denoising methods [5,16,21]. 5. Experiment In this section, we conduct extensive experiments to evaluate the performance of the proposed SANet. Firstly, we introduce the detailed experimental setting. Then, we utilize experimental results to demonstrate the effectiveness of the proposed separation–aggregation strategy and the superior denoising performance of the proposed SANet over other existing state-of-theart competitors. Finally, we discuss the training convergence as well as the runtime complexity of the proposed SANet. 5.1. Experimental setting Dataset. In the following experiments, we adopt two datasets, including the Berkeley segmentation dataset (BSD) [39] and Set12. BSD contains 500 natural images, while Set12 consists of 12 images. To train the proposed SANet, we split the BSD into two separate parts, namely a training set with 400 images and a test set with the remaining 100 images. Since most existing denoising methods commonly choose 68 images out from the test set for evaluation, for convenient comparison, we also select these same 68 images for testing. For simplicity, the new test set with 68 images is often termed as BSD68. Since the images in Set12 are too few, the whole Set12 dataset is only utilized for testing. Comparison methods. In this study, we compare the SANet with 7 state-of-the-art methods, including BM3D [6], WNMM [8], EPLL [21], CSF [5], TNRD [28], DeepAM [15] and NLNet [16]. BM3D, WNMM, EPLL are NLP model based methods, while CSF, TNRD, DeepAM and NLNet are learning based methods. In particular, DeepAM and NLNet are based on deep convolutional neural

6

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

Table 1 PSNR of SANet and the baseline network on 12 images from Set12 dataset with different noise levels. Method

C.man

House

Peppers

Starfish

Monar.

Airpl.

Parrot

Lena

Barbara

Boat

Man

Couple

Avg.

σ = 15

Baseline SANet

32.29 32.38

34.87 35.03

33.07 33.18

32.08 32.14

33.14 33.20

31.62 31.71

31.76 31.89

34.42 34.54

32.31 32.61

32.21 32.36

32.33 32.38

32.31 32.41

32.70 32.82

σ = 25

Baseline SANet

29.96 30.04

32.84 33.05

30.72 30.83

29.24 29.31

30.24 30.27

29.00 29.08

29.26 29.34

32.23 32.35

29.59 30.00

30.00 30.12

29.92 30.00

29.93 30.05

30.25 30.37

σ = 50

Baseline SANet

26.89 26.92

29.78 29.93

27.15 27.27

25.50 25.52

26.65 26.64

25.72 25.71

26.15 26.18

29.13 29.22

25.62 26.37

27.16 27.20

27.14 27.11

26.74 26.80

26.97 27.09

Table 2 SSIM of SANet and the baseline network on 12 images from Set12 dataset with different noise levels. Method

C.man

House

Peppers

Starfish

Monar.

Airpl.

Parrot

Lena

Barbara

Boat

Man

Couple

Avg.

σ = 15

Baseline SANet

0.9061 0.9078

0.8889 0.8920

0.9129 0.9149

0.9080 0.9094

0.9460 0.9474

0.9071 0.9091

0.9041 0.9065

0.8967 0.8996

0.9139 0.9190

0.8541 0.8588

0.8767 0.8789

0.8770 0.8802

0.8993 0.9020

σ = 25

Baseline SANet

0.8635 0.8665

0.8605 0.8642

0.8822 0.8848

0.8576 0.8602

0.9150 0.9159

0.8716 0.8737

0.8600 0.8637

0.8641 0.8679

0.8620 0.8758

0.8000 0.8070

0.8118 0.8178

0.8176 0.8249

0.8555 0.8602

σ = 50

Baseline SANet

0.7900 0.7856

0.8122 0.8139

0.8105 0.8100

0.7580 0.7568

0.8364 0.8382

0.7955 0.7928

0.7929 0.7884

0.8014 0.8024

0.7400 0.7713

0.7137 0.7143

0.7153 0.7115

0.7164 0.7173

0.7735 0.7752

network, and we adopt all their variants for comparison, including DeepAM, NLNet55×5 and NLNet57×7 . Details of these variants can be found in [15,16]. In addition to these existing denoising methods, we also implement a DCNN based baseline network for comparison. The baseline network has the same architecture as the deep mapping block in the proposed method. We implement the baseline network as Section 4.3 to make the model complexity comparable to SANet. In this way, the only difference between the proposed method and the baseline network is that the proposed method adopts the separation–aggregation strategy. Implementation details. In SANet, we construct the deep mapping block by 16 convolution layers each of which consists of 64 filters of size 3 × 3 and is followed by a ReLu unit. Both the convolutional separation block and the band aggregation block are implemented by an individual convolution layer with 64 filters, viz., we set C = 64. Each filter is of size 3 × 3 in spatial domain. The only difference is that the convolutional separation block is followed by a ReLu unit, while the band aggregation block not. In the training time, each training example is corrupted by Gaussian white noise with standard deviation σ = 10, 25, 50. In this study, we train an individual SANet as well as the baseline network for a specific noise level in the TensorFlow framework [40]. We use the Adam algorithm [41] with initial learning rate 1e−3 , batch size 64 and β = (0.9, 0.999) for optimization. In testing, peak signal-to-noise ratio (PSNR) and structure similarity (SSIM) [42] index are adopted to quantitatively evaluate all methods. 5.2. Effectiveness of image decomposition In this study, our core idea is the separation-aggregation strategy to decompose the input noisy image into multiple bands and model each of them separately. To verify the effectiveness of this idea, we compare SANet with the DCNN based baseline network mentioned in Section 5.1. On two benchmark datasets, we employ these two methods to recover the clean image from its noisy observation with different noise levels. The average PSNR and SSIM values produced by these two methods are summarized as bar charts shown in Fig. 5. It can be seen that both the PSNR and SSIM produced by the SANet are higher than that of the baseline network in most cases. For example, when σ = 15, compared with the baseline network, SANet improves the average PSNR on Set12 by 0.12 db. Since the only difference between these two methods is the separation–aggregation strategy adopted in SANet, these results demonstrate that such a strategy

can further improve the performance in denoising. The reason is twofold. On one hand, with the image decomposition, SANet only needs to deal with the generated bands exhibiting some basic patterns, which is much simpler than the original input noisy image, thus makes it easier to tell noise apart. On the other hand, the band aggregation block in SANet is able to flexibly assemble the mapping result from each band, which enables SANet to generalize to those noisy patterns which are unseen during the training procedure. To clarify this point, we show the PSNR and SSIM produced by SANet and the baseline network on each image from the Set12 dataset in Tables 1 and 2. Since images in Set12 exhibit totally different content to that in BSD, noisy images generated based on this dataset can be viewed as those patterns unseen during the training procedure. We can see that SANet outperforms the baseline network in each image when σ = 15, 25, and in most images when σ = 50. Since the only difference between the proposed method and the baseline is to utilize the separation–aggregation strategy or not, the superior performance of the proposed method over the baseline demonstrates the separation–aggregation strategy is effective for image denoising. 5.3. Comparison with the state-of-the-arts In this part, we compare the SANet with several state-of-theart denoising methods. With the same experimental setting as above, the average PSNR and SSIM results of all methods on Set12 dataset are shown in Table 3. We can find that SANet outperforms all the other methods in PSNR with different noise levels. For example, when σ = 25, the PSNR produced by SANet is higher than that of WNNM (performs the best in all NLP model based methods) by 0.11 db, and than that of the TNRD by 0.31 db. Additionally, SANet also produces the highest SSIM scores in most cases. The visual results in Figs. 6 and 7 also demonstrate this point. These results demonstrate the effectiveness of SANet in image denoising. To further clarify this point, we evaluate all methods on the BSD68 dataset. The numerical results and some visual results are provided in Table 4 and Fig. 8. It can be seen that SANet still obviously outperforms all NLP model based methods with different noise levels. For example, the PNSR of SANet is higher than that of WNNM by 0.31 db when σ = 15. Compared with those learning based methods, SANet also obtains obvious improvement. For example, when σ = 15, compared with NLNet55×5 and NLNet57×7 , SANet improves the PSNR at least 0.16 db. It is noticeable that NLNet55×5 and NLNet57×7 employ deep neural network to model

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

7

Fig. 5. Bar charts of the average PSNR and SSIM produced by the proposed SANet and the baseline network on two benchmark datasets with different noise levels.

Fig. 6. Visual results of different methods on image ‘Parrot’ in Set12 when σ = 25. From left to right, (a) Noise image, (b) BM3D [6] (28.92/0.8537), (c) WNNM [8] (29.10/0.8578), (d) EPLL [21] (28.86/0.8529), (e) CSF [5] (28.84/0.8493), (f) TNRD [28] (29.13/0.8575), (g) Baseline (29.26/0.8601), (h) SANet (29.34/0.8637). Table 3 Average PSNR/SSIM of all methods on Set12 dataset with different noise levels.

Table 4 Average PSNR/SSIM of all methods on BSD68 dataset with different noise levels.

Method

σ = 15

σ = 25

σ = 50

Method

σ = 15

σ = 25

σ = 50

BM3D [6] WNNM [8] EPLL [21]

32.37/0.8950 32.70/0.8978 32.14/0.8937

29.97/0.8507 30.26/0.8558 29.69/0.8446

26.72/0.7670 27.05/0.7780 26.47/0.7487

BM3D [6] WNNM [8] EPLL [21]

31.07/0.8721 31.37/0.8766 31.21/0.8823

28.57/0.8013 28.83/0.8086 28.68/0.8121

25.62/0.6862 25.87/0.6981 25.67/0.6875

CSF [5] TNRD [28]

32.32/0.8914 32.50/0.8964

29.84/0.8432 30.06/0.8519

–/– 26.81/0.7673

SANet

32.82/0.9020

30.37/0.8602

27.09/0.7752

CSF [5] TNRD [28] NLNet55×5 [16]

31.24/0.8730 31.42/0.8824 31.49/–

28.74/0.8030 28.92/0.8152 28.98/–

–/– 25.97/0.7018 25.99/–

NLNet57×7 [16]

31.52/–

29.03/–

26.07/–

DeepAM [15]

31.40/0.8820

28.95/0.8160

25.94/0.7010

SANet

31.68/0.8896

29.13/0.8244

26.10/0.7100

the similarity among non-local patches which is similar as SANet (see Section 4.2). However, these two methods collect non-local patches from the noisy image with the block-matching scheme, which is prone to be misled by the noise corruption. In contrast, SANet embeds the image decomposition into the end-to-end learning scheme and thus performs more robustly, shown as Table 3. Furthermore, the separation–aggregation strategy enables SANet to generalize better to unseen images than the counterpart

without image decomposition adopted in those two methods. Both of these contribute to the improvement of SANet. In contrast to SANet which casts the denoising problem in a single neural network, DeepAM embeds the deep neural network into the alternative minimization scheme for denoising. It has

8

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

Fig. 7. Visual results of different methods on image ‘Couple’ in Set12 when σ = 25. From left to right, (a) Noise image, (b) BM3D [6] (29.72/0.8175), (c) WNNM [8] (29.84/0.8183), (d) EPLL [21] (29.49/0.8087), (e) CSF [5] (29.55/0.8037), (f) TNRD [28] (29.73/0.8136), (g) Baseline (29.93/0.8176), (h) SANet (30.05/0.8249).

Fig. 8. Visual results of different methods on one image from BSD68 when σ = 50. From left to right, (a) Noise image, (b) BM3D [6] (20.45/0.4928), (c) WNNM [8] (20.80/0.5396), (d) EPLL [21] (20.90/0.5598), (d) TNRD [28] (21.07/0.5772), (e) Baseline (21.25/0.6006), (f) SANet (21.31/0.6158).

been found [43] that the later scheme can further improve the performance. The reason is intuitive. In the later scheme, the deep neural network only needs to learn the image regularization, while the former one needs to learn extra image encoding and decoding. For fair comparison, we only compare the proposed method with the basic version of DeepAM which learns network parameters within one single iteration of minimization. It can be seen that the proposed method outperforms DeepAM with a clear margin. For example, when σ = 15, the improvement on PSNR is up to 0.28 db.

5.4. Convergence analysis

To demonstrate the convergence of the proposed method, we plot the curves of training loss as well as the restoration PSNR on Set12 dataset versus training iterations in Fig. 9. It can be seen that the restoration PSNR gradually increases as the decreasing loss and ultimately performs stably. This demonstrates that the proposed SANet converges well.

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

9

Fig. 9. Curves of loss and restoration PSNR during training.

Declaration of competing interest

Table 5 Average runtime of different methods on Set12. Method

BM3D

EPLL

WNNM

CSF

TNRD

Baseline

SANet

Time (s)

1.38

58.54

530.36

5.44

2.96

0.10

0.40

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105603.

5.5. Computational cost analysis References Intuitively, compared with the baseline method without decomposition, the computational complexity of the proposed method will increase C times when the input image is decomposed into C bands, since each band is processed independently before assembling. In this study, we set C = 64. However, the independence among those C bands allows the operations on them to be computed in a totally parallel way. Thus, the practical computation complexity will not increase too much. To illustrate this point, we report the average runtime of all methods on the Set12 dataset. The runtime is computed on a workstation with 2 E5-2650v4 CPUs, 256 GB RAM and 8 GTX1080Ti GPUs. It can be seen that the proposed method is much more efficient than most of competitors. Although the proposed method requires C times more computation complexity in theory than the baseline method, their runtime is comparable due to the parallel computation on GPU (see Table 5). 6. Conclusion In this study, we present a separation–aggregation strategy to overcome the difficulty brought by the huge feature space of noisy patterns in learning a noisy-to-clean mapping function. To implement this strategy in an end-to-end manner, we develop a separate–aggregation network, which generalizes better than the deep neural network counterparts without image decomposition. This work paves a new direction of addressing image denoising with deep neural network. Moreover, the proposed separation– aggregation strategy has the potential to aid solving any other image restoration problems. In the future, we plan to embed the proposed network into some optimization method for inverse linear problem, e.g., ADMM, to further exploit the potential of our separation–aggregation strategy in solving image restoration problem. Acknowledgments This work is supported in part by the National Natural Science Foundation of China (No. 61671385, 61571354, 61571362), Natural Science Basis Research Plan in Shaanxi Province of China (No. 2017JM6021, 2017JM6001) and China Postdoctoral Science Foundation under Grant (No. 158201).

[1] A. Buades, B. Coll, J.-M. Morel, A non-local algorithm for image denoising, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, IEEE, 2005, pp. 60–65. [2] W. Dong, L. Zhang, G. Shi, X. Li, Nonlocally centralized sparse representation for image restoration, IEEE Trans. Image Process. 22 (4) (2013) 1620–1630. [3] W. Wei, L. Zhang, C. Tian, A. Plaza, Y. Zhang, Structured sparse codingbased hyperspectral imagery denoising with intracluster filtering, IEEE Trans. Geosci. Remote Sens. 55 (12) (2017) 6860–6876. [4] L. Zhang, W. Wei, Y. Zhang, C. Shen, A. van den Hengel, Q. Shi, Cluster sparsity field: An internal hyperspectral imagery prior for reconstruction, Int. J. Comput. Vis. (2018) 1–25. [5] U. Schmidt, S. Roth, Shrinkage fields for effective image restoration, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2774–2781. [6] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, Image denoising by sparse 3-d transform-domain collaborative filtering, IEEE Trans. Image Process. 16 (8) (2007) 2080–2095. [7] B. Du, L. Zhang, Target detection based on a dynamic subspace, Pattern Recognit. 47 (1) (2014) 344–358. [8] S. Gu, L. Zhang, W. Zuo, X. Feng, Weighted nuclear norm minimization with application to image denoising, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2862–2869. [9] F. Chen, L. Zhang, H. Yu, External patch prior guided internal clustering for image denoising, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 603–611. [10] L. Zhang, W. Wei, C. Tian, F. Li, Y. Zhang, Exploring structured sparsity by a reweighted laplace prior for hyperspectral compressive sensing, IEEE Trans. Image Process. 25 (10) (2016) 4974–4988. [11] Y. Zhang, B. Du, L. Zhang, A sparse representation-based binary hypothesis model for target detection in hyperspectral images, IEEE Trans. Geosci. Remote Sens. 53 (3) (2015) 1346–1354. [12] L. Zhang, W. Wei, Q. Shi, C. Shen, A.v.d. Hengel, Y. Zhang, Beyond Low Rank: A Data-Adaptive Tensor Completion Method, arXiv preprint arXiv: 1708.01008. [13] X. Mao, C. Shen, Y.-B. Yang, Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections, in: Advances in Neural Information Processing Systems, 2016, pp. 2802–2810. [14] K. Zhang, W. Zuo, Y. Chen, D. Meng, L. Zhang, Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising, IEEE Trans. Image Process. 26 (7) (2017) 3142–3155. [15] Y. Kim, H. Jung, D. Min, K. Sohn, Deeply aggregated alternating minimization for image restoration, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6419–6427. [16] S. Lefkimmiatis, Non-local color image denoising with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3587–3596.

10

L. Zhang, Y. Li, P. Wang et al. / Applied Soft Computing Journal 83 (2019) 105603

[17] L. Zhang, P. Wang, C. Shen, L. Liu, W. Wei, Y. Zhang, A.v.d. Hengel, Adaptive importance learning for improving lightweight image super-resolution network, arXiv preprint arXiv:1806.01576. [18] Y. Dong, B. Du, L. Zhang, Target detection based on random forest metric learning, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8 (4) (2015) 1830–1838. [19] H. Bristow, A. Eriksson, S. Lucey, Fast convolutional sparse coding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 391–398. [20] F. Heide, W. Heidrich, G. Wetzstein, Fast and flexible convolutional sparse coding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5135–5143. [21] D. Zoran, Y. Weiss, From learning models of natural image patches to whole image restoration, in: 2011 International Conference on Computer Vision, IEEE, 2011, pp. 479–486. [22] J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Non-local sparse models for image restoration, in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2272–2279. [23] A. Rajwade, A. Rangarajan, A. Banerjee, Image denoising using the higher order singular value decomposition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (4) (2012) 849–862. [24] P.R. Gomes, J.P.C. da Costa, A.L. de Almeida, R.T. de Sousa Jr, Tensor-based multiple denoising via successive spatial smoothing, low-rank approximation and reconstruction for rd sensor array processing, Digit. Signal Process. (2019). [25] Y. Wu, L. Fang, S. Li, Weighted tensor rank-1 decomposition for nonlocal image denoising, IEEE Trans. Image Process. 28 (6) (2019) 2719–2730. [26] K. Zhang, W. Zuo, S. Gu, L. Zhang, Learning Deep CNN Denoiser Prior for Image Restoration, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3929–3938. [27] B. Du, L. Zhang, A discriminative metric learning based anomaly detection method, IEEE Trans. Geosci. Remote Sens. 52 (11) (2014) 6844–6857. [28] Y. Chen, T. Pock, Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1256–1272. [29] M. Koziarski, B. Cyganek, Image recognition with deep neural networks in presence of noise–dealing with and taking advantage of distortions, Integr. Comput.-Aided Eng. 24 (4) (2017) 337–349. [30] P. Liu, H. Zhang, K. Zhang, L. Lin, W. Zuo, Multi-level wavelet-cnn for image restoration, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 773–782.

[31] L. Liu, C. Shen, A. van den Hengel, The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4749–4757. [32] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929. [33] V. Nair, G.E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 807–814. [34] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [35] C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Deep learning requires rethinking generalization, in: International Conference on Learning Representation, 2017, pp. 1–14. [36] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [37] K. Dabov, A. Foi, V. Katkovnik, K. Egiazarian, et al., Image denoising with block-matching and 3 d filtering, in: Proceedings of SPIE, Vol. 6064, 2006, p. 606414. [38] L. Zhang, W. Wei, C. Bai, Y. Gao, Y. Zhang, Exploiting clustering manifold structure for hyperspectral imagery super-resolution, IEEE Trans. Image Process. 27 (12) (2018) 5969–5982. [39] D. Martin, C. Fowlkes, D. Tal, J. Malik, A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, Computer Vision (ICCV), 2001 IEEE International Conference on. [40] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, arXiv preprint arXiv:1603.04467. [41] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980. [42] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [43] J. Rick Chang, C.-L. Li, B. Poczos, B. Vijaya Kumar, A.C. Sankaranarayanan, One network to solve them all–solving linear inverse problems using deep projection models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5888–5897.