Low-level structure feature extraction for image processing via stacked sparse denoising autoencoder

Low-level structure feature extraction for image processing via stacked sparse denoising autoencoder

Accepted Manuscript Low-level Structure Feature Extraction for Image Processing via stacked sparse denosing autoencoder Zunlin Fan , Duyan Bi , Linyu...

717KB Sizes 0 Downloads 111 Views

Accepted Manuscript

Low-level Structure Feature Extraction for Image Processing via stacked sparse denosing autoencoder Zunlin Fan , Duyan Bi , Linyuan He , Ma Shiping , Shan Gao , Cheng Li PII: DOI: Reference:

S0925-2312(17)30390-9 10.1016/j.neucom.2017.02.066 NEUCOM 18140

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

2 January 2017 23 February 2017 25 February 2017

Please cite this article as: Zunlin Fan , Duyan Bi , Linyuan He , Ma Shiping , Shan Gao , Cheng Li , Low-level Structure Feature Extraction for Image Processing via stacked sparse denosing autoencoder, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.02.066

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Low-level Structure Feature Extraction for Image Processing via stacked sparse denosing autoencoder Zunlin Fan a *, Duyan Bi a, Linyuan He a, Ma Shiping a, Shan Gao a, Cheng Li b a Air Force Engineering University, Aeronautics and Astronautics Engineering College, No. 1 Baling Road, Baqiao District, Xi’an, Shaanxi 710038, China

b Air Force Aviation University, Nanhu Road, Nanguan District, Changchun, Jilin 130022, China

CR IP T

*E-mail: [email protected]

AN US

Highlights:  We use deep learning for image processing by extracting image feature  The extraction has good performance in noisy or low-light circumstance  We optimize TV and L0 smoothing filter by the proposed feature extraction  The features can extract image structure features regardless the inputs directly

Abstract: In this paper, we propose a novel low-level structure feature extraction for image processing

M

based on deep neural network, stacked sparse denosing autoencoder (SSDA). The current image

ED

processing methods via deep learning are directly building and learning the end-to-end mappings between the input/output. Instead, we advocate the analysis of the first layer learning features from

PT

input data. With the learned low-level structure features, we improve two edge-preserving filters that

CE

are key to image processing tasks such as denoising, High Dynamic Range (HDR) compression and details enhancement. Due to the validity and superiority of the proposed feature extraction, the results

AC

computed by the two improved filters do not suffer from the drawbacks including halos, edge blurring, noise amplification and over-enhancement. More importantly, we demonstrate that the features trained from natural images are not specific and can extract the structure features of infrared images. Hence, it is feasible to handle tasks by using the trained features directly.

Keywords: Image processing; Pattern recognition; Spatial filter; Image enhancement; Deep learning

ACCEPTED MANUSCRIPT

1.

Introduction and motivation Deep learning has swept the field of artificial intelligence like storm. It has made great

breakthroughs in speech recognition, pattern recognition and natural language processing. Meanwhile, it has tremendously expanded the study area of machine learning and promoted the artificial

CR IP T

intelligence technology to achieve rapid development [1-3]. Many scholars have introduced Convolutional neural networks (CNNs) and autoencoders (AEs) into image processing [1]. CNNs, inspired by the animal visual cortex organization, have shown

AN US

satisfactory performance in processing two-dimensional data like images and videos [1, 4-7]. To extract the whole saliency maps, Li et al propose a multi-task deep saliency model based on a fully CNN with global input and global output. The underlying saliency prior information is encoded by the

M

data-driven strategy [4]. With the CNN containing three layers, the depth image denoising and enhancement is achieved by pre-processing the gray image and regenerating depth image. However,

ED

the performance of the learning framework is relied heavily on the capacity of training data [5]. To

PT

construct the connections between the clean images and the noisy ones, Sun et al present a deep convolutional-pairs network for image denoising. They use deeper networks trained from large data to

CE

build end-to-end mappings directly from a noisy image to its noise-free image based on the observation

AC

that deeper networks improve denoising performance [6]. A method for single image super-resolution is proposed based on the deep CNN that takes the low-resolution image as the input and outputs the high-resolution one [7]. Recently, AEs have been employed to learn generative models of data with one key advantage that this model extracts useful features continuously during the propagation and filters the useless information [1, 8-14]. To retrieve the inherently non-linear relationships between 2D images and

ACCEPTED MANUSCRIPT

human 3D poses, Hong and Yu propose novel pose recovery methods using non-linear mapping with multimodal deep AE [10, 11]. Agostinelli el at. present a novel technique computing optimal column weights via solving a nonlinear optimization program and train a separate network to predict the optimal weights [12]. Xu connects traditional optimization-based schemes and a neural network

CR IP T

architecture to outperform previous ones especially when the blurred input images are partially saturated [13]. A novel SSDA for low-light images, called Low-light Net (LLNet), is trained to enhance natural low-light images by using synthetically modifying images available on Internet

AN US

databases to simulate low-light environments [14].

Summarized the above image processing methods based on CNNs and AEs, there exist two generalities. One is that huge amounts of unlabeled training data are imperative to train the deep

M

networks suitable for image processing tasks and relieve the problem of over-fitting. However, when only a limited amount of training data is available, more powerful models are required to achieve an

ED

enhanced learning ability. The other is that scholars, subject to the mapping of end-to-end, pay great

PT

attentions to the input and output. So, the input data are usually generated by modifying the clean images to simulate image quality degradation. With data-driven and appropriate data simulation, the

CE

networks have achieved grate success in image processing tasks. Nevertheless, these networks are not

AC

so effective on other image degradations that are not simulated. Image processing focuses on low-level features, such as color, edges, textures and details.

Considering image features stands for the research current in image processing. Moreover, many image processing tasks are depending on the non-linear neighborhood filter [15]. To fully take advantage of SSDA on the ability of learning features, we need analyze the semantics of each neuron in each layer.

ACCEPTED MANUSCRIPT

In this paper, we aim to jump out of the end-to-end learning way and improve the non-linear neighborhood filter based on the low-level features learned by SSDA. Contributions: (1) We present a novel application that extracts low-level structure features well even in the degradation environments by SSDA. (2) We demonstrate that the two improved

CR IP T

edge-preserving filters achieve more compelling results for different image processing tasks. (3) We find that the features trained from natural images are also effective for other images in test stage, such as infrared images and medical images.

Related work

AN US

2.

Low-level image processing tasks include edge detection, interpolation and deconvolution. These tasks are useful both in themselves and as a front-end for high-level visual tasks like object recognition.

M

This paper focuses on image denoising, high dynamic range (HDR) compression and details enhancement. These tasks depend on the non-linear neighborhood filter that restores a pixel by taking

ED

an average of the values of neighboring pixels with a similar grey level value and preserves the edges.

PT

This filter is called edge-preserving filter in spatial domain. Preserving edges is the fundamental motivation in denoising. In HDR compression, images are

CE

often decomposed into a piecewise smooth base layer and one or more detail layers. The base layer

AC

captures the larger scale variations in intensity and is the output of image coarsening process. The coarsening must be done carefully to prevent artifacts that might arise once the base and the detail layers are manipulated separately and recombined. For example, to reduce the dynamic range of an HDR image, the base layer is typically subjected to a non-linear compressive mapping by applying an edge-preserving smoothing, and then recombined with the detail layers [15]. The edge-preserving filter is crucial to boost the detail component, difference between the original image and the smoothing

ACCEPTED MANUSCRIPT

result, without artifacts like halos. Hence, an edge-preserving filter is the foundation for these image processing tasks. Some edge-preserving filters have been proposed in computational photography over the last several decades. Bilateral filter (BF) smooths image at a pixel by outputting the average of neighboring

CR IP T

pixels, weighted by the Gaussian of both spatial and intensity distances [15]. Although BF is suitable for many situations, it has unexpected gradient reversal artifacts at the edges. Total variation (TV), which optimizes an L1-regularized loss function, is able to smooth drastically while preserving crisp

AN US

edges [16]. Gradient is not enough to identify edges from noise. Therefore, there exist blocking artifacts in the results by TV. Weighted least squares (WLS) filter is an alternative edge-preserving smoothing operator based on the image gradients and produces halo-free smoothing [17]. It also

M

smooths image based on gradient magnitudes. L0 smoothing filter removes low-amplitude structures and globally preserves salient edges. This filter will cause over-sharpening in challenging

ED

circumstances to remove details and speckle residuals in denoising results [18]. Driven from the local

PT

linear model, guided image filter (GIF) works well near the edges by considering the content of a guidance image. The guidance image can be the input image itself or another different image.

CE

However, the output is a linear filter based on the mathematical statistics of local patches.

AC

Mathematical statistics, mean and variance, are hard to entirely represent image structures [19]. Unlike Gaussian filter which just uses the linear-invariant kernel, the above edge-preserving filters

are all trying to present some criterions that differentiate the image structures and noises. These criterions include the weighted sum of spatial and intensity distances in BF, gradient magnitudes in TV and WLS, sparsity of edge in L0 smoothing filter, and edges information of guided images in GIF.

ACCEPTED MANUSCRIPT

Hence, distinguishing noise and useful information is key to eliminate noise while prevent edges blurring.

3.

Model Description In this section, we first briefly give preliminaries about SSDA. Then we analyze the significance of

CR IP T

the first layer in SSDA. At last, we prove the validity of the proposed low-level feature extraction.

3.1 SSDA

Denoising autoencoder (DA) is usually used as the first layer in a deep neural network to pre-train

AN US

the input data [1]. To introduce DA clearly, we assume y D is the original data and x D is the corrupted version of corresponding y with N training examples. DA is defined as below:

(1)

yˆ (x)  g (W 'h(x)  b ')

(2)

M

h(x)  f (Wx  b)

Eq.

(1)

ED

where f() and g() are activation functions (the sigmoid function  (s)  (1  exp(s))1 is often used). and

Eq.

(2)

express

the

encoding

and

decoding

processes.

PT

Θ  {W K D , b K , W ' DK , b ' D} contains the weights and biases of encoding and decoding.

CE

Inspired by the virtues of sparse coding, a sparsity denoising autoencoder (SDA) is trained to

AC

minimize the reconstruction loss function with a sparsity regularization term [5, 10]: LSDA (Θ) 

K 1 N  2 y i  yˆ (xi ) 2    KL(  || ˆ j )  ( W  N i 1 2 j 1

KL(  || ˆ j )   log

2 F

 W' F )

 1  1 N  (1   )log , ˆ j   h j (xi ) ˆ j 1  ˆ j N i 1

2

(3)

(4)

The sparsity regularization term is defined by using Kullback-Leibler divergence KL(  || ˆ j ) shown in Eq. (4). ˆ j is the average activation of the j-th hidden layer. The target activation ρ is set as 0.1 in this paper to drive the mean activation of hidden units to be small. The weight decay term

ACCEPTED MANUSCRIPT

W F  W ' F is to prevent over-fitting. β, λ are the parameters controlling the weights of the penalty 2

2

terms. The increasing of λ, β improves the significance of sparsity and weight decay terms, respectively. Figure 1. (a) displays the SSDA architecture stacked by two SDAs. Namely, the activation of the

CR IP T

first SDA’s hidden layer is the input of the second SDA. Then, this entire network is trained in a fine tuning stage to minimize the following loss function: LSSDA (Θ) 

2 1 N  4 2 y i  yˆ (xi ) 2   ( W( m) )  F N i 1 2 m 1

(5)

AN US

where W(m) denotes weights for the m-th layer in SSDA. With the above pre-training process, W(1) and W(4) come from the encoding and decoding weights of the first SDA, and W(2) and W(3) come from the encoding and decoding weights of the second SDA. For the sparsity regularization has been served in

M

the pre-train stage, a new sparsity regularization term is not necessary for the fine tuning stage [20]. In

ED

both of the pre-training and fine tuning stages, we optimize the loss functions by L-BFGS algorithm.

. . .

W(2),b(2)

h1 x2 h2 x3

. . .

W(3),b(3)

. . .

W(4),b(4)

AC

h3

. . .

CE

W(1),b(1)

x1

...

. . .

PT

Encoder

xD hK +1 Decoder

x:Input layer

h(1):First layer

h(2):Second layer

h(3):Third layer

64× 1

200× 1

200× 1

200× 1

Layer 1

yˆ :Output layer 64× 1

(a) Stacked sparse denoising autoencoder architecture by Two SDAs

Fig. 1. Model architectures.

3.2 Significance analysis of SSDA

Input

(b) input and first hidden layer

ACCEPTED MANUSCRIPT

As mentioned in [20], many deep neural networks train on natural images exhibit a curious phenomenon in common: the first layer learns features similar to Gabor filters. It is valuable and feasible to analyze each neuron in the first layer. Reviewing the loss function of SDA in Eq. (3), the sparsity-inducing term is key to train the optimal parameters Θ . To describe the significance of SSDA

CR IP T

in details, let’s introduce one of frame theories--optimally sparse approximations first [21-23]. A sequence {Φk}k∈Z in Hardy space is called a frame, if there exist constants 0
A f

2

  f ,Φk

2

B f

2

(6)

AN US

k Z

If A and B can be chosen with A=B, then the frame is called A-tight. And if A=B=1 is possible, then{Φk}k∈Z is a Parseval frame [24]. There exists the following signal nonlinear approximation fM for

M

f:

ED

f M   f ,Φk Φk

where Φk represents one of basis function.

(7)

k Z

f ,Φk , operated by inner product, is the coefficient

PT

corresponding to the basis function. This coefficient indicates the similarity between input signal and

CE

the basis function Φk. The greater coefficient value indicates stronger similarity. In the optimally sparse approximations, the coefficients { f ,Φ1 , f ,Φ2 ,..., f ,Φz } are sparse.

AC

As shown in Figure 1. (b), we select one arbitrary neuron in the first layer and rewrite Eq. (1) as follows:

h i (x)  σ (Wi(1) x  bi(1) )

(8)

where i represents the i-th neuron in the first layer. Compared with signal nonlinear approximation in Eq. (7), Wi(1) x and

f ,Φk

are both operated by inner product and indicating the similarity of two

involved actors. As the mean activation of hidden units is encouraged to be small,

f ,Φk

is optimally

ACCEPTED MANUSCRIPT

sparse, and hi (x)  σ (Wi(1) x  bi(1) ) can be considered as an alternative way of sparsity representation. Hence, the encoding weights W(1), capturing the structure features of input data, are shown in Fig. 2. The input number is 8*8 (D=64) and the hidden neuron number is 200 (K=200).

3.3 low-level feature extraction by SSDA

CR IP T

If input data belong to the edges, they must have large similarity with one of trained features. On the contrary, if input data are sampled from plat regions, they are isolated with all trained features. Fortunately, the values of hidden neuron activations on input data evaluate the similarities.

AN US

Given a input vector z 64 , the activations by trained features can be described as follow:

h(z)  σ (W(1) z  b(1) )

(9)

where W(1) and b(1) are the trained parameters by Eq. (5). h(z) 200 is a vector of similarity value

AC

CE

PT

ED

M

and represents relationships between the inputs and 200 trained features.

Fig. 2. Visualization of decoding weights in the first layer

CR IP T

ACCEPTED MANUSCRIPT

Fig. 3. Comparisons of proposed structure feature map and edge feature map from noisy image and low-light image

We analyze the distribution of similarity vector h(z) by gathering statistical information including

AN US

its mean, standard and maximum. The activations on plat region, close to ˆ j , are small and homogeneous, and the activations on edges are mutated and inhomogeneous. Therefore, the standard and maximum can well represent the characteristics of input data. The maximum of similarity vector is

M

chosen to compute and expressed as below:

ED

S (z)  max (σ (Wi(1) z  bi )) 1 i  200

(10)

PT

where S(z) is defined as the structure index of input data z. The larger index shows the more sharpened edge. To extract the low-level structure features from a whole image (M×N), we first extend the input

CE

image to sample MN×64 vectors by per pixel scanning. Then MN structure indexes are outputted

AC

vectors by the operation in Eq. (10). Finally, a low-level structure feature map is obtained by converting the structure indexes vector to matrix. In Figure 3, (a) is the clean image, (b) is the low-level structure feature map based on the proposed

extraction and displays the distinct edges and details, (e) is the histogram of (b) and demonstrates that the structure features closely follow heavy-tailed distribution. To test the superiority of proposed structure feature extraction, the noisy image (corrupted by additive Gaussian white noise with standard

ACCEPTED MANUSCRIPT

deviation 25 and shown in (c)) and low-light image (darkened nonlinearly by gamma correction with γ=3.5 in [14] and shown in (d)) are chosen. The precise intensities and locations of the clean image have been shown in (f) and (h) which are formed by the proposed feature extraction. On the contrary, most of edges and details are vanishing in (g) and (i) which are mapped from edge feature extraction. (j)

CR IP T

shows the part intensity variations of five feature maps in (b) and (f)~(i) in which we choose the pixels in 20-th row and 418~442th columns. The salient intensities, marked by red dotted box, indicate that the pixels are on the edges but in plat region actually. Tabular intensities, marked by black dotted box,

4.

AN US

show that the edge features from low-light image are faint.

Applications to image processing

This section demonstrates the effectiveness of the proposed low-level structure feature extraction

ED

4.1 Edge-preserving filter

M

via conducting image processing tasks.

As referred in section2, the traditional edge-persevering filters are suffering from edge blurring and

PT

noise residual. The reason for this is that the regularization terms cannot make a good distinction

CE

between noise and useful information. So, we aim to optimize TV [16] and L0 smoothing filter [18] by using the proposed low-level structure features.

AC

We denote the input image by I and the computed result by U. The loss function in TV and L0

smoothing filter can be described as below: min{ (U p  I p )2   C (U )} U

(11)

p

where p indexes image pixel, C(U) is the regularization term, and the term

 (U

p

 I p )2 constrains

p

image structure similarity. μ is the regularization parameter for balancing the two terms. A large μ makes the result have few edges. The gradient U p  ( xU p ,  yU p )T for each pixel p is calculated as

ACCEPTED MANUSCRIPT

difference between neighboring pixels along the x and y directions. The regularization terms C (U )  U p

1

and C (U )  U p

0

in TV and L0 smoothing filter are respectively described as below: 1

U p  ( xU p )2  ( yU p )2

(12)

0

U p  #{ p |  xU p   yU p  0}

(13)

CR IP T

One-norm L1 calculates gradient value at pixel p, and zero-norm L0 counts p whose magnitude  xU p   yU p is not zero. To solve the optimization problem in Eq. (11), we adopt half-quadratic

splitting with auxiliary variables to expand the original terms and update them iteratively. It is worth

AN US

noting that the L0-norm regularization optimization is typically non-convex problem known as computationally intractable. We use an approximation to make this optimization easier to tackle in [18]. We introduce auxiliary variable g and rewrite the objective function:

min{ (U p  I p )2   C ( g )    ( g  U p ) 2} p

(14)

M

U , h,v

ED

here, δ controls the differences between auxiliary variables and gradient. As δ→∞, the solution of (14)

CE

PT

will converge to that of (11). Then we turn Eq. (14) into two sub-problems: min{ (U p  I p )2   ( g  U p )2}

(15)

min{  ( g  U p )2   C ( g )}

(16)

U

p

g

p

AC

The function in Eq. (15) is quadratic and has a global minimum by means of derivation. To speed up the operation, we solve it by Fast Fourier Transform (FFT):

where F

U F

is the FFT operator and F

1

1

(

F ( I )    F (T g ) ) F (1)    F (T )

(17)

is the inverse FFT operator. The plus, division and

multiplication are all based on component-wise. To solve Eq. (16), employed ways depend on the regularization terms. When the regularization term

ACCEPTED MANUSCRIPT

is C (U )  U p

1

in TV, we use the Shrinkage technique [25] to compute auxiliary variable g: g  max( M  1

 2

M

,0) 

M

1

,

M  U

(18)

0

If the regularization term C (U ) is U p , Eq.(16) can be rewritten with splitting scheme: min{  ( g  U p )2   L( g )} g

(19)

p

CR IP T

where L(g) is a binary function returning 1 if g x  g y  0 and 0 otherwise. Then we introduce the strategy in [18] to solve the above problem:

( xU p )2  ( yU p )2   /  otherwise

(20)

AN US

(0,0)T  gp     U p

L1-norm and L0-norm regularizations are both on basis of edge features mapped from gradient U . Taking the superiority of proposed low-level structure feature into consideration, we rewrite the loss

M

function in Eq. (11):

min{ (U p  I p )2   S (U )  C (U )} U

p

(21)

ED

where S (U) is the proposed low-level structure feature map by Eq. (10). The objective function is

PT

redefined with auxiliary variable g:

min{ (U p  I p )2   S (U )  C ( g )    ( g  S (U )  U p ) 2}

(22)

p

CE

U ,g

We solve U and g with half-quadratic splitting way and rewrite the solution of them corresponding

AC

Eqs. (17), (18) and (20): U F

1

(

F ( I )    S (U )  F (T g ) ) F (1)    ( S (U ))2  F (T )

g  max( M  1

T  (0,0) g p  S (U ) p    U p

 2

,0) 

M M

1

,

M  S (U )  U

S (U )  (( xU p ) 2  ( yU p ) 2 )   /  otherwise

(23)

(24)

(25)

δ=2k controls the iteration of U and g, and k represents the k-th iteration. Usually 20-30 iterations

ACCEPTED MANUSCRIPT

are performed in our paper. These two improved filters are called Weighted-TV and Weighted-L0 smoothing filter, respectively.

4.2 Image denoising To illustrate the image denoising performance of two improved edge-preserving filters, we use the

CR IP T

color noisy image [17] and grayscale noisy image shown in Fig. 3. (c). Figure 4 shows the denoising

M

AN US

results by BLF, WLS, TV, Weighted-TV, L0 smoothing filter and Weighted-L0 smoothing filter.

ED

Fig. 4. Denoising results. (a) Color noisy image (b) Result of BLF on (a) (ζs=12, ζr=0.35), (c) Result of WLS on (a) (α=3, λ=1.8),

PT

(d) Result of TV on (a) (μ=0.5), (e) Result of Weighted-TV on (a) (μ=0.75), (f) L0 smoothing filter on (a) (μ=0.009), (g) Result of

Weighted-L0 smoothing filter on (a) (μ=0.045); (h) Grayscale noisy image, (i) Result of BLF on (h) (ζs=6, ζr =0.45), (j) Result of

CE

WLS on (h) (α=3, λ=0.5), (k) Result of TV on (h) (μ=0.25), (l) Result of Weighted-TV on (h) (μ=0.4), (m) L0 smoothing filter on

AC

(h) (μ=0.0045) and (n) Result of Weighted-L0 smoothing filter on (h) (μ=0.025). The images in last row are the enlarged regions.

In the first row, applying BLF and WLS manages to smooth the large noises, meanwhile blurs

edges and introduces noise residuals in the plat regions. TV eliminates most of noise and preserves primary edges. Unfortunately, it brings serious block artifacts. The result by Weighted-TV prevents edges blurring and is free from block artifacts. With the L0-norm regularizations, L0 smoothing and Weighted-L0 smoothing filters take the superiority in edge-preserving. As some large noises are

ACCEPTED MANUSCRIPT

regarded as edges, there are sporadic noises in (f) and (g). And (g) has fewer noise residuals than (f), because the proposed feature extraction reduces certain disturbances of noises. In the second row, results of WLS and TV respectively suffer from noise residual and block artifacts. The sporadic noises are preserved in the results of L0 smoothing and Weighted-L0 smoothing filters. To demonstrate the

CR IP T

performance on edge-preserving more clearly, the enlarged regions are shown in the last row. The Weighted-TV results in clean regions and distinct edges.

To evaluate the performance on denoising by quantitative measures, we choose peak signal-to-noise

AN US

ratio (PSNR) and structure similarity index measurement system (SSIM) [5, 12 and 18]. PSNR and SSIM measure the intensity and structure similarities between the target image and the reference one, respectively. From Table 1, we see that Weighted-TV achieves the highest PSNR and SSIM values.

M

Compared with TV and L0 smoothing filter, their improvements based on low-level structure feature extraction, Weighted-TV and Weighted-L0 smoothing filter, get more outstanding performances. the SSIM/PSNR (dB) value of results of the denoised images at different noise levels and by different schemes.

methods

BLF

Color image Grayscale

ED

Table 1.

WLS

TV

Weighted-TV

L0

Weighted-L0

0.8123/26.25

0.8174/26.98

0.8551/27.90

0.8768/28.75

0.8498/27.56

0.8557/28.08

0.8412/24.86

0.8427/25.23

0.8592/26.85

0.8772/27.92

0.8484/25.96

0.8571/26.79

PT

The tail value imageis the PSNR value.

CE

In summary, BLF and WLS trade off their edge-preservation abilities with their smoothing abilities. As the scale of the noise vanishes, BLF and WLS tend to blur over more edges. TV introduces block

AC

artifact because the L1-norm regularization confuses edges and noise. Weighted-TV prevents edge vanishing and keeps the plat region from block artifact disturbance based the proposed low-level structure feature extraction. L0 smoothing and Weighted-L0 smoothing filters take great advantages of edge-preserving and bring unexpected artifacts like pseudo edges.

4.3 HDR compression

ACCEPTED MANUSCRIPT

HDR tone mapping is a popular application of edge-preserving smoothing operator. Our decomposition comparisons are easily harnessed to perform detail preserving compressions of HDR images based on the simply replacing the BLF with other edge-preserving filters in the tone mapping algorithm [15], which is achieved by decomposing an HDR image into piece-wise smooth base layer

CR IP T

conveying the energy of image and a detail layer. Figure 5 shows the comparisons of HDR compression with seven edge-preserving filters: BLF [15], WLS [17], Guided filter [19], TV [16], Weighted-TV, L0 smoothing filter [18] and Weighted-L0

AN US

smoothing filter. As noticed in [17], tone mapping result by BLF brings some halos visible around the picture frames and the light fixture. The halos pointed out by green arrows are shown in the enlarged images distinctly. Tone mapping results by seven filters except for BLF and Guided filter are all free

M

from halos. Compared with Weighted-TV, TV cannot recover the details in the light region pointed by red arrow. Weighted-L0 smoothing filter results in more unambiguous details like the windows of

ED

buildings than L0 smoothing filter. HDR compressions by the two improved filters recover out the

blurring.

PT

details in the dark and light regions and are free from halos, because these two filters prevent edges

CE

4.4 Details enhancement

AC

Given the input image shown in Fig. 6 (a), we compute the details enhancement by only magnifying gradients in the detail layer (5×) [18]. To enhance details adaptively and prevent noise amplifying in results by the two improved filters, the magnification is not constant but a map W=5*(S(U))0.5, S(U) is obtained by Eq. (10). We apply the seven edge-preserving filters to enhance details based on the algorithm used in [18]. This details magnification is easy to introduce halo artifacts, over-sharpen the strong edge and amplify

ACCEPTED MANUSCRIPT

noises. An excellent edge-preserving filter avoids halos. As a result, Weighted-TV and Weighted-L0 smoothing filter are free from halos and contain edges that are more compelling than others. For the magnifications on the details rely on image structures, the noises marked with red boxes are suppressed

PT

ED

M

AN US

CR IP T

in results of Weighted-TV and Weighted-L0 smoothing filter and amplified in others.

Fig. 5. Tone mapping results. (a) Original HDR image, (b) a tone-mapped image by BLF, taken directly from [15] (ζs=15,

CE

ζr=0.12), (c) a tone-mapped image by WLS (α=1.2, λ=20), (d) a tone-mapped image by Guided image filter (r=12, ε=0.122), (e)

AC

a tone-mapped image by TV (μ=0.01), (f) a tone-mapped image by Weighted-TV (μ=0.015), (g) a tone-mapped image by L0

smoothing filter (μ=0.0075) and (h) a tone-mapped image by Weighted-L0 smoothing filter (μ=0.035). The third and last rows

show the enlarged images.

CR IP T

ACCEPTED MANUSCRIPT

Fig. 6. Details enhancement results. (a) Original image, (b) BLF-based enhancement result (ζs=12, ζr=0.15), (c) WLS-based

AN US

enhancement result (α=0.2, λ=1.2), (d) Guided image filter-based enhancement result (r=15, ε=0.032), (e) TV-based enhancement

result (μ=0.1), (f) Weighted-TV-based enhancement result (μ=0.15), (g) L0 smoothing filter-based enhancement result (μ=0.005)

and (h) Weighted-L0 smoothing filter-based enhancement result (μ=0.02).

Extending applications to infrared image processing

M

5.

The training data for SSDA are taken from natural images on Internet. The trained features by

ED

SSDA belong to low-level features and appear not to be specific to a particular dataset. Hence, we try

PT

to use the feature trained from natural images to extract features from infrared images. We test infrared images as input of the trained SSDA to obtain the low-level structure feature maps

CE

and then use them to smooth noises and enhance details. In Figure 7, (a) and (e) are the infrared images

AC

needed to be smoothed and enhanced, respectively. The low-level structure feature images, (b) and (f), are distinct and indicate the locations and intensities of infrared image structures. The smoothed results, (c) and (d), are clean and edge-preserved. In the results (g) and (h), details are enhanced adaptively and noises are suppressed. The features trained by SSDA in this paper are suitable for other images not just for natural images. Anyway, the corresponding dataset is better to train low-level features but not prerequisite.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Fig. 7. The denoising and enhancement performance on infrared images by the two improved filters. (a) noisy infrared image, (b) low-level structure feature image on (a) by the trained features, (c) smoothed image on (a) by Weighted-TV, (d) smoothed image on (a) by Weighted-L0 smoothing filter, (e) infrared image needed to be enhanced, (f) low-level structure feature image on (e) by

6.

Conclusion and Limitations

M

the trained features, (g) enhancement on (e) by Weighted-TV and (h) enhancement on (e) by Weighted-L0 smoothing filter.

ED

We have presented a novel low-level structure feature extraction by SSDA. Differently from the recent trend designing the end-to-end mapping, we use the features learned from SSDA to extract

PT

image features. We improve the two edge-preserving filters by the proposed low-level structure feature

CE

extraction. Our results on a variety of applications, including denoising, HDR tone mapping and details manipulation, show that these two filters are robust and versatile. More importantly, these features

AC

trained from natural images can extract image structure features regardless the types of inputs. Limitations. The size of proposed low-level structure features is determined by the input-size of

SSDA. The proposed feature extraction is out of considering the multiscale problem. In future work, we would like to take multiscale into account and investigate more sophisticated deep neural networks for extracting image features.

ACCEPTED MANUSCRIPT

Acknowledgements This work was supported by the National Natural Science Foundation of China (Grant 61372167, 61379104 and 61301233). Reference

and their applications,” Neurocomputing, 234, 11-26 (2017).

CR IP T

[1] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep neural network architectures

[2] Y. Tan, P. Tang, Y. Zhou, W. Luo, Y. Kang, and G. Li, “Photograph aesthetical evaluation and classification

AN US

with deep convolutional neural networks”, Neurocomputing, 228, 165-175 (2017).

[3] J. Yu, X. Yang, F. Gao, and D. Tao, “Deep multimodal distance metric learning using click constraints for image ranking”, IEEE Transactions on Cybernetics, PP (99), 1-11 (2016).

M

[4] X. li, L. Zhao, M. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang, “DeepSaliency: Multi-task deep neural network model for salient object detection,” IEEE Transactions on Image Processing, 25(8), 3919-3930 (2016).

ED

[5] X. Zhang, and R. Wu, “Fast depth image denoising and enhancement using a deep convolutional network,”

PT

IEEE International Conference on Acoustics, Speech and Signal Processing, 2499-2503 (2016). [6] L. Sun, Y. Zhang, W. An, J. Fan, J. Zhang, H. Wang, and Q. Dai, “Fast and accurate image denoising via a deep

CE

convolutional-pairs network,” Advances in Multimedia Information Processing- PCM 2016. Lecture Notes in

AC

Computer Science, 9916 (2016). [DOI: 10.1007/978-3-319-48890-5_19] [7] C. Dong, C. C. Loy, K He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295-307 (2016). [8] K. Sun, J. Zhang, C. Zhang, and J. Hu, “Generalized extreme learning machine autoencoder and a new deep neural network,” Neurocomputing, 230, 374-381 (2017). [9] Z. Zhu, X. Wang, S. Bai, C. Yao, and X. Bai, “Deep learning representation using autoencoder for 3D shape

ACCEPTED MANUSCRIPT

retrieval” Neurocomputing, 204, 41-50 (2016). [10] C. Hong, J. Yu, J. Wan, D. Tao, and M. Wang, “Multimodal deep autoencoder for human pose recovery,”

IEEE Transactions on Image Processing, 24(12), 5659-5670 (2015). [11] C. Hong, J. Yu, D. Tao, and M. Wang, “Image-based 3D human pose recovery by multi-view locality sensitive

CR IP T

sparse retrieval,” IEEE Transactions on Industrial Electronics, 62(6), 3742-3751 (2015). [12] F. Agostinelli, M.R. Anderson, and H. Lee. “Adaptive multi-Column deep neural networks with application to robust image denoising,” in Advances in Neural Information Processing Systems 26, pp.1493-1501 (2013).

AN US

[13] L. Xu, J. Ren, C. Liu, and J. Jia, “Deep Convolutional Neural Network for Image Deconvolution,” in

Advances in Neural Information Processing Systems 27, pp.1790-1798 (2014).

[14] K. G. Lore, A. Akintayo, and S. Sarkar, “LLNet: A Deep Autoencoder Approach to Natural Low-light Image

M

Enhancement,” Pattern Recognition, 61, 650-662 (2016).

[15] F. Durand, and J. Dorsey, “Fast bilateral filtering for the display of high-dynamic-range images,” ACM

ED

Transactions on Graphics, 21(3), 257-266 (2002).

PT

[16] L. Rudin, S. Osher, and F. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D:

Nonlinear Ohenomena, 60, 259-268 (1992).

CE

[17] Z. Farbman, R. Fattal, D. Lischinski, and R. Szeliski, “Edge preserving decompositions for multi-scale tone

AC

and detail manipulation,” ACM Transactions on Graphics, 27(3) (2008). [18] L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via L0 gradient minimization,” ACM Transactions on Graphics, 30(6), 174 (2010). [19] K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35, 1397-1409 (2013). [20] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in

ACCEPTED MANUSCRIPT

Advances in Neural Information Processing Systems 27, pp.3320-3328 (2014). [21] K. Gou, and D. Labate, “Optimally sparse multidimensional representation using shearlets,” SIAM Journal on

Mathematical Analysis, 39(1), 298-318 (2007). [22] G. Easley, D. Labate, and W.Q. Lim, “Sparse directional image representation using the discrete shearlet

CR IP T

transforms,” Applied and Computational Harmonic Analysis, 25 (1), 25-46 (2008). [23] Z. Fan, D. Bi, S. Gao, L. He, and W. Ding. “Adaptive enhancement for infrared image using shearlet frame,”

Journal of Optics, 18(8), 085706 (2016).

Mathematical Society, 72, 341-366 (1952).

AN US

[24] R.J. Duffin, and A.C. Schaeffer, “A class of nonharmonic Fourier series,” Transactions of the American

[25] Z. Fan, D. Bi, L. He, S. Ma, “Noise suppression and details enhancement for infrared image via novel prior,”

M

Infrared Physics and Technology, 74, 44-52 (2016).

Appendix:

ED

Table 2. Important notations and definitions in this paper.

Important notations and definitions in SSDA

The original data

x

The input data

W

The encoding weights

W'

The decoding weights

b

The encoding biases

b'

The decoding biases

Θ

The joint name of network parameter

yˆ (x)

The output data

λ

Controlling weight decay term

β

Controlling sparsity penalty term

The average activation of the j-th hidden layer

ρ

The target activation

The weights for the m-th layer

N

The number of training examples

CE

ˆ j

(m)

AC

W

PT

y

Important notations and definitions in frame theory

H

Hardy space

{Φk}k∈Z

A sequence in Hardy space

A, B

The frame bounds

f

A function

The nonlinear approximation for f

,

The inner operation

fM

Important notations and definitions in the two improved filters p

The pixel index

I

The input image

U

The computed result

C(U)

The regularization term

μ

Controlling regularization term

U p

The gradient for each pixel p

g

The auxiliary variable

δ

Controlling difference between U p and g

L(g)

A binary function according to g

F

The FFT operator

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Zunlin Fan: received the B.S in electrical engineering from Air Force Engineering University, Xi’an China, in 2013. As a MBA-DBA, he is currently pursuing the Ph.D. degree in electrical engineering in Air Force Engineering University. His current research interests include statistical image processing, image denoising, image enhancement, and pattern recognition. Duyan Bi: graduated from the Department of Electrical Engineering, National University of Defense Technology, Changsha, China, in 1983, and received the M.S. degree in signal processing from National University of Defense Technology in 1987 and the Ph.D. degree in electrical engineering from Tours University, Tours, France, in 1997. He joined Department of Aeronautics and Astronautics, Air Force Engineering University in 1987, and is currently a Professor and the Director of the Laboratory of field reconnaissance and surveillance technology, Air Force Engineering University. His research interests include computer vision, pattern recognition and image processing. Linyuan He: received his M.S. degree from Air Force Engineering University in 2007. He joined Department of Aeronautics and Astronautics, Air Force Engineering University in 2005, and is currently pursuing the Ph.D. degree in electrical engineering in Xi'an Jiao Tong University. His research interest covers machine learning and computer vision. Shiping Ma: received the Ph.D. degree from the Air Force Engineering University, Xi’an, China in 2003. He joined Department of Aeronautics and Astronautics, Air Force Engineering University in 2003, and is currently associate professor. His research interest covers machine learning and computer vision. Shan Gao: received the Ph.D. degree from the Air Force Engineering University, Xi’an, China in 2010. She is currently a Lecturer of Air Force Engineering University, Xi’an, China. Her current research interests include image fusion, image processing and pattern recognition. Cheng Li: received the Ph.D. degree from the Air Force Engineering University, Xi’an, China in 2011. He is currently a Lecturer of the Air Force Aviation University, Changchun, China. His current research interests include image processing, visual interpretation, and machine learning.