Multi-stream attentive generative adversarial network for dynamic scene deblurring

Multi-stream attentive generative adversarial network for dynamic scene deblurring

ARTICLE IN PRESS JID: NEUCOM [m5G;December 13, 2019;19:56] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing...

9MB Sizes 0 Downloads 46 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 13, 2019;19:56]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Multi-stream attentive generative adversarial network for dynamic scene deblurring Jinkai Cui, Weihong Li∗, Weiguo Gong Key Lab of Optoelectronic Technology and Systems of Education Ministry, College of Optoelectronic Engineering, Chongqing University, Chongqing, 400044, China

a r t i c l e

i n f o

Article history: Received 27 June 2019 Revised 11 October 2019 Accepted 27 November 2019 Available online xxx Communicated by Gaofeng MENG Keywords: Attentive guidance module Multi-stream Attention map Multi-scale fusion Gate structure Dynamic scene deblurring

a b s t r a c t Image deblurring methods based on deep learning have been proved to be effective for dynamic scene deblurring. Nevertheless, the most existing methods usually fail to deal with severe and complex dynamic scene blurs. In this paper, we propose a novel multi-stream attentive generative adversarial network (MSA-GAN) for solving the problems. Firstly, we design a novel attentive guidance module for generating an attention map which can guide to pay more attention on the severe blur regions or objects and their edge or texture structures in the feature extraction and deblurring process. Secondly, we propose the multi-stream and multi-scale feature extraction strategy and design an attentive and multi-scale feature extraction module, in which we design a multi-scale residual block for extracting the multi-scale features and capturing more information of complex blurs. Thirdly, we propose a multi-scale feature fusion strategy and the strategy is implemented in the multi-stream and multi-scale feature fusion module through the designed gate structure, which can adaptively learn and fuse multi-scale features from different streams. Finally, The extensive experiments are performed on the GoPro dynamic scene dataset and real image data, and the experimental results both quantitatively and qualitatively have demonstrated that the proposed method outperform the recent state-of-the-art methods. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Dynamic scene blur is one of the most common image degradation problems while taking a picture, whose formation factors are extremely complex, including the camera shake, multiple object motions, scene depth variations and occlusion in motion boundaries during the exposure time, as presented in Fig. 1. The blurred structures of image not only seriously degrade their visual quality, but also directly affect the practical application in various fields. Although the uniform [1–3] and non-uniform deblurring methods [4–8] have gained remarkable progress, they are considerably difficult for dynamic scene deblurring. Hence, the dynamic scene deblurring has become one of the most challenging tasks in the field of image processing. Recently, some methods [9–11] have been greatly put into use to the dynamic scene deblurring. But the methods are strongly dependent on an accurate image segmentation. Moreover, these methods require solving the highly non-convex optimization problem, which may result in a high computational cost. The learningbased methods [7,8,12–16] were presented for avoiding accurate



Corresponding author. E-mail address: [email protected] (W. Li).

image segmentation and reducing the computational cost, but they need to pixel-by-pixel estimate blur kernels and the dynamic scene blur images that they restore usually contain a large number of blur residuals. Meanwhile, these methods used the synthetic blurry image which were generated by convolving clean natural images with synthetic motion kernels due to lack of the ground truth clean image. Consequently, the above mentioned methods either tackle only several specific type dynamic scene blurs or simplify the blur model, and have limitations for more complex dynamic scene blurs. In the past few years, deep learning, especially the deep convolutional neural network (CNN) [17], has been proven superior in the field of image processing. Since the CNN has an end-to-end characteristic, some state-of-the-art works [12–15] exploit temporal information for video deblurring, and some works [18,19] were presented for dynamic scene deblurring with single frame information. Nah et al. [13] proposed a multi-scale CNN, in which the coarse-to-fine manner was adopted for improving the performance of dynamic scene deblurring. However, the method still cannot obtain better quality restored image and require an exceedingly deep network structure. Tao et al. [15] adopted a similar multiscale strategy and presented a scale-recurrent network for deblurring of different scales by shared parameters, but the parameter sharing scheme neglects one crucial scale variant property of

https://doi.org/10.1016/j.neucom.2019.11.063 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 1. Several severe and complex dynamic scene blur examples where the blur is caused by the camera shake (as shown in Fig. 1 (a, b, c, d)), multiple object motions (as shown in Fig. 1 (a, b, c)), scene depth variations (as shown in Fig. 1 (a, b, c, d)) and occlusion in motion boundaries (as shown in Fig. 1 (a, b, c)).

features in different scale. Though Kupyn et al. [14] had achieved some promising results by applying conditional generative adversarial network (GAN), their method can only restore blurry image with smaller blur and specific type dynamic scene blurs. In this work, we propose a novel MSA-GAN for severe and complex dynamic scene blurs. As we all know, the blurs of different regions or objects usually vary from pixel to pixel in the same dynamic scene image. Our one of work is motivated by the above characteristics. We simply subtract the image that degraded by blur with its corresponding sharp image and find that there are large differences in pixel values between different regions. For example, the pixel values of some regions differ very little, while the pixel values of other regions vary greatly. Therefore, we design an attentive guidance module for generating an attention map, which can guide to pay more attention on the severe blurring regions or objects and their edge or texture structures in the feature extraction and deblurring process. In addition, the above mentioned methods by using single-stream network usually lose some important information when feature extraction. Thus, we propose the multi-stream and multi-scale feature extraction strategy and design a multi-scale residual block for extracting the multi-scale features and capturing more information of complex blurs. In each stream, we apply mean squared error (MSE) loss to improve the ability of the feature extraction. Obtaining more efficient features and better representation of an object in the image as well as its surrounding complicated context are very effective for image restoration. Inspired by Ren et al. [20], we propose a multi-scale fusion strategy and design a learning gate structure to adaptively fuse the multi-scale features. Meanwhile, this strategy is able to facilitate the convergence of the training process. Finally, we use perceptual loss [21] and Wasserstein GAN [22] with Gradient Penalty [23] for preserving the high texture details and looking perceptually more convincing. The main contributions of this work are summarized as follows: (1) We propose a novel MSA-GAN for severe and complex dynamic scene blurs. (2) We design an attentive guidance module for generating an attention map, which can guide to pay more attention on the severe blur regions or objects and their edge or texture

structures of dynamic scene images in the feature extraction and deblurring process. (3) We propose a multi-stream and multi-scale feature extraction strategy and design an attentive and multi-scale feature extraction module, in which a multi-scale residual block is designed for extracting the multi-scale features and capturing more information of complex blurs. (4) We propose a multi-scale feature fusion strategy and the strategy is implemented in the multi-stream and multi-scale feature fusion module through the designed gate structure, which can adaptively learn and fuse multi-scale features from different streams. The remainder of this work is organized as follows. Section 2 reviews the most related works. In Section 3, we detail the design methodology of the proposed network and the loss functions. Section 4 details the experiments and results. Section 5, we analyze and discuss the effectiveness of different strategies or components in our network. Finally, Section 6 gives the conclusions of this work. 2. Related works Existing conventional dynamic scene deblurring methods [9–11]usually include a segmentation process to help the image deblurring process. Kim et al. [9] proposed a method to jointly estimate motion segments, the latent image and blur kernel. The method could deal with multiple moving objects in the dynamic scene and camera shake, but it failed in certain scenarios such as different depth variations and forward motions. Pan et al. [10] proposed a method to jointly estimate object segmentation and camera motion based on soft-segmentation. In order to deal with severe blur, the method needed to rely on segmentation confidence map and required user inputs to initialize segmentations. Kim and Lee et al. [11] proposed a segmentation-free method for dynamic scene deblurring, but the method argued that motion was able to be approximated to linear motion, that is to say, blur kernels could be simulated by a local linear optical flow field. Therefore, the assumption cannot invariably effective for complex dynamic scene blurs. The above mentioned methods either deal with only several

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

specific type dynamic scene blurs or simplify the blur model, and image deblurring process is time-consuming. Recently, single frame / video dynamic scene deblurring methods based on deep learning have achieved great progress. Some works have been proposed by exploiting blending temporal information in spatial-temporal recurrent network for video deblurring. Su et al. [18] introduced an encoder-decoder network, which exploited multiple video frames to accumulate information across frames. Zhang et al. [19] presented a deblurring network for spatial-temporal learning by applying a 3D convolution to both the spatial and temporal domains. And some other works for single frame dynamic scene deblurring without temporal information. These methods can be roughly divided into the following three categories: The first category: The early CNN-based methods [7,8] usually require estimating the non-uniform blur kernels, and then restore clear images by using a non-blind deblurring method [24]. Sun et al. [7] presented a learning method to address non-uniform motion blur through estimating the motion blur of every patch and adopted the Markov random field model to achieve a dense nonuniform motion blur field. This method may lose some the highlevel information for some severe blur regions, since the training process is at the patch-level. Gong et al. [8] presented a deeper CNN model for estimating motion flow and removing pixel-wise heterogeneous motion blur. However, it was only trained for linear blur examples, and limited to deal with several simple types of blur. The second category: The methods based on end-to-end trainable manner [12,15,16] have been extended to dynamic scene. Noroozi et al. [12] proposed a multi-scale CNN to obtain a larger receptive field and adopted pyramid schemes with skip connections. Each segment of the network only needs to produce a residual image to help image reconstruction. Tao et al. [15] adopted a similar multi-scale strategy and proposed a scale-recurrent network with convolutional long short term memory (LSTM) for dynamic scene deblurring. Moreover, they proposed sharing network weights in each scale to reduce training difficulty, but this parameter sharing scheme neglects scale variant property of features, which are crucial for respective restoration in different scale. Zhang et al. [16] proposed a spatially variant neural network, which consists of three deep CNN and a recurrent neural network. Deblurring step is similar to an infinite impulse response model, which can be approximated by recurrent neural network. The methods mentioned above require an exceedingly deep network structure, and usually fail to handle severe image blurring. The third category: The recent GAN architecture has been proven superior in the field of image deblurring [13,14,25] and also has shown remarkable results in various computer version and image processing field, such as image deraining [26,27], image inpainting [28], style transfer [21], image dehazing [29], superresolution [30,31] and so on. Qian et al. [26] proposed an attentive recurrent module for raindrop removal. It is the most critical part of generative network that can generate an attention map for guiding to focus on raindrop regions. Ramakrishnan et al. [25] proposed a deep filter based-on GAN integrated with dense architecture and global skip connection to remove blur caused by camera shake. Nah et al. [13] proposed a multi-scale CNN with 40 convolution layers in each scale for dynamic scene deblurring. The method requires 120 convolution layers and added adversarial loss to make sure restore sharp realistic images. Furthermore, Kupyn et al. [14] exploited a conditional GAN and applied multi-components loss function to preserve high texture details. Except for the last layer of this network, all convolution layers are followed by the instance normalization (InstanceNorm) layer [32] and Leaky Rectified Liner Unit (LeakyReLU) [33] with α = 0.2. As a whole, since the dynamic scene blur varies from pixel to pixel

3

and from image to image, above methods are inefficient to only utilize the single CNN network structure to tackle all cases. In addition, they are not always ideal, especially the severe and complex dynamic scene blur. 3. Proposed method We propose a novel MSA-GAN, which including four major modules. The architecture of the proposed method is illustrated in Fig. 2. (1) The attentive guidance module: it aims to focus on some local severe blur regions or objects and their surrounding structures by an attention map in the feature extraction and deblurring process. (2) The attentive and multi-scale feature extraction module: it contains the attentive feature extraction module and the multi-scale feature extraction module. The attentive feature extraction module is designed for extracting main features about local severe blur regions or objects and their surrounding structures. The goal of multi-scale feature extraction module is to efficiently extract the multi-scale features and capturing more missing information of complex blurs in attentive feature extraction module. (3) The multi-stream and multi-scale feature fusion module: it is proposed for reconstructing the final clear image via multistream and multi-scale fusion strategy and a learning gate structure. (4) The discriminative module: its main function is to differentiate whether a generative image is real or fake and ensures that the outputs look more like real images. 3.1. Attentive guidance module The proposed novel attentive guidance module is composed of N attentive blocks. Each attentive block consists of an input layer, four residual blocks, a convolutional long short term memory (ConvLSTM) unit [34] and a output layer, as shown in Fig. 3(a). The designed residual blocks are used for extracting more relevant features from the input blurry image and the mask generated by the previous block, as shown in Fig. 3(b). Moreover, because the results strongly depend on the batch size setting, we add the group normalization (GN) [35], which avoids the batch size limitation to ensure that attention maps are more stable and effective. The proposed attention map represents the ever-increasing attention from clean regions or slight blur regions to severe blur regions, and even the values vary inside blur regions. In practice, the attention map is a matrix, which ranging is from 0 to 1, and the larger these values are, the more attention it has. The generated visualization of final attention map via the designed attentive guidance module is shown in Fig. 4. From Fig. 4, when the blurs are severe, more focus is put on the severe blur region. When the blurs are slight, more attention will be paid to edge and texture structure. ConvLSTM unit is the most important part in our designed attentive guidance module and it comprises of an input gate it , a forget gate ft , an output gate ot and a cell state gt . The interaction between gates and states along time dimension is defined as follows:

it = σ (Wxi ∗ Xt + Whi ∗ ht−1 + Wci  gt−1 + bi )   ft = σ Wxi ∗ Xt + Whi ∗ ht−1 + Wgt  gt−1 + b f



gt = ft  gt−1 + it  tanh Wxg ∗ Xt + Whg ∗ ht−1 + bg   ot = σ Wxo ∗ Xt + Who ∗ ht−1 + Wgo  gt + bo



(1)

here Xt is the feature generated by residual block. gt encodes the cell state that will be fed into the next LSTM. ht is the output feature from the LSTM unit. The operator  and operator ∗ represent element-wise multiplication and the convolution operation, respectively.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM 4

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 2. The architecture of the proposed MSA-GAN. Dashed lines represent skip connections, the red line represents the fusion connection at the original scale, the dark blue lines represent the fusion connection at 1/4 of original size and the light blue lines represent the fusion connection at 1/2 of original size.

Fig. 3. (a) The attentive block of our attentive guidance module, (b) the residual block of attentive block. Curved dotted and solid lines represent skip connections.

The output feature of LSTM unit is fed into the next convolutional layers to produce an attention map. During the training process, we initialize the value of the attention map as 0.5 [26]. In every time step, we first concatenate the input blurry image with the current attention map, and then feed them into the next block of the proposed attentive guidance module. The blurs of different regions or objects usually vary from pixel to pixel in the same dynamic scene image. In order to make the attentive guidance module better learn and focus on serious blur

regions, we create a binary mask M when training our generative network. Firstly, we simply subtract the image that degraded by blur with its corresponding sharp image. Then, we set the threshold as 20 for all images in our training dataset for determining whether a pixel is part of severe blur region. Finally, we set the pixels greater than 20 as 1 and the pixels less than 20 as 0 for obtaining our binary mask M. In each attentive block, we define the attentive loss function as the MSE between the output attention map At at time step t, and the binary mask M. The attentive loss

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

5

Fig. 4. The generated visualization of final attention map by our attentive guidance module. The whole learning process will be recursively performed with the progress of time steps. Our attentive guidance module allows for focusing on severe blur regions and edge/texture structure more steadily.

function is expressed as:

LATT ({A}, M ) =

N 

θ N−t LMSE (At , M )

t=1

At = ATTt ( ft−1 , ht−1 , gt−1 )

(2)

where At is the attention map generated by the attentive guidance module at time step t. AT T t is the attentive guidance module at time step t. ft−1 is the concatenation of the input image and the attention map from the previous step. ft−1 is an initial attention map concatenated with the input blurry image as t = 1. θ N−t is a weight for different time step. N is the number of attentive block, and its value is 4. 3.2. Attentive and multi-scale features extraction module Due to the complexity of dynamic scene blurs, single-stream network may not be able to capture all information of different blurs. Therefore, we propose the multi-stream and multi-scale feature extraction strategy and design an attentive and multi-scale feature extraction module in this part, which contains attentive feature extraction module and multi-scale feature extraction module, and the relevant details are described as follows: 3.2.1. Attentive features extraction module The encoder-decoder networks have been demonstrated to produce great results for various generative tasks [18,36]. Because of the variability of blurring degree in dynamic scene, it needs a large receptive field to obtain more relevant information. We utilize a simple asymmetric residual encoder-decoder structure in this module to amplify the receptive field. The encoder comprises of three scales, where each scale contains three residual blocks [37] and two strided convolutional layer with stride 1/2 to downsample the feature map. The residual block structure is shown in Fig. 5(a). The decoder includes two transposed convolution layers to enlarge the spatial resolution of the feature map. In addition, the skip connections are used for accelerating convergence and generating better outputs. The input of this module is the concatenate with the input blurry image and the generated attention map. 3.2.2. Multi-scale features extraction module The asymmetric residual encoder-decoder structure is also adopted in this module. With exception of the residual blocks in this module, all the components are same as attentive feature extraction module. We observe that it is difficult to extract the serious blur features and spatially variant for single-size convolution kernel network. Therefore, we design a multi-scale residual block with different convolution kernels (different receptive fields)

Fig. 5. Residual block structure of attentive feature extraction module and multiscale residual block structure of multi-scale feature extraction module.

for solving the problem, and it also can reduce training error. The multi-scale residual block consists of four convolution layers with the filter size of 1 × 1, 3 × 3, 5 × 5 for the feature extraction and 1 × 1 output layer for multi-scale features fusion, as shown in Fig. 5(b). And we employ the LeakyReLU replace with ReLU as the activation function of our residual block. It can avoid the “dead feature” caused by zero gradient in ReLU. In addition, we adopt local residual learning in each multi-scale residual block to make the network more efficient and reduce its complexity. The multi-scale loss at each scale has been proven to achieve great deblurring effects [13,15]. So, we introduce a multi-stream loss to help extract more available features and texture details of dynamic scene blur image in the attentive and multi-scale feature extraction module. We use MSE loss in each stream output and the ground truth. The proposed multi-stream loss function is calculated as the following:

LMS =

3 



LMSE IBi , I

S



(3)

i=1

where IS is a sharp dynamic scene image, IBi is latent sharp image of the corresponding stream. 3.3. Multi-stream and multi-scale feature fusion module The same one dynamic scene image may contain many similar objects such as people, cars, plants and complex contexts. Some researchers [20,38–40] found that fusing convolutional features are able to lead to a better representation of an object in the image and its surrounding complicated context. Inspired by Ren et al. [20], we propose a multi-scale feature fusion strategy and design a

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 6. The visualization features of ψ f 13 and ψ f 23 from the attentive feature extraction module and multi-scale feature extraction module. The features ψ f 13 and ψ f 23 are fused by our designing gate structureGgate . Features ψ f usion3 have higher response on larger blur regions and clearer details.

multi-stream and multi-scale feature fusion module (MS-MSF) for obtaining more efficient features by fusing the multi-scale features of the corresponding dimensions from different streams. The fusion strategy plays the most important role in this module. The designed MS-MSF module contains three scales gate structure, two transposed convolution layers, a convolution layer with ReLU and a convolution layer with Tanh. In addition, the designed gate structure can adaptively fuse the multi-scale features, and enable local and contextual features fusion of different streams. The gate structure mainly consists of two convolution layers with the filter size of 3 × 3 and 1 × 1 and three residual blocks. The gate structure Ggate can generate a pixel-wise weight maps by blending ψ f 1i , which represents features of three different scales from the multi-scale features extraction module, and ψ f 2 , which represents i the features of three different scales from attentive feature extraction module. And the gate structure takes the set of ψ f 1 ,ψ f 2 and i i ψ f usioni−1 as the input, where ψ f usioni−1 is obtained via a feature fusion of previous scale, in addition to Ggate only takes as input the set of ψ f 1 and ψ f 2 , when i equal to 1. It is expressed as: i

i

     fusioni = Ggatei cat  f 1i , f 2i ,fusioni−1  f1i  f2i  fusioni−1 (4) where cat represents concatenation,  represents the elementwise multiplication and  represents the element-wise addition. And it is equivalent to local skip connections for accelerating convergence and fusing better features. The values of i are vary from 1 to 3. As shown in the Fig. 6, we visualize the features of ψ f 1 , ψ f 2 3 3 and ψ f usion . The ψ f usion features not only have high response on 3 3 large blur regions but also contain spatial details of the input image. The main meaning of fusing the two features by our fusion module is that the attentive features can guide multi-scale features, and multi-scale features can also complement for attentive features, especially on blur regions. Finally, to further reconstruct deblurring image and make sure better details well preserved, we concatenate the fused features from final gate structure with two coarse deblurring image, and

then feed into a convolutional layer with LReLU and a convolutional layer with Tanh as the final clear image refinement. We introduce a recently proposed perceptual loss [21] and adversarial loss to constrain the final output of multi-stream and multi-scale feature fusion module. Besides, the perceptual loss can measure the global difference between features of the generated image and those of the homologous real clean image. These features can be extracted from a pre-trained VGG19. The perceptual loss function is defined as:

Lp =

Wi, j Hi, j       2 1 i, j (Is )x,y − i, j Gθ G IB x,y Wi, j Hi, j

(5)

x=1 y=1

where IB is blurry image, ∅i, j is the feature achieved from the jth convolution layer before the ith maxpooling layer in the VGG19 network, which is a well-trained on ImageNet [41]. Hi, j and Wi, j are the height and width of the feature maps, respectively. We utilize activations from V GG3,3 convolution layer. Meanwhile, to make the generated higher quality images indistinguishable from real images, we adopt WGAN-GP [23] as the adversarial loss function, which has been verified to be robust to the choice of generator architecture [22]. Our discriminator is identical to the PatchGAN [11,30]. With the exception of the last layer, all the convolutional layers are followed by InstanceNorm layer and LeakyReLU with α =0.2. Our generative adversarial objective function is given by

minmaxV (G, D ) = EIS ∼P G



( )

train IS

D

+ EIB ∼P LGAN =

N 

 

log D IS



−D G I



  

log 1 − D G IB

( )   B 

train IB

(6)

n=1

where G represents the generator, and D represents the discriminator. IS is the input clear image and IB is the input blurry image. Finally, we train our network by jointly optimizingLAT T , LMS , LP and LGAN . The loss function of the proposed entire network is de-

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

signed as:

L= LAT T + LMS + L p + LGAN

(7)

4. Experiments and results In this section, we perform some comparative experiments on GoPro dynamic scene dataset [13], synthetic dynamic scene dataset [13] and some real-world blurred images from [7,14,18] with stateof-the-art methods [3,5,13–15,42] on both quantitatively and qualitatively. These datasets contain complex dynamic scene blur and some global and local serious blurs. For GoPro dynamic scene dataset, which includes 3214 clear and blurry image pairs. Following the identical ways as in [13], we also utilize 2103 pairs as training dataset and the rest of 1111 pairs for test dataset. And the dataset was obtained by using high-speed GoPro4 Hero Black camera and averaged varying number (7–13) of consecutive sharp frames from 240 fps videos to generate blurs of different degrees. Notably, the dataset generated by this method is more realistic, because it can imitate complicated camera and spatially varying blurs caused by objects motion that are common in real scenarios and the static background. For synthetic dynamic scene dataset, Nah et al. [13] exploited gamma curve to synthesize the corresponding dynamic scene blur images. We take 1111 pairs synthetic images as our other test dataset for testing the generalization capability of the proposed method. All experiments are conducted by PyTorch [44] deep learning framework on a single NVidia Titan GPU. 4.1. Experimental settings For the training phase, we employ the image patches of size 320 × 320 to replace 256 × 256 as the input, which is beneficial for learning more about effective information of blurs. For optimization we follow the strategy of [14] and perform five gradient descent steps on D, then one step on G. We use ADAM [43] optimizer with β 1 = 0.5, β 2 = 0.999 and ∈= 10−8 . The learning rate is set initially to 10−4 for both G and D. All trainable variables are initialized with the method [14] is same. During the fine-tuning, after the first 40 epochs we linearly decay the learning rate to zero over the next 40 epochs. In addition, our model was trained with a batch size = 1, which showed better results on test dataset. For the proposed method, some basic parameters are set as follows by some experiments. In the attentive guidance module, the threshold is set as 20 for obtaining our binary mask M. There are two main reasons. Firstly, we observe most pixels value greater than 20 in sereve blur regions after subtracting the blurred image with its corresponding clean image. Secondly, the smaller values such as 10 cannot focus on the sereve blur regions, and the larger values such as 30, 40 may lose some important details. Moreover, we set time step N as 4 and the hyper-parameter θ as 0.9. Meanwhile, a higher time step N can generate a better attention map, but it also needs larger memory. And the hyper-parameter θ set as 0.9 is optimal. In the multi-scale feature extraction module, we try that the feature extraction part of multi-scale residual block is consisted of three convolution layers with the filter size of 1 × 1, 3 × 3, 7 × 7 or 1 × 1, 5 × 5, 7 × 7. Finally, we choose three convolution layers with the filter size of 1 × 1, 3 × 3, 5 × 5 in multiscale residual block, since other choices improved performance is marginal and computationally expensive. 4.2. Quantitative evaluations Table 1 reports the average peak signal-to-noise ratio (PSNR) and self-similarity measure (SSIM) of the recovered images by the proposed and state-of-the-art methods on GoPro dynamic scene dataset and synthetic dynamic scene dataset. From Table 1, we

7

can see that the method [42] gives lowest performance and the method [15] obtains the most outstanding performance in all the compared methods. The presented method is superior to the method [15] in both PSNR and SSIM. These generated results suggest that the proposed method is able to effectively improve the quality of restored images and has much higher PSNR and SSIM values. 4.3. Qualitative evaluations 4.3.1. Results on GoPro dynamic scene dataset Fig. 7 shows three visual comparison examples generated by the presented method and state-of-the-art methods on GoPro dynamic scene test dataset. These blurry image examples suffer from complex blur due to scene depth variations, object motions and camera shake, and contain local serious blur. From Fig. 7, we can see that the deblurring methods [3,5,42] cannot restore clear results as forward/backward motion and scene depth variations play important roles in real blurry images. The methods [13–15] basedCNN are proposed for dynamic scene deblurring, but they are not able to remove sereve blur due to the limited receptive field in their methods. In addition to achieving the highest PSNR and SSIM values, the proposed method restores the clear images with much clearer structures and finer texture details than existing state-ofthe-art methods, especially the car with serious blur in first image. 4.3.2. Results on synthetic dynamic scene dataset Since some methods based on deep learning are trained by synthesizing blur images, for a fair comparison, we also show two examples of the deblurring results on the synthetic dynamic scene test dataset shown in Fig. 8. Although the methods [3,5,42] focus on the blur suffered from simple camera shake, these methods fail to generate clear images. The methods [13–15] based-CNN generated the final results contain some blurry structure and artifacts. As presented in Fig. 8(h), aside from achieving much higher PSNR and SSIM values, our result is visually close to the real world with sharper structures and clear details as well as fewer artifacts. Hence, the results indicate that the proposed method has good generalization ability. 4.3.3. Results on real world images We further evaluate the presented method on some real world blurry images. Fig. 9 shows two real deblurring examples on a dynamic scene dataset [14], which caused by both the car movement and camera shake. Fig. 10 shows an example of the deblurring results on a non-uniform motion blur image from [7]. Fig. 11 shows other real world examples [18] that generated by our method and state-of-the-art methods. The methods [3,5,42] are not able to restore clear enough images. In contrast to the methods [13–15], the presented method generates much clearer details and texture structures. 4.4. Run time We compare the presented method with the state-of-the-art methods in terms of run time on the same computer with an Inter(R) Core (TM) i5-8400 CPU and a single NVidia Titan GPU. The average run time for 1111 images with the size of 1280 × 720 pixels is presented in Table 2. The traditional image deburring methods [3,5] are time-consuming due to requiring solving the highly non-convex optimization problem. Although the method [42] proposed a CNN method to estimate blur kernel, it needs traditional non-blind deblurring algorithm for restoring the sharp image and increases the computational cost. Compared with the traditional methods, several recent works apply end-to-end trainable networks [13–15] spend much less time. In contrast to the method

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 7. Deblurring results on the GoPro dynamic scene dataset. The proposed method generates much clearer images with higher PSNR and SSIM values.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

9

Table 1 Quantitative results comparison with state-of-the-art methods. Method

Pan et al. [3] Chakrabarti [42] Sun et al. [7]. Whyte et al. [5] Nah et al. [13] Kupyn et al. [14] Tao et al. [15] Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

23.5049 23.4623 25.3098 24.5312 29.5762 28.8248 31.0659 31.9498

0.8336 0.8216 0.8511 0.8458 0.8708 0.8507 0.9085 0.9177

22.4361 22.3516 24.5019 23.6843 27.9581 27.7888 29.3401 29.4408

0.8208 0.8113 0.8385 0.8268 0.8540 0.8405 0.8961 0.9003

Fig. 8. Deblurring results on the synthetic dynamic scene dataset with many complex details.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 9. Deblurring results on a dynamic scene dataset.

Table 2 Average run time (second) on GoPro dynamic scene dataset. Method

Pan et al. [3]

Chakrabarti [42]

Sun et al. [7]

Whyte et al. [5]

Nah et al. [13]

Kupyn et al. [14]

Tao et al. [15]

Ours

Time

2500

2100

1200

720

6.59

0.85

1.55

1.35

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM

ARTICLE IN PRESS J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

[m5G;December 13, 2019;19:56] 11

Fig. 10. Deblurring results on a non-uniform motion blur image.

Fig. 11. Deblurring results on a real-world dataset.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 12. Visual comparison with the proposed method without AGM, the proposed method with ARM and the proposed method with AGM on the dynamic scene dataset. (i, k) and (j, l) are attention maps generated by our ARM and AGM, respectively.

[14], the presented method takes slightly much time. However, the presented method has clearer images and higher PSNR and SSIM values. 5. Analysis and discussions 5.1. Effectiveness of the attentive guidance module To demonstrate the effectiveness of the designed attentive guidance module (AGM), we perform three groups experiment on the dynamic scene datasets [13]. One experiment is the proposed method with direct using the attentive recurrent module (ARM) from [26] and another two experiment are the proposed method with and without using our AGM. We used the same training strategy as in Section 4.1 to train these networks. We show the deblurring results and corresponding attention maps generated by these networks in Fig. 12. As shown in Fig. 12(b, f), the deblurring results without using AGM contain blur residuals, especially in the larger blur regions. From Fig. 12(c, g), the deblurring results with direct using ARM from [26] still include blur residuals. By adding the AGM to generate an attention map, a sharper image can be generated as presented in Fig. 12(d, h). Fig. 12(j, l) and Fig. 12(i, k) are attention maps generated by our AGM and the ARM from [26]. We can observe that attention maps generated directly using ARM

from [26] are disorganized and cannot be better focused on ideal regions. However, the generated attention maps by our designed AGM are more stable and effective and can focus on the severe blur regions. Moreover, the average PSNR and SSIM compare with these three methods on GoPro and synthetic dynamic scene dataset are shown in Table 3, which clearly indicates that the proposed method outperforms the proposed method without using the AGM and with using ARM [26]. Correspondingly, the average PSNR and SSIM of the presented method gain over the proposed method without the AGM on real dynamic scene and synthetic dynamic scene dataset are 1.4409 dB, 0.0217 and 0.8597 dB, 0.0199, respectively. Besides, the average PSNR and SSIM of the presented method with AGM gain over the proposed method with ARM are 0.9242 dB, 0.0154 and 0.5757 dB, 0.0132, respectively. From the quantitative results, the proposed AGM has the best performance gain, since the module can pay more attention on severe blur regions and effectively improve the quality of severe blur regions. 5.2. Effectiveness of the multi-stream and multi-scale feature extraction strategy Two different experiments are performed for demonstrating the effectiveness of the presented multi-stream and multi-scale fea-

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

13

Table 3 Quantitative results comparison with the proposed method without AGM, with ARM and with AGM.

Method Without AGM With ARM [26] Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

30.5089 31.0256 31.9498

0.8960 0.9023 0.9177

28.5811 28.8649 29.4408

0.8804 0.8861 0.9003

Fig. 13. The visualization features (a) and (b) from attentive feature extraction module and multi-scale feature extraction module. (c) fused by our designing gate structure. The proposed MS-MS can better fused different features.

Fig. 14. Visual comparison with the proposed method without MS-MS and with MS-MS on GoPro dynamic scene dataset in the first row and on the synthetic dynamic scene dataset in the second row.

ture extraction strategy (MS-MS). We perform the experiment that the proposed method without attentive guidance module and attentive feature extraction module. Then the other experiment is the proposed method. We used the same training strategy as in Section 4.1. Firstly, we show the visual features extracted by the attentive feature extraction module and multi-scale feature extraction module in Fig. 13(a) and (b). The features are fused by designing gate structure in Fig. 13(c). From Fig. 13(a), we can see that it mainly includes some main blur region features. In Fig. 14(b), it contains the spatially variant information and texture details. Thus,

two features can complement each other, especially for blur regions. Then we also show the visual results generated by these two methods in Fig. 14. As show in Fig. 14(b, e), the deblurring results without MS-MS still contain blur residuals, and many details are still not clear enough. Compared with the proposed method without MS-MS, a clearer image with clearer details can be recovered by the proposed method as show in Fig. 14(c, f). Correspondingly, Table 4 shows quantitative results with the proposed method with MS-MS and without MS-MS on GoPro and synthetic dynamic scene dataset. It is obvious that the proposed

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM 14

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx Table 4 Quantitative results comparison with the proposed method with MS-MS and without MS-MS. Method

Without MS-MS Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

30.3030 31.9498

0.8902 0.9177

28.2775 29.4408

0.8724 0.9003

Fig. 15. Visual comparison with the output feature maps of the residual block [37] and our MSRB.

Fig. 16. Visual comparison with the proposed method with residual block [37] and with MSRB on GoPro dynamic scene dataset.

method outperforms the proposed method without MS-MS. Concretely, the average PSNR and SSIM gain of the proposed method over the presented method without MS-MS are 1.6468 dB, 0.0275 and 1.1633 dB, 0.0279, respectively, in two dataset experiments. Thus, the experimental results well prove the effectiveness of the multi-stream and multi-scale feature extraction strategy in images with different type blurs and rich details. From the quantitative results, this strategy has better performance gains than attentive guidance module, but the module is composed of attentive guidance module and attentive feature extraction module. The final gain is the result of the attentive guidance module and attentive feature extraction module combination. Therefore, the best gain for a single module is the attentive guidance module. 5.3. Effectiveness of the multi-scale residual block To illustrate the effectiveness of the presented multi-scale residual block (MSRB), we design an experiment for comparing the performance with residual block [37] and MSRB. The experiment is that we replace the residual block with the MSRB in the multiscale feature extraction module. We followed the same training strategy as in Section 4.1. Fig. 15 shows some feature maps from different feature extraction residual block. It is clear that the output from the MSRB includes more detail information and effective activation maps. Moreover, we show one deblurring result generated by these two methods in Fig. 16. Table 5 shows the aver-

age quantitative results of these two approaches evaluated on test dataset. As show in Fig. 16, the deblurring results with the proposed MSRB are clearer in detail regions. Correspondingly, compared with the method without MSRB, the average PSNR and SSIM of the method with MSRB increased by 0.6266 dB, 0.0094, 0.3468 dB and 0.0065, respectively, in GoPro and synthetic dynamic scene dataset experiments. The experimental results indicate the effectiveness of the presented MSRB. 5.4. Effectiveness of the multi-scale feature fusion strategy To verify the effectiveness of the proposed multi-scale fusion strategy (MSFS), we perform two groups experiment on the dynamic scene datasets [13]. One experiment is the proposed method without using MSFS and another experiment is the proposed method with using MSFS. We used the same training strategy as in Section 4.1 to train these two networks. Firstly, we visualized 64 feature maps with and without MSFS in Fig. 17. From Fig. 17(c), it is obvious that the proposed MSFS is able to obtain more efficient features as well as its surrounding complicated context. Besides, we show the deblurring results generated by these two networks in Fig. 18. From Fig. 18(b, f), the deblurring results without using MSFS still contain blur residuals, especially in regions of abundant detail. By using the MSFS, a sharper image can be generated as presented in Fig. 18(c, g). The results demonstrate that the effectiveness of the proposed MSFS.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

15

Fig. 17. (a) and (b) are the feature map of the attentive feature extraction module and multi-scale feature extraction module without using MSFS, respectively. (c) is the feature map with using MSFS.

Fig. 18. Visual comparison with the proposed method without MSFS and with MSFS on the GoPro dynamic scene dataset in the first row and on the synthetic dynamic scene dataset in the second row. Table 5 Quantitative results comparison with the proposed method with MSRB and with residual block.

Method Without MSRB Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

31.3232 31.9498

0.9083 0.9177

29.0940 29.4408

0.8938 0.9003

Table 6 Quantitative results comparison with the proposed method with MSFS and without MSFS.

Method Without MSFS Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

30.5309 31.9498

0.8872 0.9177

28.0775 29.4408

0.8624 0.9003

Table 6 shows quantitative results with the proposed method with and without using MS-MSFS on GoPro and synthetic dynamic scene dataset. It is obvious that the proposed method outperforms the proposed method without using MS-MSFS. Concretely, the average PSNR and SSIM gain of the proposed method over the presented method without using MS-MSFS are 1.4187 dB, 0.0305 and 1.3633 dB, 0.0379, respectively, in two dataset experiments. Thus, the experimental results well prove the effectiveness of the MS-

MSFS in images with complex scene and rich details. And the gain of MS-MSFS is only slightly lower than the attentive guidance module. 5.5. Effectiveness of the gate structure In this subsection, to further illustrate the effectiveness of the gate structure (GS), the experiment removes the GS from the

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

ARTICLE IN PRESS

JID: NEUCOM 16

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

Fig. 19. (a) is blurry image. (b) and (c) are the features from different stream before the fusion of the GS. (d) is the feature after the fusion of the GS.

Fig. 20. Visual comparison with the proposed method without GS and with GS on the GoPro dynamic scene dataset in the first row and on the synthetic dynamic scene dataset in the second row. Table 7 Quantitative results comparison with the proposed method with GS and without GS.

Method Without GS Ours

GoPro dynamic scene dataset

Synthetic dynamic scene dataset

PSNR (dB)

SSIM

PSNR (dB)

SSIM

31.0200 31.9498

0.9053 0.9177

28.5940 29.4408

0.8867 0.9003

proposed method. We adopted the same training strategy as in Section 4.1. In Fig. 19, we visualize the features before and after the fusion of the GS. Fig. 19(b) and (c) contain the different spatial features and have no response in blur regions, while Fig. 19(d) contains more information and has a high response in severe blur regions after the fusion of the GS. We also show the deblurring results generated by these two methods in Fig. 20. As show in Fig. 20(b, e), the deblurring results without using GS are still not clear enough in detail regions. In contrast to the proposed method without GS, the clearer details can be restored by the proposed method as show in Fig. 20(c, f). Correspondingly, Table 7 shows the average quantitative results of the method without GS evaluated on GoPro and synthetic dynamic scene dataset. The average PSNR and SSIM gain of the presented method over the presented

method without GS are 0.9298 dB, 0.0124 and 0.8468 dB, 0.0136, respectively, in two dynamic scene dataset experiments. The experimental results suggest that the designed GS can further improve the quality of reconstructed images, especially the image with severe blur and rich details. On the whole, all the modules are designed according to the characteristics of dynamic scene blurs. The attentive guidance and attentive features extraction module has the best performance gain. 6. Conclusion In this paper, we proposed a novel MSA-GAN for severe and different type dynamic scene blurs. The proposed generative net-

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM

ARTICLE IN PRESS J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

work first generates an attention map by an attentive guidance module, which is exceedingly important part of the proposed network, because it can guide the next feature extraction and deblurring process to focus on different degree blur regions or objects and their surrounding structures. Moreover, we propose the multistream and multi-scale feature extraction strategy and design an attentive and multi-scale feature extraction module, in which we design a multi-scale residual block for extracting the multi-scale features and capturing more information of complex blurs. Notably, we propose a multi-scale feature fusion strategy and the strategy is implemented in the multi-stream and multi-scale feature fusion module through the designed gate structure, which can adaptively learn and fuse multi-scale features from different streams. We also design a multi-components loss function to jointly optimize the presented network. Extensive experiments are performed on GoPro dynamic scene dataset, synthetic dynamic scene dataset and realworld images. The experimental results have proved the presented network significantly outperforms state-of-the-art methods. In addition, the effectiveness of the significant components of the MSAGAN are also proved in detail. In the future, our work will focus on video deblurring and improving the problem of time-consuming. Author Contributions Weihong Li is the corresponding author. Jinkai Cui proposed the whole network framework and designed and conducted the experiments. Weiguo Gong was involved in the writing and argumentation of the manuscript. Declaration of Competing Interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments This work was supported by the National Science and Technology Key Program of China (2013GS500303); Key Projects of Science and Technology Agency of Guangxi province, China (Guike AA 17129002); and the Municipal Science and Technology Project of CQMMC, China (2017030502). The authors would like to thank the editors and reviewers for their valuable comments and suggestions. References [1] Y. Bai, G. Cheung, X. Liu, W. Gao, Graph-based blind image deblurring from a single photograph, IEEE Trans. Image Process. 28 (3) (2018) 1404–1418. [2] A. Levin, Y. Weiss, F. Durand, W.T. Freeman, Understanding blind deconvolution algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 33 (12) (2011) 2354–2367. [3] J. Pan, D. Sun, H. Pfister, Deblurring Images via Dark Channel Prior, IEEE Trans. Pattern Anal. Mach. Intell. 40 (10) (2018) 2315–2328. [4] M. Hirsch, C.J. Schuler, S. Harmeling, B. Scholkopf, Fast removal of non-uniform camera shake, in: Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 463–470. [5] O. Whyte, J. Sivic, A. Zisserman, Non-uniform Deblurring for Shaken Images, in: Proceedings of the International Journal of Computer Vision, 2012, pp. 168–186. [6] Y. Xu, X. Hu, S.J.N. Peng, Sharp image estimation from a depth-involved motion-blurred image, Neurocomputing 171 (2016) 1185–1192. [7] J. Sun, W. Cao, Z. Xu, Learning a convolutional neural network for non-uniform motion blur removal, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 769–777. [8] D. Gong, J. Yang, L. Liu, Y. Zhang, From motion blur to motion flow: a deep learning solution for removing heterogeneous motion blur, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2319–2328. [9] T.H. Kim, B. Ahn, K.M. Lee, Dynamic scene deblurring, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 3160–3167. [10] J. Pan, Z. Hu, Z. Su, H. Lee, M. Yang, Soft-segmentation guided object motion deblurring, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 459–468.

[m5G;December 13, 2019;19:56] 17

[11] T.H. Kim, B. Ahn, K.M. Lee, Segmentation-free dynamic scene deblurring, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2766–2773. [12] M. Noroozi, P. Chandramouli, Motion deblurring in the wild, in: Proceedings of the German Conference on Pattern Recognition, Springer, Cham, 2017, pp. 65–77. [13] S. Nah, T.H. Kim, K.M. Lee, Deep multi-scale convolutional neural network for dynamic scene deblurring, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3883–3891. [14] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, DeblurGAN: blind motion deblurring using conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8183–8192. [15] X. Tao, H. Gao, X. Shen, J. Wang, Scale-recurrent network for deep image deblurring, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8174–8182. [16] J. Zhang, J. Pan, J.S.J. Ren, Y. Song, L. Bao, Dynamic scene deblurring using spatially variant recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2521–2529. [17] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [18] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, Deep video deblurring for hand-held cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1279–1288. [19] K. Zhang, W. Luo, Y. Zhong, Adversarial spatio-temporal learning for video deblurring, IEEE Trans. Image Process. 28 (1) (2018) 291–301. [20] W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, M.H. Yang, Gated fusion network for single image Dehazing, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3253–3261. [21] J. Johnson, A. Alahi, L. Feifei, Perceptual losses for real-time style transfer and super-resolution, in: Proceedings of the European Conference on Computer Vision, Springer, Cham, 2016, pp. 694–711. [22] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein gan[J], arXiv preprint arXiv:1701.07875, 2017. [23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, Improved training of Wasserstein GANs, in: Advances in Neural Information Processing Systems, 2017, pp. 5767–5777. [24] D. Zoran, Y. Weiss, From learning models of natural image patches to whole image restoration, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 479–486. [25] S. Ramakrishnan, S. Pachori, A. Gangopadhyay, Deep generative filter for motion deblurring, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2993–30 0 0. [26] R. Qian, W. Yang, J. Su, Attentive generative adversarial network for raindrop removal from a single image, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2482–2491. [27] H. Zhang, V.A. Sindagi, V.M. Patel, Image de-raining using a conditional generative adversarial network, arXiv:1701.05957, 2017. [28] R.A. Yeh, C. Chen, T.Y. Lim, A.G. Schwing, M. Hasegawajohnson, Semantic image inpainting with deep generative models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5485–5493. [29] R. Li, J. Pan, Z. Li, J. Tang, Single image Dehazing via conditional generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8202–8211. [30] C. Ledig, L. Theis, F. Huszar, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. [31] X. Zhang, H. Dong, Z. Hu, W. Lai, F. Wang, Gated fusion network for joint image deblurring and super-resolution, arXiv:1807.10806, 2018. [32] D. Ulyanov, A. Vedaldi, V. Lempitsky, Instance normalization: the missing ingredient for fast stylization, arXiv:1607.08022, 2016. [33] B. Xu, N. Wang, T. Chen, Empirical evaluation of rectified activations in convolutional network, arXiv:1505.00853, 2015. [34] X. Shi, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, Convolutional LSTM Network: a machine learning approach for precipitation nowcasting, in: Advances in Neural Information Processing Systems, 2015, pp. 802–810. [35] Y. Wu, K. He, Group normalization, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 3–19. [36] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, Context encoders: feature learning by inpainting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544. [37] B.O. Lim, S. Son, H. Kim, S. Nah, Enhanced deep residual networks for single image super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1132–1140. [38] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in: Proceedings of the IEEE European Conference on Computer Vision, 2014, pp. 346–361. [39] H. Zhang, K.J. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, Context encoding for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160. [40] H. Zhang, V.M. Patel, Density-aware single image de-raining using a multi-stream dense network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 695–704. [41] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Feifei, ImageNet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063

JID: NEUCOM 18

ARTICLE IN PRESS

[m5G;December 13, 2019;19:56]

J. Cui, W. Li and W. Gong / Neurocomputing xxx (xxxx) xxx

[42] A. Chakrabarti, A neural approach to blind motion deblurring, in: Proceedings of the IEEE European Conference on Computer Vision, Springer, Cham, 2016, pp. 221–235. [43] D. Kingma, J. Ba, Adam: a Method for Stochastic Optimization, arXiv:1412.6980, 2014. [44] PyTorch. http://pytorch.org. Jinkai Cui is currently a Ph.D. candidate in the College of Optoelectronic Engineering, Chongqing University, Chongqing, China. His research interests include deep learning, computer vision and image processing.

Weiguo Gong received his doctoral degree in computer science from the Tokyo Institute of Technology of Japan in March 1996 as a scholarship gainer of Japan Government. From April 1996 to March 2002, he served as a researcher or senior researcher in NEC Labs of Japan. Now he is a professor of Chongqing University, China. He has published over 130 research papers in international journals and conferences and two books as an author or coauthor. His current research interests are in the areas of pattern recognition and image processing.

Weihong Li received her doctoral degree from Chongqing University in 2006. Now she is a professor in Chongqing University, China. Her current research interests are in the areas of pattern recognition and image processing.

Please cite this article as: J. Cui, W. Li and W. Gong, Multi-stream attentive generative adversarial network for dynamic scene deblurring, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.11.063