Single image super-resolution with multi-level feature fusion recursive network

Single image super-resolution with multi-level feature fusion recursive network

ARTICLE IN PRESS JID: NEUCOM [m5G;September 11, 2019;9:44] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing...

2MB Sizes 0 Downloads 8 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;9:44]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Single image super-resolution with multi-level feature fusion recursive network Xin Jin a, Qiming Xiong c, Chengyi Xiong a,b,1,∗, Zhibang Li a, Zhirong Gao d a

College of Electronic and Information Engineering, South-Central University for Nationalities, Wuhan 430074, China Hubei Key Laboratory of Intelligent Wireless Communication, South-Central University for Nationalities, Wuhan 430074, China c School of Mathematics and Computer Science, Wuhan Textile University, Wuhan 430200, China d School of Computer Science, South-Central University for Nationalities, Wuhan 430074, China b

a r t i c l e

i n f o

Article history: Received 16 May 2019 Revised 20 June 2019 Accepted 28 June 2019 Available online xxx Communicated by Dr Ma Jiayi Keywords: Single image super resolution Deep convolutional neural network Recursive learning Multi-level feature fusion

a b s t r a c t Recursive learning can widen the receptive field of deep convolutional neural network while do not increase model parameters with parameters sharing. And dense skip connections can promote deep feature representation ability by reusing deep features in different receptive fields. Both techniques are very beneficial to improve performance in image restoration tasks. In this paper, we propose a new end to end deep network for single image super-resolution (SISR) by using both recursive residual feature extraction and multi-level features fusion, in which the multi-level deep features are firstly produced from the input low resolution (LR) image by recursive convolution units, and then fused to reconstruct high resolution (HR) image. The proposed scheme could achieve good super-resolution performance with relatively low complexity. Extensive experimental results on the benchmark tests verify the effectiveness of the proposed method.

1. Introduction Single image super-resolution (SISR) refers to estimate a highresolution (HR) output from a given low-resolution (LR) image by numerical computation. SISR can overcome the limitation of hardware to obtain HR image with high pixel density, which has been widely used in many fields including remote sensing, medical images enhancement,intelligent surveillance systems, and telecommunication, etc [1–3]. Over the past many years, SISR researches have been paid considerable attentions. The relationship between LR image and HR image is one-tomany, thus SISR is a highly ill-posed inverse problem, and difficult to solve. The traditional SISR methods are mainly divided into three categories: interpolation method [4–6], reconstruction-based method [7,8], learning-based method [9–13]. As deep learning (DL) is widely used in image classification tasks [14,15], deep networkbased SISR approaches have been widely studied in recent years. Dong first designed the three-layer convolutional neural network

∗ Corresponding author at: College of Electronic and Information Engineering, South-Central University for Nationalities, Wuhan 430074, China. E-mail address: [email protected] (C. Xiong). 1 This work was supported by the National Natural Science Foundation of China under Grant 61471400 and the Fundamental Research Funds for the Central Universities of South-Central University for Nationalities under Grant CZY19016.

© 2019 Elsevier B.V. All rights reserved.

(CNN) model named SRCNN [16] to solve SISR and establish a connection between deep learning and traditional sparse coding [13], which has much less computational complexity than traditional methods. Then the FSRCNN [17] proposed to adopt smaller size of filters with more mapping layers, and introduce a deconvolution layer at the end of network for the upsampling operation, which is faster with superior restoration performance. The ESPCNN [18] proposed to use the sub-pixel convolutional layer to upsample LR images, which could remove the checkerboard artifacts caused by the deconvolution layer. However, these early DL based methods have only settled for shallow layers network because of the difficulty in training, thus got limited performance. To sufficiently harness the advantages of deeper layer network, the VDSR [19] proposed a 20-layer CNN model with residual learning [20], and the EDSR [21] proposed an enhanced deep residual network by modules optimization, and their reconstruction performance have dramatically grown. Although in general increasing the depth of the deep networks can achieve significant performance improvement, they also afford to increase of system complexity with more parameters, which is certainly not conducive to practical applications. To address this problem, the DRCN [22] and DRRN [23] proposed recursive convolution networks that take feed-forward inputs into all unfolded layers, which can effectively reduce the model complexity by sharing parameters. Moreover, the WDSR [24] proposed an

https://doi.org/10.1016/j.neucom.2019.06.102 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx

EDSR [21] based improvement by introducing a wide activation residual block with the same amount of parameters. Among these methods, various residual connection techniques, e.g. global [19], local [22], joint [23], have been proposed to address the problems of gradient vanishing and/or gradient explosion occurring during training, and make the training procedure much easier. In addition, to make full use of hierarchical features especially low-level features to improve the performance, the MemNet [25], SRDenseNet [26] and RDN [27] proposed to use dense skip connection methods [28], which fuse low-level and high-level features for enhancing reconstruction image quality. In this paper, we propose a new recursive residual architecture based network with multi-level features fusion, named MFFRnet, to reconstruct HR images with fewer parameters and less computation time, in which multi-level recursive residual features are extracted and fused to enhance the representation ability of deep features for the following image reconstruction. The proposed MFFRnet would like to fully make use of all the hierarchical features from the original LR image with adopting recursive residual feature extraction block (RFEB). After extracting multi-level recursive residual features, we further conduct global feature fusion by multi-level feature fusion block (MFFB) to adaptively preserve the hierarchical features in a global way. It is worth noting that all RFEBs have direct access to the initial LR convolution feature output, leading to an implicit deep supervision. In summary, our main contributions are three-folds: • We design a new multi-level feature fusion recursive network (MFFRnet) for SISR without any image scaling preprocessing. It takes the low-resolution (LR) image as input and outputs high-resolution (HR) image, shortening the inference time of the model. • We proposed a recursive feature extraction block (RFEB), which uses a wide activated residual block (WARB) to learn spatial correlation of features, and combine LR residual features with LR coarse features to enhance deep representation with reduction of redundant model parameters. • A multi-level feature fusion block (MFFB) is proposed to further boost the performance by fusing different levels features through dense skip connections.

2. Related work 2.1. Recursive learning for SISR In recent years, models based on deep convolutional neural networks have achieved great success on SISR by effectively learning complex non-linear mapping relationship between LR and HR. As the network deepens, the computing cost and the number of parameters increase dramatically. Therefore, recursive learning is introduced into the super-resolution field to reduce the model parameters [22,23,29,30], which refers to applying the same modules multiple times in a recursive manner. DRCN [22] introduces recursive layers as inference net to the network, so that model parameters do not increase while more recursions are used (up to 16). In order to prevent the gradient problems introduced by recursive learning, DRCN [22] incorporates multi-supervision to recursion units. Specifically, they took each output of the recursive units as the input of a reconstructive module to generate multiple mid-HR images, and put all these images through a weighted average algorithm to construct the final prediction, the weights can be learnt through model training. In addition, DRRN [23] uses a residual architecture as recursive unit (up to 25 recursions) and obtains better performance with fewer parameters. DRRN [23] uses very deep networks to compensate for the loss of performance

and deal with SISR in high resolution space. However, it requires heavy computing resources and increases inference time. 2.2. Dense skip connections for SISR Since DenseNet [28], skip connections is first proposed to manage gradient problems, enhance signal propagation and encourage feature reuse. Dense skip connections have become more and more popular in vision tasks [25–27,31]. For the sake of fusing multiple low-level and high-level features to provide richer information for reconstructing high quality details, SRDenseNet [26] introduces dense skip connections into SISR. It designs 8 dense blocks and uses dense skip connections for the output of each dense block. For every dense block in [26], the feature maps of all preceding blocks concatenated with that of current block are used as inputs for all subsequent blocks. For the output of each dense block, dense skip connections are used to combine all features. In MemNet [25], current memory block is to learn short-term memory from features, and long-term memory is presented with previous memory blocks. Then a gate unit is used to adaptively select the long-term memory and the short-term memory. However, the memory block in MemNet [25] does not receive the information from preceding layers or memory blocks. Hence, RDN [27] proposes residual dense block (RDB) to extract abundant local features via dense skip connections. The RDB fuses the features from the preceding layers to the current one, and its output has direct access to each layer of the next RDB, leading to a contiguous memory (CM) mechanism. 3. Multi-level feature fusion recursive network In this section, we describe our proposed MFFRnet. We first introduce the overall framework and then present the details of each block. 3.1. Overall network structure The architecture of our MFFRnet consists of four parts: coarse feature extraction block(CFEB), recursive feature extraction block(RFEB), multi-level feature fusion block(MFFB), and reconstruction block(RecB), as shown in Fig. 1. We use x and y, to represent input (LR image) and output (HR image) of the network, respectively. Two convolutional layers are utilized in CFEB to extract initial features of input LR image, while the first convolutional layer extracts coarse features (CF), and the second convolutional layer extracts coarse residual features (CRF) of the LR image, respectively. The CF extraction and CRF extraction in CFEB are formulated as follows:

F−1 = fCF1 (x )

(1)

F0 = fCF2 (F−1 )

(2)

where fCF1 (· ) denotes the first convolutional operation. F−1 will be used as input to all RFEBs, MFFB and global residual learning unit followed; fCF2 (· ) denotes convolutional operation of the second coarse residual features extraction layer, and F0 is the input of the first RFEB and the MFFB. In MFFRnet, supposing there are N RFEBs cascaded to recursively extract residual features, then the outputs of all RFEBs can be represented as

F1 = fRF EB1 (F−1 , F0 ) F2 = fRF EB2 (F−1 , F1 ) ... FN = fRF EBN (F−1 , FN−1 )

(3)

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx Coarse Feature Extraction Block

Recons truction Block

+

Upscale

FN+1

Conv 3×3×feats

FN

MFFB

F1

RFEB N

F0

RFEB 1

F-1

Conv 3×3×feats

y Conv 3×3×feats

x

Multi-Level Feature Fusion Block

Recursive Feature Extraction Block

3

WARB

Conv 3×3×feats

[F-1,F n-1]

Conv 1×1×feats

Fig. 1. The overall architecture of the proposed multi-level feature fusion recursive network (MFFRnet).

Fn

Fig. 2. Internal architecture of recursive feature extraction block (RFEB).

where fRF EBN (· ) denotes the Nth RFEB function. In particular, F−1 and FN−1 are the inputs of corresponding RFEB, as illustrated in Fig. 1. We introduce MFFB to reuse hierarchical features and produce the final residual features, and the output of MFFB can be formulated as

FN+1 = fMF F B (F−1 , F0 , F1 , . . . , FN )

(4)

where fMFFB ( · ) denotes the MFFB function. Then the coarse features F−1 is added to FN+1 , using a long-term skip connection (LTSC), to obtain the deep features that are inputs to the final image reconstruction unit RecB. Experimental results detailed in the following section show that LTSC is helpful for performance improvement and training stability. The RecB contains a 3 × 3 convolutional layer and a sub-pixel layer [18], which finally produces the HR image. The output of the MFFRnet can be formulated as

y = fMF F Rnet (x ) = fRecB (F−1 + FN+1 )

(5)

where fRecB ( · ) denotes the RecB function, fMFFRnet ( · ) denotes the function of our MFFRnet. It should be noted that our framework is inspired by RDN [27], but there are at least two differences between ours and the former as described as follows. Firstly, for building cleaner models, we use RFEB with recursive learning instead of RDB in RDN [27] to reduce parameters. Secondly, we use the coarse feature as an input of following recursive feature extraction unit in our model, which serve as a supplement of recursive features in an adaptive manner, this is not considered in the RDN [27]. These approaches help us to get a concise and effective model. 3.2. Recursive feature extraction block This section presents details about the proposed recursive feature extraction block (RFEB). As shown in Fig. 2, our RFEB contains a gate unit, a 3 × 3 convolutional layer and a local residual block (WARB [24] is used in our implementation). Multiple RFEBs are cascaded to extract multi-level recursive residual features with different receptive fields with fewer parameters. Different from the recursive layer used in DRCN[22] that only accepts the features of the previous recursive layer as input, which ignores the shallower features, the proposed RFEB uses a gate unit to learn the channel correlation of previous recursive output FN−1 and coarse features F−1 , and put it into the 3 × 3 convolutional

Fig. 3. (a) Internal architecture of residual block of EDSR (RB), (b) Internal architecture of wide activated residual block of WDSR (WARB)).

layer, where the coarse feature severs as a supplement of the recursive features in an adaptive manner as mentioned above. Because the WDSR [24] conjectures that the non-linear ReLUs impede information flow moving from shallow layer to deeper ones, they proposed wide activated residual block (WARB) based on residual block in EDSR [21], which has better performance with the same amount of parameters. WARB is used in our implementation of local residual block, whose internal architecture is shown in Fig. 3(b). The WARB factorizes a large convolution kernel into two low-rank convolution kernels. The first kernel of the WARB expands features at ratio=6 before ReLU activation over the traditional local residual block (RB) as shown in Fig. 3(a), so more information can be delivered to next layer in this way. And the second kernel reduces the number of channels, with one 3 × 3 convolutional layer performing spatial-wise feature extraction. Here, we use WARB as a residual block. However, the module WDSR mentioned above may reduce receptive fields, which is not conducive to extract local structural feature information. Therefore, we cascade a 3 × 3 convolutional layer with WARB after the gate unit in RFEB, which guarantees a certain size of the receptive field and prevent the loss of a large amount of local structural feature information. Hence, The output of the nth RFEB can be obtained by

Fn = fWARB_R (WRC ∗ fRG (F−1 , Fn−1 )), 1 ≤ n ≤ N

(6)

where fRG ( · ), fWARB_R (· ) denotes the gate unit function and WARB function, respectively. WRC denotes the weights of the 3 × 3 convolutional layer in RFEB, and ∗ denotes the convolutional operation.

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx

Fig. 4. Internal architecture of multi-level feature fusion block (MFFB).

3.3. Multi-level feature fusion block This section presents details about the proposed multi-level feature fusion block (MFFB), whose internal architecture is shown in Fig. 4, the architecture of the MFFB is the same as that of the RFEB, but it reuses coarse features from LR image and multi-level residual recursive features from different receptive fields as inputs, and outputs the final deep residual features. The output of the MFFB can be formulated as

FN+1 = fWARB_M (WMC ∗ fMG (F−1 , F0 , F1 , . . . , FN ))

(7)

where fMG ( · ), fWARB_M (· ) denotes the gate unit function and WARB function, respectively. WMC denotes the weights of the 3 × 3 convolutional layer in MFFB, and ∗ denotes the convolutional operation. Our proposed MFFRnet is also great difference from the DRCN [22]. The latter feeds each output of recursive units into a reconstruction module to generate an intermediate HR image, and construct the final prediction by weighted averaging all these intermediate HR images, where the additional weights are learned during training. In contrast, our MFFRnet reuses hierarchical features without additional weights. First, dense skip connections are used to combine the multi-level residual features from all RFEBs with the features of the CFEB. Then a gate unit is introduced in MFFB to adaptively select these hierarchical features. Finally, a 3 × 3 convolutional layer and WARB are adopted to extract context information of features to improve performance. 4. Experiments 4.1. Datasets and metrics Our training set comprises 291 images as DRRN [23], of which 91 samples are from Yang et al. [13], and the remaining 200 images are from the Berkeley Segmentation Dataset [32]. Standard benchmark datasets, Set5 [33], Set14 [34], BSD100 [13], Urban100 [35] are used for testing. For comparisons, the SISR results with three scale factors are evaluated with peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [36] on luminance channel (Y channel) in transformed YCbCr color space. For each original image, we have 7 additional augmented versions by rotating 90◦ , 180◦ , 270◦ and flipping horizontally. Training images are split into 48 × 48 × 1 HR patches and the corresponding LR patches with different scale factors ( × 2, × 3., and × 4). 4.2. Training details In our MFFRnet, the kernel size of the convolutional layers is 3 × 3, except the gate unit and the linear low-rank convolutional layer in WARB with kernel of size 1 × 1. We set the channel of the features as 64, and use zero padding to keep feature sizes fixed. Given training datasets {imgilr , imgihr }M , imgilr and imgihr dei=1 notes the LR input images and HR output images, respectively, and M denotes the number of image patches. We choose the mean square error (MSE) as the loss function for network training:

Loss =

M 1   imgihr − fMF F Rnet (imgilr ) 2 M i=1

(8)

Fig. 5. The number of N versus the performance for scale factor × 3 on the Set5 dataset.

We implement our MFFRnet with Pytorch framework and minimize the loss function with Adam optimizer [37] by setting β1 = 0.9, β2 = 0.999, and ε = 10−8 . We also set the mini-batch size as 32. When scale factor is × 3, the learning rate is initialized as 0.0 0 01 for all layers, and updated to 0.0 0 0 05 for 10 to 20 epochs, 0.0 0 0 01 for 20 to 30 epochs, 0.0 0 0 0 05 for 30 to 40 epochs, and 0.0 0 0 0 01 for 40 to 50 epochs. It takes about one day with two GPU GTX1080 Ti to train our model. 4.3. Study of N Fig. 5 shows the performance evolution of the MFFRnet with different values of N for the scale factor × 3 on the dataset Set5 [33]. The experimental results indicate that the average PSNR of the MFFRnet gradually increases with the increase of N at start, and the best is achieved when N= 8, after that it begins to decline. Therefore, we set N= 8 in our implementation of the MFFRnet. 4.4. Benchmark results The experimental results of the qualitative and quantitative comparisons are demonstrated in this section. We compare our proposed MFFRnet with other similar SISR methods under the same setting, including Bicubic [5], SRCNN [16], VDSR [19], DRCN [22], and DRRN [23]. Table 1 shows the average PSNR and SSIM [36] values on standard benchmark datasets with different scale factors ( × 2, × 3., and × 4). In detail, for Set5 [33] with scale factor × 2 and × 3, BSD100 [13] with scale factor × 3 and × 4, and Urban100 [35] with all scale factors, the proposed method achieves the second best results in terms of both PSNR and SSIM [36], while for Set5 [33] with scale factor × 4, the proposed method only achieves the third best in terms of both PSNR and SSIM [36]. For Set14 [34] with all scale factors, our method achieves the second best in term of SSIM [36]. For BSD100 [13] with scale factor × 2, our proposed achieves the second best in PSNR and the best in SSIM [36]. To evaluate the visual quality of the proposed approach, we illustrate the reconstructed images comparison of the proposed method with the related others, which are shown in Figs. 6–8. From Fig. 6, we can see that the reconstructed images details by the MFFRnet are much sharper and more precise than those of SRCNN [16], VDSR [19] and DRCN [22]. From Figs. 7 and 8, we can see that the reconstructed images by the MFFRnet have contour structure closer to the original images. Remarkably, the reconstruction quality of the MFFRnet is competitive with the DRRN [23].

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx

5

Table 1 Average PSNR and SSIM for scale factors( × 2, × 3., and × 4). The maximal values are bold, and the second ones are underlined. Datasets

scale

Bicubic PSNR/SSIM

SRCNN PSNR/SSIM

VDSR PSNR/SSIM

DRCN PSNR/SSIM

DRRN PSNR/SSIM

MFFRnet PSNR/SSIM

Set5

×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4

33.66/0.9299 30.39/0.8682 28.42/0.8104 30.24/0.8688 27.55/0.7742 26.00/0.7027 29.56/0.8431 27.21/0.7385 25.96/0.6675 26.88/0.8403 24.46/0.7349 23.14/0.6577

36.66/0.9542 32.75/0.9090 30.48/0.8628 32.42/0.9063 29.28/0.8209 27.49/0.7503 31.36/0.8879 28.41/0.7863 25.96/0.6675 29.50/0.8946 26.24/0.7989 24.52/0.7221

37.53/0.9587 33.66/0.9213 31.35/0.8838 33.03/0.9124 29.77/0.8314 28.01/0.7674 31,90/0.8960 28,82/0.7976 27.29/0.7251 30.76/0.9140 27.14/0.8279 25.18/0.7524

37.63/0.9588 33.82/0.9226 31.53/0.8854 33.04/0.9118 29.76/0.8311 28.02/0.7670 31.85/0.8942 28.80/0.7963 27.23/0.7233 30.75/0.9133 27.15/0.8276 25.14/0.7510

37.74/0.9591 34.03/0.9244 31.68/0.8888 33.23/0.9136 29.96/0.8349 28.21/0.7721 32.05/0.8973 28.95/0.8004 27.38/0.7284 31.02/0.9164 27.38/0.8331 25.35/0.7576

37.68/0.9590 33.91/0.9226 31.39/0.8843 32.99/0.9134 29.60/0.8327 27.85/0.7683 32.03/0.8975 28.87/0.7984 27.33/0.7260 30.89/0.9156 27.16/0.8288 25.20/0.7541

Set14

BSD100

Urban100

Fig. 6. The 37,073 image from BSD100 dataset with scale factor × 2. There (a) Ground Truth (PSNR/SSIM), (b) SRCNN (33.39/0.8937), (c) VDSR (35.44/0.9158), (d) DRCN (35.43/0.9154), (e) DRRN (35.75/0.9179), (f) MFFRnet (35.73/0.9169).

Fig. 7. The head image from Set14 dataset with scale factor × 3. There (a) Ground Truth (PSNR/SSIM), (b) SRCNN (33.73/0.8274), (c) VDSR (33.98/0.8339), (d) DRCN (34.03/0.8333), (e) DRRN (34.04/0.8344), (f) MFFRnet (34.06 /0.8350).

We further conduct on the performance comparisons of our MFFRnet against other methods from 4 aspects: the number of parameters, the average time, the depth of the used 3 × 3 conv layers and the corresponding average PSNR on the Set5 [33] dataset with scale factor × 3. Table 2 list the comparison results. Here we do not count to the depth of 1 × 1 conv layer for it has less impact on inference speed than the 3 × 3 one. Table 2 shows that our model has the third smallest size of parameters (paras) and 2nd fastest inferring time in all five methods, and wins the second place of PSNR performance. We make full use of the shallow layer features and use no image preprocessing before training, which

Table 2 Speed, Parameters, Depth and Accuracy trade-off. The average time, the model parameters, the depth of 3 × 3 conv layers and the corresponding average PSNR for scale factor × 3 on the Set5 dataset. The best results are bold, and the second ones are underlined. Methods

SRCNN

VDSR

DRCN

DRRN

Ours

Time/s Paras/K Depth PSNR/dB

0.02 57 3 32.75

0.08 665 20 33.66

0.87 1774 20 33.82

1.89 297 52 34.03

0.05 339 21 33.91

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx

Fig. 8. The butterfly image from Set5 dataset with scale factor × 3. There (a) Ground Truth (PSNR/SSIM), (b) SRCNN (27.49/0.8984), (c) VDSR (29.79/0.9401), (d) DRCN (29.89/0.9263), (e) DRRN (30.37/ 0.9355), (f) MFFRnet (30.16/0.9403).

Table 3 Average PSNR when different DRRN components are turned on or off, for scale factor × 3 on the Set5 dataset. The best results are bold. Methods

MFF_NRL

NMFF_RL

MFFRnet_NLTSC

MFFRnet_NCF

MFFRnet_BRB

MFFRnet_BDB

MFFRnet

RL MFFB LTSC CFR WARB RBR RDB PSNR/dB Paras/K Time/s

×     × × 33.78 1261 0.0605

 ×    × × 33.84 174 0.0508

  ×   × × 33.81 339 0.0517

   ×  × × 33.64 314 0.0575

    ×  × 33.71 642 0.0604

    × ×  33.61 336 0.0494

     × × 33.91 339 0.0524

results in a faster inference time than the VDSR [19], DRCN [22], and DRRN [23]. The PSNR performance of the DRRN [23] is best with 34.03 dB, but we get a very close result of 33.91 dB. The parameters of our model is about 42 K more than that of the DRRN [23], but we achieve shorter inference time of 0.05 s, which is nearly 38 times faster than the DRRN [23]. Apparently, although the proposed method does not surpass the PSNR performance of the DRRN [23], it is extremely competitive when fully considering several evaluation indexes, thus could be applicable to practical applications. 4.5. Ablation study In this section, we conduct an ablation study to demonstrate the effectiveness of different components in MFFRnet, through turning them on or off. This is very important for understanding how these strategies affect the network performance, including recursive learning, multi-level feature fusion, global residual learning, coarse feature reusing, and local residual learning. Recursive learning. The recursive learning (RL) strategy can reduce the number of parameters and storage demand, and get cleaner models as well. To validate our recursive learning strategy, multi-level feature fusion without using recursive learning, called MFF_NRL, is tested, and the experimental results are listed in the 2nd column of Table 3. We can see that the MFFRnet with recursive learning can get 0.13 dB PSNR gain over that of the MFF_NRL. Meanwhile, the MFFRnet needs only about a quarter number of parameters of the MFF_NRL. In fact, the recursive learning is also ef-

fective to prevent over-fitting the same structure and training set in our implementation. Multi-level feature fusion. To demonstrate the effectiveness of multi-level feature fusion block (MFFB), we compared the MFFRnet with that removing multi-level feature fusion, called NMFF_RL. As shown in the 3rd column of Table 3, it is obvious that the former can get better performance than the latter with 33.91 dB vs 3.84 dB PSNR. It indicates that MFFB is an efficient architecture to further improve the quality of reconstruction images, due to fully reusing hierarchical features. However, it is worth noting that removing the MFFB can drastically reduce number of parameters, but it still achieve higher PSNR performance for the NMFF_RL with 0.02 dB gain over that of DRCN [22]. Global Residual Learning. In our MFFRnet, we uses global residual learning (GRL) to further increase stability of learning and reconstruction performance, where the long-term skip connection (LTSC) is introduced to add the original LR features to deep global residual features together. To demonstrate the effectiveness of the GRL, the MFFRnet without LTSC is also tested, called MFFRnet_NLTSC. Experimental results, shown in the 4th column of Table 3, indicate that such a global residual architecture is helpful to improve performance, obtaining a 0.1 dB PSNR gain. Coarse feature reusing. As we mentioned before, the coarse features reusing (CFR) is that all the following recursive feature extraction blocks use the first layer of convolution feature, which sever as a supplement to the recursive features in an adaptive manner. To demonstrate the effectiveness of CFR, we tested

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx

the MFFRnet without CFR, called MFFRnet_NCF, in which the nth RFFB’s input has only previous recursive output Fn−1 . As shown in the column 5th of Table 3, it is obvious that the model’s performance degrades significantly with a decrease of 0.3 dB when the coarse feature reusing is removed. Local residual learning. We compare the performance when different local residual learning are used in our MFFRnet, where several residual learning structures are considered. Here, the performance of the MFFRnet using the wide activated residual block (WARB), that of the MFFRnet_BRB using basic residual block (BRB) [21], and that of the MFFRnet_RDB using residual dense block [27], are tested, respectively, and the results list in the 6th, 7th and 8th columns of Table 3. From Table 3, we can see that the MFFRnet gets 0.27 dB and 0.2 dB gains over the MFFRnet_BRB and the MFFRnet_RDB, respectively. In addition, the MFFRnet has faster inference time and the second smallest size of parameters. Consequentially, It can be drawn that the WARB is more competitive than the other two local residual structures. 5. Conclusions In this paper, we propose a recursive convolutional neural network with multi-level feature fusion (MFFRnet) for single image super-resolution (SISR). The proposed recursive feature extraction block (RFEB) can learn the residual feature representations of different receptive fields with fewer parameters. At the same time, multi-level feature fusion block (MFFB) could reuse the extracted hierarchical features well to further improve the quality of reconstruction. Experimental results demonstrate that the proposed architecture is highly competitive in balancing speed, model parameters and reconstruction quality, which is well suited for practical applications. Declaration of Competing Interest None. References [1] S.C. Park, M.K. Park, M.G. Kang, Super-resolution image reconstruction: a technical overview, IEEE Signal Process Mag. 20 (3) (2003) 21–36. [2] J. Ma, J. Zhao, J. Jiang, H. Zhou, X. Guo, Locality preserving matching, Int. J. Comput. Vis. 127 (5) (2019) 512–531. [3] J. Ma, W. Yu, P. Liang, C. Li, J. Jiang, Fusiongan: a generative adversarial network for infrared and visible image fusion, Inf. Fus. 48 (2019) 11–26. [4] R. Fattal, Image upsampling via imposed edge statistics, ACM Trans. Graph. (TOG) 26 (3) (2007) 95. [5] R. Keys, Cubic convolution interpolation for digital image processing, IEEE Trans. Acoust. 29 (6) (1981) 1153–1160. [6] F. Aràndiga, A nonlinear algorithm for monotone piecewise bicubic interpolation, Appl. Math. Comput. 272 (2016) 100–113. [7] X. Yang, Y. Zhang, D. Zhou, R. Yang, An improved iterative back projection algorithm based on ringing artifacts suppression, Neurocomputing 162 (2015) 171–179. [8] H. Stark, P. Oskoui, High-resolution image recovery from image-plane arrays, using convex projections, J. Opt. Soc. Am. A Opt. Image Sci. 6 (11) (1989) 1715. [9] H. Chang, D.-Y. Yeung, Y. Xiong, Super-resolution through neighbor embedding, in: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004., 1, IEEE, 2004 pp. I–I. [10] R. Timofte, V. De Smet, L. Van Gool, Anchored neighborhood regression for fast example-based super-resolution, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2013, pp. 1920–1927. [11] J. Jiang, Y. Yu, Z. Wang, S. Tang, R. Hu, J. Ma, Ensemble super-resolution with a reference dataset, IEEE Trans Cybern (2019). [12] J. Jiang, X. Ma, C. Chen, T. Lu, Z. Wang, J. Ma, Single image super-resolution via locally regularized anchored neighborhood regression and nonlocal means, IEEE Trans Multimed. 19 (1) (2016) 15–26. [13] J. Yang, J. Wright, T.S. Huang, Y. Ma, Image super-resolution via sparse representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873. [14] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, IEEE, 2012, pp. 1097–1105.

7

[15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015, pp. 1–14. [16] C. Dong, C.C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE Trans Pattern Anal. Mach. Intell. 38 (2) (2016) 295–307. [17] C. Dong, C.C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 391–407. [18] W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 1874–1883. [19] J. Kim, J. Kwon Lee, K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 1646–1654. [20] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778. [21] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks for single image super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 2017, pp. 136–144. [22] J. Kim, J. Kwon Lee, K. Mu Lee, Deeply-recursive convolutional network for image super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 1637–1645. [23] Y. Tai, J. Yang, X. Liu, Image super-resolution via deep recursive residual network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 3147–3155. [24] J. Yu, Y. Fan, J. Yang, N. Xu, Z. Wang, X. Wang, T. Huang, Wide activation for efficient and accurate image super-resolution, CoRR, abs/1808.08718 (2018). [25] Y. Tai, J. Yang, X. Liu, C. Xu, MemNet: A persistent memory network for image restoration, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2017, pp. 4539–4547. [26] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2017, pp. 4799–4807. [27] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2018, pp. 2472–2481. [28] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2017, pp. 4700–4708. [29] Z. Feng, J. Lai, X. Xie, J. Zhu, Image super-resolution via a densely connected recursive network, Neurocomputing 316 (2018) 270–276. [30] N. Ahn, B. Kang, K.-A. Sohn, Fast, accurate, and lightweight super-resolution with cascading residual network, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 252–268. [31] Z. Wang, P. Yi, K. Jiang, J. Jiang, Z. Han, T. Lu, J. Ma, Multi-memory convolutional neural network for video super-resolution, IEEE Trans. Image Process. 28 (5) (2018) 2530–2544. [32] D. Martin, C. Fowlkes, D. Tal, J. Malik, et al., A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics, in: Proceedings of the International Conference on Computer Vision, ICCV Vancouver:, 2001, pp. 416–423. [33] M. Bevilacqua, A. Roumy, C. Guillemot, M.L. Alberi-Morel, Low-complexity single-image super-resolution based on nonnegative neighbor embedding, Proceedings of the British Machine Vision Conference (2012). [34] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-representations, in: Proceedings of the International Conference on Curves and Surfaces, Springer, 2010, pp. 711–730. [35] J.-B. Huang, A. Singh, N. Ahuja, Single image super-resolution from transformed self-exemplars, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 5197–5206. [36] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, et al., Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [37] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on Learning Representations, 2015, pp. 1–15. Xin Jin received the B.S. degree in electronic and information engineering from Hubei University of Technology, Wuhan, China, in 2017. She is currently pursuing the Master degree in electronic and information engineering at South-Central University for Nationalities, Wuhan, China. Her research interests include super resolution, image restoration, and deep learning.

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;September 11, 2019;9:44]

X. Jin, Q. Xiong and C. Xiong et al. / Neurocomputing xxx (xxxx) xxx Qiming Xiong is currently pursuing the B.S degree in School of Mathematics and Computer Science, Wuhan Textile University, Wuhan, 430200 China. He is currently also an internship student in college of electronic and information engineering, South-Central University for Nationalities. His research interests include image super resolution and deep learning.

Zhibang Li received the B.S. degree from college of electronic and information engineering, South-Central University for Nationalities, Wuhan, China, in 2016. He is currently pursuing the Master degree in signal and information processing at South-Central University for Nationalities, Wuhan, China. His research interests include deep learning, compressive sensing, and image restoration.

Chengyi Xiong received his B.S. degree in radio technology from University of Electronic Science and Technology of China, in 1992, and Ph.D. degree in control science and engineering from School of Automation, Huazhong University of Science and Technology, in 2006. Now he is a professor in South-Central University for Nationalities. His research interests include image restoration, image compression, compressive sensing, machine learning, and deep learning.

Zhirong Gao received the M.S. degree from school of computer, Huazhong University of Science and Technology, in 2002. She is now also a associate professor in College of Computer Science, South-Central University for Nationalities. Her research interests include image super resolution, compressive sensing, and machine learning.

Please cite this article as: X. Jin, Q. Xiong and C. Xiong et al., Single image super-resolution with multi-level feature fusion recursive network, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.06.102