A joint residual network with paired ReLUs activation for image super-resolution

A joint residual network with paired ReLUs activation for image super-resolution

Communicated by Jun Yu Accepted Manuscript A Joint Residual Network with Paired ReLUs activation for Image Super-Resolution Zhimin Tang, Linkai Luo,...

3MB Sizes 2 Downloads 49 Views

Communicated by Jun Yu

Accepted Manuscript

A Joint Residual Network with Paired ReLUs activation for Image Super-Resolution Zhimin Tang, Linkai Luo, Hong Peng, Shaohui Li PII: DOI: Reference:

S0925-2312(17)31384-X 10.1016/j.neucom.2017.07.061 NEUCOM 18774

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

9 March 2017 29 July 2017 31 July 2017

Please cite this article as: Zhimin Tang, Linkai Luo, Hong Peng, Shaohui Li, A Joint Residual Network with Paired ReLUs activation for Image Super-Resolution, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.07.061

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A Joint Residual Network with Paired ReLUs activation for Image Super-Resolution Zhimin Tang, Linkai Luo∗, Hong Peng, Shaohui Li Department of Automation, Xiamen University, Xiamen 361005, China

Abstract

AN US

CR IP T

Recently, single image super-resolution (SR) models based on deep convolutional neural network (CNN) have achieved significant advances in accuracy and speed. However, these models are not efficient enough for the image SR task. Firstly, we find that generic deep CNNs learn the low frequency features in all layers, which is redundant and leads to a slow training speed. Secondly, rectified linear unit (ReLU) only activate the positive response of the neuron, while the negative response is also worth being activated. In this paper, we propose a novel joint residual network (JRN) with three subnetworks, in which two shallow subnetworks aim to learn the low frequency information and one deep subnetwork aims to learn the high frequency information. In order to activate the negative part of the neurons and to preserve the sparsity of activation function, we propose a paired ReLUs activation scheme: one of the ReLUs is for positive activation and the other is for negative activation. The above two innovations lead to a much faster training, as well as a more efficient local structure. The proposed JRN achieves the same accuracy of a generic CNN with only 10.5% training iterations. The experiments on a wide range of images show that JRN is superior to the state-of-the-art methods both in accuracy and computational efficiency. Keywords: Image super-resolution, Deep learning, Image restoration, Convolutional neural network

M

1. Introduction

AC

CE

PT

ED

Single image super-resolution (SR) aims to recover a highresolution (HR) image from a given low-resolution (LR) image [40]. It is widely used in many fields such as surveillance, medical imaging, satellite imaging and face recognition. SR is typically a highly ill-posed problem since there are a lot of solutions. In general, strong prior information is needed in ill-posed problem. Recently, learning-based methods have achieved great success in SR task. The methods based on neighbor embedding [2, 4] make use of the local linear embedding to generate HR patches under the assumption that the LR and its HR patches lie on low-dimensional nonlinear manifolds with similar local geometry. Sparse-coding-based methods [41, 42, 49] jointly train two dictionaries for the LR-HR patch pairs through enforcing the similarity of sparse representations. Therefore, the LR patch and the corresponding HR patch can share the same sparse representation over their own dictionaries. Based on the learned iterative shrinkage and thresholding algorithm [10], researchers [39, 25] extend the conventional sparse coding model [42] to the sparse-coding-based network, hence sparse prior can be encoded in the network. By combing sparse coding and neighbor embedding, Timofte et al. proposed a anchored neighborhood regression (ANR) method [34], in which the learned dictionary atoms are used as anchor points and then the regressors for each anchor points are learned. Instead of ∗ Corresponding

author Email address: [email protected] (Linkai Luo)

Preprint submitted to Neurocomputing

learning the regressors on the dictionary, ANR’s improved version A+ [35] reaches the state-of-the-art quality by using the full training data. More recently, neural networks [32, 26, 29, 21] have been widely used in many applications, such as image recognition [20], speech recognition, natural language processing [21], human pose recovery [14], image privacy protection [45], big multimedia analysis [43], image ranking [44] and image super-resolution [6]. Especially, convolutional neural networks (CNN) [20, 21] have achieved great success in machine learning and computer vision. Dong et al. [6, 7] proposed a superresolution convolutional neural network (SRCNN) based on a fully convolutional neural network . SRCNN and other methods based on CNN [17, 18, 8, 16, 31, 38] have shown impressive performance. Although the image super-resolution methods based on CNN obtain great success, we find some limitations. Firstly, these CNNs are designed with generic deep architectures. Generic deep CNNs are designed for image recognition task and some other high-level vision problems, not specialized for the image SR problem. In the task of SR, the low frequency information needs to be propagated from the input layer to the output layer, each middle layer must learn the low frequency features. We find that generic deep CNNs learn the low frequency features in all layers. We think that it is redundant and leads to a slow training speed. Similar to Taylor expansion, we can decompose an image into different components in which some are low frequency components and some are high-frequent components. The low August 18, 2017

ACCEPTED MANUSCRIPT

frequent and high frequent components are corresponding to the low-order and high-order terms of Taylor expansion, respectively. The low order term is a coarse approximation to the real signal. If we want a more accurate approximation, more high order terms are needed. In a generic deep CNN, the low frequency information needs to propagate through a long term from the input layer to the output layer. A shallow network can learn the coarse approximation well, so deep network is not necessary. Besides, researchers always randomly initialize the filters with the mean value of 0. Hence the initial states for all the feature maps are random values distributed around 0. The low frequency feature maps with large values are need to be learned in all layers, so the low frequency redundancy will leads to a slow training speed. If we decompose the model into different subnetworks in which some aim to learn the low frequency information and some to learn the high frequency information. We could build an extreme small subnetwork to learn a coarse approximation of the images. Furthermore, with the correct understanding of the low frequency features’ physical meaning, we can initialize the low frequency subnetwork with low-pass filters to further accelerate the training speed and improve the training quality. Secondly, rectified linear unit (ReLU) [30, 20, 9], the most popular activation function in deep learning, only activates the positive phase and ignores the negative phase. We argue that the neurons with strong response are worth being activated, including positive responses and negative responses. In addition, sparsity can regularize the solution space. Previous sparse coding based works [41, 42, 11, 49] have shown striking results on SR. So we want to retain the sparsity of activation function when activating both the negative and positive responses simultaneously. Contributions: (1) we propose a novel joint residual network (JRN) to reduce the low frequency redundancy in deep networks, which can extraordinarily accelerate the training stage. (2) We propose a paired rectified linear units (paired ReLUs) activation function to sparsely activate both the positive and negative responses of the neurons simultaneously. The paired ReLUs leads to a more efficient local structure. (3) Our proposed JRN outperforms previous methods by a large margin with less computational complexity. In the following, we first review CNN-based methods in Section 2. The JRN with paired ReLUs activation are introduced in Section 3. In Section 4, experimental details are demonstrated and extensive experimental results are reported. Conclusions are drawn in Section 5.

(b) subnet-2

(c) subnet-3

CR IP T

(a) subnet-1

(d) subnet-(1+2)

(e) JRN

(f) ground truth

ED

M

AN US

Figure 1: Output of our proposed JRN and outputs of its subnetworks with an upscaling factor of 3. Subnet-1 learns a coarse approximation of the image, i.e., the low frequency information. Subnet-3 mainly learn the high frequency information. Synthesis with more high order subnetworks can obtain a more accurate approximation to the ground truth image.

is denoted as Fl = ReLU(Wl ∗ Fl−1 + bl ). The obtained result is then fed to another layer, thus a multi-layered CNN architecture is created through cascading the basic layers. Dong et al. [6, 7] proposed a CNN model for single image SR, termed SRCNN, which consist of three convolutional layers. They interpreted a relationship between SRCNN and the sparse-coding-based SR approaches, where three convolutional layers in SRCNN respectively correspond to patch extraction, non-linear mapping and reconstruction. The forward pass of SRCNN can be formulated as F(X; {Wi }3i=1 , {bi }3i=1 ) = ReLU(W3 ∗

ReLU(W2 ∗ ReLU(W1 ∗ X + b1 ) + b2 ) + b3 )

(1)

AC

CE

PT

Then, the model is trained by the backpropagation algorithm. SRCNN relies on the context of small image regions. Collecting and analyzing more contextual information may give more priors to SR. Cascading small filters many times in a deep CNN architecture can exploit contextual information over large image regions. Kim et al. [17] integrate more contextual information by increasing depth of the model and significantly boosts the performance. In [18], the authors designed their model in a similar way. Differently, their model share parameters between layers to decrease the number of learnable parameters. The above CNN based approaches [6, 7, 17, 18] aim to learning a mapping function F between the bicubicly interpolated LR image and HR image. The input image is firstly interpolated to HR image size, i.e., all the operation is processed in HR space. The computation complexity of a CNN network can be calculated as X O( nl−1 kl2 nl wl hl ), (2)

2. Convolutional Neural Network based Methods for Super-Resolution In recent years, deep learning and especially CNN [20, 21] have led to great success in various fields. In a CNN layer, the input feature Fl−1 is convolved with learnable filters Wl , where l donates the lth layer. A non-linear activation function is then applied on the sum of the obtained feature map Wl ∗Fl and a bias term bl . The most popular activation function is ReLU [30, 9], formally defined as ReLU(a) = max(a, 0) . A basic CNN layer

l

where kl is the filter size, nl is the number of feature maps, wl and hl is the width and the height of the lth layer, respectively. 2

ACCEPTED MANUSCRIPT

where ri is the residue, gi (X) is the ith subnetwork. The ith subnetwork is shallower and smaller than the (i + 1)th subnetwork. Low frequency information is learned by the shallow and small subnetworks, so low frequency redundancy is removed from the deeper subnetworks. The deep subnetworks do not need to learn a set of redundant feature maps. As a result, we can optimize the networks easily and enhance the final accuracy. 3.2. Joint Residual Network The subnetworks gi in primitive residual networks described in Equation (3) need to be trained in a complicate multi-stage way. Besides, subnetwork gi are trained individually based on all the subnetwork g j , j < i. We always want an end-toend model in machine learning. In this subsection, we aim to build an end-to-end residual networks in which all the parameters in our architecture can be jointly optimized through backpropagation. We name this novel architecture Joint Residual Network (JRN). JRN provides a powerful framework to accelerate the training of networks as well as to enhance the performance. In addition, JRN dose not need to train a model in a multi-stage manner which is too complicated and has huge cost in the optimization process. In the general case, shallow and small networks converge faster than the deeper and bigger ones. The distribution of the inputs of current layer will change when the parameters of the previous layers changed. As a result, deep networks converge slower than shallow networks. Besides, more parameters bring about more saddle points, which also leads to a slower training convergence speed. We build our architecture base on the principle C(gi−1 )  C(gi ), where C(gi ) is the convergency rate of gi . Hence, the depth of subnetworks D(g) should conform to D(gi−1 ) < D(gi ) and the number of parameter N(g) should meet N(gi )  N(gi−1 ). We synthesize the results of all the subnetworks together and jointly learn all the subnetworks by minimizing the mean square error (MSE). The optimization model can be expressed as

CR IP T

Obviously, convolutional operations in LR space is much efficient as the wl and hl in LR space is much smaller than HR space. Shi et al. [31] and Dong et al. [8] build convolutional layers in LR space before an upscaling layer in the end of networks. They used a sub-pixel convolution [31] layer and deconvolutional [47, 48, 46] layer to upscale the LR feature maps, respectively. The upscaling operation is only performed in the last layer of the network to avoid expensive computations in the HR feature maps. Instead of interpolation, learning upscaling filters lead a real end-to-end CNN, which can boosts the performance of CNN in term of accuracy and speed [31, 38, 8]. 3. Joint Residual Network with Paired ReLUs Activation

AN US

In a specified framework, models could built on a more efficient architecture and deliver better performance with less cost. We aim to reduce low frequency redundancy and sparsely activate both the positive and negative responses. In this section, we present a joint residual architecture and a paired ReLUs activation function to build efficient and effective models for the image SR.

PT

ED

M

3.1. Architecture of Residual Networks Researchers need to train a very deep network to learn highlevel structure information for an accurate SR model. Through our observation on the feature maps in deep networks, we find that there are several low frequency feature maps, as shown in Fig. 4, in each layer. We think that it is redundant to learn low frequency information in all layers. Secondly, we always initialize filters with zero-mean random values which lead to zero-mean feature maps. However, low frequency information is always large values. Thus low frequency redundancy makes networks have bad initial values and severely slows down the training speed. Learning the residue of a low frequency component is more effective than learning HR image directly. It makes the learning process more tractable. In most patch-based approaches [4, 41, 42, 34, 35, 49] and some convolutional methods [17, 11], images are decomposed into one bicubicly interpolated component (low frequency component) and its residue (high frequency component). In [3], the authors decomposed the input image patches into a low frequency content and a high frequency residual content. In [17], the authors have found that residual architecture can ease the training of networks and enhance the performance of accuracy. We think that these methods all benefit from the reduction of low frequency redundancy. For the purpose of reducing the low frequency redundancy in a deep CNN, we propose a novel regression CNN structure named residual architecture. This architecture is also inspired by matching pursuit algorithm [27]. Given input X and ground truth Y, we denote this residual architecture as

min g

(4)

CE

It will gives more visually pleasing results when using objective functions that are more correlated with human visual perception than MSE [22, 16]. However, there is no ideal objective metric to quantitate the human visual difference between ground truth and super-resolved image. Since we will use Peak Signal to Noise Ratio (PSNR) as main metric in the comparisons, we choose MSE as our objective function because it is conceptually equal to PSNR and easier to compute.

AC

1 k gi (X) − ri k22 , i = 1, 2, 3... 2  Y, i = 1       i−1 X ri =     Y − gi (X), i > 1  

1 X gi (X) − Y k22 k 2 i

3.3. Paired ReLUs Activation Scheme In SR problem, sparse prior can regularize the solution space. Previous sparse coding based works [41, 42, 11, 49] have shown striking results on SR. The ReLU [30] was extensively used as activation function in deep learning [20, 21]. Glorot et al. [9] had explained that ReLU allows a network to obtain sparse representations easily. There is an intimate connection between ReLU and multi-layered convolutional sparse coding, which

min gi

(3)

1

3

Output

Generic net

ReLU

1-Conv(3h3)

64-shuffle

ReLU

64h6ϡ-Conv(3h3)

20- Conv+ReLU

ReLU

ReLU

64-Conv(3h3)



Input



64-Conv(5h5)

ACCEPTED MANUSCRIPT

Subnetwork-3

1-Conv(3h3)

Output

CR IP T

Subnetwork-1

1-Conv(3h3)

1-shuffle 64-shuffle

AN US

1h6ϡ-Conv(3h3) 64h6ϡ-Conv(3h3)

ED

M

Subnetwork-2

1-Conv(3h3) 10 Residual blocks

4-shuffle

4h6ϡ-Conv(3h3)

64-Conv(5h5)

Input

8-Conv(5h5)

Figure 2: Generic CNN. A generic CNN for SR which is mainly processed in LR space. The input LR image is super-resolved with an upscaling factor of S .

Figure 3: The architecture of the proposed Joint Residual Network.

AC

CE

PT

ing algorithm. However, ReLU still has some limitations. We believe that neurons that have strong response are worth being activated, while ReLU only activates the strong response of the positive phase. In order to activate the negative part of the neuron as well as to preserve the sparsity, we propose a paired ReLUs activation scheme which is defined as

(a) Generic CNN

h(x) = (max(s × x − θ, 0), max(s p × x − θ p , 0))

(5)

where s and s p are a pair of scale parameters. The s is initialized with the value of 0.5 and s p is initialized with −0.5. θ and θ p are a pair of trainable thresholds. Multi-shreshold activation function gives more optional visual patterns. Li et al. [23] have shown that the activation with multi-threshold could obtain better performance than single-shreshold. In paired ReLUs activation scheme, one of the ReLUs will activate the positive part of the neuron and the other activates the negative part. Through activating both the positive and negative phase of the neurons, double feature maps are produced. Therefore, compared to ReLU activation scheme, a more efficient local structure can be built.

(b) Subnetwork-3 in the JRN

Figure 4: Feature maps of the 7-th layer. Each middle layer of a generic deep CNN contains several low frequency feature maps, which is a coarse approximation of the input LR image. There is no need to use a deep network to reconstruct the low frequency component of the HR image. As shown in (a), there are several low frequency feature maps in a middle layer of a generic CNN. As shown in (b), only high frequency maps are existed in the deep subnetwork of JRN.

was well studied in [37]. Analysis in [37] showed that the forward pass of CNN is factually identical to a layered threshold4

ACCEPTED MANUSCRIPT

3.4. Design of Joint Residual Network

CR IP T

Output

64-Conv(1h1)

Paired ReLUs

64-Conv(3h3)

Input

3.4.1. Residual Block To handle the degradation problem and ease the training of deep networks, residual learning is proposed by He et al. [13]. Inspired by [13], shortcut connection is introduced to our deep subnetworks. In our deep subnetworks, we cascade many residual blocks repeatedly to construct a deep depth network.

3.4.3. Details of the Subnetworks in JRN As illustrated in Fig. 3, the proposed JRN consists of 3 subnetworks,. Subnetwork-1: Basing on the principles explained in 3.2, we design g1 (X) as a extremely simple network, which contains only 1 upscaling module and followed 1 convolutional layer with 1 feature map. Only the luminance channel (Y channel in YCbCr color space) is processed, so there is only 1 channel in the two layers of g1 (X). Since g1 (X) will mainly reconstruct the low frequency content of HR image, it will learn a set of low frequency filters without bias terms in our architecture. We initialize the filters by low-pass filters with value 1/k2 for all parameters, where k is the kernel size. The initial filter is extensively used in low-pass filtering which is also know as averaging filter. Subnetwork-2: Subnetwork g2 (X) is composed of one LR convolutional layer with 8 feature maps, 1 upsampling module with 4 feature maps and 1 HR convolutional layer. In this way, g2 (X) will learn the medium-frequency information. Subnetwork-3: We use g3 (X) to predict the complex image structure information, i.e., the high frequency information. g3 (X) is a deep subnetwork that contains 1 LR convolutional layer, 10 residual blocks, 1 upscaling module and 1 HR convolutional layer. All the convolutional layers in g1 (X) and g2 (X) are linear activated. We use the method described in He et al. [12] to initialize the weights in g2 (X) and g3 (X). We randomly initial√ ize their weights with normal distribution N(0, 2/c), where c is the number of weights in current layer. If the value is more than 2 standard deviations from the mean, it will be dropped and regenerated. All the bias terms are initialized with 0.

Figure 5: Residual block in the deep subnetwork.

ED

M

AN US

Noting that both JRN and ResNet [13] are called residual networks, but they learn the residual in different aspects. Our proposed JRN jointly learn the residue of ground truth image using several subnetworks while ResNet learn the residual features in one network. Fig. 5 demonstrates a residual block in our deep subnetwork. In each residual block, there is a 3 × 3 convolutional layer activated by our proposed paired ReLUs and a high-efficient 1 × 1 convolutional layer used to compress the output feature maps. Although the paired activations double the number of feature maps, the 1 × 1 convolutional layer reduces the feature maps of current residual block with little cost.

3.4.2. Upscaling Module In [6, 7, 41, 42, 17, 18], the authors upscale the input image into HR image size with bicubic interpolation, which is decided by empiricism. Instead of using hand-crafted interpolation filter, we employ a learnable sub-pixel convolutional layer [31] to upscale the LR size to HR size. Sub-pixel convolution is a convolutional layer followed by a periodic shuffling operator that rearranges the elements of a tensor with shape of (n×S 2 )×w×h to a tensor with shape of n × (w × S ) × (h × S ), where S is the upscale factor, (n × S 2 ) and n is the number of feature maps. Therefore, the low resolution feature maps with size of w × h are resized to (w × S ) × (h × S ). The upscaling module connects the LR feature space and HR feature space and plays a key role in our architecture. This module allows CNN to learn all the filters from data, and lead to a real end-to-end architecture. This module allows us process in LR space. There are several benefits to process in LR space. Firstly, it is more computational efficiency according to Equation (2). Secondly, it can provides larger effective receptive fields, which allows networks to learn more contextual information. Besides, it has been shown that enabling the network to learn the upscaling filters directly can further improve the accuracy and speed [8, 31, 38]. For the upscaling factor of 4, we upscale the feature space twice with the factor of 2.

4. Experiments

AC

CE

PT

In this section, we evaluate the performance of our models on several commonly used datasets. We first briefly introduce these datasets for training and testing in our work. Then, we conduct several experiments to investigate the properties our model. Finally, we compare our models with several state-ofthe-art methods.

(a) ground truth

(b) bicubic

(c) JRN

Figure 6: SR results. ‘comic.bmp’ is upscaled by scale factor ×4. JRN give SR image with good quality.

5

ACCEPTED MANUSCRIPT

4.1. Datasets

CR IP T

Training dadaset: MS-COCO [24]. We randomly sample 80,000 patches with the size of 288 × 288 from MS-COCO as training dataset. These images are distinct from the validation and testing datasets. We choose MS-COCO just because it can supply large size images. Validation datasets: Two validation datasets are used. (1) B200. B200 contains 200 images from the Berkeley Segmentation Dataset (B500) [28]. (2) General-100 [8]. For a more diversified validation image, 100 patches with the size of 96 × 96 are randomly cropped from General-100. Testing datasets: Four testing datasets are used in this paper. (1) Set5 [2]. (2) Set14 [49]. (3) Urban100 [15]. (4) B100. B100 contains 100 images from the B500. B100 and B200 are different subsets of B500.

In subnetwork-3, the 3 × 3 convolutional layer is activated by paired ReLUs and the 1 × 1 layer is linearly activated. Because we evaluate the PSNR on validation dataset without shaving the image border, the PSNR will be slightly lower than the actual situation. As showed in Table 1 and Fig. 7, models with more subnetworks achieve a satisfactory result by less training iterations. The JRN obtains the same accuracy with generic CNN by only 10.5% training iterations. It shows that decomposing images to different components can leads to a faster convergency rate. The curves in the Fig. 7 show that JRN with three subnetworks is a little faster than two subnetworks. In addition, we can see from Fig. 1 that the sum of subnetwork-1 and subnetwork-2 is very close to the ground truth. We conjecture that subnetwork-1 and subnetwork-2 can learn most of the low frequency information. Hence, we use three subnetworks to constitute our JRN. As illustrated in Fig. 7, curve ‘JRN’ converges much faster than curve ‘JRN rand init’. The JRN achieves a satisfactory result within 50,000 iterations. Initializing subnetwork-1 with low frequency filters gives better initial values and can significantly accelerate the training speed. Furthermore, through observing the weights in subnetwork-1 after training, we find that these filters are very close to gaussian filters. We initialize these weights with low pass filters values so that the initial weights of JRN are more close to the optimal values than the random weights with mean value of 0. Besides, from the performance of ResNet we find that short connection plays a significant role in convergence rate and accuracy.

4.2. Experimental Setup

ED

M

AN US

We evaluate the performance of upscaling factors 2, 3 and 4. For each upscaling factor, LR image is obtained by downsampling the HR image using bicubic kernel with downsampling factor (2, 3 or 4). Ten pairs of images are used as a minibatch for stochastic gradient descent. For optimization, we use Adam [19] with a learning rate of 1e − 4. We do not use weight decay or dropout, since the over-fitting phenomenon dose not occur. All models are trained with 240,000 iterations using the Tensorflow [1] on a TITAN X Maxwell GPU and a Intel E52650 CPU 2.3 GHz. We utilize PSNR and Structural SIMilarity (SSIM) index metrics for quantitative evaluation, which are widely used in the image SR literatures. We shave the image border in the same way as [6, 17] for a fair comparison. Since human vision is more sensitive to details in intensity than in color, only the luminance channel is super-resolved with our method. For the purpose of displaying, other two channels are obtained by bicubically interpolating the chrominance channels of LR input image. We evaluate our model every 500 iterations and the model with the best mean PSNR on the two validation sets will be saved. The best model is chosen as our testing model. Our models are designed in a full convolutional manner. Although our models are trained with 288 × 288 images, the models can take the original LR image of arbitrary size as input and directly predict the corresponding HR image.

29

PT

PSNR(dB)

28.5

28

JRN Subnet−1 + Subnet−3 JRN rand init Subnet−3

CE

27.5

AC

27

0

1

2

3

4 Iterations

5

6

7 4

x 10

Figure 7: Convergencies of different architectures on B200 with an upscaling factor of 3. More subnetworks give a faster training speed. ‘JRN rand init’ denote a JRN in which subnetwork-1 is randomly initialized. Initializing subnetwork-1 by low-pass filters giving better initial weights and a faster convergency rate.

4.3. Convergency Rate

In order to clearly investigate the influence on the convergency rate in different network structures and initialization scheme, we conduct a series of comparative experiments. To compare JRN with generic CNN, we implement a generic CNN that has a similar structure with subnetwork-3. The hyperparameter settings for JRN and the generic CNN are illustrated in Fig. 2 and Fig. 3. The influence of shortcut connection in SR task is also tested. We implement a network with shortcut and call it as ResNet. ResNet and our subnetwork-3 are almost the same except the 10 residual blocks. In ResNet, two 3 × 3 convolutional layers of the residual block are activated by ReLU.

4.4. Properties of Different Architectures For the purpose of investigating the properties of different architectures, we evaluate different CNN architectures on dataset B200 in different upscaling factors. We conduct experiments several times in each different architectures, and we find the deviation of the best PSNR on a same architecture is within 6

ACCEPTED MANUSCRIPT

Model Generic ResNet subnet-3 subnet-1 + subnet-3 JRN rand init JRN

Iterations to achieve 28.70dB 22.9 × 104 11.9 × 104 7.2 × 104 4.1 × 104 4.5 × 104 2.4 × 104

Rate

Parameters (k)

PSNR(dB)

52.0% 31.4% 17.9% 19.7% 10.5%

1, 073.2 1, 073.2 786.4 786.5 789.3 789.3

28.70 28.87 28.86 28.91 28.93 28.93

Time to achieve 28.70dB(hour) 13.1 1.8

Total time(hour) 15.6 17.5

CR IP T

Table 1: The number of iterations to achieve 28.7dB and corresponding proportions with respect to generic CNN, the number of parameters and the best PSNR of different architectures. All the experiments are evaluated on the dadaset B200 with scale factor ×3.

0.005dB. In [5], authors have empirically verified that most local minima are equivalent and yield similar performance on a test set for large-size networks.

4.4.3. The bigger the more accurate We implement different hyper-parameters of the subnetwork3 to investigate the relations between performance and model size. The relation between depth and accuracy is showed in Fig. 8. The relation between width and accuracy is showed in Fig. 9. We find that models with bigger size always lead to more accurate results. The results of NTIRE challenge [33] have verified the same conclusion. For the purpose of getting a reasonable trade-off between performance and model size, we chose the hyper-parameters with 64 feature maps and 12 residual blocks.

AN US

4.4.1. Model size and computational complexity Although JRN has three subnetworks, the number of parameters in subnetwork-1 and subnetwork-2 are very small compared to subnetwork-3. The computational cost in subnetwork1 and subnetwork-2 can be negligible. Through computing the complexity using Equation (2), we find that the residual blocks in JRN only has 11/18 computational complexity compared to generic CNN and ResNet. In addition, Table 1 shows that JRN have significantly reduced the number of parameters without the loss of performance compared to the ResNet. We believe that these excellent properties are derived from the paired ReLUs activation.

28.95

ED

4.4.2. Different activation schemes In Table 2, we compare the paired ReLUs scheme with popular activation functions, including ReLU [30], leaky ReLU [12], sigmoid, tanh, in the aspect of model size and performance. In the proposed models, the 3 × 3 convolutional layers activated by paired ReLUs are followed by a 1 × 1 convolutional layer in the residual blocks. For other activation schemes, the residual blocks are consist of two 3 × 3 convolutional layers. We find that the proposed paired ReLUs activation achieves the best result with the fewest number of parameters. Although leaky ReLU and tanh also has activated the negative part and have more trainable parameters than paired ReLUs, their accuracy is still inferior to the paired ReLUs. We attribute this to the sparsity of the paired ReLUs.

PT

Scale ×3 ×3 ×3 ×3 ×3

Parameters(k) 1073.5 1073.5 1073.5 1073.5 789.3

28.85

28.8 28.75

4

6

8 10 Number of residual blocks

12

Figure 8: Relation between the depth of subnet-3 and the accuracy on B200 with an upscaling factor of 3. The depths of subnet-1 and subnet-2 are fixed, we progressively modify the depth of subnet-3 to investigate the trade-off between depth and performance.

29

CE

28.9 PSNR(dB)

AC

Activation scheme sigmoid tanh ReLU leaky ReLU paired ReLU

PSNR(dB)

M

28.9

28.8 28.7 28.6 10

PSNR(dB) 28.81 28.90 28.89 28.90 28.93

20

30 40 50 60 Number of feature maps

70

80

Figure 9: Relation between width of subnet-3 and accuracy on B200 with an upscaling factor of 3. Width of subnet-1 and subnet-2 are fixed, we progressively modify the width of subnet-3 to investigate the trade-off between width and performance.

Table 2: Model size and performance using different activation schemes in JRN architecture.

7

4.4.4. Testing time Table 3 lists the execution time of different models on one 288 × 288 size image. Although JRN has a reduced model size

ACCEPTED MANUSCRIPT

and is more efficient than generic CNN, the test time of JRN is slightly more in GPU mode. We attribute this phenomenon to the difference of speedup ratio between 3 × 3 convolution and 1 × 1 convolution in GPU mode. In CPU mode, JRN consumes less time because CPU have much less core number and have a poor parallel computing performance. Table 3 shows that the JRN achieves less computational costs compared to the very deep super-resolution (VDSR) [17]. Because the deeply-recursive convolutional network (DRCN) [18] is evaluated based on an ensemble result of several models, DRCN is much time consuming than VDSR. For these reasons, we do not test the computational efficiency on DRCN [18]. As reported in [18], DRCN takes up to one second to process a 288 × 288 size image on a TITAN X GPU. In contrast, our JRN only takes about 9.8ms to 15.8ms to handle a image with HR size of 288 × 288 on a TITAN X GPU. In addition to this, VDSR and DRCN need extra time to upscale the input image to HR size using bicubic interpolation.

JRN

CPU time (ms) 1070 495 312 262 475 274 240

(e) VDSR

(d) SRCNN

(f) JRN(Ours)

Figure 10: SR results. The ‘8023’ image from B100 [28] is upscaled by factor ×4 using different state-of-the-art algorithms with single model.

ED

Table 3: Testing time on a 288 × 288 size image.

(b) bicubic

CR IP T

GPU Time (ms) 33.2 14.9 9.9 8.9 15.8 10.8 9.8

AN US

Generic CNN

Scale ×2 ×3 ×4 ×2 ×3 ×4

(c) A+

M

Model VDSR [17]

(a) ground truth

4.5. Comparisons with State-of-the-art Approaches

AC

CE

PT

Finally, we compare the performance of JRN to bicubic interpolation and four state-of-the-art approaches: A+ [35], SRCNN [7], DRCN [18] and VDSR [17]. We only evaluate JRN on single model without any post-processing used in [17, 18] and enhancement technology reported by [36]. In Table 4, we provide a summary of quantitative results on four popular testing datasets. Our JRN beats all these state-of-the-art methods on all datasets by different upscaling factors except it is slightly inferior to SSIM on Set5 with the upscaling factor of 2. Quantitative results summarized in Table 4 show that our JRN sets a new state-of-the-art on a wide range of images. In Fig. 10 and 11, we compare the visual results of our method with the state-of-the-art methods including bicubic interpolation, A+ [35], SRCNN [7] and VDSR [17]. The bicubic interpolation is significantly worse than other SR methods. Since the proposed JRN obtains a large gain over the state-ofthe-art methods under the metric of PSNR, JRN usually gives a much more visually pleasing result. In Fig. 10, only our JRN gives the right texture of the bird’s wing. In Fig. 11 the letters are clear in our method whereas they are severely blurred or distorted in most state-of-the-art methods. JRN achieves a average gain of +0.38dB compared to VDSR.

5. Conclusions

In this work, we propose a novel architecture JRN for single image super-resolution. JRN reduces the low frequency redundancy that is often existed in generic deep CNNs by introducing a residual network. The residual network is composed of three subnetworks, in which two shallow subnetworks aims to learn the low frequency information of input and one deep subnetwork for high frequency information. In addition, JRN obtain more features with a same size of network by introducing a paired ReLUs activation compared with traditional ReLU activation. Experiments on a wide range of images show that JRN exceeds the state-of-the-art methods both in computational efficiency and accuracy. In future work, we will apply JRN to other image restoration problems, such as denoising and colorization. It is also a next work to further lighting JRN.

Acknowledgement This work was partially supported by the National Natural Science Foundation of China under Grant No. 61271337, 61503313, and the Jiangsu Key Laboratory of Image and Video Understanding for Social Safety (Nanjing University of Science and Technology) under Grant No. 30916014107. We should also thank Cuihong Wen and Zuoliang He for their contributions to this work. 8

ACCEPTED MANUSCRIPT

Set5

Set14

B100

Urban100

Scale ×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4 ×2 ×3 ×4

Bicubic PSNR/SSIM 33.66/0.9299 30.39/0.8682 28.42/0.8104 30.24/0.8688 27.55/0.7742 26.00/0.7027 29.56/0.8431 27.21/0.7385 25.96/0.6675 26.88/0.8403 24.46/0.7349 23.14/0.6577

A+ [35] PSNR/SSIM 36.54/0.9544 32.58/0.9088 30.28/0.8603 32.28/0.9056 29.13/0.8188 27.32/0.7491 31.21/0.8863 28.29/0.7835 26.82/0.7087 29.20/0.8938 26.03/0.7973 24.32/0.7183

SRCNN [7] PSNR/SSIM 36.66/0.9542 32.75/0.9090 30.48/0.8628 32.42/0.9063 29.28/0.8209 27.49/0.7503 31.36/0.8879 28.41/0.7863 26.90/0.7101 29.50/0.8946 26.24/0.7989 24.52/0.7221

DRCN [18] PSNR/SSIM 37.63/0.9588 33.82/0.9226 31.53/0.8854 33.04/0.9118 29.76/0.8311 28.02/0.7670 31.85/0.8942 28.80/0.7963 27.23/0.7233 30.75/0.9133 27.15/0.8276 25.14/0.7510

VDSR [17] PSNR/SSIM 37.53/0.9587 33.66/0.9213 31.35/0.8838 33.03/0.9124 29.77/0.8314 28.01/0.7674 31.90/0.8960 28.82/0.7976 27.29/0.7251 30.76/0.9140 27.14/0.8279 25.18/0.7524

JRN PSNR/SSIM 37.67/0.9583 33.99/0.9228 31.82/0.8872 33.90/0.9179 30.46/0.8388 28.71/0.7769 32.09/0.8974 28.99/0.8010 27.47/0.7294 31.28/0.9190 27.37/0.8338 25.27/0.7566

CR IP T

Data

Gain PSNR +0.14 +0.33 +0.47 +0.87 +0.69 +0.70 +0.19 +0.17 +0.18 +0.52 +0.23 +0.09

(b) bicubic

(c) A+

(e) VDSR

(f) JRN(Ours)

AC

CE

PT

(a) ground truth

ED

M

AN US

Table 4: Comparisons with the stat-of-the-art methods. Average PSNR/SSIM on four popular test datasets. Red color indicates the best performance. JRN achieves a average gain of +0.38dB compared to VDSR.

(d) SRCNN

Figure 11: SR results. The ‘ppt3’ image from Set14 [49] upscaled by factor ×4 using different state-of-the-art algorithms with single model.

9

ACCEPTED MANUSCRIPT

[25]

[1] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Man´e, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [2] Macro Bevilacqua, Aline Roumy, Christine Guillemot, and MarieLine Alberi Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012. [3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and MarieLine Alberi Morel. Super-resolution using neighbor embedding of backprojection residuals. In Digital Signal Processing (DSP), 2013 18th International Conference on, pages 1–8. IEEE, 2013. [4] Hong Chang, Dit-Yan Yeung, and Yimin Xiong. Super-resolution through neighbor embedding. In CVPR, 2004. [5] Anna Choromanska, Mikael Henaff, Michael Mathieu, G´erard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015. [6] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184–199, 2014. [7] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016. [8] Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the superresolution convolutional neural network. In ECCV, 2016. [9] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Aistats, volume 15, page 275, 2011. [10] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In ICML, pages 399–406, 2010. [11] Shuhang Gu, Wangmeng Zuo, Qi Xie, Deyu Meng, Xiangchu Feng, and Lei Zhang. Convolutional sparse coding for image super-resolution. In ICCV, pages 1823–1831, 2015. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015. [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. [14] Chaoqun Hong, Jun Yu, Jian Wan, Dacheng Tao, and Meng Wang. Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing, 24(12):5659–5670, 2015. [15] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR, pages 5197– 5206, 2015. [16] Justin Johnson, Alexandre Alahi, and Fei-Fei Li. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016. [17] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image superresolution using very deep convolutional networks. In CVPR, 2016. [18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In CVPR, 2016. [19] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [20] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097– 1105, 2012. [21] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015. [22] Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016. [23] Hongyang Li, Wanli Ouyang, and Xiaogang Wang. Multi-bias non-linear activation in deep neural networks. In ICML, 2016. [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Per-

[26] [27] [28]

[29] [30] [31]

AN US

[32]

ona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014. Ding Liu, Zhaowen Wang, Bihan Wen, Jianchao Yang, Wei Han, and Thomas S Huang. Robust single image super-resolution via deep networks with sparse prior. IEEE transactions on image processing, 25(7):3194–3207, 2016. Weibo Liu, Zidong Wang, Xiaohui Liu, Nianyin Zeng, Yurong Liu, and Fuad E Alsaadi. A survey of deep neural network architectures and their applications. Neurocomputing, 234:11–26, 2017. St´ephane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993. David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, volume 2, pages 416–423, 2001. M. Baig Mirza, Awais Mian.M., and M. El-Alfy El-Sayed. Adaboostbased artificial neural network learning. Neurocomputing, 248:120–126, 2017. Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010. Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016. Kai Sun, Jiangshe Zhang, Chunxia Zhang, and Junying Hu. Generalized extreme learning machine autoencoder and a new deep neural network. Neurocomputing, 230:374–381, 2017. Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, Lei Zhang, Zhimin Tang, et al. Ntire 2017 challenge on single image superresolution: Methods and results. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017. Radu Timofte, Vincent De Smet, and Luc Van Gool. Anchored neighborhood regression for fast example-based super-resolution. In ICCV, pages 1920–1927, 2013. Radu Timofte, Vincent De Smet, and Luc Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In ACCV, pages 111–126, 2014. Radu Timofte, Rasmus Rothe, and Luc Van Gool. Seven ways to improve example-based single image super resolution. In CVPR, 2016. Papyan Vardan, Yaniv Romano, and Michael Elad. Convolutional neural networks analyzed via convolutional sparse coding. arXiv preprint arXiv:1607.08194, 2016. Yifan Wang, Lijun Wang, Hongyu Wang, and Peihua Li. End-to-end image super-resolution via deep and shallow convolutional networks. arXiv preprint arXiv:1607.07680, 2016. Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image super-resolution with sparse prior. In ICCV, pages 370–378, 2015. Chih-Yuan Yang, , Chao Ma, and Ming-Hsuan Yang. Single-image superresolution: a benchmark. In ECCV, 2014. Jianchao Yang, John Wright, Thomas Huang, and Yi Ma. Image superresolution as sparse representation of raw image patches. In CVPR, pages 1–8, 2008. Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image superresolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010. Jun Yu, Jitao Sang, and Xinbo Gao. Machine learning and signal processing for big multimedia analysis. Neurocomputing, 257:1–4, 2017. Jun Yu, Xiaokang Yang, Fei Gao, and Dacheng Tao. Deep multimodal distance metric learning using click constraints for image ranking. IEEE transactions on cybernetics, 2017. Jun Yu, Baopeng Zhang, Zhengzhong Kuang, Dan Lin, and Jianping Fan. iprivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Transactions on Information Forensics and Security, 12(5):1005–1016, 2017. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, pages 818–833, 2014. Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In CVPR, pages 2528–2535. IEEE, 2010. Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive decon-

CR IP T

References

[33]

[34]

M

[35]

[36]

ED

[37]

PT

[38] [39]

CE

[40] [41]

AC

[42] [43] [44] [45]

[46] [47] [48]

10

ACCEPTED MANUSCRIPT

CR IP T

volutional networks for mid and high level feature learning. In ICCV, pages 2018–2025. IEEE, 2011. [49] Roman Zeyde, Michael Elad, and Matan Protter. On single image scaleup using sparse-representations. In International conference on curves and surfaces, pages 711–730, 2010.

Zhimin Tang is a Ph.D. student in the Department of Automation at the Xiamen University, People’s Republic of China. He recieved the M.S. degree from the Hunan University in 2015, and the B.S. degree from Hunan Normal University in 2012. His research interests include image processing, machine learning and compter vision.

M

AN US

Hong Peng is an Assistant Professor in the Department of Automation, Xiamen University, People’s Republic of China. His research interests include pattern recognition, intelligent video processing and related areas.

AC

CE

PT

ED

Linkai Luo is a Professor in the Department of Automation, Xiamen University, People’s Republic of China. His research interests include machine learning, pattern recognition, data mining, computer vision, biological information processing and financial data analysis.

Shaohui Li is a postgraduate student in the Department of Automation, Xiamen University, People’s Republic of China. His research interests include image processing, machine learning and computer vision.

11