Attention-aware perceptual enhancement nets for low-resolution image classification

Attention-aware perceptual enhancement nets for low-resolution image classification

Attention-aware Perceptual Enhancement Nets for Low-Resolution Image Classification Journal Pre-proof Attention-aware Perceptual Enhancement Nets fo...

2MB Sizes 0 Downloads 40 Views

Attention-aware Perceptual Enhancement Nets for Low-Resolution Image Classification

Journal Pre-proof

Attention-aware Perceptual Enhancement Nets for Low-Resolution Image Classification Xiaobin Zhu, Zhuangzi Li, Xianbo Li, Shanshan Li, Feng Dai PII: DOI: Reference:

S0020-0255(19)31121-1 https://doi.org/10.1016/j.ins.2019.12.013 INS 15056

To appear in:

Information Sciences

Received date: Revised date: Accepted date:

23 July 2019 2 December 2019 9 December 2019

Please cite this article as: Xiaobin Zhu, Zhuangzi Li, Xianbo Li, Shanshan Li, Feng Dai, Attentionaware Perceptual Enhancement Nets for Low-Resolution Image Classification, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.013

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.

Attention-aware Perceptual Enhancement Nets for Low-Resolution Image Classification Xiaobin Zhua,1 , Zhuangzi Lib,1,∗, Xianbo Lib , Shanshan Lib , Feng Daib a School

of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China b School of Computer and Information Engineering, Beijing Technology and Business University, Beijing, China

Abstract Classifying low-resolution (LR) images is notoriously challenging because of their noisy representation and limited information. Existing approaches mainly solve this challenge by training carefully designed architectures on LR datasets or by employing an image-resizing algorithm in a straightforward manner. However, the performance improvements of these methods are usually limited or even trivial in the case of LR images. In this work, we address the LR image classification problem by developing an end-to-end architecture that internally elevates representations of an LR image to “super-resolved” ones. This approach imparts characteristics similar to those of high-resolution (HR) images and is thus more discriminative and representative for image classification. For this purpose, we propose an innovative unified framework, named Attention-aware Perceptual Enhancement Nets (APEN), which integrates perceptual enhancement and an attention mechanism in an end-to-end manner for LR image classification. Specifically, the framework includes a perceptual enhancement network to generate super-resolved images from LR images. In addition, a novel attention mechanism is presented to highlight informative regions, while restricting the semantic deviation of super-resolved images. Additionally, we designed a feature ∗ Corresponding

author Email address: [email protected] (Zhuangzi Li) 1 These authors contributed equally to this study and share the first authorship. This work was supported by National Key R&D Program of China (2018Y F B0803700) and National Natural Science Foundation of China (61806044, 61602517, 61871378).

Preprint submitted to Journal of Information Sciences

December 9, 2019

rectification strategy to promote the adaptability of category decision. Experiments conducted on publicly available datasets demonstrate the superiority of our method against state-of-the-art methods on both LR and HR datasets. Keywords: Image classification, Image super-resolution, Deep learning, Convolutional neural networks, Perceptual enhancement. 2010 MSC: 00-01, 99-00

1. Introduction The rapid development of deep learning architectures [1, 2] has dramatically improved image classification performance, and many evolutionary approaches have been proposed. These approaches are indeed effective on images with 5

a high-resolution (HR), clear appearance, and obvious structure from which discriminative and representative features can be effectively learned. However, as shown in Figure 1 (a), the performance may drastically degrade on lowresolution (LR) images with a loss of detail. Although LR images are common in real world scenarios, LR image classification is much more challenging and

10

satisfactory solutions remain rare. Previous efforts [3, 4, 5, 6, 7] were devoted to addressing LR image classification and related problems. Some researchers [3, 7] concentrated on developing network variants to extract rich representation, which enhances high-level small-scale features with multiple low-level features. However, these approaches

15

attempt to improve the performance by exhaustively augmenting data, using skillful fine-tuning, or natively increasing the feature dimensionality, thereby increasing the computational cost and obtaining unsatisfactory performance. Other researchers [4, 6] tried to directly upscale the input images to resolve LR objects to produce HR feature maps. Cai et al. [4] proposed a resolution-aware

20

convolutional deep model that combines the super-resolution and classification of images. Zou et al. [8] adopted super-resolution to improve facial recognition performance on LR images, showing that learning a super-resolution and facial recognition model concurrently allows for increased recognition performance.

2

Image Resize Bird LR image

Classification Network

(a) Conventional low-resolution image classification Enhance

Penguin LR image Classification Network

(b) The proposed low-resolution image classification

Figure 1: Comparison of the performance of conventional image classification methods with that of our method on an LR image. Conventional methods usually adopt upsampling methods as interpolation, tending to generate a blurry image, which is adverse to classification because of the absence of semantic information. Our method adopts a perceptual enhancement strategy to super-resolve an LR image, promoting the visual quality and semantic information of an image. Consequently, our method can accurately classify the “penguin” LR image.

Yu et al. [9] proposed to exploit the spatial pyramid structure of images to 25

enhance the vector of locally aggregated descriptors to reflect the structural information of the images for place recognition. Na et al. [6] introduced a super-resolution method on cropped regions or candidates for improving object detection and classification. However, these methods still separate the procedure of super-resolution and specific tasks, and the intrinsic structural and semantic

30

gap between LR and HR images remains to be bridged. To address the above-mentioned problems in a practical way, we propose an innovative unified framework, named Attention-aware Perceptual Enhancement Nets (APEN) for LR image classification. Our APEN aims to enhance the representations of LR images to approximate those of the HR counterparts by fully

35

exploiting the structural and semantic correlations between LR images with their HR counterparts while the network is learning. As shown in Figure 1 (b), our method is dedicated to achieving higher accuracy by producing high-quality HR counterparts. Specifically, a novel deep residual learning-based perceptual

3

enhancement network is designed to upscale LR images to super-resolved ones by 40

refining and providing additional textural details for LR images. An attention generation network is proposed to highlight the informative regions of superresolved images. Hence, it can supervise the perceptual enhancement network and guarantee the quality of super-resolved images, restricting the generation of semantically inconsistent contents for classification. Furthermore, we present a

45

feature rectification strategy, which estimates rectification feature vectors from attentive super-resolved images to improve the feature representation accuracy of their categories. In addition, three types of loss, i.e., MSE loss, perceptual loss, and classification loss, are combined for network optimization. To be concrete, optimizing MSE and perceptual losses can approximate super-resolved

50

images to their HR counterparts on a superficial and perceptual level. The classification loss is optimized to provide semantic information for updating the attention generation network and the feature rectification network. In summary, the main contributions of this paper are four fold: • We propose the innovative Attention-aware Perceptual Enhancement Nets

55

(APEN), specially for effectively boosting the performance of LR image classification by improving perceptual representation. • To the best of our knowledge, our work is the first attempt to apply a perceptual enhancement network in an end-to-end image classification framework.

60

• We propose a novel attention mechanism to highlight informative regions and restrict semantic deviation of super-resolved images. • We present extensive experiments that were conducted on publicly available datasets to demonstrate the state-of-the-art performance of our method. We have published our code and model at an anonymous webpage on:

65

https://github.com/lizhuangzi/APEN . The remainder of the paper is organized as follows. In Section 2 we overview related work. The method is described

4

in Section 3. Section 4 presents and discusses the experimental results. In Section 5, we draw our conclusions.

2. Related Work 70

2.1. Image Classification In the paradigm of image classification, the mainstream of pioneer work [10, 11] concentrated on carefully designing handcrafted features to improve the performance. With the prosperity of deep learning, recent work usually adopted convolutional neural networks (CNNs) as feature extractors, which have shown

75

significant improvements in many applications [12][13][14]. Krizhevsky et al. [15] first demonstrated the success of a CNN in an image classification task. Afterwards, Simonyan et al. [1] modeled a deeper network with smaller convolution kernels for improved performance. Other researchers [16] proved that CNNs can extract more effective and representative features by increasing the

80

network depth. Building a deeper network can generally enhance the performance, but could result in some training problems, such as a vanishing gradient or exploding gradient. He et al. [2] proposed a residual network (ResNet) to alleviate the abovementioned problems. ResNet facilitates the training of networks that are sub-

85

stantially deeper than before. Motivated by using short connections, an inception network was developed and achieved promising performance on classification tasks [17]. Targ et al. [18] proposed a series of variant ResNets with generalized residual blocks. However, ResNets with random weight dropout can achieve improved performance, demonstrating that a great amount of redun-

90

dancy can be found in residual networks. Zhang et al. [19] presented an unsupervised local deep-feature alignment network for dimension reduction, which can benefit tasks that rely on high-dimensional feature representation. Iandola et al. [20] proposed a small CNN architecture named SqueezeNet, achieving accuracy comparable to AlexNet on ImageNet with only 2% of the parameters.

95

To reduce the redundancy of ResNet, Huang et al. [21] proposed a novel densely

5

connected network, in which each layer connects with all the other layers in a feed-forward fashion. Despite these great improvements on deep learning-based image classification networks, the issue of accurately classifying LR images still remains greatly challenging and good solutions have rarely been reported. 100

2.2. Perceptual Enhancement Image super-resolution aims at recovering an HR image from its corresponding LR image, and it is an effective way for perceptual enhancement. Conventional image super-resolution methods can be roughly divided into two categories, namely internal and external example-based methods. The internal

105

example-based methods [22, 23] utilize a self-similarity property and generate exemplar patches from the input images. The external example-based methods [24, 25] learn a mapping function from LR to HR patches from external datasets. However, it is inefficient to establish a complex mapping by using massive amounts of raw data.

110

As pioneer work, SRCNN [26] was the first super-resolution deep neural network, achieving promising performance compared with conventional methods. Inspired by [27], considering that CNN, with a much deeper architecture, can learn more complicated representation, Kim et al. [28] increased the depth of the network by stacking more convolutional layers with residual learning for

115

image super-resolution. However, in [28], bicubic interpolation was used to upscale the LR observation before transferring images into the CNN, but this was computationally highly expensive. Later, deconvolution-based super-resolution methods [29, 30] upscaled the LR features in the last part of the network. In addition, shortcut connection strategies [31, 32] were adopted by providing hi-

120

erarchical features to promote the performance. DRCN [33] and DRRN [31] are deep networks that were modeled by stacking recursive blocks. Specifically, dense connections [21] were employed to provide an effective way to build compact models. The shortcut connection architecture not only alleviates the vanishing gradient problem in deep network training, but also integrates low-

125

level and high-level features for improved HR reconstruction. Cai et al. [4] 6

proposed a resolution-aware deep model, which combines convolutional image super-resolution and convolutional classification into a single end-to-end model. The procedure of super-resolution and specific tasks is not well integrated, and the intrinsic connections between super-resolution and classification need to be 130

further investigated. 2.3. Attention Mechanism An attention mechanism can be viewed as a strategy to bias the allocation of available processing resources towards the most informative components of the input [34]. Attention modules have been widely applied in the field of Natural

135

Language Processing (NLP) in applications involving machine translation and sentence generation with performance that is surprisingly remarkable. In addition, the attention mechanism has also demonstrated powerful performance in the image vision field. For example, Hu et al. [35] adopted an attention mechanism to propose an object relation module, which models the relationship among

140

a set of objects and improves object recognition. Zhang et al. [36] proposed a channel-wise attention mechanism for an image super-resolution task. Wang et al. [7] proposed a residual attention network by stacking attention modules to generate attention-aware features for image classification. In our work, the attention-aware network aims to concentrate on an informative spatial region,

145

to ensure that super-resolved images conform to their real categories.

3. Our Method 3.1. Overview As shown in Figure 2, the five main components of our framework are a perceptual enhancement network, an attention generation network, a feature 150

extraction network, a feature rectification network, and a classifier. The perceptual enhancement network generates super-resolved (SR) images from their low-resolution (LR) counterparts. The pixel-wise Mean Square Error (MSE) loss is adopted to maintain the similarity of the pixel with the high-resolution

7

MSE loss Feature Extraction Classification Network Network

...

Concat

Res Block

Upsampling

Res Block

Res Block

Res Block

Res Block

Perceptual Enhancement Network

Percetual loss

... Feature maps

Weighting sharing Feature Extraction Classification Network Network

Attention Generation Network

Attention maps

Feature Rectification Network

. ..

. ..

Classification loss

Classifier

Figure 2: Framework of our Attention-aware Perceptual Enhancement Nets (APEN).

(HR) image. HR and SR images are both fed into the feature extraction network 155

to compute the perceptual loss in feature levels. The hierarchical features of the perceptual enhancement network are fed into the attention generation network to produce an attention map. The attention map can indicate informative regions of SR images and promote their semantic information. The attentive SR image is then fed into the feature extraction

160

network to obtain a one-dimensional feature representation. We adopt the feature rectification network to rectify the one-dimensional feature by element-wise addition. Then, the rectified features are fed into the classifier, yielding the classification loss. Using the classification loss, we can fix the weights of the feature extraction network and feature rectification network to optimize the attention

165

generation network. The classification model in our method can be substituted by any other classification network. We adopt two classical networks, i.e., VGG [1] and ResNet [2], for comprehensive evaluations. Section. 3.2 introduces the perceptual enhancement network. Section. 3.3 and Section. 3.4 describe the attention gener-

170

ation network and the feature rectification network, respectively. We elaborate on the loss functions and training algorithm in Section. 3.5.

8

Feature Embedding Network

Conv 9´9

Upsample Block

Conv 3´3 Pixelshuffle Sigmoid

Conv 3´3 BN

Upsample Block

Residual Block

...

Residual Block

Totally 16 Residual Blocks

Residual Block

Conv 3´3 BN

Conv 3´3 BN Sigmoid

Conv 9´9 Sigmoid

Residual Block

Image Reconstruction Network

Figure 3: Architecture of our perceptual enhancement network.

3.2. Perceptual Enhancement Network Given a low-resolution image ILR ∈ R3×h×w (h and w denote the width and height of the image, respectively), we designed a perceptual enhancement network to enhance the HR property of ILR , generating the corresponding enhanced image. For the purposes of a clear illustration, we refer to the enhanced image as the SR image ISR ∈ R3×rh×rw , where r is an upscale factor. As shown in Figure 3, the perceptual enhancement network is composed of two sub-networks: a feature embedding network GE , and an image reconstruction network GR . The feature embedding network can map the LR image to c-channel deep features: Fe = GE (ILR ),

(1)

where Fe ∈ Rc×w×h . GE first adopts a 9 × 9 convolutional layer to extract primary features with large receptive fields. To strengthen the learning capability of the embedding network, 16 residual blocks are stacked for learning deep representations. Notably, we adopt Swish [37] activation (x = x Sigmoid(x)) instead of ReLU [15] in our network for improved performance. We alleviate the gradient vanishing and exploding problems by transmitting gradients by local residual connections. In each block, convolutional layers with 3×3 kernels, Bach Normalization (BN), and a Sigmoid function are combined to learn residual information. High-frequency learning of details is highlighted by enabling GE to utilize a global residual connection to add both high- and low-level features. 9

4 Channel increasing convolution

3

Sub-pixel convolution

1 2 3 4 Swish Activation

2 Embedded feature

1

Figure 4: Diagram of feature upsampling. The channel increasing convolution can increase the number of channels, the sub-pixel convolution can shuffle those pixels and obtain a 2× magnified feature map.

As for the image reconstruction network GR , it aims to convert the embedded features to an SR image on RGB space: ISR = GR (ILR )

(2)

It consists of double upsampling blocks and a 9 × 9 convolutional layer. Specifically, each upsampling block can expand the feature maps by 2×, as shown in Figure 4. The upsampled features F↑e can be formulated as F↑e = σ(S(Wci ∗ Fe + bci )),

(3)

where Wci ∈ Rc×4c×3×3 is a convolution kernel to increase the number of channels, and bci is a bias vector for the convolution of the increased number of 175

channels; S is a sub-pixel convolution operation [38] for generating c-channel 2× upscaled features, and σ denotes the Swish activation function [37] for nonlinear mapping. After upsampling the layers twice, the last 9 × 9 convolutional layer is adopted to convert the upsampling feature maps into a 3-channel SR image. 3.3. Attention Generation Our attention generation network aims to highlight the informative regions of SR images to further ensure the consistency with their real categories. The 10

ReLU

Conv 1X1

Deconv 2X

ReLU

Deconv 2X

ReLU

Conv 3X3

Fusion

Sigmoid

Input Attention map

Figure 5: Architecture of our attention generation network. The bright yellow regions on the attention map have higher attention values.

architecture of the attention generation network is shown in Figure 5. The network can generate an attention map from a group of input feature maps. To exploit the hierarchical information, the input feature maps are taken from k hierarchical output H1 , H2 , ..., Hk of each block in the perceptual enhancement network. We fuse the inputs to obtain the representative feature maps as Hf u = Wf u ∗ Ht = Wf u ∗ [H1 , H2 , ..., Hk ],

(4)

where Wf u ∈ R3×3×k·c×ca is a convolutional kernel that outputs ca -channel features; [.] denotes a channel-wise concatenation operation. After a 3 × 3 convolutional layer and a ReLU function, we stack double deconvolution layers to upscale the feature maps. This guarantees that the size of the generated attention map is identical with that of the SR images. Then, we adopt a 1 × 1 convolutional layer to linearly combine these features. Finally, a Sigmoid function is adopted to suppress the attention values between 0 and 1. We define the final attention map as A ∈ R1×rw×rh . According to Figure 2, we should allocate attention values for regions in the SR image as SR SR SR ISR ⊗ A = (ISR a =I R A, IG A, IB A), 180

(5)

SR SR where denotes the Hadamard product, and ISR R , IG , and IB denote the red,

green, and blue channels of the SR image, respectively. 3.4. Feature Rectification Our feature extraction network is trained on HR images; hence, it cannot successfully discriminate among the features of SR images. Thus, we design 11

Flatten

Global Averagepooling

Relu

Conv 3´3

Relu

Conv 3´3

Maxpooling

Relu

Conv 3´3

Relu

Maxpooling

Conv 11´11

Figure 6: Structure of the feature rectification network.

a feature rectification network to rectify the global feature representations to enhance the discriminative ability. The flowchart of the feature rectification network, which contains four convolutional layers, is shown in Figure 6. The first convolutional layer adopts a kernel size of 11×11 with a stride of 2 to capture large receptive fields. We adopt max-pooling layers in the first and second convolutional layers to reduce the computational cost. Each max-pooling layer with a 2 × 2 kernel and stride of 2 can downscale the feature maps by 1/2. The global average pooling calculates the mean value for each feature map, followed by a flattening operation or the generation of a one-dimensional feature vector z. We refer to the one-dimensional feature vector as a rectification feature, which is used to rectify the feature difference of the input SR image and its corresponding HR image. The rectification process can be formulated as Xrec = z + F(φ(ISR a )).

(6)

Here, Xrec are the rectified features, F denotes the flattening operation, and φ denotes the feature extraction network. 185

3.5. Loss and Training Algorithm We propose three types of loss to train our APEN. First, MSE loss and perceptual loss are combined to optimize the perceptual enhancement network. Given an LR image, the MSE loss provides a fundamental restriction to ensure the SR image approximates the HR image as closely as possible on the pixel level. For an input LR image ILR and the corresponding ground truth HR image

12

IHR , the MSE loss can be formulated as LM SE =

H W C 1 XX X 2 ||Gθ (ILR )m,q,p − IHR m,q,p ||2 . whc p=1 q=1 m=1

(7)

The majority of previous studies on super-resolution and denoising adopted individual MSE loss for training. However, optimization of the SR image by a single type of MSE loss often does not conform with the perception of classification, resulting in unsatisfactory performance. Inspired by [39], we view the convolutional layers of the classification network (VGG or ResNet) as the feature extraction network. Therefore, the content of the i-th convolutional layer in the feature extraction network is used for computing the perceptual loss: ∗

Lci





w X h X c X 1 ||φi (Gθ (ILR ))m,q,p − φi (IHR )m,q,p ||22 , = ∗ ∗ ∗ w h c m=1 q=1 p=1

(8)

where w∗ , h∗ , and c∗ are the width, height, and number of channels of the content feature maps, respectively. We determined multi-level output content loss to be more effective for containing multi-scale information. Therefore, we compute the three types of perceptual loss (Lc−1 , Lc−2 , and Lc−3 ) from the last three convolutional outputs, and the final perceptual loss is formulated as LP = Lc−1 + Lc−2 + Lc−3 .

(9)

Finally, the total loss for optimizing the perceptual enhancement network is a linear combination of LM SE and LP : LT otal = αLM SE + βLP .

(10)

According to Section 3.3, the attention generation network should be optimized by semantic information based on categories. As shown in Figure 2, the attentive SR image ISR is fed into the feature extraction network and feature a rectification network to obtain a rectified feature. The rectified feature is fed into the classifier to obtain an estimated category. Thus, the classification loss can be formulated as Lsr sr + (1 − y)log(1 − yˆ sr )), Cls = −(ylog yˆ 13

(11)

where y is a ground-truth label, and yˆsr is an estimated label generated from the attentive SR image ISR a . We use the back-propagation algorithm to update the parameters ωAtt of the attention generation network by gradient 5ωAtt Lsr Cls : SR 5ωAtt Lsr Lsr Cls = 5ISR Cls · 5ωAtt Ia a SR = 5Xrec Lsr z + 5φ(ISR Xrec · 5ISR φ(ISR Cls · 5ωAtt Ia · (5z Xrec · 5ISR a )), a a a )

(12) where ISR a is the attentive SR image and the · denotes an element-wise product. Notably, the parameters of the classifier, feature extraction network, and feature rectification network are fixed during the training process of the attention network. In addition, the gradient for the input of the attention network 5Hf u LCls 190

does not continuously transmit to the perceptual enhancement network. The feature extraction network and the classifier are jointly optimized by the classification loss based on the HR images: Lhr hr + (1 − y)log(1 − yˆ hr )), Cls = −(ylog yˆ

(13)

where yˆhr is an estimated label generated from the HR image. The feature rectification network should improve the adaptability of the classification model for SR images. Thus, the feature rectification network is also optimized by classification loss based on SR images (Equation 11). The loss ensures that the rectified feature can satisfy the decision boundary of the classifier. Finally, the parameters of the feature rectification network are updated according to the chain rule: sr 5ωrec Lsr Xrec ) Cls = 5Xrec LCls · (5z Xrec + 5φ(ISR a )

= 5Xrec Lsr Cls · 5z Xrec · 5ωrec z,

(14)

where · denotes the element-wise product. The detailed training procedure is provided in Algorithm 1. Our training strategy enables all networks to easily converge. Notably, the classification network is trained on HR images as Eq 13. Consequently, our classification network maintains favorable performance 195

in terms of classifying the original HR image. 14

Algorithm 1 Training algorithm of APEN. 1: for number of training epochs do 2: 3:

for k steps do HR Sample a batch of HR images IHR and corresponding labels 1 , ..., In

y1 , .., yn from the training dataset. 4:

Randomly crop 224 × 224 of IHR and zoom them to 1/4, thus obtain a LR batch of LR images samples ILR 1 , ..., In .

6:

LR Input ILR into the perceptual enhancement network Gθ , and 1 , ..., In Pn 1 SR obtain SR images ISR 1 , ..., In , and update it by 5θG n 1 LT otal .

7:

work and perform attention as Equation 5; compute classification loss Pn sr 1 1 LCls and update its parameters according to Equation 12. n

5:

Input hierarchical features H1t , ..., Hnt into the attention generation net-

HR Input IHR into the feature extraction network φ and classifier 1 , ..., In Pn 1 Cls; compute loss n 1 Lhr Cls by labels y1 , .., yn to update their param-

eters. 8:

SR Input ISR 1 , ..., In into the feature extraction network φ and feature recPn tification network τ ; compute loss n1 1 Lsr Cls by labels y1 , .., yn ; update

the parameters of τ according to Equation 14.

9: 10:

end for end for

4. Experiments 4.1. Datasets and Comparison Methods We evaluated the proposed method on three benchmark datasets, i.e., Caltech256 [40], Stanford Dogs [41], and Food-101 [42]: 200

• Caltech-256 [40]: it includes 257 categories (256 general categories and one background category), and contains a total of 30, 607 images of different sizes. Each category contains a different number (80 to 827) of images, and most of the categories have approximately 100 images. Following the popular evaluation protocol [43], we randomly selected 60 training images 15

205

from each class and used the rest for testing. • Stanford Dogs [41]: it is a popular fine-grained dataset involving 120 categories of dogs. Each category has more than 150 images. Following the standard protocol [41], we selected 100 samples in each category for training, and used the rest for testing. Certain algorithms usually adopt

210

prior localization information. For fair comparison, we did not adopt this prior to classify images. • Food-101 [42]: it is one of the largest datasets consisting of classified food images. It has a collection of the top 101 most popular dishes. Each category has 750 training images and 250 testing images. The side lengths

215

of images are approximately 300 and 350 pixels. Classification on Food101 is a challenging task because the food images are easily affected by deformation, color change, packaging, and the features space is greatly different after cooking. To acquire LR images, we followed the typical strategies used in many studies

220

[29, 30, 44] to downscale images by 1/4 (e.g., a 224 × 224 image is resized to a 56 × 56 image). The downscaled images were considered the LR images, and the original versions were viewed as the HR images. Detailed settings are provided in Section 4.6. For extensive comparisons, we introduce a series of promising methods:

225

• HR-LR: the classification network is trained by HR images, but we tested it by using the LR images. This setting can be viewed as a baseline for comparison. • LR-LR: the classification network is trained by LR images, and we tested it by using LR images. This setting needs to train an additional network

230

for LR images. • D-DBPN [45]: it is a state-of-the-art image super-resolution network based on pixel-wise MSE loss. We adopted its well-trained model to enhance LR images for classification. 16

• SRGAN [39]: this is a state-of-the-art image super-resolution network 235

based on optimizing perception. We adopted the well-trained generator to enhance the LR image for classification. • PEN: the perceptual enhancement network is trained on the mentioned training datasets by MSE loss. We used the network directly to enhance the LR testing image for classification.

240

• Li et al. [43]: it is a state-of-the-art image classification method, which combines a CNN and a kernel extreme learning machine (KELM). We used the VGG network trained by LR-LR to extract the LR features, then adopted KELM for classification. • Iandola et al. [20]: it is a lightweight image classification network and

245

can achieve comparable performance with AlexNet. We trained it according to LR-LR. • Jacobsen et al. [46]: it is a state-of-the-art image classification network structure based on an invertible architecture. We trained it by LR-LR. • Wang et al. [47]: it is a state-of-the-art image classification method,

250

which is simple but very effective. We trained it by LR-LR. • HR-HR: the classification network was trained on HR images and tested on HR images. In our evaluation, “-VGG” denotes adoption of the VGG-16 network as the classification model, and “-ResNet” denotes the ResNet-34 network. Notably,

255

super-resolution methods i.e., SRGAN and D-DBPN are universal for all natural images, and their well-trained models can be directly applied to unseen LR images. 4.2. Experiment on Caltech-256 4.2.1. State-of-the-art comparison

260

We compared our methods with state-of-the-art methods to verify the effectiveness of our method. The results in Table 4 show that APEN-ResNet 17

Table 1: Accuracy comparisons of different LR image classification methods on Caltech-256.

Model

Top-1 (%) Top-5 (%)

Model

Top-1 (%) Top-5 (%)

HR-HR-VGG

79.48

92.14

HR-HR-ResNet

83.49

94.77

HR-LR-VGG

58.46

79.11

HR-LR-ResNet

69.99

87.99

LR-LR-VGG

73.42

88.57

LR-LR-ResNet

78.57

92.60

PEN-VGG

70.01

86.53

PEN-ResNet

72.61

89.70

D-DBPN-VGG

69.29

86.31

D-DBPN-ResNet

70.53

88.46

SRGAN-VGG

72.14

88.31

SRGAN-ResNet

77.54

91.98

Li et al. [43]

73.51

88.69

Iandola et al. [20]

65.19

83.45

Wang et al. [47]

73.68

88.66

Jacobsen et al. [46]

73.08

87.68

APEN-VGG

75.62

90.21

APEN-ResNet

79.95

92.99

achieves the highest classification accuracy on both Top-1 and Top-5. APENVGG outperforms HR-LR-VGG and LR-LR-VGG by 17.06% and 2.1% on Top-1 accuracy, demonstrating the weakness and bottleneck of conventional LR im265

age classification methods. Although Li et al. [43] boosted the performance by approximately 0.1% and 0.12% compared with LR-LR, this increase is still not promising. In contrast, our APEN-VGG outperforms the state-of-the-art methods of Li et al. [43], Wang et al. [47], and Jacobsen et al. [46] by 2.01%, 1.84%, and 2.24% on Top-1 accuracy, respectively, showing the superior performance

270

of our method on LR image classification. In addition, APEN-ResNet outperforms the VGG-based APEN by increasing the accuracy by 4.33% and 2.78% on Top-1 and Top-5, respectively. APEN-ResNet is only less accurate than HR-HRResNet by 3.54% and 2.78% on the Top-1 and Top-5 evaluation, respectively, whereas HR-LR-ResNet greatly degrades the performance. These comparisons

275

enable us to conclude that our APEN is not only an effective framework, but also achieves state-of-the-art performance for general image classification. As shown in Figure 7, we further visualize the feature maps of the feature extraction network for input of different resolutions, i.e., an LR image, SR image, and HR image. The feature maps of the SR image are noticeably close to those of

280

the HR image, whereas the feature maps of the LR images are obviously distinct

18

Table 2: Ablation study on Caltech-256; the classification network is VGG-16.

Model

Top-1 (%)

Top-5 (%)

no-MSE

75.61

90.10

no-Perceptual

73.70

88.85

one-Perceptual

75.51

90.10

no-Attention

75.61

90.09

no-Rectification

75.13

89.91

Table 3: Ablation study on Caltech-256; the classification network is ResNet-34.

Model

Top-1 (%)

Top-5 (%)

no-MSE

79.17

92.59

no-Perceptual

78.03

92.35

one-Perceptual

78.92

92.76

no-Attention

78.86

92.46

no-Rectification

78.66

92.58

from those of the HR image. 4.2.2. Ablation study We conducted ablation studies on the Caltech-256 dataset to verify the effectiveness of each component by using five degraded models we designed for 285

comparison purposes: 1) no-MSE: removing MSE loss. 2) no-Perceptual: removing perceptual loss. 3) one-Perceptual: utilizing the content of the last convolutional layer to model perceptual loss. 4) no-Attention: removing the attention mechanism. 5) no-Rectification: removing the feature rectification strategy. The results based on the VGG-16 and ResNet-34 classification net-

290

works are presented in Table 2 and Table 3, respectively. A comparison of the degraded model one-Perceptual with APEN indicates that one-Perceptual decreases the accuracy by 1.92% and 1.92% on Top-1 on the VGG- and ResNetbased classification models, respectively. This shows the necessity of taking into account the perceptual loss. A comparison of APEN with the no-Rectification

19

LR Feature extraction

SR HR

LR Feature extraction

SR HR

Figure 7: Example of the visualization of feature maps extracted by the last convolutional layer of VGG on the Caltech-256 dataset.

295

model on ResNet revealed that the feature rectification strategy can promote the accuracy by 1.29% and 0.41% on Top-1 and Top-5, demonstrating the effectiveness of the feature rectification. Other comparisons also prove the effectiveness of each component for LR image classification. 4.3. Experiment on the Stanford Dogs dataset

300

4.3.1. State-of-the-art comparison We compared the performance of our methods with that of state-of-the-art methods on the fine-grained Stanford Dogs dataset. The results in Table 4 indicate that VGG trained on HR images leads to poor performance when processing LR images, only achieving 35.83% accuracy on Top-1. The utilization of

305

D-DBPN [45] and SRGAN [39] for image enhancement is expected to drastically improve the classification performance. This verifies that image super-resolution is valuable for LR image classification. APEN-VGG further improved the performance, achieving 70.63% accuracy on Top-1, thereby outperforming SRGAN and LR-LR by 1.54% and 2.44% accuracy on Top-1, respectively. This shows

310

the performance bottleneck of retraining a network for LR images, but proves the advantage of our methods. The results of PEN-VGG and PEN-ResNet are not promising, revealing the shortcomings of a single perceptual enhancement

20

Table 4: Comparison of the accuracy of different LR image classification methods on the Stanford Dogs dataset.

Model

Top-1 (%) Top-5 (%)

Model

Top-1 (%) Top-5 (%)

HR-HR-VGG

76.92

95.78

HR-HR-ResNet

83.12

98.02

HR-LR-VGG

35.83

66.59

HR-LR-ResNet

56.40

86.48

LR-LR-VGG

68.19

91.90

LR-LR-ResNet

76.62

95.47

PEN-VGG

51.63

82.05

PEN-ResNet

60.81

87.52

D-DBPN-VGG

66.68

91.22

D-DBPN-ResNet

71.45

93.09

SRGAN-VGG

69.09

92.81

SRGAN-ResNet

77.04

95.84

Li et al. [43]

67.51

90.11

Iandola et al. [20]

57.21

85.71

Wang et al. [47]

67.07

90.37

Jacobsen et al. [46]

68.29

90.97

APEN-VGG

70.63

93.17

APEN-ResNet

77.46

95.94

network. APEN-ResNet also achieves the highest classification performance compared with all the state-of-the-art methods. It outperforms the methods of 315

Wang et al. [47] and Jacobsen et al. [46] by 10.39% and 9.17% accuracy on Top-1, showing the superiority of our approach. 4.3.2. Ablation study We conducted ablation studies on the Stanford Dogs dataset. The methods we used for comparison are identical to those in Section 4.2.2. The experimen-

320

tal results based on VGG and ResNet are provided in Table 5 and Table 6, respectively. A comparison of the no-Perceptual model with APEN indicates that the no-Perceptual model decreases the performance by 5.0% and 3.0% in terms of the Top-1 accuracy using both VGG and ResNet. This demonstrates that perceptual loss plays an important role in our method. Multi-level per-

325

ceptual loss is clearly superior to one-level perceptual loss, and outperforms the one-Perceptual model by 0.6% with respect to Top-1 accuracy. The effectiveness of our attention mechanism also becomes evident in that APEN-ResNet outperforms the no-Attention model by nearly 1.0% in terms of Top-1 accuracy. The feature rectification strategy is also effective and outperforms the

330

no-Rectification model by increasing the accuracy by approximately 0.19% and 21

Table 5: Ablation study on Stanford Dogs dataset using the VGG-16 classification network.

Model

Top-1 (%)

Top-5 (%)

no-MSE

70.56

93.44

no-Perceptual

65.10

90.27

one-Perceptual

70.03

93.05

no-Attention

70.44

93.17

no-Rectification

70.55

93.30

Table 6: Ablation study on Stanford Dogs dataset using the ResNet-34 classification network.

Model

Top-1 (%)

Top-5 (%)

no-MSE

76.75

95.22

no-Perceptual

74.42

94.51

one-Perceptual

77.43

96.00

no-Attention

76.51

95.34

no-Rectification

77.02

96.01

0.44% on Top-1 based on the VGG and ResNet classification networks, respectively. Our investigation of the ablation enables us to conclude that each component in our method is important, and demonstrates its effectiveness in terms of fine-grained LR image classification. 335

Figure 8 visualizes the training curves of the different degraded models. The performance of the no-Perceptual model is poor in every epoch. Although the performance of the no-MSE model is higher than that of APEN based on VGG, that of APEN (the red line) overtakes it after 30 epochs. Figure 8 (c) shows that APEN consistently outperforms the other methods. Figure 8

340

(d) indicates that the performance of APEN is comparable to that of the onePerceptual model. This is because the optimization of high-level perceptual loss during fine-grained classification can promote detailed classification. The noAttention model underperforms APEN as shown by all the curves, confirming the superiority of our attention mechanism.

22

Accurancy

Accurancy

APEN one-Perceptual no-MSE no-Att no-Rectifiaction no-Perceptual

APEN one-Perceptual no-MSE no-Att no-Rectifiaction no-Perceptual

Epoch (a) TOP-1 accurancy on VGG

Epoch (b) TOP-5 accurancy on VGG

Accurancy

Accurancy

APEN one-Perceptual no-MSE no-Att no-Rectifiaction no-Perceptual

APEN one-Perceptual no-MSE no-Att no-Rectifiaction no-Perceptual

Epoch (c) TOP-1 accurancy on ResNet

Epoch (d) TOP-5 accurancy on ResNet

Figure 8: Training curves of the ablation studies on the Stanford Dogs dataset

345

4.4. Experiment on the Food-101 dataset We conducted experiments on the Food-101 dataset to further demonstrate the effectiveness of our method. The results are listed in Table 7 and indicate that our methods again achieve the best performance. LR-LR outperforms SRGAN by more than 2% in terms of Top-1 accuracy, because LR-LR can be

350

adequately trained by the current dataset, whereas SRGAN is a general model. A comparison of D-DBPN with SRGAN reveals the advantage of optimizing the perceptual loss. APEN-VGG outperforms LR-LR-VGG by 0.7% and 0.19% with respect to Top-1 and Top-5 accuracy, respectively. In comparison, APENResNet slightly outperforms LR-LR-ResNet by 0.64% and 0.06% in terms of

355

Top-1 and Top-5 accuracy, respectively. These results show that our method still achieves the best performance on the large-scale Food-101 dataset. The scores of PEN-VGG and PEN-ResNet are 10% lower than that of our APEN, indicating the drawback of a single enhancement network. Compared with the methods of Wang et al. [47] and Jacobsen et al. [46], the performance of

23

Table 7: Comparison of the accuracy of different LR image classification methods on the Food-101 dataset.

Model

360

Top-1 (%) Top-5 (%)

Model

Top-1 (%) Top-5 (%)

HR-HR-VGG

82.47

95.73

HR-HR-ResNet

82.25

95.76

HR-LR-VGG

54.78

79.18

HR-LR-ResNet

63.11

85.98

LR-LR-VGG

76.47

93.12

LR-LR-ResNet

77.50

93.86

PEN-VGG

61.80

84.12

PEN-ResNet

63.42

85.79

D-DBPN-VGG

73.60

91.30

D-DBPN-ResNet

70.33

89.97

SRGAN-VGG

74.07

91.66

SRGAN-ResNet

74.47

92.14

Li et al. [43]

64.02

85.02

Iandola et al. [20]

69.52

89.96

Wang et al. [47]

77.04

93.29

Jacobsen et al. [46]

78.01

93.61

APEN-VGG

77.22

93.43

APEN-ResNet

78.30

94.06

APEN-ResNet is again superior. Specifically, APEN-ResNet outperforms their methods by 1.26% and 0.29% in terms of Top-1 accuracy, and by 0.77% and 0.45% in terms of Top-5 accuracy. 4.5. Visual Results Figure 9 visualizes the intermediate result of our method. The resized re-

365

sults are blurry especially certain local details; thus, the textural information and edge information is difficult to extract by the feature extraction network. Our SR results are clearer than the resized images. The attention maps are able to successfully locate the key objects. Although we applied attention maps to the SR images, the results are unnatural. This is normal, because our training

370

algorithm limits the superficial representations of the attention map. Specifically, instead of pixel-wise MSE loss, our attention generation network is only optimized by classification loss at the semantic levels. Even though the pixelwise information (superficial representations) is destroyed, the classification network can also discriminate among attentive SR images, because the attention

375

generation network is trained jointly with the classification network, and the back-propagation gradients enable the attentive results to complete the classification task satisfactorily. 24

LR Figure 9:

Resize

SR

Attention

Attentive SR

Visualization of LR images, resized images, SR images, attention maps, and

attentive SR images.

4.6. Implementation Details Network setting. The proposed framework is trained specially for a 4× resolu380

tion enhancement. The perceptual enhancement has 16 residual blocks in total, and each residual block has 64-channel feature maps. The hyper-parameter ca in the attention generation network is set to 64. Except for the last 1 × 1 convolutional layer, all the layers have 64 channels. The number of channels of the feature rectification network is: 32, 192, 384, and 512. The flattened vector,

385

after being subjected to global average pooling, has 512 dimensions.

25

Training details. We resized the images to 256 × 256 pixels and conducted data augmentation by randomly cropping these images to a size of 224 × 224 pixels. Then, we adopted Bicubic 4× downsampling to obtain the corresponding 56 × 56 LR images for training; the 224 × 224 images were viewed as HR images. 390

The classification network was trained by Stochastic Gradient Descent (SGD) with an initial learning rate of 0.01, which decreased by 0.1 times in steps of 20 epochs during training. The attention generation network, feature rectification network, and perceptual enhancement network were all optimized by the Adam optimizer [48] with an initial learning rate of 0.0001, and with the learning rate

395

decreasing 0.1 times in steps of 25 epochs. A grid search revealed that the relatively larger perceptual loss tends to improve the results, and the methods with α = 0.01 and β = 1 can optimize the performance under most conditions. In APEN-VGG and APEN-ResNet, the batch sizes were set to 16 and 32, respectively. APEN-VGG required approximately 12 hours, 13 hours, and 120 hours

400

for training and APEN-ResNet required approximately 12 hours, 13 hours, and 60 hours for training on the Caltech-256, Stanford Dogs, and Food-101 datasets, respectively. The average inference time of APEN-VGG and APEN-ResNet on Caltech-256 were 0.012 s and 0.016 s per image, respectively. We additionally tested the inference time in each network, namely the Perceptual Enhancement

405

Network (PEN), the Attention Generation Network (AGN), the Feature Rectification Network (FRN), and the Classification Network (CN). Details of the inference times appear in Table 8. Experiments were performed on a machine with an Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70 GHz, 12 GB RAM, and NVIDIA Titan Xp GPUs for training and testing. Table 8: Inference time of different sub-networks (millisecond)

Model

PEN

AGN

FRN

CN

Total

APEN-VGG

7.15

0.61

0.55

4.00

12.25

APEN-ResNet

7.40

0.61

0.59

8.14

16.69

26

410

4.7. Discussion and Future Work Our work was largely inspired by a previous study [39], which integrated perceptual loss with adversarial learning to generate realistic SR images from LR images. However, our method differs entirely from theirs in four respects: First, the learning task is different. Our work focuses on image classification

415

and the weights of the classification network are trainable. Second, our method does not employ adversarial learning, because this form of learning destroys the internal distribution of an image, leading to semantic deviation in classification. According to our observations, adversarial learning indeed results in poor performance and also has considerable costs in terms of training time and

420

memory. Third, contrary to the generator of SRGAN, we designed an attention generation network, which is independent from PEN, to highlight SR images for accurate recognition. This network is optimized by classification loss in an end-to-end manner. Last, we effectively discriminate SR images by introducing a feature rectification strategy, specially aiming to improve the adaptability of

425

the classification network. However, our super-resolution network is mainly designed for accurate classification, and the SR images that are generated do not always fulfill human visual expectations. This is mainly caused by the large perceptual loss. In future we plan to alleviate this problem to produce more perceptually realistic images. We

430

also consider utilizing the feature super-resolution method [5], which, because it can map an LR image to a high-level feature space, would enable the classification task to benefit from high-level representation. In addition, because existing super-resolution technologies have redundancy on certain large-scale images, we aim to focus on local region super-resolution in the future.

435

5. Conclusion In this paper, we propose a novel unified framework, named attention-aware perceptual enhancement nets (APEN), for low-resolution image classification. Our method was designed to adopt end-to-end super-resolution technologies to

27

boost low-resolution classification performance. Specifically, a novel perceptual 440

enhancement network based on deep residual learning was designed to upscale LR images to super-resolved images by refining and providing the LR images with additional textural details. An attention generation network was proposed to highlight informative regions of the super-resolved images. Therefore, this network has the task of supervising the perceptual enhancement network and

445

guaranteeing the quality of super-resolved images, thereby restricting the generation of semantically inconsistent contents that would adversely affect the subsequent classification. In addition, a feature rectification strategy is presented to promote the adaptability of the classification network for enhanced images. The optimization algorithm of our framework was carefully designed in

450

an end-to-end fashion. Extensive experiments conducted on three benchmark datasets demonstrated the state-of-the-art performance of our method for both low-resolution and high-resolution image classification.

References [1] K. Simonyan, A. Zisserman, Very deep convolutional networks for large455

scale image recognition, in: International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015. [2] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 770–778.

460

[3] Y. Lu, Z. Lai, X. Li, D. Zhang, W. K. Wong, C. Yuan, Learning parts-based and global representation for image classification, IEEE Trans. Circuits Syst. Video Techn. 28 (12) (2018) 3345–3360. [4] D. Cai, K. Chen, Y. Qian, J. K¨ am¨ ar¨ ainen, Convolutional low-resolution fine-grained classification, Pattern Recognition Letters 119 (2019) 166–171.

465

[5] W. Tan, B. Yan, B. Bare, Feature super-resolution: Make machine see

28

more clearly, in: Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 2018, pp. 3994–4002. [6] B. Na, G. C. Fox, Object detection by a super-resolution method and a convolutional neural networks, in: International Conference on Big Data, 470

Seattle, WA, USA, 2018, pp. 2263–2269. [7] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, X. Tang, Residual attention network for image classification, in: Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 2017, pp. 6450–6458.

475

[8] W. W. W. Zou, P. C. Yuen, Very low resolution face recognition problem, IEEE Trans. Image Processing 21 (1) (2012) 327–340. [9] J. Yu, C. Zhu, J. Zhang, Q. Huang, D. Tao, Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition, IEEE Transactions on Neural Networks and Learning Systems (2019) 1–14.

480

[10] J. Yan, M. Zhu, H. Liu, Y. Liu, Visual saliency detection via sparsity pursuit, IEEE Signal Process. Lett. 17 (8) (2010) 739–742. [11] X. Zhu, J. Liu, J. Wang, C. Li, H. Lu, Sparse representation for robust abnormality detection in crowded scenes, Pattern Recognition 47 (5) (2014) 1791–1799.

485

[12] C. Hong, J. Yu, J. Zhang, X. Jin, K.-H. Lee, Multimodal face-pose estimation with multitask manifold deep learning, IEEE Transactions on Industrial Informatics 15 (7) (2019) 3952–3961. [13] J. Yan, C. Li, Y. Li, G. Cao, Adaptive discrete hypergraph matching, IEEE Trans. Cybernetics 48 (2) (2018) 765–779.

490

[14] J. Yu, M. Tan, H. Zhang, D. Tao, Hierarchical deep click feature prediction for fine-grained image recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) –. 29

[15] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, ACM Commun. 60 (6) (2017) 84–90. 495

[16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 2818–2826. [17] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-

500

resnet and the impact of residual connections on learning, in: Conference on Artificial Intelligence, AAAI 2017, San Francisco, California, USA, 2017, pp. 4278–4284. [18] S. Targ, D. Almeida, K. Lyman, Resnet in resnet: Generalizing residual architectures, CoRR abs/1603.08029.

505

[19] J. Zhang, J. Yu, D. Tao, Local deep-feature alignment for unsupervised dimension reduction, IEEE Transactions on Image Processing 27 (5) (2018) 2420–2432. [20] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, K. Keutzer, Squeezenet: Alexnet-level accuracy with 50x fewer parame-

510

ters and <1mb model size, CoRR abs/1602.07360. [21] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 2017, pp. 2261–2269. [22] R. Zeyde, M. Elad, M. Protter, On single image scale-up using sparse-

515

representations, in: International Conference on Curves and Surfaces, Avignon, France, 2010, pp. 711–730. [23] G. Freedman, R. Fattal, Image and video upscaling from local selfexamples, ACM Trans. Graph. 30 (2) (2011) 12:1–12:11.

30

[24] K. I. Kim, Y. Kwon, Single-image super-resolution using sparse regression 520

and natural image prior, IEEE Trans. Pattern Anal. Mach. Intell. 32 (6) (2010) 1127–1133. [25] R. Timofte, V. D. Smet, L. J. V. Gool, Anchored neighborhood regression for fast example-based super-resolution, in: International Conference on Computer Vision, ICCV 2013, Sydney, Australia, 2013, pp. 1920–1927.

525

[26] C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2016) 295–307. [27] Y. LeCun, Y. Bengio, G. E. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.

530

[28] J. Kim, J. K. Lee, K. M. Lee, Accurate image super-resolution using very deep convolutional networks, in: Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 1646–1654. [29] X. Mao, C. Shen, Y. Yang, Image restoration using convolutional autoencoders with symmetric skip connections, CoRR abs/1606.08921.

535

[30] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: International Conference on Computer Vision, ICCV 2017, Venice, Italy, 2017, pp. 4809–4817. [31] Y. Tai, J. Yang, X. Liu, Image super-resolution via deep recursive residual network, in: Conference on Computer Vision and Pattern Recognition,

540

CVPR 2017, Honolulu, HI, USA, 2017, pp. 2790–2798. [32] J. Li, F. Fang, K. Mei, G. Zhang, Multi-scale residual network for image super-resolution, in: European Conference on Computer Vision, ECCV 2018, Munich, Germany, 2018, pp. 527–542. [33] J. Kim, J. K. Lee, K. M. Lee, Deeply-recursive convolutional network for

545

image super-resolution, in: Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 1637–1645. 31

[34] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 2018, pp. 7132–7141. 550

[35] H. Hu, J. Gu, Z. Zhang, J. Dai, Y. Wei, Relation networks for object detection, in: Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 2018, pp. 3588–3597. [36] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: European Con-

555

ference on Computer Vision, ECCV 2018, Munich, Germany, 2018, pp. 294–310. [37] P. Ramachandran, B. Zoph, Q. V. Le, Searching for activation functions, in: International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 2018.

560

[38] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 1874–1883.

565

[39] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 2017, pp. 105–114.

570

[40] G. Griffin, A. Holub, P. Perona, Caltech-256 object category dataset. [41] A. Khosla, N. Jayadevaprakash, B. Yao, F.-F. Li, Novel dataset for finegrained image categorization: Stanford dogs, in: Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2011, 2011, pp. 1–2. 32

575

[42] L. Bossard, M. Guillaumin, L. Van Gool, Food-101–mining discriminative components with random forests, in: European Conference on Computer Vision, ECCV 2014, 2014, pp. 446–461. [43] Z. Li, X. Zhu, L. Wang, P. Guo, Image classification using convolutional neural networks and kernel extreme learning machines, in: International

580

Conference on Image Processing, ICIP 2018, Athens, Greece, 2018, pp. 3009–3013. [44] X. Zhu, Z. Li, X. Zhang, H. Li, Z. Xue, L. Wang, Generative adversarial image super-resolution through deep dense skip connections, Comput. Graph. Forum 37 (7) (2018) 289–300.

585

[45] M. Haris, G. Shakhnarovich, N. Ukita, Deep back-projection networks for super-resolution, in: Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 2018, pp. 1664–1673. [46] J. Jacobsen, A. W. M. Smeulders, E. Oyallon, i-revnet: Deep invertible networks, in: International Conference on Learning Representations, ICLR

590

2018, Vancouver, BC, Canada, 2018. [47] Y. Wang, V. I. Morariu, L. S. Davis, Learning a discriminative filter bank within a CNN for fine-grained recognition, in: Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 2018, pp. 4148–4157.

595

[48] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 2015.

33

Author Contributions: Xiaobin Zhu and Zhuangzi Li contributed the central idea, analysed most of the data, and wrote the initial draft of the paper. The remaining authors contributed to refining the ideas, carrying out additional analyses and finalizing this paper.

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: