Progressive learning for weakly supervised fine-grained classification

Progressive learning for weakly supervised fine-grained classification

Progressive Learning for Weakly Supervised Fine-grained Classification Journal Pre-proof Progressive Learning for Weakly Supervised Fine-grained Cla...

3MB Sizes 0 Downloads 50 Views

Progressive Learning for Weakly Supervised Fine-grained Classification

Journal Pre-proof

Progressive Learning for Weakly Supervised Fine-grained Classification Tiantian Yan, Shijie Wang, Zhihui Wang, Haojie Li, Zhongxuan Luo PII: DOI: Reference:

S0165-1684(20)30062-1 https://doi.org/10.1016/j.sigpro.2020.107519 SIGPRO 107519

To appear in:

Signal Processing

Received date: Revised date: Accepted date:

1 December 2019 19 January 2020 1 February 2020

Please cite this article as: Tiantian Yan, Shijie Wang, Zhihui Wang, Haojie Li, Zhongxuan Luo, Progressive Learning for Weakly Supervised Fine-grained Classification, Signal Processing (2020), doi: https://doi.org/10.1016/j.sigpro.2020.107519

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Highlights • We propose progressive patch localization module to solve the problem that the selected patches with lower rank very likely contain noise information while guaranteeing a diversity of fine-grained features. • Feature calibration module is proposed to calibrate patch-level features for strengthening its discriminative information and suppressing useless information by employing the global information, which further benefits the final classification performance. • We evaluate our method on three challenging datasets (CUB, Cars and Aircraft), and achieve the state-of-the-art results on all of these datasets.

1

Progressive Learning for Weakly Supervised Fine-grained Classification Tiantian Yana , Shijie Wanga , Zhihui Wangb,∗, Haojie Lib , Zhongxuan Luoa a School

of Software Technology, Dalian University of Technology, Dalian, China School of Information Science & Engineering, Dalian University of Technology, China

b International

Abstract Despite fine-grained image classification has made considerable progress, it still remains a challenging task due to the difficulty of finding subtle distinctions. Most existing methods solve this problem by selecting the top-N highest scores’ discriminative patches from candidate patches at one time. However, since the classification network often highlights small and sparse regions, the selected patches with the lower rank may contain noise information. To address this problem and ensure the diversity of fine-grained features, we propose a progressive patch localization module (PPL) to find the discriminative patches more accurately. Specifically, this work employs the classification model to find first most discriminative patch, then removes the most salient region to help the localization of the next most discriminative patch, and the top-K discriminative patches can be found by repeating this procedure. In addition, in order to further improve the representational power of patch-level features, we propose a feature calibration module (FCM). This module employs the global information to selectively emphasize discriminative features and suppress useless information, which can obtain more robust and discriminative local feature representations and then help classification network achieve better performance. Extensive experiments are conducted to show the substantial improvements of our method on three benchmark datasets. ∗ Corresponding

author Email address: [email protected] (Zhihui Wang)

Preprint submitted to Journal of LATEX Templates

February 1, 2020

Keywords: Fine-grained image classification, Progressive patch localization module, Feature calibration module 2019 MSC: 00-01, 99-00

1. Introduction

Figure 1: Illustration of challenges in fine-grained image classification. Large differences in same subordinate class are shown in the first row, and small distinction among different subordinate classes are shown in the second row. The images in (a) Birds and (b) Cars are from CUB-200-2011 [1], Cars-196 [2] respectively.

Fine-grained image classification is a hot research topic in computer vision, pattern recognition and other fields [3], [4] in recent years, because it has huge application requirements in both academia and industry. The goal of 5

fine-grained image classification is to divide coarse-grained categories into subcategories, such as hundreds of subordinate classes of birds [1], automobiles [2], etc. An example is shown in Figure 1(a), fine-grained image classification need to classify the birds in the second row as the subcategory of ”California Gull”, ”Glaucous winged Gull” and ”Western Gull” respectively. Fine-grained image

10

classification still remains challenges for the reason that there are large intraclass differences and inter-class similarities. In the first row of Figure 1(a), the three images belonging to the same subcategory are affected by the pose,

3

Figure 2: The motivation for our patch localization module. (a): The lower rank patches selected by the anchor based methods contain non-discriminative information, such as background information. (b): The proposed progressive patch localization module.

shooting angle and growth period, resulting in a great visual difference among them. In the second row of Figure 1(a), the three images belonging to differ15

ent subcategories have similar global features (such as gray wings and white bellies), so they can only be distinguished with the aid of subtle local distinctions in tails and legs, etc. Therefore, how to accurately find and effectively use local discriminant information is the key to the success of fine-grained image classification.

20

Some previous works [5, 6, 7, 8, 9] address this problem by making use of fine-grained annotations, like annotations for bird parts in bird classification.

4

However, human-defined parts may not be optimal for fine-grained image classification, which is completely dependent on the annotator’s cognitive level. In addition, fine-grained annotations require massive manual labor, but also lack 25

practicality and scalability because a lot of the actual data is not annotated [10]. Therefore, recent works mainly focus on weakly-supervised frameworks that only use image-level labels. The existing weakly-supervised methods [11, 12, 13, 14], [15] search for discriminative patches based on the candidate anchors. He et al.[11] design a dis-

30

criminative localization method based on Faster R-CNN [16] to simultaneously localize discriminative patches with the guide of saliency information and extract discriminative features. This work only employs one level attention, while different level attention describe different visual features, carrying multi-grained and multi-scale information. A weakly supervised discriminative localization

35

method [12] extracts multiple different discriminative regions by n-pathway localization module whose supervision information is provided by multi-level attention extraction network. Yao et al. [14] propose a graph analysis algroithm to estimate object patches and select the distinctive local patches. These methods [11, 12, 14] ignore the spatial relationships among patches, which have a

40

difficulty in capturing diverse fine-grained features. Peng et al. [13] and He et al. [15] design an object-part spatial constraint module to solve this problem by selecting the top-N patches as discriminative patches according to their scores. Although these methods [13], [15] consider the spatial constraint between patches to guarantee the selected patches containing

45

diverse features, the selected patches with the lower rank may contain very little useful information and a lot of noise information (as show in Figure 2(a)). Our proposed localization module (PPL) can effectively solve the above two problems. First, we employ a classification network to locate the first discriminative patch. Next, the most salient region is removed from the original image, that

50

is, set the most salient region to zero. To remedy the classification performance drop, the next discriminative patch can be found by the classification network. Finally, top-K discriminative patches can be extracted by repeating the above 5

procedure (as shown in Figure 2(b)). After discriminative patches are found, patches (sharing the same category 55

label with the corresponding original image) and original image are used to train the fine-grained classification network. To obtain the final classification result of each image, some previous methods [17, 13] adopt the average or weighted sum of the predicted scores of discriminative patches and original image. Other methods [18, 14] adopt concatenation between patch-level features and image-

60

level feature as the final feature descriptor for classification prediction. In order to further improve the representational power of patch-level features, we propose a feature calibration module (FCM). In this module, we employ the global information of image-level feature to selectively emphasise discriminative information of patch-level features and suppress less useful information. In other

65

words, each element of patch-level feature is weighted by the global information, which makes the useful feature value increases and the useless feature value decreases. In addition, considering the complementarity between the patch-level features of different scales, we fuse local features of different scales so as to obtain more powerful and sufficient representation of discriminative features. We

70

feed the concatenation between the calibrated patch-level features and imagelevel feature into fully connected layer to obtain the final prediction result. To summarize, the contributions are as follows • We propose progressive patch localization module (PPL) to pinpoint the discriminative patches for fine-grained image classification network. This

75

module solves the problem that the selected patches with lower rank very likely contain noise information while guaranteeing a diversity of finegrained features. • Feature calibration module (FCM) is proposed to calibrate patch-level features for strengthening its discriminative information and suppressing

80

useless information by employing the global information, which further benefits the final classification performance. • We evaluate our method on three challenging datasets (CUB-200-2011 [1], 6

Stanford Cars [2] and FGVC-Aircraft [19]), and achieve the state-of-theart results on all of these datasets. 85

The rest of this paper is organized as follows: Section 2 briefly reviews related works: patch localization and feature aggregation. Section 3 presents our proposed method in detail, and Section 4 introduces the experimental results as well as ablation analyses. Finally, Section 5 make conclusions for this work.

2. Related Works 90

The part-based methods of fine-grained image classification can be summarized into three categories [20]: ensemble of networks based methods, attention based methods and candidate patch based methods. 2.1. Ensemble of Networks Based Methods Ensemble of networks based methods employ neural networks to learn the

95

representation of discriminative features for fine-grained image classification. Lin et al. [21] propose a bilinear model which contains two CNNs whose outputs are multiplied by using the outer product and pooled to construct a discriminative image descriptor. Considering that the previous methods [21], [22] lack the ability of spatial invariance of input data, Jaderberg et al. [23] propose a

100

spatial transformer module to capture the discriminative parts by 2 or 4 parallel spatial transformers. The concatenation of part representations is fed into classifier for final prediction. However, these methods ignore the spatial relation information between parts. Qi et al. [24] propose a part selection module to pick out the part pairs with high discriminative ability by utilizing the spatial

105

relation informations between parts, then a discriminative image representation is constructed by the interaction between parts. He et al. [25] base on the deep reinforement learning to hierarchically find the discriminative patches in different granularities and adaptively determine how many patches to extract.

7

2.2. Attention Based Methods 110

Some recent works [26, 27, 18, 28] are based on attention to extract the discriminative patches. Zhang et al. [26] learn a set of part detectors by alternately iterating between new positive sample mining and part model retraining by finding filters that have significant and consistent responses to specific parts. Fu et al. [27] recursively learn more fine-grained parts and multi-scale representations

115

of part features in a mutually reinforcing manner. Zheng et al. [18] generate multiple patches by channel grouping module which adopts a series of clustering, weighting, and pooling operations for spatially-correlated channels. Sun et al. [28] extract multiple parts of different objects by one-squeeze multi-excitation module and pull positive features closer to the anchor by the multi-attention

120

multi-class constraint loss. Zheng et al. [29] learn fine-grained details from hundreds of detail-preserved images which are generated by an attention-based sampler. 2.3. Candidate Patch Based Methods Other existing works[11, 12, 13, 15, 14] are based on candidate anchors

125

to pick out the discriminative patches.

He et al.

[11] take the bounding

box generated by the saliency-guided localization learning strategy as pseudo groundtruth, and then use object detection framework to simultaneously predict the discriminative patches and extract the discriminative features. He et al. [12] further explore the multi-level attention extraction network whose output 130

is used to help the n-pathway localization module to extract the multiple different discriminative patches. Yao et al. [14] extract the hypothetical bounding boxes by taking different binary threshold on the saliency map, then calculate the overlap score between each candidate patch and the hypothetical bounding box and retain candidate patches with top-10 scores for each threshold. Then,

135

they propose a co-localization algorithm to further select the most accurate five object-level patches. And they select the part-level patches with the hightest score under each threshold. However, the above methods overlook the spatial relationships between object-level patch and its part-level patches as well as 8

among part-level patches, resulting in the lack of the ability to capture diverse 140

fine-grained features. To make up for this deficiency, Peng et al. [13] put forward an object-part spatial constraint module to extract top-N patches. This module consists of object spatial constraint and part spatial constraint: object spatial constraint make the selected patches have high representation ability, and part spatial

145

constraint eliminates redundancy between selected patches. In accordance with the scores of candidate patches, the top-N patches are selected as discriminative patches, and then the cluster patterns of neural network is utilized to align the patches with the same semantic information together for improving the classification performance. It is worth noting that the saliency map usually

150

highlights small and sparse regions, which make the selected patches with lower rank contain a lot of noise information and less useful features. In this case, we propose PPL to progressively find the discriminative patches. After finding the first discriminative patches by classification network, we remove the most salient region from original image to help the localization of the

155

next most discriminative patch. Finally, the top-K discriminative patches can be extracted for learning local discriminative features. In this way, we can find the discriminative patches more accurately and reduce the overlap between selected patches. In addition, the learned local discriminative features are more representative and robust through feature calibration module, which emphasis

160

the discriminative information of patch-level features under the guide of the global information of image-level features.

3. Proposed Method In this section, we present the progressive patch localization module (PPL) and feature calibration module (FCM) for fine-grained image classification sys165

tem. As shown in Figure 3, the framework of our method is composed of two modules. The first module, PPL, aims to progressively pinpoint the discriminative patches from original images. On account of utilizing the comprehensive

9

information of the image, this module takes the multi-scale images as inputs. The second module, fine-grained image classification network with FCM, aims to 170

extract the features of images as well as patches and improve the representation of patch-level discriminative features.

Figure 3: Overview of our framework, which consists of two modules: progressive patch localization module (PPL) and fine-grained image classification network with feature calibration module (FCM). The PPL takes the multi-scale images as input to find the first discriminative patch by the classification network, then the image removed the most salient region is fed into the classification network to extract the next discriminative patch. The top-K discriminative patches can be extracted by repeating the above procedure. Under the guidance of the global information, FCM performs a feature calibration operation on the PPL selected patch features, that is, it strengthens the discriminative information in local features while suppressing the invalid information.

3.1. Progressive Patch Localization Module In this module, motivated by AE [30], the training image set is denoted as −1 I = {(Ii , yi )}N i=0 , where yi is the label of the image Ii and N is the number of 175

images. Step1: We firstly fine-tune the classification network on the fine-grained dataset, then use the fine-tuned model combines with the gradients flowing back [31] to localize the discriminative patches and extract the attention mask. We can find the first most discriminative patch that contains the most discriminative

10

information. The attention map Lf irst is computed as follows: X Lcf irst = G(I) = ReLU ( ach ∗ M h ),

(1)

h

where G(·) denotes the operation of attention map generation and ach is the neuron importance weight obtained by gradients of y c (which is the predicted score for class c) flowing back with global average pooling: ach =

1 X X ∂y c . h Z i j ∂Mi,j

(2)

Step2: We perform binarization processing on the attention map Lcf irst shown in the second row of Figure 4 , to obtain the attention mask L1 , where the pixels of attention map larger than a certain threshold δ are set to 1 and the rest are set to 0. We extract the smallest bounding box covering the largest 180

connected region of attention mask as the first discriminative patch. Step3: We take the original image I multiplying the reversed attention mask c1 = 1 − L1 as input to obatin the next attention map Lnext (as shown in the L

third row of Figure 4) which highlights the next discriminative region. The

attention map Lnext shown in Eq.(3) is binarized to obtain the attention mask L2 and then we can extract the next discriminative patch. c1 ). Lnext = G(I · L

(3)

Step4: Repeat the Step3, multiple effective discriminant patches can be found until the discriminative information contained in the extracted patch is insufficient to improve the performance of the fine-grained image classification network. 185

For the sake of the comprehensive information, we apply the multi-scale images (resized to 224 × 224 and 448 × 448) to find discriminative patches. In order to facilitate the follow-up work, we select the same number of patches for different scale images. We use Ri = {Ri1 , Ri2 , ..., RiK , ..., Ri2K } to denote all discriminative patches, and the first K patches are from the same image with

190

small scale, while the last K patches are from the same image with large scale. 11

Figure 4: Visiualization of attention maps of images with scale 448 × 448 from different stages of PPL module. The first row denotes the original images. The second row denotes the first discriminative regions and the third row denotes the second discriminative regions.

3.2. Fine-grained Classificaton Network with Feature Calibration Module To emphasis the discriminative information of the patch-level features and suppress their noise information, we design a feature calibration module. In the first place, we need to extract the patch-level and image-level features. We use the Top-K patches of each scale combined with original images to train the classification network, which can generate K patch-level feature vector P k = [pk1 , pk2 , ..., pkH ], (k ∈ {1, ..., K}) and one corresponding image-level feature P 0 = [p01 , p02 , ..., p0H ], each with length H (H = 2048, if ResNet50 is adopted). In this

module, the classification network can directly load the fine-tuned parameters of baseline of patch localization module, because this module adopts the same CNN as localization module. These patch-level and image-level features are fed into the fully connected layer with softmax function and their labels are consistent with those of corresponding images. And we adopt the cross entropy loss as classification loss: Limage = −

N −1 X

Lpatchl = −

N −1 X

yi log(T (Pi0 )).

(4)

yi log(T (Pil )),

(5)

i=0

i=0

where T denotes the fully connected layer with softmax function, Limage denotes the loss of image-level classification and Lpatchl , (l = 1, ..., 2K) denotes the loss of all patch-level classification respectively. 12

195

In order to further improve the discriminability of patch-level features by exploiting global information of image-level feature and different scales information, we propose a feature calibration module. In this module, each patch-level feature is multiplied by image-level feature after the sigmoid function to obtain the processed patch-level features Ql as shown in Eq.(6) and Eq.(7). Then the

200

calibrated patch-level features Ok = [ok1 , ok2 , · · · , okH ] are generated by adding the corresponding elements of the processed patch-level features of different scales, as shown in Eq.(8). 

q10





p01



     0  0 q  p2    2 0    Q =  .  = sigmoid(  .. ).  .   ..      0 p0H qH 

q1l





pl1





q10 pl1



       l  0 l   l q p p q       2 0  2  2 2  Ql =   ..  = Q ∗  ..  =  ..  , (l = 1, ..., K, ..., 2K).  .   .   .        0 l l pH qH plH qH

Ok = Qk + Qk+K



(6)

q1k + q1k+K

(7)



   k   q2 + q2k+K    , (k = 1, ..., K). = ..    .   k+K k qH + qH

(8)

The sigmoid function aims to suppress the irrelevant information of imagelevel feature, and the image-level feature after sigmoid function can provide the weights of feature importance for patch-level features. Moreover, the multiplication operation can emphasize the discriminative information of patch-level features and suppress less useful information. In view of the complementary information contained in patch-level features of different scales, the additive

13

operation is adopted to further enrich the discriminative information of patchlevel features. To the end, we concatenate the image-level with K calibrated patch-level features for fine-grained image classification: g = T 0 (P 0 , O1 , ..., OK ).

(9)

We define T 0 as the fully connected layer with softmax function, whose input dimension is (K + 1) ∗ H. We employ cross entropy loss as joint classification loss: Lcon = − Thus, the overall loss is: L = Limage + λ

N −1 X

yi log(gi ).

(10)

i=0

2K X

Lpatchl + µLcon ,

(11)

l=1

where λ and µ are hyper-parameters. In our experiment, we set λ = µ = 1.

4. Experiments 205

4.1. Datasets and Evaluation Metric We evaluate our algorithm on CUB-200- 2011 [1], Stanford Cars [2] and FGVC-Aircraft [19], which are the widely used benchmarks for fine-grained image classification. We only use image class labels in our experiments, and compare our method with other state-of-the-art approaches to prove its effec-

210

tiveness. Three datasets are described below: • CUB-200-2011 [1]: It is the most representative dataset and includes 11788 images of 200 different subclasses, which splites into 5994 images for training stage as well as 5794 for testing stage. There are about 30

215

images for each subclass in training dataset and 11 ∼ 30 images for each subclass in testing dataset.

14

• Stanford Cars [2]: It has 16185 images of 196 subclasses, which is divided into 8144 images for training stage and 8041 for testing stage. There are about 24 ∼ 84 images for each subclass in training dataset and 24 ∼ 83 220

images for each subclass in testing dataset. • FGVC-Aircraft [19]: It contains 10000 images of 100 different subclasses, which is divided into 6667 images for training stage and 3333 images for testing. There are about 66 ∼ 67 images for each subclass in training dataset and 33 ∼ 34 images for each subclass in testing dataset. We adopt top-1 accuracy as the evaluation metric, which is most commomly used evaluation measure for fine-grained image classification, and defined as follows: accuracy =

225

Rcorrect , R

(12)

where R represents the total number of testing images and Rcorrect means the number of images which are predicted to be the correct categories. 4.2. Implementation Details Our experiments adopt the CNN of ResNet50 [32] as our baseline. This CNN architecture can be used for many other tasks [33], [34], [35] which have achieved

230

good results. Note that this CNN architecture can be replaced by other CNNs. And the PPL and classification module all employ same CNN architecture. For the training of PPL and classification network with FCM, they are trained in a separate manner. The PPL takes the images whose scales are 224 × 224 and 448×448 as input. In the testing phase, both the discriminative patches and the

235

original images are feed into the network to extract the discriminative features, and the robust feature representations obtained after FCM are used for the final classification. To keep training and testing stages consistent, we adopt the predicted category probability c to pick out the attention map. We resize and normalize the selected patches and images all to 448×448. All experiments base

240

on the open source toolbox PyTorch. The classification module is trained with a batch size of 16 and uses Momentum SGD with initial learning rate 0.001 which 15

is multiplied by 0.1 after 60 epochs, as well as we adopt weight decay 1e-4. The thresholds of localization module are set to 0.7, 0.6, 0.6 respectively. 4.3. Quantitative Results 245

In this subsection, we compare the experimental results of our methods with existing methods on CUB-200-2011, Car and FGVC-aircraft datasets, as shown in Table 1, Table 2 and Table 3, respectively. The compared methods in tables can be devided into four groups: (1) supervised methods, (2) ensemble of networks based methods, (3) attention based methods and (4) candidate patches

250

based methods. Table 1 displays the comparisons of results on CUB-200-2011 dataset. As we can see our method achieves the best accuracy of 88.3% among all the methods. Our method exceeds the best compared result of TASN [29] by 0.4%, which learns the fine-grained knowledges from hundreds of detail-preserved images.

255

The detail-preserved images are generated by re-sampling from the original images under the guidence of trilinear attention map. Our method is also superior to other methods based on attention map to extract the discriminative patches. Our method performs better than PA-CNN [36], which introduces a part rectification mechanism to ensure high accuracy of parts. Comparing with PA-CNN,

260

our method employs FCM to selectively strengthen discriminative features and suppress less useful ones by global information guiding to obtain more discriminative feature. Our method obtains the 2.1% higher accuracy than the result of MAMC [28] which learns the correlation of the features from multiple attention regions to the benefit of the fine-grained image classification task.

265

Our method also outperforms M2DRL [25], StackDRL [37] and DT-RAM [38], which are based on reinforcement learning. Moreover, our method outperforms other approaches focusing on exploring the framework of CNN, such as DFL[39] and ESR [24]. DFL [39] designs an asymmetric multi-stream architecture to learning a group of convolutional filters that respond to a certain

270

discriminative region in the original image and achieves the accuracy of 87.4%. ESR [24] picks out the patch pairs by utilizing the spatial distance between 16

the spatial coordinates of corresponding activation vectors on feature map, and constructs a discriminative image representation on the basis of the interation between patches. The accuracy of ESR is 85.5%, which is lower than our method 275

by 2.8%. These comparisons show that our method can find the discriminative patches more accurately which help the classification network to learn the subtle and robust discriminative features. The accuracy of our method is higher than OPAM [13] by 2.5% and our method is superior to other candidate patch based methods, which demon-

280

strates that our PPL can better find the patches containing more discriminative information. Compared with NTS[40], our method brings 0.8% accuracy improvement. This result indicates that discriminability of recalibrated features is enhanced by FCM. Furthermore, we also compare our method with the works using annotations,

285

such as T-CNN [41] and Coarse-to-Fine [42], and our method with only imagelevel annotations outperforms all the methods that use either part annotations or object annotations. The above results demonstrate that automatically-mined discriminative patches supervised by image-level labels are superior to humandesigned patches via progressive learning.

290

In addition, the compared results on Car, FGVC-Aircraft dataset are shown in Table 2 and 3 respectively. Our method beats the most state-of-the-art methods, and achieves the accuracies of 94.0% on Car dataset and 92.6% on FGVC-Aircraft dataset. The compared results indicate that the proposed progressive learning can not only mine discriminative patches of birds but also do

295

well in the patch localization for cars and aircrafts. 4.4. Ablation experiments To analyse the effectiveness of each component in our framework, we design the following experiments. In this work, we adopt the ResNet50 as baseline for all the ablation study experiments, and the baseline achieves 84.5% accuracy.

300

Ablation experiments are carried out on the CUB-200-2011, Standford Cars and FGVC-Aircraft dataset from the following two aspects: 17

Table 1: Comparison results on CUB-200-2011 dataset

Methods

Baseline

Anno.

Accuracy

Part-RCNN [9]

AlexNet

Parts

76.4

Webly-supervised [43]

AlexNet

Parts

78.6

PG Alignment [44]

VGG-19

Bboxs

82.8

Coarse-to-Fine [42]

VGG-19

Bboxs

82.9

AlexNet+ResNet

Bboxs

87.3

VGGNet

n/a

84.1

InceptionNet

n/a

84.1

ESR [24]

VGGNet

n/a

85.5

DT-RAM [38]

ResNet50

n/a

86.0

StackDRL [37]

VGG-16

n/a

86.6

M2DRL [25]

VGG-19

n/a

87.2

DFL[39]

ResNet50

n/a

87.4

SCDA [45]

VGG-16

n/a

80.5

RA-CNN [27]

VGG-19

n/a

85.3

MAMC [28]

ResNet50

n/a

86.2

MA-CNN [18]

VGG-19

n/a

86.5

PA-CNN [36]

VGG-19

n/a

87.8

TASN[29]

ResNet50

n/a

87.9

AutoBD [14]

VGG-19

n/a

81.6

TSC [15]

VGG-16

n/a

84.7

WSDL [12]

VGG-16

n/a

85.7

OPAM [13]

VGG-16

n/a

85.8

NTS[40]

ResNet50

n/a

87.5

Our Method

ResNet50

n/a

88.3

T-CNN [41] Bilinear-CNN [21] ST-CNN [23]

1) The effectiveness of progressive patch localization module The purpose of extracting discriminative patches by PPL is to learn suffi-

18

Table 2: Comparison results on Standford Cars dataset

Methods

Baseline

Anno.

Accuracy (%)

PG Alignment [44]

VGG-19

Bboxs

92.8

Bilinear-CNN [21]

VGGNet

n/a

91.3

DFL[39]

ResNet50

n/a

93.1

DT-RAM [38]

ResNet50

n/a

93.1

M2DRL [25]

VGG-19

n/a

93.2

TASN[29]

ResNet50

n/a

93.8

SCDA [45]

VGG-16

n/a

85.9

RA-CNN [27]

VGG-19

n/a

92.5

MA-CNN [18]

VGG-19

n/a

92.8

MAMC [28]

ResNet50

n/a

92.8

PA-CNN [36]

VGG-19

n/a

93.3

AutoBD [14]

VGG-19

n/a

88.9

OPAM [13]

VGG-16

n/a

92.2

WSDL [12]

VGG-16

n/a

92.3

NTS[40]

ResNet50

n/a

93.9

Our method

ResNet50

n/a

94.0

cient discriminative information for fine-grained image classification network so 305

as to improve the performance of classification network. Comparing with the accuracy of baseline, our PPL brings 3.0 % improvements on CUB-200-2011 dataset, as shown in Table 4. This result means that our PPL can effectively find the discriminative patches, and some discriminative patches are presented in Figure 5 . As we can observe that the patches from large scale image are more

310

concerned about detailed information, and the patches from small scale image focus on more general information. This proves that discriminative information of different scale patches are complementary. As to the number of selected patches, we can see from Table 5 , the accuracy

19

Table 3: Comparison results on FGVC-Aircraft dataset

Methods

Baseline

Anno.

Accuracy (%)

Bilinear-CNN [21]

VGGNet

n/a

84.1

ESR [24]

VGGNet

n/a

86.9

DFL[39]

ResNet50

n/a

91.7

SCDA [45]

VGG-16

n/a

79.5

RA-CNN [27]

VGG-19

n/a

88.2

MA-CNN [18]

VGG-19

n/a

89.9

PA-CNN [36]

VGG-19

n/a

91.0

NTS[40]

ResNet50

n/a

91.4

Our method

ResNet50

n/a

92.6

increases from 84.5% to 86.7% when two patches are extracted from image 315

of the large scale (448 × 448) and engaged in training classification network. When three patches participate in training, there is no obvious improvement or even slight decrease in accuracy. As most of the discriminative information are included in the first two patches, the remaining regions in the image are not good enough for the network to classifiy them. In this case, the classification

320

network must rely on the background regions to indentify the categories of images, which are usually harmful to fine-grained image classification. Thus we only select the first two patches of the same scale image for later fine-grained image classification module. In addition, the first two patches from the image of 224 × 224 lift the accuracy from 84.5 % to 86.2 %. And the first three patches

325

from the image of 224 × 224 participated in training classification network do not help improve the performance of classification network. When two patches of each scale are used to train the classification network, the accuracy reaches 87.5%. Therefore, we select the first two patches for each scale on the CUB200-2011 dataset.

330

However, due to the large proportion of objects in the whole image on Car

20

and FGVC-Aircraft dataset, we found that the first three patches of each scale should be selected for good performance, as shown in Table 6. When the original images of each scale learn four patches, the performance of classification network declines. This indicates that the discriminative information contained in the 335

fourth patch is not enough to improve the performance of the classification network. At the same time, the noise information contained in the fourth patch brings interference to the discriminative features of network learning. Figure 6 displays some patches on Car and FGVC-Aircraft dataset.

Figure 5: Some results of PPL on CUB-200-2011 [1]. The first row denotes the original images, the second and third rows display the first and second discriminative patches of images with scale 448 × 448. The fourth and fifth rows display the most and second discriminative patches of images with scale 224 × 224.

Table 4: Effectiveness of each components in our method on CUB-200-2011, Stanford Cars and FGVC-Aircraft dataset

Methods

CUB

Car

Aircraft

ResNet50

84.5

91.5

87.8

ResNet50 + PPL

87.5

93.0

91.6

ResNet50 + PPL + FCM

88.3

94.0

92.6

21

Figure 6: Some results of PPL on Cars [2] and FGVC-Aircraft [19]. The first row denotes the original images, and from the second to fourth rows denote the top-3 discriminative patches of images with scale 448 × 448. Table 5: Effectiveness of multi-scale representation on CUB-200-2011 dataset

Methods

Accuracy (%)

ResNet50

84.5

ResNet50 + 2 patches (448 × 448)

86.7

ResNet50 + 3 patches (448 × 448)

86.6

ResNet50 + 2 patches (224 × 224)

86.2

ResNet50 + 3 patches (224 × 224)

86.2

ResNet50 + 4 patches (224 × 224 & 448 × 448)

87.5

Table 6: Ablation experiments on PPL module with different number of patches from original images of each scale on Standford Cars and FGVC-Aircraft dataset

Methods

Car

Aircraft

ResNet50

92.5

88.3

ResNet50 + 4 patches (224 × 224 & 448 × 448)

93.6

92.1

ResNet50 + 6 patches (224 × 224 & 448 × 448)

94.0

92.6

ResNet50 + 8 patches (224 × 224 & 448 × 448)

92.6

91.0

2) The effectiveness of feature calibration module 340

In Table 4 , we can observe that the accuracy of classification network on 22

CUB-200-2011 dataset increased from 87.5% to 88.3% after FCM is added. This result suggests that the discriminative information of calibrated features are more effective. Besides, we feed the concatenation of four processed patch-level features and the image feature into the classification network, which obtains an 345

improvement of 0.5% compared to the baseline, as Table 7 shows. This result means the global information of image-level feature can calibrate the patch-level features and strengthen the distinctive information of patch-level features and suppresses useless information. And the addition operation between different scale of patch-level features further makes the discriminative information con-

350

tained in the patch-level features more sufficient, which helps to enhance the accuracy of classification. Table 7:

Results of using different design choices of FCM on CUB-200-2011 dataset.

P img , P pat ,

denotes image-level feature, patch-level feature respectively.

patch-level feature from the patch of small scale image and

pat Plarge

pat Psmall denotes

denotes patch-level feature

from the patch of large scale image.

Setting

Accuracy (%)

a) direct concatenation

87.5

b) sigmoid(P img ) ∗ P pat c) sigmoid(P

img

)∗

88.0

pat (Psmall

+

pat Plarge )

88.3

5. Conclusion To address the problem that the selected patches with the lower rank may contain noise information and ensure a diversity of fine-grained features, we 355

propose a progressive patch localization module for fine-grained image classification. This module firstly finds the most discriminative patch by classification network, then removes the most salient region to prompt the localization of the next most discriminative patch, and the top-K discriminative patches can be found by repeating this step iteratively. The discovered patches sharing the same

23

360

label as the original image are used to train the classification network so as to help it learn discriminative features. In addition, we propose an feature calibration module to calibrate the discriminative information of patch-level features by using the global information to re-weight patch-level features. Furthermore, the processed patch-level features fused the multi-scale complementary information

365

can further improve the representaion of patch-level features. Extensive experiments have proved the effectiveness of our method on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets, which achieves significant improvements.

Acknowledgments This work was supported in part by the National Natural Science Foundation 370

of China (NSFC) under Grants No. 61772108, No. 61572096 and No. 61733002.

References [1] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The CaltechUCSD Birds-200-2011 Dataset, Tech. rep. (2011). [2] J. Krause, M. Stark, J. Deng, L. Fei-Fei, 3d object representations for 375

fine-grained categorization, in: 2013 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 554–561. doi:10.1109/ICCVW.2013.77. URL https://doi.org/10.1109/ICCVW.2013.77 [3] M. Liu, L. Nie, X. Wang, Q. Tian, B. Chen, Online data organizer: Micro-

380

video categorization by structure-guided multimodal dictionary learning, IEEE Trans. Image Processing 28 (3) (2019) 1235–1247. doi:10.1109/ TIP.2018.2875363. URL https://doi.org/10.1109/TIP.2018.2875363 [4] Y. Wei, X. Wang, W. Guan, L. Nie, Z. Lin, B. Chen, Neural multimodal

385

cooperative learning toward micro-video understanding, IEEE Trans. Im-

24

age Processing 29 (2020) 1–14. doi:10.1109/TIP.2019.2923608. URL https://doi.org/10.1109/TIP.2019.2923608 [5] L. Xie, Q. Tian, R. Hong, S. Yan, B. Zhang, Hierarchical part matching for fine-grained visual categorization, in: IEEE International Conference 390

on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 1641–1648. doi:10.1109/ICCV.2013.206. URL https://doi.org/10.1109/ICCV.2013.206 [6] T. Berg, P. N. Belhumeur, POOF: part-based one-vs.-one features for finegrained categorization, face verification, and attribute estimation, in: 2013

395

IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 955–962. doi:10.1109/CVPR.2013. 128. URL https://doi.org/10.1109/CVPR.2013.128 [7] D. Cossock, T. Zhang, Statistical analysis of bayes optimal subset ranking,

400

IEEE Trans. Information Theory 54 (11) (2008) 5140–5154. doi:10.1109/ TIT.2008.929939. URL https://doi.org/10.1109/TIT.2008.929939 [8] E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, T. Tuytelaars, Fine-grained categorization by alignments, in: IEEE International

405

Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 1713–1720. doi:10.1109/ICCV.2013.215. URL https://doi.org/10.1109/ICCV.2013.215 [9] N. Zhang, J. Donahue, R. B. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection, in: Computer Vision - ECCV 2014 - 13th

410

European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, 2014, pp. 834–849. doi:10.1007/978-3-319-10590-1\_54. URL https://doi.org/10.1007/978-3-319-10590-1_54 [10] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, X. Wu, Image annotation by multiple-instance learning with discriminative feature mapping 25

415

and selection, IEEE Trans. Cybernetics 44 (5) (2014) 669–680.

doi:

10.1109/TCYB.2013.2265601. URL https://doi.org/10.1109/TCYB.2013.2265601 [11] X. He, Y. Peng, J. Zhao, Fine-grained discriminative localization via saliency-guided faster R-CNN, in: Proceedings of the 2017 ACM on Mul420

timedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 627–635. doi:10.1145/3123266.3123319. URL https://doi.org/10.1145/3123266.3123319 [12] X. He, Y. Peng, J. Zhao, Fast fine-grained image classification via weakly supervised discriminative localization, IEEE Trans. Circuits Syst. Video

425

Techn. 29 (5) (2019) 1394–1407. doi:10.1109/TCSVT.2018.2834480. URL https://doi.org/10.1109/TCSVT.2018.2834480 [13] Y. Peng, X. He, J. Zhao, Object-part attention model for fine-grained image classification, IEEE Trans. Image Processing 27 (3) (2018) 1487–1500. doi: 10.1109/TIP.2017.2774041.

430

URL https://doi.org/10.1109/TIP.2017.2774041 [14] H. Yao, S. Zhang, C. Yan, Y. Zhang, J. Li, Q. Tian, Autobd: Automated bilevel description for scalable fine-grained visual categorization, IEEE Trans. Image Processing 27 (1) (2018) 10–23. doi:10.1109/TIP.2017.2751960. URL https://doi.org/10.1109/TIP.2017.2751960

435

[15] X. He, Y. Peng, Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 4075–4081. URL

440

http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/

14629 [16] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards realtime object detection with region proposal networks, IEEE Trans. Pattern

26

Anal. Mach. Intell. 39 (6) (2017) 1137–1149. doi:10.1109/TPAMI.2016. 2577031. 445

URL https://doi.org/10.1109/TPAMI.2016.2577031 [17] B. Zhao, X. Wu, J. Feng, Q. Peng, S. Yan, Diversified visual attention networks for fine-grained object classification, IEEE Trans. Multimedia 19 (6) (2017) 1245–1256. doi:10.1109/TMM.2017.2648498. URL https://doi.org/10.1109/TMM.2017.2648498

450

[18] H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural network for fine-grained image recognition, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 5219–5227. doi:10.1109/ICCV.2017.557. URL https://doi.org/10.1109/ICCV.2017.557

455

[19] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, A. Vedaldi, Fine-grained visual classification of aircraft, CoRR abs/1306.5151. arXiv:1306.5151. URL http://arxiv.org/abs/1306.5151 [20] B. Zhao, J. Feng, X. Wu, S. Yan, A survey on deep learning-based fine-grained object classification and semantic segmentation, International

460

Journal of Automation and Computing 14 (2) (2017) 119–135.

doi:

10.1007/s11633-017-1053-3. URL https://doi.org/10.1007/s11633-017-1053-3 [21] T. Lin, A. Roy Chowdhury, S. Maji, Bilinear CNN models for fine-grained visual recognition, in: 2015 IEEE International Conference on Computer 465

Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 1449– 1457. doi:10.1109/ICCV.2015.170. URL https://doi.org/10.1109/ICCV.2015.170 [22] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, Z. Zhang, Multiple granularity descriptors for fine-grained categorization, in: 2015 IEEE International

470

Conference on Computer Vision, ICCV 2015, Santiago, Chile, December

27

7-13, 2015, 2015, pp. 2399–2406. doi:10.1109/ICCV.2015.276. URL https://doi.org/10.1109/ICCV.2015.276 [23] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks, in: Advances in Neural Information Processing Systems 475

28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2017–2025. URL http://papers.nips.cc/paper/5854-spatial-transformer-networks [24] L. Qi, X. Lu, X. Li, Exploiting spatial relation for fine-grained image classification, Pattern Recognition 91 (2019) 47–55. doi:10.1016/j.patcog.

480

2019.02.007. URL https://doi.org/10.1016/j.patcog.2019.02.007 [25] X. He, Y. Peng, J. Zhao, Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization, International Journal of Computer Vision 127 (9) (2019) 1235–1255.

485

doi:

10.1007/s11263-019-01176-2. URL https://doi.org/10.1007/s11263-019-01176-2 [26] X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses for fine-grained image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,

490

USA, June 27-30, 2016, 2016, pp. 1134–1142. doi:10.1109/CVPR.2016. 128. URL https://doi.org/10.1109/CVPR.2016.128 [27] J. Fu, H. Zheng, T. Mei, Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition, in: 2017

495

IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 4476–4484. doi: 10.1109/CVPR.2017.476. URL https://doi.org/10.1109/CVPR.2017.476

28

[28] M. Sun, Y. Yuan, F. Zhou, E. Ding, Multi-attention multi-class constraint 500

for fine-grained image recognition, in: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI, 2018, pp. 834–850. doi:10.1007/978-3-030-01270-0\ _49. URL https://doi.org/10.1007/978-3-030-01270-0_49

505

[29] H. Zheng, J. Fu, Z. Zha, J. Luo, Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition, in:

IEEE Conference on Computer Vision and Pattern

Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019, pp. 5012–5021. 510

URL

http://openaccess.thecvf.com/content_CVPR_2019/

html/Zheng_Looking_for_the_Devil_in_the_Details_Learning_ Trilinear_Attention_CVPR_2019_paper.html [30] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation 515

approach, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6488–6496. doi:10.1109/CVPR.2017.687. URL https://doi.org/10.1109/CVPR.2017.687 [31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Ba-

520

tra, Grad-cam: Visual explanations from deep networks via gradientbased localization, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 618–626. doi:10.1109/ICCV.2017.74. URL https://doi.org/10.1109/ICCV.2017.74

525

[32] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 29

770–778. doi:10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90 530

[33] W. Huang, H. Ding, G. Chen, A novel deep multi-channel residual networks-based metric learning method for moving human localization in video surveillance, Signal Processing 142 (2018) 104–113. doi:10.1016/j. sigpro.2017.07.015. URL https://doi.org/10.1016/j.sigpro.2017.07.015

535

[34] Y. Cao, Z. He, Z. Ye, X. Li, Y. Cao, J. Yang, Fast and accurate single image super-resolution via an energy-aware improved deep residual network, Signal Processing 162 (2019) 115–125. doi:10.1016/j.sigpro.2019.03.018. URL https://doi.org/10.1016/j.sigpro.2019.03.018 [35] R. Hong, L. Li, J. Cai, D. Tao, M. Wang, Q. Tian, Coherent semantic-visual

540

indexing for large-scale image retrieval in the cloud, IEEE Trans. Image Processing 26 (9) (2017) 4128–4138. doi:10.1109/TIP.2017.2710635. URL https://doi.org/10.1109/TIP.2017.2710635 [36] H. Zheng, J. Fu, Z. Zha, J. Luo, T. Mei, Learning rich part hierarchies with progressive attention networks for fine-grained image recognition, IEEE

545

Trans. Image Processing 29 (2020) 476–488.

doi:10.1109/TIP.2019.

2921876. URL https://doi.org/10.1109/TIP.2019.2921876 [37] X. He, Y. Peng, J. Zhao, Stackdrl: Stacked deep reinforcement learning for fine-grained visual categorization, in: Proceedings of the Twenty550

Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2018, pp. 741–747. doi: 10.24963/ijcai.2018/103. URL https://doi.org/10.24963/ijcai.2018/103 [38] Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, W. Xu, Dynamic computational

555

time for visual attention, in: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 30

22-29, 2017, 2017, pp. 1199–1209. doi:10.1109/ICCVW.2017.145. URL https://doi.org/10.1109/ICCVW.2017.145 [39] Y. Wang, V. I. Morariu, L. S. Davis, Learning a discriminative filter bank 560

within a CNN for fine-grained recognition, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 4148–4157. doi:10.1109/CVPR. 2018.00436. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Wang_

565

Learning_a_Discriminative_CVPR_2018_paper.html [40] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for fine-grained classification, in: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, 2018, pp. 438–454. doi:10.1007/978-3-030-01264-9\_26.

570

URL https://doi.org/10.1007/978-3-030-01264-9_26 [41] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, H. Gao, Fine-grained image classification by visual-semantic embedding, in: Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2018, pp. 1043–1049. doi:

575

10.24963/ijcai.2018/145. URL https://doi.org/10.24963/ijcai.2018/145 [42] H. Yao, S. Zhang, Y. Zhang, J. Li, Q. Tian, Coarse-to-fine description for fine-grained visual categorization, IEEE Trans. Image Processing 25 (10) (2016) 4858–4872. doi:10.1109/TIP.2016.2599102.

580

URL https://doi.org/10.1109/TIP.2016.2599102 [43] Z. Xu, S. Huang, Y. Zhang, D. Tao, Webly-supervised fine-grained visual categorization via deep domain adaptation, IEEE Trans. Pattern Anal. Mach. Intell. 40 (5) (2018) 1100–1113. doi:10.1109/TPAMI.2016. 2637331.

585

URL https://doi.org/10.1109/TPAMI.2016.2637331 31

[44] J. Krause, H. Jin, J. Yang, F. Li, Fine-grained recognition without part annotations, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 5546–5555. doi:10.1109/CVPR.2015.7299194. 590

URL https://doi.org/10.1109/CVPR.2015.7299194 [45] X. Wei, J. Luo, J. Wu, Z. Zhou, Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Trans. Image Processing 26 (6) (2017) 2868–2881. doi:10.1109/TIP.2017.2688133. URL https://doi.org/10.1109/TIP.2017.2688133

595

Conflict of Interest Authors declare that no conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. Authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this

600

paper.

Credit Author Statement Tiantian Yan: Conceptualization, Methodology, Software,Writing - Original Draft. Shijie Wang: Validation, Software. 605

Zhihui Wang: Conceptualization, Writing - Review & Editing. Haojie Li: Conceptualization, Funding acquisition. Zhongxuan Luo: Supervision.

32