Progressive Learning for Weakly Supervised Fine-grained Classification
Journal Pre-proof
Progressive Learning for Weakly Supervised Fine-grained Classification Tiantian Yan, Shijie Wang, Zhihui Wang, Haojie Li, Zhongxuan Luo PII: DOI: Reference:
S0165-1684(20)30062-1 https://doi.org/10.1016/j.sigpro.2020.107519 SIGPRO 107519
To appear in:
Signal Processing
Received date: Revised date: Accepted date:
1 December 2019 19 January 2020 1 February 2020
Please cite this article as: Tiantian Yan, Shijie Wang, Zhihui Wang, Haojie Li, Zhongxuan Luo, Progressive Learning for Weakly Supervised Fine-grained Classification, Signal Processing (2020), doi: https://doi.org/10.1016/j.sigpro.2020.107519
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
Highlights • We propose progressive patch localization module to solve the problem that the selected patches with lower rank very likely contain noise information while guaranteeing a diversity of fine-grained features. • Feature calibration module is proposed to calibrate patch-level features for strengthening its discriminative information and suppressing useless information by employing the global information, which further benefits the final classification performance. • We evaluate our method on three challenging datasets (CUB, Cars and Aircraft), and achieve the state-of-the-art results on all of these datasets.
1
Progressive Learning for Weakly Supervised Fine-grained Classification Tiantian Yana , Shijie Wanga , Zhihui Wangb,∗, Haojie Lib , Zhongxuan Luoa a School
of Software Technology, Dalian University of Technology, Dalian, China School of Information Science & Engineering, Dalian University of Technology, China
b International
Abstract Despite fine-grained image classification has made considerable progress, it still remains a challenging task due to the difficulty of finding subtle distinctions. Most existing methods solve this problem by selecting the top-N highest scores’ discriminative patches from candidate patches at one time. However, since the classification network often highlights small and sparse regions, the selected patches with the lower rank may contain noise information. To address this problem and ensure the diversity of fine-grained features, we propose a progressive patch localization module (PPL) to find the discriminative patches more accurately. Specifically, this work employs the classification model to find first most discriminative patch, then removes the most salient region to help the localization of the next most discriminative patch, and the top-K discriminative patches can be found by repeating this procedure. In addition, in order to further improve the representational power of patch-level features, we propose a feature calibration module (FCM). This module employs the global information to selectively emphasize discriminative features and suppress useless information, which can obtain more robust and discriminative local feature representations and then help classification network achieve better performance. Extensive experiments are conducted to show the substantial improvements of our method on three benchmark datasets. ∗ Corresponding
author Email address:
[email protected] (Zhihui Wang)
Preprint submitted to Journal of LATEX Templates
February 1, 2020
Keywords: Fine-grained image classification, Progressive patch localization module, Feature calibration module 2019 MSC: 00-01, 99-00
1. Introduction
Figure 1: Illustration of challenges in fine-grained image classification. Large differences in same subordinate class are shown in the first row, and small distinction among different subordinate classes are shown in the second row. The images in (a) Birds and (b) Cars are from CUB-200-2011 [1], Cars-196 [2] respectively.
Fine-grained image classification is a hot research topic in computer vision, pattern recognition and other fields [3], [4] in recent years, because it has huge application requirements in both academia and industry. The goal of 5
fine-grained image classification is to divide coarse-grained categories into subcategories, such as hundreds of subordinate classes of birds [1], automobiles [2], etc. An example is shown in Figure 1(a), fine-grained image classification need to classify the birds in the second row as the subcategory of ”California Gull”, ”Glaucous winged Gull” and ”Western Gull” respectively. Fine-grained image
10
classification still remains challenges for the reason that there are large intraclass differences and inter-class similarities. In the first row of Figure 1(a), the three images belonging to the same subcategory are affected by the pose,
3
Figure 2: The motivation for our patch localization module. (a): The lower rank patches selected by the anchor based methods contain non-discriminative information, such as background information. (b): The proposed progressive patch localization module.
shooting angle and growth period, resulting in a great visual difference among them. In the second row of Figure 1(a), the three images belonging to differ15
ent subcategories have similar global features (such as gray wings and white bellies), so they can only be distinguished with the aid of subtle local distinctions in tails and legs, etc. Therefore, how to accurately find and effectively use local discriminant information is the key to the success of fine-grained image classification.
20
Some previous works [5, 6, 7, 8, 9] address this problem by making use of fine-grained annotations, like annotations for bird parts in bird classification.
4
However, human-defined parts may not be optimal for fine-grained image classification, which is completely dependent on the annotator’s cognitive level. In addition, fine-grained annotations require massive manual labor, but also lack 25
practicality and scalability because a lot of the actual data is not annotated [10]. Therefore, recent works mainly focus on weakly-supervised frameworks that only use image-level labels. The existing weakly-supervised methods [11, 12, 13, 14], [15] search for discriminative patches based on the candidate anchors. He et al.[11] design a dis-
30
criminative localization method based on Faster R-CNN [16] to simultaneously localize discriminative patches with the guide of saliency information and extract discriminative features. This work only employs one level attention, while different level attention describe different visual features, carrying multi-grained and multi-scale information. A weakly supervised discriminative localization
35
method [12] extracts multiple different discriminative regions by n-pathway localization module whose supervision information is provided by multi-level attention extraction network. Yao et al. [14] propose a graph analysis algroithm to estimate object patches and select the distinctive local patches. These methods [11, 12, 14] ignore the spatial relationships among patches, which have a
40
difficulty in capturing diverse fine-grained features. Peng et al. [13] and He et al. [15] design an object-part spatial constraint module to solve this problem by selecting the top-N patches as discriminative patches according to their scores. Although these methods [13], [15] consider the spatial constraint between patches to guarantee the selected patches containing
45
diverse features, the selected patches with the lower rank may contain very little useful information and a lot of noise information (as show in Figure 2(a)). Our proposed localization module (PPL) can effectively solve the above two problems. First, we employ a classification network to locate the first discriminative patch. Next, the most salient region is removed from the original image, that
50
is, set the most salient region to zero. To remedy the classification performance drop, the next discriminative patch can be found by the classification network. Finally, top-K discriminative patches can be extracted by repeating the above 5
procedure (as shown in Figure 2(b)). After discriminative patches are found, patches (sharing the same category 55
label with the corresponding original image) and original image are used to train the fine-grained classification network. To obtain the final classification result of each image, some previous methods [17, 13] adopt the average or weighted sum of the predicted scores of discriminative patches and original image. Other methods [18, 14] adopt concatenation between patch-level features and image-
60
level feature as the final feature descriptor for classification prediction. In order to further improve the representational power of patch-level features, we propose a feature calibration module (FCM). In this module, we employ the global information of image-level feature to selectively emphasise discriminative information of patch-level features and suppress less useful information. In other
65
words, each element of patch-level feature is weighted by the global information, which makes the useful feature value increases and the useless feature value decreases. In addition, considering the complementarity between the patch-level features of different scales, we fuse local features of different scales so as to obtain more powerful and sufficient representation of discriminative features. We
70
feed the concatenation between the calibrated patch-level features and imagelevel feature into fully connected layer to obtain the final prediction result. To summarize, the contributions are as follows • We propose progressive patch localization module (PPL) to pinpoint the discriminative patches for fine-grained image classification network. This
75
module solves the problem that the selected patches with lower rank very likely contain noise information while guaranteeing a diversity of finegrained features. • Feature calibration module (FCM) is proposed to calibrate patch-level features for strengthening its discriminative information and suppressing
80
useless information by employing the global information, which further benefits the final classification performance. • We evaluate our method on three challenging datasets (CUB-200-2011 [1], 6
Stanford Cars [2] and FGVC-Aircraft [19]), and achieve the state-of-theart results on all of these datasets. 85
The rest of this paper is organized as follows: Section 2 briefly reviews related works: patch localization and feature aggregation. Section 3 presents our proposed method in detail, and Section 4 introduces the experimental results as well as ablation analyses. Finally, Section 5 make conclusions for this work.
2. Related Works 90
The part-based methods of fine-grained image classification can be summarized into three categories [20]: ensemble of networks based methods, attention based methods and candidate patch based methods. 2.1. Ensemble of Networks Based Methods Ensemble of networks based methods employ neural networks to learn the
95
representation of discriminative features for fine-grained image classification. Lin et al. [21] propose a bilinear model which contains two CNNs whose outputs are multiplied by using the outer product and pooled to construct a discriminative image descriptor. Considering that the previous methods [21], [22] lack the ability of spatial invariance of input data, Jaderberg et al. [23] propose a
100
spatial transformer module to capture the discriminative parts by 2 or 4 parallel spatial transformers. The concatenation of part representations is fed into classifier for final prediction. However, these methods ignore the spatial relation information between parts. Qi et al. [24] propose a part selection module to pick out the part pairs with high discriminative ability by utilizing the spatial
105
relation informations between parts, then a discriminative image representation is constructed by the interaction between parts. He et al. [25] base on the deep reinforement learning to hierarchically find the discriminative patches in different granularities and adaptively determine how many patches to extract.
7
2.2. Attention Based Methods 110
Some recent works [26, 27, 18, 28] are based on attention to extract the discriminative patches. Zhang et al. [26] learn a set of part detectors by alternately iterating between new positive sample mining and part model retraining by finding filters that have significant and consistent responses to specific parts. Fu et al. [27] recursively learn more fine-grained parts and multi-scale representations
115
of part features in a mutually reinforcing manner. Zheng et al. [18] generate multiple patches by channel grouping module which adopts a series of clustering, weighting, and pooling operations for spatially-correlated channels. Sun et al. [28] extract multiple parts of different objects by one-squeeze multi-excitation module and pull positive features closer to the anchor by the multi-attention
120
multi-class constraint loss. Zheng et al. [29] learn fine-grained details from hundreds of detail-preserved images which are generated by an attention-based sampler. 2.3. Candidate Patch Based Methods Other existing works[11, 12, 13, 15, 14] are based on candidate anchors
125
to pick out the discriminative patches.
He et al.
[11] take the bounding
box generated by the saliency-guided localization learning strategy as pseudo groundtruth, and then use object detection framework to simultaneously predict the discriminative patches and extract the discriminative features. He et al. [12] further explore the multi-level attention extraction network whose output 130
is used to help the n-pathway localization module to extract the multiple different discriminative patches. Yao et al. [14] extract the hypothetical bounding boxes by taking different binary threshold on the saliency map, then calculate the overlap score between each candidate patch and the hypothetical bounding box and retain candidate patches with top-10 scores for each threshold. Then,
135
they propose a co-localization algorithm to further select the most accurate five object-level patches. And they select the part-level patches with the hightest score under each threshold. However, the above methods overlook the spatial relationships between object-level patch and its part-level patches as well as 8
among part-level patches, resulting in the lack of the ability to capture diverse 140
fine-grained features. To make up for this deficiency, Peng et al. [13] put forward an object-part spatial constraint module to extract top-N patches. This module consists of object spatial constraint and part spatial constraint: object spatial constraint make the selected patches have high representation ability, and part spatial
145
constraint eliminates redundancy between selected patches. In accordance with the scores of candidate patches, the top-N patches are selected as discriminative patches, and then the cluster patterns of neural network is utilized to align the patches with the same semantic information together for improving the classification performance. It is worth noting that the saliency map usually
150
highlights small and sparse regions, which make the selected patches with lower rank contain a lot of noise information and less useful features. In this case, we propose PPL to progressively find the discriminative patches. After finding the first discriminative patches by classification network, we remove the most salient region from original image to help the localization of the
155
next most discriminative patch. Finally, the top-K discriminative patches can be extracted for learning local discriminative features. In this way, we can find the discriminative patches more accurately and reduce the overlap between selected patches. In addition, the learned local discriminative features are more representative and robust through feature calibration module, which emphasis
160
the discriminative information of patch-level features under the guide of the global information of image-level features.
3. Proposed Method In this section, we present the progressive patch localization module (PPL) and feature calibration module (FCM) for fine-grained image classification sys165
tem. As shown in Figure 3, the framework of our method is composed of two modules. The first module, PPL, aims to progressively pinpoint the discriminative patches from original images. On account of utilizing the comprehensive
9
information of the image, this module takes the multi-scale images as inputs. The second module, fine-grained image classification network with FCM, aims to 170
extract the features of images as well as patches and improve the representation of patch-level discriminative features.
Figure 3: Overview of our framework, which consists of two modules: progressive patch localization module (PPL) and fine-grained image classification network with feature calibration module (FCM). The PPL takes the multi-scale images as input to find the first discriminative patch by the classification network, then the image removed the most salient region is fed into the classification network to extract the next discriminative patch. The top-K discriminative patches can be extracted by repeating the above procedure. Under the guidance of the global information, FCM performs a feature calibration operation on the PPL selected patch features, that is, it strengthens the discriminative information in local features while suppressing the invalid information.
3.1. Progressive Patch Localization Module In this module, motivated by AE [30], the training image set is denoted as −1 I = {(Ii , yi )}N i=0 , where yi is the label of the image Ii and N is the number of 175
images. Step1: We firstly fine-tune the classification network on the fine-grained dataset, then use the fine-tuned model combines with the gradients flowing back [31] to localize the discriminative patches and extract the attention mask. We can find the first most discriminative patch that contains the most discriminative
10
information. The attention map Lf irst is computed as follows: X Lcf irst = G(I) = ReLU ( ach ∗ M h ),
(1)
h
where G(·) denotes the operation of attention map generation and ach is the neuron importance weight obtained by gradients of y c (which is the predicted score for class c) flowing back with global average pooling: ach =
1 X X ∂y c . h Z i j ∂Mi,j
(2)
Step2: We perform binarization processing on the attention map Lcf irst shown in the second row of Figure 4 , to obtain the attention mask L1 , where the pixels of attention map larger than a certain threshold δ are set to 1 and the rest are set to 0. We extract the smallest bounding box covering the largest 180
connected region of attention mask as the first discriminative patch. Step3: We take the original image I multiplying the reversed attention mask c1 = 1 − L1 as input to obatin the next attention map Lnext (as shown in the L
third row of Figure 4) which highlights the next discriminative region. The
attention map Lnext shown in Eq.(3) is binarized to obtain the attention mask L2 and then we can extract the next discriminative patch. c1 ). Lnext = G(I · L
(3)
Step4: Repeat the Step3, multiple effective discriminant patches can be found until the discriminative information contained in the extracted patch is insufficient to improve the performance of the fine-grained image classification network. 185
For the sake of the comprehensive information, we apply the multi-scale images (resized to 224 × 224 and 448 × 448) to find discriminative patches. In order to facilitate the follow-up work, we select the same number of patches for different scale images. We use Ri = {Ri1 , Ri2 , ..., RiK , ..., Ri2K } to denote all discriminative patches, and the first K patches are from the same image with
190
small scale, while the last K patches are from the same image with large scale. 11
Figure 4: Visiualization of attention maps of images with scale 448 × 448 from different stages of PPL module. The first row denotes the original images. The second row denotes the first discriminative regions and the third row denotes the second discriminative regions.
3.2. Fine-grained Classificaton Network with Feature Calibration Module To emphasis the discriminative information of the patch-level features and suppress their noise information, we design a feature calibration module. In the first place, we need to extract the patch-level and image-level features. We use the Top-K patches of each scale combined with original images to train the classification network, which can generate K patch-level feature vector P k = [pk1 , pk2 , ..., pkH ], (k ∈ {1, ..., K}) and one corresponding image-level feature P 0 = [p01 , p02 , ..., p0H ], each with length H (H = 2048, if ResNet50 is adopted). In this
module, the classification network can directly load the fine-tuned parameters of baseline of patch localization module, because this module adopts the same CNN as localization module. These patch-level and image-level features are fed into the fully connected layer with softmax function and their labels are consistent with those of corresponding images. And we adopt the cross entropy loss as classification loss: Limage = −
N −1 X
Lpatchl = −
N −1 X
yi log(T (Pi0 )).
(4)
yi log(T (Pil )),
(5)
i=0
i=0
where T denotes the fully connected layer with softmax function, Limage denotes the loss of image-level classification and Lpatchl , (l = 1, ..., 2K) denotes the loss of all patch-level classification respectively. 12
195
In order to further improve the discriminability of patch-level features by exploiting global information of image-level feature and different scales information, we propose a feature calibration module. In this module, each patch-level feature is multiplied by image-level feature after the sigmoid function to obtain the processed patch-level features Ql as shown in Eq.(6) and Eq.(7). Then the
200
calibrated patch-level features Ok = [ok1 , ok2 , · · · , okH ] are generated by adding the corresponding elements of the processed patch-level features of different scales, as shown in Eq.(8).
q10
p01
0 0 q p2 2 0 Q = . = sigmoid( .. ). . .. 0 p0H qH
q1l
pl1
q10 pl1
l 0 l l q p p q 2 0 2 2 2 Ql = .. = Q ∗ .. = .. , (l = 1, ..., K, ..., 2K). . . . 0 l l pH qH plH qH
Ok = Qk + Qk+K
(6)
q1k + q1k+K
(7)
k q2 + q2k+K , (k = 1, ..., K). = .. . k+K k qH + qH
(8)
The sigmoid function aims to suppress the irrelevant information of imagelevel feature, and the image-level feature after sigmoid function can provide the weights of feature importance for patch-level features. Moreover, the multiplication operation can emphasize the discriminative information of patch-level features and suppress less useful information. In view of the complementary information contained in patch-level features of different scales, the additive
13
operation is adopted to further enrich the discriminative information of patchlevel features. To the end, we concatenate the image-level with K calibrated patch-level features for fine-grained image classification: g = T 0 (P 0 , O1 , ..., OK ).
(9)
We define T 0 as the fully connected layer with softmax function, whose input dimension is (K + 1) ∗ H. We employ cross entropy loss as joint classification loss: Lcon = − Thus, the overall loss is: L = Limage + λ
N −1 X
yi log(gi ).
(10)
i=0
2K X
Lpatchl + µLcon ,
(11)
l=1
where λ and µ are hyper-parameters. In our experiment, we set λ = µ = 1.
4. Experiments 205
4.1. Datasets and Evaluation Metric We evaluate our algorithm on CUB-200- 2011 [1], Stanford Cars [2] and FGVC-Aircraft [19], which are the widely used benchmarks for fine-grained image classification. We only use image class labels in our experiments, and compare our method with other state-of-the-art approaches to prove its effec-
210
tiveness. Three datasets are described below: • CUB-200-2011 [1]: It is the most representative dataset and includes 11788 images of 200 different subclasses, which splites into 5994 images for training stage as well as 5794 for testing stage. There are about 30
215
images for each subclass in training dataset and 11 ∼ 30 images for each subclass in testing dataset.
14
• Stanford Cars [2]: It has 16185 images of 196 subclasses, which is divided into 8144 images for training stage and 8041 for testing stage. There are about 24 ∼ 84 images for each subclass in training dataset and 24 ∼ 83 220
images for each subclass in testing dataset. • FGVC-Aircraft [19]: It contains 10000 images of 100 different subclasses, which is divided into 6667 images for training stage and 3333 images for testing. There are about 66 ∼ 67 images for each subclass in training dataset and 33 ∼ 34 images for each subclass in testing dataset. We adopt top-1 accuracy as the evaluation metric, which is most commomly used evaluation measure for fine-grained image classification, and defined as follows: accuracy =
225
Rcorrect , R
(12)
where R represents the total number of testing images and Rcorrect means the number of images which are predicted to be the correct categories. 4.2. Implementation Details Our experiments adopt the CNN of ResNet50 [32] as our baseline. This CNN architecture can be used for many other tasks [33], [34], [35] which have achieved
230
good results. Note that this CNN architecture can be replaced by other CNNs. And the PPL and classification module all employ same CNN architecture. For the training of PPL and classification network with FCM, they are trained in a separate manner. The PPL takes the images whose scales are 224 × 224 and 448×448 as input. In the testing phase, both the discriminative patches and the
235
original images are feed into the network to extract the discriminative features, and the robust feature representations obtained after FCM are used for the final classification. To keep training and testing stages consistent, we adopt the predicted category probability c to pick out the attention map. We resize and normalize the selected patches and images all to 448×448. All experiments base
240
on the open source toolbox PyTorch. The classification module is trained with a batch size of 16 and uses Momentum SGD with initial learning rate 0.001 which 15
is multiplied by 0.1 after 60 epochs, as well as we adopt weight decay 1e-4. The thresholds of localization module are set to 0.7, 0.6, 0.6 respectively. 4.3. Quantitative Results 245
In this subsection, we compare the experimental results of our methods with existing methods on CUB-200-2011, Car and FGVC-aircraft datasets, as shown in Table 1, Table 2 and Table 3, respectively. The compared methods in tables can be devided into four groups: (1) supervised methods, (2) ensemble of networks based methods, (3) attention based methods and (4) candidate patches
250
based methods. Table 1 displays the comparisons of results on CUB-200-2011 dataset. As we can see our method achieves the best accuracy of 88.3% among all the methods. Our method exceeds the best compared result of TASN [29] by 0.4%, which learns the fine-grained knowledges from hundreds of detail-preserved images.
255
The detail-preserved images are generated by re-sampling from the original images under the guidence of trilinear attention map. Our method is also superior to other methods based on attention map to extract the discriminative patches. Our method performs better than PA-CNN [36], which introduces a part rectification mechanism to ensure high accuracy of parts. Comparing with PA-CNN,
260
our method employs FCM to selectively strengthen discriminative features and suppress less useful ones by global information guiding to obtain more discriminative feature. Our method obtains the 2.1% higher accuracy than the result of MAMC [28] which learns the correlation of the features from multiple attention regions to the benefit of the fine-grained image classification task.
265
Our method also outperforms M2DRL [25], StackDRL [37] and DT-RAM [38], which are based on reinforcement learning. Moreover, our method outperforms other approaches focusing on exploring the framework of CNN, such as DFL[39] and ESR [24]. DFL [39] designs an asymmetric multi-stream architecture to learning a group of convolutional filters that respond to a certain
270
discriminative region in the original image and achieves the accuracy of 87.4%. ESR [24] picks out the patch pairs by utilizing the spatial distance between 16
the spatial coordinates of corresponding activation vectors on feature map, and constructs a discriminative image representation on the basis of the interation between patches. The accuracy of ESR is 85.5%, which is lower than our method 275
by 2.8%. These comparisons show that our method can find the discriminative patches more accurately which help the classification network to learn the subtle and robust discriminative features. The accuracy of our method is higher than OPAM [13] by 2.5% and our method is superior to other candidate patch based methods, which demon-
280
strates that our PPL can better find the patches containing more discriminative information. Compared with NTS[40], our method brings 0.8% accuracy improvement. This result indicates that discriminability of recalibrated features is enhanced by FCM. Furthermore, we also compare our method with the works using annotations,
285
such as T-CNN [41] and Coarse-to-Fine [42], and our method with only imagelevel annotations outperforms all the methods that use either part annotations or object annotations. The above results demonstrate that automatically-mined discriminative patches supervised by image-level labels are superior to humandesigned patches via progressive learning.
290
In addition, the compared results on Car, FGVC-Aircraft dataset are shown in Table 2 and 3 respectively. Our method beats the most state-of-the-art methods, and achieves the accuracies of 94.0% on Car dataset and 92.6% on FGVC-Aircraft dataset. The compared results indicate that the proposed progressive learning can not only mine discriminative patches of birds but also do
295
well in the patch localization for cars and aircrafts. 4.4. Ablation experiments To analyse the effectiveness of each component in our framework, we design the following experiments. In this work, we adopt the ResNet50 as baseline for all the ablation study experiments, and the baseline achieves 84.5% accuracy.
300
Ablation experiments are carried out on the CUB-200-2011, Standford Cars and FGVC-Aircraft dataset from the following two aspects: 17
Table 1: Comparison results on CUB-200-2011 dataset
Methods
Baseline
Anno.
Accuracy
Part-RCNN [9]
AlexNet
Parts
76.4
Webly-supervised [43]
AlexNet
Parts
78.6
PG Alignment [44]
VGG-19
Bboxs
82.8
Coarse-to-Fine [42]
VGG-19
Bboxs
82.9
AlexNet+ResNet
Bboxs
87.3
VGGNet
n/a
84.1
InceptionNet
n/a
84.1
ESR [24]
VGGNet
n/a
85.5
DT-RAM [38]
ResNet50
n/a
86.0
StackDRL [37]
VGG-16
n/a
86.6
M2DRL [25]
VGG-19
n/a
87.2
DFL[39]
ResNet50
n/a
87.4
SCDA [45]
VGG-16
n/a
80.5
RA-CNN [27]
VGG-19
n/a
85.3
MAMC [28]
ResNet50
n/a
86.2
MA-CNN [18]
VGG-19
n/a
86.5
PA-CNN [36]
VGG-19
n/a
87.8
TASN[29]
ResNet50
n/a
87.9
AutoBD [14]
VGG-19
n/a
81.6
TSC [15]
VGG-16
n/a
84.7
WSDL [12]
VGG-16
n/a
85.7
OPAM [13]
VGG-16
n/a
85.8
NTS[40]
ResNet50
n/a
87.5
Our Method
ResNet50
n/a
88.3
T-CNN [41] Bilinear-CNN [21] ST-CNN [23]
1) The effectiveness of progressive patch localization module The purpose of extracting discriminative patches by PPL is to learn suffi-
18
Table 2: Comparison results on Standford Cars dataset
Methods
Baseline
Anno.
Accuracy (%)
PG Alignment [44]
VGG-19
Bboxs
92.8
Bilinear-CNN [21]
VGGNet
n/a
91.3
DFL[39]
ResNet50
n/a
93.1
DT-RAM [38]
ResNet50
n/a
93.1
M2DRL [25]
VGG-19
n/a
93.2
TASN[29]
ResNet50
n/a
93.8
SCDA [45]
VGG-16
n/a
85.9
RA-CNN [27]
VGG-19
n/a
92.5
MA-CNN [18]
VGG-19
n/a
92.8
MAMC [28]
ResNet50
n/a
92.8
PA-CNN [36]
VGG-19
n/a
93.3
AutoBD [14]
VGG-19
n/a
88.9
OPAM [13]
VGG-16
n/a
92.2
WSDL [12]
VGG-16
n/a
92.3
NTS[40]
ResNet50
n/a
93.9
Our method
ResNet50
n/a
94.0
cient discriminative information for fine-grained image classification network so 305
as to improve the performance of classification network. Comparing with the accuracy of baseline, our PPL brings 3.0 % improvements on CUB-200-2011 dataset, as shown in Table 4. This result means that our PPL can effectively find the discriminative patches, and some discriminative patches are presented in Figure 5 . As we can observe that the patches from large scale image are more
310
concerned about detailed information, and the patches from small scale image focus on more general information. This proves that discriminative information of different scale patches are complementary. As to the number of selected patches, we can see from Table 5 , the accuracy
19
Table 3: Comparison results on FGVC-Aircraft dataset
Methods
Baseline
Anno.
Accuracy (%)
Bilinear-CNN [21]
VGGNet
n/a
84.1
ESR [24]
VGGNet
n/a
86.9
DFL[39]
ResNet50
n/a
91.7
SCDA [45]
VGG-16
n/a
79.5
RA-CNN [27]
VGG-19
n/a
88.2
MA-CNN [18]
VGG-19
n/a
89.9
PA-CNN [36]
VGG-19
n/a
91.0
NTS[40]
ResNet50
n/a
91.4
Our method
ResNet50
n/a
92.6
increases from 84.5% to 86.7% when two patches are extracted from image 315
of the large scale (448 × 448) and engaged in training classification network. When three patches participate in training, there is no obvious improvement or even slight decrease in accuracy. As most of the discriminative information are included in the first two patches, the remaining regions in the image are not good enough for the network to classifiy them. In this case, the classification
320
network must rely on the background regions to indentify the categories of images, which are usually harmful to fine-grained image classification. Thus we only select the first two patches of the same scale image for later fine-grained image classification module. In addition, the first two patches from the image of 224 × 224 lift the accuracy from 84.5 % to 86.2 %. And the first three patches
325
from the image of 224 × 224 participated in training classification network do not help improve the performance of classification network. When two patches of each scale are used to train the classification network, the accuracy reaches 87.5%. Therefore, we select the first two patches for each scale on the CUB200-2011 dataset.
330
However, due to the large proportion of objects in the whole image on Car
20
and FGVC-Aircraft dataset, we found that the first three patches of each scale should be selected for good performance, as shown in Table 6. When the original images of each scale learn four patches, the performance of classification network declines. This indicates that the discriminative information contained in the 335
fourth patch is not enough to improve the performance of the classification network. At the same time, the noise information contained in the fourth patch brings interference to the discriminative features of network learning. Figure 6 displays some patches on Car and FGVC-Aircraft dataset.
Figure 5: Some results of PPL on CUB-200-2011 [1]. The first row denotes the original images, the second and third rows display the first and second discriminative patches of images with scale 448 × 448. The fourth and fifth rows display the most and second discriminative patches of images with scale 224 × 224.
Table 4: Effectiveness of each components in our method on CUB-200-2011, Stanford Cars and FGVC-Aircraft dataset
Methods
CUB
Car
Aircraft
ResNet50
84.5
91.5
87.8
ResNet50 + PPL
87.5
93.0
91.6
ResNet50 + PPL + FCM
88.3
94.0
92.6
21
Figure 6: Some results of PPL on Cars [2] and FGVC-Aircraft [19]. The first row denotes the original images, and from the second to fourth rows denote the top-3 discriminative patches of images with scale 448 × 448. Table 5: Effectiveness of multi-scale representation on CUB-200-2011 dataset
Methods
Accuracy (%)
ResNet50
84.5
ResNet50 + 2 patches (448 × 448)
86.7
ResNet50 + 3 patches (448 × 448)
86.6
ResNet50 + 2 patches (224 × 224)
86.2
ResNet50 + 3 patches (224 × 224)
86.2
ResNet50 + 4 patches (224 × 224 & 448 × 448)
87.5
Table 6: Ablation experiments on PPL module with different number of patches from original images of each scale on Standford Cars and FGVC-Aircraft dataset
Methods
Car
Aircraft
ResNet50
92.5
88.3
ResNet50 + 4 patches (224 × 224 & 448 × 448)
93.6
92.1
ResNet50 + 6 patches (224 × 224 & 448 × 448)
94.0
92.6
ResNet50 + 8 patches (224 × 224 & 448 × 448)
92.6
91.0
2) The effectiveness of feature calibration module 340
In Table 4 , we can observe that the accuracy of classification network on 22
CUB-200-2011 dataset increased from 87.5% to 88.3% after FCM is added. This result suggests that the discriminative information of calibrated features are more effective. Besides, we feed the concatenation of four processed patch-level features and the image feature into the classification network, which obtains an 345
improvement of 0.5% compared to the baseline, as Table 7 shows. This result means the global information of image-level feature can calibrate the patch-level features and strengthen the distinctive information of patch-level features and suppresses useless information. And the addition operation between different scale of patch-level features further makes the discriminative information con-
350
tained in the patch-level features more sufficient, which helps to enhance the accuracy of classification. Table 7:
Results of using different design choices of FCM on CUB-200-2011 dataset.
P img , P pat ,
denotes image-level feature, patch-level feature respectively.
patch-level feature from the patch of small scale image and
pat Plarge
pat Psmall denotes
denotes patch-level feature
from the patch of large scale image.
Setting
Accuracy (%)
a) direct concatenation
87.5
b) sigmoid(P img ) ∗ P pat c) sigmoid(P
img
)∗
88.0
pat (Psmall
+
pat Plarge )
88.3
5. Conclusion To address the problem that the selected patches with the lower rank may contain noise information and ensure a diversity of fine-grained features, we 355
propose a progressive patch localization module for fine-grained image classification. This module firstly finds the most discriminative patch by classification network, then removes the most salient region to prompt the localization of the next most discriminative patch, and the top-K discriminative patches can be found by repeating this step iteratively. The discovered patches sharing the same
23
360
label as the original image are used to train the classification network so as to help it learn discriminative features. In addition, we propose an feature calibration module to calibrate the discriminative information of patch-level features by using the global information to re-weight patch-level features. Furthermore, the processed patch-level features fused the multi-scale complementary information
365
can further improve the representaion of patch-level features. Extensive experiments have proved the effectiveness of our method on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets, which achieves significant improvements.
Acknowledgments This work was supported in part by the National Natural Science Foundation 370
of China (NSFC) under Grants No. 61772108, No. 61572096 and No. 61733002.
References [1] C. Wah, S. Branson, P. Welinder, P. Perona, S. Belongie, The CaltechUCSD Birds-200-2011 Dataset, Tech. rep. (2011). [2] J. Krause, M. Stark, J. Deng, L. Fei-Fei, 3d object representations for 375
fine-grained categorization, in: 2013 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 554–561. doi:10.1109/ICCVW.2013.77. URL https://doi.org/10.1109/ICCVW.2013.77 [3] M. Liu, L. Nie, X. Wang, Q. Tian, B. Chen, Online data organizer: Micro-
380
video categorization by structure-guided multimodal dictionary learning, IEEE Trans. Image Processing 28 (3) (2019) 1235–1247. doi:10.1109/ TIP.2018.2875363. URL https://doi.org/10.1109/TIP.2018.2875363 [4] Y. Wei, X. Wang, W. Guan, L. Nie, Z. Lin, B. Chen, Neural multimodal
385
cooperative learning toward micro-video understanding, IEEE Trans. Im-
24
age Processing 29 (2020) 1–14. doi:10.1109/TIP.2019.2923608. URL https://doi.org/10.1109/TIP.2019.2923608 [5] L. Xie, Q. Tian, R. Hong, S. Yan, B. Zhang, Hierarchical part matching for fine-grained visual categorization, in: IEEE International Conference 390
on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 1641–1648. doi:10.1109/ICCV.2013.206. URL https://doi.org/10.1109/ICCV.2013.206 [6] T. Berg, P. N. Belhumeur, POOF: part-based one-vs.-one features for finegrained categorization, face verification, and attribute estimation, in: 2013
395
IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, 2013, pp. 955–962. doi:10.1109/CVPR.2013. 128. URL https://doi.org/10.1109/CVPR.2013.128 [7] D. Cossock, T. Zhang, Statistical analysis of bayes optimal subset ranking,
400
IEEE Trans. Information Theory 54 (11) (2008) 5140–5154. doi:10.1109/ TIT.2008.929939. URL https://doi.org/10.1109/TIT.2008.929939 [8] E. Gavves, B. Fernando, C. G. M. Snoek, A. W. M. Smeulders, T. Tuytelaars, Fine-grained categorization by alignments, in: IEEE International
405
Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013, pp. 1713–1720. doi:10.1109/ICCV.2013.215. URL https://doi.org/10.1109/ICCV.2013.215 [9] N. Zhang, J. Donahue, R. B. Girshick, T. Darrell, Part-based r-cnns for fine-grained category detection, in: Computer Vision - ECCV 2014 - 13th
410
European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, 2014, pp. 834–849. doi:10.1007/978-3-319-10590-1\_54. URL https://doi.org/10.1007/978-3-319-10590-1_54 [10] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, X. Wu, Image annotation by multiple-instance learning with discriminative feature mapping 25
415
and selection, IEEE Trans. Cybernetics 44 (5) (2014) 669–680.
doi:
10.1109/TCYB.2013.2265601. URL https://doi.org/10.1109/TCYB.2013.2265601 [11] X. He, Y. Peng, J. Zhao, Fine-grained discriminative localization via saliency-guided faster R-CNN, in: Proceedings of the 2017 ACM on Mul420
timedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, 2017, pp. 627–635. doi:10.1145/3123266.3123319. URL https://doi.org/10.1145/3123266.3123319 [12] X. He, Y. Peng, J. Zhao, Fast fine-grained image classification via weakly supervised discriminative localization, IEEE Trans. Circuits Syst. Video
425
Techn. 29 (5) (2019) 1394–1407. doi:10.1109/TCSVT.2018.2834480. URL https://doi.org/10.1109/TCSVT.2018.2834480 [13] Y. Peng, X. He, J. Zhao, Object-part attention model for fine-grained image classification, IEEE Trans. Image Processing 27 (3) (2018) 1487–1500. doi: 10.1109/TIP.2017.2774041.
430
URL https://doi.org/10.1109/TIP.2017.2774041 [14] H. Yao, S. Zhang, C. Yan, Y. Zhang, J. Li, Q. Tian, Autobd: Automated bilevel description for scalable fine-grained visual categorization, IEEE Trans. Image Processing 27 (1) (2018) 10–23. doi:10.1109/TIP.2017.2751960. URL https://doi.org/10.1109/TIP.2017.2751960
435
[15] X. He, Y. Peng, Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 4075–4081. URL
440
http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/
14629 [16] S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards realtime object detection with region proposal networks, IEEE Trans. Pattern
26
Anal. Mach. Intell. 39 (6) (2017) 1137–1149. doi:10.1109/TPAMI.2016. 2577031. 445
URL https://doi.org/10.1109/TPAMI.2016.2577031 [17] B. Zhao, X. Wu, J. Feng, Q. Peng, S. Yan, Diversified visual attention networks for fine-grained object classification, IEEE Trans. Multimedia 19 (6) (2017) 1245–1256. doi:10.1109/TMM.2017.2648498. URL https://doi.org/10.1109/TMM.2017.2648498
450
[18] H. Zheng, J. Fu, T. Mei, J. Luo, Learning multi-attention convolutional neural network for fine-grained image recognition, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 5219–5227. doi:10.1109/ICCV.2017.557. URL https://doi.org/10.1109/ICCV.2017.557
455
[19] S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, A. Vedaldi, Fine-grained visual classification of aircraft, CoRR abs/1306.5151. arXiv:1306.5151. URL http://arxiv.org/abs/1306.5151 [20] B. Zhao, J. Feng, X. Wu, S. Yan, A survey on deep learning-based fine-grained object classification and semantic segmentation, International
460
Journal of Automation and Computing 14 (2) (2017) 119–135.
doi:
10.1007/s11633-017-1053-3. URL https://doi.org/10.1007/s11633-017-1053-3 [21] T. Lin, A. Roy Chowdhury, S. Maji, Bilinear CNN models for fine-grained visual recognition, in: 2015 IEEE International Conference on Computer 465
Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015, pp. 1449– 1457. doi:10.1109/ICCV.2015.170. URL https://doi.org/10.1109/ICCV.2015.170 [22] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, Z. Zhang, Multiple granularity descriptors for fine-grained categorization, in: 2015 IEEE International
470
Conference on Computer Vision, ICCV 2015, Santiago, Chile, December
27
7-13, 2015, 2015, pp. 2399–2406. doi:10.1109/ICCV.2015.276. URL https://doi.org/10.1109/ICCV.2015.276 [23] M. Jaderberg, K. Simonyan, A. Zisserman, K. Kavukcuoglu, Spatial transformer networks, in: Advances in Neural Information Processing Systems 475
28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 2017–2025. URL http://papers.nips.cc/paper/5854-spatial-transformer-networks [24] L. Qi, X. Lu, X. Li, Exploiting spatial relation for fine-grained image classification, Pattern Recognition 91 (2019) 47–55. doi:10.1016/j.patcog.
480
2019.02.007. URL https://doi.org/10.1016/j.patcog.2019.02.007 [25] X. He, Y. Peng, J. Zhao, Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization, International Journal of Computer Vision 127 (9) (2019) 1235–1255.
485
doi:
10.1007/s11263-019-01176-2. URL https://doi.org/10.1007/s11263-019-01176-2 [26] X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses for fine-grained image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV,
490
USA, June 27-30, 2016, 2016, pp. 1134–1142. doi:10.1109/CVPR.2016. 128. URL https://doi.org/10.1109/CVPR.2016.128 [27] J. Fu, H. Zheng, T. Mei, Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition, in: 2017
495
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 4476–4484. doi: 10.1109/CVPR.2017.476. URL https://doi.org/10.1109/CVPR.2017.476
28
[28] M. Sun, Y. Yuan, F. Zhou, E. Ding, Multi-attention multi-class constraint 500
for fine-grained image recognition, in: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XVI, 2018, pp. 834–850. doi:10.1007/978-3-030-01270-0\ _49. URL https://doi.org/10.1007/978-3-030-01270-0_49
505
[29] H. Zheng, J. Fu, Z. Zha, J. Luo, Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition, in:
IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 2019, pp. 5012–5021. 510
URL
http://openaccess.thecvf.com/content_CVPR_2019/
html/Zheng_Looking_for_the_Devil_in_the_Details_Learning_ Trilinear_Attention_CVPR_2019_paper.html [30] Y. Wei, J. Feng, X. Liang, M. Cheng, Y. Zhao, S. Yan, Object region mining with adversarial erasing: A simple classification to semantic segmentation 515
approach, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 6488–6496. doi:10.1109/CVPR.2017.687. URL https://doi.org/10.1109/CVPR.2017.687 [31] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Ba-
520
tra, Grad-cam: Visual explanations from deep networks via gradientbased localization, in: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 618–626. doi:10.1109/ICCV.2017.74. URL https://doi.org/10.1109/ICCV.2017.74
525
[32] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 29
770–778. doi:10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90 530
[33] W. Huang, H. Ding, G. Chen, A novel deep multi-channel residual networks-based metric learning method for moving human localization in video surveillance, Signal Processing 142 (2018) 104–113. doi:10.1016/j. sigpro.2017.07.015. URL https://doi.org/10.1016/j.sigpro.2017.07.015
535
[34] Y. Cao, Z. He, Z. Ye, X. Li, Y. Cao, J. Yang, Fast and accurate single image super-resolution via an energy-aware improved deep residual network, Signal Processing 162 (2019) 115–125. doi:10.1016/j.sigpro.2019.03.018. URL https://doi.org/10.1016/j.sigpro.2019.03.018 [35] R. Hong, L. Li, J. Cai, D. Tao, M. Wang, Q. Tian, Coherent semantic-visual
540
indexing for large-scale image retrieval in the cloud, IEEE Trans. Image Processing 26 (9) (2017) 4128–4138. doi:10.1109/TIP.2017.2710635. URL https://doi.org/10.1109/TIP.2017.2710635 [36] H. Zheng, J. Fu, Z. Zha, J. Luo, T. Mei, Learning rich part hierarchies with progressive attention networks for fine-grained image recognition, IEEE
545
Trans. Image Processing 29 (2020) 476–488.
doi:10.1109/TIP.2019.
2921876. URL https://doi.org/10.1109/TIP.2019.2921876 [37] X. He, Y. Peng, J. Zhao, Stackdrl: Stacked deep reinforcement learning for fine-grained visual categorization, in: Proceedings of the Twenty550
Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2018, pp. 741–747. doi: 10.24963/ijcai.2018/103. URL https://doi.org/10.24963/ijcai.2018/103 [38] Z. Li, Y. Yang, X. Liu, F. Zhou, S. Wen, W. Xu, Dynamic computational
555
time for visual attention, in: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 30
22-29, 2017, 2017, pp. 1199–1209. doi:10.1109/ICCVW.2017.145. URL https://doi.org/10.1109/ICCVW.2017.145 [39] Y. Wang, V. I. Morariu, L. S. Davis, Learning a discriminative filter bank 560
within a CNN for fine-grained recognition, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 4148–4157. doi:10.1109/CVPR. 2018.00436. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Wang_
565
Learning_a_Discriminative_CVPR_2018_paper.html [40] Z. Yang, T. Luo, D. Wang, Z. Hu, J. Gao, L. Wang, Learning to navigate for fine-grained classification, in: Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, 2018, pp. 438–454. doi:10.1007/978-3-030-01264-9\_26.
570
URL https://doi.org/10.1007/978-3-030-01264-9_26 [41] H. Xu, G. Qi, J. Li, M. Wang, K. Xu, H. Gao, Fine-grained image classification by visual-semantic embedding, in: Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., 2018, pp. 1043–1049. doi:
575
10.24963/ijcai.2018/145. URL https://doi.org/10.24963/ijcai.2018/145 [42] H. Yao, S. Zhang, Y. Zhang, J. Li, Q. Tian, Coarse-to-fine description for fine-grained visual categorization, IEEE Trans. Image Processing 25 (10) (2016) 4858–4872. doi:10.1109/TIP.2016.2599102.
580
URL https://doi.org/10.1109/TIP.2016.2599102 [43] Z. Xu, S. Huang, Y. Zhang, D. Tao, Webly-supervised fine-grained visual categorization via deep domain adaptation, IEEE Trans. Pattern Anal. Mach. Intell. 40 (5) (2018) 1100–1113. doi:10.1109/TPAMI.2016. 2637331.
585
URL https://doi.org/10.1109/TPAMI.2016.2637331 31
[44] J. Krause, H. Jin, J. Yang, F. Li, Fine-grained recognition without part annotations, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015, pp. 5546–5555. doi:10.1109/CVPR.2015.7299194. 590
URL https://doi.org/10.1109/CVPR.2015.7299194 [45] X. Wei, J. Luo, J. Wu, Z. Zhou, Selective convolutional descriptor aggregation for fine-grained image retrieval, IEEE Trans. Image Processing 26 (6) (2017) 2868–2881. doi:10.1109/TIP.2017.2688133. URL https://doi.org/10.1109/TIP.2017.2688133
595
Conflict of Interest Authors declare that no conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. Authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this
600
paper.
Credit Author Statement Tiantian Yan: Conceptualization, Methodology, Software,Writing - Original Draft. Shijie Wang: Validation, Software. 605
Zhihui Wang: Conceptualization, Writing - Review & Editing. Haojie Li: Conceptualization, Funding acquisition. Zhongxuan Luo: Supervision.
32