Improving classification with semi-supervised and fine-grained learning

Improving classification with semi-supervised and fine-grained learning

Accepted Manuscript Improving Classification with Semi-Supervised and Fine-grained Learning Danyu Lai, Wei Tian, Long Chen PII: DOI: Reference: S003...

10MB Sizes 0 Downloads 51 Views

Accepted Manuscript

Improving Classification with Semi-Supervised and Fine-grained Learning Danyu Lai, Wei Tian, Long Chen PII: DOI: Reference:

S0031-3203(18)30423-0 https://doi.org/10.1016/j.patcog.2018.12.002 PR 6732

To appear in:

Pattern Recognition

Received date: Revised date: Accepted date:

26 March 2018 29 September 2018 6 December 2018

Please cite this article as: Danyu Lai, Wei Tian, Long Chen, Improving Classification with Semi-Supervised and Fine-grained Learning, Pattern Recognition (2018), doi: https://doi.org/10.1016/j.patcog.2018.12.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • We propose the Improved Pseudo-Label for semi-Supervised learning;

CR IP T

• We propose a new method for fine-grained feature learning;

• Our model can be applied to domain adaptation quickly and effectively;

• Our approach can combine almost all deep neural network models and training methods;

• Extensive experiments on two challenging datasets demonstrate the effectiveness

AC

CE

PT

ED

M

AN US

of our approach.

1

ACCEPTED MANUSCRIPT

Improving Classification with Semi-Supervised and Fine-grained Learning

CR IP T

Danyu Laia , Wei Tianb , Long Chena,∗ a School

of Data and Computer Science, Sun Yat-sen University, Guangzhou, China [email protected], *[email protected] b Institute of Measurement and Control Systems, Measurement and Control, Karlsruhe Institute of Technology, Karlsruhe, Germany. [email protected]

AN US

Abstract

In this paper, we propose a novel and efficient multi-stage approach, which combines both semi-supervised learning and fine-grained learning to improve the performance of classification model learned only from a few samples. The fine-grained category recognition process utilized in our method is dubbed as MSR. In this process, we cut images into multi-scaled parts to feed into the network to learn more fine-grained fea-

M

tures. By assigning these image cuts with dynamic weights, we can reduce the negative impact of background information and thus achieve a more accurate prediction.

ED

Furthermore, we present the voted pseudo label (VPL) which is an efficient method of semi-supervised learning. In this approach, for unlabeled data, VPL picks up the classes with non-confused labels verified by the consensus prediction of different clas-

PT

sification models. These two methods can be applied to most neural network models and training methods. Inspired from classifier-based adaptation, we also propose a mix deep CNN architecture (MixDCNN). Both the VPL and MSR are integrated with

CE

the MixDCNN. Comprehensive experiments demonstrate the effectiveness of VPL and MSR. Without bottles and jars, we achieve the state-of-the-art or even better perfor-

AC

mance in two fine-grained recognition tasks on the datasets of Stanford Dogs and CUB Birds, with the accuracy of 95.6% and 85.2% respectively. Keywords: Semi-supervised learning; fine-grained feature learning; mixture of DCNNs; image classification ∗ Corresponding

author

Preprint submitted to Journal of Pattern Recognition

December 6, 2018

ACCEPTED MANUSCRIPT

1. Introduction

CR IP T

In recent years, deep neural networks have achieved significant success especially in tough vision-based perception tasks [1, 2, 3, 4, 5]. The deep learning allows processing of multiple neural networks to learn data representations within multiple abstraction 5

levels [6]. On the classical machine learning problem, a basic assumption of statistical

learning theory is that the training and test data are drawn from the same distribution. However, this assumption does not hold in many applications. Hence, to cope with the

AN US

problem, a realistic strategy, the transfer learning approach is used to employ the prior knowledge from similar domains or tasks [7], e.g., instance-based adaptation, feature 10

representation adaptation, and classifier-based adaptation[8, 9, 10].

Depend on the availability of labeled data in the target domain, the domain adaptation can be generally divided into semi-supervised and unsupervised domain adaptation [11]. The semi-supervised method only requires a certain amount of labeled train-

15

M

ing samples while the supervised learning usually requires a large amount of labeled data. As an effective approach to the problem of insufficient dataset, semi-supervised

ED

method draws increased attention from the research community [12, 13, 14, 15]. For instance, Wang et al. [12] proposed to train the neural network in a semi-supervised fashion by adding unlabeled data, which are with the maximum predicted probability,

20

PT

to the labeled dataset, yielding a high score on the MINIST dataset. However, the high predicted probability can not guarantee a high accuracy (e.g., by a not well-trained model), this approach may introduce additional noise in the training set, deteriorating

CE

the classification performance. As another hotspot in the deep learning area, fine-grained category recognition

AC

also becomes attractive, especially in the classification of bird species [16, 17], flower

25

types [18, 19], car models [20, 21], etc. Among these tasks, a common learning approach is firstly conducting an exact region localization and thereafter utilizing finegrained feature learned from those regions to distinguish visual differences between two images. While this approach benefits a variety of applications such as expert-level image recognition [22, 23] and rich image captioning [24, 25], challenges also emerge, 3

Crop in two-scale

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: Two species of woodpecker. The subtle visual differences are derived from two differently scaled local regions (the number of 1/2 and 1/3 respectively indicates the intersection of cropped area and original

AN US

object image), i.e., the color of heads in yellow boxes, which are important features to distinguish between those two species.

30

mainly in two folds: localization of discriminative regions and effective learning of fine-grained features. In previous works, part based recognition frameworks and Recurrent Attention Convolutional Neural Network [26] are introduced to address these

M

problems and achieve impressive progress. In these approaches, the convolutional neural network (CNN) is trained on image characteristics from extra human-defined re35

gions, which can be selected in unsupervised or semi-supervised manner. However,

ED

defining these regions is kind of fussy and takes a lot of efforts. Moreover, region selected by existing unsupervised methods may not be optimal for training classifiers.

PT

To deal with above challenges, we propose a novel semi-supervised learning method, named as voted pseudo label (VPL), and an approach (i.e., MSR) to learn multi-scale 40

fine-grained features. Both the VPL and MSR are used in MixDCNNs for feature

CE

extraction and classification. The VPL cautiously adds unlabeled data which are persistently checked by three different experts in MixDCNNs to expand the dataset. For training, the MSR randomly crops each image from the training set into a fixed number

AC

of patches within two scales. For testing, the MSR crops an image in a fixed way as

45

shown in Fig. 5, and assigns each cropped patch with a dynamic weight. In MixDCNNs, features extracted by each expert will be concatenated, and then fed into a fully connected layer for classification (shown in Fig. 2). The proposed approach is tested on complex datasets of Stanford Dogs [27] and

4

ACCEPTED MANUSCRIPT

0

Feature Extraction

Training

Random Crop

Train Vector

... Training Images

Train

Cropped Images

Testing

...

Testing 1

Fixed Crop

Test Vector

2

...

5 3

4

Testing Images

Cropped Images

Three Models

Second Stage

Model Fusion

Error Analysis on Validation Set

Output by Dynamic Weight Test

Confused Class Probability Distribution

CR IP T

Training

First Stage

Add Corresponding New Birds

IPL algorithm

Figure 2: The work flow of our approach. We use random cropping for detected image data in the training

AN US

set and fixed cropping on the testing set. All images and cropped patches form a large dataset and features

are extracted by three different models pre-trained on the ImageNet. These features are concatenated and fed into two fully connected layers. The object prediction is based on the input image cuts which are assigned with dynamic weights. By analyzing the prediction errors on validation set, we pick out unlabeled classes associated with voted pseudo label and add them to the training set.

50

M

CUB Birds [28], achieving an accuracy of 95.6% and 85.2% respectively. Our contributions are summarized as follows:

ED

• We proposed the voted pseudo label approach, named as VPL, which is able to

extract precise pseudo labels based on the consensus judgment from different models.

55

PT

• We proposed a method to improve the classification model by fine-grained learning, dubbed as MSR. Without complex human-defined annotations and time to

CE

train CNN, the method can learn more discriminative features.

• We proposed a MixDCNN which is built by three pre-trained networks to feature

AC

extraction and classification.

60

• Extensive experiments on two challenging datasets, i.e., the CUB Birds [28]

and Stanford Dogs [27], demonstrate the effectiveness of our approach, which is superior to state-of-the-arts.

In the rest of paper, we review related works in Section 3. The proposed method is

5

ACCEPTED MANUSCRIPT

introduced in Section 4. Section 5 provides the experimental results. And the paper is concluded in Section 6. 2. Background

CR IP T

65

In this section, we briefly describe the main required concepts as well as background information related to our approach. 2.1. Mixture of DCNNs

Fine-grained categorization has been a challenging vision problem because of small inter-class variation and large intra-class variation. To overcome these problems, di-

AN US

70

viding the fine-grained dataset into multiple visually similar subsets or directly using multiple neural networks to improve the performance of classification is a widely used method [29]. In [30], the author proposed mix deep convolutional neural networks (MixDCNNs) for fine-grained image classification. This mix system shows state-of75

the-art results on Birdsnap [31] and PlantCLEFFlower datasets [32]. In this paper,

M

we use a simplified MixDCNN architecture, which is composed by three classification models and two additional fully connected layers. The features passing through the

ED

final convolutional layer of each model will be sent to a global pooling layer and then concatenated together. Fig. 3 shows the architecture. When training, the MixDCNNs 80

will freeze the convolution layers and only train the fully connected layers. This can

PT

save a lot of time compared to training the whole work. In the MSR proposal, the mixture model is utilized to extract and classify the image features.

CE

2.2. Voted Pseudo Label In [12], the author proposed a simple method of semi-supervised learning for deep

AC

neural networks, picking up the classes which have the maximum predicted probability as pseudo labels and using them as if they were true labels. In this paper, we change the strategy into voting operation. Each model in MixDCNNs predicts on the validation set. Only when the image is with consistently predicted label by three models and does not belong to the confused label, it can be selected as a pseudo label and then added to extend the training set. The confused label is the class which is hard to distinguish 6

ACCEPTED MANUSCRIPT

by the neural network. To be more specific, we define the formula for determination of

f or l ∈ / CF & f1 (x) = f2 (x) = f3 (x) otherwise

,

(1)

CR IP T

the class label as below,   l C=  0

where label l is obtained by the consensus determination of three model responses

85

fi (x)|i=1,...,3 and CF stands for the set of confused labels. 2.3. Multi-Scale Recognition

The fine-grained recognition aims to distinguish objects from different subordinate-

AN US

level categories within a general category [29]. It is a challenging task because of the subtle differences in the overall appearance between various classes (low inter-class 90

variation) and large pose and appearance variations in the same class (large intra-class variation). Much of the work for fine-grained image classification deals with the issue of detecting by modeling local parts [33, 34, 35]. In our paper, we propose a multi-scale recognition (MSR) approach to force the model to focus more on fine-grained features.

95

M

For training, the MSR randomly crops the input image into a fixed number of patches within two scales. For testing, the MSR crops the image in a fixed way as shown in

ED

Fig. 5, and assigns each cropped patch with a dynamic weight. We do it in this way, so that during training the diverse image inputs can let the model learn more generalized features, and the weighted output can reduce the influence from the background in the

3. Related Work

CE

100

PT

image.

In this section, we introduce related researches mainly from three aspects: trans-

fer learning for image classification, semi-supervised learning and fine-grained image

AC

recognition.

3.1. Transfer Learning for Image Classification

105

In fact, transfer learning has been proposed with a relative long history in the ma-

chine learning domain. Domain Adaptation is a representative method of transfer learning, which refers to using information of rich source domain samples to improve the 7

ACCEPTED MANUSCRIPT

performance of the target domain model. In [36], Afridi et al. rank CNNs by reliability in a zero-shot manner to select the most suitable model from the source task to the 110

target task. According to different types of target domain and source domain, domain

CR IP T

adaptive problems can be divided to four different types: unsupervised, supervised, heterogeneous distribution and multiple source domain problems.

Existing approaches predominantly solve image classification problem by training CNN models in the end-to-end manner [26, 37, 21, 38]. Although this fashion may 115

yield good results, it is inapplicable in use cases where the size of available dataset is insufficient or the training time is strictly limited.

AN US

Therefore, it is reasonable to utilize the prior knowledge from other domain. Feature adaptation and classifier-based adaptation can help to solve the problem of insufficient data. The pre-trained model, which is directly adopted as a feature extractor [39], 120

can save lots of training time [40]. By utilizing such transfer learning fashion, impressive performance gain has been achieved in vision-based classification tasks [41].

M

3.2. Semi-supervised Learning

Semi-supervised learning is a subclass of the supervised learning approach yet tak-

125

ED

ing unlabeled data into consideration, especially when the volume of annotated data is insufficient for training networks. Normally, unsupervised learning is treated as an auxiliary task to supervised learning in research works. For instance, Hinton et al. learn

PT

a stack of unsupervised restricted Boltzmann machines to pre-train the model [42]. Ranzato et al. reconstruct the input at each level of network for a compact representation [43], in which the auxiliary task of ladder networks is utilized for denoising. In [44], the labeled and unlabeled samples are learned together by a multi-manifold

CE 130

Isomap learning framework.

AC

In contrast, other works are focusing on how to assign labels to unlabeled data.

Representatively, Papandreou et al. [14] combine both strong and weak labels using an expectation-maximization (EM) process for image segmentation. In [13, 15], the

135

samples generated from the generator of a GAN model are packed into one category and fed into the discriminator. In [45], labels for the unlabeled data are gained by the Dirichlet process-based clustering algorithm. 8

ACCEPTED MANUSCRIPT

The approach most related to ours is [12], in which Lee et al. proposed the pseudo label which picks up unlabeled data with the maximum predicted probability. However, 140

such selection approach may not ensure the accuracy. In comparison, we propose the

CR IP T

voted pseudo label with a better validity. 3.3. Fine-grained Image Recognition

Discriminative region localization and fine-grained feature learning are two main challenges of fine-grained object recognition (e.g., on bird species). Researches on 145

recognition of fine-grained images can be mainly categorized into two groups: so-

AN US

phisticate region localization and learning of discriminative features. In the first group,

several previous works leverage extra bounding boxes (e.g., for part annotation) in finegrained image classification [33, 34, 35, 28, 38, 46, 47]. As the manual work involved on annotation task is heavy, such approach is unpractical to solve large-scaled prob150

lems. In other works, part detectors are trained by unsupervised learning approach, e.g., by analyzing CNN filter responses [17] and deploying multi-grained descriptors [48].

M

LG-CNN [49] select and filter image feature through two CNNs which share weights. The most relevant work to ours is [26], which utilizes recurrent attention convolutional

155

ED

network to combine features from discriminative local regions in three scales. In this paper, we obtain discriminative regions by a random selection based approach. In the second group, learning of powerful feature representation is the main task.

PT

Several works cast this task into learning deeper networks. For instance, deep residual network upscaled to a depth of 152 is utilized in [37], reducing the error rate to 3.75% on the ImageNet test set [4]. For better modeling the subtle difference among finegrained categories, a bilinear structure [21] is proposed to capture local differences of

CE 160

image and yields state-of-the-art results on the CUB birds dataset [28]. Besides, by

AC

deploying a unified framework incorporating two filter response picking steps, Zhang et al. [17] achieve superior results on both the bird [28] and dog datasets [27].

165

Different from these aforementioned methods, we adopt the pseudo label along

with a multi-scale recognition approach. The network is trained to learn differently scaled fine-grained features on a train set extended by voted pseudo label.

9

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 3: The framework of MixDCNNs. Our idea is to crop images and send them into the MixDCNNs to extract features for classification. Symbol ⊕ represents “crop” and “select” operation. We select images in

all scales (from top to bottom) as inputs. Every image or image patch will be resized to the same size and sent into three different models to extract features. “concat” denotes concatenation of those three feature vectors into a compact vector (marked in red). Thereafter, the new vector will be sent to the “fc” layer (marked in

M

blue) to make prediction by the classification function Lcls .

ED

4. Approach

In this section, we introduce the proposed approach of voted pseudo label (VPL) and the multi-scale recognition (MSR). The voted pseudo labels are selected by the consensus judgment of three different expert networks and further utilized in the train-

PT

170

ing procedure. In the MSR method, we use images tailored in two different scales to

CE

train the model so that it learns more fine features on training. When testing, we divide one image into six parts, which are weighted dynamically, and feed them into the

AC

network to make the final prediction.

175

4.1. The MixDCNNs Inspired by [30], to better deal with the problem of fine-grained categorization. We

propose a new MixDCNN which is made up of three different networks by stacking their convolutional features. We also use the idea of classifier-based adaptation and the three CNNs in the MixDCNN is transfered from the ImageNet. We also freeze 10

ACCEPTED MANUSCRIPT

180

their convolution layer to speed up training. Finally, we use two fully connected layers to process these features. The whole architecture is illustrated in Fig. 3.

Given

an input image X, we first extract region-based deep features by feeding the images

CR IP T

into pre-trained models. The extracted features are represented as g (X, θ), with g (·) denoting operations such as convolution, pooling and activation. And θ stands for the 185

corresponding network parameters.

Here we prefer stacking because the difference between model’s structures is normally large, which is resulted by the fact that different models usually emphasize different information of the image. Thus, stacking them together can enable the model to

AN US

have a stronger generalization ability.

The task of the connected network is to generate a probability distribution p over all categories, interpreted as:

p(X) = f (Wc ∗ X), 190

(2)

where f (·) includes the two fully connected layers as well as the post-processing step

M

by softmax operation. The fully connected layers map the input features into a compact feature vector, which is consistent with the category entries, while the softmax function

ED

further converts the vectorized features into probabilities. The application of softmax function instead of a Support Vector Machine (SVM) [50] is mainly for technical con195

sistency of feature classification, so that we can integrate the multi-scale descriptors.

PT

In the training procedure, the fitting errors of category label over the samples are measured via a cross entropy loss function. Furthermore, we select the tanh function

CE

as the activation of our fully connected layer, which squashes a real-valued number x to the range of [−1, 1]. In experiments, we also compare the performance of other

200

activation functions such as Sigmoid, ReLU, etc. Although they can yield comparable

AC

precision on the test dataset, they are not so stable as the tanh function especially when

images are cropped into significantly different scales. In our network, each fully connected layer is also followed by a dropout layer. Dur-

ing the training of network, the hidden units are randomly omitted with a probability

205

of P = 0.5, which yields the best results in our experiment. Besides, we found that a dropout of 40% or 45% visible units is also helpful. By this technique, we can reduce 11

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 4: Probability distribution of false predictions by the MixDCNNs on the dataset of Stanford Dogs. In this example, classes, such as Eskimo dog, Lhasa, Siberian husky, Staffordshire bullterrier, etc, have a high error proportion. Thus, these classes are assigned with confused labels because the model can not recognize

overfitting effectively.

ED

4.2. Voted Pseudo Labels

M

them well.

Pseudo label regards target classes as real labels for unlabeled samples. By analyz210

ing the prediction of model on the validation set, we can acknowledge the performance

PT

of our network on different categories. Confused labels represent categories with low accuracy which is far away from average. The boundary between confused and non-

CE

confused labels is thus determined by the variation of prediction errors. In our work, we analyze the error distribution, i.e., in the form of histogram, among all classes on

215

the validation set, which is 20% portion of the training data. As shown in Fig. 4, for

AC

those not-well recognized classes (the red bars), we assign them with confused labels because they are far away from the average of class precision. The unlabeled samples of voted pseudo label are extracted from those categories of non-confused labels (green bars in Fig. 4) with a high prediction accuracy based on test results of the validation

220

set. In comparison, the voted pseudo labels are associated to the classes of unlabeled 12

ACCEPTED MANUSCRIPT

data with the assumption that they are with true labels. In this approach, we pick up the images which are not assigned with confused labels and add them to the set of pseudo labels C if all the prediction of three models fi (x)|1≤i≤3 are in consensus,

CR IP T

formulated as in Eq. (1). The three models utilized in the MixDCNN are respectively InceptionV4 [51], InceptionResNetV2 [17] and Xception [52]. After adding pseudo

labels, the fully connected layers are retrained in a supervised fashion with the data of both true labels and pseudo labels. The same loss function as in the supervised learning task is also deployed, which is given by: 0

0

Lj log pj ,

j=1

(3)

AN US

E =−

n X

where the variable n0 replaces n (image number in the training set before adding unlabeled data) and indicates the increased number of images by voted pseudo labels. Adding voted pseudo labels can augment the training data. Although the unlabeled data which we added may have noise, a well train CNN can still well handle these errors. Since the selecting process is more strict in our method, we can get more correct pseudo labels.

ED

4.3. Multi-scale Recognition

M

225

Inspired by the recent success of Recurrent Attention Convolutional Neural Network (RA-CNN) [26], in this paper, we propose an efficient approach by fine-grained image learning which does not require to amend the construction of deep network while

PT

230

improves the prediction performance. Given an input image X, which is selected from

CE

the training set, we first send it to the Single Shot MultiBox Detector (SSD) [53] to get the positive image X1 , which is bounded by the detection window. Thereafter, we crop 9 and 8 small square patches on X1 respectively in a portion of 1/9 and 1/4 of X1 . The cropping is randomly conducted in 8 equidistant directions outgoing from the im-

AC

235

age center and the patch locations are evenly distributed over the whole image. These 17 image cuts along with the image X1 are then fed into the MixDCNNs described in Sec.4.1 to extract features and to train the fully connected layers. As these image patches include discriminative features especially on small object parts, the model is

240

able to learn more fine-grained features from a single image. 13

ACCEPTED MANUSCRIPT

1 1

2

2 5

3 4

3

5

CR IP T

crop 4

6

Figure 5: During testing, we always divide an image into five fixed parts, with their locations evenly dis-

AN US

tributed on the whole image. These image cuts are in 1/4 portion of the original image size. Those five

patches along with the original image are fed into the network to extract features and to make the prediction.

For images in the test set, analogously, we first send an image Y into the SSD [53] to extract the detected image region Y1 . Here we crop Y1 into five small parts with the same size as shown in Fig. 5. These small parts along with the image Y1 are fed into the trained model to make prediction. To reduce the negative impact of background

M

blocks, we propose a solution called as dynamic weighting, interpreted as: (4)

ED

Y Y WiY = Pi,M ax − Pi,Sec ,

Y Y where Pi,M ax and Pi,Sec are respectively the first and the second largest probability of

image cut i predicted by the softmax function. The weight WiY is assigned to image

CE

PT

Y cut i and multiplied with its probability Pi,l on class l in equation

PlY =

6 X i=1

Y WiY ∗ Pi,l ,

(5)

where sum PlY indicates the accumulated probability of all input image cuts for class l. The assumption is that large weight WiY implies large prediction confidence and

AC

the weight of a background block is usually small. Thus, we can reduce the negative impact of background blocks. The final label of image Y is determined by equation lY = arg max(PlY ), l

(6)

where label lY is associated with the class of maximum probability predicted on image Y. 14

ACCEPTED MANUSCRIPT

#Category number

#Training images

#Testing images

CIFAR-10 [54]

10

50000

10000

CIFAR-100 [54]

100

50000

10000

CUB-200-2011 [28]

200

5994

5794

Stanford Dogs [27]

120

12000

CR IP T

Datasets

8580

Table 1: The statistics of datasets used in this paper.

5. Experiments

245

AN US

Datasets: We divide experiments into two parts. The first experiment is conducted on two famous image classification datasets: CIFAR-10 and CIFAR-100 [54] while the second one is on two challenging fine-grained image recognition datasets, including Caltech-UCSDBirds (CUB-200-2011) [28] and Stanford Dogs [27]. The detailed

5.1. SSD Network Training

M

statistics about their category numbers and data splits are summarized in Table 1.

We initialize our SSD model based on the architecture of VGG16 [55] with weights

250

ED

trained for classification on the ILSVRC CLS-LOC dataset [39]. We fine tuned the network on the training set of PASCAL VOC2007 + VOC2012 for 21 classes. During training, we apply L2 normalization technique and randomly sample each

PT

training image by one of following options: - retaining the complete original input image,

CE

255

- sampling a patch which has an overlap with the objects in a portion of 0.1,

AC

0.3,0.5, 0.7, or 0.9 and

- randomly sampling a patch with its center aligned to center of the original image.

We post-process sampled images by horizontal flipping with probability of 0.5. Then

260

all the images are resized to a fixed size of 300×300 pixels. Finally, we apply some photo-metric distortions referred from [53].

15

ACCEPTED MANUSCRIPT

1.00

0.90

0.85

7: g ho rs e 8: sh ip 9: tru ck

og

fro

5: d

6:

t

er

de

4:

rd

ca

3:

ca r

bi

2:

1:

Class

0:

ai rp

la ne

0.80

CR IP T

Accuracy

0.95

Figure 6: The classification precision of the GoogleNet on the validation set. The horizontal axis indicates the class ID. We mark a category with the color orange when its accuracy falls between the range of 90%-

AN US

91% and denote it as a light-confused label (i.e., the class of bird and dog respective with the ID of 2 and 5). And the color red denotes an accuracy less than 90% and corresponds to a confused label (here is the cat class with ID 3).

For optimization, we used the standard adaptive moment estimation with a learning rate of 1×10−3 . The learning rate decay factor is set ot 0.94 and the weight decay is

265

M

equal to 5×10−4 . We start with fixed learning rate to train the network. Once the network converges to a good result (0.5 mAP for instance), we change the learning rate and fine-tune the complete network. We use a mini-batch size of 32, and train the

ED

network for a maximum epoch of 500 on a GPU of Nvidia GTX Titan X. We achieve an mAP of 0.778 on the VOC2007 testing set.

PT

5.2. Experiments on VPL

In this part, we conduct experiments based on variable-controlling method to prove

270

CE

the effectiveness of our semi-supervised approach. We take 20% of the training set as the validation set and another 20% of the training set as unlabeled data. The hyperparameters are also determined w.r.t. the performance of neural network on validation

AC

set. During training, we apply data augmentation for all datasets by a random crop

275

(with a size of 32 pixels and a padding of 4 pixels) and a random horizontal flip with a probability of 0.5. For optimization, we apply the stochastic gradient descent approach with a learning

rate of 0.1, a momentum of 0.9, and a weight decay of 5×10−4 . We started with a

16

Approach

Additional Class

Accuracy (%) 94.32

ResNet18 [37]

-

ResNet18 [PL]

0-9

94.4

ResNet18

0,1,2,4,6,7,8,9

94.58

4,6,7,8,9

94.85

-

93.05

VGG19 [PL]

0-9

93.36

VGG19

0,1,2,4,6,7,8,9

93.74

GoogleNet [56]

-

94.33

GoogleNet [PL]

0-9

94.37

GoogleNet

0,1,2,4,5,6,7,8,9

94.54

GoogleNet

0,1,4,6,7,8,9

94.73

ResNet18

(0-2)*, (4-9)*

95.14

VGG19

(0-2)*, (4-9)*

94.05

GoogleNet

(0-2)*, (4-9)*

95.88

AN US

ResNet18 VGG19 [55]

CR IP T

ACCEPTED MANUSCRIPT

Table 2: Comparison of the test results on the Cifar-10 dataset among different approaches. “PL” indicates pseudo label is used. “Additional Class” indicates the class of unlabeled data which is added to the training set. Symbol * denotes the unlabeled data with VPL which are with high consensus among the ResNet18, the VGG19 and the GoogleNet. Thus, they do not have confused labels.

280

M

fixed learning rate to train the network until the classification precision stabilizes on the validation set. Then we use a lower learning rate (e.g., shrinked by 0.1) to perform

ED

further training. We use a mini batch size of 128 and train the network for a maximum epoch of 300 on the same GPU.

PT

5.2.1. Experiments on Cifar-10

In this experiment, we investigate the performance of our classification model in 285

dependent on the number of selected unlabeled data, with the results shown in Table 2.

CE

Each model is trained respectively with four datasets: the original labeled data, the data augmented with pseudo labels [12], the data augmented with non-confused classes of

AC

the unlabeled data, and the data augmented with VPL (i.e., unlabeled data with high consensus among all three models on the same label). The confused label is gained by

290

checking the model prediction on the validation set and it differs for different utilized CNN models. Taking the GoogleNet [56] as an example, as shown in Fig. 6, the classes are ranked by the accuracy. We define the red bar as confused label whose accuracy is lower than 90% and yellow bar as light-confused label (with accuracy in the range of 17

ACCEPTED MANUSCRIPT

Non-confused Label

Accuracy (%) 74.78

ResNet18 [PL [12]]

73.60 √

ResNet18 VGG19 [55]

73.99 67.82

VGG19 [PL]

67.24 √

VGG19 GoogleNet [56]

67.73 77.52

GoogleNet [PL]

76.18

GoogleNet



77.2

ResNet18*



75.52

VGG19*



69.16

GoogleNet*



78.81

CR IP T

Approach ResNet18 [37]

Table 3: Comparison results on Cifar-100 dataset. Check mark X indicates using unlabeled data. * denotes

AN US

using the voted pseudo label.

90% to 91%). By removing them from the data with pseudo label, GoogleNet achieves 295

an accuracy of 94.54% and 94.73%. Compared with the approach of original pseudo label [12], we obtain a gain of 0.17% and 0.36% respectively.

On the initial training set, the ResNet18 [37] and VGG19 [55] respectively achieve

M

an accuracy of 94.32% and 93.05%. After adding unlabeled data which pick up the maximum predicted probability as their true labels [12], they are improved to an accuracy of 94.4% and 93.36% respectively. By getting rid of confused labels, they achieve

ED

300

another gain of 0.14% and 0.38% respectively. After removing the light-confused labels, the ResNet achieves its top accuracy of 94.85%. As shown in the last row in

PT

Table 2, we choose the class cat with ID 3 as the confused label because all utilized models show relative poor prediction on it. Thus, we only add unlabeled data which exclude this class and are in high-consensus among three models. In such way, the

CE

305

ResNet18, GoogleNet and VGG19 achieve an improvement on the recognition accu-

AC

racy of 95.14%, 94.05% and 95.88% respectively. 5.2.2. Experiments on Cifar-100

310

The classification accuracies on Cifar-100 are summarized in Table 3. When using

the pseudo label [12], the performance of all three models decreases (e.g., from 74.78% to 73.60% for the ResNet18). This fact is due to the poor recognition on the initial train set. Thus, the adding of unlabeled data brings too much noise. After removing the 18

ACCEPTED MANUSCRIPT

confused label, the accuracy decline has been reduced (e.g., from 74.78% improved to 73.99% for the ResNet18). Finally, after adding the voted pseudo labels which exclude 315

the confused labels and in high-consensus among three model, the ResNet18 is boosted

CR IP T

to an accuracy of 75.52% and the GoogleNet and VGG19 respectively improves to 78.81% and 69.16% from 77.52% and 67.82%. From the above phenomenon, we can see that the voted pseudo label can ensure that the added unlabeled data do not have too much noise so that it doesn’t reduce the performance of the model. 320

5.2.3. Iterations Experiments on Cifar-10

AN US

We conduct an experiment to explore the effect of iterations. The iterative adding here means repeatedly adding voted pseudo label which is selected from the same unlabeled dataset. The experimental result is shown in Table 4 We respectively adopt 40%, 40%, 20% of the whole data as the training, unlabeled and validation set. In the experiment, selecting voted pseudo labels repeatedly from the same unla-

325

beled data set does not bring persistent improvement. Concretely,, after the first itera-

M

tion, the performance of ResNet18 and MobileNetV2 is improved by 1.52% and 1.27% respectively. In the next iteration, we reuse the network to make predictions and to se-

330

ED

lect voted pseudo labels again. The data with the new pseudo labels will be added into the initial labeled training set and the network is retrained again. However, in our experiments, we found that after the second adding the performance gain of the network

PT

was small. In the third iteration, the model’s performance was hardly changed and the number of pseudo labels was merely increased. So we conclude that it is because the network is already fitted with the voted pseudo label data after the first iteration so that

CE

the difference of each selected pseudo label is small in further iterations.

AC

Iterations

0

1

2

3

ResNet18 [54]

92.02%

93.64%

93.66%

93.65%

MobileNetV2 [57]

92.93%

94.20%

94.18%

94.17%

Table 4: Comparison of the accuracies after different number of iterations on the Cifar-10 dataset. Tested approaches are ResNet18 [37] and MobileNetV2.

335

19

ACCEPTED MANUSCRIPT

5.3. Experiments on VPL + MSR Baselines: We list some excellent approaches as baseline, which bases on deep learning and yields state-of-the-art results on both datasets. These methods are mainly

340

CR IP T

chosen from two categories, depending on whether human-defined bounding boxes or part annotations are utilized. All of them are based on the VGG16 or VGG19. All the baselines are listed as below, with the first five working by human supervision while the last eight in unsupervised part learning manner.

• DeepLAC [34]: a deep localization method using a pose-aligned part image for

345

AN US

classification.

• SPDA-CNN [38]: a network extracting features from candidates generated by an approach of semantic part detection and abstraction.

• Part-RCNN [46]: extension of the framework R-CNN [33] by part annotations.

M

• PA-CNN [20]: a method generating aligned parts with the help of co-segmentation. • PN-CNN [16]: a CNN model computing local features from estimated normalized object pose.

ED

350

• B-CNN [21]: a bilinear CNN model classifying objects by pairwise feature interaction.

PT

• PDFR [17]: an approach learns part detectors by analyzing deep filter responses. • MG-CNN [48]: a multi-region learning method for all grained levels by multiple granularity descriptors.

CE

355

AC

• FCAN [58]: a fully convolutional attention network adapted to selection of mul-

360

tiple task-driven visual attentions by the reinforcement learning.

• NAC [59]: a part localization method by computing constellated neural activation patterns.

• DVAN [60] : a diverse attention network classifying objects from coarse to fine by multi-region proposals.

20

ACCEPTED MANUSCRIPT

• Improved B-CNN [61] : an improved B-CNN architecture that uses matrixnormalization layers.

• RA-CNN [26]: a recurrent attention convolutional neural network recursively

CR IP T

learning discriminative attentions and feature representations for multi-scaled

365

regions.

In our approach, we mainly use transfer learning instead of deep learning. Concretely, we utilize the models of InceptionV4 [51], InceptionResNetV2 [51] and Xception [52] in the MixDCNNs for recognition of dog images while for the bird dataset, we adopt the networks of ResNet152 [37], DenseNet161 [62] and Xception [52]. We

AN US

370

choose different models for each dataset, because the visual features of birds are generally smaller and less distinguishable than that in the image of dogs. Therefore, the second configuration of models yields better results in experiments on the bird dataset. Additionally, we resize input images to a size of 224 × 224 and 299 × 299 pixels. The

375

first size is adapted for feature extraction in ResNet152 and DenseNet161 while the

M

second size is utilized for other networks. All the deployed models are pre-trained on the ImageNet. We also found that different values of P in dropout make the network

ED

more robust. Therefore, we empirically set P = 0.4 in the first fully connected layer and 0.45 in the second layer.

5.3.1. Experiments on the CUB-200-2011

PT

380

In this experiment, we compare our approach with the baselines whose results are publicized on this dataset. Moreover, we evaluate the performance of each single model

CE

deployed in our MixDCNNs, i.e., the ResNet [37], the DenseNet [62] and the Xception [52]. As shown in Table 5, without stacking, the accuracy of ResNet, DenseNet and Xception is relative low and equals 68.6%, 72.0% and 67.5% respectively. By con-

AC

385

catenating their features in three scales, the score of MixDCNNs improves to 77.7%. This accuracy is further increased by about 7% when the MSR is integrated. When this approach is further incorporated with voted pseudo label, we obtain the highest recognition accuracy of 85.2%. This value is comparable to the best results achieved

390

by the baseline methods, considering that most of the baseline methods utilize strictly 21

ACCEPTED MANUSCRIPT

DeepLAC [34]

Train with Anno. √

Accuracy (%)

Part-RCNN [46]



81.6

MG-CNN [48]



83.0

FCAN [58]



84.3

B-CNN (250k) [21]



85.1

SPDA-CNN [38]



85.1

PN-CNN [20]



82.8

PN-CNN [16]



85.4

80.3

82.0

Improved B-CNN [61]

85.8

RA-CNN [26]

85.3

ResNet152 [37]

68.6

DenseNet161 [62]

72.0

Xception [52]

67.5

MixDCNNs

77.7

MixDCNNs [MSR]

AN US

FCAN [58]

CR IP T

Approach

84.3

MixDCNNs [VPL+MSR]

85.2

Table 5: Comparison of the test results on the CUB-200-2011 dataset among different approaches. Check mark X indicates that strictly human-defined bounding boxes or part annotations are utilize during training.

M

human-defined bounding boxes or part annotations. For instance, PN-CNN [16] is trained by strong supervision of both human-defined bounding boxes and part ground-

ED

truths. And B-CNN [21] utilizes bounding boxes with a high dimensional feature representation (250k). However, our network only relies on multi-scale image cuts with 395

dynamic weighting as well as voted pseudo label, which makes our approach more

PT

flexible and thus it is capable to be applied in general classification tasks. Additionally, we display several examples with discriminative regions of two scales by the proposed approach MSR in Fig. 7. From these images, we can observe that most of the localized

CE

regions are exactly consistent with the human perception, which is helpful to make

400

model recognize better and further verifies that our approach is effective in selection of

AC

discriminative fine-grained features. 5.3.2. Experiments on the Stanford Dogs Analogously, we compare our approach with the methods whose results on this

dataset are publicly available. Their classification accuracy on the dataset of Stanford

405

Dogs are summarized in Table 6. The single model utilized in our MixDCNNs, i.e.,

22

CR IP T

ACCEPTED MANUSCRIPT

Figure 7: Five bird species displayed with discriminative regions in different scales. We select and show the original image and some cropped patches from 17 patches. Significant visual cues are captured in these

AN US

regions and can thus improve the classification precision.

the InceptionV4, the InceptionResNetV2 [51] and the Xception[34] [52] respectively achieves a recognition accuracy of 92.1%, 92.0% and 90.6% at the original images. Relying on the feature concatenation, the MixDCNNs achieves an improvement on the recognition accuracy of 92.8%. By combining image parts from different scales and the weighted prediction, we boost the performance to 94.5%. Unlike on the CUB, the

M

410

pretrained models are well preformed on this dataset. We owe this phenomenon to the

ED

pretraining on the ImageNet dataset which contains far more training images of dogs. The result also reflects the importance of transfer learning. However, if the source and target task do not have much correlation, maybe we should consider fine turning or retraining the pretrained model instead of just extracting feature. Compared with the

PT

415

test results on the dataset of birds, such improvement is relative small. This is due to two reasons. On the one hand, unlike the dogs, the similarity between different bird

CE

species is relative high and the main visual difference is mainly derived from the area of head or wings. On the other hand, the difference between models of the series of InceptionNet is not as obvious as among ResNet, DenseNet and Xception. However,

AC

420

this precision value still outperforms the unsupervised methods. E.g., compared with DVAN [60], FCAN [58] and RA-CNN [26], the relative accuracy gains are 13.0%, 10.3% and 7.2%. After adding the voted pseudo labels, the performance of our approach is further boosted to 95.6%.

23

Accuracy (%) 68.3

PDFR(AlexNet) [17]

71.9

VGG-16 [55]

76.7

DVAN [60]

81.5

FCAN [58]

84.2

RA-CNN [26]

87.3

InceptionV4[ [51]

92.1

InceptionResNetV2 [51]

92.0

Xception [52]

90.6

MixDCNNs

92.8

MixDCNNs [MSR]

94.5

MixDCNNs [IPL+MSR]

95.6

AN US

Approach NAC(AlexNet) [59]

CR IP T

ACCEPTED MANUSCRIPT

Table 6: Comparison of the test results on the Stanford dogs dataset among different approaches without extra bounding box or part annotation.

425

6. Conclusion

M

In this paper, we proposed two techniques to improve existing classification model: the voted pseudo label and a novel fine-grained feature learning. These two approaches can be widely applied on most pre-trained neural network models and training methods.

430

ED

Without complex training scheme, both of them achieve promotional manifestation on the basis of the original model. By domain adaptation and parameters freezing, our

PT

MixDCNN runs fast and is memory-saving. The MixDCNN is also suitable for some tasks which have high requirements on both the accuracy and speed. The semi-learning method is suitable for some areas lacking of sufficient dataset, e.g., in some biomedical

CE

areas. Extensive experiments demonstrate the superior performance of our approach on

435

fine-grained recognition tasks of dogs and birds. In the future, our research focuses in two directions: a more effective image feature utilizing approach and network archi-

AC

tecture designing. References [1] J. Fu, T. Mei, K. Yang, H. Lu, Y. Rui, Tagging personal photos with transfer deep

440

learning, in: Proceedings of the 24th International Conference on World Wide 24

ACCEPTED MANUSCRIPT

Web, International World Wide Web Conferences Steering Committee, 2015, pp. 344–354. [2] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, H. Lu, Image tag refinement with viewfor Video Technology 25 (8) (2015) 1409–1422.

445

CR IP T

dependent concept representations, IEEE Transactions on Circuits and Systems

[3] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, Y. Rui, Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging, in: Pro-

ceedings of the IEEE International Conference on Computer Vision, 2015, pp.

450

AN US

1985–1993.

[4] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

[5] J. Wang, J. Fu, Y. Xu, T. Mei, Beyond object recognition: Visual sentiment anal3484–3490.

455

M

ysis with deep coupled adjective and noun neural networks., in: IJCAI, 2016, pp.

ED

[6] Y. Lecun, Y. Bengio, G. Hinton, Deep learning., Nature 521 (7553) (2015) 436. [7] A. Niculescu-Mizil, R. Caruana, Inductive transfer for bayesian network structure

PT

learning, in: Artificial Intelligence and Statistics, 2007, pp. 339–346. [8] S. J. Pan, Q. Yang, et al., A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (10) (2010) 1345–1359.

CE

460

[9] L. Shao, F. Zhu, X. Li, Transfer learning for visual categorization: A survey, IEEE transactions on neural networks and learning systems 26 (5) (2015) 1019–1034.

AC

[10] D. H. Svendsen, L. Martino, M. Campos-Taberner, F. J. Garc´ıa-Haro, G. Camps-

465

Valls, Joint gaussian processes for biophysical parameter retrieval, IEEE Transactions on Geoscience and Remote Sensing 56 (3) (2018) 1718–1727.

[11] J. Zhang, W. Li, P. Ogunbona, Joint geometrical and statistical alignment for visual domain adaptation, arXiv preprint arXiv:1705.05498. 25

ACCEPTED MANUSCRIPT

[12] D.-H. Lee, Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, in: Workshop on Challenges in Representation Learning, ICML, Vol. 3, 2013, p. 2.

470

CR IP T

[13] A. Odena, Semi-supervised learning with generative adversarial networks, arXiv preprint arXiv:1606.01583.

[14] G. Papandreou, L.-C. Chen, K. P. Murphy, A. L. Yuille, Weakly-and semi-

supervised learning of a deep convolutional network for semantic image segmentation, in: Proceedings of the IEEE International Conference on Computer Vision,

475

AN US

2015, pp. 1742–1750.

[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. 480

[16] S. Branson, G. Van Horn, S. Belongie, P. Perona, Bird species categorization

M

using pose normalized deep convolutional nets, arXiv preprint arXiv:1406.2952. [17] X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses

ED

for fine-grained image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1134–1142. [18] M.-E. Nilsback, A. Zisserman, A visual vocabulary for flower classification, in:

PT

485

IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp.

CE

1447–1454.

[19] S. Reed, Z. Akata, H. Lee, B. Schiele, Learning deep representations of finegrained visual descriptions, in: Proceedings of the IEEE Conference on Computer

AC

490

Vision and Pattern Recognition, 2016, pp. 49–58.

[20] J. Krause, H. Jin, J. Yang, L. Fei-Fei, Fine-grained recognition without part annotations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5546–5555.

26

ACCEPTED MANUSCRIPT

[21] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained visual recognition, in: Proceedings of the IEEE International Conference on Computer

495

Vision, 2015, pp. 1449–1457.

CR IP T

[22] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, L. Fei-

Fei, The unreasonable effectiveness of noisy data for fine-grained recognition, in: European Conference on Computer Vision, Springer, 2016, pp. 301–320. 500

[23] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, Z. Zhang, The application of two-

level attention models in deep convolutional neural network for fine-grained im-

AN US

age classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 842–850.

[24] L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: Describing novel object categories

505

without paired training data, in: Proceedings of the IEEE Conference on Com-

M

puter Vision and Pattern Recognition, 2016, pp. 1–10.

[25] J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: Fully convolutional localization networks for dense captioning, in: Proceedings of the IEEE Conference on Com-

ED

puter Vision and Pattern Recognition, 2016, pp. 4565–4574.

510

[26] J. Fu, H. Zheng, T. Mei, Look closer to see better: recurrent attention convolu-

PT

tional neural network for fine-grained image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

CE

[27] A. Khosla, N. Jayadevaprakash, B. Yao, F.-F. Li, Novel dataset for fine-grained image categorization: Stanford dogs, in: Proc. CVPR Workshop on Fine-Grained

515

Visual Categorization (FGVC), Vol. 2, 2011, p. 1.

AC

[28] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-ucsd birds 200.

[29] B. Zhao, J. Feng, X. Wu, S. Yan, A survey on deep learning-based fine-grained

520

object classification and semantic segmentation, International Journal of Automation and Computing 14 (2) (2017) 119–135. 27

ACCEPTED MANUSCRIPT

[30] Z. Ge, A. Bewley, C. Mccool, P. Corke, B. Upcroft, C. Sanderson, Fine-grained classification via mixture of deep convolutional neural networks, workshop on applications of computer vision (2016) 1–6. [31] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, P. N. Belhumeur,

CR IP T

525

Birdsnap: Large-scale fine-grained visual categorization of birds (2014) 2019– 2026.

[32] H. Goeau, P. Bonnet, A. Joly, Lifeclef plant identification task 2014 1391 (2014) 598–615.

[33] S. Huang, Z. Xu, D. Tao, Y. Zhang, Part-stacked cnn for fine-grained visual cat-

AN US

530

egorization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1173–1182.

[34] D. Lin, X. Shen, C. Lu, J. Jia, Deep lac: Deep localization, alignment and classification for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1666–1674.

M

535

[35] O. M. Parkhi, A. Vedaldi, C. Jawahar, A. Zisserman, The truth about cats and

ED

dogs, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 1427–1434.

PT

[36] M. J. Afridi, A. Ross, E. M. Shapiro, On automated source selection for transfer learning in convolutional neural networks, Pattern Recognition 73 (2018) 65 – 75.

540

CE

[37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

AC

[38] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, D. Metaxas,

545

Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1143–1152.

28

ACCEPTED MANUSCRIPT

[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.

550

CR IP T

[40] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.

[41] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning, Journal of Big Data 3 (1) (2016) 9. 555

[42] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with

AN US

neural networks, science 313 (5786) (2006) 504–507.

[43] M. Ranzato, M. Szummer, Semi-supervised learning of compact document representations with deep networks, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 792–799. 560

[44] Y. Zhang, Z. Zhang, J. Qin, L. Zhang, B. Li, F. Li, Semi-supervised local multi76 (2018) 662–678.

M

manifold isomap by linear embedding for feature extraction, Pattern Recognition

ED

[45] H. Wu, S. Prasad, Semi-supervised dimensionality reduction of hyperspectral imagery using pseudo-labels, Pattern Recognition 74 (2018) 212–224. [46] N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained

PT

565

category detection, in: European conference on computer vision, Springer, 2014,

CE

pp. 834–849.

[47] X.-S. Wei, C.-W. Xie, J. Wu, C. Shen, Mask-cnn: Localizing parts and selecting

AC

descriptors for fine-grained bird species categorization, Pattern Recognition 76

570

(2018) 704–714.

[48] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, Z. Zhang, Multiple granularity descriptors for fine-grained categorization, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2399–2406.

29

ACCEPTED MANUSCRIPT

[49] G.-S. Xie, X.-Y. Zhang, W. Yang, M. Xu, S. Yan, C.-L. Liu, Lg-cnn: From local parts to global discrimination for fine-grained recognition, Pattern Recognition

575

71 (2017) 118–131.

CR IP T

[50] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273–297.

[51] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet

and the impact of residual connections on learning., in: AAAI, 2017, pp. 4278–

580

4284.

preprint arXiv:1610.02357.

AN US

[52] F. Chollet, Xception: Deep learning with depthwise separable convolutions, arXiv

[53] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision,

585

Springer, 2016, pp. 21–37.

M

[54] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images. [55] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale

590

ED

image recognition, arXiv preprint arXiv:1409.1556. [56] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-

PT

houcke, A. Rabinovich, Going deeper with convolutions (2014) 1–9. [57] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen, Inverted residuals

CE

and linear bottlenecks: Mobile networks for classification, detection and segmentation, CoRR abs/1801.04381. arXiv:1801.04381. URL http://arxiv.org/abs/1801.04381

AC

595

[58] X. Liu, T. Xia, J. Wang, Y. Lin, Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition, arXiv preprint arXiv:1603.06765.

30

ACCEPTED MANUSCRIPT

[59] M. Simon, E. Rodner, Neural activation constellations: Unsupervised part model discovery with convolutional networks, in: Proceedings of the IEEE International

600

Conference on Computer Vision, 2015, pp. 1143–1151.

CR IP T

[60] B. Zhao, X. Wu, J. Feng, Q. Peng, S. Yan, Diversified visual attention networks for fine-grained object classification, arXiv preprint arXiv:1606.08572. [61] T. Y. Lin, S. Maji, Improved bilinear pooling with cnns. 605

[62] G. Huang, Z. Liu, K. Q. Weinberger, L. van der Maaten, Densely connected con-

AC

CE

PT

ED

M

AN US

volutional networks, arXiv preprint arXiv:1608.06993.

31

CR IP T

ACCEPTED MANUSCRIPT

Danyu Lai is with School of Data and Computer Science, Sun Yat-sen University,

AN US

Guangzhou, Guangdong, P.R.China. He is a postgraduate student and studies machine

learning and deep learning in applications of both pattern recognition and computer vision currently. His areas of interest include Image Classification, Semantic Segmen-

PT

ED

M

tation and Object Detection.

CE

Wei Tian received the B.Sc degree in mechatronics engineering from Tongji University, Shanghai, China, in 2010. From October 2010, he was with the Department of

AC

Electrical Engineering and Information Technology at KIT, Karlsruhe, Germany, and received the M.Sc. degree in May 2013. He is currently working toward the Ph.D. degree at the Institute of Measurement and Control Systems at KIT. He is interested in research areas of robust object detection and tracking.

32

AN US

CR IP T

ACCEPTED MANUSCRIPT

Long Chen received the B.Sc. degree in communication engineering and the Ph.D.

M

degree in signal and information processing from Wuhan University, Wuhan, China, in 2007 and in 2013, respectively. From October 2010 to November 2012, he was co-

ED

trained PhD Student at National University of Singapore. From 2008 to 2013, he was in charge of environmental perception system for autonomous vehicle SmartV-II with the Intelligent Vehicle Group, Wuhan University. He is currently an Associate Professor

PT

with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou,

AC

CE

China. His areas of interest include perception system of intelligent vehicle.

33