Accepted Manuscript
Improving Classification with Semi-Supervised and Fine-grained Learning Danyu Lai, Wei Tian, Long Chen PII: DOI: Reference:
S0031-3203(18)30423-0 https://doi.org/10.1016/j.patcog.2018.12.002 PR 6732
To appear in:
Pattern Recognition
Received date: Revised date: Accepted date:
26 March 2018 29 September 2018 6 December 2018
Please cite this article as: Danyu Lai, Wei Tian, Long Chen, Improving Classification with Semi-Supervised and Fine-grained Learning, Pattern Recognition (2018), doi: https://doi.org/10.1016/j.patcog.2018.12.002
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • We propose the Improved Pseudo-Label for semi-Supervised learning;
CR IP T
• We propose a new method for fine-grained feature learning;
• Our model can be applied to domain adaptation quickly and effectively;
• Our approach can combine almost all deep neural network models and training methods;
• Extensive experiments on two challenging datasets demonstrate the effectiveness
AC
CE
PT
ED
M
AN US
of our approach.
1
ACCEPTED MANUSCRIPT
Improving Classification with Semi-Supervised and Fine-grained Learning
CR IP T
Danyu Laia , Wei Tianb , Long Chena,∗ a School
of Data and Computer Science, Sun Yat-sen University, Guangzhou, China
[email protected], *
[email protected] b Institute of Measurement and Control Systems, Measurement and Control, Karlsruhe Institute of Technology, Karlsruhe, Germany.
[email protected]
AN US
Abstract
In this paper, we propose a novel and efficient multi-stage approach, which combines both semi-supervised learning and fine-grained learning to improve the performance of classification model learned only from a few samples. The fine-grained category recognition process utilized in our method is dubbed as MSR. In this process, we cut images into multi-scaled parts to feed into the network to learn more fine-grained fea-
M
tures. By assigning these image cuts with dynamic weights, we can reduce the negative impact of background information and thus achieve a more accurate prediction.
ED
Furthermore, we present the voted pseudo label (VPL) which is an efficient method of semi-supervised learning. In this approach, for unlabeled data, VPL picks up the classes with non-confused labels verified by the consensus prediction of different clas-
PT
sification models. These two methods can be applied to most neural network models and training methods. Inspired from classifier-based adaptation, we also propose a mix deep CNN architecture (MixDCNN). Both the VPL and MSR are integrated with
CE
the MixDCNN. Comprehensive experiments demonstrate the effectiveness of VPL and MSR. Without bottles and jars, we achieve the state-of-the-art or even better perfor-
AC
mance in two fine-grained recognition tasks on the datasets of Stanford Dogs and CUB Birds, with the accuracy of 95.6% and 85.2% respectively. Keywords: Semi-supervised learning; fine-grained feature learning; mixture of DCNNs; image classification ∗ Corresponding
author
Preprint submitted to Journal of Pattern Recognition
December 6, 2018
ACCEPTED MANUSCRIPT
1. Introduction
CR IP T
In recent years, deep neural networks have achieved significant success especially in tough vision-based perception tasks [1, 2, 3, 4, 5]. The deep learning allows processing of multiple neural networks to learn data representations within multiple abstraction 5
levels [6]. On the classical machine learning problem, a basic assumption of statistical
learning theory is that the training and test data are drawn from the same distribution. However, this assumption does not hold in many applications. Hence, to cope with the
AN US
problem, a realistic strategy, the transfer learning approach is used to employ the prior knowledge from similar domains or tasks [7], e.g., instance-based adaptation, feature 10
representation adaptation, and classifier-based adaptation[8, 9, 10].
Depend on the availability of labeled data in the target domain, the domain adaptation can be generally divided into semi-supervised and unsupervised domain adaptation [11]. The semi-supervised method only requires a certain amount of labeled train-
15
M
ing samples while the supervised learning usually requires a large amount of labeled data. As an effective approach to the problem of insufficient dataset, semi-supervised
ED
method draws increased attention from the research community [12, 13, 14, 15]. For instance, Wang et al. [12] proposed to train the neural network in a semi-supervised fashion by adding unlabeled data, which are with the maximum predicted probability,
20
PT
to the labeled dataset, yielding a high score on the MINIST dataset. However, the high predicted probability can not guarantee a high accuracy (e.g., by a not well-trained model), this approach may introduce additional noise in the training set, deteriorating
CE
the classification performance. As another hotspot in the deep learning area, fine-grained category recognition
AC
also becomes attractive, especially in the classification of bird species [16, 17], flower
25
types [18, 19], car models [20, 21], etc. Among these tasks, a common learning approach is firstly conducting an exact region localization and thereafter utilizing finegrained feature learned from those regions to distinguish visual differences between two images. While this approach benefits a variety of applications such as expert-level image recognition [22, 23] and rich image captioning [24, 25], challenges also emerge, 3
Crop in two-scale
CR IP T
ACCEPTED MANUSCRIPT
Figure 1: Two species of woodpecker. The subtle visual differences are derived from two differently scaled local regions (the number of 1/2 and 1/3 respectively indicates the intersection of cropped area and original
AN US
object image), i.e., the color of heads in yellow boxes, which are important features to distinguish between those two species.
30
mainly in two folds: localization of discriminative regions and effective learning of fine-grained features. In previous works, part based recognition frameworks and Recurrent Attention Convolutional Neural Network [26] are introduced to address these
M
problems and achieve impressive progress. In these approaches, the convolutional neural network (CNN) is trained on image characteristics from extra human-defined re35
gions, which can be selected in unsupervised or semi-supervised manner. However,
ED
defining these regions is kind of fussy and takes a lot of efforts. Moreover, region selected by existing unsupervised methods may not be optimal for training classifiers.
PT
To deal with above challenges, we propose a novel semi-supervised learning method, named as voted pseudo label (VPL), and an approach (i.e., MSR) to learn multi-scale 40
fine-grained features. Both the VPL and MSR are used in MixDCNNs for feature
CE
extraction and classification. The VPL cautiously adds unlabeled data which are persistently checked by three different experts in MixDCNNs to expand the dataset. For training, the MSR randomly crops each image from the training set into a fixed number
AC
of patches within two scales. For testing, the MSR crops an image in a fixed way as
45
shown in Fig. 5, and assigns each cropped patch with a dynamic weight. In MixDCNNs, features extracted by each expert will be concatenated, and then fed into a fully connected layer for classification (shown in Fig. 2). The proposed approach is tested on complex datasets of Stanford Dogs [27] and
4
ACCEPTED MANUSCRIPT
0
Feature Extraction
Training
Random Crop
Train Vector
... Training Images
Train
Cropped Images
Testing
...
Testing 1
Fixed Crop
Test Vector
2
...
5 3
4
Testing Images
Cropped Images
Three Models
Second Stage
Model Fusion
Error Analysis on Validation Set
Output by Dynamic Weight Test
Confused Class Probability Distribution
CR IP T
Training
First Stage
Add Corresponding New Birds
IPL algorithm
Figure 2: The work flow of our approach. We use random cropping for detected image data in the training
AN US
set and fixed cropping on the testing set. All images and cropped patches form a large dataset and features
are extracted by three different models pre-trained on the ImageNet. These features are concatenated and fed into two fully connected layers. The object prediction is based on the input image cuts which are assigned with dynamic weights. By analyzing the prediction errors on validation set, we pick out unlabeled classes associated with voted pseudo label and add them to the training set.
50
M
CUB Birds [28], achieving an accuracy of 95.6% and 85.2% respectively. Our contributions are summarized as follows:
ED
• We proposed the voted pseudo label approach, named as VPL, which is able to
extract precise pseudo labels based on the consensus judgment from different models.
55
PT
• We proposed a method to improve the classification model by fine-grained learning, dubbed as MSR. Without complex human-defined annotations and time to
CE
train CNN, the method can learn more discriminative features.
• We proposed a MixDCNN which is built by three pre-trained networks to feature
AC
extraction and classification.
60
• Extensive experiments on two challenging datasets, i.e., the CUB Birds [28]
and Stanford Dogs [27], demonstrate the effectiveness of our approach, which is superior to state-of-the-arts.
In the rest of paper, we review related works in Section 3. The proposed method is
5
ACCEPTED MANUSCRIPT
introduced in Section 4. Section 5 provides the experimental results. And the paper is concluded in Section 6. 2. Background
CR IP T
65
In this section, we briefly describe the main required concepts as well as background information related to our approach. 2.1. Mixture of DCNNs
Fine-grained categorization has been a challenging vision problem because of small inter-class variation and large intra-class variation. To overcome these problems, di-
AN US
70
viding the fine-grained dataset into multiple visually similar subsets or directly using multiple neural networks to improve the performance of classification is a widely used method [29]. In [30], the author proposed mix deep convolutional neural networks (MixDCNNs) for fine-grained image classification. This mix system shows state-of75
the-art results on Birdsnap [31] and PlantCLEFFlower datasets [32]. In this paper,
M
we use a simplified MixDCNN architecture, which is composed by three classification models and two additional fully connected layers. The features passing through the
ED
final convolutional layer of each model will be sent to a global pooling layer and then concatenated together. Fig. 3 shows the architecture. When training, the MixDCNNs 80
will freeze the convolution layers and only train the fully connected layers. This can
PT
save a lot of time compared to training the whole work. In the MSR proposal, the mixture model is utilized to extract and classify the image features.
CE
2.2. Voted Pseudo Label In [12], the author proposed a simple method of semi-supervised learning for deep
AC
neural networks, picking up the classes which have the maximum predicted probability as pseudo labels and using them as if they were true labels. In this paper, we change the strategy into voting operation. Each model in MixDCNNs predicts on the validation set. Only when the image is with consistently predicted label by three models and does not belong to the confused label, it can be selected as a pseudo label and then added to extend the training set. The confused label is the class which is hard to distinguish 6
ACCEPTED MANUSCRIPT
by the neural network. To be more specific, we define the formula for determination of
f or l ∈ / CF & f1 (x) = f2 (x) = f3 (x) otherwise
,
(1)
CR IP T
the class label as below, l C= 0
where label l is obtained by the consensus determination of three model responses
85
fi (x)|i=1,...,3 and CF stands for the set of confused labels. 2.3. Multi-Scale Recognition
The fine-grained recognition aims to distinguish objects from different subordinate-
AN US
level categories within a general category [29]. It is a challenging task because of the subtle differences in the overall appearance between various classes (low inter-class 90
variation) and large pose and appearance variations in the same class (large intra-class variation). Much of the work for fine-grained image classification deals with the issue of detecting by modeling local parts [33, 34, 35]. In our paper, we propose a multi-scale recognition (MSR) approach to force the model to focus more on fine-grained features.
95
M
For training, the MSR randomly crops the input image into a fixed number of patches within two scales. For testing, the MSR crops the image in a fixed way as shown in
ED
Fig. 5, and assigns each cropped patch with a dynamic weight. We do it in this way, so that during training the diverse image inputs can let the model learn more generalized features, and the weighted output can reduce the influence from the background in the
3. Related Work
CE
100
PT
image.
In this section, we introduce related researches mainly from three aspects: trans-
fer learning for image classification, semi-supervised learning and fine-grained image
AC
recognition.
3.1. Transfer Learning for Image Classification
105
In fact, transfer learning has been proposed with a relative long history in the ma-
chine learning domain. Domain Adaptation is a representative method of transfer learning, which refers to using information of rich source domain samples to improve the 7
ACCEPTED MANUSCRIPT
performance of the target domain model. In [36], Afridi et al. rank CNNs by reliability in a zero-shot manner to select the most suitable model from the source task to the 110
target task. According to different types of target domain and source domain, domain
CR IP T
adaptive problems can be divided to four different types: unsupervised, supervised, heterogeneous distribution and multiple source domain problems.
Existing approaches predominantly solve image classification problem by training CNN models in the end-to-end manner [26, 37, 21, 38]. Although this fashion may 115
yield good results, it is inapplicable in use cases where the size of available dataset is insufficient or the training time is strictly limited.
AN US
Therefore, it is reasonable to utilize the prior knowledge from other domain. Feature adaptation and classifier-based adaptation can help to solve the problem of insufficient data. The pre-trained model, which is directly adopted as a feature extractor [39], 120
can save lots of training time [40]. By utilizing such transfer learning fashion, impressive performance gain has been achieved in vision-based classification tasks [41].
M
3.2. Semi-supervised Learning
Semi-supervised learning is a subclass of the supervised learning approach yet tak-
125
ED
ing unlabeled data into consideration, especially when the volume of annotated data is insufficient for training networks. Normally, unsupervised learning is treated as an auxiliary task to supervised learning in research works. For instance, Hinton et al. learn
PT
a stack of unsupervised restricted Boltzmann machines to pre-train the model [42]. Ranzato et al. reconstruct the input at each level of network for a compact representation [43], in which the auxiliary task of ladder networks is utilized for denoising. In [44], the labeled and unlabeled samples are learned together by a multi-manifold
CE 130
Isomap learning framework.
AC
In contrast, other works are focusing on how to assign labels to unlabeled data.
Representatively, Papandreou et al. [14] combine both strong and weak labels using an expectation-maximization (EM) process for image segmentation. In [13, 15], the
135
samples generated from the generator of a GAN model are packed into one category and fed into the discriminator. In [45], labels for the unlabeled data are gained by the Dirichlet process-based clustering algorithm. 8
ACCEPTED MANUSCRIPT
The approach most related to ours is [12], in which Lee et al. proposed the pseudo label which picks up unlabeled data with the maximum predicted probability. However, 140
such selection approach may not ensure the accuracy. In comparison, we propose the
CR IP T
voted pseudo label with a better validity. 3.3. Fine-grained Image Recognition
Discriminative region localization and fine-grained feature learning are two main challenges of fine-grained object recognition (e.g., on bird species). Researches on 145
recognition of fine-grained images can be mainly categorized into two groups: so-
AN US
phisticate region localization and learning of discriminative features. In the first group,
several previous works leverage extra bounding boxes (e.g., for part annotation) in finegrained image classification [33, 34, 35, 28, 38, 46, 47]. As the manual work involved on annotation task is heavy, such approach is unpractical to solve large-scaled prob150
lems. In other works, part detectors are trained by unsupervised learning approach, e.g., by analyzing CNN filter responses [17] and deploying multi-grained descriptors [48].
M
LG-CNN [49] select and filter image feature through two CNNs which share weights. The most relevant work to ours is [26], which utilizes recurrent attention convolutional
155
ED
network to combine features from discriminative local regions in three scales. In this paper, we obtain discriminative regions by a random selection based approach. In the second group, learning of powerful feature representation is the main task.
PT
Several works cast this task into learning deeper networks. For instance, deep residual network upscaled to a depth of 152 is utilized in [37], reducing the error rate to 3.75% on the ImageNet test set [4]. For better modeling the subtle difference among finegrained categories, a bilinear structure [21] is proposed to capture local differences of
CE 160
image and yields state-of-the-art results on the CUB birds dataset [28]. Besides, by
AC
deploying a unified framework incorporating two filter response picking steps, Zhang et al. [17] achieve superior results on both the bird [28] and dog datasets [27].
165
Different from these aforementioned methods, we adopt the pseudo label along
with a multi-scale recognition approach. The network is trained to learn differently scaled fine-grained features on a train set extended by voted pseudo label.
9
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 3: The framework of MixDCNNs. Our idea is to crop images and send them into the MixDCNNs to extract features for classification. Symbol ⊕ represents “crop” and “select” operation. We select images in
all scales (from top to bottom) as inputs. Every image or image patch will be resized to the same size and sent into three different models to extract features. “concat” denotes concatenation of those three feature vectors into a compact vector (marked in red). Thereafter, the new vector will be sent to the “fc” layer (marked in
M
blue) to make prediction by the classification function Lcls .
ED
4. Approach
In this section, we introduce the proposed approach of voted pseudo label (VPL) and the multi-scale recognition (MSR). The voted pseudo labels are selected by the consensus judgment of three different expert networks and further utilized in the train-
PT
170
ing procedure. In the MSR method, we use images tailored in two different scales to
CE
train the model so that it learns more fine features on training. When testing, we divide one image into six parts, which are weighted dynamically, and feed them into the
AC
network to make the final prediction.
175
4.1. The MixDCNNs Inspired by [30], to better deal with the problem of fine-grained categorization. We
propose a new MixDCNN which is made up of three different networks by stacking their convolutional features. We also use the idea of classifier-based adaptation and the three CNNs in the MixDCNN is transfered from the ImageNet. We also freeze 10
ACCEPTED MANUSCRIPT
180
their convolution layer to speed up training. Finally, we use two fully connected layers to process these features. The whole architecture is illustrated in Fig. 3.
Given
an input image X, we first extract region-based deep features by feeding the images
CR IP T
into pre-trained models. The extracted features are represented as g (X, θ), with g (·) denoting operations such as convolution, pooling and activation. And θ stands for the 185
corresponding network parameters.
Here we prefer stacking because the difference between model’s structures is normally large, which is resulted by the fact that different models usually emphasize different information of the image. Thus, stacking them together can enable the model to
AN US
have a stronger generalization ability.
The task of the connected network is to generate a probability distribution p over all categories, interpreted as:
p(X) = f (Wc ∗ X), 190
(2)
where f (·) includes the two fully connected layers as well as the post-processing step
M
by softmax operation. The fully connected layers map the input features into a compact feature vector, which is consistent with the category entries, while the softmax function
ED
further converts the vectorized features into probabilities. The application of softmax function instead of a Support Vector Machine (SVM) [50] is mainly for technical con195
sistency of feature classification, so that we can integrate the multi-scale descriptors.
PT
In the training procedure, the fitting errors of category label over the samples are measured via a cross entropy loss function. Furthermore, we select the tanh function
CE
as the activation of our fully connected layer, which squashes a real-valued number x to the range of [−1, 1]. In experiments, we also compare the performance of other
200
activation functions such as Sigmoid, ReLU, etc. Although they can yield comparable
AC
precision on the test dataset, they are not so stable as the tanh function especially when
images are cropped into significantly different scales. In our network, each fully connected layer is also followed by a dropout layer. Dur-
ing the training of network, the hidden units are randomly omitted with a probability
205
of P = 0.5, which yields the best results in our experiment. Besides, we found that a dropout of 40% or 45% visible units is also helpful. By this technique, we can reduce 11
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 4: Probability distribution of false predictions by the MixDCNNs on the dataset of Stanford Dogs. In this example, classes, such as Eskimo dog, Lhasa, Siberian husky, Staffordshire bullterrier, etc, have a high error proportion. Thus, these classes are assigned with confused labels because the model can not recognize
overfitting effectively.
ED
4.2. Voted Pseudo Labels
M
them well.
Pseudo label regards target classes as real labels for unlabeled samples. By analyz210
ing the prediction of model on the validation set, we can acknowledge the performance
PT
of our network on different categories. Confused labels represent categories with low accuracy which is far away from average. The boundary between confused and non-
CE
confused labels is thus determined by the variation of prediction errors. In our work, we analyze the error distribution, i.e., in the form of histogram, among all classes on
215
the validation set, which is 20% portion of the training data. As shown in Fig. 4, for
AC
those not-well recognized classes (the red bars), we assign them with confused labels because they are far away from the average of class precision. The unlabeled samples of voted pseudo label are extracted from those categories of non-confused labels (green bars in Fig. 4) with a high prediction accuracy based on test results of the validation
220
set. In comparison, the voted pseudo labels are associated to the classes of unlabeled 12
ACCEPTED MANUSCRIPT
data with the assumption that they are with true labels. In this approach, we pick up the images which are not assigned with confused labels and add them to the set of pseudo labels C if all the prediction of three models fi (x)|1≤i≤3 are in consensus,
CR IP T
formulated as in Eq. (1). The three models utilized in the MixDCNN are respectively InceptionV4 [51], InceptionResNetV2 [17] and Xception [52]. After adding pseudo
labels, the fully connected layers are retrained in a supervised fashion with the data of both true labels and pseudo labels. The same loss function as in the supervised learning task is also deployed, which is given by: 0
0
Lj log pj ,
j=1
(3)
AN US
E =−
n X
where the variable n0 replaces n (image number in the training set before adding unlabeled data) and indicates the increased number of images by voted pseudo labels. Adding voted pseudo labels can augment the training data. Although the unlabeled data which we added may have noise, a well train CNN can still well handle these errors. Since the selecting process is more strict in our method, we can get more correct pseudo labels.
ED
4.3. Multi-scale Recognition
M
225
Inspired by the recent success of Recurrent Attention Convolutional Neural Network (RA-CNN) [26], in this paper, we propose an efficient approach by fine-grained image learning which does not require to amend the construction of deep network while
PT
230
improves the prediction performance. Given an input image X, which is selected from
CE
the training set, we first send it to the Single Shot MultiBox Detector (SSD) [53] to get the positive image X1 , which is bounded by the detection window. Thereafter, we crop 9 and 8 small square patches on X1 respectively in a portion of 1/9 and 1/4 of X1 . The cropping is randomly conducted in 8 equidistant directions outgoing from the im-
AC
235
age center and the patch locations are evenly distributed over the whole image. These 17 image cuts along with the image X1 are then fed into the MixDCNNs described in Sec.4.1 to extract features and to train the fully connected layers. As these image patches include discriminative features especially on small object parts, the model is
240
able to learn more fine-grained features from a single image. 13
ACCEPTED MANUSCRIPT
1 1
2
2 5
3 4
3
5
CR IP T
crop 4
6
Figure 5: During testing, we always divide an image into five fixed parts, with their locations evenly dis-
AN US
tributed on the whole image. These image cuts are in 1/4 portion of the original image size. Those five
patches along with the original image are fed into the network to extract features and to make the prediction.
For images in the test set, analogously, we first send an image Y into the SSD [53] to extract the detected image region Y1 . Here we crop Y1 into five small parts with the same size as shown in Fig. 5. These small parts along with the image Y1 are fed into the trained model to make prediction. To reduce the negative impact of background
M
blocks, we propose a solution called as dynamic weighting, interpreted as: (4)
ED
Y Y WiY = Pi,M ax − Pi,Sec ,
Y Y where Pi,M ax and Pi,Sec are respectively the first and the second largest probability of
image cut i predicted by the softmax function. The weight WiY is assigned to image
CE
PT
Y cut i and multiplied with its probability Pi,l on class l in equation
PlY =
6 X i=1
Y WiY ∗ Pi,l ,
(5)
where sum PlY indicates the accumulated probability of all input image cuts for class l. The assumption is that large weight WiY implies large prediction confidence and
AC
the weight of a background block is usually small. Thus, we can reduce the negative impact of background blocks. The final label of image Y is determined by equation lY = arg max(PlY ), l
(6)
where label lY is associated with the class of maximum probability predicted on image Y. 14
ACCEPTED MANUSCRIPT
#Category number
#Training images
#Testing images
CIFAR-10 [54]
10
50000
10000
CIFAR-100 [54]
100
50000
10000
CUB-200-2011 [28]
200
5994
5794
Stanford Dogs [27]
120
12000
CR IP T
Datasets
8580
Table 1: The statistics of datasets used in this paper.
5. Experiments
245
AN US
Datasets: We divide experiments into two parts. The first experiment is conducted on two famous image classification datasets: CIFAR-10 and CIFAR-100 [54] while the second one is on two challenging fine-grained image recognition datasets, including Caltech-UCSDBirds (CUB-200-2011) [28] and Stanford Dogs [27]. The detailed
5.1. SSD Network Training
M
statistics about their category numbers and data splits are summarized in Table 1.
We initialize our SSD model based on the architecture of VGG16 [55] with weights
250
ED
trained for classification on the ILSVRC CLS-LOC dataset [39]. We fine tuned the network on the training set of PASCAL VOC2007 + VOC2012 for 21 classes. During training, we apply L2 normalization technique and randomly sample each
PT
training image by one of following options: - retaining the complete original input image,
CE
255
- sampling a patch which has an overlap with the objects in a portion of 0.1,
AC
0.3,0.5, 0.7, or 0.9 and
- randomly sampling a patch with its center aligned to center of the original image.
We post-process sampled images by horizontal flipping with probability of 0.5. Then
260
all the images are resized to a fixed size of 300×300 pixels. Finally, we apply some photo-metric distortions referred from [53].
15
ACCEPTED MANUSCRIPT
1.00
0.90
0.85
7: g ho rs e 8: sh ip 9: tru ck
og
fro
5: d
6:
t
er
de
4:
rd
ca
3:
ca r
bi
2:
1:
Class
0:
ai rp
la ne
0.80
CR IP T
Accuracy
0.95
Figure 6: The classification precision of the GoogleNet on the validation set. The horizontal axis indicates the class ID. We mark a category with the color orange when its accuracy falls between the range of 90%-
AN US
91% and denote it as a light-confused label (i.e., the class of bird and dog respective with the ID of 2 and 5). And the color red denotes an accuracy less than 90% and corresponds to a confused label (here is the cat class with ID 3).
For optimization, we used the standard adaptive moment estimation with a learning rate of 1×10−3 . The learning rate decay factor is set ot 0.94 and the weight decay is
265
M
equal to 5×10−4 . We start with fixed learning rate to train the network. Once the network converges to a good result (0.5 mAP for instance), we change the learning rate and fine-tune the complete network. We use a mini-batch size of 32, and train the
ED
network for a maximum epoch of 500 on a GPU of Nvidia GTX Titan X. We achieve an mAP of 0.778 on the VOC2007 testing set.
PT
5.2. Experiments on VPL
In this part, we conduct experiments based on variable-controlling method to prove
270
CE
the effectiveness of our semi-supervised approach. We take 20% of the training set as the validation set and another 20% of the training set as unlabeled data. The hyperparameters are also determined w.r.t. the performance of neural network on validation
AC
set. During training, we apply data augmentation for all datasets by a random crop
275
(with a size of 32 pixels and a padding of 4 pixels) and a random horizontal flip with a probability of 0.5. For optimization, we apply the stochastic gradient descent approach with a learning
rate of 0.1, a momentum of 0.9, and a weight decay of 5×10−4 . We started with a
16
Approach
Additional Class
Accuracy (%) 94.32
ResNet18 [37]
-
ResNet18 [PL]
0-9
94.4
ResNet18
0,1,2,4,6,7,8,9
94.58
4,6,7,8,9
94.85
-
93.05
VGG19 [PL]
0-9
93.36
VGG19
0,1,2,4,6,7,8,9
93.74
GoogleNet [56]
-
94.33
GoogleNet [PL]
0-9
94.37
GoogleNet
0,1,2,4,5,6,7,8,9
94.54
GoogleNet
0,1,4,6,7,8,9
94.73
ResNet18
(0-2)*, (4-9)*
95.14
VGG19
(0-2)*, (4-9)*
94.05
GoogleNet
(0-2)*, (4-9)*
95.88
AN US
ResNet18 VGG19 [55]
CR IP T
ACCEPTED MANUSCRIPT
Table 2: Comparison of the test results on the Cifar-10 dataset among different approaches. “PL” indicates pseudo label is used. “Additional Class” indicates the class of unlabeled data which is added to the training set. Symbol * denotes the unlabeled data with VPL which are with high consensus among the ResNet18, the VGG19 and the GoogleNet. Thus, they do not have confused labels.
280
M
fixed learning rate to train the network until the classification precision stabilizes on the validation set. Then we use a lower learning rate (e.g., shrinked by 0.1) to perform
ED
further training. We use a mini batch size of 128 and train the network for a maximum epoch of 300 on the same GPU.
PT
5.2.1. Experiments on Cifar-10
In this experiment, we investigate the performance of our classification model in 285
dependent on the number of selected unlabeled data, with the results shown in Table 2.
CE
Each model is trained respectively with four datasets: the original labeled data, the data augmented with pseudo labels [12], the data augmented with non-confused classes of
AC
the unlabeled data, and the data augmented with VPL (i.e., unlabeled data with high consensus among all three models on the same label). The confused label is gained by
290
checking the model prediction on the validation set and it differs for different utilized CNN models. Taking the GoogleNet [56] as an example, as shown in Fig. 6, the classes are ranked by the accuracy. We define the red bar as confused label whose accuracy is lower than 90% and yellow bar as light-confused label (with accuracy in the range of 17
ACCEPTED MANUSCRIPT
Non-confused Label
Accuracy (%) 74.78
ResNet18 [PL [12]]
73.60 √
ResNet18 VGG19 [55]
73.99 67.82
VGG19 [PL]
67.24 √
VGG19 GoogleNet [56]
67.73 77.52
GoogleNet [PL]
76.18
GoogleNet
√
77.2
ResNet18*
√
75.52
VGG19*
√
69.16
GoogleNet*
√
78.81
CR IP T
Approach ResNet18 [37]
Table 3: Comparison results on Cifar-100 dataset. Check mark X indicates using unlabeled data. * denotes
AN US
using the voted pseudo label.
90% to 91%). By removing them from the data with pseudo label, GoogleNet achieves 295
an accuracy of 94.54% and 94.73%. Compared with the approach of original pseudo label [12], we obtain a gain of 0.17% and 0.36% respectively.
On the initial training set, the ResNet18 [37] and VGG19 [55] respectively achieve
M
an accuracy of 94.32% and 93.05%. After adding unlabeled data which pick up the maximum predicted probability as their true labels [12], they are improved to an accuracy of 94.4% and 93.36% respectively. By getting rid of confused labels, they achieve
ED
300
another gain of 0.14% and 0.38% respectively. After removing the light-confused labels, the ResNet achieves its top accuracy of 94.85%. As shown in the last row in
PT
Table 2, we choose the class cat with ID 3 as the confused label because all utilized models show relative poor prediction on it. Thus, we only add unlabeled data which exclude this class and are in high-consensus among three models. In such way, the
CE
305
ResNet18, GoogleNet and VGG19 achieve an improvement on the recognition accu-
AC
racy of 95.14%, 94.05% and 95.88% respectively. 5.2.2. Experiments on Cifar-100
310
The classification accuracies on Cifar-100 are summarized in Table 3. When using
the pseudo label [12], the performance of all three models decreases (e.g., from 74.78% to 73.60% for the ResNet18). This fact is due to the poor recognition on the initial train set. Thus, the adding of unlabeled data brings too much noise. After removing the 18
ACCEPTED MANUSCRIPT
confused label, the accuracy decline has been reduced (e.g., from 74.78% improved to 73.99% for the ResNet18). Finally, after adding the voted pseudo labels which exclude 315
the confused labels and in high-consensus among three model, the ResNet18 is boosted
CR IP T
to an accuracy of 75.52% and the GoogleNet and VGG19 respectively improves to 78.81% and 69.16% from 77.52% and 67.82%. From the above phenomenon, we can see that the voted pseudo label can ensure that the added unlabeled data do not have too much noise so that it doesn’t reduce the performance of the model. 320
5.2.3. Iterations Experiments on Cifar-10
AN US
We conduct an experiment to explore the effect of iterations. The iterative adding here means repeatedly adding voted pseudo label which is selected from the same unlabeled dataset. The experimental result is shown in Table 4 We respectively adopt 40%, 40%, 20% of the whole data as the training, unlabeled and validation set. In the experiment, selecting voted pseudo labels repeatedly from the same unla-
325
beled data set does not bring persistent improvement. Concretely,, after the first itera-
M
tion, the performance of ResNet18 and MobileNetV2 is improved by 1.52% and 1.27% respectively. In the next iteration, we reuse the network to make predictions and to se-
330
ED
lect voted pseudo labels again. The data with the new pseudo labels will be added into the initial labeled training set and the network is retrained again. However, in our experiments, we found that after the second adding the performance gain of the network
PT
was small. In the third iteration, the model’s performance was hardly changed and the number of pseudo labels was merely increased. So we conclude that it is because the network is already fitted with the voted pseudo label data after the first iteration so that
CE
the difference of each selected pseudo label is small in further iterations.
AC
Iterations
0
1
2
3
ResNet18 [54]
92.02%
93.64%
93.66%
93.65%
MobileNetV2 [57]
92.93%
94.20%
94.18%
94.17%
Table 4: Comparison of the accuracies after different number of iterations on the Cifar-10 dataset. Tested approaches are ResNet18 [37] and MobileNetV2.
335
19
ACCEPTED MANUSCRIPT
5.3. Experiments on VPL + MSR Baselines: We list some excellent approaches as baseline, which bases on deep learning and yields state-of-the-art results on both datasets. These methods are mainly
340
CR IP T
chosen from two categories, depending on whether human-defined bounding boxes or part annotations are utilized. All of them are based on the VGG16 or VGG19. All the baselines are listed as below, with the first five working by human supervision while the last eight in unsupervised part learning manner.
• DeepLAC [34]: a deep localization method using a pose-aligned part image for
345
AN US
classification.
• SPDA-CNN [38]: a network extracting features from candidates generated by an approach of semantic part detection and abstraction.
• Part-RCNN [46]: extension of the framework R-CNN [33] by part annotations.
M
• PA-CNN [20]: a method generating aligned parts with the help of co-segmentation. • PN-CNN [16]: a CNN model computing local features from estimated normalized object pose.
ED
350
• B-CNN [21]: a bilinear CNN model classifying objects by pairwise feature interaction.
PT
• PDFR [17]: an approach learns part detectors by analyzing deep filter responses. • MG-CNN [48]: a multi-region learning method for all grained levels by multiple granularity descriptors.
CE
355
AC
• FCAN [58]: a fully convolutional attention network adapted to selection of mul-
360
tiple task-driven visual attentions by the reinforcement learning.
• NAC [59]: a part localization method by computing constellated neural activation patterns.
• DVAN [60] : a diverse attention network classifying objects from coarse to fine by multi-region proposals.
20
ACCEPTED MANUSCRIPT
• Improved B-CNN [61] : an improved B-CNN architecture that uses matrixnormalization layers.
• RA-CNN [26]: a recurrent attention convolutional neural network recursively
CR IP T
learning discriminative attentions and feature representations for multi-scaled
365
regions.
In our approach, we mainly use transfer learning instead of deep learning. Concretely, we utilize the models of InceptionV4 [51], InceptionResNetV2 [51] and Xception [52] in the MixDCNNs for recognition of dog images while for the bird dataset, we adopt the networks of ResNet152 [37], DenseNet161 [62] and Xception [52]. We
AN US
370
choose different models for each dataset, because the visual features of birds are generally smaller and less distinguishable than that in the image of dogs. Therefore, the second configuration of models yields better results in experiments on the bird dataset. Additionally, we resize input images to a size of 224 × 224 and 299 × 299 pixels. The
375
first size is adapted for feature extraction in ResNet152 and DenseNet161 while the
M
second size is utilized for other networks. All the deployed models are pre-trained on the ImageNet. We also found that different values of P in dropout make the network
ED
more robust. Therefore, we empirically set P = 0.4 in the first fully connected layer and 0.45 in the second layer.
5.3.1. Experiments on the CUB-200-2011
PT
380
In this experiment, we compare our approach with the baselines whose results are publicized on this dataset. Moreover, we evaluate the performance of each single model
CE
deployed in our MixDCNNs, i.e., the ResNet [37], the DenseNet [62] and the Xception [52]. As shown in Table 5, without stacking, the accuracy of ResNet, DenseNet and Xception is relative low and equals 68.6%, 72.0% and 67.5% respectively. By con-
AC
385
catenating their features in three scales, the score of MixDCNNs improves to 77.7%. This accuracy is further increased by about 7% when the MSR is integrated. When this approach is further incorporated with voted pseudo label, we obtain the highest recognition accuracy of 85.2%. This value is comparable to the best results achieved
390
by the baseline methods, considering that most of the baseline methods utilize strictly 21
ACCEPTED MANUSCRIPT
DeepLAC [34]
Train with Anno. √
Accuracy (%)
Part-RCNN [46]
√
81.6
MG-CNN [48]
√
83.0
FCAN [58]
√
84.3
B-CNN (250k) [21]
√
85.1
SPDA-CNN [38]
√
85.1
PN-CNN [20]
√
82.8
PN-CNN [16]
√
85.4
80.3
82.0
Improved B-CNN [61]
85.8
RA-CNN [26]
85.3
ResNet152 [37]
68.6
DenseNet161 [62]
72.0
Xception [52]
67.5
MixDCNNs
77.7
MixDCNNs [MSR]
AN US
FCAN [58]
CR IP T
Approach
84.3
MixDCNNs [VPL+MSR]
85.2
Table 5: Comparison of the test results on the CUB-200-2011 dataset among different approaches. Check mark X indicates that strictly human-defined bounding boxes or part annotations are utilize during training.
M
human-defined bounding boxes or part annotations. For instance, PN-CNN [16] is trained by strong supervision of both human-defined bounding boxes and part ground-
ED
truths. And B-CNN [21] utilizes bounding boxes with a high dimensional feature representation (250k). However, our network only relies on multi-scale image cuts with 395
dynamic weighting as well as voted pseudo label, which makes our approach more
PT
flexible and thus it is capable to be applied in general classification tasks. Additionally, we display several examples with discriminative regions of two scales by the proposed approach MSR in Fig. 7. From these images, we can observe that most of the localized
CE
regions are exactly consistent with the human perception, which is helpful to make
400
model recognize better and further verifies that our approach is effective in selection of
AC
discriminative fine-grained features. 5.3.2. Experiments on the Stanford Dogs Analogously, we compare our approach with the methods whose results on this
dataset are publicly available. Their classification accuracy on the dataset of Stanford
405
Dogs are summarized in Table 6. The single model utilized in our MixDCNNs, i.e.,
22
CR IP T
ACCEPTED MANUSCRIPT
Figure 7: Five bird species displayed with discriminative regions in different scales. We select and show the original image and some cropped patches from 17 patches. Significant visual cues are captured in these
AN US
regions and can thus improve the classification precision.
the InceptionV4, the InceptionResNetV2 [51] and the Xception[34] [52] respectively achieves a recognition accuracy of 92.1%, 92.0% and 90.6% at the original images. Relying on the feature concatenation, the MixDCNNs achieves an improvement on the recognition accuracy of 92.8%. By combining image parts from different scales and the weighted prediction, we boost the performance to 94.5%. Unlike on the CUB, the
M
410
pretrained models are well preformed on this dataset. We owe this phenomenon to the
ED
pretraining on the ImageNet dataset which contains far more training images of dogs. The result also reflects the importance of transfer learning. However, if the source and target task do not have much correlation, maybe we should consider fine turning or retraining the pretrained model instead of just extracting feature. Compared with the
PT
415
test results on the dataset of birds, such improvement is relative small. This is due to two reasons. On the one hand, unlike the dogs, the similarity between different bird
CE
species is relative high and the main visual difference is mainly derived from the area of head or wings. On the other hand, the difference between models of the series of InceptionNet is not as obvious as among ResNet, DenseNet and Xception. However,
AC
420
this precision value still outperforms the unsupervised methods. E.g., compared with DVAN [60], FCAN [58] and RA-CNN [26], the relative accuracy gains are 13.0%, 10.3% and 7.2%. After adding the voted pseudo labels, the performance of our approach is further boosted to 95.6%.
23
Accuracy (%) 68.3
PDFR(AlexNet) [17]
71.9
VGG-16 [55]
76.7
DVAN [60]
81.5
FCAN [58]
84.2
RA-CNN [26]
87.3
InceptionV4[ [51]
92.1
InceptionResNetV2 [51]
92.0
Xception [52]
90.6
MixDCNNs
92.8
MixDCNNs [MSR]
94.5
MixDCNNs [IPL+MSR]
95.6
AN US
Approach NAC(AlexNet) [59]
CR IP T
ACCEPTED MANUSCRIPT
Table 6: Comparison of the test results on the Stanford dogs dataset among different approaches without extra bounding box or part annotation.
425
6. Conclusion
M
In this paper, we proposed two techniques to improve existing classification model: the voted pseudo label and a novel fine-grained feature learning. These two approaches can be widely applied on most pre-trained neural network models and training methods.
430
ED
Without complex training scheme, both of them achieve promotional manifestation on the basis of the original model. By domain adaptation and parameters freezing, our
PT
MixDCNN runs fast and is memory-saving. The MixDCNN is also suitable for some tasks which have high requirements on both the accuracy and speed. The semi-learning method is suitable for some areas lacking of sufficient dataset, e.g., in some biomedical
CE
areas. Extensive experiments demonstrate the superior performance of our approach on
435
fine-grained recognition tasks of dogs and birds. In the future, our research focuses in two directions: a more effective image feature utilizing approach and network archi-
AC
tecture designing. References [1] J. Fu, T. Mei, K. Yang, H. Lu, Y. Rui, Tagging personal photos with transfer deep
440
learning, in: Proceedings of the 24th International Conference on World Wide 24
ACCEPTED MANUSCRIPT
Web, International World Wide Web Conferences Steering Committee, 2015, pp. 344–354. [2] J. Fu, J. Wang, Y. Rui, X.-J. Wang, T. Mei, H. Lu, Image tag refinement with viewfor Video Technology 25 (8) (2015) 1409–1422.
445
CR IP T
dependent concept representations, IEEE Transactions on Circuits and Systems
[3] J. Fu, Y. Wu, T. Mei, J. Wang, H. Lu, Y. Rui, Relaxing from vocabulary: Robust weakly-supervised deep learning for vocabulary-free image tagging, in: Pro-
ceedings of the IEEE International Conference on Computer Vision, 2015, pp.
450
AN US
1985–1993.
[4] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
[5] J. Wang, J. Fu, Y. Xu, T. Mei, Beyond object recognition: Visual sentiment anal3484–3490.
455
M
ysis with deep coupled adjective and noun neural networks., in: IJCAI, 2016, pp.
ED
[6] Y. Lecun, Y. Bengio, G. Hinton, Deep learning., Nature 521 (7553) (2015) 436. [7] A. Niculescu-Mizil, R. Caruana, Inductive transfer for bayesian network structure
PT
learning, in: Artificial Intelligence and Statistics, 2007, pp. 339–346. [8] S. J. Pan, Q. Yang, et al., A survey on transfer learning, IEEE Transactions on knowledge and data engineering 22 (10) (2010) 1345–1359.
CE
460
[9] L. Shao, F. Zhu, X. Li, Transfer learning for visual categorization: A survey, IEEE transactions on neural networks and learning systems 26 (5) (2015) 1019–1034.
AC
[10] D. H. Svendsen, L. Martino, M. Campos-Taberner, F. J. Garc´ıa-Haro, G. Camps-
465
Valls, Joint gaussian processes for biophysical parameter retrieval, IEEE Transactions on Geoscience and Remote Sensing 56 (3) (2018) 1718–1727.
[11] J. Zhang, W. Li, P. Ogunbona, Joint geometrical and statistical alignment for visual domain adaptation, arXiv preprint arXiv:1705.05498. 25
ACCEPTED MANUSCRIPT
[12] D.-H. Lee, Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, in: Workshop on Challenges in Representation Learning, ICML, Vol. 3, 2013, p. 2.
470
CR IP T
[13] A. Odena, Semi-supervised learning with generative adversarial networks, arXiv preprint arXiv:1606.01583.
[14] G. Papandreou, L.-C. Chen, K. P. Murphy, A. L. Yuille, Weakly-and semi-
supervised learning of a deep convolutional network for semantic image segmentation, in: Proceedings of the IEEE International Conference on Computer Vision,
475
AN US
2015, pp. 1742–1750.
[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. 480
[16] S. Branson, G. Van Horn, S. Belongie, P. Perona, Bird species categorization
M
using pose normalized deep convolutional nets, arXiv preprint arXiv:1406.2952. [17] X. Zhang, H. Xiong, W. Zhou, W. Lin, Q. Tian, Picking deep filter responses
ED
for fine-grained image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1134–1142. [18] M.-E. Nilsback, A. Zisserman, A visual vocabulary for flower classification, in:
PT
485
IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp.
CE
1447–1454.
[19] S. Reed, Z. Akata, H. Lee, B. Schiele, Learning deep representations of finegrained visual descriptions, in: Proceedings of the IEEE Conference on Computer
AC
490
Vision and Pattern Recognition, 2016, pp. 49–58.
[20] J. Krause, H. Jin, J. Yang, L. Fei-Fei, Fine-grained recognition without part annotations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5546–5555.
26
ACCEPTED MANUSCRIPT
[21] T.-Y. Lin, A. RoyChowdhury, S. Maji, Bilinear cnn models for fine-grained visual recognition, in: Proceedings of the IEEE International Conference on Computer
495
Vision, 2015, pp. 1449–1457.
CR IP T
[22] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, L. Fei-
Fei, The unreasonable effectiveness of noisy data for fine-grained recognition, in: European Conference on Computer Vision, Springer, 2016, pp. 301–320. 500
[23] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, Z. Zhang, The application of two-
level attention models in deep convolutional neural network for fine-grained im-
AN US
age classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 842–850.
[24] L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, T. Darrell, Deep compositional captioning: Describing novel object categories
505
without paired training data, in: Proceedings of the IEEE Conference on Com-
M
puter Vision and Pattern Recognition, 2016, pp. 1–10.
[25] J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: Fully convolutional localization networks for dense captioning, in: Proceedings of the IEEE Conference on Com-
ED
puter Vision and Pattern Recognition, 2016, pp. 4565–4574.
510
[26] J. Fu, H. Zheng, T. Mei, Look closer to see better: recurrent attention convolu-
PT
tional neural network for fine-grained image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
CE
[27] A. Khosla, N. Jayadevaprakash, B. Yao, F.-F. Li, Novel dataset for fine-grained image categorization: Stanford dogs, in: Proc. CVPR Workshop on Fine-Grained
515
Visual Categorization (FGVC), Vol. 2, 2011, p. 1.
AC
[28] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, P. Perona, Caltech-ucsd birds 200.
[29] B. Zhao, J. Feng, X. Wu, S. Yan, A survey on deep learning-based fine-grained
520
object classification and semantic segmentation, International Journal of Automation and Computing 14 (2) (2017) 119–135. 27
ACCEPTED MANUSCRIPT
[30] Z. Ge, A. Bewley, C. Mccool, P. Corke, B. Upcroft, C. Sanderson, Fine-grained classification via mixture of deep convolutional neural networks, workshop on applications of computer vision (2016) 1–6. [31] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, P. N. Belhumeur,
CR IP T
525
Birdsnap: Large-scale fine-grained visual categorization of birds (2014) 2019– 2026.
[32] H. Goeau, P. Bonnet, A. Joly, Lifeclef plant identification task 2014 1391 (2014) 598–615.
[33] S. Huang, Z. Xu, D. Tao, Y. Zhang, Part-stacked cnn for fine-grained visual cat-
AN US
530
egorization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1173–1182.
[34] D. Lin, X. Shen, C. Lu, J. Jia, Deep lac: Deep localization, alignment and classification for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1666–1674.
M
535
[35] O. M. Parkhi, A. Vedaldi, C. Jawahar, A. Zisserman, The truth about cats and
ED
dogs, in: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 1427–1434.
PT
[36] M. J. Afridi, A. Ross, E. M. Shapiro, On automated source selection for transfer learning in convolutional neural networks, Pattern Recognition 73 (2018) 65 – 75.
540
CE
[37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
AC
[38] H. Zhang, T. Xu, M. Elhoseiny, X. Huang, S. Zhang, A. Elgammal, D. Metaxas,
545
Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1143–1152.
28
ACCEPTED MANUSCRIPT
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
550
CR IP T
[40] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European conference on computer vision, Springer, 2014, pp. 818–833.
[41] K. Weiss, T. M. Khoshgoftaar, D. Wang, A survey of transfer learning, Journal of Big Data 3 (1) (2016) 9. 555
[42] G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with
AN US
neural networks, science 313 (5786) (2006) 504–507.
[43] M. Ranzato, M. Szummer, Semi-supervised learning of compact document representations with deep networks, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 792–799. 560
[44] Y. Zhang, Z. Zhang, J. Qin, L. Zhang, B. Li, F. Li, Semi-supervised local multi76 (2018) 662–678.
M
manifold isomap by linear embedding for feature extraction, Pattern Recognition
ED
[45] H. Wu, S. Prasad, Semi-supervised dimensionality reduction of hyperspectral imagery using pseudo-labels, Pattern Recognition 74 (2018) 212–224. [46] N. Zhang, J. Donahue, R. Girshick, T. Darrell, Part-based r-cnns for fine-grained
PT
565
category detection, in: European conference on computer vision, Springer, 2014,
CE
pp. 834–849.
[47] X.-S. Wei, C.-W. Xie, J. Wu, C. Shen, Mask-cnn: Localizing parts and selecting
AC
descriptors for fine-grained bird species categorization, Pattern Recognition 76
570
(2018) 704–714.
[48] D. Wang, Z. Shen, J. Shao, W. Zhang, X. Xue, Z. Zhang, Multiple granularity descriptors for fine-grained categorization, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2399–2406.
29
ACCEPTED MANUSCRIPT
[49] G.-S. Xie, X.-Y. Zhang, W. Yang, M. Xu, S. Yan, C.-L. Liu, Lg-cnn: From local parts to global discrimination for fine-grained recognition, Pattern Recognition
575
71 (2017) 118–131.
CR IP T
[50] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (3) (1995) 273–297.
[51] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inception-resnet
and the impact of residual connections on learning., in: AAAI, 2017, pp. 4278–
580
4284.
preprint arXiv:1610.02357.
AN US
[52] F. Chollet, Xception: Deep learning with depthwise separable convolutions, arXiv
[53] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision,
585
Springer, 2016, pp. 21–37.
M
[54] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images. [55] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
590
ED
image recognition, arXiv preprint arXiv:1409.1556. [56] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
PT
houcke, A. Rabinovich, Going deeper with convolutions (2014) 1–9. [57] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, L. Chen, Inverted residuals
CE
and linear bottlenecks: Mobile networks for classification, detection and segmentation, CoRR abs/1801.04381. arXiv:1801.04381. URL http://arxiv.org/abs/1801.04381
AC
595
[58] X. Liu, T. Xia, J. Wang, Y. Lin, Fully convolutional attention localization networks: Efficient attention localization for fine-grained recognition, arXiv preprint arXiv:1603.06765.
30
ACCEPTED MANUSCRIPT
[59] M. Simon, E. Rodner, Neural activation constellations: Unsupervised part model discovery with convolutional networks, in: Proceedings of the IEEE International
600
Conference on Computer Vision, 2015, pp. 1143–1151.
CR IP T
[60] B. Zhao, X. Wu, J. Feng, Q. Peng, S. Yan, Diversified visual attention networks for fine-grained object classification, arXiv preprint arXiv:1606.08572. [61] T. Y. Lin, S. Maji, Improved bilinear pooling with cnns. 605
[62] G. Huang, Z. Liu, K. Q. Weinberger, L. van der Maaten, Densely connected con-
AC
CE
PT
ED
M
AN US
volutional networks, arXiv preprint arXiv:1608.06993.
31
CR IP T
ACCEPTED MANUSCRIPT
Danyu Lai is with School of Data and Computer Science, Sun Yat-sen University,
AN US
Guangzhou, Guangdong, P.R.China. He is a postgraduate student and studies machine
learning and deep learning in applications of both pattern recognition and computer vision currently. His areas of interest include Image Classification, Semantic Segmen-
PT
ED
M
tation and Object Detection.
CE
Wei Tian received the B.Sc degree in mechatronics engineering from Tongji University, Shanghai, China, in 2010. From October 2010, he was with the Department of
AC
Electrical Engineering and Information Technology at KIT, Karlsruhe, Germany, and received the M.Sc. degree in May 2013. He is currently working toward the Ph.D. degree at the Institute of Measurement and Control Systems at KIT. He is interested in research areas of robust object detection and tracking.
32
AN US
CR IP T
ACCEPTED MANUSCRIPT
Long Chen received the B.Sc. degree in communication engineering and the Ph.D.
M
degree in signal and information processing from Wuhan University, Wuhan, China, in 2007 and in 2013, respectively. From October 2010 to November 2012, he was co-
ED
trained PhD Student at National University of Singapore. From 2008 to 2013, he was in charge of environmental perception system for autonomous vehicle SmartV-II with the Intelligent Vehicle Group, Wuhan University. He is currently an Associate Professor
PT
with the School of Data and Computer Science, Sun Yat-sen University, Guangzhou,
AC
CE
China. His areas of interest include perception system of intelligent vehicle.
33