Improving deep ensemble vehicle classification by using selected adversarial samples

Improving deep ensemble vehicle classification by using selected adversarial samples

Accepted Manuscript Improving Deep Ensemble Vehicle Classification by Using Selected Adversarial Samples Wei Liu, Zhiming Luo, Shaozi Li PII: DOI: Re...

2MB Sizes 0 Downloads 68 Views

Accepted Manuscript

Improving Deep Ensemble Vehicle Classification by Using Selected Adversarial Samples Wei Liu, Zhiming Luo, Shaozi Li PII: DOI: Reference:

S0950-7051(18)30321-6 10.1016/j.knosys.2018.06.035 KNOSYS 4415

To appear in:

Knowledge-Based Systems

Received date: Revised date: Accepted date:

3 January 2018 16 June 2018 19 June 2018

Please cite this article as: Wei Liu, Zhiming Luo, Shaozi Li, Improving Deep Ensemble Vehicle Classification by Using Selected Adversarial Samples, Knowledge-Based Systems (2018), doi: 10.1016/j.knosys.2018.06.035

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Improving Deep Ensemble Vehicle Classification by Using Selected Adversarial Samples Wei Liua,d , Zhiming Luob,e,∗, Shaozi Lic a Virtual

Reality and Interactive Techniques Institute, East China Jiaotong University, Jiangxi, China. Center of Information and Communication Engineering, Xiamen University, China c Cognitive Science Department, Xiamen University, Fujian, China. d Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, Yi Chang,China. e The Key Laboratory of Cognitive Computing and Intelligent Information Processing of Fujian Education Institutions, Wuyishan, China

CR IP T

b Postdoc

Abstract

AN US

Most image classification algorithms aim to maximize the percentage of class labels that are predicted correctly. These algorithms often missclassify images from minority categories as into the dominant categories . To overcome the issue of unbalanced data for classifying vehicles from traffic surveillance images, we propose a semi supervised pipeline focused on integrating deep neural networks with data augmentation based on generative adversarial nets (GANs). The proposed approach consists of three main stages. In the first stage, we trained several GANs on the original dataset to generate adversarial samples for the rare classes. In the second stage, an ensemble of CNN models with different architectures are trained on the original imbalanced data set, and then a sample selection step is performed to filter out the low-quality adversarial samples. In the final stage, the aforementioned ensemble model is refined on the augmented dataset by adding the selected adversarial samples. Experiments on the highly imbalanced large benchmark “MIOvision Traffic Camera Dataset (MIO-TCD)” classification challenge dataset demonstrate that the proposed framework is able to increase the mean performance of some categories to some extent, while maintaining a high overall accuracy, compared with the baseline.

M

Keywords: Imbalanced classification, image classification, generative adversarial nets, ensemble learning 2010 MSC: 00-01, 99-00 1. Introduction

25

15

20

ED

AC

10

CE

PT

5

In the last decade, the widespread use of visual traffic systems has led to rapid growth of the available video data that must be processed. However, while parsing surveillance video content is repetitive, it is suitable for computers to perform tedious tasks that require a long attention span. Therefore, auto- 30 mated video surveillance is attracting more attention than manual surveillance. With the increasing number of available images, image processing techniques such as image classification have become a hot topic in the field of artificial intelligence. Although image classification has been widely stud- 35 ied in academia and applied in various fields, it remains an open problem. For example, many practical image classification problems are imbalanced, i.e., at least one of the classes is represented by only a few samples, while other categories make up the majority. It is a difficult task to classify images with 40 multiple labels using only a small number of labeled samples, and this difficulty is compounded by images with an unbalanced distribution. Moreover, image classification plays an important role in visual intelligent transport systems. It is a prerequisite for the 45 semantic analysis of visual traffic surveillance systems. In the field of traffic surveillance, a visual traffic surveillance system ∗ Corresponding

author Email address: [email protected] (Zhiming Luo)

Preprint submitted to Journal of LATEX Templates

50

needs to detect vehicles or pedestrians and classify them if possible. In practice, Pedestrians, Bicycles and Motorcycles often constitute a minority of the data set, in contrast with Cars and Buses. Consequently, to avoid the misclassification of images from rare categories as majority classes, it is not appropriate to assume that the misclassification error costs for all samples are equal. If the misclassification error costs are implicitly assumed to be equal, images from minority categories are prone to be misclassified as from dominant categories. Therefore, to effectively reduce the number of fatalities, it is reasonable to focus on enhancing the mean accuracy of all categories, in the condition of high overall accuracy. In the last decade, deep neural networks have led to a series of breakthroughs on a variety of machine learning tasks, such as computer vision, text analysis and voice recognition. Large-scale labeled training datasets are becoming increasingly important with the rise in the capacity of deep learning methods. However, such datasets are not always available. In such cases, data augmentation techniques to enlarge training image data sets with given labels become a viable solution. Generative adversarial networks (GANs)[1] have been used to generate synthetic images, owing to the relative sharpness of samples generated by these models compared to other approaches. However, learning from training data enlarged by GANs may not achieve the desired performance, owing to a gap between the synthetic and real image distributions for rare classes. The contributions of this work are as follows: July 5, 2018

ACCEPTED MANUSCRIPT

55

60

65

• To reduce the bias of single models, an ensemble method based on convolutional neural network (CNN) models was proposed to improve the overall classification performance. We showed that the proposed ensemble method is superior to any single model constructing the ensemble. 105 • Inspired by cost-sensitive learning [2, 3], a class-dependent weighted loss function was devised to emphasize the misclassification cost of training samples for rare classes. • It was demonstrated that the proposed semi supervised110 scheme is able to improve performance to some extent, compared with the baselines, for the imbalanced vehicle type classification. 115

70

Classification has been an active research topic in machine learning for a long period of time [4, 5, 6, 7, 8, 9]. It has been the subject of many papers, workshops, special sessions, and120 dissertations [10, 11, 12, 13, 14, 2, 15, 16, 17, 18]. 2.1. Deep learning

90

95

M

AC

85

CE

PT

80

In the past few years, deep-learning-based methods have achieved many breakthroughs on a variety of machine learn-125 ing tasks including computer vision, machine translation, and speech recognition. Among the different kinds of deep neural network architectures, the CNN is one of the most widely used. Since the proposal of Alexnet by Krizhevsky et al. [19] for image classification, deep methods have shown superior perfor-130 mance compared with conventional “shallow” models in many computer vision tasks, such as object detection [20], video classification [21], and image segmentation [22]. These successes inspired a new line of research focused on developing more accurate and higher-performance CNN mod-135 els. The quality of CNN models has been significantly enhanced by using deeper and wider architectures. Simonyan et al. [23] proposed the VGGNet, which promotes the research of using small convolutional filters to construct deep models. Szegedy et al. [24] presented the GoogLeNet which contains140 the Inception module to construct a wider model and defined the new state-of-the-art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 1 . However, it is much more difficult to train deeper neural networks. When deeper networks are about to converge, a degradation145 problem will arise: as the network depth increases, the accuracy will saturate and then decrease rapidly. To solve this issue, He et al. [5] proposed a residual learning framework named ResNet, which uses a short-connection architecture.

ED

75

2.2. Data augmentation Large-scale labeled image training datasets are becoming increasingly important, with the recent rise in capacity of deep learning methods. However, large-scale labeled datasets are not always easy to obtain. In this case, data augmentation techniques are often employed to enlarge the training datasets. The existing data augmentation techniques can be mainly grouped into two categories [28]. The first category consist of geometrictransformation-based methods, and the second includes methods of synthesizing images by certain algorithms to balance the datasets. Geometric transformation focuses on generating image data through label-preserving transformations such as affine, cropping, and mirror operations. It greatly improves the training performance of certain datasets with inadequate data. However, it contributes little to supplement the data manifold, as image-level transformation through depth and scale does not extend data in the true sense. In traditional image synthesis techniques, minority instances are generated by certain algorithms such as oversampling to make the data set balanced. The easiest approach for oversampling is to randomly replicate minority instances to increase their population. The positive consequence of replication-based random oversampling is that it duplicates the number of errors for minority instances. However, replication-based random oversampling has a tendency of overfitting, as it does not bring any new information because it results in duplicate data. This gives variables the appearance of low variance. To address this, Chawla et al. [29] proposed the synthetic minority oversampling technique(SMOTE) to generate new non-replicated minority examples. Several improved variants of SMOTE are presented in [30, 31, 32]. However, these method can potentially lead to overfitting, because their decision regions remain error-prone by synthesizing noisy and borderline instances. Recently, GANs have been used to generate synthetic images such as house numbers [33], hand-drawn sketches [34], bedrooms [35], and a variety of other image categories [36, 37], owing to the relative sharpness of samples generated by these models compared to other approaches . The GAN was first proposed by Goodfellow et al. [1] to generate image samples, by training a generative model and a discriminative model simultaneously with backpropagation. Then, Radford et al. presented DCGANs [35] to improve the stability of training, by taking the discriminative model as a robust feature extractor. Salimans et al. [36] proposed improved techniques for training GANs, and achieved a state-of-the-art result in semi-supervised classification. Arjovsky et al. [38] defined a new form of GAN called the Wasserstein GAN, and provided a theoretical analysis of how the earth mover’s distance behaves in comparison to previous popular probability distances. However, learning from training data enlarged by GANs may not achieve the desired

AN US

2. Related work

Although deep learning has been a success for a variety of tasks, to the best of our knowledge, very few studies in literature[2, 25, 26, 15, 27] have been presented to tackle the imbalanced classification problem by utilizing deep learning. Most of these methods can be considered as natural extensions of using traditional algorithms to deal with imbalanced data classification.

CR IP T

• To reduce the gap between synthetic and real image distributions, we devised a supervised synthetic sample refiner. 100

150

1 http://image-net.org/challenges/LSVRC/2014/

2

CR IP T

ACCEPTED MANUSCRIPT

Figure 1: MIO-TCD classification challenge dataset acquired from visual traffic surveillance sensors.

160

performance, owing to a gap between the synthetic and real image distributions for rare classes. To reduce the gap, Zheng et190 al. [39] proposed a semi supervised method based on GANs for person re-identification. The synthetic samples generated in their work are unlabeled, whereas the synthetic samples generated in our scheme are labeled and refined by a supervised classifier trained on the original training data. 195 2.3. Ensemble learning

180

185

M

AC

175

CE

PT

170

Ensemble learning is also widely studied in the field of machine learning and its basic idea is to construct an ensemble model by combing predictions from a set of weak models. For200 the classification task, multiple weak classifiers are learned from the same original dataset, and then these weak classifiers are combined to form an ensemble model that will be used to classify the unseen data. The functions of single models have high classification performance but have a problem in terms of a fixed set of parameters, which introduces bias. Reducing such205 bias can be achieved through ensemble learning, a comprehensive review of which can be found in [40]. As mentioned above, the generalization ability of an ensemble is usually much stronger than that of base learners. Ensemble learning can boost week learners to strong learners that can210 make very accurate predictions. Thus, ensemble learning is appealing for machine learning. An ensemble is essentially a supervised learning algorithm, for it must be trained and then used to make predictions for unknown data. The performance of ensemble learning depends on the precision of the constituent215 classifiers. Although most theoretical analyses work on weak learners, it is notable that the base learners used in practice are not necessarily weak, as using not-so-weak base learners often results in better prediction performance.

ED

165

on GANs. As shown in Fig.2, the proposed approach consists of three stages. In the first stage, we trained several GANs on the original dataset to generate adversarial samples for the rare classes. In the second stage, an ensemble of CNN models with different architectures are trained on the original imbalanced data set, and then a sample selection step is performed to filter out the low-quality adversarial samples. In the final stage, the aforementioned ensemble model is refined on the augmented dataset by adding the selected adversarial samples. In the following subsections, we describe the details of the ensemble model training procedure, the adversarial sample generator, the adversarial sample selector, and the final ensemble model refining step.

AN US

155

3. Proposed method In this section, we describe the pipeline of our proposed method. To tackle the imbalanced problem, we focus on integrating deep neural networks with data augmentation based 3

3.1. Ensemble model learning To reduce the bias caused by single model for vehicle type classification, we proposed a deep CNN ensemble model. As mentioned above, the proposed ensemble model contains multiple CNNs, which are trained independently on the training data set, all started from a good initialization (pre-trained on ImageNet). Single models in the initial stage of the ensemble system generate many results for a single image to classify. Therefore we need to combine the outputs of the single models to create a single output. Two voting schemes are considered in this study. One is maximum majority voting with one-hot encoding, taking the majority vote of one-hot encoded predictions of the single models. In practice we found that rare class would be better, if two candidate classes have the same votes. The other way is maximum majority voting by averaging the predicted probabilities, in which an ensemble model is treated as a uniformly-weighted mixture model and the final prediction is computed by averaging the predicted probabilities. Formally, let C = {1, · · · , K} be the predefined classes of vehicle types, where K is the number of vehicle types. The probability for a testing sample x to be classified as the class y can be formulated as M 1 X p(y|x) = pw (y|x, wm ), (1) M m=1 m

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 2: Pipeline of our proposed method. At the first step, a GAN model is used to generate adversarial samples of rare classes. Then, we train an ensemble model on the original dataset to select a subset of high-quality adversarial samples, Finally, the original dataset augmented with these selected adversarial samples is utilized to refine the ensemble model.

where wm are the parameters of the m-th model, and M is the number of single models constructing the ensemble. Then, the predicted label of the ensemble is the maximum majority voting

AC

225

CE

PT

220

ED

M

In this case, minimizing the cross-entropy loss is equivalent to maximizing the log-likelihood of the correct label. To avoid over-fitting, label smoothing regularization is introduced to encourage the network not to be too confident towards the ground truth. For each training sample x with ground label y, the label k = argmax p(y|x). (2) y∈C distribution q (k|x) is rewritten as:  Comparisons of the two above maximum majority voting schemes   if k , y  K 0 are presented in Section 4 of this paper. , (5) q (k|x) =   1 −  +  otherwise K 3.1.1. Label smoothing regularization where  ∈ [0, 1] is a hyper-parameter of the CNNs. If  = 0, Label smoothing was recently rediscovered by Szegedy et then Equation 5 reduces to Equation 4. Now, the cross-entropy al. [41] in deep learning. During training, the non-ground truth loss for label smooth regularization can be rewritten as: classes are assigned small values instead of 0, which discourK ages the network from being tuned toward the ground truth and X 0 (x) l = − log (p (k|x)) q0 (k|x) . (6) thus reduces the chances of overfitting. Label smoothing is used k=1 to regularize the classifier layer by estimating the marginalized effect of label-dropout during training. In this work the hyper-parameter  is set by cross-validations For each training samplex, let p (k|x) ∈ [0, 1] denote the with a value of 0.1 for all of the networks. predicted probability of the input belonging to class k, which can be obtained by CNNs. Let q (k|x) be the ground truth dis3.1.2. Weighted loss function tribution. Then, the cross-entropy loss for the training sample x To compensate for class-imbalanced problem, we devised can be formulated as: a weighted softmax cross-entropy loss function. Given the estimated probability pl (x) for an image x to belong to the class K X l and the ground truth probability gl (x), the weighted softmax L (x) = − log (p (k|x)) q (k|x) . (3) cross entropy loss function is k=1 X L=− wl gl (x) log(pl (x)), (7) Consider the case of a single ground-truth label y, such that l q (k|x) can be defined as:  where wl denotes the weight related to class l. In this study, the   0 if k , y weight wl is defined as q(k|x) =  . (4)  1 otherwise. median(s) , (8) wl = sl 4

ACCEPTED MANUSCRIPT

230

a linear transformation and a reshape operation. After that, four residual convolutional blocks and a final convolutional layer was used to enlarge the tensor to a size of 64 × 64 × 3. Each residual convolutional block has a similar identity mapping architecture as discussed in [4], which added the outputs from a weighted path and a short-connection path. The weighted path 3.2. Adversarial sample generator has the structure “BN-ReLU-UP-Conv1-BN-ReLU-Conv2”, where “BN” is the batch normalization, “ReLU” is the rectify linear 3.2.1. Generative adversarial network (GAN) 260 unit activation function, “UP” is an upsampling of the feature The GAN is implemented by a system of two sub-networks: maps by a factor 2 using nearest interpolation, and “Conv1” and a generator and a discriminator. These two sub-networks com“Conv2” are two convolutional layers with the kernel size 3 × 3. pete in a zero-sum game framework. The generator produces The short-connection path first upsampled the feature maps by a synthetic data by given some random noise as input, and the factor of two and then performed a 1 × 1 convolution. The outdiscriminator tries to discriminate the synthetic data from the 265 put feature channels of each residual convolutional block are real data. The game between the generator G and the discrimionly half of its input. Before the final convolutional layer, we nator D can be formulated as a minimax objective: added a “BN” and a “ReLU”. The final convolutional layer has a kernel size of 3 × 3 and three output channels corresponding min max E x∈Pr [log(D(x))] + E x˜ ∈Pg [log(1 − D( x˜ ))], (9) G D to the “RGB” color channels. The input to the discriminator consists of the generated imwhere Pr is the real data distribution and Pg is the model dis-270 ages and the real images. A convolutional layer, four residual tribution. Pg is implicitly defined by x˜ = G(z) and z ∼ p(z), convolutional blocks, and a fully connected layer were used to where z is the input to the generator randomly sampled from a classify whether an image is generated or not. The first convoluuniform distribution or a Gaussian distribution. tional layer has a kernel size of 3 × 3 and a 64 output-channels. The authors of [42] argue that the divergence issue of GAN was cause by minimizing a non-continuous function with re-275 The weighted path in the residual blocks has the structure of “BN-ReLU-Conv1-BN-ReLU-Conv2-POOL”. The kernel size spect to the generator’s parameters. Instead, they proposed to use the earth mover’s distance W(q, p) (also known as Wasserstein- of “Conv1” and “Conv2” is 3 × 3, and “POOL” is an average pooling with a stride of 2. The short connection path contains a 1) to measure the disagreement between the real and generated convolution layer with a kernel size of 1×1 and an average pooldata distributions. W(q, p) can be considered as the minimum cost of transforming the distribution q into the distribution p.280 ing with a stride of 2. The number of output channels for the four residual blocks are 128, 256, 512, and 512, respectively. They named the proposed GAN architecture as WGAN, which After defining the network architecture, we trained several is to obtain: individual synthetic sample generators for each category separately by using the loss function in Equation 11. Finally, we use min max E x∈Pr [D(x)] − E x˜ ∈Pg [D( x˜ )], (10) G D 285 a bilinear interpolation to resize all the generated images from 64 × 64 to 256 × 256. where D is the set of 1-Lipschitz functions. In this case, under an optimal discriminator, minimizing the Equation 10 with re3.3. Adversarial sample selector spect to the generator parameters also minimizes the distance The input to the generator of the GAN is a random vector W(Pr , Pg ). drawn from a Gaussian distribution; it is possible to generate The WGAN enforces the Lipschitz constrain by clipping the 290 infinite number of adversarial samples. Moreover, because we weights in the range [−c, c]. As this weight clipping procedure trained the generator for each rare categories separately, each will lead to low-quality generated samples or fail to converge generated sample has the same category label as its generator. under certain settings, [43] proposed an improved version of However, in reality, the number of images for training the genWGAN by penalizing the norm of the gradient of the input diserator of rare classes is very limited, which is insufficient to criminator instead of clipping the weights. The new objective 295 get a well trained generator. As a consequence, the generator function is produces many poor samples that appear like random noise.   L = E x˜ ∈Pg [D( x˜ )] − E x∈Pr [D(x)] + λE xˆ ∈Pxˆ (k∇ xˆ D( xˆ)k2 − 1)2 , It is always preferable to have high-quality training sam(11) ples instead of random noise for training the CNN models.To where P xˆ is implicitly defined as a uniform distribution along deal with the issue of poor synthetic samples, we performed a the straight line between pairs of points sampled from the data300 selection procedure to select only those high-quality generate distribution Pr and the generator distribution Pg . In this work, samples by using the ensemble model trained on the original we use this new objective function to train our synthetic sample dataset. For each rare category, we first generated 20,000 syngenerators. thetic samples with random inputs, and then compute the probability of each sample by using Equation 1. Finally, we only 3.2.2. GAN architecture 305 keep 5,000 synthetic samples for each category with highest For the generator, we started with a 128-dim random noise probabilities. Some of the selected and unselected samples are vector and then converted it to a tensor with a size 4×4×512 by shown in Figure 3.

AC

CE

PT

240

ED

M

AN US

CR IP T

235

where s = [s1 , s2 , · · · , sN ] is the number of training samples for each category. In this way, our loss function compensates for the class imbalance problem by putting higher weights on training samples from rare classes, in comparison to the dominant255 classes.

245

250

5

CR IP T

ACCEPTED MANUSCRIPT

4. Experiments and results

4.1. The MIO-TCD classification challenge dataset To demonstrate the effectiveness of the proposed scheme, we devised experiments on the large benchmark traffic camera data set MIO-TCD classification challenge dataset2 . This clas-340 sification dataset consists of 648,959 vehicle images acquired at different times of the day and different periods of the year by traffic cameras deployed all over Canada and the United States. These images have been well selected to cover a wide range of challenges and are representative of typical visual data captured in urban traffic scenarios. The classification challenge dataset has been divided into345 11 categories, including Articulated truck, Background, Bicycle, Bus, Car, Motorcycle, Non-motorized vehicle, Pedestrian, Pickup truck, Single-unit truck and Work van. The training and testing sets have 519,164 and 129,795 images, both have a highly imbalanced data distribution. The number for350 each category is given in Figure 4 which is in the range between 1,751 and 260,518. Background and Car are the dominant

330

335

AC

325

CE

PT

320

M

315

3.4. Ensemble model refining step As discussed in section on related work, the upsampling strategy (duplicating those samples) of the rare classes is often used to deal with the imbalance data issue. Similar to this strategy, we use the GAN to generate new synthetic samples for the rare classes to balance the training dataset. After selecting those high quality generator images, our final step is to refine those pre-train CNN models to get a new ensemble model. In this work, we simply retrained all the pretrain CNN models by mixing the generated images and the original training set as the input. During the training, we also use the label smoothing regularization and weighted loss function.

ED

310

AN US

Figure 3: Some of the synthetic samples of the categories bicycle, motorcycle, non-motorized vehicle and single-unit truck. The left five columns show some selected high-quality generated adversarial samples, and the right five columns show some filed out low quality samples.

2 http://tcd.miovision.com/challenge/dataset/

6

Figure 4: Class distribution of the MIO-TCD classification challenge dataset.

classes, taking up to 81% of the training set. On the contrary, Bicycle, Motorcycle, Non-motorized vehicle and Single unit truck are only represented by a few samples, and are considered as rare classes in this dataset. 4.2. Baselines To indicate the effect of the proposed scheme, five stateof-the-art deep learning methods including ResNet-50, ResNet101, ResNet-152 , Inception V3 [5], and Inception V4[44] are used as baselines. The ResNet-50 trained with label smoothing regularization is denoted as ResNet-50-LS, train ed with the proposed weighted loss function is denoted as ResNet-50W, and that trained with our synthetic data generated by GANs is denoted as ResNet-50-GAN. A similar naming convention is used with the other single models. We name the proposed GAN based deep ensemble model GEM, which is an ensemble of the mentioned 20 models. We tried two different voting schemes

ACCEPTED MANUSCRIPT

Model size 307.4 MB 536.1 MB 724.6 MB 108.8 MB 184.4 MB 237.2 MB

Batch size 128 64 32 64 32 32

# of steps 80,000 160,000 320,000 160,000 320,000 320,000

• Mean Recall • Mean Precision

mPre = mean(Prei ) • Cohen Kappa Score

Table 1: The GPU memory cost, batch size, and the number of training steps for different models.

360

365

GEM-OE and GEM-AP, which refer to the proposed ensemble deep learning method with maximum majority voting by one-hot encoding and by averaging the predicted probabilities, respectively. 4.3. Implementation details 385 We tested all of methods in the experiments with a TITAN X Pascal GPU and an Intel i7 core on the deep learning framework TensorFlow 1.3.0. All of the pre-trained networks were downloaded from the TensorFlow-Slim image classification model library 3 , and started with a base learning rate of η = 0.001. The390 learning rate decay type was set to exponential, with a value of 0.94 every two epochs. The single models trained with GANs or the modified loss function are not fully fine-tuned. The training speed for the single models mentioned above is approximately 0.8 s/step. The number of single models constructing395 the deep ensemble is M = 20. More efficiency comparisons and implementation details are presented in Table 1. The code for our models will be available at github.4

ED

PT

CE

380

4.4. Evaluation criterion 400 The prime goal of this study is to classify vehicle types in images acquired from visual traffic surveillance sensors. Let T P denote true positive, T N denote true negative, FP denote false positive, and FN denote false negative. In order to objectively evaluate the performance of the introduced method and the baselines, we evaluate our approach and the baselines with405 respect to the following six metrics. • Precision of each category Prei =

AC

375

po − pe 1 − pe

where po is the empirical probability of agreement on the label assigned to any sample (the observed agreement ratio) and pe is the expected agreement when both annotators assign labels randomly [38].

M

370

k=

4.5. Results

4.5.1. Overall performance Table 3 shows the overall performance results of the proposed scheme and baseline algorithms. From Table 3, it can be observed that the proposed deep learning ensemble method GEM-AP performs better than the individual networks. GEMAP achieves a classification of 97.80%, mean recall of 90.74%, mean precision of 93.55%, and Cohen Kappa Score of 96.75%. Table 3 demonstrates that the proposed vehicle type classification method is able to improve the mean precision to some extent, in the condition of high overall accuracy. Additionally, GEM-AP performs slightly better than GEM-OE, which indicates that it is better to perform majority voting by averaging the predicted probabilities. Moreover, the single models trained with semi supervised GANs or the modified loss function have similar or better performance, compared to the baseline single models, in terms of mean recall and mean precision.

AN US

355

mRe = mean(Reci )

CR IP T

Network ResNet-50 ResNet-101 ResNet-152 Inception V3 Inception V4 Inception-ResNet-V2

T Pi T Pi + FPi

410

• Recall of each category Reci =

T Pi T Pi + FNi

415

• Accuracy TP Acc = #o f T estingImages 3 https://github.com/tensorflow/models/tree/master/

420

research/slim#Pretrained 4 https://github.com/weiliuxm/miotcd_classification

7

4.5.2. Ablation study To evaluate how much did the weighted loss function, label smoothing regularization and the proposed GAN pipeline influence the final performance, we conducted a comparison by computing the average evaluation criterion of those models (ResNet-50, ResNet-101, ResNet-152, Inception V3 and Inception v4) trained by different methods listed in Table 2, and reported the results in Table 3. As can be seen, the weighted loss function helps to improve the “Mean Recall while decrease the “Mean Precision, the label smoothing regularization contributes to increase the “Precision and “Cohen Kappa Score. The proposed GAN pipeline in this study can improve all the evaluation criterion over the baseline and outperform the “weighted loss function and “label smoothing regularization. 4.5.3. Precision and recall The precisions of various methods for each category on the MIO-TCD Classification Challenge Dataset are presented in Table 4. From Table 4, it can be observed that the proposed ensemble schemes have high precisions above 90% for rare classes such as Bicycle, Motorcycle, and Non-motorized vehicle. Moreover, for Non-motorized vehicle and Single-unit

ACCEPTED MANUSCRIPT

Precision 0.9538 0.9623 0.9620 0.9592 0.9648 0.9635 0.9678 0.9652 0.9625 0.9577 0.9614 0.9657 0.9637 0.9619 0.9651 0.9679 0.9644 0.9640 0.9723 0.9625 0.9770 0.9780

Mean Precision 0.8364 0.8775 0.8363 0.8551 0.8881 0.8825 0.8998 0.8880 0.8847 0.8325 0.8685 0.8913 0.8993 0.8314 0.8574 0.8830 0.8844 0.8583 0.8971 0.8865 0.9326 0.9355

Cohen Kappa Score 0.9283 0.9412 0.9408 0.9368 0.9454 0.9434 0.9498 0.9461 0.9421 0.9347 0.9404 0.9468 0.9437 0.9409 0.9459 0.9502 0.9448 0.9444 0.9568 0.9545 0.9642 0.9657

CR IP T

Mean Recall 0.8586 0.8501 0.8417 0.8539 0.8708 0.8634 0.8622 0.8742 0.8767 0.8804 0.8667 0.8724 0.8573 0.8923 0.8842 0.8756 0.8792 0.8929 0.8824 0.8911 0.9043 0.9074

AN US

Method ResNet-50 ResNet-50-W ResNet-50-LS ResNet-50-GAN ResNet-101 ResNet-101-W ResNet-101-LS ResNet-101-GAN ResNet-152 ResNet-152-W ResNet-152-LS ResNet-152-GAN Inception V3 Inception V3-W Inception V3-LS Inception V3-GAN Inception V4 Inception V4-W Inception V4-LS Inception V4-GAN GEM-OE (ours) GEM-AP(ours)

Table 2: Overall results on the MIO-TCD Dataset. GEM-OE and GEM-AP refer to the proposed ensemble deep learning method with maximum majority voting by one-hot encoding and by averaging the predicted probabilities, respectively.

Precision 0.9618 0.9619 0.9657 0.9641

M

Mean Recall 0.8685 0.8758 0.8674 0.8734

ED

Method Baseline Weighted loss function Label smoothing regularization GAN

Mean Precision 0.8786 0.8564 0.8718 0.8808

Cohen Kappa Score 0.9409 0.9411 0.9467 0.9469

430

435

440

CE

truck, the proposed method GEM-AP achieves a precision of 91.22%, and increases more than 5% and 3% respectively, in comparison with single models in the ensemble. Table 5 presents the recalls of all the methods. As seen in Table 5, the proposed ensemble methods GEM-OE and GEMAP achieve the best or near-best recall rates for all categories, compared with the baseline algorithms. 445

AC

425

PT

Table 3: The average evaluation criterion of different models by using baseline training method, weighted loss function, label smoothing regularization and the proposed pipeline in the paper.

4.5.4. Confusion matrix A confusion matrix of the proposed scheme on the MIOTCD challenge unseen test dataset is shown in Fig.5. Bus, Car, and Background have classification precision rates of approxi-450 mately 0.99, as Car and Background are dominant classes, and Bus has a small intraclass difference. Because Work van and Car are similar in appearance, and furthermore, Car is a dominant class in the training set, Work van is prone to be wrongly classified as Car. Single-unit truck is prone to be misclassified455 as Articulated truck, owing to their similarity in appearance and the highly imbalanced distribution of the training data. Com8

pared to the other classes, Non-motorized vehicle has a low classification precision rate of 0.62, and is apt to be wrongly classified as Articulated truck and Single-unit truck. 5. Conclusion To correctly classify the type of vehicle in images acquired from visual traffic surveillance sensors, we proposed an image classification scheme based on ensemble deep learning and GANs. Experiments on the MIO-TCD classification challenge dataset demonstrate that the proposed ensemble method is able to improve the overall performance for imbalanced vehicle type classification to a certain extent, compared with the baseline algorithms. Nevertheless, the proposed method has some limitations. For instance, the only networks we selected are ResNets and Inceptions. Moreover, if label smoothing regularization, synthetic data generated by GANs, and weighed loss functions are integrated into a single model, it is difficult to train the model to

ACCEPTED MANUSCRIPT

BG 0.9885 0.9931 0.9929 0.9940 0.9937 0.9952 0.9957 0.9931 0.9946 0.9960 0.9882 0.9953 0.9943 0.9981 0.9973 0.9956 0.9947 0.9981 0.9928 0.9958 0.9967 0.9967

Bicycle 0.8628 0.8270 0.8401 0.8162 0.8483 0.8499 0.8947 0.8305 0.8542 0.8514 0.7620 0.8724 0.9027 0.8211 0.7659 0.9000 0.8189 0.8325 0.8126 0.8426 0.9072 0.9121

Bus 0.8573 0.9293 0.9307 0.9490 0.9518 0.9533 0.9473 0.9447 0.9424 0.8639 0.9752 0.9561 0.9225 0.9022 0.9752 0.9175 0.9057 0.907 0.9483 0.9586 0.9669 0.9659

Car 0.9736 0.9757 0.9720 0.9798 0.9840 0.9835 0.9785 0.9842 0.9881 0.9831 0.9851 0.9841 0.9823 0.9818 0.9831 0.9849 0.9828 0.9840 0.9825 0.9854 0.9860 0.9862

MC 0.8779 0.9264 0.8533 0.8778 0.9775 0.9287 0.9547 0.9280 0.9140 0.8775 0.9071 0.9360 0.8874 0.9016 0.9100 0.9458 0.8540 0.8980 0.9054 0.8855 0.9624 0.9625

NV 0.5272 0.7200 0.7120 0.6299 0.7010 0.7511 0.7910 0.7468 0.8032 0.4865 0.7443 0.7883 0.8609 0.4483 0.6333 0.7005 0.7684 0.5951 0.7838 0.7405 0.9072 0.9122

Pedestrian 0.8986 0.9377 0.9272 0.9264 0.9384 0.9378 0.9253 0.9457 0.9426 0.8745 0.9523 0.9375 0.9573 0.8854 0.9376 0.9051 0.9638 0.9394 0.9692 0.9477 0.9558 0.9622

PT 0.8832 0.8886 0.8801 0.8651 0.8582 0.8568 0.9044 0.8683 0.8354 0.8531 0.8711 0.8731 0.8510 0.9090 0.8972 0.8962 0.8624 0.8944 0.9324 0.9060 0.9227 0.9280

SUT 0.6642 0.7513 0.7419 0.7514 0.7178 0.7239 0.7457 0.7889 0.7141 0.6975 0.7532 0.7413 0.7135 0.6931 0.6815 0.7418 0.7788 0.7962 0.7757 0.7267 0.8331 0.8318

CR IP T

AT 0.7782 0.8115 0.7405 0.7610 0.8768 0.8308 0.8210 0.8225 0.8449 0.7515 0.8221 0.8001 0.8840 0.7858 0.7973 0.8898 0.8553 0.7496 0.8681 0.8527 0.8707 0.8803

AN US

Model ResNet-50 ResNet-50-W ResNet-50-LS ResNet-50-GAN ResNet-101 ResNet-101-W ResNet-101-LS ResNet-101-GAN ResNet-152 ResNet-152-W ResNet-152-LS ResNet-152-GAN Inception V3 Inception V3-W Inception V3-LS Inception V3-GAN Inception V4 Inception V4-W Inception V4-LS Inception V4-GAN GEM-OE (ours) GEM-AP(ours)

WV 0.8886 0.8925 0.8972 0.8558 0.9216 0.8969 0.9391 0.9150 0.8981 0.7667 0.7932 0.9204 0.9360 0.8195 0.8526 0.8358 0.9431 0.8469 0.8969 0.9103 0.9503 0.9530

AC

Bicycle 0.8371 0.8704 0.8371 0.8634 0.8914 0.8827 0.8634 0.8669 0.8827 0.8827 0.8914 0.8739 0.8126 0.8844 0.8651 0.8039 0.8792 0.8792 0.8809 0.8529 0.9072 0.9089

Bus 0.9736 0.9729 0.9628 0.9663 0.9725 0.9740 0.9752 0.9798 0.9837 0.9849 0.9589 0.9802 0.9779 0.9833 0.9318 0.9837 0.9794 0.9872 0.9752 0.9690 0.9868 0.9891

ED

BG 0.9750 0.9879 0.9864 0.9872 0.9899 0.9879 0.9878 0.9887 0.9872 0.9668 0.9896 0.9866 0.9906 0.9652 0.9811 0.9869 0.9849 0.9709 0.9899 0.9870 0.9926 0.9932

PT

AT 0.8732 0.8983 0.8914 0.9130 0.8806 0.9034 0.9150 0.9258 0.8825 0.9246 0.9358 0.9219 0.8338 0.8748 0.9153 0.8114 0.8570 0.9258 0.8933 0.8550 0.9370 0.9324

CE

Model ResNet-50 ResNet-50-W ResNet-50-LS ResNet-50-GAN ResNet-101 ResNet-101-W ResNet-101-LS ResNet-101-GAN ResNet-152 ResNet-152-W ResNet-152-LS ResNet-152-GAN Inception V3 Inception V3-W Inception V3-LS Inception V3-GAN Inception V4 Inception V4-W Inception V4-LS Inception V4-GAN GEM-OE (ours) GEM-AP(ours)

M

Table 4: Comparisons of precision for each category on the MIO-TCD Classification Challenge Dataset. AT denotes Articulated Truck, MC denotes Motorcycle,NV denotes Non-motorized Vehicle, PT denotes Pickup Truck, SUT denotes Single-unit Truck, WV denotes Work Van, and BG denotes Background. GEM-OE and GEM-AP refer to the proposed ensemble deep learning method with maximum majority voting by one-hot encoding and by averaging the predicted probabilities, respectively.

Car 0.9675 0.9768 0.9746 0.9670 0.9699 0.9697 0.9824 0.9711 0.9626 0.9602 0.9644 0.9732 0.9706 0.9793 0.9772 0.9777 0.9731 0.9761 0.9856 0.9806 0.9850 0.9862

MC 0.9152 0.8646 0.9051 0.8707 0.8768 0.8949 0.8949 0.9111 0.9232 0.8970 0.9273 0.8869 0.9394 0.9253 0.9192 0.8808 0.9333 0.9071 0.9091 0.9374 0.9313 0.9333

NV 0.5525 0.4110 0.3105 0.4429 0.4977 0.3858 0.4406 0.5320 0.4566 0.4521 0.4452 0.4932 0.4521 0.6233 0.6347 0.5822 0.6438 0.6142 0.5959 0.6256 0.6027 0.6164

Pedestrian 0.8831 0.8658 0.8377 0.8684 0.8856 0.8952 0.9099 0.8792 0.8920 0.9125 0.8173 0.9016 0.8888 0.9380 0.8645 0.9450 0.8173 0.9016 0.8454 0.9029 0.9252 0.9259

PT 0.9078 0.9004 0.8936 0.9178 0.9421 0.9347 0.9195 0.9380 0.9544 0.9234 0.9350 0.9397 0.9459 0.9194 0.9287 0.9417 0.9461 0.9327 0.9294 0.9455 0.9455 0.9455

SUT 0.7664 0.7672 0.6984 0.7391 0.8086 0.8070 0.7742 0.7531 0.8234 0.8195 0.7703 0.7992 0.7977 0.8063 0.8125 0.8148 0.8086 0.7937 0.7969 0.8453 0.8266 0.8383

WV 0.7936 0.8361 0.7931 0.8576 0.8642 0.8621 0.8216 0.8708 0.8951 0.8902 0.8980 0.8402 0.8212 0.9166 0.8955 0.9038 0.8481 0.9339 0.9050 0.9013 0.9071 0.9116

Table 5: Comparisons of recall for each category on the MIO-TCD Classification Challenge Dataset. GEM-OE and GEM-AP refer to the proposed ensemble deep learning method with maximum majority voting by one-hot encoding and by averaging the predicted probabilities, respectively.

9

ACCEPTED MANUSCRIPT

[4] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: European Conference on Computer Vision, Springer, 2016, pp. 630–645. [5] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778. 495 [6] L. Yang, S. Yang, S. Li, R. Zhang, F. Liu, L. Jiao, Coupled compressed sensing inspired sparse spatial-spectral lssvm for hyperspectral image classification, Knowledge-Based Systems 79 (2015) 80–89. [7] R. Shang, W. Wang, R. Stolkin, L. Jiao, Subspace learning-based graph regularized feature selection, Knowledge-Based Systems 112 (2016) 500 152–165. [8] R. Shang, Z. Zhang, L. Jiao, W. Wang, S. Yang, Global discriminativebased nonnegative spectral clustering, Pattern Recognition 55 (2016) 172–182. [9] R. Shang, W. Wang, R. Stolkin, L. Jiao, Non-negative spectral learn505 ing and sparse regression-based dual-graph regularized feature selection, IEEE transactions on cybernetics 48 (2) (2018) 793–806. [10] W. Liu, R. Ji, S. Li, Towards 3d object detection with bimodal deep boltzmann machines over rgbd imagery, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3013–3021. 510 [11] W. Liu, S. Li, X. Lin, Y. Wu, R. Ji, Spectral–spatial co-clustering of hyperspectral image data based on bipartite graph, Multimedia Systems 22 (3) (2016) 355–366. Figure 5: Vehicle classification confusion matrix of the proposed scheme on [12] W. Liu, S. Li, D. Cao, S. Su, R. Ji, Detection based object labeling of 3d the MIO-TCD Classification Challenge Dataset. point cloud for indoor scenes, Neurocomputing 174 (2016) 1101–1106. 515 [13] B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by costproportionate example weighting, in: Data Mining, 2003. ICDM 2003. achieve good performance. Furthermore, our single models are Third IEEE International Conference on, IEEE, 2003, pp. 435–442. not fully fine-tuned in this study, with the parameters of the sin[14] C. Unsworth, G. Coghill, Excessive noise injection training of neural networks for markerless tracking in obscured and segmented environments, gle models trained with GANs and the modified loss function 520 Neural computation 18 (9) (2006) 2122–2145. were set as almost the same. [15] C. Huang, Y. Li, C. Change Loy, X. Tang, Learning deep representation Therefore, as part of future research to achieve better perfor imbalanced classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5375–5384. formance, we plan to fully fine-tune our single models and re[16] C. Chen, A. Liaw, L. Breiman, Using random forest to learn imbalanced search how to integrate a weighted loss function, label smooth525 data, University of California, Berkeley 110. ing regularization, and GANs well into a single model. [17] Y. Tang, Y.-Q. Zhang, N. V. Chawla, S. Krasser, Svms modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39 (1) (2009) 281–288. 6. Acknowledgement [18] W. Liu, M. Zhang, Z. Luo, Y. Cai, An ensemble deep learning method 530 for vehicle type classification on visual traffic surveillance sensors, IEEE This work is supported by the National Natural Science Access. Foundation of China (No. 61662024, No. 61572409, No. U1705286[19] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information & No. 61571188); the Natural Science Foundation of Jiangxi processing systems, 2012, pp. 1097–1105. Province (20171BAB212013); Science and Technology Research 535 [20] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for Project of Jiangxi Provincial Education Department(GJJ170423); accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, The Fujian Province 2011 Collaborative Innovation Center of pp. 580–587. TCM Health Management; Collaborative Innovation Center of [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. FeiChinese Oolong Tea Industry Collaborative Innovation Cen-540 Fei, Large-scale video classification with convolutional neural networks, in: Proceedings of the IEEE conference on Computer Vision and Pattern ter (2011) of Fujian Province; Fund for Integration of Cloud Recognition, 2014, pp. 1725–1732. Computing and Big Data, Innovation of Science and Educa[22] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for setion; and the Open Fund of Hubei Province Key Laboratory mantic segmentation, in: Proceedings of the IEEE Conference on Com(2016KLA03). 545 puter Vision and Pattern Recognition, 2015, pp. 3431–3440. [23] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556. References [24] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Pro550 ceedings of the IEEE Conference on Computer Vision and Pattern RecogReferences nition, 2015, pp. 1–9. [25] P. Jeatrakul, K. Wong, C. Fung, Classification of imbalanced data by com[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, bining the complementary neural network and smote algorithm, Neural S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: AdInformation Processing. Models and Applications (2010) 152–159. vances in neural information processing systems, 2014, pp. 2672–2680. [2] Z.-H. Zhou, X.-Y. Liu, Training cost-sensitive neural networks with555 [26] S. H. Khan, M. Bennamoun, F. Sohel, R. Togneri, Cost sensitive learning of deep feature representations from imbalanced data, arXiv preprint methods addressing the class imbalance problem, IEEE Transactions on arXiv:1508.03422. Knowledge and Data Engineering 18 (1) (2006) 63–77. [27] Y. Yan, M. Chen, M.-L. Shyu, S.-C. Chen, Deep learning for imbalanced [3] C. X. Ling, V. S. Sheng, Cost-sensitive learning, in: Encyclopedia of multimedia data classification, in: 2015 IEEE International Symposium machine learning, Springer, 2011, pp. 231–235.

AC

475

CE

470

PT

ED

465

M

460

AN US

CR IP T

490

480

485

10

ACCEPTED MANUSCRIPT

570

575

580

585

590

PT

CE

605

AC

600

ED

M

595

CR IP T

565

on Multimedia (ISM), IEEE, 2015, pp. 483–488. [28] X. Zhu, Y. Liu, Z. Qin, Data augmentation in classification using gan, arXiv preprint arXiv:1711.00648. [29] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357. [30] L. Zhang, W. Wang, A re-sampling method for class imbalance learning with credit data, in: Information Technology, Computer Engineering and Management Sciences (ICM), 2011 International Conference on, Vol. 1, IEEE, 2011, pp. 393–397. [31] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, Advances in intelligent computing (2005) 878–887. [32] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on knowledge and data engineering 21 (9) (2009) 1263–1284. [33] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, Infogan: Interpretable representation learning by information maximizing generative adversarial nets, in: Advances in Neural Information Processing Systems, 2016, pp. 2172–2180. [34] S. Gurumurthy, R. K. Sarvadevabhatla, V. B. Radhakrishnan, Deligan: Generative adversarial networks for diverse and limited data, arXiv preprint arXiv:1706.02071. [35] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434. [36] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. [37] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, H. Lee, Learning what and where to draw, in: Advances in Neural Information Processing Systems, 2016, pp. 217–225. [38] R. Artstein, M. Poesio, Inter-coder agreement for computational linguistics, Computational Linguistics 34 (4) (2008) 555–596. [39] Z. Zheng, L. Zheng, Y. Yang, Unlabeled samples generated by gan improve the person re-identification baseline in vitro, arXiv preprint arXiv:1701.07717. [40] Z.-H. Zhou, Ensemble learning, Encyclopedia of biometrics (2015) 411– 416. [41] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826. [42] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein gan, arXiv preprint arXiv:1701.07875. [43] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved training of wasserstein gans, arXiv preprint arXiv:1704.00028. [44] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi, Inception-v4, inceptionresnet and the impact of residual connections on learning., in: AAAI, 2017, pp. 4278–4284.

AN US

560

11