MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning

MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning

ARTICLE IN PRESS JID: NEUCOM [m5G;July 16, 2019;21:6] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing jour...

4MB Sizes 0 Downloads 54 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 16, 2019;21:6]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning Huiyuan Fu∗, Huadong Ma, Gaoya Wang, Xiaomou Zhang, Yifan Zhang Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China

a r t i c l e

i n f o

Article history: Received 5 August 2017 Revised 10 December 2017 Accepted 4 February 2018 Available online xxx Keywords: Vehicle color recognition Convolutional neural network Residual learning Video surveillance Intelligent transportation system

a b s t r a c t Automatic vehicle color recognition is very important for video surveillance, especially for intelligent transportation system. Currently, some approaches have been proposed. However, it is still very difficult to recognize the vehicle color correctly in the complex traffic scenes with constantly changing illuminations. To solve this problem, we propose a new network structure - Multiscale Comprehensive Feature Fusion Convolutional Neural Network (MCFF-CNN) based on residual learning for color feature extraction. First, we use MCFF-CNN network to extract the deep color features of the vehicles. Then, we employ support vector machine (SVM) classifier to obtain the final color recognition results. Based on the proposed approach, we have built a system for robust vehicle color recognition in practical traffic scenes. Extensive experimental results show our solution is effective.

1. Introduction Robust vehicle color recognition is one of the most important tasks for video surveillance, especially for intelligent transportation system. Color, as an important expression in the visual information makes it has a wide range of applications for our daily life. For example, the managers can detect the criminals’ vehicles by color when the license plates of the vehicles can not be captured. For another example, the effective vehicle color recognition can help to promote the precision of vehicle re-identification in the urban surveillance videos [26]. Recently, some approaches have been proposed for vehicle color recognition [1–8]. They can be roughly fall into these two categories: artificial feature learning based methods [1–6] and automatic feature learning based methods with convolutional neural network [7,8]. Dule et al. [1] try to improve the performance of vehicle color recognition by obtaining the different feature sets. They consider the various color space and the different classification methods. Li et al. [2] use HSI color space and relative error distance’s matching algorithm for vehicle color recognition. Yang et al. [3] propose a new vehicle color recognition method by employing H-S two-dimensional histogram. Sam et al. [4] present a



Corresponding author. E-mail address: [email protected] (H. Fu).

© 2019 Published by Elsevier B.V.

vehicle color recognition solution based on fuzzy set theory. Hu et al. [5] propose a vehicle color recognition approach by estimating the RGB value of the whole vehicle. Chen et al. [6] use a method based on ROI(region of interest) chosen for vehicle color recognition. Rachmadi et al. [7] demonstrate that convolutional neural network (CNN) can be used for vehicle color recognition. Hu et al. [8] propose a deep learning based algorithm for vehicle color recognition. They fuse the spatial pyramid strategy and original CNN together. Their method is based on deep convolutional neural network. It can adaptively learn the color features of the vehicles by CNN. However, it lacks more deep studies on the network structure of the different layers in CNN. So it can not do well for parameter overfitting problem. Almost all of current methods can not paly well in the practical traffic scenes with constantly changing illuminations. In this paper, we propose a new method to overcome this difficult problem for vehicle color recognition. The approach is based on a new network structure named MCFF-CNN for automatic deep color feature extraction. In general, large-scale network structure often has the disadvantage that its deep network loss value is not less than its shallow network loss value. MCFF-CNN network remanufactures the learning function of network layer by remnant mapping, and the method of approximating the residual value to zero value effectively solves the problem. At the same time, MCFF-CNN achieves the multiple scale fusion of image features by combining the output characteristics of network layers with different sizes.

https://doi.org/10.1016/j.neucom.2018.02.111 0925-2312/© 2019 Published by Elsevier B.V.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 1. Overview of our method. We use the proposed MCFF-CNN as a feature extractor when it has been trained on the training data. After preprocessing of the input vehicle images, MCFF-CNN will extract the color features. Then, SVM classifier predicts the color types of the vehicles.

In order to make the network learn the deeper characteristics of vehicle images, and realize the fusion of local features and global features, the output characteristics of deep layer network and shallow layer network are merged. We use SVM [9] classifier to obtain the final color recognition results after extracting the color features by MCFF-CNN. Extensive experimental results demonstrate that our method achieves significant improvement comparing to current states-of-the-art approaches. The rest of this paper is organized as follows: In Section 2, our proposed method is presented in detail. Then, the experimental results and analysis are described in Section 3. Finally, we conclude our work in Section 4.

way, the feature gap between different colors will be more obvious. This can help for the final color recognition with deep learning. To reduce the image exposure, we adjust linearly on three channels together as follows:

V ali = Coli × α − β , i = r, g, b

(2)

Where Coli means RGB value of the original image, and Vali means the RGB value after the adjustment. α and β are the parameters. We usually set them to 0.7 and 20, respectively. Although it can not remove the highlights of the vehicle body, it reduces the exposure of the whole image. Thus, we can make it useful for the following feature learning step. Fig. 2 shows a sample result of our preprocessing procedure.

2. Our method 2.2. MCFF-CNN For vehicle color feature extraction Our approach for vehicle color recognition consists of the following three steps: preprocessing, MCFF-CNN for vehicle color feature extraction, and Color classification with SVM. In the following place, we will introduce them respectively. The pipeline of the proposed algorithm is depicted in Fig. 1. 2.1. Preprocessing In the practical traffic scenes, changing illuminations will influence the vehicle color recognition performance. Researchers find that the each channel value of an image is always low at night. To enhance the image brightness, we adjust linearly on three channels together as follows:

V ali = Coli × α + β , i = r, g, b

(1)

Where Coli means RGB value of the original image, and Vali means the RGB value after the adjustment. α and β are the parameters. We usually set them to 1.5 and 0, respectively. In this

After the preprocessing procedure, we use the proposed MCFFCNN as shown in Fig. 1 for vehicle color feature extraction. In the design of MCFF-CNN, we adequately take advantage of the Inception architecture [10] which can improve the representation ability for designing a large network [12–14]. Inception architecture has been demonstrated extremely effective in basic networks [11]. The Inception module with dimension reductions [13] is used in MCFFCNN. It can allow lots of each layer’s final inputs pass to the next layer. To maintain the sparse structure and less computation, we use 1 × 1 Convolution kernel before high computational 3 × 3 and 5 × 5 Convolution kernels. Residual module in Fig. 1 adopts threelayer residual learning construction blocks. And the Concat module represents for the fusion of local features and global features of reversed propagation images. The total number of layers for network construction of MCFFCNN contains independent multi-scale feature fusion modules is about 100. We use Mean-pooling [15] before the final classification

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 2. (a) Enhance the brightness of the vehicle image. (b) Reduce the exposure of the vehicle image. The top row images and the bottom row images are original images and the result images after processing, respectively.

to improve the adaptive performance of the network. The size of the filters in the convolution layers is used as 1 × 1, 3 × 3 or 5 × 5 basically. The MCFF-CNN network design not only considers the efficiency and practicability of the computation, but also lowers the memory utilization rate and improves the learning performance of the network. Multi-scale feature fusion is also can be used in the other problems, such as scene recognition [16,17]. We can finally use the MCFF-CNN network to extract the color characteristics of the vehicles robustly. To avoid the problem of deep network’s loss value is not less than shallow network’s loss value in the general large networks, multiple scale feature fusion layers that play an important role in feature extraction are reconstructed to learn residual functions in our proposed MCFF-CNN. In this way, it is not need to start with a new function. Based on multi-scale feature fusion layer combined with the residual learning model, it can guarantee the MCFF-CNN can learn the image characteristics of different scales. In order to ensure the size consistency of the output characteristics, we only add the residual learning module to the multi-scale feature fusion layer (Inception module). In this paper, we consider a residual learning module defined as follows [18]:

y = F (x, {Wi } ) + x

(3)

Where x represents the input vector of the Inception layers, and y represents the output vector of the fusion results after adding residual learning modules into the Inception layers. The function F(x, {Wi }) represents the residual mapping to be learned. We adopt the three-layer residual learning module in our paper [18]. So it can be defined as follows:

F = W3 σ (W2 σ (W1 x ))

(4)

Where σ denotes ReLU [15]. So we can obtain the final output result as follows:

y = W3 σ (W2 σ (W1 x )) + x

(5)

Add residual learning module into the multi-scale feature fusion layer can effectively reduce the loss value. However, the researchers find that when the number of residual learning module’s filter is more than 10 0 0, the residual learning will appear unstable phenomenon. Therefore, the maximum number of filters in our MCFF-CNN is 1024. The residual learning module is added to the

3

multi-scale feature fusion layer with the number of filters of 256, 512, 1024, respectively. Fig. 3 shows the fusion of Inception module and Residual module. In order to further study the deep characteristics of the vehicle image, the integration of local features and global features is realized. Then, the deep network and the shallow network will be merged. Due to the number of channels of feature pixels, the numerical scale and norm are different in different multi-scale feature fusion layers, the scale will get smaller with the deeper layers. Thus, it is not reasonable to simply transfer the characteristics of different multi-scale feature fusion layers to one-dimensional vectors and connect them. Because the scale difference is too large for the deep weight, it needs to be readjusted to make it less robust to directly connect the three layers of different depths. Therefore, prior to the fusion of local features of different sizes and global features, the output features of each layer should be normalized. In this way, the network can learn the value of the scaling factor in each layer and stabilize the network to improve the accuracy. We normalize the eigenvectors for each output layer. We zoom each vector after normalization as follows:

X  = X/



 | xi | i=1

c

(6)

Where X and X represents the original pixel’s vector and normalized pixel’s vector, respectively. c represents the channels of each vector. Then, we apply the zoom factor α i to each channel of the vector as follows:

yi = αi .xi 

(7)

After normalization, we need to adjust the size of the obtained features. MCFF-CNN uses mean-pooling operation for dimension reduction as follows:

i+s, j+s

β (X(i, j ) ) =

i, j

X(i, j )

s∗s

(8)

Where i and j represent the coordinates of the features, s represents the kernel size. And the fusion of local features and global features by the combination of shallow network and deep network can be defined as follows:

y = β ( β ( x1 )  x2 )  x3

(9)

Where  represents concatenate these features with corresponding coordinates. x1, x2 and x3 represents Inception(3), Inception(4) and Inception(5), respectively. Because the output size of Inception(3) is 28 × 28 × 256, the output size of Inception(4) is 14 × 14 × 512, the output size of Inception(5) is 7 × 7 × 1024, most of the information will be lost if Inception (3) is directly reduced from 28 × 28 through meanpooling operation to 7 × 7. So we start with Inception (3), with mean-pooling operation to 14 × 14. Mean-pooling keeps the pixels down, retaining more background information, but losing some of the information. So after mean-pooling, the number of filters has doubled. The post-processing Inception (3) and Inception (4) are merged into concat_1, and the same mean-pooling treatment for concat_1 layer is combined with Inception (5) to get concat_2. Fig. 4 shows the fusion of local features and global features by the combination of shallow network and deep network. By merging different multi-scale feature layers, we combine local features of image information with global features and effectively propagate back to all network layers for learning.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 3. The fusion of Inception module and Residual module in MCFF-CNN. The residual learning model is added to the multi-scale feature fusion layer with the number of filters of 256, 512, 1024, respectively.

Fig. 4. The fusion of local features and global features by the combination of shallow network and deep network in our proposed MCFF-CNN.

Algorithm 2: Testing procedure for vehicle color recognition.

2.3. Color classification with SVM In the training phase, we use Softmax [19] classifier to obtain the loss for extracting the features of each layer as follows:



Li = − log e fyi



j

efj

Algorithm 1: Training procedure for vehicle color feature extraction. Require: labeled images in the training data set Ensure: best effective layer for color classification begin 1: Enhance the image brightness according to Eq. (1); 2: Reduce the image exposure according to Eq. (2); 3: while loss is not convergenced do Forward MCFF-CNN according to Eqs. (3)–(9); 4: 5: Generate loss from Softmax layer according to Eq. (10); Backward MCFF-CNN; 6: 7: end while 8: Generate the output features of each layer; 9: Use SVM[10] to choose the best effective layer for color classification. end



(10)

Where f j represents the j element of classification vector f. However, it can not choose the best effective layer for color classification. So we adopt SVM [9] classifier to find out the best effective layer. Algorithm 1 shows our detail procedure in the training phase. In the testing phase, we use the features from the best effective layer in the training phase to recognize the final color types with SVM classifier. Algorithm 2 shows our detail procedure in the testing phaseins. 3. Experiments 3.1. Experiment setup In this paper, we propose a new way for vehicle color recognition based on MCFF-CNN. We evaluated the proposed algorithm and other methods on the Vehicle Color data set released in [6]. The data set contains 15,601 different vehicle images covering

Require: a testing vehicle image, the best effective layer in training phase Ensure: color recognition result begin 1: Enhance the image brightness according to Eq. (1); 2: Reduce the image exposure according to Eq. (2); 3: Forward MCFF-CNN according to Eq. (3)–(9); 4: Obtain the features from the best effective layer; 5: Use SVM[10] to recognize the color type of vehicle in the image. end

eight color types: white, yellow, cyan, red, gray, green, blue and black. Each image of the data set only contains one vehicle. The data set is challenging due to its variability in different illuminations, etc. Since the data is not already labeled, it is necessary to manually reselect the image and label the categories. Meanwhile, we adopt the preprocessing procedure which is introduced in Section 2 for lighting enhancement of each image. The samples of the data set used in our experiments after above processing can be seen in Fig. 5. We divide the data set into training set and testing set. The number of images in the training set is about 90% of the total number of images in the data set. Note that all of the

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

5

Fig. 5. Some samples of the data set [6] for vehicle color recognition.

Fig. 6. Output data visualization with different network depth.

images with labeled categories in the training set and testing set are blended again randomly. All the experiments are conducted on a regular PC (Dell Precision T1700 host with 4 Intel E3-1220v3 3.1Ghz, 8-GB RAM, and Ubuntu14.04 OS). We also use a NVIDIA K80 GPU card in our server for training. 3.2. Implementation details We adopt the implementation of Caffe [20] to train MCFFCNN architecture. Each image in our dataset will be adjusted to 256 × 256. In the feature learning stage, the images are resized to

224 × 224 × 3. Then, the images are fed to the convolutional layer named conv1 with 64 filters of size 7 × 7 and stride of two pixels. conv1 has 64 features, and the padding size of it is 3. So the output feature is 112 × 112 × 64. The outputs of conv1 are modeled by a rectified linear unit (ReLU) [21]. Max pooling reduces the size of the output of the ReLUs with the kernels of 3 × 3 and the stride of two pixels. The output feature is 56 × 56 × 64. What’s more, the output is normalized via a norm layer to ensure that the output is within a smaller range. Then, it is sent to the second convolutional layer named conv2 with 192 filters of size 3 × 3 and stride of two pixels. conv1 has 192 features, and the padding

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 7. Convolution layers’ weights visualization with different network depth.

size of it is 1. So the output feature is 56 × 56 × 192. The outputs of conv2 are also modeled by a ReLU. It is put into pool2 layer for max pooling with the kernels of 3 × 3 and the stride of two pixels after normalization. The output feature is 28 × 28 × 192. Then, it is put into Inception module. The features will be divided into four parts. We use the different convolutional kernels to deal with these four parts. The outputs of these four parts are as follows: The output feature of the first part which is through 64 convolutional kernels of size 1 × 1 is 28 × 28 × 64. The output feature of the second part which is through 96 convolutional kernels of size 1 × 1 is 28 × 28 × 96. Then, we will take 128 3 × 3 convolution operation after ReLUs. The output feature is 28 × 28 × 128. The output feature of the third part which is through 16 convolutional kernels of size 1 × 1 is 28 × 28 × 16. Then, we will take 32 5 × 5 convolution operation after ReLUs. The output feature is 28 × 28 × 32. The output feature of the fourth part which is through the pool layer whose kernels of size 3 × 3 and the padding of size 1 is 28 × 28 × 192. The output feature will be 28 × 28 × 32 through 32 convolutional kernels of size 1 × 1. Then, we combine the output features of these four parts together. The final output feature is 28 × 28 × 256. The obtained feature will be put into the residual learning module. First, it is through 64 convolutional kernels of size 1 × 1. The output feature is 28 × 28 × 64. Then, it is through 64 convolutional kernels of size 3 × 3. The output feature is also 28 × 28 × 64. Finally, it is through 256 convolutional kernels of size 1 × 1. The output feature is 28 × 28 × 256. We regard the output feature which is based on the upper layer feature after residual learning as the input feature of the next Inception layer. In this way, the output feature of Inception_3a/output will be transformed to the output feature of Inception_3a. The combination of the subsequent Inception modules and residual learning modules is similar. The output feature vector should be normalized. Then, let the output feature of Inception_3a do mean-pooling operation. The feature will be dropped to the size of 14 × 14. The post-processing Inception_3a and Inception_4a are merged into concat_1. And the same mean-pooling treatment for concat_1 layer is combined with Inception _5b to get concat_2. The combined feature’s size is 1 × 1 × 1024 after a mean-pooling operation with the size of 7 × 7. Then, it will reduce 40% output weight through the dropout layer. At last, it will be

put into a liner classifier with softmax loss function. The final output result is a vector with the size of 8 × 1 because we have eight categories. We set the learning rate to 0.0 0 01. The manner of updating the learning rate is step, and the stepsize is 320,0 0 0. The maximum number of iterations is 2,0 0 0,0 0 0, and the weight decay is set to 0.0 0 02. We extract the features of training set and testing set from the trained caffe model. Then, we use a linear SVM to train a model on the training set. And we use the trained model to testify on the testing set for the final color classification. We testify our method through 10-fold cross validation scheme. Figs. 6 and 7 show the visualization results of the intermediate output network layers for better understanding the proposed MCFF-CNN. 3.3. Experimental results First, we measure the proposed method by comparing to current state-of-the-art methods from some different views. Then, we testify our proposed method in the practical traffic scenes for vehicle color recognition. To testify our proposed approach, we compare it with stateof-the-art methods whose features are artificial for vehicle color recognition. We compare them by the recognition precision of each color category and the mean precision of all color categories. Most of present works on vehicle color recognition are based on artificial features, such as Hue Hist [22], Normalized RG Hist [23], Opponent Hist [24], RGB Hist, Transformed Color Hist [24], and Combined Color Hist [6]. Fig. 8 shows the comparison results between our MCFF-CNN based method with the other artificial feature based methods. It is obviously that our approach is better than all of the other approaches on precision for recognizing each category of the eight colors. It effectively demonstrates that feature extraction with deep neural network can better fit the different conditions in the practical complex environment. To verify our proposed approach, we also compare it with the newest method with deep learning [8] for vehicle color recognition. Their method is based on deep convolutional neural network. What’s more, we compare it with VGG based method [25] for vehicle color recognition. As shown in Fig. 9, our approach is better than theirs on the mean precision of all color categories.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

7

Table 1 Comparison on vehicle color recognition precision with different color feature extraction approaches. GoogleNet GoogleNet GoogleNet GoogleNet ResNet-50 ResNet-50 ResNet-50 ResNet-101 ResNet-101 ResNet-101 ResNet-152 ResNet-152 ResNet-152 MCFF-CNN MCFF-CNN MCFF-CNN MCFF-CNN MCFF-CNN

Pool2(96.29%) Inception_4a(95.25%) Pool4(95.12%) Loss3(89.00%) Conv1(95.50%) Res3(97.14%) Pool5(94.47%) Conv1(95.65%) Res3(97.00%) Pool5(95.20%) Conv1(95.65%) Res3(97.15%) Pool5(96.40%) Pool2(93.87%) Inception_4a(97.98%) Pool4(96.75%) Inception_5b(96.16%) Concat_2(97.53%)

Fig. 8. Comparison on vehicle color recognition precision of our MCFF-CNN based method with some artificial feature based methods.

What’s more, we compare our proposed MCFF-CNN approach with state-of-the-art GoogleNet [13] approach and ResNet [18] approach for the performance of vehicle color feature extraction. The experimental results can be seen in Table 1. We find that the feature of Inception_4a layer achieves the highest 97.98% of our

Inception_3b(94.95%) Inception_4d(95.19%) Inception_5b(93.69%) – Pool1(95.45%) Res4(97.45%) – Pool1(96.10%) Res4(96.85%) – Pool1(96.10%) Res4(97.00%) Inception_3a(97.27%) Inception_4b(97.85%) Inception_4d(97.14%) Pool5(96.36%) –

Pool3(96.55%) Inception_4e(94.80%) Pool5(89.73%) – Res2(95.95%) Res5(96.75%) – Res2(95.50%) Res5(95.10%) – Res2(95.50%) Res5(96.25%) Pool3(97.40%) Inception_4c(97.40%) Inception_4e(97.33%) Concat_1(97.46%) –

proposed MCFF-CNN method, which is better than the highest precision 96.55% in the pool3 layer of GoogleNet method and the highest precision 97.45% in the res4 layer of ResNet method. To demonstrate the performance of the proposed MCFF-CNN, we conduct it on a new dataset from Peking University [26]. This dataset is famous for vehicle re-identification. It contains more than 220k images for this task. In fact, it has labeled some images with color attribute randomly at front of 18,861 sample images. We take these 18k sample images alone for our evaluation experiment. But we also do some pre-processing work on them at first. We obtain a dataset of 6884 images by the intersection of color labelled sample images and original sample images. Then, we relabel the images of this dataset with six color types: black, yellow, white, blue, gray, and red. (We don’t consider silver type because of it maybe confused with gray type and white type). What’s more, we add green type into the final labelled dataset by labeling on the whole 18k sample images manually. In this way, we can get a final evaluation dataset with 6975 color labelled sample images. We use this dataset to finetune our prior trained MCFF-CNN. The neural network after finetune is used as feature extractor to obtain the pool5 layer feature of MCFF-CNN. Then, we use SVM to classify them. We testify the performance of our model through 4-fold cross validation approach. The final result can be seen in Fig. 10. It shows that our proposed MCFF-CNN is effective on the dataset.

Fig. 9. Comparison on vehicle color recognition precision of our MCFF-CNN+SVM approach with current state-of-the-art SP-CNN+SVM approach [8] and VGG+SVM approach [25].

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 10. 4-fold cross validation result of our MCFF-CNN+SVM approach on the dataset of Peking University [26] for vehicle color recognition.

Fig. 11. Results of our system for vehicle color recognition in the practical traffic scenes.

Then, we give the experimental results on surveillance images in practical complex traffic scenes for vehicle color recognition. We use the state-of-the-art object detection approach [27] to detect each vehicle in the surveillance images first. Then, Our proposed method is applied on the localized and cropped vehicle images. The final results of our system for vehicle color recognition can be seen in Fig. 11. All of the detected vehicles are labeled in red, and the recognized color types of them are labeled in yellow. Then, we give the experimental results on surveillance images in practical complex traffic scenes for vehicle color recognition.

We use the state-of-the-art object detection approach [27] to detect each vehicle in the surveillance images first. Then, Our proposed method is applied on the localized and cropped vehicle images. The final results of our system for vehicle color recognition can be seen in Fig. 10. All of the detected vehicles are labeled in red, and the recognized color types of them are labeled in yellow. Although the proposed algorithm works well under various challenging illuminations, it is far from perfect. It might make mistakes in some cases. Fig. 12 shows some typical incorrect predictions.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx

9

Fig. 12. Some failure cases. The recognition results in the bracket are our predicted color types, and the color types out of the bracket are the ground truth.

4. Conclusions In this paper, we propose a new method based on MCFF-CNN for automatic vehicle color recognition. We evaluate our algorithm by comparing to currently state-of-the-art methods from some different views, and the experimental results demonstrate this our proposed approach can obtain the convincing performance. Based on the proposed algorithm, we have built a practical system for robust vehicle classification in the complex traffic scenes. However, there is still room for further improvement in vehicle color recognition due to kinds of complicated cases in the practical environment. We will continue to overcome these influences in our future work. Declarations of interest None. Acknowledgments The research reported in this paper is supported by the NSFCGuangdong Joint Found under No.U1501254, the Natural Science Foundation of China under Grant No. 61402048, the Funds for Creative Research Groups of China under Grant No.61421061, and the Beijing Training Project for the Leading Talents in S&T(ljrc201502). References [1] E. Dule, M. Gökmen, M.S. Beratog˘ lu, et al., A convenient feature vector construction for vehicle color recognition, Fuzzy Syst. Evolut. Comput. (2010) 250–255. [2] X. Li, G. Zhang, J. Fang, et al., Vehicle color recognition using vector matching of template, in: Proceedings of the International Symposium on Electronic Commerce & Security, 2010, pp. 189–193. [3] M. Yang, G. Han, X. Li, et al., Vehicle color recognition using monocular camera, Proceedings of the International Conference on Wirelss Communications and Signal Processing (2011) 1–5. [4] K.T. Sam, X.L. Tian, Vehicle color recognition using fuzzy rules and center of maximum defuzzification, Energy Procedia 13 (2011) 1006–1010. [5] W. Hu, J. Yang, L. Bai, L. Yao, A new approach for vehicle color recognition based on specular-free image, in: Proc. SPIE 9067, Sixth International Conference on Machine Vision (ICMV 2013), 2013 90671Q. [6] P. Chen, X. Bai, W. Liu, Vehicle color recognition on urban road by feature context, IEEE Trans. Intell. Transp. Syst. 15 (5) (2014) 2340–2346.

[7] R.F. Rachmadi, I.K.E. Rachmadi, Vehicle color recognition using convolutional neural network, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–5. [8] C. Hu, X. Bai, L. Qi, et al., Vehicle color recognition with spatial pyramid deep learning, IEEE Trans. Intell. Transp. Syst. 16 (5) (2015) 1–10. [9] B. Boser, I. Guyon, V. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the Annual Conference on Learning Theory, 1992, pp. 144–152. [10] R.T. Tan, K. Ikeuchi, Separating reflection components of textured surfaces using a single image, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2) (2005) 78–93. [11] S. Arora, A. Bhaskara, R. Ge, et al., Provable bounds for learning some deep representations, in: Proceedings of the International Conference on Machine Learning, 2013, pp. 584–592. [12] D. Erhan, C. Szegedy, A. Toshev, et al., Scalable object detection using deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2155–2162. [13] C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [14] R. Girshick, J. Donahue, T. Darrell, et al., Rich feature hierarchies for accurate object detection and semantic segmentation, Comput. Sci. (2014) 580–587. [15] M. Lin, Q. Chen, S. Yan, Network in network, Neural Evolut. Comput. (2014). [16] X.H. Song, S.Q. Jiang, L. Herranz, Multi-scale multi-feature context modeling for scene recognition in the semantic manifold, IEEE Trans. Image Process. 26 (6) (2017) 2721–2735. [17] X.H. Song, S.Q. Jiang, L. Herranz, Joint multi-feature spatial context for scene recognition in the semantic manifold, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1312–1320. [18] K. He, X. Zhang, S. Ren, et al., Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [19] K. Gimpel, N. Smith, Softmax-margin CRFs: training log-linear models with cost functions, in: Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010. [20] Y. Jia, Caffe: an open source convolutional architecture for fast feature embedding, 2013. Available: http://caffe.berkeleyvision.org. [21] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the International Conference on Machine Learning, 2010, pp. 807–814. [22] W.J. Van, T. Gevers, A.D. Bagdanov, Boosting color saliency in image feature detection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (1) (2006) 150–156. [23] M.F. Carlsohn, B.H. Menze, B.M. Kelm, et al., Color Image Processing: Methods and Applications, CRC Press, 2007. [24] S. Van, Evaluating color descriptors for object and scene recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1582–1596. [25] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015. [26] S. Van, Deep relative distance learning: tell the difference between similar vehicles, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2167–2175. [27] R. Girshick, Fast r-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111

JID: NEUCOM 10

ARTICLE IN PRESS

[m5G;July 16, 2019;21:6]

H. Fu, H. Ma and G. Wang et al. / Neurocomputing xxx (xxxx) xxx Huiyuan Fu He received the Ph.D. degree in computer science from Beijing University of Posts and Telecommunications, China, in 2014, and the B.S. degree in computer science from Xi’an University of Posts and Telecommunications, China, in 2008. He is an associate professor at the School of Computer Science, Beijing University of Posts and Telecommunications, China. His research area includes visual big data, machine learning and pattern recognition, multimedia systems, etc.

Xiaomou Zhang She is currently working towards the M.S. degree in computer science with the Beijing key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, China. Her research interests include computer vision and pattern recognition.

Huadong Ma He received the Ph.D. degree in computer science from Institute of Computing Technology, Chinese Academy of Science (CAS), in 1995, the M.S. degree in computer science from Shenyang Institute of Computing Technology, CAS in 1990, and the B.S. degree in mathematics from Henan Normal University, China, in 1984. He is a professor at the School of Computer Science, Beijing University of Posts and Telecommunications, China. His research interests include multimedia networks and systems, Internet things and sensor networks. He has published over 100 papers in these fields.

Yifan Zhang He is currently working towards the M.S. degree in computer science with the Beijing key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, China. His research interests include computer vision and pattern recognition.

Gaoya Wang She is currently working towards the M.S. degree in computer science with the Beijing key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, China. Her research interests include image processing and visual understanding.

Please cite this article as: H. Fu, H. Ma and G. Wang et al., MCFF-CNN: Multiscale comprehensive feature fusion convolutional neural network for vehicle color recognition based on residual learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.02.111