Neurocomputing 347 (2019) 24–33
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Towards accurate tiny vehicle detection in complex scenes Wei Liu a,b,1, Shengcai Liao c,2,∗, Weidong Hu a a
National Key Laboratory of Science and Technology on ATR, School of Electronic Science, National University of Defense Technology, Changsha, China Center for Biometrics and Security Research, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China c Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE b
a r t i c l e
i n f o
Article history: Received 3 May 2018 Revised 14 February 2019 Accepted 2 March 2019 Available online 6 March 2019 Communicated by Dr. H. Yu Keywords: Vehicle detection Object proposal Deep neural network
a b s t r a c t Large variance of scales is still a notable challenge for vehicle detection in complex traffic scenes. Though Faster RCNN based two-stage detectors have demonstrated superiority on generic object detection, its effectiveness on vehicle detection in real applications is not aware of, especially for vehicles of tiny scales. In this paper, a novel two-stage detector is proposed to perform tiny vehicle detection with high recall. It consists of a backward feature enhancement network (BFEN) and a spatial layout preserving network (SLPN). At the first stage, less attention has been paid for generating proposals with high recall, which, however, plays a significant role in recalling vehicles of tiny scales. By fully exploiting the feature representation power of a backbone network, the BFEN aims at generating high-quality region proposals for vehicles of various scales. Even with only 100 proposals, the proposed BFEN achieves an encouraging recall rate over 99%. For a better localization of tiny vehicles, we argue that the spatial layout of the ROI features also plays a significant role in the second stage. Accordingly, a light-weight detection sub-network called SLPN is designed to progressively integrate ROI features, while preserving the spatial layouts. Experiments done on the challenging DETRAC vehicle detection dataset show that the proposed method significantly improves a competitive baseline (ResNet-50 based Faster RCNN) by 16.5% mAP. Comparative performance to the state of the arts are also achieved on both the DETRAC and KITTI benchmarks. © 2019 Published by Elsevier B.V.
1. Introduction Recently, generic object detection has gained great success, especially driven by the deep Convolutional Neural Networks (CNN). However, vehicle detection in complex traffic scenes still encounters a number of challenges, such as various lighting conditions, occlusions, and low resolutions. Beyond early studies which focused on hand-craft features [1] and cascade classifier structure [2], RCNN [3] firstly introduced CNN into object detection, followed by Fast-RCNN [4], Faster RCNN [5] and R-FCN [6]. These methods are called two-stage detectors, where proposals are generated in the first stage and refined in the second stage. Although considerably good performance has been achieved by two-stage detectors on vehicle detection [7,8], less attention has been paid to the quality of the proposals generated in the first stage. A high recall of proposals is critical for the ∗ Corresponding author at: Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE. E-mail addresses:
[email protected] (W. Liu),
[email protected],
[email protected] (S. Liao),
[email protected] (W. Hu). 1 Wei Liu finished his part of work during his visit in CASIA. 2 Shengcai Liao finished his work in CASIA.
https://doi.org/10.1016/j.neucom.2019.03.004 0925-2312/© 2019 Published by Elsevier B.V.
overall vehicle detection performance because missed objects can not be retrieved in the second stage, especially for tiny vehicles ( < 20 pixels) in traffic scenes. In this paper, following the twostage detection pipeline, we focus on performing vehicle detection with high-recall. Specifically, we firstly answer the question of how to generate high-quality proposals, and secondly how to refine these proposals for better localization accuracy. Thanks to the strong feature representation of CNN, proposal generation can be embedded in a network, deemed as Region Proposal Network (RPN) in Faster RCNN [5]. Although multi-scale anchors can be assigned on the last layer of a strong backbone network (e.g. ResNet-50), RPN has a suboptimal performance on recalling vehicles of various scales, especially the tiny ones, as depicted in Fig. 1(a). This is because the last layer with large respective field are illy matched with small objects. One solution is to utilize separate multi-layers to generate proposals (e.g. SSD [9] and MSCNN [10]). However, it is a common sense that feature representation of lower layers are less discriminative and abstract than that of higher layers. To address this problem, we design a backward feature enhancement strategy to transport more semantic information from high layers to low layers, so that features at low layers are fine-grained in spatial resolution, as well as discriminant
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
25
Fig. 1. Top 10 proposals of (a).RPN [5] and (b).BFEN. Visibly, BFEN generates proposals with better accuracy, and experimentally, it has a higher recall for tiny vehicles ( < 20 pixels) than RPN (89.68 % vs 63.32 % with top 300 proposals) as shown in Table 2.
in representation. This is denoted as Backward Feature Enhancement Network (BFEN). As shown in Fig. 1 (b), our BFEN performs considerably well on tiny vehicles and achieves an encouraging recall rate over 99% with only 100 proposals on the DETRAC validation set. With these high-quality proposals, we aim at further improving localization accuracy in the second stage. To improve the discrimination of the proposals’ features, fully-connected (FC) layers are commonly adopted in existing two-stage detectors (e.g.[5] and [11]), but FC layers intrinsically corrupts the spatial layout of features, which is fairly harmful in detecting tiny vehicles. To address this, a spatial layout preserving network is proposed, the core of which is the split-transform-merge building block proposed in [12]. We simply stack two such building blocks as an alternative to the FC layers in original Faster RCNN [5], leading to a computationally friendly detection sub-network. It evolves the region-wise features progressively while preserve the spatial information, thus resulting in a prominent detection accuracy improvement which will be demonstrated in our experiments in Section 4.2. With the above two network designs, a novel detector is proposed, specifically for improved results of tiny vehicle detection, which achieves a new state of the art on the DETRAC benchmark [8] compared with all published and unpublished results, and also comparative performance on the KITTI [13] vehicle detection benchmark. Notably, the proposed method significantly improves a competitive baseline (ResNet-50 based Faster RCNN) by 16.5% mAP in the DETRAC validation dataset. To sum up, we focus on the problem of tiny vehicle detection where current detection framework struggles. Regarding this, the contribution in this paper lies in two folds, both focusing on tiny vehicle detection. First, we show that representation of small objects requires a relatively large feature map, which is possible from low levels of a deep network. However, large feature maps in lower levels generally contain limited discriminant information. Therefore, we specifically design the BEFN for proposal generation, which helps absorb higher level information step by step to enhance the base feature map. With this backward phase, feature maps in lower levels of BEFN are enhanced by more discriminative information and thus are beneficial for small object detection. Second, we demonstrate through comprehensive experiments that the motivation discussed above indeed has been achieved; together with a method to keep the spatial layout of the enhanced feature
map, we have made a big performance boost from a basic network for tiny vehicle detection. This paper is an extension of our conference paper recently accepted by the IEEE International Conference on Multimedia and Expo (ICME) 2018. Main changes contained in this paper are as follows: (1) The proposed method was further evaluated on the KITTI [13] vehicle detection benchmark, and experimental results compared to the state of the arts on both the validation set and test set are reported. (2) More ablation studies and analyses on the DETRAC validation set are included to demonstrate the effectiveness of the proposed method. For region proposal evaluation, extended experiments are performed on a stricter IoU threshold of 0.7. For detection evaluation, comparisons of the parameter scales of different networks are given to demonstrate the preferred applicability of the proposed method, apart from the superiority of detection accuracy. (3) Some examples of detection results on DETRAC and KITTI test set are presented to qualitatively evaluate the proposed method; (4) The preliminary and motivations of the proposed method are firstly given, then we provide more detailed formulation of the proposed network designs along with more details of network training. 2. Related work Previous works on vehicle detection are mainly based on handcrafted features [1,14] and cascade classifier framework [2]. Another line of research introduced deformable part based model (DPM) [15], which explicitly explored spatial structure information for vehicle detection, thus yielded good results [16]. Recent works achieved remarkably better performance thanks to the rich feature representation of CNN. Among various CNN detection methods, it can be roughly classified into two categories. The first type is pioneered by RCNN [3] based two-stage methods [4–6]. To generate proposals in a unified framework, Faster RCNN [5] devises a Region Proposal Network sharing the base network with detection network, thus can be trained jointly. Following that, numerous methods have tried to improve the detection performance by focusing on network architecture [6,11,17,18], training strategy [19,20], auxiliary context mining [21–23], and so on. The second type of methods [9,24–26], which are called singlestage methods, aims at speeding up detection by removing the region proposal generation stage. These single-stage detectors di-
26
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
rectly regress object locations from multiple layers of the base network and thus are more computationally efficient but yield less satisfactory results than two-stage methods. To achieve strong feature representation, fusing feature maps from different layers has been extensively studied in the literature [11,17,26,27]. Especially, [11] proposes a feature pyramid network for object detection, boosting feature representation from various levels, which has been demonstrated effective for generic object detection, while tiny objects in real applications are not aware of. Motivated by this, we design a backward feature enhancement network, focusing on low-level, large, but boosted feature map for tiny vehicle representation. In two-stage detectors, much progress has been evolved with better CNN architecture in the first stage, while less effort has been made in the second stage. In [28], it is demonstrated that the design of the detection network is as important as the proposal network. As a common practice, FC layers are adopted in two-stage detectors [4,5,11] for classification and regression after ROI pooling, which is suboptimal as FC is unable to preserve the spatial layout of input and a large number of parameters in FC layers also costs a heavy computational burden. Motivated by the simplicity of block design in ResNeXt [12], we simply stack two split-transformmerge blocks to evolve the proposals’ features progressively, with preserved spatial layouts and boosted discrimination, while with only 1.4% parameters of a two-FC counterpart. There are already some works addressing the problem of tiny object detection. For example, [29] focused on the specific task of tiny traffic sign detection, but based on the hand-crafted features. GETNET [30] proposed a general end-to-end 2-D CNN framework for hyperspectral image change detection, while focusing on the task of per-pixel classification. Similarly, [31] proposed a joint method of priori CNN at the super pixel level and the soft restricted context transfer for street scenes segmentation, which is also a task of per-pixel classification. The proposed method is mainly designed for tiny vehicle detection in street scenes and it is interesting to incorporate the strategies proposed in above methods into our work to further improve the detection performance. In terms of complex scene modelling, [32] proposed to discover the categories of a scene image in an unsupervised manner. Furthermore, [33] proposed a weakly-supervised learning method to parse a scene image into a structured configuration by its descriptive sentences. It is also interesting to further improve the detection performance by utilizing the above techniques adopted in scene understanding. In terms of the model design for object detection, [34] proposed an And-Or graph model with an novel structural optimization algorithm to recognize object shapes in images. Most recently, [35] proposed the Switchable Normalization (SN) in deep neural networks, which outperforms other normalization methods on various tasks, e.g. image classification, object detection and semantic segmentation. Both of them are in parallel to our work and can be incorporated into the proposed method for better performance.
where I represents the input image, f(.) is an existing layer from a base network or an added feature extraction layer, and k is the generated feature maps from the kth layer. On top of the feature maps of the last layer, a convolutional predictor p(.) is applied for anchors of different scales and aspect ratios (denoted as B), that is,
3. Proposed method
ˆ n , Bn ), pn−1 ( ˆ n−1 , Bn−1 ), P = F ( pn (
Since our method stems from the Faster RCNN [5] detection framework, here we give a brief review of this method. In Faster RCNN [5], detection mainly contains two stages that are the Region Proposal Network (RPN) and the detection subnetwork. RPN is built upon a backbone network (e.g. VGG [36], ResNet [37]), in which multiple feature maps with different resolutions are progressively downsampled, these multi-scale feature maps can be defined as follows:
(1)
(2)
where P are the generated proposals. Generally, p(.) contains two elements, cls(.), which predicts the classification scores, and regr(.), which predicts the scaling and offsets of the default anchor boxes. Given the region proposals P from Eq. (2), a ROI-pooling layer is attached on the feature maps k , the ROI-pooled features are then fed into another detection sub-network for further classification and regression, which is formulated as
D = H (ROIPool (k , P )), k > 0,
(3)
where, D is the detection results, ROIPool(.) is the ROI-pooling function that projects the object proposals P into the kth feature maps k to extract region-wise features. H(.) denotes the detection subnetwork, in which two fully-connected layers are applied to further evolve the region-wise features for classification and regression in Faster RCNN [5]. 3.2. Backward feature enhancement network (BFEN) Feature matters. We argue that how to fully exploit the feature representation of a backbone network plays an important role in generating high-quality proposals, especially for the tiny ones. In Faster RCNN [37], only the final layer is utilized for proposal generation, which is not good for tiny vehicles. In this paper, as depicted in Fig. 2(a), we firstly emanate branches from the last layers of stage 3, 4 and 5 in ResNet-50, respectively, then deliver more semantic information from higher layers to lower layers by the following strategy:
ˆ n = cn (n ) dn+1 (cn+1 ( ˆ n+1 )), n = 3, 4
(4)
where n is the feature map of the last layer of stage n, cn (.) represents a convolution layer to reduce the original feature maps’ channel dimensions, dn (.) represents a deconvolution layer to upsample feature maps by a factor of 2, and denotes element-wise addition. With this backward transmission, the feature maps from the first two stages are enhanced and thus are more competent for small object detection. For large scale objects, we attach two convolution layers with a stride of 1 and 2 on 5 , generating two ˆ 5 and ˆ 6 , respectively. Then the final set of feature branches ˆ 3, ˆ 4, ˆ 5, ˆ 6 }, with 256 channels and resolutions maps are { of 1/8, 1/16, 1/32 and 1/64 of the input image, respectively. For proposal generation, anchors of {162 , (322 , 642 ), 1282 , 2562 } pixels, with aspect ratios {1, 1, (0.5, 1, 2), (0.5, 1, 2)}, respectively, are assigned to each level of feature map. Then, we append the same design as in Faster-RCNN [5] for bounding box classification and regression. In this way, Eq. (2) can be reformulated as
ˆ n−m , Bn−m )), n > m > 0, ..., pn−m (
3.1. Preliminary
k = fk (k−1 ) = fk ( fk−1 (... f1 (I ))),
P = p(k , B ), k > 0,
(5)
where Bn is the anchor boxes pre-defined in the nth layer’s feature map cells, and pn (.) is the corresponding convolutional predictor ˆ n . F(.) is the function to gather applied on the nth feature maps all object proposals from all layers and output the final proposals sets P by doing Non-Maximum Suppression (NMS). 3.3. Spatial layout preserving network (SLPN) Given the region proposals P generated from the BFEN, a ROIˆ n , the ROI-pooled pooling layer is attached on the feature maps
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
27
Fig. 2. (a) depicts the detailed architecture of the proposed method. It contains a backward feature enhancement network (BFEN) and a spatial layout preserving network (SLPN), with the formal one responsible for accurate proposal generation and the latter one refining these proposals. (b) depicts the split-merge-transform block adopted in SLPN, where (Cin , k × k, Cout ) represents a convolutional layer with input of Cin channels, output of Cout channels and kernel size of k × k.
features are then fed into another detection sub-network defined in Eq. (3). Unlike a common practice which flatten ROI-pooled features using FC layers in Faster RCNN [5], in this paper, a lightweight yet powerful detection sub-network is designed to further increase the localization accuracy. Specifically, the ROI-pooling ˆ 3 to extract feature vectors of fixed dimenlayer is attached after ˆ 3 primarsion (i.e., 7 × 7 × 256) to represent proposals. We chose ily because feature maps with higher resolution provide more information for small objects as suggested in [10]. However, a maˆ 3 feature maps are enjor difference to MSCNN [10] is that, the hanced by higher layers with more semantic information. Despite the strong representation power from the enhanced feature map, the spatial layout of ROI-pooled features is also critical for vehicle classification and regression, thus should be preserved when evolving progressively in the second stage. Motivated by the superior classification performance of ResNext [12] at a considerably lower computational complexity, we choose the split-transformmerge building block (illustrated in Fig. 2 (b)) as an alternative to FC. As depicted in Fig. 2 (b), totally 32 paths are appended to independently process the input features and then concatenated in the channel dimension, which is further transited by a 1 × 1 convolutional layer to match the output channel number. A skip connection layer is also utilized to directly transit the original input to the evolved features by element-wise summation. With the combination of splitting, transformation and aggregation, features are evolved progressively while preserving the spatial layout. We simply stack two such blocks, with 512 and 1024 output channels, respectively. The final outputs are then fed to a global average pooling layer for bounding box classification and regression. To sum up, a backward feature enhancement network is designed to generate object proposal with high-recall on multi-scale enhanced feature maps, the generated proposed proposals are then fed into a light-weight multi-path detection sub-network for further classification and regression. The overall detection pipeline is pictorially illustrated in Fig. 2.
to [26]. A difference is that, our hard mining is applied on sampled positives S+ and negatives S− , with |S− | = α|S+ |. The softstyle hard mining is formulated as
lcls = −
1 α 2 (1 − pi )2 log( pi ) − pi log(1 − pi ), (6) 1+α 1+α i∈S+
where pi is the positive probability score of sample i and lcls is the classification loss from the all samples in a batch. With Eq. (6), the contributions of easy samples are down-weighted, thus acting as a kind of hard mining. 3.5. Training Loss function We choose the ResNet-50 [37] as the base network, which is pretrained on the ImageNet [38]. All other layers are randomly initialized with the ‘xavier’ method. Both BFEN and SLPN have a multi-task loss combining two objectives:
L = lcls + λ[y = 1]lloc ,
Positive-negative imbalance is a common combat in classification problems. Faster RCNN [5] randomly selects a fixed set of samples (e.g. 256) based on the Intersection of Union (IoU) value of ground truth bounding boxes and anchor boxes or object proposals. Loss from these positive and negatives shares the equal weights during training, while [19] performs a hard-style hard mining by choosing samples with top loss values. In this section, we exploit a soft-style hard mining, which shares a similar idea
(7)
where the classification loss lcls is defined in Eq. (6), and the regression loss lloc is the same smooth L1 loss adopted in FasterRCNN [5]. The regressed targets t = (tx , ty , tw , th ) are the bounding box offsets relative to an object proposal, which is defined as
ˆ , tw = log(w/w ˆ ), tx = (x − xˆ)/w ˆ ˆ ty = (y − yˆ )/h, th = log(h/h ),
(8)
ˆ , hˆ ) are the locations of an object proposal and (x, y, where (xˆ, yˆ, w w, h) are the locations of the corresponding matched ground-truth bounding box. Given the network prediction s = (sx , sy , sw , sh ), the smooth L1 loss is defined as
lloc (t, s ) =
smoothL1 (t − s ),
(9)
if |x| < 1 |x| − 0.5 otherwise.
(10)
i∈{x,y,w,h}
where
smoothL1 (x ) = 3.4. Soft-style hard mining
i∈S−
0.5x2
Then models can be jointly trained by the following two-stage loss function:
Ltotal = LBF EN + LSLPN ,
(11)
where LBFEN and LSLPN are the objective of BFEN and SLPN respectively. Proposal network warm-up Jointly optimizing Eq. (11) from scratch makes training unstable in early iterations as we found in experiments. Thus we adopt a proposal network warm-up strategy.
28
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
Specifically, we firstly train BFEN for 20 k iterations with a learning rate of 10−4 , then the resulting model is used as the start point for jointly optimizing Ltotal . For joint learning, we set initial learning rate as 10−5 and 10−4 for BFEN and SLPN, respectively, then decrease them by 10 after 60k iterations, with the maximum of iterations 110k.
Table 1 Proposal recall rates of different methods on validation set with different number of proposals ranked by the confidence scores. # proposals
10
100
300
500
800
10 0 0
RPN BFEN w/o SHM BFEN with SHM
0.6733 0.8053 0.8097
0.9789 0.9868 0.9920
0.9879 0.9915 0.9962
0.9898 0.9928 0.9972
0.9905 0.9936 0.9978
0.9908 0.9940 0.9980
4. Experiments Table 2 Recall rates of vehicles of different scales on validation set. The scales are defined as the square root of their area in pixels. Results of top 300 proposals under the IoU threshold of 0.5 and 0.7 are reported.
4.1. Experiment settings To evaluate the effectiveness of the proposed method, we conduct experiments on the the recent DETRAC benchmark [8] and KITTI benchmark [13]. The DETRAC dataset [8] contains a series of video sequences captured in surveillance scenes with different weather conditions. It labels a substantial amount of challenging vehicles like tiny cars, severely overlapped or truncated buses. The entire dataset has 84 K images for training and 56 K for testing. As the ground truth of test set is not publicly available, we split the training set into 39 sequences (containing 56 K images) for training and 21 sequences (containing 28 K images) for validation following EB [7]. We sample each training sequence every five frames into the training data, in which only the images have vehicle annotations are included, and the ignored areas are also handled in a similar way to [39] during training. KITTI [13] is one of the most representative vehicle detection dataset for automatic driving. It contains vehicles of various scales and different levels of occlusion and truncation, based on which three difficulty levels (i.e. easy, moderate and hard) are adopted for evaluation. In total, 7481 images are available for training and 7518 for testing. Since no ground truth is available for the test set, following [10], we also split the training set into training (3712 images) and validation (3769 images) sets. Ablation experiments were conducted on the DETRAC validation dataset and comparisons to the state of the arts were reported on both the DETRAC and KITTI test datasets. For evaluation, we used the mean average precision (mAP) under the IoU threshold of 0.7 following the benchmark protocols [8,13]. For testing, it involves a feed forward of the network followed by NMS. We run the proposed method on a single Nvidia GTX1080Ti GPU and report the running time by averaging on all the images in the test set. The input size is of 540 × 960 pixels for DETRAC and 576 × 1920 pixels for KITTI. Code is available at1 . 4.2. Ablation analysis 4.2.1. Region proposal evaluation In this part, the results of experiments to examine the quality of region proposals are reported. For fair comparison, we reimplemented RPN in Faster-RCNN [5] with the same backbone network (ResNet-50), scales and aspect ratios of default anchors are also the same as our method. To better understand the effectiveness of the Soft-style Hard Mining proposed in Section 3.4, we trained a variant of BFEN without Soft-style Hard Mining, denoted as BFEN w/o SHM. Following [40], an object is recalled if any proposal has IoU higher than 0.5. Results are listed in Table 1 and Fig. 3(a). With only 10 proposals, our BFEN achieves a recall above 80%, significantly outperforming RPN by 20.3%. With top 100 proposals, our BEFN has already achieved a recall over 99%, while RPN needs 800 proposals to achieve the equal recall rate. When the number of proposals is up to 10 0 0, the recall rate rises to
1
https://github.com/VideoObjectSearch/ITVD_icme/
IoU
Methods
Tiny (0,20]
Small (20,50]
Medium (50,150]
Large (150,)
0.5
RPN BFEN w/o SHM BFEN with SHM MSRPN with SHM
0.6332 0.7730 0.8968 0.8646
0.9924 0.9930 0.9986 0.9977
0.9965 0.9973 0.9980 0.9975
0.9955 0.9998 0.9982 0.9996
0.7
RPN BFEN w/o SHM BFEN with SHM MSRPN with SHM
0.1138 0.2927 0.3787 0.3011
0.8714 0.9383 0.9576 0.9491
0.9599 0.9779 0.9783 0.9753
0.9763 0.9609 0.9671 0.9808
99.80%, which demonstrates the high-recall capability of the proposed BEFN. To further validate the effectiveness on various scales of objects, Table 2 gives the comparisons of these methods. The scales are defined as the square root of their area in pixels following the DETRAC benchmark2 . Additionally, we add the ‘tiny’ subset which includes the objects with scale less than 20 pixels to underline the superiority on tiny vehicles of our method. As can be seen in Table 2, with top 300 proposals, the proposed BEFN performs better than RPN across all the subsets under the IoU threshold of 0.5. Notably, BFEN outperforms RPN on the recall of tiny vehicles by a large margin (63.32% to 89.68%), indicating that BFEN is significantly superior to RPN, especially for tiny vehicles( < 20 pixels). Training with the proposed SHM strategy, BFEN with SHM outperforms BFEN w/o SHM except with a slightly inferior performance on the large subset (99.82% v.s. 99.98%), which means that the proposed SHM strategy is capable to mine hard examples (particularly on the objects of smaller size). We further evaluated the methods under a stricter IoU threshold of 0.7, which requires higher-quality object proposals. The comparisons are reported in the last three rows in Table 2. It is clear that recall rates of all methods drop drastically, which indicates that the second stage is necessary to refine these proposals for better localization. However, the performance gain in recalling tiny vehicles from the BFEN to RPN is consistent, which further demonstrates the superiority of the proposed BEFN over RPN. Furthermore, we also report the results of the proposals generated from the original feature maps, which is denoted as MSRPN with SHM in Table 2. It is shown that with the enhanced features, BFEN can achieve a higher recall especially for tiny and small subsets, which further demonstrates the effectiveness of the enhanced features for tiny vehicle proposal generation. To better understand the high quality of proposals generated from BFEN, Fig. 3 (b) depicts the recall based on different IoU threshold, it highlights the performance gap between RPN and BFEN when IoU threshold increasing progressively, which gives the evidence that BFEN can generate proposals more accurately than RPN. Intriguingly, training with the Soft-style Hard Mining, BFEN can perform slightly better at a higher IoU threshold.
2
https://detrac-db.rit.albany.edu/
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
29
Fig. 3. Proposal recall rate comparison on validation set. (a) Recall v.s. number of proposals at IoU threshold 0.5. (b) Recall v.s. IoU threshold for 300 proposals.
Table 3 Mean average precision (mAP) of different detectors on validation set as well as different subsets. ‘∗ ’ represents the variant of Faster RCNN reimplemented by us as a competitive baseline. Method
# Parameters
mAP (%)
RPN
DetNet
Total
Overall
Sunny
Cloudy
Rainy
Night
Faster RCNN [5] EB [7]
– –
– –
– –
68.58 84.43
63.64 87.48
70.04 85.88
81.56 85.65
60.53 70.86
Faster RCNN∗
10.96M
15.01M
25.97M
72.72
88.00
73.96
66.26
68.33
BFEN BFEN+2FC BFEN+SLPN BFEN+SLPN (PNW)
37.24M 37.24M 37.24M 37.24M
0M 68.19M 0.98M 0.98M
37.24M 105.43M 38.22M 38.22M
82.22 83.00 87.19 88.71
85.38 83.90 89.13 90.01
83.56 82.73 87.35 89.46
80.43 90.21 92.21 92.64
72.69 74.98 76.84 78.48
4.2.2. Detection evaluation In this part, the analyses of each component’s contribution to the final detection performance are reported. For detection, we fed the top 300 proposals from BFEN to the detection network, and then those boxes with detection scores less than 0.1 were filtered out. The final results were obtained by applying NMS with a threshold of 0.7. Table 3 reports the results on the DETRAC validation set, the first two rows are results from the EB paper [7]. We also reimplemented a variant of Faster RCNN according to [37], with the same backbone network (ResNet-50) and default anchors as ours, this is denoted as Faster RCNN∗ in Table 3. Firstly, we tested BFEN alone as a detector, secondly, we replaced the STM blocks in SLPN with two fully-connected layers with 4096 nodes (commonly adopted in two-stage detectors), which was jointly trained with BFEN (denoted as BFEN+2FC in the fifth row in Table 3). Then we jointly trained BFEN and SLPN from scratch (denoted as BFEN+SLPN) and reported the results in row 6, finally we jointly trained our model with the Proposal Network Warmup strategy mentioned in Section 3.5 and the results are reported in the last row. It indicates that the proposed BFEN can already work well as a detector with a competitive mAP of 82.22% on the whole validation set, beating the two-stage Faster-RCNN∗ baseline by approximately 10% in mAP, even without the second stage for proposal refinement, showing the high quality of the generated proposals. When combining BFEN with a detection sub-network consisting of two fully-connected layers adopted in original Faster RCNN [5], it slightly improves the detection performance of the single-stage BEFN by less than 1% mAP, but introduces a considerably heavy burden on parameters (approximately two times larger
than the BEFN). However, jointly training BFEN with the proposed SLPN achieves a mAP of 87.19%, and brings a performance gain of approximately 5% in mAP while merely with additional 0.98M parameters, leading to a highly light-weight but effective detection sub-network compared to the two-FC counterpart, which indicates that the spatial layout of features is equally important and thus should be preserved in the detection sub-network. Finally, the Proposal Network Warm-up strategy further boosts the performance to an mAP of 88.71%, significantly outperforming the Faster RCNN∗ baseline by 16.5% in mAP and performs the best across all the subsets on different weather conditions. The proposed SLPN is helpful to capture the spatial layout of the ROI features, which is crucial for tiny vehicle localization. To demonstrate this, we report the comparisons of BFEN+2FCBFEN+SLPN evaluated by the F-score in Table 4. It is shown that the proposed SLPN performs especially well on the Tiny subset, outperforming its 2FC counterpart by a large margin. 4.3. Comparisons to the state of the arts 4.3.1. DETRAC [8] The results of the proposed method trained with both training and validation sets were submitted to the DETRAC benchmark. On the DETRAC test set, comparisons to previous published and unpublished methods are reported in Table 5. The proposed method achieves a new state-of-the-art on the leaderboard, and ranks first on almost all subsets except for the easy and night subsets. Notably, we achieved a significant overall improvement of 8.3% and 6.4% mAP over the state-of-the-arts EB [7] and RFCN [6], respec-
30
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33 Table 4 Comparisons of the detection network evaluated with F-score. ‘P’ and ‘R’ represent the precision and recall of different methods. Method
Tiny
BFEN+2FC BFEN+SLPN
Small
Medium
Large
P
R
F-score
P
R
F-score
P
R
F-score
P
R
F-score
0.0222 0.0354
0.2545 0.2400
0.0409 0.0617
0.3346 0.3322
0.9484 0.9306
0.4947 0.4896
0.5196 0.5070
0.9791 0.9729
0.6789 0.6666
0.7302 0.7452
0.9636 0.9695
0.8308 0.8427
Table 5 mAP (%) results on the DETRAC leaderboard. (‘∗ ’ represent unpublished works). Red and greed indicate the best and second best performance [41].
Fig. 4. Precision-recall curves of state-of-the-art methods on the full, easy, medium and hard subsets of the DETRAC test set. Table 6 Comparisons to MSCNN [10] on KITTI validation dataset. ‘Feat. Up.’ means the feature maps are upsampled before ROIpooling. ‘Context’ means the region-wise features are enhance with context. ‘Dim. Reduce’ means the dimension of region-wise features are reduced. The input size of ‘ × 1’ is 384 × 1280. Red and greed indicate the best and second best performance.
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
31
Table 7 Comparisons to previously published results on the KITTI benchmark test set. Methods are ranked based on the ‘Moderate’ set. Method
Time
Environment
mAP(%) Easy
Moderate
Hard
Faster RCNN [5] RefineNet [42] MV3D(LIDAR) [43] SDP+CRF(ft) [44] YOLOv2 [25] Mono3D [45] MM-MRFC [46] 3DOP [47] DeepStereoOP [48] MSCNN [10] SubCNN [49] Deep3DBox [50] MV3D [43] PRC [51]
2.0 s/img 0.2 s/img 0.3 s/img 0.6 s/img 0.05 s/img 4.2 s/img 0.05 s/img 3.0 s/img 3.4 s/img 0.4 s/img 2.0 s/img 1.5 s/img 0.4 s/img 3.6 s/img
[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]
Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz Ghz
87.80 90.16 87.00 90.39 88.01 90.27 90.93 90.09 90.34 90.46 90.75 90.47 90.53 90.61
79.11 79.21 79.24 81.33 85.65 87.86 88.20 88.34 88.75 88.83 88.86 88.86 89.17 90.22
70.19 65.71 78.16 70.33 74.16 78.09 78.02 78.39 79.39 74.76 79.24 77.60 80.16 87.44
Ours
0.3 s/img
[email protected] Ghz
90.57
89.23
79.31
Fig. 5. Examples of detection results by the proposed method on the DETRAC and KITTI test sets. In both (a) and (b), the results of the first row are from DETRAC and the last two rows are from KITTI. Best viewed in color.
tively. Fig. 4 further gives the precision-recall curves of state-ofthe-art methods, it can also be observed that the proposed method achieves the best performance across all difficulty settings and on different weather conditions. 4.3.2. KITTI [13] Experiments were firstly conducted on the same validation dataset as in MSCNN [10]. As can be seen from Table 6, without any auxiliary techniques such as feature map upsampling or context encoding adopted in MSCNN [10], our method achieves
the best performance on all the three difficulty levels, compared with MSCNN [10]. Notably, our method performs particularly well on the ‘Hard’ subset. Even without input upsampling, our method achieves a mAP of 78.10% that is competitve to MSCNN [10] with input upsampling (75.54% in mAP). It demonstrates the discriminative ability of our method to handle objects with small size, heavy occlusion and truncation. The results of our method trained on the entire training dataset were submitted to the KITTI leaderboard. A comparison to 14 recently published approaches is given in Table 7. It is clear that
32
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33
our method achieves a comparative accuracy (89.23%) to the state of the arts on the moderate case while performing at a relatively faster speed (0.3 s/img). Although our method underperforms the best competitor (PRC [51]) with approximately 1% in mAP but performs almost 12 times faster than PRC [51]. In Fig. 5, we visualize some examples of detection results on the DETRAC and KITTI test sets, where both the successful detection results and failed cases are presented to qualitatively evaluate the proposed method. It is clear that the proposed method is strong enough to detect vehicles with a large variance of scales, especially for the tiny ones which are far away from the cameras. Also, it can robustly localize vehicles of various orientations, and truncation levels under different situations such as blurry, rainy, night and so on. However, the detection results are sometimes unsatisfactory in crowed scenes, and ‘close but not accurate’ false positives are likely to happen where vehicles are adjacent to each other. Some additional techniques like spatial context encoding [10] may be useful to address these problems. Generally speaking, the proposed method performs reasonably well on vehicles of various scales and is able to be flexibly extended and enhanced with other techniques, which will be left for our future work. 5. Conclusions Towards improving the performance of tiny vehicle detection, a backward feature enhancement network is proposed and demonstrated particularly effective to generate high-recall proposals. When combined with the proposed spatial layout preserving network, we make a big performance boost from a basic network for tiny vehicle detection. Though competitive in terms of detection accuracy, the proposed method runs at 0.2 s per image of 540x960 pixels, which is unsatisfactory for real-time requirements. In the future work, we would like to incorporate some network acceleration strategies (e.g. network pruning, teacher-student mimicking) into the proposed method to achieve real-time speed while maintain the detection accuracy. Acknowledgment This work was supported by the National Key Research and Development Plan (Grant No. 2016YFC0801003), the Chinese National Natural Science Foundation Projects #61672521. References [1] S. Liao, A. Jain, S. Li, A fast and accurate unconstrained face detector, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2016) 211–223. [2] L. Bourdev, J. Brandt, Robust object detection via soft cascade, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005., 2, IEEE, 2005, pp. 236–243. [3] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [4] R. Girshick, Fast R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448. [5] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2015, pp. 91–99. [6] J. Dai, Y. Li, K. He, J. Sun, R-FCN: object detection via region-based fully convolutional networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 379–387. [7] L. Wang, Y. Lu, H. Wang, et al., Evolving boxes for fast vehicle detection, in: International Conference on Multimedia and Expo, 2017, pp. 1135–1140. [8] L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Yang, S. Lyu, in: UA-DETRAC: a new benchmark and protocol for multi-object detection and tracking, 2015. https://arxiv.org/abs/1511.04136. [9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S.Reed, C.-Y. Fu, A. Berg, SSD: single shot multibox detector, in: Proceedings of the European conference on Computer Vision, Springer, 2016, pp. 21–37. [10] Z. Cai, Q. Fan, R.S.Feris, N. Vasconcelos, A unified multi-scale deep convolutional neural network for fast object detection, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 354–370.
[11] Y.-T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 936–944. [12] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural networks, in: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 5987–5995. [13] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti vision benchmark suite, in: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)„ IEEE, 2012, pp. 3354–3361. [14] P. Dollár, R. Appel, S. Belongie, P. Perona, Fast feature pyramids for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 36 (8) (2014) 1532–1545. [15] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-based models, IEEE Trans. Pattern Anal. Mach. Intell. 32 (9) (2010) 1627–1645. [16] X. Song, T. Wu, Y. Jia, S.-C. Zhu, Discriminatively trained and-or tree models for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3278–3285. [17] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: towards accurate region proposal generation and joint object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 845–853. [18] H. Lee, S. Eum, H. Kwon, in: ME R-CNN: multi-expert region-based CNN for object detection, 2017. https://arxiv.org/abs/1704.01069v1. [19] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors with online hard example mining, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769. [20] X. Wang, A. Shrivastava, A. Gupta, A-fast-RCNN: hard positive generation via adversary for object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3039–3048. [21] S. Bell, C. Lawrence Zitnick, K. Bala, R. Girshick, Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2874–2883. [22] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware CNN model, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1134–1142. [23] A. Shrivastava, A. Gupta, Contextual priming and feedback for faster R-CNN, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 330–348. [24] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. [25] J. Redmon, A. Farhadi, Yolo90 0 0: better, faster, stronger, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6517–6525. [26] Y.-T. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: International Conference on Computer Vision, 2017, pp. 2999–3007. [27] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, Y. Chen, RON: reverse connection with objectness prior networks for object detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5244–5252. [28] S. Ren, K. He, R.G.X. Zhang, J. Sun, Object detection networks on convolutional feature maps, IEEE Trans. Pattern Anal. Mach. Intell. 39 (7) (2017) 1476– 1481. [29] Y. Yuan, Z. Xiong, Q. Wang, An incremental framework for video-based traffic sign detection, tracking, and recognition, IEEE Trans. Intell. Transp. Syst. 18 (7) (2017) 1918–1929. [30] Q. Wang, Z. Yuan, Q. Du, X. Li, Getnet: a general end-to-end 2-D CNN framework for hyperspectral image change detection, IEEE Trans. Geosci. Remote Sens. 57 (1) (2019) 3–13. [31] Q. Wang, J. Gao, Y. Yuan, A joint convolutional neural networks and context transfer for street scenes labeling, IEEE Trans. Intell. Transp. Syst. 19 (5) (2018) 1457–1470. [32] L. Lin, R. Zhang, X. Duan, Adaptive scene category discovery with generative learning and compositional sampling, IEEE Trans. Circuits Syst. Video Technol. 25 (2) (2015) 251–260. [33] R. Zhang, L. Lin, G. Wang, M. Wang, W. Zuo, Hierarchical scene parsing by weakly supervised learning with image descriptions, IEEE Trans. Pattern Anal. Mach. Intell. 41 (3) (2019) 596–610. [34] L. Lin, X. Wang, W. Yang, J. Lai, Discriminatively trained and-or graph models for object shape detection, IEEE Trans. Pattern Anal. Mach. Intell. 37 (5) (2015) 959–972. [35] P. Luo, J. Ren, Z. Peng, in: Differentiable learning-to-normalize via switchable normalization, 2018. https://arxiv.org/abs/1806.10779. [36] K. Simonyan, A. Zisserman, in: Very deep convolutional networks for largescale image recognition, 2014. https://arxiv.org/abs/1409.1556. [37] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, F.-F. Li, Imagenet: a large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009., IEEE, 2009, pp. 248– 255. [39] S. Zhang, R. Benenson, B. Schiele, Citypersons: a diverse dataset for pedestrian detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4457–4465. [40] J. Hosang, R. Benenson, P. Dollár, B. Schiele, What makes for effective detection proposals? IEEE Trans. Pattern Anal. Mach. Intell. 38 (4) (2016) 814– 830.
W. Liu, S. Liao and W. Hu / Neurocomputing 347 (2019) 24–33 [41] Z. Cai, M. Saberian, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian detection, Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3361–3369. [42] R.N. Rajaram, E. Ohn-Bar, M.M. Trivedi, Refinenet: iterative refinement for accurate object localization, in: Proceedings of the IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), 2016, IEEE, 2016, pp. 1528–1533. [43] X. Chen, H. Ma, J. Wan, B. Li, T. Xia, Multi-view 3D object detection network for autonomous driving, in: Proceedings of the IEEE CVPR, 1, 2017, p. 3. [44] F. Yang, W. Choi, Y. Lin, Exploit all the layers: fast and accurate CNN object detector with scale dependent pooling and cascaded rejection classifiers, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2129–2137. [45] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, R. Urtasun, Monocular 3D object detection for autonomous driving, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2147–2156. [46] A.D. Costea, R. Varga, S. Nedevschi, Fast boosting based detection using scale invariant multimodal multiresolution filtered features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, 2017, pp. 993–1002. [47] X. Chen, K. Kundu, Y. Zhu, A.G. Berneshawi, H. Ma, S. Fidler, R. Urtasun, 3D object proposals for accurate object class detection, in: Proceedings of the Advances in Neural Information Processing Systems, 2015, pp. 424–432. [48] C.C. Pham, J.W. Jeon, Robust object proposals re-ranking for object detection in autonomous driving using convolutional neural networks, Signal Process.: Image Commun. 53 (2017) 110–122. [49] Y. Xiang, W. Choi, Y. Lin, S. Savarese, Subcategory-aware convolutional neural networks for object proposals and detection, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, IEEE, 2017, pp. 924–933. [50] A. Mousavian, D. Anguelov, J. Flynn, J. Košecká, 3D bounding box estimation using deep learning and geometry, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, 2017, pp. 5632–5640. [51] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y.-W. Tai, L. Xu, Accurate single stage detector using recurrent rolling convolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, IEEE, 2017, pp. 752–760.
33
Wei Liu received the B.S. and M.S. degree from Aviation University of Air Force, Changchun, China, in 2013 and 2015, respectively. He is currently a Ph.D. student in the College of Electronic Science, National University of Defense Technology, Changsha, China, and supervised by Professor Hu. From 2016 until now, he is working as intern in the Center for Biometrics and Security Research & National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China, and supervised by Professor Liao. His research interests include object detection, visual object tracking, domain adaptation. Shengcai Liao received the B.S. degree in mathematics and applied mathematics from the Sun Yatsen University in 2005 and the Ph.D. degree from CASIA in 2010. He was a Post Doctoral Fellow in the Department of Computer Science and Engineering, Michigan State University during 2010–2012. He is a Senior Member of IEEE, and a Member of the Technical Committee on Computer Vision, CCF. His research interests include computer vision and pattern recognition, with a focus on face recognition, object detection, and person re-identification. He was awarded the Best Student Paper Award in ICB 2006, ICB 2015, and CCBR 2016, and the Best Paper Award in ICB 2007. He was also awarded the Best Reviewer Award in IJCB 2014. He served as an Area Chair for ICPR 2016, ICB 2016, and ICB 2018, and as a PC member for ICCV, CVPR, ECCV, etc. Weidong Hu received the B.S. degree in microwave technology and the M.S. degree and Ph.D degrees in communication and electronic system from the National University of Defense Technology, Changsha, China, in 1990, 1994 and 1997, respectively. He is currently a professor in the College of Electronic Science, National University of Defense Technology. His research interests include radar signal and data precessing.