Using multi-label classification to improve object detection

Using multi-label classification to improve object detection

ARTICLE IN PRESS JID: NEUCOM [m5G;September 11, 2019;23:32] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputin...

2MB Sizes 0 Downloads 52 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;23:32]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Using multi-label classification to improve object detection Tao Gong, Bin Liu∗, Qi Chu, Nenghai Yu CAS Key Laboratory of Electromagnetic Space Information, University of Science and Technology of China, China

a r t i c l e

i n f o

Article history: Received 20 September 2018 Revised 20 July 2019 Accepted 15 August 2019 Available online xxx Communicated by Dr. Ma Jiayi Keywords: Object detection Multi-label classification Feature fusion Deep learning

a b s t r a c t In this paper, a novel multi-task framework for object detection is proposed. The framework uses multilabel classification as an auxiliary task to improve object detection, and can be trained and tested endto-end. The object detection branch adopts R-FCN methods to solve the object detection task. The multilabel branch uses attention mechanism to solve the multi-label classification task. The features, which are generated by the attention mechanism in the multi-label branch, contain rough localization information of the objects. Thus, the features can be useful for the object detection. Both the box-level features and the image-level features of multi-label are fused to improve the accuracy of the object detection. The proposed framework does not require any extra annotation, since the ground truth of the multi-label classification can be directly obtained from the bounding box annotations. This is different from other multi-task frameworks such as StuffNet and Mask R-CNN which need extra semantic segmentation and instance segmentation annotations. Extensive experiments on PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of the proposed approach. Code has been made publicly available at: https://github.com/GT9505/MONet. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Object detection, which is a fundamental and important task in the computer vision field, aims at automatically localizing instances of real world objects within an image. It has many important applications. For example, image analysis and intelligent video surveillance are two of the major research topics in the computer vision field. The object detection plays a key role in these two topics, since the objects need to be firstly detected during the processing of these two research topics. In the last decades, remarkable progress has been made in the object detection and recognition [1–7], especially benefited from the rapid development of deep neural network based methods [8–16]. Among them, one of the most influential methods is R-CNN [8]. R-CNN uses deep convolutional neural networks (ConvNets) to classify region proposals generated by Selective Search [17] or Edge Box [18]. As a major improvement based on R-CNN, in order to reduce the computational time, Fast R-CNN [9] adopts region of interest pooling (ROI pooling) to extract the features of region proposals from the feature map of the entire image. Faster R-CNN [10] proposes the region proposal network (RPN) to generate region proposals and combines it with Fast R-CNN into a single



Corresponding author. E-mail addresses: [email protected] (T. Gong), fl[email protected] (B. Liu), [email protected] (Q. Chu), [email protected] (N. Yu).

network. R-FCN [11] replaces the ROI pooling in Faster R-CNN with position sensitive ROI pooling (PS ROI pooling) to acquire the scores of the proposals. Specifically, R-FCN uses a shared fully convolutional network to extract the shared object detection feature map. Then R-FCN places two sibling convolutional layers followed by two PS ROI pooling layers for region classification and bounding box regression respectively on the top of object detection feature map. Most of the above methods usually treat object detection as a multi-task problem with region classification and bounding box regression. However, all of these methods ignore the multi-label supervision information provided in the object detection annotations. In fact, if ignoring the location information in the object detection annotation files and only focusing on the category information, the multi-label annotations of the images can be directly acquired. It is shown in some literatures that the features of the classification can contain rough localization information of the objects. The features can be useful for object detection. For example, it is shown in [19] that by revisiting the global average pooling layer [20] to generate the class activation mapping (CAM), the features of CAM can contain the rough localization information of objects, despite the network just being trained for solving classification task. It is also shown in SRN [21] that using attention mechanism [22] could pay more attention to the object related regions with only image-level labels. The attention mechanism can enrich the features of classification with more localization information than CAM.

https://doi.org/10.1016/j.neucom.2019.08.089 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

Fig. 1. The visualization of the object detection feature map and the multi-label feature map. Best viewed in color.

In order to clearly show the benefits of fusing the multi-label information, the results generated in Section 3 are borrowed to visualize the object detection feature map and the multi-label feature map. In Fig. 1, the pixels in the feature map across channels are averaged to get one channel feature map, which is used to plot the heap map. When the color gradually changes from blue to red, the activation value is gradually increasing. Column (a) is the input images picked from PASCAL VOC 2007. Column (b) is the feature map of the object detection. Column (c) is the feature map of the multi-label classification. As shown in Fig. 1, compared with the object detection feature map, the multi-label feature map also has high activated values in the inside and surrounding of the object areas and low activated values in the background areas. Therefore, a multi-task framework, which uses multi-label classification as an auxiliary task to improve the object detection, can be constituted. In fact, the multi-task learning is not new at all. A lot of recent papers [15,23–25] are handling semantic segmentation or instance segmentation in an object detection framework. Each sub-task helps other tasks. However, the semantic segmentation annotations and the instance segmentation annotations are much harder to acquire than the bounding box annotations. Since the multi-label annotations can be directly obtained from the object detection annotations, the proposed method does not require any extra annotation. Although a weakly supervised semantic segmentation method, which uses the object detection annotations to generate the weak semantic segmentation annotations, is proposed in [15], the labels in the weak semantic segmentation annotations are noisy. It is not suitable for the training of the network. In this paper, a novel deep convolutional neural network with two branches, the object detection branch and the multi-label branch, is proposed. The object detection branch adopts R-FCN [11] algorithm as mentioned before. The multi-label branch is used to solve multi-label classification as an auxiliary task to improve the accuracy of the object detection. Specifically, the attention mechanism is used to constitute the multi-label branch in order to generate the rough localization information of the objects. Since the multi-label feature map contains rough localization information of the objects, ROI pooling is used to fuse the box-level features and a gate module is designed to fuse the image-level features from the multi-label feature map into the object detection feature map. Since Multi-label classification is used to improve Object detection, the proposed approach is named as MONet.

In summary, the main contributions of this paper are as follows: 1. A novel multi-task framework, which contains the object detection task and the multi-label classification task, is proposed for object detection. To the best of our knowledge, it is the first framework that solves the multi-label classification task and the object detection task simultaneously in one deep convolutional neural network. 2. The features of the multi-label are fused into the features of the object detection in order to improve the accuracy of the object detection. Specifically, ROI pooling is used and a gate module is designed to fuse both box-level features and image-level features from the features of multi-label into the features of object detection, respectively. 3. The proposed method outperforms other state-of-the-art methods with an mAP of 83.6% on PASCAL VOC 2007 and an mAP of 81.0% on PASCAL VOC 2012. The proposed method outperforms R-FCN by 5.5 points with an AP of 35.4% on MS COCO. The rest of the paper is organized as follows: Section 2 describes the related works. Section 3 introduces the technical details of the proposed MONet. Specifically, the overview of the proposed method is provided in Section 3.1. And the object detection branch is briefly introduced in Section 3.2. The multi-label branch is described in Section 3.3. Section 3.4 describes how to fuse the box-level features from the multi-label feature map into the object detection branch. The gate module, which fuses the image-level features from the multi-label feature map into the object detection feature map, is described in Section 3.5. Section 3.6 introduces the multi-task loss. Extensive experiments are conducted in Section 4 to show the effectiveness of the proposed method. Section 5 concludes the characteristics of the proposed MONet. 2. Related work 2.1. Deep ConvNet for object detection With the development of the deep ConvNets, the object detectors using deep ConvNets [8–13] improve the accuracy of the object detection dramatically. R-CNN [8] feeds the scale-normalized region proposals generated by Selective Search [17] or Edge Box [18] through a ConvNet to classify the region proposals. SPPNet [26] adopts a spatial pyramid pooling (SPP) layer to extract a fixed

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

length feature for different scale proposals from the feature maps computed from a single scale. Fast R-CNN [9] combines the region classification with the bounding box regression as a multitask problem into a single network. Faster R-CNN [10] integrates the progress of generating the region proposals into Fast R-CNN framework and proposes an object detection framework which can be trained and tested end-to-end. R-FCN [11] replaces ROI pooling [10] with PS ROI pooling to acquire the scores of the region proposals. Cascade R-CNN [27] builds the multi-stage box classification and the multi-stage box regression to get better localization of the objects. Many proposals are required to be generated firstly in the above algorithms. This leads to slow speed of the algorithms. Thus, YOLO [13] abandons the step of generating proposals. YOLO simultaneously regresses and classifies the boxes that contain the objects. SSD [12] uses multi-layer feature maps to detect different scales of the objects. RetinaNet [28] proposes the focal loss to deal with the problem of extremely imbalance amount between the positives and the negatives in SSD. The CNN models proposed in [29–31] focus on enriching the representation power of CNNs by introducing some additional constraints. All of the above approaches ignore the multi-label supervision information provided in the object detection annotation files, while the proposed method constitutes a multi-label branch as an auxiliary task to improve the accuracy of the object detection. 2.2. Multi-task for improving object detection The combination of the object detection with similar tasks can improve the accuracy of the object detection. For example, StuffNet [23] combines the semantic segmentation with the object detection. Mask R-CNN [24] and MNC [25] combine the instance segmentation with the object detection. These methods usually need extra pixel-level annotations, while the proposed approach does not require any extra annotation. A weakly supervised semantic segmentation, which uses the object detection annotations to generate the weak semantic segmentation annotations, is developed in [15] to improve the accuracy of the object detection. However, the labels in the weak semantic segmentation annotations are noisy and are not suitable for the training of the network.

3

the features of the object detection, while the proposed MONet fuses both the box-level features and the image-level features. 3. The proposed multi-task approach This section describes the technical details of the proposed MONet. 3.1. Framework overview The architecture of the proposed approach is shown in Fig. 2. ResNet-101 is a very famous framework in the classification task field. ResNet-101 has 5 convolutional blocks named as Conv1 to Conv5 (Conv1-5) and shows excellent feature extraction ability. Thus, there are a lot of methods adopting the Conv1-5 in ResNet101 as the backbone network for the feature extraction in the object detection field, such as Faster R-CNN [10], R-FCN [11], Mask R-CNN [24] and so on. The Conv1-5 in ResNet-101 [44] pretrained on ImageNet [45] are also used as the backbone network for the feature extraction of the input image. The same dilation strategy is used to reduce the effect stride of Conv5 in ResNet-101 as suggested by R-FCN [11]. Thus, the stride of Conv5 is 16. The RPN [10] is placed on the top of Conv4 to generate the region proposals and is omitted in Fig. 2 for convenience. On the top of Conv5, the network is split into two different branches followed by two 1 × 1 convolutional layers with 1024 channels: one for the object detection and the other for the multi-label classification. The multi-label branch uses the attention mechanism to acquire rough localization information with only image-level label supervision. Thus, ROI pooling is used to extract the box-level features from the multilabel feature map in order to acquire the score of each proposal. A gate module is also designed to integrate the multi-label feature map ML into the object detection feature map D in order to generate the new object detection feature map Dnew . The gate module fuses the image-level features from the multi-label classification into the object detection. The object detection branch uses PS ROI pooling to get the score of each proposal with the new object detection feature map Dnew as the input. Finally, after elementwise summing the scores from PS ROI pooling and ROI pooling, the scores are passed through a softmax function to get the final prediction of each proposal.

2.3. Feature fusing for improving object detection 3.2. Object detection branch Recent works have shown that fusing different semantic features as input to the final classification layer of an object detector can improve the accuracy of the object detection. For example, HyperNet [32], RON [33], FPN [34] and RefineDet [35] fuse different level feature maps to provide more useful feature maps for object detection. HyperLearner [36] explores the possibility of aggregating extra features such as edge, segmentation and heatmap into the CNN-based pedestrian detection framework to improve the accuracy of pedestrian detection. PFPNet [37] uses SPP layer to generate different levels of semantic features and fuses those semantic features to boost the object detection performance. StuffNet [23] places an ROI pooling layer on the top of semantic segmentation feature map to extract features of each proposal. Then StuffNet element-wise sums the features with per-region features extracted from the object detection feature map. MR-CNN [38], CC-Net [39] and MultiPath-Net [40] fuse the context information using regions with different resolutions of each proposal. GBD-NET [41] combines features with different resolutions and support regions for proposals. CoupleNet [42] combines both the global information and the local information of the proposals to predict the object category. Object relation module is proposed in [43] to fuse the relation features among boxes. The above approaches fuse either image-level features or box-level features into

The R-FCN [11] architechture is adopted as the object detection branch. Specifically, a 1 × 1 convolutional layer with k2 (C + 1 ) channels is placed on the top of the new object detection feature map Dnew . k means that a proposal is divided into k × k bins (k is set to 7 as suggested by R-FCN) and C + 1 is the number of the object categories plus background. The Dnew contains the imagelevel features of the multi-label classification via the gate module. For each category, there are totally k2 channels and each channel is responsible for encoding a part of the proposal. Then, PS ROI pooling is used to extract the features of the proposal from the feature map with k2 (C + 1 ) channels. The size of the proposal features is k × k × (C + 1 ). The scores of each category are determined by voting the k2 values. The average pooling is performed for voting to acquire a part of scores of the proposal. The bounding box regression is processed in a similar way. Specifically, a sibling 1 × 1 convolutional layer with 4k2 channels is appended on the top of Dnew . Then, the PS ROI pooling is used to produce a 4k2 -dimension vector for each proposal based on the 4k2 maps. The 4k2 -dimension vector is voted into a 4-dimension vector by the average pooling. The 4-dimension vector will be used to predict a part of the coordinates of the proposal. This procedure is omitted in Fig. 2 for convenience.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

Fig. 2. The framework of the proposed MONet. Best viewed in color.

3.3. Multi-label branch The multi-label annotation is defined as follows: Let I denotes an image with the ground truth labels Y = [y1 , y2 , . . . , yC ]T , where yl is a binary indicator and l ∈ {1, 2, . . . , C }. yl = 1 if image I is tagged with label l, and yl = 0 otherwise. The ground truth labels Y are the multi-label annotation of the image I. If there are one or more objects belonging to the label l in the object detection annotation, yl is set to 1, and otherwise yl equals to 0. Thus, the labels Y can be directly obtained from the object detection annotation. Given image I as the input, the multi-label branch outputs Ypred = [y1pred , y2pred , . . . , yCpred ]T . The multi-label loss Lml is calculated based on Ypred and Y. The multi-label branch is built on the top of Conv5 with a 1024channels 1 × 1 convolutional layer to generate the multi-label feature map ML. Based on ML ∈ W × H × 1024 , where W and H represent the width and the height of the feature map respectively, the label attention values for each label are generated automatically, W  ×H  ×C

Z = fatt (ML; ωatt ), Z ∈ 

,

(1)

where Z is the non-normalized attention map with each channel corresponding to one label. Z is generated by function fatt ( · ) which uses ML as the input. ωatt represents the parameter of the function fatt ( · ). W and H represent the width and the height of Z respectively. Following [21,22], Z is spatially normalized with the softmax   function to obtain the normalized attention map A ∈ W ×H ×C ,

exp(zi,l j ) ali, j =  , ali, j ∈ [0, 1], l i, j exp (zi, j )

(2)

where zi,l j and ali, j represent the non-normalized and normalized attention values at (i, j) in Z and A for label l respectively. Intuitively, if label l is tagged to the input image, the image regions related to it in A should be assigned with higher attention values. A convolutional layer with C channels is used to model the function fatt . The kernel size of the convolutional layer is 7 × 7 in order to capture the large enough receptive field. However, since the resolution of the input image is usually around 600 × 1000 pixels and the stride in Conv5 of ResNet-101 is 16, the resolution of Z is around 40 × 60 pixels. The resolution is too large to perform the softmax function in Eq. (2). Thus, an average pooling layer with

kernel size 4 × 4 and stride 4 is added at the front of the convolutional layer with kernel size 7 × 7 to downsample ML. Finally, the function fatt is modeled as the average pooling layer followed by the convolutional layer described above.   Then, the confidence map CM ∈ W ×H ×C with each channel corresponding to one label is generated by function fconf ( · ). The function fconf ( · ) also uses ML as the input,

C M = fcon f (ML; ωcon f ), C M ∈ W

 ×H  ×C

,

(3)

where ωconf represents the parameter of the function fconf ( · ). Similarly, the function fconf is modeled as an average pooling layer (kernel size 4 × 4 and stride 4) followed by an convolutional layer (kernel size 7 × 7 and C channels). Since the attention map A has higher values at the object re lated regions and i, j ali, j = 1 for all l, the attention map A can be used to weight the confidence map CM. The weighted confidence map is summed to get the final score ylscore for each label l,

ylscore =



cmli, j ali, j , ylscore ∈ ,

(4)

i, j

where cmli, j is the confidence value at (i, j) in CM for label l. The ylscore is normalized by the sigmoid function to generate the final probability ylpred for each label l. Since the attention mechanism is used in the multi-label branch, the object related region should be assigned with higher attention values, i.e., the multi-label feature map ML contains rough localization information. The multi-label feature map ML and the object detection feature map D are visualized to show that ML contains rough localization information. As shown in Fig. 1, ML has higher activated values in the object areas than D. For example, in the bottom row of Fig. 1, the person areas in the left of the airplane have low activation values (painted with blue) in the object detection feature map, while the corresponding areas in the multi-label feature map have high activation values (painted with red). Therefore, the features of the person areas in the multi-label feature map can offer a complementary localization information for the object detection feature map. The features of the person areas in the multi-label feature map can be helpful for detecting the person object in the image. The activated values in the background areas of ML are lower than D. For example, in the top row

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

5

useful features can be controlled to pass from ML to D:

fuse f ul = G  ML, fuse f ul ∈ W ×H×1024 ,

(6)

where  is the element-wise product operation. fuseful represents the useful information that will be integrated into D as:

Dnew = D + fuse f ul , Dnew ∈ W ×H×1024 ,

(7)

where Dnew represents the new object detection feature map that contains useful features from ML. 3.6. Multi-task loss Multi-task loss is used to optimize the proposed MONet. The multi-task loss contains the region classification loss, the bounding box regression loss and the multi-label classification loss,

L = Lcls + λ1 Lbbox + λ2 Lml ,

Fig. 3. The designed gate module.

of Fig. 1, the grass areas (background area) have high activation values (painted with cyan) in the object detection feature map, while the corresponding areas in the multi-label feature map have low activation values (painted with dark blue). Thus, the features between the background areas and the object areas are more discriminative in the multi-label feature map than the features in the object detection feature map. The features in the multi-label feature map can help localize the objects in the image. Therefore, ML contains rough localization information of the objects and may be helpful for the object detection. 3.4. Fusing box-level features

3.5. Fusing image-level features The multi-label classification and the object detection are two similar but different tasks. Thus, not all of the features in the multi-label feature map may be useful. Only the useful features from the multi-label feature map need to be fused into the object detection feature map. Therefore, a gate module is designed as shown in Fig. 3. Firstly, the gate feature map G is generated as follows,

G = sig(Xconv5 ∗ wgate + bgate ), G ∈ W ×H×1024

Lcls = −

(5)

where Xconv5 represents the feature map of Conv5. sig represents the sigmoid function. ∗ is the convolution operation. wgate , bgate are the learnable parameters of the convolution layer. The G is implemented with a 1024-channels 1 × 1 convolutional layer followed by a sigmoid function layer. With this learnable gate feature map, the

N 1 log( pui ), ui ∈ {0, 1, 2, . . . , C }, N

(9)

i=1

where N is the number of proposals used in the training. pui denotes the predicted probability for the true class u of proposal i. ui = 0 denotes that the proposal i belongs to the background. The smooth L1 loss is used for Lbbox following [9]:



Lbbox

Since the multi-label feature map contains rough localization information, ROI pooling is used to extract it for each proposal. Specifically, a proposal is divided into m × m bins and a ROI pooling layer is placed on the top of multi-label feature map to extract a m × m × 1024 spatial grid features for the proposal. Here the m is set to 7 following [9]. Then, two 1024-dimension fully connection layers are used to further extract the features of the proposal. Finally, a (C+1)-dimension fully connection layer is used to get the classification scores of the proposal. The scores from the object detection branch and the multi-label branch are element-wise summed. Then, the scores are passed through a softmax function to obtain the final predictions of the proposal. Similarly, a sibling 4-dimension fully connection layer is used to predict another part of the coordinates of the proposal. The predicted coordinates from the object detection branch and the multi-label branch are also element-wise summed to get the final predicted coordinates of the proposal. The procedure of the bounding box regression is omitted in Fig. 2 for convenience.

(8)

where Lcls , Lbbox and Lml represent the classification loss, the bounding box loss and the multi-label classification loss respectively. λ1 and λ2 are the weighting factors. The softmax loss is used for Lcls following [9]:

N 1 = T ( ui ) N i=1





smoothL1 (ti j − ti∗j )

(10)

j∈{x,y,w,h}

in which



T ( ui ) =

0, 1,

ui = 0 ui ∈ {1, 2, . . . , C }



smoothL1 (s ) =

0.5s2 , |s| − 0.5,

tix = (xi − xia )/wia , tiw = log(wi /wia ), tix∗ = (x∗i − xia )/wia , ∗ tiw = log(w∗i /wia ),

|s| < 1 |s| ≥ 1

(11) (12)

tiy = (yi − yia )/hia , tih = log(hi /hia ), tiy∗ = (y∗i − yia )/hia , ∗ tih = log(h∗i /hia ),

(13)

where xi and yi denote the center coordinates of the predicted box i. wi and hi denote the width and the height of the predicted box i respectively. Variables xi , xia and x∗i are for the predicted box, the proposal box and the ground-truth box, respectively (likewise for yi , wi , hi ). This can be thought of as the bounding box regression from a proposal box to a nearby ground-truth box. Lbbox is omitted in Fig. 2 for convenience. The standard cross-entropy is used to calculate the multi-label classification loss Lml as

Lml = −

C 1 l [y log(ylpred ) + (1 − yl ) log(1 − ylpred )] C

(14)

l=1

Equation (8) is optimized by stochastic gradient descent (SGD) [46]. 4. Experiments The proposed method is evaluated on the PASCAL VOC 2007 [47], the PASCAL VOC 2012 and the MS COCO [48] detection benchmarks to make fair comparison with the state-of-the-art methods. Extensive experiments demonstrate the effectiveness of the proposed approach.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

4.1. Datasets and evaluation metrics 4.1.1. PASCAL VOC 2007 and PASCAL VOC 2012 The PASCAL VOC 2007 dataset and the PASCAL VOC 2012 dataset contain 20 object categories. The two datasets consist of 9963 and 22,531 images respectively. Each dataset is divided into the train, the validation and the test subsets. The model evaluated on the PASCAL VOC 2007 test set is trained on the trainval split from the PASCAL VOC 2007, including 5011 images, and the trainval split from the PASCAL VOC 2012, including 11,540 images. The model evaluated on the PASCAL VOC 2012 test set is trained on all images from the PASCAL VOC 2007, including 9963 images, and the trainval split from the PASCAL VOC 2012. The mean of average precision (mAP), which is evaluated at IOU = 0.5 following the PASCAL detection challenge protocols for evaluation, is adopted as the evaluation metrics. 4.1.2. MS COCO The MS COCO dataset has 80 object categories. The dataset consists of the train set with 82,783 images, the validation set with 40,504 images and the test-dev set with 20,288 images. The model evaluated on the MS COCO test-dev set is trained on the train and the validation set from the MS COCO, including 123,287 images. The average precision (AP), which is evaluated at IOU ∈ [0.5, 0.55, 0.6, . . . , 0.95] following the MS COCO detection challenge protocols for evaluation, is used as the evaluation metrics. 4.2. Implementation details The ResNet-101 [44] pretrained on ImageNet [45] is used to initialize the MONet. All the other newly added layers are initialized by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. Code is based on the R-FCN [11] framework built on the Caffe library [49]. Horizontally flipping images is used to augment the training data. Online hard example mining (OHEM) [50] is used to train the proposed model following R-FCN. For PASCAL VOC 2007, a 1-GPU implementation with 2 effective mini-batch size by setting the iter _size to 2 is adopted. The whole network is trained for 80k iterations with a learning rate of 0.001 and then for 30k iterations with a learning rate of 0.0 0 01. For PASCAL VOC 2012, 4 GPUs are used to train the proposed model with an effective mini-batch size of 4 (1 per GPU). The network is trained for 60k iterations with a learning rate of 0.001 and then for 20k with a learning rate of 0.0 0 01. For MS COCO, the learning rate is set to 0.002 for 255k iterations and 0.0 0 02 for next 35k iterations, with an effective mini-batch size of 8. Furthermore, since the MS COCO dataset has more small objects, the ROI pooling is replaced by ROI align [24] to extract more precise features. All models are trained and tested on NVIDIA GeForce Titan Xp GPUs. λ1 is set to 1 and λ2 is set to 0.1 in Eq. (8) for all experiments. Multi-scale training with the shorter side of images randomly resized from 30 0–90 0 is also conducted. 4.3. Ablation experiments Ablation experiments are performed on the PASCAL VOC 2007 dataset to show the effectiveness of the proposed MONet detector. All experiments use single-scale training and testing in this section. 4.3.1. Multi-task The effectiveness of the proposed multi-task framework is shown. The R-FCN is used as the base model. The multi-label branch described in Section 3.3 is added into R-FCN, denoted by R-FCN∗ . As shown in Table 1, R-FCN achieves 79.5% mAP on the PASCAL VOC 2007 test set, while R-FCN∗ achieves 80.3% mAP, i.e.,

Table 1 Effects of different methods described in Section 3, evaluated on the PASCAL VOC 2007 test set. Model

Multi-label

ROI pooling

R-FCN [11] R-FCN∗ R-FCN∗ ∗ MONet

√ √ √

√ √

Gate module

[email protected] (%) 79.5 80.3 81.9 83.0



Table 2 Effects of different gate values in MONet, evaluated on the PASCAL VOC 2007 test set. Method

[email protected] (%)

MONet (gate values equal to 0) MONet (gate values equal to 1) MONet

81.9 82.1 83.0

0.8 points increment compared to R-FCN. This shows that the object detection task can be benefited from the multi-label classification task. 4.3.2. ROI Pooling ROI pooling is added to extract features for each proposal based on the new multi-task framework R-FCN∗ , denoted by R-FCN∗∗ . Table 1 shows that R-FCN∗∗ achieves 81.9% mAP on the PASCAL VOC 2007 test set, which means that placing ROI pooling on the top of multi-label feature map can increase the mAP by 1.6 points compared to R-FCN∗ (81.9% vs 80.3%). This demonstrates that the multi-label feature map contains rough localization information and placing the ROI pooling on the multi-label feature map can extract the localization information for each proposal. 4.3.3. Gate module The gate module is added for passing the useful features from the multi-label feature map into the object detection feature map, i.e. MONet. As shown in Table 1, the MONet achieves 83.0% mAP on the PASCAL VOC 2007 test set and increases the mAP by 1.1 points compared to the R-FCN∗∗ . This shows that the proposed gate module can pass the useful features from the multi-label feature map into the object detection feature map. In order to further show the effectiveness of the gate module, two experiments, which set all values in gate to 0 and 1, respectively, are conducted. The experiment with gate values equalling to 0 means that the multi-label feature map is not fused into the object detection feature map. The experiment with gate values equalling to 1 means that the multi-label feature map is simply summed with the object detection feature map. As shown in Table 2, the results are 81.9% and 82.1% mAP on the PASCAL VOC 2007 test set, respectively. This demonstrates that the features in the multi-label feature map are useful for the object detection, since simply summing the multi-label feature map with object detection feature map can obtain 0.2 points performance gain (82.1% vs 81.9%). However, updating the gate values during the training achieves 83.0% mAP. This shows that using the gate module can obtain 0.9 points performance gain compared with simply summing the multi-label feature map with the object detection feature map (83.0% vs 82.1%). This verifies that not all of features in the multi-label feature map are useful for the object detection and using the gate module can extract the useful features from the multilabel feature map. Therefore, using the gate module can achieve the best result compared to setting all values in gate to 0 or 1. Two experiments are conducted to explore different implementations of the proposed gate module. As shown in Fig. 3, the gate module takes three feature maps as the input, i.e. the multi-label feature map ML, the object detection feature map D and the gate

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx Table 3 Different implementations of the proposed gate module, evaluated on the PASCAL VOC 2007 test set. Method

[email protected] (%)

1 × 1 convolutional layer (MONet) 3 × 3 convolutional layer stacking two 1 × 1 convolutional layers

83.0 82.9 82.7

7

parameters and MONet without multi-label loss is the architecture choices. However, MONet gets 83.0% mAP, which means that the multi-label loss can bring 2.0 points performance gain compared to MONet without multi-label loss. Therefore, the multi-label loss can bring the most performance gain compared to more parameters or architecture choices. 4.4. Detection error analysis

Table 4 Effects of parameters, architecture choices and multi-label loss, evaluated on the PASCAL VOC 2007 test set. Method

[email protected] (%)

R-FCN [11] R-FCN (more parameters) MONet (no multi-label loss) MONet

79.5 79.8 81.0 83.0

feature map G. Each of three feature maps is generated by a 1 × 1 convolutional layer following the Conv5 feature map of ResNet-101. In the first experiment, the 1 × 1 convolutional layer is changed to 3 × 3 convolutional layer with appropriate padding. As shown in Table 3, the result is 82.9% mAP on the PASCAL VOC 2007 test set. Compared with MONet without gate module (i.e. R-FCN∗ ∗ in Table 1), the 3 × 3 convolutional layer can obtain 1.0% mAP gain, while the 1 × 1 convolutional layer (i.e. MONet) can obtain 1.1% mAP gain. Both the 3 × 3 convolutional layer and the 1 × 1 convolutional layer can obtain similar performance gain. This shows that the key to boost object detection performance is the architecture of the proposed gate module rather than simply changing the kernel size of the convolutional layer. In the second experiment, two successive 1 × 1 convolutional layers are stacked to generate each feature map. As shown in Table 3, the result is 82.7% mAP on the PASCAL VOC 2007 test set. Compared with MONet without gate module (i.e. R-FCN∗∗ in Table 1), the stacked two convolutional layers can obtain 0.8% mAP gain, while the only one convolutional layer (i.e. the proposed MONet) can obtain 1.1% mAP gain. The performance gain obtained from only one convolutional layer is slightly higher than the stacking two convolutional layers. This further shows that the key to boost the object detection performance is the architecture of the proposed gate module rather than simply stacking more convolutional layers. Therefore, the key to boost the object detection performance is the architecture of the gate module rather than simply adding more parameters. This shows the superiority of the architecture of the proposed gate module. 4.3.4. Parameters, architecture choices, multi-label loss Two experiments are conducted to show the effectiveness of the multi-label loss, since the proposed MONet has a bit more parameters and different architecture choices compared to R-FCN. In the first experiment, two extra convolution layers, where the kernel size is 7 × 7 × 1024 and 3 × 3 × 1024 respectively, are added to the prediction sub-network of R-FCN in order to maintain the same amount of parameters with MONet. As shown in Table 4, The result is 79.8% mAP on the PASCAL VOC 2007 test set. This demonstrates that simply adding more parameters can only obtain very limited performance gain (0.3 points compared to R-FCN). In the second experiment, the multi-label loss is removed and the rest architecture of the MONet is kept unchanged. The network is only trained with the region classification loss and the bounding box regression loss. As shown in Table 4, the performance is 81.0% mAP. This shows that the proposed architecture choices can bring 1.2 points performance gain compared to R-FCN with more parameters, since the major difference between R-FCN with more

The detection errors of the proposed MONet are also analysed in the PASCAL VOC 2007 test set using the tool proposed in [51]. In Fig. 4, the pie charts with the percentage of detections that are correct (Correct), or false positives due to poor localization (Localization), confusion with similar objects (Similar), confusion with other VOC objects (Others), or confusion with background or unlabeled objects (Background), are shown. The graphs are only shown for challenging classes, i.e. boat, bottle, chair, diningtable and pottedplant, due to space limitations. It can be observed that the MONet achieves a considerable reduction in the percentage of false positives due to bad localization for challenging categories. This shows that the multi-label feature map contains rough localization information. ROI pooling and the gate module can extract rough localization information from the multi-label feature map into the object detection branch. 4.5. Visualization and analysis It is obvious that the feature map of image is very important for the result of the object detection. Since the features of the multi-label classification contain rough localization information of objects and are fused into the features of object detection, the object detection feature map of MONet should be more useful than the object detection feature map of R-FCN. Thus, the object detection feature map is visualized to further show the effectiveness of the proposed method. In Fig. 5, Column (a) is the input images picked from PASCAL VOC 2007. Column (b) and (c) represent the object detection feature map of R-FCN and MONet, respectively. As shown in column (b) of Fig. 5, the spatial locations, where an object covers, in the object detection feature map of R-FCN usually have high activation values (painted with yellow or red). But these spatial locations usually are in the center of objects. High activated values in the object areas and low activated values in the background areas can be directly seen in the object detection feature map of the MONet in column (c). For example, even the shape of horse can be observed in the top two rows of MONet. Compared to R-FCN, the features between the object areas and the background areas in the object detection feature map of the proposed MONet are more discriminative. With a more useful object detection feature map, it is obvious that the MONet can achieve higher accuracy of the object detection. Next, the proposed approach is quantitatively evaluated on the PASCAL VOC 2007, PASCAL VOC 2012 and MS COCO detection benchmarks. 4.6. Performance comparisons The proposed MONet is compared with other state-of-the-art methods on the PASCAL VOC 2007 test set, the PASCAL VOC 2012 test set and the MS COCO test-dev set. 4.6.1. PASCAL VOC 2007 Table 5 shows the detailed comparisons among the proposed MONet, Faster R-CNN and R-FCN. In Table 5, all models use ResNet101 as the backbone network and are trained on the PASCAL VOC 2007 trainval set union with the PASCAL VOC 2012 trainval set. The single model without multi-scale training of MONet achieves a

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

Fig. 4. Analysis of top ranked false positives on the PASCAL VOC 2007 test set. Top row: the results of R-FCN. Bottom row: the results of MONet.

Fig. 5. The visualization of the object detection feature map. Best viewed in color.

Table 5 Comparison with Faster R-CNN and R-FCN on the PASCAL VOC 2007 test set. † : multi-scale training.

Faster R-CNN [44] R-FCN [11] R-FCN † [11] MONet MONet †

[email protected] (%)

GPU

76.4 79.5 80.5 83.0 83.6

K40 TITAN TITAN TITAN TITAN

test time(ms/image) Xp Xp Xp Xp

420 81 81 99 99

mAP of 83.0%, which outperforms R-FCN by 3.5 points. To the best of our knowledge, the proposed MONet achieves the state-of-theart result on the PASCAL VOC 2007 test set. The inference time of the proposed network is also evaluated by using a NVIDIA TITANXp GPU (pascal) along with CUDA 8.0 and cuDNN-v5.1. As shown in Table 5, the MONet costs 99ms per image, while R-FCN costs 81ms per image. Though the MONet is slightly slower than R-FCN, it also reaches a real-time speed (i.e 10.1 fps). The MONet achieves a better trade-off between the accuracy and the speed. In Table 6, the performance of MONet is compared with other state-of-the-art methods. All models are trained on the PASCAL VOC 2007 trainval set union with the PASCAL VOC 2012 trainval set. All models

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM

mAP

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

mbike

persn

plant

sheep

sofa

train

tv

YOLO [13] YOLOv2 544 [14] SSD512 [12] DSSD513 [52] PFPNet-R512 [37] Faster [44] R-FCN† [11] D-R-FCN [53] CoupleNet† [42] MONet†

63.4 78.6 76.8 81.5 82.3 76.4 80.5 82.6 82.7 83.6

– – 82.4 86.6 – 79.8 79.9 – 85.7 87.2

– – 84.7 86.2 – 80.7 87.2 – 87.0 88.8

– – 78.4 82.6 – 76.2 81.5 – 84.8 82.0

– – 73.8 74.9 – 68.3 72.0 – 75.5 79.8

– – 53.2 62.5 – 55.9 69.8 – 73.3 72.2

– – 86.2 89.0 – 85.1 86.8 – 88.8 88.2

– – 87.5 88.7 – 85.3 88.5 – 89.2 89.0

– – 86.0 88.8 – 89.8 89.8 – 89.6 89.4

– – 57.8 65.2 – 56.7 67.0 – 69.8 70.9

– – 83.1 87.0 – 87.8 88.1 – 87.5 89.0

– – 70.2 78.7 – 69.4 74.5 – 76.1 76.5

– – 84.9 88.2 – 88.3 89.8 – 88.9 89.5

– – 85.2 89.0 – 88.9 90.6 – 89.0 89.3

– – 83.9 87.5 – 80.9 79.9 – 87.2 89.4

– – 79.7 83.7 – 78.4 81.2 – 86.2 86.9

– – 50.3 51.1 – 41.7 53.7 – 59.1 62.0

– – 77.9 86.3 – 78.6 81.8 – 83.6 85.6

– – 73.9 81.6 – 79.8 81.5 – 83.4 85.2

– – 82.5 85.7 – 85.3 85.9 – 87.6 87.9

– – 75.3 83.7 – 72.0 79.9 – 80.7 84.4

Table 7 Results on the PASCAL VOC 2012 test set. † : multi-scale training. § : http://host.robots.ox.ac.uk:8080/anonymous/V0UJHQ.html. mAP

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

table

dog

horse

mbike

persn

plant

sheep

sofa

train

tv

YOLO [13] YOLOv2 544 [14] SSD512 [12] DSSD513 [52] PFPNet-R512 [37] Faster [44] R-FCN† [11] CoupleNet† [42] MONet†

57.9 73.4 74.9 80.0 80.3 73.8 77.6 80.4 81.0§

77.0 86.3 87.4 92.1 – 86.5 86.9 89.1 90.2

67.2 82.0 82.3 86.6 – 81.6 83.4 86.7 87.0

57.7 74.8 75.8 80.3 – 77.2 81.5 81.6 81.4

38.3 59.2 59.0 68.7 – 58.0 63.8 71.0 71.6

22.7 51.8 52.6 58.2 – 51.0 62.4 64.4 64.1

68.3 79.8 81.7 84.3 – 78.6 81.6 83.7 85.1

55.9 76.5 81.5 85.0 – 76.6 81.1 83.7 84.0

81.4 90.6 90.0 94.6 – 93.2 93.1 94.0 94.7

36.2 52.1 55.4 63.3 – 48.6 58.0 62.2 63.1

60.8 78.2 79.0 85.9 – 80.4 83.8 84.6 85.2

48.5 58.5 59.8 65.6 – 59.0 60.8 65.6 64.8

77.2 89.3 88.4 93.0 – 92.1 92.7 92.7 92.8

72.3 82.5 84.3 88.5 – 85.3 86.0 89.1 89.1

71.3 83.4 84.7 87.8 – 84.8 84.6 87.3 87.6

63.5 81.3 83.3 86.4 – 80.7 84.4 87.7 87.8

28.9 49.1 50.2 57.4 – 48.1 59.0 64.3 66.1

52.2 77.2 78.0 85.2 – 77.3 80.8 84.1 85.9

54.8 62.4 66.3 73.4 – 66.5 68.6 72.5 73.7

73.9 83.8 86.3 87.8 – 84.7 86.1 88.4 89.5

50.8 68.7 72.0 76.8 – 65.6 72.9 75.3 76.6

9

[m5G;September 11, 2019;23:32]

Method

ARTICLE IN PRESS

Method

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

Table 6 Results on the PASCAL VOC 2007 test set. †: multi-scale training.

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx Table 8 Results on the MS COCO 2015 test-dev set. +++: box refinement [38], context information [44] and multi-scale testing [54]. † : multi-scale training. Method

Backbone Network

AP

AP50

AP75

APS

APM

APL

AR1

AR10

AR100

ARS

ARM

ARL

YOLOv2 [14] SSD512 [12] DSSD513 [52] RetinaNet† [28] PFPNet-R512 [37] Faster+++ [44] R-FCN [11] R-FCN† [11] CoupleNet† [42] D-R-FCN [53] Cascade R-CNN [27] MONet MONet†

DarkNet19 VGG16 ResNet-101 ResNet-101-FPN VGG16 ResNet-101 ResNet-101 ResNet-101 ResNet-101 ResNet-101 ResNet-101-FPN ResNet-101 ResNet-101

21.6 27.7 33.2 39.1 35.2 34.9 29.2 29.9 34.4 34.5 42.8 34.7 35.4

44.0 46.4 53.3 59.1 57.6 55.7 51.5 51.9 54.8 55.0 62.1 54.4 55.0

19.2 26.7 35.2 42.3 37.9 – – – 37.2 – 46.3 37.6 38.6

5.0 – 13.0 21.8 18.7 15.6 10.3 10.8 13.4 14.0 23.7 14.4 16.1

22.4 – 35.4 42.7 38.6 38.7 32.4 32.8 38.1 37.7 45.5 38.7 39.5

35.5 – 51.1 50.2 45.9 50.9 43.4 45.0 50.8 50.3 55.2 50.4 49.4

20.7 – 28.9 – – – – – 30.0 – – 30.5 30.7

31.6 – 43.5 – – – – – 45.0 – – 46.1 47.0

33.3 – 46.2 – – – – – 46.4 – – 48.0 48.9

9.8 – 21.8 – – – – – 20.7 – – 23.1 25.3

36.5 – 49.1 – – – – – 53.1 – – 55.6 56.3

54.4 – 66.4 –

are tested on the PASCAL VOC 2007 test set. The base network of YOLO and YOLO v2 are GoogleNet [55] and DarkNet19 respectively. The base network of SSD512 and PFPNet-R512 are VGG16. The base network of other models are ResNet-101. By performing multi-scale training, the MONet achieves 83.6% mAP. The MONet with multiscale training outperforms all other state-of-the-art single models such as DSSD [52], PFPNet-R512 [37] and D-R-FCN [53], CoupleNet [42] on the PASCAL VOC 2007 test set. 4.6.2. PASCAL VOC 2012 In Table 7, all models are trained on the union set of the PASCAL VOC 2007 trainval, the PASCAL VOC 2007 test along with the PASCAL VOC 2012 trainval. All models are tested on the PASCAL VOC 2012 test set. The base network of YOLO and YOLO v2 are GoogleNet [55] and DarkNet19 respectively. The base network of SSD512 and PFPNet-R512 are VGG16. The base network of other models are ResNet-101. As shown in Table 7, R-FCN achieves 77.6% mAP, while the MONet obtains an mAP of 81.0%. The MONet outperforms R-FCN by 3.4 points. To the best of our knowledge, the proposed model outperforms all other state-of-the-art single models such as YOLOv2 [14], DSSD [52], PFPNet-R512 [37] and CoupleNet [42] on PASCAL VOC 2012 test. 4.6.3. MS COCO The results are shown in Table 8. The MS COCO-style AP is evaluated @IOU ∈ [0.5, 0.55, 0.6, . . . , 0.95]. AP50 is the PASCAL-style AP evaluated @IOU = 0.5. YOLOv2, SSD512, DSSD513 and PFPNet-R512 are trained with the MS COCO trainval35k set. Other models are trained with the MS COCO trainval set. Since the proposed method fuses multi-label information into R-FCN, the baseline of MONet is R-FCN. R-FCN obtains 29.2% AP on the MS COCO 2015 test-dev set, while the single-scale trained MONet has a result of 34.7% AP. Compared with R-FCN, MONet outperforms R-FCN by 5.5 points, which validates the effectiveness of the proposed method. When training with multi-scale, the AP is further improved to 35.4%. The proposed MONet achieves better result than DSSD [52], PFPNetR512 [37] and D-R-FCN [53], CoupleNet [42]. 4.7. Applied to instance segmentation The proposed method is also applied to the instance segmentation task in order to show the generalization of the proposed MONet. Since the proposed MONet takes R-FCN method to solve the object detection task, the mask branch is implemented into R-FCN in order to make fair comparison with R-FCN and MONet. The R-FCN with the mask branch and the proposed MONet with the mask branch are named as Mask R-FCN and Mask MONet, respectively. The mask branch is built following Mask R-CNN [24] as follows: Firstly, for R-FCN, the object detection feature map D is used to ex-

– – – 68.5 – – 68.9 69.0

Table 9 Results on the MS COCO 2015 test-dev set. box AP Method R-FCN-ReIm MONet Mask R-FCN Mask MONet

mask AP

AP

AP50

AP75

AP

AP50

AP75

31.6 34.7 32.1 35.2

51.6 54.4 52.5 54.8

34.3 37.6 35.0 38.3

– – 29.5 30.9

– – 49.5 50.8

– – 31.2 32.3

tract the features of size 14 × 14 × 1024 for each proposal by ROI Align operation. For MONet, since the new object detection feature map Dnew contains rough localization information from the multi-label feature map ML, Dnew is used to extract the features of the same size 14 × 14 × 1024 for each proposal by the same ROI Align operation. Secondly, each proposal’s features are passed through four successive 3 × 3 convolutional layers with 256 channels for further feature extraction and a deconvolutional layer to enlarge the size of features to 28 × 28 × 256. Finally, the enlarged features are used to generate the predicted mask of each proposal by a 7 × 7 convolutional layer with C channels (C = 80 in MS COCO dataset) and a sigmoid function. The loss of the mask branch Lmask is the binary cross-entropy loss and the loss weight of the mask loss is set to 1. The results on the MS COCO 2015 test-dev set are shown in Table 9. Box AP is the results of the object detection. Mask AP is the results of the instance segmentation. The results of R-FCN-ReIm and MONet on the object detection are 31.6% AP and 34.7% AP respectively. Note that the R-FCN-ReIm is the reimplementation of RFCN and the result of R-FCN-ReIm is higher than R-FCN [11] (29.2% AP). The proposed MONet outperforms R-FCN-ReIm by 3.1 points. The results of Mask R-FCN on the object detection and the instance segmentation are 32.1% AP and 29.5% AP, respectively. The results of Mask MONet on the object detection and the instance segmentation are 35.2% AP and 30.9% AP respectively. Comparing with Mask R-FCN, the proposed Mask MONet fuses the multi-label features into the object detection features and the mask features. The proposed Mask MONet also brings 3.1% AP performance gain on the object detection and brings 1.4% AP performance gain on the instance segmentation. This demonstrates that the branch of the multi-label classification task can be applied along with the instance segmentation task. The multi-label classification consistently boosts the performance of the object detection and the instance segmentation. Comparing with the instance segmentation task, the object detection task can obtain higher performance gain from the multi-label features. The reason is that the instance segmentation needs more accurate localization information than the object detection and the multi-label features only contain rough localization information.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM

ARTICLE IN PRESS T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

As shown in Table 9, the results of R-FCN-ReIm, Mask R-FCN and MONet are 31.6% AP, 32.1% AP and 34.7% AP respectively on the object detection task. Comparing with R-FCN-ReIm, Mask RFCN contains the instance segmentation task and brings 0.5% AP performance gain (32.1% vs 31.6%) to the object detection. Comparing with R-FCN-ReIm, MONet contains the multi-label classification task and brings 3.1% AP performance gain to the object detection. This shows that comparing with the instance segmentation task, the multi-label classification task can bring higher performance gain for the object detection. It seems that the instance segmentation task uses more annotations than the multi-label classification task, while the multi-label classification task can bring higher performance gain than the instance segmentation task. The reason is that the multi-label features, which contains rough localization information, are fused into the object detection features by the gate module and ROI pooling. However, the features of instance segmentation are not be fused into the object detection features. As shown in Table 1, simply constituting multi-label branch into R-FCN (i.e. R-FCN∗ ) can only boost performance by 0.8% mAP (from 79.5% mAP to 80.3% mAP) on the PASCAL VOC 2007 test set. However, fusing the multi-label features into the object detection features can boost performance by 3.5% mAP (from 79.5% mAP to 83.0% mAP). This further demonstrates that the key to boost performance of the object detection is the feature fusing by ROI pooling and the gate module rather than simply solving multi-tasks in one framework. 5. Conclusion In this paper, a novel multi-task framework for object detection is proposed, named as MONet. The multi-task framework uses the multi-label classification as an auxiliary task to improve the accuracy of the object detection. Compared with other multi-task frameworks, the proposed approach does not need extra annotations. Since both the gate module is designed and ROI pooling is used to fuse the multi-label features into the object detection features, the MONet achieves the best result (single model) on the challenging PASCAL VOC 2007 and PASCAL VOC 2012 detection benchmarks and outperforms R-FCN by 5.5 points on the challenging MS COCO detection benchmark. Since a little bit more parameters are introduced to build the multi-label branch, the proposed MONet (10.1fps) is slightly slower than R-FCN (12.3fps). However, the MONet also reaches a real-time speed and achieves better trade-off between accuracy and speed comparing with R-FCN. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement This work is supported by the National Natural Science Foundation of China (Grant No. 61371192), the Key Laboratory Foundation of the Chinese Academy of Sciences (CXJJ-17S044) and the Fundamental Research Funds for the Central Universities (WK210 0330 0 02, WK3480 0 0 0 0 05). References [1] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893. [2] P.F. Felzenszwalb, D.A. McAllester, D. Ramanan, et al., A discriminatively trained, multiscale, deformable part model, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.

[m5G;September 11, 2019;23:32] 11

[3] M. Guo, Y. Zhao, C. Zhang, Z. Chen, Fast object detection based on selective visual attention, Neurocomputing 144 (2014) 184–197. [4] Y. Liu, L. Huang, X. Liu, B. Lang, A novel rotation adaptive object detection method based on pair hough model, Neurocomputing 194 (2016) 246–259. [5] D. Cheng, J. Wang, X. Wei, Y. Gong, Training mixture of weighted SVM for object detection using em algorithm, Neurocomputing 149 (2015) 473–482. [6] M. Tan, G. Pan, Y. Wang, Y. Zhang, Z. Wu, L1-Norm latent SVM for compact features in object detection, Neurocomputing 139 (2014) 56–64. [7] J. Shen, C. Sun, W. Yang, Z. Wang, Z. Sun, A novel distribution-based feature for rapid object detection, Neurocomputing 74 (17) (2011) 2767–2779. [8] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [9] R. Girshick, Fast R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1440–1448. [10] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2015, pp. 91–99. [11] J. Dai, Y. Li, K. He, J. Sun, R-FCN: Object detection via region-based fully convolutional networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2016, pp. 379–387. [12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, SSD: Single shot multibox detector, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 21–37. [13] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788. [14] J. Redmon, A. Farhadi, Yolo90 0 0: better, faster, stronger, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7263–7271. [15] J. Li, X. Liang, J. Li, Y. Wei, T. Xu, J. Feng, S. Yan, Multistage object detection with group recursive learning, IEEE Trans. Multimed. 20 (7) (2018) 1645– 1655. [16] W. Chu, D. Cai, Deep feature based contextual model for object detection, Neurocomputing 275 (2018) 1035–1042. [17] J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171. [18] C.L. Zitnick, P. Dollár, Edge boxes: Locating object proposals from edges, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 391–405. [19] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929. [20] M. Lin, Q. Chen, S. Yan, Network in network, in: Proceedings of the International Conference on Learning Representations, 2014. [21] F. Zhu, H. Li, W. Ouyang, N. Yu, X. Wang, Learning spatial regularization with image-level supervisions for multi-label image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5513–5522. [22] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: Proceedings of the International Conference on Machine Learning, 2015, pp. 2048–2057. [23] S. Brahmbhatt, H.I. Christensen, J. Hays, Stuffnet: Using stuffto improve object detection, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2017, pp. 934–943. [24] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask R-CNN, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969. [25] J. Dai, K. He, J. Sun, Instance-aware semantic segmentation via multi-task network cascades, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150–3158. [26] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) (2015) 1904–1916. [27] Z. Cai, N. Vasconcelos, Cascade R-CNN: Delving into high quality object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6154–6162. [28] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988. [29] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images, IEEE Trans. Geosci. Remote Sens. 54 (12) (2016) 7405–7415. [30] K. Li, G. Cheng, S. Bu, X. You, Rotation-insensitive and context-augmented object detection in remote sensing images, IEEE Trans. Geosci. Remote Sens. 56 (4) (2018) 2337–2348. [31] G. Cheng, J. Han, P. Zhou, D. Xu, Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection, IEEE Trans. Image Process. 28 (1) (2019) 265–278. [32] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: Towards accurate region proposal generation and joint object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 845–853. [33] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, Y. Chen, RON: Reverse connection with objectness prior networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5936– 5944.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;September 11, 2019;23:32]

T. Gong, B. Liu and Q. Chu et al. / Neurocomputing xxx (xxxx) xxx

[34] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. [35] S. Zhang, L. Wen, X. Bian, Z. Lei, S.Z. Li, Single-shot refinement neural network for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4203–4212. [36] J. Mao, T. Xiao, Y. Jiang, Z. Cao, What can help pedestrian detection? in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3127–3136. [37] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, S.-J. Ko, Parallel feature pyramid network for object detection, in: Proceedings of the European Conference on Computer Vision, 2018, pp. 234–250. [38] S. Gidaris, N. Komodakis, Object detection via a multi-region and semantic segmentation-aware CNN model, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1134–1142. [39] W. Ouyang, K. Wang, X. Zhu, X. Wang, Chained cascade network for object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1938–1946. [40] S. Zagoruyko, A. Lerer, T.-Y. Lin, P.H. Pinheiro, S. Gross, S. Chintala, P. Dollar, A multipath network for object detection, in: Proceedings of the British Machine Vision Conference, 2016. 15.1–15.12 [41] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, Gated bi-directional CNN for object detection, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 354–369. [42] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu, Couplenet: Coupling global structure with local parts for object detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4126–4134. [43] H. Hu, J. Gu, Z. Zhang, J. Dai, Y. Wei, Relation networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597. [44] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [45] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. [46] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551. [47] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338. [48] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft COCO: Common objects in context, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 740–755. [49] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: Proceedings of the ACM International Conference on Multimedia, 2014, pp. 675–678. [50] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors with online hard example mining, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 761–769. [51] D. Hoiem, Y. Chodpathumwan, Q. Dai, Diagnosing error in object detectors, in: Proceedings of the European Conference on Computer Vision, 2012, pp. 340–353. [52] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A.C. Berg, DSSD: Deconvolutional single shot detector, arXiv:1701.06659, 2017. [53] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 764–773.

[54] S. Ren, K. He, R. Girshick, X. Zhang, J. Sun, Object detection networks on convolutional feature maps, IEEE Trans. Pattern Anal. Mach. Intell. 39 (7) (2017) 1476–1481. [55] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. Tao Gong received the B.S. degree in electronic engineering and information science from the University of Science and Technology of China in 2016. He is currently working toward the M.S. degree in the Department of Electronic Engineering and information science at the University of Science and Technology of China. His research interests include computer vision and deep learning.

Bin Liu received the B.S. and M.S. degrees both from the University of Science and Technology of China in 1998 and 2001, respectively, and the Ph.D. degree in electrical engineering from Syracuse University in 2006. Currently, he is an associate professor with the School of Information Science and Technology, University of Science and Technology of China. His research interests are signal processing, communications in wireless sensor, body area networks and computer vision. He is a member of the IEEE.

Qi Chu received the B.S. degree in electronic engineering and information science from the University of Science and Technology of China in 2014. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering and information science at the University of Science and Technology of China. His research interests include computer vision and deep learning.

Nenghai Yu received the B.S. degree from Nanjing University of Posts and Telecommunications in 1987, the M.E. degree from Tsinghua University in 1992, and the Ph.D. degree from the University of Science and Technology of China in 2004, where he is currently a professor. His research interests include multimedia security, multimedia information retrieval, video processing, and information hiding. He is a member of the IEEE.

Please cite this article as: T. Gong, B. Liu and Q. Chu et al., Using multi-label classification to improve object detection, Neurocomputing, https://doi.org/10.1016/j.neucom.2019.08.089