BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks
Journal Pre-proof
BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks Ba Rom Kang, Hyunku Lee, Keunju Park, Hyunsurk Ryu, Ha Young Kim PII: DOI: Reference:
S0167-8655(20)30035-0 https://doi.org/10.1016/j.patrec.2020.01.024 PATREC 7774
To appear in:
Pattern Recognition Letters
Received date: Revised date: Accepted date:
12 August 2019 28 November 2019 27 January 2020
Please cite this article as: Ba Rom Kang, Hyunku Lee, Keunju Park, Hyunsurk Ryu, Ha Young Kim, BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks, Pattern Recognition Letters (2020), doi: https://doi.org/10.1016/j.patrec.2020.01.024
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.
1 Highlights • We propose a module to predict boundary shapes and boxes with a new masking scheme. • We comprehensively investigate proposed frameworks with various masking methods. • The proposed modularizable component can easily be applied to existing models. • BshapeNet+ improves accuracy notably compared to the baselines, Faster and Mask RCNN. • We achieved highly competitive results with SOTA detection and segmentation models.
1
Pattern Recognition Letters journal homepage: www.elsevier.com
BshapeNet: Object Detection and Instance Segmentation with Bounding Shape Masks Ba Rom Kanga , Hyunku Leeb , Keunju Parkb , Hyunsurk Ryub , Ha Young Kimc,∗∗ a Department
of Data Science, Ajou University, Worldcupro 206, Yeongtong-gu, Suwon, 16499, Republic of Korea Electronics Co., Ltd, Samsungro 129, Yeongtong-gu, Suwon, 16677, Republic of Korea c Department of Information, Yonsei University, Yonsei-ro 50, Seodaemun-gu, Seoul 03722, Republic of Korea b Samsung
ABSTRACT We propose a modularizable component that can predict the boundary shapes and boxes of an image, along with a new masking scheme for improving object detection and instance segmentation. Specifically, we introduce two types of novel masks: a bounding box (bbox) mask and a bounding shape (bshape) mask. For each of these types, we consider two variants—the “Thick” model and the “Scored” model—both of which have the same morphology but differ in ways that make their boundaries thicker. To evaluate our masks, we design extended frameworks by adding a bshape mask (or a bbox mask) branch to a Faster R-CNN, and call this BshapeNet (or BboxNet). Furthermore, we propose BshapeNet+, a network that combines a bshape mask branch with a Mask R-CNN. Among our various models, BshapeNet+ demonstrates the best performance in both tasks. In addition, BshapeNet+ markedly outperforms the baseline models on MS COCO and Cityscapes and achieves highly competitive results with state-of-the-art models. In particular, the experimental results show that our branch works well on small objects and is easily applicable to various models, such as PANet as well as Faster R-CNN and Mask R-CNN. c 2020 Elsevier Ltd. All rights reserved.
1. Introduction An object detection algorithm determines the class and location of each object in an image. Deep learning-based approaches have achieved notable success recently; such approaches include the Faster R-CNN [26], SNIPER [28], and PANet [20]. These approaches use a bounding box (bbox) regressor to predict the object locations that are defined by fourdimensional coordinates. However, it is still challenge to learn continuous variables from images. Thus, if we define a new target that allows the detector to learn the position of the object more efficiently and is easily applicable as a module to the existing frameworks, the performance would improve. In other words, the algorithm can predict the position more accurately by learning not only the coordinates but also a different form of location information. This is based on the same reason that people learn better if they study the same things in different ways.
∗∗ Corresponding
author: Tel: +82-2-2123-4194; Fax: +82-2-2123-8654; e-mail:
[email protected] (Ha Young Kim)
In this study, we define the location of an object in the form of a mask. This is because we perceive that spatial information can be learned more efficiently than coordinates can be learned, and can facilitate not only object detection but also instance segmentation, such as Mask R-CNN [13]. In addition, because the object’s boundary separates the foreground and background, we consider it to be more crucial than the object’s interior. Thus, we transform the complex task of learning both the interior and boundary into a simpler task by focusing only on the boundaries. In particular, we consider this approach to be highly effective for small objects and occlusions. Thus, we propose two types of masks—a bbox mask and a bshape mask—to indicate the location of an object. Figures 1 (a) and (e) show masks with only true boundaries; however, an imbalance problem occurs owing to excessive zeros. Therefore, it is necessary to create a thick boundary. For each of the two types, we consider two variants: the “Thick” model (Figures 1 (b) and (f)) and the “Scored” model (Figures 1 (c) and (g)), both of which exhibit the same morphology but differ in ways that make their boundaries thicker. Furthermore, as shown in Figures 1(d) and (h), various thickness masks are used. To verify the newly defined masks (Figure 2) for object de-
2 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
.9 .9 .9 .9 .9 .9 .9 .9 .9 1
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
1
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
.9
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
1
1
1
1
1
1
1
1
0
0
0
1
1
1
0
.9
1
.9 .9 .9 .9 .9 .9 .9 .9 .9
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
.9
1
.9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
.9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
1
0
1
1
1
0
1
1
1
1
0
1
1
1
0
0
.9
1 .9
0
0
1
0
1
1
1
0
1
1
1
1
0
1
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
.9
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
.9
1
.9 .9 .9 1
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
.9
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
(a) bounding shape mask
1 .9
0
1
0
1
0
1
0
0
0
0
0
0
0
0
.9 .9 .9
.9
1
1
0
1
0
0
1
0
0
1
0
1
0
1
0
1
.9 .9 .9 .9 .9 1
0
1
1
1 .9
.9 .9 .9
.9
1 .9
0
0
0
.9
1
.9
0
0
0
.9 1
.9
1
0
0
0
0
0
0
0
0
0
0
0 .9
.9
0
0
.9 .9 .9 .9
0
.9
1
.9 0
0
.9
1
.9
1
.9
0
0
0
.9 .9 .9 .9
1
.9
0
0
0
0
0
.9
0
0
0
0
0
.9
1
0
1
0
1
0
(b) Thick bounding shape mask (c) Scored bounding shape mask (d) k-px bounding shape mask
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.9
1
1
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.9
1 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
.9
1 .9
0
0
0
0
0
0
0
0
0
0
0
0 .9
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
.9
1 .9
0
0
0
0
0
0
0
0
0
0
0
0 .9
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
.9
1 .9
0
0
0
0
0
0
0
0
0
0
0
0 .9
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
.9
1
0
0
0
0
0
0
0
0
0
0
0
0 .9
1 .9
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.9
1 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9
1 .9
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.9
1
1 .9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
.9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9
(e) bounding box mask
k pixel
(f) Thick bounding box mask
.9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 .9 1
.9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
(g) Scored bounding box mask
k pixel
(h) k-px bounding box mask
.9
Fig. 1. Proposed bounding shape masks and bounding box masks.
tection and instance segmentation, we propose three frameworks: BboxNet, BshapeNet, and BshapeNet+. BshapeNet (or BboxNet) extends the framework of the Faster R-CNN framework with RoIAlign [13], called Faster R-CNN‡; specifically, BshapeNet adds a bshape mask (or a bbox mask) branch to the Faster R-CNN‡. BshapeNet+ is a network in which a bshape mask branch and instance mask branch (adopted from Mask RCNN [13]) are added to Faster R-CNN‡. It is the same as combining a bshape mask branch with the Mask R-CNN. With this network, the effect of the bshape mask can be analyzed in terms of instance segmentation and object detection when simultaneously using the bshape mask and instance mask. In other words, BshapeNet(+) allows both object detection and instance segmentation, but BboxNet only allows object detection, because the mask predicts the bounding boxes. Furthermore, each network contains two variants (the Thick model and the Scored model). We evaluate our approaches on the MS COCO [19] and Cityscapes [6] datasets. The main contributions of this study are summarized as follows. 1. We propose a novel modularizable component with two newly introduced types of masks, and their various variants (Scored mask, Thick mask, and the degree of boundary dilation) for object detection and instance segmentation. 2. We propose efficient frameworks, BshapeNet and BshapeNet+, by adding our bshape mask branch to the Faster R-CNN‡ and Mask R-CNN, respectively. Our branch is easily applicable to the existing algorithms, including PANet. 3. Comprehensive experiments on two benchmarks show that BshapeNet+ improves performance significantly compared to the baseline models, and achieves highly competitive results with state-of-the-art (SOTA) methods. 2. Related Work Object detection: Many studies have recently presented excellent results in object detection. The remarkable improvements in object detection began with Overfeat [27] and region-
(a) Instance mask branch FCN RoIAlign
(b) Bshape mask branch FCN
backbone FPN RPN
Instance mask segmentor
Boundary shape segmentor
(c) Regressor & Classifier bbox regressor classifier
RCN
bbox regressor classifier
Fig. 2. The proposed BshapeNet+ framework for object detection and instance segmentation.
based CNN (R-CNN) [11]. With the success of R-CNN, twostage detection has continued to evolve. To solve the heavy calculations of CNN in the R-CNN, Fast R-CNN [10] and Faster R-CNN have been developed. Fast R-CNN uses a selective search [29] to obtain proposals using the features of the CNN backbone, and a region of interest (RoI) pooling layer [10] to eliminate the repeated feeding of RoIs back into the CNN. Faster R-CNN uses a region proposal network (RPN) to detect RoIs in the structure of Fast R-CNN. Recently, many approaches have extended or modified this framework. For example, Cascade R-CNN [4] has achieved superior performance by creating a cascade framework that provides high quality region proposals to the RPN in a step-wise manner. SNIPER [28] proposes efficient multi-scaling learning methods by using chips to provide efficient and greedy input size to the model. Instance segmentation: DeepMask [23] and SharpMask [24] predict the location of an object by segmenting it to propose candidates with a CNN. Dai et al. [7] proposed a cascading stage model using a shared feature to segment the proposed instance(s) from a bbox. Furthermore, fully convolutional instance-aware semantic segmentation [17] is a method to predict a position-sensitive map as well as the cascading stage method. Unlike these proposal-based methods, Mask RCNN is an instance-based method. This model uses RoIAlign [13] to obtain precise RoIs and has three branches (classification, bbox detection, and instance segmentation). PANet [20]
3. The proposed method 3.1. Proposed Mask Representations We define two types of masks, bshape mask and bbox mask, which present the boundary shapes and boxes, as in Figure 1. We introduce the modularizable branch with this novel mask for object detection and instance segmentation. As the names of the proposed masks imply, the bshape mask labels the boundary pixels of the object as one, while the bbox mask labels the bbox pixels as one and the other pixels as zero (see Figures 1(a) and (e)). It is difficult to learn with these masks, because of excessive zeros compared to ones. Thus, for each of the bshape mask and bbox mask, we define two variants, the Thick mask and the Scored mask. They exhibit the same morphology but differ in methods to thicken their boundaries (or bboxes). As shown in Figures 1(d) and (h), we create thicker boundaries by extending the boundaries inside and outside of the true boundaries by k pixels. We prefix the name of these masks with “k-px” to indicate the boundary thickness. We call the expanded boundary pixels false boundary pixels. We explain the Thick mask and Score mask with the bshape mask in more detail in this part. The bbox mask is the same except that the bounding box pixels are considered. In the Thick mask, false boundary pixels are filled with ones, as is the case for the true boundary pixels. The mathematical expression of this mask is as follows. Let M = (mi j ) ∈ Rv×w be a true boundary mask matrix, B = {(i, j) : mi j = 1, 1 ≤ i ≤ v, 1 ≤ j ≤ w} is a true boundary index set, and X = (x pq ) ∈ Rv×w is a k-px Thick bshape mask matrix. For ∀(i, j) ∈ B, ( x p j = 1, i f 1 ≤ i − k ≤ p ≤ i + k ≤ v (1) xiq = 1, i f 1 ≤ j − k ≤ q ≤ j + k ≤ w.
All remaining values are filled with zeros. In this case, the loss is the same whether the model error is very close to the boundary or very far away. Thus, we assign higher scores to predictions that are closer to the true boundary, to aid learning. Therefore, we develop the Scored mask, because we can learn the boundaries more effectively by defining false boundary and true boundary values differently. The value of the false boundary is reduced at a constant rate in proportion to the distance from the actual boundary. Thus, for the Scored bounding shape mask, false boundary pixels are filled with distance-based scored numbers, which are positive numbers of less than one, to generate a mask. Let Y = (y pq ) ∈ Rv×w be a k-px Scored bounding shape (or box) mask matrix. We use the same matrixes M and B as in Eq.1, and s is a predetermined positive constant (less than one) that controls the magnitude by which the value decreases. We set s as 0.05. Subsequently, ∀(i, j) ∈ B, ( y p j = 1 − d1 s, i f 1 ≤ i − k ≤ p ≤ i + k ≤ v (2) yiq = 1 − d2 s, i f 1 ≤ j − k ≤ q ≤ j + k ≤ w, where d1 = |p − i| and d2 = |q − j| are the distances from the true boundary, and the remaining pixels are zero.
Conv(𝑓, 𝑚): conv. layer, filter size 𝑓, stride 1 (if 𝑓 = 3, zero padding 1) Deconv(𝑓, 𝑚): deconv. layer, stride 2 𝑚: # of channels, 𝑘: # of classes
7x7x512 Conv(7, 1024) Conv(1, 1024) FC(𝑘 )
FC(4𝑘)
[1x𝑘]
[4x𝑘]
Conv(1, 𝑘) [28 x 28 x 𝑘]
Deconv(2, 256)
14x14x512
RoI Align
Conv(3, 256) Conv(3, 256) Conv(3, 256) Conv(3, 256)
RPN
FC(1) FC(4)
P2 [W x H x 256] P3 [W x H x 256] P4 [W x H x 256] P5 [W x H x 256]
Conv(3, 512) Conv(1, 512)
FPN
Input
and MS R-CNN [15] improve performance by modifying the architecture of Mask R-CNN. In addition, recent approaches, including Deep Coloring [2], MaskLab[5], and GMIS[21], use semantic segmentation results for instance segmentation.
ResNet C2,C3,C4,C5
3
FCN
RCN
Fig. 3. The detailed network architecture of BshapeNet.
3.2. Proposed Frameworks We propose three frameworks: BboxNet, BshapeNet, and BshapeNet+, to verify our masks (Figure 2). BshapeNet (or BboxNet) is a framework that combines Faster R-CNN‡ with a bshape (or bbox) mask branch. Furthermore, BshapeNet+ is a framework that adds both our bshape mask branch and instance mask branch to Faster R-CNN‡. The bshape (or bbox) mask branch segments the boundaries (bboxes) of instances in the RoIs. Furthermore, the instance mask branch performs instance segmentation, that is, both the interior of the instance and the boundary are segmented [13]. The regressor and classifier branch perform bbox regression and classification of the RoIs used in [26]. BshapeNet+ consists primarily of a backbone (as in [13]), the RPN, region classification network (RCN), and bshape mask branch and instance mask branch based on the FCN [22], as Figure 2 shows. In addition, for better performance, we use the feature pyramid network (FPN) [18]. Specifically, the flow of BshapeNet+ with each component is as follows. First, the backbone extracts features and then, the FPN combines multiresolution features with high-level features from multiple layers of the backbone [18] and forwards combined features to the RPN. Subsequently, the classifier and bbox regressor of the RPN propose the RoIs. For final predictions, both the bshape segmentor and instance segmentor use RoIs simultaneously as the RCN. Through this process, all predictions occur. Figure 3 shows the detailed architecture of BshapeNet. It is exactly the same as the architectures of Mask R-CNN; BboxNet uses the same architecture. Since our branch is the same as the instance mask branch except masking scheme, BshapeNet+ contains two FCNs with the same architecture. We investigate our models with ResNet [14] and ResNeXt [30] as the backbone for all experiments. 3.3. Training loss functions We use the same loss for both BshapeNet and BboxNet, because only the morphology of the defined mask is different. However, the Scored model and Thick model exhibit different losses. First, the loss function for the Scored model is defined as follows: Losstotal = αLRPN + βLRCN + γLS mask ,
(3)
where α, β and γ are predetermined positive constants, and the loss functions of the RPN, RCN, and Scored bshape (or bbox) mask branch are called LRPN , LRCN , LS mask , respectively. The
4
To analyze the effect of adding our mask to the existing model, at inference, we evaluate the object detection performance using the bbox regressor. The accuracy of bbox mask branch is evaluated in Table 8. For instance, segmentation, post-processing is required with bshape mask branch, because bshape masks segment only the boundaries of objects. Thus, we perform simple two-step post-processing to calculate instance segmentation performance of BshapeNet. The first step is to connect the predicted boundaries with a Prim’s algorithm [25] and the second step is to fill it. However, BshapeNet+ uses the result of instance mask branch for a fair comparison with Mask R-CNN to avoid the post-processing effect.
Thick
Model (R-101) BboxNet (3) BboxNet (5) BboxNet (7) BboxNet (11)
Scored
BboxNet (3) BboxNet (5) BboxNet (7) BboxNet (11) BshapeNet+ (3) BshapeNet+ (5) Faster R-CNN‡ [12]
APbb
AP50 bb
AP75 bb
Model (R-101)
APbb
AP50 bb
AP75 bb
37.9 37.9 38.0 37.8 37.8 38.1 38.1 37.9 41.7 41.8 39.8
59.9 59.0 59.9 58.0 59.5 59.8 59.7 59.5 63.3 63.8 61.2
40.6 40.0 39.4 39.8 40.5 40.7 40.9 40.5 44.6 46.2 43.1
BshapeNet (3)
38.1 38.2 38.4 38.2 41.5 41.4 42.1 41.7 42.3 42.1 40.9
59.9 60.9 61.5 60.9 63.4 63.5 64.1 63.9 64.5 64.1 62.9
41.0 40.5 41.8 41.2 44.0 45.9 46.2 46.2 46.4 46.3 44.8
BshapeNet (5) BshapeNet (7) BshapeNet (11) BshapeNet (3) BshapeNet (5) BshapeNet (7) BshapeNet (11) BshapeNet+ (7) BshapeNet+ (11) Mask R-CNN[12]
Table 2. Instance segmentation results on MS COCO minival dataset with bbox AP (%) and mask AP (%). Model (R-101) Thick
3.4. Inference
Table 1. Object detection results on MS COCO minival dataset with bbox AP (%). Numbers in parentheses indicate thickness of boundary (k-px model). ResNet101 (R-101) is used as a backbone.
BshapeNet (3) BshapeNet (5) BshapeNet (3)
Scored
loss functions of the RPN and RCN are the same as those for Faster R-CNN. We use the Euclidean loss function for the Scored mask branch, called LS mask , as follows: for each RoI, LS mask = 1 PW PH ˆ 2 j (ti j − ti j ) , where H and W indicate the height and i 2HW width of the mask, respectively, and ti j is the ground-truth label of (i, j) in the mask with a predicted value of tˆi j . The Thick mask branch solves a pixel-wise classification while the Scored mask branch model solves a pixel-wise regression. Thus, we use the following binary cross-entropy loss function for the Thick mask branch, called LT mask , with the same 1 PW P H ˆ notation as in LS mask : LT mask = − HW j {ti j log ti j + (1 − i ti j ) log(1 − tˆi j )}. The total loss function of the Thick model is equivalent to changing LS mask to LT mask in Eq. 3. Finally, the loss function of BshapeNet+ is equivalent to adding δLImask to Eq. 3. LImask stands for the loss of the instance mask branch and is the same as that used in Mask R-CNN and δ is a preset positive constant.
BshapeNet (5) BshapeNet+ (3) BshapeNet+ (5)
Mask R-CNN [12]
AP75 mk
Model (R-101)
APmk
30.7 48.6 27.9 31.2 52.2 31.4
BshapeNet (7)
31.5 51.6 31.9 31.6 50.4 29.8
33.2 34.8 33.6 35.4 36.4
BshapeNet (7)
APmk
AP50 mk
49.8 57.1 50.1 57.4 57.8
30.3 37.6 32.6 37.9 38.8
BshapeNet (11)
BshapeNet (11) BshapeNet+ (7) BshapeNet+ (11)
36.7 36.4 37.1 36.9
AP50 mk
57.7 57.2 58.9 57.7
AP75 mk
38.0 37.9 39.3 38.9
Table 3. Ablation study on BshapeNet on COCO test-dev. We denote ResNeXt by “X” and ResNet by “R” for brevity. “Inst. mask” refers to an instance mask branch; 7-px Scored BshapeNet is used. backbone
APbb
AP50 bb
AP75 bb
APmk
AP50 mk
AP75 mk
Faster R-CNN‡[12]
R-101
BshapeNet
R-101
+ ResNeXt
X-101
+ Inst. mask (BshapeNet+)
X-101
40.0 42.3 42.5 42.8
61.7 64.5 64.8 64.9
43.5 46.4 46.7 46.9
37.0 37.2 37.9
58.1 58.9 61.3
38.2 38.9 40.2
Model
4. Experiment We compare the results of various proposed models and perform ablations. We also compare these results with the baseline models Faster R-CNN‡, Mask R-CNN [13], and SOTA algorithms. We present the results of Detectron [12] for comparison. However, if values are not present in Detectron, we use the results in Mask R-CNN. If they are not in the paper, the results of reproducing with detectron code are used. We use two benchmark datasets, MS COCO and Cityscapes. Specifically, we use the MS COCO Detection 2017 dataset containing 81 classes and 123,287 images (trainval), and 118,287 images for the training set and 5,000 images for the validation set (minival). We also obtain the results of object detection and instance segmentation on MS COCO test-dev [19]. For Cityscapes, we use the dataset with fine annotations comprising nine object categories for instance-level semantic labeling and a 2,975-image training set, 500-image validation set, and 1,525-image test set [6]. Furthermore, we evaluate our results using the Cityscapes test-server. We evaluate the performance using the standard COCO-style metric [19]. In the Mask R-CNN paper, models are trained using 8 NVIDIA Tesla P100 GPUs; however, we use 2 NVIDIA GTX 1080Ti GPUs. Owing to the limited experimental environment,
we use a smaller minibatch. However, we match the hyperparameters to the study of Mask R-CNN except the minibatch size for a fair comparison. 4.1. Metric We follow the standard MS COCO evaluation metrics, including AP (average precision averaged over intersection-overunion (IoU) thresholds of 0.5 to 0.95), AP50 (IoU = 0.5), AP75 (IoU = 0.7), and APS , AP M , APL , which are APs for small, medium, and large objects, respectively. We specify the bbox average precision as APbb and instance segmentation average precision as APmk . These metrics apply to both datasets. 4.2. Implementation details We use officially released Detectron code to implement our models and test the performance of Faster R-CNN‡ and Mask R-CNN. The detailed architectures of our models are described in Subsection 3.2. For the MS COCO dataset, we resize the images such that their shorter edge is 800 pixels [18]. We use two GPU and four minibatch (two images per GPU) and train the model for 640K iterations. We set the initial learning rate
5 Table 4. Object detection and instance segmentation results of applying our mask branch to PANet on COCO test-dev. ResNet50 is used as a backbone. Object Detection
APbb
AP50 bb
AP75 bb
PANet[20] Ours (PANet + bshape mask branch)
42.5 44.2
62.3 63.5
46.4 46.7
Instance Segmentation
APmk
AP50 mk
AP75 mk
PANet[20]
36.6 37.4
58.0 58.7
39.3 40.1
Ours (PANet + bshape mask branch)
Fig. 4. Comparison of the results of 7-px ScoredBboxNet (top) and ScoredBshapeNet (bottom) on the MS COCO minival dataset. No post-processing is performed.
to 0.02 and divide it by 10 (0.002) at 480K iterations. For the backbone, we use ResNet50, ResNet101, and ResNeXt101. In the Cityscapes, we perform training using only a fine dataset. Although the raw data size is 2048×1024, we reduce it to 1024×800 to fit our resource. The models are trained with two GPUs and two minibatches (one image per GPU), and the model for 96K iterations is trained. We set the initial learning rates to 0.01 divided by 10 (0.001) at 56K iterations and use only ResNet50 as the backbone. This is because the amount of data in Cityscapes are extremely small and our model does not improve significantly with ResNet101 as Mask R-CNN [13]. The hyperparameters typically used in both datasets are as follows. Each image contains 512 sampled RoIs for the FPN with positive and negative 1:3 ratios [18]. Anchors of five different scales and three aspect ratios are used, as in [26]. The proposed RoIs of the RPN are 2000 per image for training and 1000 for testing. We set the weight decay to 0.0001 and the momentum to 0.9. We use pretrained ImageNet1k [8] weights for all backbones. 4.3. Comprehensive analysis of all proposed methods We compare and analyze the proposed models and their variants. In summary, the experimental results show that BshapeNet is better than BboxNet, the Scored model is better than the Thick model, and Scored BshapeNet+ achieves the best performance. The degree of boundary dilation: As shown in Tables 1 and 2, the 7-px models demonstrate the best results in both object detection and instance segmentation in MS COCO. Unlike COCO, the 3-px model, rather than the 7-px model, is the best model in Cityscapes (Tables 6 and 7). For MS COCO, we consider that good performance is achieved using relatively thick boundary masks owing to the variety of objects (81 classes), sizes, and backgrounds. In other words, it must be thick enough to cover the various scales of the objects. Meanwhile, in Cityscapes, the objects are simpler than those of COCO; a fairly thick boundary affects the model negatively, because the boundary dilation introduces noise due to false boundaries. BshapeNet vs. BboxNet: The BshapeNet with the same variant condition is significantly more accurate in object detection (Tables 1 and 6). The accuracy of the best object detection model of BshapeNet is 42.1 (32.0) APbb , while that of BboxNet’s best model is 38.1 (29.7) APbb , as shown in Tables 1
Table 5. Comparison of object detection and instance segmentation results of 7-px Scored BshapeNet(+) with the SOTA models on COCO test-dev. ResNet101 is used as a backbone. Inference time per image in milliseconds is measured on a GTX1080Ti GPU. Object Det.
Time
APbb
AP50 bb
AP75 bb
L APbb
M APbb
APSbb
SNIPER[28]
342
Cascade R-CNN[4]
203
MaskLab+[5]
205
Mask R-CNN[12]
193
67.0 62.1 62.6 63.8 63.2 64.5 64.7
51.6 46.3 46.0 45.2 45.4 46.4 46.6
58.1 55.2 54.2 53.7 52.6 53.0 53.1
48.9 45.5 45.5 44.5 44.2 45.7 45.9
29.6 23.7 23.8 24.1 24.7 24.9 24.9
BshapeNet
193
BshapeNet+
198
46.1 42.8 41.9 41.9 41.4 42.3 42.4
Instance Seg.
Time
APmk
AP50 mk
AP75 mk
L APmk
M APmk
APSmk
MS R-CNN[15]
193
Mask R-CNN[12]
193
MaskLab[5]
205
RetinaMask[9]
166
38.3 37.1 35.4 34.7
58.8 58.2 57.4 55.4
41.5 39.2 37.4 36.9
54.4 53.9 49.2 50.5
40.4 38.2 38.3 36.7
17.8 16.5 16.9 14.3
BshapeNet
193
BshapeNet+
198
37.0 37.5
58.1 59.2
38.2 39.5
51.9 53.1
38.5 39.9
16.7 17.3
SOD-MTGAN[3]
-
and 6, respectively. These findings demonstrate that the boundary shape information of the object facilitates detecting objects much more effectively than the bbox information does. Scored masks vs. Thick masks: The accuracies of Scored models with the same variant condition surpass those of Thick models in both MS COCO and Cityscapes, as shown in Tables 1, 2, 6, and 7. In particular, the Scored models are much better than the Thick models in instance segmentation (Tables 2 and 7). For example, 7-px Scored BshapeNet (36.7 APmk ) is 5.2 points higher than 7-px Thick BshapeNet (31.5 APmk ) in MS COCO (Table 2). This finding confirms that filling the false boundary values (distance-based scored values) differently to the true boundary value improves both object detection and instance segmentation performance. BshapeNet vs. BshapeNet+: Because the performance of Scored BshapeNet is the best among all BshapeNet, BboxNet, and their variant models, for BshapeNet+, we experiment using only Scored models. The results are shown in Tables 1, 7, and 9. BshapeNet+ performs better than BshapeNet in all experiments under the same conditions. In our opinion, the instance mask branch is added to allow for the model to learn more patterns of the locations of objects. The results of this model are analyzed in the next two subsections. Ease of application: Our modularizable component, our bshape (or bbox) branch, is easily applicable to various models, such as PANet as well as Mask R-CNN and Faster R-CNN, as shown in Table 4. As a result of applying our bshape mask branch to PANet, object detection is improved by 1.7 AP and
6
BboxNet (3) BboxNet (5) BboxNet (7) BboxNet (11)
Scored
BboxNet (3) BboxNet (5) BboxNet (7) BboxNet (11) BshapeNet+ (3) BshapeNet+ (5) Faster R-CNN‡ (Ours)
APbb
AP50 bb
AP75 bb
Model (R-50)
APbb
AP50 bb
AP75 bb
29.3 29.0 29.2 29.1 29.7 29.2 29.4 29.4 32.3 31.8 28.9
48.4 48.0 48.1 48.2 48.9 48.1 48.2 48.2 52.4 51.7 48.4
28.6 28.5 28.3 28.1 28.9 28.8 28.3 28.5 32.5 31.6 28.1
BshapeNet (3)
30.3 30.0 29.9 29.9 32.0 31.4 30.9 30.7 31.3 31.2 29.6
49.7 49.4 49.2 49.2 52.0 50.5 50.1 50.3 50.8 51.0 49.1
28.7 28.6 28.6 28.7 32.1 29.9 29.2 29.1 29.7 29.7 29.2
BshapeNet (5) BshapeNet (7) BshapeNet (11) BshapeNet (3) BshapeNet (5) BshapeNet (7) BshapeNet (11) BshapeNet+ (7) BshapeNet+ (11) Mask R-CNN (Ours)
instance segmentation is improved by 0.8 AP. 4.4. Object Detection Main Results: All BshapeNet and BshapeNet+ models show better detection performance than the baseline models, Faster R-CNN‡ and Mask R-CNN, in both the MS COCO dataset (minival and test-dev) and Cityscapes (val), as shown in Tables 1, 3, 5, and 6. In particular, the best result (42.1 AP) of BshapeNet for COCO is 2.3 points (1.2 points) higher than Faster R-CNN‡ (Mask R-CNN), respectively, as shown in Table 1. Similarly, in Cityscapes, our best BshapeNet result (32 AP) is 3.1 points (2.4 points) higher than Faster R-CNN‡ (Mask R-CNN), as shown in Table 6. These results demonstrate that our mask branches can help improve the performance of object detection. In addition, the results show that the scored bshape mask branch is more effective than the instance mask branch for object detection. Furthermore, BshapeNet+ obtains better results than BshapeNet for both data; specifically, it achieves 42.3 AP (32.3 AP) on COCO minival (on Cityscapes val). This shows that multi-tasking learning via both boundary-focused learning and instance-focused learning effectively helps in object detection. Table 5 shows that our model achieves very competitive results with recent SOTA models in MS COCO test-dev. BshapeNet+ has high-level performance among the SOTA models presented. In particular, all proposed models are excellent for small objects. Ablation Studies: We also perform ablations with COCO test-dev, as in Table 3. We compare Faster R-CNN‡ with BshapeNet to check the effect of the bshape mask branch. The result of BshapeNet (42.3 AP) is significantly higher than that (40.0 AP) of Faster R-CNN‡. When we change the backbone to ResNeXt101, BshapeNet shows a score of 42.5 AP in test-dev, which is 0.2 points higher. Adding an instance mask branch to this model, that is, BshapeNet+, improves the detection performance by 0.3 points to 42.8 AP. Results of Bbox Mask Branch: The bboxes can be predicted only with our bbox mask of BboxNet, and bbox AP can be calculated using the coordinates of the top-left corner and bottom-right corner of the predicted box. The results of the bbox mask branch are similar to or slightly higher than those of the bbox regressor, as in Table 8. In addition, the intersection of the two results improves the accuracy.
Model (R-50) Thick
Thick
Model (R-50)
Table 7. Instance segmentation results on Cityscapes val set with mask AP (%).
Scored
Table 6. Object detection results on Cityscapes val dataset with bbox AP (%). Ours means results obtained from our experimental environment.
BshapeNet (3) BshapeNet (5) BshapeNet (3) BshapeNet (5) BshapeNet+ (3) BshapeNet+ (5)
Mask R-CNN [13]
AP75 mk
Model (R-50)
APmk
29.4 48.2 29.0 29.4 48.0 29.1
BshapeNet (7)
29.1 47.5 29.0 28.9 47.6 28.8
32.1 31.9 33.5 32.3 31.5
BshapeNet (7)
APmk
AP50 mk
49.8 49.5 50.7 50.7 -
30.2 29.9 30.7 30.2 -
BshapeNet (11)
BshapeNet (11) BshapeNet+ (7) BshapeNet+ (11) Mask R-CNN (Ours)
AP50 mk
31.7 31.4 32.0 31.9 31.2
49.0 48.7 50.2 48.8 49.7
AP75 mk
29.6 29.6 29.7 29.5 29.6
Table 8. Object detection results of Scored BboxNet (ResNet101) on COCO minival and those of Scored BboxNet (ResNet50) on Cityscapes val. BBR is the result from the bbox regressor. BM is the result from the bbox mask. BBR ∩ BM is the result from the intersection of BM and BBR. COCO (minival) Cityscapes (val) Model APbb AP50 AP75 APbb AP50 AP75 bb bb bb bb BBR
38.1
59.7
40.9
29.7
48.9
28.9
BM
38.2
59.9
40.8
29.7
49.1
28.8
BBR ∩ BM
38.4
60.1
42.4
29.9
49.2
28.8
4.5. Instance Segmentation Main Results: We present the results of instance segmentation of BshapeNet+ and BshapeNet on COCO and on Cityscapes in Tables 2, 3, 5, 7, and 9. In Tables 2 and 7, we show that all BshapeNet+ models (except one model) obtain superior performance over the baseline Mask R-CNN in both datasets. In particular, BshapeNet+, which shows the best performance, achieves 37.1 AP and 33.5 AP in COCO and Cityscapes, respectively. Our models achieve very comparable results with the SOTA models in COCO test-dev and Cityscapes test-server, as shown in Tables 5 and 9. As mentioned earlier, our branch module can be further improved for instance segmentation, because it is very easily applicable to existing models. Thus, in Table 9, the accuracy of our model is lower than those of PANet and UPSNet; however, our branch can be applied to both models to improve the performance as in Table 4. In addition, BshapeNet+ has good performance in small objects, as shown in Table 5 and Figure 5. BshapeNet+ has a performance of 17.3 AP for small objects and is superior to the current SOTA models, except for one model. However, the performance of large objects of BshapeNet is slightly lower than those of BshapeNet+ owing to false boundaries. Therefore, post-processing, such as the conditional random field, which reduces the thickness of the false boundary, can improve the performance. Ablation Studies: We also perform oblation analysis in instance segmentation, as in Table 3. When using ResNet101 as a backbone, BshapeNet shows 37.0 AP and with ResNeXt101 as a backbone, BshapeNet has a score of 37.2 AP in test-dev, and it is 0.2 points higher. However, adding the instance mask branch to BshapeNet shows a score of 37.9 AP in test-dev, and it is 0.7 points higher than in BshapeNet. 5. Conclusion We demonstrated a significantly improved object detection performance by additionally providing the locations of objects
7 Table 9. Comparison of instance segmentation results of proposed models with the SOTA models on Cityscapes test-server. The 3-px models are used. ResNet50 is used as a backbone. Inference time per image in milliseconds is measured on a GTX1080Ti GPU. Train dataset
Time
APmk
AP50 mk
PANet[COCO][20]
fine
165
UPSNet[COCO][31]
fine
169
Mask R-CNN[COCO][13]
fine
159
PolygonRNN++[1]
fine
295
Deep Coloring[16]
fine
285
36.4 33.0 31.9 25.4 24.9
63.1 59.6 58.1 45.5 46.2
BshapeNet
fine
159
BshapeNet+
fine
159
BshapeNet+[COCO]
fine
163
27.1 27.3 32.9
50.3 50.5 58.8
Mask R-CNN
BshapeNet+
Instance Seg
Fig. 5. Comparison of the results of instance segmentation of Cityscapes images by the proposed model (3-px Scored BshapeNet+ (top)), and Mask R-CNN (bottom). Both models are trained under the ResNet50 backbone.
with a different format with novel masking scheme as well as the coordinates of object detection algorithms. Doing so greatly improves the performance of small objects particularly. In addition, we showed that the proposed module notably improves the accuracy of instance segmentation. Furthermore, our module could be applied easily to existing algorithms. Acknowledgments This research was supported by System LSI Business, Samsung Electronics Co., Ltd. References [1] Acuna, D., Ling, H., Kar, A., Fidler, S., 2018. Efficient interactive annotation of segmentation datasets with polygon-rnn++, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 859–868. [2] Bai, M., Urtasun, R., 2017. Deep watershed transform for instance segmentation, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 2858–2866. [3] Bai, Y., Zhang, Y., Ding, M., Ghanem, B., 2018. Sod-mtgan: Small object detection via multi-task generative adversarial network, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 206–221. [4] Cai, Z., Vasconcelos, N., 2018. Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. [5] Chen, L.C., Hermans, A., Papandreou, G., Schroff, F., Wang, P., Adam, H., 2018. Masklab: Instance segmentation by refining object detection with semantic and direction features, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4022. [6] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. [7] Dai, J., He, K., Sun, J., 2016. Instance-aware semantic segmentation via multi-task network cascades, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3150–3158.
[8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee. pp. 248–255. [9] Fu, C.Y., Shvets, M., Berg, A.C., 2019. Retinamask: Learning to predict masks improves state-of-the-art single-shot detection for free. arXiv preprint arXiv:1901.03353 . [10] Girshick, R., 2015. Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. [11] Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. [12] Girshick, R., Radosavovic, I., Gkioxari, G., Doll´ar, P., He, K., 2018. Detectron. [13] He, K., Gkioxari, G., Doll´ar, P., Girshick, R., 2017. Mask r-cnn, in: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE. pp. 2980–2988. [14] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. [15] Huang, Z., Huang, L., Gong, Y., Huang, C., Wang, X., 2019. Mask scoring r-cnn, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6409–6418. [16] Kulikov, V., Yurchenko, V., Lempitsky, V., 2018. Instance segmentation by deep coloring. arXiv preprint arXiv:1807.10007 . [17] Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y., 2016. Fully convolutional instanceaware semantic segmentation. arXiv preprint arXiv:1611.07709 . [18] Lin, T.Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017. Feature pyramid networks for object detection, in: CVPR, p. 4. [19] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: European conference on computer vision, Springer. pp. 740–755. [20] Liu, S., Qi, L., Qin, H., Shi, J., Jia, J., 2018a. Path aggregation network for instance segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. [21] Liu, Y., Yang, S., Li, B., Zhou, W., Xu, J., Li, H., Lu, Y., 2018b. Affinity derivation and graph merge for instance segmentation, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–703. [22] Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. [23] Pinheiro, P.O., Collobert, R., Doll´ar, P., 2015. Learning to segment object candidates, in: Advances in Neural Information Processing Systems, pp. 1990–1998. [24] Pinheiro, P.O., Lin, T.Y., Collobert, R., Doll´ar, P., 2016. Learning to refine object segments, in: European Conference on Computer Vision, Springer. pp. 75–91. [25] Prim, R.C., 1957. Shortest connection networks and some generalizations. Bell system technical journal 36, 1389–1401. [26] Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster r-cnn: Towards realtime object detection with region proposal networks, in: Advances in neural information processing systems, pp. 91–99. [27] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y., 2013. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 . [28] Singh, B., Najibi, M., Davis, L.S., 2018. Sniper: Efficient multi-scale training, in: Advances in Neural Information Processing Systems, pp. 9310–9320. [29] Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W., 2013. Selective search for object recognition. International journal of computer vision 104, 154–171. [30] Xie, S., Girshick, R., Doll´ar, P., Tu, Z., He, K., 2017. Aggregated residual transformations for deep neural networks, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE. pp. 5987– 5995. [31] Xiong, Y., Liao, R., Zhao, H., Hu, R., Bai, M., Yumer, E., Urtasun, R., 2019. Upsnet: A unified panoptic segmentation network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8818–8826.
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: