Segmentation based rotated bounding boxes prediction and image synthesizing for object detection of high resolution aerial images

Segmentation based rotated bounding boxes prediction and image synthesizing for object detection of high resolution aerial images

Segmentation Based Rotated Bounding Boxes Prediction and Image Synthesizing for Object Detection of High Resolution Aerial Images Communicated by Dr. ...

10MB Sizes 0 Downloads 39 Views

Segmentation Based Rotated Bounding Boxes Prediction and Image Synthesizing for Object Detection of High Resolution Aerial Images Communicated by Dr. Ruiping Wang

Journal Pre-proof

Segmentation Based Rotated Bounding Boxes Prediction and Image Synthesizing for Object Detection of High Resolution Aerial Images Yingming Wang, Lijun Wang, Huchuan Lu, You He PII: DOI: Reference:

S0925-2312(20)30083-7 https://doi.org/10.1016/j.neucom.2020.01.039 NEUCOM 21795

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

3 August 2019 10 December 2019 6 January 2020

Please cite this article as: Yingming Wang, Lijun Wang, Huchuan Lu, You He, Segmentation Based Rotated Bounding Boxes Prediction and Image Synthesizing for Object Detection of High Resolution Aerial Images, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.039

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Segmentation Based Rotated Bounding Boxes Prediction and Image Synthesizing for Object Detection of High Resolution Aerial Images Yingming Wanga , Lijun Wangb , Huchuan Lua,∗∗, You Hea,c a

b

School of Information and Communication Engineering, Dalian University of Technology, Dalian, 116023, China School of Computer Science and Technology, Dalian University of Technology, Dalian, 116023, China c Institute of Information Fusion, Naval Aviation University, Yantai 264001, China

Abstract Object detection for aerial images is becoming an active topic in computer vision with many real-world applications. It is a very challenging task due to many factors such as highly complex background, arbitrary object orientations, high input resolution, etc. In this paper, we develop a new training and inference mechanism, which is shown to significantly improve the detection accuracy for high resolution aerial images. Instead of estimating the orientations of objects using direct regressions like in previous methods, we propose to predict the rotated bounding boxes by leveraging a segmentation task, which is easier to train and yields more accurate detection results. In addition, an image synthesizing based data augmentation strategy is presented to address the data imbalance issues in aerial object detection. Extensive experiments have been conducted to verify our contribution. The proposed method ∗

∗∗

Corresponding author Email Address: [email protected]; Tel./Fax : 86-411-84708971

Preprint submitted to Neurocomputing

January 13, 2020

sets new state-of-the-art performance on the challenging DOTA dataset. Keywords: Object detection, arbitrary orientations, aerial images, high resolution images 1. Introduction Deep Neural Networks have achieved many significant progresses [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] in compute vision tasks for general images. With the fast development of unmanned aerial vehicles and satellites, object detection for aerial images has attracted wide attention from both the research and industrial communities. Although recent years have witnessed significant progress in object detection [12, 13, 14, 15, 16, 17, 18, 19, 20, 21], detection accuracy for aerial images is still unsatisfactory. Compared to conventional object detection, aerial images are mostly of high resolutions (e.g., more than 4k ×4k pixels as shown in Figure 1), making aerial detection more challenging in terms of two aspects. Firstly, since the objects are often in small sizes, their appearance information will be significantly lost when directly downsampling the input images, leading to inferior detection performance. Secondly, training detectors in a patch-by-patch manner will suffer from the imbalance issue between positive and negative samples as objects in aerial images are mostly sparsely distributed with vast image areas being background regions. Another challenge in aerial object detection is that objects may suffer from various orientations, whose locations cannot be well characterized by horizontally/vertically aligned bounding boxes as in conventional object detections. Previous works [22, 23, 24] address this issue by explicitly learning 2

Figure 1: The examples of the images whose size is large while objects are small.

to regress the angles of bounding boxes. Since angles are periodic, a small rotation (e.g., from 1◦ to 359◦ ) will result in a significant difference in the regression target, making direct angle regression more challenging. An alternative method to circumvent the periodic issues is to cast bounding box angle prediction as a classification task. However, this is still not the desired solution since continuous variations of the angles are ignored and differences between discrete angles are treated equally. This paper proposes an end-to-end network for aerial object detection based on the two-stage detection framework [12, 13, 14, 15]. In order to better 3

handle input images with high resolutions, we split input images into image patches and perform training and evaluation in a patch-by-patch manner. A training strategy with data augmentation is developed, which increases the variety of training patches with synthetic images, and is shown to effectively mitigate the imbalance issue between positive and negative samples. In addition, a new bounding box angle prediction approach is also presented. As opposed to angle estimation through direct regression, our method infers the bounding box angles according to the segmentation results, which lowers the difficulty of network training while allowing more accurate angle regression. In summary, the main contributions of our work are as follows: (1) We present a new paradigm for training and testing deep learning based aerial object detection for high resolution images. Through careful designing, our method can effectively operate in a patch-by-patch mode and alleviates sample imbalance through data augmentation. (2) We propose a new bounding box angle prediction method based on segmentation results, delivering more precise location representation for objects of various orientations. (3) We achieve new state-of-the-art performance on the popular DOTA [23] dataset. Ablative studies have also been conducted to verify the effectiveness of the proposed algorithms. 2. Related Work There are many remarkable object detection algorithms based on deep convolutional neural networks (CNNs) [12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 25, 26, 27, 28, 29]. The series of two stages region-based framework [12, 13, 14, 4

15] achieve impressive performance on object detection. They first generate regions of interest (RoI) as proposals, then predict each proposal to final detections with classification tasks and regression tasks. Fast R-CNN [13] proposes the RoI-pooling layer to pool the features of RoI, so each image only needs to extract feature once which can reduce computationally and run faster. Faster R-CNN [14] generates RoI using a region proposal network (RPN), which is faster than the traditional proposal algorithm. Mask RCNN [15] proposes using the RoI-align layer to replace the RoI-pooling layer to remove misalignment caused by round-off and also additional train the network with segmentation mask which can bring more information. On the other hand, the one-stage method [16, 17, 18], which directly predict the final detections without the RoI-pooling operation, achieve very fast detection speeds. These methods will only predict the horizontal bounding boxes for the instances. In the filed of detection in aerial images, RoI Transformer [22] proposes to transform the horizontal RoI (Region of Interest) into the rotated RoI and pooling based on the rotated RoI to predict better rotated bounding boxes. ICN [27] proposes a joint image cascade and feature pyramid network to allow extracting information on a wide range of scales. They are suffering from predicting the angles of instances by regression tasks. In order to handle the high resolution images, they always simply split the image by sliding windows. But we think the performance can be further improved by more efficient images processing. There are also many algorithms to predict the rotated bounding boxes in the field of scene text detection. These algorithms achieve great progress in

5

the scene text detection field based on regression tasks [30, 31, 32, 33, 34] or segmentation tasks [35, 36, 37, 38, 39]. But different from the scene text detection, object detection in aerial images would be more challenge in some ways. First, we would like to predict multi-category object detection but not only text. Second, the aerial images are in a large size but the objects are always very small and aggregated distribution. Gao, M. et al. [40] propose a dynamic zoom-in network for high resolution images based on reinforcement learning. They first zoom-out the images for coarse detection and then dynamic zoom-in some areas to perform detection at a higher resolution based on the coarse detection results. However, this method is not suitable for high resolution aerial images. The size of aerial images can be much larger than the datasets of their experiments while the objects are much smaller. The coarse detection would only detect a few large instances. Mate Kisantal et al. [41] propose to oversample the images with small objects and augment each of those images by copy-pasting small objects many times. The idea of the data augmentation in it and ours are both paste image patches on the background region. But our method is much different from it. The idea of it is to increase the quantities of small objects while our motivation is to increase the variety of the background region and also effectively mitigate the imbalance issue between positive and negative samples. So we sample the background image patch and paste the positive image patch to it for each positive image patch while the method in [41] copies and pastes the small objects to the background region of the image for each image.

6

In the filed of detection in aerial images, the most popular dataset is DOTA [23] because it has a large number of images and a lot of instances of many different classes with oriented object annotation. Each image of it is of the size about 4000 × 4000 pixels. Other datasets [42, 43, 44, 45] are not as large as DOTA in terms of the amounts of images and instances. Some of them may focus on a few categories, e.g. UCAS-AOD [43] contains vehicles and planes, VEDAI [42] focus on various kinds of vehicles while HRSC2016 [44] contains various kinds of ships. The image width of these datasets is about 1000 pixels which can be regarded as having been split from the high resolution images in advance and it can be directly training without splitting. In the filed of multi-categories arbitrary-oriented object detection in high resolution aerial images, it is beginning to warm in the near years and the algorithms are still rare. We would like to provide a general algorithm for it and hope the ideas in this paper can be used in other algorithms. 3. Algorithm 3.1. Overview As shown in Figure 2, our network is based on the two-stage object detection framework [12, 13, 14, 15]. We adopt the ResNet-50 [1] as the backbone network. The feature pyramid network (FPN) [46] is leveraged to achieve multi-scaled features of the input image, which has shown to be beneficial for scale variations of objects. The RPN network [14] is built on top of the feature pyramid followed by an RoI-align layer [15], which proposes regions of interest (ROI) and extracts a fixed-sized of feature for each ROI, respectively. 7

Training Patch

Training Image

Annotation

GroundTruth

box mask ResNet50 + FPN

box

Backbone

RoI Align

cls

Heads

box cls

RPN

Figure 2: The pipeline of our framework. The patch in the red dashed box is the sampled patch of the training images. The red boxes are the rotated bounding boxes annotation while the blue ones are the horizontal bounding boxes generated from the rotated bounding boxes. The green dashed boxes are the partially included target and the anchors or the proposals matched to them will be ignored in training.

Finally, the extracted features of RoIs are fed into the multi-task prediction heads, which perform category classification, bounding box regression and mask prediction. In the following, we will first introduce the proposed rotated bounding box prediction algorithm in Section 3.2, then present our data augmentation techniques based on synthetic images to address the training data imbalance issues in Section 3.3. In Section 3.4, model training and inference details are described.

8

3.2. Rotated Bounding Box Prediction The bounding box regression head in most conventional two-stage object detection frameworks locates each object candidate with a horizontally aligned bounding box by indicating the coordinates of the top left corner point and the size of the box. In the aerial object detection task, since the images are captured from a bird-eye view, the objects may undergo various rotations. To accurately locate the objects, one can modify the bounding box regression head to predict an additional rotation angle for each bounding box. As explained in Section 1, estimating the rotation angle through direct regression or classification suffers from the periodic angle problem, giving rise to suboptimal detection results. To address the above issue, we propose to infer a rotated bounding box by predicting a binary bounding box mask, since spatial masks are more flexible to represent oriented object locations and easier to learn than rotation angle regression. For this purpose, we add a box mask head in parallel to the bounding box regression head, where the box regression head is used to predict a rough horizontal box location, and the box mask head aims to identify the more precise rotated box location. During training, we compute a horizontal box that tightly encloses the ground truth rotated box and servers as the target output for the box regression head. A binary mask located in the horizontal box region is also generated and used to train the box mask head, where 0 represents a background pixel and 1 indicates a foreground pixel within the ground truth rotated box region (See Figure 3 as an example). During testing, the rotated box location can be inferred by combining the prediction of bounding box regression and box mask head 9

Figure 3: The process of transferring the rotated bounding boxes to the horizontal boxes and binary masks ground truth. The red boxes are the rotated bounding boxes annotation while the blue ones are the horizontal bounding boxes generated from the rotated bounding boxes and the green dashed boxes are the partially included target.

(See Figure 4 as an example). The core idea is to identify a minimum-sized bounding box that can enclose the predicted foreground pixels. We achieve this using the off-the-shelf Rotating Calipers algorithm [47] implemented by OpenCV [48]. To our knowledge, we are the first to explore mask based rotated bounding box prediction for the aerial object detection task. Our proposed method incorporates horizontal bounding box regression with box mask prediction in a coarse-to-fine manner, further boosting the detection accuracy. It should also be noted that our method is inspired by Mask R-CNN [15]. Nonetheless, ours differs in the sense that our predicted box mask aims to detect rotated boxes, hence the name, and therefore does not require precise per-pixel segmentation annotations for training. In practice, we find our rotated box prediction method can be further improved by leveraging confidence measurement techniques. We use the probabilities predicted by the category classification head as the initial confidence scores for the detected bounding boxes. The following two schemes are de-

10

Figure 4: The process of inference. The testing image is divided into several patches by a sliding window of size 1200 × 1200 and a stride of 800. Each patch will be fed into the network and the results will be combined together by NMS. The blue boxes are the inferred horizontal bounding boxes and each of them contains a rotated box mask. The red boxes are the final rotated bounding boxes generated from the box masks. The green dashed boxes are the targets near the split boundaries which may be partially included and should be ignored, or it may be too small to be suppressed by NMS. The target in the red dashed circle is an example of that.

signed to further update the confidence scores. The final object detection results are achieved through non-maximum suppression according to the updated confidence scores. Area Connectivity. Since occlusion rarely occurs in aerial images, accurate box masks mostly contain connected foreground regions. If a box mask prediction contains multiple connected foreground regions, the prediction is most likely unreliable. Based on this observation, we use the following operation to update the confidence score of the detection result. s←s× 11

Am , A

(1)

where s denotes the confidence score of a detected object, Am represents the area of the largest connected foreground region in the predicted box mask, and A denotes the overall foreground area of the corresponding box mask. For a detected object with only one connected foreground region in its box mask, its confidence score is unchanged since Am equals to A. Bounding Box Consistency. For each detected bounding box which may be subject to rotations, we generate a horizontally aligned bounding box to tightly enclose the detected ones. For a reliable detection result, the generated horizontal box should have a strong overlap with the horizontal box predicted by the bounding box regression head. Therefore, we compute the Intersection over Unit (IoU) of the two boxes. If their IoU is less than a pre-defined threshold τ , we decrease the confidence score as follows. s ← s × IoU / τ.

(2)

In our experiments, we find the threshold τ = 0.75 performs well. 3.3. Image Synthesizing based Training Data Augmentation The maximum resolution of the input image is restricted by the GPU memory. To tackle high resolution images during training, we split each training image into sub-image patches using sliding windows. In our experiments, we use three sliding windows of sizes 800 × 800, 400 × 400, and 200 × 200, respectively to better handle object scale variations. To ensure that each object can be entirely included in at least one image patch, we set the stride of sliding windows to be 32 pixels, allowing the patches to be sufficiently overlapped. Since most patches do not contain any intact objects 12

Positive Patches

Figure 5: Example of positive patch sampling. The patches in red dashed boxes are examples of candidate patches generated by the sliding windows. The blue boxes are the horizontal bounding boxes generated from the rotated bounding boxes. The patches in red solid line are the selected positive patches.

at all, we use a greedy mechanism which is similar to SNIPER [20] to sample positive patches containing as many objects as possible to improve training efficiency. Specifically, we associate each training image with an object set O containing all objects within the image. During each iteration of positive patch sampling, we sample without replacement the patch containing the maximum number of objects within O and then remove the corresponding objects from O. The above procedure is conducted recursively until the object set O is empty. Figure 5 shows an example of the positive patch sampling. The sampled image patches constitute the positive set, while the rest image patches that do not contain any intact objects serve as the negative training patches. However, directly using the above positive and negative set for training 13

Figure 6: The examples of the synthetic training patches which is generated by synthesizing positive patches from existing positive and negative patches. The red boxes are the rotated bounding boxes annotation while the blue ones are the horizontal bounding boxes generated from the rotated bounding boxes.

will lead to severe data imbalance since objects are very rare and sparsely distributed in aerial images. Although data resampling can partially address this issue, we propose a data augmentation technique with synthetic images as an alternative solution, which can not only achieve the rebalance between positive and negative samples, but also improve the training data diversity, enabling more effective network training. To this end, we augment the pos-

14

itive image patches by synthesizing positive patches from existing positive and negative patches. More precisely, for each positive patch of one training image, if its size is less than 800 pixels (e.g., 400 or 200 pixels), we randomly select a negative patch from the same image whose size is larger than the positive patch. The positive patch is then randomly pasted onto the large negative patch, generating a new positive patch with different backgrounds as shown in Figure 6. The above data synthesizing procedure is randomly conducted for each positive patch with a probability p which can be further tuned to adjust the positive and negative sample ratio. In our experiments, we empirically set p = 0.5 which delivers the best performance. 3.4. Training and Inference Details We train and evaluate the proposed method on the DOTA aerial detection dataset [23], which consists of 1411 training images, 458 validation images, and 937 test ones. During training, we follow most of the settings of Mask R-CNN [15]. Specifically, we resize the input image patch to 800 × 800 pixels. Except for the image synthesizing based augmentation method, the popular data augmentation strategies, i.e. random horizontal flipping and random vertical flipping, are also explored. The network is trained for 440k iterations with the batch-size of 2. The initial learning rate is 0.0025 with 500 iterations for warming up. The learning rate will be decreased by 10 at the 240K and 360K iteration. We use a weight decay of 0.0001 and a momentum of 0.9. Any anchors or proposals matched to the partially included objects (the objects in the green dashed boxes as shown in Figure 2), whose semantic information is not clear (as shown in Figure 7) and may confuse the network, will be ignored at the training stage. 15

Figure 7: The examples of the training samples that have partially included targets with unclear semantic information. The red boxes are the rotated bounding boxes annotation while the blue ones are the horizontal bounding boxes generated from the rotated bounding boxes.

During testing, we also adopt the patch-by-patch mode, where image patches are extracted using a sliding window of size 1200 × 1200 and a stride of 800. The threshold for the mask is 0.5 and the NMS threshold is 0.5. To improve the robustness against object scale variation, we perform test-time input image rescaling as is done in [49] with factors {×0.5, ×1.0, ×2.0}. We then aggregate the multi-scale detection results using non-maximum suppres-

16

sion to get the final output. The detections near the split boundaries within 20 pixels, which may be the partially included object and may be too small to be suppressed by NMS, will be ignored to avoid false positives (FP). For example, the detections in green dashed boxes in Figure 4 will be ignored. 4. Experiment 4.1. Datasets We use the DOTA dataset [23] for our experiments. DOTA is a largescale dataset for object detection in aerial images with both horizontal and oriented bounding box annotations. It contains 2806 large size aerial images from different sensors and platforms. The fully annotated DOTA images contains 188, 282 instances which are in 15 categories, including plane (PL), baseball diamond (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer-ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP) and helicopter (HC). 4.2. Comparison with State-of-the-art In order to validate the effectiveness of our proposed algorithm, we compare the proposed segmentation-based rotated bounding box detector(SegmRDet) with the state-of-the-art algorithms including FR-O, RRPN [34], R2CNN [24], R-DFPN [50], ICN [27], RoI Transformer [22], and the algorithm of Yang et al. [51] Note that the FR-O is the Faster R-CNN OBB detector which is the baseline provided by DOTA [23]. RRPN and R2CNN are originally used for text scene detection and the results are a re-implemented version for DOTA by a third-party. 17

Table 1: Comparisons with state-of-the-art methods on DOTA dataset. Note that our detector(SegmRDet) is only trained on the training set of DOTA and tested on the testing set by the online evaluation server provide by DOTA while others may be trained on both the training set and validation set. SegmRDet-ss means the proposed detector is only tests on the single ×1 scale while SegmRDet-ms means it is an ensemble of multi-scales ×0.5, 1, 2. Our detector tests without any other test augmentations, e.g. flip, multi-models. The best score of each category is shown in red color. mAP

PL

BD

BR

GTF

SV

LV

SH

TC

BC

ST

SBF

RA

HA

SP

FR-O

54.13 79.42 77.13 17.7 64.05 35.3 38.02 37.16 89.41 69.64 59.28 50.3 52.91 47.89 47.4

HC 46.3

RRPN

61.01 80.94 65.75 35.34 67.44 59.92 50.91 55.81 90.67 66.92 72.39 55.06 52.23 55.14 53.35 48.22

R2CNN

60.67 88.52 71.2 31.66 59.3 51.85 56.19 57.25 90.81 72.84 67.38 56.69 52.84 53.08 51.94 53.58

R-DFPN

57.94 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.1 51.32 35.88

Yang et al.

62.26 81.25 71.41 36.53 67.44 61.16 50.91 56.6 90.67 68.09 72.39 55.06 55.6 62.44 53.35 51.47

ICN

68.2

81.40 74.30 47.70 70.30 64.90 67.80 70.00 90.80 79.10 78.20 53.60 62.90 67.00 64.20 50.20

RoI Transformer 69.56 88.64 78.52 43.44 75.92 68.81 73.68 83.59 90.74 77.27 81.46 58.39 53.54 62.83 58.93 47.67 SegmRDet-ss

72.64 89.97 79.23 50.61 69.38 74.32 74.62 85.51 90.86 80.26 86.23 49.02 64.41 74.66 71.51 48.75

SegmRDet-ms

74.14 89.91 80.31 52.85 70.72 76.72 79.43 86.33 90.65 82.91 87.1 48.41 69.29 74.68 72.36 50.48

As shown in Table 1, our detector can achieve better performance than the state-of-art algorithms with a considerable margin. The state-of-the-art detector RoI Transformer [22] is trained on both training set and validation set of DOTA dataset and it combines the results of multi-scales. Our detector is only trained on the training set of DOTA which is a subset of those for RoI Transformer [22]. As shown in Table 1, even the mean average precision (mAP) of our single scale results outperforms the RoI Transformer by 3.08 percent while the ensemble of multi-scales of our mAP outperforms the RoI Transformer by 4.85 percent. 4.3. Qualitative Results The detection results is shown in Figure 8. Our detector can well handle the objects even they are densely distributed and small. And our detector can predict a tight rotated bounding box for an arbitrary direction object. 18

Figure 8: Detection results on the DOTA dataset.

4.4. Ablation Studies To evaluate the influence of each part of the algorithm, we conduct serial ablation studies. Our ablation studies are tested on the validation set of DOTA. Note that we only train the network on the training set of DOTA. The results of the validation set are better than those on the test set. For example, the mAP of our detector with all components with a single scale ×1 for the test set and validation set are 72.64 and 74.15 as shown in Table 1 and Table 2 respectively. It is mainly because we stop training when the performance on the validation set stops promoting.

19

Table 2: Comparisons of ablation study for the Area Connectivity and the Bounding Box Consistency. These results are tested on the validation set of DOTA for only a single scale of ×1. Note that we only train the network on the training set of DOTA. AC means the Area Connectivity while BBC means the Bounding Box Consistency. mAP

PL

BD

BR

GTF

SV

LV

SH

TC

BC

ST

SBF

RA

HA

SP

HC

without AC and BBC 73.95 89.87 73.42 48.32 72.59 65.98 80.51 86.84 90.77 70.57 87.95 77.76 66.27 77.56 66.82 54.08 +AC

74.08 89.87 73.43 48.03 72.597 65.99 82.48 86.95 90.77 70.57 87.95 77.76 66.27 77.61 66.82 54.08

+BBC

74.14 89.87 73.28 48.41 72.66 66.06 82.63 87.1 90.77 70.74 87.98 77.92 66.28 77.64 66.9

53.9

+AC+BBC

74.15 89.87 73.28 48.42 72.66 66.06 82.73 87.09 90.77 70.74 87.98 77.92 66.28 77.64 66.9

53.9

4.4.1. Area Connectivity and the Bounding Box Consistency We update the confidence scores by the rules of Area Connectivity(AC) and the Bounding Box Consistency(BBC) to improve the performance. As shown in Table 2, the mAP of the detector without AC and BBC is 73.95 and the mAP of the detector with AC is 74.08 while the mAP of the detector with BBC is 74.14. When combining AC and BBC together, the mAP of the detector is 74.15. So the area connectivity can improve 0.13 percent, bounding box consistency can improve 0.19 percent and combining both of them together improve 0.2 percent. These two methods can improve the performance for almost all categories, but the improvement is not very significant. It is because these two methods can decrease the confidence scores of the unreliable rotated bounding boxes so that they can improve the performance, but most of the rotated bounding boxes are reliable and the confidence scores will not be changed for these boxes. As both the two methods have the same influence, i.e. decreasing the confidence scores of unreliable detections, the improvement of combining these two methods together is marginal, e.g. 0.2 < 0.13 + 0.19.

20

4.4.2. Rotated Bounding Box Prediction To evaluate the effectiveness of the segmentation based rotated bounding box prediction, we compare it to several regression-based methods with all the other components unchanged. For these methods, we use the vector (x, y, s, l, θ) to represent the rotated bounding box. The (x, y) are the center coordinates of the rotated bounding box, while the s and the l are the length of the short edge and long edge respectively and the θ is the angle of the orientation which is in the range of [0, π). Similar to [12, 13], we adopt the parameterizations of the (x, y, s, l) as following: sp = min(wp , hp ),

lp = max(wp , hp ),

tx = (x − xp )/wp ,

ty = (y − yp )/hp ,

ts = log(s/sp ),

tl = log(l/lp ),

t∗x = (x∗ − xp )/wp , t∗y = (y ∗ − yp )/hp , t∗s = log(s∗ /sp ), t∗l = log(l∗ /lp ), (3) where variables x, xp , x∗ are for the predicted box, proposal box of RPN, and groundtruth box respectively (likewise for y, w, h, s, l, θ). And for the (x, y, s, l), we use the loss Lxysl =

X

i∈x,y,s,l

in which smoothL1 (x) =

smoothL1 (ti − t∗i ),

  0.5x2

if |x| < 1

 |x| − 0.5 otherwise.

(4)

(5)

As the (x, y, s, l) are determined, then we need to predict the angle θ of the box. Here we use three methods for the angle: (1)Training the network to predict the angle with smooth-l1 loss, (2)Training the network to predict the angle with cosine loss, 21

(3)Training the network to predict the sine and cosine values of the angle with smooth-l1 loss. If we train the network to predict the angle, we adopt the parameterizations of the θ as following: tθ = (θ − π/2), t∗θ = (θ∗ − π/2).

(6)

As θ is defined in the range of [0, π), the tθ and t∗θ are in the range of [−π/2, π/2). Then the smooth-l1 loss for it is Lθ = smoothL1 (tθ − t∗θ ) as

shown in Equation 5 while the cosine loss for it is Lθ = 1 − cos2(tθ − t∗θ ). At last the θ can be got by θ = tθ + π/2. We also train the network to sine and cosine values of the angle and we adopt the parameterizations of them as following: tsinθ = sinθ,

tcosθ = 2cosθ − 1,

t∗sinθ = sinθ∗ , t∗cosθ = 2cosθ∗ − 1.

(7)

The linear scaling for the cosine value is to make it in the range of [−1, 1). The smooth-l1 loss for it is similar to Equation 5: X Lθ = smoothL1 (ti − t∗i ).

(8)

i∈sinθ,cosθ

At last the θ can be got by θ = arctan

tsinθ . (tcosθ + 1)/2 + 10−6

(9)

As the Area Connectivity can not be used for the regression-based method, so the segmentation based method also does not use it in Table 3 because we want to keep all the other components the same for a fair comparison. These 22

Table 3: Comparisons of ablation study for the segmentation-based rotated bounding box prediction and several regression-based methods. These results are tested on the validation set of DOTA for only a single scale of ×1. Note that we only train the network on the training set of DOTA. The target θ means that the network is trained to directly predict the angle of the orientations. The target sinθ, cosθ means that the network is trained to predict the sine and cosine values of the orientations. The θ is defined in [0, π). The target mask means the segmentation based method proposed in this paper. target

loss type

mAP

θ

smooth l1

68.9 89.33 72.39 42.66 68.41 60.34 73.93 77.92 90.68 69.23 86.8 74.31 60.36 74.22 62.53 30.42

θ

1-cos2∆θ

71.05 89.71 73.76 45.18 72.82 66.09 74.53 87.1 90.77 71.32 85.49 74.83 58.65 75.46 63.46 36.61

sinθ, cosθ

smooth l1

71.18 89.61 71.47 44.99 72.75 65.29 73.84 87.08 90.77 70.19 85.72 73.94 62.43 75.95 62.81 40.9

mask

PL

BD

BR

GTF

SV

LV

SH

TC

BC

ST

SBF

RA

HA

SP

cross entropy 74.14 89.87 73.28 48.41 72.66 66.06 82.63 87.1 90.77 70.74 87.98 77.92 66.28 77.64 66.9

HC

53.9

results are tested on the validation set of DOTA for only a single scale of ×1. Note that we only train the network on the training set of DOTA. As shown in Table 3, the target θ means that the network is trained to directly predict the angle of the orientations, the target sinθ, cosθ means that the network is trained to predict the sine and cosine values of the orientations and the target mask means the proposed segmentation-based method. The mAP for the method training the network to predict the angle with smooth-l1 loss is 68.9 while 71.05 for that with cosine loss, and the mAP for the method training to predict the sine and cosine values of the angle with smooth-l1 loss is 71.18. The mAP of the segmentation based method in this paper is 74.14 which is much better than the regression-based method and exceeds them by 2.96 to 5.24 percent. Training the network to predict the angle will suffer from the periodicity of angle, i.e. a small rotation may result in a significant difference for the tθ which is the output of the network as explained in Section 1. So the mAP of training to predict the angle with smooth-l1 loss is only 68.9 which is not very high. Training to predict the angle with cosine loss can 23

Table 4: Comparisons of ablation study for IPIO and IS. These results are tested on the validation set of DOTA for only a single scale of ×1. Note that we only train the network on the training set of DOTA. IPIO means that the partially included objects, whose semantic information is not clear, will be ignored for training or testing. IS means the image synthesizing based training data augmentation. mAP without IPIO and IS

PL

BD

BR

GTF

SV

LV

SH

TC

BC

ST

SBF

RA

HA

SP

HC

70.56 86.41 68.70 46.68 67.31 64.87 80.48 86.22 89.53 65.32 86.57 62.05 65.84 74.2 61.06 53.16

+IPIO (train)

72.5 89.59 71.77 46.08 73.52 64.85 81.54 86.54 90.74 67.79 87.44 71.77 69.16 74.72 62.79 49.15

+IPIO (train test)

72.63 89.91 72.35 46.62 71.76 64.88 81.83 86.82 90.79 68.72 87.26 69.95 69.38 77.04 62.67 48.47

+IPIO (train test)+IS 74.15 89.87 73.28 48.42 72.66 66.06 82.73 87.09 90.77 70.74 87.98 77.92 66.28 77.64 66.9

53.9

be elegant to deal with the periodicity of the angle when compute the loss, but it still can not solve the problem that a small rotation may result in a significant difference for the tθ , so that the mAP of it is better than that with smooth-l1 loss with 2.15 percent but worse than the segmentation based method with 3.09 percent. Training to predict the sine and cosine values of the angle with smooth-l1 loss can alleviate the problem caused by the periodicity of the angle, so it is better than training to predict the angle directly by 0.13 to 2.28 percent, but it is still worse than the segmentation based method by 2.96 percent. It is mainly because the segmentation task is easier to train than regressing the sine and cosine value of the angle. In summary, the segmentation based angle prediction in this paper is much better than other regression-based methods. 4.4.3. Ignoring the Partially Included Objects and Image Synthesizing Augmentation To evaluate the effectiveness of the proposed mechanisms of ignoring the partially included objects (IPIO) and the image synthesizing based training data augmentation (IS), we conduct the ablation study for them as shown 24

in Table 4. These results are tested on the validation set of DOTA for only a single scale of ×1. Note that we only train the network on the training set of DOTA. The mAP of the detector without IPIO and IS is 70.56. When we adopt the IPIO for training, the mAP of the detector is 72.5 which promotes 1.94 percent. For this significant promotion, we think it is mainly because the network would not be confused by the partially included objects whose semantic information is not clear so that the network can be trained better. As we adopt the IPIO for testing, the mAP improves slightly, i.e. from 72.5 to 72.63. It is because the amount of partially included objects in testing is much smaller than the intact objects so that the influence of it is not significant. As we add the IS mechanism, the mAP can further significantly improve by 1.52 percent and achieve the mAP 74.15. The IS mechanism is effective because the image synthesizing based training data augmentation with more different background areas can improve the training data diversity and it will have less false positives in the background area. 5. Conclusion In this paper, we proposed an algorithm for the detection of large size aerial images with rotated bounding boxes. We predict the rotated bounding boxes by a box segmentation task to avoid problems of predicting angles. We also solve the problems from the large size of images by the image synthesizing based training data augmentation and ignoring the partially included objects. The experiments show that the proposed algorithm archives better performance than other state-of-the-art methods and prove our algorithm is efficient. 25

Author Contribution Yingming Wang: Conceptualization, Methodology, Software, Validation, Investigation, Writing - Original Draft, Lijun Wang: Conceptualization, Methodology, Validation, Investigation, Writing - Review & Editing Huchuan Lu: Conceptualization, Methodology, Resources, Writing - Review & Editing You He: Conceptualization, Methodology, Writing - Review & Editing

Declaration of Competing Interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References [1] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [2] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

26

[3] N. Ma, X. Zhang, H.-T. Zheng, J. Sun, Shufflenet v2: Practical guidelines for efficient cnn architecture design, in: The European Conference on Computer Vision (ECCV), 2018. [4] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking with fully convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3119–3127. [5] M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, Atom: Accurate tracking by overlap maximization, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [6] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, J. Yan, Siamrpn++: Evolution of siamese visual tracking with very deep networks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [7] M. Jian, Q. Qi, H. Yu, J. Dong, C. Cui, X. Nie, H. Zhang, Y. Yin, K.M. Lam, The extended marine underwater environment database and baseline evaluations, Applied Soft Computing 80 (2019) 425–437. [8] M. Jian, W. Zhang, H. Yu, C. Cui, X. Nie, H. Zhang, Y. Yin, Saliency detection based on directional patches extraction and principal local color contrast, Journal of Visual Communication and Image Representation 57 (2018) 1–11. [9] M. Jian, Q. Qi, J. Dong, Y. Yin, K.-M. Lam, Integrating qdwd with pattern distinctness and local contrast for underwater saliency detection, Journal of visual communication and image representation 53 (2018) 31– 41. 27

[10] M. Jian, K.-M. Lam, J. Dong, L. Shen, Visual-patch-attention-aware saliency detection, IEEE transactions on cybernetics 45 (8) (2014) 1575– 1586. [11] Q. Wang, S. Tang, D. Zhai, X. Hu, Salience based object tracking in complex scenes, Neurocomputing 314 (2018) 132–142. [12] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587. [13] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448. [14] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99. [15] K. He, G. Gkioxari, P. Doll´ar, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961– 2969. [16] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788. [17] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271. 28

[18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision, Springer, 2016, pp. 21–37. [19] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Doll´ar, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. [20] B. Singh, M. Najibi, L. S. Davis, Sniper: Efficient multi-scale training, in: Advances in Neural Information Processing Systems, 2018, pp. 9310– 9320. [21] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., Hybrid task cascade for instance segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4974–4983. [22] Y. L. G.-S. X. Q. L. Jian Ding, Nan Xue, Learning roi transformer for detecting oriented objects in aerial images, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [23] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, L. Zhang, Dota: A large-scale dataset for object detection in aerial images, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [24] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2cnn: Rotational region cnn for orientation robust scene text detection, arXiv preprint arXiv:1706.09579. 29

[25] Y. Li, H. Zheng, Z. Yan, L. Chen, Detail preservation and feature refinement for object detection, Neurocomputing. [26] Q. Zhong, C. Li, Y. Zhang, D. Xie, S. Yang, S. Pu, Cascade region proposal and global context for deep object detection, Neurocomputing. [27] S. M. Azimi, E. Vig, R. Bahmanyar, M. K¨orner, P. Reinartz, Towards multi-class object detection in unconstrained remote sensing imagery, in: Asian Conference on Computer Vision, Springer, 2018, pp. 150–165. [28] P. Tang, X. Wang, S. Bai, W. Shen, X. Bai, W. Liu, A. L. Yuille, Pcl: Proposal cluster learning for weakly supervised object detection, IEEE transactions on pattern analysis and machine intelligence. [29] Z. Huang, L. Huang, Y. Gong, C. Huang, X. Wang, Mask scoring rcnn, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6409–6418. [30] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, J. Liang, East: An efficient and accurate scene text detector, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [31] Z. Tian, W. Huang, T. He, P. He, Y. Qiao, Detecting text in natural image with connectionist text proposal network, in: European conference on computer vision, Springer, 2016, pp. 56–72. [32] M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes: A fast text detector with a single deep neural network, in: Thirty-First AAAI Conference on Artificial Intelligence, 2017. 30

[33] B. Shi, X. Bai, S. Belongie, Detecting oriented text in natural images by linking segments, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2550–2558. [34] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, X. Xue, Arbitraryoriented scene text detection via rotation proposals, IEEE Transactions on Multimedia 20 (11) (2018) 3111–3122. [35] D. Deng, H. Liu, X. Li, D. Cai, Pixellink: Detecting scene text via instance segmentation, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [36] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4159–4167. [37] T. He, W. Huang, Y. Qiao, J. Yao, Accurate text localization in natural image with cascaded convolutional text network, arXiv preprint arXiv:1603.09423. [38] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, Z. Cao, Scene text detection via holistic, multi-channel prediction, arXiv preprint arXiv:1606.09002. [39] Y. Zhang, J. Lai, P. C. Yuen, Text string detection for loosely constructed characters with arbitrary orientations, Neurocomputing 168 (2015) 970–978. [40] M. Gao, R. Yu, A. Li, V. I. Morariu, L. S. Davis, Dynamic zoom-in network for fast object detection in large images, in: Proceedings of the 31

IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6926–6935. [41] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, K. Cho, Augmentation for small object detection (2019). arXiv:1902.07296. [42] S. Razakarivony, F. Jurie, Vehicle detection in aerial imagery: A small target detection benchmark, Journal of Visual Communication and Image Representation 34 (2016) 187–203. [43] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, J. Jiao, Orientation robust object detection in aerial images using deep convolutional neural network, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE, 2015, pp. 3735–3739. [44] Z. Liu, H. Wang, L. Weng, Y. Yang, Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds, IEEE Geoscience and Remote Sensing Letters 13 (8) (2016) 1074–1078. [45] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 54 (12) (2016) 7405–7415. [46] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. 32

[47] G. Toussaint, Solving geometric problems with the rotating calipers, In Proceedings of IEEE MELECON’83 83. [48] G. Bradski, The OpenCV Library, Dr. Dobb’s Journal of Software Tools. [49] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4) (2018) 834–848. [50] X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, Z. Guo, Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks, Remote Sensing 10 (1) (2018) 132. [51] X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, K. Fu, Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network, IEEE Access 6 (2018) 50839–50849.

33

Biography

Yingming Wang received the B.E. degree from the Dalian University of Technology, in 2017, where he is currently pursuing the masters degree with the School of Information and Communication Engineering, super- vised by Prof. H. Lu. His research interests include deep learning, visual tracking and object detection.

Lijun Wang received the B.E. degree and Ph.D de- gree from the Dalian University of Technology, Dalian, China, in 2013 and 2019 respectively, where 34

he is cur- rently doing postdoctoral research. His current research interests include deep learning, visual saliency, object tracking and depth estimation.

Huchuan Lu (SM’12) received the M.Sc. degree in Signal and Information Processing, Ph.D. degree in System Engineering, Dalian University of Technology (DUT), China, in 1998 and 2008 respectively. He has been a faculty since 1998 and a professor since 2012 in the School of Information and Communication En- gineering of DUT. His research interests are in the areas of computer vision and pattern recognition. In recent years, he focuses on visual tracking and segmentation. Now, he serves as an associate editor of the IEEE Transactions On Systems, Man, and Cybernetics: Part B.

You He received the PhD degree in the Department Of Electronic Engineering, Tsinghua University. He is member of Chinese Academy of Engineering and fellow member of IET. His research interests include radar signal processing and multi-sensor information fusion.

35