SAANet: Spatial adaptive alignment network for object detection in automatic driving

SAANet: Spatial adaptive alignment network for object detection in automatic driving

Journal Pre-proof SAANet: Spatial adaptive alignment network for object detection in automatic driving Junying Chen, Tongyao Bai PII: S0262-8856(20)...

1MB Sizes 0 Downloads 16 Views

Journal Pre-proof SAANet: Spatial adaptive alignment network for object detection in automatic driving

Junying Chen, Tongyao Bai PII:

S0262-8856(20)30005-6

DOI:

https://doi.org/10.1016/j.imavis.2020.103873

Reference:

IMAVIS 103873

To appear in:

Image and Vision Computing

Received date:

27 October 2019

Revised date:

18 December 2019

Accepted date:

29 December 2019

Please cite this article as: J. Chen and T. Bai, SAANet: Spatial adaptive alignment network for object detection in automatic driving, Image and Vision Computing(2020), https://doi.org/10.1016/j.imavis.2020.103873

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier.

Journal Pre-proof Contents lists available at www.sciencedirect.com

Image and Vision Computing

SAANet: Spatial Adaptive Alignment Network for Object Detection in Automatic Driving Junying Chen*, Tongyao Bai*

ro of

School of Information and Control Engineering, Xi'an University of Architecture and Technology, Xi’an 710055, China

ABSTRACT

ur

na

lP

re

-p

Both images and point clouds are beneficial for object detection in a visual navigation module for autonomous driving. The spatial relationships between different objects at different times in a bimodal space can vary significantly. It is difficult to combine bimodal descriptions into a unified model to effectively detect objects in an efficient amount of time. In addition, conventional voxelization methods resolve point clouds into voxels at a global level, and often overlook local attributes of the voxels. To address these problems, we propose a novel fusion-based deep framework named SAANet. SAANet utilizes a spatial adaptive alignment (SAA) module to align point cloud features and image features, by automatically discovering the complementary information between point clouds and images. Specifically, we transform the point clouds into 3D voxels, and introduce local orientation encoding to represent the point clouds. Then, we use a sparse convolutional neural network to learn a point cloud feature. Simultaneously, a ResNet-like 2D convolutional neural network is used to extract an image feature. Next, the point cloud feature and image feature are fused by our SAA block to derive a comprehensive feature. Then, the labels and 3D boxes for objects are learned using a multi-task learning network. Finally, an experimental evaluation on the KITTI benchmark demonstrates the advantages of our method in terms of average precision and inference time, as compared to previous state-of-the-art results for 3D object detection.

Jo

Keywords: object detection; fusion-based deep framework; local orientation encoding; spatial adaptive alignment; autonomous driving

1. Introduction Object detection in 3D scenes is a key task in a visual navigation module for autonomous driving. The visual navigation module utilizes road scene information acquired by various sensors, such as a millimeter wave radar, camera, and/or laser radar (LiDAR), and then processes the obtained information to provide a drivable area to an automatic driving system. * Corresponding author at: School of Information and Control Engineering, Xi'an University of Architecture and Technology, Xi’an 710055, China. Tel.:+8618691822599. E-mail addresses: [email protected](J. Chen), [email protected](T. Bai)

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro of

Image data and LiDAR data have been widely studied for object detection in visual navigation systems in recent years. Image data can provide a much more detailed reflection of the scene, but is often susceptible to environmental factors and a lack of depth information. In contrast, point clouds are 3D space coordinates of points scanned by LiDAR. Such data has rich depth information, and is not susceptible to environmental impacts. It is evident that there may be complementary information between images and point clouds in object detection for automatic driving. For example, and especially when the objects are small or far, objects are scanned by the LiDAR beam to a small surface area; in such situations, cameras can provide an order of magnitude of additional measurements as compared to LiDAR. The other extreme case is object occlusion, where no point cloud information can be obtained for the occluded objects. In such cases, image information can be incorporated to enable detection of such objects. There is a need for a better approach to fusing images and point clouds to complement each other. Some typical methods apply 2D convolutional neural networks (CNNs) to fuse point cloud projections and images. For instance, a multi-view 3D network[1] projects and discretizes point clouds onto a front view (FV) map and a bird view (BV) map, and then uses a 2D CNN to fuse the maps with RGB images. Some other methods directly fuse point cloud features and image features. One representative work is PointFusion[2], which uses PointNet[3] and ResNet[4] to extract point cloud features on raw point clouds and image features, respectively, and then fuses the features using a fully-connected network. These fusion-based methods essentially add up the features of two different modalities, a.k.a. the point cloud modality and image modality, and ignore the spatial alignment of the two types of features. Point clouds and images can be regarded as different descriptions for the same 3D scene, and the spatial alignment of their features is the key to object recognition. To better align the features of 2D RGB images and 3D point clouds for a detection task, we propose a deep fusion network, and introduce a spatial adaptive alignment (SAA) module for modeling the relationships between image features and point cloud features. To derive a point cloud feature, the research studies[5-9] acquire images from multiple perspectives based on point clouds, and then apply image-processing technology to obtain point cloud representations. This type of method suffers from a complicated preprocessing process to obtain the multiple perspectives, which can easily cause information loss from the raw point clouds to the projection images. Directly operating on raw point clouds can be used to avoid information loss during the projection, [3, 10-12] as this approach directly utilizes effective operations on raw point clouds to acquire features. The amount of calculation is usually large owing to the large amount of raw point cloud data, and is far from real-time detection when executing convolution or similar operations on the raw point clouds. To balance the information loss and amount of calculation, [13, 14]partitions point clouds into voxels, and extracts feature representations on the voxels. Voxelization methods divide an entire point cloud space into uniform voxels at a global level, which often leads to an uneven distribution among voxels and missing local structure information in each voxel. To resolve these problems, we adopt a random sampling of the fixed number of points in each voxel and represent each point based on its spatial coordinates, reflection intensity, and local orientation encoding (LOE). These data are simple, yet are still beneficial to eliminating uneven distribution and retaining local information. Overall, we propose a novel fusion-based deep architecture (SAANet) to learn point cloud features and image features, and to align them adaptively in a feature fusion process. Specifically, we transform point clouds to regular 3D voxels and introduce LOE of each voxel to obtain a comprehensive point cloud representation, which is helpful for determining the orientation and location of 3D objects during detection. Subsequently, we propose an SAA module to align a point cloud modal feature and an image modal feature, which is beneficial to accurate 3D object detection. The main contributions of this study are summarized as follows:

Journal Pre-proof   

A deep fusion network architecture that combines a point cloud feature and an image feature for complementary 3D object detection. An SAA module that takes advantage of an attention mechanism and cross-correlation to adaptively align the point cloud feature and image feature. A representation of a point cloud including a LOE of each voxel, to assist the network in determining the orientation and location of 3D objects during detection.

2. Related Work

Jo

ur

na

lP

re

-p

ro of

Object detection on 2D images. In the image field, an initial object detection ordinarily uses some geometric features (such as edges, keys, or templates) to align a 2D/3D model of an object on an image. The advent of machine learning (ML) has caused a qualitative leap in image-based object detection technology. Earlier ML algorithms used in object detection include boosting[15] and support vector machines[16]. The emergence of the deep CNN (DCNN) has led to another qualitative leap in 2D object detection. DCNN-based object detectors initially apply a fine-tuning classifier at each possible position of an image in a sliding window manner[17], or at a certain region of interest[18-20] through a regional proposal mechanism. These DCNN-based methods comprise multi-stage pipeline training. To reduce the computation time and network complexity, Ren et al.[21] proposed a regional proposal network (RPN), for creating a full CNN to complete classification and regression tasks through an end-to-end structure. This network framework is currently widely used[22, 23]. Thus, we also adopt a RPN to execute the object classification and localization tasks in our proposed method. Object detection on 3D point clouds. Point cloud-based 3D object detection has received significant attention in the field of automatic driving. Some previous works represent point clouds using hand-crafted features[9, 24], which yield enhanced results in some cases. However, such statistical properties of point clouds cannot be easily transformed to a wide range of application scenarios, such as 3D object detection tasks. Several precursory works[13, 14] voxelate point clouds, and apply 3D CNNs to handle them. Nevertheless, the computation cost of a 3D CNN constrains its performance in large application scenarios. VoxelNet[7] groups raw point clouds into voxels and stacks multiple voxel feature encoding (VFE) layers to learn complex features for characterizing local 3D shape information. Then, the local voxel features are further aggregated into a high-dimensional volumetric representation for object detection. PointNet[3] proposed a new approach aimed at directly managing raw point clouds for 3D object recognition and pointwise semantic segmentation tasks. PointNet aims to preserve global structure information, yet pays little attention to local structure information, limiting its ability to recognize fine-grained patterns and its generalizability to complex scenes. PointNet++[12] captures local structures at different scales to make up for the lost local information in PointNet. It achieves better results than PointNet, in view of the introduced local features. However, it is not easily implemented for RPNbased detection tasks in an automatic driving scene. F-Pointnet[25] utilized mature 2D detectors to generate 2D proposals from image regions, then proposed 3D frustum proposals to perform 3D object instance segmentation and amodal 3D bounding box localization. In view of the achievements in object detection in the 2D image field, some research studies project point clouds onto 2D images, and then use 2D detection technology to detect objects. PIXOR[26] exploits a bird's eye view (BEV) of 3D point clouds, and obtains a compact representation by using a 2D CNN to detect objects in the BEV. Although such a method achieves promising results for 3D object detection, it fails to explore local orientation information of the point clouds, and has a limited ability to accurately localize 3D objects.

Journal Pre-proof Object detection based on images and point clouds fusion. Some existing methods employ a fusion-based architecture to improve detection performance. One typical approach[27] is to project point clouds onto 2D maps, use 2D CNNs to fuse these projections with images, and then train an RPN to estimate 3D bounding boxes. Multi-view 3D networks (MV3D)[1] project point clouds to different 2D perspectives, and then use a 2D CNN to fuse them with images. Cont-Fuse[28] proposes a continuous CNN to fuse features of images and point cloud BEV maps at different levels. Another approach is to manage point cloud features and image features separately. PointFusion[2] extracts the point cloud feature and image feature using PointNet[3] and Resnet[4], respectively, and then utilizes fully-connected layers to fuse them.

ro of

3. Spatial Adaptive Alignment Network

Jo

ur

na

lP

re

-p

This section describes the proposed end-to-end detection framework, SAANet. Figure 1 illustrates an overview containing three components: (1) feature learning that includes two branches, i.e., a ResNet-like 2D image feature extractor and a 3D point cloud feature extractor; (2) feature fusion to process the features from the two branches into a high-level fusion representation; and (3) 3D box estimation to localize the detected objects. In the following section, we first introduce a voxel representation with LOE for point clouds. Next, we present feature-learning modules for a LiDAR stream and camera stream. Then, we use an SAA module for bimodal feature fusion. Finally, we present the RPN for 3D box estimation, and a loss function for multitask learning.

Journal Pre-proof Fig. 1: Our proposed SAANet framework. It can be divided into three components. The left includes the feature learning modules, which extract the point cloud feature and image feature. The middle is the feature fusion module, which can align and adaptively fuse the point cloud feature and image feature. The right is the 3D box estimation module, which can generate 3D bounding boxes based on the fused feature. In the two dotted-line boxes below, the left gray box represents convolutional layers, the right green boxes represent 2D convolutional layers, and the blue and brown boxes indicate the batch normalization and Relu layers, respectively.

3.1 Voxel representation with local orientation encoding



ro of

We propose a generic encoding method for mining the potential orientation information of point clouds, as well as for effectively aligning multi-modal features in the fusion phase. Suppose * + , where that the point clouds acquired by LiDAR are * +, are the space coordinates of the i-th point, is the corresponding received reflectance, and N is the number of point clouds. We divide the raw point clouds into uniform voxels, and then compute centroids of each voxel. The centroid of the j-th voxel can be computed as: ∑



(1)

lP

re

-p

Here, M is the number of randomly sampled points in the j-th voxel. After voxelization, the point densities among voxels are highly variable, which might bias the detection. Thus, we randomly sample a fixed number of points in each voxel, to eliminate the influence of the highly variable point density. When the number of points in the voxels is smaller than the defined number, all of the points in the voxel are retained, and M is the actual number of points in the voxel. To capture local orientation information beneficial to the estimation of heading angles for the 3D boxes, we design LOE, denoted as , as calculated by the following equation: ∑

.

/



.

/

)

(2)

na

(

ur

We employ arctan to compute the orientation angle, and expect it to serve as local information for object detection. The final representation of the i-th points in j-th voxel can be rewritten as * +. 3.2 Feature learning

Jo

Feature learning mainly includes two parts: one part is point cloud feature learning based on the above-derived point cloud representation, and the other is feature learning based on camera images. From the point clouds stream, the point clouds (with LOE) are input to a sparse CNN to obtain point cloud features. The sparse CNN consists of four phases of sparse convolution. Each phase contains several submanifold convolutional layers and one normal sparse convolution. In addition, each convolutional layer is followed by a BatchNorm layer and a ReLU layer. Within each phase, the submanifold convolutional layers have i output feature maps (where i is in the range of 16 to 128), a kernel size of (3, 3, 3), and a stride of (1,1,1). At the last layer of the first three phases, we perform downsampling, with a stride of (2,2,2). At the last layer of the last phase, we change the kernel size to (3,1,1) and the stride to (2, 1, 1). After eight iterations of downsampling, the output has been reshaped to 256 × 400 × 352. For the camera stream, we design a Resnet-like 2D CNN to extract image features, which can capture image texture information for better modality fusion. We use Conv2D( ) to represent a Conv2D-BatchNorm-ReLU layer, where is the number of output channels, k is the

Journal Pre-proof kernel size, and s is the stride. Similar to the default bottleneck block proposed in Resnet[4], our bottleneck block contains three convolutional layers, i.e., Conv2D( ), Conv2D( ), and Conv2D( ). First, three bottleneck blocks are stacked, followed by a max-pooling layer with a pooling size of 1×1 and stride of 2×2. Thereafter, four bottleneck blocks are stacked, also followed by a max-pooling layer with the kernel size of 1 and stride of 2, to make the dimension of the image feature the same as that of the point cloud feature. 3.3 Feature fusion

Jo

ur

na

lP

re

-p

ro of

There are already some combination methods for fusing point cloud features and 2D image features. These methods can be roughly classified into two types. The first type projects the point clouds to 2D maps, and uses a 2D CNN to fuse these projections with images[1, 27, 28]. The differences in models of this type include deriving different 2D perspectives from point clouds or utilizing different image fusion methods, such as CNN or a fully connected network (FCN), at the same or different levels. A representative work in this direction is MV3D[1], which previously achieved state-of-the-art results. Another type of combination method is to directly fuse the point cloud feature and image feature, such as in PointFusion[2]. Our method belongs to the latter type. However, our method is different from PointFusion, in that our fusion method utilizes an SAA block, which considers the cross-correlation relation between two modality features, and our point cloud representation includes local information of voxels. In contrast, PointFusion extracts image features using a CNN and derives point cloud features using PointNet[3] which loses consideration of local information, and then applies a number of fully connected layers to the concatenation of the two features . In our proposed feature fusion method, we design an SAA block to learn the spatial alignment between a point cloud feature and image feature, as the same object must be matched in the point cloud feature space and image feature space to achieve accurate object detection. The detailed structure of the proposed SAA block is shown in the bottom right corner of Figure 1. Our SAA block can be classified as "inter-attention", in contrast to self-attention or intraattention, which are widely used in the natural language processing and image processing fields. As our inter-attention mechanism is an attention mechanism relating different types of spatial features (i.e., point cloud features and image features), it is different from the previous attention mechanisms for positions of a single sequence. Thus, the design of an inter-attention block to effectively combine two types of feature is an innovative work. The two most commonly used attention functions are additive attention[29] and dot-product attention[30]. The two are similar in theoretical complexity. Additive attention computes a compatibility function using a feed-forward network with a single hidden layer. Dot-product attention is much faster, and more space-efficient in practice. Based on the two attention mechanisms and the cross-correlation of the two types of features, and through theoretical analysis and some tentative experiments, we design our SAA block as follows: (

.

(

)/)

(3)

.

(

)/

(4)

.

(

)/

(5)

In the above, gp and gi are the point cloud feature and image feature, respectively; indicates element-wise addition and indicates the element-wise product; are learned parameters;

Journal Pre-proof and

and are the attention weights of the point cloud feature and image feature, respectively. First, the element-wise addition result of gp and gi, is spatially transformed by multiplication with a weight matrix. Batch normalization is then performed, and a Relu function is applied to obtain the results of the additive attention of the two modality features. Then, the cross-correlation between the two features is calculated by the dot-product, and spatially transformed by multiplication with a weight matrix. The attention weight is calculated using Softsign as an activation function. Finally, the fusion feature gf can be obtained by the weighted sum of gp and gi. Overall, the entire output of the SAA block can be formulated as: (6)

ro of

3.4 3D box estimation

re

-p

In this step, we construct a sample RPN for 3D bounding box estimation, which is similar to the RPN in VoxelNet[7]. The architecture of this RPN is comprised of three stages, and each stage combines several convolutional layers, with batch normalization and rectified linear unit layers behind. The fusion features generated from the SAA block are input into the RPN. The convolutional layer of every stage downsamples the feature map by a stride size of 2, followed by convolutions of stride 1. The output of every stage is upsampled to the same size as the input size. These feature maps are then concatenated into one feature map. Finally, three convolutions are applied for the prediction of labels and the estimation 3D boxes. After that, a non-maximum suppression layer is added to generate the final object boxes, based on the output of the RPN.

lP

3.5 Loss function

na

Our tasks are to predict the labels of objects, and to localize their 3D boxes. Predicting the labels is a classification task, whereas localizing the 3D boxes is a regression task. To better realize both tasks, we use a multi-task loss function to train our network. Following common practice in object detection[1, 2, 7], we define the loss function for the classification task as: (

)

( )

(7)

Jo

ur

In that regard, was introduced by RetinaNet[23], and stands for focal loss; is the model's estimated probability; and and are the parameters of the focal loss. We set = 0.25 and in our training process. The loss function for the regression task is: ∑

(

)

(8)

Here, and are the predicted and ground truth offsets, respectively. SL stands for the SmoothL1 function, defined as: 0.5     2 if  i   i  1 i i   SL  i , i        i   i  0.5 otherwise (9) For 3D detection, is the sum of seven terms. We parameterize a 3D ground truth box as ) are ( ) , where ( ) represents the center location, ( the length, width, and height of the box, respectively, and is the yaw rotation around the Z) . Then, we define the axis. A positive anchor is parameterized as ( residual vector as seven regression targets, containing the offsets of center location ( ), three dimensions ( ), and the rotation , which are computed as:

Journal Pre-proof (10) . /

.

/

. /

(11) (12)

( ) is the diagonal of the base of the anchor. In the above, √( ) The total loss function is the summation of classification and regression losses, and is denoted as follows: (13)

ro of

4. Experiments

re

-p

We evaluate our SAANet on the "KITTI" 3D object detection benchmark dataset[31]. Similar to previous works, such as [26] and [7], we examine three tasks: car, pedestrian, and cyclist detection, which are challenging and can demonstrate the effectiveness of our method. We compare our method with some other previous state-of-the-art methods in both 3D object detection and 2D BEV object detection tasks. Then, we report the results obtained for the KITTI validation set and test set. Finally, we show qualitative results. 4.1 Dataset

ur

na

lP

The KITTI dataset[31] contains 7481 training image/point cloud samples and 7518 test image/point cloud samples in three categories: Car, Pedestrian and Cyclist. For each sample, 3D LiDAR point clouds are captured by a laser scanner (Velodyne HDL-64E) mounted on top of a vehicle, and an RGB camera image is captured by a front-facing camera mounted on top of the vehicle. For each category, the detector is evaluated for three levels of difficulty: easy, moderate, and hard. The difficulty assessment is based on the object height in the 2D results, occlusion, and truncation. As the ground truth for the test set is not available and the access to the test server is limited, we follow the approach proposed in MV3D, i.e., splitting the 7481 training examples into a training set of 3712 samples and a validation set of 3769 samples. Finally, we present the test results using the KITTI server.

Jo

4.2 Implementation details

For car detection, we follow the procedure described in[7] to obtain a voxel representation of - meters point clouds. We consider point clouds within the range of [- ] [] , along the Z, Y, and X axes, respectively. Points that are projected outside of the image boundaries ) ( ) meters. Similar to are removed. We choose a voxel size of ( VoxelNet[7], we set M=35 as the maximum number of randomly sampled points in each nonempty voxel. The feature learning phase takes the sampled points in each voxel as inputs into the sparse CNN for obtaining the point cloud feature. For the image branch, a Resnet-like 2D CNN is utilized to extract image features from the input camera images. These two branch features are fused by means of the SAA module. Finally, the output feature map enters the RPN. We use only ) ( ) meters, centered at one anchor size ( meters with two rotations, at 0 and 90 degrees. The anchors are assigned to ground-truth objects using an

Journal Pre-proof

ro of

intersection-over-union (IoU) threshold of 0.6, and are assigned to the background (negative) if their IoUs are less than 0.45. Anchors with IoUs between 0.45 and 0.6 are ignored during training. For pedestrian and cyclist detection, the range of the input point clouds is cropped as [- ] - meters along Z, Y, and X axes, respectively. The voxel size is the same as in the [] , configuration for the above-mentioned car detection. For pedestrian, we use anchor ) ( ) meters, centered at size ( meters with two rotations, at 0 ) ( ) meters, and 90 degrees. For cyclist, the anchor has dimensions of ( centered at meters, with 0 and 90 degrees rotations. We use a value of 0.35 for the mismatching threshold and 0.5 for the matching threshold. Relatively small objects like pedestrians and cyclists need more points for feature extraction, and we set M=45. The entire training procedure of SAANet stops after 100 epochs. The initial learning rate is set to 0.0002, with an exponential decay factor of 0.8 and a decay every 20 epochs. We train our model with one NVIDIA Tesla with P4 GPUs, and all of the experiments are implemented under PyTorch. 4.3 Metrics

re

-p

The evaluation metrics follow the official KITTI evaluation protocol. Specially, the IoU threshold is 0.7 for the car category, and 0.5 for the pedestrian and cyclist categories. The IoU threshold is the same for the evaluation of both BEV detection and 3D detection. As mentioned above, KITTI divides labels into three levels: easy, moderate, and hard. We use the average precision (AP) metric at all three levels.

lP

4.4 Ablation study

Jo

ur

na

To justify the effectiveness of our proposed LOE and SAA block, we conduct an ablation study. To the best of our knowledge, one state-of-the-art image-based 3D object detection method, MonoPSR[32], achieved AP values of 10.85%, 12.57%, and 9.06% for car detection, 10.66 %, 12.65 %, and 10.08% for pedestrian detection, and 11.01%, 13.43%, and 9.93% for cyclist detection across three levels, respectively. The performance achieved by the image-based 3D object detection method is much worse than that achieved by LiDAR-based 3D object detection methods. Therefore, our work mainly utilizes image information to complement LiDAR information, to improve the performance of object detection. Accordingly, we first build a plain baseline model in which object detection is only based on point clouds, with the representation * +. This baseline model does not include our proposed LOE or SAA block. We denote this baseline model as "Base1". Then, different configurations are tested, including: 1) the baseline model with LOE, denoted by "Base1+LOE"; 2) the above configuration plus an image feature, where a commonly used element-wise summation operation is used to combine the point cloud feature and image feature, denoted by "Base1+LOE+IF"; 3) the baseline model is used to learn the point cloud feature, which is then fused with the image feature via a commonly used element-wise summation operation, denoted by "Base1+IF"; 4) the baseline model is used to learn the point cloud feature, which is then fused with the image feature by our proposed SAA block, denoted by "Base1+IF+SAA"; and 5) our full model, denoted by "Base1+LOE+IF+SAA", or simply "SAANet" for short. The experimental results from the KITTI validation set for all configurations are reported in Table 1.

Journal Pre-proof Table 1. Performance comparison of our models with different configurations in 3D detection on KITTI validation set. LOE: local orientation encoding; IF: image feature; SAA: spatial adaptive alignment. Method

Car (AP0.7)

Pedestrian (AP0.5)

Cyclist (AP0.5)

Moderate

Hard

Easy

Moderate

Hard

Easy

Moderate

Hard

Base1

84.42

68.46

66.27

55.29

48.64

42.69

74.79

57.33

52.48

Base1+LOE

86.28

74.83

68.57

54.28

49.23

44.44

78.75

59.43

54.36

Base1+LOE+IF

86.24

75.45

68.93

55.14

49.40

46.33

75.13

57.66

52.15

Base1+IF

84.45

68.95

66.35

54.25

48.50

43.25

74.58

57.58

53.21

Base1+IF+SAA

85.49

75.57

69.03

54.26

48.54

45.81

78.52

58.34

53.82

Base1+LOE+IF+SAA (SAANet)

88.23

77.74

76.26

56.56

50.30

47.30

77.49

60.54

54.91

ro of

Easy

na

lP

re

-p

It can be seen from Table 1 that the LOE and SAA block are both beneficial to improving the AP value. When we add the LOE technique to the Base1 or Base1+IF+SAA methods, higher AP values are achieved by Base1+LOE and Base1+LOE+IF+SAA, respectively, on average. When we replace the commonly used element-wise summation operation with the SAA module in the Base1+IF and Base1+LOE+IF methods, higher AP values are achieved by Base1+IF+SAA and Base1+LOE+IF+SAA, respectively. As compared with our other models, SAANet achieves a higher AP on all three detection tasks across all three levels, except for cyclist detection at the easy level. This illustrates that both the SAA block and point cloud representation with LOE are helpful for performance improvements. Notably, the Base1+LOE+IF method did not achieve better results than Base1+LOE. This indicates that the element-wise summation of two modality features cannot unearth the complementary characteristics between the point cloud feature and image feature. In that regard, this is the motivation to design the SAA block, i.e., to better fuse these two types of features. The results demonstrate that the proposed LOE and SAA block are complementary to each other, and our proposed deep fusion network achieves the best performance.

ur

4.5 Evaluation in bird’s eye view

Jo

For BEV evaluation, we compare the proposed method with several top-performing algorithms, including LiDAR-only approaches as well as fusion-based approaches, e.g., VeloFCN[33], MV3D(BV+FV)[1], VoxelNet[7], MV3D(BV+FV+Img)[1], PointFusion[2], and ContFuse[28]. VeloFCN[33] proposed the first 3D FCN framework for end-to-end 3D object detection based on point clouds. MV3D[1] is a former state-of-the-art 3D object detection method with two forms: LiDAR-only, and fusion. VoxelNet[7] produces state-of-the-art results for LiDAR-based car, pedestrian, and cyclist detection benchmarks. PointFusion[2] utilizes FCNs for the concatenation of point cloud features and image features which are derived by using PointNet[3] and Resnet[4], respectively. ContFuse[28] exploits continuous convolutions to fuse image and LiDAR feature maps at different levels of resolution, and shows significant improvements over state-of-the-art methods. Table 2 reports the performance of BEV detection on the KITTI validation set. As no currently published method except VoxelNet presents validation results for pedestrian and cyclist detection, we use N/A for unsupplied results. The results show that SAANet outperforms other methods across all levels of difficulty for car and cyclist detection. For pedestrian detection, only VoxelNet yields a higher AP than our methods. VoxelNet combines point-wise features with a locally

Journal Pre-proof aggregated feature derived by a VFE layer, which is composed of several parts, i.e., representing the voxel with { } , encoding the feature with the FCN, and then concatenating the learned point-wise feature and locally aggregated feature. In contrast, our method derives the local information mainly by introducing LOE. In VoxelNet, the representation dimension of the point clouds is seven and more memory is required for computing, whereas our method reduces the input dimensions to five through introducing the LOE.

Modality

Pedestrian (AP0.5)

Cyclist (AP0.5)

Easy

Moderate

Hard

Easy

Moderate

Hard

Easy

Moderate

Hard

VeloFCN

LiDAR

40.14

32.08

30.47

N/A

N/A

N/A

N/A

N/A

N/A

MV3D

LiDAR

86.18

77.32

76.33

N/A

N/A

N/A

N/A

N/A

N/A

VoxelNet

LiDAR

89.60

84.81

78.57

65.95

61.05

56.98

74.41

52.18

50.49

MV3D

LiDAR+Img

86.55

78.10

76.67

N/A

N/A

N/A

N/A

N/A

N/A

PointFusion

LiDAR+Img

88.16

84.02

76.44

N/A

N/A

ContFuse

LiDAR+Img

88.81

85.83

77.33

SAANet

LiDAR+Img

90.37

87.08

79.66

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

N/A

62.15

56.08

49.85

80.28

62.29

56.82

re

Car (AP0.7)

ro of

Method

-p

Table 2. Performance comparison in bird's eye view (BEV) detection on KITTI validation set. MV3D: Multi-view 3D networks. LiDAR: laser radar.

4.6 Evaluation in 3D

Jo

ur

na

lP

3D detection is a more challenging task, as it requires more accurate localization of shapes in the 3D space. We compare our SAANet with the above-mentioned models, as well as aggregate view object detection (AVOD)[27],which produces state-of-the-art results on the KITTI 3D object detection benchmark while running in real time. Table 3 summarizes the comparison results. It can be seen from Table 3 that SAANet outperforms LiDAR-based methods for the car and cyclist categories. Specifically, our SAANet significantly outperforms the state-of-the-art LiDARbased method VoxelNet by 6.26%, 12.28%, and 13.41% at the easy, moderate, and hard levels for car detection, respectively. For cyclist detection, SAANet achieves a higher AP than VoxelNet by 10.32%, 12.89%, and 9.80% at the easy, moderate, and hard levels, respectively. For pedestrian detection, VoxelNet achieves the best results, and SAANet achieved a lower AP than VoxelNet by 1.30%, 3.12%, and 1.57% at the easy, moderate, and hard levels, respectively. Table 3. Performance comparison in 3D detection on KITTI validation set. AVOD: aggregate view object detection. Method

Modality

Car (AP0.7)

Pedestrian (AP0.5)

Cyclist (AP0.5)

Easy

Moderate

Hard

Easy

Moderate

Hard

Easy

Moderate

Hard

VeloFCN

LiDAR

15.20

13.66

15.98

N/A

N/A

N/A

N/A

N/A

N/A

MV3D

LiDAR

71.19

56.60

55.30

N/A

N/A

N/A

N/A

N/A

N/A

VoxelNet

LiDAR

81.97

65.46

62.85

57.86

53.42

48.87

67.17

47.65

45.11

MV3D

LiDAR+Img

71.29

62.68

56.56

N/A

N/A

N/A

N/A

N/A

N/A

PointFusion

LiDAR+Img

77.92

63.00

53.27

33.36

28.04

23.38

49.34

29.42

26.98

ContFuse

LiDAR+Img

82.54

66.22

64.04

N/A

N/A

N/A

N/A

N/A

N/A

AVOD

LiDAR+Img

84.41

74.44

68.65

N/A

N/A

N/A

N/A

N/A

N/A

Journal Pre-proof SAANet

LiDAR+Img

88.23

77.74

76.26

56.56

50.30

47.30

77.49

60.54

na

lP

re

-p

ro of

When compared with LiDAR+Img-based methods, SAANet achieved a higher AP on all three detection tasks, across all three levels. SAANet outperforms the state-of-the-art AVOD by 3.82%, 3.30%, and 7.61% at the easy, moderate, and hard levels, respectively, for car detection. Moreover, SAANet outperforms MV3D(BV+FV+Img)[1], PointFusion[2], and ContFuse[28] by a large margin. We also visualize the localization results of several examples in Figure 2.

Jo

ur

Fig. 2: Visualizations of our SAANet results on KITTI validation set. The predicted detection bounding boxes are in green, while the ground truth bounding boxes are in red. The predicted confidence is in blue and shown on the upper right of the detection bounding boxes.

4.7 Evaluation on KITTI test set We evaluated our SAANet on the KITTI test set, by submitting the object detection results on the test set to the official server. The results generated by KITTI’s evaluation server are summarized in Table 4. We compared SAANet with the above-mentioned methods that currently have results on the test server. It can be seen from Tables 3 and 4 that the performance on the test set is consistent with that on the validation set. We would like to note that our SAANet is effective in the object detection task, as the proposed LOE can capture the local structure information of point clouds, and the SAA can adaptively align the point cloud feature and image feature. In addition, we evaluated the inference time of our method and other methods. As can be seen from Table 5, the reasoning time of our method is advantageous when compared with most methods, which shows that our method does not sacrifice speed when improving performance. ContFuse uses a shorter inference time, but it sacrifices detection performance on small objects, such as

54.91

Journal Pre-proof pedestrians and bicycles. The reasoning time of the AVOD method is less than that of SAANet; however, SAANet exceeds AVOD by 5.46% on average for the nine indicator values, in terms of AP. Table 4. Performance evaluation in 3D detection on KITTI test set Method

Modality

Car (AP0.7)

Pedestrian (AP0.5)

Cyclist (AP0.5)

Easy

Moderate

Hard

Easy

Moderate

Hard

Easy

Moderate

Hard

LiDAR

15.20

13.66

15.98

N/A

N/A

N/A

N/A

N/A

N/A

MV3D

LiDAR

66.77

52.73

51.31

N/A

N/A

N/A

N/A

N/A

N/A

VoxelNet

LiDAR

77.47

65.11

57.73

39.48

33.69

31.51

61.22

48.36

44.37

MV3D

LiDAR+Img

71.09

62.35

55.12

N/A

N/A

N/A

N/A

N/A

N/A

ContFuse

LiDAR+Img

82.54

66.22

64.04

N/A

N/A

N/A

N/A

N/A

N/A

AVOD

LiDAR+Img

73.59

65.78

58.38

38.28

31.51

26.98

60.11

44.90

38.80

SAANet

LiDAR+Img

83.72

73.92

66.80

38.64

32.66

31.47

64.33

50.89

45.06

ro of

VeloFCN

-p

Table 5. Comparison of inference time with other methods on KITTI dataset Method

Modality

MV3D

LiDAR

VoxelNet

LiDAR

0.23

MV3D

LiDAR+Img

0.36

lP

re

0.24

ContFuse

LiDAR+Img

0.06

AVOD

LiDAR+Img

0.08

SAANet

LiDAR+Img

0.10

na

5. CONCLUSION

Runtime(s)

Jo

ur

In this paper, we proposed a novel fusion-based deep architecture for 3D object detection, named SAANet. SAANet utilizes a spatial adaptively aligned block to fuse image feature and point cloud features, in which LOE of voxels is introduced in the point cloud representation to improve point cloud features. Ablation studies show the effectiveness of the LOE and SAA fusion block. Comprehensive experimental results on the popular benchmark dataset KITTI indicate that our method achieves higher AP in a more efficient time than some state-of-the-art methods.

Acknowledgements This work was supported by the Foundation of the China Scholarship Council under grant no.201808615030.

References [1] X. Chen, H. Ma, J. Wan, B. Li, T. Xia, Multi-view 3d object detection network for autonomous driving, in: Proceedings of the IEEE Conference on Computer Vision and Pattern

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro of

Recognition, 2017, pp. 1907-1915. [2] D. Xu, D. Anguelov, A. Jain, Pointfusion: Deep sensor fusion for 3d bounding box estimation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 244-253. [3] C.R. Qi, H. Su, K. Mo, L.J. Guibas, Pointnet: Deep learning on point sets for 3d classification and segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652-660. [4] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770778. [5] A. González, G. Villalonga, J. Xu, D. Vázquez, J. Amores, A.M. López, Multiview random forest of local experts combining rgb and lidar data for pedestrian detection, in: 2015 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2015, pp. 356-361. [6] C. Premebida, J. Carreira, J. Batista, U. Nunes, Pedestrian detection combining rgb and dense lidar data, in: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2014, pp. 4112-4117. [7] Y. Zhou, O. Tuzel, Voxelnet: End-to-end learning for point cloud based 3d object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490-4499. [8] M. Engelcke, D. Rao, D.Z. Wang, C.H. Tong, I. Posner, Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 1355-1361. [9] O. Tuzel, M.-Y. Liu, Y. Taguchi, A. Raghunathan, Learning to rank 3d features, in: European Conference on Computer Vision, Springer, 2014, pp. 520-535. [10] D.Z. Wang, I. Posner, Voting for Voting in Online Point Cloud Object Detection, in: Robotics: Science and Systems, 2015, pp. 10.15607. [11] B. Li, 3d fully convolutional network for vehicle detection in point cloud, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2017, pp. 1513-1518. [12] C.R. Qi, L. Yi, H. Su, L.J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space, in: Advances in neural information processing systems, 2017, pp. 50995108. [13] C.R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, L.J. Guibas, Volumetric and multi-view cnns for object classification on 3d data, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5648-5656. [14] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, J. Xiao, 3d shapenets: A deep representation for volumetric shapes, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912-1920. [15] H. Schneiderman, T. Kanade, Object detection using the statistics of parts, International Journal of Computer Vision, 56 (2004) 151-177. [16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in, 2005. [17] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Is object localization for free?-weakly-supervised learning with convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 685-694. [18] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580-587. [19] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE transactions on pattern analysis and machine intelligence, 37 (2015) 1904-1916.

Journal Pre-proof

Jo

ur

na

lP

re

-p

ro of

[20] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440-1448. [21] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 9199. [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision, Springer, 2016, pp. 21-37. [23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988. [24] M.M. Bronstein, I. Kokkinos, Scale-invariant heat kernel signatures for non-rigid shape recognition, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 1704-1711. [25] C.R. Qi, W. Liu, C. Wu, H. Su, L.J. Guibas, Frustum pointnets for 3d object detection from rgb-d data, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 918-927. [26] B. Yang, W. Luo, R. Urtasun, Pixor: Real-time 3d object detection from point clouds, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7652-7660. [27] J. Ku, M. Mozifian, J. Lee, A. Harakeh, S.L. Waslander, Joint 3d proposal generation and object detection from view aggregation, in: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2018, pp. 1-8. [28] M. Liang, B. Yang, S. Wang, R. Urtasun, Deep continuous fusion for multi-sensor 3d object detection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 641-656. [29] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794-7803. [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł Kaiser, I Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998-6008. [31] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2012, pp. 3354-3361. [32] J. Ku, A.D. Pon, S.L. Waslander, Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11867-11876. [33] B. Li, T. Zhang, T. Xia, Vehicle detection from 3d lidar using fully convolutional network, Proceedings of Robotics: Science and Systems, (2016).

Journal Pre-proof Highlights

ro of -p re lP na



ur



A deep fusion network architecture for complementary 3D object detection. An SAA module to adaptively align the point cloud feature and image feature. A local orientation encoding introduced to the representation of a point cloud.

Jo