ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs
Fusion of images and point clouds for the semantic segmentation of largescale 3D scenes based on deep learning ⁎
Rui Zhanga,b, , Guangyun Lia, Minglei Lia, Li Wanga a b
Information Engineering University, 450001 Zhengzhou, China North China University of Water Resources and Electric Power, 450045 Zhengzhou, China
A R T I C LE I N FO
A B S T R A C T
Keywords: 3D scene segmentation 2D image 3D point cloud Large-scale High-resolution
We address the issue of the semantic segmentation of large-scale 3D scenes by fusing 2D images and 3D point clouds. First, a Deeplab-Vgg16 based Large-Scale and High-Resolution model (DVLSHR) based on deep Visual Geometry Group (VGG16) is successfully created and fine-tuned by training seven deep convolutional neural networks with four benchmark datasets. On the val set in CityScapes, DVLSHR achieves a 74.98% mean Pixel Accuracy (mPA) and a 64.17% mean Intersection over Union (mIoU), and can be adapted to segment the captured images (image resolution 2832 ∗ 4256 pixels). Second, the preliminary segmentation results with 2D images are mapped to 3D point clouds according to the coordinate relationships between the images and the point clouds. Third, based on the mapping results, fine features of buildings are further extracted directly from the 3D point clouds. Our experiments show that the proposed fusion method can segment local and global features efficiently and effectively.
1. Introduction Compared with object classification, object detection and object recognition, semantic segmentation is a higher-level task that paves the way towards a complete scene understanding in computer vision. It is the pixel-level classification of different objects against a complex background. The importance of semantic segmentation as a core computer vision issue is highlighted by the increasing number of applications that adopt this approach to infer different types of information, including remote-sensing mapping, autonomous driving, indoor navigation, robotics, augmented reality, human-computer interaction, city planning, etc. Recently, laser scanners have become popular equipment for 3D scene perception due to their stable 3D data capturing capability both at day and night. In combination with digital cameras, 2D images, 2.5D depth images and 3D point clouds can all be captured quickly and efficiently and be used to understand and infer the nature of the 3D world. Meanwhile, they present great challenges for the quick and accurate segmentation of 3D scenes. In addition, with the development of Graphics Processing Units (GPUs), machine learning and the appearance of public 3D point cloud datasets, deep learning has started to be applied to 3D scene segmentation, which breaks through the technical barrier of traditional 3D point cloud segmentation by solving the following problems: (1) the data need to be preprocessed to remove
⁎
ground points; (2) only one type of object can be extracted at a time; (3) 3D objects need to be extracted using hand-designed features, which depends on the professional knowledge of the researchers; and (4) the processing speed is slow and the combination with Compute Unified Device Architecture (CUDA) is difficult. The application of deep learning in the survey area has only just begun. 3D objects are extracted, mainly based on 2D projective images. Our study is inspired by the success of deep learning in 2D images. We first tried seven famous semantic segmentation models, compared their performance and evaluated their suitability for different scenes. On this basis, to obtain a Deep Convolution Neural Network (DCNN) suitable for largescale 3D scenes and high-resolution images (2832 ∗ 4256 pixels), we modified and fine-tuned the weights of the publicly available ImageNetpretrained Deeplabv2-Vgg16 and adapted it to large-scale outdoor scenes and high-resolution images. To ensure the validity of the DVLSHR model, four benchmark datasets (PASCAL VOC12, SIFT-Flow, CamVid and CityScapes) were utilized in the training and validation stages. Two test datasets were captured with a Nikon D700 digital camera and a Riegl VZ400 laser scanner. Then, the segmentation results of the 2D images were mapped to their corresponding 3D point clouds according to their coordinate transformations. Features of the 3D objects were then coarsely extracted from the 3D point clouds. Until now, not all 3D objects can be segmented well; for example, difficulty in segmenting buildings was encountered. Due to the limitations of image labels, only the outlines of
Corresponding author at: Information Engineering University, 450001 Zhengzhou, China. E-mail address:
[email protected] (R. Zhang).
https://doi.org/10.1016/j.isprsjprs.2018.04.022 Received 29 August 2017; Received in revised form 17 April 2018; Accepted 25 April 2018 0924-2716/ © 2018 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
Please cite this article as: Zhang, R., ISPRS Journal of Photogrammetry and Remote Sensing (2018), https://doi.org/10.1016/j.isprsjprs.2018.04.022
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
depth images from different viewpoints and different scales. The flaws of these methods are that the local and global structures are changed, which will reduce their identification performance for various scene features. The third category is voxel-based representation. Wu et al. (2015) represented a geometric 3D shape as a probability distribution of binary variables on a 3D voxel grid. Xu et al. (2016) created binary images based on different rotations along the x-, y-, and z-axes. Li et al. (2016b) represented 3D shapes as volumetric fields. Qi et al. (2016b) compared CNNs based upon volumetric representations with those based on multi-view representations. Wu et al. (2016) generated 3D objects from a probability space by leveraging recent advances in volumetric convolutional networks and generative adversarial nets. Although these methods completely preserve the 3D shape information, they face some new challenges: (1) the 3D voxel resolution cannot be too high to ensure that the training is not too complex, where a resolution of 30 ∗ 30 ∗ 30 pixels is usually used. However, too low of a resolution will limit the segmentation performance. (2) The voxel proportion of 3D surfaces is not high, which results in the voxel result being sparse. Therefore, it is necessary to design a reasonable network structure that avoids zero or void operations. The methods in the fourth category adopt the raw point clouds directly (Vinyals et al., 2015; Qi et al., 2016a). According to the scattered and unstructured characteristics of 3D point clouds, these methods design special network input layers. However, developing classifiers and other supervised machine learning algorithms on top of such 3D shape descriptors poses a number of challenges (Su et al., 2015). First, the number of organized databases with annotated 3D models is rather limited compared with image datasets because generating 3D datasets for segmentation is costly and difficult. Second, 3D shape descriptors (including voxel-based representation) tend to be very high-dimensional, and few deep learning methods can process such data directly. Thus, 3D point cloud datasets are unpopular at present. Third, real point clouds are big, and a single laser scan possesses tens of millions of unstructured points, which constitutes a large computational burden. The main bottleneck is the large number of 3D nearest-neighbor queries, which significantly slows down processing. Instead of computing exact neighborhoods for each point, Hackel et al. (2016) downsampled the entire point clouds to generate a multi-scale pyramid and computed a separate search structure per scale level. From the above summary, we can see that 3D point cloud descriptors face many challenges compared with 2D images until now. The first category requires more prior knowledge, while a considerable amount of 3D information will be lost or distorted in the second category. The third category bears higher complex and computational burden of data preprocessing, while the fourth category contains a major bottleneck, i.e. the large number of 3D nearest-neighbor queries, which significantly slows down the processing. However, there is no need for image-based DNN to project data into another dimensional space; besides, data preprocessing is much simpler, and there are many more shared datasets, which together greatly benefit model training. In this work, a novel large-scale point cloud segmentation method is proposed, in which 2D images synchronously acquired with 3D point clouds are input into a CNN to obtain preliminary segmentation results.
buildings were labeled while local structures such as windows, balconies and doors were not. To address this problem, 3D objects with coarse segmentation were further refined with Fuzzy Clustering combined with the Generalized Hough Transformation algorithm (named as the FC-GHT algorithm). The key contributions of our work are as follows: (1) 3D point cloud descriptors face many challenges compared with 2D images at present. For semantic segmentation methods based on deep learning, the study of 3D point clouds directly used as input to implement scene segmentation is rare, except for Stanfords PointNet and its extended version PointNet++. The only publically available 3D dataset is Stanford 2D-3D-S, which consists of indoor scenes rather than urban scenes. However, deep neural network models based on 2D image segmentation are more mature, and there are many more available datasets that can greatly benefit model training. As such, in this work we synchronously acquired 2D images and 3D point clouds. Then, the 2D images were input into the convolutional neural network to get their segmentation results, as discussed in Section 3. According to the style of the 3D point cloud segmentation being assisted by the 2D images, we successfully modified and fine-tuned a DCNN for large-scale scenes and high-resolution images, as discussed in Section 3.1. (2) The segmentation results of the 2D images were then mapped to 3D point clouds according to their coordinate transformation relationships. We deduced the mapping process suitable for our proposed segmentation method, discussed in Section 3.2. (3) For the mapping results, only the outlines of each class were segmented, instead of local features. For buildings, the main structures in 3D urban scenes, the extraction was insufficient, and we merely obtained building outlines. Therefore, based on the mapping results, we further extracted the physical planes of the buildings with 3D point clouds using the FC-GHT algorithm, as discussed in Section 3.3. The segmentation results and plane extraction validated the effectiveness of our proposed method. 2. Related works 2.1. Point cloud descriptors In what manner should the 3D point clouds be represented? A large corpus of shape descriptors has been developed for drawing inferences about 3D objects in deep learning. These descriptors can be classified into four broad categories: methods based on hand-extracted features, 2D projection maps, voxel-based representation and raw point clouds Guo (2017). Previously, 3D point cloud descriptors were largely “hand-designed” according to the particular geometric properties of a shape’s surface or volume, such as length, width, height, area, reflected intensity, normal vector or curvature (Berthold, 1984; Bu et al., 2014). Hand-designed features need to first be extracted to be input into a Deep Neural Network (DNN) to learn high-layer features accordingly, which still depend on the selected, hand-designed features and parameter optimization; thus, the advantages of deep learning were lost to some extent, and the problem of automatic learning could not be solved. The second category is view-based descriptors, which describe the shape of a 3D object by “how it looks” in a collection of 2D projections. Murase and Nayar (1995) recognized objects by matching their appearances in parametric Eigen-spaces formed by large sets of 2D renderings of 3D models under varying poses and illuminations. Su et al. (2015) rendered a 3D shape from 12 different views and passed these views through a Convolutional Neural Network (CNN) to extract viewbased features. Shi et al. (2015) converted 3D shapes into a panoramic view, namely, a cylindrical projection around its principle axis. Sinha et al. (2016) created geometric images using authalic parametrization in a spherical domain. Kalogerakis et al. (2016) obtained shaded and
2.2. Deep CNNs Until now, to apply deep learning to the semantic segmentation of images, three main parts have been included in the generic framework: (1) 2D images are input; (2) a Fully Convolutional Network (FCN) is used at the front-end of the model to coarsely extract features; and (3) the output from the front-end is optimized by the back-end with a Conditional Random Field/Markov Random Field (CRF/MRF) to obtain the segmentation results. At present, many of the classical semantic segmentation methods for urban complex scenes based on deep learning adopt this frame, such as FCN (Long et al., 2015), SegNet 2
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
and its model transformation is not rigorous enough. The DLT method does not require initial values, and it is suitable for determining exterior orientations at large rotation angles in close range photogrammetry. Therefore, in this paper, the exterior orientation elements of the cameras are calculated based on the DLT method.
(Badrinarayanan et al., 2016;Kendall et al., 2016) and DeepLab (Chen et al., 2014, 2017). They have several different variants, for example, FCN-AlexNet, FCN-VGG16, FCN-GoogLeNet, SegNet-VGG16, BayesianSegNet-VGG16, DeepLab-VGG16 and DeepLab-ResNet101. The key insight of FCN is to build a fully convolutional network that consumes arbitrary sizes and produces a correspondingly sized output. The novelty of SegNet lies in the manner in which the decoder up-samples the lower-resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear up-sampling. DeepLab has three main contributions: atrous convolution, atrous spatial pyramid pooling (ASPP) and CRF. Among the three classical models, FCN, SegNet and DeepLab, CRF is the only one used in DeepLab. The architectures of these three state-of-the-art methods are AlexNet (Krizhevsky et al., 2012), GoogLeNet (Szegedy et al., 2015), VGG16 (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016), which are currently used as building blocks for many segmentation frameworks. The most frequently used one is VGG16, which is composed of 16 layers and was introduced by the Visual Geometry Group from the University of Oxford. The main difference between VGG16 and its predecessors is the use of a stack of convolutional layers with smaller receptive fields in the first few layers. This leads to fewer parameters and more nonlinearity, thereby making the decision function more discriminative and the model easier to train (Garcia-Garcia et al., 2017). Through a series of trails that include training, validation and testing with these DCNN architectures, we find that FCN is suitable for the segmentation of lower-resolution images, which are usually 256∗256 pixels. SegNet’s performance is better when indoor scenes or simple outdoor scenes are segmented to delineate boundaries, like freeways, while DeepLab is suitable for complex urban scenes.
2.4. Physical plane extraction methods In the methods used to extract the fine features of buildings, model matching (Yang et al., 2013a) is the classical approach, which requires a large amount of prior knowledge. The dimensionality reduction method (Yang et al., 2013b; Yang et al., 2012; Wei et al., 2012) projects the entire 3D point clouds onto a two-dimensional space, and then adopts a 2D image processing method to extract features, which results in a massive loss of spatial information. In addition, clustering segmentation algorithms (Nalani et al., 2012) and statistical analysis algorithms (Li et al., 2015a) have been widely used, but the processing speed is slow because of iterative computations. Accordingly, to improve the processing speed and automation, this study proposes a novel algorithm to refine the segmentation of buildings that combines fuzzy clustering (Nalani and Maas, 2012; Nalani et al., 2012; Josep and Jose, 2008; Burochin et al., 2014; Petar et al., 2015) and the Generalized Hough Transformation (GHT) (Zhou et al., 2013; Song et al., 2014). 3. Methodology This study deals with the semantic segmentation of complex 3D scenes based on deep learning. The main process is as follows: (1) 2D images are used as training sets to train the DVLSHR model, with which the preliminary segmentation results are obtained; (2) the segmentation results are mapped from 2D images to 3D point clouds according to their coordinate relationships; and (3) the physical plane extraction of buildings is performed directly with the 3D point clouds.
2.3. Mapping relationships between 2D images and 3D point clouds
3.1. DVLSHR model
The solution to the mapping relationships between 2D images and 3D point clouds is to determine the interior orientation and the exterior orientation elements of cameras relative to the scanner, i.e. the registration of 2D and 3D data (Wang and Hu, 2012). Some scholars matched images with 2D intensity images (Fang, 2014) or distance images (Xu et al., 2013) generated by point clouds, then, the mapping between point clouds and images became the registration between 2D and 2D data. Some authors matched the 3D point clouds generated from stereo images (Deng et al., 2007; Yan, 2014) with laser-scanned point clouds, thus transforming the problem into the registration between 3D and 3D data. The existing methods used to compute the intrinsic and extrinsic parameters of cameras are: the pyramid principle (Xu et al., 2000), the direct adjustment method based on the collinear equation (Yao and Zhang, 2005), a method based on Lodriguez matrix (Yao et al., 2006), direct linear translation (DLT) (Abdel-Aziz and Karara, 2015), etc. The pyramid principle and the direct adjustment method compute the parameters by iteration, which requires assuming the image plane and the xoy plane of the physical coordinate system are approximately parallel, or the use of a better initial value input. The method based on the Lodriguez matrix cannot solved the initial value problem effectively
According to Section 2.1, feature descriptors based on 2D projective images are relatively low-dimensional and more efficient to evaluate. They have been widely used because of their advantages of many publicly available 2D image datasets and network structures with high performance. However, methods of this kind change the local and global structures of 3D shapes. Therefore, in our study, we acquired 2D images synchronously with 3D point clouds, and then input 2D image feature descriptors into DVLSHR to extract preliminary features. CNNs are widely used in many fields because of their feature expression abilities (Lecun et al., 1998, 2015). Their application to surveying and mapping has just begun. Herein, we describe how we have modified and fine-tuned the weights of the publicly available DeepLabV2-VGG16 model to adapt it to large-scale outdoor scenes and high-resolution images (see Fig. 1). As shown in Fig. 2, the size of the initial images is 2048∗1024 pixels. With the limitation of a single INVIDIA Titan X Pascal 12G GPU, we first down-sampled these images to 1024∗512 pixels (stride = 2), and then input them into the neural network to train the DVLSHR model
Fig. 1. Flowchart of the proposed algorithm. 3
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
parameters are calculated according to the gradient decent algorithm with a limited number of iterations. The pseudo code of the gradient decent algorithm is as follows:
δ=α
∂J (w ) ∂w
w = w−δ where α denotes the update rate and w updates in the direction of the negative gradient. 3.1.2. Down-sampling The resolution of the initial images of CityScapes is 2048 ∗ 1024 pixels, which makes training deeper networks with our limited GPU memory challenging. To solve this problem, we first downsampled the images, including the original and ground truth ones, and then up-sampled them to the original resolution to simulate the best results obtainable with the particular down-sampling factor after segmentation. In our experiments, we down-sampled images by factors of 2 and 4, respectively. The way of selecting the appropriate down-sampling factor will be discussed in Section 4.3.1.
Fig. 2. DVLSHR model illustration.
(kernel = 3, atrous convolution with rate = 12). The concrete, finetuning process will be discussed later in Section 4. Finally, we upsampled the output of the last layer into the original size of 2048 ∗ 1024 pixels (stride = 2). 3.1.1. Loss function and optimized solver algorithm The loss function we used is the cross-entropy term. Since the crossentropy loss is a convex function, we employed the standard stochastic gradient descent (SGD) as the solver optimization algorithm. Let us suppose that the input data consist of a single sample and multiple classes (x ,y ) , where x is the feature vector of the input sample and y is the corresponding label. Vector y is expressed as an n-dimensional column vector, y = (y1,…,yk ,…,yn )T , where n is the number of classes. Therefore, the probability of the pixel i in sample x for each class is:
3.1.3. Effect of batch_size on cross-entropy loss and accuracy Suppose Pcorrect is the probability of accurate classification of a single sample, then the cross-entropy loss is loss = −lnPcorrect . When the batchsize = m , the loss of the whole batch is:
loss =
= − ∑ [yk loggw (x )k + (1−yk )(log(1−gw (x )k ))] + λ
accuracy ≈ e−loss.
(2)
n
∑∑
[yk(i) loggw (x (i) )k + (1−yk(i) )(log(1−gw (x (i) )k ))]
3.2. Mapping 2D images to 3D point clouds
i=1 k=1
(3)
+ λ ‖w‖2 1 1 + e−fw (x
(i) )
, fw
(x (i) )
=
wT (x i )
3.2.1. Exterior azimuth element calibration of a single image The installation parameters of the externally installed digital camera relative to the 3D terrestrial laser scanner are the external azimuth elements of the digital camera relative to the laser scanner coordinate system. According to collinear conditions and taking into account the camera’s internal parameters, the relationships between the image point coordinates, the object point coordinates, the cameras internal parameters and the external azimuth elements can be established as:
, w =w0 + w1 + w2 + ⋯+wn ,
w0 = b . In addition, L−1
1 2m
Sl
SL + 1
∑ ∑ ∑ l=1
i=1
(θj (l) )2
j=1
(4)
where L denotes the total number of layers in the network and Sl is the number of units in layer l. This loss function can solve the problem of slow gradient updating. The neural network system adopts forward propagation to get the loss of current parameters, and then backward propagates errors, and uses a standard SGD algorithm to iteratively modify the weights of each layer. Each parameter is stochastically initialized to obtain the input, output and loss function values, and the loss function of every neuron is used to compute the partial derivative J (w ) of the loss function to every parameter:
∂J (w ) 1 = m ∂wi(,jl) where
δ j(l)
a (X − X ) + b (Y − Y ) + c (Z − Z )
⎧ x −x 0 + Δx + f a1 (X − XS) + b1 (Y − YS) + c1 (Z − ZS ) = 0 3
i=1
α j(i)(l) δ j(i)(l + 1)
S
3
S
3
S
⎨ y−y + Δy + f a2 (X − XS) + b2 (Y − YS) + c2 (Z − ZS) = 0 0 a3 (X − XS) + b3 (Y − YS) + c3 (Z − ZS) ⎩
, (10)
where (x ,y ) denotes the coordinates of a control point in the 2D image physical coordinate system, (X ,Y ,Z ) denotes the coordinates of this control point in the scanner coordinate system, (x 0,y0 ) denotes the principal point coordinates of the photograph, f denotes the focal length, (△x ,△y ) denotes the distortion correction composed of seven distortion correction parameters, {aj,bj,cj (j ∈ {1,2,3})} denotes the nine corresponding direction cosines when the exterior azimuth elements
m
∑
(9)
That is to say, when we use batch-based cross-entropy losses to train the machine-learning algorithm, the accuracy can be roughly calculated according to the loss, and the error decreases with an increase in the batchsize . When the batchsize = 100, e−loss approaches the accuracy, where the error is usually less than 0.01.
Then, we infer the loss function with n sample data:
‖w‖2 =
(8)
Namely, when batchsize → ∞,
‖w‖2
k=1
(7)
loss = E (−lnaccuracy ).
k n
where gw (x (i) ) =
(6)
For a single sample, Pcorrect denotes the accuracy. When there is an infinite number of samples in the batch, Eq. (7) changes to:
(1)
J (w ) = −mL (mk (w )) + λR (w )
m
(i) (−lnPcorrect ).
i=1
loss = E (C ) = E (−lnPcorrect ).
where gw (x (i) )1 + ⋯gw (x (i) )k + ⋯gw (x (i) )k = 1. The errors of the input sample produced during the transformation include both a loss term and a regularization term. The loss function is defined as:
1 m
m
∑
(i) as a stochastic variable Ci , when m → ∞, Regarding −lnPcorrect
(i) ⎧ P (y = y1 |x ,w ) = gw (x )1 ⎪… P (y = yk |x ,w ) = gw (x (i) )k ⎨… ⎪ (i) ⎩ P (y = yn |x ,w ) = gw (x )n
J (w ) = −
1 m
(5)
denotes the error of node j in layer l. Then, the optimal 4
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
3D scene. A more suitable model for laser points may lead to faster and more accurate scene segmentation (Liu et al., 2017). To improve the efficiency of the data organization and management, we adopted the Kd-OcTree index, which was introduced in detail in our earlier work (Zhang et al., 2017). On the basis of the mixed index, to calculate the normal vector, this study set a distance threshold of point clouds to presample from the processing point sets and conducted a K NearestNeighbor search (KNN) within the pending point clouds. Then, a pending point and its neighbor points are used to calculate the initial value of the normal vector with a Principal Component Analysis (PCA) algorithm. According to the constraint that the dot product of → n = (n x ,n y,nz ) and → r = (x ,y,z ) must be less than zero (→ n ·→ r < 0 ), we adjusted the orientation of the normal vector, and then the final normal n = (n x ,n y,nz ) and → r = (x ,y,z ) denote the vector was obtained, where → normal vector of each point and the incidence direction of the laser, respectively.
are converted to a coordinate rotation matrix and (Xs ,Ys,Zs ) denotes the exterior line elements. The camera was calibrated before the experiments were performed using Video-simultaneous Triangulation and Resection System (V-STARS). Therefore, (x 0,y0 ,f ,Δx ,Δy ) are known values. With the object and the image control point coordinates obtained as the input values, the exterior azimuth elements of the cameras initial shooting angle are derived by DLT. 3.2.2. Exterior azimuth element calibration of multiple images When the 3D laser scanner works at the same station, its instrumental coordinate system remains the same, but the camera will rotate around the device. So, the external azimuth elements of each image are different. If each image is calibrated with the above method, it will be difficult to implement in practice. Since the externally installed camera is fixed on the laser scanner and rotates only around the z-axis of the laser scanner (rotation angle is ξ ), our strategy is to perform the calibration only once the scanner is in its initial position. Then, the external azimuth elements of the images at other positions are calculated from the rotation angle relative to the coordinates of the initial position (Li et al., 2016a). Suppose the number of images obtained is n while the camera rotates with the scanning laser scanner, then, the angle between two neighboring images is ξ = 360°/ n . The coordinates of point P in the physical coordinate system of the image in the initial position are (x ,y ) , its corresponding coordinates in the image-space coordinate system are [ x y − f ]T , and the coordinates of the corresponding object point in the scanner coordinate system are [ X1 Y1 Z1]T , then:
x ⎡ X1⎤ ⎡ a1 a2 a3 ⎤ ⎡ x ⎤ ⎡ XS ⎤ ⎡ ⎤ ⎢ Y1 ⎥ = ⎢ b1 b2 b3 ⎥ ⎢ y ⎥ + ⎢ YS ⎥ = R ⎢ y ⎥ + T ⎢ Z1⎥ ⎢ c c c ⎦ ⎣ − f ⎦ ⎢ ZS ⎥ ⎣− f ⎦ ⎣ ⎦ ⎣ 1 2 3⎥ ⎣ ⎦
3.3.2. Physical plane extraction Fuzzy clustering: To segment each facade into independent semantic information effectively and efficiently, it is necessary to use the fuzzy clustering algorithm combined with the normal vector angle and the Euclidean distance to coarsely segment the physical planes before they are accurately extracted. For the segmentation of boundary points, this method is more accurate than the traditional Euclidean distance method because both the distance and the normal vector are considered; therefore, it can divide points that are nearby in distance but belong to two different planes into different facades. FC-GHT algorithm: This work takes advantage of the Random Hough Transformation (Zhou et al., 2013) and GHT, and sets a sampling interval in the image space according to the number of points and their density. This approach avoids the accumulation of invalid points and improves the efficiency of the algorithm while reducing the computation time and memory usage. The GHT algorithm includes the following two steps:
(11)
The coordinates of the image point on the ith image in the scanner coordinate system are [ Xi Yi Zi ]T , which are obtained from the rotation of the first image at an angle (i−1) ξ along the z-axis of the scanner coordinate system:
x ⎡ Xi ⎤ ⎡ cos((i−1)·ξ ) − sin((i−1)·ξ ) 0 ⎤ ⎡ X1⎤ ⎡ ⎤ ⎢ Yi ⎥ = ⎢ sin((i−1)·ξ ) cos((i−1)·ξ ) 0 ⎥ ⎢ Y1 ⎥ = R1,i R ⎢ y ⎥ + R1,i T ⎥⎢ ⎥ ⎢ Zi ⎥ ⎢ ⎣− f ⎦ 0 0 1 ⎦ ⎣ Z1⎦ ⎣ ⎦ ⎣ x ⎡ ⎤ = Ri ⎢ y ⎥ + Ti, (12) ⎣− f ⎦
(1) The transformation of spatial planar coordinates and polar coordinates. The equation of the plane obtained in the polar coordinate system is:
ρ = x cosθcosφ + y sinθcosφ + z sinφ
where the parameter θ represents the angle between the projection of the normal vector n on the plane xoy and the x-axis, φ represents the angle between the normal vector n and the projection of the normal vector n on the plane xoy, and ρ represents the distance from the original point O to the plane. The equations for converting the unit normal vector of each point n = {n x ,n y,nz } to angles θ , φ and distance ρ are θ = arctan(n y / n x ) , θ ∈ [0°,360°]; φ = arcsin(n z ) , φ ∈ [−90°,90°]; and ρ = d . (2) The division of voting space. First, we set the segment widths of ρ , θ and φ as Width , Theta , and Phi, respectively, and divide the ranges of the values of ρ , θ and φ in the parameter space evenly as D = (d max−d min )/ Width , T = 360/ Theta , and P = 180/ Phi , where d max and d min represent the maximum and minimum values, respectively, of the distance from the points to the plane. Then, a 3D cumulative array Vote (ρ,θ,φ) is set up whose size is equal to D ∗T ∗P , and all array elements are initialized to 0. Third, ρ , θ and φ are voted, which are converted from points in the point cluster. The threshold of the minimum vote is set to detect local peak values. Inverse mapping of the detected peak to the image space yield the preliminary plane parameters a , b , c and d. We assume there are two planes and if plane1, plane2 , ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ NV plane1 · NV plane2 > AngleThreshold & & |Distplane1, the two
where Ri and Ti denote the rotation matrix and translation vector, respectively, which are derived from the exterior orientation elements of the ith image. The image is segmented according to DCNN to get the label corresponding to each pixel point in the pixel coordinate system. In light of the relationship between the pixel coordinate system and the image physical coordinate system (see Eq. (13)), (u,v ) can be converted to (x ,y ) . Then, according to Eq. (12), the corresponding point coordinates in the scanner coordinate system can be obtained, and then the initial segmentation results of the 3D point clouds can be derived. −1
⎡ Δy u ⎡v⎤ = ⎢ ⎢ ⎥ ⎢0 ⎣1⎦ ⎢ ⎣0
0 u0 ⎤ y ⎥⎡ ⎤ v0 ⎥ ⎢ x ⎥ ⎥⎣1⎦ 0 1⎦
(14)
−1 Δx
(13)
3.3. Physical plane extraction with 3D point clouds In this section, further refinement of the feature extraction of the constructed 3D point clouds is performed.
−Distplane2 | < Width planes are combined together to generate better planarity.
3.3.1. 3D point cloud representation Determining how to represent and model 3D laser point clouds is fundamental and prerequisite to refine the segmentation of a large-scale 5
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
3.3.3. Extracted plane optimization Patch merging: The fuzzy clustering and FC-GHT algorithms were both implemented on the global KD leaf nodes, which are influenced by the threshold of the number of the point clouds in the global KD leaf nodes, as well as the angle threshold and distance threshold during the preliminarily clustering. Those thresholds may result in points that originally belonged to the same plane being classified into different leaf nodes or into different clusters. When implementing FC-GHT to extract the plane based on each cluster, over-segmentation may occur; therefore, it is necessary to merge the same planes and then delete repeated planes. Re-judgment of coplanar points: There are two types of coplanar points. One type is the intersection points of two planes, which are real coplanar points, and are also called boundary points. The other is pseudo coplanar points, which are optical illusions due to the overlap of two or more planes. Since the normal vectors of coplanar points are emanative and disordered, many coplanar points are misclassified. Therefore, the distances from the points to the planes are adopted rather than the normal vectors in the re-judgment process. The equation for the distance of any point P (x p,yp ,z p) to plane A is defined as
DA = |n x x p + n y yp + nz z p−(n x x 0 + n y y0 + nz z 0)|
Table 1 Description of the two testing sets. Dataset
Number of images
Image resolution
Number of points
Size(MB)
Data set 1 Data set 2
7 7
2832 ∗ 4256 2832 ∗ 4256
7,098,159 5,956,090
236 200
terrestrial laser scanner. 2D images and 3D point clouds were captured synchronously. 4.2. Evaluation criteria To assess the segmentation effect of each class, deep learning usually relies on Eqs. (16) and (17) where TP , FP , and FN are the number of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. Next, when traditional segmentation method is preferable, we use Eq. (18).
(15)
where {n x ,n y,nz } represents the unit normal vector for plane A, and (x 0,y0 ,z 0) represents the central coordinates of the center of the plane. According to Eq. (15), the distances from the coplanar point C to n planes, D1,D2,⋯,Dn are calculated; two minimum distances Di,Dj {i,j ∈ (1,n)} are selected, and the distance threshold between the point and the plane is set as ξ . If |Di−Dj| ⩾ ξ , then the distance between the point and the two nearest planes is relatively large, i.e., |Di| ≫ |Dj| or |Di| ≪ |Dj|, which indicates that the point is a pseudo-coplanar point that belongs to the closer plane. If |Di−Dj | < ξ , the distances of the point to the two closest planes are approximately equal, that is to say, this point is a boundary point, and then the neighbor point auxiliary judgment method is used to re-determine the boundary points. While judging neighboring points, we conducted a neighborhood search by taking an undetermined point as the center of a circle and set R as the radius to obtain all neighboring points. Then, we counted the number of neighbor points in each plane according to planeflag , which indicated to which plane an undetermined point belongs, and finally classified the point into the plane with the maximum number of points and then changed the planeflag of this point.
IoU =
TP TP + FP + FN
(16)
PA =
TP TP + FP
(17)
Recall =
TP TP + FN
(18)
The evaluation criteria used to assess the integral segmentation effect are mPA and mIoU. The mPA is the average ratio of true-positive pixels per class to the total number of classes, as shown in Eq. (19). The mIoU is the average ratio of the number of true positives (intersection) over the sum of the numbers of true positives, false negatives, and false positives (union) per class to the total number of classes, as shown in Eq. (20).
mPA =
mIoU =
1 k+1 1 k+1
k
∑ i=0
PAi =
1 k+1
k
∑ i=0
IoUi =
k
∑ i=0
1 k+1
pii k
∑ j = 0 pij
k
∑ i=0
(19)
pii k
k
∑ j = 0 pij + ∑ j = 0 pji −pii
(20)
4. Results and discussion
where k + 1 represents the number of classes, including the background, and pij represents the probability of identifying the three-dimensional target i as the target j.
4.1. Dataset
4.3. DVLSHR model fine-tuning
4.1.1. Training and validation set To validate the performance of the DVLSHR model, we used a subset of PASCAL VOC12 (Everingham et al., 2015), where training and validation was performed with 1464 and 1449 images, respectively. In the process of scene segmentation, we used the CityScapes (Cordts et al., 2016) training dataset. Classes were selected based on their frequency and relevance from our application standpoint. Classes that were too rare or too irrelevant were excluded, leaving 19 classes for evaluation: roads, sidewalks, buildings, walls, fences, poles, traffic lights, traffic signs, vegetation, terrain, sky, persons, riders, cars, trucks, buses, trains, motorcycles and bicycles. The other 14 object classes were considered as background.
Through a series of fine-tuning experiments with a single INVIDIA Titan X Pascal 12G GPU, we finally employed the CityScapes dataset which was down-sampled by 2 (image resolution 1024 ∗ 512 pixels). The network prototxt was set as: batchsize = 2 , cropsize = 713, 20 K iterations, the “poly” learning rate policy, stepsize = 2000, the crossentropy loss function and the standard SGD optimization algorithm. The initial model used was DeepLab-VGG16, which is a basic DeepLabLargeFOV version that was not pre-trained on MS-COCO. The model was fine-tuned twice. Surprisingly, this yielded a performance of 74.98% mPA and 64.17% mIoU on the val set. The concrete experimental process is as follows. 4.3.1. Down-sampling factor During the training, we randomly selected the training and validation sets from CityScapes. Then, the DVLSHR model was trained twice. For the first run, the training and validation sets contained 2975 and 100 images, respectively. For the second run, which was based on the results of the first training run, all images that were identified as poor segmentation classes constituted the new training and validation sets, which contained 2000 and 100 images, respectively. The experimental
4.1.2. Testing sets The laser point clouds used in this experiment were collected from two different scenes: dataset1 is an urban street scene in a western suburb of Zhengzhou city, and dataset2 is a corner of our university, as shown in Table 1. The 2D images were captured with a Nikon D700 digital camera and Nikon Nikkor 20 mm/F 2.8D fixed-focus lens. The laser point clouds were collected with a Riegl VZ-400, which is a 6
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
stability of the accuracy. Fig. 3 shows the results of the training process for different values of stepsize on PASCAL VOC12, using DeepLabV1VGG16. The results show that the accuracy is better with stepsize = 2000 than with stepsize = 200. The loss converges to a smaller and steadier value with stepsize = 2000, while the loss with stepsize = 200 oscillates constantly.
Table 2 val set results on CityScapes for models trained with images down-sampled by factors of 2 and 4. No.
Method
Batch_size
Iteration
Factor
mPA
mIoU
1 2 3
DVLSHR DVLSHR DVLSHR
1 1 1
20 K 20 K 20 K
2 4 2
72.18 61.95 74.98
61.26 50.18 64.17
4.3.4. Qualitative results We compared the DVLSHR model with FCN-8s, SegNet and DeepLabV2-VGG16 for the CityScapes dataset, as listed in Table 3. To obtain further baseline results, we list the results from the benchmark models on the CityScapes web site in the first block of Table 3. The web site asked their selected groups who proposed the state-of-the-art semantic labeling approaches to optimize their methods and evaluated their predictions with the CityScapes test set. In order to optimize the segmentation results, FCN-8s and SegNet were trained twice, first with the CityScapes training set (train) with fine annotations until the performance on the validation set (val) saturated, then with train + val and externaldata, i.e. ImageNet and Pascal Context. For comparison, the first row lists the results without downsampling while the second row lists those with a down−sampling factor = 2. We found that the mIoU of the training set with full resolution was better than that with the down-sampled images, which was 65.3% and 61.9%, respectively. In our experiments, it was a challenge to train deeper networks with our limited GPU memory, so we down-sampled images and their corresponding labels by a factor = 2, and then with a factor = 4. We compared the DVLSHR model (including the first and second fine-tuned results) with the original DeepLabV2-VGG16 with the val set, as shown in the second block of Table 3. During the training process, we only used the train set, without the val set and external data. Our DVLSHR model significantly brought 1.60% and 5.09% improvements to the mIoU value after the first and the second fine-tuning runs, and brought 0.50% and 3.09% improvements to the mPA value, respectively. Compared with SegNet, our model improved the accuracy by almost
results showed that the down-sampling factor exerted a great influence on the segmentation performance. As shown in Table 2, the performance with a down−sampling factor = 2 was significantly better than that with a factor = 4 ; it reached 74.98% mPA and 64.17% mIoU, which represents an improvement of 10 percent compared with a down−samplingfactor = 4 . Although we adopted a down-sampling factor, the experimental results were almost the same as those literature papers (Long et al., 2015; Badrinarayanan et al., 2016; Kendall et al., 2016) where they processed the images in their original resolution. After segmentation, the results were up-sampled to their original size, and were then mapped to their corresponding 3D point clouds. 4.3.2. Batch_size As shown in row 3 and row 1 in Table 2, when the batchsize was increased from 1 to 2, during the same iteration and with a down−sampling factor = 2 , the segmentation results brought 0.9% and 1.1% improvements in the values of mPA and mIoU, respectively. 4.3.3. Step_size As long as the gradient descent method is used to solve the optimization, there will be a learning rate, also called the step length. The baselr was used to set the basic learning rate. In the process of iteration, baselr was adjusted. The policy of adjustment was set by lr policy . Because lr policy was set as “step”, we also needed to set the stepsize . We explored different stepsize values when fine-tuning the model. Our experiments showed that the stepsize had a direct influence on the
Fig. 3. Performance results for different stepsize . Top row: stepsize = 2000. Bottom row: setpsize = 200. 7
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
Table 3 Quantitative results using Cityscapes. The first block lists results from the benchmark models on the CityScapes web site, and indicates the used training data for each model, including trainingfine, valfine and external data values. The second block lists results of our experiments, where the fine-tuned DVLSHR model is compared with the original DeepLabV2-VGG16 model. Factors: down-sampling factor, overall: percentage of pixels correctly labeled overall. Method
Train fine
Val fine
External data
Factor
mIoU
mPA
Overall
FCN-8s FCN-8s FCN-8s
✓ ✓ ✓
✓ ✓ ✓
ImageNet, Pascal Context ImageNet, Pascal Context ImageNet
No 2 4
65.3 61.9 57.0
n/a n/a n/a
n/a n/a n/a
DeepLabV2-VGG16 DVLSHR(first fine-tuning) DVLSHR(second fine-tuning)
✓ ✓ ✓
no no no
2 2 2
59.08 60.68 64.17
71.79 72.29 74.98
88.65 89.03 89.91
Table 4 PA of each class (%). val set results of Cityscapes (100 images), comparing our fine-tuned DVLSHR model with the original DeepLabV2-VGG16; side: sidewalk, buil: building, tral: trafficlight, tras: trafficsign, vege: vegetation, terr: terrain, pers: person, motc: motorcycle, bicy: bicycle, first: the first fine-tuned results, second: the second fine-tuned results. Method
Road
Side
Buil
Wall
Fence
Pole
Tral
Tras
Vege
Terr
Sky
Pers
Ride
Car
Truc
Bus
Train
Motc
Bicy
Original Ours (first) Ours (second)
97.56 97.65 97.95
72.98 73.31 77.68
92.38 92.85 92.97
47.56 46.51 48.24
64.32 66.67 67.15
44.32 45.10 46.03
54.87 56.47 59.33
63.55 66.22 69.13
94.38 94.61 94.75
78.62 81.83 80.86
93.05 93.05 93.84
77.75 79.69 79.78
55.07 57.72 60.63
94.55 94.53 94.67
69.92 78.34 63.97
72.15 71.02 71.42
59.07 45.83 88.56
66.19 64.73 71.69
65.70 67.32 66.02
Table 5 IoU of each class (%). val set results of Cityscapes (100 images), comparing our fine-tuned DVLSHR model with the original DeepLabV2-VGG16. Method
Road
Side
Buil
Wall
Fence
Pole
Tral
Tras
Vege
Terr
Sky
Pers
Ride
Car
Truc
Bus
Train
Motc
Bicy
Original Ours (first) Ours (second)
90.80 90.99 92.27
62.67 64.04 68.19
83.44 84.08 84.62
31.85 32.95 34.48
47.91 50.25 53.18
34.60 35.49 35.75
36.12 41.97 44.80
50.93 54.78 57.23
87.44 87.97 88.37
60.88 62.16 64.57
86.19 86.72 87.83
60.84 63.62 64.60
41.66 44.89 46.96
86.40 86.79 87.51
52.81 61.95 55.55
55.18 59.64 66.98
55.37 43.91 80.81
50.21 50.68 53.72
47.25 50.08 51.75
4.4. Segmentation results on the test sets
3%, which was similar to FCN-8s trained with the train + val + external data. Table 4 shows the pixel accuracy of each class. After twice finetuning, the pixel accuracy of almost all classes improved. The PA values of the sidewalks, traffic lights, traffic signs, riders and trains dramatically increased from 72.98%, 54.87%, 63.55%, 55.07%, and 59.07–77.68%, 59.33%, 69.13%, 60.63%, and 88.56%, respectively. The same dramatic increase in the IoU values of each class is shown in Table 5. IoU values of the sidewalks, fences, traffic lights, traffic signs, riders and trains dramatically increased from 62.67%, 47.91%, 36.12%, 50.93%, 41.66%, and 55.37–68.19%, 53.18%, 44.80%, 57.32%, 46.96%, and 80.81%, respectively. After fine-tuning with a down−sampling factor = 2 and batchsize = 2, the model yielded 74.98% mPA and 64.17% mIoU with the CityScapes val set, which was not pretrained on MS-COCO, without CRF. We visualized the results in Fig. 4.
In this section, we visualize the segmentation results with the CityScapes test set. Fig. 5 shows the test results on dataset1, while Fig. 6 shows the results on dataset2. The segmentation results show that the 3D scene was better segmented with our proposed model, where roads, trees, cars and buildings were extracted accurately with clear boundaries. In Fig. 6, purple indicates roads, fuchsia shows sidewalks, blue represents cars, red colors are persons, olive are drab trees, pale green are lawns, light gray are traffic lights, and dark gray shows buildings, dark skies, manhole covers and other un-segmented objects. The segmentation results show that most of the classes have segmented well, such as roads, sidewalks, trees, lawns, traffic lights, cars, sky and buildings. However, for buildings, only the outlines are segmented. The local structures need to be further refined by using the algorithm described in Section 3.3.
Fig. 4. Segmentation results on the CityScapes val set. The first row: initial images; the second row: ground-truth; the third row: segmentation results. 8
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
in dataset1 are occluded by dense and tall vegetation; hence the building points are very sparse. For the physical plane extraction of these buildings, we used dataset2 . As shown by the red rectangles in Fig. 8(b), there are several buildings with dense points; in these cases, low and sparse vegetation cover the buildings lightly or only partially. 4.6. Physical plane extraction of buildings based on FC-GHT To further extract sophisticated features of buildings, thresholds are needed. For example, in the process of preliminary plane extraction, we set the intersection angle threshold of the normal vectors and the distance threshold from a point to a plane. For GHT, the threshold of the minimum vote was set to detect the local peak value. In Section 3.3.3, the intersection angle threshold of the normal vectors and the distance threshold were set again for similar patch merging. Threshold selection relies heavily on the prior knowledge of the researchers, so its robustness is poor. The extraction results of the buildings labeled 1 and 2 in Fig. 8 are shown in Fig. 10. One point to note is that the color of every part of the buildings is produced randomly. Due to the different distances between the terrestrial laser scanner and different buildings, the feature extraction effects also differ. For short distances, the effect is better. Building1 is nearest, so the windows are extracted cleanly. Building2 is farther away and the points are much sparser than those of building1, so some windows are extracted well while others are not. The whole effect of the 3D scene is shown in Fig. 11. To assess the performance of the FC-GHT algorithm, we relied on Recall, which was introduced in Section 4.2. For dataset2 , according to the values of PA and IoU listed in Section 4.3.4, we computed the recall = 90.4%; however, this value increased to 99.7% after refining, as shown in Table 6. Additionally, we also compared this method with traditional segmentation methods for building facades, reproduced from the methods in the corresponding papers (Li et al., 2011, 2014, 2015b).
Fig. 5. Part segmentation results with dataset1.
4.5. Mapping results During the mapping process, we used the two testing sets listed in Table 2. Then, according to the mapping method presented in Section 3.2, the semantic segmentation results were mapped to 3D point clouds. The mapping results are shown in Figs. 7 and 8. The mapping results from the 2D images to the 3D point clouds show that most classes are segmented well, such as roads, sidewalks, trees, cars and buildings. However, only the outlines of each class are segmented, and they do not include fine features. For buildings, the main structures in 3D outdoor scenes, it is insufficient to merely extract outlines. Therefore, based on the mapping results, we refined the features of buildings directly with 3D point clouds. By enlarging the mapping results, shown in Fig. 9(c), it can be seen that some buildings
Fig. 6. Semantic segmentation results with dataset2 .
Fig. 7. Comparison between the mapping result and the 3D scene (Dataset1). 9
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
Fig. 8. Comparison between the mapping result and the 3D scene (Dataset2 ).
Fig. 9. Panorama of mapping results.
Table 6 Performance comparison with traditional segmentation methods. Method
Recall(%)
RANSAC Li et al. (2011) Region growing Li (2014) Dynamic clustering Li (2014) Standard HT Li et al. (2015b) Ours (preliminary segmentation) Ours (refined segmentation)
79.93 68.66 74.86 65.27 90.40 99.71
Fig. 10. Feature extraction based on point clouds.
algorithm can solve this problem effectively. During the mapping process, the main tasks are matrix computation and coordinate transformation. The algorithm complexity is o (n) , where n denotes the number of points. For the dataset2 used in Section 4 (which consists of seven images and 5,956,090 points), the mapping only took 2.6 s. In the process of physical plane extraction, based on the preliminary segmentation results, the FC-GHT algorithm was performed only with the 3D point clouds of buildings, which reduced the computation burden greatly. Taking the dataset2 as an example, 8.678 s were spent. The segmentation was run on a PC with an Intel Core i5 2.6 GHz CPU and 8.0 GB of RAM.
4.7. Discussion Compared with 2D images, 3D point clouds can provide much more information. For example, semantic scene comprehension with point clouds can naturally localize the 3D coordinates of objects in the scene, which provides crucial information for subsequent tasks like navigation or manipulation. However, if semantic scene segmentation algorithms are performed directly with raw 3D point clouds, the computational burden will be larger because of the massive characteristics of the point clouds, which can easily exceed the memory limit of desktop computers. As the number of points increases, memory occupancy and time consumption both increase exponentially. Our proposed fusion
Fig. 11. The whole semantic segmentation result of a 3D scene. 10
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
5. Conclusions
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFS, arXiv preprint arXiv:1412.7062. Chen, L.C. Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2017. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, ATROUS Convolution, and Fully Connected CRFS, arXiv preprint arXiv:1606.00915v2. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The cityscapes data set for semantic urban scene understanding. Comp. Vis. Pattern Recog. IEEE 3213–3223. Deng, F., Zhang, Z.X., Zhang, J.Q., 2007. A method of registration between laser scanning data and digital images. Geomat. Inf. Sci. Wuhan Univ. 32 (4), 290–292. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2015. Int. J. Comp. Vis. 111 (1), 98–136. Fang, W., 2014. Research on Automatic Texture Mapping of Terrestrial Laser Scanning Data Combining Photogrammetry Technique. Wuhan University, Wuhan. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S.O., Villena-Martinez, V., Garcia-Rodriguez, J., 2017. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv:1704.06857v1. Guo, Y.L, 2017. Deep Feature Expression of 3D Data. Classroom of Deep Learning < http://mp.weixin.qq.com/s/g9ANliOMLJalJtpt4YCZVw > . Hackel, T., Wegner, J.D., Schindler, K., 2016. Fast semantic segmentation of 3d point clouds with strongly varying density. ISPRS Ann. Photogram., Rem. Sens. Spat. Inf. Sci. III-3, 177–184. http://dx.doi.org/10.5194/isprsannals-III-3-177-2016. He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Josep, M.B., Jose, L.L., 2008. Unsupervised robust planar segmentation of terrestrial laser scanner point clouds based on fuzzy clustering method. ISPRS J. Photogram. Rem. Sens. 63, 84–98. Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S., 2016. 3D Shape Segmentation with Projective Convolutional Networks. Computer Vision and Pattern Recognition (CVPR 2015). Boston, MA, 8–10 June, arXiv preprint arXiv:1612.02808. Kendall, A., Badrinarayanan, V., Cipolla, R., 2016. Bayesian SEGNET: Model Uncertainty in Deep Convolutional Encoderdecoder Architectures for Scene Understanding, arXiv preprint arXiv:1511.02680v2. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, (on CD-ROM). Lecun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature. 521, 436–444. http://dx. doi.org/10.1038/Nature14539. Li, M.D., Jiang, S.P., Wang, H.P., 2015a. A RANSCA-based stable plane fitting method of point clouds. Sci. Survey. Map. 40, 102–106. Li, M.L., 2014. Technology of Preprocessing on 3D Laser Scanned Point Clouds. Master D. Dissertation of PLA Information Engineering University. Li, M.L., Li, G.Y., Wang, L., Li, H.B., Fan, Z.R., 2015b. Automatic feature detecting from point clouds using 3D hough transform. Bull. Survey. Map. (2), 29–33. Li, M.L., Gao, X.Y., Li, G.Y., Wang, L., Liu, S.L., 2016a. High accuracy calibration of installation parameters between 3D terrestrial laser scanner and external-installed digital camera. Optical Prec. Eng. 24, 2158–2166. Li, N., Ma, Y.W., Tang, Y., Gao, S.L., 2011. Segmentation of building facade point cloud using RANSAC. Sci. Survey. Map. 36 (5), 144–146. Li, Y., Pirk, S., Su, H., Qi, C.R., Guibas, L.J., 2016b. FPNN: field probing neural networks for 3D data. NIPS 307–315. Liu, Y., Wang, F., Dobaie, A.M., 2017. Comparison of 2D image models in segmentation performance for 3D laser point clouds. Neurocomputing, doi:http://dx.doi.org/10. 1016/j.neucom.2017.04.030 (in press). Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Murase, H., Nayar, S.K., 1995. Visual learning and recognition of 3-D objects from appearance. Int. J. Comp. Vis. 14 (1), 5–24. Nalani, H.A., Maas, H.G., 2012. Automatic Building Facade detection in Mobile Laser Scanner point clouds. In: The German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF), Potsdam, Germany (on CD-ROM). Nalani, H.A., Sanka, Nirodha, P., Maas, H.G., 2012. Automatic processing of mobile laser scanner point clouds for building facade detection. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XXXIX-B5, 2012 XXII ISPRS Congress, 25 August–01 September, Melbourne, Australia. Petar, V., Igor, K., Marko, R.S., 2015. Obtaining structural descriptions of building facades. Comp. Sci. Inf. Syst. 13 (1), 23–43. Qi, C.R., Su, H., Mo, K., Guibas, L.J., 2016a. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv preprint arXiv:1612.00593. Qi, C.R., Su, H., Niessner, M., Dai, A., Yan, M., Guibas, L.J., 2016b. Volumetric and multiview CNNs for object classification on 3D data. In: Computer Vision and Pattern Recognition (CVPR 2016). Las Vegas, Nevada, 26 June–1 July, pp. 5648–5656, doi:http://dx.doi.org/10.1109/CVPR.2016.609. Shi, B., Bai, S., Zhou, Z., Bai, X., 2015. DeepPano: deep panoramic representation for 3-D shape recognition. IEEE Sig. Process. Lett. 22 (12), 2339–2343. http://dx.doi.org/10. 1109/LSP.2015. 2480802. Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint arXiv:1409.1556. Sinha, A., Bai, J., Ramani, K., 2016. Deep learning 3D shape surfaces using geometry images. In: Computer Vision-ECCV. Springer International Publishing, pp. 223–240. Song, X.Y., Yuan, S., Guo, H.B., Liu, J.F., 2014. Pattern identification algorithm with
For large-scale 3D urban scene segmentation, most classical models based on deep convolutional neural networks are mainly used for 2D images and can only segment global outlines. Traditional approaches of refinement processing are mostly performed with 3D raw point clouds directly, which can extract local features but carry a high computational burden. In addition, we can easily use digital cameras to capture large-scale scene data. Compared with cameras, 3D laser scanners can provide a wider field of view and are insensitive to changes in the lighting conditions. Motivated by these two different data capturing styles and two kinds of segmentation methods, this study fused 2D images and 3D point clouds to segment complex 3D urban scenes. There were five steps: (1) data preparation, which included benchmark datasets and our own datasets. The former were used to train and finetune our DVLSHR model, and the latter were used as testing sets to evaluate the efficiency of the proposed models. (2) Data preprocessing, where we down-sampled the images to 1024∗512 pixels. If the GPU memory was sufficient, the images retained their initial resolution. (3) DVLSHR model fine-tuning. According to the methodology in Section 3.1, network parameters were set and the model was fine-tuned step by step. (4) Mapping from the 2D segmentation results to 3D point clouds. Based on DLT, we first calculated the exterior orientation of the cameras, and then mapped the multi-view images to their corresponding point clouds. (5) Physical plane extraction, where the mapping results of the buildings were refined based on our proposed FC-GHT algorithm. The semantic segmentation results and experimental analyses show that the DVLSHR model is more suitable for large-scale scenes and highresolution images, and the FC-GHT algorithm, which is used for refinement processing, further extracts the fine features of buildings efficiently. An important issue with our proposed method is that we need to know the interior orientation of the camera first to map the 2D images to 3D point clouds. Then, based on the DLT algorithm, to calculate the exterior orientation of the camera to achieve the overall mapping between the multi-view images and their corresponding point clouds. However, the only currently available benchmark dataset for sematic segmentation that includes both 2D and 3D data is the Stanford 2D-3D-S dataset. However, this dataset is a 3D indoor space dataset in the form of 360-degree panorama, and does not provide interior orientation elements. Therefore, mapping of this dataset cannot yet be realized. Moreover, the testing datasets we obtained do not have label information. So, in this work, we compared the performance of the DVLSHR model with other state-of-the-art methods using only on CityScapes val set. In addition, the mapping also affects the refinement results. Based on the above analysis, in the future, it is worthwhile proposing a novel 3D semantic segmentation model to directly consume raw 3D point clouds, which can effectively avoid the mapping process. Acknowledgements This study was undertaken with financial support of the National Natural Science Foundation of China (NSFC) (Grant No. 41501491 and 61601184) and the Key Science Research Program of Higher Education of Henan Province, China (No. 16A520062). References Abdel-Aziz, Y.I., Karara, H.M., 2015. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogram. Eng. Rem. Sens. 81 (1), 103–107. Badrinarayanan, Kendall, V.A., Cipolla, R., 2016. Segnet: A Deep Convolutional EncoderDecoder Architecture for Image Segmentation, arXiv preprint arXiv:1511.00561v3. Berthold, K.P.H., 1984. Extended Gaussian images. Proc. IEEE 72 (12), 1671–1686. Bu, S., Liu, Z., Han, J., Wu, J., Ji, R., 2014. Learning high-level feature by deep belief networks for 3-D model retrieval and recognition. IEEE TMM 16 (8), 2154–2167. Burochin, J.P., Vallet, B., Bredif, M., Paparoditis, N., 2014. Detecting blind building facades from highly overlapping wide angle aerial imagery. ISPRS J. Photogram. Rem. Sens. 96:193–209. http://dx.doi.org/10.1016/j.isprsjprs.2014.07.011.
11
ISPRS Journal of Photogrammetry and Remote Sensing xxx (xxxx) xxx–xxx
R. Zhang et al.
based on volumetric representation using convolutional neural networks. In: International Conference on Articulated Motion and Deformable Objects. Springer International Publishing, pp. 147–156. http://dx.doi.org/10.1007/978-3-319-417783_15. Yan, J.F., 2014. Research on Terrestrial LiDAR Point Cloud Data Registration and Fusion Method of Point Cloud and Images. China University of Mining & Technology, Xuzhou. Yang, B.S., Wei, Z., Li, Q.Q., Li, J., 2012. Automated extraction of street-scene objects from mobile lidar point clouds. Int. J. Rem. Sens. 33 (18), 5839–5861. Yang, B.S., Dong, Z., Wei, Z., Fang, L.N., Li, H.W., 2013a. Extracting complex building facades from mobile laser scanning data. Acta Geodaet. Cartograph. Sin. 42 (3), 411–417. Yang, B.S., Wei, Z., Li, Q.Q., Li, J., 2013b. Semiautomated building facade footprint extraction from mobile LiDAR point clouds. IEEE Geosci. Rem. Sens. Lett. 10 (4), 766–770. Yao, J.L., Zhang, D.F., 2005. Improvement of the direct-solution of spatial-resection. J. Shandong Univ. Technol. (Sci & Tech) 19 (2), 6–9. Yao, J.L., Sun, Y.T., Wang, S.G., 2006. The direct-solution of spatial-resection based on Luodigues matrix. J. Shandong Univ. Technol. (Sci & Tech) 20 (2), 36–39. Zhang, R., Li, G.Y., Wang, L., Zhou, Y.L., 2017. New method of hybrid index for mobile LiDAR point cloud data. Geomat. Inf. Sci. Wuhan Univ. (11), 1–7. http://dx.doi.org/ 10.13203/j.whugis20160441. http://kns.cnki.net/kcms/detail/42.1676.TN. 20171109.1531.002.html. Zhou, F., Yang, C., Wang, C.G., Wang, B.Q., Liu, J., 2013. Circle detection and its number identification in complex condition based on random Hough transform. Chin. J. Scient. Instrum. 34 (3), 622–628.
adaptive threshold interval based extended Hough transform[J]. Chin. J. Scient. Instrum. 35 (5), 1109–1117. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E., 2015. Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of ICCV, pp. 945–953, doi:http:// dx.doi.org/10.1109/ICCV.2015.114. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Vinyals, O., Bengio, S., Kudlur, M., 2015. Order matters: Sequence to sequence for sets. Comp. Sci. arXiv preprint arXiv:1511.06393. Wang, Y.M., Hu, C.M., 2012. A robust registration method for terrestrial LiDAR point clouds and texture image. Acta Geodaet. Cartograph. Sin. 41 (2), 266–272. Wei, Z., Yang, B.S., Li, Q.Q., 2012. Automated extraction of building footprints from mobile LIDAR point clouds. J. Rem. Sens. 16 (2), 286–296. Wu, J., Zhang, C., Xue, T., Freemanand, W.T., Tenenbaum, J.B., 2016. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: Advances in Neural Information Processing Systems. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., 2015. 3D ShapeNets: a deep representation for volumetric shapes. In: Computer Vision and Pattern Recognition (CVPR 2015). Boston, MA, 8–10 June, pp. 1912–1920. Xu, J.Z., Kou, Y., Yuan, F., Zhang, W., 2013. Auto-registration of aerial imagery and airborne LiDAR data based on structure feature. Infrared Laser Eng. 42 (12), 3502–3508. Xu, Q., Shou, H., Zhu, S.L., 2000. Modern Photogrammetry. The Peoples Liberation Army Press, Beijing. Xu, X., Corrigan, D., Dehghani, A., Caulfield, S., Moloney, D., 2016. 3D object recognition
12