Effective piecewise planar modeling based on sparse 3D points and convolutional neural network

Effective piecewise planar modeling based on sparse 3D points and convolutional neural network

Journal Pre-proof Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network Wei Wang , Wei Gao , Zhanyi Hu PII: ...

3MB Sizes 1 Downloads 39 Views

Journal Pre-proof

Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network Wei Wang , Wei Gao , Zhanyi Hu PII: DOI: Reference:

S0925-2312(19)31402-X https://doi.org/10.1016/j.neucom.2019.10.026 NEUCOM 21371

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

28 June 2018 31 May 2019 13 October 2019

Please cite this article as: Wei Wang , Wei Gao , Zhanyi Hu , Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.10.026

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Β© 2019 Elsevier B.V. All rights reserved.

Highlights 1. An effective plane assignment cost based on sparse 3D points, scene structure priors and high-level image features obtained by Convolutional Neural Network for modeling urban scenes is proposed. 2. A new piecewise planar stereo method that jointly optimizes image regions and their associated planes is proposed. The method can effectively reconstruct urban scenes from only sparse 3D points. 3. The problems (inaccurate image over-segmentation, incomplete candidate planes, and unreliable regularization) commonly existing traditional piecewise planar stereo methods are effectively resolved.

Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network Wei Wang*, Wei Gao, Zhanyi Hu * Corresponding author: E-mail: [email protected] Abstract: Piecewise planar stereo methods can approximately reconstruct the complete structures of a scene by overcoming challenging difficulties (e.g., poorly textured regions) that pixel-level stereo methods cannot resolve. In this paper, a novel plane assignment cost is first constructed by incorporating scene structure priors and high-level image features obtained by Convolutional Neural Network (CNN). Then, the piecewise planar scene structures are reconstructed in a progressive manner that jointly optimizes image regions (or superpixels) and their associated planes, followed by a global plane assignment optimization under a Markov Random Field (MRF) framework. Experimental results on a variety of urban scenes confirm that the proposed method can effectively reconstruct the complete structures of a scene from only sparse three-dimensional (3D) points with high efficiency and accuracy and can achieve superior results compared with state-of-the-art methods. Keywords: urban scene, piecewise planar stereo, Markov Random Field, image over-segmentation, Convolutional Neural Network

1

Introduction

Piecewise planar stereo methods can approximately reconstruct the complete structures of a scene, where higher-level planarity priors significantly help overcome the challenging difficulties (e.g., poorly textured regions) that pixel-level stereo methods cannot resolve. In general, piecewise planar stereo methods have three basic steps: (1) over-segmenting the image into several regions (i.e., superpixels) without overlapping; (2) generating candidate planes from initial data (e.g., three-dimensional (3D) points); (3) assigning the optimal plane for each superpixel using a global method to produce the piecewise planar model of the scene. In fact, such methods can be unreliable and inefficient because of the following. (1) It can be difficult to produce

complete candidate planes from initial sparse or dense 3D points, which can lead to larger errors in modeling scenes. As indicated in Figure 1(c), only three candidate planes (including two reliable planes) are generated from initial sparse 3D points using the state-of-the-art multi-model fitting method, and are not sufficient to describe initial scene structures. As a result, as indicated in Figures 1(d) and 1(f), the current superpixel indicated in Figures 1(a) and 1(b) is assigned to a false plane because the real plane is not in the candidate planes. (2) In assigning the optimal plane to the current superpixel, the plane assignment cost is frequently constructed based on low-level image features (e.g., gray), 3D point visibility constraints and the assumption that two neighboring superpixels with similar features have the same plane. However, in specific cases, low-level image features are not sufficiently robust to overcome interferences (e.g., the match ambiguity). Moreover, two planes associated with two superpixels with similar features are not necessarily the same. These factors can also incur an unreliable scene reconstruction. For example, as indicated in Figures 1(d)-1(f), the errors are caused by forcing to assign the same plane to the two superpixels indicated in Figures 1(a) and 1(b) (their corresponding scene patches are actually in different planes) according to incomplete candidate planes. (3) It is frequently difficult to determine the optimal parameters for image over-segmentation methods based only on low-level features to produce accurate superpixels consistent with scene structures. Actually, as indicated in Figures 1(g)-1(i), superpixels with larger sizes can straddle two or more planes with larger depth changes, and cannot be reasonably modeled by single planes. However, superpixels with smaller sizes (consider the case that the superpixel only contains only one pixel) can lead to the match ambiguity commonly existing in traditional pixel-level stereos. Actually, the accuracy of superixels can be improved using some existing optimization methods; for example, Feng et al.[1] used the split-and merge strategy to automatically produce spatially coherent image segmentations; Li et al.[2] regularized arbitrary superpixels into a maximum cohesive grid structure via cascade dynamic programming. However, these methods cannot guarantee the consistent boundaries between superpixels and scene structures by lack of the corresponding spatial information (e.g., 3D points and planes). (4) Unrelated regions (e.g., sky, ground) cannot be effectively detected and filtered out, which frequently reduces the efficiency of the reconstruction.

To overcome the above problems, in this paper, we extended our investigation of modeling piecewise planar urban scenes based on scene structure priors and Convolutional Neural Network (CNN) started in [3]. We discussed in detail the performance of guiding the reconstruction process using scene structure priors and improving the plane assignment reliability using high-level image features obtained by CNN. Finally, by jointly optimizing the superpixels and their associated planes, and globally optimizing the resulting plane assignments under the Markov Random Field (MRF) framework, we analyzed the overall performance of the proposed method on a wide variety of urban scene data sets, and confirmed it can completely model a scene from only sparse 3D points with high efficiency and accuracy.

(a)

(f) Figure 1.

(b)

(c)

(g)

(d)

(h)

(e)

(i)

Problems existing in traditional methods. (a) 2D proejcted points (black) from initial 3D points, current superpixel 𝑠 (red) and its

neighboring superpixels with reliable planes (green); (b) close-up of superpixels in the white rectangle in (a); (c)incomplete candidate planes (black: top-view of initial 3D poitns; red: reliabel planes; white: unreliable planes); (d) plane assignment for superpixels 𝑠 and its neighboring superpixels using a hard regularization; (e)-(f) top-view and close-up (two neighboring superpixels corresponding to different real planes are assigned to the same plane); (g)-(h) superpixels produced by Mean-shift[4] method are basically consistent with scene structures, however, specific superpixels straddle two or more planes. Relatively, although superpixels produced by SLIC [5] method have a uniform size, they can incur matching ambiguity of superpixels (e.g., superpixels with small sizes in sky region have similar color features); (i) sample superpixels and close-ups (solid and dashed rectangles denote superpixels produced by Mean-shift and SLIC methods, respectively).

2

Related work

Our method is related with multi-view piecewise planar stereo and CNN-based stereo matching, a separate short review for each of them is listed below.

2.1 Multi-view piecewise planar stereo Traditional exhaustive plane sweeping methods[6] tend to directly determine the optimal planes associated with superpixels in a larger search space, and usually lead to high computational complexity and low reliability. Furukawa etal.[7] obtained a set of candidate planes along three orthogonal scene directions (e.g., Manhattan-world model) based on initial 3D-oriented points obtained from the PMVS (Patch-based Multi-view Stereo)[8] method, and then assigned each pixel to an optimal plane by pixel-wise plane labeling under the MRF framework. As a result, the method is not suitable for complex scenes with more than three scene directions. Similarly, Gallup et al.[9] extended the traditional plane sweeping methods to perform multiple plane sweepings, where the sweeping directions were aligned to the expected surface normals of the scene. Clearly, such a method is not robust to complex scenes, because only a few sweeping directions are involved. Mičuőík et al.[10] restricted scene directions through vanishing points and performed a superpixel-based dense reconstruction of urban scenes. However, the method can erroneously suppress a slanted plane, the normal vector of which is not consistent with the predefined main directions. In general, simply restricting or specifying scene structures (e.g., the number of scene directions) in advance is unsuitable for reconstructing complex scenes. Sinha et al.[11] used sparse 3D points and lines to generate candidate planes, and then, recovered a piecewise planar depth map under the MRF framework. However, this method may ignore some real planes because of the nature of sparse initial 3D points and lines. Chauve et al.[12] extracted all possible planes from unstructured 3D points using a region growing approach, and then, formulated the problem of piecewise planar reconstruction as a labeling problem of 3D space into empty or occupied regions. In fact, this method may not be robust to the noisy 3D points, as region growing could easily be entrapped in wrong solutions. Jiao et al.[13] first generated candidate planes from quasi-dense 3D points, and then assigned the optimal plane to each superpixel under the MRF framework where the contour of the superpixel is modified to be consistent with scene structures in advance. This method is related to ours; however, it may not be robust to sparse 3D points that are not enough to generate complete candidate planes. Bodis-Szomoru et al.[14]

proposed a piecewise planar modeling method based on sparse 3D points and superpixels to generate an approximate model of the scene. Although the speed of the method is fast, it may be unreliable because it assumes that each superpixel is sufficiently large to contain an adequate number of 3D points for plane fitting. In fact, a scene patch with larger depth discontinues corresponding to a superpixel of large size cannot be reasonably modeled by a single plane. Verleysen et al.[15] generated dense 3D points by matching DAISY[16] descriptions across two wide-baseline images, and then extracted candidate planes from dense 3D points to perform a piecewise planar reconstruction under the MRF framework. In general, the method could produce better results because dense 3D points implicitly contain more information about scene structures (e.g., generate relatively complete candidate planes, construct stronger constraints for the plane inference). However, because of the time-consuming of stereo matching, the method usually leads to higher computational complexity. In our previous work[17], we proposed to cooperatively optimize the image regions and their associated planes based on scene structure priors (e.g., plane intersection angles); however, such a optimization is frequently difficult to achieve a globally optimal solution, and thus results in some errors in reconstructing small plane patches. 2.2 CNN-based Stereo matching Stereo mathcing (i.e., establishing point correspondences between different images) is one of the most fundamental tasks for the image-based 3D reconstruction, and have been witnessed as a continued hot research in the last few years. 18]

For example, Feng et al.[

proposed a spectral-multiplicity-tolerant method for attributed graph matching by posing the

general graph matching problem as alternatively optimizing a multiplicity matrix and a vertex permutation matrix. such a method can be applied to solving the point matching problem, but may be unrobust for the region matching when noise and illumination variation are involved. In order to improve the accuracy of stereo matching, Liang et al. [19] proposed a new network architecture to seamlessly integrate four steps of stereo matching (i.e., matching cost calculation, matching cost aggregation, disparity calculation and disparity refinement). In this network, the feature constancy (feature correlation and reconstruction error) constraint is introduced to refine the initial disparity by a

notable margin. Similarly, Jie et al. [20] proposed a novel left-right comparative recurrent model to perform left-right consistency checking jointly with disparity estimation. Such a method employs a soft attention mechanism to guide the model to selectively focus on refining the unreliable regions at each recurrent step (i.e., disparity estimation and online left-right comparison). Actually, the left-right consistency measure may be unreliable because consistently matched pixels are not always correct, and thus cannot detect all the incorrect matches. To address this issue, Park et al. [21] used the random forest framework to select effective confidence measures depending on the characteristics of the training data and matching strategies, and then adopted the selected confidence measure to build a better confidence prediction model to improve the robustness and accuracy of traditional stereo matching methods. In general, these methods are effective for estimating the disparity from a rectified stereo pair of images, however, they may be unreliable for large-scale wide-baseline images. Recently, with the improvement of the theories and methods of deep learning, CNN-based stereo matching[22,23] increasingly becomes a hot research topic in the field of computer vision. Fischer et al.[24] found that high-level image features obtained by CNN in a supervised, especially unsupervised learning manner, have higher performance than traditional feature descriptions (e.g., Scale Invariant Feature Transform, SIFT) in stereo matching. Zbontar et al.[22] utilized CNN to evaluate the visual similarity relationships between a pair of image patches, and then obtained an accurate disparity/depth map in a global optimization manner. However, the above method could lead to high computational complexity because of the time-consuming convolution computation. In order to address this problem, Chen et al.[23] first extracted the features of a pair of image patches at different scales, and then obtained the matching scores by an inner product. The scores from different scales are then merged for an ensemble. In addition, the method also grouped multiple inner product operations as a matrix operation for further acceleration. Luo et al.[25] designed a product layer which simply computes the inner product between the two representations of a siamese architecture, and produced better matching results in less than a second of GPU (Graphics Processing Unit) computation. Zagoruyko et al.[26] exploited the methods which directly learn a general similarity function for comparing image patches, and

proposed multiple neural network architectures for this purpose. Shi et al.[27] learned a RGB-D patch descriptor using a deep convolutional neural network that takes in color, depth, normal and multi-scale context information of a planar patch in an image. Such descriptors can be used to predict whether or not two RGB-D patches from different frames are coplanar in SLAM reconstruction. Recently, Shrestha et al.

[28]

employed a generative neural network to predict

unknown regions of a partially explored 2D map in indoor environments, and used the resulting prediction to enhance the exploration in an information-theoretic manner. Our work is closely related to the two-channel networks in [26]. However, in comparison to the depth map estimation in [26], our method mainly focuses on measuring the reliability of a plane in multi-view piecewise planar stereos.

3

Overview and contributions

In some cases, as discussed in Section 1, traditional piecewise planar stereo methods could be unreliable and low-efficient in reconstructing complex urban scenes due to four factors: inaccurate image over-segmentation, incomplete candidate planes, unreliable plane optimization and unnecessary reconstruction. Input 1. Sparse 3D points; 2. Images with calibrated camera parameters. Pre-processing 1.Over-segment images to generate superpixels (or image regions). 2.Detect line segments and refine superpixels using line segments. 3.Generate initial candidate planes via multi-plane fitting. 4.Estimate vertical scene direction to detect unrelated regions (e.g., the ground). Jointly optimize superpixel and their related planes 1. Generate initial reliable planes from initial candidate planes. 2. Resegment superpixels at a smaller threshold. 3. Generate candidate planes based on structure priors for each superpixel. 4. Assign the optimal plane to each superpixel.

Globally optimize plane structures and output 1.Globally optimize the planes associated with superpixels under the MRF framework. 2.Output the optimized piecewise planar model. Figure 2. Flowchart of the proposed method

In general, inaccurate superpixels can be further over-segmented or simply regularized using the line segments

detected in the current image. The goal is to make them to be consistent with scene structures. Moreover, incomplete candidate planes and unreliable plane optimization can be improved by incorporating scene structure priors and high-level image features. In fact, urban scenes contain multi-plane structures (i.e., the piecewise planar prior that pixels of similar appearance more likely belong to the same plane) and the component planes have strong structural regularities (i.e., the angle prior that the angles between planes are usually fixed values such as 90Β° ). These priors significantly help to guide the reconstruction process to achieve better results and improve the overall efficiency of the reconstruction (e.g., filtering out unrelated regions to avoid the unnecessary reconstruction). Based on the analysis above, given sparse 3D points reconstructed from calibrated images (i.e., their corresponding extrinsic and intrinsic camera parameters have been obtained by some existing methods such as Structure from Motion pipelines), the paper presents an effective piecewise planar stereo method based on structure priors and CNN to reliably model an urban scene with high efficiency and accuracy. The flowchart is outlined in Figure 2 and each component is elaborated in the subsequent sections. The main contributions of the proposed work can be summarized as follows: (1) We utilize scene structure priors and high-level image features to overcome the influence of inaccurate image over-segmentation, incomplete candidate planes, and unreliable regularization, and thus enhance the reliability of the plane assignment. (2) We propose a new piecewise planar stereo method that jointly optimizes superpixels and their associated planes by incorporating low-level and high-level image features, 3D point visibility constraints, and scene structure priors. The method can effectively reconstruct the scene from only sparse 3D points. (3) We propose an effective method to detect and filter out unrelated regions (e.g., sky, ground) in the reconstruction process. This can significantly improve the efficiency of the entire reconstruction.

4

Preprocessing

According to the piecewise planar prior, the current image is first oversegmented as a set of superpixels (denoted as 𝑅0) using the Mean-shift method (or other similar methods). Then, as indicated in Figures 3(a) and 3(b), initial superpixels are further resegmented as sub-superpixels according to the lines determined by the line segments detected in the current image[15]. Further, initial candidate planes (denoted as 𝐻0) are generated from the initial 3D points using the multi-model fitting method [29]. The method [29] aims at finding a small number of planes (i.e., dominant planes) which could best explain the whole 3D points, and adopts a mutually reinforcing manner to alternatively perform candidate plane generation and fitting optimization (instead of exhaustively random sampling 3D points to generate candidate planes). In general, the method can generate relatively reliable planes from initial 3D points; however, when initial 3D points are too sparse, the fitted planes are usually not sufficient to depict the complete scene structures, and also contain many outliers (e.g., wrongly fitting several 3D points in two or more real planes). Thus, the set 𝐻0 is incomplete and typically contains many plane outliers. In the proposed method, the scene’s vertical direction is also estimated using vanish point detection methods[

30]

based on the detected line segments. Moreover, for a hand-held camera with the calibrated parameters, as indicated in Figure 3(d), the ground can be estimated according to the height of the camera (e.g., 1.7m). Note that, the vertical direction and the ground are used to generate candidate planes and eliminate the unrelated regions (see Section 5.2), respectively.

(a)

(b)

(c)

(d)

Figure 3. Superpixel resegmentation. (a) Line segment (red) detection; (b) resegment superpixel using line segments; (c) resegment superpixel at a smaller threshold; for the current superpixel (left), the scene patches corresponding to its two parts on both sides of the dotted line are actually in different planes; the resegmented sub-superpixels (right) will be respectively assigned to more reliable planes (see Section 5.2), instead of assigning one plane to the current superpixel (left); (d) vertical direction (white arrow), ground (grid) and camera (red point).

5

Jointly optimizing superpixels and their associated planes

In this section, we first introduce the plane assignment cost incorporating low-level and high-level image features, and then discuss the method that jointly optimizes superpixels and their associated planes

5.1 Plane assignment cost Given the current image πΌπ‘Ÿ and its neighboring images {𝐼𝑖 }(𝑖 = 1,2, β‹― , π‘˜), we define the following cost of assigning a plane 𝐻𝑠 to a superpixel 𝑠 ∈ πΌπ‘Ÿ . 𝐸(𝑠, 𝐻𝑠 ) = πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) + 𝛾 βˆ™ βˆ‘ πΈπ‘Ÿπ‘’π‘”π‘’π‘™π‘Žπ‘Ÿ (𝐻𝑠 , 𝐻𝑑 ),

(1)

π‘‘βˆˆβ„•(𝑠)

where πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) and πΈπ‘Ÿπ‘’π‘”π‘’π‘™π‘Žπ‘Ÿ (𝐻𝑠 , 𝐻𝑑 ) denote the data and regularization terms, respectively, and β„•(𝑠) denotes the set of reliable superpixels (i.e., they have been assigned reliable planes). 𝛾 is the weight of the regularization term. (1) Data term The data term πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) is evaluated by incorporating low-level and high-level image features, and is formally defined as follows. πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) = πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) + 𝜌 βˆ™ 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ).

(2)

In Eq.(2), πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) encodes low-level image features and 3D point visibility constraints, namely, π‘˜

πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) =

1 βˆ‘ βˆ‘ 𝐢𝑠 (𝑝, 𝐻𝑠 , 𝐼𝑖 ), π‘˜ βˆ™ |𝑠|

(3)

𝑖=1 π‘βˆˆπ‘ 

where |𝑠| and π‘˜ denote the total number of pixels belonging to superpixel 𝑠 and the number of neighboring images, 𝐢𝑠 (𝑝, 𝐻𝑠 , 𝑁𝑖 ) is defined as π‘šπ‘–π‘›(β€–πΉπΌπ‘Ÿ (𝑝) βˆ’ 𝐹𝐼𝑖 (𝐻𝑠 (𝑝))β€–, 𝛿) 𝐷(𝐻𝑠 (𝑝)) = π‘π‘ˆπΏπΏ 𝐢𝑠 (𝑝, 𝐻𝑠 , 𝐼𝑖 ) = { 𝑑(𝐻𝑠 (𝑝)) > 𝐷(𝐻𝑠 (𝑝)), πœ†π‘œπ‘π‘ πœ†π‘’π‘Ÿπ‘Ÿ 𝑑(𝐻𝑠 (𝑝)) ≀ 𝐷(𝐻𝑠 (𝑝))

(4)

where 𝐻𝑠 (𝑝) ∈ 𝐼𝑖 denotes the corresponding point in the image 𝐼𝑖 induced by the plane 𝐻𝑠 with respect to the pixel 𝑝 ∈ 𝑠, 𝐹π‘₯ (𝑦) denotes the normalized color (i.e., the value is between zero and one) of the point 𝑦 in the image π‘₯, and β€–πΉπΌπ‘Ÿ (𝑝) βˆ’ 𝐹𝐼𝑖 (𝐻𝑠 (𝑝))β€– denotes the absolute difference of the normalized color; 𝑑(π‘₯) and 𝐷(π‘₯) denote the estimated depth value from the current plane and reliable depth value from the initial 3D points, respectively; the parameter 𝛿 is a truncation threshold, the constants πœ†π‘œπ‘π‘ and πœ†π‘’π‘Ÿπ‘Ÿ are the occlusion penalty and free-space violation penalty,

respectively. In Eq.(4), the first case implies that if 𝐷(𝐻𝑠 (𝑝)) = π‘π‘ˆπΏπΏ, plane 𝐻𝑠 is more likely to be a real plane and the photo-consistency cost is measured by the dissimilarity of color distribution. Otherwise, the intersection point of the back-projection ray of pixel 𝑝 with plane 𝐻𝑠 can be occluded if 𝑑(𝐻𝑠 (𝑝)) > 𝐷(𝐻𝑠 (𝑝)) or violates the 3D point visibility if 𝑑(𝐻𝑠 (𝑝)) ≀ 𝐷(𝐻𝑠 (𝑝)) because a reliable 3D point is unlikely to be occluded. Therefore, different penalties must be assigned for these two cases. Using high-level image features, we first extract three image patches that appropriately contain superpixel 𝑠 ∈ πΌπ‘Ÿ and the corresponding projected regions {𝑠𝑖 }(𝑖 = 1,2, β‹― , π‘˜) in the images {𝐼𝑖 }(𝑖 = 1,2, β‹― , π‘˜), and resize them to 224 Γ—224. Then, we simply consider these image patches as a multi-channel image and adopt the VGG-M architecture proposed in [31] to extract the features of the multi-channel image. Finally, we directly feed the features to a logistical regression layer and use the output as the plane assignment cost 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) based on the high-level image features. In this process, as training data, we sampled image patches from 13 scenes of the DTU datasets[32] and the CASIA datasets[33]. In collecting positive samples, for a superpixel 𝑠 in the current image, we first fitted its associated ground truth 3D points (i.e., these 3D points can be just projected into the superpixel) as a plane. Then, if the fitted plane is reliable (i.e., the average distance between the 3D points and the fitted planes is smaller than a pre-defined threshold), we projected the ground truth 3D points into the neighboring images of the current image to select the corresponding the image patches {𝑠𝑖 }, respectively. Finally, we took the image patches {𝑠, 𝑠1 , 𝑠2 , β‹― , π‘ π‘˜ } and the fitted plane as a position sample. Meanwhile, we picked the image patches far away from {𝑠𝑖 } to produce negative samples (i.e., the fitted plane is inconsistent with the image patch {𝑠𝑖 }). In total, we sampled 230K positive and 210K negative examples. Finally, learning is done by minimizing the cross-entropy loss with 50K iterations of standard Stochastic Gradient Descent (SGD) (batch size and learning rate are set to 512 and 0.001, respectively). Based on the definitions of πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) and 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ), we conducted comparison experiments to evaluate their performances using two neighboring images (i.e., π‘˜=2) (more experiments results about different π‘˜ values are shown

in Section 7.1). More specifically, for a superpixel containing initial 3D points (i.e., these initial 3D points are just projected into the superpixel), its reliable assigned plane is first determined according to the minimal average distance between these 3D points and initial candidate planes (see Section 4). Then, taking these superpixels and their assigned planes as ground truth, the accuracy β„± is defined by the ratio of the number of superpixels that are assigned the correct planes using πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) or 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) to the total number of superpixels. Table 1. β„± values on different data sets (𝜌 = 0.2) Accuracy with different cost Data set πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 )

𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 )

πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 )

Valbonne

0.6415

0.4821

0.7166

Wadham

0.6106

0.5357

0.6949

LSB

0.5110

0.4158

0.7075

TS

0.6317

0.5756

0.7271

City#1

0.4901

0.4003

0.6112

City#2

0.5177

0.4821

0.6398

Figure 4. Accuracy changes with weight 𝜌

Based on β„±, Table 1 displays the corresponding results on different data sets (see Section 7). Clearly, compared to 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ), πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) appears to be more effective because of its characteristics that can quantitatively compute the feature similarity at the pixel level. Furthermore, we combine 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) into πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) with weight 𝜌 to generate πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ). As indicated in Figure 4 and Table 1, the accuracy β„± value of πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) approaches the maximum when 𝜌 = 0.2. (2) Regularization term

In traditional methods, the angle prior (see Section 3) is frequently ignored or simply formulated as a hard regularization that forces two neighboring superpixels with similar appearances to be assigned the same plane. In this paper, such a hard regularization term is relaxed through the angle prior and defined as πΆπ‘ π‘–π‘š πΈπ‘Ÿπ‘’π‘”π‘’π‘™π‘Žπ‘Ÿ (𝐻𝑠 , 𝐻𝑑 ) = {πœ‡ βˆ™ πΆπ‘ π‘–π‘š πœ†π‘‘π‘–π‘ 

𝐻𝑠 = 𝐻𝑑 𝐴(𝐻𝑠 , 𝐻𝑑 ) ∈ π΄π‘π‘Ÿπ‘–π‘œπ‘Ÿ , π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’

(5)

where 𝐴(𝐻𝑠 , 𝐻𝑑 ) denotes the intersection angle between plane 𝐻𝑠 and 𝐻𝑑 corresponding to superpixels 𝑠 and 𝑑, respectively, π΄π‘π‘Ÿπ‘–π‘œπ‘Ÿ is the angel prior and set to [30o , 45o , 60o , 90o , βˆ’60o , βˆ’45o , βˆ’30o ] (more angles assist the reconstruction of detailed structures; however, they also incur high computational complexity). The constants πœ†π‘‘π‘–π‘  and πœ‡ are the plane discontinuity penalty and the relaxation parameter, respectively. In Eq.(5), πΆπ‘ π‘–π‘š measures the color dissimilarity of the superpixels and is defined as πΆπ‘ π‘–π‘š =

1 1

+ 𝑒 βˆ’β€–π‘(𝑠)βˆ’π‘(𝑑)β€–

,

(6)

where ‖𝑐(𝑠) βˆ’ 𝑐(𝑑)β€– denotes the difference between the mean colors (normalized to a range of zero to one) corresponding to superpixels 𝑠 and 𝑑, respectively. In fact, high-level image features also can be used in Eq.(6); however these do not significantly improve performance at the cost of higher computational complexity.

5.2 Jointly optimizing superpixels and their associated planes According to the definition of the plane assignment cost, superpixels and their associated planes are jointly optimized using the method described in Algorithm 1. Next, we introduce several implementation details. (1) Initial reliable planes Essentially, Algorithm 1 performs in a progressive manner. In this process, as indicated in Figure 5(a), initial reliable planes can provide strong constraints for inferring other planes. Hence, for the superpixel 𝑠 containing initial 3D points, we select plane 𝐻𝑠 from the set 𝐻0 (see Section 4) as its reliable plane according to the following condition. Μ…)}, 𝑇(𝑠) = {𝐻𝑠 ∈ 𝐻0 : (πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) < 𝐸̅ )β‹€(𝑁(𝑃𝑠 , 𝐻𝑠 ) < 𝑁

(7)

where 𝑃𝑠 denotes the 3D points that are projected in superpixel 𝑠 and 𝑁(𝑃𝑠 , 𝐻𝑠 ) denotes the average orthogonal

Μ… are the average values of the minimal πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) values distance between 3D points 𝑃𝑠 and plane 𝐻𝑠; 𝐸̅ and 𝑁 and minimal 𝑁(𝑃𝑠 , 𝐻𝑠 ) values of all superpixels containing 3D points, respectively. Algorithm 1. Jointly optimizing superpixels and their associated planes Input:Initial 3D points and three calibrated images. Output:Sets of superpixels β„› and associated planes β„‹. Initialization:The sets of initial superpixels 𝑅0 and initial candidate planes 𝐻0 . 1. Determine initial reliable planes for the superpixels containing 3D points (let β„›Μ… denote the set of other superpixels) from 𝐻0 and 𝑅0 , and save them to β„‹ and β„›, respectively. 2. Compute the plane assignment priority for superpixels in β„›Μ… . 3. Select and remove superpixel 𝑠 with the highest priority from β„›Μ… . 3.1 If superpixel 𝑠 is verified as sky or ground, discard it. 3.2 Otherwise, generate candidate planes and compute the minimal 𝐸(𝑠, 𝐻𝑠 ) value. 3.3 If 𝐸(𝑠, 𝐻𝑠 ) ≀ 𝐸̅ , assign plane 𝐻𝑠 (i.e., reliable plane) to superpixel 𝑠, and save them to β„‹ and β„›, respectively. 3.4 Otherwise, resegment superpixel 𝑠 and save the resulting sub-superpixels to β„›Μ… . 4. Goto Step 2 until β„›Μ… = βˆ…. 5. Output β„› and β„‹.

(2) Plane assignment priority For the superpixel 𝑠 ∈ β„›Μ…, the reliable planes associated with its neighboring superpixels typically have important effects in inferring its optimal plane. To measure this influence, we define the plane assignment priority as follows: πœŒπ‘  = 𝑁(𝑠) βˆ™ 𝐡(𝑠),

(8)

where 𝑁(𝑠) is the number of neighboring superpixels with reliable planes of superpixels 𝑠 and 𝐡(𝑠) is the total number of the pixels adjacent to all superpixels in 𝑁(𝑠) in the edges of superpixel 𝑠.

(a)

(b)

(c)

(d)

(e)

Figure 5. Plane assignment based on the angle prior. (a) Initial reliable planes extracted from the set 𝐻0 ; (b) top-view of candidate planes (white); (c) plane assignment; (d)-(e) top-view and close-up.

Eq.(8) indicates that, for superpixel 𝑠, when the number of its neighboring superpixels with reliable planes is greater and the corresponding boundary length is longer, the constraints for assigning its optimal plane is stronger and

more reliable, and the plane associated with superpixel 𝑠 must be inferred as priority. Note that if superpixel 𝑠 is sub-segmented, only the plane assignment priorities of the resulting sub-superpixels are computed in Step 2 to improve the efficiency. (3) Unrelated region detection Given the ground (see Section 4) of the scene, if the intersection points of the superpixel 𝑠 back-projected with the building planes are below the ground, we consider the superpixel 𝑠 as an unrelated ground region. For the superpixels in the sky region, we detect them according to the following condition:

π‘‡π‘ π‘˜π‘¦ (𝑠) = (π‘ƒπ‘ π‘˜π‘¦ (𝑠) > πœ–)β‹€ (

1 βˆ‘ πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) > 𝐸̅ ), |β„‹ |

(9)

𝐻𝑠 βˆˆβ„‹

where π‘ƒπ‘ π‘˜π‘¦ (𝑠) is the probability that the superpixel 𝑠 belongs to the sky and produced by the semantic labeling algorithm [34]; Ο΅ is the corresponding threshold. 𝐸̅ is defined in Eq.(7) and β„‹ is the set of the current reconstructed planes. (4) Candidate plane generation According to the structure characteristics of urban scenes, the plane associated with the current superpixels 𝑠 ∈ β„›Μ… frequently has specified angles with its neighboring planes. Therefore, for generating candidate planes for superpixel 𝑠, as indicated in Figures 1(a) and 1(b), we first detect the set 𝛱 of its neighboring superpixels that have been assigned reliable planes, and then rotate the plane with the axis by the vertical direction (see Section 4) and a 3D point that is projected in the boundary between superpixels 𝑠 and 𝑑 ∈ 𝛱. Finally, as indicated in Figure 5(b), we consider each plane produced at each rotating angle belonging to π΄π‘π‘Ÿπ‘–π‘œπ‘Ÿ as candidate planes of superpixel 𝑠. Consequently, in contrast to incomplete candidate planes as indicated in Figure 1(c), the extended candidate planes are sufficient and reliable for reconstructing more detailed scene structures. Note that, for reconstructing some slanted planes, we also consider the axis that are perpendicular to both the vertical direction and the normal vector of the current reliable plane. (5) Plane assignment cost computation

In fact, computing 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) is relatively time-consuming. As indicated in Table 2, 𝐸𝑝𝑐 (π‘π‘Žπ‘Ÿπ‘‘) denotes the πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) that the component 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) is only available for superpixels with low discrimination, and 𝐸𝑝𝑐 (π‘Žπ‘™π‘™) denotes the πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) that the component 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) is alway available for each superpixel. Here, low discrimination denotes it is unreliable to assign a plane for a superpixel only using πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ). In this case, the ratio of the minimal πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) value to the second smallest πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) value with respect to all candidate planes is typically greater, and thus used to identify whether or not a superpixel has low discrimination by setting a pre-given threshold (set to 0.8 in this paper). Moreover, PS denotes the percentage of superpxiels with low discrimination, M1 (Sini ) is defined in Section 7.3. Table 2. Accuracy and computational time using different types of data terms.

Data Sets

PS

M1 (Sini )

Time(second)

𝐸𝑝𝑐 (π‘π‘Žπ‘Ÿπ‘‘)

𝐸𝑝𝑐 (π‘Žπ‘™π‘™)

𝐸𝑝𝑐 (π‘π‘Žπ‘Ÿπ‘‘)

𝐸𝑝𝑐 (π‘Žπ‘™π‘™)

Valbonne

63.4

0.5614

0.5753

17.8

26.2

Wadham

71.4

0.7629

0.7648

37.2

46.6

LSB

76.9

0.6546

0.6629

56.8

69.3

TS

80.3

0.5967

0.6189

68.1

76.5

City#1

67.7

0.4842

0.5164

61.8

84.4

City#2

59.8

0.5889

0.6087

74.4

91.7

From Table 2, we can see that the accuracy of 𝐸𝑝𝑐 (π‘π‘Žπ‘Ÿπ‘‘) is comparable with 𝐸𝑝𝑐 (π‘Žπ‘™π‘™), but the corresponding computational time is relatively shorter. Therefore, in our experiments, when a superpxiel can be assigned a reliable plane using πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ), we do not compute 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) any more in order to improve the efficiency of Algorithm 1 (or 𝐸𝑐𝑛𝑛 (𝑠, 𝐻𝑠 ) is only utilized to improve the reliability of πΈπ‘‘π‘Žπ‘‘π‘Ž (𝑠, 𝐻𝑠 ) for a superpixel with low discrimination). (6) Superpixel resegmentation For a superpixel, it is frequently difficult to determine the optimal plane for inaccurate superpixels with unreliable features. Therefore, we resegment the superpixel by the Mean-shift method at a smaller threshold when the corresponding plane assignment cost is larger than 𝐸̅ . As indicated in Figure 3(c), after resegmenting the current superpixel (left), the resulting sub-superpixels (right) are more consistent with scene structures (e.g., edges). Note that,

too small sub-superpixels (e.g., the number of the component pixels is less than 10) are combined into other larger superpixels or sub-superpixels to improve the efficiency and reliability of Algorithm 1. In general, by incorporating the angle prior, Algorithm 1 can achieve superior results compared to traditional methods that adopt hard regularizations. Using the superpixel (red) in Figures 1(a) and 1(b) as an example, as indicated in Figures 5(c)-5(e), traditional hard regularizations tend to assign the same plane to two neighboring superpixels because they have similar appearances. However, Algorithm 1 can effectively resolve this problem using the angle prior.

6

Global plane assignment optimization

After obtaining the planar assignment using Algorithm 1, the following three common problems existing in traditional methods can be effectively solved: (1) Inaccurate superpixels can be resegmented according to the plane assignment cost and the resulting sub-superpixels can be reasonably modeled by the appropriate planes. (2) Both reliability and efficiency of the plane assignment can be effectively improved because of the guideline of scene structure priors (e.g., the angle prior). (3) Unrelated regions (e.g., sky, ground) can be filtered out and unnecessary plane assignments are thus avoided; this can further improve the efficiency of the reconstruction. To produce more reliable results (e.g., eliminate the calculation deviations between two planes), the plane assignment obtained by Algorithm 1 can be optimized under the MRF framework[35]. The energy function is defined as:

𝐸(β„‹) = βˆ‘ (πΈπ‘β„Žπ‘œ (𝑠, 𝐻𝑠 ) + πœ” βˆ™ βˆ‘ πΈπ‘Ÿπ‘’π‘”π‘’π‘™π‘Žπ‘Ÿ (𝐻𝑠 , 𝐻𝑑 )), π‘ βˆˆβ„›

(10)

π‘‘βˆˆπ’©(𝑠)

where β„› and β„‹ are respectively the set of superpixels and their associated planes obtained by Algorithm 1, 𝒩(s) is the set of all neighboring superpixels of superpixel 𝑠. The contant πœ” is the weight of the regularization term. Note that, the data term constructed using only low-level image feature has higher efficiency, and has almost similar results with the one constructed using low-level and high-level image features. Eq.(10) can be minimized using the Ξ±-expansion algorithm[35]. As indicated in Figure 6(a), compared to the initial plane assignment that contains outliers, the optimized assignment as indicated in Figures 6(b) and 6(c) appear to be

satisfied. Figure 6(d) indicates the superpixels corresponding to the reconstructed reliable planes. Clearly, the boundary between two regions also can be reliably reconstructed (e.g., the boundary in the rectangle). Conversely, as indicated in Figures 6(e) and 6(f), traditional methods [14] and [15] suffer from incomplete reconstruction and inaccurate reconstructed boundaries between the planes. More experimental results are presented and analyzed in the next section.

(a)

(b)

(d)

(c)

(e)

(f)

Figure 6. Plane assignment optimization (different colors denote different reliable planes). (a) Plane assignment produced by Algorithm 1; (b) plane assignment optimized under the MRF framework; (c) top-view; (d) superpixels corresponding to reliable planes; (e) results produced by method [14]; (f) results produced by method [15].

7

Experiments

To evaluate the performance of the proposed method, we conducted experiments on several data sets of urban scenes where planar structures dominate. Figure 7 presents the current images (the LSB scene is displayed in Figure 1) and the corresponding two-dimensional (2D) points projected from the initial 3D points. 200

400

600

800

1000

1200

1400 200

(a)

(b)

400

600

800

1000

1200

(c)

1400

1600

1800

2000

(d)

(e)

Figure 7. Sample images. (a)Valbonne; (b)Wadham; (a)TS; (d)City#1; (e)City#2.

(1) Oxford VGG data sets[36]: Valbonne and Wadham. For the two datasets, the image resolutions are 512Γ—768 and 1024Γ—768, respectively. As displayed in Figures 7(b) and 7(c), the corresponding scene structures are relatively simple; however, it was frequently difficult to obtain improved results including for slanted surfaces (e.g., the roof in the

Wadham scene) and other details. (2) CASIA data sets[33]: Life science building (LSB) and Tsinghua school (TS). For the two datasets, the image resolutions are 728Γ—1072 and 2184Γ—1456 (the camera parameters can be obtained using Structure from Motion pipelines), respectively. As displayed in Figure 1 and Figure 7(a), the corresponding scenes contain some small plane patches (e.g., the windows in the TS scene) and plane intersection angles (e.g., 90o , 135o ), it is challenging to effectively reconstruct their complete structures. (3) Our own data sets: City#1 and City#2. The corresponding image resolutions are 1884Γ—1224. Relatively, as displayed in Figures 7(d) and 7(e), the structures of the two scenes are more complex and more difficult to be reconstructed because of more interference factors such as illumination variations, repetitive textures, and long distances between the camera and the buildings. In addition, in contrast to other scenes, there are more unrelated regions (e.g., sky and ground) in the current image, which frequently reduces the efficiency of the reconstruction. All the experiments were conducted on a desktop PC with Intel Core 4 Duo 4.0 GHz CPU and 32 G RAM. Each algorithm in all experiments was implemented in parallel C++. 7.1 Parameter settings The proposed method appeared to be less sensitive to parameter settings; the majority of the parameters were fixed. Specifically, for determining the number of neighboring images π‘˜ in Eq.(3), we conducted the experiments by setting different π‘˜ values. As shown in Figure 8, the reconstruction accuracy M1 (Sopt ) (see Section 7.3) almost approaches the maximum when π‘˜ = 2, and then basically reduces when π‘˜ > 2. The reason lies in: (1) two neighboring images can provide enough information for reconstructing the scene structures corresponding to the current image in virtue of the higher reliability of the plane assignment cost incorporating image features, structure priors (plane and angle priors) and 3D point visibility constraints; (2) the areas of all the reconstructed scene patches corresponding to the overlapping regions between more images decrease. Moreover, the computation time significantly increases when π‘˜ > 2. In summary, we set π‘˜ to 2.

(a) Figure 8. Accuracy and computation time with different

(b)

π‘˜ values. (a)accuracy; (b)computation time.

Moreover, for the data term in the plane assignment cost, the truncation threshold 𝛿 aims to address the robustness concern related to occlusion regions; the occlusion penalty πœ†π‘œπ‘π‘ shoud be set to smaller values than the visibility violation penalty πœ†π‘’π‘Ÿπ‘Ÿ . With respect to the difference between the normalized colors of two pixels, the proposed method can achieve superior results when 𝛿=0.5, πœ†π‘œπ‘π‘ =2 and πœ†π‘’π‘Ÿπ‘Ÿ =4. For the regularization term, the plane discontinuity penalty πœ†π‘‘π‘–π‘  is mainly used to enhance the consistency of two neighboring planes, and set to 2 with respect to the difference between the mean colors of two superpixels; moreover, larger πœ‡ values cannot better incorporate the predefined angle priors to relax the hard regularization, and thus set to 0.6.

(a)

(b)

Figure 9. Accuracy changes with different weights. (a)weight Ξ³; (b)weight Ο‰.

For the weight 𝛾 of the regularization term, similarly to the weight of high-level image features 𝜌 (see Section 5.1), we found the optimal 𝛾 value in [0,1] according to the reconstruction results produced by Algorithm 1. As shown in Figure 9(a), greater Ξ³ values typically force a superpixel and its neighboring superpixels to be assigned the same plane; this is not conducive to reconstructing detailed structures. Conversely, smaller values can reduce the effect of the

regularization term and thus lead to more outliers. The proposed method performed well when 𝛾 = 0.6. Similarly, we selected the optimal πœ” vlaue from [0,1] by comparing the corresponding accuracy of the reconstruction results produced by the proposed methods. From Figure 9(b), it can be seen that the proposed method achieves superior results when πœ”=0.5. In addition, the threshold πœ– is used to identify the sky regions in images using the semantic labeling algorithm [34]. In our experiments, the algorithm has higher accuracy on detecting the sky regions, and thus set πœ– to 0.9. The parameter settings are summarized in Table 3. Table 3. Parameter settings ID

Name

Default value

Function

Section

1

π‘˜

2

Number of neighboring images

5.1

2

𝛾

0.6

Weight of regularization term in Eq.(1)

5.1

3

πœ†π‘œπ‘π‘

2

Occlusion penalty

5.1

4

πœ†π‘’π‘Ÿπ‘Ÿ

4

Free-space violation penalty

5.1

5

πœ†π‘‘π‘–π‘ 

2

Plane discontinuity penalty

5.1

6

πœ‡

0.6

Relaxation parameter of structure priors

5.1

7

𝜌

0.2

Weight of high-level image features

5.1

8

𝛿

0.5

Truncation threshold of color difference

5.1

9

πœ–

0.9

Threshold of semantic regions

5.2

10

πœ”

0.5

Weight of regularization term in Eq.(11)

6

7.2 Evaluation criteria (1) Reconstruction accuracy In this paper, we adopt the following criteria to evaluate the reliabilities of the reconstructed 3D points and planes. 1) Reliable 3D points: the 3D points π‘ƒπ‘š and 𝑃𝑛 corresponding to pixel π‘š ∈ πΌπ‘Ÿ and 𝑛 ∈ 𝐼𝑖 (𝑖 = 1,2, β‹― , π‘˜), respectively, are considered the same as the 3D point that is reliable to pixel π‘š ∈ πΌπ‘Ÿ only when the difference (i.e.,(𝑑(π‘ƒπ‘š ) βˆ’ 𝑑(𝑃𝑛 ))⁄𝑑(π‘ƒπ‘š )) between the depths 𝑑(π‘ƒπ‘š ) and 𝑑(𝑃𝑛 ) with respect to the image πΌπ‘Ÿ is less than a prespecified threshold (set to 0.2 in this paper). 2) Reliable planes: for the reconstructed 3D points of all pixels in superpixel 𝑠 ∈ πΌπ‘Ÿ , the plane associated with

superpixel 𝑠 is considered reliable only when the percentage of reliable 3D points is greater than a prespecified threshold (set to 0.8 in this paper). Based on the above definitions, we adopt the point accuracy M1 and plane accuracy M2 to comprehensively measure the accuracy of the scene reconstruction. Here, M1 denotes the ratio of reliable reconstructed 3D points to all reconstructed 3D points and M2 denotes the number of reliable planes. (2) Method comparison To further evaluate the performance of the proposed method, we also conducted comparison experiments with the-state-of-the-art methods [14] and [15]. These two methods have similar pipelines including image oversegmentation, candidate plane generation and scene structure inference. The main differences are the density of initial 3D points, the methods of generating candidate planes, and constructing the energy function used to infer the scene structures. For more details, please refer to the related papers. For the convenience of experimental comparison, we marginally adjusted some details in implementing the above two methods: (1) for the current image, five groups of superpixels were generated using the Mean-shift method with difference parameters and used to reconstruct the corresponding scene structures. The optimal results were used to compare with other methods; (2) detecting and filtering out the sky and ground regions (see Section 5.2); and (3) filtering out unreliable planes for visualization comparison.

7.3 Results and analysis The proposed method focuses on how to jointly optimize superpixels and their associated planes by incorporating scene structure priors. The initializations for different data sets are presented in Table 4. In general, it is difficult to over-segment the current image into regions (or superpixels) appropriately consistent with the scene structures. Therefore, as indicated in Figure 10(a), we over-segmented the current image at a larger threshold. Consequently, as indicated in Figure 10(b), only a small number of reliable planes associated with the superpixels could be determined because the majority of superpixels straddled two or more planes and could not be

modeled as single planes. Table 4. Initializations Data sets

3D points

Superpixels

Lines

Planes

LSB

6636

3788

1704

28

TS

9265

3706

2694

102

Valbonne

561

360

362

17

Wadham

2120

1243

838

38

City#1

2234

2793

1588

11

City#2

1503

2643

1297

7

Based on these initial reliable planes, as indicated in Figure 10(c), Algorithm 1 resegmented inaccurate superpixels according to the plane assignment cost and simultaneously optimized their associated planes. In the meantime, unrelated regions (e.g., sky, ground) were reliably filtered out, which significantly improved the overall efficiency and visualization effects. Table 5. Results of different methods Data sets

Proposed method

Method [14]

Method [15]

SRP

SP

PL

M1 (Hini )

M1 (Hopt )

M2 (H)

M1 (Sini )

M1 (Sopt )

M2 (S)

M1

M2

M1

M2

LSB

292

6592

1107

0.5509

0.7888

15

0.6629

0.8913

18

0.4319

3

0.6135

5

TS

182

9896

2112

0.5250

0.6991

19

0.6189

0.7991

27

0.3196

11

0.5001

13

Valbonne

30

1940

156

0.3987

0.5738

5

0.5753

0.8256

9

0.5067

7

0.6418

6

Wadham

85

7113

409

0.6711

0.7893

11

0.7648

0.8796

12

0.3991

7

0.6791

11

City#1

27

8608

3761

0.3675

0.7306

5

0.5164

0.7829

7

0.3284

7

0.4398

6

City#2

41

7473

2537

084.49

0.6756

6

0.6087

0.7591

6

0.2987

5

0.5736

6

Table 5 shows the corresponding quantitative results. Here, SRP denotes the number of initial superpixels with a reliable plane, and SP and PL denote the number of superpixels (including superpixels and sub-superpixels) and planes produced by Algorithm 1, respectively. In order to show the performance of traditional hard regularizations and the soft regularization defined in Eq.(5), we conducted the corresponding experiments using these two regularizations (the hard regularization is constructed using the first and third terms in Eq.(5); other conditions being equal), respectively. For the hard regularization, M1 (Hini ), M1 (Hopt ) and M2 (H) denote the accuracy of the initial scene structures produced by Algorithm 1, the scene structures optimized under the MRF framework and the number of the reconstructed planes,

respectively. Similarly, M1 (Sini ), M1 (Sopt ) and M2 (S) denote the results produced by the soft regularization. From Table 5, it can be seen clearly that, because the initial plane assignments are basically reliable, the global plane assignment optimization could produce superior results as indicated in Figures 10(d) and 11(a). Moreover, the soft regularization defined in Eq.(5) achieves higher accuracy than that of the hard regularization because it can describe the scene structures more effectively. Figure 11(b) displays the superpixels corresponding to the optimized planes; it can be observed that the proposed method does not only reconstruct the major structures of the scenes, but also performs well in reconstructing the details (e.g., the windows of the Wadham scene) and the boundaries between different planes (e.g., the boundaries in the rectangles). 100

200

300

400

500

600

700 100

200

300

400

500

600

700

800

900

1000

100

200

300

400

500

600

700

50

100

150

200

250

100

200

300

400

500

300

350

400

450

600

700

800

900

500

100

200

300

400

500

600

700

(a)

1000

(b)

(c)

(d)

Figure 10. Results on standard data sets. (a) Initial superpixels; (b) initial reliable planes; (c) textured structures corresponding to plane assignments produced by Algorithm 1; (d) textured structures corresponding to globally optimized plane assignments.

Method [14] assumes that the initial superpixels contain sufficient 3D points to generate the corresponding candidate planes; however, such an assumption frequently leads to larger errors in regions (e.g., poorly textured regions) where the initial 3D points do not distribute evenly. For example, specific superpixels as indicated in Figure 11(c) cannot be assigned reliable planes. Method [15] first produces dense 3D points by matching DAISY feature descriptions. This preprocessing not only generates relatively complete candidate planes but also helps to construct stronger

constraints to enhance the reliability of the subsequent plane inference and optimization. Therefore, the method could achieve improved results. However, it could not reliably reconstruct the boundaries between the planes (e.g., the boundaries in the rectangles in Figure 11(d)). For the efficiency of each method, as listed in Table 6, the proposed method performed relatively quickly at the multi-plane fitting and line segment detection stages; however, it consumed considerable time in extracting the high-level image features and resegmenting the superpixels. Nevertheless, the plane assignment process was faster because of reliable candidate planes generated by the guidance of the angle priors. Further, the optimization process required less time owing to the improved initialization derived from the initial plane assignment produced by Algorithm 1. Method [14] performed at a speed similar to the proposed method because it avoids computing the photo-consistency measurements across multiple images, except for the time-consuming candidate plane generation. Method [15] performed slowly owing to the per-pixel feature matching.

(a)

(b)

(c)

(d)

Figure 11. Results of different methods (different colors denote different reliable planes). (a) Top view of Figure 10(d); (b) results produced by proposed method; (c) results produced by method [14]; (d) results produced by method [15].

Our own data sets were used to evaluate the adaptability of the proposed method. For the scenes displayed in Figure 7, because the cameras were far from the buildings, the captured images typically contained multiple different types of building regions that were relatively smaller than the unrelated regions (e.g., sky, ground). In our experiments,

when the unrelated regions were not detected and filtered out, method [14] failed unexpectedly and method [15] produced fewer reliable planes. 200

400

600

800

1000

1200

1400

1600

1800 200

400

600

800

1000

1200

200

400

600

800

1000

1200

200

400

600

800

1000

1200

1400

1600

1800

(a)

(b)

(c)

(d)

Figure 12. Results on our own data sets. (a) Initial superpixels; (b) initial reliable planes; (c) textured structures corresponding to plane assignments produced by Algorithm 1; (d) textured structures corresponding to globally optimized plane assignments.

(a)

(b)

(e)

(c)

(d)

(f)

Figure 13. Results of different methods (different colors denote different reliable planes). (a) Top view of Figure 12(d); (b) results produced by proposed method; (c) results produced by method [14]; (d) results produced by method [15]; (e) close-ups of the rectangles in (b), (c) and (d) (City#1); (f) close-ups of the rectangles in (b), (c) and (d) (City#2).

In this experiment, the proposed method also only produced fewer reliable planes during the initial phase. However, Algorithm 1 performed well because structure priors also commonly exist between different buildings in urban scenes. This guarantees the reliability of the next plane assignment optimization (see Figures 12(d) and 13(a)). In particular, slanted surfaces were also reliably reconstructed (e.g., the planes in the rectangles in Figure 13(b)). Methods [14] and

[15] also performed well after filtering out the unrelated regions. However, they failed to solve the common problems such as incomplete reconstruction and inaccurately reconstructed boundaries (e.g., the boundaries in the rectangles in Figures 13(c) and 13(d)). Note that, as shown in Table 5 and Table 6, compared to the standard data sets, three methods generally had less accuracy and efficiency, the main reasons lie in: (1) more interference (e.g., illumination variations, far distance between the camera and the buildings) reduces the reliability of the reconstructed 3D points and planes; (2) repetitive textures and structures lead to more sub-superpixels with small size in resegmenting inaccurate superpixels. These sub-superpixels can also influence the reliability and efficiency of the plane assignment cost and the entire scene reconstruction because of the match ambiguity of the superpixels. Table 6. Computation time (seconds) of different methods Proposed method Data sets

Initial

Line

Initial

Global

Total

detection

structures

optimization

time

Method[14]

Method[15]

Initial planes superpixels LSB

4.7

12.7

4.3

56.8

2.9

81.4

94.4

682.6

TS

3.9

22.1

6.9

68.1

3.7

105.7

130.8

883.4

Valbonne

1.1

4.9

2.5

17.8

0.8

27.1

34.9

243.2

Wadham

2.4

7.6

4.1

34.2

1.1

49.4

71.2

341.0

City#1

3.7

8.7

10.7

61.8

2.4

87.3

98.7

577.2

City#2

4.2

5.5

9.7

74.4

3.1

96.9

107.5

498.9

Given the initial sparse 3D points of a scene, by incorporating scene structure priors and high-level image features, the proposed method can effectively overcome the influence of inaccurate superpixels, incomplete candidate planes and unreliable regularizations on the reconstruction, and completely reconstruct the piecewise planar structures of the scene with high accuracy and efficiency.

8

Conclusions

For the piecewise planar reconstruction of urban scenes, traditional methods frequently have problems related to uncertain factors such as sparser initial 3D points, inaccurate image over-segmentation, incomplete candidate planes and unreliable regularization terms. To address these problems, the paper constructed an effective plane assignment cost

based on scene structure priors and high-level image features obtained by CNN. It then jointly optimized the superpixels and their associated planes, followed by globally optimizing the initial scene structures under the MRF framework. Experimental results confirm that the proposed method performed well on both standard and our own data sets with high accuracy and efficiency. The limitations of the proposed method include the following. (1) Although high-level image features help to improve the reliability of the reconstruction, they also lead to a high computational load. (2) Curved surfaces with poor textures cannot be reliably reconstructed because they are beyond the scope of the prespecified scene structure priors. In future work, we will further explore efficient image patch matching methods and extract more scene structure priors (e.g., specific objects and geometric shapes) from the initial 3D points and images based on CNN. Finally, the accuracy and efficiency of the scene reconstruction are expected to improve using a higher-order MRF framework that incorporates more scene structure priors.

Acknowledgements This work is supported in part by the National Key R&D Program of China (2016YFB0502002), and in part by the National Natural Science Foundation of China (61772444, 61421004, 61873264), the Natural Science Foundation of Henan Province (162300410347), the Key Scientific and Technological Project of Henan Province (162102310589, 192102210279).

References 1 Feng W, Jia J, Liu Z. Self-validated labeling of markov random fields for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10):1871-1887. 2 Li L, Feng W, Wan L, Zhang J. Maximum cohesive grid of superpixels for fast object localization. In: Proceedings of Computer Vision and Pattern Recognition, 2013, pp.3174-3181. 3

Wang W, Gao W, Hu Z Y. Effectively modeling piecewise planar urban scenes based on structure priors and CNN. SCIENCE CHINA Information Sciences. 2019, 62(2):029102.

4

Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(5):603-619.

5

Achanta R, Shaji A, Smith K, Lucchi A, Fua P, SΓΌsstrunk S. SLIC Superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11):2274-2282.

6

Γ‡Δ±ΔŸla C, Zabulis X, Alatan A A. Region-based dense depth extraction from multi-view video. In: Proceedings of

IEEE ICIP. 2007, pp. 213-216. 7

Furukawa Y, Curless B, Seitz S M. Manhattan-world stereo. In: Proceedings of Computer Vision and Pattern Recognition, 2009, pp.1422-1429.

8

Furukawa Y, Ponce J. Accurate, dense, and robust multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(8):1362-1376.

9

Gallup D, Frahm J M, Mordohai P. Real-time plane-sweeping stereo with multiple sweeping directions. In: Proceedings of Computer Vision and Pattern Recognition, 2007, pp.1-8.

10

Mičuőík B, KoőeckÑ J. Multi-view superpixel stereo in urban environments. International Journal of Computer Vision, 2010, 89(1):106-119.

11

Sinha S N, Steedly D, Szeliski R. Piecewise planar stereo for image-based rendering. In: Proceedings of 12th International Conference on Computer Vision, 2009, pp.1881-1888.

12

Chauve A L, Labatut P, Pons J P. Robust piecewise-planar 3D reconstruction and completion from large-scale unstructured point data. In: Proceedings of Computer Vision and Pattern Recognition, 2010, pp. 1261-1268.

13 Jiao Z, Liu T, Zhu X. Robust piecewise planar stereo with modified segmentation cues in urban scenes. In: Proceedings of International Conference on Multimedia Technology, 2011, pp.698-701. 14 Bodis-Szomoru A, Riemenschneider H, Van Gool L. Fast, approximate piecewise-planar modeling based on sparse structure-from-motion and superpixels. In: Proceedings of Computer Vision and Pattern Recognition, 2014, pp. 469, 476, 23-28. 15 Verleysen C, Vleeschouwer C D. Piecewise-planar 3D approximation from wide-baseline stereo. In: Proceedings of Computer Vision and Pattern Recognition, 2016, pp. 3327-3336. 16 Tola E, Lepetit V, and Fua P. Daisy: An efficient dense descriptor applied to wide-baseline stereo. Pattern Analysis and Machine Intelligence, 2010, 32(5): 815-830. 17 Wang W, Ren G, Chen L and Zhang X. Piecewise planar urban scene reconstruction based on structure priors and cooperative optimization. http://kns.cnki.net/kcms/detail/11.2109.TP.20181007.2353.012.html, 2018. 18 Feng W , Liu Z, Wan L. A spectral-multiplicity-tolerant approach to robust graph matching. Pattern Recognition, 2013, 46(10):2819-2829. 19 Liang Z, Feng Y, Guo Y, Liu H, Chen W, Qiao L, Zhou L and Zhang J. Learning for Disparity Estimation through Feature Constancy. In: Proceedings of Computer Vision and Pattern Recognition, 2018. 20 Jie Z, Wang P, Ling Y, Zhao B, Wei Y, Feng J and Liu W. Left-right comparative recurrent model for stereo matching. In: Proceedings of Computer Vision and Pattern Recognition, 2018.

21 Park M G, Yoon K J. Learning and Selecting Confidence Measures for Robust Stereo Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 22 Zbontar J, LeCun Y. Computing the stereo matching cost with a convolutional neural network. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 1592-1599. 23 Chen Z, Sun X, Wang L, Yu Y, Huang C. A deep visual correspondence embedding model for stereo matching costs. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 972-980. 24 Fischer P, Dosovitskiy A, Brox T. Descriptor matching with convolutional neural networks: a comparison to SIFT. Computer Science, 2015. 25 Luo W, Schwing A G, Urtasun R. Efficient deep learning for stereo matching. In: Proceedings of Computer Vision and Pattern Recognition, 2016, pp. 5695-5703. 26 Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 4353-4361. 27 Shi Y, Xu K, Niessner M, Rusinkiewicz S and Funkhouser T. PlaneMatch: patch coplanarity prediction for robust RGB-D reconstruction. In: Proceedings of European Conference on Computer Vision, 2018. 28 Shrestha R, Tian F, Feng W, Tan P, Vaughan R. Learned Map Prediction for Enhanced Mobile Robot Exploration. International Conference on Robotics and Automation , 2019. 29 Pham T T, Chin T J, Yu J and Suter D. The random cluster model for robust geometric fitting. Pattern Analysis and Machine Intelligence. 2014, 36(8): 1658-1671. 30 Hartley R, Zisserman A, Multiple view geomenry in computer vision, Cambridge University Press, 2003. 31 Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of British Machine Vision Conference. 2014. 32 Jensen R, Dahl A, Vogiatzis G, Tola E and Aans H. Large scale multi-view stereopsis evaluation. In: Proceedings of Computer Vision and Pattern Recognition, 2014. 33 [Online]: http://vision.ia.ac.cn/data/index.html. 34 Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of Computer Vision and Pattern Recognition, 2017. 35 Gorelick L, Boykov Y, Veksler O, Ayed I B, Delong A. Submodularization for binary pairwise energies. In Proc. of Computer Vision and Pattern Recogition, 2016, pp(99):1-1. 36 [Online]: http://www.robots.ox.ac.uk/~vgg/data/data-mview.html.

Wei Wang received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 2015. He is currently an associate professor at the School of Network Engineering, Zhoukou Normal University. His research interests include computer vision, machine learning and 3D reconstruction.

Wei Gao received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences at 2008. He is currently an associate professor at the Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision and 3D reconstruction.

Zhanyi HU was born in 1961. He received his B.S. degree in automation from the North China University of Technology in 1985, the Ph.D. degree (Docteur d'Etat) in computer vision from the University of Liege, Belgium, 1993. He is now a professor at the Institute of Automation, Chinese Academy of Sciences.