Journal Pre-proof
Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network Wei Wang , Wei Gao , Zhanyi Hu PII: DOI: Reference:
S0925-2312(19)31402-X https://doi.org/10.1016/j.neucom.2019.10.026 NEUCOM 21371
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
28 June 2018 31 May 2019 13 October 2019
Please cite this article as: Wei Wang , Wei Gao , Zhanyi Hu , Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.10.026
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Β© 2019 Elsevier B.V. All rights reserved.
Highlights 1. An effective plane assignment cost based on sparse 3D points, scene structure priors and high-level image features obtained by Convolutional Neural Network for modeling urban scenes is proposed. 2. A new piecewise planar stereo method that jointly optimizes image regions and their associated planes is proposed. The method can effectively reconstruct urban scenes from only sparse 3D points. 3. The problems (inaccurate image over-segmentation, incomplete candidate planes, and unreliable regularization) commonly existing traditional piecewise planar stereo methods are effectively resolved.
Effective Piecewise Planar Modeling Based on Sparse 3D Points and Convolutional Neural Network Wei Wang*, Wei Gao, Zhanyi Hu * Corresponding author: E-mail:
[email protected] Abstract: Piecewise planar stereo methods can approximately reconstruct the complete structures of a scene by overcoming challenging difficulties (e.g., poorly textured regions) that pixel-level stereo methods cannot resolve. In this paper, a novel plane assignment cost is first constructed by incorporating scene structure priors and high-level image features obtained by Convolutional Neural Network (CNN). Then, the piecewise planar scene structures are reconstructed in a progressive manner that jointly optimizes image regions (or superpixels) and their associated planes, followed by a global plane assignment optimization under a Markov Random Field (MRF) framework. Experimental results on a variety of urban scenes confirm that the proposed method can effectively reconstruct the complete structures of a scene from only sparse three-dimensional (3D) points with high efficiency and accuracy and can achieve superior results compared with state-of-the-art methods. Keywords: urban scene, piecewise planar stereo, Markov Random Field, image over-segmentation, Convolutional Neural Network
1
Introduction
Piecewise planar stereo methods can approximately reconstruct the complete structures of a scene, where higher-level planarity priors significantly help overcome the challenging difficulties (e.g., poorly textured regions) that pixel-level stereo methods cannot resolve. In general, piecewise planar stereo methods have three basic steps: (1) over-segmenting the image into several regions (i.e., superpixels) without overlapping; (2) generating candidate planes from initial data (e.g., three-dimensional (3D) points); (3) assigning the optimal plane for each superpixel using a global method to produce the piecewise planar model of the scene. In fact, such methods can be unreliable and inefficient because of the following. (1) It can be difficult to produce
complete candidate planes from initial sparse or dense 3D points, which can lead to larger errors in modeling scenes. As indicated in Figure 1(c), only three candidate planes (including two reliable planes) are generated from initial sparse 3D points using the state-of-the-art multi-model fitting method, and are not sufficient to describe initial scene structures. As a result, as indicated in Figures 1(d) and 1(f), the current superpixel indicated in Figures 1(a) and 1(b) is assigned to a false plane because the real plane is not in the candidate planes. (2) In assigning the optimal plane to the current superpixel, the plane assignment cost is frequently constructed based on low-level image features (e.g., gray), 3D point visibility constraints and the assumption that two neighboring superpixels with similar features have the same plane. However, in specific cases, low-level image features are not sufficiently robust to overcome interferences (e.g., the match ambiguity). Moreover, two planes associated with two superpixels with similar features are not necessarily the same. These factors can also incur an unreliable scene reconstruction. For example, as indicated in Figures 1(d)-1(f), the errors are caused by forcing to assign the same plane to the two superpixels indicated in Figures 1(a) and 1(b) (their corresponding scene patches are actually in different planes) according to incomplete candidate planes. (3) It is frequently difficult to determine the optimal parameters for image over-segmentation methods based only on low-level features to produce accurate superpixels consistent with scene structures. Actually, as indicated in Figures 1(g)-1(i), superpixels with larger sizes can straddle two or more planes with larger depth changes, and cannot be reasonably modeled by single planes. However, superpixels with smaller sizes (consider the case that the superpixel only contains only one pixel) can lead to the match ambiguity commonly existing in traditional pixel-level stereos. Actually, the accuracy of superixels can be improved using some existing optimization methods; for example, Feng et al.[1] used the split-and merge strategy to automatically produce spatially coherent image segmentations; Li et al.[2] regularized arbitrary superpixels into a maximum cohesive grid structure via cascade dynamic programming. However, these methods cannot guarantee the consistent boundaries between superpixels and scene structures by lack of the corresponding spatial information (e.g., 3D points and planes). (4) Unrelated regions (e.g., sky, ground) cannot be effectively detected and filtered out, which frequently reduces the efficiency of the reconstruction.
To overcome the above problems, in this paper, we extended our investigation of modeling piecewise planar urban scenes based on scene structure priors and Convolutional Neural Network (CNN) started in [3]. We discussed in detail the performance of guiding the reconstruction process using scene structure priors and improving the plane assignment reliability using high-level image features obtained by CNN. Finally, by jointly optimizing the superpixels and their associated planes, and globally optimizing the resulting plane assignments under the Markov Random Field (MRF) framework, we analyzed the overall performance of the proposed method on a wide variety of urban scene data sets, and confirmed it can completely model a scene from only sparse 3D points with high efficiency and accuracy.
(a)
(f) Figure 1.
(b)
(c)
(g)
(d)
(h)
(e)
(i)
Problems existing in traditional methods. (a) 2D proejcted points (black) from initial 3D points, current superpixel π (red) and its
neighboring superpixels with reliable planes (green); (b) close-up of superpixels in the white rectangle in (a); (c)incomplete candidate planes (black: top-view of initial 3D poitns; red: reliabel planes; white: unreliable planes); (d) plane assignment for superpixels π and its neighboring superpixels using a hard regularization; (e)-(f) top-view and close-up (two neighboring superpixels corresponding to different real planes are assigned to the same plane); (g)-(h) superpixels produced by Mean-shift[4] method are basically consistent with scene structures, however, specific superpixels straddle two or more planes. Relatively, although superpixels produced by SLIC [5] method have a uniform size, they can incur matching ambiguity of superpixels (e.g., superpixels with small sizes in sky region have similar color features); (i) sample superpixels and close-ups (solid and dashed rectangles denote superpixels produced by Mean-shift and SLIC methods, respectively).
2
Related work
Our method is related with multi-view piecewise planar stereo and CNN-based stereo matching, a separate short review for each of them is listed below.
2.1 Multi-view piecewise planar stereo Traditional exhaustive plane sweeping methods[6] tend to directly determine the optimal planes associated with superpixels in a larger search space, and usually lead to high computational complexity and low reliability. Furukawa etal.[7] obtained a set of candidate planes along three orthogonal scene directions (e.g., Manhattan-world model) based on initial 3D-oriented points obtained from the PMVS (Patch-based Multi-view Stereo)[8] method, and then assigned each pixel to an optimal plane by pixel-wise plane labeling under the MRF framework. As a result, the method is not suitable for complex scenes with more than three scene directions. Similarly, Gallup et al.[9] extended the traditional plane sweeping methods to perform multiple plane sweepings, where the sweeping directions were aligned to the expected surface normals of the scene. Clearly, such a method is not robust to complex scenes, because only a few sweeping directions are involved. MiΔuΕ‘Γk et al.[10] restricted scene directions through vanishing points and performed a superpixel-based dense reconstruction of urban scenes. However, the method can erroneously suppress a slanted plane, the normal vector of which is not consistent with the predefined main directions. In general, simply restricting or specifying scene structures (e.g., the number of scene directions) in advance is unsuitable for reconstructing complex scenes. Sinha et al.[11] used sparse 3D points and lines to generate candidate planes, and then, recovered a piecewise planar depth map under the MRF framework. However, this method may ignore some real planes because of the nature of sparse initial 3D points and lines. Chauve et al.[12] extracted all possible planes from unstructured 3D points using a region growing approach, and then, formulated the problem of piecewise planar reconstruction as a labeling problem of 3D space into empty or occupied regions. In fact, this method may not be robust to the noisy 3D points, as region growing could easily be entrapped in wrong solutions. Jiao et al.[13] first generated candidate planes from quasi-dense 3D points, and then assigned the optimal plane to each superpixel under the MRF framework where the contour of the superpixel is modified to be consistent with scene structures in advance. This method is related to ours; however, it may not be robust to sparse 3D points that are not enough to generate complete candidate planes. Bodis-Szomoru et al.[14]
proposed a piecewise planar modeling method based on sparse 3D points and superpixels to generate an approximate model of the scene. Although the speed of the method is fast, it may be unreliable because it assumes that each superpixel is sufficiently large to contain an adequate number of 3D points for plane fitting. In fact, a scene patch with larger depth discontinues corresponding to a superpixel of large size cannot be reasonably modeled by a single plane. Verleysen et al.[15] generated dense 3D points by matching DAISY[16] descriptions across two wide-baseline images, and then extracted candidate planes from dense 3D points to perform a piecewise planar reconstruction under the MRF framework. In general, the method could produce better results because dense 3D points implicitly contain more information about scene structures (e.g., generate relatively complete candidate planes, construct stronger constraints for the plane inference). However, because of the time-consuming of stereo matching, the method usually leads to higher computational complexity. In our previous work[17], we proposed to cooperatively optimize the image regions and their associated planes based on scene structure priors (e.g., plane intersection angles); however, such a optimization is frequently difficult to achieve a globally optimal solution, and thus results in some errors in reconstructing small plane patches. 2.2 CNN-based Stereo matching Stereo mathcing (i.e., establishing point correspondences between different images) is one of the most fundamental tasks for the image-based 3D reconstruction, and have been witnessed as a continued hot research in the last few years. 18]
For example, Feng et al.[
proposed a spectral-multiplicity-tolerant method for attributed graph matching by posing the
general graph matching problem as alternatively optimizing a multiplicity matrix and a vertex permutation matrix. such a method can be applied to solving the point matching problem, but may be unrobust for the region matching when noise and illumination variation are involved. In order to improve the accuracy of stereo matching, Liang et al. [19] proposed a new network architecture to seamlessly integrate four steps of stereo matching (i.e., matching cost calculation, matching cost aggregation, disparity calculation and disparity refinement). In this network, the feature constancy (feature correlation and reconstruction error) constraint is introduced to refine the initial disparity by a
notable margin. Similarly, Jie et al. [20] proposed a novel left-right comparative recurrent model to perform left-right consistency checking jointly with disparity estimation. Such a method employs a soft attention mechanism to guide the model to selectively focus on refining the unreliable regions at each recurrent step (i.e., disparity estimation and online left-right comparison). Actually, the left-right consistency measure may be unreliable because consistently matched pixels are not always correct, and thus cannot detect all the incorrect matches. To address this issue, Park et al. [21] used the random forest framework to select effective confidence measures depending on the characteristics of the training data and matching strategies, and then adopted the selected confidence measure to build a better confidence prediction model to improve the robustness and accuracy of traditional stereo matching methods. In general, these methods are effective for estimating the disparity from a rectified stereo pair of images, however, they may be unreliable for large-scale wide-baseline images. Recently, with the improvement of the theories and methods of deep learning, CNN-based stereo matching[22,23] increasingly becomes a hot research topic in the field of computer vision. Fischer et al.[24] found that high-level image features obtained by CNN in a supervised, especially unsupervised learning manner, have higher performance than traditional feature descriptions (e.g., Scale Invariant Feature Transform, SIFT) in stereo matching. Zbontar et al.[22] utilized CNN to evaluate the visual similarity relationships between a pair of image patches, and then obtained an accurate disparity/depth map in a global optimization manner. However, the above method could lead to high computational complexity because of the time-consuming convolution computation. In order to address this problem, Chen et al.[23] first extracted the features of a pair of image patches at different scales, and then obtained the matching scores by an inner product. The scores from different scales are then merged for an ensemble. In addition, the method also grouped multiple inner product operations as a matrix operation for further acceleration. Luo et al.[25] designed a product layer which simply computes the inner product between the two representations of a siamese architecture, and produced better matching results in less than a second of GPU (Graphics Processing Unit) computation. Zagoruyko et al.[26] exploited the methods which directly learn a general similarity function for comparing image patches, and
proposed multiple neural network architectures for this purpose. Shi et al.[27] learned a RGB-D patch descriptor using a deep convolutional neural network that takes in color, depth, normal and multi-scale context information of a planar patch in an image. Such descriptors can be used to predict whether or not two RGB-D patches from different frames are coplanar in SLAM reconstruction. Recently, Shrestha et al.
[28]
employed a generative neural network to predict
unknown regions of a partially explored 2D map in indoor environments, and used the resulting prediction to enhance the exploration in an information-theoretic manner. Our work is closely related to the two-channel networks in [26]. However, in comparison to the depth map estimation in [26], our method mainly focuses on measuring the reliability of a plane in multi-view piecewise planar stereos.
3
Overview and contributions
In some cases, as discussed in Section 1, traditional piecewise planar stereo methods could be unreliable and low-efficient in reconstructing complex urban scenes due to four factors: inaccurate image over-segmentation, incomplete candidate planes, unreliable plane optimization and unnecessary reconstruction. Input 1. Sparse 3D points; 2. Images with calibrated camera parameters. Pre-processing 1.Over-segment images to generate superpixels (or image regions). 2.Detect line segments and refine superpixels using line segments. 3.Generate initial candidate planes via multi-plane fitting. 4.Estimate vertical scene direction to detect unrelated regions (e.g., the ground). Jointly optimize superpixel and their related planes 1. Generate initial reliable planes from initial candidate planes. 2. Resegment superpixels at a smaller threshold. 3. Generate candidate planes based on structure priors for each superpixel. 4. Assign the optimal plane to each superpixel.
Globally optimize plane structures and output 1.Globally optimize the planes associated with superpixels under the MRF framework. 2.Output the optimized piecewise planar model. Figure 2. Flowchart of the proposed method
In general, inaccurate superpixels can be further over-segmented or simply regularized using the line segments
detected in the current image. The goal is to make them to be consistent with scene structures. Moreover, incomplete candidate planes and unreliable plane optimization can be improved by incorporating scene structure priors and high-level image features. In fact, urban scenes contain multi-plane structures (i.e., the piecewise planar prior that pixels of similar appearance more likely belong to the same plane) and the component planes have strong structural regularities (i.e., the angle prior that the angles between planes are usually fixed values such as 90Β° ). These priors significantly help to guide the reconstruction process to achieve better results and improve the overall efficiency of the reconstruction (e.g., filtering out unrelated regions to avoid the unnecessary reconstruction). Based on the analysis above, given sparse 3D points reconstructed from calibrated images (i.e., their corresponding extrinsic and intrinsic camera parameters have been obtained by some existing methods such as Structure from Motion pipelines), the paper presents an effective piecewise planar stereo method based on structure priors and CNN to reliably model an urban scene with high efficiency and accuracy. The flowchart is outlined in Figure 2 and each component is elaborated in the subsequent sections. The main contributions of the proposed work can be summarized as follows: (1) We utilize scene structure priors and high-level image features to overcome the influence of inaccurate image over-segmentation, incomplete candidate planes, and unreliable regularization, and thus enhance the reliability of the plane assignment. (2) We propose a new piecewise planar stereo method that jointly optimizes superpixels and their associated planes by incorporating low-level and high-level image features, 3D point visibility constraints, and scene structure priors. The method can effectively reconstruct the scene from only sparse 3D points. (3) We propose an effective method to detect and filter out unrelated regions (e.g., sky, ground) in the reconstruction process. This can significantly improve the efficiency of the entire reconstruction.
4
Preprocessing
According to the piecewise planar prior, the current image is first oversegmented as a set of superpixels (denoted as π
0) using the Mean-shift method (or other similar methods). Then, as indicated in Figures 3(a) and 3(b), initial superpixels are further resegmented as sub-superpixels according to the lines determined by the line segments detected in the current image[15]. Further, initial candidate planes (denoted as π»0) are generated from the initial 3D points using the multi-model fitting method [29]. The method [29] aims at finding a small number of planes (i.e., dominant planes) which could best explain the whole 3D points, and adopts a mutually reinforcing manner to alternatively perform candidate plane generation and fitting optimization (instead of exhaustively random sampling 3D points to generate candidate planes). In general, the method can generate relatively reliable planes from initial 3D points; however, when initial 3D points are too sparse, the fitted planes are usually not sufficient to depict the complete scene structures, and also contain many outliers (e.g., wrongly fitting several 3D points in two or more real planes). Thus, the set π»0 is incomplete and typically contains many plane outliers. In the proposed method, the sceneβs vertical direction is also estimated using vanish point detection methods[
30]
based on the detected line segments. Moreover, for a hand-held camera with the calibrated parameters, as indicated in Figure 3(d), the ground can be estimated according to the height of the camera (e.g., 1.7m). Note that, the vertical direction and the ground are used to generate candidate planes and eliminate the unrelated regions (see Section 5.2), respectively.
(a)
(b)
(c)
(d)
Figure 3. Superpixel resegmentation. (a) Line segment (red) detection; (b) resegment superpixel using line segments; (c) resegment superpixel at a smaller threshold; for the current superpixel (left), the scene patches corresponding to its two parts on both sides of the dotted line are actually in different planes; the resegmented sub-superpixels (right) will be respectively assigned to more reliable planes (see Section 5.2), instead of assigning one plane to the current superpixel (left); (d) vertical direction (white arrow), ground (grid) and camera (red point).
5
Jointly optimizing superpixels and their associated planes
In this section, we first introduce the plane assignment cost incorporating low-level and high-level image features, and then discuss the method that jointly optimizes superpixels and their associated planes
5.1 Plane assignment cost Given the current image πΌπ and its neighboring images {πΌπ }(π = 1,2, β― , π), we define the following cost of assigning a plane π»π to a superpixel π β πΌπ . πΈ(π , π»π ) = πΈπππ‘π (π , π»π ) + πΎ β β πΈππππ’πππ (π»π , π»π‘ ),
(1)
π‘ββ(π )
where πΈπππ‘π (π , π»π ) and πΈππππ’πππ (π»π , π»π‘ ) denote the data and regularization terms, respectively, and β(π ) denotes the set of reliable superpixels (i.e., they have been assigned reliable planes). πΎ is the weight of the regularization term. (1) Data term The data term πΈπππ‘π (π , π»π ) is evaluated by incorporating low-level and high-level image features, and is formally defined as follows. πΈπππ‘π (π , π»π ) = πΈπβπ (π , π»π ) + π β πΈπππ (π , π»π ).
(2)
In Eq.(2), πΈπβπ (π , π»π ) encodes low-level image features and 3D point visibility constraints, namely, π
πΈπβπ (π , π»π ) =
1 β β πΆπ (π, π»π , πΌπ ), π β |π |
(3)
π=1 πβπ
where |π | and π denote the total number of pixels belonging to superpixel π and the number of neighboring images, πΆπ (π, π»π , ππ ) is defined as πππ(βπΉπΌπ (π) β πΉπΌπ (π»π (π))β, πΏ) π·(π»π (π)) = πππΏπΏ πΆπ (π, π»π , πΌπ ) = { π(π»π (π)) > π·(π»π (π)), ππππ ππππ π(π»π (π)) β€ π·(π»π (π))
(4)
where π»π (π) β πΌπ denotes the corresponding point in the image πΌπ induced by the plane π»π with respect to the pixel π β π , πΉπ₯ (π¦) denotes the normalized color (i.e., the value is between zero and one) of the point π¦ in the image π₯, and βπΉπΌπ (π) β πΉπΌπ (π»π (π))β denotes the absolute difference of the normalized color; π(π₯) and π·(π₯) denote the estimated depth value from the current plane and reliable depth value from the initial 3D points, respectively; the parameter πΏ is a truncation threshold, the constants ππππ and ππππ are the occlusion penalty and free-space violation penalty,
respectively. In Eq.(4), the first case implies that if π·(π»π (π)) = πππΏπΏ, plane π»π is more likely to be a real plane and the photo-consistency cost is measured by the dissimilarity of color distribution. Otherwise, the intersection point of the back-projection ray of pixel π with plane π»π can be occluded if π(π»π (π)) > π·(π»π (π)) or violates the 3D point visibility if π(π»π (π)) β€ π·(π»π (π)) because a reliable 3D point is unlikely to be occluded. Therefore, different penalties must be assigned for these two cases. Using high-level image features, we first extract three image patches that appropriately contain superpixel π β πΌπ and the corresponding projected regions {π π }(π = 1,2, β― , π) in the images {πΌπ }(π = 1,2, β― , π), and resize them to 224 Γ224. Then, we simply consider these image patches as a multi-channel image and adopt the VGG-M architecture proposed in [31] to extract the features of the multi-channel image. Finally, we directly feed the features to a logistical regression layer and use the output as the plane assignment cost πΈπππ (π , π»π ) based on the high-level image features. In this process, as training data, we sampled image patches from 13 scenes of the DTU datasets[32] and the CASIA datasets[33]. In collecting positive samples, for a superpixel π in the current image, we first fitted its associated ground truth 3D points (i.e., these 3D points can be just projected into the superpixel) as a plane. Then, if the fitted plane is reliable (i.e., the average distance between the 3D points and the fitted planes is smaller than a pre-defined threshold), we projected the ground truth 3D points into the neighboring images of the current image to select the corresponding the image patches {π π }, respectively. Finally, we took the image patches {π , π 1 , π 2 , β― , π π } and the fitted plane as a position sample. Meanwhile, we picked the image patches far away from {π π } to produce negative samples (i.e., the fitted plane is inconsistent with the image patch {π π }). In total, we sampled 230K positive and 210K negative examples. Finally, learning is done by minimizing the cross-entropy loss with 50K iterations of standard Stochastic Gradient Descent (SGD) (batch size and learning rate are set to 512 and 0.001, respectively). Based on the definitions of πΈπβπ (π , π»π ) and πΈπππ (π , π»π ), we conducted comparison experiments to evaluate their performances using two neighboring images (i.e., π=2) (more experiments results about different π values are shown
in Section 7.1). More specifically, for a superpixel containing initial 3D points (i.e., these initial 3D points are just projected into the superpixel), its reliable assigned plane is first determined according to the minimal average distance between these 3D points and initial candidate planes (see Section 4). Then, taking these superpixels and their assigned planes as ground truth, the accuracy β± is defined by the ratio of the number of superpixels that are assigned the correct planes using πΈπβπ (π , π»π ) or πΈπππ (π , π»π ) to the total number of superpixels. Table 1. β± values on different data sets (π = 0.2) Accuracy with different cost Data set πΈπβπ (π , π»π )
πΈπππ (π , π»π )
πΈπππ‘π (π , π»π )
Valbonne
0.6415
0.4821
0.7166
Wadham
0.6106
0.5357
0.6949
LSB
0.5110
0.4158
0.7075
TS
0.6317
0.5756
0.7271
City#1
0.4901
0.4003
0.6112
City#2
0.5177
0.4821
0.6398
Figure 4. Accuracy changes with weight π
Based on β±, Table 1 displays the corresponding results on different data sets (see Section 7). Clearly, compared to πΈπππ (π , π»π ), πΈπβπ (π , π»π ) appears to be more effective because of its characteristics that can quantitatively compute the feature similarity at the pixel level. Furthermore, we combine πΈπππ (π , π»π ) into πΈπβπ (π , π»π ) with weight π to generate πΈπππ‘π (π , π»π ). As indicated in Figure 4 and Table 1, the accuracy β± value of πΈπππ‘π (π , π»π ) approaches the maximum when π = 0.2. (2) Regularization term
In traditional methods, the angle prior (see Section 3) is frequently ignored or simply formulated as a hard regularization that forces two neighboring superpixels with similar appearances to be assigned the same plane. In this paper, such a hard regularization term is relaxed through the angle prior and defined as πΆπ ππ πΈππππ’πππ (π»π , π»π‘ ) = {π β πΆπ ππ ππππ
π»π = π»π‘ π΄(π»π , π»π‘ ) β π΄πππππ , ππ‘βπππ€ππ π
(5)
where π΄(π»π , π»π‘ ) denotes the intersection angle between plane π»π and π»π‘ corresponding to superpixels π and π‘, respectively, π΄πππππ is the angel prior and set to [30o , 45o , 60o , 90o , β60o , β45o , β30o ] (more angles assist the reconstruction of detailed structures; however, they also incur high computational complexity). The constants ππππ and π are the plane discontinuity penalty and the relaxation parameter, respectively. In Eq.(5), πΆπ ππ measures the color dissimilarity of the superpixels and is defined as πΆπ ππ =
1 1
+ π ββπ(π )βπ(π‘)β
,
(6)
where βπ(π ) β π(π‘)β denotes the difference between the mean colors (normalized to a range of zero to one) corresponding to superpixels π and π‘, respectively. In fact, high-level image features also can be used in Eq.(6); however these do not significantly improve performance at the cost of higher computational complexity.
5.2 Jointly optimizing superpixels and their associated planes According to the definition of the plane assignment cost, superpixels and their associated planes are jointly optimized using the method described in Algorithm 1. Next, we introduce several implementation details. (1) Initial reliable planes Essentially, Algorithm 1 performs in a progressive manner. In this process, as indicated in Figure 5(a), initial reliable planes can provide strong constraints for inferring other planes. Hence, for the superpixel π containing initial 3D points, we select plane π»π from the set π»0 (see Section 4) as its reliable plane according to the following condition. Μ
)}, π(π ) = {π»π β π»0 : (πΈπππ‘π (π , π»π ) < πΈΜ
)β(π(ππ , π»π ) < π
(7)
where ππ denotes the 3D points that are projected in superpixel π and π(ππ , π»π ) denotes the average orthogonal
Μ
are the average values of the minimal πΈπππ‘π (π , π»π ) values distance between 3D points ππ and plane π»π ; πΈΜ
and π and minimal π(ππ , π»π ) values of all superpixels containing 3D points, respectively. Algorithm 1. Jointly optimizing superpixels and their associated planes InputοΌInitial 3D points and three calibrated images. OutputοΌSets of superpixels β and associated planes β. InitializationοΌThe sets of initial superpixels π
0 and initial candidate planes π»0 . 1. Determine initial reliable planes for the superpixels containing 3D points (let βΜ
denote the set of other superpixels) from π»0 and π
0 , and save them to β and β, respectively. 2. Compute the plane assignment priority for superpixels in βΜ
. 3. Select and remove superpixel π with the highest priority from βΜ
. 3.1 If superpixel π is verified as sky or ground, discard it. 3.2 Otherwise, generate candidate planes and compute the minimal πΈ(π , π»π ) value. 3.3 If πΈ(π , π»π ) β€ πΈΜ
, assign plane π»π (i.e., reliable plane) to superpixel π , and save them to β and β, respectively. 3.4 Otherwise, resegment superpixel π and save the resulting sub-superpixels to βΜ
. 4. Goto Step 2 until βΜ
= β
. 5. Output β and β.
(2) Plane assignment priority For the superpixel π β βΜ
, the reliable planes associated with its neighboring superpixels typically have important effects in inferring its optimal plane. To measure this influence, we define the plane assignment priority as follows: ππ = π(π ) β π΅(π ),
(8)
where π(π ) is the number of neighboring superpixels with reliable planes of superpixels π and π΅(π ) is the total number of the pixels adjacent to all superpixels in π(π ) in the edges of superpixel π .
(a)
(b)
(c)
(d)
(e)
Figure 5. Plane assignment based on the angle prior. (a) Initial reliable planes extracted from the set π»0 ; (b) top-view of candidate planes (white); (c) plane assignment; (d)-(e) top-view and close-up.
Eq.(8) indicates that, for superpixel π , when the number of its neighboring superpixels with reliable planes is greater and the corresponding boundary length is longer, the constraints for assigning its optimal plane is stronger and
more reliable, and the plane associated with superpixel π must be inferred as priority. Note that if superpixel π is sub-segmented, only the plane assignment priorities of the resulting sub-superpixels are computed in Step 2 to improve the efficiency. (3) Unrelated region detection Given the ground (see Section 4) of the scene, if the intersection points of the superpixel π back-projected with the building planes are below the ground, we consider the superpixel π as an unrelated ground region. For the superpixels in the sky region, we detect them according to the following condition:
ππ ππ¦ (π ) = (ππ ππ¦ (π ) > π)β (
1 β πΈπππ‘π (π , π»π ) > πΈΜ
), |β |
(9)
π»π ββ
where ππ ππ¦ (π ) is the probability that the superpixel π belongs to the sky and produced by the semantic labeling algorithm [34]; Ο΅ is the corresponding threshold. πΈΜ
is defined in Eq.(7) and β is the set of the current reconstructed planes. (4) Candidate plane generation According to the structure characteristics of urban scenes, the plane associated with the current superpixels π β βΜ
frequently has specified angles with its neighboring planes. Therefore, for generating candidate planes for superpixel π , as indicated in Figures 1(a) and 1(b), we first detect the set π± of its neighboring superpixels that have been assigned reliable planes, and then rotate the plane with the axis by the vertical direction (see Section 4) and a 3D point that is projected in the boundary between superpixels π and π‘ β π±. Finally, as indicated in Figure 5(b), we consider each plane produced at each rotating angle belonging to π΄πππππ as candidate planes of superpixel π . Consequently, in contrast to incomplete candidate planes as indicated in Figure 1(c), the extended candidate planes are sufficient and reliable for reconstructing more detailed scene structures. Note that, for reconstructing some slanted planes, we also consider the axis that are perpendicular to both the vertical direction and the normal vector of the current reliable plane. (5) Plane assignment cost computation
In fact, computing πΈπππ (π , π»π ) is relatively time-consuming. As indicated in Table 2, πΈππ (ππππ‘) denotes the πΈπππ‘π (π , π»π ) that the component πΈπππ (π , π»π ) is only available for superpixels with low discrimination, and πΈππ (πππ) denotes the πΈπππ‘π (π , π»π ) that the component πΈπππ (π , π»π ) is alway available for each superpixel. Here, low discrimination denotes it is unreliable to assign a plane for a superpixel only using πΈπβπ (π , π»π ). In this case, the ratio of the minimal πΈπβπ (π , π»π ) value to the second smallest πΈπβπ (π , π»π ) value with respect to all candidate planes is typically greater, and thus used to identify whether or not a superpixel has low discrimination by setting a pre-given threshold (set to 0.8 in this paper). Moreover, PS denotes the percentage of superpxiels with low discrimination, M1 (Sini ) is defined in Section 7.3. Table 2. Accuracy and computational time using different types of data terms.
Data Sets
PS
M1 (Sini )
Time(second)
πΈππ (ππππ‘)
πΈππ (πππ)
πΈππ (ππππ‘)
πΈππ (πππ)
Valbonne
63.4
0.5614
0.5753
17.8
26.2
Wadham
71.4
0.7629
0.7648
37.2
46.6
LSB
76.9
0.6546
0.6629
56.8
69.3
TS
80.3
0.5967
0.6189
68.1
76.5
City#1
67.7
0.4842
0.5164
61.8
84.4
City#2
59.8
0.5889
0.6087
74.4
91.7
From Table 2, we can see that the accuracy of πΈππ (ππππ‘) is comparable with πΈππ (πππ), but the corresponding computational time is relatively shorter. Therefore, in our experiments, when a superpxiel can be assigned a reliable plane using πΈπβπ (π , π»π ), we do not compute πΈπππ (π , π»π ) any more in order to improve the efficiency of Algorithm 1 (or πΈπππ (π , π»π ) is only utilized to improve the reliability of πΈπππ‘π (π , π»π ) for a superpixel with low discrimination). (6) Superpixel resegmentation For a superpixel, it is frequently difficult to determine the optimal plane for inaccurate superpixels with unreliable features. Therefore, we resegment the superpixel by the Mean-shift method at a smaller threshold when the corresponding plane assignment cost is larger than πΈΜ
. As indicated in Figure 3(c), after resegmenting the current superpixel (left), the resulting sub-superpixels (right) are more consistent with scene structures (e.g., edges). Note that,
too small sub-superpixels (e.g., the number of the component pixels is less than 10) are combined into other larger superpixels or sub-superpixels to improve the efficiency and reliability of Algorithm 1. In general, by incorporating the angle prior, Algorithm 1 can achieve superior results compared to traditional methods that adopt hard regularizations. Using the superpixel (red) in Figures 1(a) and 1(b) as an example, as indicated in Figures 5(c)-5(e), traditional hard regularizations tend to assign the same plane to two neighboring superpixels because they have similar appearances. However, Algorithm 1 can effectively resolve this problem using the angle prior.
6
Global plane assignment optimization
After obtaining the planar assignment using Algorithm 1, the following three common problems existing in traditional methods can be effectively solved: (1) Inaccurate superpixels can be resegmented according to the plane assignment cost and the resulting sub-superpixels can be reasonably modeled by the appropriate planes. (2) Both reliability and efficiency of the plane assignment can be effectively improved because of the guideline of scene structure priors (e.g., the angle prior). (3) Unrelated regions (e.g., sky, ground) can be filtered out and unnecessary plane assignments are thus avoided; this can further improve the efficiency of the reconstruction. To produce more reliable results (e.g., eliminate the calculation deviations between two planes), the plane assignment obtained by Algorithm 1 can be optimized under the MRF framework[35]. The energy function is defined as:
πΈ(β) = β (πΈπβπ (π , π»π ) + π β β πΈππππ’πππ (π»π , π»π‘ )), π ββ
(10)
π‘βπ©(π )
where β and β are respectively the set of superpixels and their associated planes obtained by Algorithm 1, π©(s) is the set of all neighboring superpixels of superpixel π . The contant π is the weight of the regularization term. Note that, the data term constructed using only low-level image feature has higher efficiency, and has almost similar results with the one constructed using low-level and high-level image features. Eq.(10) can be minimized using the Ξ±-expansion algorithm[35]. As indicated in Figure 6(a), compared to the initial plane assignment that contains outliers, the optimized assignment as indicated in Figures 6(b) and 6(c) appear to be
satisfied. Figure 6(d) indicates the superpixels corresponding to the reconstructed reliable planes. Clearly, the boundary between two regions also can be reliably reconstructed (e.g., the boundary in the rectangle). Conversely, as indicated in Figures 6(e) and 6(f), traditional methods [14] and [15] suffer from incomplete reconstruction and inaccurate reconstructed boundaries between the planes. More experimental results are presented and analyzed in the next section.
(a)
(b)
(d)
(c)
(e)
(f)
Figure 6. Plane assignment optimization (different colors denote different reliable planes). (a) Plane assignment produced by Algorithm 1; (b) plane assignment optimized under the MRF framework; (c) top-view; (d) superpixels corresponding to reliable planes; (e) results produced by method [14]; (f) results produced by method [15].
7
Experiments
To evaluate the performance of the proposed method, we conducted experiments on several data sets of urban scenes where planar structures dominate. Figure 7 presents the current images (the LSB scene is displayed in Figure 1) and the corresponding two-dimensional (2D) points projected from the initial 3D points. 200
400
600
800
1000
1200
1400 200
(a)
(b)
400
600
800
1000
1200
(c)
1400
1600
1800
2000
(d)
(e)
Figure 7. Sample images. (a)Valbonne; (b)Wadham; (a)TS; (d)City#1; (e)City#2.
(1) Oxford VGG data sets[36]: Valbonne and Wadham. For the two datasets, the image resolutions are 512Γ768 and 1024Γ768, respectively. As displayed in Figures 7(b) and 7(c), the corresponding scene structures are relatively simple; however, it was frequently difficult to obtain improved results including for slanted surfaces (e.g., the roof in the
Wadham scene) and other details. (2) CASIA data sets[33]: Life science building (LSB) and Tsinghua school (TS). For the two datasets, the image resolutions are 728Γ1072 and 2184Γ1456 (the camera parameters can be obtained using Structure from Motion pipelines), respectively. As displayed in Figure 1 and Figure 7(a), the corresponding scenes contain some small plane patches (e.g., the windows in the TS scene) and plane intersection angles (e.g., 90o , 135o ), it is challenging to effectively reconstruct their complete structures. (3) Our own data sets: City#1 and City#2. The corresponding image resolutions are 1884Γ1224. Relatively, as displayed in Figures 7(d) and 7(e), the structures of the two scenes are more complex and more difficult to be reconstructed because of more interference factors such as illumination variations, repetitive textures, and long distances between the camera and the buildings. In addition, in contrast to other scenes, there are more unrelated regions (e.g., sky and ground) in the current image, which frequently reduces the efficiency of the reconstruction. All the experiments were conducted on a desktop PC with Intel Core 4 Duo 4.0 GHz CPU and 32 G RAM. Each algorithm in all experiments was implemented in parallel C++. 7.1 Parameter settings The proposed method appeared to be less sensitive to parameter settings; the majority of the parameters were fixed. Specifically, for determining the number of neighboring images π in Eq.(3), we conducted the experiments by setting different π values. As shown in Figure 8, the reconstruction accuracy M1 (Sopt ) (see Section 7.3) almost approaches the maximum when π = 2, and then basically reduces when π > 2. The reason lies in: (1) two neighboring images can provide enough information for reconstructing the scene structures corresponding to the current image in virtue of the higher reliability of the plane assignment cost incorporating image features, structure priors (plane and angle priors) and 3D point visibility constraints; (2) the areas of all the reconstructed scene patches corresponding to the overlapping regions between more images decrease. Moreover, the computation time significantly increases when π > 2. In summary, we set π to 2.
(a) Figure 8. Accuracy and computation time with different
(b)
π values. (a)accuracy; (b)computation time.
Moreover, for the data term in the plane assignment cost, the truncation threshold πΏ aims to address the robustness concern related to occlusion regions; the occlusion penalty ππππ shoud be set to smaller values than the visibility violation penalty ππππ . With respect to the difference between the normalized colors of two pixels, the proposed method can achieve superior results when πΏ=0.5, ππππ =2 and ππππ =4. For the regularization term, the plane discontinuity penalty ππππ is mainly used to enhance the consistency of two neighboring planes, and set to 2 with respect to the difference between the mean colors of two superpixels; moreover, larger π values cannot better incorporate the predefined angle priors to relax the hard regularization, and thus set to 0.6.
(a)
(b)
Figure 9. Accuracy changes with different weights. (a)weight Ξ³; (b)weight Ο.
For the weight πΎ of the regularization term, similarly to the weight of high-level image features π (see Section 5.1), we found the optimal πΎ value in [0,1] according to the reconstruction results produced by Algorithm 1. As shown in Figure 9(a), greater Ξ³ values typically force a superpixel and its neighboring superpixels to be assigned the same plane; this is not conducive to reconstructing detailed structures. Conversely, smaller values can reduce the effect of the
regularization term and thus lead to more outliers. The proposed method performed well when πΎ = 0.6. Similarly, we selected the optimal π vlaue from [0,1] by comparing the corresponding accuracy of the reconstruction results produced by the proposed methods. From Figure 9(b), it can be seen that the proposed method achieves superior results when π=0.5. In addition, the threshold π is used to identify the sky regions in images using the semantic labeling algorithm [34]. In our experiments, the algorithm has higher accuracy on detecting the sky regions, and thus set π to 0.9. The parameter settings are summarized in Table 3. Table 3. Parameter settings ID
Name
Default value
Function
Section
1
π
2
Number of neighboring images
5.1
2
πΎ
0.6
Weight of regularization term in Eq.(1)
5.1
3
ππππ
2
Occlusion penalty
5.1
4
ππππ
4
Free-space violation penalty
5.1
5
ππππ
2
Plane discontinuity penalty
5.1
6
π
0.6
Relaxation parameter of structure priors
5.1
7
π
0.2
Weight of high-level image features
5.1
8
πΏ
0.5
Truncation threshold of color difference
5.1
9
π
0.9
Threshold of semantic regions
5.2
10
π
0.5
Weight of regularization term in Eq.(11)
6
7.2 Evaluation criteria (1) Reconstruction accuracy In this paper, we adopt the following criteria to evaluate the reliabilities of the reconstructed 3D points and planes. 1) Reliable 3D points: the 3D points ππ and ππ corresponding to pixel π β πΌπ and π β πΌπ (π = 1,2, β― , π), respectively, are considered the same as the 3D point that is reliable to pixel π β πΌπ only when the difference (i.e.,(π(ππ ) β π(ππ ))βπ(ππ )) between the depths π(ππ ) and π(ππ ) with respect to the image πΌπ is less than a prespecified threshold (set to 0.2 in this paper). 2) Reliable planes: for the reconstructed 3D points of all pixels in superpixel π β πΌπ , the plane associated with
superpixel π is considered reliable only when the percentage of reliable 3D points is greater than a prespecified threshold (set to 0.8 in this paper). Based on the above definitions, we adopt the point accuracy M1 and plane accuracy M2 to comprehensively measure the accuracy of the scene reconstruction. Here, M1 denotes the ratio of reliable reconstructed 3D points to all reconstructed 3D points and M2 denotes the number of reliable planes. (2) Method comparison To further evaluate the performance of the proposed method, we also conducted comparison experiments with the-state-of-the-art methods [14] and [15]. These two methods have similar pipelines including image oversegmentation, candidate plane generation and scene structure inference. The main differences are the density of initial 3D points, the methods of generating candidate planes, and constructing the energy function used to infer the scene structures. For more details, please refer to the related papers. For the convenience of experimental comparison, we marginally adjusted some details in implementing the above two methods: (1) for the current image, five groups of superpixels were generated using the Mean-shift method with difference parameters and used to reconstruct the corresponding scene structures. The optimal results were used to compare with other methods; (2) detecting and filtering out the sky and ground regions (see Section 5.2); and (3) filtering out unreliable planes for visualization comparison.
7.3 Results and analysis The proposed method focuses on how to jointly optimize superpixels and their associated planes by incorporating scene structure priors. The initializations for different data sets are presented in Table 4. In general, it is difficult to over-segment the current image into regions (or superpixels) appropriately consistent with the scene structures. Therefore, as indicated in Figure 10(a), we over-segmented the current image at a larger threshold. Consequently, as indicated in Figure 10(b), only a small number of reliable planes associated with the superpixels could be determined because the majority of superpixels straddled two or more planes and could not be
modeled as single planes. Table 4. Initializations Data sets
3D points
Superpixels
Lines
Planes
LSB
6636
3788
1704
28
TS
9265
3706
2694
102
Valbonne
561
360
362
17
Wadham
2120
1243
838
38
City#1
2234
2793
1588
11
City#2
1503
2643
1297
7
Based on these initial reliable planes, as indicated in Figure 10(c), Algorithm 1 resegmented inaccurate superpixels according to the plane assignment cost and simultaneously optimized their associated planes. In the meantime, unrelated regions (e.g., sky, ground) were reliably filtered out, which significantly improved the overall efficiency and visualization effects. Table 5. Results of different methods Data sets
Proposed method
Method [14]
Method [15]
SRP
SP
PL
M1 (Hini )
M1 (Hopt )
M2 (H)
M1 (Sini )
M1 (Sopt )
M2 (S)
M1
M2
M1
M2
LSB
292
6592
1107
0.5509
0.7888
15
0.6629
0.8913
18
0.4319
3
0.6135
5
TS
182
9896
2112
0.5250
0.6991
19
0.6189
0.7991
27
0.3196
11
0.5001
13
Valbonne
30
1940
156
0.3987
0.5738
5
0.5753
0.8256
9
0.5067
7
0.6418
6
Wadham
85
7113
409
0.6711
0.7893
11
0.7648
0.8796
12
0.3991
7
0.6791
11
City#1
27
8608
3761
0.3675
0.7306
5
0.5164
0.7829
7
0.3284
7
0.4398
6
City#2
41
7473
2537
084.49
0.6756
6
0.6087
0.7591
6
0.2987
5
0.5736
6
Table 5 shows the corresponding quantitative results. Here, SRP denotes the number of initial superpixels with a reliable plane, and SP and PL denote the number of superpixels (including superpixels and sub-superpixels) and planes produced by Algorithm 1, respectively. In order to show the performance of traditional hard regularizations and the soft regularization defined in Eq.(5), we conducted the corresponding experiments using these two regularizations (the hard regularization is constructed using the first and third terms in Eq.(5); other conditions being equal), respectively. For the hard regularization, M1 (Hini ), M1 (Hopt ) and M2 (H) denote the accuracy of the initial scene structures produced by Algorithm 1, the scene structures optimized under the MRF framework and the number of the reconstructed planes,
respectively. Similarly, M1 (Sini ), M1 (Sopt ) and M2 (S) denote the results produced by the soft regularization. From Table 5, it can be seen clearly that, because the initial plane assignments are basically reliable, the global plane assignment optimization could produce superior results as indicated in Figures 10(d) and 11(a). Moreover, the soft regularization defined in Eq.(5) achieves higher accuracy than that of the hard regularization because it can describe the scene structures more effectively. Figure 11(b) displays the superpixels corresponding to the optimized planes; it can be observed that the proposed method does not only reconstruct the major structures of the scenes, but also performs well in reconstructing the details (e.g., the windows of the Wadham scene) and the boundaries between different planes (e.g., the boundaries in the rectangles). 100
200
300
400
500
600
700 100
200
300
400
500
600
700
800
900
1000
100
200
300
400
500
600
700
50
100
150
200
250
100
200
300
400
500
300
350
400
450
600
700
800
900
500
100
200
300
400
500
600
700
(a)
1000
(b)
(c)
(d)
Figure 10. Results on standard data sets. (a) Initial superpixels; (b) initial reliable planes; (c) textured structures corresponding to plane assignments produced by Algorithm 1; (d) textured structures corresponding to globally optimized plane assignments.
Method [14] assumes that the initial superpixels contain sufficient 3D points to generate the corresponding candidate planes; however, such an assumption frequently leads to larger errors in regions (e.g., poorly textured regions) where the initial 3D points do not distribute evenly. For example, specific superpixels as indicated in Figure 11(c) cannot be assigned reliable planes. Method [15] first produces dense 3D points by matching DAISY feature descriptions. This preprocessing not only generates relatively complete candidate planes but also helps to construct stronger
constraints to enhance the reliability of the subsequent plane inference and optimization. Therefore, the method could achieve improved results. However, it could not reliably reconstruct the boundaries between the planes (e.g., the boundaries in the rectangles in Figure 11(d)). For the efficiency of each method, as listed in Table 6, the proposed method performed relatively quickly at the multi-plane fitting and line segment detection stages; however, it consumed considerable time in extracting the high-level image features and resegmenting the superpixels. Nevertheless, the plane assignment process was faster because of reliable candidate planes generated by the guidance of the angle priors. Further, the optimization process required less time owing to the improved initialization derived from the initial plane assignment produced by Algorithm 1. Method [14] performed at a speed similar to the proposed method because it avoids computing the photo-consistency measurements across multiple images, except for the time-consuming candidate plane generation. Method [15] performed slowly owing to the per-pixel feature matching.
(a)
(b)
(c)
(d)
Figure 11. Results of different methods (different colors denote different reliable planes). (a) Top view of Figure 10(d); (b) results produced by proposed method; (c) results produced by method [14]; (d) results produced by method [15].
Our own data sets were used to evaluate the adaptability of the proposed method. For the scenes displayed in Figure 7, because the cameras were far from the buildings, the captured images typically contained multiple different types of building regions that were relatively smaller than the unrelated regions (e.g., sky, ground). In our experiments,
when the unrelated regions were not detected and filtered out, method [14] failed unexpectedly and method [15] produced fewer reliable planes. 200
400
600
800
1000
1200
1400
1600
1800 200
400
600
800
1000
1200
200
400
600
800
1000
1200
200
400
600
800
1000
1200
1400
1600
1800
(a)
(b)
(c)
(d)
Figure 12. Results on our own data sets. (a) Initial superpixels; (b) initial reliable planes; (c) textured structures corresponding to plane assignments produced by Algorithm 1; (d) textured structures corresponding to globally optimized plane assignments.
(a)
(b)
(e)
(c)
(d)
(f)
Figure 13. Results of different methods (different colors denote different reliable planes). (a) Top view of Figure 12(d); (b) results produced by proposed method; (c) results produced by method [14]; (d) results produced by method [15]; (e) close-ups of the rectangles in (b), (c) and (d) (City#1); (f) close-ups of the rectangles in (b), (c) and (d) (City#2).
In this experiment, the proposed method also only produced fewer reliable planes during the initial phase. However, Algorithm 1 performed well because structure priors also commonly exist between different buildings in urban scenes. This guarantees the reliability of the next plane assignment optimization (see Figures 12(d) and 13(a)). In particular, slanted surfaces were also reliably reconstructed (e.g., the planes in the rectangles in Figure 13(b)). Methods [14] and
[15] also performed well after filtering out the unrelated regions. However, they failed to solve the common problems such as incomplete reconstruction and inaccurately reconstructed boundaries (e.g., the boundaries in the rectangles in Figures 13(c) and 13(d)). Note that, as shown in Table 5 and Table 6, compared to the standard data sets, three methods generally had less accuracy and efficiency, the main reasons lie in: (1) more interference (e.g., illumination variations, far distance between the camera and the buildings) reduces the reliability of the reconstructed 3D points and planes; (2) repetitive textures and structures lead to more sub-superpixels with small size in resegmenting inaccurate superpixels. These sub-superpixels can also influence the reliability and efficiency of the plane assignment cost and the entire scene reconstruction because of the match ambiguity of the superpixels. Table 6. Computation time (seconds) of different methods Proposed method Data sets
Initial
Line
Initial
Global
Total
detection
structures
optimization
time
Method[14]
Method[15]
Initial planes superpixels LSB
4.7
12.7
4.3
56.8
2.9
81.4
94.4
682.6
TS
3.9
22.1
6.9
68.1
3.7
105.7
130.8
883.4
Valbonne
1.1
4.9
2.5
17.8
0.8
27.1
34.9
243.2
Wadham
2.4
7.6
4.1
34.2
1.1
49.4
71.2
341.0
City#1
3.7
8.7
10.7
61.8
2.4
87.3
98.7
577.2
City#2
4.2
5.5
9.7
74.4
3.1
96.9
107.5
498.9
Given the initial sparse 3D points of a scene, by incorporating scene structure priors and high-level image features, the proposed method can effectively overcome the influence of inaccurate superpixels, incomplete candidate planes and unreliable regularizations on the reconstruction, and completely reconstruct the piecewise planar structures of the scene with high accuracy and efficiency.
8
Conclusions
For the piecewise planar reconstruction of urban scenes, traditional methods frequently have problems related to uncertain factors such as sparser initial 3D points, inaccurate image over-segmentation, incomplete candidate planes and unreliable regularization terms. To address these problems, the paper constructed an effective plane assignment cost
based on scene structure priors and high-level image features obtained by CNN. It then jointly optimized the superpixels and their associated planes, followed by globally optimizing the initial scene structures under the MRF framework. Experimental results confirm that the proposed method performed well on both standard and our own data sets with high accuracy and efficiency. The limitations of the proposed method include the following. (1) Although high-level image features help to improve the reliability of the reconstruction, they also lead to a high computational load. (2) Curved surfaces with poor textures cannot be reliably reconstructed because they are beyond the scope of the prespecified scene structure priors. In future work, we will further explore efficient image patch matching methods and extract more scene structure priors (e.g., specific objects and geometric shapes) from the initial 3D points and images based on CNN. Finally, the accuracy and efficiency of the scene reconstruction are expected to improve using a higher-order MRF framework that incorporates more scene structure priors.
Acknowledgements This work is supported in part by the National Key R&D Program of China (2016YFB0502002), and in part by the National Natural Science Foundation of China (61772444, 61421004, 61873264), the Natural Science Foundation of Henan Province (162300410347), the Key Scientific and Technological Project of Henan Province (162102310589, 192102210279).
References 1 Feng W, Jia J, Liu Z. Self-validated labeling of markov random fields for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(10):1871-1887. 2 Li L, Feng W, Wan L, Zhang J. Maximum cohesive grid of superpixels for fast object localization. In: Proceedings of Computer Vision and Pattern Recognition, 2013, pp.3174-3181. 3
Wang W, Gao W, Hu Z Y. Effectively modeling piecewise planar urban scenes based on structure priors and CNN. SCIENCE CHINA Information Sciences. 2019, 62(2):029102.
4
Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(5):603-619.
5
Achanta R, Shaji A, Smith K, Lucchi A, Fua P, SΓΌsstrunk S. SLIC Superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(11):2274-2282.
6
ΓΔ±Δla C, Zabulis X, Alatan A A. Region-based dense depth extraction from multi-view video. In: Proceedings of
IEEE ICIP. 2007, pp. 213-216. 7
Furukawa Y, Curless B, Seitz S M. Manhattan-world stereo. In: Proceedings of Computer Vision and Pattern Recognition, 2009, pp.1422-1429.
8
Furukawa Y, Ponce J. Accurate, dense, and robust multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(8):1362-1376.
9
Gallup D, Frahm J M, Mordohai P. Real-time plane-sweeping stereo with multiple sweeping directions. In: Proceedings of Computer Vision and Pattern Recognition, 2007, pp.1-8.
10
MiΔuΕ‘Γk B, KoΕ‘eckΓ‘ J. Multi-view superpixel stereo in urban environments. International Journal of Computer Vision, 2010, 89(1):106-119.
11
Sinha S N, Steedly D, Szeliski R. Piecewise planar stereo for image-based rendering. In: Proceedings of 12th International Conference on Computer Vision, 2009, pp.1881-1888.
12
Chauve A L, Labatut P, Pons J P. Robust piecewise-planar 3D reconstruction and completion from large-scale unstructured point data. In: Proceedings of Computer Vision and Pattern Recognition, 2010, pp. 1261-1268.
13 Jiao Z, Liu T, Zhu X. Robust piecewise planar stereo with modified segmentation cues in urban scenes. In: Proceedings of International Conference on Multimedia Technology, 2011, pp.698-701. 14 Bodis-Szomoru A, Riemenschneider H, Van Gool L. Fast, approximate piecewise-planar modeling based on sparse structure-from-motion and superpixels. In: Proceedings of Computer Vision and Pattern Recognition, 2014, pp. 469, 476, 23-28. 15 Verleysen C, Vleeschouwer C D. Piecewise-planar 3D approximation from wide-baseline stereo. In: Proceedings of Computer Vision and Pattern Recognition, 2016, pp. 3327-3336. 16 Tola E, Lepetit V, and Fua P. Daisy: An efficient dense descriptor applied to wide-baseline stereo. Pattern Analysis and Machine Intelligence, 2010, 32(5): 815-830. 17 Wang W, Ren G, Chen L and Zhang X. Piecewise planar urban scene reconstruction based on structure priors and cooperative optimization. http://kns.cnki.net/kcms/detail/11.2109.TP.20181007.2353.012.html, 2018. 18 Feng W , Liu Z, Wan L. A spectral-multiplicity-tolerant approach to robust graph matching. Pattern Recognition, 2013, 46(10):2819-2829. 19 Liang Z, Feng Y, Guo Y, Liu H, Chen W, Qiao L, Zhou L and Zhang J. Learning for Disparity Estimation through Feature Constancy. In: Proceedings of Computer Vision and Pattern Recognition, 2018. 20 Jie Z, Wang P, Ling Y, Zhao B, Wei Y, Feng J and Liu W. Left-right comparative recurrent model for stereo matching. In: Proceedings of Computer Vision and Pattern Recognition, 2018.
21 Park M G, Yoon K J. Learning and Selecting Confidence Measures for Robust Stereo Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018. 22 Zbontar J, LeCun Y. Computing the stereo matching cost with a convolutional neural network. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 1592-1599. 23 Chen Z, Sun X, Wang L, Yu Y, Huang C. A deep visual correspondence embedding model for stereo matching costs. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 972-980. 24 Fischer P, Dosovitskiy A, Brox T. Descriptor matching with convolutional neural networks: a comparison to SIFT. Computer Science, 2015. 25 Luo W, Schwing A G, Urtasun R. Efficient deep learning for stereo matching. In: Proceedings of Computer Vision and Pattern Recognition, 2016, pp. 5695-5703. 26 Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks. In: Proceedings of Computer Vision and Pattern Recognition, 2015, pp. 4353-4361. 27 Shi Y, Xu K, Niessner M, Rusinkiewicz S and Funkhouser T. PlaneMatch: patch coplanarity prediction for robust RGB-D reconstruction. In: Proceedings of European Conference on Computer Vision, 2018. 28 Shrestha R, Tian F, Feng W, Tan P, Vaughan R. Learned Map Prediction for Enhanced Mobile Robot Exploration. International Conference on Robotics and Automation , 2019. 29 Pham T T, Chin T J, Yu J and Suter D. The random cluster model for robust geometric fitting. Pattern Analysis and Machine Intelligence. 2014, 36(8): 1658-1671. 30 Hartley R, Zisserman A, Multiple view geomenry in computer vision, Cambridge University Press, 2003. 31 Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. In: Proceedings of British Machine Vision Conference. 2014. 32 Jensen R, Dahl A, Vogiatzis G, Tola E and Aans H. Large scale multi-view stereopsis evaluation. In: Proceedings of Computer Vision and Pattern Recognition, 2014. 33 [Online]: http://vision.ia.ac.cn/data/index.html. 34 Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of Computer Vision and Pattern Recognition, 2017. 35 Gorelick L, Boykov Y, Veksler O, Ayed I B, Delong A. Submodularization for binary pairwise energies. In Proc. of Computer Vision and Pattern Recogition, 2016, pp(99):1-1. 36 [Online]: http://www.robots.ox.ac.uk/~vgg/data/data-mview.html.
Wei Wang received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences in 2015. He is currently an associate professor at the School of Network Engineering, Zhoukou Normal University. His research interests include computer vision, machine learning and 3D reconstruction.
Wei Gao received his Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences at 2008. He is currently an associate professor at the Institute of Automation, Chinese Academy of Sciences. His research interests include computer vision and 3D reconstruction.
Zhanyi HU was born in 1961. He received his B.S. degree in automation from the North China University of Technology in 1985, the Ph.D. degree (Docteur d'Etat) in computer vision from the University of Liege, Belgium, 1993. He is now a professor at the Institute of Automation, Chinese Academy of Sciences.