Accurate and occlusion-robust multi-view stereo

Accurate and occlusion-robust multi-view stereo

ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61 Contents lists available at ScienceDirect ISPRS Journal of Photogrammetry and Re...

6MB Sizes 3 Downloads 93 Views

ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing journal homepage: www.elsevier.com/locate/isprsjprs

Accurate and occlusion-robust multi-view stereo Zhaokun Zhu a,b,⇑, Christos Stamatopoulos c, Clive S. Fraser c a

College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation, National University of Defense Technology, Changsha 410073, China c CRCSI, Dept. of Infrastructure Engineering, University of Melbourne, Vic. 3010, Australia b

a r t i c l e

i n f o

Article history: Received 23 January 2015 Received in revised form 25 August 2015 Accepted 27 August 2015

Keywords: Multi-view stereo Support window model Visibility estimation PatchMatch Markov Random Field

a b s t r a c t This paper proposes an accurate multi-view stereo method for image-based 3D reconstruction that features robustness in the presence of occlusions. The new method offers improvements in dealing with two fundamental image matching problems. The first concerns the selection of the support window model, while the second centers upon accurate visibility estimation for each pixel. The support window model is based on an approximate 3D support plane described by a depth and two per-pixel depth offsets. For the visibility estimation, the multi-view constraint is initially relaxed by generating separate support plane maps for each support image using a modified PatchMatch algorithm. Then the most likely visible support image, which represents the minimum visibility of each pixel, is extracted via a discrete Markov Random Field model and it is further augmented by parameter clustering. Once the visibility is estimated, multi-view optimization taking into account all redundant observations is conducted to achieve optimal accuracy in the 3D surface generation for both depth and surface normal estimates. Finally, multi-view consistency is utilized to eliminate any remaining observational outliers. The proposed method is experimentally evaluated using well-known Middlebury datasets, and results obtained demonstrate that it is amongst the most accurate of the methods thus far reported via the Middlebury MVS website. Moreover, the new method exhibits a high completeness rate. Ó 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

1. Introduction Image-based modeling and reconstruction is the process of extracting accurate three-dimensional (3D) information of unknown objects or scenes from imagery. It is essential for many photogrammetric measurement applications, such as product quality inspection, vision metrology, cultural heritage documentation, traffic accident reconstruction and mobile mapping. 3D reconstruction from stereo and multi-view imagery has been given impetus over recent years due in large part to the decreasing costs and higher resolution of consumer-grade digital cameras. A more important factor in the growth of applications of image-based modeling, however, has been the new developments in image matching algorithms that have allowed faster, more robust and more accurate reconstruction. 3D reconstruction can be grouped in two major categories namely, sparse and dense reconstruction. Sparse reconstruction is ⇑ Corresponding author at: College of Aerospace Science and Engineering, National University of Defense Technology, Changsha 410073, China. Tel.: +86 186 7087 7752.

usually performed via detector-based matching of common scale invariant features (Bay et al., 2008) between images. This facilitates the determination of both image orientation and a sparse reconstruction, desirably via established photogrammetric methodologies. It is noteworthy that ‘sparse’ may nevertheless imply 1000s of extracted 3D object points. This technique is commonly referred to as Structure-from-Motion (SfM) (Crandall et al., 2013) and it has been successfully applied to large sets of both organized and unorganized images. On the other hand, dense reconstruction methods aim to create a complete model of a scene by extracting a 3D coordinate for every pixel of an image, thus leading to object point clouds of millions of points. A prerequisite for practical dense reconstruction is the provision of images with known interior and exterior orientation and thus it is not uncommon nowadays to have sparse and dense reconstruction performed as sequential operations. Dense reconstruction methods can be further divided into binocular stereo and multi-view stereo (MVS) methods. Binocular stereo (Scharstein and Szeliski, 2002) aims to produce a dense disparity map from a pair of images, whereas MVS methods aim to utilize multitudes of images. According to the taxonomy of Seitz

http://dx.doi.org/10.1016/j.isprsjprs.2015.08.008 0924-2716/Ó 2015 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.

48

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

et al. (2006), MVS methods can be further categorized in terms of how the object is represented, for example via voxels, level-sets, polygon meshes or depth maps. MVS methods in which the object is represented by depth maps are also referred to as multi-view depth map estimation (MVDE) methods by Zheng et al. (2013). One fundamental image matching problem that both local binocular stereo and MVDE methods have to deal with is how to optimally select the support window along the epipolar line of the support (matching) image to ensure photo-consistency, given a reference window in the reference image. The distinction for MVDE methods is that support windows have to be selected for all possible support images. For the sake of simplicity, many methods use the same window size as the reference window, assuming that the reconstructed surface is locally fronto-parallel. Such an assumption is easily violated in many situations, so various local binocular stereo methods have proposed alternative methodologies. These include the methods of adaptive-window (Veksler, 2003), multiple-window (Bobick and Intille, 1999), segmentationbased window (Wang et al., 2004), adaptive support-weight (Hosni et al., 2013) and Disparity Space Image (DSI) plane induced support window (Bleyer et al., 2011). Many MVDE methods, e.g., that proposed by Goesele et al. (2006), use the same size reference window, but further problems can arise with this approach since many pixels of the support window do not necessarily correspond to the same object point at all, resulting in a deviation of the computed photo-consistency measure. Thus, more sophisticated support window calculation is also warranted for MVDE. However, the complexity increases because instead of only one depth parameter along the line of sight, there are now multiple parameters involved in the calculation of the depth estimate. Another challenge for MVDE methods is the determination of the visibility of each pixel of the reference image with respect to the support images. More specifically, it is the process of finding which support images satisfy the photo-consistency measure for every pixel of the reference image so that this information is taken into account during the depth estimation process. This step is quite important, as the introduction of erroneous matches will lead to incorrect depth calculation and consequently outliers. Conversely, use of only a small number of estimations in order to minimize the outliers, e.g., the nearest support images, is also not desirable as redundant observations are essential in providing better geometry and consequently increasing the accuracy and reliability of the depth estimation. Although outliers can be detected and filtered by enforcing depth-consistency with other depth maps, the final completeness of the reconstruction can be adversely affected by the presence of outliers. Thus, the visibility determination is crucial to achieving both high completeness and accuracy in the 3D reconstruction. The problems of support window selection, visibility determination and outlier detection in image-based 3D reconstruction are addressed in this paper, where a novel MVDE-based method for the estimation of both depth and surface normal for each pixel is proposed. The reported approach relies on an accurate visibility estimation method that is initially calculated with the purpose of efficient outlier filtering. Thus, high accuracy depth and surface normal estimates are anticipated as all possible redundant observations can be correctly recognized and utilized. In addition, so as to allow for faster computation times, the proposed methodology is developed in such way that parallel processing of each image is possible. The proposed MVDE approach is evaluated using wellknown Middlebury MVS benchmark datasets, and results obtained show the method to be amongst the most accurate reported in accommodating occlusion-challenging scenes. It is also capable of producing a 3D reconstruction with a very high completeness rate.

2. Related work Selected MVDE approaches, of which there are many in the computer vision literature, will first be discussed to provide a background into the MVDE-type approach adopted here. A classic MVDE approach is that reported by Goesele et al. (2006), in which normalized cross-correlation (NCC) is used as the photoconsistency measure. The support window model is simply a fixed-size square window without any surface normal information for the associated 3D object points, while the visibility of each pixel is determined based on both the computed NCC and a comparison against a fixed threshold. Such an approach can often lead to erroneous results as in some cases a pixel can have a high NCC value even if it is occluded. In aiming to handle large online ‘community’ photo collections, Goesele et al. (2007) further proposed a region growing MVDE method that takes reconstructed sparse feature points from SfM as seeds and iteratively grows surfaces from them. Additionally, for the first time, a support window of non-fixed size was introduced. This window is derived by a 3D support plane modeled in the image domain by a depth and two depth offsets. Two photoconsistency measures are used at the same time, namely the sum of squared differences (SSD) and NCC. The SSD is employed only for parameter optimization, while NCC is used for calculating both confidence and convergence, as well as determining visibility in a similar way to that reported in Goesele et al. (2006). Optimization of both the depth and the two depth offsets is solved via a multiphoto geometrically constrained least-squares matching (MPGC) method (Baltsavias, 1991), one of the drawbacks of which is that it requires good initial parameter estimates in order to achieve fast convergence. Otherwise MPGC may either achieve convergence slowly, oscillate or even diverge (Gruen, 1996). To deal with these issues, a specific yet complicated mechanism was proposed, yet the completeness and accuracy of the reconstruction remains heavily dependent upon the SfM seed points. In the method of Strecha et al. (2006), the depth and visibility of each pixel of the reference image are jointly modeled as latent variables in a hidden Markov Random Field (MRF) where smoothness is incorporated on neighboring pixels for both depth and visibility. Their inference, along with the stochastic model parameters, is calculated by an Expectation-Maximization (EM) algorithm. Unlike other MVDE methods, the support window is not necessary as the algorithm is able to consider only pixel-wise color discrepancies due to the incorporation of smoothness terms. However, it is assumed both that every matching pixel has the same color, without error, and that the color discrepancy should follow a Gaussian distribution with zero mean, something that is usually violated in practical situations. Further drawbacks of this method include high memory-consumption due to the need to store the visibility configuration for each pixel, with memory requirements increasing exponentially as the number of images increases. The use of several locally optimal depth hypotheses for each pixel has been proposed by both Campbell et al. (2008) and Hu and Mordohai (2012). Campbell et al. (2008) select final depth estimates based on a discrete MRF model without the presence of any other depth maps, whereas Hu and Mordohai (2012) adopt a depth estimation based on depth consistency with other depth maps. While the former of these two approaches uses a fixed-size square support window, as with the method of Goesele et al. (2006), the latter uses a plane-sweeping stereo method similar to that reported by Gallup et al. (2007) where four different support windows with corresponding 3D object surface normals are considered. Both methods involve finding several local extrema of the photo-consistency measure, which requires sampling the entire

49

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

possible depth range at equal intervals. Consequently, only a limited number of discrete depths can be considered. A parameter propagation and optimization method named PatchMatch was first proposed for use in MVDE by Bailer et al. (2012) and Shen (2013). Their proposed PatchMatch methods do not require sparse SfM feature points as seeds and they use a support window based on a 3D support plane, unlike the region growing method of Goesele et al. (2007). The 3D support plane is modeled by a depth and two spherical coordinates, i.e., the surface normal is represented by an azimuth and elevation. Additionally, the method of Shen (2013) deduces the homography relationship between the reference and support windows. In the method of Bailer et al. (2012), a self-weighted average score is used to integrate NCC values computed with different support images, whereby smaller NCC values are assigned a smaller weight in the computation of this weighted average score. Although multiple neighboring images can be selected for the reference image in the PatchMatch method of Shen (2013), only one image supports the depth map generation while the others are used only to filter outliers by enforcing depth-consistency once their own depth maps have been generated. Thus, redundant observations to achieve better accuracy and reliability are not utilized. PatchMatch was originally proposed as a means to quickly find approximate nearest neighbor matches between image patches for image hole filling (Barnes et al., 2009). It was subsequently generalized to find more than just one nearest neighbor for each image patch (Barnes et al., 2010). PatchMatch was then introduced to binocular stereo vision by Bleyer et al. (2011) as a means to efficiently find an approximate optimal support plane in a disparity space image (DSI) using three parameters for each pixel in the rectified stereo images. Besse et al. (2013) further analyzed the similarities and differences between PatchMatch and Max-Product Particle Belief Propagation (MP-PBP) and proposed a new PatchMatch Belief Propagation (PMBP) method by combining these two. This represented the first attempt to add pairwise smoothness regularization to the PatchMatch algorithm, and the method was verified on binocular stereo data based on the support window model of Bleyer et al. (2011). Heise et al. (2013) also attempted to add pairwise smoothness regularization to PatchMatch-based binocular stereo, with the support window model of Bleyer et al. (2011) again being used. The method, which was different to that of Besse et al. (2013), was mainly inspired by the computational approach for large displacement optical flow proposed by Steinbrucker et al. (2009). Zheng et al. (2013) also applied PatchMatch to MVDE, their method being similar to that of Strecha et al. (2006) in that it is also based on a probabilistic model and solves the inference problem using an EM algorithm. A difference in the two approaches is that the probabilistic model used by Strecha et al. (2006) is a hidden MRF. This is basically an undirected graphical model in which the smoothness term is enforced on both depth and visibility within the 2D neighborhood, whereas the hidden Markov Chain used by Zheng et al. (2013) is a type of directed graphical model in which the smoothness term is enforced only on the visibility within the 1D neighborhood along the current propagation direction. Unfortunately, and unlike the approach of Bailer et al. (2012) and Shen (2013), the support window model used by Zheng et al. (2013) is prone to accuracy loss in depth estimation because it does not consider any surface normal information. 3. Methodology As with other MVDE methods, the method proposed in this paper seeks to generate from a given set of N images I = {Ii} depth maps for each reference image R e I, with respect to its M support images S = {Sk}, S  I  R.

The proposed five-stage methodology for dense surface reconstruction is laid out in Fig. 1, with the stages being: i. ii. iii. iv. v.

Generation of support plane maps using PatchMatch. Extraction of minimum visibility. Visibility augmentation. Multi-view optimization. Outlier elimination.

The core idea of the proposed method is based on two observations: (1) Depth estimation of occluded mismatched pixels is usually far less smooth than those of non-occluded correctly matched pixels when they are estimated in a Winner-Takes-All (WTA) manner without any smoothness regularization. (2) Pixels occluded in some support images may be visible in other support images and as a result their depths can be correctly calculated. Based on these two observations, instead of trying to utilize all support images blindly without any knowledge of visibility, the multi-view constraint can be initially relaxed. The process starts with the generation of separate depth maps for the reference image R with each of the support images Sk (see Section 3.2). Then, the most likely visible support image is extracted, i.e., the minimum visibility of each pixel (see Section 3.3) is determined. The image configuration with minimum visibility is then further augmented to complete the set of all visible support images (see Section 3.4). Once the visibility is estimated, multi-view optimization taking into account all redundant observations is conducted to achieve better accuracy (see Section 3.5). Finally, multi-view consistency is utilized to eliminate any remaining outliers (see Section 3.6). 3.1. Support window model The support window model used in the developed methodology follows closely that proposed by Goesele et al. (2007), i.e., the 3D support plane is a tangent plane to the object surface that is modeled by three parameters, namely a depth and two per-pixel depth offsets along the x- and y-axes of the image. Goesele et al. (2007) acknowledge that this 3D support plane only approximates the actual object surface; however, the associated error is assumed negligible when the window size is small. A mathematical derivation follows in order to both supplement statements of previous researchers, as well as to derive the relationship between the reference window, the 3D support plane and the support images. Fig. 2 shows the relationship between a pixel p = [x, y]T of the reference image R and the corresponding 3D object point PR = [XR, YR, ZR]T. The w  w reference window that is centered on pixel p corresponds to a theoretical object surface that is defined as a tangent plane at the object point PR with a surface normal nR = [a, b, c]T. This tangent plane is a surface estimate of the reference window centered at p in the 3D object space. If the ZR coordinate of PR is referred to as the depth d with respect to R, then this 3D plane can be described as

 0 0    ^ ¼ 0; ^  dp nTR P 0R  P R ¼ nTR d p

ð1Þ

where P 0R is an arbitrary point on the plane, and d0 is its depth with ^ 0 and p ^ correspond to the vector directions of P 0R and respect to R. p PR, respectively, and they form normalized image points that are defined as

2 3 2 xxp 3 3 2 x0 xp 3 ^x ^x0 f f 7 6 0 7 6^7 6 yyp 7 y yp 7; p ^ ^0 ¼ 6 ^0 7 ¼ p y 4 5¼6 4y 5¼6 4 f 5 4 5 2

1

f

1

1

1

ð2Þ

50

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

Fig. 1. Flow chart of the proposed MVDE method. Depth maps are displayed as gray-value images, and normal maps are displayed as color images, with normalized normal vector [X, Y, Z]T mapped to R, G and B respectively. Red color in the estimated visibility maps indicates pixels in the reference image that are not reliably visible in corresponding support images.

nR

d d ^ 0 Dd þ c Dd ¼ 0 a u þ b v þ a^x0 Dd þ by f f

PR

ð4Þ

where

R

p

Sk

8 0 > : 0 Dd ¼ d  d

pk

Thus, the depth increment Dd can be described as a combination of image point increments u and v as follows:

OR Ok

Dd ¼

Fig. 2. 3D support plane in object space, shown by a gray patch.

ad bd v uþ ^0 þ cÞ ^0 þ cÞ f ða^x0 þ by f ða^x0 þ by

¼ where p0 = [x0 , y0 ]T is projection of P 0R in R, f is the focal length of R, xp and yp are the principal point offsets of R. After expansion Eq. (1) becomes

 0   0   0  y  yp x  xp 0 x  xp 0 y  yp þb d þc d d ¼0 a d d d f f f f ð3Þ which can be further simplified to

ad bd u T 0v ^0 ^ f nTR p f nR p

ð5Þ

Eq. (5) indicates that Dd is not strictly a linear combination of u and v. However, when the reference window centered on p is ^ 0 and p ^ will be negligible, small, the difference between directions p ^ . Then, Dd approximates a linear ^ 0 can be replaced with p so that p combination of u and v as follows:

Dd  

ad bd u  T v ¼ au þ bv ^ ^ f nTR p f nR p

ð6Þ

51

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

The depth of any pixel p0 within the reference window can now be given as

dðx þ u; y þ v Þ ¼ dðx; yÞ þ Dd ¼ d þ au þ bv

ð7Þ

Thus, the 3D support plane associated with the reference window of p can now be modeled in R using three parameters x = [d, a, b]T, which are the depth and two per-pixel depth offsets. The x parameter set is referred to in this paper as a support plane. Considering that the surface normal in object space should always point toward the camera, the 3rd element of nR should be set to 1. Then, nR can be uniquely reconstructed from x as

2

a

3

2

f ^þd f a^xþf by

a

3

7 6 7 6 f 7 nR ¼ 4 b 5 ¼ 6 4 f a^xþf by^þd b 5 1 1

ð8Þ

modification in the way that the propagation and random sampling is performed during the iterative process is proposed. In the original method the spatial propagation was carried out only in the 2D image domain. The proposed approach, however, calculates the support plane maps x(k) sequentially. The spatial propagation estimates can then be propagated not only from x(k) but also from x(i–k). This can be thought of as a 3D spatial propagation and should not be confused with the view propagation mentioned in the original paper, as will be explained later. View propagation is not included in this modified approach in order to ensure process independence and parallelism. 3.2.1. Initialization During the initialization phase, each pixel is first given a random uniform value:

2

When moving to the world coordinate system, the surface RTR nR ,

with RR being the rotation normal n can be calculated as n ¼ ^ ðx; yÞ of a matrix of R. Given the depth d(x, y) and the direction p pixel [x, y]T in R, the corresponding 3D coordinates of its object point in the world coordinate system can then be determined as

^ ðx; yÞ Pðx; yÞ ¼ OR þ RTR P R ðx; yÞ ¼ OR þ dðx; yÞRTR p

ð9Þ

where OR is the optical center of R in the world coordinate system. According to Eq. (7), the coordinates of object points corresponding to any other pixels [x + u, y + v]T within the reference window in R are given as

Pðx þ u; y þ v Þ ¼ OR þ dðx þ u; y þ v

^ ðx ÞRTR p

¼ OR þ ðdðx; yÞ þ aðx; yÞu

þ u; y þ v Þ

^ ðx þ u; y þ v Þ þ bðx; yÞv ÞRTR p

1 ðK k Rk Pðx þ u; y þ v Þ þ K k t k Þ Zk

ð12Þ

During this process, the x support plane map is calculated for the reference image for each of the support images Sk. For each image pair R and Sk, a support plane map xðkÞ : ðX  R2 Þ ! R3 is first generated with the parameters comprising one depth map ðkÞ

: X ! R and the two depth offset maps aðkÞ : X ! R,

ðkÞ

b : X ! R. The adopted photo-consistency measure between the reference window centered on a pixel [x, y]T in R and the support window in Sk given x is defined as

Cðx; y; x; Sk Þ ¼ 1  NCCðx; y; x; Sk Þ

Here, dmin and dmax are the minimum and maximum depths to be considered. These values can either be set manually or automatically by utilizing a user-provided bounding box or the SfM reconstruction result in cases where it is available. The remaining required parameters are defined as

dmax tan hmax f ¼ amax ; bmin ¼ bmax

amax ¼ bmax ¼

ð15Þ

amin

ð16Þ

x0ðkÞ ðx; yÞ. Secondly, all the updated estimates along the vertical direction fxði2½1;kÞÞ ðx; yÞg are also checked for the possibility of their being a better estimate. The parameters of x(k)(x, y) are updated temporarily to any better estimate that may be found during this procedure. This procedure can be described by

 

(k)

d

b ðx; yÞ  Uðbmin ; bmax Þ

3.2.2. Iterative process and 3D spatial propagation As shown in Fig. 3(a), in each odd iteration x(k)(x, y) is initially compared to its horizontal neighbors, x(k)(x  1, y) and x(k)(x, y  1) in case they have been updated. This comparison aims to find out if there is a better estimate than the current value

In this step, the PatchMatch algorithm is employed as it allows for an efficient calculation of an approximately optimal support plane, in the sense of minimization of the photo-consistency measure, out of the infinite number possible for each pixel x

ð14Þ

ðkÞ

ð11Þ

3.2. Generation of support plane maps using PatchMatch

x ðx; yÞ ¼ arg min Cðx; y; x; Sk Þ

7 xðkÞ ðx; yÞ ¼ 6 4 aðkÞ ðx; yÞ  Uðamin ; amax Þ 5

with hmax being the angle between the allowed maximum depth increment vector and the image plane.

Here, Kk and Rk are the calibration and rotation matrices of Sk, respectively, tk is the translation vector and Zk is the 3rd element of the vector KkRkP(x + u, y + v) + Kktk. Once the support plane is calculated for the reference window in R, matching positions in any of the support images can be directly determined without epipolar resampling, while the complete support window can be constructed using interpolation.

ðkÞ

3

ð10Þ

Their projections on a support image Sk are

pk ðx þ u; y þ v Þ ¼

ðkÞ

d ðx; yÞ  Uðdmin ; dmax Þ

ð13Þ

As explained by Bleyer et al. (2011), the PatchMatch algorithm is comprised of an initialization stage followed by a specific number of T iterations. In order to achieve faster convergence, a

 



 ðkÞ ðx; yÞ ¼ arg min C x;y; xðkÞ ðx  1;yÞ;Sk ; C x;y; xðkÞ ðx;y  1Þ;Sk ; x x

  o ðkÞ Cðx;y; x0 ðx;yÞ; Sk Þ; C x; y; xðiÞ ðx;yÞ; Sk ;i 2 ½1;kÞ

ð17Þ

The 3D spatial propagation in even iterations is similar to that in odd iterations, except that it is performed in opposite directions, as shown in Fig. 3(b), and it can be described as

 

 



 ðkÞ ðx; yÞ ¼ arg min C x;y; xðkÞ ðx þ 1;yÞ;Sk ; C x;y; xðkÞ ðx;y þ 1Þ;Sk ; x x

   o ðkÞ C x; y; x0 ðx; yÞ; Sk ; C x;y; xðiÞ ðx;yÞ;Sk ; i 2 ðk;M ð18Þ

Note that while the vertical propagation proposed here looks similar to the view propagation proposed by Bleyer et al. (2011), there are fundamental differences. The view propagation method of Bleyer et al. (2011) involves propagating estimates from the support plane map of R to support plane maps created from different reference images, which essentially constitutes propagation between different support plane maps. In contrast, the proposed approach utilizes only support plane maps related to the support images {Sk} of one single reference image, and thus the vertical

52

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

Fig. 3. 3D spatial propagation. The white arrows represent horizontal propagation while the gray arrow represents vertical propagation from different layers.

propagation is only performed from xR to xR . That way the independence and parallelism of the processing of each image is not compromised. Fig. 4 showcases the efficiency of the developed approach. It can be seen that the 3D spatial propagation requires half the number of iterations when compared to propagation limited to the 2D domain. The vertical propagation significantly accelerates the convergence of the PatchMatch operation, yet most importantly no extra memory is required since M support plane maps {x(k)} are generated regardless of the vertical propagation. In order to avoid convergence to a local minima, a further L uniform random samples are extracted once the temporary opti ðkÞ ðx; yÞ is found. The random samples mal support plane estimate x  ðkÞ ðx; yÞ with are extracted from the parameter space centered on x radius rj, which is exponentially decreasing from the maximum radius rmax by a factor of c e (0, 1), as shown below ðiÞ





j1  ðkÞ xðkÞ rmax j ðx; yÞ  U N ðx ðx; yÞ; r j Þ ; r j ¼ c

ðkÞ

ð19Þ

where rmax is defined as

2

r max

3 ðdmax  dmin Þ=2 6 7 ¼ 4 ðamax  amin Þ=2 5 ðbmax  bmin Þ=2

ð20Þ

Finally, x(k)(x, y) is updated in the current iteration to the best n o  ðkÞ ðx; yÞ and xjðkÞ ðx; yÞ : estimate of x

n

n

oo ; j 2 ½1;L

 ðkÞ ðx;yÞ; Sk Þ; Cðx;y; xðkÞ xðkÞ ðx;yÞ ¼ arg min Cðx; y; x j ðx;yÞ;Sk Þ x

ð21Þ   M1 It can be easily calculated that in each iteration 2 þ 2 þ L M NCC computations are conducted per pixel. Thus, given the image width W, height H and number of iterations T, the total number of NCC computations required in the whole PatchMatch   is 2 þ M1 þ L MWHT. For comparison, the computational com2 plexities of two closely related MVDE methods that also use PatchMatch, i.e. methods of Bailer et al. (2012) and Shen (2013), are also listed in Table 1. While the computations involved in the proposed method are similar in number to those in the method of Bailer et al. (2012) in terms of PatchMatch, the new method does require more computations than the method of Shen (2013). This is because, unlike the other two methods, Shen (2013) generates a depth map using only one support image (as discussed in Section 2). No redundant observations are utilized to achieve better accuracy and reliability. To cope with this computational load, Bailer et al. (2012) propose the use of parallel processing lines, whereby all pixels in the same line can be processed in parallel using a GPU. As for the proposed method, although no parallelism is currently exploited, and considering the significant acceleration to the convergence of PatchMatch brought about by the vertical propagation (see Fig. 4), the actual required iterations T can be small. Empirically, T = 2 iterations would normally be sufficient.

Fig. 4. Maps of estimated surface normals generated with and without vertical propagation.

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61 Table 1 PatchMatch computational complexity comparison. Methods Proposed Bailer et al. (2012) Shen (2013)

NCC computations required in PatchMatch   2 þ M1 2 þ L MWHT (3 + L)MWHT (3 + L)WHT

3.3. Extraction of minimum visibility Fig. 5 shows the results obtained from the previous step. In the first row, 4 support images and the reference image are displayed. The first 4 columns of the second and third row show the corresponding depth maps {d(k)} along with the normalized surface   normal maps nðkÞ : X ! R3 of R with respect to each support image. The normalized maps are computed using Eq. (8), while their display is achieved by normalizing the 3D coordinates and mapping them to the colors of red, green and blue, respectively. Fig. 5 highlights certain areas in each d(k) where the depth and surface normal calculations are not smooth. This is often caused due to occlusions between the reference image R and the corresponding support image Sk as well as when the baseline between the images increases. However, it can be seen that although some areas are occluded in certain support images, which results in nonoptimal estimations, other pairs have successfully produced correct estimates. It is advantageous then to know which pixels result

53

in correct estimates since that defines the visibility of a pixel of image R with respect to a support image Sk. To achieve this, the most likely support plane for each pixel of R is initially selected out of the M candidates based on a discrete MRF model. 3.3.1. MRF model The adopted MRF model contains both a data and smoothness term:

EðxÞ ¼ Edata ðxÞ þ Esmooth ðxÞ

ð22Þ

The smoothness term incorporates smoothness parameters for both the depth and surface normal, such that

Esmooth ðxÞ ¼ k1 Esmooth ðdÞ þ k2 Esmooth ðnÞ

ð23Þ

where k1 and k2 are the weights of the depth and normal smoothness terms, respectively, while there is no weight applied to the data term. The three energy terms are defined as

Edata ðxÞ ¼

X Cðp; xðpÞÞ

ð24Þ

p2X

Esmooth ðdÞ ¼

X X

C d ðdðpÞ; dðqÞÞ

ð25Þ

p2X q2N ðpÞ

Esmooth ðnÞ ¼

X X

C n ðnðpÞ; nðqÞÞ

p2X q2N ðpÞ

Fig. 5. Extraction of most likely support plane. Highlighted areas indicate where the depth and surface normal calculations are not smooth.

ð26Þ

54

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

where p 2 X denotes any pixel in R and q 2 N ðpÞ denotes any neighboring pixel of p. The cost functions Cd and Cn are defined as

C d ðd1 ; d2 Þ ¼ C n ðn1 ; n2 Þ ¼

2jd1  d2 j d1 þ d2 \ðn1 ; n2 Þ

p

4

4

4

ð27Þ ¼

  cos1 nT1 n2 =kn1 kkn2 k

p

ð28Þ

Eq. (22) can then be expanded to

( ) X X EðxÞ ¼ Cðp; xðpÞÞ þ ½k1 C d ðdðpÞ; dðqÞÞ þ k2 C n ðnðpÞ; nðqÞÞ p2X

4

1 2 3 4

2 1 3

2 1 3

1 3 2

3 1 2

q2N ðpÞ

xML ¼ arg min EðxÞ

3.3.2. Energy minimization A dynamic programming (DP) method similar to that proposed by Hirschmuller (2008) has been adopted to solve the energy minimization problem. More specifically, 1D DP is conducted 8 times to aggregate costs from 8 different directions equally at each pixel. For every direction l a 3D cost matrix Ll : ðW  R3 Þ ! R is generated, and the cost at level k of pixel p is updated during the traversal according to the following equation:

n   Ll ðp; kÞ ¼ C p; xðkÞ ðpÞ; Sk þ min Ll ðp  l; iÞ i2½1;M   o ðkÞ ðiÞ þk1 C d d ðpÞ; d ðp  lÞ þ k2 C n nðkÞ ðpÞ; nðiÞ ðp  lÞ

ð31Þ  min Ll ðp  l; jÞ j2½1;M

The most likely visible support image of a pixel p is then selected by finding the level with the minimum cost of all fLl ðp; kÞg, which is

k

1 2

1 2 3

Fig. 6. Visibility estimation.

ð30Þ

x

kML ðpÞ ¼ arg min

4

3

ð29Þ The most likely support plane map xML is extracted by minimizing this energy function E(x):

4

( X

)

Ll ðp; kÞ

ð32Þ

l

be considered for each pixel. A more selective process can be employed such that depth or surface normal estimates with erroneous results can be filtered and not taken into account. While the proposed minimum visibility extraction might appear similar to the method proposed by Campbell et al. (2008), since multiple hypotheses are maintained for each pixel and the discrete MRF model is employed, the motivation is different. Campbell et al. (2008) attempted to output a complete final depth map by fusing multiple depth hypotheses for each pixel, whereas the proposed method attempts only to further infer the visibility of each pixel. Also, in the proposed method unlimited combinations of continuous depths and surface normals are considered while no discretization is required. Moreover, the acquired multiple hypotheses can come from matching with the same support image in the method of Campbell et al. (2008), whereas in the proposed approach all hypotheses come from matching the reference image with each of the support images. Thus, the number of hypotheses equals the number of support images. Finally, since PatchMatch is used in the proposed method, each hypothesis corresponds to an approximate global extremum of the photoconsistency measure instead of to a local extremum as in the method of Campbell et al. (2008). 3.4. Visibility augmentation

The most likely support plane of p will be

xML ðpÞ ¼ xðkML ðpÞÞ ðpÞ

ð33Þ

and this can be found by minimizing Eq. (29). For a more detailed description of the energy minimization method readers are referred to Hirschmuller (2008). Extraction of the support plane map xML based on the proposed discrete MRF model is illustrated graphically in Fig. 6, which shows seven continuous pixels, each having 4 candidate support plane estimates. These support planes are represented as numbered circles with normal vectors and numbers. The number inside each circle denotes the support image from which the support plane originates, while the location of the circle itself graphically represents the corresponding depth estimate. Finally, the dark gray circles are selected as the most likely support planes for each pixel, derived via the proposed MRF model, as they have the smoothest estimates for both depth and surface normal. The extracted most likely depth map dML and surface normal map nML are shown for the MVDE example indicated by Fig. 5 in the last column of the second and third row, respectively, of that same figure. Both dML and nML will be smoother, as well as displaying better completeness, than any of the individual d(k) and n(k) maps. It should be noted that during the extraction of the minimum visibility or the visibility augmentation performed in the next stage, not all M candidate support plane estimates need to

The visibility augmentation phase that is developed in this section is based on the method of Rothermel et al. (2012), where it is stated that the matching position of a pixel p of image R in the support image Sk can only be estimated with a confidence interval of 2r along the epipolar line. This means that there is an uncertainty h i ðkÞ ^ ; dðkÞ ^ ^ range of dmin p max p on the line of sight of p, where p is p’s normalized image coordinates representing the direction of its line of sight. As shown in Fig. 7, the dark and light gray circles represent the support planes x(1) and x(2) of p respectively, with the dark gray circle being the most likely selected support plane estimate. h i h i ð1Þ ^ ; dð1Þ ^ and dð2Þ ^ ð2Þ ^ are the The uncertainty ranges dmin p max p min p; dmax p relevant confidence intervals for S1 and S2, respectively, while their graphical representation is provided in Fig. 7. h i ðkÞ ðkÞ ^ ; dmax ^ Similar to the method of Rothermel et al. (2012), dmin p p h i ðkML Þ (k) ML Þ ^ ; dðk ^ p can be clustered together and dmin max p will overlap if x with xML or, in other words, if p is visible in both support images. However, this is not sufficient for the proposed method, since the surface normal of the support plane is also estimated and the similarity between n(k) and nML should therefore also be considered in the clustering. As a consequence, an additional check is added so that the angle between the two surface normals does not exceed 90°. According to this criteria, x(2) in Fig. 7 can obviously be

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

55

the depth estimates. The MPGC method used by Goesele et al. (2007), which optimizes both the depth and surface normal parameters, has been adopted for this step. 3.5.1. MPGC Given the projection pk(u, v) in Sk of a pixel [u, v]T in the reference window, the residual vector to be minimized in MPGC is

f k ðu; v Þ ¼ ck Ik ðpk ðu; v ÞÞ  IR ðu; v Þ

ð34Þ

where ck is the color scale between the reference and support windows in Sk. As shown below, the derivatives of this residual vector with respect to the support plane can be easily computed using the chain rule according to Eqs. (10) and (11):

S1

p

S2

R Fig. 7. Visibility augmentation by clustering.

clustered together with x(1). For the example in Fig. 6, all support plane estimates that can be clustered together with xML are denoted by light gray circles. An essential difference between the clustering proposed here and that of Rothermel et al. (2012) is that in the present method the x(k) parameter sets can only be clustered together with xML. This is because the outlier elimination of Rothermel et al. (2012) relies on the number of consistent depth estimates that are clustered together being larger than a required minimum threshold. In contrast, with the proposed method the minimum visibility, which is the most likely visible support image, has already been determined (Section 3.3) based on the MRF model. Thus, even in the case that all other estimates are clustered together, they will be eliminated as outliers as they are considered occlusions so long as xML is not in that cluster. Additionally, no minimum number of support planes is required for a cluster in the proposed method, as there is at least one xML for each pixel. In this way, the proposed method is able to preserve the estimates of those pixels that are only visible in a single support image. Applied to the example shown in Fig. 5, the generated augmented visibility maps fuðkÞ : X ! Rg of R with respect to each support image {Sk} are shown in Fig. 8. The visible pixels p, u(k)(p) = 1, are shown in their original RGB color while occluded pixels, u(k)(p) = 0, are shown in red. It can be seen that the visibility of most pixels is correctly estimated.

3.5. Multi-view optimization Although the most likely extracted support plane map xML (Section 3.3) shows a significant improvement compared with each single x(k), as shown in Fig. 5, it is essentially estimated using only one visible support image for each pixel, and thus there are no redundant observations. The proposed methodology addresses the need for observational redundancy, which is warranted if a further improvement in accuracy of the derived support plane map x for the reference image R is to be achieved. One possible approach is to fix the matching image points in all visible support images and adjust only the depth with the purpose of minimizing the re-projection errors, as has been proposed by Rothermel et al. (2012). However, such an approach will only increase the accuracy of the depth estimate and not the accuracy of the estimated surface normal that is essential for a better surface reconstruction. Surface normal estimation can be achieved by also optimizing the location of the matching image points along with







@f k ðu; v Þ @Ik @pk ðu; v Þ @Pðu; v Þ ¼ ck @x @x @pk ðu; v Þ 32 @Pðu; v Þ 23 33 33 ð35Þ

@f k ðu; v Þ ¼ Ik ðpk ðu; v ÞÞ @ck 31

ð36Þ

The derivatives with respect to ck are the projected colors, as per Eq. (36). For a more detailed description of MPGC, readers are referred to Goesele et al. (2007). 3.5.2. Outlier handling in MPGC The MPGC algorithm is carried out within an image window and is not limited to a single pixel. Theoretically, all the pixels within the reference window, as well as those within the support windows of all the visible support images, should correspond to the same object points. Otherwise, outliers are introduced. More specifically, outliers can arise in two different circumstances. Firstly, when there are depth discontinuities in the reference window, and secondly when pixels in the reference window are occluded in the support window. These two cases are highlighted in Fig. 9, where pixels in the first column of the reference window, along with their projected pixels in the support window of Sk, do not correspond to the same object surface due to a depth discontinuity in the reference window. Additionally, pixels in the last column of the reference window, along with their projected pixels in the support window, also do not correspond because the former are occluded in the support window. Detection and proper handling of such cases in MPGC is necessary to avoid any accuracy loss near depth or visibility discontinuities. Such outliers are shown in Fig. 10(b), where erroneously estimated surface normals appear after MPGC, they being distinct due to their wave-like form. In traditional Least-Squares Matching (LSM), outliers are usually handled by adaptive weights, but in the proposed method, good preliminary estimates are readily available for both depth and visibility. These initial estimates can be used to efficiently detect outliers. Outliers caused by depth discontinuities can be readily detected by local depth clustering using the depth map dML. A depth discontinuity between two neighboring depths can be assumed, if the absolute difference |Dd| between them is larger than a set threshold defined by the following equation:

jDdj ¼ jd1  d2 j >

ðd1 þ d2 Þ tan hmax 2f

ð37Þ

Otherwise they are clustered together. The clustering result of pixels in the reference window is denoted by n(u, v). If the pixel [u, v]T is in the same cluster with the window center, n(u, v) = 1 and this means that pixel [u, v]T has a similar depth and is therefore an inlier. Otherwise, it is considered an outlier and its projection pk(u, v) in Sk is not considered in MPGC. On the other hand, outliers caused by occlusions are directly indicated by their visibility. A pixel [u, v]T is denoted as visible when u(k)(u, v) = 1; on the

56

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

Fig. 8. Augmented visibility with respect to each support image.

2

mask

Sk R Fig. 9. Two types of outliers in MPGC.

other hand u(k)(u, v) = 0 denotes a pixel occlusion in Sk and consequently its projection pk(u, v) in Sk is also omitted from MPGC. This process can be extended to accommodate the full-size support window by introducing a matrix mask(k) to handle outliers:

mask ðu; v Þ ¼ nðu; v ÞuðkÞ ðu; v Þ ðkÞ

ð38Þ

If the images of Fig. 9 are used as an example, the generated mask(k) is defined as

ðkÞ

¼nu

ðkÞ

0 1 1 1 1

3 2

1 1 1 1 0

7 6 6 60 1 1 1 17 61 1 7 6 6 6 6 ¼ 60 1 1 1 17 7  61 1 7 6 6 40 1 1 1 15 41 1 1 1 0 1 1 1 1 3 2 0 1 1 1 0 7 6 60 1 1 1 07 7 6 7 ¼6 60 1 1 1 07 7 6 40 1 1 1 05 0 1 1 1 0

3

7 1 1 07 7 1 1 07 7 7 1 1 05 1 1 0

Projections of the pixels whose mask(k)(u, v) = 1 are used for MPGC. After applying this outlier handling methodology, the accuracy at pixels near depth or visibility discontinuities is maintained, as can be seen in Fig. 10(c), where there is a clear improvement in the estimation of surface normal after MPGC. MPGC is conducted for every pixel in the reference image, so given the image width W and height H, WH MPGC computations in total are performed. Fortunately, they are mutually independent and they can thus be performed in parallel. Besides, the total computational cost is also dependent on the number of support images M, since the more observations are involved the more resampling operations are required in each MPGC computation. Although it

Fig. 10. Surface normal maps before and after MPGC.

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

is not linearly dependent, the total computational cost will become s (1 < s < 2) times more, e.g., if M doubles. 3.6. Outlier elimination The final step of the proposed methodology is outlier elimination. After the parallel generation of the support plane maps for all images, a two-step outlier elimination procedure is implemented. First, in a similar manner to the depth map refinement proposed by Shen (2013), a depth consistency check is employed among multiple images to eliminate outliers. The depth estimate d is compared with the depth estimate dk which is generated from when Sk is set as the reference image. The estimated depth d is considered consistent only when d(P(p), Sk) and dk(pk(P(p))) are sufficiently close in value:

dðPðpÞ; Sk Þ  dk ðpk ðPðpÞÞÞ
ð39Þ

Here, s is a ratio threshold. The main difference between this approach and that of Shen (2013) is that Shen’s depth consistency check is carried out in all neighboring images, whereas here it is carried out only between corresponding visible support images. Additionally, Shen’s method requires that one depth is consistent with at least 2 neighboring images in order to be valid, whereas only a single visible support image is required in the proposed method. While it might appear that this constraint is rather relaxed, this is in fact not the case. During the visibility augmentation phase, a clustering of the support plane estimates is performed. This can be interpreted as an initial outlier elimination

Table 2 Pseudo-code of the proposed method. Input: all input images {Ii} and their orientations Output: estimated {xi} 1 // Estimation of {xi} 2 for each input image R = Ii do 3 for each support image Sk of R do 4 estimate x(k) of R with Sk using PatchMatch; // see Section 3.2 5 end for 6 extract kML and xML from {x(k)} based on MRF; // see Section 3.3 7 augment visibility based on xML by clustering; // see Section 3.4 8 MPGC optimization given xML and visibility; // see Section 3.5 9 end for 10 // Outlier elimination by enforcing multi-view consistency 11 for each xi do 12 eliminate estimates in xi that are not consistent with any of its visible support images; // see Section 3.6 13 end for 14 return {xi};

57

and thus estimates with more visible support images are more likely to be inliers when compared to those with less. The proposed method is thus able to preserve as many inliers as possible while achieving an elimination of outliers when necessary by taking into account the visibility information. The second stage of the outlier elimination procedure aims to detect any remaining outliers by carrying out the depth clustering method that was also used in Section 3.5.2. This is applied to the whole image domain, such that estimates in clusters with sizes smaller than a certain criteria cmin are eliminated as outliers, with cmin being defined as

cmin ¼ q2 WH

ð40Þ

where q is a ratio factor with respect to the image size and W, H denote the image width and height respectively. In addition, when silhouettes of objects in images can be easily extracted, e.g., as in indoor images taken under controlled illumination, their visual hull can be constructed to further filter outliers. To summarize, the proposed method can also be depicted by pseudo-code shown in Table 2. 4. Experimental results The proposed method has been evaluated on four Middlebury MVS evaluation datasets (Seitz et al.), namely, the templeRing (47 images), templeSparseRing (16 images), dinoRing (48 images) and dinoSparseRing (16 images) datasets. The first two datasets are comprised of images of a plaster temple, shown in Fig. 11 (a) and (b), while the last two datasets are images of a plaster dinosaur, shown in Fig. 11(c) and (d). The main difference between the non-Sparse and Sparse datasets is the number of images involved, along with the larger baseline separation between the exposure stations of the sparse datasets. The temple datasets present severe occlusions, especially for the sparse dataset, whereas the biggest challenge of the dinosaur images comes from their extremely low texture. The interior and exterior orientation values for all test images was provided, along with the bounding boxes of each test scene. Since the datasets are indoor images captured under controlled illumination, objects features can be easily separated from the background, so silhouettes are extracted in the experiments to filter outliers. Parameter settings used in the proposed algorithm are listed in Table 3 for all experimental datasets. The experimental results have been evaluated by comparing the generated triangular meshes with the known ‘ground truth’ meshes that were generated using a laser scanner. In order to generate triangular meshes for each dataset, all resulting depth maps are back projected and merged into a complete 3D point

Fig. 11. Sample images from the temple (a, b) and dinosaur (c, d) datasets.

58

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

Table 3 Parameter settings for experiments. Parameter

Section

Description

Value

M

Section 3

Number of support images

w

Section 3.1

T

Section 3.2

hmax

Section 3.2.1

L

Section 3.2.2

c

Section 3.2.2

k1

Section 3.3.1

k2

Section 3.3.1

r

Section 3.4

s

Section 3.6

q

Section 3.6

Reference image window size, w  w Number of iterations in PatchMatch Angle between the allowed maximum depth increment vector and the image plane Number of times of random sampling Decreasing factor in random sampling Weight of depth smoothness term Weight of normal smoothness term Assumed image matching accuracy Ratio threshold of relative depth difference Ratio factor with respect to image size

2 for templeSparseRing and dinoSparseRing, 4 for templeRing and dinoRing 5 pixels 2 80°

5 0.25 1.5 0.01 0.5 pixel 0.01 0.02

cloud, with each point being associated with its own resulting surface normal estimate. This complete point cloud is then randomly subsampled into a smaller point cloud with only 500,000 points. The final triangular mesh is generated from this smaller point cloud using the Screened Poisson Surface Reconstruction method (Kazhdan and Hoppe, 2013) with an octree depth setting of 10. The surface reconstruction outputs only a closed mesh with holes filled automatically, and thus in order to reflect the actual completeness, parts of the mesh with low vertex density are removed.

The ground truth and reconstructed meshes for all datasets are shown in Fig. 12. As shown in Table 4, the experimental results have been evaluated in terms of both accuracy and completeness. Accuracy was quantified in terms of point-to-mesh discrepancies between the generated and ground-truth 3D surfaces, and further details on the evaluation methodology are provided in Seitz et al. (2006). The accuracy and completeness are given with thresholds being 90% and 1.25 mm respectively. Readers are again referred to Seitz et al. (2006) for further explanation of the above thresholds. Finally, the runtimes per image are listed in the right-most column. For the two temple datasets, the proposed method is ranked as the most accurate among reported Middlebury evaluations, there being 52 such evaluations for the first dataset and 42 for the second. For the two dinosaur datasets, the accuracy rankings drop to 9th out of 49 and 14th out of 43 due to the poor texture, but the surface-to-surface accuracies are still within 0.50 mm. It is evident that the number of images directly affects the accuracy, with fewer images leading to lower accuracy and reliability. The histograms of algebraically signed errors for each dataset are shown in Fig. 13, where it is clear that all approach Gaussian distributions, with approximately zero means. The accuracy rankings of Table 4 are current at the time of writing (November 2014); they can be anticipated to vary in the future when new methods are submitted for evaluation. For more detailed information on the evaluation results and comparisons with other methods, readers are referred to the Middlebury MVS evaluation web page (Seitz et al.). Visual inspection of Fig. 12 shows that the results of two Sparse datasets and the dinoRing dataset present holes, which correspond to the deleted areas due to low vertex density. The low density is because most reconstructed points within these areas are mismatched and so are eliminated as outliers. A wide baseline is the main factor responsible for the holes in templeSparseRing, which makes the image content vary significantly between images, especially for slanted planes in an image, e.g., the temple roof and ground. This also explains why most of the holes disappear in templeRing when the baseline decreases. Poor image texture is

Fig. 12. Ground truth and reconstructed meshes (Seitz et al.).

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61 Table 4 Evaluation results.

templeRing templeSparseRing dinoRing dinoSparseRing

Accuracy (mm) & ranking

Completeness (%)

Runtime (min per image)

0.40 0.45 0.38 0.48

99.2 95.7 98.3 95.4

12.5 6.0 12.5 6.0

(1st of 52) (1st of 42) (9th of 49) (14th of 43)

the main factor responsible for the holes in the two dino datasets. This results in insufficient information being present to confidently locate the match in an image window, e.g., at the left and right shoulder areas of the dinosaur where image contents are very flat. This also explains why there are still a considerable number of holes in dinoRing even when the baseline decreases. Back to the quantitative results given in Table 4, it is shown that the completeness percentages for all four datasets are above 95%, even for the Sparse datasets. As expected, higher completeness is achieved with a larger number of images. Nevertheless, considering both the severe occlusions presented by the structure of the

59

temple, and the fact that the size of the Sparse datasets (16 images) is only 1/3 of the size of the non-Sparse datasets (47 images for temple, 48 images for dino), so the average baseline in the Sparse datasets is 3 times as long, the 3.5% numerical completeness drop for the temple (from 99.2% to 95.7%) and the 2.9% completeness drop for the dino (from 98.3% to 95.4%) are relatively insignificant, although visually the non-Sparse results do appear better concerning completeness. Since most evaluated methods reported to Middlebury directly submit meshes without removing parts with low vertex density, the completeness numbers of results generally do not accurately reflect completeness. For this reason, no completeness rankings have been given in Table 4. The most time-consuming steps per image are the PatchMatch and MPGC optimization. In accordance with earlier computational complexity analysis, since the number of support images (M = 4) used in the non-Sparse datasets is double that in the Sparse datasets (M = 2), the NCC and MPGC computations required per image of the non-Sparse datasets will be, respectively, 2.27 and s (1 < s < 2) times as many as in the Sparse datasets. Thus, comprehensively speaking, the runtime per image of the non-Sparse datasets is expected to approximately double that for the Sparse

(a) templeRing.

(b) templeSparseRing.

(c) dinoRing.

(d) dinoSparseRing.

Fig. 13. Histograms of signed errors (Seitz et al.).

60

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61

datasets when all computations are executed in sequence, as per the recorded runtimes listed in Table 4. All datasets were processed on a laptop, the processor type being an Intel(R) Core (TM) i5-3210M CPU @ 2.50 GHz. Thus far, no effort has been made to optimize the runtime performance as this investigation was aimed to deliver a proof of concept. It is expected that the runtimes will be significantly reduced upon further optimization and the parallel processing of images. It should also be clarified that, for comparison purposes, runtimes given on the Middlebury MVS evaluation web page (Seitz et al.) are all normalized to runtimes on a 3.0 GHz processor. For this processor, the proposed method is therefore expected to take 10.42 instead of 12.5 min per image for the non-Sparse datasets, and 5 instead of 6 min per image for the Sparse datasets. The four total normalized runtimes (H:M:S) given on the website are 08:09:10 (templeRing), 01:20:00 (templeSparseRing), 08:20:00 (dinoRing) and 01:20:00 (dinoSparseRing) when all images are processed in sequence. While the research conducted to date has indicated that the proposed method is reasonably insensitive to changes in the parameter settings related to PatchMatch (i.e., T, hmax, L and c), further sensitivity analysis work is still to be conducted. However, considering that an initial estimate is eventually optimized using MPGC, the proposed method is expected to be reasonably insensitive to parameter changes in general, at least in terms of accuracy.

5. Conclusions When compared to other state-of-the-art MVS approaches, the proposed method, in which depth, surface normals and visibility are estimated for each image pixel, has been shown to produce high-accuracy 3D reconstruction from multi-image networks displaying occlusion-challenging issues. Moreover, satisfactory completeness is obtained for all evaluated test datasets, with minimal differences resulting between the Sparse and non-Sparse cases. Finally, as the processing of each image is independent, the proposed method can be easily parallelized to efficiently handle datasets comprising a large number of images. However, the runtime per image is still unsatisfactory and the new approach also faces challenges when dealing with images having low texture. Future research will concentrate mainly on improving these two aspects. The most time-consuming steps in processing a given image are PatchMatch and MPGC optimization. Exploitation of the parallelism in these two steps will provide the key to any reduction in runtime per image. Four viable areas of improvements for the future are: a. Inspired by the parallel processing line idea of Bailer et al. (2012), the parallelism in the proposed PatchMatch can be likewise exploited. This is expected to yield a significant reduction in runtime. b. As described in Section 3.5, MPGC computation of each pixel can also be performed in parallel. c. A coarse-to-fine strategy can be adopted by incorporating the commonly used pyramid mechanism. d. As this investigation was aimed to deliver a proof of both concept and general feasibility, unnecessary overheads are inevitable in the current implementation. Hence, there is still a lot of room for coding level optimization. Although the most likely estimate is extracted in the proposed method based on a MRF model with smoothness term incorporated, each candidate estimate is essentially generated in PatchMatch based on the image content alone, i.e. only the data term

is considered. Thus, when image texture is poor, the information contained in an image window may not be enough to locate the match confidently. Future possible improvements concerning this aspect include: a. Since image content alone is not enough, incorporating smoothness regularization (i.e., priors from neighbors) explicitly in PatchMatch could be a good way of dealing with poor texture. However, recalling the two observations that the proposed method is based upon, this is expected to jeopardize the robustness against occlusion and the accuracy of visibility estimation to some extent. Nevertheless the approach will be tried in future research to ascertain how it affects the overall quality of reconstruction. b. An adaptive-size image window may be considered, i.e. a big image window is used when image texture is poor; otherwise a small one is used instead. In this way, a measure of the richness of texture can be established. c. Inspired by the method of Campbell et al. (2008), the extraction of multiple local minimizers as candidates instead of only one global minimizer with each support image could be beneficial. d. Besides the runtime aspect, incorporation of the commonly used pyramid mechanism could also be beneficial in dealing with poor texture. e. When object silhouettes are readily available, the visual hull can be used directly as an explicit constraint when generating and optimizing depth and normal maps, rather than being employed just to filter outliers.

References Bailer, C., Finckh, M., Lensch, H.P., 2012. Scale robust multi view stereo. In: Computer Vision–ECCV 2012. Springer, pp. 398–411. Baltsavias, E.P., 1991. Multiphoto Geometrically Constrained Matching. Institut für Geodäsie und Photogrammetrie, ETH Zurich, Zurich, Switzerland. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D., 2009. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph.-TOG 28, 24. Barnes, C., Shechtman, E., Goldman, D.B., Finkelstein, A., 2010. The generalized patchmatch correspondence algorithm. In: Computer Vision–ECCV 2010. Springer, pp. 29–43. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L., 2008. Speeded-up robust features (SURF). Comput. Vis. Image Understand. 110, 346–359. Besse, F., Rother, C., Fitzgibbon, A., Kautz, J., 2013. Pmbp: Patchmatch belief propagation for correspondence field estimation. Int. J. Comput. Vis., 1–12 Bleyer, M., Rhemann, C., Rother, C., 2011. PatchMatch stereo–stereo matching with slanted support windows. BMVC, 1–11. Bobick, A.F., Intille, S.S., 1999. Large occlusion stereo. Int. J. Comput. Vis. 33, 181–200. Campbell, N.D., Vogiatzis, G., Hernández, C., Cipolla, R., 2008. Using multiple hypotheses to improve depth-maps for multi-view stereo. In: Computer Vision– ECCV 2008. Springer, pp. 766–779. Crandall, D.J., Owens, A., Snavely, N., Huttenlocher, D.P., 2013. SfM with MRFs: discrete–continuous optimization for large-scale structure from motion. Pattern Anal. Mach. Intell., IEEE Trans. 35, 2841–2853. Gallup, D., Frahm, J.-M., Mordohai, P., Yang, Q., Pollefeys, M., 2007. Real-time planesweeping stereo with multiple sweeping directions. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE, pp. 1–8. Goesele, M., Curless, B., Seitz, S.M., 2006. Multi-view stereo revisited. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 2402–2409. Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M., 2007. Multi-view stereo for community photo collections. In: IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007. IEEE, pp. 1–8. Gruen, A., 1996. Least squares matching: a fundamental measurement algorithm. Close Range Photogramm. Mach. Vis., 217–255 Heise, P., Klose, S., Jensen, B., Knoll, A., 2013. PM-Huber: PatchMatch with huber regularization for stereo matching. In: 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, pp. 2360–2367. Hirschmuller, H., 2008. Stereo processing by semiglobal matching and mutual information. Pattern Anal. Mach. Intel., IEEE Trans. 30, 328–341. Hosni, A., Rhemann, C., Bleyer, M., Rother, C., Gelautz, M., 2013. Fast cost-volume filtering for visual correspondence and beyond. Pattern Anal. Mach. Intell., IEEE Trans. 35, 504–511.

Z. Zhu et al. / ISPRS Journal of Photogrammetry and Remote Sensing 109 (2015) 47–61 Hu, X., Mordohai, P., 2012. Least commitment, viewpoint-based, multi-view stereo. In: 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT). IEEE, pp. 531–538. Kazhdan, M., Hoppe, H., 2013. Screened poisson surface reconstruction. ACM Trans. Graph. (TOG) 32, 29. Rothermel, M., Wenzel, K., Fritsch, D., Haala, N., 2012. SURE: photogrammetric surface reconstruction from imagery. Proceedings LC3D Workshop, Berlin. Scharstein, D., Szeliski, R., 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, 7–42. Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R. Middlebury Multi-View Stereo Evaluation Web Page. . Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R., 2006. A comparison and evaluation of multi-view stereo reconstruction algorithms. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 519–528.

61

Shen, S., 2013. Accurate multiple view 3D reconstruction using patch-based stereo for large-scale scenes. IEEE Trans. Image Process. 22, 1901–1914. Steinbrucker, F., Pock, T., Cremers, D., 2009. Large displacement optical flow computation withoutwarping. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 1609–1614. Strecha, C., Fransens, R., Van Gool, L., 2006. Combined depth and outlier estimation in multi-view stereo. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 2394–2401. Veksler, O., 2003. Fast variable window for stereo correspondence using integral images. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, vol. 551. IEEE, pp. I-556–I-561. Wang, L., Kang, S.B., Shum, H.-Y., Xu, G., 2004. Cooperative segmentation and stereo using perspective space search. In: Asian Conference on Computer Vision, Jeju Island, South Korea, pp. 366–371. Zheng, E., Dunn, E., Jojic, V., Frahm, J.-M., 2013. PatchMatch based joint view selection and depthmap estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1517.