Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching

Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching

Accepted Manuscript Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching Sen Xiang, Huiping Deng, Lei Zhu, Jin Wu, Li...

8MB Sizes 0 Downloads 37 Views

Accepted Manuscript Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching Sen Xiang, Huiping Deng, Lei Zhu, Jin Wu, Li Yu

PII: DOI: Reference:

S0923-5965(18)30363-1 https://doi.org/10.1016/j.image.2018.07.005 IMAGE 15420

To appear in:

Signal Processing: Image Communication

Received date : 7 December 2017 Revised date : 5 July 2018 Accepted date : 5 July 2018 Please cite this article as: S. Xiang, H. Deng, L. Zhu, J. Wu, L. Yu, Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.07.005 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Exemplar-based depth inpainting with arbitrary-shape patches and cross-modal matching Sen Xianga,b, Huiping Denga,b,∗, Lei Zhua,b , Jin Wua,b , Li Yuc a School of Inform. Sci. & Engn., Wuhan Univ. of Sci. & Tech., Wuhan, 430081 China Research Center of Metallurgical Auto. and Measurement Tech., Ministry of Education, Wuhan, 430081 China c School of Electron. Inform. & Coummun., Huazhong Univ. of Sci. & Tech., Wuhan, 430074 China

b Engin.

Abstract Commodity RGB-D cameras can provide texture and depth maps in real-time, and thus have facilitated the booming development of various depth-dependent applications. However, depth maps suffer from the loss of valid values, which leads to holes and impairs both research and applications. In this paper, we propose a novel exemplar based method to fill depth holes and thus to improve depth quality. This novel method is based on the fact that a depth map has many similar even identical parts, and the lost depth values can be restored by referring to valid ones. Considering the intrinsic property of depth maps, i.e., the sharpness of object boundaries, we propose to use arbitrary-shape matching patches, instead of fixed squares, to avoid inter-depth-layer distortion and thus improve the boundary. In addition, since depth values do not have distinct features, cross-modal matching, where both depth and texture are involved, is utilized. Moreover, we also investigate the similarity criteria in cross-modal matching, in order to improve the accuracy between source patch and target patch. Experimental results demonstrate that the proposed method can accurately recover lost depth information, especially at boundaries, which outperforms state-of-the-art exemplar-based inpainting methods. Keywords: depth map, inpainting, exemplar, edge-preserving, 3D video

∗ Corresponding

author Email address: [email protected] (Huiping Deng)

Preprint submitted to Signal Processing: Image Communication

July 5, 2018

1. Introduction Depth information is a fundamental element is various applications, such as freeviewpoint video [1], 3D reconstruction [2] and face recognition [3]. In recent years, commodity RGB-D cameras, based on structured light [4] or time-of-flight technique 5

[5], have made depth acquisition easy and convenient. However, due to the limitation of depth generation principles and hardware, the reported depth maps have many holes inside, which impairs further research and applications. To solve this problem, many researchers have studied the topic of depth inpainting. In general, these methods can be categorized into filtering-based ones and exemplar-based ones.

10

In filtering-based methods, special filters are designed to diffuse valid depth values to invalid ones. Min et al. [6] proposed the weighted mode filter. Instead of spatial and intensity similarity, this filter uses statistic data of valid depth values, and it yields sharp depth edges. Yang et al. [7] proposed to use an auto-regressive model to estimate the coefficients of the filter, and thus the filtering can adapt to local context. Miao et

15

al. [8] and Xiang et al. [9] considered the homogeneity of depth gradient, and the depth values are obtained under the constraint of gradients by solving partial differential equations. Milani et al. [10] proposed to use a set of local differential equations to interpolate the missing depth samples. Xue et al. [11] introduced a low gradient regularization method where the gradual depth changes are allowed by reducing the

20

penalty for small gradients while penalizing the non-zero gradients. Zhao et al. [12] proposed a two-stage filtering on blurred depth maps. The distorted depth maps are successively processed with binary segmentation-based depth filtering and MRF-based reconstruction. Owing to the consistency between texture and depth, texture-guided filtering is also

25

quite popular. These filters incorporate texture similarity, and thus different objects can be distinguished and depth boundaries can be better preserved. The simplest examples are joint bilateral filter [13][14] and trilateral filter [15], where the weighting kernel has a texture similarity term. Kim et al. [16] modified the color weights by considering the texture-depth map consistency. Bapat et al. [17] proposed a novel iterative me-

30

dian filter which takes into account the RGB components as well. The color similarity

2

is measured with the absolute difference of the neighboring pixels and their median value. Chang et al. [18] proposed adaptive texture-similarity-based hole filling, where luminance, instead of RGB, are used as guidance. Bhattacharya et al. [19] focused on removing depth edge distortions. They used contour-guided partial differential equa35

tion to inpaint the contour points. Buyssens et al. [20] proposed to inpaint the virtual depth maps in view synthesis. They used super-pixel segmentation to detect occlusions followed by linear filtering. Chen et al. [21] fused the texture edges and depth edges, with which the inpainting accuracy at object boundaries can be improved. Wang et al. [22] proposed a MRF-based optimization framework for depth map enhancement, and

40

the coupled high quality color images are the guidance. Filtering, in general, is conducted pixel-by-pixel, while exemplar inpainting is implemented in patches, so it recovers the information of many pixels at once. The algorithm was first proposed by Criminisi [23] for object removal and texture synthesis. It is based on the fact that signals tend to exhibit high auto correlation and repeated

45

patterns at different locations within the data [24], and thus missing regions can be restored by referring to the valid parts. In depth maps, it is even more common that similar and identical blocks repeat in a map because of the homogeneity and smoothness. Therefore, exemplar-based approaches are also applicable to depth maps. In the simplest case, classic inpainting [23] can be directly applied to depth maps. After that,

50

researchers have tried to improve the performance from two aspects: updating the priority and finding a better source patch. Daribo et al. [25] weighted the classic priority by adding a level regularity term, where background pixels are assigned with higher priorities and thus be filled first. This method is effective in restoring dis-occlusions in virtual view synthesis. Ram et al. [26] improved the inpainting order by reordering

55

overlapped patches. Viacheslav et al. [27] introduced texture and depth gradients to the priority term, and they also improved the details with an adaptive median filter. More recently, Zhang et al. [28] added an level-set term to the priority, and used joint trilateral filtering as post-processing. On the other hand, searching for an optimal source patch in the source region is another concern, and some efforts have been made so far.

60

Huang et al. [29] searched the the most similar patches based on structure consistency. Qi et al. [30] incorporated non-local filtering into depth inpainting, and the structure 3

of the corresponding texture map decides the similarity. Fan et al. [31] introduced adaptive patch size and image rotation to the source patch searching, and structure similarity (SSIM) [32] of texture is chosen as the matching criteria. In addition, meth65

ods based on special properties of depth maps have also been proposed. Li et al. [33] used sparse features of texture and depth blocks for learning and exemplar-based depth super-resolution. Wang et al. [34] utilized the correlation between stereo views, based on which the color and depth holes can be filled simultaneously. Ou et al. [35] used patch features in DCT domain to achieve matching between source and target patches.

70

However, due to the intrinsic property of depth maps, there still exist two main challenges when exemplar-based inpainting is applied. First, depth boundaries are quite important but vulnerable, so they should be specially preserved. In practice, depth boundaries are irregular, while conventional inpainting usually uses rectangle patches, which cannot well fit boundaries and lead to distortions. On the other hand,

75

given a patch with holes, accurately finding the optimal source patch is another issue. Since depth maps are smooth without distinct spatial features, there exist many similar candidates and this causes ambiguity. Facing these challenges, we propose to use arbitrary-shape matching patches so as to adapt to complex boundaries. Moreover, instead of only measuring depth similarity, we propose to use a cross-modal metric

80

that jointly considers texture and depth information, which can remove the ambiguity. In patch matching, we also analyze and test the commonly-used similarity metrics to choose the optimal one. As a result, the proposed method can improve the quality of restored depth maps, especially at object boundaries. The remainder of this paper is organized as follows. Section II presents the pro-

85

posed method in detail. In Section III, experiments are conducted to verify the performance of the proposed method. Finally, Section IV summarizes the paper with a conclusion.

4



 









(a)

 

 





 







 

 

 (b)

(c)

Figure 1: Image model of the classic exemplar-based inpainting [23]. (a) Binarized sketch B indicating region Φ and region Ω. np is a normalized vector orthogonal to the wavefront of the boundary. Bx and By are the horizontal and vertical gradients of the binary map B, respectively. (b) Sketch of the intensity gradient ∇Ip . Ix and Iy are the horizontal and vertical gradients, respectively, and Ip⊥ is orthogonal to Ip . (c) Sketch of matching the target patch Ψp and candidate source patches Ψq .

2. Proposed method 2.1. Motivation 90

As shown in Fig. 1, an image I consists of a source region Φ with valid values, and a target region Ω with invalid ones. ∂Ω is the boundary between Φ and Ω. Because similar patterns (objects/ textures/ intensities) often repeat at different locations, both in Φ and Ω, exemplar-based inpainting can recover information in Ω by referring to the source region Φ.

95

⎧ ⎪ P (p) = C (p) D (p) ⎪ ⎪ ⎪  ⎪  ⎨ p ∈Ψp C (p ) C (p) = ⎪ #(Ψp ) ⎪ ⎪ ⎪ ⎪   ⊥ ⎩ D (p) = ∇Ip · np  α

(1)

Classic exemplar-based inpainting [23] consists of three steps. (1) Computing patch priorities. For any patch Ψp centered at a border pixel p, its priority P (p) is calculated with a confidence term C(p) and a data term D(p) as defined in Eq. 1. C(p) measures the amount of reliable information around p. For any pixel p in the patch Ψp , it is initialized as C(p ) = 0 if p ∈ Ω and C(p ) = 1 if p ∈ I − Ω. D(p) measures

100

the continuation of strong edges, and it is computed based on two vectors np and ∇Ip⊥ . np is a normalized vector of the binary map B indicating Ω and Φ, and it is

[B ,By ] shown in Fig. 1(a). ∇Ip⊥ is the orthogonal vector of formulated as np = √Bx2 +B 2 x

y

the intensity gradient and formulated as ∇Ip⊥ = [−Iy , Ix ] shown in Fig. 1(b). α 5











  

  (a)

(b)

Figure 2: An example of inter-depth-layer distortion. Black pixels are invalid and white dashed lines indicate the boundaries between foreground (FG) and background (BG). (a) When the edge is explicit and accurate, correct matching can be achieved. (b) When the real edge is latent and unavailable, a rectangle patch will introduce inter-depth-layer errors. The shaded area in Ψp , which belongs to BG, will be filled with FG pixels in Ψ∗q .

is a constant scalar. With Eq. 1, the patch Ψp with the highest priority is found. 105

(2) Searching the optimal source patch. As shown in Fig. 1(c), patch matching is conducted between the target patch Ψp and all candidate source patches Ψq in the source region, by measuring their similarity. The matching will report the optimal source patch Ψ∗q with the highest similarity. (3) Hole filling with the exemplar. Once Ψ∗q has been found, invalid pixels in Ψp are filled with the corresponding pixels in Ψ∗q .

110

After that, the confidence term is updated for newly inpainted pixels. When applied in depth inpainting, the aforementioned framework faces problems from two aspects. On one hand, conventional exemplar-based algorithms [23, 27, 28] use rectangle matching patches, which is error-prone for depth. Take Fig. 2 as an example, the map consists of two layers: a foreground (FG) and a background (BG). An explicit edge, as is shown

115

in Fig. 2(a), is an important feature and improves matching accuracy. Unfortunately, when holes exist in the FG-BG transition region, the real edge is usually not available as shown in Fig. 2(b). In such a case, patch matching is conducted only based on the valid FG pixels in Ψp , and thus the optimal source patch Ψ∗q belongs to only the FG. After that, copying Ψ∗q to Ψp causes edge distortions, i.e. the shaded area in Ψp

120

belongs to BG but will be filled with FG values. To solve this problem, arbitraryshape patches that adapt to irregular boundaries are necessary. On the other hand, the source patch Ψ∗q should be the ‘most similar’ one to the target patch Ψp , where the definition ‘most similar’ depends on a similarity metric. Common metrics such as

6

(a)

(b)

Figure 3: Sketch map of patch modification. (a) Judge pixels in the rectangle patch. Most edges are caused by color change inside objects. (b) The generated arbitrary-shape patch. The edges correspond to real object boundaries.

MSE, Pearson correlation and SSIM [32] can yield quite different results. For example, 125

SSIM coincide better with visual perception than MSE for texture maps, but depth values are smooth and sparse, and thus the performance can be different. Therefore, a proper similarity metric is to be determined by analysis and experiment. In addition, depth maps do not have many distinct features, and thus many candidate source patches are quite similar. Therefore, incorporating texture information, which has more details,

130

into the matching should benefit the matching accuracy. 2.2. Arbitrary-shape patch matching The proposed inpainting is inspired by [23], and a similar framework, including priority computation and patch matching, is adopted. The priority is first calculated with Eq. 1. After that, we introduce arbitrary-shape patches, instead of rectangle ones,

135

for successive patch matching. An enlarged target patch Ψp is shown in Fig. 3, and its center is p (marked in red). For every pixel qi in the patch, we choose a bounding box Bpqi , where p and qi are the diagonal/anti-diagonal corners. Afterward, we check pixels in Bpqi . The existence of edges in Bpqi indicates that qi and p are located on different sides of an edge, indicating they belong to different objects and depth layers. Therefore, qi is removed from the matching patch. In contrast, if no edge pixels exist in Bpqi , qi is a

7

(a)

(b)

(c)

Figure 4: Edges of Teddy. (a) Texture map. (b) Result of Canny detector with auto threshold. (c) Result of structured forest [36] and Otsu’s thresholding.

qualified for matching. Mathematically, pixel qi can be judged as the following

qi =

⎧ ⎨qualif ied, ⎩

if edge ∈ / Bpqi

(2)

if edge ∈ Bpqi

removed,

In Fig. 3, box Bpq1 is filled with non-edge pixels, so q1 is adopted in the patch. In contrast, there are edge pixels in box Bpq2 , and thus q2 is removed. Checking all pixels in the patch in such a manner generates an arbitrary-shape patch that belongs to a single depth layer, such as Fig. 3(b), where the qualified pixels are marked in green. 140

In this method, pixel identification depends on reliable object boundaries. Ideally, it is the depth edge that should be used as the criteria of the proposed patch generation. However, as shown in Fig .2(b), because of the holes, true depth edges are often unavailable. Fortunately, texture edges are highly related with texture ones and thus are helpful [37]. In general, texture and edge distribution of a local region can be classified

145

into three cases. Case 1: both texture and depth edges exist, which occurs at real object boundaries. In this case, texture and depth edges are equivalent, and texture edges can replace depth edges for the proposed arbitrary-shape patch generation. Case 2: only texture edges exist, which often occurs inside objects with smooth depth but varying colors. In this case, using texture edges will split a rectangle patch into several smaller

150

and irregular ones, but this will not impair the final inpainting results because the depth is smooth inside objects. Case 3: only depth edges exist. In this case, the FG and BG depth layers are in the same color. So texture and depth edges are not equivalent for arbitrary-shape patch generation. Nevertheless, this situation is special and rarely 8

happens because in reality objects often have different colors. From another point of 155

view, texture edges should be as similar as depth edges in the proposed method if they function as alternatives of depth edges. This requires that a high-quality edge detection method, which can report real object boundaries in case 1 while suppress fake edges in case 2. To achieve this goal, we use structured forest [36], a learning-based detection method, to detect semantic boundary intensities, followed by Otsu’s adap-

160

tive thresholding and morphological thinning, which yields single-pixel-with texture edges. Fig. 4 is an example of the detected edges. It demonstrates that structured forest [36] reports more accurate edges than traditional detectors. More importantly, this method reports real boundaries between objects while excludes fake alarms within objects, which makes texture and depth edges quite similar. Above all, with the afore-

165

mentioned considerations, texture edges can work as alternatives of depth edges for arbitrary-shape patch generation. 2.3. Cross-modal matching Once a target patch is determined, searching for the optimal source patch that minimize the difference between Ψq and Ψp is the most important task, and the optimal source patch Ψ∗q goes as Ψ∗q = arg max S (Ψp , Ψq ) Ψq

(3)

In [23], Ψ∗q is found by minimizing the mean absolute difference between the source patch and the target patch. This standard is straightforward and convincing because 170

[23] only involves texture maps. However, depth values are smooth and even identical, and many candidate source patches Ψq are similar to the target patch Ψp , which leads to ambiguous matching results. Fortunately, texture and depth maps of a same viewpoint are highly correlated, and thus the texture view, which in general has more details, can be applied to improve the matching accuracy. More specifically, cross-modal matching,

175

which takes both texture and depth similarity into consideration, should be used. Here, we consider three similarity metrics: mean squared error (MSE), structure similarity (SSIM) and Person correlation (PC). These metrics are widely adopted in

9









 Figure 5: Example of patch matching using different similarity metrics. H is a hole. B, C and D are candidate source patches.

signal processing, especially image processing and computer vision fields. MSE is second moment form the reference between the distorted image and the reference. It is 180

simple for calculation and has clear physical meaning: the power of the error between the compared patches. SSIM, on the other hand, is an overall evaluation metric of luminance, contrast and structure. It can coincide with human vision perception. Last but not least, Person correlation measures the linearity, or gradient similarity, of two related datasets. The principles of the three metrics are quite different, and we would

185

like to investigate which one better suits to the proposed inpainting. Depth maps, without loss of generality, can be regarded as piece-wise planar as shown in Fig. 4, where region H is a hole, and B, C and D are three candidate source regions. Mathematically, the values can be formulated as ⎧ ⎨−a(x − x ) + b 0 D (x, y) = ⎩ a(x − x ) + b 0

x < x0 x ≥ x0

(4)

Theoretically, the performance of the three metrics are as the following. • Pearson correlation (PC) requires identical gradients of the two blocks, but not the absolute values. Nevertheless, in depth maps, this condition is often satisfied even though the patches are quite different. For example, region B in Fig. 5 has 190

the same gradient values with region H, but obviously it is not a good match. • SSIM evaluates the similarity by multiplying a luminance term l, a contrast term c and a structure term s [32]. l and c require the source patch and the target

10

patch have similar average intensities and variances, but not pixel-wise correspondence, and this causes ambiguities. For example, in Fig. 5, region C is 195

similar to the hole H in the terms of l and c, but these two patches are quite different. The structure term s is related with the covariance and the standard deviation of the source patch and the target patch, which, like Pearson correlation, does not check the absolute intensities. In Fig. 5, region B has the same structure index with H. Overall, the three terms in SSIM face the ambiguity problem for

200

depth patch matching. • MSE is the average of the squared intensity difference between the two patches. The best case, M SE(Ψp , Ψq )=0, strictly demands that corresponding pixels in the matching patches have an identical intensity. For a surface as presented in Eq. 4, when MSE is adopted as the similarity metric, D will be the reported

205

source patch, which is an optimal solution. Overall, for depth maps, MSE strictly requires that the source patch and the target patch should be identical in pixel-level, and thus can yields more accurate matching results than PC and SSIM. For texture map, MSE and SSIM are more widely used than PC, and we discuss

210

their applicability in texture patch matching. As aforementioned, the two metrics are based on different principles, and the general trend of between MSE/SSIM and patch similarity is definite: a smaller MSE or a larger SSIM value indicates less distortion and higher similarity. More specifically, Chai [38] has proven that the two metrics are inversely related. From this point of view, both of them are qualified to measure

215

texture similarity. However, MSE has more advantages in this depth inpainting task due to the following reasons. On one hand, SSIM is designed to simulate human vision perception, but in depth inpainting, texture maps are used as guidance but not output presented to users, and thus the advantage of SSIM is weakened. In contrast, MSE requires the pixel-to-pixel error between the matched patches should be minimized,

220

which is stricter than SSIM. On the other hand, on the aforementioned condition that MSE better suits to depth similarity, choosing MSE as the texture similarity can make it reasonable and easier to couple the two types of similarities. The overall joint similarity 11

will have a clear physical meaning, i.e. combined error power of texture and depth patches, and also a simple format. In addition, in practical applications, MSE is still 225

quite widely used in a range of patch/block matching tasks such as image inpainting [39, 40], video coding [41, 42] and denoising [43]. Due to the above considerations, MSE is also adopted as the texture similarity metric. Note that a smaller MSE value means smaller block difference, and thus maximizing the similarity equals to minimizing MSE. On the other hand, both texture and depth maps are taken into consideration, and the overalls similarity is defined as Ψ∗q = arg min {βM SED (Ψp , Ψq ) + (1 − β) M SET (Ψp , Ψq )} Ψq

(5)

Here, the subscripts T and D indicate texture and depth, respectively. β balances the contributions of the two terms. 230

Please note that, in patch matching, the source patch Ψq is in the same shape with the target patch Ψp . All pixels of Ψq should be in the source region Φ. Moreover, only valid pixels contribute to the similarity calculation. In such a manner, the optimal source patch Ψ∗q can be found for the target patch Ψp . Finally, hole pixels in Ψp get their depth values from the corresponding pixels in Ψ∗q .

235

3. Experimental results 3.1. Experiment setup We have conducted extensive experiments to verify the proposed method. Three scenes ‘wood’, ‘teddy’ and ‘monopoly’ belong to Middlebury dataset [44], and another four scenes ‘desk’, ‘meeting room’, ‘bookcase’ and ‘bed room’ are from NYU

240

dataset [45]. The texture and depth maps of the two datasets are shown in Figs. 6 and 7, respectively, where the pixel values are in the range of [0, 255]. In Middlebury, depth maps are obtained with structured light technique, which has holes in the raw ground truth depth maps. We further generate rectangle holes, which are with randomly determined positions and width-height ratios, in the depth maps. In NYU dataset, the depth

245

maps are captured with MS Kinect and aligned with the texture view. Since there are already many irregular holes, we do not add any artificial ones. 12

In the experiment, important parameters are selected based on the following concepts. (1) The search region of Ψq : theoretically, patch matching searches in the whole source region Φ of the depth map. However, spatial similarity decreases as the distance 250

increases. In other words, the optimal source patch Ψ∗q is more likely to be near the target patch Ψp , while the regions far way will not contribute much. Meanwhile, a larger search region also brings much heavier computation load. Therefore, a limited range around Ψp is sufficient. In addition, the search range is also related to the resolution of the depth maps, i.e. a larger search range is needed for an image with a higher res-

255

olution and vice versa. With the above concerns, we use a 40×40 search window for Middlebury images and a 60×60 search window for NYU dataset. (2) The weighting factor β: β balances the contribution of M SET and M SED . If β approaches to 1, the upper bound, M SED dominates over M SET and texture guidance no longer works. In contrast, if β appropriates to 0, the lower bound, M SET dominates over M SED ,

260

and thus the matching will be performed based on only the texture similarity, which is unreasonable for depth inpainting. By several preliminarily attempts, we set β to 0.6. In Section 3.5, we present detailed analysis about the effect of β. In implementation, the original matching patches are 7×7 squares, but the modified patches are in irregular shapes. We present more details about the patch size in Section 3.6.

265

Besides, results of two state-of-the-art exemplar-based depth inpainting methods proposed by Viacheslav [27] and Zhang [28], as well as the classic inpainting [23], are reproduced for performance comparison. 3.2. Results of inpainted depth maps The resulted depth maps of Middlebury and NYU datasets are shown in Figs. 8

270

and 9, respectively. For each depth map, details are zoomed in for a better illustration. The methods of [23, 27, 28] do not perform depth layer identification, and thus object boundaries are quite inaccurate in their results. More specifically, foreground (FG) pixels may be filled with background (BG) values and vice versa. In addition, the edges of [28] are slightly blurred. This lies in the fact that tri-lateral filter is essentially

275

a low-pass filter, which indeed blurs edges. The proposed method improves the inpainted depth maps from two aspects. Due 13

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Figure 6: Middlebury dataset. From left to right are color maps, depth maps of the original groundtruth and depth maps with randomly-generated holes. Note that the groundtruth depth maps have invalid pixels. (a)-(c) wood with resolution 457×370. (d)-(f) teddy with resolution 450×375. (g)-(i) monopoly with resolution 443×370.

to the guidance of texture edges and the usage of arbitrary-shape matching patches, the source patch and the target patch always belong to a same depth layer. This prevents inter-depth-layer error and can generate clear and sharp object boundaries. In 280

addition, the cross-modal matching utilizes more detailed textures, which can detect optimal source patches with higher accuracy. Therefore, in Figs. 8 and 9, the results of the proposed method, illustrated in the last rows, are in better accuracy that others. Especially, in the bedroom of NYU dataset, the bottom of lamp holder is gone in the original depth map. In the inpainting results shown in the last column of Fig. 9, only

285

the proposed method recovered the shape correctly. We would also like to point out that, although the artificially-generated holes in

14

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)



   

    



 

Figure 7: NYU dataset: Color map and depth map from MS Kinect. The upper row and lower row present texture and depth maps respectively. (a)(e) Desk. (b)(f) Meeting room. (c)(g) Bookcase. (d)(h) Bedroom. All images are in a resolution of 560×425.





  

Figure 8: Results on Middlebury dataset. Rows from top to bottom are results obtained with classic inpainting [23], Viacheslav [27], Zhang [28] and the proposed method. Columns from left to right are scenes of wood, teddy and monopoly, respectively.

Middlebury are rectangles, the shape of hole does not affects the results. First, the inpainting order is determined by the priority in Eq. 1, but not along any specified paths. With the priority, the inpainting is performed in an ‘onion-peel’ manner. More

15

             

 

   

 

 

Figure 9: Results on NYU dataset. Rows from top to bottom are results obtained with classic inpainting [23], Viacheslav [27], Zhang [28] and the proposed method. Columns from left to right are scenes of desk, meeting room, bookcase and bedroom, respectively.

290

specifically, although the holes are rectangles at first, they becomes irregular during inpainting. On the other hand, the groundtruth maps of Fig. 6 themselves already have irregular holes. And the depth maps of NYU dataset in Fig. 7 have more irregular invalid regions. The inpainted results have demonstrated that the proposed method can accurately fill the holes, regardless of their shapes. We also assess the inpainted depth maps with peak-signal-to-noise-ratio (PSNR). As shown in Fig. 6, invalid pixels exist in the original depth maps, and these pixels cannot be used as ground truth. Therefore, we exclude these invalid pixels in computing the PSNR values, i.e. 2552 P SN R = 10 log 

(x,y)

295



f (x, y)

(x,y)

[f (x, y) (I (x, y) − IGT (x, y))]

2

(6)

where f (x, y)=1 only when (x, y) is a valid pixel in the ground truth. The PSNR values are presented in Table. 1. For teddy and monopoly, our method achieves the highest PSNR values, proving the effectiveness of the method. For wood, although our method yields the most accurate edges (see the first column of Fig. 8), this improved

16

Table 1: PSNR (dB) of the inpainted depth maps of Middlebury depth images

Criminis [23] 39.98 36.61 36.62

Viacheslav [27] 39.30 39.41 51.26

Zhang [28] 39.60 38.09 50.10

proposed 41.90 41.53 44.74

  

  

teddy monopoly wood

  





Figure 10: Influence of the shape of matching patches. The upper row and lower row present depth maps inpainted with rectangle patches and arbitrary-shape patches, respectively.

region is invalid in the ground truth in Fig. 6(b). As a result, according to Eq. 6, the 300

quality improvement makes no contribution to the PSNR value. Therefore, in general, the proposed method improves the depth quality over conventional methods, especially at boundaries. 3.3. Influence of patch shape As aforementioned, the shape of matching patches affects the results greatly, espe-

305

cially the boundary distortion. At boundary regions, a rectangle patch can cover two or even more depth layers, but the source patch often belongs to a single layer. As a result, some pixels are filled with pixels from incorrect layers, and this leads to shape distortions, i.e., the distortions in the upper row of Fig. 10. In contrast, using adaptiveshape patches can prevent the problem of cross-layer matching, and thus in the lower

310

row of Fig. 10 more accurate boundaries are presented. 3.4. Influence of similarity metric The performance of three widely used similarity metrics: MSE, SSIM and Pearson correlation (PC) is also tested in our experiment. The results are in Fig. 11. Due 17

(a)

(b)

(c)

(d)

Figure 11: Inpatinting results using different similarity metrics. (a) Depth map. (b) Results obtained with PC. (c) Results obtained with SSIM. (d) Results obtained with MSE.

to the homogeneity and smoothness of depth map, SSIM and PC cannot accurately 315

achieve patch matching. As a result, there exist many incorrect patches in the results as shown in Figs. 11(b) and 11(c). Finally, MSE requires equal absolute values between the source patch and the target patch. As expected, MSE generates the best results as shown in Fig 11(d). 3.5. Influence of texture-depth Weights

320

The weighting parameter β balances the contribution of texture similarity and depth similarity, and it affects the matching results as well as the inpainting results. We study the influence of β and the results are shown in Fig 12. When β is small, M SET dominates over M SED , which means that the inpainting is performed almost based on color similarity. Therefore, many hole regions are filled with incorrect depth values.

325

As β increases, depth becomes more important, and the results get improved. However, when β approaches to 1, M SED dominates over M SET in the matching. In such a case, the guidance and reference of texture map is no longer available. Consequently, cross boundary errors between different depth layers. Above all, taking both texture and depth information involved benefits the matching accuracy, and we take β=0.6 in

330

our implementation. 3.6. Complexity analysis The proposed method modifies patch shapes so they adapt to object boundaries. A matching patch is initialized as a 7×7 square, and as pixels in different depth layers are removed by following Section 2.2, the patch becomes smaller and irregular. A

335

statistic on the size of the modified matching blocks is performed, and the results are 18

 

  

  

  

  

  

Figure 12: Inpatinting results with different weighting factor.

illustrated in Fig. 13. Since the patches are irregular, patch sizes are described with the quantities of pixels. The average sizes of monopoly, wood and teddy are 11.28, 11.72 and 9.82 pixels, respectively. These sizes are smaller than the original ones, indicating the proposed inpainting is more precise. The histograms also indicate that, in general, 340

teddy has smaller patches than wood and monopoly. The reason lies in the fact that the edge map of teddy are more complex, which yields smaller and more patches. We also investigate the efficiency of the proposed method on a laptop with an Intel Core i5 3250M CPU and 12GB RAM. Two major steps: arbitrary-shape patch generation (APG) and patch matching (PM) are programmed with C and complied as ‘.mex’

345

files, and afterward called in Matlab. The time consumed in APG and PM are shown in Table 2. It shows that both APG and PM take hundreds of mini-seconds. For simple texture and edges like monopoly and wood, APG takes less time than PM, but for complex scenes like teddy, APG takes more time than PM. Compared with the classic inpainting [23], the proposed method takes about four times as long as [23] did. The

350

reasons are two-fold. On one hand, the additional step of APG is introduced. On the other hand, as the patches get smaller, PM must be conducted more times to fill a same hole. With these increased computation load, the proposed method precisely handles the boundaries and improves the depth quality. Note that no acceleration is conducted in our implementation. In practice, parallel operation can be carried out for efficiency.

355

For example, different isolated holes can be inpainted simultaneously, which will ac19

 





  





 



 

 

             

  

  







       



        

       

 

  

 

  

 

  

(a)

(b)

(c)

Figure 13: Histograms of the size (in pixels) of modified patches. (a)-(c) are patch size histograms of monopoly, wood and teddy, respectively. The average sizes of the three scenes are 11.28, 11.72 and 9.82 pixels, respectively.

Table 2: Time consumption (seconds) of the proposed method. APG: arbitrary patch generation, PM: patch matching

monopoly wood teddy

APG 0.165 0.132 0.594

proposed PM 0.240 0.383 0.372

total 0.405 0.515 0.966

Criminis [23] 0.090 0.117 0.156

celerate the process. 3.7. Failure cases Although the proposed method can generate good results, it still faces problems. The proposed method relies on correct texture edges, but there are situations that this 360

condition is not satisfied. For example, in the red rectangle of Fig. 14, the foreground and background are with similar colors, and thus no texture edges exist. As a result, the patch contains both depth layers, and the foreground pixels are filled with background depth values. This problem corresponds to case 3 in our discussion about the distribution of texture and depth edges in Section 2.2. On the other hand, in some cases, the

365

original depth map, i.e. the source region, is inaccurate, which also impairs the inpainting. Under such a circumstance, pre-processing depth edges [37, 46] and removing the errors will benefit the matching.

20

(a)

(b)

(c)

(d)

Figure 14: A failure case of the proposed method. (a) Color map. (b) Edges detected from the texture map. (c) Depth map with holes. (d) Inpainted result of the proposed method. Since the edge map in (b) failed to indicate foreground and background, distortions appear in the final inpainted depth map in (d).

4. Conclusion This paper proposed a novel cross-modal method to enhance depth maps from 370

RGB-D cameras, where, to be specific, the main concern is to restore lost depth with accurate object boundaries. Based on exemplar-based inpainting framework, the proposed method uses arbitrary-shape matching patch to adapt to irregular object boundaries. Moreover, a cross-modal matching energy involving both texture and depth information is proposed to achieve accurate matching. Experimental results shown that

375

the proposed method can indeed improve the depth quality, especially the boundary accuracy. In our future work, this proposed method will be further extended in two approaches. On one hand, more possible representations of the similarity criteria can be investigated to further improve the matching accuracy. On the other hand, the gen-

380

eration of arbitrary-shape patches can be accelerated and the patch matching can be conducted in parallel, which will improve the efficiency of the proposed method.

21

Acknowledgment This work was partly supported by National Natural Science Foundation of China (Nos. 61702384, 61502357, 61502358, 61231010); Natural Science Foundation of 385

Hubei Province (2017CFB348); Research Foundation of Hubei Educational Committee (Q20171106); Research Foundation for Young Scholars of WUST (2017xz008).

References [1] A. Smolic, 3D video and free viewpoint video-From capture to display, Elsevier Science Inc., 2011. 390

[2] Y. Yang, X. Wang, Q. Liu, M. Xu, L. Yu, A bundled-optimization model of multiview dense depth map synthesis for dynamic scene reconstruction, Inform. Sci. 320 (2014) 306–319. [3] S. Naveen, R. S. Moni, A robust novel method for face recognition from 2d depth images using dwt and dft score fusion, in: Intl. Conf. on Comput. Syst. & Com-

395

mun., 2015, pp. 1–6. [4] S. Xiang, H. Deng, L. Yu, J. Wu, Y. Yang, Q. Liu, Z. Yuan, Hybrid profilometry using a single monochromatic multi-frequency pattern, Opt. Express 25 (22) (2017) 27195–27209. [5] Y. S. Kang, Y. S. Ho, Disparity map generation for color image using tof depth

400

camera, in: 3dtv Conf.: the True Vis. - Capture, Transmission and Display of 3d Video, 2011, pp. 1–4. [6] D. Min, J. Lu, M. N. Do, Depth video enhancement based on weighted mode filtering, IEEE Trans. Image Process. 21 (3) (2012) 1176–1190. [7] J. Yang, X. Ye, K. Li, C. Hou, Y. Wang, Color-guided depth recovery from rgb-d

405

data using an adaptive autoregressive model, IEEE Trans. Image Process. 23 (8) (2014) 3443–3458.

22

[8] D. Miao, J. Fu, Y. Lu, S. Li, C. W. Chen, Texture-assisted kinect depth inpainting, in: Int. Symp. on Circuits and Syst., IEEE, 2012, pp. 604–607. [9] S. Xiang, L. Yu, Y. Yang, Q. Liu, J. Zhou, Interfered depth map recovery with tex410

ture guidance for multiple structured light depth cameras, Signal Process.: Image Commun. 31 (2015) 34–46. [10] S. Milani, G. Calvagno, Correction and interpolation of depth maps from structured light infrared sensors, Signal Process.: Image Commun. 41 (2016) 28–39. [11] H. Xue, S. Zhang, C. Deng, Depth image inpainting: Improving low rank matrix

415

completion with low gradient regularization, IEEE Trans. Image Process. (99) (2016) 1–1. [12] L. Zhao, H. Bai, A. Wang, Y. Zhao, B. Zeng, Two-stage filtering of compressed depth images with markov random field, Signal Process.: Image Commun. 54 (2017) 11–22.

420

[13] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, K. Toyama, Digital photography with flash and no-flash image pairs, ACM Trans. Graph. 23 (3) (2004) 664–672. [14] J. Kopf, M. F. Cohen, D. Lischinski, M. Uyttendaele, Joint bilateral upsampling, ACM Trans. on Graph. 26 (3) (2007) 96.

425

[15] S. Liu, P. Lai, D. Tian, C. W. Chen, New depth coding techniques with utilization of corresponding video, IEEE Trans. Broadcast. 57 (2) (2011) 551–561. [16] J. Kim, G. Jeon, J. Jeong, Joint-adaptive bilateral depth map upsampling, Signal Process.: Image Commun. 29 (4) (2014) 506–513. [17] A. Bapat, A. Ravi, S. Raman, An iterative, non-local approach for restoring depth

430

maps in rgb-d images, in: National Conf. on Commun., IEEE, 2015, pp. 1–6. [18] T. A. Chang, J. P. Kuo, J. F. Yang, Efficient hole filling and depth enhancement based on texture image and depth map consistency, in: IEEE Asia Pacific Conf. on Circuits and Syst., 2016, pp. 192–195. 23

[19] S. Bhattacharya, S. Gupta, K. Venkatesh, High accuracy depth filtering for kinect 435

using edge guided inpainting, in: Intl. Conf. on Advances in Computing, Commun. and Inform., IEEE, 2014, pp. 868–874. [20] P. Buyssens, M. Daisy, D. Tschumperl´e, O. L´ezoray, Superpixel-based depth map inpainting for rgb-d view synthesis, in: Intl. Conf. on Image Process., IEEE, 2015, pp. 4332–4336.

440

[21] W. Chen, H. Yue, J. Wang, X. Wu, An improved edge detection algorithm for depth map inpainting, Optics & Lasers in Engn. 55 (7) (2014) 69–77. [22] Y. Wang, F. Zhong, Q. Peng, X. Qin, Depth map enhancement based on color and depth consistency, Vis. Comput. 30 (10) (2014) 1157–1168. [23] A. Criminisi, P. P´erez, K. Toyama, Region filling and object removal by exemplar-

445

based image inpainting, IEEE Trans. Image Process. 13 (9) (2004) 1200–1212. [24] E. P. Simoncelli, B. A. Olshausen, Natural image statistics and neural representation., Annual Review of Neuroscience 24 (24) (2001) 1193–1216. [25] I. Daribo, H. Saito, A novel inpainting-based layered depth video for 3dtv, IEEE Trans. Broadcast. 57 (2) (2011) 533–541.

450

[26] I. Ram, M. Elad, I. Cohen, Image processing using smooth ordering of its patches, IEEE Trans. Image Process. 22 (7) (2013) 2764–2774. [27] V. Viacheslav, F. Alexander, M. Vladimir, T. Svetlana, L. Oksana, Kinect depth map restoration using modified exemplar-based inpainting, in: Int. Conf. on Signal Process., IEEE, 2014, pp. 1175–1179.

455

[28] L. Zhang, P. Shen, S. Zhang, J. Song, G. Zhu, Depth enhancement with improved exemplar-based inpainting and joint trilateral guided filtering, in: Int. Conf. on Image Process., IEEE, 2016, pp. 4102–4106. [29] H.-Y. Huang, C.-N. Hsiao, A patch-based image inpainting based on structure consistence, in: Intl. Comput. Symp., IEEE, 2010, pp. 165–170.

24

460

[30] F. Qi, J. Han, P. Wang, G. Shi, F. Li, Structure guided fusion for depth map inpainting, Pattern Recognition Lett. 34 (1) (2013) 70–76. [31] Q. Fan, L. Zhang, A novel patch matching algorithm for exemplar-based image inpainting, Multim. Tools and Applicat. (2017) 1–15. [32] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment:

465

from error visibility to structural similarity, IEEE Trans. on Image Process. 13 (4) (2004) 600–612. [33] Y. Li, T. Xue, L. Sun, J. Liu, Joint example-based depth map super-resolution, in: Intl. Conf. on Multimedia and Expo, IEEE, 2012, pp. 152–157. [34] L. Wang, H. Jin, R. Yang, M. Gong, Stereoscopic inpainting: Joint color and

470

depth completion from stereo images, in: IEEE Conf. on Comput. Vis. and Pattern Recognition, IEEE, 2008, pp. 1–8. [35] J. Ou, W. Chen, B. Pan, Y. Li, A new image inpainting algorithm based on dct similar patches features, in: Intl. Conf. on Comput. Intell. and Security, IEEE, 2016, pp. 152–155.

475

[36] P. Doll´ar, C. L. Zitnick, Structured forests for fast edge detection, in: Int. Conf. on Comput. Vis., IEEE, 2013, pp. 1841–1848. [37] S. Xiang, L. Yu, C. W. Chen, No-reference depth assessment based on edge misalignment errors for t + d images, IEEE Trans. Image Process. 25 (3) (2016) 1479–1494.

480

[38] L. Chai, Y. Sheng, Optimal design of multichannel equalizers for the structural similarity index., IEEE Trans. Image Process. 23 (12) (2014) 5626–5637. [39] R. Martnez-Noriega, A. Roumy, G. Blanchard, Exemplar-based image inpainting: Fast priority and coherent nearest neighbor search, in: IEEE Intl. Workshop on Machine Learning for Signal Process., 2012, pp. 1–6.

25

485

[40] B. Chung, C. Yim, Hybrid error concealment method combining exemplar-based image inpainting and spatial interpolation, Signal Process.: Image Commun. 29 (10) (2014) 1121–1137. [41] G. J. Sullivan, J. Ohm, W.-J. Han, T. Wiegand, Overview of the high efficiency video coding (hevc) standard, IEEE Trans. Circuits and Sys. for Video Tech.

490

22 (12) (2012) 1649–1668. [42] T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra, Overview of the h. 264/avc video coding standard, IEEE Trans. Circuits and Sys. for Video Tech. 13 (7) (2003) 560–576. [43] M. Maggioni, A. Foi, Joint removal of random and fixed-pattern noise through

495

spatiotemporal video filtering., IEEE Trans. Image Process. 23 (10) (2014) 4282– 4296. [44] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, Intl. J. Comput. Vis. 47 (1-3) (2002) 7–42. [45] N. Silberman, R. Fergus, Indoor scene segmentation using a structured light sen-

500

sor, in: Intl. Conf. on Comput. Vis. Workshops, IEEE, 2011, pp. 601–608. [46] H. Deng, J. Wu, L. Zhu, Z. Yan, L. Yu, Texture edge-guided depth recovery for structured light-based depth sensor, Multimed. Tools and Appl. (2016) 1–16.

26

Highlights We propose an exemplar-based depth inpainting method with arbitary-shape matching patches and cross-modal matching creteria. Using arbitary-shape matching blocks preserves sharp depth boundaries. Cross-modal matching with both texture and depth map improves the matching accuracy. MSE outperforms Pearson correlation and SSIM in depth block matching.

file:///C|/Documents%20and%20Settings/focal03/Desktop/Users/arathy/IMAGE_2018_284_R2-highlights.txt[06/07/2018 14:22:45]