Depth map inpainting via sparse distortion model

Depth map inpainting via sparse distortion model

Digital Signal Processing 58 (2016) 93–101 Contents lists available at ScienceDirect Digital Signal Processing www.elsevier.com/locate/dsp Depth ma...

3MB Sizes 0 Downloads 108 Views

Digital Signal Processing 58 (2016) 93–101

Contents lists available at ScienceDirect

Digital Signal Processing www.elsevier.com/locate/dsp

Depth map inpainting via sparse distortion model Fen Chen, Tianyou Hu, Liwen Zuo, Zongju Peng ∗ , Gangyi Jiang, Mei Yu Faculty of Information Science and Engineering, Ningbo University, Ningbo 315211, China

a r t i c l e

i n f o

Article history: Available online 1 August 2016 Keywords: Depth map Hole filling Sparse representation Kinect

a b s t r a c t The depth map captured from a real scene by the Kinect motion sensor is always influenced by noise and other environmental factors. As a result, some depth information is missing from the map. This distortion of the depth map directly deteriorates the quality of the virtual viewpoints rendered in 3D video systems. We propose a depth map inpainting algorithm based on a sparse distortion model. First, we train the sparse distortion model using the distortion and real depth maps to obtain two learning dictionaries: one for distortion and one for real depth maps. Second, the sparse coefficients of the distortion and the real depth maps are calculated by orthogonal matching pursuit. We obtain the approximate features of the distortion from the relationship between the learning dictionary and the sparse coefficients of the distortion map. The noisy images are filtered by the joint space structure filter, and the extraction factor is obtained from the resulting image by the extraction factor judgment method. Finally, we combine the learning dictionary and sparse coefficients from the real depth map with the extraction factor to repair the distortion in the depth map. A quality evaluation method is proposed for the original real depth maps with missing pixels. The proposed method achieves better results than comparable methods in terms of depth inpainting and the subjective quality of the rendered virtual viewpoints. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Currently, 3D video technologies have a wide range of commercial applications, including virtual reality simulations and entertainment. The multiview video plus depth (MVD) format presents users with a rich perception of depth [1,2]. Notably, the free viewpoint video (FVV) system of this format provides a broader visual perspective and richer 3D gradations. This enables users to immerse themselves in a multi-angle 3D visual experience [3]. Depth camera capturing and software estimation are the two primary methods of obtaining depth-related information. Recent technological advances have seen depth cameras such as Time of Flight (TOF) and Microsoft’s Kinect become dynamic tools with a variety of applications. TOF cameras [4] are very effective in certain scenarios, but the image resolution of these fairly expensive cameras is relatively low. In comparison, images captured by the lower-cost Kinect [5] have a relatively high resolution. The main advantage of using the Kinect is the better alignment of color images and depth maps. As the Kinect uses infrared rays to measure distance and capture depth information, it is subjected to interference from

*

Corresponding author. E-mail addresses: [email protected] (F. Chen), [email protected] (T. Hu), [email protected] (L. Zuo), [email protected] (Z. Peng), [email protected] (G. Jiang), [email protected] (M. Yu). http://dx.doi.org/10.1016/j.dsp.2016.07.019 1051-2004/© 2016 Elsevier Inc. All rights reserved.

many environmental factors. Multiple reflections from smooth objects, refractions from transparent objects, and additional scattering and occlusions could result in a loss of depth information and the occurrence of so-called depth holes. In addition, the noise introduced by the non-uniform sensitivities of the optoelectronic sensors often leads to inaccuracies in the depth map. Therefore, restoring is often required for depth maps. So far, many methods have been proposed for depth video inpainting which can be roughly categorized into two types, reconstruction-based methods, and filtering-based methods. The reconstructed methods restore depth video by image inpainting technique [6]. Many image inpainting methods are proposed and applied in many application fields. In geophysical fields, splines approximation and finite element method can be used to fit the surface with surface patches, varying data, or curve set [7–11]. The spline approximation method can also be used for depth image inpainting [12–14]. Liu et al. [15] present a new energy minimization method to restore the missing regions and remove noise in a depth map. Their method using a TV 21 regularization could preserve sharp edges and achieve better inpainting result of depth map. In a study reported by Viacheslav et al. [16], the texture and structural characteristics of the sample were used to solve partial differential equations in order to perform inpainting. Many depth image techniques belong to the filtering-based methods. Camplani et al. [17] proposed a hole filling strategy for the depth maps obtained with the Microsoft Kinect device.

94

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

The depth map is repaired by a joint-bilateral filtering which includes spatial and temporal information. Telea [18] developed a fast marching algorithm (FMM) for image restoration, and Shen et al. [19] utilized a joint bilateral filtering method to perform rapid inpainting of the depth map. However, their research did not consider the structural information of the corresponding color images. The missing edge of depth map could not be perfectively repaired. Jonna et al. [20] proposed an optical flow technique to remove fences/occlusions from depth maps, but their algorithm was not suitable for handling dynamic data associated with rapidly moving scenes. Bhattacharya et al. [21] used a guided method based on texture image edge information to repair distortions and reconstruct the depth map. However, their method did not account for the complexity of the color-texture information. Qi et al. [22] used a method of fusing depth and color structure information for depth map inpainting. Schmeing et al. [23] utilized color segmentation to repair depth map edges, and also introduced an edge evaluation model. Although this method utilized texture-color information, the algorithm for processing inaccurate depth information was too simplistic. It is a trend that using learning-based method to restore depth map. Sparse representation theory has been successfully used in images inpainting. Aharon et al. [24] proposed an overcomplete dictionary for sparse representation which achieves better results. Xie et al. [25] proposed a robust coupled dictionary learning method with consistency constraints to reconstruct the corresponding depth map. In addition to the incorporation of an adaptively regularized shock filter to simultaneously reduce jagged noise and sharp edges, this method could effectively reduce data uncertainty and prevent the dictionary from being overfitted. Fan et al. [26] proposed a high-resolution depth map reconstruction method using a sparse linear model. Although this method improves the quality of high-resolution images reconstructed from low-resolution distorted depth maps, it requires early training of high-resolution image characteristics. In addition, the results obtained from training high-resolution noisy images were not ideal. Although some learning-based approaches have been proposed, they are not effective to process high-resolution and noisy depth map. Therefore, we developed a depth map inpainting algorithm based on a sparse distortion model. The proposed method uses an overcomplete dictionary learning method to inpaint depth maps containing holes and noise. Although most conventional methods are based on color image characteristics, there are occasions where the corresponding color images of distorted depth maps are not available as a reference. In such cases, the proposed method is very effective for depth map inpainting. Our approach can be divided into several stages. First, a sparse distortion model is constructed. K-means singular value decomposition (K-SVD) is used to train the distorted depth map, as well as multiple undistorted depth maps. This procedure gives learning dictionaries for the distorted depth map and real depth maps. K-SVD is a high-precision training method that identifies the sample block based on a relatively small number of trained characteristics. Second, an orthogonal matching pursuit (OMP) algorithm is used. We use OMP to obtain the sparse coding coefficients of the distorted depth map and the undistorted depth map. Using the relationship between the learning dictionary and sparse coding coefficients of the distorted depth map, we then obtain the approximate depth values and depth map features. Subsequently, the joint space structure filter is employed for noise reduction, enabling the edge information of the image to be preserved. The results are applied to the threshold determination method to obtain the extraction factor. Finally, the relationship between the learning dictionary and sparse coding coefficients of the undistorted depth map is used to obtain the undistorted depth map features, which are combined with the extraction factor to perform the inpainting of the depth map. We also propose a pre-

processing-based evaluation method for examining the quality of inpainting in depth maps. Experimental results demonstrate that the proposed method achieves improvements in both the subjective and objective quality of depth map inpainting, as compared to a variety of existing methods. Furthermore, the inpainted depth maps exhibit improved virtual-viewpoint subjective quality over those given by other techniques. The main contributions of this paper are 1) a depth map inpainting algorithm which is based on a sparse distortion model, and 2) a joint space structure filter which can reduce noise of distorted depth map and improve accuracy of mask extraction. The remainder of this paper is organized as follows. In Section 2, we describe the framework of our depth map inpainting algorithm. Section 3 presents a series of experimental results to evaluate the proposed method, and discusses its performance against that of other inpainting techniques. Our conclusions are summarized in Section 4. 2. Framework of the depth map inpainting algorithm The framework of the proposed method is shown in Fig. 1 which E 1 and E 2 are undistorted and distorted depth maps, D x and D z are undistorted and distorted learning dictionaries, ax and a z are sparse coefficient vectors, and P is extractor factor. The method includes three parts, sparse distortion model building, training, and denoising and reconstruction. In training process, K-SVD algorithm is used to implement dictionary training on E 1 and E 2 to respectively obtain D x and D z firstly. Then, OMP sparse coding algorithm is applied to obtain a x and a z . In denoising and reconstruction part, D z and a z are combined with the joint space structure filter to denoise the distorted depth map firstly. Then, the results are fed into the extraction factor judgment algorithm to obtain P . Finally, D x , ax , and P are used to reconstruct the depth map. 2.1. Sparse distortion model building and depth map training The distortion of the depth map can be described using an input–output model. A depth map distortion matrix can be expressed as the summation of the noise matrix and the product of the extraction factor matrix and the undistorted depth map matrix, i.e.,

Z = PX +v

(1)

where Z represents a distorted image, X is the original image, P is the extraction factor, i.e., hole-mask, and v represents Gaussian noise. If Z is known and v can be removed, P can be obtained from the specific characteristics of Z and X . Based on the relationship between Z and P , the model is then possible to obtain the repaired depth map, thereby realizing depth map hole filling and noise removal. Therefore, an adaptive dictionary based on the training stage is used to remove noise and inpaint depth holes. Let us assume that, in Eq. (1), X ∈ Rn and Z ∈ Rn are the original image and the distorted image, respectively, where n represents the number of pixels. X and Z can be expressed as l linear combinations of n-dimensional atoms. Therefore, the entire sparse representation can be expressed as:

X = D x a x + εx

(2)

Z = D z a z + εz

(3)

where D x ∈ Rn×l and D z ∈ Rn×l denote training dictionaries, in which column vectors represent dictionary atoms; ax ∈ Rl and a z ∈ Rl represent sparse coefficient vectors; and ε x , ε z represent errors. When ε x and ε z are negligible, it is assumed that the estimated training samples can be represented by X E and Z E , which can be expressed as:

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

95

Fig. 1. Flowchart of the proposed depth map inpainting method.

X ≈ X E = D x ax

(4)

Z ≈ Z E = D z az

(5)

To ensure that the samples provide sufficiently accurate information during the dictionary training process so that the sampled dictionary blocks overlap, the overlap between adjacent dictionary atoms needs to be calculated in the sample reconstruction process. This corresponds to the position of the image matrix used as a weighting factor in the reconstruction. The dictionary capacity will have an impact on the quality of the reconstructed depth map. Starck et al. [27] explored various different image shapes for different dictionary holding capacities, and Rubinstein et al. [28] have discussed the dictionary mathematical model. In our framework, an adaptive overcomplete dictionary is used, and dictionary training is carried out using K-SVD [29]. Under the same training error conditions, this dictionary requires fewer atoms to complete. The training accuracy is superior to that of discrete cosine transform (DCT) dictionaries. The OMP algorithm [30] gives a residual vector that is perpendicular to the current vector projection. This greatly reduces the matching error and complexity. The noise removal and reconstruction strategies are described in the next section. 2.2. Joint space structure filter Based on the sparse distortion model, we aim to preserve the edge of the image during the noise removal process. Bilateral filtering utilizes a weighted convolution for noise removal, but this is not ideal for removing high-level noise because it is difficult to adjust the Gaussian function coefficients to the optimal level. Although conventional denoising methods such as median, mean, and Gaussian filtering have a low structural complexity, they might damage the edge of the image. Therefore, we propose a joint space structure filtering method. This method uses a standard convolution structure, via the dictionary training of distorted images, to perform denoising based on the structure weight of the image itself. The original image is compared with the denoising results from the previous step to obtain the residual noise, which indicates the noise level in a flat area of the image. Joint space structure filter can then be used to remove noise in flat region. Finally, the residual is added to the previously denoised image. Thus, this method removes noise from the flat regions while preserving the edge of the image. The filtering calculation for the first step can be expressed as:

Z dn (i , j ) =

Z (i , j ) + k λ Z E (i , j ) 1 + kλ Z Weight (i , j )

(6)

where k is a constant, λ represents the noise factor, Z is the distorted sample matrix, Z Weight (i , j ) denotes the weight of the num-

ber of overlapping training sampling blocks at image coordinates (i , j), Z E (i , j ) represents the distorted image obtained after training with the K-SVD algorithm, and Z dn is result of first step filter. Z Weight is calculated by the following cumulative process:

Z weight (n, m) =

m= l+size n= l+size 



m=l



Z weight (n, m) + B (n, m)

(7)

n=l

where m and n represent the numbers of row and column of the matrix, size indicates the size of the sample block, l represents the column number of the reconstructed matrix with all pixels of the distorted sample matrix Z arranged based on the sample block size, and B is a matrix of the same dimensions as the sample block whose elements are all equal to 1. The noise from non-edge regions of depth map will be not clearly removed in the first step of joint space structure filter. Therefore, the second step of filter method is provided to further remove residual noise. The method is spatial similarity-based anisotropically foveated nonlocal (AFNL)-means filtering method [31] is mainly used for removing noise from non-edge regions of the image. In comparison to non-local means filtering (NL-means [32]), this method can efficiently remove the noise in a shorter period of time. The principle of AFNL-means filtering is to replace the Euclidean distance of the NL-means with the assumption of a foveated distance, along with an added isotropic foveation operator to improve the accuracy of searching through similar blocks. The method can be expressed as:

y( p) =



∀p ∈ Q

w ( p , q) z(q),

(8)

q∈ X

where { w ( p , q)}q∈ Q is the adaptive weight of point p in image domain Q . Each weight represents the similarity between blocks p and q. The specific expression is:

w ( p , q) = e

− d( p2,q) h

/



e

− d( p2,q) h

(9)

q∈ Q

where d( p , q) denotes the foveated distance between point p and q in the image block, and h is the filter parameter. The foveated distance can be expressed as:



2

d( p , q) =  F [ z, p ] − F [ z, q]2

(10)

where F represents the foveation operator. For an image z and a fixed point x, the output foveated patch Z xFOV can be expressed as:

F [ z, x] = zFOV x (u ),

u∈U

(11)

96

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

Fig. 2. Ratio of missing pixels to valid pixels in real depth maps.

where U represents the neighbor of original image and u is pixel of image in U . The second step of joint space structure filter process can be described as follows. The difference between the noise map and the results from the first round of denoising is Z diff , which can be calculated as:

Z diff = Z − Z dn

(12)

We use AFNL-means to filter the residual (difference) noise. The result is added to the result from the first round of denoising. The final result of noise removal can eventually be calculated as:

Z r = Z dn + AFNLM{ Z di f f }

(13)

where Z r is the denoised result from the joint space structure filtering, and AFNLM{·} represents an anisotropically foveated NLmeans filtering operator. 2.3. Extraction factor identification and depth map reconstruction The extraction factor P is an operand for obtaining the depth map hole. Its function is to restore missing pixels at certain locations on the original depth map, i.e., hole generation. Based on this, the extraction factor P is designated as the depth map hole mask. If the corresponding position of the extraction factor is 0, it is assumed to be the depth hole; a value of 1 indicates no depth information loss. Therefore, Z can be used to obtain P , with a threshold value T for determination. The specific expression is:

P (x, y ) =



0 if Z r (x, y ) < T 1 otherwise

(14)

In accordance with the sparse distortion model, after removing noise v, the denoised image can be represented as X hole , i.e., the image of the missing pixels, using the expression:

Z r = X hole ≈ P X E

(15)

After P has been obtained, the final reconstruction result, X rp , can be expressed as:

X rp = Extra{ X hole } = Extra{ Z r }

(16)

where Extra{·} is the recovery factor of P . For the specific calculation principle outlined in Eq. (17), the OMP algorithm can be used to match and track the undistorted depth map X k and the learning dictionary D x to obtain the sparse coefficient a x . As the error becomes negligible, X k ≈ X ke = D x · ax . P can then be used to obtain the reconstructed depth map X rp according to:



X rp (x, y ) =

X ke (x, y ) if P (x, y ) = 0 Z r (x, y ) otherwise

(17)

where X ke denotes the kth estimated undistorted depth map after training.

3. Results and discussion There is no undistorted image that can be used as a reference for the distorted image captured by the Kinect camera. Thus, we use the Middlebury dataset with ground truth depth map and 3D stereoscopic video sequence library to evaluate the quality of inpainting. However, the real depth map from the stereoscopic image library contains some missing pixels, called holes, such as the dark regions shown in Fig. 3(a). The deviation between the inpainted pixel value and the original pixel value cannot be determined because of the presence of these small holes. This greatly impacts the quality evaluation for the entire depth map inpainting process. Relatively few images do not contain these small holes. However, the testing process would still be restricted, and thereby this is not the best approach. Typically, the quality of the depth map is evaluated by rendering virtual viewpoints from the surrounding viewpoints, so the depth map inpainting quality is related to the quality of the virtual viewpoints. Within the FVV system, there is frequently a lack of real color intermediate viewpoints as references. Thus, most quality evaluations are subjective, as objective evaluations are limited by various conditions. Therefore, we propose a method for the objective evaluation of depth map inpainting. To remove the impact of existing holes on objective evaluation, preprocessing is carried out to fill the existing holes on the real depth map. In this article, areas with missing pixels in the real depth map are described as holes, whereas the inpainted true depth map without holes is referred to as the standard depth map. Fig. 2 shows the ratio of M hole and M valid , where M hole is the number of missing pixels in the real depth map and M valid represents the number of valid pixels. We define C P as:

C P = M hole / M valid

(18)

Fig. 2 indicates that depth maps from the Middlebury dataset have relatively small areas of missing pixels. A relatively accurate standard depth map can therefore be obtained if the spatial correlation of images is utilized to fill the holes in advance. Hence, the FMM algorithm is used to preprocess the real depth map. This approach efficiently and accurately repairs small regions of holes. Fig. 3(b) shows the result of FMM pre-processing, from which it can be observed that all holes have been filled in the treated depth map (standard depth map). The standard depth map is then used to objectively evaluate the quality of depth map hole inpainting. To test the performance of the proposed algorithm, 10 real depth maps were selected from the Middlebury dataset as training samples. From the remaining images, 12 preprocessed standard depth maps were selected for testing. To quantitatively evaluate the performance of the algorithm, artificial depth map distortion was added to the test sample set to create a distorted sample set. The distortions presented in the distorted sample set represent noise and holes introduced during the capture of depth maps.

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

97

Table 2 Comparison of PSNR of the enhancement methods.

Fig. 3. Result of preprocessed of the real depth map. Table 1 Comparison of PSNR of the denoising methods. Method

Structure weight filter

AFNL-means

Joint space structure filter

PSNR (dB)

43.8741

42.8403

44.8445

The distorted sample set was obtained by introducing Gaussian noise with a noise factor of λ = 10 and random holes to 30% of the depth maps of the test samples. Furthermore, for the “Door Flower” depth sequence from the 3D stereoscopic video library, Gaussian noise with a noise factor of λ = 10 and random holes with a proportion of 50% were introduced to create a distorted sample set for analyzing the quality of the virtual viewpoint rendering process. An overcomplete dictionary with 512 atoms and a block size of 8 × 8 was adopted, and a training process with 20 iterations was employed. The joint structure weighted filter coefficient k was set to 0.031, the filtering parameter h was set to 10, and the threshold value T was set to 20. The results from subjective and objective tests are discussed in the next subsection. 3.1. Comparison of denoising methods The denoising method described in this paper is based on combining an AFNL-means filter with the training-based structure weighted filter. Individual tests were carried out to determine the average peak signal-to-noise ratio (PSNR) of the structure weight filter, AFNL-means filter, and the joint space structure filter methods. For a single-channel image I of m × n size and a reconstructed image L to be measured, the mean squared error (MSE) is

MSE =

m −1 n −1 1 

mn

( I i j − L i j )2

(19)

i =0 j =0

where i, and j represent the coordinate position of pixels. Based on MSE, PSNR is defined as



PSNR = 10 log10

2 I max



MSE

(20)

where I max is the maximum value in I (e.g. I max = 255). Obviously, the reconstructed image L is closer to the ordinary image I when PSNR is larger. The test results are shown in Table 1. The results indicate that the joint space structure filter achieves a higher PSNR than the structure weight filter (by 0.9704 dB) and the AFNL-means filter (by 2.0042 dB). Therefore, the joint space structure filter can effectively improve the noise removal. 3.2. Depth map inpainting results The proposed method was compared to the DCT, BF [34], FMM, and Ram’s [33] methods. The results obtained from distorted depth maps of the Middlebury dataset were categorized as subjective and objective experimental results. For an objective evaluation, we

Image

PSNRDct (dB)

PSNRRam (dB)

PSNRBF (dB)

PSNRFMM (dB)

PSNRpro (dB)

Moebius Baby Dolls Aloe Bowling Cloth Flowerpots Lampshade1 Midd1 Plastic Art Books Average

41.5400 45.4858 42.4854 41.5857 44.0258 48.0983 42.4265 43.1560 43.4538 45.9657 40.1165 42.6781 43.4181

41.1473 41.1554 41.0705 40.5777 40.9218 41.3157 40.6530 41.0096 41.1103 41.3163 40.0641 41.1643 40.9588

40.2854 41.5820 41.7031 38.2536 41.0471 45.8579 38.9212 39.9332 41.3583 43.1877 35.7757 40.6845 40.7158

40.0849 41.6340 41.2637 38.0875 40.9687 46.2992 38.7347 39.9313 41.3417 43.2285 35.8986 40.8723 40.6954

41.9163 45.9104 42.7357 41.8374 44.2345 48.2760 42.9067 43.4839 44.0106 46.2140 40.6265 43.0323 43.7654

Table 3 Comparison of SSIM of the depth enhancement methods. Image

SSIMDct

SSIMRam

SSIMBF

SSIMFMM

SSIMPro

Moebius Baby Dolls Aloe Bowling Cloth Flowerpots Lampshade1 Midd1 Plastic Art Books Average

0.9715 0.9857 0.9723 0.9746 0.9798 0.9848 0.9765 0.9802 0.9809 0.9845 0.9731 0.9740 0.9782

0.9794 0.9812 0.9759 0.9812 0.9782 0.9774 0.9789 0.9811 0.9797 0.9793 0.9826 0.9786 0.9794

0.9749 0.9861 0.9781 0.9730 0.9828 0.9921 0.9748 0.9787 0.9837 0.9885 0.9586 0.9786 0.9792

0.9745 0.9865 0.9776 0.9728 0.9835 0.9926 0.9748 0.9790 0.9838 0.9889 0.9593 0.9789 0.9794

0.9745 0.9881 0.9751 0.9780 0.9833 0.9884 0.9804 0.9836 0.9838 0.9872 0.9758 0.9775 0.9813

compared the PSNR and structural similarity (SSIM) [35]. The standard depth maps were also used to analyze the quality of the inpainting. (1) Objective test results In Tables 2 and 3, the columns represent the performance of the proposed method, DCT, Ram’s, BF, and FMM, as denoted by the subscript. Table 2 and Fig. 4 indicate that the proposed method has a higher PSNR than DCT (by 0.3437 dB), RAM (2.8066 dB), BF (3.0496 dB), and FMM (3.0700 dB). In addition, each image repaired by the proposed method has a higher PSNR than those given by the other methods. As the trained overcomplete dictionaries are updated column by column, the training accuracy is higher than with the DCT dictionary. The overall depth map features are less complex than those of the color images. Thus, a high level of training accuracy is achieved by the proposed method. The only difference between DCT and our framework is in the selection of dictionaries. Therefore, there is little difference in the overall objective quality of the two methods. The denoising algorithm described in this paper performs better than Ram’s method for flat areas. The repaired depth map with flat regions becomes even smoother under the proposed method. Therefore, the proposed method has a significantly better PSNR than that achieved by Ram’s method when used to repair images with large flat areas, e.g., for sequences such as “Baby” and “Cloth.” Table 3 and Fig. 5 show that some of the sequences obtained using the proposed method gives a slightly lower SSIM than the Ram’s and FMM methods. However, the overall average SSIM is still higher than that of the comparative methods. As this approach is an inpainting strategy based on dictionary feature learning, it obtains higher training accuracy for flat images than for complex texture images. Hence, flatter depth maps have a higher PSNR and SSIM than complex texture images. For example, the “Cloth” image has a flat texture with a simple structure, whereas the “Art” image has

98

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

Fig. 4. Comparison of PSNR of the five depth enhancement methods.

Fig. 5. Comparison of SSIM of the five depth enhancement methods.

a complex texture. As a result, the repaired “Cloth” image exhibits a higher PSNR than the repaired “Art” image. The overall objective results show that the objective quality of the proposed method is higher than that of the other methods. (2) Subjective quality test results Subjective quality test results are shown in Fig. 6, where the proposed method clearly shows some improvement in subjective quality over the other methods. For example, the proposed method has higher accuracy than BF and FMM in repairing the edge of books in row L, and gives a smoother appearance than Ram’s method and DCT in flat areas of the repaired image. 3.3. Virtual viewpoint synthesis results The subjective quality of images repaired by the proposed method is slightly superior to that of other methods when utilized for depth map inpainting. To further elucidate whether the results from this inpainting method could be used to synthesize virtual viewpoints in the FVV system, we conducted an experiment using the multi-view depth video sequence “Door Flower.” Gaussian noise and random holes were introduced to the sequence. The quality of the virtual viewpoint was then observed after applying each method. Fig. 7 shows the results of these repairs, and Fig. 8

shows the virtual viewpoints generated using the 24th frame of the 7th viewpoint and the 24th frame of the 9th viewpoint. It is evident from Fig. 8 that the edge of the poster in the uppermost red box is smoother in the results obtained via the proposed method than in the images given by other methods. The same is true of the red box next to the chair leg. Fig. 7 shows that the inpainted depth map given by the proposed method has a smoother background contour. The red box in the middle contains the lion statue. The proposed method gives a significantly clearer image of the statue than the BF method, as do DCT, FMM, and Ram’s method. Fig. 7(c) shows the depth map inpainted by the BF method, from which it can be observed that the region corresponding to the lion statue suffers sudden depth value changes. This leads to incorrectly mapped coordinates for the view synthesis, and thus distortions are observed on the lion statue. The above results and analysis show that the overall performance of the proposed method is superior to that of the other methods. 4. Conclusions The depth maps captured by the Kinect sensor often contain missing pixels and noise interference, which tend to result in inaccurate depth information. This paper has described a depth map inpainting algorithm based on a sparse distortion model. Once the sparse distortion model has been constructed to represent the re-

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

99

Fig. 6. Comparison of repaired images given by the enhancement methods. Rows A–L show depth map inpainting results for “Moebius,” “Baby,” “Dolls,” “Aloe,” “Bowling,” “Cloth,” “Flowerpots,” “Lampshade1,” “Midd1,” “Plastic,” “Art,” and “Books,” respectively.

lationship between the distortion depth map and the real depth map matrix, K-SVD is used to train the distorted depth maps and true depth maps to obtain learning dictionaries of the distorted map and the true map, respectively. The sparse coefficients of the distorted and real depth maps are then obtained by sparse coding.

The real depth map training dictionary and the sparse coefficients are combined to obtain the approximate features of the depth map. Noise in the depth map is removed using joint space structure filtering. The extraction factor is then obtained, and the real depth map learning dictionary and sparse coefficients are used to

100

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

Fig. 7. Results of repairing the “Door Flower” sequence. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)

Fig. 8. Results of virtual viewpoint rendering. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)

Acknowledgments determine the undistorted depth map features, based on which inpainting is carried out at the location identified by the extraction factor. We have also proposed a preprocessing-based quality assessment method for filling the missing pixels in the real depth maps. The experimental results described in this paper demonstrate that the depth maps repaired using the proposed method exhibit improved subjective and objective quality, as compared to those given by DCT, FMM, BF, and Ram’s method. Furthermore, the inpainted depth map exhibits an improved virtual-viewpoint subjective quality, as compared to those synthesized using other methods. Although the proposed method for training the dictionary of real depth map is offline, image denoising of the joint space structure filter method should be considered to further reduce computational complexity. Improvement of this process is the focus of future research.

This work was supported by the Natural Science Foundation of China (61271270, 61111140392, U1301257), National High-tech R&D Program of China (863 Program, 2015AA015901), Natural Science Foundation of Zhejiang Province (LY15F010005, LY16F010002) and Natural Science Foundation of Ningbo (2015A610127, 2015A610124). It is also sponsored by K.C. Wong Magna Fund in Ningbo University. References [1] C. Zhu, S. Li, Depth image based view synthesis: new insights and perspectives on hole generation and filling, IEEE Trans. Broadcast. 62 (1) (2016) 82–92. [2] A. Purica, E.G. Mora, B. Pesquet-Popescu, et al., Multiview plus depth video coding with temporal prediction view synthesis, IEEE Trans. Circuits Syst. Video Technol. 26 (1) (2015) 1–14.

F. Chen et al. / Digital Signal Processing 58 (2016) 93–101

[3] C.C. Lee, A. Tabatabai, K. Tashiro, Free viewpoint video (FVV) survey and future research direction, APSIPA Trans. Signal Inform. Proces. (2015) 4. [4] A. Colaco, A. Kirmani, N.W. Gong, et al., 3dim: compact and low power timeof-flight sensor for 3d capture using parametric signal processing, in: Image Sensor Workshop, 2013, pp. 349–352. [5] Y. Jingyu, Y. Xinchen, L. Kun, et al., Color-guided depth recovery from RGB-D data using an adaptive autoregressive model, IEEE Trans. Image Process. 23 (8) (2014) 3443–3458. [6] M. Bertalmio, G. Sapiro, V. Caselles, et al., Image inpainting, in: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, 2000, pp. 417–424. [7] C. Gout, Ck surface approximation from surface patches, Comput. Math. Appl. 44 (3) (2002) 389–406. [8] C. Gout, C. Le Guyader, L. Romani, et al., Approximation of surfaces with fault (s) and/or rapidly varying data, using a segmentation process, Dm-splines and the finite element method, Numer. Algorithms 48 (1–3) (2008) 67–92. [9] D. Apprato, C. Gout, P. Sénéchal, Ck reconstruction of surfaces from partial data, Math. Geol. 32 (8) (2000) 969–983. [10] D. Apprato, C. Gout, A result about scale transformation families in approximation: application to surface fitting from rapidly varying data, Numer. Algorithms 23 (2–3) (2000) 263–279. [11] D. Apprato, C. Gout, D. Komatitsch, Surface fitting from ship track data: application to the bathymetry of the Marianas trench, Math. Geol. 34 (7) (2002) 831–843. [12] S. Masnou, Disocclusion, a variational approach using level lines, IEEE Trans. Image Process. 11 (2) (2002) 68–76. [13] S. Masnou, J.M. Morel, Level lines based disocclusion, in: International Conference on Image Processing, 1998, pp. 259–263. [14] D. Brazey, C. Gout, Kinect depth map inpainting using spline approximation, in: European Workshop on Visual Information Processing, EUVIP, 2015, pp. 1–6. [15] S. Liu, Y. Wang, J. Wang, et al., Kinect depth restoration via energy minimization with TV regularization, in: International Conference on Image Processing, 2013, pp. 724–727. [16] Viacheslav Voronin, Alexander Fisunov, Vladimir Marchuk, et al., Kinect depth map restoration using modified exemplar-based inpainting, in: IEEE International Conference on Signal Processing, 2014, pp. 1175–1179. [17] M. Camplani, L. Salgado, Efficient spatio-temporal hole filling strategy for kinect depth maps, in: Three-Dimensional Image Processing and Applications II, vol. 8290 (9), 2012, pp. 841–845. [18] A. Telea, An image inpainting technique based on the fast marching method, J. Graph. GPU Game Tools 9 (1) (2004) 23–34. [19] Y. Shen, J. Li, C. Lu, Depth map enhancement method based on joint bilateral filter, in: IEEE International Congress on Image and Signal Processing, vol. 20 (6), 2014, pp. 153–158. [20] S. Jonna, V.S. Voleti, R.R. Sahay, et al., A multimodal approach for image defencing and depth inpainting, in: IEEE Eighth International Conference on Advances in Pattern Recognition, 2015, pp. 1–6. [21] S. Bhattacharya, S. Gupta, K.S. Venkatesh, High accuracy depth filtering for kinect using edge guided inpainting, in: International Conference on IEEE Advances in Computing, Communications and Informatics, 2014, pp. 868–874. [22] F. Qi, J. Han, P. Wang, et al., Structure guided fusion for depth map inpainting, Pattern Recognit. Lett. 34 (1) (2013) 70–76. [23] M. Schmeing, X. Jiang, Edge-aware depth image filtering using color segmentation, Pattern Recognit. Lett. 50 (2014) 63–71. [24] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. Signal Process. 54 (11) (2006) 4311–4322. [25] J. Xie, R. Feris, S.S. Yu, et al., Joint super resolution and de-noising from a single depth image, IEEE Trans. Multimed. 17 (9) (2015) 1525–1537. [26] H. Fan, D. Kong, J. Li, Reconstruction of high-resolution depth map using sparse linear model, in: International Conference on Intelligent Systems Research and Mechatronics Engineering, 2015, pp. 283–292. [27] J.L. Starck, M. Elad, D.L. Donoho, Image decomposition via the combination of sparse representations and a variational approach, IEEE Trans. Image Process. 14 (10) (2005) 1570–1582. [28] R. Rubinstein, A.M. Bruckstein, M. Elad, Dictionaries for sparse representation modeling, Proc. IEEE 98 (6) (2010) 1045–1057. [29] M. Elad, M. Aharon, Image denoising via learned dictionaries and sparse representation, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (1), 2006, pp. 895–900. [30] J. Tropp, A.C. Gilbert, Signal recovery from random measurements via orthogonal matching pursuit, IEEE Trans. Inf. Theory 53 (12) (2007) 4655–4666.

101

[31] A. Foi, G. Boracchi, Anisotropically foveated nonlocal image denoising, in: 20th IEEE International Conference on Image Processing, 2013, pp. 464–468. [32] A. Buades, B. Coll, J.M. Morel, A non-local algorithm for image denoising, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (7), 2005, pp. 60–65. [33] I. Ram, M. Elad, I. Cohen, Image processing using smooth ordering of its patches, IEEE Trans. Image Process. 22 (7) (2013) 2764–2774. [34] M. Zhang, B.K. Gunturk, Multiresolution bilateral filtering for image denoising, IEEE Trans. Image Process. 17 (12) (2008) 2324–2333. [35] Z. Wang, A.C. Bovik, H.R. Sheikh, et al., Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612.

Fen Chen received her B.S. degree from Sichuan Normal College, China, in 1996, and M.S. degree from the Institute of Optics and Electronics, Chinese Academy of Science in 1999. She is now an associate professor at the Faculty of Information Science and Engineering, Ningbo University, China. Her research interests mainly include digital signal processing, image processing and communications, and multiview video processing. Tianyou Hu received his B.E. degree in communications from Ningbo University in 2014. He is currently a graduate student in the Department of Information Science and Engineering, Ningbo University. His research interests include image and video processing as well as RGBD computer vision.

Liwen Zuo received her B.E. degree in communications from the Jinling Institute of Technology in 2013. She is currently a graduate student in the Department of Information Science and Engineering, Ningbo University. Her research interests include multimedia signal processing, RGBD computer vision, and their applications. Zongju Peng received his B.S. degree from Sichuan Normal College, China, in 1995, and M.S. degree from Sichuan University, China, in 1998, and Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Science in 2010. He is now a professor at the Faculty of Information Science and Engineering, Ningbo University, China. His research interests mainly include image/video compression, multi-view video coding, and video perception. Gangyi Jiang received his M.S. degree from Hangzhou University in 1992, and Ph.D. degree from Ajou University, Korea, in 2000. He is now a professor at the Faculty of Information Science and Engineering, Ningbo University, China. His research interests mainly include digital video compression and communications, multi-view video coding, and image processing. Mei Yu received her B.S. and M.S. degrees from Hangzhou Institute of Electronics Engineering, China, in 1990 and 1993, respectively, and Ph.D. degree from Ajou University, Korea, in 2000. She is now with the Faculty of Information Science and Engineering, Ningbo University, China. Her research interests include image/video coding and video perception.