Pattern Recognition Letters 34 (2013) 70–76
Contents lists available at SciVerse ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Structure guided fusion for depth map inpainting Fei Qi ⇑, Junyu Han, Pengjin Wang, Guangming Shi 1, Fu Li School of Electronic Engineering, Xidian University, Xi’an, Shaanxi 710071, PR China
a r t i c l e
i n f o
Article history: Available online 27 June 2012 Keywords: Depth map Inpainting Information fusion Non-local means Kinect
a b s t r a c t Depth acquisition becomes inexpensive after the revolutionary invention of Kinect. For computer vision applications, depth maps captured by Kinect require additional processing to fill up missing parts. However, conventional inpainting methods for color images cannot be applied directly to depth maps as there are not enough cues to make accurate inference about scene structures. In this paper, we propose a novel fusion based inpainting method to improve depth maps. The proposed fusion strategy integrates conventional inpainting with the recently developed non-local filtering scheme. The good balance between depth and color information guarantees an accurate inpainting result. Experimental results show the mean absolute error of the proposed method is about 20 mm, which is comparable to the precision of the Kinect sensor. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Depth is very important in computer vision because it is one of the kernel cues for understanding real world scenes. Conventional methods for capturing depth information are based on multiple view geometry inspired by the human binocular system. In recent years, many new economical facilities, including time-of-flight sensors (Iddan and Yahav, 2001), structured light (Scharstein and Szeliski, 2003; Wong et al., 2005), and the revolutionary Kinect, are introduced for depth acquisition. Kinect captures pairs of synchronized depth-color images for a scene within a range of several meters. However, the depth map cannot be used directly in scene reconstruction because it has some deficiencies such as holes due to occlusion, reflection, and other optical factors. In this paper, we focus on filling the vacancies to improve the quality of the depth map. Chiu et al. (2011) proposed a cross-model approach to improve the Kinect depth map. Their basic assumption is that a linear combination of RGB channels can synthesize an IR image. The IR images synthesized and captured by the depth sensor form a stereo pair. This approach encounters difficulties if the object surface has extremely different colors which result in largely different depth values. Matyunin et al. (2011) proposed a filtering method to fill occlusions and improve the temporal stability of the Kinect depth map. Depth maps captured by time-of-light sensors have similar deficiencies. Kim et al. (2010) proposed a bilateral filtering based approach to perform the spatial enhancement on depth image ⇑ Corresponding author. 1
E-mail addresses:
[email protected] (F. Qi),
[email protected] (G. Shi). Principal corresponding author.
0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.06.003
employing both color and depth information. Zhu et al. (2008) endeavored to obtain high accuracy depth maps by combining/ fusing the active time-of-flight depth data and the conventional passive stereo estimation. Structured light based methods can provide reliable dense stereo correspondences, where the missing match problem is solved by using multiple illumination sources (Wong et al., 2005) or using a multi-pass dynamic programming (Scharstein and Szeliski, 2003). Image inpainting (Bertalmio et al., 2000), which is developed to recover damaged regions in a color image, is another related research field. Early works on image inpainting are formulated as intensity diffusion (Bertalmio et al., 2000; Chan and Shen, 2001). Taking the redundancy of image contents into account, texture synthesis based approaches were proposed (Bertalmio et al., 2003; Criminisi et al., 2004). Recently, Bugeau et al. (2010) proposed a comprehensive framework for inpainting which considers diffusion, texture synthesis, and content coherence. Filling holes on depth map can be treated as an inpainting problem. However, different from visual image inpainting, it is difficult to determine the stopping positions corresponding to object boundaries since there are few even no textures on depth maps. Furthermore, texture synthesis is also hard to be performed within the depth map itself. On Kinect, there is a visual sensor available for capturing high resolution color images synchronized with depth camera. In this paper, we propose a novel fusion based inpainting method to complement depth maps by incorporating color information. Though we focus on the Kinect depth-color sensor pairs, it can be easily extended to any other multi-spectrum systems using array sensors, since it is inexpensive for other facilities to provide visual images at the same time of depth capturing.
71
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
The rest part of this paper is organized as follows. Section 2 describes the proposed information fusion based approach for depth map inpainting. Experimental results are illustrated in Section 3 to show the validity, effectiveness, and accuracy of the proposed method. Conclusions are drawn in Section 4. 2. Fusion based inpainting In this section, we introduce a fusion based framework for depth map inpainting which incorporates the structural information of a scene provided by both the depth map and the color image. The fusion based inpainting method is proposed in Section 2.1. Then, we explain how the information fusion works based on structural information in details in Section 2.2. For the convenience of discussion, we first introduce the notations used in this paper. We use bold face letters to denote the positions of pixels on both the depth map and the color image. To discriminate the two coordinate systems, we add a prime to the bold letters denoting pixels on the depth image. For example, p0 is a pixel on depth map, while p is a pixel on the color image. We use dðp0 Þ to denote the depth of pixel p0 , while JðpÞ is the color of pixel p. The structural information of the color image is denoted as bJðÞ. In addition, we use X to denote the region where depth information is unavailable. Based on the geometric relationship between the depth and color cameras, a mapping function can be deduced according to the multiple view geometry (Hartley and Zisserman, 2000; Khoshelham and Elberink, 2012). Since the alignment is somewhat complex, we simply denote the vectorial mapping function as
p ¼ mðp0 ; dðp0 ÞÞ;
ð1Þ
which projects the pixel p0 on depth map onto the color image as p. In order for a precise alignment, both cameras are calibrated by employing Zhang’s method (Zhang, 2000). 2.1. Fusion based depth map inpainting With the mapping function (1), a depth pixel can be associated with a color pixel to form an augmented data array as fdðpÞ; JðpÞg, with resampling of the projected depth map on the color image coordinates. On the augmented array, the color information is available for all pixels. However, the depth information is lost for pixels p 2 X. Thus, depth map inpainting is to recover the lost depth information under the condition of known depth and color pixels. As conventional image inpainting works with pixels without additional channel information, the inpainting of Kinect captured depth maps can be seen as a partial inpainting. After the projection, we can now use the color image coordinate system only. In a general inpainting configuration, the unknown depth dðpÞ of the pixel p can be predicted by the known depth dðqÞ of a near pixel q. Assuming the first order Taylor approximation is accurate enough, the prediction is formulated as
dðpÞ ¼ dðqÞ þ rqp dðqÞkp qk2 ;
ð2Þ
where the operator rqp denotes the directional derivation along qp at point q, and k k2 the standard ‘2 norm. According to calculus, the directional derivation can be derived from the gradient. Thus, (2) can be rewritten as
dðpÞ ¼ dðqÞ þ h$dðqÞ; p qi;
ð3Þ
where h; i denotes the inner product between two vectors, and $ the gradient operator. Due to the noise in capture, prediction with only one reference pixel is not robust enough. Recent development in denoising
shows that the non-local means scheme (Buades et al., 2005; Buades et al., 2007) provides an accurate estimation by exploiting the redundancy of image textures. The recent visual image inpainting framework also employs a non-local means like scheme in its texture synthesis part (Bugeau et al., 2010). On the basis of these methods, we propose to use the following depth map inpainting framework,
dðpÞ ¼
X
wðp; qÞ½dðqÞ þ hrdðqÞ; p qi;
ð4Þ
q2N ðpÞ
where N ðpÞ denotes a set of pixels near p 2 X, which are with observed depth information, the weight wðp; qÞ is assigned according to the geometrical distance, color structure coherence, and depth smoothness. After inpainting, a simple median filter is applied to further improve the quality of the depth map. The inpainting framework (4) is also similar to the scheme employed by the fast marching based image inpainting method (Telea, 2004). The differences are in the selection of the set, N ðpÞ, of neighboring pixels, and the assignment of weights. In the fast marching based inpainting, neighbor pixels are chosen from those within the narrow band of the propagation frontier. In the proposed method, starting from the given pixel p, the set of near pixels N ðpÞ are formed by pixels on the ray of the eight directions. Fig. 1 illustrates the construction of the near pixel set N ðpÞ. The reference pixel is at the center. Pixels on the eight rays are searched. If the depth information is available, the pixel is added to the set. In both Fig. 1a and b, chosen pixels are annotated by squares. The grid is regular, pixels without annotations are ignored due to the lost of depth information. A depth map is generally more smooth than its corresponding color image, because surfaces of objects are with abundant textures while the depth map keeps continuous at texture edges, especially for indoor scenes. Thus, the first order Taylor expansion in the prediction (4) holds for a larger neighboring region on the depth map than that on the color image. So, in the searching of near pixels to form N ðpÞ, a large region can be used as neighbors of the pixel p. Thus, for a narrow inpainting region, the set of near pixels jumps over the region. The difficulty of determining where depth edges are also originates from the blob continuity of object surfaces. This difficulty can hardly be resolved based on depth cues only. The weights wðp; qÞ are designed to evaluate the contribution of each pixels within N ðpÞ for the inpainting of p by considering several factors, especially the structural information provided by the color image. Detailed discussions on how to assign weights wðp; qÞ are provided in the next subsection. 2.2. Structure guided fusion Since the depth map is less textural than the corresponding color image, propagating depth information from the outside to the inside of the inpainting region generally produces better result than diffusion do on color images. However, the difficulty is how to determine where to stop the propagation. To resolve this problem, we design a weighting function to improve the inpainting by considering geometrical distance, depth similarity, and the structure information provided by the color image. The weighting function has the following form, 0
w ðp; qÞ ¼
3 Y i¼1
exp
! si ðp; qÞ 2
2hi
;
ð5Þ
where si ðp; qÞ are the three factors being considered, and hi are corresponding bandwidths of the three factors. The weights in (4) is the normalized version of (5), where the normalization is as
72
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
Fig. 1. Construction of the set of near pixels. (a) A schematic illustration. (b) Sketch for near pixel selection.
w0 ðp; qÞ : 0 q2N ðpÞ w ðp; qÞ
wðp; qÞ ¼ P
ð6Þ
The first factor in (5) is the geometric distance between p and q, which is
s1 ðp; qÞ ¼ kp qk22 :
3. Experimental results
ð7Þ
This factor indicates that a near pixel is more reliable in inpainting than a farther one. This factor is introduced to model the reliability of the Taylor expansion used in prediction, with respect to the distance between the reference pixel and the pixel being recovered. The second factor in (5) considers the depth induced structure. Since a great part of inpainting regions are formed due to occlusion, these regions tend to have large depth value. To inpaint shadow regions, the following function is evaluated,
2 dðqÞ s2 ðp; qÞ ¼ 1 ; dmax ðpÞ
ð8Þ
where dmax ðpÞ is the maximum depth within the selected neighbors N ðpÞ, i.e.,
dmax ðpÞ ¼ max dðtÞ: t2N ðpÞ
ð9Þ
The last factor takes into account the coherence of structural component of the color image, which is
s3 ðp; qÞ ¼
X kbJðp þ tÞ bJðq þ tÞk22 ;
ð10Þ
t2B
where bJ denotes the structural component of the color image, B ¼ fl; . . . ; lg fl; . . . ; lg is a set of offset vectors of a ð2l þ 1Þ ð2l þ 1Þ rectangular patch. The structure component of the color image is extracted according to
bJ ¼ arg min JS
Z n
o jrJ S j þ kðJ S JÞ2 dp;
the image. As our inpainting scheme has a non-local means architecture, the structure extraction does improve the inpainting performance.
In this section, we design several experiments to evaluate the performance of the proposed depth map inpainting method. To estimate the accuracy of the proposed method, we build scenes with rigid boxes only for the convenience of retrieving the ground truth of depth data. For scenes with non-rigid foreground objects, we perform only subjective evaluations due to the difficulty of obtaining the ground truth data. The subjective evaluation and quantitative accuracy analysis are provided in Sections 3.1 and 3.2, respectively. Some of the configurations are fixed throughout all experiments in this section. Without particular specifications, the bandwidths of the three factors in (5) are set to h1 ¼ 1:0; h2 ¼ 0:112, and h3 ¼ 0:045, respectively. The set N ðpÞ, which is used in (6) and (9), contains 8 5 pixels. For a certain pixel p, as shown in Fig. 1, five pixels are chosen from each of the eight directions. The size of the matching rectangle B in (10) is set to 3 3. The depth map is a 13 bit data array. To effectively show the depth map, we render the data with pseudo colors. The white color are used to denote the regions where the depth information is missing. Since the color camera is with a view angle larger than that of the depth camera, the projected depth map is smaller than the original one. Focusing on the inpainting of internal regions only, outer regions without depth information are ignored in inpainting. In the other words, we do NOT perform an extrapolation with the proposed inpainting framework. 3.1. Subjective evaluation
ð11Þ
which can be accomplished with the total variational denoising model (Rudin et al., 1992; Wedel et al., 2009). Fig. 2 shows an example of the structural component extraction. Fig. 2a is the original color image. Fig. 2b is the structural component. Fig. 2c shows the texture component of the original image, which is obtained by removing the structural component from the original image, i.e., this image is J bJ. In non-local means method, principal component analysis (PCA) is shown to be effective in improving the denoising performance (Tasdizen, 2009). The effect of extracting structural component is similar to PCA which prohibits the high frequency component in
Fig. 3 shows depth map inpainting results of the human body scene. Fig. 3a and b are the original depth map and the map projected onto the color image, respectively. Fig. 3c and d are inpainting results without and with post median processing, respectively. In this scene, the post median filtering improves the visual effect. As shown by both figures, the depth structure at the background right to the human’s head is well recovered. The hand of the human is recovered clearly with sharp edges. Fig. 3e and f show inpainting results produced by different fusion strategies. Fig. 3e inhibits the usage of the depth induced structure, i.e., set h2 ¼ 1 in (5). The edges are blurred, especially around the hand of the human, in the inpainting result. This result
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
73
Fig. 2. The structural component of a color image. (a) The original color image. (b) The structural component of the original image. (c) The texture component of the original image.
Fig. 3. Depth map inpainting results of the human body scene. (a) The original depth map. (b) The original depth map projected onto the color image. (c) The result depth map of the inpainting without post median filtering. (d) The result depth map of the inpainting with post median filtering. (e) The result depth map based on a fusion strategy without using both the depth channel, i.e., h2 ¼ 1. (f) The result depth map of the inpainting based on a fusion strategy without using both color structural information, i.e., h3 ¼ 1. The fusion parameters are set to h1 ¼ 2:24, h2 ¼ 0:22, and h3 ¼ 0:08 in this experiment.
indicates the depth factor contribute much to generate sharp edges on inpainting results. Fig. 3f inhibits the usage of the color structural information which means h3 ¼ 1 in (5). This strategy degrades the regions depending on color cues, such as the background region right to the head of the human. Fig. 4 illustrates the results of the hand-cloth scene which is designed for evaluating the performance of inpainting non-rigid foreground objects in front of textural background. In this experiment, the background includes a cloth with cluttered textures. The foreground object is a non-rigid human hand. As shown by Fig. 4c and d, the regions near the arm, the thumb, and the cloth are all inpainted fairly well. This experiment verifies that the proposed method works well when the background is with textures. The inpainting result of the small region between the middle finger and the third finger, as shown by Fig. 4b, is not good enough. The main reason is that this region is too small to be matched to the correct background region. In fact, this region is occupied by some inaccurate depth measurements which makes the two fingers connected. To improve the depth information of this region, a mechanism to compensate such positions is expected.
3.2. Quantitative accuracy analysis To evaluate the quantitative accuracy of the proposed method, we need to obtain ground truth of the depth of all pixels in the scene. We construct special scenes with three rigid boxes, because the ground truth of depth of planar surfaces are easy to obtain in such scenes. On a Kinect depth map, pixels failed to capture depth information can be categorized to two types. Pixels in the first type are background pixels in the shadows of the Kinect projected infrared light. To obtain ground truth for these pixels, we only need to move the foreground objects casting shadows out of the scene while keeping background objects static. As shown in Figs. 5 and 6, we need to get ground truth of the surfaces of the three planar surfaces. This is done by moving the three boxes out of the scene one by one in the order from the nearest to the farthest. For each surface, we take an average of ten frames as the ground truth. Then, manually combining these depth maps, we obtain the ground truth for pixels related to the first type of depth acquisition failure.
74
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
Fig. 4. The hand-cloth scene for evaluating the inpainting performance of non-rigid foreground objects in front of textural background. (a) The color image of the hand-cloth scene. (b) The depth map of the scene to be inpainted. (c) The inpainted depth map aligned with the color image. (d) The comparison image generated by blending the inpainted depth map with the color image.
Fig. 5. The color boxes scene for quantitative accuracy evaluation. (a) The color image. (b) The depth map to be inpainted. (c) The inpainted depth map aligned to the color image. (d) The comparison image generated by blending the inpainted depth map with the color image.
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
75
Fig. 6. The textural boxes scene for quantitative accuracy evaluation. (a) The color image. (b) The depth map to be inpainted. (c) The inpainted depth map aligned to the color image. (d) The comparison image generated by blending the inpainted depth map with the color image.
Pixels in the second type are foreground pixels near object boundaries where the infrared speckles are corrupted. Ground truth of this type of pixels are estimated according to linear regression. We first segment out the foreground rectangular surfaces of each box manually. Then we do a linear regression based on available depth information of each surface. After that, we perform a linear prediction according to the regression to estimate the ground truth of these boundary pixels. Since the surfaces of boxes are planar, such an approach is reliable. In addition, foreground pixels are determined by checking the differences between the two color images which are with and without the foreground boxes. Once we obtained the ground truth of depth map, we can evaluate the accuracy of the proposed approach quantitatively. To achieve a comprehensive evaluation, we design two experiments each with a set of boxes. One experiment uses boxes each with a same surface color, while the other experiment uses boxes with complex textures, as shown in Figs. 5 and 6. The accuracy is measured by the mean absolute error (MAE) of inpainted pixels with respect to the ground truth. Table 1 provides the quantitative accuracy of the proposed method evaluated on several configurations. It can be seen that, based on the fusion of depth and color structural component, the inpainting results are more accurate than that using only one cues,
on both the color and textural boxes. There is only a slight degradation of the accuracy when the surfaces to be inpainted are with textures. In both experiments of color and textural surfaces, the accuracy obtained by the color depth fusion strategy is comparable to the precision of the Kinect itself, which is around 20 mm according to the investigation of Khoshelham and Elberink (2012). In addition, for planar surfaces with only one color, the color cue is strong enough to achieve accurate inpainting. However, for textural planar surfaces, color cue is far from enough. The MAEs are very large when inpainting based on depth cue only, because large errors occur at boundaries of foreground objects where depth information is hard to be compensated with only depth cues. Furthermore, the coherence of the color and depth does provide a strong and reliable cue for depth map inpainting. Fig. 7 illustrates the distributions of errors for the two experiments shown by Figs. 5 and 6. It can be seen that a large portion
Table 1 The quantitative accuracy under different configurations. The unit of the MAEs is millimeter (mm). Color boxes
Textural boxes
h1 h2 h3
1 0.112 0.045
1 1 0.045
1 0.112 1
1 0.112 0.045
1 1 0.045
1 0.112 1
MAE
19.03
19.97
97.26
20.67
45.73
121.98
Fig. 7. Distributions of errors in experiments shown by Figs. 5 and 6.
76
F. Qi et al. / Pattern Recognition Letters 34 (2013) 70–76
of MAEs are less than 10 mm. The extremely large errors contribute greatly to the bias of the MAEs, which means determining the type of depth capture failure is important in improving the inpainting accuracy. According to the quantitative accuracy analysis, the proposed framework balances the depth and color cues well, and provides fairly accurate inpainting results. 4. Conclusions In this paper we propose a fusion based framework for depth map inpainting which takes the advantages of available color and depth information. With the corresponding color image available, this framework produces a very reliable inpainting result for regions missing depth information. Though only depth-color image pairs captured by Kinect are tested, the proposed method can be applied to depth maps captured by other types of sensors, including the time-of-flight sensor, if the corresponding color image is available. In the future, we expect to further improve the proposed inpainting method by considering the following various aspects. Firstly, how to use the temporal information of both the depth and color channels. Secondly, how to adjust the order of pixels being recovered to make the inpainting more accurate. Finally, how to extend the proposed method to the scenes with multiple objects or humans for recovering pixels with depth and color both are lost. Acknowledgments This work is supported in part by National Natural Science Foundation of China under grant NO. 60805012, 61033004, 61070138, and 61100155. References Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C., 2000. Image inpainting. Proc. 27th Annual Conf. Computer Graphics and Interactive Techniques. SIGGRAPH’00. ACM Press, New York, NY, USA.
Bertalmio, M., Vese, L., Sapiro, G., Osher, S., 2003. Simultaneous structure and texture image inpainting. IEEE Trans. Image Process. 12 (8), 882–889. Buades, A., Coll, B., Morel, J., 2005. A non-local algorithm for image denoising. In: IEEE Comp. Soc. Conf. Computer Vision and Pattern Recognition, CVPR, vol. 2. pp. 60–65. Buades, A., Coll, B., Morel, J., 2007. Nonlocal image and movie denoising. Internat. J. Comput. Vision 76 (2), 123–139. Bugeau, A., Bertalmio, M., Caselles, V., Sapiro, G., 2010. A comprehensive framework for image inpainting. IEEE Trans. Image Process. 19 (10), 2634–2645. Chan, T., Shen, J., 2001. Mathematical models for local nontexture inpaintings. SIAM J. Appl. Math., 1019–1043. Chiu, W., Blanke, U., Fritz, M., 2011. Improving the kinect by Cross-Modal stereo. In: British Machine Vision Conf. BMVA, pp. 116.1–116.10. Criminisi, A., Perez, P., Toyama, K., 2004. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13 (9), 1200– 1212. Hartley, R., Zisserman, A., 2000. Multiple View Geometry in Computer Vision. Cambridge Univ Press. Iddan, G.J., Yahav, G., 2001. Three-dimensional imaging in the studio and elsewhere. In: Three-Dimensional Image Capture and Applications IV, Vol. 4298 of Proc., SPIE. pp. 48–55. Khoshelham, K., Elberink, S.O., 2012. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 12 (2), 1437–1454. Kim, S., Cho, J., Koschan, A., Abidi, M.A., 2010. Spatial and temporal enhancement of depth images captured by a Time-of-Flight depth sensor. In: 20th Int. Conf. Pattern Recognit., ICPR. pp. 2358–2361. Matyunin, S., Vatolin, D., Berdnikov, Y., Smirnov, M., 2011. Temporal filtering for depth maps generated by kinect depth camera. In: 3DTV Conf.: The True VisionCapture, Transmission and Display of 3D Video (3DTV-CON). p. 14. Rudin, L., Osher, S., Fatemi, E., 1992. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenom. 60 (1–4), 259–268. Scharstein, D., Szeliski, R., 2003. High-accuracy stereo depth maps using structured light. In: IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, CVPR, vol. 1, pp. 195–202. Tasdizen, T., 2009. Principal neighborhood dictionaries for nonlocal means image denoising. IEEE Trans. Image Process. 18 (12), 2649–2660. Telea, A., 2004. An image inpainting technique based on the fast marching method. J. Graphics Tools 9 (1), 23–34. Wedel, A., Pock, T., Zach, C., Bischof, H., Cremers, D., 2009. An improved algorithm for TVL1 optical flow. In: Cremers, D., Rosenhahn, B., Yuille, A., Schmidt, F. (Eds.), Statistical and Geometrical Approaches to Visual Motion Analysis. Springer, pp. 23–45, vol. 5064 of LNCS. Wong, A.K., Niu, P., He, X., 2005. Fast acquisition of dense depth data by a new structured light scheme. Comput. Vision Image Understanding 98 (3), 398–422. Zhang, Z., 2000. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Machine Intell. 22 (11), 1330–1334. Zhu, J., Wang, L., Yang, R., Davis, J., 2008. Fusion of time-of-flight depth and stereo for high accuracy depth maps. In: IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, pp. 1–8.