Reconstructing non-rigid object with large movement using a single depth camera

Computer Aided Geometric Design 64 (2018) 15–26 Contents lists available at ScienceDirect Computer Aided Geometric Design www.elsevier.com/locate/ca...

Download PDF

4MB Sizes 0 Downloads 30 Views

Report

Full Text

Computer Aided Geometric Design 64 (2018) 15–26

Contents lists available at ScienceDirect

Computer Aided Geometric Design www.elsevier.com/locate/cagd

Reconstructing non-rigid object with large movement using a single depth camera ✩ Feixiang Lu ∗ , Bin Zhou, Feng Lu, Yu Zhang, Xiaowu Chen, Qinping Zhao State Key Laboratory of Virtual Reality Technology & Systems, Beihang University, China

a r t i c l e

i n f o

Article history: Received 1 March 2018 Received in revised form 18 April 2018 Accepted 11 June 2018 Available online 25 June 2018 Keywords: Non-rigid object reconstruction Large movement A single depth camera Canonical frame identiﬁcation

a b s t r a c t Non-rigid detailed 3D reconstruction of real world scenes has witnessed great success in recent years. However, most existing methods take the ﬁrst frame as canonical model and the topological structure of the input scenes are ﬁxed during the reconstruction process, which is an assumption that may not hold in practice for highly non-rigid scenes. Regarding this issue, this work proposes a novel approach to reconstruct non-rigid object with large movement which often results in topological structure change. In this paper, we ﬁrstly introduce an adaptive strategy that can effectively identify the most ﬁne-grained scene topology as the canonical model. Such model is then deformed to each depth map, constrained by robust inter-frame correspondences established from object contour and scene ﬂows. After deformation, we further fuse the depth map to the canonical model via a novel adaptive selection scheme, so as to remove spurious noise without smoothing model details. Experimental results show that the proposed approach can effectively handle various input scenes with large movement and generate models with high-ﬁdelity details. © 2018 Elsevier B.V. All rights reserved.

1. Introduction 3D reconstruction of real-world scenes from depth cameras is a widely studied problem in the ﬁelds of computer vision and computer graphics. After long-term efforts, the 3D model of a scene can be now accurately built by fusing its depth maps captured in multiple views, as long as the scene is static (e.g. KinectFusion (Newcombe et al., 2011; Izadi et al., 2011)). However, reconstructing non-rigid scenes with a single depth camera is still largely unsolved due to a number of challenges, such as non-rigid deformation, incomplete scans, and large movement which might cause the inconsistency of topological structures of the scene. In recent years, the challenges of handling non-rigid deformation and incomplete scans have been well studied and addressed by various previous works (Sumner et al., 2007; Xu et al., 2007; Li et al., 2009; Liao et al., 2009; Zhou et al., 2010; Oikonomidis et al., 2011; Taylor et al., 2012; Li et al., 2013; Yang et al., 2013; Zollhöfer et al., 2014; Dou et al., 2015; Zhang et al., 2015a; Yang et al., 2015; Dou et al., 2016). However, these methods rely on strong priors based on pre-designed templates, user direct manipulation, multiple depth sensors, or pre-learned statistical models. Moreover, some techniques need seconds to minutes to compute a single frame which is a waste of time for reconstruction. Newcombe et al. (2015) proposed the ﬁrst system for densely reconstructing general dynamic scenes, which can generate high-quality results from a single camera in real-time. Although signiﬁcant successes were made by these approaches, however, most of them do not

✩

*

This paper has been recommended for acceptance by Ligang Liu. Corresponding author. E-mail address: ﬂ[email protected] (F. Lu).

https://doi.org/10.1016/j.cagd.2018.06.002 0167-8396/© 2018 Elsevier B.V. All rights reserved.

16

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Fig. 1. We present a new approach to reconstruct non-rigid object with large movement. Our method only requires single depth camera (e.g. Kinect v2.0) to capture the depth maps and generate the plausible results rapidly (≈0.1 s/frame).

Fig. 2. An overview of the system pipeline. (For interpretation of the colors in the ﬁgure(s), the reader is referred to the web version of this article.)

explicitly consider the challenging problem of large movement (i.e. potential topological change) of the input scene, which frequently happens for non-rigid objects. As illustrated in Fig. 1, with the arms of the person stretching away from the body, the topology of the person becomes inconsistent. In this case, the previous methods using ﬁxed topological structure cannot reconstruct the input scene consistently. Slavcheva et al. (2017) have made the attempt towards this problem by a level-set evolution approach. However, the reconstruction without correspondences makes the results somewhat shaking in appearance and loose essential details. To address this problem, this paper presents a novel approach to reconstructing non-rigid scenes with large movement from a single depth camera. As summarized in Fig. 2, the proposed approach takes the depth sequence captured by Kinect v2.0 sensor as input, and incrementally fuses the depth maps to generate a canonical model that can best ﬁt the scene on each frame under certain deformations. To this end, we propose a novel adaptive strategy to identify the most ﬁne-grained scene topology as the canonical model by analyzing the topological structure. Given the canonical model, we then deform it to each depth map constrained by robust inter-frame correspondences established from object contours and scene ﬂows. Finally, we fuse the depth maps onto the deformed canonical models through a novel scheme that can adaptively select the appropriate interval of frames for fusion, which can generate high-quality reconstruction results without over-smoothing model details. Experimental results demonstrate that our approach can effectively handle various input scenes with topological structure change due to large movement. The contributions of this paper are summarized as follows: 1) we present a novel approach via identifying the canonical frame to reconstruct the non-rigid scenes with large movement; 2) we eﬃciently deform the canonical model to ﬁt each depth map using contour and scene ﬂow cues; 3) we propose an adaptive fusion algorithm which can largely suppress the noise during fusion and preserve the model details. 2. Related work There were various previous works on 3D scene reconstruction based on consumer-level depth cameras. While a large group of them focused on static scenes (Newcombe et al., 2011; Izadi et al., 2011; Roth and Vona, 2012; Whelan et al., 2012;

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

17

Fig. 3. The reconstruction quality with different canonical frames. For each column, the canonical frame is shown on the left while the target frame is on the right.

Shao et al., 2012; Lin et al., 2013; Steinbrucker et al., 2013; Chen et al., 2013; Nießner et al., 2013; Kahler et al., 2015; Zhang et al., 2015b), this section mainly reviews recent advances on non-rigid scene reconstruction that are tightly correlated with our approach. Multi-view non-rigid reconstruction. Depth maps from multiple views of the scene provide complementary visual information and thus facilitate the reconstruction process. For example, Tong et al. (2012) use three Kinect cameras to scan the 3D full human bodies through global non-rigid registration. Dou et al. (2013) use eight Kinect cameras to reconstruct the complete 3D model, and then track the model to match later observations. Wang et al. (2016) present an end-to-end system for reconstructing complete water-tight and textured models of moving subjects using three or four handheld sensors. Dou et al. (2016) proposed Fusion 4D system for live multi-view performance capture, generating temporally coherent highquality reconstructions in real-time which use 24 cameras. Although multi-view reconstruction techniques can generate much more delicate models than single camera. The capture setups are complex and not easy to use for novice user. Prior-based non-rigid reconstruction. Several approaches proposed to employ the prior knowledge of the scene to aid the reconstruction process. In this context, various technical improvements were made on constructing 3D human bodies, hands and faces (Weiss et al., 2011; Oikonomidis et al., 2011; Cao et al., 2013). However, these rely on strong priors such as pre-learned statistical models, articulated skeletons, or morphable shape models, while capturing non-rigid scenes of general categories still remains challenging. Template-based non-rigid reconstruction. Template-based approach is proven effective for modeling general non-rigid scenes recently (Li et al., 2009). Typically, a 3D model template of the scene is learned, which is then deformed to match the visual information of each frame. For example, Zollhöfer et al. (2014) ﬁrst acquire a template of the scene using KinectFusion, and then non-rigidly deform the template to the captured sequences. Templateless non-rigid reconstruction. Newcombe et al. (2015) present the ﬁrst dense non-rigid reconstruction system– DynamicFusion which can fuse the depth maps to incrementally generate “canonical model” and at each time instant deform it to each frame in real-time. Innmann et al. (2016) extract sparse color features to enable accurate tracking and effectively handle the drift problem compared with DynamicFusion. Guo et al. (2017) simultaneously fuse object geometry and surface albedo for a non-rigid scene in real time. Yu et al. (2017) take advantage of the internal articulated skeleton prior and propose a real-time skeleton-embedded surface fusion approach. Wang et al. (2017) propose an effective local-to-global hierarchical optimization framework to reconstruct and track non-rigid objects with an RGB-D camera. All of above methods assume the topological structure of scanned object to be ﬁxed, once the object movement was large, the structural connection may be broken, however, the object appearance on several frames may not be accurately aligned by any possible deformation, thus leading to failure reconstructions. 3. Overview We aim to reconstruct the non-rigid dynamic object in real-world scene using a single depth camera, where the object movement is large and topologies may change signiﬁcantly in the depth video. For example, as shown in the ﬁrst row of Fig. 3(a, b), the person’s hands and head touch with each other at the ﬁrst time, and then gradually separate in the next a few frames. This kind of large movement often happens in our daily life, while most state-of-the-art 3D reconstruction methods, such as DynamicFusion (Newcombe et al., 2015) approach, always fail in this situation. It is because that these methods always take the ﬁrst frame as canonical model and directly warp the ﬁrst frame to ﬁt the other frames. A straightforward solution to deal with this problem is to design a new canonical model updating algorithm like ‘Key Volume’ technique proposed in Fusion4D (Dou et al., 2016). If the misalignments are drastic, ‘Key Volume’ would refresh

18

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Fig. 4. Canonical frame identiﬁcation. (a) The input depth map. (b) The mutual distances of the point cloud. For each 3D point (red color), we calculate the Euclidean distance with other 3D points (blue color). (c) The mutual distances of the object contour points. For each contour point, we calculate the Euclidean distance with other contour points (blue color). (d) The forward and (e) backward scene ﬂows.

the canonical model using the current data. However, since there exist several canonical models, the motion ﬁeld is not consistent and the results will be shaking (shown in the video ﬁle). Instead, we look for an alternative lightweight solution for this problem. Our solution is based on the observation that the selected frame used to initialize the canonical model immensely affects the reconstruct result. As shown in the second row of Fig. 3(c), a well selected frame can obtain well result, while inproperly selected frames usually result in low-quality reconstructions (the second row of Fig. 3(a, b)). That is, the key of our solution to deal with large movement is to scan the whole sequence and identify the most ﬁne-grained frame to construct the canonical model. As shown in Fig. 2, we take the recorded depth sequence as input and identify the canonical frame, which automatically selects the most ﬁne-grained scene topology as the canonical model. Then we extract the polygonal mesh from the volume which is further deformed to the depth map. The procedure of non-rigid deformation constrained by robust inter-frame correspondences established from object contour and scene ﬂows. After deformation, we fuse the depth map to the canonical model via a novel adaptive selection scheme, so as to reduce noise. As a result, we can generate detailed canonical model and deformed model sequence. 4. Method In this section, we will describe the components of our approach of reconstructing the non-rigid object with large movement. First, we introduce how to eﬃciently identify the most ﬁne-grained scene topology as the canonical model (Fig. 4). Second, we deform the canonical model to the depth map, constrained by object contour and scene ﬂows. Finally, we present a novel fusion strategy yielding compelling detailed models. 4.1. Canonical frame identiﬁcation To capture the dynamic scenes, recent approaches often apply a deformable canonical model. In the recent DynamicFusion (Newcombe et al., 2015) approach, such model is simply initialized using the ﬁrst input frame. In this paper, we follow the usage of canonical model, but differ from previous approaches by adaptively initializing it with the frame on which the scene parts are most separated. Intuitively, the canonical model initialized in such manner would have the most ﬁne-grained topology. To identify such frame, we compute a “separability” score at frame-level:

E=

du − d w 2 + λc · du − d w 2 + λ f · fk−1 (du ) + fk+1 (du )2 , k k 2 2 2

u , w ∈Dk

u , w ∈Ck

(1)

u ∈Dk

where

du = K−1 D k (u )[u T , 1]T .

(2)

Here, Dk ⊂ R is the depth image domain on the kth frame and Ck ⊂ R is the set of contour pixels extracted by Zhou and Koltun (2015) on the depth map. We back-project each depth pixel u to acquire 3D point dku where K indicates the depth camera intrinsic matrix and D k (u ) is the corresponding depth value. The ‘separability’ is thus modeled by the mutual distances of the points in these sets. In the third term, f denotes the displacement vector of the point du at the kth frame to the previous (next) frame. Such displacement ﬁeld is obtained via the scene ﬂow algorithm (Jaimez et al., 2015) computed on depth maps. This term models sudden topological change, where the moving patterns of pixels differ greatly. Empirically, the non-negative weights λc and λ f are set to 10 and 100, respectively. 2

2

4.2. Model-to-frame deformation After initializing the canonical frame, we take the corresponding depth map into volume represented by truncated signed distance function (TSDF) (Curless and Levoy, 1996) as canonical model. In our implementation, we set the resolution of TSDF volume as 6403 voxels and 384 voxels per meter. Thus, we can capture maximally 1.67 m3 size of scenes and each voxel represents 2.6 mm3 in real world. Then we compute the deformation ﬁeld that warps the model so as to match the input depth on each frame. This ﬁeld is further used to guide the fusion process.

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

19

Fig. 5. A smaller square slides under a larger one which produces a new region in red. In this region, the forward ﬂow fkk−1 would yield a large residual.

4.2.1. Deformation ﬁeld We follow DynamicFusion (Newcombe et al., 2015) to construct the deformation ﬁeld Wt by a hierarchical deformation graph, which not only effectively aligns non-rigid model surface to each depth map but also could be applied to deform the volume. Speciﬁcally, at the frame k, we consider the following parameters Wk = {gi , σi , Ti }, where i denotes the index of the control node in the deformation graph. Each node has a position in canonical model gi ∈ R3 , and σi is a radius parameter related to the weight that control the extent of ith node inﬂuences voxel x. The radius weight wi (x) = exp(−gi − x22 /(2σi2 )). Each node is associated with 6D transformation Ti ∈ SE3 . Speciﬁcally, Ti consists of 3D rotation Ri ∈ SO3 and translation ti ∈ R3 . We follow the routines proposed in Newcombe et al. (2015) to generate and update the deformation graph. 4.2.2. Energy function After deﬁning deformation graph, the key step is how to non-rigidly deform the canonical model to the current depth map. To estimate the parameters of deformation ﬁeld Wk , we formulate an energy function as follows:

E (Wk ) = E data + ωr E reg + ωc E contour + ω f E f low .

(3)

The data term E data measures the dense distances between the canonical model and the closest data points in the depth map, while the regularization term E reg promotes smooth deformations. These two terms are same as Newcombe et al. (2015). The contour constraints term E contour preserve the contour consistency during the reconstruction and is deﬁned as

E contour =

ψcontour (vˆ u − d w ).

(4)

u ∈Ck −1 , w ∈Ck

Here, we render the deformed canonical model at the time (k − 1) to obtain the initial visible vertex map and extract its contour Ck −1 ; ψcontour is robust Tukey penalty function and we non-rigidly deformed the model vertex vu by vˆ u = Wk (v(u )) · v(u ) to the closest contour point d w which is located on the kth depth map. The dense scene ﬂow term E f low forces the point to be matched to its position in adjacent frames and is deﬁned as

E f low =

u ∈Dk −1

u˜ =

ψ f low (ˆvu − du˜ ),

(5)

π K vu + fkk−1 (vu )

,

(6)

where Dk −1 is the raycasted map by rendering the deformed model of the time (k − 1), fkk−1 (vu ) indicates the scene ﬂow from Dk to Dk−1 . π (·) performs perspective projection which project the 3D point to the image plane. 4.2.3. Optimization In order to minimize this energy function, we take Gauss–Newton algorithm to minimize this energy function which has the form of a sum of squared residuals. We deﬁne the vector x to represent the unknown parameters of the Wk and the energy can be rewritten as E (x) = i f i (x)2 = f(x) f(x). The Gauss–Newton algorithm linearizes the non-linear problem with Taylor expansion about x: f(x +δ) = f(x) + Jδ where J indicates the Jacobian matrix of f(x). Each Gauss–Newton iteration improves a parameter vector xi as xi +1 = xi − h with J Jh = J f. The main computational complexity is constructing and factorizing J J = Jd Jd + ωr Jr Jr + ωc Jc Jc + ω f J J . We follow DynamicFusion approach to build and solve the linear system f f on the GPU. After that, we can non-rigidly deform the canonical mesh to the depth map so as to generate the non-rigid sequential models. 4.3. Adaptive fusion After calculating the deformation ﬁeld, each depth map could be deformed to the canonical model and incrementally fused into one single 3D reconstruction using the volumetric truncated signed distance function (Curless and Levoy, 1996; Curless, 1997). Previous approaches usually implement this step by fusing the information provided by all frames. Due to the noise and errors introduced in registration and deformation, however, accumulating across the full video would blur the reconstruction results along the object boundaries. Here we propose a post-processing strategy that can select the “conﬁdent” depth maps for fusion, while discarding the unreliable ones.

20

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Fig. 6. Visual comparisons between the proposed approach and state-of-the-art approaches (Newcombe et al., 2011; Newcombe et al., 2015).

This strategy is based on the observation that such accumulation errors often occur in case that a part of the object emerges or is occluded. Thus, the magnitude of scene ﬂow error can be deemed as a reliable estimator of frame quality. As shown in Fig. 5, new region emerging in the current frame k is occluded at the previous frame. In this region, the forward ﬂow fkk−1 would yield a large residual. We take the current depth map into fusion once the residual of forward ﬂow is larger than a threshold. 5. Experiments 5.1. Experimental settings We implemented our method on a 64-bit desktop machine with a 12-core 3.6 GHz Intel Xeon CPU, 64GB of memory and a Nvidia TITAN X graphics card. We use a single depth camera (e.g. Microsoft Kinect v2.0) to capture the depth sequence. At each time step, a depth map recorded at 512 × 424 resolution. To evaluate the proposed approach, we have captured several scenes with different actors behaving in real-life scenarios: opening hands, opening arms, pushing the pillow, stretching hands, playing plush toys, crossing hands and so on. 5.2. Comparison with state-of-the-arts We ﬁrst compare the proposed approach with two state-of-the-art reconstruction approaches: KinectFusion (Newcombe et al., 2011) and DynamicFusion (Newcombe et al., 2015). The results are summarized in Fig. 6. The ﬁrst row of Fig. 6 shows that the KinectFusion approach reconstructs static body well, but may fail to correctly handle the deformable parts, such as the pillow and the actor’s hands. In contrast, the proposed approach successfully handles both cases. DynamicFusion achieves reasonable results at the ﬁrst several frames, but fails in the rest due to the changing topology of the scene (i.e. the pillow and the actor touches with each other at ﬁrst and then becomes separate). In this case, DynamicFusion simply merges the pillow with the actor since the ﬁrst frame is chosen as the canonical model. In contrast, our approach automatically identiﬁes the optimal frame to generate the canonical model and achieves much better results. More results generated by the proposed approach are shown in Fig. 14 and supplementary materials. In addition, we compare the proposed approach with BodyFusion approach (Yu et al., 2017) which takes the skeleton prior into reconstruction and could generate plausible results. Their fusion procedure applies ‘Key Volume’ technique which is explored in the work of Dou et al. (2016). If the misalignments between the model and data are drastic, they would refresh the misaligned voxels using the corresponding voxels from the data volume and deform the canonical volume to the current so as to generate ‘Key Volume’. However, since there exist several canonical models, the motion ﬁeld is not

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

21

Fig. 7. Visual comparisons between the proposed approach and BodyFusion (Yu et al., 2017) approaches. (a) BodyFusion approach without ‘Key Volume’ strategy. (b) BodyFusion approach with ‘Key Volume’ technique. (c) The reconstruction results of our approach.

Fig. 8. Evaluating the reconstruction quality of the proposed approach initialized by different canonical frames. For each column, the canonical frame is shown on the left while the target frame is on the right.

consistent and the results will be shaking (shown in the video ﬁle). We compare our approach with BodyFusion and the results are demonstrated in Fig. 7. The ﬁrst row of Fig. 7 shows that it is failed to track and reconstruct the topology changing object without ‘Key Volume’ technique. In contrast, BodyFusion approach could track the body motion and update the geometry model. However, it is not sensitive to handle the topological change and emerges some artifacts such as the over-smoothed hands and the stretched wrist. In this case, our method could identify the most ﬁne-grained topological structure and generate detailed model sequence. 5.3. Performance analysis Impact of the canonical frame. A basic assumption of the proposed approach is that selecting a good canonical frame is important to achieve high-quality results. We demonstrate this assumption in Fig. 8. In this ﬁgure, Fig. 8(a, b) show reconstruction results using two randomly selected canonical frames. It can be shown that the human body cannot be easily reconstructed when it is not in the open status. Our approach could identiﬁes the optimal status which is important to make the reconstruction successful, see Fig. 8(c). Impact of constrained non-rigid deformation. A core aspect of the proposed approach is the use of contour and scene ﬂow cues to guide non-rigid deformation. To illustrate the impact of such cues, we conduct an ablation study as shown in Fig. 9. We note that constraining the objective by object contours and scene ﬂow cues can effectively remove the noises on the object boundaries and correct the artifacts on the actor’s arm caused by the error accumulation through time. We further conduct evaluation to see how the proposed approach improves over the previous state-of-the-art approach (Newcombe et al., 2015) and investigate a challenging scenario as shown in Fig. 10. In this example, the actor kicks his leg suddenly, which

22

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Fig. 9. The results of model-to-frame non-rigid deformation with and without contour and ﬂow constraints.

Fig. 10. Visual comparisons between the proposed approach and DynamicFusion (Newcombe et al., 2015) in case of sudden object motion.

Fig. 11. The reconstruction results in case of signiﬁcant appearance change.

is not successfully handled by Newcombe et al. (2015). Incorporating the proposed cues can mitigate this negative effect dramatically. In addition, contours could be view dependent, especially when the target to be scanned moves quickly. In this case, contour will be harmful to the deformation due to the incorrect correspondences. However, scene ﬂow cues provide the frame-to-frame motion constraints which could generate correct deformed models (shown in Fig. 11). Impact of adaptive fusion. The long-term accumulated errors could be further reduced by the proposed adaptive fusion scheme. Recently, Fusion4D (Dou et al., 2016) and BodyFusion (Yu et al., 2017) demonstrated impressive results on dynamic reconstruction, so we compare our method with ‘Key Volume’ strategy which is applied in their approaches (shown in Fig. 12). From these comparisons, we see ‘Key Volume’ technique could ﬁx small scale tracking failures by refreshing the

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

23

Fig. 12. Visual comparisons between our approach with ‘Key Volume’ technique which is applied in BodyFusion (Yu et al., 2017). Table 1 Running time of the proposed approach with different sampling strategies. Sampling radius

0.020 m

0.030 m

0.040 m

0.050 m

0.060 m

Node number Time (second)

2169 0.470

832 0.135

388 0.052

224 0.041

134 0.035

Fig. 13. The deformation graphs and corresponding error maps generated by the proposed approach with different sampling radius of nodes.

misalignment voxels using the corresponding depth data, but the reconstruction results are blurred. In contrast, our method could identify the most ﬁne-grained topology as canonical model and adaptively select depth maps for fusion. As a result, the generated models are of high-ﬁdelity. 5.4. Computation With current GPU implementation, processing a video with 288 frames (Fig. 9) takes around 0.5 minutes: 10 seconds for canonical frame selection, 16 seconds for deformation, and 4 seconds for adaptive fusion. Our approach is much faster than existing oﬄine approaches (e.g. Dou et al., 2015 takes around 400 minutes to process the same video). In further, We ﬁnd that the sampling radius of nodes inﬂuence the whole system running time signiﬁcantly. Table 1 and Fig. 13, we report the time cost and reconstruction quality w.r.t the sampling strategy, respectively. We ﬁnd that smaller sampling radius positively impact the reconstruction quality but introduces more time cost. Interestingly, the proposed system is not sensitive to the number of sampled nodes. In practice, we ﬁnd that setting the sampling radius to 0.04 m is a good compromise of eﬃciency and accuracy. 6. Conclusion In this paper, we present a novel approach to achieve non-rigid reconstruction with large movement. In contrast to previous methods, we identify the most ﬁne-grained scene topology as the canonical model, then preform “model-to-frame”

24

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Fig. 14. Reconstruction results of the proposed approach on more sequences. Gold color indicates the canonical model.

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

25

deformation and adaptive fusion. Comparisons on several challenging real-world examples suggest that the proposed approach achieves smooth results with less noise. Furthermore, we are interested to develop eﬃcient algorithms to jointly reconstruct the sequence as a whole, instead of using a single-frame canonical model. In the future, we will address this problem by incorporating the whole sequence to reconstruct the precise topology structure rather than identify the canonical frame. Limitations. While our approach can trivially handle some scenes with large movement during non-rigid reconstruction, there are still several limitations, which point out the direction of future study. Firstly, it is not a real-time system compared with DynamicFusion (Newcombe et al., 2015) since we have to record the depth sequence for bidirectional processing. However, the proposed approach performs better than Newcombe et al. (2015) with reasonable time cost (0.1 s/frame), and runs much faster than existing oﬄine approaches (i.e. 0.1 s/frame v.s. 60 s/frame (Dou et al., 2015)). Such a time cost can still be greatly accelerated since that we adopt a rather simple implementation of non-rigid deformation solver which is not well-optimized on GPU. Secondly, if the sequence is long, then one canonical frame would not be suﬃcient. Lastly, it is somewhat limited to assume that there always exists a proper canonical frame across the video, especially for extremely complex motions. This is a common problem suffered by most existing approaches that still requires further investigation. Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and suggestions. We are also grateful to Tao Yu for running BodyFusion (Yu et al., 2017) on our data and helpful discussions. This work was supported by National Natural Science Foundation of China (Grant No. 61502023 & U1736217). Appendix A. Supplementary material Supplementary material related to this article can be found online at https://doi.org/10.1016/j.cagd.2018.06.002. References Cao, C., Weng, Y., Lin, S., Zhou, K., 2013. 3d shape regression for real-time facial animation. ACM Trans. Graph. (TOG) 32 (4), 41. Chen, J., Bautembach, D., Izadi, S., 2013. Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. (TOG) 32 (4), 113. Curless, B., Levoy, M., 1996. A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. ACM, pp. 303–312. Curless, B.L., 1997. New Methods for Surface Reconstruction from Range Images. PhD thesis. Stanford University. Dou, M., Fuchs, H., Frahm, J.-M., 2013. Scanning and tracking dynamic objects with commodity depth cameras. In: 2013 IEEE International Symposium on Mixed and Augmented Reality. ISMAR. IEEE, pp. 99–106. Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A., Escolano, S.O., Rhemann, C., Kim, D., Taylor, J., et al., 2016. Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. (TOG) 35 (4), 114. Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A., Izadi, S., 2015. 3d scanning deformable objects with a single rgbd sensor. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. CVPR. IEEE, pp. 493–501. Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y., 2017. Real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. ACM Trans. Graph. (TOG) 36 (3), 32. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Volumedeform, M. Stamminger, 2016. Real-time volumetric non-rigid reconstruction. In: European Conference on Computer Vision. Springer, pp. 362–379. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., et al., 2011. KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, pp. 559–568. Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., Cremers, D., 2015. A primal-dual framework for real-time dense RGB-D scene ﬂow. In: 2015 IEEE International Conference on Robotics and Automation. ICRA. IEEE, pp. 98–104. Kahler, O., Prisacariu, V.A., Ren, C.Y., Sun, X., Torr, P., Murray, D., 2015. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Vis. Comput. Graph. 21 (11), 1241–1250. Li, H., Adams, B., Guibas, L.J., Pauly, M., 2009. Robust Single-View Geometry and Motion Reconstruction. ACM Trans. Graph., vol. 28. ACM, p. 175. Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G., 2013. 3d self-portraits. ACM Trans. Graph. (TOG) 32 (6), 187. Liao, M., Zhang, Q., Wang, H., Yang, R., Gong, M., 2009. Modeling deformable objects from a single depth camera. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 167–174. Lin, H., Gao, J., Zhou, Y., Lu, G., Ye, M., Zhang, C., Liu, L., Yang, R., 2013. Semantic decomposition and reconstruction of residential scenes from Lidar data. ACM Trans. Graph. (TOG) 32 (4), 66. Newcombe, R.A., Fox, D., Seitz, S.M., 2015. Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 343–352. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., 2011. KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. ISMAR. IEEE, pp. 127–136. Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M., 2013. Real-time 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. (TOG) 32 (6), 169. Oikonomidis, I., Kyriazis, N., Argyros, A.A., 2011. Eﬃcient model-based 3d tracking of hand articulations using kinect. In: BmVC, vol. 1, p. 3. Roth, H., Vona, M., 2012. Moving volume KinectFusion. In: BMVC, pp. 1–11. Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., Guo, B., 2012. An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Trans. Graph. (TOG) 31 (6), 136. Slavcheva, M., Baust, M., Cremers, D., Ilic, S., 2017. Killingfusion: non-rigid 3d reconstruction without correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR, vol. 3, p. 7. Steinbrucker, F., Kerl, C., Cremers, D., 2013. Large-scale multi-resolution surface reconstruction from RGB-D sequences. In: The IEEE International Conference on Computer Vision. ICCV. Sumner, R.W., Schmid, J., Pauly, M., 2007. Embedded Deformation for Shape Manipulation. ACM Trans. Graph. (TOG), vol. 26. ACM, p. 80.

26

F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26

Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A., 2012. The Vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. CVPR. IEEE, pp. 103–110. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H., 2012. Scanning 3d full human bodies using kinects. IEEE Trans. Vis. Comput. Graph. 18 (4), 643–650. Wang, K., Zhang, G., Xia, S., 2017. Templateless non-rigid reconstruction and motion tracking with a single RGB-D camera. IEEE Trans. Image Process. 26 (12), 5966–5979. Wang, R., Wei, L., Vouga, E., Huang, Q., Ceylan, D., Medioni, G., Li, H., 2016. Capturing dynamic textured surfaces of moving targets. In: European Conference on Computer Vision. Springer, pp. 271–288. Weiss, A., Hirshberg, D., Black, M.J., 2011. Home 3d body scans from noisy image and range data. In: International Conference on Computer Vision. IEEE, pp. 1951–1958. Whelan, T., Kaess, M., Fallon, M., Johannsson, H., Leonard, J., McDonald, J., 2012. Kintinuous: Spatially Extended KinectFusion. Xu, W., Zhou, K., Yu, Y., Tan, Q., Peng, Q., Guo, B., 2007. Gradient domain editing of deforming mesh sequences. ACM Trans. Graph. (TOG) 26 (3), 84. Yang, J., Li, K., Li, K., Lai, Y.-K., 2015. Sparse Non-Rigid Registration of 3d Shapes. Computer Graphics Forum, vol. 34. Wiley Online Library, pp. 89–99. Yang, Y., Xu, W., Guo, X., Zhou, K., Guo, B., 2013. Boundary-aware multidomain subspace deformation. IEEE Trans. Vis. Comput. Graph. 19 (10), 1633–1645. Yu, T., Guo, K., Xu, F., Dong, Y., Su, Z., Zhao, J., Li, J., Dai, Q., Bodyfusion, Y. Liu, 2017. Real-time capture of human motion and surface geometry using a single depth camera. In: The IEEE International Conference on Computer Vision. ICCV. ACM. Zhang, R., Chen, X., Shiratori, T., Tong, X., Liu, L., 2015a. An eﬃcient volumetric method for non-rigid registration. Graph. Models 79, 1–11. Zhang, Y., Xu, W., Tong, Y., Zhou, K., 2015b. Online structure analysis for real-time indoor scene reconstruction. ACM Trans. Graph. (TOG) 34 (5), 159. Zhou, K., Xu, W., Tong, Y., Desbrun, M., 2010. Deformation Transfer to Multi-Component Objects. Computer Graphics Forum, vol. 29. Wiley Online Library, pp. 319–325. Zhou, Q.-Y., Koltun, V., 2015. Depth camera tracking with contour cues. In: Proceedings of Computer Vision and Pattern Recognition, pp. 632–638. Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., et al., 2014. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graph. 33 (4), 156.

Reconstructing non-rigid object with large movement using a single depth camera

Reconstructing non-rigid object with large movement using a single depth camera

Recommend Documents