Computer Aided Geometric Design 64 (2018) 15–26
Contents lists available at ScienceDirect
Computer Aided Geometric Design www.elsevier.com/locate/cagd
Reconstructing non-rigid object with large movement using a single depth camera ✩ Feixiang Lu ∗ , Bin Zhou, Feng Lu, Yu Zhang, Xiaowu Chen, Qinping Zhao State Key Laboratory of Virtual Reality Technology & Systems, Beihang University, China
a r t i c l e
i n f o
Article history: Received 1 March 2018 Received in revised form 18 April 2018 Accepted 11 June 2018 Available online 25 June 2018 Keywords: Non-rigid object reconstruction Large movement A single depth camera Canonical frame identification
a b s t r a c t Non-rigid detailed 3D reconstruction of real world scenes has witnessed great success in recent years. However, most existing methods take the first frame as canonical model and the topological structure of the input scenes are fixed during the reconstruction process, which is an assumption that may not hold in practice for highly non-rigid scenes. Regarding this issue, this work proposes a novel approach to reconstruct non-rigid object with large movement which often results in topological structure change. In this paper, we firstly introduce an adaptive strategy that can effectively identify the most fine-grained scene topology as the canonical model. Such model is then deformed to each depth map, constrained by robust inter-frame correspondences established from object contour and scene flows. After deformation, we further fuse the depth map to the canonical model via a novel adaptive selection scheme, so as to remove spurious noise without smoothing model details. Experimental results show that the proposed approach can effectively handle various input scenes with large movement and generate models with high-fidelity details. © 2018 Elsevier B.V. All rights reserved.
1. Introduction 3D reconstruction of real-world scenes from depth cameras is a widely studied problem in the fields of computer vision and computer graphics. After long-term efforts, the 3D model of a scene can be now accurately built by fusing its depth maps captured in multiple views, as long as the scene is static (e.g. KinectFusion (Newcombe et al., 2011; Izadi et al., 2011)). However, reconstructing non-rigid scenes with a single depth camera is still largely unsolved due to a number of challenges, such as non-rigid deformation, incomplete scans, and large movement which might cause the inconsistency of topological structures of the scene. In recent years, the challenges of handling non-rigid deformation and incomplete scans have been well studied and addressed by various previous works (Sumner et al., 2007; Xu et al., 2007; Li et al., 2009; Liao et al., 2009; Zhou et al., 2010; Oikonomidis et al., 2011; Taylor et al., 2012; Li et al., 2013; Yang et al., 2013; Zollhöfer et al., 2014; Dou et al., 2015; Zhang et al., 2015a; Yang et al., 2015; Dou et al., 2016). However, these methods rely on strong priors based on pre-designed templates, user direct manipulation, multiple depth sensors, or pre-learned statistical models. Moreover, some techniques need seconds to minutes to compute a single frame which is a waste of time for reconstruction. Newcombe et al. (2015) proposed the first system for densely reconstructing general dynamic scenes, which can generate high-quality results from a single camera in real-time. Although significant successes were made by these approaches, however, most of them do not
✩
*
This paper has been recommended for acceptance by Ligang Liu. Corresponding author. E-mail address: fl
[email protected] (F. Lu).
https://doi.org/10.1016/j.cagd.2018.06.002 0167-8396/© 2018 Elsevier B.V. All rights reserved.
16
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Fig. 1. We present a new approach to reconstruct non-rigid object with large movement. Our method only requires single depth camera (e.g. Kinect v2.0) to capture the depth maps and generate the plausible results rapidly (≈0.1 s/frame).
Fig. 2. An overview of the system pipeline. (For interpretation of the colors in the figure(s), the reader is referred to the web version of this article.)
explicitly consider the challenging problem of large movement (i.e. potential topological change) of the input scene, which frequently happens for non-rigid objects. As illustrated in Fig. 1, with the arms of the person stretching away from the body, the topology of the person becomes inconsistent. In this case, the previous methods using fixed topological structure cannot reconstruct the input scene consistently. Slavcheva et al. (2017) have made the attempt towards this problem by a level-set evolution approach. However, the reconstruction without correspondences makes the results somewhat shaking in appearance and loose essential details. To address this problem, this paper presents a novel approach to reconstructing non-rigid scenes with large movement from a single depth camera. As summarized in Fig. 2, the proposed approach takes the depth sequence captured by Kinect v2.0 sensor as input, and incrementally fuses the depth maps to generate a canonical model that can best fit the scene on each frame under certain deformations. To this end, we propose a novel adaptive strategy to identify the most fine-grained scene topology as the canonical model by analyzing the topological structure. Given the canonical model, we then deform it to each depth map constrained by robust inter-frame correspondences established from object contours and scene flows. Finally, we fuse the depth maps onto the deformed canonical models through a novel scheme that can adaptively select the appropriate interval of frames for fusion, which can generate high-quality reconstruction results without over-smoothing model details. Experimental results demonstrate that our approach can effectively handle various input scenes with topological structure change due to large movement. The contributions of this paper are summarized as follows: 1) we present a novel approach via identifying the canonical frame to reconstruct the non-rigid scenes with large movement; 2) we efficiently deform the canonical model to fit each depth map using contour and scene flow cues; 3) we propose an adaptive fusion algorithm which can largely suppress the noise during fusion and preserve the model details. 2. Related work There were various previous works on 3D scene reconstruction based on consumer-level depth cameras. While a large group of them focused on static scenes (Newcombe et al., 2011; Izadi et al., 2011; Roth and Vona, 2012; Whelan et al., 2012;
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
17
Fig. 3. The reconstruction quality with different canonical frames. For each column, the canonical frame is shown on the left while the target frame is on the right.
Shao et al., 2012; Lin et al., 2013; Steinbrucker et al., 2013; Chen et al., 2013; Nießner et al., 2013; Kahler et al., 2015; Zhang et al., 2015b), this section mainly reviews recent advances on non-rigid scene reconstruction that are tightly correlated with our approach. Multi-view non-rigid reconstruction. Depth maps from multiple views of the scene provide complementary visual information and thus facilitate the reconstruction process. For example, Tong et al. (2012) use three Kinect cameras to scan the 3D full human bodies through global non-rigid registration. Dou et al. (2013) use eight Kinect cameras to reconstruct the complete 3D model, and then track the model to match later observations. Wang et al. (2016) present an end-to-end system for reconstructing complete water-tight and textured models of moving subjects using three or four handheld sensors. Dou et al. (2016) proposed Fusion 4D system for live multi-view performance capture, generating temporally coherent highquality reconstructions in real-time which use 24 cameras. Although multi-view reconstruction techniques can generate much more delicate models than single camera. The capture setups are complex and not easy to use for novice user. Prior-based non-rigid reconstruction. Several approaches proposed to employ the prior knowledge of the scene to aid the reconstruction process. In this context, various technical improvements were made on constructing 3D human bodies, hands and faces (Weiss et al., 2011; Oikonomidis et al., 2011; Cao et al., 2013). However, these rely on strong priors such as pre-learned statistical models, articulated skeletons, or morphable shape models, while capturing non-rigid scenes of general categories still remains challenging. Template-based non-rigid reconstruction. Template-based approach is proven effective for modeling general non-rigid scenes recently (Li et al., 2009). Typically, a 3D model template of the scene is learned, which is then deformed to match the visual information of each frame. For example, Zollhöfer et al. (2014) first acquire a template of the scene using KinectFusion, and then non-rigidly deform the template to the captured sequences. Templateless non-rigid reconstruction. Newcombe et al. (2015) present the first dense non-rigid reconstruction system– DynamicFusion which can fuse the depth maps to incrementally generate “canonical model” and at each time instant deform it to each frame in real-time. Innmann et al. (2016) extract sparse color features to enable accurate tracking and effectively handle the drift problem compared with DynamicFusion. Guo et al. (2017) simultaneously fuse object geometry and surface albedo for a non-rigid scene in real time. Yu et al. (2017) take advantage of the internal articulated skeleton prior and propose a real-time skeleton-embedded surface fusion approach. Wang et al. (2017) propose an effective local-to-global hierarchical optimization framework to reconstruct and track non-rigid objects with an RGB-D camera. All of above methods assume the topological structure of scanned object to be fixed, once the object movement was large, the structural connection may be broken, however, the object appearance on several frames may not be accurately aligned by any possible deformation, thus leading to failure reconstructions. 3. Overview We aim to reconstruct the non-rigid dynamic object in real-world scene using a single depth camera, where the object movement is large and topologies may change significantly in the depth video. For example, as shown in the first row of Fig. 3(a, b), the person’s hands and head touch with each other at the first time, and then gradually separate in the next a few frames. This kind of large movement often happens in our daily life, while most state-of-the-art 3D reconstruction methods, such as DynamicFusion (Newcombe et al., 2015) approach, always fail in this situation. It is because that these methods always take the first frame as canonical model and directly warp the first frame to fit the other frames. A straightforward solution to deal with this problem is to design a new canonical model updating algorithm like ‘Key Volume’ technique proposed in Fusion4D (Dou et al., 2016). If the misalignments are drastic, ‘Key Volume’ would refresh
18
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Fig. 4. Canonical frame identification. (a) The input depth map. (b) The mutual distances of the point cloud. For each 3D point (red color), we calculate the Euclidean distance with other 3D points (blue color). (c) The mutual distances of the object contour points. For each contour point, we calculate the Euclidean distance with other contour points (blue color). (d) The forward and (e) backward scene flows.
the canonical model using the current data. However, since there exist several canonical models, the motion field is not consistent and the results will be shaking (shown in the video file). Instead, we look for an alternative lightweight solution for this problem. Our solution is based on the observation that the selected frame used to initialize the canonical model immensely affects the reconstruct result. As shown in the second row of Fig. 3(c), a well selected frame can obtain well result, while inproperly selected frames usually result in low-quality reconstructions (the second row of Fig. 3(a, b)). That is, the key of our solution to deal with large movement is to scan the whole sequence and identify the most fine-grained frame to construct the canonical model. As shown in Fig. 2, we take the recorded depth sequence as input and identify the canonical frame, which automatically selects the most fine-grained scene topology as the canonical model. Then we extract the polygonal mesh from the volume which is further deformed to the depth map. The procedure of non-rigid deformation constrained by robust inter-frame correspondences established from object contour and scene flows. After deformation, we fuse the depth map to the canonical model via a novel adaptive selection scheme, so as to reduce noise. As a result, we can generate detailed canonical model and deformed model sequence. 4. Method In this section, we will describe the components of our approach of reconstructing the non-rigid object with large movement. First, we introduce how to efficiently identify the most fine-grained scene topology as the canonical model (Fig. 4). Second, we deform the canonical model to the depth map, constrained by object contour and scene flows. Finally, we present a novel fusion strategy yielding compelling detailed models. 4.1. Canonical frame identification To capture the dynamic scenes, recent approaches often apply a deformable canonical model. In the recent DynamicFusion (Newcombe et al., 2015) approach, such model is simply initialized using the first input frame. In this paper, we follow the usage of canonical model, but differ from previous approaches by adaptively initializing it with the frame on which the scene parts are most separated. Intuitively, the canonical model initialized in such manner would have the most fine-grained topology. To identify such frame, we compute a “separability” score at frame-level:
E=
du − d w 2 + λc · du − d w 2 + λ f · fk−1 (du ) + fk+1 (du )2 , k k 2 2 2
u , w ∈Dk
u , w ∈Ck
(1)
u ∈Dk
where
du = K−1 D k (u )[u T , 1]T .
(2)
Here, Dk ⊂ R is the depth image domain on the kth frame and Ck ⊂ R is the set of contour pixels extracted by Zhou and Koltun (2015) on the depth map. We back-project each depth pixel u to acquire 3D point dku where K indicates the depth camera intrinsic matrix and D k (u ) is the corresponding depth value. The ‘separability’ is thus modeled by the mutual distances of the points in these sets. In the third term, f denotes the displacement vector of the point du at the kth frame to the previous (next) frame. Such displacement field is obtained via the scene flow algorithm (Jaimez et al., 2015) computed on depth maps. This term models sudden topological change, where the moving patterns of pixels differ greatly. Empirically, the non-negative weights λc and λ f are set to 10 and 100, respectively. 2
2
4.2. Model-to-frame deformation After initializing the canonical frame, we take the corresponding depth map into volume represented by truncated signed distance function (TSDF) (Curless and Levoy, 1996) as canonical model. In our implementation, we set the resolution of TSDF volume as 6403 voxels and 384 voxels per meter. Thus, we can capture maximally 1.67 m3 size of scenes and each voxel represents 2.6 mm3 in real world. Then we compute the deformation field that warps the model so as to match the input depth on each frame. This field is further used to guide the fusion process.
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
19
Fig. 5. A smaller square slides under a larger one which produces a new region in red. In this region, the forward flow fkk−1 would yield a large residual.
4.2.1. Deformation field We follow DynamicFusion (Newcombe et al., 2015) to construct the deformation field Wt by a hierarchical deformation graph, which not only effectively aligns non-rigid model surface to each depth map but also could be applied to deform the volume. Specifically, at the frame k, we consider the following parameters Wk = {gi , σi , Ti }, where i denotes the index of the control node in the deformation graph. Each node has a position in canonical model gi ∈ R3 , and σi is a radius parameter related to the weight that control the extent of ith node influences voxel x. The radius weight wi (x) = exp(−gi − x22 /(2σi2 )). Each node is associated with 6D transformation Ti ∈ SE3 . Specifically, Ti consists of 3D rotation Ri ∈ SO3 and translation ti ∈ R3 . We follow the routines proposed in Newcombe et al. (2015) to generate and update the deformation graph. 4.2.2. Energy function After defining deformation graph, the key step is how to non-rigidly deform the canonical model to the current depth map. To estimate the parameters of deformation field Wk , we formulate an energy function as follows:
E (Wk ) = E data + ωr E reg + ωc E contour + ω f E f low .
(3)
The data term E data measures the dense distances between the canonical model and the closest data points in the depth map, while the regularization term E reg promotes smooth deformations. These two terms are same as Newcombe et al. (2015). The contour constraints term E contour preserve the contour consistency during the reconstruction and is defined as
E contour =
ψcontour (vˆ u − d w ).
(4)
u ∈Ck −1 , w ∈Ck
Here, we render the deformed canonical model at the time (k − 1) to obtain the initial visible vertex map and extract its contour Ck −1 ; ψcontour is robust Tukey penalty function and we non-rigidly deformed the model vertex vu by vˆ u = Wk (v(u )) · v(u ) to the closest contour point d w which is located on the kth depth map. The dense scene flow term E f low forces the point to be matched to its position in adjacent frames and is defined as
E f low =
u ∈Dk −1
u˜ =
ψ f low (ˆvu − du˜ ),
(5)
π K vu + fkk−1 (vu )
,
(6)
where Dk −1 is the raycasted map by rendering the deformed model of the time (k − 1), fkk−1 (vu ) indicates the scene flow from Dk to Dk−1 . π (·) performs perspective projection which project the 3D point to the image plane. 4.2.3. Optimization In order to minimize this energy function, we take Gauss–Newton algorithm to minimize this energy function which has the form of a sum of squared residuals. We define the vector x to represent the unknown parameters of the Wk and the energy can be rewritten as E (x) = i f i (x)2 = f(x) f(x). The Gauss–Newton algorithm linearizes the non-linear problem with Taylor expansion about x: f(x +δ) = f(x) + Jδ where J indicates the Jacobian matrix of f(x). Each Gauss–Newton iteration improves a parameter vector xi as xi +1 = xi − h with J Jh = J f. The main computational complexity is constructing and factorizing J J = Jd Jd + ωr Jr Jr + ωc Jc Jc + ω f J J . We follow DynamicFusion approach to build and solve the linear system f f on the GPU. After that, we can non-rigidly deform the canonical mesh to the depth map so as to generate the non-rigid sequential models. 4.3. Adaptive fusion After calculating the deformation field, each depth map could be deformed to the canonical model and incrementally fused into one single 3D reconstruction using the volumetric truncated signed distance function (Curless and Levoy, 1996; Curless, 1997). Previous approaches usually implement this step by fusing the information provided by all frames. Due to the noise and errors introduced in registration and deformation, however, accumulating across the full video would blur the reconstruction results along the object boundaries. Here we propose a post-processing strategy that can select the “confident” depth maps for fusion, while discarding the unreliable ones.
20
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Fig. 6. Visual comparisons between the proposed approach and state-of-the-art approaches (Newcombe et al., 2011; Newcombe et al., 2015).
This strategy is based on the observation that such accumulation errors often occur in case that a part of the object emerges or is occluded. Thus, the magnitude of scene flow error can be deemed as a reliable estimator of frame quality. As shown in Fig. 5, new region emerging in the current frame k is occluded at the previous frame. In this region, the forward flow fkk−1 would yield a large residual. We take the current depth map into fusion once the residual of forward flow is larger than a threshold. 5. Experiments 5.1. Experimental settings We implemented our method on a 64-bit desktop machine with a 12-core 3.6 GHz Intel Xeon CPU, 64GB of memory and a Nvidia TITAN X graphics card. We use a single depth camera (e.g. Microsoft Kinect v2.0) to capture the depth sequence. At each time step, a depth map recorded at 512 × 424 resolution. To evaluate the proposed approach, we have captured several scenes with different actors behaving in real-life scenarios: opening hands, opening arms, pushing the pillow, stretching hands, playing plush toys, crossing hands and so on. 5.2. Comparison with state-of-the-arts We first compare the proposed approach with two state-of-the-art reconstruction approaches: KinectFusion (Newcombe et al., 2011) and DynamicFusion (Newcombe et al., 2015). The results are summarized in Fig. 6. The first row of Fig. 6 shows that the KinectFusion approach reconstructs static body well, but may fail to correctly handle the deformable parts, such as the pillow and the actor’s hands. In contrast, the proposed approach successfully handles both cases. DynamicFusion achieves reasonable results at the first several frames, but fails in the rest due to the changing topology of the scene (i.e. the pillow and the actor touches with each other at first and then becomes separate). In this case, DynamicFusion simply merges the pillow with the actor since the first frame is chosen as the canonical model. In contrast, our approach automatically identifies the optimal frame to generate the canonical model and achieves much better results. More results generated by the proposed approach are shown in Fig. 14 and supplementary materials. In addition, we compare the proposed approach with BodyFusion approach (Yu et al., 2017) which takes the skeleton prior into reconstruction and could generate plausible results. Their fusion procedure applies ‘Key Volume’ technique which is explored in the work of Dou et al. (2016). If the misalignments between the model and data are drastic, they would refresh the misaligned voxels using the corresponding voxels from the data volume and deform the canonical volume to the current so as to generate ‘Key Volume’. However, since there exist several canonical models, the motion field is not
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
21
Fig. 7. Visual comparisons between the proposed approach and BodyFusion (Yu et al., 2017) approaches. (a) BodyFusion approach without ‘Key Volume’ strategy. (b) BodyFusion approach with ‘Key Volume’ technique. (c) The reconstruction results of our approach.
Fig. 8. Evaluating the reconstruction quality of the proposed approach initialized by different canonical frames. For each column, the canonical frame is shown on the left while the target frame is on the right.
consistent and the results will be shaking (shown in the video file). We compare our approach with BodyFusion and the results are demonstrated in Fig. 7. The first row of Fig. 7 shows that it is failed to track and reconstruct the topology changing object without ‘Key Volume’ technique. In contrast, BodyFusion approach could track the body motion and update the geometry model. However, it is not sensitive to handle the topological change and emerges some artifacts such as the over-smoothed hands and the stretched wrist. In this case, our method could identify the most fine-grained topological structure and generate detailed model sequence. 5.3. Performance analysis Impact of the canonical frame. A basic assumption of the proposed approach is that selecting a good canonical frame is important to achieve high-quality results. We demonstrate this assumption in Fig. 8. In this figure, Fig. 8(a, b) show reconstruction results using two randomly selected canonical frames. It can be shown that the human body cannot be easily reconstructed when it is not in the open status. Our approach could identifies the optimal status which is important to make the reconstruction successful, see Fig. 8(c). Impact of constrained non-rigid deformation. A core aspect of the proposed approach is the use of contour and scene flow cues to guide non-rigid deformation. To illustrate the impact of such cues, we conduct an ablation study as shown in Fig. 9. We note that constraining the objective by object contours and scene flow cues can effectively remove the noises on the object boundaries and correct the artifacts on the actor’s arm caused by the error accumulation through time. We further conduct evaluation to see how the proposed approach improves over the previous state-of-the-art approach (Newcombe et al., 2015) and investigate a challenging scenario as shown in Fig. 10. In this example, the actor kicks his leg suddenly, which
22
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Fig. 9. The results of model-to-frame non-rigid deformation with and without contour and flow constraints.
Fig. 10. Visual comparisons between the proposed approach and DynamicFusion (Newcombe et al., 2015) in case of sudden object motion.
Fig. 11. The reconstruction results in case of significant appearance change.
is not successfully handled by Newcombe et al. (2015). Incorporating the proposed cues can mitigate this negative effect dramatically. In addition, contours could be view dependent, especially when the target to be scanned moves quickly. In this case, contour will be harmful to the deformation due to the incorrect correspondences. However, scene flow cues provide the frame-to-frame motion constraints which could generate correct deformed models (shown in Fig. 11). Impact of adaptive fusion. The long-term accumulated errors could be further reduced by the proposed adaptive fusion scheme. Recently, Fusion4D (Dou et al., 2016) and BodyFusion (Yu et al., 2017) demonstrated impressive results on dynamic reconstruction, so we compare our method with ‘Key Volume’ strategy which is applied in their approaches (shown in Fig. 12). From these comparisons, we see ‘Key Volume’ technique could fix small scale tracking failures by refreshing the
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
23
Fig. 12. Visual comparisons between our approach with ‘Key Volume’ technique which is applied in BodyFusion (Yu et al., 2017). Table 1 Running time of the proposed approach with different sampling strategies. Sampling radius
0.020 m
0.030 m
0.040 m
0.050 m
0.060 m
Node number Time (second)
2169 0.470
832 0.135
388 0.052
224 0.041
134 0.035
Fig. 13. The deformation graphs and corresponding error maps generated by the proposed approach with different sampling radius of nodes.
misalignment voxels using the corresponding depth data, but the reconstruction results are blurred. In contrast, our method could identify the most fine-grained topology as canonical model and adaptively select depth maps for fusion. As a result, the generated models are of high-fidelity. 5.4. Computation With current GPU implementation, processing a video with 288 frames (Fig. 9) takes around 0.5 minutes: 10 seconds for canonical frame selection, 16 seconds for deformation, and 4 seconds for adaptive fusion. Our approach is much faster than existing offline approaches (e.g. Dou et al., 2015 takes around 400 minutes to process the same video). In further, We find that the sampling radius of nodes influence the whole system running time significantly. Table 1 and Fig. 13, we report the time cost and reconstruction quality w.r.t the sampling strategy, respectively. We find that smaller sampling radius positively impact the reconstruction quality but introduces more time cost. Interestingly, the proposed system is not sensitive to the number of sampled nodes. In practice, we find that setting the sampling radius to 0.04 m is a good compromise of efficiency and accuracy. 6. Conclusion In this paper, we present a novel approach to achieve non-rigid reconstruction with large movement. In contrast to previous methods, we identify the most fine-grained scene topology as the canonical model, then preform “model-to-frame”
24
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Fig. 14. Reconstruction results of the proposed approach on more sequences. Gold color indicates the canonical model.
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
25
deformation and adaptive fusion. Comparisons on several challenging real-world examples suggest that the proposed approach achieves smooth results with less noise. Furthermore, we are interested to develop efficient algorithms to jointly reconstruct the sequence as a whole, instead of using a single-frame canonical model. In the future, we will address this problem by incorporating the whole sequence to reconstruct the precise topology structure rather than identify the canonical frame. Limitations. While our approach can trivially handle some scenes with large movement during non-rigid reconstruction, there are still several limitations, which point out the direction of future study. Firstly, it is not a real-time system compared with DynamicFusion (Newcombe et al., 2015) since we have to record the depth sequence for bidirectional processing. However, the proposed approach performs better than Newcombe et al. (2015) with reasonable time cost (0.1 s/frame), and runs much faster than existing offline approaches (i.e. 0.1 s/frame v.s. 60 s/frame (Dou et al., 2015)). Such a time cost can still be greatly accelerated since that we adopt a rather simple implementation of non-rigid deformation solver which is not well-optimized on GPU. Secondly, if the sequence is long, then one canonical frame would not be sufficient. Lastly, it is somewhat limited to assume that there always exists a proper canonical frame across the video, especially for extremely complex motions. This is a common problem suffered by most existing approaches that still requires further investigation. Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and suggestions. We are also grateful to Tao Yu for running BodyFusion (Yu et al., 2017) on our data and helpful discussions. This work was supported by National Natural Science Foundation of China (Grant No. 61502023 & U1736217). Appendix A. Supplementary material Supplementary material related to this article can be found online at https://doi.org/10.1016/j.cagd.2018.06.002. References Cao, C., Weng, Y., Lin, S., Zhou, K., 2013. 3d shape regression for real-time facial animation. ACM Trans. Graph. (TOG) 32 (4), 41. Chen, J., Bautembach, D., Izadi, S., 2013. Scalable real-time volumetric surface reconstruction. ACM Trans. Graph. (TOG) 32 (4), 113. Curless, B., Levoy, M., 1996. A volumetric method for building complex models from range images. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques. ACM, pp. 303–312. Curless, B.L., 1997. New Methods for Surface Reconstruction from Range Images. PhD thesis. Stanford University. Dou, M., Fuchs, H., Frahm, J.-M., 2013. Scanning and tracking dynamic objects with commodity depth cameras. In: 2013 IEEE International Symposium on Mixed and Augmented Reality. ISMAR. IEEE, pp. 99–106. Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A., Escolano, S.O., Rhemann, C., Kim, D., Taylor, J., et al., 2016. Fusion4D: real-time performance capture of challenging scenes. ACM Trans. Graph. (TOG) 35 (4), 114. Dou, M., Taylor, J., Fuchs, H., Fitzgibbon, A., Izadi, S., 2015. 3d scanning deformable objects with a single rgbd sensor. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. CVPR. IEEE, pp. 493–501. Guo, K., Xu, F., Yu, T., Liu, X., Dai, Q., Liu, Y., 2017. Real-time geometry, albedo, and motion reconstruction using a single RGB-D camera. ACM Trans. Graph. (TOG) 36 (3), 32. Innmann, M., Zollhöfer, M., Nießner, M., Theobalt, C., Volumedeform, M. Stamminger, 2016. Real-time volumetric non-rigid reconstruction. In: European Conference on Computer Vision. Springer, pp. 362–379. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., et al., 2011. KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology. ACM, pp. 559–568. Jaimez, M., Souiai, M., Gonzalez-Jimenez, J., Cremers, D., 2015. A primal-dual framework for real-time dense RGB-D scene flow. In: 2015 IEEE International Conference on Robotics and Automation. ICRA. IEEE, pp. 98–104. Kahler, O., Prisacariu, V.A., Ren, C.Y., Sun, X., Torr, P., Murray, D., 2015. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Trans. Vis. Comput. Graph. 21 (11), 1241–1250. Li, H., Adams, B., Guibas, L.J., Pauly, M., 2009. Robust Single-View Geometry and Motion Reconstruction. ACM Trans. Graph., vol. 28. ACM, p. 175. Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G., 2013. 3d self-portraits. ACM Trans. Graph. (TOG) 32 (6), 187. Liao, M., Zhang, Q., Wang, H., Yang, R., Gong, M., 2009. Modeling deformable objects from a single depth camera. In: 2009 IEEE 12th International Conference on Computer Vision. IEEE, pp. 167–174. Lin, H., Gao, J., Zhou, Y., Lu, G., Ye, M., Zhang, C., Liu, L., Yang, R., 2013. Semantic decomposition and reconstruction of residential scenes from Lidar data. ACM Trans. Graph. (TOG) 32 (4), 66. Newcombe, R.A., Fox, D., Seitz, S.M., 2015. Dynamicfusion: reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 343–352. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A., 2011. KinectFusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE International Symposium on Mixed and Augmented Reality. ISMAR. IEEE, pp. 127–136. Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M., 2013. Real-time 3d reconstruction at scale using voxel hashing. ACM Trans. Graph. (TOG) 32 (6), 169. Oikonomidis, I., Kyriazis, N., Argyros, A.A., 2011. Efficient model-based 3d tracking of hand articulations using kinect. In: BmVC, vol. 1, p. 3. Roth, H., Vona, M., 2012. Moving volume KinectFusion. In: BMVC, pp. 1–11. Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., Guo, B., 2012. An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Trans. Graph. (TOG) 31 (6), 136. Slavcheva, M., Baust, M., Cremers, D., Ilic, S., 2017. Killingfusion: non-rigid 3d reconstruction without correspondences. In: IEEE Conference on Computer Vision and Pattern Recognition. CVPR, vol. 3, p. 7. Steinbrucker, F., Kerl, C., Cremers, D., 2013. Large-scale multi-resolution surface reconstruction from RGB-D sequences. In: The IEEE International Conference on Computer Vision. ICCV. Sumner, R.W., Schmid, J., Pauly, M., 2007. Embedded Deformation for Shape Manipulation. ACM Trans. Graph. (TOG), vol. 26. ACM, p. 80.
26
F. Lu et al. / Computer Aided Geometric Design 64 (2018) 15–26
Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A., 2012. The Vitruvian manifold: inferring dense correspondences for one-shot human pose estimation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. CVPR. IEEE, pp. 103–110. Tong, J., Zhou, J., Liu, L., Pan, Z., Yan, H., 2012. Scanning 3d full human bodies using kinects. IEEE Trans. Vis. Comput. Graph. 18 (4), 643–650. Wang, K., Zhang, G., Xia, S., 2017. Templateless non-rigid reconstruction and motion tracking with a single RGB-D camera. IEEE Trans. Image Process. 26 (12), 5966–5979. Wang, R., Wei, L., Vouga, E., Huang, Q., Ceylan, D., Medioni, G., Li, H., 2016. Capturing dynamic textured surfaces of moving targets. In: European Conference on Computer Vision. Springer, pp. 271–288. Weiss, A., Hirshberg, D., Black, M.J., 2011. Home 3d body scans from noisy image and range data. In: International Conference on Computer Vision. IEEE, pp. 1951–1958. Whelan, T., Kaess, M., Fallon, M., Johannsson, H., Leonard, J., McDonald, J., 2012. Kintinuous: Spatially Extended KinectFusion. Xu, W., Zhou, K., Yu, Y., Tan, Q., Peng, Q., Guo, B., 2007. Gradient domain editing of deforming mesh sequences. ACM Trans. Graph. (TOG) 26 (3), 84. Yang, J., Li, K., Li, K., Lai, Y.-K., 2015. Sparse Non-Rigid Registration of 3d Shapes. Computer Graphics Forum, vol. 34. Wiley Online Library, pp. 89–99. Yang, Y., Xu, W., Guo, X., Zhou, K., Guo, B., 2013. Boundary-aware multidomain subspace deformation. IEEE Trans. Vis. Comput. Graph. 19 (10), 1633–1645. Yu, T., Guo, K., Xu, F., Dong, Y., Su, Z., Zhao, J., Li, J., Dai, Q., Bodyfusion, Y. Liu, 2017. Real-time capture of human motion and surface geometry using a single depth camera. In: The IEEE International Conference on Computer Vision. ICCV. ACM. Zhang, R., Chen, X., Shiratori, T., Tong, X., Liu, L., 2015a. An efficient volumetric method for non-rigid registration. Graph. Models 79, 1–11. Zhang, Y., Xu, W., Tong, Y., Zhou, K., 2015b. Online structure analysis for real-time indoor scene reconstruction. ACM Trans. Graph. (TOG) 34 (5), 159. Zhou, K., Xu, W., Tong, Y., Desbrun, M., 2010. Deformation Transfer to Multi-Component Objects. Computer Graphics Forum, vol. 29. Wiley Online Library, pp. 319–325. Zhou, Q.-Y., Koltun, V., 2015. Depth camera tracking with contour cues. In: Proceedings of Computer Vision and Pattern Recognition, pp. 632–638. Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., et al., 2014. Real-time non-rigid reconstruction using an RGB-D camera. ACM Trans. Graph. 33 (4), 156.