From rendering to tracking point-based 3D models

From rendering to tracking point-based 3D models

Image and Vision Computing 28 (2010) 1386–1395 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.c...

895KB Sizes 0 Downloads 55 Views

Image and Vision Computing 28 (2010) 1386–1395

Contents lists available at ScienceDirect

Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

From rendering to tracking point-based 3D models Christophe Dehais *, Géraldine Morin, Vincent Charvillat IRIT–ENSEEIHT, 2 rue Charles Camichel, B.P. 7122, 31071 Toulouse Cedex, France

a r t i c l e

i n f o

Article history: Received 10 September 2008 Received in revised form 8 February 2010 Accepted 2 March 2010

Keywords: Visual tracking Point-based model Surface splatting GPGPU

a b s t r a c t This paper adds to the abundant visual tracking literature with two main contributions. First, we illustrate the interest of using Graphic Processing Units (GPU) to support efficient implementations of computer vision algorithms, and secondly, we introduce the use of point-based 3D models as a shape prior for real-time 3D tracking with a monocular camera. The joint use of point-based 3D models together with GPU allows to adapt and simplify an existing tracking algorithm originally designed for triangular meshes. Point-based models are of particular interest in this context, because they are the direct output of most laser scanners. We show that state-of-the-art techniques developed for point-based rendering can be used to compute in real-time intermediate values required for visual tracking. In particular, apparent motion predictors at each pixel are computed in parallel, and novel views of the tracked object are generated online to help wide-baseline matching. Both computations derive from the same general surface splatting technique which we implement, along with other low-level vision tasks, on the GPU, leading to a real-time tracking algorithm. Ó 2010 Elsevier B.V. All rights reserved.

1. Introduction Low-level vision tasks such as extraction of visual features in images consume a large portion of the overall computation time of 3D visual tracking algorithms compared to higher level tasks such as pose estimation. For real-time applications, this is an area of focus for increasing performance. Faster algorithms are not always available and might not behave as well in real-life scenarios. Another possibility is to exploit the computational capabilities of modern Graphic Processing Units (GPU) which are nowadays common even in the embedded computing realm. On workstations and desktop systems, they often outperform the main CPU. This paper presents a real-time visual tracking framework which makes a heavy use of the GPU at nearly all stages of computation. Acknowledging the advantages of the GPU for vision tasks, we are able to propose the use of dense point-based models (PBM) to represent the tracked object. While PBM and related GPU-based techniques have been quite well studied by the computer graphics community [1], to our knowledge they have never been used in the context of visual tracking. Commonly used object models are based on 3D meshes that naturally lead to edge-based tracking techniques. In his pioneering work, Lowe [2] extracts image lines that are matched and fitted to those of the model. One can avoid prior edges extraction by ac* Corresponding author. Tel.: +33 5 61 58 83 78; fax: +33 5 61 58 83 06. E-mail address: [email protected] (C. Dehais). URL: http://dehais.perso.enseeiht.fr/ (C. Dehais). 0262-8856/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2010.03.001

tively searching for strong image gradients along normals of the projected model edges [3] from a limited set of control points sampled along the edges [4]. These techniques are very well suited for industrial objects that exhibit strong straight edges, but may fail for natural or smooth objects. To enrich the model with object appearance information, a common approach is to apply textures onto the model surface [5]. The popular active appearance models [6] use a compact texture representation derived from the analysis of the dense object appearance in a (possibly large) number of poses. Some objects (like faces) do not show strong texture information everywhere on their surface. Sparse texture representations using for example interest points help to address this issue. A representative technique is the one of Vacchetti et al. [7]; our work is strongly inspired by their tracking algorithm. Bringing further the idea of models based on sparse salient features yields to collections of visual features without explicit topology. The hyperpatches of Wiles et al. [8] are attached to 3D points and centered on a priori selected salient features in the image. Rothganger et al. [9] successfully applied a similar 3D object representation using texture descriptors to object recognition. Munoz et al. [10] use a set of small planar textured patches along with shape and texture bases that support the deformations and the appearance changes of the model. We use a model similar in nature, made of a collection of unconnected points. However, instead of carefully located features, we use a set of unconnected points to model the object to track. Such point clouds are of particular interest since they are the typical

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

output of most 3D scanning devices and the initial structure recovered by most vision-based reconstruction techniques. Of course, dense point clouds introduce a somewhat bigger volume of data than standard approaches, but this concern is reduced by the fact that, as we show in this paper, most of the treatment is offloaded to the GPU. The intermediate values required by our tracking algorithm cannot, to date, be performed on a standard CPU. Since these computations are highly parallelisable, the GPU is well adapted to perform these computations. Along with the introduction of point-based models (PBM) as a representation of interest for computer vision problems in general and for visual tracking in particular, this paper presents two main contributions. First, we adapt state-of-the-art techniques developed for point-based rendering to compute very efficiently intermediate values required by the tracking algorithm. In particular, apparent motion predictors are computed in parallel and at each pixel by the GPU. Second, we directly use our PBM rendering algorithm to generate novel views of the tracked object as necessary in order to avoid tracking drift. As a ‘‘side-effect”, we come up with an elegant and somewhat simpler algorithm compared to traditional tracking approaches. The next section lays the basis of this work. We first informally identify a few good properties of point-based models (Section 2.1). We then introduce the visual tracking framework we build upon (Section 2.2) and finally present the splatting algorithm used for rendering and computing attributes on the PBM (Section 2.3). Section 3 presents the extension of this algorithm to the computation of dense apparent motion predictors. Section 4 details our GPU implementations, including those of low-level tasks. Section 5 presents a qualitative and performance evaluation of the proposed algorithm. Section 6 concludes the paper. 2. GPU-supported point-based models for computer vision A 3D model of the object of interest is often helpful in many computer vision tasks such as object recognition [9], object tracking [11,7] and visual servoing [12]. In face tracking, many face models have been devised: a cylinder, an ellipsoid, a generic parameterized face model, etc. Thus, a natural question arises: what makes a good object model? Task-related and specific constraints notwithstanding, a set of good properties can be pointed out. 2.1. What is a good 3D model? We believe a good model should verify as much as possible the following four properties. First, it should be easy to acquire, ideally output directly by a 3D digitization device or reconstruction algorithm. Second, it should be easily manipulated in 3D, e.g. transformed, simplified or combined. Third, computer vision problems often mandate easy access to the model from image space, in particular visibility computation, back-projection of image features and synthesis of novel views. Fourth, the model should be as general as possible and hold, along with 3D coordinates, any other geometric or photometric attributes relevant to the application (e.g. surface color and normal, deformation fields, saliency descriptors). Some of these properties can actually conflict. A compact parametric model is lightweight and easily processed in 3D space, yet is not generally easy to acquire and extend with attributes. The often used 3D meshes are known to be tedious to acquire from inherently point-oriented raw data. Computing a motion model at any point of the projection of a 3D mesh involves a lot of indirection levels: from a given image pixel to the facet, from the facet to the vertices, where the normals are stored, and finally to the 2D

1387

motion induced by the facet plane. The increasing ubiquity of GPUs makes an alternative solution possible. Instead of sparse facetbased models, dense unstructured PBM can be considered and indeed satisfy the above-mentioned properties thanks to recent works from the computer graphics community [1]. The model creation is simplified because the output of a laser scanner or of a Structure-from-Motion algorithm can be used with minimal processing. Both 3D and 2D manipulations can be GPU assisted as shown in Section 2.3. In addition, PBM are expandable: along with 3D points, our models include static attributes such as color and normal and dynamic, motion-related attributes (see Section 3). This work argues in favor of PBM supported by GPU implementations for computer vision. We think that these models provide an elegant answer to the following general issues:  dense computation in image space of various model attributes, such as normals, or colors and  dense generation of intermediate pose-dependent values necessary to vision algorithms. In the following of this paper, we show how we apply the approach described in this section to the model-based visual tracking problem. 2.2. 3D visual tracking using a point-based model The proposed tracking algorithm implements the approach introduced in the previous section and follows the framework initially designed by Vacchetti et al. [7], which combines iterative and keyframe-based tracking. 2.2.1. Iterative tracking The rigid 3D motion between two successive images It1 and It is described by a 3D euclidean transformation, made of three elementary rotations and a translation, and modeled by a 4  4 matrix dE. Hence Et ¼ dE Et1 . We use the standard pinhole camera model with 3  4 projection matrix P t ¼ KEt , where K is the (known) constant matrix of intrinsic parameters. Tracking the object iteratively means updating the current pose estimation knowing the previous pose Et1 and a set of motion measurements made on the successive images. In this work, we track feature points detected with the Harris criterion [13] and matched using a standard correlation-based technique. Those steps are performed on the GPU as detailed in Section 4.2. Let mit1 be a feature point detected on image It1 . If this point lies on the object image, then mit1 is the image of a 3D point Mi on the object surface. The iterative tracking problem can be formulated as

c ¼ argmin dE dE

k  X    i 2 W i dE; mi ; t1  mt M

ð1Þ

i¼1

where WMi is the apparent motion model that relates mit1 to mit and depends on Mi . It can also be seen as a motion predictor in the sense given a small 3D displacethat it predicts the next position of mt1 i ment. Lepetit and Fua [14] surveys various formulation for WMi depending on both image measurements (optical flow, image gradient, feature points, etc.) and underlying 3D models. 2.2.2. Adaptation to PBM At this point, an efficient 3D model for tracking will allow the easy interpretation of any 2D–2D correspondence mit1 $ mit in terms of a 3D pose update. Those correspondences can come for example from feature point matching or from optical flow estimation. We should have for any feature mit1 a motion model WMi induced by the object model.

1388

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

estimation are computed the same way as novel views of the object are generated. Both computations exploit the massively parallel architecture of the GPU. In particular, our dense computation of motion predictors avoids on-demand query of the motion model at feature locations (see Section 3). 2.3. Surface splatting algorithm for point-based models rendering

Fig. 1. In green: feature point trails showing 2D–2D correspondences. In blue: 3D motion predictors corresponding to a translation (the sparse representation is only for clarity).

Assume that the motion model has p parameters, we can decorate our PBM with p dynamically generated attributes. Using the forward projection algorithm presented in Section 2.3, we can then evaluate the p motion parameters at any pixels. Fig. 1 illustrates this: on the right, the PBM of the object of interest is rendered in gray. To every model point are appended p = 2 parameters defining the apparent motion model corresponding to a 3D translation. The representation of those motion predictors is the blue vectors and is only sparse for readability of the illustration. The point correspondences represented as green vectors on the real toy leopard on the left can now easily be projected on the blue motion field, and would the model be correctly registered, the 3D translation motion estimated. Hopefully, this simple example provides evidence for the suitability of a PBM for computing dense representation of various model attributes, as claimed in Section 2.1. The details about the computation of the motion predictors are given in Section 3.

A PBM defines a 3D object by a set of points sampling its surface, as depicted in Fig. 3a. No explicit connectivity information is necessary, and the surface is not required to be regularly sampled. A fundamental issue for point-based graphics is the reconstruction of a hole-free continuous object surface from the samples (see Fig. 3b and c). Popular approaches involve rendering a ‘‘thick” primitive at each point in the image so that the holes are covered. Zwicker et al. [16] formalized such an approach by introducing the surface splatting algorithm. Each sample (or splat) is associated with a number of attributes (position, normal, color, etc.), and the rendering can be seen as a resampling process of these attributes on the regular pixel grid in image space. A splat pi is defined by a 3D point, an oriented surface normal 1 The surface is further defined by a set of attributes and a radius. n o i i A1 ; A2 ; . . . . An attribute A is locally approximated using a function PAi : R2 ! RnA defined in the splat plane (nA is the dimension of the attribute space, e.g. nA ¼ 1 when A is scalar). A radial (3D) reconstruction kernel ri depending solely on the radius of the splat is defined in the splat reference frame. A typical choice is a Gaussian kernel with a standard deviation adapted to the splat radius. Reconstructing the surface in image space involves the projection of points yi from the reference plane of splat pi to an image point x. Let this projection be approximated by the mapping Mi : yi ! x. Then, the evaluation of attribute A of the surface at image point x is given by:

SA ðxÞ ¼

P

0A 0 i r i ðxÞPi ðxÞ P 0 ; i r i ðxÞ

ð2Þ

with 2.2.3. Keyframe-based tracking Using inter-frame measurements alone is prone to error accumulation and causes the recovered object pose to drift over time. As proposed in [7], we overcome this problem by also registering the object with generated views of the tracked object. Those views use the texture provided by a set of keyframes. A keyframe is a view of the object, where the pose is precisely computed offline. A set of keyframes is constructed in order to roughly cover the range of views that will be seen later in the tracked sequences (see Fig. 2a). As the model is registered, feature points on a keyframe can be back-projected on the surface model, yielding 3D–2D correspondences readily available for pose estimation. Those correspondences can be propagated to the current frame via 2D–2D correspondences with the keyframe. As the current frame and the closest keyframe may still be quite far apart, making feature points matching hard to achieve [15], a novel view is rendered using the current pose estimate Et1 . Such a view is shown in Fig. 2b. Features are then matched between the current frame and this novel view. 2.2.4. Wrap-up: exploiting PBM generality The very same forward projection algorithm will be used to render virtual novel views and to compute the apparent motion model at every pixel (see Section 2.3). In this respect, the PBM and its associated GPU rendering algorithm provide an elegant answer to the visual tracking problem. Intermediate values necessary to pose

   1  A r0i ðxÞ ¼ r i M1 and P0A i ðxÞ i ðxÞ ¼ Pi Mi ðxÞ :

ð3Þ

In practice, the support of the reconstruction kernels is truncated, so that the above sum is finite and a low-pass filter is convolved with SA to avoid resampling artifacts. Under reasonable conditions the image footprints are elliptic and can be efficiently rasterized and blended. Several implementations of this rendering algorithm have been proposed, from the original CPU-based implementation of Zwicker et al. [17] to hardware accelerated implementations making use of modern GPUs [18,19]. Our own GPU implementation is very similar and consists in:  projecting the model points and computing the Gaussian kernels parameters,  rasterizing the ellipse corresponding to the kernel footprint and blending the weighted attributes for each pixel of the ellipse,  handling the visibility using back-face culling and -depth test techniques. Fig. 3 shows a leopard model rendered by our system. Again, the same algorithm is also able to render the novel view shown in Fig. 2b and the motion vectors of Fig. 1, as we show in the next section. 1 When only a pure points cloud is available, the two latter attributes can be estimated by statistical analysis of local neighborhoods.

1389

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

Fig. 2. (a) A keyframe ‘‘cloud” and (b) a novel view generated from one of the keyframes of figure (a).

Fig. 3. (a): PBM of a toy leopard and (b) and (c): close view with increasing splat radius.

3. Splatting motion predictors

3.2. Dense evaluation of the motion predictors

In this section, we derive an expression for the iterative pose update (Eq. (1)) adapted to PBM. Our apparent motion model approximation takes the form of a basis of six 2D motion vectors. We call motion predictors the image projection of these six vectors. We rely on a dense computation of the motion predictors to be able to query the predicted motion of any image point on the model projection.

By setting the flj g as dynamically generated, pose-dependent attributes of the PBM, the splatting algorithm of Section 2.3 provides a straightforward way to compute the motion bases everywhere on the model image. Fig. 4 illustrates this. As a reference, Fig. 4a shows a rendered PBM of a face. For each sample point of the model, we show the vector l6 corresponding to the rotation around the z-axis (pointing towards the camera) (Fig. 4b). Note that on this figure, motion vectors associated with hidden sampled points do also appear. In contrast, using the splatting algorithm, a dense motion field may be computed, and depth is correctly handled (hidden points are discarded) (Fig. 4c). For each pixel belonging to the projection of the object, l6 is computed using only visible neighboring splats.

3.1. Point-based motion predictors To obtain a 2D motion model suited to the use of 3D points, we linearize the projection of the 3D motion as proposed by Drummond and Cipolla [4]. We take the first-order approximation of the exponential map form of the euclidean matrix dE:

dE  I þ

6 X

aj Gj ;

ð4Þ

j¼1

where the matrices Gi are the generators of 3D elementary motions (rotations and translations w.r.t. the axes of the object frame) and a ¼ ða1 ; . . . ; a6 Þ are the corresponding parameters of the elementary translations and rotations. We can now relate the apparent motion of an image point m to the small rigid 3D transformation of amplitude aj undertaken by the object. Noting m ¼ PM ¼ ðu; v ; wÞ> and PGj M ¼ ðu0j ; v 0j ; w0j Þ> , we obtain the linear relation:

m0 ¼ m þ

6 X j¼1

2 u0 wuw0 3 j

2

j

Proceeding a step of iterative tracking is now possible using Eq. (5) as a linear motion model WMi introduced in Eq. (1). The motion vector approximation is defined on each meaningful pixel, that is, on each feature point lying on the object projection. Let us denote  > i lj ¼ uij ; vij ; j ¼ f1; . . . ; 6g the basis of elementary motions comi

puted at the point mit1 , and d ¼ mit1  mit . The pose update is then the solution of the problem:

min

ða1 ;...;a6 Þ>

j

w 5; j ¼ 1; . . . ; 6: aj lj with lj ¼ 4 v 0 w v w0

3.3. Recovering the 3D motion

ð5Þ

j

w2

The 2D vector lj is the first-order approximation to the motion of the point m when M moves according to an elementary 3D transformation of amplitude aj .

2   k X 6 X  i i aj lj þ d  :    i¼1 j¼1

ð6Þ

This problem is linear in the unknowns aj and may be written and solved in the following matrix form:

c at ¼ argmin kLat þ dk2 : at ¼ða1 ;...;a6 Þ>

ð7Þ

1390

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

Fig. 4. (a) A 3D PBM of a face rendered with hidden surface removal. (b) Motion vectors fl6 g corresponding to a rotation of the model around the z-axis, some of which should be hidden. (c) The dense motion field, showing vectors fl6 g interpolated by splatting. Note the correct handling of hidden points.

In practice, a robust version of this linear estimate (via Iterative Reweighted Least Squares [20]) is used to cope with outlying at , we can easily derive an intermematches. From this correction c c to estimate the unknown pose Et . This first estidiate update dE c cd Et1 is then refined using registration with a mate E1t ¼ dE keyframe.

pects make our proposal much simpler than the original meshbased algorithm it is based on. Because our algorithm computes the intermediate values necessary to tracking in one forward projection pass, intermediate structures such as Facet-ID images [7], which may not scale well with the number of facets, are not necessary anymore. 3.5. Implementation details

3.4. Comparison with traditional approaches The method detailed in the two previous sections takes a relatively different route than traditional model-based 3D tracking approaches such as that of [7]. Those often comprise two steps in order to form a motion prediction for a particular feature point mit1 on image It1 . First, the feature is back-projected on the model, using the current pose estimate Et1 to get the corresponding 3D surface point Mi . The new position mti of the feature point is then obtained by projecting Mi using the putative pose Et . The back-projection/projection process can be modeled with the motion model W that maps mt1 to mti according to the current pose (see Section i 2.2). In the case of piecewise planar surfaces, W is a homography. The explicit back-projection step can thus be replaced by an implicit back-projection that returns the plane-equation of the facet on which Mi lie. Our proposed method differs in two aspects: first, we do not need explicit or implicit back-projection of the feature points, because our motion model is directly projected onto every image pixel, thanks to the flexible point-based rendering technique. Second, because we directly approximate the pose update Eq. (4), we obtain a linear least square problem which is easier to solve. Both as-

To make the computation of dense motion predictors tractable, we use a GPU implementation of the splatting algorithm outlined in Section 2.3. We build upon the work of [18] and [19] on deferred shading and extend their proposal to our vision-related needs. Any number of attributes can in theory be interpolated by the splatting algorithm, but some restrictions apply to the number of output buffers of the fragment shader stage. At every point of the PBM, we define a six-2D vector motion basis (see Section 3.1), so we need 12 attributes. The denominator of Eq. (2) (normalization term) needs to be carried on too, and each output buffer can hold l m ¼ 4 buffers are needed four components, which means ð12þ1Þ 4 which is the limit on many common graphic cards. Because this many buffers takes up a lot of bandwidth on the graphic bus, and because the normalization term is not needed anymore once the motion predictors are computed, we reorganize the buffers components so that only three buffers need to be downloaded on the CPU after the last splatting stage. Those buffers form the motion predictor image which is illustrated in Fig. 5. There is clearly room for improvement at this stage, and we could try to download only the pixels of the rendered motion predictor image corresponding to the feature points. That would require a rather

Fig. 5. A visualization on a 30  30 regular grid of the computed motion predictors for x translation and z rotation.

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

1391

Fig. 6. Illustration of the use of keyframes. See text for details.

complex combination of shaders though, because ‘‘gather” operations are not well suited to current GPU architectures. 4. Novel views generation and low-level image processing In this section, we describe two key parts of our algorithm, namely the generation of novel views from keyframes using the PBM and the extraction and matching of features points between images (either between two consecutive images of the sequence or between an image of the sequence and a novel view). Those two parts are respectively entirely and nearly entirely performed on the GPU. 4.1. Novel view generation As mentioned in Section 2.2, the 3D–2D matches can be obtained thanks to 2D–2D matches between the current frame It and a close keyframe. Finding matches between feature points becomes difficult when the baseline between two views increases. However, the current frame and the selected ‘‘close” keyframe are often in such a configuration. To address this issue, an intermediate virtual keyframe is synthesized by splatting the model with c the estimate E1t of the current pose. Fig. 6 illustrates the four steps of the whole process which we synthesize here: (1) Select the closest keyframe according to a similarity criterion between poses. (2) Extract texture information from the keyframe, thanks to the manually computed pose. Note that this step is done offline. As shown in Fig. 2a, each keyframe thus provides its own textured version of the object model. (3) Render the textured model (texture chosen from the closest keyframe) according to the current pose estimate. This produces an intermediate virtual keyframe containing only the tracked object (center image of Fig. 6). (4) Determine 2D–2D matches between the current frame and the intermediate keyframe and deduce 2D–3D correspondences for the current frame. The correspondences established between the keyframe and the current image can be integrated into the pose estimation by minimizing the reprojection error between 3D and 2D points in a following step to the minimization of Eq. (7). Another approach c that we adopt is the following. Thanks to the estimated E1t , we cannot only render the textured model of the object but also compute the motion predictors (see central rendering in Fig. 6) for this resyn-

thesized view. With the 2D–2D matches between It and this resynthesized view (see dashed blue line in Fig. 6), we now have the information to formulate a problem similar to Eq. (7), giving a corc rection to the current pose estimate E1t . This refinement process may be iterated (as suggested in [4]), but we found out that a single step suffices, as shown by the experiments described in Section 5.

4.2. Feature points extraction and matching on the GPU Our algorithm runs nearly all its steps on the GPU. Here, we briefly describe the GPU implementation of the low-level image processing steps of extraction and matching of feature points. Feature points are extracted according to the standard Harris criterion. All image manipulations required to compute a cornerness map on the GPU are implemented without particular difficulty using fragment shaders.2 The computation carried out at every image pixel being independent from the one at the neighbor pixels, we can maximize parallelism and minimize memory transfer by packing four gray-level values into one RGBA pixel. The Harris criterion is based on the eigenvalues of the local structure tensor of the gray-level image. The elementary operations are the convolution of the image with a Gaussian derivative-filter, a few pixel-wise arithmetic operations, a per pixel computation of the Harris criterion and a final non maxima removal pass. The GPU-generated cornerness map is then downloaded to the CPU which creates the feature points list (it consists of either applying a threshold or doing a very fast partial sort of the cornerness values). A recent algorithm has been proposed [21] to generate point lists directly on the GPU, thus further saving CPU time and some bus bandwidth. We determine correspondences by matching small windows around feature points, using a normalized cross correlation criterion (NCC). For every feature of a given image, the NCC scores are computed with all feature points in the other image, and the feature with maximum value is marked as potential match. The same process with images swapped is performed, and a cross-validation test discards the features which are not marked as the match of their marked match. Performed naively this algorithm does not map well to a GPU architecture, because of the scattered reads (the sparse sets of feature points are located nearly randomly on the image). We thus perform a first pass that reads every feature window and vectorizes it to one line of an output buffer of size n  w2 , where n is the number of features in the image, and w2 is the win2

We use the GLSL shading language.

1392

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

dow size in pixels. We then perform the rest of the algorithm on the flattened neighborhood buffers of each image to be matched. 5. Results We designed a complete implementation of the tracking framework presented in the previous sections which we summarize in the following subsection. We then analyze the performance of our implementation and underline the crucial role of the GPU. Finally, we show the good behavior in terms of precision and drift correction of the approach mixing iterative and keyframe-based trackings. 5.1. Algorithm summary Assuming the pose E0 of the object in the frame I0 is known, we detect feature points in each frame It and match them to the points from the previous image It1 , as detailed in Section 4.2. We then compute the motion predictors for the pose Ed t1 and download the corresponding maps to the CPU memory (see Section 3.2). This allows us to sample the maps at feature points location on image It1 , giving the predicted motions used to form the matrix L in Eq. (7). By solving problem (7), we obtain an intermediate pose c estimate E1t , which we then refine using feature matches established between the current frame and the virtual novel view (Section 4.1).

 Novel view synthesis. Here, a comparison with pure CPU implementation is unfair due to the well-adaptation of the GPU to this task. As a reference, Zwicker et al. report nearly one second to render a 512  512 image of a model similarly as complex as the ones we use [22].  Motion predictors evaluation. As above, even if here the bus bandwidth required to download the motion vector maps is somehow detrimental to our implementation, a pure CPU-based one would perform similarly worse than in the case of novel view synthesis. The performance of our GPU implementation for both novel view rendering and motion predictors evaluation is shown in Fig. 8 for two GPU architectures (Nvidia G71 and G80) are three PBM of increasing complexity. We see that real-time computation is easily achievable on a modern GPU even with a rather complex PBM. Note that the central face model shown in Fig. 8 has been obtained directly with a laser scanner. The graph of Fig. 9 sums up the average timings of the different parts of the algorithm, measured on a system with an Intel Core2 CPU at 2.6 GHz and an NVidia G80 graphic card. The overall pose update process takes on average 81.5 ms, which allows real-time interaction with the system. In general, the observed framerates range from 10 to 15 fps, depending mainly on the complexity of the PBM for a given frame resolution. As for any iterative tracking technique, the apparent motion of the tracked object is somewhat limited; note however that the keyframe-based approach significantly alleviates this issue.

5.2. Performance analysis We stress that more than 80% of the algorithm is performed on the GPU and compare those parts to CPU-based implementations (ours or reported by others).  Feature points extraction and matching. The graph of Fig. 7 shows the mean computation time per frame of our GPU implementation against a CPU implementation of our own, on three sequences of varying size. The feature extraction performs about three times faster on the GPU. Combining extraction and matching, the GPU does not perform outstandingly better overall, because the feature matching algorithm is really not suited to its architecture. However, the GPU implementation scales a lot better in terms of feature window size and is able to match every point instead of limiting the search to the closest features, as is mandatory for good performance on the CPU (doing so on the GPU did not affect performance). We use 9  9 windows as it provides a good trade-off between the assumption of local planarity and distinctiveness.

5.3. Evaluation of the tracking quality We now present experiments highlighting the benefits of using a hybrid tracking approach. With our original tracker, we obtain results consistent with those reported in [7]. Iterative tracking, while robust in the presence of incorrect image measurements, is prone to accumulate estimation error, and the reprojection of the tracked object will often lag behind its real counterpart. In contrast, keyframe-based tracking does not accumulate error over time, but due to the less precise matching process with the synthesized view, it is affected by jittering and can also fail when not enough correct matches are found. We illustrate these observations on a synthetic sequence that provides ground-truth values for the pose parameters. The sequence involves translations and rotations (see Fig. 10a). We plot the output of the algorithm against ground-truth values when activating iterative tracking only (IT), when activating keyframe tracking only (KF) and finally when combining both (KF + IT). The graphs (b)–(e) of the recovered pose parameters show that the combined approach is always closer to the ground-truth. The results in

Fig. 7. Performance comparison of the tracking (extraction and matching) of 900 feature points on a CPU and GPU architecture, on three sequences of varying size. Feature windows size is 9  9 and the maximum search distance for the CPU version is set to 30 pixels. Red bars are indicative of interactive and real-time frame rates.

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

1393

Fig. 8. Performances of the splatting algorithm on two GPU architectures (NVidia G71 and G80) for novel view rendering (orange bars on the left) and motion vectors computation (purple bars on the right). Three PBM of increasing complexity are considered. The resolution is 640  480.

Fig. 9. The timings of the different parts of our tracking algorithm, with a video resolution of 640  480. The operations done on the CPU (in shades of green) are slightly offset away from the GPU based ones (in shades of blue).

Fig. 10d, corresponding to the rotation around the y axis are particularly significant: the iterative tracker looses track early and never recovers; the keyframe-based tracker exhibits a lot of jitter; on the contrary, the combined tracker output is at the same time smooth and close to the ground-truth. We finally show in Fig. 11 some screen captures of our tracker running in real-time on live sequences taken by a standard camera. Further results and video material illustrating the effectiveness of the system on live video sequences can be found at http:// dehais.perso.enseeiht.fr/pbmtracking. This video illutrates the typical amplitude of motion that the system can track. 6. Conclusion and perspective In this paper, we introduced in computer vision the idea recently developed by the computer graphics community of using topology-

free models. Along with providing the easy acquisition praised in computer graphics, using PBM brought new insights into how to efficiently compute the vision-related intermediate values using a GPU architecture designed for massively parallel computation. We highlighted that combining PBM and efficient GPU-based rendering and manipulation techniques can be effectively used for many computer vision problems requiring a model of the object of interest. In the specific case of 3D visual tracking, we showed that with the very same rendering algorithm, we could compute apparent motion predictors at every image pixel efficiently and generate novel views of the tracked object. The proposed approach compares favorably in terms of simplicity and scalability to existing tracking algorithms designed for triangulated meshes. The resulting implementation is a real-time tracking system running about 80% of its computation time on the GPU, which frees the CPU for other tasks.

1394

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

(a)

(b)

(c)

(d)

(e)

Fig. 10. Tracking results for the synthetic sequence shown in (a). The graph (b) shows the evolution of all three translation parameters. Graphs (c)–(e) show the individual rotation parameters (respectively Rx ; Ry and Rz ).

C. Dehais et al. / Image and Vision Computing 28 (2010) 1386–1395

1395

Fig. 11. Screenshots from live tracking sequences. First line: tracking the toy leopard shown throughout the paper. Second line: tracking using a face model acquired with a range scanner.

As GPU and CPU performances and capabilities increase, we naturally expect the tracking time to improve. Even if the particular implementation of the ideas presented in this paper should be adapted to harness new generation GPU and CPU, the principles proposed for tracking based on a PBM will remain valid. The perspectives of this work are twofold. First, we would like to investigate the formulation of an edge-based tracking technique on PBM. This would further prove the interest of such models in this context. Second, we believe that this work is an interesting path to address 3D non-rigid tracking. Deformations in 3D could indeed correspond to 2D deformation fields of the same nature as the motion predictors we generate by splatting rigid motions. Working with a point-based 3D model may simplify the management of visibility including self-occlusions. Being densely evaluated, those motion fields can combine texture and geometry and thus can be used to verify the optical flow constraint equation or a more general illumination constraint.

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.imavis.2010.03.001.

References [1] M. Gross, H. Pfister, Point-based Graphics, Morgan Kauffman, 2007. [2] D.G. Lowe, Robust model-based motion tracking through the integration of search and estimation, International Journal of Computer Vision 8 (2) (1992) 113–122. [3] H. Kollnig, H.-H. Nagel, 3d pose estimation by directly matching polyhedral models to gray value gradients, International Journal of Computer Vision 23 (3) (1997) 283–302. [4] T. Drummond, R. Cipolla, Real-time visual tracking of complex structures, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (7) (2002) 932– 946. [5] M. Pressigout, E. Marchand, Real-time hybrid tracking using edge and texture information, International Journal of Robotics Research, IJRR 26 (7) (2007) 689–713.

[6] J. Xiao, S. Baker, I. Matthews, T. Kanade, Real-time combined 2d + 3d active appearance models, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 535–542. [7] L. Vacchetti, V. Lepetit, P. Fua, Stable real-time 3D tracking using online and offline information, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (10) (2004) 1391. [8] C.S. Wiles, A. Maki, N. Matsuda, Hyperpatches for 3d model acquisition and tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (12) (2001) 1391–1403. [9] F. Rothganger, S. Lazebnik, C. Schmid, J. Ponce, 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints, International Journal of Computer Vision 66 (3) (2006) 231–259. [10] E. Munoz, J.M. Buenaposada, L. Baumela, Efficient model-based 3d tracking of deformable objects, in: Proceedings of ICCV 2005, Beijing, China, 2005, pp. 877–882. [11] G. Klein, D. Murray, Full-3d edge tracking with a particle filter, in: Proceedings of British Machine Vision Conference (BMVC’06), BMVA, Edinburgh, 2006. [12] A. Comport, E. Marchand, F. Chaumette, Robust model-based tracking for robot vision, in: IEEE/RSJ International Conferrence on Intelligent Robots and Systems, IROS’04, Sendai, Japan, 2004, pp. 692–697. [13] C. Harris, M. Stephens, A combined corner and edge detection, in: Proceedings of the Fourth Alvey Vision Conference, 1988, pp. 147–151. [14] V. Lepetit, P. Fua, Monocular model-based 3d tracking of rigid objects: a survey, Foundations and Trends in Computer Graphics and Vision 1 (1) (2005) 1–89. [15] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detectors, International Journal of Computer Vision 37 (2) (2000) 151–172. [16] M. Zwicker, H. Pfister, J. van Baar, M. Gross, Surface splatting, in: E. Fiume (Ed.), SIGGRAPH 2001, Computer Graphics Proceedings, ACM Press/ACM SIGGRAPH, 2001, pp. 371–378. [17] M. Zwicker, J. Räsänen, M. Botsch, C. Dachsbacher, M. Pauly, Perspective accurate splatting, in: GI ’04: Proceedings of Graphics Interface 2004, Canadian Human–Computer Communications Society, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2004, pp. 247–254. [18] G. Guennebaud, L. Barthe, M. Paulin, Deferred splatting, Computer Graphics Forum 23 (3) (2004) 653–660 (EG2004 Proceedings). . [19] M. Botsch, A. Hornung, M. Zwicker, L. Kobbelt, High-quality surface splatting on today’s gpus, in: Eurographics Symposium on Point-Based Graphics 2005, 2005, pp. 17–24. [20] P. Huber, Robust Statistics, Wiley, New York, 1981. [21] G. Ziegler, A. Tevs, C. Theobalt, H.-P. Seidel, On-the-fly point clouds through histogram pyramids, in: Proceedings of Vision Modeling and Visualization 2006 (VMV06), 2006, pp. 137–144. [22] M. Zwicker, H. Pfister, J.V. Baar, M. Gross, Ewa splatting, IEEE Transactions on Visualization and Computer Graphics 8 (2002) 223–238.