High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Pattern Recognition Letters xxx (2014) xxx–xxx Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

Download PDF

2MB Sizes 0 Downloads 72 Views

Report

PDF Reader
Full Text

Pattern Recognition Letters xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor q Matthew Shreve ⇑, Mauricio Pamplona, Timur Luguev, Dmitry Goldgof, Sudeep Sarkar Department Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., Tampa, FL, USA

a r t i c l e

i n f o

Article history: Available online xxxx Keywords: 3-D Optical strain Optical ﬂow

a b s t r a c t Generating 2-D strain maps of the face during facial expressions provides a useful feature that captures the bio-mechanics of facial skin tissue, and has had wide application in several research areas. However, most applications have been restricted to collecting data on a single pose. Moreover, methods that strictly use 2-D images for motion estimation can potentially suppress large strains because of projective distortions caused by the curvature of the face. This paper proposes a method that allows estimation of 3-D surface strain using a low-resolution depth sensor. The algorithm consists of automatically aligning a rough approximation of a 3-D surface with an external high resolution camera image. We provide experimental results that demonstrate the robustness of the method on a dataset collected using the Microsoft Kinect synchronized with two external high resolution cameras, as well as 101 subjects from a publicly available 3-D facial expression video database (BU4DFE). Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Capturing and quantizing the strain observed during facial expressions has found useful application in areas such as expression detection, face identiﬁcation, and medical evaluations [1–3]. However one limitation in each of these applications (especially the latter two) is that in order to estimate the consistency of strain over several acquisition sessions, each subject was required to provide a pose that aligned with prior data. We describe a method that is more invariant to view then when strictly using 2-D, thus allowing a greater variability in data acquisition. There are several advantages when supplementing 2-D motion estimations with aligned 3-D depth values. Due to the foreshortening that occurs when the curvature of the face is projected onto the 2-D image plane, the horizontal motion estimated along these points are often suppressed. By projecting these vectors back onto the 3-D surface, we are able to adjust these vectors to better approximate the true displacements. Similarly, motion that is orthogonal to the image plane is lost in 2-D only approximations, but recovered when using 3-D data. This method is possible because the surface of the face is a 2-D surface embedded in 3-D. Moreover, we assume that any local region of the face (3 3 neighborhood) is ﬂat and can be approximated using a 2-D plane.

q

This paper has been recommended for acceptance by Xiaoyi Jiang.

⇑ Corresponding author. Tel.: +1 (813) 974 3652; fax: +1 (813) 974 5456.

Hence, we can use standard optical ﬂow techniques in order to describe the 2-D motion, adjust them using 3-D depth approximations, and then determine the 3-D strain tensors. The proposed algorithm consists of the following steps: (a) the Kinect sensor is automatically calibrated to an external HD camera using an algorithm that is based on SIFT features (b) optical ﬂow is calculated between each consecutive pair of frames in the video collected from the external camera; (c) we then stitch together pairs of motion vectors in order to obtain the motion from the ﬁrst frame to every other frame; (d) each of the stitched motion ﬁelds are projected onto the aligned 3-D surface of the Kinect; (e) surface strain is calculated for each pixel of the face; (f) a masking technique is used that separates reliable strain estimations from those regions where optical ﬂow fails (such as the eyes, mouth, and the boundary of the head); (g) the strain map is back projected to the original orientation of the Kinect sensor, which generates an aligned 3-D surface strain map. We have reported preliminary results in [4]. However, we now use a new method for automatic calibration and facial tracking and extend our experiments from 40 subjects and 2 expression to over 100 subjects and 6 expressions from the BU4DFE database. While we have not found work in the literature that estimates the 3-D surface strain on the face, we found in [4] that there are many approaches for estimating 3-D volumetric motion. Perhaps one of the biggest applications is for the entertainment industry, where modeling the surface of the face during expressions can be used to animate humanoid avatars in interactive video games

E-mail address: [email protected] (M. Shreve). http://dx.doi.org/10.1016/j.patrec.2014.01.015 0167-8655/Ó 2014 Elsevier B.V. All rights reserved.

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

2

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Fig. 1. Overall algorithm design.

or for realistic special effects in movies. Many of these methods (especially those that densely track key points on the face) do not work strictly on passive imaging the face, but require the manual application of facial landmarks and makeup that are used for tracking and 3-D reconstruction, especially when lighting conditions are not ideal [5–7]. The main drawback of these methods is that they do not provide dense enough motion estimation that would allow for the capture of the elastic properties of the face. More similar to our method, a dense approach is given by Bradley et al. [8] that uses optical ﬂow to track a mesh generated from seven pairs of cameras and nine light grids. However, because of the complexity of their acquisition device it may not be feasible in many applications. Other approaches that attempt to ﬁnd 3-D disparity from multiple view triangulation can be found in [8–10], many of which have been used to ﬁnd non-rigid disparities [11– 13]. Methods that use a single camera while exploiting knowledge of the object being studied (such as updating a generic 3-D face mesh on a 2-D image) is discussed in [14], while some commercial and open-source projects can also be found. One such product is [15], which is compatible with a range of commodity 3-D sensors and can update a live avatar mesh by automatically tracking several key points on the face. Alternatively, the open source project [16] has similar capabilities, but works with 2-D images. However, none of these methods provide per-pixel tracking, thus using them to estimate surface strain is not feasible. Finally, dynamic depth map approaches [17,18] assume a single 3-D acquisition device with a calibrated color image, such as the Microsoft Kinect. In the ﬁrst approach, 3-D motion is estimated for rigid gestures opposed to the non-rigid motion of the face; the latter uses a pre-recorded animation priors to drive real-time avatar animations. Our approach also uses dynamic depth maps in order to estimate the 3-D surface strain on the face by tracking the skin pixels over a video sequence captured from an external high-resolution camera that has been automatically calibrated/ aligned with an RGB-D sensor. The paper is organized in the following way: in Section 1 we give a brief overview of optical ﬂow and how we can use these motion vectors to calculate surface strain. Section 2 gives an approach for solving the 3-D transform of the low resolution depth map to match that of an external HD camera. Section 3 provides two experiments, one that demonstrates the robustness of the method

with respect depth resolution, and the second provides an experiment which compares the 3-D surface strain maps calculated from two HD cameras angled approx. 45 degrees apart. We provide our conclusions in Section 5.

2. Optical strain Methods for calculating 2-D optical strain can be found in our earlier work [1,3]. In this paper, we propose a new method that uses depth approximation in order to calculate a more view-invariant 2-D strain map. The method works by ﬁrst computing the 2-D

Fig. 2. Matched SIFT features from right camera view and the Kinect RGB image are given in (a). In (b), the aligned depth images are shown.

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

3

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Fig. 3. A mask is generated by deﬁning the boundaries of the face, eyes, and mouth using the 66 automatically detected points on the face shown in (a). Part (b) shows the original mask (strain values in white region pass through, black regions are set to 0). The transformed masks for the two external high resolution cameras are shown in (c) and (d) using the described calibration method. Part (e) shows the points that are visible to each view. The light gray areas are points only visible from the right camera, the dark gray areas are those found in the left camera, and the white regions are points common to both views.

motion along the surface using standard optical ﬂow techniques. Then, the motion vectors are projected to the 3-D surface of the face and adjusted based on the curvature of the face. 2.1. Optical ﬂow Optical ﬂow is based on the brightness conservation principle [19]. In general, it assumes (i) constant intensity at each point over a pair of frames, and (ii) smooth pixel displacement within a small image region. These ﬁrst constraint is encapsulated by the following equation:

ðrIÞT p þ It ¼ 0;

ð1Þ

where Iðx; y; tÞ represents the temporal image intensity function at point x and y at time t, and rI represents the spatial and temporal gradient. The horizontal and vertical motion vectors are

T

represented by p ¼ ½p ¼ dx=dt; q ¼ dy=dt . We tested several optical ﬂow implementations which are summarized in [2]. Based on experiments, we chose the [20] implementation of the Horn– Schunck method due to the speed, accuracy and ﬂexibility that the implementation provided. First, the Horn–Schunck method formulates the ﬂow using the following global energy functional:

E¼

ZZ

½ðIx u þ Iy v þ It Þ2 þ a2 ðjjrujj2 þ jjrv jj2 Þdxdy;

ð2Þ

where Ix ; Iy , and It are the derivatives of the image intensities in the x; y, and t dimensions. The optical ﬂow is denoted by ~ V ¼ ½uðx; yÞ; v ðx; yÞT and a is a regularization constant. The choice of a speciﬁes the degree of smoothness in the ﬂow ﬁelds. This global energy function can be minimized using the Euler–Lagrange equations. An iterative solution of these equations are derived in [21]:

Fig. 4. Depth reconstruction and surface strain maps for two example expressions (happy and anger) with the depth data at resolutions of 200 200, 50 50, and 40 40 (each column after regular image). Depth maps are quantized to 20 levels to highlight consistency.

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

4

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Sad Correlation Coefficient

Correlation Coefficient

Happy 1 0.8 0.6 0.4 0.2

100x100

66x66

50x50

40x40

1 0.8 0.6 0.4 0.2 0

100x100

Depth Resolution

0.8

0.6

100x100

66x66

50x50

40x40

0.8 0.6 0.4 0.2 0

100x100

0.8 0.6 0.4

50x50

50x50

40x40

Surprise Correlation Coefficient

Correlation Coefficient

Fear

66x66

66x66

Depth Resolution

1

100x100

40x40

1

Depth Resolution

0.2

50x50

Disgust Correlation Coefficient

Correlation Coefficient

Angry 1

0.4

66x66

Depth Resolution

40x40

1

0.8

0.6

0.4

100x100

Depth Resolution

66x66

50x50

40x40

Depth Resolution

Fig. 5. Box plots of the correlation coefﬁcients of the 3-D surface strain when using full 1:1 depth information and depth downsampled at ratios of 1:2, 1:3, 1:4, and 1:5. Results are for all 101 subjects performing 6 expressions. The mean is denoted by the larger solid circle, while outliers are denoted by gray dots (approximately 2.7 r).

k ukþ1 ¼ u

v kþ1 ¼ v k

k þ Iy v k þ It Þ Ix ðIx u

a2 þ I2x þ I2y

;

k þ Iy v k þ It Þ Iy ðIx u

a2 þ I2x þ I2y

ð3Þ :

The weighted average of u in a ﬁxed neighborhood at a pixel ðx; yÞ. With respect to a, we location (x; y) is denoted by u empirically found that a ¼ :05 and k ¼ 200 offered a good trade-off between capturing the non-rigid motion of the face and suppressing noisy motion outliers.

Our method is based on tracking the motion occurring on the face during facial expressions, where expressions can last several seconds and consist of motion that is too large to be accurately and consistently tracked by optical ﬂow. To solve this, we use a vector stitching technique that combines pairs of vectors from consecutive frames in order to solve the overall larger displacements occurring over several frames [22]. In our previous work [1], we have shown that a signiﬁcant amount of noise accumulates after several hundred frames (over 800) in a video, however in this work we only need to stitch roughly 30–50 frames. Although we chose to use vector stitching, it is worth noting that there are alternative methods that could be used for tracking the motion during facial expression including different approaches for calculating optical ﬂow (a summary evaluation of several optical ﬂow techniques can be found in [23]). A course-ﬁne approach that is capable of tracking larger motions is discussed in [19]; however, the time complexity of this approach for large HD images was signiﬁcantly greater than our current approach of stitching together smaller ﬂow estimations (3–5 min per frame compared with 2–4 s per frame). Another approach is to warp the 3-D model to a 2-D plane using diffeomorphic mapping, such as the harmonic mapping given in [24]. Optical ﬂow and hence optical strain would be calculated in the warped domain and then inversely mapped back onto the 3-D model. 2.2. Strain

Fig. 6. Acquisition setup for USF-Kinect Dataset. Used in experiment 2 to demonstrate the methods robustness under view changes.

The 3-D displacement of any deformable object can be expressed by a vector u ¼ ½u; v ; wT . If the motion is small enough, then the corresponding ﬁnite strain tensor is deﬁned as:

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

5

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

1 2

e ¼ ½ru þ ðruÞT ;

ð4Þ

which can be expanded to the form:

2

exx ¼ @u @x 6 6 1 @v @u e¼6 e ¼ 6 xy 2 @x þ @y 4 exz ¼ 12 @u þ @w @z @x

eyx ¼ 12

@u @y

þ @@xv

eyy ¼ @@yv eyz ¼

1 2

@v @z

þ

@w @y

ezx ¼ 12 ezy ¼ 12

@w

@x

@w @y

ezz ¼

þ @u @z

3

7 7 þ @@zv 7 7 5

ð5Þ

@w @z

e

p¼

Dx u ¼ ; Dt Dt

u ¼ pDt;

ð6Þ

q¼

Dy v ¼ ; Dt Dt

v ¼ q Dt

ð7Þ

r¼

Dz w ¼ ; Dt Dt

w ¼ r Dt

ð8Þ

where Dt is the change in time between two image frames. Setting Dt to a ﬁxed interval length, we approximate the ﬁrst-order partial derivatives:

@u @p ¼ Dt; @x @x

@u @p ¼ Dt; @y @y

@u @p ¼ Dt; @z @z

ð9Þ

@ v @q ¼ Dt; @x @x

@ v @q ¼ Dt; @y @y

@ v @q ¼ Dt; @z @z

ð10Þ

@w @r ¼ Dt; @x @x

@w @r ¼ Dt; @y @y

@w @r ¼ Dt: @z @z

ð11Þ

Finally, the above computation scheme can be implemented using a variety of methods that take the spatial derivative over a ﬁnite number of points. We chose the central difference method due to its computational efﬁciency, thus calculating the normal strain tensors as:

@ v v ðx; y þ Dy; zÞ v ðx; y Dy; zÞ ¼ 2Dy @y qðx; y þ Dy; zÞ qðx; y Dy; zÞ ¼ ; 2Dy @w wðx; y; z þ DzÞ wðx; y; z DzÞ ¼ @z 2Dz rðx; y; z þ DzÞ rðx; y; z DzÞ ¼ ; 2Dz

2 0

where (exx ; eyy ; ezz ) are normal strain components and (exy ; exz ; eyz ) are shear strain components. Since each of these strain components are a function of displacement vectors (u; v ; w) over a continuous space, each strain component is approximated using the discrete optical ﬂow data (p; q; r):

@u uðx þ Dx; y; zÞ uðx Dx; y; zÞ ¼ @x 2Dx pðx þ Dx; y; zÞ pðx Dx; y; zÞ ¼ ; 2 Dx

determined by the change in z along a locally ﬁt plane. However, because of the assumption that the face is locally ﬂat, all components that differentiate with respect to z are then 0. Hence, we are left with two normal strains (e0xx ; e0yy ), one shear strain (e0xy ), and the two partial shears (e0xz ; e0yz ). We can then rewrite Eq. 5 as follows:

ð12Þ

ð13Þ

ð14Þ

e0xx ¼ @u @x

6 6 0 1 @v @u ¼6 6 exy ¼ 2 @x þ @y 4 @w e0xz ¼ 2@x

e0yx ¼ 12

@u @y

e ¼ 0 yy

e0yz ¼

þ @@xv

@v @y @w 2@y

e0zx ¼ e ¼

0 zy

@w 3

2@x @w 2@y

e0zz ¼ 0

7 7 7: 7 5

ð15Þ

Finally, we calculate the strain magnitude using the following:

e0mag ¼

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðe0xx Þ2 þ ðe0yy Þ2 þ ðe0xy Þ2 þ ðe0xz Þ2 þ ðe0yz Þ2 :

ð16Þ

The values in e can then be normalized to ½0; 255 to generate a strain map image where low and large intensity values correspond to small and large strain deformations respectively (the un-normalized values are used for comparison in our experiments). Note that in contrast to our previous work that was limited to the projected 2-D motion [1], this ﬁnal surface strain equation also includes the motion lost due to foreshortening. 0 mag

3. Algorithm The algorithm is based on calculating optical ﬂow on a 2-D image, with the depth approximated by a low resolution sensor. Since the cheaper commodity RGBD sensors such as the Microsoft Kinect do not have sufﬁcient optics and resolution for accurate ﬂow tracking, we align a separate HD webcam with the Kinect depth model. Hence optical ﬂow is performed on the HD image, and then projected onto a depth map which has been aligned to the same image. The overall design of the algorithm can be found in Fig. 1. 3.1. Automatic Kinect-to-webcam calibration We automatically calibrate Kinect RGBD images to images from an HD webcam using local invariant features, as proposed by Pamplona Segundo et al. [25]. First, the scale-invariant feature transform (SIFT) [26] is employed to extract keypoints from both Kinect and webcam color images, represented by points in Fig. 2(a). Then, non-face keypoints are discarded and the remaining keypoints are used to establish correspondences between Kinect and webcam images, illustrated as lines. Robustness against 3D noise and correspondence errors is achieved by applying the RANSAC algorithm by Fischler and Bolles [27] to ﬁnd the subset of correspondences whose obtained transformation maximizes the number of Kinect keypoints projected into their correspondent webcam keypoints. At least four correct correspondences are required to estimate the transformation that projects the Kinect depth data into the webcam coordinate system (see Fig. 2(b)), considering a camera model with seven parameters [28]: translation and rotation in X; Y and Z axes, and focal length. To retrieve these parameters for a given subset of four correspondences we use the Levenberg–Marquardt minimization approach described by Marquardt [29].

where Dx=Dy=Dz = 1 pixel. 3.2. Fitting planes using least squares 2.3. Surface strain The equations described in the above section describe 3-D volumetric strain. However, in our approach we adjust the 2-D surface motion along a 3-D volume by supplementing the 2-D optical ﬂow vectors with a depth component. Due to the low resolution of the Kinect data, the depth value for each point on the face is

When working with a depth map with a resolution lower than it’s calibrated RGB image, there will be many points within the image that will not have depth values. In our experiments using the Kinect sensor, it is common to ﬁnd that the face size is approximately 500 500 pixels in the aligned high resolution camera, with noisy depth values for only a portion (approximately 1/2) of

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

6

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Fig. 7. Example of the automatic calibration process: (a) and (d) show correspondences between Kinect and right webcam in blue and between Kinect and left webcam in red; (b) and (c) show resulting projection of the Kinect depth data (center) to the coordinate system of the right webcam (left) and to the coordinate system of the left webcam (right). Figure best viewed in color. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

these coordinates. In order to smooth out and interpolate for the missing depth values, we ﬁt a plane into every 5 5 neighborhood using least squares ﬁtting. Each pixel is then updated to it’s new planar value based on it’s ﬁtted plane. 3.3. Facial landmark detection and masking Optical strain maps are derived from optical ﬂow estimations that are susceptible tracking failures which result in optical strain failure. In general, the regions most susceptible to this kind of error are located around the eyes and mouth, due to the self-occlusions that occur when subjects open and close them. Moreover, since failure in ﬂow estimations occur at the boundary of the head (due to the head moving in front of the background), areas outside of the head are masked. We segment these regions of the face using the facial landmarks automatically detected by Saragih et al. [30]. Fig. 3 shows the detected facial landmarks as well as the original mask. The two transformed views using the transformation matrix identiﬁed by the

method described in the previous section can also be found in this ﬁgure. Please note that only one camera is needed to run the algorithm, however to demonstrate some robustness of the method to view, we aligned two cameras for an experiment in Section 4.2. It is worth noting that for the second experiment in the next section, we performed additional masking by only comparing values that are visible in both views (see part (e) in Fig. 3). These points are found by back-projecting every pixel from each calibrated view back onto the 3-D model and incrementing a value at each location. 4. Experiments 4.1. Experiment 1: strain map consistencies at varying depth resolutions To demonstrate that a sparse depth resolution of the face is sufﬁcient for reliable 3-D strain estimation, we compared strain maps calculated at the full resolution of the BU datasets depth data with

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

7

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Fig. 8. Example strain maps for the happy expression from two different views. The ﬁrst two columns are the RGB images for the left and right camera. The next two images are the corresponding 2-D strain maps. The last two images are the corresponding 3-D surface strain maps. Note that the 3-D surface strain images only show the strain values for pixels that are common to both views (see Fig. 3 part e.)

1 0.9

Normalized Value

0.8 0.7 0.6 0.5 0.4 Summed Strain Magnitude Correlation Coefficient (2D Aligned) Correlation Coefficient (3D Aligned)

0.3 0.2 0

5

10

15

20

25

30

Frame Number Fig. 9. This ﬁgure shows the relationship between the normalized correlation coefﬁcients of the 2-D and 3-D surface strain maps calculated from two different views, along with the normalized summed strain magnitude. When the summed strain magnitude is high, the 3-D surface strain appears correlated between the two different camera views, especially when compared to the correlation between 2-D strain maps after a feature-based alignment.

those calculated with several downsampled depths. This is a publically available 3-D dataset [31] that contains a single view and over 600 sequences of high-resolution (1280 720) videos with depth data. We use 101 subjects and 6 expressions from this dataset to measure the consistency of strain map calculation at low depth resolutions. The depth resolution of the BU data is roughly 200 200 for the face image. The depth data is then downsampled to 100 100; 66 66; 50 50; and 40 40 (see Fig. 4). For each of these resolutions, pixels that are missing depth data are solved

by ﬁtting a plane to it’s local neighborhood using linear least squares as described in Section 3.2. We then estimate the 3-D surface strain using the 2-D optical ﬂow and the downsampled depth map and compare it when using the original depth map by ﬁnding the correlation coefﬁcient. The correlation coefﬁcient provides a similarity between each pixel using the normalized covariance function. Overall, Fig. 5 shows that the strain estimation is stable across these resolutions, even when only 40 40 depth coordinates are used from the original

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

8

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

Normalized Correlation Coefficient

1 2−D Aligned 3−D Aligned

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

1

2

3

4

5

6

7

8

Subject Number Fig. 10. Boxplot showing the normalized correlation coefﬁcients for 3-D surface strain maps and 2-D aligned strain maps. The means for each boxplot are denoted by ‘o’, the ‘’ within the box denotes the median, while the range of outliers are denoted by a the dotted lines (approximately 2.7 r).

200 200 depth map. A slight decrease in the correlation coefﬁcient is observed for each decrease in depth resolution. The standard deviation is tightly centered around the median, and maintain at least 85% correlation at all four depth resolutions. 4.2. Experiment 2: surface strain maps from two views While the ﬁrst experiment demonstrated that the surface strain estimations are consistent when using low-resolution depth information, the next experiment (see Fig. 6) demonstrates that the method has an increased robustness to view by calculating two strain maps independently (one for each view) and then providing a similarity measure between the two. We developed a dataset containing two webcam views (roughly 45 degrees apart) and one Kinect sensor. These two cameras will allow us to calculate two separate strain maps from two different views, which we can then compare for similarities. We recorded 8 subjects performing the smile and surprise facial expressions. Some example images from this dataset can be found in Fig. 7. Due to simultaneously acquiring large amounts of data over USB2.0 bandwidth, the framerate of each video ﬂuctuated between 5–6 frames a second. Each sequences lasted approximately 50 frames, or 10 s. The ﬁrst step is to align the Kinect’s 3-D data to each external webcam (see Fig. 7). We then calculate the surface strain for each of these views, and then transform each of them back to the Kinect coordinate system where we can then determine the normalized correlation coefﬁcient between the two registered strain maps. Note that this is only done for points in each image that are common to both views (see Fig. 3) since there will be large occluded regions due to parallax between both views. For comparison, we also analyze the correlation of the two 2-D strain maps calculated from each camera after aligning both to the center view using an afﬁne transformation based on the location of the eyes and mouth. Fig. 8 shows a comparison of the 2-D strain maps and the 3-D surface strain maps generated using the method, where qualitatively we can see that the 3-D surface strain maps are similar between each view. Results for multiple frames in an example sequence is given in Fig. 9. This ﬁgure plots the summed strain magnitude along with the normalized correlation coefﬁcients for both 2-D and 3-D and quantitatively demonstrates that the facial surface strain estimation is consistent between both cameras. We also observe that the normalized correlation coefﬁcient increases as the strain magnitude increases. This is likely due to the noise

in ﬂow estimation when there is little or no motion occurring on the face. Hence when we calculated the correlation coefﬁcients for all the subjects in Fig. 10, we only look at values when the strain magnitude is over 80%. Overall, the results are evidence that there is a signiﬁcant improvement observed when comparing strain maps with and without 3-D information. We observe a 40–80% correlation in strain maps calculated from two different views when using the 3-D information to calculate surface strain and align them to a central view (note that means, denoted by a ‘o’, of these values range from 60–70%). However, when using a 2-D based alignment, there is a signiﬁcant decrease in correlation. 5. Conclusions Optical strain maps generated from facial expressions capture the elastic deformation of facial skin tissue. This feature has been shown (in 2-D) to have wide application in several research areas, including biometrics and medial imaging. However, previous work on strain imaging has been limited to a single view thus forcing training data to be manually aligned with future queries. In this paper, we propose a novel method that removes this restriction and allows for a range of views from multiple recording sessions. Furthermore, we have proposed a method that is able to recover motion that is lost due to distortions from a 3-D model being projected onto a 2-D image plane. The method works by ﬁrst calibrating an external high resolution camera to a low resolution depth sensor using matched SIFT features. Next, optical ﬂow is calculated in the external camera and then projected onto the calibrated depth map in order to approximate the 3-D displacement. We have experimentally demonstrated that the method is robust at several depth resolutions by giving quantitative and qualitative results that show that the surface strain maps for 101 subjects and over 600 expression sequences at 4 sub-sampled depth resolutions are highly correlated with the surface strain maps using full depth resolution. Moreover, when evenly down-sampling a depth resolution of 200 200 to as little as 40 40, we lose less than 10% correlation on average when compared to using full depth resolution. We have also reported results that demonstrates the increased view-invariance of the method by recording subjects from two views approximately 45 degrees apart. In this experiment, we show that the independent 3-D surface strain maps generated from each of these two views are highly correlated. This is especially true when the amount of strain is relatively high, opposed to when

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx

there is little to no strain present, because of noise in the optical ﬂow estimation. In our future work, we will analyze several applications of this method and it’s potential to boost performance over methods that are based on 2-D strain.

[14] [15] [16] [17]

References [1] M. Shreve, S. Godavarthy, D. Goldgof, S. Sarkar, Macro- and micro-expression spotting in long videos using spatio-temporal strain, in: International Conference on Automatic Face and Gesture Recognition, 2011, pp. 51–56. [2] M. Shreve, N. Jain, D. Goldgof, S. Sarkar, W. Kropatsch, C.-H.J. Tzou, M. Frey, Evaluation of facial reconstructive surgery on patients with facial palsy using optical strain, in: Proceedings of the 14th International Conference on Computer Analysis of Images and Patterns, 2011, pp. 512–519. [3] M. Shreve, V. Manohar, D. Goldgof, S. Sarkar, Face recognition under camouﬂage and adverse illumination, in: First IEEE International Conference on Biometrics: Theory, Applications, and Systems, 2010, pp. 1–6. [4] M. Shreve, S. Feﬁlatyev, N. Bonilla, D. Goldgof, S. Sarkar, Method for calculating view-invariant 3D optical strain, in: International Workshop on Depth Image Analysis, 2012. [5] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pﬁster, Multi-scale capture of facial geometry and motion, in: ACM Transactions on Graphics, vol. 29, 2007, p. 33. [6] V. Blanz, C. Basso, T. Poggio, T. Vetter, Reanimating faces in images and video, Comput. Graphics Forum 22 (2003) 641–650. [7] I. Lin, M. Ouhyoung, Mirror MoCap: automatic and efﬁcient capture of dense 3D facial motion parameters, Visual Comput. 21 (2005) 355–372. [8] D. Bradley, W. Heidrich, A.S.T. Popa, High resolution passive facial performance capture, in: ACM Transactions on Graphics, vol. 29, 2010, p. 41. [9] Y. Furukawa, J. Ponce, Dense 3D motion capture from synchronized video streams, in: Image and Geometry Processing for 3-D Cinematography, 2010, pp. 193–211. [10] J. Pons, R. Keriven, O. Faugeras, Multi-view stereo reconstruction and scene ﬂow estimation with a global image-based matching score, Int. J. Comput. Vision 72 (2007) 179–193. [11] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, C. Theobalt, Lightweight binocular facial performance capture under uncontrolled lighting, ACM Trans. Graphics 31 (6) (2012) 1–11. [12] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R.W. Sumner, M. Gross, High-quality passive facial performance capture using anchor frames, in: ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, 2011, pp. 75:1–75:10. [13] M. Klaudiny, A. Hilton, High-detail 3D capture and non-sequential alignment of facial performance, in: 2012 Second International Conference on 3D

[18] [19]

[20] [21] [22]

[23] [24]

[25]

[26] [27]

[28]

[29] [30] [31]

9

Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012, pp. 17–24. M. Penna, The incremental approximation of nonrigid motion, Comput. Vision 60 (1994) 141–156. Faceshift, Markerless motion capture at every desk, 2013. http:// www.faceshift.com/. Livedriver, Face tracking, 2013. . S. Hadﬁeld, R. Bowden, Kinecting the dots: particle based scene ﬂow from depth sensors, in: Proceedings of International Conference on Computer Vision, 2011, pp. 2290–2295. T. Weise, S. Bouaziz, H. Li, M. Pauly, Realtime performance-based facial animation, in: ACM Transactions on Graphics, vol. 30, 2011, p. 77. M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise-smooth ﬂow ﬁelds, Computer Vision and Image Understanding, vol. 63, Elsevier Science Inc., New York, NY, USA, 1996, pp. 75–104 (ISSN 10773142, http://dx.doi.org/10.1006/cviu.1996.0006). MATLAB, version 7.10.0 (R2010a), The MathWorks Inc., Natick, Massachusetts, 2010. B.K.P. Horn, B.G. Schunck, Determining Opt. Flow 17 (1981) 185–203. S. Baker, S. Roth, D. Scharstein, M. Black, J.P. Lewis, R. Szeliski, A database and evaluation methodology for optical ﬂow, in: IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8. S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, R. Szeliski, A database and evaluation methodology for optical ﬂow, Int. J. Comput. Vision 92 (2011) 1–31. Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, P. Huang, High resolution tracking of non-rigid motion of densely sampled 3D data using harmonic maps, Int. J. Comput. Vision, 0920-5691 76 (3) (2008) 283–300, http://dx.doi.org/10.1007/s11263-007-0063-y. M. Pamplona Segundo, L. Gomes, O.R.P. Bellon, L. Silva, Automating 3D reconstruction pipeline by SURF-based alignment, in: International Conference on Image Processing, 2012, pp. 1761–1764. D.G. Lowe, Object recognition from local scale-invariant features, in: International Conference on Computer Vision, 1999, pp. 1150–1157. M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (1981) 381–395. M. Corsini, M. Dellepiane, F. Ponchio, R. Scopigno, Image-to-geometry registration: a mutual information method exploiting illumination-related geometric properties, Comput. Graphics Forum 28 (7) (2009) 1755–1764. D.W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM J. Appl. Math. 11 (2) (1963) 431–441. J.M. Saragih, S. Lucey, J.F. Cohn, Face Alignment through subspace constrained mean-shifts, in: International Conference of Computer Vision, 2009. L. Yin, X. Chen, Y. Sun, T. Worm, M. Reale, A high-resolution 3D dynamic facial expression database, in: The 8th International Conference on Automatic Face and Gesture Recognition, 2008, pp. 1–6.

Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015

High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor

Recommend Documents