Pattern Recognition Letters xxx (2014) xxx–xxx
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor q Matthew Shreve ⇑, Mauricio Pamplona, Timur Luguev, Dmitry Goldgof, Sudeep Sarkar Department Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave., Tampa, FL, USA
a r t i c l e
i n f o
Article history: Available online xxxx Keywords: 3-D Optical strain Optical flow
a b s t r a c t Generating 2-D strain maps of the face during facial expressions provides a useful feature that captures the bio-mechanics of facial skin tissue, and has had wide application in several research areas. However, most applications have been restricted to collecting data on a single pose. Moreover, methods that strictly use 2-D images for motion estimation can potentially suppress large strains because of projective distortions caused by the curvature of the face. This paper proposes a method that allows estimation of 3-D surface strain using a low-resolution depth sensor. The algorithm consists of automatically aligning a rough approximation of a 3-D surface with an external high resolution camera image. We provide experimental results that demonstrate the robustness of the method on a dataset collected using the Microsoft Kinect synchronized with two external high resolution cameras, as well as 101 subjects from a publicly available 3-D facial expression video database (BU4DFE). Ó 2014 Elsevier B.V. All rights reserved.
1. Introduction Capturing and quantizing the strain observed during facial expressions has found useful application in areas such as expression detection, face identification, and medical evaluations [1–3]. However one limitation in each of these applications (especially the latter two) is that in order to estimate the consistency of strain over several acquisition sessions, each subject was required to provide a pose that aligned with prior data. We describe a method that is more invariant to view then when strictly using 2-D, thus allowing a greater variability in data acquisition. There are several advantages when supplementing 2-D motion estimations with aligned 3-D depth values. Due to the foreshortening that occurs when the curvature of the face is projected onto the 2-D image plane, the horizontal motion estimated along these points are often suppressed. By projecting these vectors back onto the 3-D surface, we are able to adjust these vectors to better approximate the true displacements. Similarly, motion that is orthogonal to the image plane is lost in 2-D only approximations, but recovered when using 3-D data. This method is possible because the surface of the face is a 2-D surface embedded in 3-D. Moreover, we assume that any local region of the face (3 3 neighborhood) is flat and can be approximated using a 2-D plane.
q
This paper has been recommended for acceptance by Xiaoyi Jiang.
⇑ Corresponding author. Tel.: +1 (813) 974 3652; fax: +1 (813) 974 5456.
Hence, we can use standard optical flow techniques in order to describe the 2-D motion, adjust them using 3-D depth approximations, and then determine the 3-D strain tensors. The proposed algorithm consists of the following steps: (a) the Kinect sensor is automatically calibrated to an external HD camera using an algorithm that is based on SIFT features (b) optical flow is calculated between each consecutive pair of frames in the video collected from the external camera; (c) we then stitch together pairs of motion vectors in order to obtain the motion from the first frame to every other frame; (d) each of the stitched motion fields are projected onto the aligned 3-D surface of the Kinect; (e) surface strain is calculated for each pixel of the face; (f) a masking technique is used that separates reliable strain estimations from those regions where optical flow fails (such as the eyes, mouth, and the boundary of the head); (g) the strain map is back projected to the original orientation of the Kinect sensor, which generates an aligned 3-D surface strain map. We have reported preliminary results in [4]. However, we now use a new method for automatic calibration and facial tracking and extend our experiments from 40 subjects and 2 expression to over 100 subjects and 6 expressions from the BU4DFE database. While we have not found work in the literature that estimates the 3-D surface strain on the face, we found in [4] that there are many approaches for estimating 3-D volumetric motion. Perhaps one of the biggest applications is for the entertainment industry, where modeling the surface of the face during expressions can be used to animate humanoid avatars in interactive video games
E-mail address:
[email protected] (M. Shreve). http://dx.doi.org/10.1016/j.patrec.2014.01.015 0167-8655/Ó 2014 Elsevier B.V. All rights reserved.
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
2
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Fig. 1. Overall algorithm design.
or for realistic special effects in movies. Many of these methods (especially those that densely track key points on the face) do not work strictly on passive imaging the face, but require the manual application of facial landmarks and makeup that are used for tracking and 3-D reconstruction, especially when lighting conditions are not ideal [5–7]. The main drawback of these methods is that they do not provide dense enough motion estimation that would allow for the capture of the elastic properties of the face. More similar to our method, a dense approach is given by Bradley et al. [8] that uses optical flow to track a mesh generated from seven pairs of cameras and nine light grids. However, because of the complexity of their acquisition device it may not be feasible in many applications. Other approaches that attempt to find 3-D disparity from multiple view triangulation can be found in [8–10], many of which have been used to find non-rigid disparities [11– 13]. Methods that use a single camera while exploiting knowledge of the object being studied (such as updating a generic 3-D face mesh on a 2-D image) is discussed in [14], while some commercial and open-source projects can also be found. One such product is [15], which is compatible with a range of commodity 3-D sensors and can update a live avatar mesh by automatically tracking several key points on the face. Alternatively, the open source project [16] has similar capabilities, but works with 2-D images. However, none of these methods provide per-pixel tracking, thus using them to estimate surface strain is not feasible. Finally, dynamic depth map approaches [17,18] assume a single 3-D acquisition device with a calibrated color image, such as the Microsoft Kinect. In the first approach, 3-D motion is estimated for rigid gestures opposed to the non-rigid motion of the face; the latter uses a pre-recorded animation priors to drive real-time avatar animations. Our approach also uses dynamic depth maps in order to estimate the 3-D surface strain on the face by tracking the skin pixels over a video sequence captured from an external high-resolution camera that has been automatically calibrated/ aligned with an RGB-D sensor. The paper is organized in the following way: in Section 1 we give a brief overview of optical flow and how we can use these motion vectors to calculate surface strain. Section 2 gives an approach for solving the 3-D transform of the low resolution depth map to match that of an external HD camera. Section 3 provides two experiments, one that demonstrates the robustness of the method
with respect depth resolution, and the second provides an experiment which compares the 3-D surface strain maps calculated from two HD cameras angled approx. 45 degrees apart. We provide our conclusions in Section 5.
2. Optical strain Methods for calculating 2-D optical strain can be found in our earlier work [1,3]. In this paper, we propose a new method that uses depth approximation in order to calculate a more view-invariant 2-D strain map. The method works by first computing the 2-D
Fig. 2. Matched SIFT features from right camera view and the Kinect RGB image are given in (a). In (b), the aligned depth images are shown.
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
3
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Fig. 3. A mask is generated by defining the boundaries of the face, eyes, and mouth using the 66 automatically detected points on the face shown in (a). Part (b) shows the original mask (strain values in white region pass through, black regions are set to 0). The transformed masks for the two external high resolution cameras are shown in (c) and (d) using the described calibration method. Part (e) shows the points that are visible to each view. The light gray areas are points only visible from the right camera, the dark gray areas are those found in the left camera, and the white regions are points common to both views.
motion along the surface using standard optical flow techniques. Then, the motion vectors are projected to the 3-D surface of the face and adjusted based on the curvature of the face. 2.1. Optical flow Optical flow is based on the brightness conservation principle [19]. In general, it assumes (i) constant intensity at each point over a pair of frames, and (ii) smooth pixel displacement within a small image region. These first constraint is encapsulated by the following equation:
ðrIÞT p þ It ¼ 0;
ð1Þ
where Iðx; y; tÞ represents the temporal image intensity function at point x and y at time t, and rI represents the spatial and temporal gradient. The horizontal and vertical motion vectors are
T
represented by p ¼ ½p ¼ dx=dt; q ¼ dy=dt . We tested several optical flow implementations which are summarized in [2]. Based on experiments, we chose the [20] implementation of the Horn– Schunck method due to the speed, accuracy and flexibility that the implementation provided. First, the Horn–Schunck method formulates the flow using the following global energy functional:
E¼
ZZ
½ðIx u þ Iy v þ It Þ2 þ a2 ðjjrujj2 þ jjrv jj2 Þdxdy;
ð2Þ
where Ix ; Iy , and It are the derivatives of the image intensities in the x; y, and t dimensions. The optical flow is denoted by ~ V ¼ ½uðx; yÞ; v ðx; yÞT and a is a regularization constant. The choice of a specifies the degree of smoothness in the flow fields. This global energy function can be minimized using the Euler–Lagrange equations. An iterative solution of these equations are derived in [21]:
Fig. 4. Depth reconstruction and surface strain maps for two example expressions (happy and anger) with the depth data at resolutions of 200 200, 50 50, and 40 40 (each column after regular image). Depth maps are quantized to 20 levels to highlight consistency.
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
4
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Sad Correlation Coefficient
Correlation Coefficient
Happy 1 0.8 0.6 0.4 0.2
100x100
66x66
50x50
40x40
1 0.8 0.6 0.4 0.2 0
100x100
Depth Resolution
0.8
0.6
100x100
66x66
50x50
40x40
0.8 0.6 0.4 0.2 0
100x100
0.8 0.6 0.4
50x50
50x50
40x40
Surprise Correlation Coefficient
Correlation Coefficient
Fear
66x66
66x66
Depth Resolution
1
100x100
40x40
1
Depth Resolution
0.2
50x50
Disgust Correlation Coefficient
Correlation Coefficient
Angry 1
0.4
66x66
Depth Resolution
40x40
1
0.8
0.6
0.4
100x100
Depth Resolution
66x66
50x50
40x40
Depth Resolution
Fig. 5. Box plots of the correlation coefficients of the 3-D surface strain when using full 1:1 depth information and depth downsampled at ratios of 1:2, 1:3, 1:4, and 1:5. Results are for all 101 subjects performing 6 expressions. The mean is denoted by the larger solid circle, while outliers are denoted by gray dots (approximately 2.7 r).
k ukþ1 ¼ u
v kþ1 ¼ v k
k þ Iy v k þ It Þ Ix ðIx u
a2 þ I2x þ I2y
;
k þ Iy v k þ It Þ Iy ðIx u
a2 þ I2x þ I2y
ð3Þ :
The weighted average of u in a fixed neighborhood at a pixel ðx; yÞ. With respect to a, we location (x; y) is denoted by u empirically found that a ¼ :05 and k ¼ 200 offered a good trade-off between capturing the non-rigid motion of the face and suppressing noisy motion outliers.
Our method is based on tracking the motion occurring on the face during facial expressions, where expressions can last several seconds and consist of motion that is too large to be accurately and consistently tracked by optical flow. To solve this, we use a vector stitching technique that combines pairs of vectors from consecutive frames in order to solve the overall larger displacements occurring over several frames [22]. In our previous work [1], we have shown that a significant amount of noise accumulates after several hundred frames (over 800) in a video, however in this work we only need to stitch roughly 30–50 frames. Although we chose to use vector stitching, it is worth noting that there are alternative methods that could be used for tracking the motion during facial expression including different approaches for calculating optical flow (a summary evaluation of several optical flow techniques can be found in [23]). A course-fine approach that is capable of tracking larger motions is discussed in [19]; however, the time complexity of this approach for large HD images was significantly greater than our current approach of stitching together smaller flow estimations (3–5 min per frame compared with 2–4 s per frame). Another approach is to warp the 3-D model to a 2-D plane using diffeomorphic mapping, such as the harmonic mapping given in [24]. Optical flow and hence optical strain would be calculated in the warped domain and then inversely mapped back onto the 3-D model. 2.2. Strain
Fig. 6. Acquisition setup for USF-Kinect Dataset. Used in experiment 2 to demonstrate the methods robustness under view changes.
The 3-D displacement of any deformable object can be expressed by a vector u ¼ ½u; v ; wT . If the motion is small enough, then the corresponding finite strain tensor is defined as:
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
5
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
1 2
e ¼ ½ru þ ðruÞT ;
ð4Þ
which can be expanded to the form:
2
exx ¼ @u @x 6 6 1 @v @u e¼6 e ¼ 6 xy 2 @x þ @y 4 exz ¼ 12 @u þ @w @z @x
eyx ¼ 12
@u @y
þ @@xv
eyy ¼ @@yv eyz ¼
1 2
@v @z
þ
@w @y
ezx ¼ 12 ezy ¼ 12
@w
@x
@w @y
ezz ¼
þ @u @z
3
7 7 þ @@zv 7 7 5
ð5Þ
@w @z
e
p¼
Dx u ¼ ; Dt Dt
u ¼ pDt;
ð6Þ
q¼
Dy v ¼ ; Dt Dt
v ¼ q Dt
ð7Þ
r¼
Dz w ¼ ; Dt Dt
w ¼ r Dt
ð8Þ
where Dt is the change in time between two image frames. Setting Dt to a fixed interval length, we approximate the first-order partial derivatives:
@u @p ¼ Dt; @x @x
@u @p ¼ Dt; @y @y
@u @p ¼ Dt; @z @z
ð9Þ
@ v @q ¼ Dt; @x @x
@ v @q ¼ Dt; @y @y
@ v @q ¼ Dt; @z @z
ð10Þ
@w @r ¼ Dt; @x @x
@w @r ¼ Dt; @y @y
@w @r ¼ Dt: @z @z
ð11Þ
Finally, the above computation scheme can be implemented using a variety of methods that take the spatial derivative over a finite number of points. We chose the central difference method due to its computational efficiency, thus calculating the normal strain tensors as:
@ v v ðx; y þ Dy; zÞ v ðx; y Dy; zÞ ¼ 2Dy @y qðx; y þ Dy; zÞ qðx; y Dy; zÞ ¼ ; 2Dy @w wðx; y; z þ DzÞ wðx; y; z DzÞ ¼ @z 2Dz rðx; y; z þ DzÞ rðx; y; z DzÞ ¼ ; 2Dz
2 0
where (exx ; eyy ; ezz ) are normal strain components and (exy ; exz ; eyz ) are shear strain components. Since each of these strain components are a function of displacement vectors (u; v ; w) over a continuous space, each strain component is approximated using the discrete optical flow data (p; q; r):
@u uðx þ Dx; y; zÞ uðx Dx; y; zÞ ¼ @x 2Dx pðx þ Dx; y; zÞ pðx Dx; y; zÞ ¼ ; 2 Dx
determined by the change in z along a locally fit plane. However, because of the assumption that the face is locally flat, all components that differentiate with respect to z are then 0. Hence, we are left with two normal strains (e0xx ; e0yy ), one shear strain (e0xy ), and the two partial shears (e0xz ; e0yz ). We can then rewrite Eq. 5 as follows:
ð12Þ
ð13Þ
ð14Þ
e0xx ¼ @u @x
6 6 0 1 @v @u ¼6 6 exy ¼ 2 @x þ @y 4 @w e0xz ¼ 2@x
e0yx ¼ 12
@u @y
e ¼ 0 yy
e0yz ¼
þ @@xv
@v @y @w 2@y
e0zx ¼ e ¼
0 zy
@w 3
2@x @w 2@y
e0zz ¼ 0
7 7 7: 7 5
ð15Þ
Finally, we calculate the strain magnitude using the following:
e0mag ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðe0xx Þ2 þ ðe0yy Þ2 þ ðe0xy Þ2 þ ðe0xz Þ2 þ ðe0yz Þ2 :
ð16Þ
The values in e can then be normalized to ½0; 255 to generate a strain map image where low and large intensity values correspond to small and large strain deformations respectively (the un-normalized values are used for comparison in our experiments). Note that in contrast to our previous work that was limited to the projected 2-D motion [1], this final surface strain equation also includes the motion lost due to foreshortening. 0 mag
3. Algorithm The algorithm is based on calculating optical flow on a 2-D image, with the depth approximated by a low resolution sensor. Since the cheaper commodity RGBD sensors such as the Microsoft Kinect do not have sufficient optics and resolution for accurate flow tracking, we align a separate HD webcam with the Kinect depth model. Hence optical flow is performed on the HD image, and then projected onto a depth map which has been aligned to the same image. The overall design of the algorithm can be found in Fig. 1. 3.1. Automatic Kinect-to-webcam calibration We automatically calibrate Kinect RGBD images to images from an HD webcam using local invariant features, as proposed by Pamplona Segundo et al. [25]. First, the scale-invariant feature transform (SIFT) [26] is employed to extract keypoints from both Kinect and webcam color images, represented by points in Fig. 2(a). Then, non-face keypoints are discarded and the remaining keypoints are used to establish correspondences between Kinect and webcam images, illustrated as lines. Robustness against 3D noise and correspondence errors is achieved by applying the RANSAC algorithm by Fischler and Bolles [27] to find the subset of correspondences whose obtained transformation maximizes the number of Kinect keypoints projected into their correspondent webcam keypoints. At least four correct correspondences are required to estimate the transformation that projects the Kinect depth data into the webcam coordinate system (see Fig. 2(b)), considering a camera model with seven parameters [28]: translation and rotation in X; Y and Z axes, and focal length. To retrieve these parameters for a given subset of four correspondences we use the Levenberg–Marquardt minimization approach described by Marquardt [29].
where Dx=Dy=Dz = 1 pixel. 3.2. Fitting planes using least squares 2.3. Surface strain The equations described in the above section describe 3-D volumetric strain. However, in our approach we adjust the 2-D surface motion along a 3-D volume by supplementing the 2-D optical flow vectors with a depth component. Due to the low resolution of the Kinect data, the depth value for each point on the face is
When working with a depth map with a resolution lower than it’s calibrated RGB image, there will be many points within the image that will not have depth values. In our experiments using the Kinect sensor, it is common to find that the face size is approximately 500 500 pixels in the aligned high resolution camera, with noisy depth values for only a portion (approximately 1/2) of
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
6
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Fig. 7. Example of the automatic calibration process: (a) and (d) show correspondences between Kinect and right webcam in blue and between Kinect and left webcam in red; (b) and (c) show resulting projection of the Kinect depth data (center) to the coordinate system of the right webcam (left) and to the coordinate system of the left webcam (right). Figure best viewed in color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
these coordinates. In order to smooth out and interpolate for the missing depth values, we fit a plane into every 5 5 neighborhood using least squares fitting. Each pixel is then updated to it’s new planar value based on it’s fitted plane. 3.3. Facial landmark detection and masking Optical strain maps are derived from optical flow estimations that are susceptible tracking failures which result in optical strain failure. In general, the regions most susceptible to this kind of error are located around the eyes and mouth, due to the self-occlusions that occur when subjects open and close them. Moreover, since failure in flow estimations occur at the boundary of the head (due to the head moving in front of the background), areas outside of the head are masked. We segment these regions of the face using the facial landmarks automatically detected by Saragih et al. [30]. Fig. 3 shows the detected facial landmarks as well as the original mask. The two transformed views using the transformation matrix identified by the
method described in the previous section can also be found in this figure. Please note that only one camera is needed to run the algorithm, however to demonstrate some robustness of the method to view, we aligned two cameras for an experiment in Section 4.2. It is worth noting that for the second experiment in the next section, we performed additional masking by only comparing values that are visible in both views (see part (e) in Fig. 3). These points are found by back-projecting every pixel from each calibrated view back onto the 3-D model and incrementing a value at each location. 4. Experiments 4.1. Experiment 1: strain map consistencies at varying depth resolutions To demonstrate that a sparse depth resolution of the face is sufficient for reliable 3-D strain estimation, we compared strain maps calculated at the full resolution of the BU datasets depth data with
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
7
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Fig. 8. Example strain maps for the happy expression from two different views. The first two columns are the RGB images for the left and right camera. The next two images are the corresponding 2-D strain maps. The last two images are the corresponding 3-D surface strain maps. Note that the 3-D surface strain images only show the strain values for pixels that are common to both views (see Fig. 3 part e.)
1 0.9
Normalized Value
0.8 0.7 0.6 0.5 0.4 Summed Strain Magnitude Correlation Coefficient (2D Aligned) Correlation Coefficient (3D Aligned)
0.3 0.2 0
5
10
15
20
25
30
Frame Number Fig. 9. This figure shows the relationship between the normalized correlation coefficients of the 2-D and 3-D surface strain maps calculated from two different views, along with the normalized summed strain magnitude. When the summed strain magnitude is high, the 3-D surface strain appears correlated between the two different camera views, especially when compared to the correlation between 2-D strain maps after a feature-based alignment.
those calculated with several downsampled depths. This is a publically available 3-D dataset [31] that contains a single view and over 600 sequences of high-resolution (1280 720) videos with depth data. We use 101 subjects and 6 expressions from this dataset to measure the consistency of strain map calculation at low depth resolutions. The depth resolution of the BU data is roughly 200 200 for the face image. The depth data is then downsampled to 100 100; 66 66; 50 50; and 40 40 (see Fig. 4). For each of these resolutions, pixels that are missing depth data are solved
by fitting a plane to it’s local neighborhood using linear least squares as described in Section 3.2. We then estimate the 3-D surface strain using the 2-D optical flow and the downsampled depth map and compare it when using the original depth map by finding the correlation coefficient. The correlation coefficient provides a similarity between each pixel using the normalized covariance function. Overall, Fig. 5 shows that the strain estimation is stable across these resolutions, even when only 40 40 depth coordinates are used from the original
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
8
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
Normalized Correlation Coefficient
1 2−D Aligned 3−D Aligned
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
1
2
3
4
5
6
7
8
Subject Number Fig. 10. Boxplot showing the normalized correlation coefficients for 3-D surface strain maps and 2-D aligned strain maps. The means for each boxplot are denoted by ‘o’, the ‘’ within the box denotes the median, while the range of outliers are denoted by a the dotted lines (approximately 2.7 r).
200 200 depth map. A slight decrease in the correlation coefficient is observed for each decrease in depth resolution. The standard deviation is tightly centered around the median, and maintain at least 85% correlation at all four depth resolutions. 4.2. Experiment 2: surface strain maps from two views While the first experiment demonstrated that the surface strain estimations are consistent when using low-resolution depth information, the next experiment (see Fig. 6) demonstrates that the method has an increased robustness to view by calculating two strain maps independently (one for each view) and then providing a similarity measure between the two. We developed a dataset containing two webcam views (roughly 45 degrees apart) and one Kinect sensor. These two cameras will allow us to calculate two separate strain maps from two different views, which we can then compare for similarities. We recorded 8 subjects performing the smile and surprise facial expressions. Some example images from this dataset can be found in Fig. 7. Due to simultaneously acquiring large amounts of data over USB2.0 bandwidth, the framerate of each video fluctuated between 5–6 frames a second. Each sequences lasted approximately 50 frames, or 10 s. The first step is to align the Kinect’s 3-D data to each external webcam (see Fig. 7). We then calculate the surface strain for each of these views, and then transform each of them back to the Kinect coordinate system where we can then determine the normalized correlation coefficient between the two registered strain maps. Note that this is only done for points in each image that are common to both views (see Fig. 3) since there will be large occluded regions due to parallax between both views. For comparison, we also analyze the correlation of the two 2-D strain maps calculated from each camera after aligning both to the center view using an affine transformation based on the location of the eyes and mouth. Fig. 8 shows a comparison of the 2-D strain maps and the 3-D surface strain maps generated using the method, where qualitatively we can see that the 3-D surface strain maps are similar between each view. Results for multiple frames in an example sequence is given in Fig. 9. This figure plots the summed strain magnitude along with the normalized correlation coefficients for both 2-D and 3-D and quantitatively demonstrates that the facial surface strain estimation is consistent between both cameras. We also observe that the normalized correlation coefficient increases as the strain magnitude increases. This is likely due to the noise
in flow estimation when there is little or no motion occurring on the face. Hence when we calculated the correlation coefficients for all the subjects in Fig. 10, we only look at values when the strain magnitude is over 80%. Overall, the results are evidence that there is a significant improvement observed when comparing strain maps with and without 3-D information. We observe a 40–80% correlation in strain maps calculated from two different views when using the 3-D information to calculate surface strain and align them to a central view (note that means, denoted by a ‘o’, of these values range from 60–70%). However, when using a 2-D based alignment, there is a significant decrease in correlation. 5. Conclusions Optical strain maps generated from facial expressions capture the elastic deformation of facial skin tissue. This feature has been shown (in 2-D) to have wide application in several research areas, including biometrics and medial imaging. However, previous work on strain imaging has been limited to a single view thus forcing training data to be manually aligned with future queries. In this paper, we propose a novel method that removes this restriction and allows for a range of views from multiple recording sessions. Furthermore, we have proposed a method that is able to recover motion that is lost due to distortions from a 3-D model being projected onto a 2-D image plane. The method works by first calibrating an external high resolution camera to a low resolution depth sensor using matched SIFT features. Next, optical flow is calculated in the external camera and then projected onto the calibrated depth map in order to approximate the 3-D displacement. We have experimentally demonstrated that the method is robust at several depth resolutions by giving quantitative and qualitative results that show that the surface strain maps for 101 subjects and over 600 expression sequences at 4 sub-sampled depth resolutions are highly correlated with the surface strain maps using full depth resolution. Moreover, when evenly down-sampling a depth resolution of 200 200 to as little as 40 40, we lose less than 10% correlation on average when compared to using full depth resolution. We have also reported results that demonstrates the increased view-invariance of the method by recording subjects from two views approximately 45 degrees apart. In this experiment, we show that the independent 3-D surface strain maps generated from each of these two views are highly correlated. This is especially true when the amount of strain is relatively high, opposed to when
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015
M. Shreve et al. / Pattern Recognition Letters xxx (2014) xxx–xxx
there is little to no strain present, because of noise in the optical flow estimation. In our future work, we will analyze several applications of this method and it’s potential to boost performance over methods that are based on 2-D strain.
[14] [15] [16] [17]
References [1] M. Shreve, S. Godavarthy, D. Goldgof, S. Sarkar, Macro- and micro-expression spotting in long videos using spatio-temporal strain, in: International Conference on Automatic Face and Gesture Recognition, 2011, pp. 51–56. [2] M. Shreve, N. Jain, D. Goldgof, S. Sarkar, W. Kropatsch, C.-H.J. Tzou, M. Frey, Evaluation of facial reconstructive surgery on patients with facial palsy using optical strain, in: Proceedings of the 14th International Conference on Computer Analysis of Images and Patterns, 2011, pp. 512–519. [3] M. Shreve, V. Manohar, D. Goldgof, S. Sarkar, Face recognition under camouflage and adverse illumination, in: First IEEE International Conference on Biometrics: Theory, Applications, and Systems, 2010, pp. 1–6. [4] M. Shreve, S. Fefilatyev, N. Bonilla, D. Goldgof, S. Sarkar, Method for calculating view-invariant 3D optical strain, in: International Workshop on Depth Image Analysis, 2012. [5] B. Bickel, M. Botsch, R. Angst, W. Matusik, M. Otaduy, H. Pfister, Multi-scale capture of facial geometry and motion, in: ACM Transactions on Graphics, vol. 29, 2007, p. 33. [6] V. Blanz, C. Basso, T. Poggio, T. Vetter, Reanimating faces in images and video, Comput. Graphics Forum 22 (2003) 641–650. [7] I. Lin, M. Ouhyoung, Mirror MoCap: automatic and efficient capture of dense 3D facial motion parameters, Visual Comput. 21 (2005) 355–372. [8] D. Bradley, W. Heidrich, A.S.T. Popa, High resolution passive facial performance capture, in: ACM Transactions on Graphics, vol. 29, 2010, p. 41. [9] Y. Furukawa, J. Ponce, Dense 3D motion capture from synchronized video streams, in: Image and Geometry Processing for 3-D Cinematography, 2010, pp. 193–211. [10] J. Pons, R. Keriven, O. Faugeras, Multi-view stereo reconstruction and scene flow estimation with a global image-based matching score, Int. J. Comput. Vision 72 (2007) 179–193. [11] L. Valgaerts, C. Wu, A. Bruhn, H.-P. Seidel, C. Theobalt, Lightweight binocular facial performance capture under uncontrolled lighting, ACM Trans. Graphics 31 (6) (2012) 1–11. [12] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R.W. Sumner, M. Gross, High-quality passive facial performance capture using anchor frames, in: ACM SIGGRAPH 2011 Papers, SIGGRAPH ’11, 2011, pp. 75:1–75:10. [13] M. Klaudiny, A. Hilton, High-detail 3D capture and non-sequential alignment of facial performance, in: 2012 Second International Conference on 3D
[18] [19]
[20] [21] [22]
[23] [24]
[25]
[26] [27]
[28]
[29] [30] [31]
9
Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012, pp. 17–24. M. Penna, The incremental approximation of nonrigid motion, Comput. Vision 60 (1994) 141–156. Faceshift, Markerless motion capture at every desk, 2013. http:// www.faceshift.com/. Livedriver, Face tracking, 2013.
. S. Hadfield, R. Bowden, Kinecting the dots: particle based scene flow from depth sensors, in: Proceedings of International Conference on Computer Vision, 2011, pp. 2290–2295. T. Weise, S. Bouaziz, H. Li, M. Pauly, Realtime performance-based facial animation, in: ACM Transactions on Graphics, vol. 30, 2011, p. 77. M.J. Black, P. Anandan, The robust estimation of multiple motions: parametric and piecewise-smooth flow fields, Computer Vision and Image Understanding, vol. 63, Elsevier Science Inc., New York, NY, USA, 1996, pp. 75–104 (ISSN 10773142, http://dx.doi.org/10.1006/cviu.1996.0006). MATLAB, version 7.10.0 (R2010a), The MathWorks Inc., Natick, Massachusetts, 2010. B.K.P. Horn, B.G. Schunck, Determining Opt. Flow 17 (1981) 185–203. S. Baker, S. Roth, D. Scharstein, M. Black, J.P. Lewis, R. Szeliski, A database and evaluation methodology for optical flow, in: IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8. S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, R. Szeliski, A database and evaluation methodology for optical flow, Int. J. Comput. Vision 92 (2011) 1–31. Y. Wang, M. Gupta, S. Zhang, S. Wang, X. Gu, D. Samaras, P. Huang, High resolution tracking of non-rigid motion of densely sampled 3D data using harmonic maps, Int. J. Comput. Vision, 0920-5691 76 (3) (2008) 283–300, http://dx.doi.org/10.1007/s11263-007-0063-y. M. Pamplona Segundo, L. Gomes, O.R.P. Bellon, L. Silva, Automating 3D reconstruction pipeline by SURF-based alignment, in: International Conference on Image Processing, 2012, pp. 1761–1764. D.G. Lowe, Object recognition from local scale-invariant features, in: International Conference on Computer Vision, 1999, pp. 1150–1157. M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (1981) 381–395. M. Corsini, M. Dellepiane, F. Ponchio, R. Scopigno, Image-to-geometry registration: a mutual information method exploiting illumination-related geometric properties, Comput. Graphics Forum 28 (7) (2009) 1755–1764. D.W. Marquardt, An algorithm for least-squares estimation of nonlinear parameters, SIAM J. Appl. Math. 11 (2) (1963) 431–441. J.M. Saragih, S. Lucey, J.F. Cohn, Face Alignment through subspace constrained mean-shifts, in: International Conference of Computer Vision, 2009. L. Yin, X. Chen, Y. Sun, T. Worm, M. Reale, A high-resolution 3D dynamic facial expression database, in: The 8th International Conference on Automatic Face and Gesture Recognition, 2008, pp. 1–6.
Please cite this article in press as: M. Shreve et al., High-resolution 3D surface strain magnitude using 2D camera and low-resolution depth sensor, Pattern Recognition Lett. (2014), http://dx.doi.org/10.1016/j.patrec.2014.01.015