Towards Stabilizing Facial Landmark Detection and Tracking via Hierarchical Filtering: A new method
Journal Pre-proof
Towards Stabilizing Facial Landmark Detection and Tracking via Hierarchical Filtering: A new method Yi Jin, Xingyan Guo, Yidong Li, Junliang Xing, Hui Tian PII: DOI: Reference:
S0016-0032(19)30956-1 https://doi.org/10.1016/j.jfranklin.2019.12.043 FI 4360
To appear in:
Journal of the Franklin Institute
Received date: Revised date: Accepted date:
17 September 2018 19 December 2019 21 December 2019
Please cite this article as: Yi Jin, Xingyan Guo, Yidong Li, Junliang Xing, Hui Tian, Towards Stabilizing Facial Landmark Detection and Tracking via Hierarchical Filtering: A new method, Journal of the Franklin Institute (2019), doi: https://doi.org/10.1016/j.jfranklin.2019.12.043
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd on behalf of The Franklin Institute.
Towards Stabilizing Facial Landmark Detection and Tracking via Hierarchical Filtering: A new method Yi Jina , Xingyan Guoa , Yidong Lia,∗, Junliang Xingb , Hui Tianc a School
of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044 China.Email: {yjin,16120368,ydli}@bjtu.edu.cn b National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100080 China. Email:
[email protected]. c China Mobile Communication Corporation, China Mobile Research Institute, Xuanwu Men West Street, Beijing, China. Email:
[email protected]
Abstract Many facial landmark detection and tracking methods suffer from instability problems that have a negative influence on real-world applications, such as facial animation, head pose estimation and real-time facial 3D reconstruction. The instability results of landmark tracking cause face pose shaking and face movement that is not fluent enough. However, most of the existing landmark detection and tracking methods only consider the stability of face location but neglect the stability of local landmark movement. To solve the problem of landmark local shaking, we present a novel hierarchical filtering method for stabilized facial landmark detection and tracking in video frames. The proposed method addresses the challenging landmark local shaking problem and provides effective remedies to solve them. The main contribution within our solution is a novel hierarchical filtering strategy, which guarantees the robustness of global whole facial shape tracking and the adaptivity of local facial parts tracking. The proposed solution does not depend on specific face detection and alignment algorithms, and thus, can be easily deployed into existing systems. Extensive experimental evaluations and analyses on benchmark datasets and 3D head pose datasets verify the effectiveness of our proposed stabilizing method. Keywords: Video analysis, Facial landmark tracking, Stabilizing tracking, 3D ∗ Corresponding
author
Preprint submitted to Journal of the Franklin Institute
January 6, 2020
Figure 1: The groundtruth of landmarks are noise and not accurate to the pixel.
head pose estimation, Global and local filtering
1. Introduction Detecting and tracking facial landmarks in video streams is a very important computer vision task, with many practical applications such as face anti-spoofing [1], face animation [2, 3], head pose estimation and real-time facial 3D recon5
struction [4]. These real-world applications rely on the smoothness of landmark tracking results. The instability tracking results of landmark trajectories cause face pose shaking and face movement that is not fluent enough. However, landmark jitter in video is a serious concern in landmark tracking, even if the face is motionless. The landmark regressed directly from the image
10
information can be jitter, including landmarks regressed by convolutional neural networks (CNNs). The jittery tracking results have a negative influence on the following applications built upon facial landmark detection and tracking. There are no better solutions for solving this problem because the face feature points in the training data are noise and cannot be accurate to the pixel. As Fig.
15
1 shows, the chins are not stable. A learning model based on this unstable annotation will also shake. Stabilizing landmark tracking is an interesting topic that has important practical significance. There are few primary research studies on how to mitigate this negative influence in landmark tracking. In this paper, we propose a novel
20
hierarchical filtering strategy for stabilizing video facial landmark detection and 2
Figure 2: Overview of our hierarchical stabilizing filtering. The proposed method includes two stages: global filtering and local filtering. In the global filtering stage, we guarantee the stability of the face by a global location filter and a global pose filter. The local filtering stage aims to fine-tune landmark locations with the restriction of previous landmarks.
tracking. Fig. 2 illustrates the whole process of our approach. We evaluate our approach on the publicly available 300-VW [5] database, and BU head tracking dataset [6]. Extensive experimental results demonstrate the effectiveness and generalization of our proposed stabilizing tracking method in the stability of 25
both landmark tracking results and pose shaking. To summarize, the main contributions of this work are three-fold: • A novel stabilized video facial landmark detection and tracking solution is developed to address challenging facial landmark shaking problems. • Hierarchical filtering obtains a global strategy and a local strategy. The
30
global strategy leverages a 3D face model and guarantees the smoothness of the tracked global face pose trajectory. The local strategy uses a Kalman filter and learns a landmark quality assessment model to guarantee the smoothness of the facial landmarks in different local parts of faces.
35
• Extensive experimental evaluations of the stability of landmark trajectories and head pose movements, and the accuracy of landmarks on different 3
databases demonstrate the superiority and effectiveness of the proposed stabilizing method.
2. Related Work 40
Facial landmark detection (face alignment) is used to localize a set of landmarks on face images and has drawn increasing attention in face anti-spoofing [1], face animation [2, 3], head pose estimation and real-time facial 3D reconstruction [4]. It is also a very difficult task due to many challenging factors from large face pose variations, partial face occlusions, fast camera motions, and com-
45
plex background clusters. The performance of facial landmark detection and tracking in videos has been greatly improved in the last decade [7, 8, 9, 10, 11]. For instance, a generative deformable shape model such as the active appearance model (AAM)[12] was proposed to estimate a shape that reduces the influence of occlusion. Cristinacce, Saragih et al. [13, 14] proposed the con-
50
strained local model (CLM) by using principal component analysis (PCA) on the concatenated local patches to build the shape model. Part-based appearance features such as scale-invariant feature transform (SIFT) [11] have been widely used to learn linear mapping to shape deviations and to improve the robustness of faces. The ensemble of regression trees proposed in [9] has been widely
55
used for high detection speeds. Asthana et al. [15] proposed a map texture model that can better address unseen variations. Facial landmark detection is performed from the bounding box provided by face detectors. Currently, face detection has matured enough to provide effective and efficient solutions for faces captured in images [16].
60
Different from face landmark detection, facial landmark tracking localizes a series of facial landmarks for image sets or video frames. Facial landmark tracking methods can be divided into two categories. One is the tracking-bydetection method, which accurately detects The face bounding box then separately performs landmark detection in the bounding box area on each video
65
frame. OpenFace [17] is a real-time facial behaviour analysis toolkit that uses
4
the recently proposed conditional local neural fields (CLNF) [18], an upgraded version of CLM that addresses the issues of feature detection in complex scenes. They employed a face validation step to address the face “drift” problem. A twostage 3D landmark regression network was proposed in [19], which estimated 70
3D face shape independently in each frame. Apparently, these methods discard shape information on a previously tracked frame, and none of them smoothed the landmark trajectories. Different from those tracking-by-detection methods, the Kalman filter has been applied to stabilize the estimation of face position in the tree-based deformable part models (DPM) tracker proposed by [20]. [21]
75
developed a multi-task deep CNN named the TCDCN model to learn feature representation for face alignment. Yu et al. [22] presented a two-stage cascaded deformable shape model to localize facial landmarks. In the tracking mode, it directly used information from past frames as the initialization for the following frames.
80
The other category is the incremental tracking method. The state-of-the-art iCCR method [23] is a facial landmark tracking algorithm performing on-line updates of the models through incremental learning. The literature [24] used incremental learning to take advantage of temporal coherency to create a more adaptive fitting method. A multi-view cascade shape regression model named
85
spatio-temporal cascade shape regression (STCSR) was used in [25] to track facial shape. These pure tracking methods consider that the shape difference in adjacent frames cannot be too large. The track model slowly changes appearance, while the slight error in the tracker is accumulated over time, and the tracker becomes increasingly worse. This slight error accumulation causes
90
“drifting” by a poor face tracker. Arguably the most important shortcoming of these incremental models is erroneous fitting. We, therefore, omit discussing them. To summarize, most existing landmark tracking methods usually focus on the stable face position by dealing with face drift, but neglect the impact from landmark trajectories flitting. OpenFace employs a face validation step to
95
stabilize the face position and the tree-based deformable part models (DPM) employ a Kalman filter to stabilize the face box. However, these methods can5
not address the question of landmark shaking when the face changes smoothly; they can only reduce global face drifting and false detection. Therefore, this study makes a major contribution to research on dealing 100
with landmark shaking by demonstrating a novel pipeline for stabilizing video facial landmark detection and tracking. To the best of our knowledge, This is the first solution focusing on increasing the stability of landmark trajectories. We propose a novel pipeline for stabilizing video facial landmark detection and tracking based on a different tracking and detection pipelines.
105
2.1. 3D face reconstruction 3D face reconstruction, which fits a 3D model by minimizing the difference between images gathered by a single 2D camera, and model appearance is widely used in face alignment, pose estimation, face recognition and facial animation. In addition to pure-2D methods, 3D face reconstruction has the potential to
110
address large poses. Blanz and Vetter are the pioneers of the popular 3D morphable model (3DMM)[26, 27] (3D Morphable Model), and most of the recent 3D face reconstruction methods are based on the research of 3DMM. Landmark-based reconstruction methods, which reconstruct a realistic 3D face by regressing the landmark positions has been proposed to improve effi-
115
ciency. This approach estimates the model parameters by coupled dictionary mappings between 2D feature points and 3D model vectors. The work of [28, 29] and the open source Surrey Face Model [30] implements a shape-to-landmark fitting. More recently, [31] estimated the 3DMM parameters on a sparse set of landmarks using CNN.
120
Image-based reconstruction methods estimate the model parameters by regressing the 2D image directly and detecting the 3D feature landmarks by a 3D model. The DDE model of [32] can infer accurate 3D face models and landmark points from 2D videos. Zhu et al. [33] used a convolutional neural network(CNN) to learn a mapping from images pixels to 3D coordinates. A
125
nonlinear 3DMM was learned in research [4].
6
3. The Proposed Approach In this section, we describe the proposed solution for stabilizing video facial landmark detection and tracking. 3.1. Processing Pipeline Overview 130
We first introduce a novel pipeline (Fig. 3) for stabilized video facial landmark detection and tracking in three main steps: 1) a head tracking stage, 2) a global filtering stage, and 3) a local filtering stage. Different from previous tracking-by-detection methods that independently detect landmarks per frame and the facial landmark trajectories are unstable. Our novel pipeline makes full
135
use of the continuity of head movement, and a new approach for stabilizing face landmarks tracking is proposed in this paper.
Figure 3: The flowchart of the proposed solution for stabilized facial landmark detection and tracking in video, which has three main steps: 1) the head tracking stage, 2) the global filtering stage, and 3) the local filtering stage.
7
At the face tracking stage, we assume that face movement between two adjacent frames cannot be too large, and it is reasonable to estimate the current face box by the previous landmark location. As a result, the current face box location is predicted by the previous landmark location. Different from previous box estimation methods, which only use max and min landmark locations, in our tracking method, we first use the least square method to estimate the face box by establishing the mapping relation between face landmarks and face box location:
T −1 T X · θ = Y =⇒ θ = X X X Y
(1)
where X is the facial contour feature point location matrix of various poses, Y is the corresponding face box corner point location matrix and θ is the mapping relation between X and Y established by the least square method. The current face box Yn is estimated by the mapping relation θ and previous contour landmarks Xn−1 as follows: Yn = Xn−1 · θ
(2)
Considering the face movement in video, we adjust the face box that was estimated, so that the box trajectory is more suitable for tracking landmarks in the next frame. 140
At the global filtering stage, we guarantee the stability of the face by a global location filter and a global pose filter. The global location filter is combined with face re-detection when “box drifting” occurs. A set of stable face boxes can be obtained as the detection area of the next stage. In the global pose filtering method, we leverage a specific 3D face model to roughly estimate the pose
145
information. Therefore, the global smoothness of the tracked face trajectory is guaranteed. Face re-detection and re-alignment are used to correct the incorrect tracking result. To the best of our knowledge, it is the first time the 3D face method has been used to address the “drifting” problem. In addition, it is a novel method for obtaining the smooth face pose trajectory.
150
The local filtering stage aims to fine-tune the landmark location with the restriction of previous landmarks. We learn a landmark quality assessment 8
model and verify the localization accuracy to guarantee the adaptivity of the facial landmarks at different local face parts. We found that our proposed pipeline can accurately and stably track the 155
landmarks, and the processing speed increased considerably compared with the traditional tracking-by-detection pipeline. 3.2. Global Filtering Stage In this section, we describe the new global filtering strategy that guarantees the smoothness of the tracked global face trajectory. With respect to the video
160
continuity, the global face pose changes between two adjacent frames cannot be too large. At the global filtering stage, we need to determine whether the face was “drifting” by estimating the shape changes between two adjacent frames. To describe the face location and orientation, we set two “face drifting” checkpoints.
165
First, we evaluate the face box change amplitude to guarantee the smoothness of face location. The face box is the search area of landmark detection, and the poor effect of face box tracking causes bad landmark tracking results. To avoid face box “drifting”, we need to calculate the difference between the previous and current face boxes locations and scales in this stage. It is impor-
170
tant to ensure the smoothness of face box trajectories. Thus, a global location filter is applied to ensure the stable face box trajectories trigger the landmark alignment process. Second, the novel global pose filter is applied to guarantee the smoothness of face pose. We estimate the 3D pose changes by roughly reconstructing the 3D
175
model. First, we selected 7 mainly inner-face landmarks (4 points for the corners of the two eyes, 1 point for the tip of the nose and two for the mouth corners). After that, we fit a specific neutral 3D face model to roughly estimate the pose information of the face shape. The 3D model we use in this filtering stage is from [30] with 4,338 vertices. The specific 3D shape parameters are fitted to the
180
first two frames for 30 iterations. During the pose estimating, given 7 landmarks 2D-3D correspondences, we estimate the 3D face pose. There are three possible 9
cases when the roughly estimated face pose has considerable changes: 1. the face moved very fast, 2. face tracking failed causing a failure in landmark detection, and 3. the face expression has considerable change (we roughly estimate the 185
pose using the neutral face model). Face re-detection and re-alignment is the processing method to correct “drifting” shape tracking results. Based on these two global filtering methods, we can guarantee the shape trajectories are more stable from global face location changes or global pose movement.
190
3.3. Local Filtering Stage In the previous section, we designed a global filtering strategy for solving the ”face drifting” problem by restricting face global movement. The slight local landmark changes could affect the stability of landmark trajectories. Thus, we proposed a new local filtering strategy to further stabilize the landmark tra-
195
jectories. The local filtering strategy is introduced to learn a landmark quality assessment model and verify the localization accuracy to guarantee the adaptivity of the facial landmarks at different local face parts. The Kalman filter estimates a joint probability distribution over the variables for each time frame and can be well applied to a wide variety of tracking
200
tasks. Typically, the Kalman filter provides a very smooth estimate of the state estimation. Furthermore, the variation trend of face positions and expressions are without any pattern. Pure Kalman filter tracking is inapplicable to a nonrigid face. In this context, a Kalman local smoothing filtering combining the Kalman filter and local smooth method we proposed is described in the ensuing
205
paragraphs Fig. 4. The Kalman filter algorithm works in a two-step process and can run in a real-time system. In the prediction step, the Kalman filter produces estimates of the current state variables sˆt along with their uncertainties Pˆt , called the priori error estimate matrix, and they are obtained by tracking the corrected state sˆt−1 at time t − 1 and error covariance Pˆt−1 sˆt = Aˆ st−1 + But−1 10
(3)
Figure 4: The flowchart of the improved Kalman local smooth filter.
Pˆt = APˆt−1 AT + Q
(4)
In the correction step, the Kalman gain at time t is computed using 5. Once the outcome of the next measurement zt is observed, the corrected state sˆt and posteriori error estimate covariance matrix Pˆt is corrected by refining the predicted state estimate and error covariance using noisy measurements. −1 Kt = Pˆt H T H Pˆt H T + R
(5)
sˆt = sˆt + Kt (zt − H sˆt )
(6)
Pˆt = (I − Kt H) Pˆt
(7)
Then, we use the Kalman constant acceleration model[34] to track the 68 facial landmarks. xi y i is the location of every landmark state and measureT i i i i i i i ment of the Kalman model as well. st = x y vx vy ax ay is the
state of every landmark in time t includes the velocities and accelerations. The 210
transition matrix A, observation matrix H, measurement noise covariances R, and process noise covariances Q are defined as follows:
11
1 0 0 A= 0 0 0
Q=
0
∆t
0
1 2 2 ∆t
0
1
0
∆t
0
1 2 2 ∆t
0
1
0
∆t
0
0
0
1
0
∆t
0
0
0
1
0
0
0
0
0
1
1 4 4 ∆t
0
1 3 2 ∆t
0
0
1 4 4 ∆t
1 3 2 ∆t
,
1 0 0 0 0 0 H= 0 1 0 0 0 0
1 2 2 ∆t
0
0
1 3 2 ∆t
0
1 2 2 ∆t
0
∆t2
0
∆t
0
0
1 3 2 ∆t
0
∆t2
0
∆t
1 2 2 ∆t
0
∆t
0
1
0
0
1 2 2 ∆t
0
∆t
0
1
2 σv ,
10 R= 0
0 10
where ∆t is the duration of two adjacent frame time steps. We define the landmark location of the last frame as sˆt−1 and the landmark
215
location detected by the landmark detector of the current frame as measurement i zt . We predict the location xikalman ykalman by the Kalman model and the
smooth prediction landmark results. This stabilized location as the prediction result of the posterior statement of the Kalman model was not sufficiently accurate. Then, we take advantage of the continuous face change in video streaming and use the landmark predicted by the Kalman model to smooth the landmark
220
change guided by the quality assessment model. At this stage, we learned a landmark quality assessment model. This model can measure the quality of every landmark change. First, we calculate the landmark variance reference to the global shaking value (GSV): v u P7 2 2 m − xm m − ym u (x ) + (y ) new new kalman kalman u m=1 GSV = t 2 2 7 (xreye − xleye ) + (yreye − yleye ) 12
(8)
GSV is a mean point-to-point Euclidean distance normalized by the interocular distance of the 7 main landmarks (4 points for two eyes corners, 1 point for tip of nose and two for mouth corners), where this 7 landmarks are the same as we mentioned in the previous section 3.2. To avoid local expression and change influence on the global face movement, we choose these 7 points to describe the global face movement. If the value of GSV is larger than 0.1, the method determines that the face was moving too much at times, and the predicted landmark had weak constraints on the detected landmarks. The model of landmark quality assessment is described in Eq. (9). 2 2 xi − xi i i − ykalman + ynew new kalman q i = − GSV 2 2 (xreye − xleye ) + (yreye − yleye )
(9)
where q i is the quality of the ith landmark. A landmark quality assessment model was carried out to show component differences between single landmark variances and verify the localization accuracy to guarantee the adaptivity of the facial landmarks in different local face parts. In calculating the relative
225
change in every point for global face change, it is essential to select the GSV as a change reference. If the tracking of the face is successful, we use the local filtering method guided by the landmark quality assessment model to reduce “face shaking”. Linew = wi · Likalman + (1 − wi ) · Linew
(10)
wi = λ · q i + τ · GSV
(11)
We use wi to guide landmark smoothing, where the smoothing rate wi de230
pends on the variance quality of each landmark q i and the global shaking value (GSV ). When GSV is high, it means the face movement is violent, and the reference value of landmarks predicted by the Kalman model is low. The larger the value of GSV , the smaller the smooth weight wi . The q i describes the point localization difference quality relative face global difference. The smaller
13
235
the value of qi , the larger the value of wi . The value of q i is large, indicating that the jitter magnitude of the modified point is larger than the global feature point jitter, so the constraint weight of the Kalman predicted landmark should be small when landmark smoothing. The Kalman model guarantees the smoothness of landmark change. Additionally, the landmark quality assessment model
240
guarantees localization accuracy and the adaptivity of the facial landmarks at different local face parts. 3.4. Implementation Details Finally, the implementation details are given in this section. We design a stabilizing pipeline and use the face detector and landmark
245
detector found in the dlib library. We initialize our face tracking method by finding a face box in the first frame using a dlib face detector [35]. Considering face movement in video, we should adjust the face box we estimated previously and extend the range of the face box, so that the box trajectory is more suitable for landmark tracking in video:
xo =
yo =
yo = 250
(xl + xr) + xnose + xo 4 (yu+yd) 2
+ 2ynose + yo 4
(yd − yu) + (xr − xl) + 2r 3
(12)
(13)
(14)
where xl and yu are the minimum values of contour landmarks, xr and yd are maximum values of contour landmarks, and (xnose , ynose ) is the nose point coordinate. Box centre (xo yo ) and radius r are used to renew a face box and then send it to the next stage. Our face box tracking method greatly improves the processing speed. Face
255
box prediction costs no more than 1 ms instead of face detection, which requires time to locate a face box when processing every frame. The most important thing is that compared with the face detector, the face predictor relies on the 14
previous face to a degree. Stable face box trajectories drive the landmark alignment method to obtain more stable landmark trajectories. The dlib landmark 260
detector we use in our pipeline is the ensemble regression trees[9], which shows state-of-the-art performance. To avoid face box “drifting” , we also need to calculate the difference between the previous and current face boxes locations and scales in the global filtering stage. If the difference is larger than 20%, we regard the face as lost and move
265
back to the initialization stage to detect the face box again. To ensure the smoothness of face pose, if any dimension of pose change is larger than 10, we determine that “face drifting” occurs and moves back to the face re-detection. This is the first time that 3D poses are used estimating method to avoid “face drifting”.
270
We also improve the original pose estimation method, which uses a variable 3D face model [30] with 3,448 vertices. After fitting the 3D face shape and expression, we map the 3D face landmark to 2D platform. Then, we combine the mapped landmarks with detected 2D landmarks, and increase the weight of 58 visibly mapped landmarks. The adjusted 2D landmark and the indicated shape
275
parameter are used to initialize the 3D shape and then estimate the expression pose. Compared with the original pose estimating method, our new method reduces the negative effect of invisible landmarks. We use shape parameters of a specific face model to initialize the model shape fitting stage and reduce the iteration times of the shape parameter with fitting expression parameters
280
normally. Landmark smoothing was constrained by a personal specific 3D model that can guarantee the adaptivity of the facial landmarks in different local parts of the face. Our solution can be easily deployed into existing systems to improve tracking performance. To prove our stabilizing tracking method can be easily deployed
285
into the existing systems, we transplant our solution to the existing OpenFace [17]toolkit. The dlib face detector and CLNF landmark detector and a 68 vector linear 3D point distribution model was assembled in OpenFace. The pipeline of OpenFace did not detect the face box per frame instead of initializing the 15
Table 1: Description of 300-VW dataset.
Properties
Descriptions
Num of videos
114 videos
Single/Multiple faces
Single
Facial pose
Various poses
Facial expression
Various expressions
Illumination
Various illuminations
Ground truth
68 facial landmark annotations
face landmark model with the face model fitted in the last frame in the tracking 290
mode. For this, we only use global pose smooth filtering and our local smooth filtering and give up the face box estimating step. Experimental results demonstrate the effectiveness and generalization of our proposed method in stabilizing face landmark tracking.
4. Experiments 295
4.1. Datasets The 300VW [5] dataset presents a benchmark for long-term facial landmark tracking, containing currently 114 annotated videos. These videos were separated into three different categories: category one in good light conditions with various head poses, category two in unconstrained conditions with various
300
head poses but without large occlusions, category three is the most challenging of these three categories in completely “in-the-wild” conditions. To verify the robustness of facial shape tracking in unconstrained conditions, we primarily evaluate the performance of our proposed pipeline on the category three test sets. This is by far the most challenging face tracking dataset containing 86
305
landmark detection results of 27,687 frames.
16
Table 2: Description of BU head tracking dataset.
Properties
Descriptions
Num of videos
72 videos
Single/Multiple faces
Single
Facial pose
Various poses
Facial expression
Various expressions
Illumination
Various illuminations
Ground truth
3D head pose annotations
The BU head tracking dataset [6] contains 72 facial videos with continuous head pose annotations. These videos include various head poses and in both uniform and varying lighting. 4.2. Performance evaluation 310
We implemented the stability facial landmark tracking pipeline, and in this section, we compare the performance of our proposed stabilizing tracking pipeline with an unstable tracking pipeline. The underlining face detection and alignment methods are the same. Our experimental environment is Windows 7. We run our pipeline on a PC with Intel (R) Core (TM) i3-2100 CPU @ 3.10
315
GHz, and the development environment is Visual Studio 2015. We can see the landmark +tracking results in Fig. 5. Subfigures (a), (c) and (e), which are blue landmarks, depict the unstable tracking results. The blue landmarks are not stable enough, especially the chin part of (a), the nose part of (c), and the counterpart landmarks of (e) are not fluent even when the
320
face movement is very slight. The subfigures (b), (d) and (f) of Fig. 5 are the corresponding video sequences tracked by our stabilizing solution. The red landmarks are more stable, and the movement of local landmarks is more fluent than the blue landmarks.
17
(a)
(b)
(c)
(d)
(e)
(f) Figure 5: Stabilizing landmark trajectories. The subfigures (a), (c) and (e) depicts the detection results of the non-stabilizing method. The subfigures (b), (d) and (f) are the correspondence video sequences tracked by our stabilizing solution.
18
Table 3: Comparison of mean shaking angle(degree) on the 300-VW test set.
category1
category2
category3
method
yaw
pitch
roll
yaw
pitch
roll
yaw
pitch
roll
dlib+stable
0.741
0.969
0.474
0.828
1.148
0.515
1.245
1.396
0.592
dlib
0.931
1.363
0.685
1.002
1.726
0.781
1.899
2.165
1.025
openface+stable
0.532
0.496
0.301
0.746
0.737
0.382
1.105
1.214
0.495
openface
0.541
0.515
0.302
0.755
0.747
0.375
1.142
1.219
0.568
TCDCN
0.532
0.648
0.389
0.984
1.248
0.608
1.241
1.473
0.897
iCCR
0.673
0.722
0.515
0.814
1.090
0.464
1.396
1.827
1.229
groundtruth
0.653
0.821
0.383
0.599
0.771
0.298
1.029
1.225
0.625
Table 4: Comparison of mean shaking angle(degree) on the BU head tracking dataset.
uniform-light
varying-light
method
yaw
pitch
roll
yaw
pitch
roll
dlib+stable
0.643
0.870
0.456
0.455
0.727
0.295
dlib
0.847
1.308
0.738
0.673
1.149
0.518
groundtruth
0.682
1.026
0.705
0.599
1.061
0.757
19
Figure 6: The 3D head pose change trends in the yaw, pitch and roll dimensions. Apparently, our novel stabilizing tracking method can effectively smooth the head pose trajectories and reduce the impact of face “shaking”. The red lines exhibited smoother and smaller fluctuations than the blue lines.
20
To measure the stability performance of landmark trajectories, we 325
designed a new metric called mean-shaking-angle. We defined it as M ean − Shaking − Angle(yaw, pitch, roll) =
Pnum−1 n=1
|αn+1 −αn | , num−1
in which alphan repre-
sents the yaw, pitch or roll angles (degree) of the nth frame. We estimated the 3D pose precisely for every single frame and calculate the average of shaking angles between every two adjacent frames in the yaw, pitch and roll dimensions. 330
The mean-shaking angle evaluates the smoothness of head pose movement and the stability of global face landmarks. We evaluate our method on the 300-VW dataset with 68 landmarks. The comparison methods are iCCR [23], TCDCN [21], OpenFace [17](CLNF landmark detector), OpenFace with our stabilizing solution, dlib pipeline [9](ERT landmark detector), and dlib with our stabilizing
335
solution and list the performance in Table 3. We also evaluate our method on the BU head pose dataset with 68 landmarks. The comparison methods are the dlib pipeline[9](ERT landmark detector), and dlib with our stabilizing solution and list the performance in Table 4. As we can see, our proposed stable method achieves better performance than the unstable methods in all categories. Our 3D
340
head poses are estimated by 3D face reconstruction, which is different from the pose notation method of the BU head pose dataset, and our stabilizing method is more stable than the existing tracking methods with the same experimental environment. Fig.6 Plotting the 3D pose trajectories where the red line is the result of
345
tracking by our stabilizing pipeline, and the blue line is the result of the unstable method. Obvious improvement of the smoothing effect of landmark trajectories can be seen as a negative influence of jitter tracking results is greatly reduced by using our stabilizing solution. When the head pose changes between 1 to 3 degrees per frame, our stabilizing solution can work well. Global filtering aims
350
to reinitialize the frames that have considerable change and avoid the detection error accumulating to the next frame. Local filtering aims to guarantee the smoothness of landmark trajectories in the following cases: some landmarks jitter, slight occlusion, and contour landmarks not accurate enough. We use the mean standard deviation of landmark distance between adjacent 21
Table 5: Comparison of mean standard deviation of point movement on 300-VW.
355
method./test set.
category 1
category 2
category 3
dlib+stable
0.0229
0.0215
0.0394
dlib
0.0533
0.0869
0.0991
openface+stable
0.0258
0.0224
0.0393
openface
0.0533
0.0308
0.1140
iccr
0.1039
0.0441
0.0626
groundtruth
0.0339
0.0211
0.1342
frames to evaluate the smoothness of the landmark movement. Table 5 shows the comparison of the mean standard deviation of point movement on 300VW. The comparison methods are iCCR [23], OpenFace[17], OpenFace with our stabilizing solution, dlib pipeline[9], and dlib with our stabilizing solution and list the performance in Table 3. The results of Table 5 demonstrate that
360
the stabilizing solution transplant to different pipelines has positive impacts on landmark stabilization. To measure the stability performance on the face detection stage, the comparison of face box trends is shown in Fig. 7. We can see by the blue lines in Fig. 7, both the size and location of the face box are unstable, which is inde-
365
pendently detected in each frame. While our face box tracking method reduces face “drifting”, and makes face box trajectories change more smoothly. However, as a foundation step of landmark detection, it greatly improves the face estimation speed. The stable face box trajectories contribute to the smoothness of landmark trajectories.
370
To measure the performance of landmarks tracking, we compare our proposed solution with non-stabilizing tracking methods. The underlying face detection and alignment methods are the same. We plot the area under the curve and point-to-point root-mean-square (RMS) error [5] lower than 0.08, which
22
Figure 7: The comparison of face box tracking result. 1) The first row depicts the trends of the face box size. 2) The second row shows the trends of the x-coordinates of the face box centre. 3) The last row shows the trends of the y-coordinates of the face box centre. Apparently, our new tracking method can generate more stable box trajectories and avoid “face drifting” to some degree.
tracks results that are not “drifting”. The obtained results from all videos of 375
category three are plotted in Fig. 8, where the obtained results are plotted using iCCR [23], TCDCN [21], OpenFace [17], OpenFace with our stabilizing solution, dlib pipeline[9], and dlib with our stabilizing solution. And the performance is listed in Table 3. We can see that the result of the stabilizing solution with the dlib detector is
380
more accurate than the unstable dlib detector, while the result of the stabilizing solution with the OpenFace method is not good enough. Our stabilizing pipeline sacrificed some accuracy to achieve stable landmark tracking. However, our proposed methods achieve similar performance to the non-stabilizing method, 23
Figure 8: RMS error curves comparison on category test3 set of 300-VW.
and the accuracy of the landmark smoothed by our solution is still satisfying. 385
This result indicates that our stabilizing pipeline will not reduce the accuracy of the landmark. Next, we introduce the computational cost of our solution. Note that our solution is based on a face detector and landmark detector, and the time costs we discuss below do not cover the time costs of a face detector and landmark
390
detector. At the global filtering stage, we only need 80 to 200 ms in the first Table 6: Comparison of successful detection rate on 300-VW category 3.
method./video. 410
411
516
517
526
528
529
530
531
533
557
558
559
562
all videos
dlib+stable
69.4 74.1 99.8 98.7 95.1 93.3 73.1 77.5 98.2 49.9 75.8 100 98.3 94.3 86.33
dlib
52.8
71.0
99.8
98.1
91.1
90.7
61.3
24
64.2
93.2
44.5
75.4
100
97.3
93.9
81.50
frame to initialize the 3D model. In the other frames, we use 8 landmarks to estimate the 3D head pose, which takes less than 1 ms per frame. Local filtering with a Kalman filter and landmark quality assessment model takes 50 ms per frame. Our experimental environment: visual studio 2015, i3-2100 CPU 3.10 395
GHz. Finally, we explain how our method can be used to enhance face detection performance. The percentage of frames with one successful detection is shown in table 6 and the last column shows that the improvement over the face detection is approximately 4.8%. Our proposed stabilizing solution works better than
400
the tracking-by-detection method shows a significant improvement in tracking ability in all videos. This performance improvement is obtained even though our face estimating method performs much better compared to the detector since the face tracking stage can estimate the face location robustness and stability. Notably, our method, which estimates the face location by the previous face can
405
remedy the shortcoming of the face detector, especially in challenging conditions.
5. Conclusions In this paper, a novel hierarchical filtering method is proposed for stabilized facial landmark detection and tracking. The proposed approach addresses the challenging “face drifting” and “landmark shaking” problems using global fil410
tering and local filtering. First, we leverage the 3D face model to avoid strong movement of the face shape. The Kalman filter and the landmark quality assessment model aim to guarantee the stability and robustness of local facial parts. Furthermore, the new method does not depend on any specific face detection or alignment algorithm. Our solution can be easily deployed into existing systems
415
to improve tracking performance. Finally, experiments on the 300VW dataset demonstrate the effectiveness and good generalization ability of landmark movements and head pose movements. In some frames, our stabilizing landmarks are more stable than the groundtruth landmarks. One future work will leverage this stabilizing method to 3D landmark tracking.
25
420
Acknowledgements This work was supported by the National Natural Science Foundation of China under Grants 61972030, 61672088 and KKA118001533. References [1] I. Chingovska, A. Anjos, S. Marcel, On the effectiveness of local binary pat-
425
terns in face anti-spoofing, in: Biometrics Special Interest Group (BIOSIG), 2012 BIOSIG-Proceedings of the International Conference of the, IEEE, 2012, pp. 1–7. [2] A.-E. Ichim, P. Kadleˇcek, L. Kavan, M. Pauly, Phace: Physics-based face modeling and animation, ACM Transactions on Graphics (TOG) 36 (4)
430
(2017) 153. [3] C. Cao, Y. Weng, S. Lin, K. Zhou, 3d shape regression for real-time facial animation, ACM Transactions on Graphics (TOG) 32 (4) (2013) 41. [4] L. Tran, X. Liu, Nonlinear 3d face morphable model, arXiv preprint arXiv:1804.03786.
435
[5] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, M. Pantic, The first facial landmark tracking in-the-wild challenge: Benchmark and results, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 50–58. [6] M. La Cascia, S. Sclaroff, V. Athitsos, Fast, reliable head tracking under
440
varying illumination: An approach based on registration of texture-mapped 3d models, IEEE Transactions on pattern analysis and machine intelligence 22 (4) (2000) 322–336. [7] S. Zafeiriou, G. Chrysos, A. Roussos, E. Ververas, J. Deng, G. Trigeorgis, The 3d menpo facial landmark tracking challenge.
445
[8] H. Yang, X. Jia, C. C. Loy, P. Robinson, An empirical study of recent face alignment methods, arXiv preprint arXiv:1511.05049. 26
[9] V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014. 450
[10] S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 3000 fps via regressing local binary features, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014. [11] X. Xiong, F. De la Torre, Supervised descent method and its applications to face alignment, in: Proceedings of the IEEE conference on computer
455
vision and pattern recognition, 2013, pp. 532–539. [12] T. F. Cootes, G. J. Edwards, C. J. Taylor, Active appearance models, IEEE Transactions on pattern analysis and machine intelligence 23 (6) (2001) 681–685. [13] D. Cristinacce, T. F. Cootes, Feature detection and tracking with con-
460
strained local models., in: BMVC, Vol. 1, 2006, p. 3. [14] J. M. Saragih, S. Lucey, J. F. Cohn, Deformable model fitting by regularized landmark mean-shift, International Journal of Computer Vision 91 (2) (2011) 200–215. [15] A. Asthana, S. Zafeiriou, G. Tzimiropoulos, S. Cheng, M. Pantic, From
465
pixels to response maps: Discriminative image filtering for face alignment in the wild, IEEE transactions on pattern analysis and machine intelligence 37 (6) (2015) 1312–1320. [16] S. Zafeiriou, C. Zhang, Z. Zhang, A survey on face detection in the wild: past, present and future, Computer Vision and Image Understanding 138
470
(2015) 1–24. [17] T. Baltruˇsaitis, P. Robinson, L.-P. Morency, Openface: an open source facial behavior analysis toolkit, in: Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on, IEEE, 2016, pp. 1–10.
27
[18] T. Baltrusaitis, P. Robinson, L.-P. Morency, Constrained local neural fields 475
for robust facial landmark detection in the wild, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 354–361. [19] P. Xiong, G. Li, Y. Sun, Combining local and global features for 3d face tracking, in: Proceedings of the IEEE Conference on Computer Vision and
480
Pattern Recognition, 2017, pp. 2529–2536. [20] M. Uric´ ar, V. Franc, V. Hlavac, Facial landmark tracking by tree-based deformable part model based detector, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 10–17. [21] Z. Zhang, P. Luo, C. C. Loy, X. Tang, Learning deep representation for face
485
alignment with auxiliary attributes, IEEE transactions on pattern analysis and machine intelligence 38 (5) (2016) 918–930. [22] X. Yu, J. Huang, S. Zhang, D. N. Metaxas, Face landmark fitting via optimized part mixtures and cascaded deformable model, IEEE Transactions on Pattern Analysis & Machine Intelligence (11) (2016) 2212–2226.
490
[23] E. S´ anchez-Lozano, B. Martinez, G. Tzimiropoulos, M. Valstar, Cascaded continuous regression for real-time incremental face tracking, in: European Conference on Computer Vision, Springer, 2016, pp. 645–661. [24] A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic, Incremental face alignment in the wild, in: Proceedings of the IEEE Conference on Computer Vision
495
and Pattern Recognition, 2014, pp. 1859–1866. [25] J. Yang, J. Deng, K. Zhang, Q. Liu, Facial shape tracking via spatiotemporal cascade shape regression, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 41–49. [26] V. Blanz, T. Vetter, A morphable model for the synthesis of 3d faces,
500
in: Proceedings of the 26th annual conference on Computer graphics and
28
interactive techniques, ACM Press/Addison-Wesley Publishing Co., 1999, pp. 187–194. [27] V. Blanz, S. Romdhani, T. Vetter, Face identification across different poses and illuminations with a 3d morphable model, in: Automatic Face and Ges505
ture Recognition, 2002. Proceedings. Fifth IEEE International Conference on, IEEE, 2002, pp. 202–207. [28] V. Blanz, T. Vetter, Face recognition based on fitting a 3d morphable model, IEEE Transactions on pattern analysis and machine intelligence 25 (9) (2003) 1063–1074.
510
[29] J. Roth, Y. Tong, X. Liu, Adaptive 3d face reconstruction from unconstrained photo collections, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4197–4206. [30] P. Huber, G. Hu, R. Tena, P. Mortazavian, P. Koppen, W. J. Christmas, M. Ratsch, J. Kittler, A multiresolution 3d morphable face model and fit-
515
ting framework, in: Proceedings of the 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2016. [31] A. Jourabloo, X. Liu, Large-pose face alignment via cnn-based dense 3d model fitting, in: Proceedings of the IEEE conference on computer vision
520
and pattern recognition, 2016, pp. 4188–4196. [32] C. Cao, Q. Hou, K. Zhou, Displaced dynamic expression regression for realtime facial tracking and animation, ACM Transactions on graphics (TOG) 33 (4) (2014) 43. [33] X. Zhu, Z. Lei, X. Liu, H. Shi, S. Z. Li, Face alignment across large poses:
525
A 3d solution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 146–155.
29
[34] U. Prabhu, K. Seshadri, M. Savvides, Automatic facial landmark tracking in video sequences using kalman filter assisted active shape models, in: European Conference on Computer Vision, Springer, 2010, pp. 86–99. 530
[35] D. E. King, Max-margin object detection, Computer Science.
30