Accepted Manuscript
3D Dynamic Facial Expression Recognition using Low-Resolution Videos Jie Shao, Ilaria Gori, Shaohua Wan, J.K. Aggarwal PII: DOI: Reference:
S0167-8655(15)00248-2 10.1016/j.patrec.2015.07.039 PATREC 6308
To appear in:
Pattern Recognition Letters
Received date: Accepted date:
19 November 2014 26 July 2015
Please cite this article as: Jie Shao, Ilaria Gori, Shaohua Wan, J.K. Aggarwal, 3D Dynamic Facial Expression Recognition using Low-Resolution Videos, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.07.039
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
CR IP T
ACCEPTED MANUSCRIPT
Research Highlights (Required)
AN US
To create your highlights, please type the highlights against each \item command.
It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) • We develop a 4D facial expression recognition algorithm.
• Our algorithm is suitable for both high and low-resolution RGB-D videos.
M
• 4D feature learning is used for facial expression recognition.
• We demonstrate that feature learning is extremely effective in this setting.
AC
CE
PT
ED
• Extensive experimental comparisons and discussions are made by the end of the paper.
ACCEPTED MANUSCRIPT 1
Pattern Recognition Letters journal homepage: www.elsevier.com
3D Dynamic Facial Expression Recognition using Low-Resolution Videos Jie Shaoa,∗∗, Ilaria Gorib , Shaohua Wanb , J.K. Aggarwalb a Department
of Electronic and Information Engineering, Shanghai University of Electric Power,Shanghai,200090, China and Vision Research Center/Department of ECE, The University of Texas at Austin,Austin,78712, USA
CR IP T
b Computer
ABSTRACT
M
AN US
In this paper, we focus on the problem of 4D facial expression recognition. While traditional methods rely on building deformation models on high-resolution 3D meshes, our approach works directly on low-resolution RGB-D sequences; this feature allows us to apply our algorithm to videos retrieved by widespread and standard low-resolution RGB-D sensors, such as Kinect. After preprocessing both RGB and depth image sequences, sparse features are learned from spatio-temporal local cuboids. Conditional Random Fields classifier is then employed for training and classification. The proposed system is fully-automatic and achieves superior results on three low-resolution datasets built by the 4D facial expression recognition dataset – BU-4DFE. Extensive evaluations of our approach and comparisons with state-of-the-art methods are presented. Keyword:3D Dynamic facial expression recognition; low-resolution video; RGB-D video; feature learning; sparse feature. c 2015 Elsevier Ltd. All rights reserved.
ED
1. INTRODUCTION
AC
CE
PT
Facial expression recognition (FER) plays an important role in several areas such as human-machine interaction and robotics. In recent years, it has been also applied to the fields of commerce and medical science. However, in real applications, recognizing facial expressions is still a challenging task. Most methods in literature exploit 2D data to solve this problem (Pantic et al., 2000; Zeng et al., 2009). Despite such algorithms have demonstrated to achieve remarkable results, they are still prone to issues such as illumination changes and pose variation. Taking full advantage of 3D information is an effective solution to these drawbacks, as three-dimensional data is invariant to such transformations. Besides, facial expressions are intrinsically dynamic; several studies showed how the temporal component, realized in the dynamics of the expressions, provide crucial cues to facilitate recognition (Sinha et al., 2006). In this sense thus, acquiring and analyzing sequences of 3D data instead of static ones allows capturing facial characteristics in the 3D spatiotemporal domain, enabling a more natural and precise recognition. Therefore, our work focuses on 4D Facial Expression ∗∗ Corresponding
author: Tel.: +86-021-68029237; fax: +86-021-65430410; e-mail:
[email protected] (Jie Shao)
Recognition, also known as 3D Dynamic Facial Expression Recognition, which is characterized by the use of 3D and temporal information. Traditional approaches working on 3D dynamic facial expression recognition use two kinds of data source: 3D meshes (Sandbach et al., 2011; Amor et al., 2014; Berretti et al., 2013; Fang et al., 2011) or depth image sequences (Le et al., 2011). According to the results presented in literature, usually methods that employ 3D meshes achieve higher accuracy than those using depth images. However, methods using 3D meshes are restricted to high-resolution data sources, and retrieving such data usually involves expensive commercial 3D sensors. On the other hand, the most popular devices in this sense are low-resolution RGB-D sensors, e.g. Kinect or Creative Labs Senz3D. Such sensors are not suitable to build high-resolution 3D meshes, but they are still indicated to obtain precise depth images. Therefore, differently from most of the works dealing with high-resolution 4D facial expression recognition, our goal is to develop a system able to obtain good results dealing with both high and low-resolution RGB-D sequences, which will allow us to exploit our system using low-resolution RGBD sensors. Since at the moment there are no publicly available datasets built with Kinect, we simply retrieve low-resolution images from the most popular database usually exploited to test 4D facial expression recognition algorithms – BU-4DFE (Yin
ACCEPTED MANUSCRIPT 2
CR IP T
consecutive frames, and then extracted spatio-temporal features from the models. Later on, a new approach called Deformation Vector Field (DVF) was proposed by Hassen D. et al. (Amor et al., 2014), which densely captured dynamic information from the entire face based on Riemannian facial analysis. Random forests (Breiman, 2001) were then used for expression classification. Recently, Berretti et al. (Berretti et al., 2013) presented a fully automatic and real-time approach, which introduced their facial landmarks detection algorithm and used mutual distances of landmarks to build the deformation models. All the methods mentioned so far rely on high-resolution 3D meshes though, therefore they are not suitable for videos retrieved by low-resolution RGBD sensors. Only a few methods employ depth data, as for instance (Le et al., 2011), which proposes a facial level curves-based approach and classify spatiotemporal features using HMMs. It is well known that feature extraction is a crucial step in every recognition procedure. Still, recognizing facial expressions entails detecting very small changes on faces, especially when using low-resolution sequences. One possible solution to this is learning new and more discriminative descriptors using feature learning procedures. One popular state-of-the-art feature learning approach is deep learning, which has been proved useful in facial expression related methodologies, like facial expression recognition (Liu et al., 2014) and facial landmark points localization (Wu et al., 2013). However, a huge number of data has to be provided to deep learning approach in order to obtain satisfactory results; furthermore, this kind of method is too complex and time-consuming for practical applications by now. Another group of popular feature learning approaches are those related to dictionary learning, which could be used along with the Bag-of-Words (BoW) paradigm (Li et al., 2009), the Spatial Pyramid Matching (SPM) approach (Lazebnik et al., 2006), and some other image representation models. Ionescu et al. (Ionescu et al., 2013) for instance, proposed a 2D facial expression classification method with low resolution images using BoW. They extracted dense SIFT descriptors and represented images as visual words from a codebook. Sikka et al. (Sikka et al., 2012) combined Multi-Scale Dense SIFT features with SPM, and highlighted that spatial information at multiple scales is crucial for facial expression recognition. Also sparse coding has been widely used recently: in (Zafeiriou and Petrou, 2010), Zafeiriou et al. proposed to exploit sparse representation derived from l1 optimization problems for 2D facial expression recognition. However, as far as we know, there are no methods in literature that exploit feature learning for 4D facial expression recognition.
AN US
et al., 2008). Our method basically relies on sparse coding based feature learning. Feature learning, which comprises a set of algorithms to find a new and more suitable representation of data, has proved its effectiveness in many fields of computer vision so far, ranging from object recognition (Bo et al., 2012) to image classification (Theriault et al., 2013) and face recognition (Lee et al., 2009). However, to the best of our knowledge, this technique has never been applied to 3D dynamic facial expression recognition. We show in this work that feature learning is extremely effective for the problem we are facing. The method we propose can be summarized as follows: as the first step, LBP-TOP descriptors (Huang et al., 2011) are extracted from both gray and depth image cuboids. Then, codebooks are initialized over the descriptors via k-means clustering, and optimized along with the new codes using LLC (Wang et al., 2010). At this stage, each cuboid has provided a local descriptor; in order to obtain a global representation of an image, we merge the local features via Spatial Pyramid pooling (Lazebnik et al., 2006). Finally, CRFs classifier – Conditional Random Fields (CRFs) (Lafferty et al., 2001) are applied for training and testing. Further experiments and comparisons with state-of-the-art algorithms are presented by the end of the paper. The main contributions of our work are the following:
M
• We develop a 4D facial expression recognition algorithm that is suitable for both high and low-resolution RGB-D videos. This characteristic is extremely useful as lowresolution devices allows for a faster processing, enabling direct applications of human-machine interaction in the future.
ED
• We demonstrate that feature learning, which has never been applied to 3D dynamic facial expression recognition problems, is extremely effective in this setting.
CE
PT
The remainder of the paper is organized as follows. The related work is reviewed in section 2. Section 3 provides the details of the proposed method. Section 4 reports our recognition results on BU-4DFE database and provides the comparison analysis. Finally, section 5 concludes the paper. 2. Related work
AC
Although there have been considerable effort towards 3D facial expression analysis (Sandbach et al., 2012), most of the algorithms focused on analyzing static data (Rosato et al., 2008). The earliest methods proposing a solution to the dynamic 3D facial expression recognition problem include (Sun and Yin, 2008; Sandbach et al., 2011; Fang et al., 2011). Sun et al. (Sun and Yin, 2008) used a 3D landmark shape model associated with 2D texture features and classified six facial expressions using Hidden Markov Models (HMMs). A kind of 3D motionbased feature Free-Form Deformations (FFDs) was exploited by Sandbach et al. in (Sandbach et al., 2011), which modeled the motion between frames of 3D facial geometry sequences and used HMMs to classify video sequences. Fang et al. (Fang et al., 2011) instead, leveraged the annotated deformable face model (AFM) (Kakadiaris et al., 2007) approach to register
3. Methodology Our method is composed of three steps: preprocessing, feature learning, and classification. At the preprocessing stage, faces are detected and aligned in color and depth image sequences, and normalized to the same size afterwards. Then, color images are converted to gray, and 4D texture features are captured from cuboids of gray and depth image sequences via LBP-TOP descriptors. Based on these spatio-temporal features,
ACCEPTED MANUSCRIPT 3 (Heikkla and Pietikainen, 2006). A 3D LBP that calculates on only three orthogonal planes of a cuboid is called LBP-TOP. As shown in the middle of Fig.2, our 4D texture features are modeled with the concatenated LBP histograms from three orthogonal planes of both gray and depth image cuboids. Let (xr , yr , tr ) represent a vector containing the position (xr , yr ) of a pixel r at time tr , and p is the number of pixels in its neighborhood. The LBP-TOP feature vr is calculated by:
y Top height ref ˄xe 2 ye2 )
3heightref
˄xe ,ye )
width ref 2width ref
(a) Face and eye detection
(b) Face alignment
(c) Facial area and normalized RGB and depth image
Fig. 1. Three preprocessing steps of 3D dynamic facial expression recognition.
codebooks are initiated and updated via LLC in order to obtain sparse codes of the cuboid descriptors. 4D global features are built using spatial pyramid pooling. Finally, gray and depth features are gathered, and classified by CRFs. 3.1. Preprocessing
={
P−1 X
s(g x p ,y p ,tr − g xr ,yr ,tr )2 p ,
p=0
P−1 X p=0
(1)
s(g x p ,yr ,t p − g xr ,yr ,tr )2 p ,
P−1 X p=0
s(g xr ,y p ,t p − g xr ,yr ,tr )2 p },
where g xr ,yr ,tr is the value of pixel r, and s(x) = ( if x ≥ 0 1, i f x < 0. 0, For each plane of the cuboid, LBP values of pixels are recorded to build a histogram, so a LBP-TOP descriptor is composed of three histogram vectors. The feature extraction step on both gray and depth sequences provides the basic 4D texture feature of each sample, whose dimension is equal to 59 × 3 × N × 2. 59 is the LBP vector dimension on each orthogonal plane, referring to (Heikkla and Pietikainen, 2006) for details. 3 represents 3 orthogonal planes. We define N = l×n×τ the number of cuboids, and this descriptor is computed on two channels (gray and depth). Specifically, given a LBP-TOP feature vector vi on the cuboid i, vi ∈ R59×3 , then the 4D texture feature of an expression sample is:
AC
CE
PT
ED
M
AN US
Assuming that RGB and depth images of facial expressions have been captured, three steps of preprocessing have to be performed in advance to have face areas aligned in sequences. Firstly, faces could be detected automatically by the ViolaJones detector on RGB images, while eyes could also be easily detected. Results of such detection are shown in Fig. 1(a). The automatically detected face area varies in every frame, so face detection is only performed on the first frame of each sequence, but eye detection will be performed on each frame of the sequence. We refer to the central coordinates of the eyes as (xe1 , ye1 ) and (xe2 , ye2 ), and the central point as (xc = (xe1 + xe2 )/2,yc = (ye1 + ye2 )/2), which is a reference point. If we connect these two points with a line, then the pose of the face could be corrected based on the angle between the line and the horizontal axis. This procedure is executed for each frame in the sequence. We can then perform face detection on the first frame of the sequence and identify the top edge ytop . At this point we are able to retrieve the quantities heightre f = |yT op − yc | and widthre f = (|xe1 −xc |+|xe2 −xc |)/2, which will help us identifying the scale of the faces for the rest of the sequence. In particular, we compute xLe f t = xc − 2 × widthre f , xRight = xc + 2 × widthre f , and yT op = yc − heightre f , yBottom = yc + 2 × heightre f as shown in Fig. 1(b). These values are computed for the rest of the sequence, allowing us to scale the faces. An example of normalized RGB and depth faces are shown in Fig. 1(c).
vr ={LBPP (xr , yr ), LBPP (xr , tr ), LBPP (yr , tr )}
CR IP T
Reference point
3.2. 4D Texture Feature Extraction At this stage, images in the same sequences are aligned and normalized. RGB images are converted to gray. A pair of gray and depth image sequences represents a dynamic expression sample. They are partitioned into l×n×τ cuboids. τ is the number of cuboids in the temporal axis. Notably, a pair of gray and depth sequences is partitioned into the same number of cuboids. Then, a feature learning procedure is executed (see Fig. 2). Local Binary Patterns (LBP) is a non-parametric method that could summarize local structures of an image efficiently
V = {(v1 gray , . . . vi gray , . . . vN gray ), (v1 depth , . . . vi depth , . . . vN depth )}, V ∈ R(59×3×2)×N . (2)
3.3. Codebook training via LLC In order to represent each cuboid as a sparse vector, we use a coding strategy. To this end, we initialize a codebook B using K-Means clustering over all our data. Then we loop through all the training descriptors in the dataset to update B incrementally via LLC. Based on the learned codebook, sparse codes will be acquired via LLC as well. LLC is a coding scheme which could be used for codebook training as well as sparse coding. It learns the codebook B = {b1 . . . bm . . . b M } and the associated sparse codes C = {c1 . . . ck . . . cK } from training samples solving the following optimization problem:
arg min C,B
s.t.
T
K X k=1
∀k, 1 ck = 1
||vk − Bck ||2 + λ||dk ck ||2 ;
∀m
(3)
2
||bm || ≤ 1,
where denotes the element-wise multiplication, and dk ∈ RK is the locality adaptor that gives different freedoms for each
ACCEPTED MANUSCRIPT 4 Input video after preprocessing
4D texture features
Sparse coding with LLC
Spatial pyramid pooling
Output features
cuboids
c1 ! "c # " 2# " # " # " ci # " # " # "$cN #%
LBP-TOP me Ti y-t
x-t
f1 j ! " j# " f2 # " # " j# " fi # " # " j# $" f M %#
CR IP T
x-y
f11 ! " 1# " f2 # " # " 1# " fi # " # " 1# $" f M %#
f121 ! " 21 # " f2 # " # " 21 # " fi # " # " 21 # "$ f M #%
Space×Time
59×3×2
& 1× 1× 1 ' x ⋅ y ⋅ t = (2 × 2 ×1 '4 × 4 ×1 )
y x
AN US
Fig. 2. Feature learning procedure after preprocessing. It includes the extraction of 4D texture features, the coding of such descriptors via LLC and a spatial pyramid pooling.
codeword proportional to its similarity to the feature vector vk . In particular: ∧
(4)
(5)
M
dist(vk , B) dk = exp( ), σ ∧ dist(vk , B) dist(vk , B) = , max(dist(vk , B))
CE
PT
ED
where dist(vk , B) = [dist(vk , b1 ), . . . , dist(vk , b M )]T , and dist(vk , bm ) is the Euclidean distance between vk and bm . σ is the parameter to adjust the weight decay speed for the locality adaptor. After the codebook initialization via k-means, the training descriptors are employed to update the dictionary. Specifically, during each iteration, current codebook B is used to encode a training descriptor vk by computing the sparse code ck with Eq. 3. Then, only part of the codewords B∗ whose corresponding weights ck ( j) are larger than a predefined constant β will be updated in the codebook. B∗ = {b j }, s.t.|ck ( j)| > β.
(6)
AC
Then we represent vk with sparse code c∗ , arg max ||vk − B∗ c||2 , c
s.t.
P j
c( j) = 1.
(7)
Finally, B∗ is substituted with B∗k , which is the result of c∗ computed in a gradient descent manner, referring to (Wang et al., 2010). Each update starting with a training descriptor generates a new codebook. The new codebook is then updated with another training descriptor in the next iteration. Iterations end when all the training descriptors are trained once. In our case, since the number of cuboids is too large, only a part of randomly selected cuboids are used as training descriptors for the
codebook. The updated codebook is applied to all the 4D texture descriptors of cuboids via Eq. 3, obtaining a set of sparse codes. 3.4. Coding and spatial pyramid pooling
In our coding process, codebooks for gray and depth images are learned respectively, so that facial expression samples are represented by a combination of sparse codes from both codebooks. Since the numbers of cuboids in gray and depth sequences are the same, their codebooks have the same number of entries as well. Given that the codebook B has M entries, the sparse codes of each facial expression sequence will result in C ∈ R(M×2)×N , where N = l × n × τ is the number of cuboids. Spatial pyramid pooling is then applied to the sparse codes of each facial expression sequence. It partitions the sequence into pyramid cells. Each cell is comprised of a set of cuboids. Features of each cell Q are the max pooled sparse codes, which are simply the component-wise maxima over all sparse codes of the cuboids within the cell. f Q = [max |c j1 |, . . . , max |c jm |, . . . , max |c jM |], j∈Q
j∈Q
j∈Q
(8)
where j ranges over all the cuboids in the cell Q, and c jm is the m-th component of the space code vector c j . We use a three level pyramid with cells 1 × 1 × 1, 2 × 2 × 1 and 4 × 4 × 1 as illustrated in Fig. 2. As a result, the pooled feature vector for gray or depth image sequence would be 21 times the original dimensionality of f Q . 3.5. CRFs for facial expression recognition After pooling, we train and classify with models based on Conditional Random Fields (CRFs). CRFs has been proved to be effective in facial expression recognition (Jain et al., 2011) by using temporal frames as its neighboring consistency information.
ACCEPTED MANUSCRIPT 5 Algorithm 1 Framework of 4D feature learning Input: A pair of gray and depth facial expression videos Output: { f 1 , ..., f Q , ..., f 21 }gray/depth 1. Calculate LBP-TOP feature V ∈ R(59×3×2)×N by Eq.(2) 2. Initialize dictionary gray/depth , ..., bm , ..., bgray/depth } Bgray/depth = {bgray/depth M 1 3. Randomly select k descriptors to update Bgray/depth while k do 1) Select B∗ from B by Eq.(6) 2) Calculate C ∗ by Eq.(7) 3) Obtain B∗k from C ∗ 4) B∗ ← B∗k 5) k ← k − 1 end while 4. Obtain C gray/depth for each cuboid by Eq.(3) 5. Pooling on C gray/depth → { f 1 , ..., f Q , ..., f 21 }gray/depth
latent space by applying EM-SFA on an expression video and accurately capture the transitions between different states of the expression. As a result, we are able to apply expression samples including only onset for CRFs and recognize the facial expression in a completely autonomous manner. 4. Experimental results and qualitative evaluation 4.1. Database description and preprocessing
Apex
Onset
Ă
Ă
AN US
Ă
Neutral Offset
CR IP T
Disgust Neutral
BU-4DFE is the most popular 3D dynamic facial expression database. It provides a high-resolution RGB image sequence as well as a 3D mesh video for each expression. Every expression includes about 100 frames. Most sequences include different phases of the expression. Besides, 3D landmark points of each face are provided as well. Each subject in the database has 6 different expression videos, namely: anger, disgust, fear, happiness, sadness, and surprise. There are 101 total subjects in the database, which means that there are 606 facial expression sequences. In order to test our algorithm on low-resolution images, we build three datasets with different resolutions based on BU4DFE. Their resolutions are 240 × 160, 120 × 80 and 60 × 40 respectively, named dataset 1, dataset 2 and dataset 3. Due to no depth data is provided by BU-4DFE, we rebuild our depth image sequences with the following strategy: assuming that the area of extracted faces has been defined, depth information is extracted based on vertices’ values of 3D meshes. Depth values of faces in the same sequence are transformed to the range of 0 to 255 according to the same rule, so that the background has the value of 0, and the highest nose tip of the faces within an expression sequence has the value of 255. However, since depth resolution of kinect is about 2 mm at 1m (Khoshelham and Elberink, 2012), assuming the depth of a head is no more than 30cm, we use only 128 values between 1 and 255 by selecting odd numbers in depth images.
Fig. 3. Different phases in a dynamic facial expression video, including: onset, apex and offset.
ED
M
We refer to the observation sequences (X1 , X2 , . . . XT ) as X and expression label sequence (Y1 , Y2 , . . . YT ) as Y, Y ∈ Y. Here Xt , t ∈ (1, 2, . . . , T ), is the t-th observation vector in temporal sequence representing features. Specifically, since the features are extracted from cuboids including several frames, t does not represent for the t-th frame in the video, but the t-th cuboid in temporal sequence. Given the above definition, a CRF model for T-length temporal sequence is defined as:
CE
PT
P exp( k θk Fk (Y, X)) P p(Y/X, θ)CRF = P Y∈Y exp( k θk F k (Y, X)) T X Fk (Y, X) = fk (Yt−1 , Yt ,X, t),
(9) (10)
t=1
AC
where fk (Yt−1 , Yt , X, t) is a unified representation for a transition function tk (Yt−1, Yt , X, t) or a state function sk (Yt , X, t). tk (Yt−1, Yt , X, t) models the temporal dependence between the (t1)-th and the t-th features of all observation vectors and labels in temporal sequence, while sk (Yt , X, t) refers to features of all observation vectors and labels at the time t. In Fig. 3, we provide an example of facial expression training data, which is a sequence of disgust expressions. It could be divided into five phases chronologically: neutral, onset of disgust, apex of disgust, offset of disgust, and neutral again. Since different phases in facial expressions are not as complex and various as for instance those in human activities, we can easily recognize expressions based only on onset phases. In order to detect onset segments, we rely on the work presented by Zafeiriou et al. in (Zafeiriou et al., 2013). They obtain a
4.2. Recognition results using CRFs We provide experimental results on our rebuilt datasets. Following our strategy that only frames during onset of expression are used for training and classification, we manually remove the videos not containing the onset phase. We also remove the videos that contain corrupted images or meshes (i.e., part of the image or mesh is lost). In order to keep the characteristics of the original database, we kept as many subjects as possible. The selection results in 411 sequences from 95 subjects. The ten-fold cross validation is performed on the selected dataset. In each round, 95 subjects are randomly partitioned into testing and training samples, composed respectively of 9 subjects and 86 subjects. The results of the ten rounds are then averaged to provide a statistically significant performance measurement of the proposed method. Experiments are implemented on our three datasets. After comparison, we found that partitioning image frames to the sizes between 5 × 5 and 10 × 10 usually performs the best performance, so we partitioned image frames into 24 × 16 parts in dataset 1 and 2, and partitioned image frames in dataset 3 into
ACCEPTED MANUSCRIPT 6 100
Table 1. Expression recognition results of three different low-resolution datasets
Resolution 240 × 160 120 × 80 60 × 40
Acc. of 6 83.07% 79.38% 69.1%
Acc. of 3 97.75 98.96 97.52
80 70
Accuracy(%)
Dataset Dataset 1 Dataset 2 Dataset 3
90
60 50 40
Ours LBP−TOP 3D Gabor HOND
30 20
Anger Disgust Fear Happi. Sadness Surprise Anger 0.85 0.05 0.05 0 0.11 0 Disgust 0.13 0.79 0.09 0 0 0 Fear 0 0.07 0.56 0 0.06 0.02 Happiness 0 0.05 0.15 0.95 0 0 Sadness 0.02 0.01 0.02 0.05 0.83 0 Surprise 0 0.03 0.13 0 0 0.98
0
0
10
20
30
40
50
60
70
80
90
100
Number of subjects
Fig. 4. Comparison results between methods using LBP-TOP, 3D Gabor, Hon4D and our feature
results are based on images of low-resolution 240 × 160, but all the other work are based on the high-resolution BU-4DFE images. It would not be a rigorous comparison if we only focused on the value of the accuracy, therefore we list the main characteristics of these six methods and ours in Table 3. We partition them into 3 groups. Different groups are divided off with double lines. Rows 2 to 3 are non-automatic approaches capable of recognizing six expressions. Rows 4 to 6 are automatic approaches but dealing with six expression recognition as well. In the last three rows we present methods that recognize only three expressions. In each column of the table, we list the input data, and the classifiers. We also specify how many subjects and expression sequences they used in the experimental database. The average accuracy of each method is shown in the last column. Regarding the two groups that deal with six expressions, if we only focus on the value of the accuracy, then (Sun and Yin, 2008) and (Amor et al., 2014) both have better results than our approach. However, in (Sun and Yin, 2008), the ground truth of 3D landmark points of faces were directly employed as features, so their method is not automatic and may not be robust in practical applications. The work presented in (Amor et al., 2014) shows the highest accuracy among all the existing related works. But their result was based on high resolution 3D meshes from BU-4DFE, whereas ours directly come from low resolution RGB-D images, which are the common input in human-machine interaction applications. Notably our proposed approach outperforms the other two methods in three expression recognition.
PT
4.3. Qualitative evaluation
ED
M
AN US
12 × 8 parts. Then we define M as the half number of descriptors in codebook initialization. λ = 500 and σ = 100 are used for Eq. 3 and Eq. 4 respectively. Finally β in Eq. 6 is defined as 0.01. Expression recognition results are reported in Table 1. Acc. of 6 provides accuracy of all six expressions, while Acc. of 3 provides accuracy results of happiness, sadness, and surprise. The confusion matrix of dataset 1 is shown in Table 2. Columns of the table are the true expressions to be classified, whereas rows represent the results of the classification. It can be observed that the best classification result comes from surprise with an accuracy of 98%. Expressions of anger, happiness, and sadness all provide accuracy above 80%. The experiment was implemented by Matlab on a PC of Intel i5 core CPU and 4G RAM. The average time cost for recognizing a facial expression is 2.3s. Certainly the algorithm would run much faster with optimized C++ codes.
10
CR IP T
Table 2. Confusion matrix of our method on 240 × 160 image sequences
AC
CE
We test and compare several features with our proposed feature using our dataset 1. In traditional facial expression recognition, LBP and Gabor features are used the most. As the extension of these two features, we apply LBP-TOP and 3D Gabor as candidates in our experiments. We also notice that a 4D feature called HON4D (Oreifej and Liu, 2013) has good performance on action recognition, so it is also considered for the comparison. Fig. 4 plots the recognition performance of our feature, LBP-TOP and other two competitors. We see that our feature outperforms others in conditions of different sample numbers. 4.4. Comparison to other techniques Six representative works considering the problem of dynamic 3D facial expression recognition are selected for comparison (Le et al., 2011; Sandbach et al., 2011; Berretti et al., 2013; Fang et al., 2011; Sun and Yin, 2008; Amor et al., 2014). All the algorithms were evaluated on the BU-4DFE database, using the ten-fold cross validation like us, but testing different numbers of subjects or expressions. The main difference is our
5. Conclusion We present a new method to recognize low-resolution 3D dynamic facial expressions using feature learning on spatiotemporal texture descriptors. After preprocessing, color, depth and temporal information are all been collected by spatiotemporal RGB-D texture features. Then we apply sparse coding to represent the basis of these information, and decrease the dimension of features. Later on, spatial pyramid pooling strategy is exploited to retain the global position information of cuboids while getting rid of the remaining items. Finally, CRFs are a class of statistical modelling method that are popular in
ACCEPTED MANUSCRIPT 7 Table 3. Comparison among existing 4D facial expression recognition methods
Input data 3D landmarks +RGB images 3D meshes 3D meshes 3D meshes RGB and depth images depth images 3D meshes RGB and depth images
ACKNOWLEDGEMENTS
M
This work is supported by National Natural Science Foundation of China (No. 61302151, No.61401268), Shanghai Natural Science Foundation, China (No. 13ZR1455100).
ED
References
CE
PT
Amor, B.B., Drira, H., Berretti, S., Daoudi, M., Srivastava, A.. 4D facial expression recognition by learning geometric deformations. IEEE Transactions on Cybernetics 2014;44(12):2443–2457. Berretti, S., del Bimbo, A., Pala, P.. Automatic facial expression recognition in real-time from dynamic sequences of 3D face scans. Vision Computing 2013;29(12):1333–1350. Bo, L., Ren, X., Fox, D.. Unsupervised feature learning for RGB-D based object recognition. International symposium on experimental robotics 2012;88:387–402. Breiman, L.. Random forests. Machine Learning 2001;45:5–32. Fang, T., Zhao, X., Shah, S.K., Kakadiaris, I.A.. 4D facial expression recognition. In: ICCVW. 2011. p. 1594–1601. Heikkla, M., Pietikainen, M.. A texture-based method for modeling the background and detecting moving objects. IEEE Trans on Pattern Analysis and Machine Intelligence 2006;28(4):657–662. Huang, D., Shan, C., M., A.. Local binary patterns and its application to facial image analysis: a survey. IEEE Trans of system, man and cybernetics 2011;41(6):765–781. Ionescu, R.T., Popescu, M., Grozea, C.. Local learning to improve bag of visual words model for facial expression recognition. In: ICML 2013 workshop on representation learning. 2013. . Jain, S., Hu, C., J.K.Aggarwal, . Facial expression recognition with temporal modeling of shapes. In: ICCV Workshop. 2011. p. 1642–1649. Kakadiaris, I., Passalis, G., Toderici, G., Murtuza, M.N., Lu, Y., Karampatziakis, N., Theoharis, T.. Three-dimensional face recognition in the presence of facial expressions: An annotated deformable model approach. PAMI 2007;29(4):640–649. Khoshelham, K., Elberink, S.O.. Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 2012;12(2):1437–1454.
AC
Fully-auto. n n y y y y y
Expr. 6 6 6 6 6 3 3
Sub. 60 100 60 60 95 60 100
Seq. 360 507 360 360 411 360 -
Acc. 90.4% 75.82% 93.25% 79.44% 83.07% 92.22% 81.93%
y
3
95
411
97.75%
Lafferty, J., McCallum, A., Pereira, F.C.. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. 2001. p. 282– 289. Lazebnik, S., Schmid, C., Ponce, J.. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR. 2006. p. 2169–2178. Le, V., Tang, H., Huang, T.S.. Expression recognition from 3D dynamic faces using robust spatio-temporal shape features. In: FG. 2011. p. 414–421. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML. 2009. p. 609–616. Li, Z., ichi Imai, J., Kaneko, M.. Facial-component-based bag of words and phog descriptor for facial expression recognition. In: IEEE International Conference on Systems, Man and Cybernetics. 2009. p. 1353–1358. Liu, P., Han, S., Meng, Z., Tong, Y.. Facial expression recognition via a boosted deep belief network. In: CVPR. 2014. p. 1805–1812. Oreifej, O., Liu, Z.. HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR. 2013. p. 716–723. Pantic, M., Member, S., Rothkrantz, L.. Automatic analysis of facial expressions: the state of the art. Pattern Analysis and Machine Intelligence, IEEE Transactions on 2000;22(12):1424–1445. Rosato, M., Chen, X., Yin, L.. Automatic registration of vertex correspondences for 3D facial expression analysis. In: Biometrics: Theory, Applications and Systems, 2nd IEEE International Conference on. 2008. p. 1–7. Sandbach, G., Zafeiriou, S., et. al., . A dynamic approach to the recognition of 3D facial expressions and their temporal models. In: FG. 2011. p. 406–413. Sandbach, G., Zafeiriou, S., Pantic, M., Yin, L.. Static and dynamic 3D facial expression recognition: A comprehensive survey. Image and Vision Computing 2012;30(10):683–697. Sikka, K., Wu, T., Susskind, J., Bartlett, M.. Exploring bag of words architectures in the facial expression domain. In: ECCV. 2012. p. 250–259. Sinha, P., Balas, B., Ostrovsky, Y., Russell, R.. Face recognition by humans: 19 results all computer vision researchers should know about. Proceedings of the IEEE 2006;94(11):1948–1962. Sun, Y., Yin, L.. Facial expression recognition based on 3D dynamic range model sequences. In: ECCV. 2008. p. 58–71. Theriault, C., Thome, N., Cord, M.. Dynamic scene classification: learning motion descriptors with slow features analysis. In: ICCV. 2013. p. 2603– 2610. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.. Locality-constrained linear coding for image classification. In: CVPR. 2010. p. 3360–3367. Wu, Y., Wang, Z., Ji, Q.. Facial feature tracking under varying facial expressions and face poses based on restricted boltzmann machine. In: CVPR. 2013. p. 3452–3459. Yin, L., Chen, X., Sun, Y., Worm, T.. A high-resolution 3D dynamic facial expression database. In: FG. 2008. p. 1–6. Zafeiriou, L., Nicolaou, M.A., Zafeiriou, S., Nikitidis, S., Pantic, M.. Learning slow features for behaviour analysis. In: ICCV. 2013. p. 4321–4328. Zafeiriou, S., Petrou, M.. Sparse representations for facial expressions recognition via l1 optimization. In: CVPRW. 2010. p. 32–39. Zeng, Z., Pantic, M., Roisman, G., Huang, T.. A survey of affect recognition methods: audio, visual, and spontaneous expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on 2009;31(1):39–58.
AN US
predicting labels for sequences of samples, so they take full advantage of temporal consistency information in classifying our 4D features. We assessed our method on three low-resolution datasets obtaining a satisfactory accuracy over six different facial expressions. As opposed to methods using high-resolution 3D meshes, we present an algorithm that is suitable for applications that require raw data from standard RGB-D sensors. In this sense, our results are promising and can be replicated on databases recorded from low-resolution sensors such as Kinect.
Classifier HMMs SVM Random forest HMMs CRFs HMMs Gentle boost+HMMs CRFs
CR IP T
Method Sun et al. (Sun and Yin, 2008) Fang et al. (Fang et al., 2011) Drira et al. (Amor et al., 2014) Berretti et al. (Berretti et al., 2013) Ours Le et al. (Le et al., 2011) Sandbach et al. (Sandbach et al., 2011) Ours