Measuring the intensity of spontaneous facial action units with dynamic Bayesian network

Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Mea...

Download PDF

2MB Sizes 1 Downloads 12 Views

Report

PDF Reader
Full Text

Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Measuring the intensity of spontaneous facial action units with dynamic Bayesian network Yongqiang Li a,n, S. Mohammad Mavadati b, Mohammad H. Mahoor b, Yongping Zhao a, Qiang Ji c a

School of Electrical Engineering and Automation, Harbin Institute of Technology, Harbin 15001, China Department of Electrical and Computer Engineering, University of Denver, Denver, CO 80208, USA c Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA b

art ic l e i nf o

a b s t r a c t

Article history: Received 27 March 2014 Received in revised form 16 March 2015 Accepted 20 April 2015

Automatic facial expression analysis has received great attention in different applications over the last two decades. Facial Action Coding System (FACS), which describes all possible facial expressions based on a set of facial muscle movements called Action Unit (AU), has been used extensively to model and analyze facial expressions. FACS describes methods for coding the intensity of AUs, and AU intensity measurement is important in some studies in behavioral science and developmental psychology. However, in majority of the existing studies in the area of facial expression recognition, the focus has been on basic expression recognition or facial action unit detection. There are very few investigations on measuring the intensity of spontaneous facial actions. In addition, the few studies on AU intensity recognition usually try to measure the intensity of facial actions statically and individually, ignoring the dependencies among multilevel AU intensities as well as the temporal information. However, these spatiotemporal interactions among facial actions are crucial for understanding and analyzing spontaneous facial expressions, since these coherent, coordinated, and synchronized interactions are that produce a meaningful facial display. In this paper, we propose a framework based on Dynamic Bayesian Network (DBN) to systematically model the dynamic and semantic relationships among multilevel AU intensities. Given the extracted image observations, the AU intensity recognition is accomplished through probabilistic inference by systematically integrating the image observations with the proposed DBN model. Experiments on Denver Intensity of Spontaneous Facial Action (DISFA) database demonstrate the superiority of our method over single image-driven methods in AU intensity measurement. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Spontaneous facial expression AU intensity Dynamic Bayesian Network FACS DISFA database

1. Introduction Facial expression is one of the most common nonverbal communication media that individuals use in their daily social interactions. Analyzing facial expression will provide powerful information to describe the emotional states and psychological patterns of individuals. In the last two decades automatic facial expression recognition has gained more attention in several applications in developmental psychology, social robotics, affective online tutoring environment and intelligent Human–Computer Interaction (HCI) design [1,2]. Facial Action Coding System (FACS) is one of the well-known approaches for describing and analyzing facial expressions [3]. FACS

n

Corresponding author. Tel.: þ 86 15245104277. E-mail address: [email protected] (Y. Li).

describes all possible facial expressions based on a set of anatomical facial muscle movements, called Action Unit (AU). For instance, AU12 or lip corner puller speciﬁes contractions that occur on the face by Orbicularis oculi muscle [3]. FACS is also capable of representing the dynamics of every facial behavior by annotating the intensity of each AU in a ﬁve ordinal scale (i.e. scales A–E that indicate the barely visible to maximum intensity of each AU). AU intensity can describe the occurrence of spontaneous facial expressions in more detail. The general relationship between the scale of evidence and the A–B–C– D–E intensity scoring, as well as some AU samples is illustrated in Fig. 1. Generally, the A level refers to a trace of the action; B, slight evidence; C, marked or pronounced; D, severe or extreme; and E, maximum evidence. For example, “AU12B” indicates AU12 with a B intensity level. Manual FACS coding is an intensive and time consuming task and designing an automatic system which can specify the list of AUs and their intensities would help the community to analyze spontaneous facial behaviors accurately and efﬁciently.

http://dx.doi.org/10.1016/j.patcog.2015.04.022 0031-3203/& 2015 Elsevier Ltd. All rights reserved.

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

2

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 1. Relation between the scale of evidence and intensity scores for facial action units [3].

Majority of the existing literature has been focusing on two types of facial expression studies. The ﬁrst category is concerned with analyzing and classifying prototypic facial expressions (aka as six basic expressions: happy, sad, disgust, anger, surprise, fear). These studies are mostly designed to recognize the basic expressions that can represent the human emotions. These expressions of emotions are known to be similar among different cultures [39]. The second category of facial behavior analysis speciﬁes expressions by a set of AUs where the goal is to represent and recognize facial AUs deﬁned by FACS. The latter approach can comprehensively describe a wider range of facial expressions. AU-based analyzers are also capable for representing the prototypic facial expressions as a combination of AUs. For instance ‘fear’ can be represented by combination of AU1, 2, 4, 5, and 25 [3]. Most of the existing studies have been focused on prototypic facial expressions and detecting the occurrence of AUs in posed facial expressions. In many real world application we need to analyze spontaneous facial expression, such as categorizing pain related facial expressions [7] and measuring the engagement of students for online tutoring applications. Spontaneous facial behavior analysis can be very challenging due to several factors, such as out-of-plane head motion and different poses, subtle facial expressions and intra-subject variability in dynamics and timing of different facial actions. In addition it has been shown that the dynamics and patterns of spontaneous facial expressions can be very different from the posed ones. Analyzing spontaneous facial expressions, especially for intensity measurement, is not as robust and accurate as the posed one, because of the aforementioned challenging factors. In early works for automatic AU intensity measurement, Bartlett et al. [13] measured the intensity of AUs in posed and spontaneous facial expressions by using Gabor wavelet and support vector machines. The mean correlation with human-coded intensity of their automated face recognition system for posed and spontaneous facial behavior is 0.63 and 0.3, respectively. These quantitative results demonstrate that recognizing spontaneous expressions is more challenging than posed expressions. In the area of spontaneous facial actions recognition, there are very few works on detecting or measuring the intensity of spontaneous facial actions [5,15]. To the best of the authors' knowledge, most of the current studies, including [5,15,14], analyze spontaneous facial actions statically and individually. In other words, the dependencies among multilevel AU intensities as well as the temporal information are ignored. The semantic and dynamic relationships among facial actions are crucial for understanding and analyzing spontaneous expression. In fact, the coordination and synchronized spatiotemporal interactions between facial actions produce a meaningful facial expression. Tong et al. [25] employed dynamic Bayesian Network (DBN) to model the dependencies among AUs and achieved improvement over single image-driven methods, especially for recognizing AUs that are difﬁcult to detect but have strong relationships with other AUs. However, their work [25] focuses on AU detection of posed expression. Following the idea in [25], in this paper, we introduce a framework based on DBN to systematically model the spatiotemporal

dependencies among multi-AU intensity levels in multiple frames, in order to measure the intensity of spontaneous facial actions. The proposed probabilistic framework is capable of recognizing multilevel AU intensities in spontaneous facial expressions. Denver Intensity of Spontaneous Facial Action (DISFA) database [9] is employed in this study. DISFA database is publicly available for analyzing the AU intensities and their dynamics. For every frame in DISFA, the intensity of every AU within the scale of 0 (absence of an AU) to 5 (maximum intensity) has been provided. To demonstrate the effectiveness of the proposed model, rigorous experiments are performed on DISFA database. The experimental results as well as the detailed analysis on the improvements are reported in this paper.

2. Related works Given the signiﬁcant role of faces in human's emotional and social life, automating the analysis of facial expression has gained great attention in both academia and industry. An automated facial expression recognition system usually consists of two key stages: feature extraction and machine learning algorithm design for classiﬁcation. Commonly used features that represent facial gestures or facial movements include optical ﬂow [35,41], explicit feature measurement (e.g., length of wrinkles and degree of eye opening) [42], Haar features [43], Local Binary Patterns (LBP) features [44,45], independent component analysis (ICA) [46], feature points [36], 3D geometries features [54,55], and Gabor wavelets [4]. Given the extracted facial features, facial actions are identiﬁed using machine learning methods such as Neural Networks [42], Support Vector Machines (SVM) [5], rule-based approaches [47], AdaBoost classiﬁers, and Sparse Representation (SR) classiﬁers [48]. The recent attempt of facial action intensity estimation has received increasing interest due to the semantic richness of the predictions. Bartlett et al. [49] estimated the intensity of action units by using the distance of a test example to the SVM separating hyperplane, while Hamm et al. [50] used the conﬁdence of the decision obtained from AdaBoost. Multi-class classiﬁer is a natural choice for this problem. For example, Mahoor et al. [5] utilized AAM features in conjunction with six one-vs-all SVM classiﬁers to automatically measure the intensity of AU6 and AU12 in videos captured from infant–mother communications. More recently, Mavadati et al. presented the DISFA database in [9], which is fully coded by six level AU intensity and employed Local Binary Pattern histogram (LBPH) features, HOG features, and Gabor features followed by multi-class SVM classiﬁcation for AU intensity measurement. Besides multi-class classiﬁers, regressor is also commonly used for the task of continuous expression estimation. For instance, Jeni et al. [51,56] and Savran et al. [52] applied Support Vector Regression (SVR) for prediction, while Kaltwang et al. [53] used Relevance Vector Regression (RVR) instead. Both methods, SVR and RVR, are extensions to regression of SVM, although RVR yields a probabilistic output. Most of the current works recognize facial actions individually and statically. However, due to the richness, ambiguity, and

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

3

Fig. 2. The ﬂowchart of the proposed online AU intensity recognition system.

dynamic nature of facial actions, individually recognizing each AU intensity is not accurate and reliable for spontaneous facial expressions. Understanding spontaneous facial expression requires not only improving facial motion observations, but more importantly, exploiting the spatiotemporal interactions among facial motions since the coherent, coordinated, and synchronized interactions among AUs produce a meaningful facial expression. Some previous studies focused on exploiting the semantic and dynamic relationships among facial actions [35,35,25]. Lien et al. [35] and Valstar et al. [36] employed a set of Hidden Markov Models (HMMs) to represent the facial actions evolution in time. Tong et al. [25,37] constructed a Dynamic Bayesian Network (DBN) based model to further exploit the semantic and temporal dependencies among facial actions. However, these works are all limited to detection of the presence and absence of AUs mainly in posed expressions. Recently, there are some works exploiting the interactions between AU intensity values. Sandbach et al. [58] employed a Markov Random Field to model the static correlation relationship among AU intensities, which improves the recognition accuracy compared to regressors, i.e., SVRs and Directed Acyclic Graph Support Vector Machines. A Conditional Ordinal Random Fields (CORF) is extended [59] to application in AU intensity estimation. However, this method [59] can only estimate the AU intensity when the presence of the AU is already known. Baltrušaitis et al. [57] combine ANNs and Continuous Conditional Random Fields (CCRF) as a Continuous Conditional Neural Fields (CCNF) for structured regression for AU intensity estimation, where each hidden component consists of a neutral layer. In this work, we construct a DBN to systematically model the spatiotemporal interactions among multi-level AU intensities. Advanced machine learning techniques are employed to train the framework from both subjective prior knowledge and training data. The proposed method differs from previous works [25,37] in both theory and applications. Theoretically, this paper focuses on exploiting the AU intensity correlations, and modeling the spatiotemporal dependencies among multi-level AU intensities. In terms of applications, the focus of this paper is to design and develop an automatic system to measure the intensity of spontaneous facial action units, which is much more challenging than detecting posed facial action units. Fig. 2 gives the ﬂowchart of the proposed online AU intensity measuring system, which consists of two independent but collaborative phases: image observation extraction and DBN inference. First, we detect the face and 66 facial landmark points in videos automatically. Then we register the images according to the

detected facial landmark points. HOG and Gabor features are employed to describe local appearance changes of the face. After that, SVM classiﬁcation produces an observation score for each AU intensity individually. Given the image observations for all AUs, the AU intensity recognition is accomplished through probabilistic inference by systematically integrating the image observation with the proposed DBN model. The remainder of the paper is organized as follows. Section 3 describes the AU observation extraction method. In Section 4, we build DBN model for AU intensity recognition, including BN model structure learning (Section 4.1), DBN model parameter learning (Section 4.3) and DBN inference (Section 4.4). Section 5 presents our experimental results and discussion and Section 6 concludes the paper.

3. AU intensity observation extraction In this section we describe our AU intensity image observation extraction method, which consists of face registration, facial image representation, dimensionality reduction and SVM classiﬁcation. 3.1. Face registration Image registration is a commonly used technique to align similar data (i.e. the reference and sensed image). In order to register two images, oftentimes, a set of points called landmark points is exploited for representing and aligning images. In our study, we used 66 landmark points of DISFA database to represent the location of important facial components [9]. To obtain the reference landmark points we averaged the 66 landmark points over the entire training set. A 2D similarity transformation and the bilinear interpolation technique were utilized to transform the new image into the reference coordinate system. The registered images are then masked to extract the facial regions and resized to 128 108 pixels. 3.2. Facial image representation After registering facial images, we utilized two well-known feature extraction techniques that are capable of representing the appearance information. These features are Histogram of Oriented Gradient (HOG), and Localized Gabor Features which are described below.

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

3.2.1. Histogram of oriented gradient The histogram of oriented gradient was ﬁrstly applied in human detection area [16], however it has been reported to be a powerful descriptor for representing and analyzing facial images [9]. HOG is a descriptor which counts the occurrences of gradient orientation in localized portion of an image and it can efﬁciently describe the local shape and appearance of an object. To represent the spatial information of an object, images are divided into small cells and for each cell, the histogram of gradient is calculated. (For more information about gradient ﬁlters and number of histogram bins, interested readers are referred to [16].) In our experiment, for every image (i.e image size is 128 108 pixels), we built a cell with 18 16 pixels and in overall 48 cells are constructed out of each image. We applied the horizontal gradient ﬁlter ½ 1 0 1 with 59 orientation bins in our experiments. To construct the HOG feature vector, the HOG representation for all the cells was concatenated together and ﬁnally the HOG feature vector with size 2832 (48 59) was obtained. 3.2.2. Localized Gabor features Gabor wavelet is another famous technique for representing texture information of an object. A Gabor ﬁlter is deﬁned with a Gaussian kernel that is modulated by a sinusoidal plane. Gabor feature has a powerful capability for representing facial textures and has been used for different applications including facial expression recognition [17]. In our experiment, to efﬁciently extract both texture and shape information of facial images, 40 Gabor ﬁlters (i.e. 5 scales and 8 orientations) were applied to regions deﬁned around every 66 landmark points and as a result 2640 Gabor features were extracted. 3.3. Dimensionality reduction: manifold learning In many real world applications in machine learning and pattern classiﬁcation, high-dimensional features make analyzing the samples more complicated, e.g, 2832 dimensional HOG features and 2640 dimensional Gabor features in this work which contain redundant information. Reviewing the literatures has shown that facial expression appearances are embedded along a low dimensional manifold in high-dimensional space [19,20]. Therefore employing a non-linear techniques (i.e., manifold learning), which can preserve the local information of facial appearances can be an effective approach to represent and classify facial expressions. Manifold learning is a nonlinear technique, which assumes that the data points are sampled from a low dimensional manifold but they are embedded into a high dimensional space. Mathematically speaking, given a set of points x1 ; …; xn A RD ﬁnd a set of point

y1 ; …; yn A Rd (d n D) such that yi represents xi efﬁciently. In this paper we utilized the Laplacian eigenmap algorithm [23] to reduce the dimensionality of data, more speciﬁcally, the high-dimensional features were mapped to a space with 29 dimensions. Belkin and Niyogi introduced the Laplacian Eigenmap algorithm [23]. This algorithm aims to map the close points in high dimensional space into the close points in the low dimensional one. The generalized eigenvector problem is applied for solving this problem and the ﬁrst d eigenvectors corresponding to the ﬁrst d smallest eigenvalues are used to describe the embedded d-dimensional Euclidean space. For more details see [23]. Similar to [5], we utilized Spectral Regression (SR) algorithm to ﬁnd a projection function which can map the high dimensional data, such as HOG and Gabor features, into low dimensional space. 3.4. Classiﬁcation Given the reduced feature vectors, we extract the AU intensity observation through SVM classiﬁcation [24]. The SVM classiﬁers aims to ﬁnd discriminative hyperplane with the maximum margin for dividing data belongs to different classes. There are several parameters, such as kernel's type (e.g. Linear, Polynomial, Radial Basis Function (RBF) kernels) that can affect the efﬁciency of the SVM classiﬁer. For more detailed information on SVM read [24]. In our experiment, to classify C¼ 6 AU intensity levels of each AU, we used one-against-one strategy, in which CðC 1Þ=2 binary discriminant functions are deﬁned, one for every possible pair of classes. And we did the same for all 12 AUs of DISFA database, which results in 15 12 ¼ 180 binary classiﬁers. We examined three different kernels (i.e. Linear, Polynomial and Gaussian RBF kernels) where the Gaussian RBF outperformed the other two kernels. Although we can extract AU intensities up to some accuracy, this image-appearance-based approach treats each AU and each frame individually and largely relies on the accuracy of face region alignment. In order to model the dynamics of AUs, as well as their semantic relationships. In addition to deal with the image uncertainty, we utilize a DBN for AU inference. Consequently, the output of the SVM classiﬁer is used as the evidence for the subsequent AU inference via the DBN.

4. DBN model for facial action unit intensity recognition 4.1. AU intensity correlation learning 4.1.1. AU intensity correlation analysis Measuring the intensity of AUs in a single frame of video is difﬁcult due to the variety, ambiguity, and dynamic nature of facial actions. This is especially true for spontaneous facial expressions.

Fig. 3. Nonadditive effect in an AU combination. (a) AU12 occurs alone. (b) AU15 occurs alone. (c) AU12 and AU15 appear together. (Adapted from [3]).

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Moreover, when AUs occur in a combination, they may be nonadditive, which means that the appearance of an AU in a combination is different from its stand-alone appearance. Fig. 3 demonstrates an example of the non-additive effect: when AU12 (lip corner puller) appears alone, the lip corners are pulled up toward the cheekbone; however, if AU15 (lip corner depressor) is also becoming active, then the lip corners are somewhat angled down due to the presence of AU15. The non-additive effect increases the difﬁculty of recognizing AU individually. As described in the FACS manual [3], the inherent relationships among AUs can provide required information to better analyze the facial expressions. The inherent relationships can be summarized as co-occurrence relationships and mutual exclusive relationships. The co-occurrence relationships characterize the groups of AUs, which oftentimes appear together to show meaningful facial emotions. For instance, AU6þAU12 þAU25 represents ‘happiness’ (Fig. 4), AU4 þAU15þ AU17 shows ‘sadness’ (Fig. 4), and AU9 þAU17 is an illustration of ‘disgust’ (Fig. 4). On the other hand, considering the alternative rules provided in the FACS manual [3], some AUs are mutually exclusive which means that some AUs rarely happen together in spontaneous expressions in real life. For instance, it may not be possible to demonstrate AU25 (lips part) with AU24 (lip pressor) simultaneously as described in the FACS manual “the lips cannot be parted and pressed together” [3]. Note that “mutual exclusive relationship” in the paper does not mean that the involved AUs can never be present together, but that the involved AUs are exclusive and the probability of the involved AUs occurring simultaneously is pretty low. In addition to the co-occurrence and mutual exclusive relationships of AUs, there are also some limitations and restrictions on the intensity level of co-occurring action units. For instance, when AU6 (cheek raiser) and AU12 (lip corner puller) are occurring together, as shown in Fig. 4, the high/low intensity of one AU indicates a high probability of high/low intensity of another one. Similarly, when AU10 (upper lip raiser) and AU12 are activated together, if AU12 is having strong intensity, the intensity of AU10 will be forced to be low as described in the FACS manual “such strong actions of AU12 counteract the inﬂuence of AU10 on the shape of the upper lip” [3]. Modeling such dependencies among AUs has been proven helpful for increasing the AU recognition accuracy [35,36,25]. For example, Tong et al. [25,37] employed a DBN to systematically model the spatiotemporal relationships among AUs, and achieved a marked improvement over the image observation. However, [25,37] focuses on AU detection, which only recognizes AU's absence or presence. In addition, [25] detects AUs from posed expressions, which are created by asking subjects to deliberately make speciﬁc facial actions or expressions. Spontaneous expressions, on the other hand, typically occur in uncontrolled conditions, and are more challenging to measure [13] because of head pose variations, occurrence of subtle facial behaviors. In this paper, inspired by [25], we adopted Bayesian Network (BN) to represent

5

and capture the semantic relationships among AUs, as well as the correlations of the AU intensities, in order to more accurately and robustly measure the intensity of spontaneous facial actions. 4.1.2. BN structure learning A BN is a Directed Acyclic Graph (DAG) that represents a joint probability distribution among a set of variables. In a BN, nodes denote variables and the links among nodes denote the conditional dependency among variables. In this work, we employ 12 hidden nodes representing 12 AUs of DISFA database [9], (i.e, AU1, AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17 AU20, AU25, AU26), each of which has six discrete states indicating the intensity of the AU. The structure of a BN captures the dependency relationships among variables, and is crucial for accurately modeling the joint probability distribution. Hence, we employ BN structure learning algorithm to learn the BN structure from both training data and prior knowledge. The goal of BN structure learning is to ﬁnd a structure B that maximizes a score function such as Bayesian Information Criterion (BIC) [26] score function: d max SD ðBÞ ¼ max LD ðθÞ log K ð1Þ B 2 θ where LD ðθÞ is the log-likelihood function of parameter θ with respect to data D and structure B, d is the number of free parameters in B and K is the number samples in training data. LD ðθÞ can be expressed as n

m

r

i i bB LD ðθÞ ¼ log ∏ ∏ ∏ θ ijk

i¼1j¼1k¼1

ð2Þ

B

b represents the parameters for network B, n is the where θ ijk number of variables (nodes), mi is the number of parent conﬁguration of ith node and ri is the number of possible states of ith node. Note that the value of mi depends on the structure B. More speciﬁcally, it depends on the parent set of each node. The number P of free parameters in B is computed as d ¼ ni¼ 1 ðmi ðr i 1ÞÞ. Based on the score function, the BN structure learning can be deﬁned as an optimization problem. Campos et al. [27,28] recently developed an exact BN structure learning algorithm from data and expert's knowledge based on BIC score function. They [27,28] found some properties that strongly reduce the time and memory costs, and developed a branch and bound algorithm that integrates structural constraints with data in a way to guarantee global optimality. In this work, we employ the method presented in [28] to learn our BN structure from both training data and prior knowledge. The prior knowledge is derived from both data analysis and study of previous works. Table 1 calculates the Pearson Correlation Coefﬁcient (PCC) of the AU intensities on DISFA database. From Table 1 we can see that there are strong correlation between some variables. For instance the PCC between the intensities of AU2 and AU1 is 0.732, which means that there is strong correlation relationship between the intensities of these two AUs, e.g., high/low intensity of AU2

Fig. 4. AU combinations that show meaningful expression. (a) AU6 þ AU12þ AU25 to represent happiness. (b) AU4 þAU15þ AU17 to represent sadness. (c) AU9 þ AU17 to represent disgust. (Adapted from [9]).

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Table 1 Pearson correlation coefﬁcient of AU intensities on DISFA database (each entry aij represents PCC(AUi,AUj)). AU no.

AU1

AU2

AU4

AU5

AU6

AU9

AU12

AU15

AU17

AU20

AU25

AU26

AU1 AU2 AU4 AU5 AU6 AU9 AU12 AU15 AU17 AU20 AU25 AU26

1.000 – – – – – – – – – – –

0.732 1.000 – – – – – – – – – –

0.129 0.050 1.000 – – – – – – – – –

0.206 0.263 0.009 1.000 – – – – – – – –

0.019 0.042 0.060 0.041 1.000 – – – – – – –

0.064 0.024 0.419 0.027 0.231 1.000 – – – – – –

0.035 0.047 0.139 0.050 0.594 0.043 1.000 – – – – –

0.035 0.004 0.256 0.003 0.061 0.184 –0.086 1.000 – – – –

0.017 0.010 0.324 0.028 0.022 0.131 –0.055 0.343 1.000 – – –

0.035 0.028 0.097 0.029 0.072 0.084 –0.049 0.163 0.234 1.000 – –

0.034 0.031 0.029 0.049 0.476 0.155 0.586 0.047 0.091 0.001 1.000 –

0.103 0.038 0.039 0.005 0.077 0.072 0.1359 0.077 0.028 0.068 0.441 1.000

indicates a high probability of high/low intensity of AU1. The strong correlation relationship between two variables can be modeled by a link in BN. Thus, if the PCC between the intensities of two AUs is higher than t (t is a threshold and we set as 0.5 in this work), we add a link (a constraint) between the two AUs in the BN to mode such correlation relationship. The direction of the link is set based on the characters of the AUs we are going to recognize, as well as the analysis of previous studies [38,25]. This way, we manually constructed three constraints, i.e., a link from AU2 to AU1, a link from AU12 to AU6 and a link AU12 to AU25, which will help reduce the search space. The manually constructed constraints are reasonable for understanding spontaneous facial expressions. For instance, in spontaneous expression, AU6, AU12 and AU25 often co-occur to represent happiness, and the intensities of the three AUs interact with each other to represent the intensity of the happiness. Hence, there are links among these three variables in the BN to model such correlation relationships. Given the complete training data and the structural constraints, we employ the learning algorithm presented in [28] to learn our BN structure. The learned structure is shown in Fig. 5. From Fig. 5 we can see that the learned structures such as AU9 to AU6, AU2 to AU5, and AU17 to AU15 are all reasonable to reﬂect facial expressions. For example, pain expression often involves AU6 and AU9 which means that there is high co-occurrence relationship between these two AUs. Similarly, AU2 and AU5 often occur together in surprise and fear expressions, and sadness expression usually involves AU15 and AU17. In summary, the learnt structure is capable of reﬂecting the pattern of the AU intensity correlation relationships in the training set. 4.2. Dynamic dependencies learning The above BN structure can only capture the static dependencies among AU intensities. In this section, we extend it to a dynamic Bayesian network by adding dynamic links. In general, a DBN is made up of interconnected time slices of static BNs based on two assumptions: ﬁrst, the system is the ﬁrst-order Markovian, i.e., Pðxt þ 1 j x1 ; …; xt Þ ¼ Pðxt þ 1 j xt Þ. Another assumption is that the transition probability Pðxt þ 1 j xt Þ is the same for all the t. Therefore, a DBN can be deﬁned by a pair of BNs (B1 ; B- ): (1) the static network B1 is the same as what we learned above, which captures the static relationships among AU intensities; (2) the transition network B- speciﬁes the transition probability pðxt þ 1 j xt Þ for all t in a ﬁnite time slice sequence, i.e., the dynamic dependencies among AU intensities. Then, given a DBN model, the joint probability over the sequence x1 ; …; xT can be computed as follows: T

Pðx1 ; …; xT Þ ¼ P B1 ðx1 Þ ∏ P B- ðxt þ 1 j xt Þ t¼1

ð3Þ

AU1

AU9

AU2

AU4

AU12

AU15

AU5 AU6

AU25

AU26

AU17

AU20

Fig. 5. The learned BN structure from both training data and prior knowledge.

where P B1 ðx1 Þ captures the joint probability of variables in the static BN (B1), and P B- ðxt þ 1 j xt Þ captures the transition probability and can be decomposed based on the transition network: n

P B- ðxt þ 1 j xt Þ ¼ ∏ P B- ðxti þ 1 j paðX ti þ 1 ÞÞ i¼1

ð4Þ

where paðX ti þ 1 Þ represents the parent conﬁguration of variable X ti þ 1 in the transition network B- , and n indicates the number of random variables in Xt. As discussed above, a DBN can be deﬁned by a pair of BNs (B1 ; B- ). The static network B1 has been learned in the above sections. Here, we focus on learning the transition network B- . Given the complete training data, the learning process of B- is the same as the static BN structure learning. We just need to reformat the two successive frames as on data instance to train the transition network B- . But compared to the static network, there are more coherent structural constraints on the transition network. The transition network consists of two types of links: inter-slice links and intraslice links. The inter-slice links are the dynamic links connecting the temporal variables of two successive time slices. In contrast, the intra-slice links connect the variables within a single time slice, which are same as the static network structure. The coherent structural constraints on the transition network are as follows: First, the variables in the ﬁrst time slice do not have parents. Second, the inter-slice links can only have one direction, in other words from the previous time slice to the current time slice. Finally, based on the stationary assumption, the intra-slice links

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

MAU4 AU1

AU1

AU4

MAU12 MAU9

AU6

AU6 AU2

AU9

AU2

AU12

AU15

AU26

t-1

AU15

AU4 MAU2

AU6

AU20

t-1

AU17

AU25

AU20

t MAU5

MAU17

AU26

AU5 AU12

AU25

MAU15

AU12

AU2

AU17

AU5 AU12

AU9

AU2

AU4 AU6

AU4

MAU1

MAU6

MAU25

MAU26

t

MAU20

Fig. 6. (a) The learned transition network and (b) the completed DBN model for AU intensity recognition (the unshaded nodes are hidden nodes, and the shaded nodes are the image observation nodes). The self-arrow at each AU node indicates the temporal evolution of a single AU from the previous time slice to the current time slice. The line with arrow from AUi at time t 1 to AUj (ja i) at time t indicates the pairwise dynamic dependency between different AUs.

for every time slice should be the same. Furthermore, to reduce the searching space and increase the generalization ability, we impose an additional constraint such that each node X ti þ 1 has at most two parents from the previous time slice. Besides, we also add some experiential constraints based on analysis of previous studies. Got the complete training data and the above structural constraints, we then apply the learning algorithm in [28] to identify the transition network structure. Here, we ﬁxed the intra-slice links as the previously learned static network. The learned transition network is shown in Fig. 6(a). For presentation clarity, we use the self-arrows at each AU node to indicate the temporal dependencies of a single AU from the previous time slice to the current time slice. From Fig. 6(a) we can see that the learned transition network felicitously reﬂects the dynamic dependencies among AU intensities. For example, the dynamic link from AU t6 1 to AUt4 means that the eyebrows intend to lower by activating AU4 as the intensity of cheek raiser (AU6) increases. And the dynamic link from AU t12 1 to AU t6 means that before the cheek is raised by activating AU6 (cheek raiser), it is most likely the lip corners are already pulled upward by activating AU12 (lip corner puller). Finally, Fig. 6(b) shows the completed DBN model for AU intensity recognition. The unshaded nodes are the hidden nodes, whose states represent AU intensities and need to be inferred from the DBN, given the image observations. The image observations, which are obtained from the computer vision techniques discussed in Section 3, are represented using the shaded nodes. The links from the hidden nodes to the observation nodes capture the likelihood of the image observation given the corresponding AU intensity, e.g., PðMAU i j AU i Þ. 4.3. DBN parameter learning Given the DBN structure, now we focus on learning the parameters from training data to infer the hidden nodes. During the DBN parameter learning, we treat the DBN as an expanded BN consisting of two-slice static BNs connected through the temporal variables. Learning the parameters in a DBN is to ﬁnd the most b for θ that can best explain the training data. Let probable values θ θijk indicate a probability parameter:

θijk ¼ Pðxki j paj ðX i ÞÞ

ð5Þ

where i ranges over all the variables (nodes in the BN), j ranges

over all the possible parent instantiations for variable Xi and k ranges over all the instantiations for Xi itself (intensity levels of AUs). Therefore, xki represents the kth state of variable Xi, and paj ðX i Þ is the jth conﬁguration of the parent nodes of Xi. In this work, the “ﬁtness” of parameters θ and training data D is quantiﬁed by the log likelihood function log ðpðDj θÞÞ, denoted as LD ðθÞ. Assuming that the training data are independent, based on the conditional independence assumption in DBN, we have the log likelihood function in Eq. (6), where nijk is the count for the case that node Xi has the state k, with the state conﬁguration j for its parent nodes: n

qi

ri

n

LD ðθÞ ¼ log ∏ ∏ ∏ θijkijk

ð6Þ

i¼1j¼1k¼1

Since we have a complete training data, i.e., for each frame we have the intensity labels for all 12 AUs, the learning process can be described as a constrained optimization problem as follows: arg max LD ðθÞ;

s:t:

g ij ðθÞ ¼

ri X

θijk 1 ¼ 0

ð7Þ

k¼1

where gij imposes that distributions deﬁned for each variable given a parent conﬁguration sums to one over all variable states. P This problem has its global optimum solution at θijk ¼ nijk = k nijk . To evaluate the quantity of training data needed for learning the proposed model, we perform a sensitivity study of model learning on different amount of training data. For this purpose, the Kullback– Leibler (KL) divergences of the parameters are computed versus the number of training samples as shown in Fig. 7. From Fig. 7 we can see that the learning process of the proposed model requires a total of about 50 000 training samples. Since we have more than 50 000 training samples available in the database, the training data for the model learning are sufﬁcient. 4.4. DBN inference Given the complete DBN model and the AU image observations, we can estimate the AU intensities by maximizing the posterior probability of the hidden nodes. Let AU t1:N represent the nodes for N target AUs at time t. Given the available evidence until time t: t 1:t MAU 1:t 1:N , the posterior probability PðAU 1:N j MAU 1:N Þ can be factorized and computed via the proposed model by performing the DBN

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

0.4

Table 2 Total number of frames for each AU intensity level on DISFA database (Adopted from [9]).

0.35

KL divergence

0.3

AU no.

0.25 0.2 0.15 0.1 0.05 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Training data size Fig. 7. The KL divergence of the proposed DBN model versus the training data size. We use different number of subject's data to train the model.

updating process as described in [31]. Because of the recursive nature of the inference process as well as the simple network topology, the inference can be implemented very efﬁciently.

5. Experimental results

1 2 4 5 6 9 12 15 17 20 25 26

# Frames for each AU intensity ‘0’

‘1’

‘2’

‘3’

‘4’

‘5’

122 036 123 450 106 220 128 085 111 330 123 682 100 020 122 952 117 884 126 282 84 762 105 838

2272 1720 4661 1579 9157 1659 13943 5180 6342 1591 9805 13 443

1749 934 7636 719 5986 2035 6869 1618 4184 1608 13 935 7473

2809 3505 6586 293 3599 3045 7233 1017 2281 1305 15 693 3529

1393 836 4328 104 601 316 2577 47 112 28 5580 314

555 369 1383 34 141 77 172 0 11 0 1039 217

Table 3 Average AU intensity recognition results (ICC ) using the SVM classiﬁer alone, the BN, and the DBN on DISFA database (%). Feature type

SVM

BN

DBN

HOG feature Gabor feature

66.64 76.57

68.95 77.71

70.16 78.34

In our experiments we utilized the DISFA database for evaluating the performance of automatic measurement of the intensity of spontaneous action units. First we introduce the contents of DISFA and then the results of the proposed system for measuring the intensity of 12 AUs of this database are reported.

residual mean squares deﬁned by Analysis Of Variance (ANOVA). That is, the ICC indicates the proportion of total variance due to differences between targets. See [32] for additional details.

5.1. DISFA database description

5.3. Results analysis

Denver Intensity of Spontaneous Facial Action (DISFA) database [9,33] contains the 4-min videos of spontaneous facial expressions for 27 adults participants. The facial images of DISFA were video recorded by a high resolution camera (i.e. 1024 768 pixel) at a frame rate of 20 fps. In DISFA, the intensity of 12 AUs (i.e. AU1, AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17, AU20, AU25, AU26 ) has been coded by a FACS coder and the six levels of AU intensities. The 6-level intensity in DISFA can represent the dynamics of each AU by the ordinal scale intensity, 0 (absence of an AU) to 5 (maximum intensity) [3]. The database also provides a set of 66 landmark points that represent the coordinates of important components of human's face, such as corner of the eyes and boundary of the lips. [9]. In this study, we utilized entire set of video frame of all participants to measure the intensity of 12 aforementioned AUs. Table 2 (adopted from [9]) lists the total number of frames for each AU intensity level on DISFA database.

As explained in the previous section for AU observation extraction, we employed two types of features (i.e. HOG and Gabor feature) followed by SVM classiﬁcation. Given the image observation, we employ the DBN model to estimate the intensity of each AU. For further comparison, we also test the BN model, which is the static network part of the DBN model. We employed leave-one-subject-out cross validation to evaluate the proposed method. The current system can process about 7.1 frames/s on a 2.4 GHz Intel Core2 PC. It could be speed up by optimization of the code and by implementing the DBN inference in C þ þ instead of in Matlab. The average recognition results by using the SVM classiﬁer alone, the BN, and the DBN are given in Table 3. From Table 3 we can see that, for both types of features, employing the BN model, which models the static relationships among AU intensities, yields improvement over using imagedriven method alone, and using the DBN model, which further incorporates the temporal dependencies among AU intensities, achieves further improvement. For image observations that are not very accurate, i.e., HOG feature observation, the improvement by employing the DBN model is signiﬁcant. This is mainly because that the enhancement of the framework mainly comes from combining the DBN model with the image driven methods, and the erroneous image observation could be compensated through the dynamic and semantic relationships encoded in the DBN. Thus, employing DBN can improve the image observation accuracy, especially for the AUs that are hard to recognize but have strong relationships with other AUs. We show the AU intensity recognition results on each AU by using the SVM classiﬁer alone, the BN model, and the DBN model in Fig. 8. To demonstrate the correlation relationships among AUs intensities, we list the correlation matrix between AU2 and AU1 on DISFA database in Table 4. From Table 4 we can see that the AU

5.2. Recognition and reliability measures To evaluate the proposed automated AU intensity measurement we exploited both recognition rate and Intra-Class Correlation (ICC) value. ICC is a statistical index that ranges from 0 to 1 and is a measure of correlation or conformity for a data set when it has multiple targets [32]. In other words, ICC measures the reliability studies in which n targets of data are rated by k judges (i. e., in this paper k ¼2 and n ¼6). ICC is similar to Pearson correlation and is preferred when computing consistency between judges or measurement devices. The ICC in is deﬁned as ICC ¼

ðBMS EMSÞ BMS þðk 1Þ EMS

ð8Þ

where BMS is the between-targets mean squares and EMS is the

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

SVM BN−HOG DBN−HOG

1

0.8

0.7

0.7

0.6

0.6

DBN−Gabor

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

BN−Gabor

0.9

0.8

0.5

SVM

1

ICC

ICC

0.9

9

0

AU1 AU2 AU4 AU5 AU6 AU9 AU12AU15AU17AU20AU25AU26

AU1 AU2 AU4 AU5 AU6 AU9 AU12AU15AU17AU20AU25AU26

AU labels

AU labels

Fig. 8. The comparison of AU intensity recognition results by using the SVM classiﬁer alone, the BN, and the DBN on DISFA database. (a) Recognition results for HOG feature. (b) Recognition results for Gabor feature.

Table 4 The correlation matrix between AU2 and AU1 on DISFA database (each entry ai;j represents Pðaj j ai Þ). AU intensity

AU1 ¼ 0

AU1 ¼ 1

AU1 ¼ 2

AU1 ¼ 3

AU1 ¼ 4

AU1 ¼ 5

AU2 ¼0 AU2 ¼1 AU2 ¼2 AU2 ¼3 AU2 ¼4 AU2 ¼5

0.9710 0.5929 0.2723 0.1746 0 0

0.0098 0.2661 0.2800 0.0838 0.0198 0

0.0065 0.0749 0.3131 0.1574 0.0507 0

0.0080 0.0566 0.1213 0.4713 0.1483 0.0167

0.0046 0.0094 0.0099 0.1013 0.5464 0.0864

0.0000 0 0.0033 0.0116 0.2349 0.8969

intensity dependency relationship between AU2 and AU1 is strong, i.e., high/low intensity of AU2 indicates a high probability of high/low intensity of AU1. For example, when the intensity of AU2 is at level “0”, the probability of AU1 ¼ 0 is 0.9710, and when the intensity of AU2 is at level “5”, the probability of AU1 ¼ 5 is 0.8969. By modeling such AU intensity dependency relationship between AU2 and AU1 in the DBN model, for the HOG features, the ICC of AU1 is increased from 69.31 percent (for the AdaBoost classiﬁer) to 72.77 percent (for the DBN model), and that of AU2 is increased from 56.01 percent to 63.48 percent, and for the Gabor feature, the ICC of these two AUs is increased from 79.68 percent to 82.90 percent, and from 82.68 percent to 87.33 percent respectively. Similarly, we list the correlation matrix between AU9 and AU5 in Table 5. From Table 5 we can see that, when AU9 occurs, the probability of AU5 being present is low, which means that there is exclusive relationship between these two AUs. The proposed model can improve the observation accuracy by modeling such relationships, especially for image observations with low accuracy. For example, for HOG feature observation, AU5 is not well recognized by the SVM, and the proposed model increases the ICC of AU5 from 51.43 percent (for the AdaBoost classiﬁer) to 57.05 percent (for the DBN model), and that of AU9 is increased from 73.31 percent (for the AdaBoost classiﬁer) to 75.27 percent (for the DBN model). For the Gabor features, the improvements for these two AUs are not that signiﬁcant mainly because that the SVM classiﬁer already recognized these two AUs well. Finally, Table 6 lists the recognition results (ICC and accuracy) for each intensity level by using the proposed DBN model on DISFA database. Accuracy was deﬁned as an average of correctly detected intensity levels for each AU. From Table 6 we can see that the average recognition accuracy for AU intensity at levels “1” and “5”

Table 5 The correlation matrix between AU9 and AU5 on DISFA database (each entry ai;j represents Pðaj j ai Þ). AU intensity

AU5 ¼ 0

AU5 ¼ 1

AU5 ¼ 2

AU5 ¼3

AU5 ¼4

AU5 ¼ 5

AU9 ¼ 0 AU9 ¼ 1 AU9 ¼ 2 AU9 ¼ 3 AU9 ¼ 4 AU9 ¼ 5

0.9776 0.9970 0.9940 1.0000 1.0000 1.0000

0.0130 0.0018 0.0030 0 0 0

0.0058 0.0012 0.0030 0 0 0

0.0025 0 0 0 0 0

0.0008 0 0 0 0 0

0.0003 0 0 0 0 0

is lower than that for AU intensity at other levels. The low recognition accuracy for AU intensity level “1” is mainly because that the low intensity expression produces very subtle facial changes, and is really hard to detect. However, high intensity level represents exaggerated expression which intuitively should be easily recognized. The low recognition accuracy for AU intensity “5” is most likely because of lack of training samples. For example, Table 2 (Adopted from [9]) lists the total number of frames for each AU intensity level on DISFA database, and from Table 2 we can see that the number of frames for intensity level “5” is extremely smaller than that for other levels. Table 7 lists the comparison of our work with previous state of the art method for measuring the intensity of AUs on DISFA database, in terms of both ICC and accuracy. Note that AU intensity recognition is a relatively recent problem within the ﬁeld, and there are few works measuring the intensity of AUs. Mavadati et al. presented the DISFA database in [9], and provided a baseline method to measuring AU intensity. From Table 7 we can see that the proposed method achieves slight better results compared to

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎

10

Table 6 The recognition results for each intensity level by using the proposed DBN model on DISFA database. Feature type (a) HOG features ICC (%) Accuracy (%) 6-Level ‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ (b) Gabor features ICC (%) Accuracy (%) 6-Level ‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’

AU1

AU2

AU4

AU5

AU6

AU9

AU12

AU15

AU17

AU20

AU25

AU26

Avg.

72.77 82.19 98.58 4.27 18.67 47.24 46.68 33.18

63.48 82.81 98.81 4.25 15.35 29.93 29.16 14.58

86.67 79.77 98.07 11.00 40.53 49.55 61.18 61.28

57.05 92.95 99.64 11.79 24.35 40.82 20.74 38.46

70.03 75.49 95.89 17.95 41.44 44.38 26.24 0

75.27 90.26 99.26 9.90 24.61 71.55 66.77 55.41

86.53 77.57 95.44 31.46 45.80 72.75 61.33 0

62.29 86.09 98.04 12.46 20.84 60.16 17.65 -

57.73 76.24 97.26 12.67 23.17 66.09 6.25 14.29

54.73 90.92 99.44 8.38 29.85 26.20 5.56 -

85.07 68.85 93.26 18.02 50.30 72.64 57.61 60.53

70.31 69.88 94.22 20.46 29.77 40.59 36.89 61.41

70.16 81.08 97.33 13.55 30.39 51.83 36.34 28.18

82.90 87.90 99.30 9.98 33.16 65.25 48.36 45.36

87.33 91.75 99.49 12.29 33.15 73.50 80.51 93.16

88.14 82.41 98.78 18.70 52.16 59.00 61.44 48.53

58.35 94.15 99.24 9.44 41.15 46.18 44.17 23.46

80.09 84.40 97.08 30.05 55.60 74.91 47.26 0

80.59 93.01 99.66 17.38 41.57 82.22 58.31 21.57

84.99 80.48 96.54 38.47 47.58 72.83 62.43 0

72.20 91.51 98.97 29.12 40.39 81.99 0 –

74.14 82.22 97.98 16.39 55.71 64.59 10.84 10.00

55.93 89.15 99.52 9.12 31.83 45.31 0 –

94.75 81.03 97.13 31.66 74.01 85.52 70.38 86.48

80.70 81.12 96.63 34.36 58.52 67.80 70.66 76.30

78.34 86.60 98.36 21.41 47.07 68.26 46.20 33.74

Acknowledgements

Table 7 Comparison with state of the art method. Works

Work [9] 2013 Our work

HOG features

Gabor features

ICC

Accuracy

ICC

Accuracy

0.70 0.7016

79.14 81.08

0.77 0.7834

85.71 86.60

[9], regarding both ICC and accuracy. In addition, since for the proposed method AU image observation extraction phase and the AU intensity inference phase are independent, given more accurate image observations, our method can achieve further improvement with few changes.

6. Conclusions and future work Due to the richness, ambiguity, and dynamic nature of facial actions, individually and statically recognizing each AU intensity is not always accurate and reliable for spontaneous facial expressions. Hence, improving the recognition system's efﬁciency not only requires improving the observation extraction accuracy, but more importantly, requires exploiting the spatiotemporal interactions among facial actions, since it is these coherent, coordinated, and synchronized interactions that produce a meaningful facial display. In this work, we presented a DBN based probabilistic framework to model the semantic and dynamic relationships among multilevel AU intensities. Extracted the image observations, AU intensities recognition is accomplished through probabilistic inference by systematically integrating the image observations with the proposed DBN model. In this way, the erroneous AU intensity observations can be compensated by the model's built-in spatial and temporal relationships, thus improving the efﬁciency of AU intensity measurement system. For instance, experimental results on DISFA database show that the overall ICC value for HOG features increased from 66.64% to 70.17% and for Gabor feature from 76.57% to 78.31%. As in spontaneous facial expression several factors, such as head pose variation, might happen. In order to deal with such challenges to measure the intensity of spontaneous action units, as a future work, one might introduce another hidden layer node to model to accommodate those challenges. Conﬂict of interest None declared.

This work was mainly accomplished when the ﬁrst author visited Rensselaer Polytechnic Institute (RPI) as a visiting student, and was partially supported by National Natural Science Foundation of China Funded Project 61402129, Heilongjiang Postdoctoral Science Foundation Funded Project LBH-Z14090, awards IIP1111568 and BCS-1052781 from the National Science Foundation to the University of Denver.

References [1] C. Breazeal, Sociable machines: expressive social exchange between humans and robots (Sc.D. dissertation), Department of Electrical Engineering and Computer Science, MIT, 2000. [2] F. Dornaika, B. Raducanu, Facial Expression Recognition for hci Applications, Prentice Hall Computer Applications in Electrical Engineering Series, 2009, pp. 625–631. [3] P. Ekman, W.V. Friesen, J. C. Hager, Facial Action Coding System, A Human Face, Salt Lake City, UT, 2002. [4] M.S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, J. Movellan, Recognizing facial expression: machine learning and application to spontaneous behavior, in: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR '05), 2005, pp. 568–573. [5] M.H. Mahoor, S. Cadavid, D.S. Messinger, J.F. Cohn, A framework for automated measurement of the intensity of non-posed facial action units, in: 2nd IEEE Workshop on CVPR for Human Communicative Behavior Analysis (CVPR4HB), Miami Beach, June 25, 2009. [7] P. Lucy, J.F. Cohn, K.M. Prkachin, P. Solomon, I. Matthrews, Painful data: the UNBC-McMaster shoulder pain expression archive database, in: IEEE International Conference on Automatic Face and Gesture Recognition (FG2011), Santa Barbara CA, March 2011. [9] S.M. Mavadati, M.H. Mahoor, K. Bartlett, P. Trinh, J.F. Cohn, DISFA: a spontaneous facial action intensity database, IEEE Trans. Affect. Comput. 4 (April– June (2)) (2013). [13] M.S. Bartlett, G.C. Littlewort, C. Lainscsek, I. Fasel, M.G Frank, J.R. Movellan, Fully automatic facial action recognition in spontaneous behavior, in: 7th International Conference on Automatic Face and Gesture Recognition, 2006, pp. 223–228. [14] S.M. Mavadati, M.H. Mahoor, K. Bartlett, P. Trinh, Automatic detection of nonposed facial action units, in: Proceeding of the IEEE International Conference on Image Processing (ICIP), September–October 2012. [15] R. Sprengelmeyer, I. Jentzsch, Event related potentials and the perception of intensity in facial expressions, Neuropsychologia 44 (2006) 2899–2906. [16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition 2005, vol. 1, June 2005, pp. 886–893. [17] Y. Tian, T. Kanade, J.F. Cohn, Evaluation of Gabor-wavelet-based facial action unit recognition in image sequences of increasing complexity, in: International Conference on Automatic Face and Gesture Recognition, 2002, p. 229. [19] Y. Chang, C. Hu, M. Turk, Manifold of facial expression, in: IEEE International Workshop on Analysis and Modeling of Faces and Gestures, IEEE Computer Society, Nice, France, 2003.

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [20] C. Shan, S. Gong, P.W. McOwan, Appearance manifold of facial expression, Computer Vision in Human–Computer Interaction, Springer, Berlin, Heidelberg (2005) 221–230. [23] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003) 1373–1396. [24] V.N. Vapnik, An overview of statistical learning theory, IEEE Trans. Neural Netw. 10 (September (5)) (1999) 988–999. [25] Y. Tong, W. Liao, Q. Ji, Facial action unit recognition by exploiting their dynamic and semantic relationships, IEEE Trans. Pattern Anal. Mach. Intell 29 (10) (2007). [26] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461–464. [27] C.P. de Campos, Z. Zeng, Q. Ji, Structure learning of Bayesian networks using constraints, in: ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 113–120. [28] C.P. de Campos, Q. Ji, Efﬁcient structure learning of Bayesian networks using constraints, J. Mach. Learn. Res. (2011) 663–689. [31] K.B. Korb, A.E. Nicholson, Bayesian Artiﬁcial Intelligence, Chapman and Hall/ CRC, London, UK,, 2004. [32] P.E. Shrout, J.L. Fleiss, Intraclass correlations: uses in assessing rater reliability, Psychol. Bull. vol 86 (March (2)) (1979) 420–428. [33] 〈http://www.engr.du.edu/mmahoor/disfa.htm〉. [35] J.J.J. Lien, T. Kanade, J.F. Cohn, C.C. Li, Detection, tracking, and classiﬁcation of action units in facial expression, Robot. Auton. Syst. 31 (2000) 131–146. [36] M. Valstar, M. Pantic, Combined support vector machines and hidden Markov models for modeling facial action temporal dynamics, in: Proceedings of IEEE International Conference on Human–Computer Interaction, 2007, pp. 118–127. [37] Y. Tong, J. Chen, Q. Ji, A uniﬁed probabilistic framework for spontaneous facial activity modeling and understanding, IEEE Trans. Pattern Anal. Mach. Intell. 32 (2) (2010) 258–273. [38] Yongqiang Li, Seyed Mohammad Mavadati, Mohammad H. Mahoor's, Qiang Ji, A uniﬁed probabilistic framework for measuring the intensity of spontaneous facial action units, in: 10th IEEE International Conference on Automatic Face and Gesture Recognition, 2013. [39] P. Ekman, Darwin, deception, and facial expression, Ann. N.Y. Acad. Sci. 1000.1 (2003) 205–221. [41] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, T.J. Sejnowski, Classifying facial actions, IEEE Trans. Pattern Anal. Mach. Intell. 21 (October (10)) (1999) 974–989. [42] Y. Tian, T. Kanade, J.F. Cohn, Recognizing action units for facial expression analysis, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2) (2001) 97–115. [43] J. Whitehill, C.W. Omlin, Haar Features for FACS AU Recognition, in Proceedings of IEEE International Conference on Automation Face and Gesture Recognition, 2006, pp. 97–101. [44] C. Shan, S. Gong, P.W. McOwan, Facial expression recognition based on local binary patterns: a comprehensive study, Image Vis. Comput. 27 (2009) 803–816.

11

[45] G. Zhao, M. Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 915–928. [46] B.A. Draper, K. Baek, M.S. Bartlett, J.R. Beveridge, Recognizing faces with PCA and ICA, Comput. Vis. Image Underst. 91 (2003) 115–137. [47] M. Pantic, I. Patras, Dynamics of facial expressions: recognition of facial actions and their temporal segments from face proﬁle image sequences, IEEE Trans. Syst. Man Cybern. B Cybern. 36 (2) (2006) 433–449. [48] S.W. Chew, R. Rana, P. Lucey, S. Lucey, S. Sridharan, Sparse temporal representations for facial expression recognition, in: Lecture Notes in Computer Science, vol. 7088, 2012, pp. 311–322. [49] M. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, J. Movellan, Automatic recognition of facial actions in spontaneous expressions, J. Multimed. 1 (6) (2006) 22–35. [50] J. Hamm, C.G. Kohler, R.C. Gur, R. Verma, Automated facial action coding system for dynamic analysis of facial expressions in neuropsychiatric disorders, J. Neurosci. Methods 200 (2) (2011) 237–256. [51] L.A. Jeni, J.M. Girard, J. Cohn, F.D.L. Torres, Continuous au intensity estimation using localized, sparse facial feature space, In: IEEE International Conference on Automatic Face and Gesture Recognition Workshop, 2013. [52] A. Savran, B. Sankur, M.T. Bilge, Regressionbased intensity estimation of facial action units, Image Vis. Comput. 30 (10) (2012) 774–784. [53] S. Kaltwang, O. Rudovic, M. Pantic, Continuous pain intensity estimation from facial expressions, in: Proceedings of the International Symposium on Visual Computing, 2012, pp. 368–377. [54] G. Sandbach, S. Zafeiriou, M. Pantic, Binary pattern analysis for 3D facial action unit detection, in: Proceedings of the British Machine Vision Conference (BMVC 2012), BMVA Press, Guildford, UK, September 2012. [55] G. Sandbach, S. Zafeiriou, M. Pantic, Local normal binary patterns for 3D facial action unit detection, in: 2012 19th IEEE International Conference on Image Processing (ICIP), IEEE, Orlando, FL, USA, October 2012, pp. 1813–C1816. [56] L. Jeni, A. L’´orincz, T. Nagy, Z. Palotai, J. Seb’´ok, Z. Szabó, D. Takács, 3d shape estimation in video sequences provides high precision evaluation of facial expressions, Image Vis. Comput. 30 (10) (2012) 785795. [57] T. Baltrušaitis, P. Robinson, L. Morency, Continuous conditional neural ﬁelds for structured regression, in: European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, September 2014. [58] G. Sandbach, S. Zafeiriou, M. Pantic, Markov random ﬁeld structures for facial action unit intensity estimation, in: Proceedings of IEEE International Conference on Computer Vision (ICCV-W13), Workshop on Decoding Subtle Cues from Social Interactions, Sydney, Australia, December 2013. [59] O. Rudovic, V. Pavlovic, M. Pantic, Context-sensitive conditional ordinal random ﬁelds for facial action intensity estimation, in: The IEEE International Conference on Computer Vision (ICCV) Workshops, Sydney, Australia, December 2013.

Yongqiang Li received the B.S., M.S. and Ph.D. degrees in instrument science and technology from Harbin Institute of Technology, Harbin, China, in 2007, 2009 and 2014, respectively. He is currently an Assistant Professor at Harbin Institute of Technology. He worked as a visiting student at Rensselaer Polytechnic Institute, Troy, USA, from September 2010–September 2012. His areas of research include computer vision, pattern recognition, and human-computer interaction.

S. Mohammad Mavadati received the B.Sc. degree in electronics engineering from the Shahrood University of Technology, Iran, in September 2007, and the M.Sc. degree in telecommunication engineering from Yazd University, Iran, in March 2010. He is currently working toward the Ph.D. degree and is a Graduate Research Assistant in the Department of Electrical and Computer Engineering at the University of Denver. His research interests include automatic analysis of facial expression, pattern classiﬁcation, and computer vision. He is a student member of both the IEEE and the IEEE Signal Processing Society.

Mohammad H. Mahoor received the B.S. degree in electronics from the Abadan Institute of Technology, Iran, in 1996, the M.S. degree in biomedical engineering from the Sharif University of Technology, Iran, in 1998, and the Ph.D. degree in electrical and computer engineering from the University of Miami, Florida, in 2007. He joined the University of Denver (DU) as an Assistant Professor of computer engineering in September 2008. He has authored or coauthored more than 60 refereed research publications. He is the director of image processing and computer vision laboratory at DU. His research interests include affective computing and developing automated systems for facial expression recognition. He is a member of the IEEE.

Yongping Zhao received the Ph.D. degree in electrical engineering from Harbin Institute of Technology, Harbin, China. He is currently a Professor with the Department of Instrument Science and Technology at Harbin Institute of Technology, Harbin, China. His areas of research include signal processing, system integration and pattern recognition.

Qiang Ji received his Ph.D. degree in electrical engineering from the University of Washington. He is currently a Professor with the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute (RPI). He recently served as a Program Director at the National Science Foundation (NSF), where he managed NSF's computer vision and machine learning programs. He also held teaching and research positions with the Beckman Institute at University of Illinois at Urbana-Champaign, the Robotics Institute at Carnegie Mellon University, the Department of Computer Science at University of Nevada at Reno, and the US Air Force Research Laboratory. Prof. Ji currently serves as the Director of the Intelligent Systems Laboratory (ISL) at RPI. Prof. Ji's research interests are in computer vision, probabilistic graphical models, information fusion, and their applications in various ﬁelds. He has published over 160 papers in peer-reviewed journals and conferences. His research has been supported by major governmental agencies including NSF, NIH, DARPA, ONR, ARO, and AFOSR as well as by major companies including Honda and Boeing. Prof. Ji is an Editor on several related IEEE and international journals and he has served as a General Chair, Program Chair, Technical Area Chair, and Program Committee Member in numerous international conferences/workshops. Prof. Ji is a Fellow of IEEE and IAPR.

Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i

Measuring the intensity of spontaneous facial action units with dynamic Bayesian network

Measuring the intensity of spontaneous facial action units with dynamic Bayesian network

Recommend Documents