A spatio-temporal RBM-based model for facial expression recognition

A spatio-temporal RBM-based model for facial expression recognition

Pattern Recognition 49 (2016) 152–161 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr A ...

1MB Sizes 1 Downloads 106 Views

Pattern Recognition 49 (2016) 152–161

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A spatio-temporal RBM-based model for facial expression recognition$ S. Elaiwat a,n, M. Bennamoun a, F. Boussaid b a b

School of Computer Science and Software Engineering, University of Western Australia, 35 Stirling Highway, Crawley, WA, Australia School of Electrical, Electronic and Computer Engineering, University of Western Australia, 35 Stirling Highway, Crawley, WA, Australia

art ic l e i nf o

a b s t r a c t

Article history: Received 30 January 2015 Received in revised form 26 May 2015 Accepted 22 July 2015 Available online 31 July 2015

The ability to recognize facial expressions will be an important characteristic of next generation human computer interfaces. Towards this goal, we propose a novel RBM-based model to learn effectively the relationships (or transformations) between image pairs associated with different facial expressions. The proposed model has the ability to disentangle these transformations (e.g. pose variations and facial expressions) by encoding them into two different hidden sets, namely facial-expression morphlets, and non-facial-expression morphlets. The first hidden set is used to encode facial-expression morphlets through a factored four-way sub-model conditional to label units. The second hidden set is used to encode non-facial-expression morphlets through a factored three-way sub-model. With such a strategy, the proposed model can learn transformations between image pairs while disentangling facialexpression transformations from non-facial-expression transformations. This is achieved using an algorithm, dubbed Quadripartite Contrastive Divergence. Reported experiments demonstrate the superior performance of the proposed model compared to state-of-the-art. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Face expression recognition Restricted Boltzmann Machines Spatio-temporal features Image transformations

1. Introduction Analyzing facial expression is important to Human Computer Interaction (HCI) applications and non-verbal communication such as online/remote education, intelligent homes, computer aided medical treatments, the monitoring of human behaviors and lifestyles, and entertainments [1]. The majority of the reported facial expression recognition (FER) systems focus on the analysis of facial expressions from single images (still images) without exploiting the dynamic nature of facial expressions. Recent works on video images have shown that the performance of FER systems can be significantly improved by capturing the dynamics associated to the formation of a given facial expression. This holds true for those natural/spontaneous expressions without exaggerated posing of facial expressions [2– 4]. Fig. 1 shows examples of challenging cases for FER systems. The majority of existing dynamic FER models are based on hand engineered features (e.g. [4–7]) extracted and pooled over selected frames in the video. Although these features achieved fairly good results [3], they are still not fully automated due to their strong reliance on human expertise to craft the most suitable descriptors. ☆ This research is supported by the Australian Research Council Grant DP110102166. n Corresponding author. E-mail address: [email protected] (S. Elaiwat).

http://dx.doi.org/10.1016/j.patcog.2015.07.006 0031-3203/& 2015 Elsevier Ltd. All rights reserved.

Furthermore, these features cannot be readily used from one task to another (e.g. face recognition, facial expression recognition and fingerprint recognition). To address these limitations, a number of works [2,8] have applied feature learning to build efficient FER models. In these works, generative models (e.g. Auto-encoder (AE) and Restricted Boltzmann Machines (RBM)) are used to learn “good” features. Generative models, such as RBMs, can be represented as a bi-partite network with two layers: a set of input units y (first layer) and a set of hidden units z (second layer). Combination of these models can be also used to build deep hierarchy structures. For example, Ranzato et al. [8] introduced a deep generative model based on a Deep Belief Network (DBN) with five hidden layers. Gated MRF was used as the first hidden layer and an RBM for the remaining layers. The deep hierarchy structure allows us to build robust features, which can be used to recognize facial expressions under challenging conditions such as in the presence of occlusions [8]. However, standard generative models are limited to single input images, and thus cannot be utilized to extract features from videos (spatiotemporal features). Extensions of the standard generative model were introduced for various applications (e.g. [9–12]) to learn variations between frames of videos. For example, Susskind et al. [12] presented an extension of RBM to capture the transformations associated to changes in facial expression between two images of the same subject. Despite extensive research in both hand engineered features and feature learning, the development of an efficient FER system remains a challenging problem for a number of factors. First, facial

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

153

Fig. 1. Samples of subjects with various facial expressions. (a) Expressions with occlusion. (b) Expressions with pose variations. (c) “Happy” subjects.

Fig. 2. Block diagram of the proposed model.

expressions of the same class (e.g. happy) could be very different from one person to another (Fig. 1(c)). Second, facial expressions are still highly intertwined with the facial morphology and facial pose variations [13] (Fig. 1(a) and (b)). To tackle this, Rifai et al. [13] suggested a learning feature model to disentangle features in facial images (still images) into two different sets, with emotion associated features isolated in one set and the remaining features isolated in the second set. Indeed, some feature learning models (e.g. [14,12]) proposed in the context of image transformations (relationships between two images), implicitly separate the variations between images (transformations) from the structure. However, these models are sensitive to common transformations such as pose, illumination or expression. In addition, these models cannot disentangle a given type of transformations (e.g facial expression transformations) from the other transformations (e.g. pose and occlusion). Ideally, it is desirable to learn how to disentangle transformations associated to facial expressions from all other transformations. In this work, we introduce a spatio-temporal feature learning model that can capture the different transformations between facial image pairs while isolating the transformations associated solely to facial expressions. This model combines a number of desired properties including simplicity (with a single hidden layer), the ability to learn image relations [15] and the disentanglement of features into different categories [13]. Fig. 2 shows a block diagram of our proposed model. The main contribution of this paper can be summarized as follows:

 A novel RBM-based model is proposed to capture various transformations (or relationships) between image pairs and disentangle these transformations into facial expression morphlets (FE morphlets) and non-facial expression morphlets (non-FE morphlets). Unlike other RBM based models (e.g. [14]), the proposed model learns features through two different hidden sets. The first hidden set is used to capture FE morphlets, while the other is used to capture non-FE morphlets. These hidden sets collaborate during the process of learning how to



disentangle FE morphlets from non-FE morphlets. In contrast, other models including deep generative models [16] and transformation learning models [12,14] lack this ability to simultaneously capture and separate certain types of transformations from others (e.g. pose variations and facial expressions). This is because they usually learn through a single hidden set for each hidden layer. In contrast, all unit sets in our model are connected together through two factored submodels, which differ from the Factored 3-way RBM [17] in two main ways. Firstly, these factored sub-models do not share the same hidden sets and one of them (FE morphlets units) is conditional on label units representing the expression class (e. g. Happiness). Thus, the FE morphlets set is forced to learn those transformations that are compatible with the labels. The other hidden set learns the transformations that are not captured by the FE morphlets set. Secondly, each sub-model defines the joint distribution between input unit sets, rather than the conditional distribution of one input set given the other. This avoids limitations associated to the learning of casespecific RBMs [18] and to the normalization which results from having a conditional partition function [12]. A Quadripartite Contrastive Divergence algorithm is introduced to learn the proposed RBM-based model. This is required because the proposed model exhibits a quadripartite structure, given that it involves two input sets, two hidden sets and one conditional label set. The standard Contrastive Divergence (CD) algorithm is limited to those models exhibiting a bi-partite structure. Even the extended version of the CD algorithm proposed by Susskind et al. [12] (3-way CD) is only applicable to tri-partite structures. Furthermore, all the variations of the CD algorithm can only learn a single hidden set. In contrast, our Quadripartite CD algorithm allows us to define the joint distribution between input sets through two independent hidden sets by repeatedly applying Gibbs sampling between the hidden and the visible sets. This makes it possible to simultaneously capture transformations while sorting them into different sets. These hidden sets are involved in the reconstruction of the input sets. It is important to ensure that

154

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

the actual transformations between image pairs are distributed between hidden sets. Since our model is conditional to label units with limited size (number of classes), it can easily discriminate between facial expressions by defining the probability distribution for each class rather than the sampling label units. Extensive evaluations, carried out on the two well-known datasets, namely Cohan–Kanade (CKþ ) [19] and MMI [20] show that the proposed model leads to dramatic improvements in facial expression recognition. The rest of the paper is organized as follows. Section 2 provides an overview of related work, including feature learning and previously reported RBM-based models for learning image relationships. The proposed spatio-temporal model is presented in Section 3 together with the proposed Quadripartite Contrastive Divergence algorithm. A performance evaluation of the proposed model on two well-known datasets (CK þ, MMI and AFEW) is reported in Section 4. Finally, a conclusion is given in Section 5.

2. Previous works Feature learning enables one to extract features automatically and not rely on human expertise. In addition, it can easily be adapted to other tasks [18]. Most common learning feature models such as K-means clustering and Restricted Boltzmann Machines can be represented as a bi-partite network, whose function is to learn salient structures and patterns as a set of latent features (hidden units) from input data (visible units). To extract more descriptive features, a deep architecture model (e.g. deep sparse representation [21] and Deep Belief Network (DBN) [22,8]) was constructed from simple models. Deep architecture models usually comprise several layers of the same architecture (e.g. RBM), each of which is greedily pre-trained. These models are then fine-tuned using supervised learning for classification purposes. Although considered powerful models for feature learning, deep architecture models suffer from the high complexity and computational cost. Furthermore, parameters selection (e.g learning rate, sparsity and momentum) is a delicate and time consuming task, often requiring cross validation [23]. Feature leaning models can be extended to learn spatiotemporal features by adding an extra input unit set(s). Taylor et al. [9] proposed a conditional RBM (CRBM), in which current visible units (vt) are conditional to the previous time slice visible

units ðvt  1 Þ as additional input set. However, this conditional relation (between vt and vt  1 ) is only limited to the biases. Memisevic and Hinton [15] proposed a conditional version of RBM, namely the Gated RBM (G-RBM), using a multiplicative interaction between the hidden units (h), the current visible units (vt) and the previous visible units ðvt  1 Þ. G-RBM was shown to be well suited to encode the relationship between images rather than of their contents (structure). Other gated models (e.g. [24,25]) applied the same concept of G-RBM but with the use of autoencoders. Unfortunately, the use of gated models is limited to small image patches since the number of parameters grows cubically with the number of pixels in the patch (for roughly the same number of units h, vt and vt  1 ) [18]. Memisevic and Hinton [14] proposed a factorized version of the gated RBM by factorizing the parameter tensor W into 3 sub-weight matrices. The factored gated RBM has a demonstrated ability to learn different types of image transformations such as shifts, rotation, scaling, affine and convolution with random kernels. Various works have applied different variations of factored gated RBM, including Ranzato et al. [16] to model natural images, Zeiler et al. [11] for facial expression transfer and Taylor et al. [17] for the modeling of motion style. The latter work proposed additional conditional label units to force the generative learning to learn w.r.t the label style. The common limitation of the previous factored models lies in that they can only model the conditional distribution of a single image given another one. To address this, Susskind et al. [12] presented a joint distribution model between an image pair by extending the Contrastive Divergence (CD) algorithm from 2-way (bi-partite structure) to 3-way (tri-partite structure). The joint distribution model has a number of advantages. It overcomes the limitation of training the model over a set of case-specific RBMs and it performs image matching between two images by measuring how much they are compatible under the trained model [18]. Unfortunately, these feature learning algorithms still cannot distinguish transformations associated to facial expressions from other transformations. In the following, we give a brief overview of the previously reported RBM-based models proposed for the task of learning relationships between images. Gated RBM (G-RBM): Introduced by Memisevic Hinton [15] in the context of learning image transformations. G-RBM differs from the standard RBM in that its visible and hidden units (y and h) are conditional to other input units x through multiplicative interactions as shown in Fig. 3. These interactions allow the hidden units to learn how to encode transformations between image pairs.

Fig. 3. Two views of G-RBM: (a) gated regression and (b) modulated bi-partite network [14].

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

G-RBM can be viewed as a gated regression (Fig. 3(a)) with a very large number of mixture components, each activated hidden unit can blend a slice of the weight tensor W into a linear regression [17]. An alternative view of G-RBM (Fig. 3(b)) is as a bi-partite network (e.g. RBM), whose connections are modulated by additional input units (x) that gate a set of linear filters to reconstruct the output image (y). The three way energy function of G-RBM is defined as X X yh X X y Eðy; h; xÞ ¼  wijk xi yj hk  wjk yj hk  whk hk þ wj yj ð1Þ ijk

jk

k

j

where wijk represents a tensor weight between the input, output and hidden units. whk and wyj are standard bias variables of the output and hidden units, respectively, while wyh jk represent the gated biases. The learning rule here is similar to that of the standard Contrastive Divergence algorithm of RBM [26]. Factored Gated RBM (FG-RBM): Since the energy function of GRBM (1) involves a summation over the input, output and hidden units, the complexity associated to one step inference is OðN 3 Þ, assuming that the number of each input units (I), output units (J) and hidden units (K) is N. Memisevic and Hinton [14] proposed the factored version of G-RBM (FG-RBM) shown in Fig. 4(a). Notice that the weight tensor w has been factorized into three weight matrices wx, wy and wh wijk ¼

F X f ¼1

wxif wyjf whkf

ð2Þ

where F refers to the number of factors. Factorizing the weight tensor reduces the complexity from OðN 3 Þ to OðN 2 Þ. The energy function of the factored three-way interaction is given by Eðy; h; xÞ ¼ 

F X X X X y xi yj hk wxif wyjf whkf  whk hk  wj yj f

ijk

Taylor and Hinton [17] introduced a style-gated RBM for modeling different styles of human motions. The proposed model includes additional conditional units (style units) which connect multiplicatively with other units. Style units act as the controller for the learning process to generate a compatible conditional distribution with style labels. Fig. 4(b) shows the corresponding model with three sub-models ðf ; m and nÞ conditional to realvalued labels (features (l)) derived from the discrete style units (s). The energy function of this model with real-valued visible units is defined as Eðy; h; x; lÞ ¼

2 XX y X yj 1Xða^ j  yj Þ  wjf whkf wlgf hk lg  b^ k hk 2 2 j σj 2σ j f jkl k

ð4Þ

with the first and the last terms embedding the sub-models n and ^ The remaining sub-model ðf Þ m within the dynamic biases a^ and b. is represented by the middle term. For simplicity, the value of σi was chosen to be 1. Joint Distribution RBM: Susskind et al. [12] introduced a way to define the joint probability distribution between two images rather than the conditional distribution. Their model extends the standard contrastive divergence algorithm (2-way CD as a bipartite structure) to incorporate 3-way sampling from distributions pðyj x; hÞ, pðxj y; hÞ and pðhj x; yÞ as a tri-partite structure. Since the model is no longer conditional on any given unit, it can match and measure how image pairs are compatible given a certain model (trained model). The energy function of the joint probability distribution is defined as

Eðx; y; hÞ ¼  ð3Þ

X X f

k

Notice that the inference task for a trained FG-RBM can take one of three forms based on the application. The first form computes the conditional distribution pðhj x; yÞ for given test images x and y. This form infers transformations between images x and y through the latent features (h). The second form computes the conditional distribution pðyj h; xÞ from the input image x and the fixed latent features h. This form is usually used in the context of analogy making [14]. The last form computes the probability distribution pðy; xÞ, from a new test image x (not included in the training data). This form cannot be computed directly because it needs marginalizing over the latent units h. Instead, pðy; xÞ can be sampled using alternating Gibbs sampling [14].

155



i

xi wxif

1 !0 ! X X y h @ yj w A h w k kf jf j

k

X 1X 1X whk hk  ðxi  wxi Þ2 þ ðyj  wyj Þ2 2 2 i j k

ð5Þ

Eq. (5) is similar to Eq. (3) when assuming that the visible units are P real units with the additional term 12 i ðxi wxi Þ2 . This is because the model is not conditional to units x anymore. After training the model, the matching of image pairs is achieved using the logprobability [12]: 0 1 X X 1 y 2A x 2 @ log ðx; yÞ ¼ log Z  exp ðxi  wi Þ  ðyj  wj Þ 2 i j

Fig. 4. (a) Graphical representation of FG-RBM [14]. (b) Styled FG-RBM with real valued styles [17].

156

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

0 0 111 !0 X X X X y h h x þ log @1 þ exp@wk þ wkf xi wif @ yj wjf AAA k

f

i

j

ð6Þ Note that the log of the partition function Z is the same for all image pairs. It can thus be canceled when comparing image pairs.

3. Proposed spatio-temporal RBM-based model for facial expression recognition The central idea of our model is to learn to disentangle the spatio-temporal features into two sets of latent features (hidden units). The first set represents the desirable transformations which result from changes in facial expressions between image pairs of the same subject (expression morphlet units). On the other hand, the second set represents the other transformations between image pairs including pose variations and illuminations (nuisance morphlet units). This differs from other probability distribution models in that our model can learn to focus on a certain type of transformations, while disentangling it from the other transformations. In contrast, other models do not have this latter characteristic. In our model, the visible units x and y are connected multiplicatively to two sets of hidden units (h and u) through two sub-models F and M (Fig. 5). The hidden units h have been designed to capture the facial expression morphlets (FE morphlets), while the other hidden units u have been designed to capture the non-facial expression morphlets (non-FE morphlets), as shown in Fig. 5. In the first sub-model ðFÞ, the units x, y and h are connected to the label units l through factored 4-way interactions, which allow us to learn features (transformations) with respect to the changes introduced by the labels. In contrast, the sub-model M, like the standard Joint Distribution RBM, is unconditional to any unit with units x, y and h connected through factored 3-way interactions (Fig. 5). Therefore, the latter sub-model can define a joint probability distribution between units representing common transformations (e.g. pose variations). Notice that while these sub-models work independently to encode features through hidden units h and u, they still work together to reconstruct the visible units x and y. To increase the effectiveness of the labels, they are mapped

and expanded from n discrete values to q real values, where n r q. It could be possible to connect soft-max style units to the hidden units rather than conditioning the interactions through the label units (in a similar way to [22] in the case of DBN). However, Taylor and Hinton [17] showed that this method is not a good option for dynamic models (e.g. conditional RBM), since the conditional relation with respect to past input sets is much stronger than the conditional relation with respect to the label units. This has the effect of learning the relationship (consistency) between the input unit sets while resisting the variations associated to the changing labels. Assuming that the visible units are real-valued (where σ ¼ 1 for simplicity, similar to our discussion of the styled-gated RBM), the energy function of our model takes the following form: X X X Eðx; y; h; u; lÞ ¼  EðFÞ  EðMÞ  hz whz m

f

z

X 1X 1X  uk wuk þ ðxi  wxi Þ2 þ ðyj  wyj Þ2 2 2 j i k

ð7Þ

where the first two terms are the energy functions over submodels F and M. The remaining four terms correspond to the interactions between units h, u, x and y and their biases whz , wuk , wxi and wyj , respectively. EðFÞ and EðMÞ are given by 1 !0 ! ! X X X X yF A xF @ h l xi w yj w hz w lg w EðFÞ ¼ ð8Þ if

EðMÞ ¼

zf

jf

i

gf

z

j

g

1 !0 ! X X X yM A xM @ xi wim yj wjm uk wukm i

j

ð9Þ

k

The joint distribution over an image pair(x,y) conditional to label units l is given by X Pðx; y; lÞ ¼ pðx; y; h; u; lÞ ð10Þ h;u

where the probability distribution over the visible and hidden units is given by 1 pðx; y; h; u; lÞ ¼ expð  Eðx; y; h; u; lÞÞ XZ with ZðlÞ ¼ expð  Eðx; y; h; u; lÞÞ

ð11Þ

x;y;h;u

Like a standard RBM, the partition function Z in Eq. (11) over the visible and hidden units is intractable if the number of hidden units (h and u) is large. Instead, it is easier to sample the model as a set of conditional distributions over the hidden and visible units. The conditional distributions of the hidden units (h and u) given other units are defined as 0 1 X X X X yF A ð12Þ pðhj x; y; u; lÞ ¼ ∏ρ@whz þ whzf xi wxF y w l w g gf j jf if z

i

f

0 pðuj x; y; h; lÞ ¼ ∏ρ@wuk þ

X

k

m

wukm

g

j

1 X X yM A xM xi wim yj wjm i

ð13Þ

j

while the reconstruction distributions of the visible units (x and y) are defined as follows: 0 X X X yF X pðyj x; h; u; lÞ ¼ ∏N @wyj þ wjf xi wxF hz whzf lg wgf if j

þ Fig. 5. The proposed spatio-temporal model.

f

i

z

X yM X X wjm xi wxM uk wukm ; 1 im m

i

k

g

! ð14Þ

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

0

þ

X m

wxM im

*

X X X X wxF yj wyF hz whzf lg wgf if jf

pðxj y; h; u; lÞ ¼ ∏N @wxi þ i f

z

j

j

X X Xα  xi yj wyF hz whzf lg wlgf jf

g

1 X X yM u yj wjm uk wkm ; 1A

z

j

ð15Þ

k

where ρð  Þ is a logistic function and N ð  ; Þ is the Gaussian function. It is important to note that the conditional distributions of the hidden units pðhj x; y; u; lÞ and pðuj x; y; h; lÞ are independent from each others since different sub-models allow each one to learn different types of features. However, in the reconstruction phase, both sub-models F and M with both hidden sets h and u work together to reconstruct the visible units x and y. Thus, our model has the notable property of learning feature disentangling [27] while modeling the transformations between image pairs [12]. Since the inference of the hidden and visible units consists of four conditional distributions, Contrastive Divergence algorithm (CD) cannot be applied for learning. To address this limitation, we extend the CD algorithm to learn our model efficiently, as discussed in the next section.

Δ

wyF p jf

X

* yα j

α

Δ

Computing the exact value of the second term over x, y, h and u is intractable. Fortunately, it can be approximated by drawing samples from the conditional distributions pðhj x; y; u; lÞ, pðuj x; y; h; lÞ, pðyj x; h; u; lÞ and pðxj y; h; u; lÞ. Note that the standard CD algorithm is only limited to bi-partied structures while our model exhibits a quadripartite structure. Conditional distribution models such as [14] exhibit a bi-partite structure, since they define the conditional distribution of one visible unit set given the other. Susskind et al. [12] showed that it is possible to extend 2-way CD algorithm to be applicable to a tri-partite structure (3-way CD) by adding an extra sampling step ðpðxj y; hÞ. From that perspective, we extended the standard CD to be compatible with a quadripartite structure (Quadripartite CD) by repeatedly applying Gibbs sampling between the hidden and the visible sets, pðhj x; y; u; lÞ, pðuj x; y; h; lÞ, pðyj x; h; u; lÞ and pðxj y; h; u; lÞ. Unlike the standard CD, a single iteration involves, here, the update of four sets (two hidden sets and two visible sets). Although the order of the updating units is somewhat close to CD where it starts with the hidden units first and reconstructs the visible units next. The sampling of hidden units in our model requires an update of both the hidden sets (h and u). Here, the order of the update is not important because h and u are independent. In the same manner, the reconstruction of the visible units requires the update of both the visible sets (x and y). Again, given that they are not fully independent, they are still affected by the updating order. In this work, the order of updating the visible sets has been randomly chosen at each iteration. By differentiating Eq. (16) w.r.t each parameter θ, we get 0* + X X Xα X xF h l Δwif p @ xαi yαj wyF h w l w z g gf zf jf j

z

g

0

z

i

g

z

q

X

g

+ 0

ð18Þ

0* + X X Xα yF α xF α l @ hz x w y w l w i

α

ð17Þ

q

+ 1 A

p

j

if

i

g

jf

gf

g

j

* + 1 X X Xα yF xF l  hz xi wif yj wjf lg wgf A i

Δ

wlgf

g

j

0

ð19Þ

q

0* + X X X X @ lg p xα wxF yα wyF hz wh i

α

j

if

i

zf

jf

z

j

0

+ 1 X X X yF xF h  lg xi wif yj wjf hz wzf A *

3.1. Quadripartite Contrastive Divergence algorithm The parameters of our model including the factors (weights) and the biases can be learned by maximizing the average log P α α probability L ¼ α log pðyα ; xα ; l Þ of the training set {ðyα ; xα ; l Þ}. The derivative of L w.r.t any single parameter θ takes the following form:   α  α ∂L X ∂Eðxα ; yα ; h; u; l Þ ∂Eðx; y; h; u; l  ¼  ð16Þ ∂θ ∂θ ∂θ h;u x;y;h;u α

g

X X Xα xi wxF hz whzf lg wlgf  yj if i

whzf

+ 1 A

X X Xα xαi wxF hz whzf lg wlgf if

*

i

α

157

X

ΔwxM im p

Δ

α

wyM jm p

Δwukm p

Δwxi p Δwhz p

ð20Þ

q

0* + * + 1 X X X X yM u u @ xα A yαj wyM u w  x y w u w i k k j i km km jm jm

X α

z

j

j

k

j

0

k

0* + * + 1 X X X X α xM u xM u @ yα xi wim uk wkm  yj xi wim uk wkm A j i

k

i

0

k

i

j

X   xαi 0  hxi iq ;

α X     hz 0  hz q ; α

i

0

Δwyj p

j

XD α

Δwuk p

yαj

E 0

ð22Þ

q

0* + * + 1 X X X X X yM α wyM xM @ uk A xαi wxM y  u x w y w i k j jm im im j jm α

ð21Þ

q

ð23Þ

q

D E  yj ; q

X     uk 0  uk q α

ð24Þ

To avoid numerical instabilities [14] during the learning process, it is important to keep the conditional distribution matrices positive definite. This requirement was explained in [28] in the context of a single image model and in [12] for the joint density model of image pairs. This requirement can be effectively met by normalizing the columns of each weight matrix (W xF ; W yF , Wh,Wl, W xM , W yM , Wu) after each iteration. The proposed Quadripartite Contrastive Divergence algorithm is shown in the following as Algorithm 1. Algorithm 1. Quadripartite Contrastive Divergence. α

Inputs: training data ðyα ; xα ; l Þ Learning rate Initialize factors and biases

ε

For α ¼ 1 to batch_size %%%%% start positive phase:%%%%%%% compute activations from visible and label units: AxF ¼ xα  W xF , AxM ¼ xα  W xM ,

α

AyF ¼ yα  W yF , AyM ¼ yα  W yM , Al ¼ l  W l sample h from pðhj x; y; u; lÞ, sample u from pðuj x; y; h; lÞ compute activations from hidden units: Ah ¼ h  W h , Au ¼ u  W u compute positive updates: W xF þ , W yF þ , W h þ , W l þ , W xM þ , W yM þ , W u þ apply positive updates:W h ¼ W h þ ϵW h þ , W xF ¼ W xF þ

ϵW xF þ , W yF ¼ W yF þ ϵW yF þ ,

158

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

W l ¼ W l þ ϵW l þ , W u ¼ W u þ ϵW u þ , W yM ¼ W yM þ ϵW yM þ ,

W xM ¼ W xM þ ϵW xM þ , wx ¼ wx þ ϵxα , wy ¼ wy þ ϵyα , wh ¼ wh þ ϵh, wu ¼ wu þ ϵu %%%%% start negative phase%%%%%%% set the value of n randomly ð0 o n o 1Þ if n 4 0:5 sample x^ from pðxj yα ; h; u; lÞ, AxF ¼ x^  W xF , AxM ¼ x^  W xM

^ h; u; lÞ, Ayf F ¼ y^  W yF , AyM ¼ y^  W yM sample y^ from pðyj x; else sample y^ from pðyj xα ; h; u; lÞ, AyF ¼ y^  W yF , AyM ¼ y^  W yM ^ h; u; lÞ, AxF ¼ x^  W xF , AxM ¼ x^  W xM sample x^ from pðxj y; end ^ y; ^ u; lÞ, set u ¼ pðuj x; ^ y; ^ h; lÞ set h ¼ pðhj x; compute activations from hidden and label units: Ah ¼ h  W h , α

Au ¼ u  W u , Al ¼ l  W l compute negative updates W xF  , W yF  , W h  ; W l  , W xM  ; W yM  ; W u  apply negative updates: W h ¼ W h  ϵW h  , W xF ¼ W xF  ϵW xF  , W yF ¼ W yF  ϵW yF  ,

W l ¼ W l  ϵW l  , W xM ¼ W xM  ϵW xM  , W yM ¼ W yM  ϵW yM  , W u ¼ W u  ϵW u  , ^ wy ¼ wy  ϵy, ^ wh ¼ wh  ϵh, wu ¼ wu  ϵu wx ¼ wx  ϵx, normalize weights. end

the log probability that the model assigns to the test images. To explain both options, let us first define a testing image pair as (x,y) and the labels (classes) as ðl A fl1 ; l2 ; …; ln ÞÞ, where n denotes the number of classes. Since our model is conditional on the label units with a limited number of classes ðn o 8Þ, it is easy to reconstruct x and y under the trained model for each specific label l (as shown in Fig. 7). This process produces n reconstructed image pairs ðx0 ; y0 Þ1:n , which are P P used to define a distance vector d ¼ ð ðx  x0l Þ2 þ ðy  y0l Þ2 Þnl¼ 1 = 2. The expression label corresponds to the lowest distance chosen as the expression class of image pair (x,y). Alternatively, we can use, for each specific class l, the log probability of the joint distribution assigned to x and y 0 1 X y 1 X log pðx; y; lÞ ¼  log Z  @ ðwxi  xi Þ2 þ ðwj  yj Þ2 A 2 i j 0 0 11 X X X X X yF l AA þ log @1þ exp@whz þ whzf xi wxF y w l w g gf j jf if z

þ

0

f

i

m

i

g

j

11 X X X X yM AA log @1 þ exp@wuk þ wukm xi wxM y w j jm im k

0

ð25Þ

j

Given that the log of the partition function ðlog zÞ in Eq. (25) is the same for all image pairs, it can thus be ignored when comparing image pairs. The distance vector is then constructed as d ¼ log pðx; y; lÞnl¼ 1 . Similar to the reconstruction error, the expression class is chosen based on the lowest distance.

4. Experiments During the early stages of training, each sub-model tries to compete in defining the probability distribution between image pairs over the other sub-model. After a number of iterations, sub-model F starts to define a joint distribution obeying the changes introduced by the label units. The transformations associated to facial expressions start thus moving slowly to sub-model F. In contrast, sub-model M is not conditioned on label units, therefore the other transformations (especially those which are not compatible with the labels) slowly settle down in sub-model M. This process of disentangling the transformations also increases the capacity to capture more complex non-linear transformations. Fig. 6(c) shows encoded features (transformations) by hidden set h from input images x and y (Fig. 6(a) and (b)). Each row of blocks in Fig. 6(c) represents the encoded features of the corresponding subjects (Fig. 6(a) and (b)) having the same expression. It is clear from Fig. 6(c) that the encoded features of the same expression, even across different subjects, are relatively the same while they vary for different expressions.

4.1. Experimental setup We have assessed the performance of the proposed model using different numbers of hidden units (in each set h and u), real-valued labels and factors (for each sub-model). We assessed the performance of the proposed model based on uniform configurations (all sets and sub-models are similar in the number of units/factors) or non-

3.2. Discriminating facial expressions After training the model, we investigated two ways to discriminate between facial expressions: either using the reconstruction error or

Fig. 7. Discriminating the expression of a new image pair under the trained model.

Fig. 6. (a) Sample input images to units x of the model. (b) Examples of input images to units y of the model. (c) An illustration of the features encoded by hidden set h.

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

uniform configurations (each set/factor has a different number of units/factors). We found that the proposed model exhibits better performance when the number of hidden units (h and u) and factors (F and M) are similar, while the number of real-valued labels is comparable to the number of hidden units. The best performance was obtained for 250 units for each hidden set (h and u), 150 real-valued labels and 250 factors for each sub-model. It is worth noting that expanding the label units (to be comparable to hidden units) increases their contribution during the learning process. The performance could be further improved by increasing the number of hidden units or factors. However, the learning process would become increasingly more complicated with a larger amount of data required for training. A learning rate of 10  3 was used when the average updates were not relatively high (less than the threshold value τ) and reduced by a factor of 2 when the average updates exceeded that. The momentum value was set to 0.5 for the first 50 epochs and 0.9 for the rest while the weight decay was set to 2  10  4 . To demonstrate the effectiveness of our model, extensive experiments were carried out on datasets CKþ [19] and MMI [20], which have been widely used to evaluate most FER systems. In the pre-processing stage, facial regions were detected in each image and cropped automatically using Zhu and Ramanan detection algorithm [29]. Cropped facial region were then resized to 64  64. No additional normalizations such as pose and scale alignment were applied afterwards.

159

Fig. 8. Samples from CKþ dataset. Each row represents different subjects. The first column represents neutral expressions, and the remaining columns show apex expressions.

4.2. Experiments on CK þ dataset The CKþ dataset [19] contains 593 sequences from 123 subjects of both sexes. This dataset is an extension of the CK dataset [30]. Each image sequence represents one facial expression and varies in duration from 10 to 60 frames starting from onset (neutral face) to apex. Only 327 sequences are labeled, each of which has one of seven expressions (Anger, Contempt, Disgust, Fear, Happiness, Sadness, and Surprise) based on Facial Action Coding System (FACS). In our experiment, we constructed three image pairs from each sequence. The first image in each pair corresponds to the onset image (neutral) and the second image corresponds to a different apex image. During the validation phase, each image pair votes for one of the seven expressions, since each sequence is represented by three image pairs. The expression class with the highest vote is considered to be the expectation of the sequence expression. The dataset was divided into 10 sub-sets based on the subject identity. As a result, subjects from any two different sub-sets cannot overlap. For each run (out of 10 runs), one of the sub-sets was selected for testing while the remaining sets were used for training to achieve a 10-fold crossvalidation. Fig. 8 shows some samples from the CKþ dataset with neutral and apex expressions. The performance of the model was evaluated using both the reconstruction error and the log probability (as explained in Section 3.2). Since the input units x represent neutral faces, it is not useful to include these units when computing the reconstruction error. Therefore, we limited the reconstruction error to input units y only. Our experiments show that the performance of both methods is very close. However, the classification method based on reconstruction errors performs 5–10% better than the method based on log probability. Fig. 9 shows the recognition rate of both methods as a function of the number of epochs. Our model achieved 95.66% average recognition rate (averaged over 10 folds/runs) for the whole CKþ dataset with 7 expressions. The average recognition rate of each class is depicted in Fig. 10(a). We can notice that expression Fear followed by Sadness and Anger reported the lowest recognition rates, while Happiness, Disgust and Contempt reported the highest recognition rates.

Fig. 9. Average recognition rate using the reconstruction error and the log probability as a function of the number of epochs.

4.3. Experiments on MMI dataset To include more challenging cases, we evaluated the performance of our model on the MMI dataset [20], which contains 30 subjects of both sexes with ages ranging from 19 to 62, with different ethnic backgrounds (European, Asian and South American). In this dataset, 213 sequences were labeled with six basic expressions, of which 205 sequences were taken from the frontal view. Each sequence represents a single facial expression starting from an onset (neutral face), followed by an apex and ending with an offset. Similar to the CKþ dataset, we constructed the three image pairs from each sequence, where the first image in each pair is the onset image and the second image is the apex image. The MMI dataset is more challenging than the CK þ dataset because the subjects in the MMI pose the facial expressions nonuniformly. Several subjects pose the same expression in different ways. Furthermore, part of the dataset involves subjects with different accessories, such as glasses, headcloth, and moustache . The average recognition rate (averaged over 10 folds/runs) of our model with MMI dataset with 6 expressions is 81.63%. Fig. 10(b) reports the average recognition rate for each class. It is clear that the recognition rate is lower here than the CKþ results, but it is still impressive given the challenging MMI dataset. 4.4. Experiments on AFEW dataset To evaluate the robustness of our model even in more challenging conditions and real scenarios, we applied our model to the Acted Facial Expression in Wild (AFEW) dataset [31]. This dataset is a

160

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

Fig. 10. Recognition rate for each expression of the (a) CK þ dataset and (b) MMI dataset.

Table 1 Performance comparison on CK þ, MMI and AFEW datasets. Methods

Validation setting

CK þ

MMI

AFEW

[35] [36] [34] [2]

15-Fold LOSO 10-Fold 10-Fold

86.3% 89.4% 89.89% 94.19%

59.7% – 73.53% 74.59%

– – – 31.73%

[33] [4] [3]

10-Fold 10-Fold 5-Fold

82.26% 88.9% 92.3%

54.15% 59.51% –

19.45% 25.27% –

Our model

10-Fold

95:66%

81:63%

46:36%

collection of short videos extracted from movies captured in real world environments. The dataset includes 1426 video sequences, 330 subjects of both genders with ages ranging from 1 to 70 years. Following to the protocol proposed in Emotion Recognition in the Wild Challenge (EmotiW) [32], the dataset was divided into three main sets: training, validation and testing, with each set including 7 different facial expressions (Angry, Disgust, Fear, Happy, Neutral, Sad and Surprise). Given that the ground truth of the testing set is not available, the evaluation was carried out on the validation set, as done in [2]. The average recognition rate of our model with AFEW dataset was found to be 46.36%. Note that the AFEW dataset is one of the most challenging datasets as it includes facial expressions in realworld conditions. 4.5. Performance comparison We first compared our proposed model with state-of-the-art approaches on CKþ dataset, as shown in Table 1. State-of-the-art approaches were classified into two categories depending on whether they are based on feature learning or hand engineered features. Table 1 shows that our model boasts the highest recognition rate (95.66%). Reported results by Wang et al. [33], Zhao and Pietikainen [4], Zhong et al. [34], and Liu et al. [2] were obtained after applying our own protocol (10-fold cross subject validation). Our model achieved the highest performance, followed by Liu et al. [2] who achieved the highest recognition rate (94.19%), with their model composed of three different feature level components namely, a spatio-temporal manifold (STM), a Universal Manifold Model (UMM) and an Expressionlet Modeling component. These components combine to extract the mid-level features of facial expressions. In contrast, our model can extract “good” features and achieve the highest performance with only a single component model structure (RBM-based model with a single hidden layer). Other works applied different protocols such as Sanin et al. [3] performed a 5-fold cross validation. Chew et al. performed a Leave One Subject Out (LOSO) cross validation; and Wang et al. [35] adopted a 15-fold cross validation. In Table 1, we compare our proposed model against state-of-theart approaches using the CKþ, MMI and AFEW datasets. For the MMI dataset, our model boasts the highest recognition rate among all approaches, improving the best reported results [2] by 7.04%. Furthermore, our model exhibits the lowest performance drop when

moving from CKþ dataset to MMI dataset. This indicates that our model is not only very accurate but also robust. For the AFEW dataset, we can notice that the average of all approaches is degraded significantly in this dataset as it introduces major challenges compared to CKþ and MMI datasets. However, our model still manages to achieve the highest recognition rate compared to all the other approaches, with a significant improvements of 14.63% compared to the best reposted results (31.73% by [2]). It is important to notice that our model defines the joint distribution between image pairs which usually requires more time for training than other standard learning methods (e.g., RBM). However, unlike the testing phase, the training phase is usually done off-line. We evaluated the time complexity of our model on randomly selected datasets with 100 training image pairs and 100 testing image pairs with seven expressions. This evaluation has been performed on 50 different selected dataset from the AFEW dataset. The proposed model took an average of 0.0248 s to achieve one epoch training, and an average of 0.0.0057 s to classify the expression of each testing image pair. This evaluation was carried out on a standard desktop with an Intel Core i7 3.40 GHz processor, 24.0 GB RAM and standard Matlab implementations. 5. Conclusion In this paper, we introduced a novel spatio-temporal model capable of capturing the different transformations between image pairs while simultaneously disentangle transformations associated to facial expressions from all other transformation. Unlike other models, our model encodes the relationship between image pairs through two independent hidden sets. The first hidden set was designed for facial expression transformations (FE morphlet units) while the second set was designed for other transformations (non-FE morphlet units) such as pose and occlusion. These sets were learned through two independent sub-models to facilitate the learning of different features. To ensure that facial expression morphlet units learn expression transformations, these units were styled (guided) by conditional label units. Furthermore, since our model exhibits a quadripartite structure, we proposed an extension of the standard Contrastive Divergence, namely Quadripartite Contrastive Divergence to efficiently learn our model. In contrast, previously reported models usually rely on a deep complex network structure to extract good features. The complexity of the parameter settings in these models is very high, with parameters selected through cross-validation. In contrast, our model relies on a single hidden layer and is capable of learning the transformations associated to facial expressions while isolating them from all other transformations. The experimental results on CKþ, MMI and AFEW datasets show that the proposed model outperforms the most recent state-of-the-art models. Future avenues of work include the application of our model to learn human motion style, investigate its ability to learn additional transformations such as facial pose (in addition to facial expression), or learn transformations and content (structure) simultaneously (e.g. discriminating the identity and the expression simultaneously). Conflict of interest None declared.

S. Elaiwat et al. / Pattern Recognition 49 (2016) 152–161

References [1] C. Lisetti, D. Schiano, Automatic facial expression interpretation where human computer interaction, artificial intelligence and cognitive science intersect, Pragmat. Cognit. 8 (1) (2000) 185–235. [2] M. Liu, S. Shan, R. Wang, X. Chen, Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [3] A. Sanin, C. Sanderson, M. Harandi, B. Lovell, Spatio-temporal covariance descriptors for action and gesture recognition, in: IEEE Workshop on Applications of Computer Vision (WACV), 2013, pp. 103–110. [4] G. Zhao, M. Pietikainen, Dynamic texture recognition using local binary patterns with an application to facial expressions, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 915–928. [5] T. Wu, M. Bartlett, J.R. Movellan, Facial expression recognition using Gabor motion energy filters, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2010, pp. 42–47. [6] A. Dhall, A. Asthana, R. Goecke, T. Gedeon, Emotion recognition using phog and lpq features, in: IEEE International Conference on Automatic Face Gesture Recognition and Workshops, 2011, pp. 878–883. [7] A. Lorincz, L. Jeni, Z. Szabo, J. Cohn, T. Kanade, Emotional expression classification using time-series kernels, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2013, pp. 889–895. [8] M. Ranzato, J. Susskind, V. Mnih, G. Hinton, On deep generative models with applications to recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 2857–2864. [9] G.W. Taylor, G.E. Hinton, S.T. Roweis, Modeling human motion using binary latent variables, in: Advances in Neural Information Processing Systems, 2006, pp. 1345–1352. [10] G. Taylor, R. Fergus, Y. LeCun, C. Bregler, Convolutional learning of spatiotemporal features, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Computer Vision, ECCV 2010, Lecture Notes in Computer Science, vol. 6316, Springer Berlin Heidelberg, Heraklion, Crete, Greece, 2010, pp. 140–153. [11] M.D. Zeiler, G.W. Taylor, L. Sigal, I. Matthews, R. Fergus, Facial expression transfer with input–output temporal restricted Boltzmann machines, in: J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, K. Weinberger (Eds.), Advances in Neural Information Processing Systems, vol. 24, Curran Associates, Inc, Granada, Spain, 2011, pp. 1629–1637. [12] J. Susskind, R. Memisevic, G. Hinton, M. Pollefeys, Modeling the joint density of two images under a variety of transformations, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 2793–2800. [13] S. Rifai, Y. Bengio, A. Courville, P. Vincent, M. Mirza, Disentangling factors of variation for facial expression recognition, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision, ECCV, Lecture Notes in Computer Science, Springer Berlin Heidelberg, Florence, Italy, 2012, pp. 808–822. [14] R. Memisevic, G.E. Hinton, Learning to represent spatial transformations with factored higher-order Boltzmann machines, Neural Comput. 22 (6) (2010) 1473–1492. [15] R. Memisevic, G. Hinton, Unsupervised learning of image transformations, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [16] M. Ranzato, A. Krizhevsky, G.E. Hinton, Factored 3-way restricted Boltzmann machines for modeling natural images, J. Mach. Learn. Res.—Proc. Track 9 (2010) 621–628. [17] G.W. Taylor, G.E. Hinton, Factored conditional restricted Boltzmann machines for modeling motion style, in: International Conference on Machine Learning, 2009, pp. 1025–1032.

161

[18] R. Memisevic, Learning to relate images, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1829–1846. [19] P. Lucey, J. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended Cohn–Kanade dataset (ckþ ): a complete dataset for action unit and emotionspecified expression, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2010, pp. 94–101. [20] M.F. Valstar, M. Pantic, Induced disgust, happiness and surprise: an addition to the mmi facial expression database, in: Proceedings of International Conference on Language Resources and Evaluation, Workshop on EMOTION, Malta, 2010, pp. 65–70. [21] P. Liu, S. Han, Y. Tong, Improving facial expression analysis using histograms of log-transformed nonnegative sparse representation with a spatial pyramid structure, in: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2013, pp. 1–7. [22] G. Hinton, S. Osindero, Y. Teh, A fast learning algorithm for deep belief nets, Neural Comput. 18 (7) (2006) 1527–1554. [23] A. Coates, A.Y. Ng, H. Lee, An analysis of single-layer networks in unsupervised feature learning, in: International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223. [24] R. Memisevic, Non-linear latent factor models for revealing structure in highdimensional data (Ph.D. thesis), University of Toronto, 2008. [25] R. Memisevic, Gradient-based learning of higher-order image features, in: IEEE International Conference on Computer Vision (ICCV), 2011, pp. 1591–1598. [26] G.E. Hinton, Training products of experts by minimizing contrastive divergence, Neural Comput. 14 (8) (2002) 1771–1800. [27] P. Liu, J. Zhou, I.-H. Tsang, Z. Meng, S. Han, Y. Tong, Feature disentangling machine—a novel approach of feature selection and disentangling in facial expression analysis, in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer Vision, ECCV, Lecture Notes in Computer Science, 2014, pp. 151–166. [28] M. Ranzato, G. Hinton, Modeling pixel means and covariances using factorized third-order Boltzmann machines, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2551–2558. [29] X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2879–2886. [30] T. Kanade, J. Cohn, Y. Tian, Comprehensive database for facial expression analysis, in: IEEE International Conference on Automatic Face and Gesture Recognition, 2000, pp. 46–53. [31] A. Dhall, R. Goecke, S. Lucey, T. Gedeon, Collecting large, richly annotated facial-expression databases from movies, MultiMed. IEEE 19 (3) (2012) 34–41. [32] A. Dhall, R. Goecke, J. Joshi, M. Wagner, T. Gedeon, Emotion recognition in the wild challenge 2013, in: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, ACM, Sydney, Australia, 2013, pp. 509–516. [33] L. Wang, Y. Qiao, X. Tang, Motionlets: Mid-level 3d parts for human motion recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2674–2681. [34] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, D. Metaxas, Learning active facial patches for expression analysis, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2562–2569. [35] Z. Wang, S. Wang, Q. Ji, Capturing complex spatio-temporal relations among facial muscles for facial expression recognition, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3422–3429. [36] S. Chew, S. Lucey, P. Lucey, S. Sridharan, J. Conn, Improved facial expression recognition via uni-hyperplane classification, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2554–2561.

Said Elaiwat received the B.Sc. degree in computer science from AL-Zaytoonah Private University of Jordan, Amman, Jordan in 2004 and the M.Sc. degree in computer science from Al-Balqa Applied University, Amman, Jordan, in 2007. He is currently a Ph.D. student in the School of Computer Science and Software Engineering, The University of Western Australia, Crawley, Australia. His current research interests include object/face recognition, image processing, computer vision and machine learning.

Mohammed Bennamoun received the M.Sc. degree in control theory from Queens University, Kingston, ON, Canada, and the Ph.D. degree in computer vision from Queens/Q.U.T, Brisbane, Australia. He was a Lecturer in robotics with Queens University and joined QUT in 1993 as an Associate Lecturer. He is currently a Winthrop Professor. He served as the Head of the School of Computer Science and Software Engineering, The University of Western Australia, Crawley, Australia, from 2007 to 2012. He served as the Director of the University Centre at QUT: The Space Centre for Satellite Navigation from 1998 to 2002. He was an Erasmus Mundus Scholar and a Visiting Professor with the University of Edinburgh, Edinburgh, U.K., in 2006. He was a Visiting Professor with the Centre National de la Recherche Scientique and Telecom Lille1, France, in 2009, Helsinki University of Technology, Helsinki, France, in 2006, and University of Bourgogne and Paris 13, Paris, France, from 2002 to 2003. He is the co-author of Object Recognition: fundamentals and Case Studies (Springer-Verlag, 2001) and the co-author of an edited book on Ontology Learning and Knowledge Discovery Using the Web in 2011. He has published over 250 journal and conference publications and secured highly competitive national grants from the Australian Research Council (ARC). Some of these grants were in collaboration with Industry partners (through the ARC Linkage Project scheme) to solve real research problems for industry, including Swimming Australia, the West Australian Institute of Sport, a textile company (Beaulieu Pacic), and AAM-GeoScan. He has worked on research problems and collaborated (through joint publications, grants, and supervision of Ph.D. students) with researchers from different disciplines, including animal biology, speech processing, biomechanics, ophthalmology, dentistry, linguistics, robotics, photogrammetry, and radiology. He received the Best Supervisor of the Year Award from QUT. He received an award for research supervision from UWA in 2008. He served as a Guest Editor for a couple of special issues in international journals, such as the International Journal of Pattern Recognition and Artificial Intelligence. He was selected to give conference tutorials from the European Conference on Computer Vision and the International Conference on Acoustics Speech and Signal Processing. He has organized several special sessions for conferences, including a special session for the IEEE International Conference in Image Processing. He was on the program committee of many international conferences. He has contributed in the organization of many local and international conferences. His current research interests include control theory, robotics, obstacle avoidance, object recognition, artificial neural networks, signal/image processing, and computer vision (particularly 3D).

Farid Boussaid (M00SM04) received the M.S. and Ph.D. degrees in microelectronics from the National Institute of Applied Science (INSA), Toulouse, 2 France, in 1996 and 1999, respectively. He joined Edith Cowan University, Perth, Australia, as a Postdoctoral Research Fellow, and a member of the Visual Information Processing Research Group in 2000. He joined the University of Western Australia, Crawley, Australia, in 2005, where he is currently an Associate Professor. His current research interests include smart CMOS vision sensors, image processing, gas sensors, neuromorphic systems, device simulation, modeling, and characterization in deep submicron CMOS processes.