Facial expression recognition via region-based convolutional fusion network

Facial expression recognition via region-based convolutional fusion network

J. Vis. Commun. Image R. 62 (2019) 1–11 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate...

2MB Sizes 4 Downloads 101 Views

J. Vis. Commun. Image R. 62 (2019) 1–11

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Facial expression recognition via region-based convolutional fusion network q Yingsheng Ye a, Xingming Zhang a, Yubei Lin b,⇑, Haoxiang Wang a a B3 Building, School of Computer Science and Engineering, South China University of Technology, Guangzhou Higher Education Mega Center, Panyu District 510006, Guangzhou, PR China b B7 Building, School of Software Engineering, South China University of Technology, Guangzhou Higher Education Mega Center, Panyu District 510006, Guangzhou, PR China

a r t i c l e

i n f o

Article history: Received 12 December 2018 Revised 12 April 2019 Accepted 15 April 2019 Available online 18 April 2019 Keywords: Facial expression recognition Emotion recognition Convolution neural network

a b s t r a c t One of the key challenge issues of deep-learning-based facial expression recognition (FER) is learning effective and robust features from variant samples. In this paper, Region-based Convolutional Fusion Network (RCFN) is proposed to solve this issue via three aspects. Firstly, a muscle movement model is built to segment out crucial regions of frontal face, providing well-unified patches with benefits of removing unrepresentative regions and greatly reducing interference caused by facial organs with varied sizes and positions among individuals. Secondly, a fast and practical network is constructed to extract robust triple-level features from low level to semantic level in each crucial region and fuse them for FER. Thirdly, constrained punitive loss is introduced to leverage the network training for boosting up FER performance. The experiment results show that RCFN is effective in commonly used datasets like KDEF, CK+, and Oulu-CASIA, and can achieve comparable performance with other state-of-the-art FER methods. Ó 2019 Elsevier Inc. All rights reserved.

1. Introduction In recent years Facial Expression Recognition (FER) has become a research hotspot in computer vision. It plays an active role in many applications such as interactive systems or games [1], health care [2], surveillance [3], and on-line social relation [4]. Many achievements have been made in FER summarized in [5,6] based on visual information. But there are still challenges and issues in FER. The differences between facial expression classes lie in the crucial regions like eye, forehead, mouth, etc., along with other unrepresentative regions (such as cheeks, hair, nose, ears, and background) and facial organs with varied sizes and positions from variant samples. These factors of variant samples could greatly increase the intra-class difference, especially the facial organs with varied sizes and positions. The cropped faces, which might include some background pixels around the edge regions [7–9] or nearly no background pixels [10,11], are widely adapted in most of existing FER approaches to remove most of useless regions outside the face. However, cropped faces can not reduce the unrepresentative q

This paper has been recommended for acceptance by Zicheng Liu.

⇑ Corresponding author.

E-mail addresses: [email protected] (Y. Ye), [email protected] (X. Zhang), [email protected] (Y. Lin), [email protected] (H. Wang). https://doi.org/10.1016/j.jvcir.2019.04.009 1047-3203/Ó 2019 Elsevier Inc. All rights reserved.

regions within the face and can not unify all facial organs simultaneously in each face. Besides cropped faces, extracting geometric features or Action Units (AUs) is the ideal way to deal with unrepresentative regions. However, the geometric features and AUs are easily affected by facial organs with varied sizes and positions from variant individuals. From this point of view, the unrepresentative regions and facial organs with varied sizes and positions from variant individuals have become the limitation for most existing FER approaches, since they either cannot greatly reduce the interferences brought by this limitation from bottom up in their feature models or have to build a large feature set for feature selection to enhance the generalization of their feature models. In this paper, we design a Region-based Convolutional Fusion Network (RCFN) for FER, aiming to learn robust visual features and greatly reduce interferences caused by this limitation from bottom up through well-defined crucial regions. The major contributions of proposed RCFN are as follows:  A muscle movement model is built to help extract three welldefined crucial regions from frontal facial images, which can greatly reduce unrepresentative regions for FER and eliminate interferences caused by facial organs with varied sizes and positions.

2

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

 A practical and fast neural network is proposed to extract robust triple-level features from low level to semantic level in each crucial region and combine them for FER, coping with small scale dataset without data augmentation.  A constrained punitive loss is proposed for network optimization by automatically adjusting the loss function according to the network output, which can boost up the FER performance. The rest of the paper is organized as follows. Section 2 reviews the related work of vision-based FER approaches. Section 3 demonstrates RCFN in details. Section 4 presents the experimental results and comparisons with other methods. And Section 5 concludes this paper.

(a) Classified Categories

(b) Exception

Fig. 1. Exception for categories of [20]. (a) Categories from [20]. (b) Exception expression of sad, which should be categorized as Lips-Eyes-Forehead-Based instead of Lips-Based.

2. Related work Though FER and face recognition (FR) are two different problems, many techniques are mutually applicable due to the close problem solving paradigms and input material of faces. In FR, many remarkable accomplishments have been made. Against the strong prior knowledge required in most existing hand-crafted binary face descriptors, [12] proposed a compact binary face descriptor feature learning method with unsupervised manner, and [13] proposed simultaneous local binary feature learning and encoding method to jointly learn binary codes and codebook in one-stage procedure. Ref. [14] categorized patches into a rotational binary pattern and proposed a rotation-invariant co-occurrence method to exploit high order statistical information. Different from the above individually learned binary feature codes, [15] proposed a contextaware local binary feature learning method to exploit contextual information of adjacent bits. versus binary descriptors, deep network architectures have shown their astonishing performances in FR with large network weights and a tremendous amount of training samples, such as the FaceNet [16] and VGGFace [17]. Rather than FR’s focus on features of identification level, FER requests for features of expression level, and ignores the identification information in the final classification. Despite of acoustical information, we only discuss visual information based FER approaches which can be grouped into image-based approaches and sequence-based (or video-based) approaches.  Image-based FER Approach. Image-based FER approaches commonly use cropped faces to extract local features, such as Histogram of Oriented Gradients (HOG) [18], pyramid HOG [19], Stepwise Linear discriminant Analysis (SWLDA) [20], and Local Phase Quantization (LPQ) [21]. Torre et al. [18] used HOG features located by facial landmarks and selective transfer machine to learn a generic maximal-margin classifier. Sun and Wen [19] embedded enhanced relevance feedback in dimensionality reduction, and employed enhanced cognitive gravitation model for density information compensation. Siddiqi et al. [20] designed a hierarchical recognition strategy based on SWLDA and Hidden Conditional Random Fields (HCRFs), with expression category recognition first and then expression recognition. But the exclusive category wasn’t well compatible with variant individuals, shown in Fig. 1. Turan et al. [21] employed soft locality preserving projection for LPQ dimensionality reduction to control the spread levels of different classes. Different from above, Alphonse et al. [22] proposed a novel Monogenic Directional Pattern (MDP), and embedded pseudoVoigt kernel in both generalized discriminant analysis and extreme learning machine. Moeini et al. [23] proposed the Dual Dictionary Learning (DDL) based on the feature sets extracted from each nine sub-images, and employed collaborative representation classification (CRC). Besides the above hand-crafted

features, Yuan and Mao [24] utilized exponential elastic preserving projections to exploit the manifold structure of data for alleviation of intraclass differences. Plus, many other methods chose deep learning to learn effective recognition models. Sun et al. [25] embedded an attention mapping between the convolution layer and the fully connected layer in end-to-end supervise learning. Liu et al. [26] utilized a Boosted Deep Belief Network (BDBN) to effectively unify feature learning, feature selection, and classifier construction into a loopy framework over a set of overlapped image patches from each training sample, and learned expression classifier in a statistical way. Both Lopes’ [10] and Xie’s [27] approaches employed wellpreprocessed cropped faces and data augmentation in training for their small convolution neural networks, coping with few data and the training sample order. However, their networks suffered bigger impediments while classifying expressions sharing similar characteristics.  Sequence-based FER Approach. Except the image-based approaches, many researchers have shown their interests in sequence-based (or video-based) FER approaches and explored evolving information from the expression sequences for FER in quite different ways. Rodriguez et al. [8] combined the finetune VGG-16 CNN with LSTM to extract spatiotemporal variations from expression sequences. Liu et al. [28] proposed Spatiotemporal Manifold ExpressionLets (STM-ExpLet) to bridge the gap between low level features and semantic level features. Sikka et al. [29] proposed Latent Ordinal Model (LOMo) to model the expression sequence by order-oriented costs of sub-events from corresponding sequence, where sub-events were detected by a latent variable model and costs were learned with a regularized max-margin hinge loss minimization. Elaiwat et al. [30] proposed a spatiotemporal Restricted Boltzmann Machine (ST-RBM) based model to capture facial-expression transformation between onset frame and apex frame, where ST-RBM was learned through quadripartite contrastive divergence. Zhao et al. [9] also targeted on image pair of onset frame and apex frame, but utilized Peak-Piloted Deep Network (PPDN) to extract expression evolving features with peak gradient suppression in training. Unlike the above methods, Abdallah et al. [31] extracted two-level Pyramid uniform Temporal Local Binary Pattern features with only XT and YT orthogonal planes from 33 selected sub-regions for PCA reduction before classification (named PCA[PTLBPu2]). Contrast to the appearance feature, Pu et al. [32] utilized Lucas-Kanade optical flow tracker to estimate facial landmarks of onset frame and apex frame, and forwarded these facial landmarks into two random forests for AU prediction and FER respectively. Besides using either temporal geometric or appearance features, exploiting both of them is another trend in sequence-based FER. Both Jung’s Deep Temporal Appearance Geometry Network (DTAGN) [33,34] and

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

Fig. 2. Different paradigms of cropped faces, (a) is from [9], (b) is from [24], and (c) is from [10].

Zhang’s Part-based Hierarchical Recurrent Neural Network and Multi-Signal CNN (PHRNN-MSCNN) [35] used two separate networks for geometric and appearance features respectively before the final fusion strategies. Jung et al. employed xycoordinates, which were relative to nose landmark coordinate and normalized by their standard deviation, and cropped faces with two different fusion strategies: weighted sum and join fine-tuning. Zhang’s PHRNN employed facial landmarks from four regions (eyebrows, eyes, nose, and mouth) and forwarded them into stacks of bidirectional recurrent neural networks. Zhang’s MSCNN required two cropped faces in training to provide recognition signal and verification signal, but only one in testing. Lin et al. [36] employed two independent subspace analysis networks to learn spatiotemporal features (ST-ISA) and geometric features (G-ISA), and combined them as STGISA features for FER. Majumder et al. [37] designed an Automatic FER System (AFERS) to embed geometric features, regional LBP, auto-encoders, and Kohonen self-organization map (SOM) classifier into a deep network, where the auto-encoders fused the geometric features and LBP for SOM classifier. Chen [38] employed HOG-TOP features along with geometric warp features to capture effective facial expression changes from expression sequences. And Yan [39] adopted Chen’s features for triplet training and proposed the collaborative discriminative multi-metric learning (CDMML) for FER. Some of existing FER approaches employ the whole face as input for feature extraction. For example, Fig. 2(a) [9] includes useless regions like hair and background around the corners. These regions not only increase the computation cost, but also affect the robustness or generalization of the classification model in one way or another. In order to deal with these regions, cropped faces like Fig. 2(b) [24] and Fig. 2(c) [10] are widely adapted by most existing FER approaches. However, these kinds of cropped faces still contain some unrepresentative regions for FER near the image center like the yellow and green regions, and cause another problem that variances of those regions among individuals, especially the green regions, may burden feature learning and classification. To avoid this defect of cropped face, our RCFN approach adaptively extracts crucial regions for feature learning.

3. Region-based convolutional fusion network The proposed RCFN includes crucial region extraction and network prediction. Fig. 3 shows the network architecture of RCFN. RCFN network takes three crucial regions, which are extracted adaptively from the frontal facial image, as inputs for its corresponding sub-networks, and then fuses the features extracted from sub-networks for FER. Different from the routine segmented blocks in the binary codes [12–15] from the aforementioned FR methods, our RCFN segments crucial regions using muscle movement model and facial landmarks, and possesses great unification on each region. The binary

3

codes focus on local vision individually, but suddenly transfer into global vision in codebook learning phase. Contrast to the sudden transfer from local vision to global vision, our RCFN starts from local vision (small kernel receptive field in the first layer) to global vision gradually following the expansion of kernel receptive field. Comparing to the FER methods employed cropped face, our RCFN learns features from three well-unified crucial regions in order to reduce unrepresentative regions and eliminate interferences caused by non-unified facial organs of cropped faces from different individuals. Though the selected sub-regions in PCA[PTLBPu2] [31] are from the mouth, cheeks, eyes and eyebrows, they are divided by fixed block sizes instead of adaptively divided for each individuals. And different from the four regions (two eyes, nose, and mouth) of AFERS [37] and the nine aligned sub-images (two cheeks, nose, philtrum, mouth, eyes, two eyebrows, and chin) of DDL + CRCLBP [23], our RCFN only employs three regions (eyes, forehead, and mouth). Details of RCFN are described in the following subsections.

3.1. Crucial region extraction To avoid the unrepresentative regions of current widely used cropped faces for FER, especially the nose region, we employ region extraction to select representative regions for our feature model. Different from the common paradigm like redundant patches in [40] and non-overlapped fixed blocks in [31], we choose welldefined crucial regions and normalize them to eliminate the interference brought by varied sizes and positions. According to the Facial Action Coding System [41,42], facial expressions are expressed by a series of muscle movements. For example, around the eyes, angry in Fig. 4(a) shows contracted muscles towards center, resulting in staring eyes, wrinkly eyebrows, and wrinkles between eyes; while surprise in Fig. 4(b) shows stretched muscles towards up, resulting in wide-opened eyes, raised eyebrows, and smooth skin between eyes. Around the mouth, angry shows strongly contracted muscles towards center, resulting in closed mouth with wrinkles; while surprise shows stretched muscles towards both up and down respectively, resulting in opened mouth without any wrinkle. Through the observation of facial expressions and corresponding muscle movement styles from multiple datasets, we select the eye patch, forehead patch, and mouth patch as our crucial regions for their intraclass common characteristics and representativeness among individuals. And we construct the muscle movement model to illustrate the muscle movement styles of crucial regions for six basic expressions, shown in Table 1. Of course, not all facial expressions have totally different muscle movement styles with others. Some facial expressions share similar muscle movement styles, such as angry versus disgust and fear versus surprise. Due to the intraclass differences of expressions, different muscle movement styles might happen over the same facial expression, which means the rows in Table 1 might have more than one muscle movement style. We utilize muscle movement model and facial landmarks [43] to analyse how to design proper rules to automatically and adaptively segment out three crucial regions which can not only preserve intraclass common characteristics and representativeness but also unify crucial regions from variant samples. Algorithm 1 shows the details of crucial region extraction where the segment rules are designed based on muscle movement model and facial landmarks. In Algorithm 1, r f and r e are height/width ratios for forehead patch and eye patch with the value of 1:75 and 0:40625 respectively; rm is the width/height ratio for mouth patch with value of 1:8; k1 and k2 represent the scale factors with empirical values of 0:83333 and 0:4 respectively. These constant values are used for crucial region unification.

4

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

Algorithm 1. RCFN for Frontal Facial Expression Recognition Input: Frontal facial image; Output:CR (crucial regions); 1: Estimate landmarks for the frontal facial image, where ðxi ; yi Þ represent the ith landmark point with i 2 ð0; 1; 2; . . . ; 67Þ; 2: Calculate the left eye center ðxl ; yl Þ with 36th–41th landmark points and right eye center ðxr ; yr Þ with 42th  47th landmark points; then calculate the horizontal angle h of ðxl ; yl Þ and ðxr ; yr Þ by h ¼ arctanððyr  yl Þ=ðxr  xl ÞÞ; 3: if h–0 then 4: Rotate the image by h based on point ðxl ; yl Þ; 5: Re-estimate landmarks for the frontal facial image; 6: end if 7: Calculate the coordinate yfb and yft by yfb ¼ ðy28 þ y29 Þ=2 and yft ¼ yfb  r f ðx42  x39 Þ respectively; 8: Use top left point ðx39 ; yft Þ and right bottom point ðx42 ; yfb Þ to segment out the forehead patch and resize it into 28  49; then append it to CR; 9: Calculate eye patch width we and height he by we ¼ maxðx27  ðx0 þ x17 Þ=2; ðx16 þ x26 Þ=2  x27 Þ and he ¼ r e  we respectively; 10: Calculate the center point ðxec ; yec Þ of 17th–26th and 36th–47th landmark points; 11: Use ðxec ; yec Þ as the eye patch center, and segment out the eye patch with width we and height he ; then, resize it into 64  26 and append to CR; 12: Calculate mouth patch height hm and width wm by hm ¼ k1 ðy8  y33 Þ and wm ¼ r m  hm respectively; 13: Calculate the center ðxmc ; ymc Þ of 48th  67th landmark points; then, calculate the top left point of mouth patch ðxmt ; ymt Þ by xmt ¼ xmc  wm =2 and ymt ¼ ymc  k2  hm . 14: Use the bounding box ðxmt ; ymt ; wm ; hh Þ to segment out mouth patch and resize it into 54  30; then append to CR; 15: return CR.

Fig. 5 shows the automatically segmented forehead patches (green boxes), eye patches (blue boxes), and mouth patches (red boxes) from different people. For interference reduction, our crucial regions leave out the background, hair, parts of the cheek, and especially the nose. Through the noses in Fig. 5, we can tell that the nose region contains more variance than constancy within expression classes, and its characteristic is not representative and hard to effectively unified due to its varied sizes and shapes among individuals. Combining Table 1 and Fig. 5, angry and disgust share similar muscle movement style in both forehead and eye patch having similar eye brows and textures from the middle, but their mouth patches are very likely different. This situation happens to happy versus surprise and fear versus sad, too. None of the crucial regions is solely sufficient enough for FER, but together they possess significant characteristics and representativeness for FER. The proposed crucial regions has two major advantages for FER. First, crucial regions provide the most important and representative regions at the beginning of feature extraction, and greatly reduce the interferences brought by unrepresentative regions, especially the nose regions. Second, crucial regions are wellextracted and unified, eliminating the interferences caused by facial organs with varied sizes and positions and providing fine grained details for robust feature learning.

(a) Angry

(b) Surprise

Fig. 4. Samples of muscle movements for angry and surprise. Yellow arrows represent the muscle movement directions. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 3. Network architecture of RCFN. Three crucial regions are forwarded through the independent sub-networks, which are sharing the same structure.

5

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11 Table 1 Muscle movement model of crucial regions for six basic expressions. Relaxed muscle Forehead Eye Mouth Forehead Eye Mouth Forehead Eye Mouth Forehead Eye Mouth Forehead Eye Mouth Forehead Eye Mouth

Contracted muscles towards down

p

Contracted muscles towards center p p p p p

p

Contracted muscles towards up and down

Stretched muscles towards up and down

Stretched muscles towards left and right

Stretched muscles towards up Angry

Disgust p

p p

p p

Fear

p p

Happy

p

p p p p

Sad

p p

3.2. RCFN network architecture design The network architecture of RCFN is shown in Fig. 3. RCFN network adopts three well-unified crucial regions: forehead patch, eye patch, and mouth patch simultaneously instead of cropped face. To

p p p

make full use of these inputs, RCFN network employs independent sub-network for each crucial region. Adapting ideas from VGGNets [44], we choose small convolutional kernels and employ similar combinations of convolution layers and pooling layers for the crucial regions. But we only employ six convolution layers (C1– C6) and three pooling layers (M1–M3) for each crucial region in order to better fit the sizes of crucial regions. In the combinations, the kernel receptive field becomes four times larger after each pooling layer. The output sizes of the last pooling layer from subnetworks are 4  7, 8  4, and 7  4 respectively, which are close to 7  7 of that in VGG-Nets. These stacks of convolution layers are used to model the information transformation from low level features to semantic level features. Along with stacks of convolution layers, we embed the inception layers (F1 and F2) inspired by GoogLeNet [45]. But unlike GoogLeNet, our inception layers are derived directly from the pooling layers of each crucial region for extracting different semantic level features. Then, we use concatenation layer to combine the outputs from sub-networks. Benefits of the concatenation layer include alleviation of gradient loss across the stacks of convolution layers and adjustment of feedback gradient via weights in the fully connected layers (F1–F3). After concatenation layer, we plant two stacks of fully connected layers (FC1 and FC2) and dropout layers (D1 and D2) [46] before the final classifier layer. The dropout layers are used to prevent over-fitting, and the mechanism of the dropout layer unit is shown in Fig. 6. Different from the multiple neural networks in discriminative deep multi-metric learning (DDMML) method [47] and other deep models, our three independent sub-networks of RCFN are designed to learn corresponding patterns inside each crucial region. The subnetworks extract triple-level features from low level to semantic level of corresponding crucial regions. And the triple-level features

(a) During Training Fig. 5. Crucial region samples of six expressions. Due to the consent for publication of CK+, the first column has two individuals.

Surprise

p

(b) During Testing

Fig. 6. Dropout layer mechanism. (a) A unit is present with probability p and connects to the units in next layer with weight w. (b) The unit is always present with the weight changed as pw.

6

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

are used to exploit feature compensation for FER instead of toplevel feature. Our RCFN focuses on feature compensation of visual features from three location-oriented crucial regions, while DDMML focuses on metric compensation of K distance metrics from the same image pair.

We can analyze how the network output vector z affects the difference between CECPL and CESCE by:

PC i¼1

CECPL  CESCE ¼ ln PC i¼1

 Network Configuration. Detail settings of RCFN network are listed in Table 2. Convolution layers employ ReLU activation function and 3  3 kernel with stride as 1 and padding as ‘same’, denoted as ‘conv_{filter size}’. Pooling layers are average pooling layers employed 2  2 kernel with stride as 2, denoted as ‘max_pool’. Fully connected layers are denoted as ‘fc_{activa tion}_{filter size}’. Dropout layers are denoted as ‘drop_{ratio}’. And in the classification layer, ‘C’ represents the number of facial expression classes. Compared with FaceNet [16] and VGGFace [17], our RCFN has less network weights and copes with small data scale without data augmentation and pretraining. 3.3. Constrained punitive loss During training, we add one more softmax operation before the popular softmax cross entropy (SCE) to build a constrained punitive loss (CPL) for network optimization. The additional softmax operation maintains the element order of network output, and works fine with the max argument protocol of classification. With the label expressed as one-hot-vector and network output vector denoted as z, the proposed CPL, marked as CECPL , is calculated by the following equations:

expðzi Þ s i ¼ PC ; i 2 ð1; 2; . . . ; CÞ j¼1 expðzj Þ

ð1Þ

PC

ð2Þ

i¼1

where C denotes the expression category number, and k is the ground truth label index of one-hot-vector. The standard SCE, marked as CESCE , is calculated by:

PC CESCE ¼ ln

expðzi Þ expðzk Þ

ð3Þ

i¼1

Table 2 RCFN network configuration. RCFN has three sub-networks for forehead patch (F.P.), eye patch (E.P.), and mouth patch (M.P.). All the sub-networks share the same construction consisting of C1-F3 layers. Name Input

j¼1

expðzj Þ

þ zk Þ

kÞ expðzi þ Pexpðz Þ C j¼1

We define a conditional function CðzÞ corresponding to z as follow:

CðzÞ ¼

C X i¼1

expðzi Þ exp PC þ zk j¼1 expðzj Þ

!



C X

E.P.

C1 C2 M1 F1 C3 C4 M2 F2 C5 C6 M3 F3 CL FC1 D1 FC2 D2

conv_8 conv_8 max_pool fc_tanh_1024 conv_32 conv_32 max_pool fc_tanh_1024 conv_128 conv_128 max_pool fc_tanh_1024 concat_9216 fc_relu_2048 drop_0.8 fc_relu_2048 drop_0.8

Classifier Layer

C

expðzi

i¼1

expðzk Þ þ PC Þ j¼1 expðzj Þ

ð5Þ

According to Eq. (4), we have the following relationships between CECPL and CESCE corresponding to CðzÞ:

CECPL > CESCE ; if and only if CðzÞ > 0

ð6Þ

CECPL ¼ CESCE ; if and only if CðzÞ ¼ 0

ð7Þ

CECPL < CESCE ; if and only if CðzÞ < 0

ð8Þ

Comparing to SCE, the proposed CPL works as a punitive mechanism fed by the network output and leverages network training through each training sample. 3.4. RCFN training setting Experiments are carried out under python 3.5.2 with Tensorflow 1.2.1 and CUDA 8.0 on Windows 10 platform with two Nvidia GTX 1080 Ti graphic cards. Staircase exponential decay is applied to the learning rate by using the following equation:

LRd ¼ R0  Rdd

ð9Þ

where LRd ; R0 , and Rd represent the decayed learning rate, initial learning rate, and decay rate respectively. Sg and Sd indicate global step and decay steps respectively. We set R0 ¼ 104 and Rd ¼ 0:8 empirically, while Sd is determined by:

Sd ¼ t 

N Bs

ð10Þ

where N is the sample size of training set, t is a constant value set to 20, and Bs represents the batch size set to 30. It means the learning rate decays after every 20 epochs. The number of total training epoch is set to 140. And we employ the stochastic optimization method ADAM [48] to optimize our RCFN network. 4. Experiments

RCFN F.P.

ð4Þ

expðzj Þ

Sg S

expðsi Þ expðsk Þ

CECPL ¼ ln

iÞ expðPexpðz C

M.P.

To get an comprehensive analysis on RCFN, we evaluate RCFN network on Karolinska Directed Emotional Faces (KDEF) [49], Extended Cohn-Kanade Dataset (CK+) [50], and Oulu-CASIA [51]. 4.1. Dataset and experimental protocols 4.1.1. KDEF dataset KDEF dataset contains 4900 pictures including 70 individuals (35 females and 35 males aging from 20 to 30) displaying 7 facial expressions (anger, disgust, fear, happy, sad, neutral, and surprise). We only use the 840 frontal view images without the neutral expression. We divide these 840 images into 10 subject-exclusive folds. In order to keep even gender distribution inside each fold, each of the first five folds includes three females and three males with the subject ID in ascending order, and each of the last five folds includes four females and four males with the subject ID in

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

ascending order. We employ leave-onefold-out cross validation, selecting one fold for testing while the rest folds for training and repeating this process until each fold being selected for testing once. For each selection, we run it 10 times and choose the best result as the report of current selection. Then, we average the results from each selection as the final report. 4.1.2. Oulu-CASIA dataset The original Oulu-CASIA NIR&VIS dataset includes sequences of six facial expressions (anger, disgust, fear, happy, sad, and surprise) collected from 80 subjects aging from 23 to 58. There are total 2880 expression sequences captured under NIR (near infrared) camera and VIS (visible light) camera in three different illumination conditions: normal, weak, and dark. Each expression sequence starts with onset frame and ends with apex frame of corresponding expression. To acquire sufficient experiment results for the RCFN configurations, we have two different splits for Oulu-CASIA dataset:  OCVN (Oulu-CASIA VIS Normal illumination): Same as the data employed by [9,28,29,33–35], this split contains 480(6  80) sequences of six expressions from normal illumination conditions under VIS camera.  OCNIR (Oulu-CASIA NIR): This split consists of 1440(3  6  80) sequences of six expressions captured under NIR camera in three different illumination conditions. In each split, we evenly group them into 10 folds independently by ascending order of subject ID using the subject-exclusive scheme, and employ the leave-onefold-out cross validation. We select the last three frames from each expression sequence for training and testing. 4.1.3. CK + dataset CK + dataset includes 593 sequences from 123 subjects varying in duration and starting from the onset frame to apex frame of corresponding expressions. Considering the seven facial expressions (anger, contempt, disgust, fear, happy, sad, and surprise) for FER, only 327 sequences from 118 subjects are provided with labels. With the purposes of objectively comparing with other literatures and analyzing the impacts of different configurations of RCFN, we set up three different splits for the popular leave-onefold-out cross validation:  CK+107 (CK+ 10-fold 7 classes): We group 327 sequences with 7 facial expressions into 10 folds with subject-exclusive scheme. Literatures sharing this scheme include [8,19,21,27–30,33– 36,39,52].  CK+106 (CK+ 10-fold 6 classes): In this split, we ignore the sequences of contempt expression and group the rest 309 sequences from 107 subjects into 10 subject-exclusive folds. Literatures sharing this scheme include [9,20,23,31,37].  CK+86 (CK+ 8-fold 6 classes): This split is similar with CK+106 but is altered to 8 folds instead of 10 folds. Literatures sharing this scheme include [10,26].



Hit N

7

ð11Þ

where Hit is the number of hits in the ground truth labels of testing samples, and N is the total number of testing samples. Besides accuracy P, we also calculate the average accuracy AP of expression classes by the following equation:

AP ¼

C 1X Hite C e Ne

ð12Þ

where C is the category number of expression classes. Hite indicates the hits in the test samples of expression e, and N e represents the total number of expression e test samples. In particular, P and AP are equal if the test sample numbers of expression classes are equal P to each other since Hit ¼ Ce Hite and N ¼ C  N e . We report these two accuracies on CK + dataset for its unequal test sample numbers among expression classes. On KDEF and Oulu-CASIA, we only report the AP since AP and P are equal due to equal sample numbers of each expression class. 4.2. Performance evaluation There are five different settings to illustrate the effectiveness of proposed RCFN. The first setting uses whole face like Fig. 7(a) as input for the consistency network of Table 2 but with only one sub-network not three, marked as ‘WCFN’. The second one uses cropped face like Fig. 7(b) instead of the whole face in first setting, marked as ‘CCFN’. The third one is RCFN with SCE marked as ‘RCFN’. The fourth is RCFN with CPL marked as ‘RCFN(CPL)’. The last one replaces the max pooling with average pool in the pooling layer ‘M3’ of fourth, marked as ‘RCFN(CPL + avg)’, which is based on the assumption that using average pooling in higher semantic layer ‘M3’ might help improve the classification result rather than using max pooling. Table 3 shows the experiment results on OCNIR, OCVN, KDEF, CK+86, CK+106, and CK+107, along with corresponding average training time of one epoch including testing. And the last column shows the average accuracies of each method from nine metrics corresponding to nine columns of APs and Ps. The best results of nine metrics are achieved by either RCFN(CPL) or RCFN(CPL + avg), except the best AP of KDEF is achieved by CCFN. In Table 3, the WCFN using whole face with useless regions is inferior to all others in all metrics. CCFN and RCFN are close to each other. CCFN has better performance in OCNIR, OCVN, and KDEF, while RCFN has better or very close performance in CK+86, CK+106, and CK+107. RCFN spends less average training time than CCFN in most setting, though the experiments were ran under different loads and hardware IDs (same product). RCFN(CPL) and RCFN(CPL + avg) both outperform CCFN except AP in KDEF and P in CK+86. In other words, the well-defined crucial regions of our RCFN can sufficiently classify facial expressions, and have significant interference elimination as well as computation cost reduction from unrepresentative regions while maintaining fine-grained details comparing to whole face and cropped face. From the view of data scale, we achieve

Same as the implementation reported in [8–10,26], etc., we select the last three frames from each sequence as the samples for training and testing. 4.1.4. Metrics We present accuracies calculated in two different ways for our RCFN networks in order to provide a fair comparison of RCFN with other methods. The first one is the accuracy P of predict labels, calculate by the following equation:

Fig. 7. Cropped face paradigms for WCFN and CCFN. (a) Face for WCFN with size of 128  128. (b) Face for CCFN with size of 96  72.

8

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

Table 3 Experiment results on Oulu-CASIA, KDEF, and CK + dataset. TE represents the average training time of one epoch. The best two are highlight in bold type. Method

WCFN CCFN RCFN RCFN(CPL) RCFN(CPL + avg)

OCNIR (4320 samples)

OCVN (1440 samples)

KDEF (840 samples)

CKþ86 (927 samples)

CKþ106 (927 samples)

CKþ107 (981 samples)

AP

TE

AP

TE

AP

TE

AP

P

TE

AP

P

TE

AP

P

TE

77.75 80.79 79.95 83.45 82.96

14.69 7.85 7.62 7.36 7.67

77.64 85.90 84.17 86.11 86.94

2.57 1.38 1.48 1.39 1.43

89.55 91.60 90.73 91.11 91.01

1.27 0.71 0.74 0.78 0.73

93.73 97.58 97.70 98.10 99.26

94.54 97.87 97.69 98.01 97.79

1.57 0.99 0.81 0.82 0.83

95.89 97.50 97.32 98.42 97.84

95.87 97.71 98.06 98.60 98.37

1.55 0.84 0.79 0.81 0.82

95.68 97.82 96.77 97.94 98.52

95.03 97.54 97.79 98.70 98.70

1.57 0.99 0.81 0.82 0.83

state-of-the-art results using 840–4320 samples, coping with small data scale without data augmentation and pre-training. What’s more, the state-of-the-art results have proved the robustness of our triple-level features extracted by the proposed network for FER. In addition, the small training time including testing for one epoch has proved our network can be fast in training and testing. Our RCFN can easily process over 1000 samples per second in a single GTX 1080 Ti graphic card during training, and can get much faster speed while testing. As for the CPL, the direct comparison between RCFN and RCFN(CPL) shows the improvements made by CPL. It’s very interesting that the results of RCFN(CPL) and RCFN(CPL + avg) show that performances of max pooling and average pooling in ‘M3’ are very close. In nine metrics, RCFN(CPL) overcomes RCFN (CPL + avg) five times, RCFN(CPL + avg) overcomes RCFN(CPL) three times, and there is one draw in P of CK+107. Although RCFN(CPL) has two rounds ahead, RCFN(CPL + avg) has slightly higher average performance. The data is not sufficient enough to support the

Average

90.63 93.81 93.35 94.49 94.60

previous assumption that using average pooling in higher semantic layer ‘M3’ could have better performance instead of using max pooling. Maybe the ‘M3’ layer is not deep enough, or average pooling and max pooling have their own strengths on specific domain. Fig. 8 are the corresponding confusion matrices of best APs in Table 3. In Table 1, angry shares totally different muscle movement styles with happy as well as surprise. And the confusion matrices in Fig. 8 have shown that false prediction rates of mis-predicting angry as happy or surprise and vice versa are zeros or very close to zero. This kind of situation happens to disgust versus happy or surprise too, since disgust also shares totally different muscle movement styles with happy and surprise. Besides the different muscle movement styles, angry shares similar muscle movement styles with disgust and sad. And the confusion matrices of Fig. 8 (a), (c), and (b), show that false prediction rates of mis-predicting angry as disgust or sad are higher than the others. The other facial expressions share similar muscle movement styles also encounter similar situation of false prediction rate, such as fear versus

Fig. 8. Confusion matrices.

9

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11 Table 4 Performance comparisons with other methods on CK + dataset. Method

Input

CK + 86 AP

CK + 106 P

CK + 107 AP

CK + 107 P

DSAE [52] STM-ExpLet [28] STG-ISA [36] LOMo [29] ST-RBM [30] LPQ-SLPM-NN [21] CDMML [39] BDBN [26] CNN [10] SH-RER [20] DTAGN [33] VGG16 + LSTM [8] DTAGN(Joint) [34] PPDN [9] CFER [19] CNN [27] PCA[PTLBPu2] [31] PHRNN-MSCNN [35] AFERS [37] DDL + CRCLBP [23]

Image Seq. Seq. Seq. Pair Image Seq. Image Image Image Seq. Seq. Seq. Pair Image Image Seq. Seq. Seq. Image

– – – – – – – 96.7 96.76 – – – – – – – – – – –

– – – – – – – – – 96.83 – – – 97.3 – – 98.1 – 98.95 99.0

93.78 – 94.5 – 95.66 95.9 96.6 – – – 96.94 – 97.25 – – 97.85 – 97.78 – –

94.19 95.1 – – – – – – – 97.2 – – 97.66 – 98.5 – –

CCFN RCFN RCFN(CPL) RCFN(CPL + avg)

Image

97.58 97.70 98.1 99.26

97.71 98.06 98.6 98.37

97.82 96.77 97.94 98.52

97.54 97.79 98.7 98.7

surprise and disgust versus sad. These situations of false prediction rates from the confusion matrices exactly reveal the successful building of muscle movement model for extracting crucial regions. 4.3. Comparison with state-of-the-art approaches We compare our proposed methods with other 20 state-of-theart approaches on either one or both of CK + and OCVN under leave-onefold-out cross validation strategy as shown in Tables 4 and 5. Eight of them are image-based approaches. The other approaches target on expression sequences instead of single apex frame during feature extraction. Two of them extract features based on image pair, and the rest approaches extract features from sequences with different sequence lengths larger than two. In Table 4, there are only four approaches [10,27,33,34] mentioned data augmentation while the others do not employ data augmentation or do not mention any data augmentation implementation in the experimental details. Sequence-based methods treat each expression sequence as one sample in the training and testing except for Rodriguez’s [8] method, which generates three different sequences of length 10 with each ending in one of the last three frames from corresponding expression sequence. Elaiwat’s STRBM [30] uses predictions of three image pairs from the same expression sequence to vote for the label of that sequence. On CK+86, our RCFN(CPL + avg) obtains a improvement of 2.5% and 2.56% comparing to Lopes’s CNN [10] and Liu’s BDBN [26]

Table 5 Performance comparisons with other methods on OCVN dataset. Method

Input

AP

PPDN [9] STM-ExpLet [28] DTAGN [33] DTAGN(Joint) [34] LOMo [29] PHRNN-MSCNN [35]

Pair Seq. Seq. Seq. Seq. Seq.

72.4 74.59 80.62 81.46 82.1 86.25

CCFN RCFN RCFN(CPL) RCFN(CPL + avg)

Image

85.90 84.17 86.11 86.94

respectively. Note that the result of BDBN is the average accuracy of the outputs of six binary classifiers from one-versus-all classification strategy, which trains a specific binary classifier for each expression class. Seven expressions were tested by Lopes, but one of them is neutral instead of contempt in our experiment. On CK+106, DDL + CRCLBP [23] obtains the highest P 99.0% and AFERS [37] achieves the second with 98.95%. Comparing to DDL + CRCLBP and AFERS, our RCFN(CPL) achieves 98.6% (the third) with a narrow gap. Note that, DDL + CRCLBP, AFERS, and RCFN all employ welldefined regions but different in one way or another. On CK+107, our RCFN(CPL + avg) outperforms all other methods, and only 6 out of 14 methods have higher AP or P than our CCFN or RCFN. Though sequence-based PHRNN-MSCNN [35] is close to our RCFN (CPL + avg), our methods have an advantage over other methods, not to mention some of them employed data augmentation. In Table 5, five of the six are sequence-based methods, and the other one is image-pair-based. Our RCFN(CPL + avg) achieves the highest AP of 86.94%, and the sequence-based PHRNN-MSCNN [35] obtains the second with 86.25%. And our methods also have an advantage over the other five methods. The results of Tables 4 and 5 prove the effectiveness of our wellunified crucial regions and RCFN network on feature extraction and their compatibility with small data scale in FER. Most of the methods in Tables 4 and 5 are using cropped faces similar to the paradigms in Fig. 2, having the nose region with less common characteristics but more variances caused by varied shapes, sizes, and positions for FER, as well as the varied sizes and positions of other facial organs. These factors hindered their performance in FER, except the AFERS and DDL + CRCLBP. On the contrary, our FER approach conquers these factors through well-unified crucial regions and learns a robust and fast classification model using the proposed RCFN network, achieving encouraging performance without data augmentation and pre-training.

5. Conclusion and future work In this paper, we propose a new FER approach named RCFN, which provides a novel way to extract and compensate the key characteristics from facial crucial regions. RCFN has three major

10

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11

contributions summarized as follows. First, a muscle movement model is built to help extract three crucial regions: forehead patch, eye patch, and mouth patch, aiming to greatly reduce unrepresentative regions and remove interferences caused by facial organs with varied sizes and positions. Second, a fast and practical network architecture is constructed to extract robust triple-level features from low level to semantic level out of the crucial regions and fuse them for FER, providing fast speed on training and testing as well as coping with small data scale. Third, a loss punitive mechanism named constrained punitive loss is introduced to leverage the loss during training, gaining a significant improvement of FER performance. Through the experiments, the image-based RCFN shows encouraging performance on three public dataset, achieves competitive results with other state-of-the-art approaches in FER, and outperforms many sequence-based approaches that employed temporal geometric and appearance features. However, RCFN has its limitations on crucial region extraction and network optimization. Though crucial region extraction is adaptive to variant samples, it is designed for frontal faces. Our future work will focus on seeking a robust approach for expressions in the wild as well as a feasible loss function targeting the great challenges brought by expressions in the wild.

Conflict of interest The authors declared that there is no conflict of interest. References [1] N.T. Cao, A.H. Ton-That, H.-I. Choi, An effective facial expression recognition approach for intelligent game systems, Int. J. Comput. Vis. Robot. 6 (2016) 223–234. [2] G.P. Amminger, M.R. Schäfer, K. Papageorgiou, C.M. Klier, M. Schlögelhofer, N. Mossaheb, S. Werneck-Rohrer, B. Nelson, P.D. McGorry, Emotion recognition in individuals at clinical high-risk for schizophrenia, Schizophr. Bull. 38 (2012) 1030–1039. [3] Q. Wang, K. Jia, P. Liu, Design and implementation of remote facial expression recognition surveillance system based on pca and knn algorithms, in: 2015 International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2015, pp. 314–317. doi:https://doi.org/10.1109/ IIH-MSP.2015.54. [4] Z. Zhang, P. Luo, C.C. Loy, X. Tang, Learning social relation traits from face images, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3631–3639, https://doi.org/10.1109/ICCV.2015.414. [5] B.C. Ko, A brief review of facial emotion recognition based on visual information, Sensors 18 (2018) 401. [6] Z.J. Xu Linlin, Shumei Zhang, Summary of facial expression recognition methods based on image, J. Comput. Appl. 37 (2017) 3509. [7] M. Mohammadi, E. Fatemizadeh, M. Mahoor, Pca-based dictionary building for accurate facial expression recognition via sparse representation, J. Vis. Commun. Image Represent. 25 (2014) 1082–1092. [8] P. Rodriguez, G. Cucurull, J. Gonzlez, J.M. Gonfaus, K. Nasrollahi, T.B. Moeslund, F.X. Roca, Deep pain: exploiting long short-term memory networks for facial expression classification, IEEE Trans. Cybern. PP (2018) 1–11. [9] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, S. Yan, Peak-piloted deep network for facial expression recognition, in: Computer Vision – ECCV 2016, vol. 9906, Springer International Publishing, Cham, 2016, pp. 425–442. [10] A.T. Lopes, E. de Aguiar, A.F.D. Souza, T. Oliveira-Santos, Facial expression recognition with convolutional neural networks: coping with few data and the training sample order, Pattern Recogn. 61 (2017) 610–628. [11] Y. Guo, G. Zhao, M. Pietikäinen, Dynamic facial expression recognition using longitudinal facial expression atlases, in: Computer Vision – ECCV 2012, Springer, Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 631–644. [12] J. Lu, V.E. Liong, X. Zhou, J. Zhou, Learning compact binary face descriptor for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) 2041–2056. [13] J. Lu, V.E. Liong, J. Zhou, Simultaneous local binary feature learning and encoding for homogeneous and heterogeneous face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018) 1979–1993. [14] Y. Duan, J. Lu, J. Feng, J. Zhou, Learning rotation-invariant local binary descriptor, IEEE Trans. Image Process. 26 (2017) 3636–3651. [15] Y. Duan, J. Lu, J. Feng, J. Zhou, Context-aware local binary feature learning for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018) 1139–1153. [16] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: a unified embedding for face recognition and clustering, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823, https://doi.org/10.1109/ CVPR.2015.7298682.

[17] O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in: Proceedings of the British Machine Vision Conference (BMVC), BMVA Press, 2015, pp. 41.1– 41.12, https://doi.org/10.5244/C.29.41. [18] F.D.L. Torre, W.S. Chu, X. Xiong, F. Vicente, Intraface, in: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 2015, pp. 1–8. [19] Y. Sun, G. Wen, Cognitive facial expression recognition with constrained dimensionality reduction, Neurocomputing 230 (2017) 397–408. [20] M.H. Siddiqi, R. Ali, A.M. Khan, Y.T. Park, S. Lee, Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields, IEEE Trans. Image Process. 24 (2015) 1386–1398. [21] C. Turan, K.-M. Lam, Histogram-based local descriptors for facial expression recognition (fer): a comprehensive study, J. Vis. Commun. Image Represent. 55 (2018) 331–341. [22] A.S. Alphonse, D. Dharma, A novel monogenic directional pattern (mdp) and pseudo-voigt kernel for facilitating the identification of facial emotions, J. Vis. Commun. Image Represent. 49 (2017) 459–470. [23] A. Moeini, K. Faez, H. Moeini, A.M. Safai, Facial expression recognition using dual dictionary learning, J. Vis. Commun. Image Represent. 45 (2017) 20–33. [24] S. Yuan, X. Mao, Exponential elastic preserving projections for facial expression recognition, Neurocomputing 275 (2018) 711–724. [25] W. Sun, H. Zhao, Z. Jin, A visual attention based roi detection method for facial expression recognition, Neurocomputing 296 (2018) 12–22. [26] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a boosted deep belief network, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812, https://doi.org/10.1109/ CVPR.2014.233. [27] Z. Xie, Y. Li, X. Wang, W. Cai, J. Rao, Z. Liu, Convolutional neural networks for facial expression recognition with few training samples, in: 2018 37th Chinese Control Conference (CCC), 2018, pp. 9540–9544. doi:https://doi.org/10.23919/ ChiCC.2018.8483159. [28] M. Liu, S. Shan, R. Wang, X. Chen, Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1749–1756, https:// doi.org/10.1109/CVPR.2014.226. [29] K. Sikka, G. Sharma, M. Bartlett, Lomo: latent ordinal model for facial analysis in videos, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5580–5589, https://doi.org/10.1109/ CVPR.2016.602. [30] S. Elaiwat, M. Bennamoun, F. Boussaid, A spatio-temporal rbm-based model for facial expression recognition, Pattern Recogn. 49 (2016) 152–161. [31] T.B. Abdallah, R. Guermazi, M. Hammami, Facial-expression recognition based on a low-dimensional temporal feature space, Multimedia Tools Appl. (2018). [32] X. Pu, K. Fan, X. Chen, L. Ji, Z. Zhou, Facial expression recognition from image sequences using twofold random forest classifier, Neurocomputing 168 (2015) 1173–1180. [33] H. Jung, S. Lee, S. Park, I. Lee, C. Ahn, J. Kim, Deep temporal appearancegeometry network for facial expression recognition, CoRR abs/1503.01532, 2015a. [34] H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-tuning in deep neural networks for facial expression recognition, in: 2015 IEEE International Conference on Computer Vision (ICCV), 2015b, pp. 2983–2991. doi:https://doi.org/10.1109/ ICCV.2015.341. [35] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression recognition based on deep evolutional spatial-temporal networks, IEEE Trans. Image Process. 26 (2017) 4193–4203. [36] C. Lin, F. Long, J. Yao, M.-T. Sun, J. Su, Learning spatiotemporal and geometric features with isa for video-based facial expression recognition, in: Neural Information Processing, Springer International Publishing, 2017, pp. 435–444. [37] A. Majumder, L. Behera, V.K. Subramanian, Automatic facial expression recognition system using deep network-based data fusion, IEEE Trans. Cybern. 48 (2018) 103–114. [38] J. Chen, Z. Chen, Z. Chi, H. Fu, Facial expression recognition in video with multiple feature fusion, IEEE Trans. Affective Comput. 9 (2018) 38–50. [39] H. Yan, Collaborative discriminative multi-metric learning for facial expression recognition in video, Pattern Recogn. 75 (2018) 33–40. [40] J. Liu, Y. Deng, T. Bai, C. Huang, Targeting ultimate accuracy: face recognition via deep embedding, CoRR abs/1506.07310, 2015. [41] J. Hamm, C.G. Kohler, R.C. Gur, R. Verma, Automated facial action coding system for dynamic analysis of facial expressions in neuropsychiatric disorders, J. Neurosci. Meth. 200 (2011) 237–256. [42] P. Ekman, W.V. Friesen, Facial action coding system (facs): a technique for the measurement of facial actions, Riv. Psichiatria 47 (1978) 126–138. [43] V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 00, 2014, pp. 1867–1874. doi:https://doi.org/10. 1109/CVPR.2014.241. [44] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition abs/1409.1556, 2015. [45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 00, 2015, pp. 1–9. doi:https://doi.org/10.1109/CVPR.2015.7298594. [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958.

Y. Ye et al. / J. Vis. Commun. Image R. 62 (2019) 1–11 [47] J. Lu, J. Hu, Y. Tan, Discriminative deep metric learning for face and kinship verification, IEEE Trans. Image Process. 26 (2017) 4269–4282. [48] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, CoRR abs/ 1412.6980, 2014. [49] D. Lundqvist, A. Flykt, A. Öhman, The karolinska directed emotional faces – kdef, CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet, 1998, pp. 91–630. [50] P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-

11

specified expression, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010, pp. 94–101. doi: https://doi.org/10.1109/CVPRW.2010.5543262. [51] G. Zhao, X. Huang, M. Taini, S.Z. Li, M. PietikäInen, Facial expression recognition from near-infrared videos, Image Vis. Comput. 29 (2011) 607–619. [52] N. Zeng, H. Zhang, B. Song, W. Liu, Y. Li, A.M. Dobaie, Facial expression recognition via learning deep sparse autoencoders, Neurocomputing 273 (2018) 643–649.