Accepted Manuscript Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier Khadija Lekdioui, Rochdi Messoussi, Yassine Ruichek, Youness Chaabi, Raja Touahni
PII: DOI: Reference:
S0923-5965(17)30140-6 http://dx.doi.org/10.1016/j.image.2017.08.001 IMAGE 15260
To appear in:
Signal Processing: Image Communication
Received date : 4 January 2017 Revised date : 1 July 2017 Accepted date : 1 August 2017 Please cite this article as: K. Lekdioui, R. Messoussi, Y. Ruichek, Y. Chaabi, R. Touahni, Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Manuscript
Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier Khadija Lekdiouia,b,∗, Rochdi Messoussib , Yassine Ruicheka , Youness Chaabib , Raja Touahnib a Le2i
FRE2005, CNRS, Arts et M´ etiers, Univ. Bourgogne Franche-Comt´ e, UTBM, F-90010 Belfort, France b Laboratoire des Syst` emes de T´ el´ ecommunication et Ing´ enierie de la D´ ecision (LASTID) Universit´ e Ibn Tofail BP 133, Kenitra 14000, Maroc
Abstract Automatic facial expression analysis is a challenging topic in computer vision due to its complexity and its important role in many applications such as human-computer and social interaction. This paper presents a Facial Expression Recognition (FER) method based on an automatic and more efficient facial decomposition into regions of interest (ROI). First, seven ROIs, representing more precisely facial components involved in expression of emotions (left eyebrow, right eyebrow, left eye, right eye, between eyebrows, nose and mouth), are extracted using the positions of some landmarks detected by IntraFace (IF). Then, each ROI is resized and partitioned into blocks which are characterized using several texture and shape descriptors and their combination. Finally, a multiclass SVM classifier is used to classify the six basic facial expressions and the neutral state. In term of evaluation, the proposed automatic facial decomposition is compared with existing ones to show its effectiveness, using three public datasets. The experimental results showed the superiority of our facial decomposition against existing ones and reached recognition rates of 96.06%, 92.03% and 93.34% for the CK, FEED and KDEF datasets, respectively. Then, a comparison with state-of-the-art methods is carried out using CK+ dataset. ∗ Corresponding
author Email addresses:
[email protected] (Khadija Lekdioui),
[email protected] (Rochdi Messoussi),
[email protected] (Yassine Ruichek),
[email protected] (Youness Chaabi),
[email protected] (Raja Touahni)
Preprint submitted to Journal of LATEX Templates
July 1, 2017
The comparison analysis demonstrated that our method outperformed or competed the results achieved by the compared methods. Keywords: Facial components, Texture and shape descriptors, Facial expression recognition, SVM classifier, ROI extraction
1. Introduction Facial expression is an important aspect of behavior and non-verbal communication. Indeed, it represents a form of non-verbal body language, which plays an essential role in human interactions. Recently, several research works 5
have focused on FER to detect emotions and analyze people behavior[1, 2, 3]. Some facial muscles are specifically associated with certain emotional states and allow, according to Ekman [4] the expression of primary emotions (Sadness, Anger, Fear, Joy, Disgust and Surprise). Emotion recognition systems, based on facial expression analysis, found their interest in various applications,
10
such as eLearning and affective computing. Several studies have shown the importance of taking into account emotional aspects of a group of learners when it is necessary to create computing environments of collaborative learning [5, 6]. In [6], authors proposed to use computer tools for helping learners to share their emotions felt in a collaborative task. Others proposed to integrate facial
15
expression analysis based system into educational platforms in order to make emotions sharing automatic and unconscious [7]. In [8], the authors used automatic facial expression analysis for discovering learning status of e-learners. In affective computing, emotion recognition is crucial to allow behavioral and emotional interaction between a human and a machine. In [9], the authors thought
20
to promote more dynamic and flexible communication between a learner and an educational platform through the integration of an agent whose role is to capture and manage emotions expressed by the learner during a learning session, using facial expression analysis. Emotion recognition based solely on facial expression analysis is a very difficult task, especially within uncontrolled con-
25
ditions such as lighting changes, occlusion, fringe, and facial pose variation in
2
front of camera. In recent years, more research works [10, 11] have been conducted to address these problems, and the literature proposes significant and growing research body in computer vision that contributes to perform FER with increasing performances. The first research work in face detection and FER can 30
be found in [12]. Feature extraction is an important step in FER system. Feature extraction methods, used to characterize facial expression, can be categorized into two approaches: appearance-based and geometry-based methods. Appearance-based methods consist of extracting facial texture caused by expressions using different
35
descriptors including Local Binary Pattern (LBP) [13], Local Ternary Pattern (LTP) [14], Compound Local Binary Pattern (CLBP)[15] and Histogram of Oriented Gradient (HOG)[16]. Geometry-based methods extract shape information and locations of facial components such as distances and angles between landmarks [17]. Some research works combine appearance and geometric information
40
as in [18, 19, 20]. Based on appearance-based features extracted from the whole face, Gritti et al. [21] extensively investigated different local features including HOG descriptor, LBP descriptor and its variants, and Gabor wavelets. Using Support Vector Machines (SVM) classifier, the authors demonstrated that the best FER
45
performances are obtained using LBP descriptor. Other extensive experiments were carried out in [22]. The authors showed that LBP features are effective and efficient for FER, even in low-resolution video sequences, using different machine learning methods. Carcagn`ı et al. [23] carried out large experiments using HOG descriptor, highlighting that a proper set of HOG parameters (cell
50
size and number of orientation bins) can make HOG descriptor one of the powerful techniques to recognize facial expressions, using SVM classifier. Ahmed et al. [15] proposed a facial feature descriptor constructed with CLBP and used SVM for classification. The most framework based on the geometric-based approach uses Active
55
Shape Model (ASM) or its extension called Active Appearance Models (AAM). In [2], the authors detect facial expression by observing changes in key features 3
in AAM using Fuzzy Logic. Shbib and Zhou [3] used geometric displacement of projected ASM feature points, and the mean shape of ASM is analyzed to evaluate FER based on SVM classifier. Poursaberi et al. [18] combined texture 60
and geometric features. They used Gauss-Laguerre circular harmonic filter to extract texture and geometric information of fiducial points, then K-nearest neighbor (KNN) is used for facial expression classification. Rapp et al. [19] also combined appearance and geometric information. They proposed an original combination of two heterogeneous descriptors. The first one uses Local Gabor
65
Binary Pattern (LGBP) in order to exploit multi-resolution and multi-direction representation between pixels. The second descriptor is based on AAM which provide an important information on the position of facial key points. Then, they used multiple kernel SVM based classification to achieve FER. Valstar and Pantic [24] proposed a method that consists in tracking 20 fiducial points in all
70
subsequent frames in a video, using particle filtering with factorized likelihoods. Then, the features are calculated from the position of facial points as indicated by the point tracker. A comparison of two feature types, geometry-based and Gabor wavelets-based, using multi-layer perceptron, is explored in [25]. The comparison showed that Gabor wavelet coefficients are much powerful than
75
geometric positions. In [26], the authors proposed a method that extracts patchbased 3D Gabor features to obtain salient distance features, selects the salient patches and then performs patch matching operations using SVM classifier. Lei et al. [17] proposed a geometric features extraction method applying an ASM automatic fiducial point location algorithm to facial expression image, and then
80
calculating euclidean distance between the center of gravity of the face shape and the annotated points. Finally, they extract geometric deformation difference between features of neutral and facial expressions to analyze, before applying SVM classifier for FER. As for the aforementioned traditional machine learning, deep learning with
85
Convolutional Neural Networks (CNN) and Deep Belief Networks (DBN) has also been applied for FER and has achieved competitive recognition rates [27, 28, 29, 30]. However, deep learning methods require more memory and are time 4
consuming [27, 31]. Khorrami et al. [29] empirically showed that the learned features obtained by CNNs correspond completely with the Facial Action Cod90
ing System (FACS) developed by Ekman and Friesen [32]. In [30], the authors demonstrated that a combination of different methods such as dropout, max pooling and batch normalization impacts the performance of CNNs. A unified FER system combining feature learning, feature selection, and classifier construction was proposed based on the Boosted Deep Belief Network (BDBN)
95
[27]. Liu et al [28] constructed a deep architecture by using convolutional kernels for learning local appearance variations caused by facial expressions and extracting features based on DBNs. All the methods presented above use facial image as a whole region of interest. Recently, many researchers have studied facial expression using facial
100
regions or facial patches instead of using the whole face [33, 34, 35, 36]. Indeed, some specific face-regions are inappropriate to FER. In [33], the authors used AAM to extract facial regions, then applied Gabor wavelet transformation to extract facial features from the defined regions, and finally used SVM to recognize learned expressions. The authors in [34, 35] defined facial components from
105
which they extracted HOG feature descriptors. Then, SVM classifier is used for expression recognition. In [37], the authors combined geometric and appearance facial features, then applied Radial Basis Function (RBF) based neural network for facial expression classification. The geometric facial features are extracted from regions of interest defined from facial decomposition. Happy and Routray
110
[36] extracted appearance based features from salient patches selected around facial landmarks then used SVM to classify expressions. According to FACS, the main facial features required to analyze facial expressions are located around the eyebrows, eyes, nose and mouth. Hence, different definitions of ROIs around these facial components are proposed in the
115
literature [37, 35]. The first one [37] segmented the face into three ROIs (eyeseyebrows, nose, mouth) using a segmentation process which detects the face image, determines its width and height then uses predefined ratios to estimate the regions in question. The second one [35] aims to define six ROIs (eyebrows, 5
left eye, right eye, between eyes, nose, mouth) by dividing the face manually. In 120
our work, different from the ones mentioned before, facial expression recognition is investigated using new face-regions defined more precisely with the help of some facial landmarks (detected by IF algorithm), which allow to extract automatically seven ROIs (left eyebrow, right eyebrow, between eyebrows, left eye, right eye, nose, mouth). Furthermore, unlike the previous works where the
125
superior region contains more than one facial component, our decomposition aims to separate the facial components. This ensures better face registration and therefore an appropriate facial representation. Two major contributions are presented in this paper. First, we define an automatic facial decomposition that improves FER performance, compared to
130
state of the art facial decompositions and the approach based on whole face. Second, the study is performed considering texture and shape descriptors, and SVM for classification. We consider different descriptors, including LBP, LTP, CLBP, HOG and their combination, to carry out comprehensive performance comparison between
135
whole face based and face-regions based methods on three different datasets. Two of them, CK [38] and KDEF [39] represent posed emotions. The third one, FEED [40], contains spontaneous emotions. Furthermore, the proposed method is compared with other state of the art methods such as deep learning based methods and feature points based ones.
140
The paper is organized as follows. Section 2 presents the overview of the proposed methodology for FER. Regions of interest extraction is detailed from facial image in Section 3. Feature extraction is presented in Section 4. Section 5 presents multiclass SVM based FER. In Section 6, we present extensive ex145
periments and discuss the results. Section 7 concludes the paper and proposes some future works.
6
2. Overview of the proposed methodology The objective of the proposed methodology is to characterize the six universal facial expressions (Joy, Fear, Disgust, Surprise, Sadness, Anger) and the 150
neutral state, by analyzing specific well defined face-regions from which facial features are extracted. The proposed study for FER is based on two main steps: (i) determination of specific and more accurate face-regions using IF framework [41], allowing facial landmarks detection; (ii) evaluation of different texture and shape descriptors (texture through LBP and its variants; and shape through
155
HOG) and their combination. To explain the interest of facial decomposition into different ROIs, representing the main components of the face (eyes, eyebrows, nose, mouth, between eyebrows), one can see in Figure 1 that when using the whole face as a ROI, the main components are not located in the same region. Indeed, the red region in Figure 1 (a) contains mouth and nose, whereas
160
the same red region in Figure 1 (b) contains only mouth. This is due to the difference in face shape from a person to another. As reported in [20], the holistic method, that uses the whole face as one ROI, does not provide better face registration due to the different shapes and sizes of facial components among the population and according to the expression. Then, one facial component may
165
not located in the same region. Hence the interest of facial decomposition into different ROIs, representing the main components of the face (left eye, right eye, left eyebrow, right eyebrow, nose, mouth, between eyebrows). This issue affects FER performance, as we will show in Section 6. When facial decomposition is applied, one can extract the facial components and analyze them globally to
170
perform expression recognition. Figures 1 (c) and (d) and Figures 1 (e) and (f) illustrate ROIs, representing nose and mouth, extracted from the faces of Figures 1 (a) and (b) respectively. Figure 2 illustrates the different steps of the proposed system: a) the openCV face detector [42] is used to detect the position of the face within the image; b) facial landmarks are detected using IF; c)
175
facial components (ROIs) are extracted using the coordinates of specific facial landmarks; d) the defined ROIs (left eyebrow, right eyebrow, between eyebrows,
7
left eye, right eye, nose and mouth) are cropped; e) each ROI is scaled and partitioned into blocks (this step allows to extract more local information); f) for each block, feature descriptor is extracted. The descriptors of all the blocks are 180
then concatenated to build the descriptor of the ROI. Finally, the descriptors of all ROIs are concatenated to obtain the descriptor of the face; g) the obtained descriptor is fed into a multiclass SVM to achieve the recognition task.
(a)
(b)
(c)
(d)
(e)
(f) Figure 1: a) The whole face as ROI, the red region incorporates two facial components that are the mouth and a part of nose. (b) The whole face as ROI, the red region incorporates only one facial component that is the mouth. (c) and (d) nose and mouth ROI extracted from (a). (e) and (f) nose and mouth ROI extracted from (b)
Figure 2: Automatic FER system.
8
3. Regions of interest extraction As stated in [20], using whole face as a unique region to extract features 185
affects FER performances. To solve this issue, our approach proposes to extract specific face-regions, representing the main face components (eyes, eyebrows, nose, mouth, between eyebrows), from which features will be extracted. The objective is to increase FER performance through feature descriptors. In Section 6, we will show through comprehensive experiments the performance of the
190
proposed face decomposition, compared with whole face and state of the art face decompositions.
To achieve the ROIs extraction step, we start by detecting facial landmarks using the popular IF framework[41]. The IF algorithm can detect 49 landmarks 195
around the regions of eyebrows, eyes, nose, and mouth (see Figure 3) using the Supervised Descent Method (SDM) [41]. The 49 landmarks detected by IF algorithm are shown in Figure 3, where each point has a label number. Hence, the face shape is described by X∗ = (X1 , X2 , · · · , Xp ), where Xi is ith point, and p denotes the number of landmarks (here p = 49). Xi = (xi , yi ), where xi
200
and yi are horizontal and vertical coordinate of ith point respectively. In this paper, we use some of these points (see Table 1) to define the proposed face decomposition into ROIs, which are located mainly around eyebrows, eyes, nose and mouth (see Figure 3).
9
Table 1: Facial components extraction using IF algorithm based landmarks.
Dimensions of ROI
Left eyebrow
Starting point
Width
Height
(x1 , y4 )
x11 − x1
max(x4
−
Facial components
x3 , y3 − y1 ) Right
eye-
x11 − x10
(x11 , y7 )
max(x7
−
x6 , y7 − y10 )
brow (x5 , min(y5 , y6 ))
x6 − x5
y13 − y12
Left eye
(x20 , y22 )
x23 − x20
y24 − y22
Right eye
(x26 , y27 )
x29 − x26
y31 − y27
Nose
(x15 , y12 )
x19 − x15
y12 − y17
Mouth
(x25 , min(y32 , y35 , y38 ))
x30 − x25
max(y32 , y38 , y41 )−
Between eyebrow
min(y32 , y35 , y38 )
Figure 3: 49 Landmarks detected by IF algorithm.
After detecting landmarks by IF algorithm, seven ROIs, that are prone to 205
change with facial expression, are extracted (left eyebrow, right eyebrow, left eye, right eye, between eyebrows, nose and mouth). Figure 4 shows an example of ROIs extraction. These ROIs are the most representative regions of facial expression according to FACS. The idea behind FER is to analyze the face locally by focusing on permanent and transient features. Permanent features
210
are eyes, eyebrows, nose and mouth. Their shape and texture are exposed to 10
change with facial expression, which produces different wrinkles and furrows that are called transient facial features (for example, vertical wrinkles between the eyebrows due to their convergence one towards another, especially when face comes to expressing sadness and anger according to FACS). This leads us 215
to choose meticulously the starting point, width and height for each ROI (see Table 1) in order to capture more exactly permanent and transient features (see Figure 4). In our work, FER is especially based on texture information. Hence, we are interested in analyzing specific areas of the face that are likely to present changes in texture information with facial expression. For this purpose,
220
it is more useful to set the width of the mouth region to the horizontal distance between the points x25 and x30 , instead of choosing the points x32 and x38 which horizontally delimit only the mouth. This choice allows to detect changes in texture information within the area between the points x25 and x32 and the area between the points x30 and x38 , especially when the face comes to expressing joy,
225
as reported by FACS. Indeed, joy emotion is expressed on the face by stretching the corners lips. For the height, we chose to set it to the difference between the maximum of the points y32 , y38 and y41 and the minimum of the points y32 , y38 and y35 , instead of using simply the difference between the points y41 and y35 which vertically delimit only the mouth. Indeed, there are some people
230
when they smile, the corners X32 and X38 of their mouth move above the point X35 . Furthermore, based again on FACS, sadness is expressed by lowering the lip outer corners, which justifies the choice of the maximum of the points y32 , y38 and y41 . Our aim by using the minimum of the points y32 , y38 and y35 , and maximum of the points y32 , y38 and y41 , is to consider the highest and lowest
235
point respectively in order to extract the region of interest that brings complete information needed for precise and reliable FER. For eyebrow region, we set the width of the left eyebrow region to the horizontal distance between the points x1 and x11 , instead of choosing the points x1 and x5 which horizontally delimit only the left eyebrow. This choice allows to detect changes in texture
240
information within the area between the points x5 and x11 , especially when the face comes to expressing fear, anger and sadness as reported by FACS. The same 11
strategy is adopted for the right eyebrow. In addition to the corresponding facial component, the mouth and eyebrows regions contain small area around them that are affected by the deformation of mouth and eyebrows when face comes 245
to express something. In contrast, for eyes, we don’t need to analyze additional area because the required information are inside the eyes. For example, when the face comes to expressing fear, the upper eyelids rise and the lower ones are tightened, which has the effect of displaying a superior Sanpaku. Hence, our choice to use the points that delimit only the eyes.
(a)
(b)
Figure 4: (a) Examples of 49 landmarks detecting by IF. (b) Examples of ROI extraction.
250
4. Feature extraction Once the ROIs are extracted from a facial image, the next step is to apply a scaling procedure by modifying ROIs size. This step is crucial as each facial component (represented by a ROI) could have different sizes depending on the face shape in the image. The idea behind scaling is to get ROIs of the same
255
component with a close scale. To achieve that, we apply different ROI sizes
12
and then select the one providing better FER performances, as we will show in Section 6. After scaling procedure, we partition each ROI in regular blocks in which features will be extracted. The partitioning step is interesting as it allows to extract local information. Once feature extraction is achieved in each 260
block of a ROI, the features of all blocks are concatenated to build the feature descriptor of the ROI. Finally, the feature descriptors of all ROIs are concatenated to create the face feature descriptor.
In the following, we will present the different feature descriptors we used in 265
this study. 4.1. LBP LBP operator was initially proposed by Ojala et al. [13] in order to characterize texture of an image. It consists in applying a threshold procedure to the difference between the value of each pixel and the value of pixels of its neigh-
270
borhood. The result of this operation gives a chain of binary values (8-bit code in the case of a 3 × 3 neighborhood), reading in clockwise direction around the central pixel. The chain of binary values could be transformed on decimal value, representing a gray level of the LBP image. 4.2. CLBP
275
Ahmed et al. [15] tried to increase the robustness of LBP feature descriptor by incorporating additional local information that is ignored by the original LBP operator. They extended the basic LBP to another variant, which is called CLBP. In this variant, the comparison between the central pixel and the pixels of its neighborhood is performed using simultaneously the sign (represented
280
by a bit), like in LBP, and the magnitude (represented by another bit) of the difference between gray values. 4.3. LTP and Dynamic LTP LTP [14] is an interesting descriptor that overcomes some limits of LBP. In LTP instead of binarizing the sampled values, they take three values according 13
to their distance to the value of the central pixel. In LTP, the indicator function s(x) is defined as: 1 if x ≥ t s(x) = 0 if |x| < t −1 if x ≤ −t
(1)
where t is a threshold defined manually.
LTP is resistant to noise. However, it is not invariant to gray-level transformations owing to the naivety of the choice of the threshold. To overcome this problem, many studies [43, 44, 45] proposed techniques, named as dynamic LTP, for calculating a dynamic threshold based on the values of the neighboring pixels. Nevertheless, to our best knowledge, all the formulas proposed in the literature to calculate a dynamic threshold use a parameter called scaling factor whose value is constant for all the neighborhood. In all existing works, to set this parameter, authors chose a value between 0 and 1. In [45], the authors evaluated the performance of their proposed method considering different values for the scaling factor. In our study, we carried out large experiments using the original LTP with different strategies for threshold calculation. The first strategy is to set the threshold value from 0 to 5 [14], as proposed in many research works. The second one proposed in [45] uses formula 1 (see Table 2), considering different values for the scaling factor δ (0.02, 0.08, 0.05, 0.1, 0.15 and 0.2). In this formula, pc denotes the value of central pixel. In this paper, we introduce other strategies to achieve dynamic thresholding with simple formulas 2, 3, 4 and 5, given in Table 2, where pi is the value of the ith neighbor pixel. Another important consideration discussed in this article is the manner to define LTP. Indeed, changing the original definition of LTP operator [14] could impact the classification results. The original definition of LTP (Eq. (1)) is to assign 0 when the absolute value of the difference is strictly lower than the threshold. While in other works [46, 47], authors unconsciously changed the original definition of LTP by affecting 0 when the absolute value of the difference is less than or equal to the threshold (Eq. (2)). Since the representation of LTP is 14
based on three values, the allocation of the equal sign to one or more of the three conditions could affect classification results. This statement led us to ask the question: which one of LTP definitions would be adapted to our application. To answer this question, we conducted extensive experiments by considering the different definitions (Eq. (1), (2), (3) and (4)). 1 if s(x) = 0 if −1 if 1 if s(x) = 0 if −1 if 1 s(x) = 0 −1
x>t (2)
|x| ≤ t x < −t x≥t x ≥ −t and
x
(3)
x≤t
(4)
x < −t
if x > t if x > −t and
if x ≤ −t
Table 2: Formulas based LTP operator (N: number of neighborhood (in this work N=8)).
Formula 1 t = pc × δ
Formula 2 t=
PN −1 √ i=0
N
pi
Formula 3 t=
PN −1 i=0
pi
N
Formula 4 √PN −1 i=0 pi t= N
Formula 5 qP N −1 t= i=0
pi N
4.4. HOG 285
HOG descriptor, largely inspired from Scale Invariant Feature Transform (SIFT), was proposed by Dalal and Triggs [16] to address the limitations of SIFT in the case of dense grids. The purpose of HOG is to represent the appearance and shape of an object in an image by the distribution of intensity gradients or edge directions. This is accomplished by dividing the image into small connected
290
regions, called cells, and, for each cell, computing a local histogram (with 9 bins) of gradient directions for the pixels belonging to this cell. The concatenation of all local histograms form HOG descriptor. The local histograms are normalized to local contrast by calculating a measure of the intensity over larger spatial 15
regions, called blocks. This normalization leads to better invariance to changes 295
in illumination and shadowing.
5. SVM based classification SVM are among the most known classification methods. They are inspired from statistical learning theory [48]. SVM perform linear classification based on supervised learning. We used linear SVM classifier to avoid parameter sensitivity. Given a labeled training data xi , yi , i = 1, · · · , p, yi ∈ −1, 1, xi ∈
X 1 min wT w + C ξ(w, xi , yi ) w 2 i=1 where C > 0 is a penalty parameter, w is normal to the hyperplane.
(5) Pp
i=1
ξ(w, xi , yi )
can be considered as a total of misclassification error. The classification of a new test data is carried out by computing the distance from the test data to 300
the hyperplane.
SVM is a binary classifier. For multiclass problem, as in our case, the problem is set to multiple binary classification problems, each one of them is then handled by SVM. Among the existing multiclass SVM methods, we opted for 305
the one-against-one based one, which is a competitive approach according to the detailed comparison conducted in [49]. One-against-one based multiclass SVM method consists in constructing one classifier for each pair of classes, and then training k(k − 1)/2 classifiers (k is the number of classes) in order to distinguish the examples of one class from those of another class. The classification of a
310
new test data is determined by the class that obtains the maximum of the sum of binary classifier responses. In this work, to implement SVM classification, we used LIBSVM library [50], which offers codes of the one-against-one based multiclass SVM method. 16
6. Experiments 315
6.1. Databases used To evaluate the proposed methodology, three different public databases are used in order to carry out tests in different conditions. The first database, called FEED [40], represents spontaneous emotions in image sequences. The second database, called CK [38], contains posed emotions over image sequences. The
320
third database, called KDEF [39], contains also posed emotions, but in independent images, i.e. one image represents one expression. Table 3 summarizes the properties of the used datasets. Table 3: Properties of CK, FEED and KDEF datasets
Dataset
CK
FEED
KDEF
Type
posed
spontaneous
posed
Uniform
Uniform
Uniform
We do not per-
We do not per-
We do not per-
form any align-
form any align-
form any align-
ment
ment
ment
Angle of the face
Frontal
Frontal
Frontal
# of images
610
630
280
# and names of
6 basic expres-
6 basic expres-
6 basic expres-
emotions
sions + neutral
sions + neutral
sions + neutral
# of subjects
9 males/ 23 fe-
8 males/ 8 fe-
20 males/ 20 fe-
males
males
males
Euro-American,
European
Caucasian
Lighting
condi-
tions Face size
Ethnicity
Afro-American, Asian and Latino 6.1.1. CK Database This database contains image sequences representing six facial expressions 325
for male and female subjects (the number of females is greater than the num17
ber of males) of different ethnic origins such as Euro-American (81%), AfroAmerican (13%) and other (6%). It is composed of posed (deliberated) facial expressions. Each sequence starts with a neutral state and carries on with the expressive state. The database is made up of subjects of varying ages, skin 330
colors and facial conformations. The lighting condition is relatively uniform. The image sequences, which are captured with frontal views, were digitized into 640 × 490 or 640 × 480 pixels resolution. All these characteristics allowed this database to be widely used for FER evaluation. In this study, a dataset of 610 images selected from 98 image sequences, representing 32 subjects and the 7
335
basic facial expressions, namely neutral, joy, sadness, surprise, anger, fear and disgust, is considered. Figure 5 illustrates some images of CK database.
(a)
(b)
(c)
Figure 5: Examples of three facial expressions in the CK database. (a) Image sequence representing sadness expression. (b) Image sequence representing fear expression. (c) Image sequence representing joy expression
18
6.1.2. FEED Database As in CK database, this database contains image sequences representing the 7 basic facial expressions for male and female subjects (the number of fe340
males and males is the same). The image sequences, which are captured with frontal views, were digitized into 640 × 480 pixels resolution. Unlike CK, FEED database is much more challenging, its subjects performed spontaneous expressions and some of them are not well distinguishable. The difference between natural facial expressions and deliberated ones is important. We will show in
345
the experiments the impact of the expression type (natural/posed) on the recognition performances. In this study, the 7 basic facial expressions fill up a dataset of 630 images selected from 81 image sequences representing 16 subjects. Figure 6 shows some images of FEED database.
(a)
(b)
(c)
Figure 6: Examples of three facial expressions for the FEED database. (a) Image sequence representing sadness expression. (b) Image sequence representing fear expression. (c) Image sequence representing joy expression.
19
6.1.3. KDEF Database 350
Composed of deliberated facial expressions, this database contains images of male and female subjects belonging to Caucasian ethnicity, each subject expressing the 7 different emotions. The number of males and females is the same. We note that some subjects have moles on their face. Each posed facial expression has been captured twice for each subject from 5 different angles (-90, -45,
355
0, +45, +90 degrees). In this study, we built a dataset of 280 images representing 40 subjects with the 7 basic facial expressions. In this work, we considered only the images captured with the straight angle (0 degree). This choice is made considering two motivations. The first one is related to the region extraction algorithm,
360
which could meet difficulties to perform well for non-zero view angles. Second, we considered the same experimental methodology for all the three databases, by choosing frontal view for face capturing. Figure 7 illustrates some images of KDEF database.
(a)
(b)
(e)
(c)
(f)
(d)
(g)
Figure 7: Examples of different facial expressions for the KDEF database.(a) neutral (b) joy (c) sadness (d) surprise (e) anger (f) fear (g) disgust.
20
6.2. Results and Discussion
(a)
(b)
(c)
(d)
Figure 8: Regions Of Interest. (a) the whole face as region of interest (b) 3 regions of interest already used in the literature [37] (c) 6 regions of interest already used in the literature [35] (d) Our 7 proposed regions of interest.
365
We applied 10-fold cross-validation to test our proposed method. Each dataset is divided into 10 person-independent subsets of equal number of images that are used to conduct 10 experiments. In each experiment, nine subsets are used for training and the remaining subset is used for the test. The 10-fold cross-validation is used to fill the confusion matrix and then the recognition rate
370
is obtained by calculating F-score which is defined as F-score=
2.Recall.P recision Recall+P recision
[51]. The choice of size and number of blocks of ROI as well as descriptor impacts generally the recognition performance. In our experiments, each ROI is divided into W × H blocks for descriptor computation. The best parameters are set 375
by performing several tests. For test and evaluation, we considered three ROI face decompositions with different parameter configurations, i.e. the sizes of ROI and the number of blocks in each ROI, as shown in Figures 9, 10 and 11. The first one considers the whole face as a single ROI (see Figure 8 (a)), as proposed in [15, 23]. In the second decomposition, the face is composed of six
380
ROIs (see Figure 8 (c)), as proposed in [35]. The third decomposition is the one we proposed, and which considers seven ROIs, as shown in Figure 8 (d). In [37], the authors proposed another face decomposition with three ROIs (see Figure 8 (b)). The performances of the face decompositions with three ROIs (see Figure
21
8 (b)) and six ROIs (see Figure 8 (c)) are close, with small advantage to the 385
second decomposition (six ROIs). Indeed the best recognition rates provided by the face decomposition with three ROIs versus six ROIs are 93.34% (VS 93.55%), 82.17% (VS 83.04%), and 92.6% (VS 92.53%) on CK, FEED and KDEF datasets, respectively. For conciseness, we opted to not give result details for this decomposition (three ROIs) as we did for the other decompositions.
390
The experiment settings are as follows: • ROI parameters: ROI size and number of blocks in each ROI (see Figures 9, 10 and 11). Figures 9, 10 and 11 provide, for each tested dataset, only parameter configurations that yield high recognition rates: given a dataset, and for each tested descriptor, we conserve the parameter config-
395
uration that yields the higher recognition rate. • Feature descriptor parameters: for LBP, CLBP and LTP, we used a radius of 1 pixel, a neighbor of 8 pixels and a histogram of 256 bins. Concerning LTP descriptor, the threshold is set manually or automatically using the formulas given in Table 2. We mention in the results analysis how we set
400
the threshold. When set manually, the threshold value is given. For HOG, we used blocks of 8 × 8 pixels, cells of 2 × 2 pixels and 9 bins histogram for each block. • SVM parameters: we used a linear kernel with parameter C=0.3 selected by performing a grid-search in 10-fold cross validation.
405
6.2.1. Experimental results using the whole face as one ROI Figure 9 presents the results obtained when using the whole face as one ROI. For each dataset, we considered texture descriptors (LBP, CLBP and LTP) (see Figure 9 (a)), the shape descriptor (HOG) (see Figure 9 (b)) and their concatenation (see Figure 9 (c)). As we can see in Figure 9, HOG descriptor provides the
410
better recognition rate (93.24%) when CK dataset is tested with the parameters configuration 1 (see Figure 9 (b)). We can see also for the same dataset, that
22
when LTP descriptor is involved (solely or in hybrid form with HOG descriptor), the recognition rate is interesting with almost all parameter configurations, and often close to the high recognition rate given by HOG descriptor. However, 415
when FEED and KDEF datasets are tested, the hybrid descriptor LTP + HOG shows the high recognition rate, when compared to the other descriptors. Indeed, we obtain 85.32% on FEED dataset with parameters configuration 3 (see Figure 9 (c)). Note that the used LTP here is defined using Eq. (1) with a fixed threshold selected experimentally (t = 1). For KDEF dataset, we obtain
420
a recognition rate of 92.19%, with the parameters configuration 2 (see Figure 9 (c)). Here, LTP is defined using Eq. (1) with a threshold procedure based on formula 3 (see Table 2).
23
CK Dataset
F-score
95
92.44
FEED Dataset 91.49
89
KDEF Dataset
LBP
90.65
CLBP
90.33
LTP 92.12
90.19
84.57
83.6
83
79.29
77 71 65
Size
165×154 11×11
# of blocks
104×117
99×90
8×9
11×10
2
3
1
Configuration
(a)
F-score
CK Dataset 95 90 85 80 75 70 65
FEED Dataset
KDEF Dataset
93.24
HOG
92.72
89.63
88.21 76.13
72.18
96×120
Size # of blocks
80×80
6×5 1
5×5 5×5 2
6×5
Configuration
(b)
CK Dataset 95
92.89
FEED Dataset
KDEF Dataset
LBP+HOG
92.21
91.96
CLBP+HOG 92.19
LTP+HOG
F-score
85
85.32
83.72
83.7
91.39
91.16
90
80 75 70
65 Size (HOG)
# of blocks (HOG) Size (Texture) # of blocks (Texture)
96×120 96×120
80×80
6×5
5×5
5×5
165×154
165×154
104×117
11×11 1
33
11×11 2
80×80
33
8×9 3
Configuration
(c)
Figure 9: Recognition rate for ROI=1 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (three configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (three configurations).
24
6.2.2. Experimental results using the six regions of interest (number of ROI = 6) 425
In this section, we analyze the performances of the descriptors using face decomposition with 6 ROIs. As we can see in Figure 10, applied on CK dataset, the hybrid descriptor (LTP + HOG) outperforms all the other descriptors in all parameter configurations. Here, LTP is computed using Eq. (4) with threshold formula 1 (see Table 2). The hybrid descriptor (LTP + HOG) was able
430
to slightly increase the recognition rate to 93.55% (see Figure 10 (c)), when compared to the results of the previous case, i.e. face as a whole ROI (number of ROI =1), where HOG descriptor has shown its effectiveness face to all the other descriptors (see Figure 9). Concerning KDEF dataset, LTP descriptor, defined with its Eq. (2) and threshold formula 4 (see Table 2), provides the best
435
results (see Figure 10 (a)). It slightly increased the recognition rate to 92.53%, when compared to the results of the previous case (number of ROI = 1), where the hybrid descriptors provided the best results, in particular the combination of HOG and LTP (see Figure 9 (c)). For FEED dataset, one can note globally from Figure 10 a significant decrease in the recognition rate to 83.04%,
440
when compared to the results of the previous case where the number of ROI = 1 (see Figure 9).The recognition rate decreasing concerns the texture and hybrid descriptors. However, HOG descriptor increases the recognition rate, when compared to the results of the previous case (number of ROI = 1) with same descriptor, and provides the best recognition rate (83.04%) when compared to
445
all the tested descriptors (see Figure 10).
25
CK Dataset 92 86 80 74 68 62 56 50
F-score
FEED Dataset
KDEF Dataset
LBP
CLBP
LTP
93.1
92.53
90.68 78.69
90.84 75.13
ROI
r1
r2
r3/r4
r5
r6
r1
r2
r3/r4
r5
r6
Size
80×20
30×30
36×24
30×30
80×60
120×36
24×36
36×24
24×18
72×54
4×2
1×1
2×2 1
1×1
4×3
5×2
1×2
1×2 2
1×1
3×3
# of blocks
Configuration
(a)
CK Dataset 95 88.48
F-score
90
FEED Dataset
KDEF Dataset
HOG
91.06
87.11
86.85
83.04
85 80
77.39
75 70 ROI
r1
r2
r3/r4
r5
r6
r1
r2
r3/r4
r5
r6
Size
144×24
32×32
40×32
24×24
72×48
240×24
32×32
32×16
16×16
128×112
6×1
1×1
1×2
1×1
3×3
10×1
1×1
2×1
1×1
8×7
# of blocks
1
2 Configuration
(b)
CK Dataset
FEED Dataset
93.55
95
KDEF Dataset
LBP+HOG
CLBP+HOG
LTP+HOG
92.02
90.38
91.81
F-score
90 85
80.16
80
76.61
75
70 65
r1
r2
r3/r4
r5
r6
r1
r2
r3/r4
r5
r6
Size (HOG)
240×24
32×32
32×16
16×16
128×112
240×24
32×32
32×16
16×16
128×112
# of blocks (HOG) Size (Texture)
10×1
1×1
2×1
1×1
8×7
10×1
1×1
2×1
1×1
8×7
90×48
30×30
36×24
24×24
90×32
120×36
24×36
36×24
24×18
72×54
3×3
1×1
1×2 1
1×1
3×4
5×2
1×2
1×2 2
1×1
3×3
ROI
# of blocks (Texture)
Configuration
(c)
Figure 10: Recognition rate for ROI=6 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (two configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (two configurations).
26
6.2.3. Experimental results using specific well defined regions of interest (number of ROI = 7) As we can see in Figure 11, a significant increase in the recognition rate for all tested datasets is obtained with our proposed facial decomposition (number of 450
ROI = 7), when compared to the previous ones (number of ROI = 1 and number of ROI = 6). Indeed, our facial decomposition provided the best recognition rates, which are obtained mainly with the hybrid descriptors. In particular, the combination of LTP and HOG achieved the high recognition rates of 96.06%, 92.03% and 93.34% for CK, FEED and KDEF datasets respectively. In the
455
descriptor combination, the LTP that allowed these best results are defined with Eq. (4) associated to threshold formula 1, Eq. (4) associated to threshold formula 3, and Eq. (2) associated to threshold formula 1, for CK, FEED and KDEF datasets respectively. The improvements given by our proposed facial decomposition was possible thanks to the relevant selected ROIs, the location
460
and proper cropping of facial components.
27
CK Dataset
F-score
96
FEED Dataset
93.56
KDEF Dataset
LBP
CLBP
LTP
94.29
92.87
92.12
89.25
90
83.02
84 78 72
ROI
r1/r2
r3
r4/r5
r6
r7
r1/r2
r3
r4/r5
r6
r7
Size
54×18
30×30
36×24
24×24
72×36
72×18
18×26
36×18
18×26
72×36
3×1
1×1
1×2 1
1×1
3×3
3×1
1×1
2×1 2
1×1
3×2
# of blocks
Configuration
(a)
F-score
CK Dataset 92 90 88 86 84 82 80
FEED Dataset
KDEF Dataset
HOG 90.28
89.74
88.25 86.4
85.82 83.67
ROI
r1/r2
r3
r4/r5
r6
r7
r1/r2
r3
r4/r5
r6
r7
Size
120×24
32×32
32×16
16×16
120×64
120×24
32×32
32×16
16×16
128×112
5×1
1×1
1×1
1×1
5×4
5×1
1×1
1×1
1×1
8×7
# of blocks
1
2 Configuration
(b)
CK Dataset 98
F-score
FEED Dataset
KDEF Dataset
LBP+HOG
CLBP+HOG
LTP+HOG
96.06
94.62
95.22
93.34
94
92.53
92.03
92.91
89.44
90 85.76
86 82 78 74
ROI
r1/r2
r3
r4/r5
r6
r7
r1/r2
r3
r4/r5
r6
r7
r1/r2
r3
r4/r5
r6
r7
Size (HOG)
120×24
32×32
32×16
16×16
120×64
120×24
32×32
32×16
16×16
128×112
72×24
32×32
40×32
24×24
72×48
# of blocks (HOG) Size (Texture)
5×1
1×1
1×1
1×1
5×4
5×1
1×1
1×1
1×1
8×7
3×1
1×1
1×2
1×1
3×3
54×18
30×30
36×24
24×24
54×36
72×18
18×26
36×18
18×26
72×36
54×18
30×30
36×24
24×24
72×36
2×1
1×1
1×2
1×1
3×3
3×1
1×1
2×1
1×1
4×2
3×1
1×1
1×2
1×1
3×3
# of blocks (Texture)
1
2
3
Configuration
(c)
Figure 11: Recognition rate for ROI=7 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (two configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (three configurations).
28
Tables 4, 5 and 6 summarize the best results obtained for CK, FEED and KDEF datasets, respectively, through the different descriptors and the different facial decompositions (number of ROI = 1, 6 and 7). Through these tables, we can see that the proposed method (face decomposition with number of ROI = 7) 465
outperforms all the others when using almost all the descriptors and for all the tested datasets. The tables show that hybrid descriptors demonstrated generally better results with respect to elementary ones, especially for our proposed facial decomposition. Concerning the datasets variability, one can see in Figures 9, 10 and 11, that on one hand, the recognition rates for FEED dataset, composed of
470
spontaneous facial expressions, are lower than those obtained for CK and KDEF datasets, composed of posed facial expressions. The main reason is that posed facial expressions are easy to detect, when compared to spontaneous expressions, which are generally acquired in non-controlled experience protocol. On the other hand, the recognition rates for KDEF dataset are low, when compared to those
475
obtained for CK dataset. This is due to the difference of the construction of each database. Indeed, CK dataset is composed of image sequences, while KDEF dataset is composed of independent images.
29
Table 4: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for CK dataset and all the tested descriptors
ROI=1
ROI=6
ROI=7
LBP
89.42
90.17
91.33
CLBP
87.13
86.6
91.26
LTP
92.44
93.1
94.29
HOG
93.24
88.48
90.28
LBP+HOG
89.5
89.53
93.75
CLBP+HOG
87.31
89.58
94.43
LTP+HOG
92.89
93.55
96.06
Table 5: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for FEED dataset and all the tested descriptors
ROI=1
ROI=6
ROI=7
LBP
84.17
72.41
86.96
CLBP
75.01
71.29
81.7
LTP
84.75
78.69
89.25
HOG
76.13
83.04
86.4
LBP+HOG
85.26
77.43
89.89
CLBP+HOG
79.09
76.84
85.66
LTP+HOG
85.32
80.16
92.03
30
Table 6: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for KDEF dataset and all the tested descriptors
ROI=1
ROI=6
ROI=7
LBP
90.44
87.95
91.87
CLBP
90.43
81.87
88.66
LTP
92.12
92.53
92.87
HOG
89.63
91.06
88.25
LBP+HOG
90.44
90.74
92.2
CLBP+HOG
91.85
89
92.59
LTP+HOG
92.19
91.81
93.34
6.2.4. Confusion matrices This section analyzes the results through confusion matrices. Tables 7, 8 480
and 9 show the confusion matrices (corresponding to the high recognition rate reached) for CK, FEED and KDEF datasets respectively. As one can see in Table 7, we achieved very good results for the majority of emotions with an average rate close to 100%, with the exception of neutral and anger emotions that are recognized with average rates of 86% and 90% respectively. Neutral
485
expression is misclassified as fear expression with an error rate of 9%. For FEED dataset (see Table 8), we can observe that one of the main reasons for the decrease in the recognition rate is the misclassification of sadness, anger and fear expressions as neutral expression, which is recognized with less precision of 82.13% (computed from the neural column as 95/(95+7+5+6.66+2)), and the
490
confusion between the fear and surprise expressions and between the disgust and joy/anger expressions. For KDEF dataset (see Table 9), neutral, joy and surprise emotions are recognized with the maximum rate of 100%. The other emotions are classified with an average recognition rate between 85% and 90%. One can see also that expression confusion is more present, with more or less
495
low percentages. Tables 8 and 9 show also that the fear expression is confused with surprise expression with an error rate of 10% and 7.5% for FEED and KDEF datasets respectively. This misclassification is due maybe to similar face 31
deformation caused by these expressions. Table 7: Confusion Matrix for CK dataset (associated to the high recognition rate: 96.06%)
Neutral
Joy
Sad
Surprise
Anger
Fear
Disgust
Neutral
86
0
1
0
3
9
1
Joy
0
100
0
0
0
0
0
Sad
1
0
99
0
0
0
0
Surprise
3
0
0
97
0
0
0
Anger
2.22
0
4.44
0
90
3.33
0
Fear
0
0
0
0
0
100
0
Disgust
0
0
0
0
0
0
100
Table 8: Confusion Matrix for FEED dataset (associated to the high recognition rate: 92.03%)
Neutral
Joy
Sad
Surprise
Anger
Fear
Disgust
Neutral
95
0
2
0
0
3
0
Joy
0
94
0
6
0
0
0
Sad
7
0
93
0
0
0
0
Surprise
0
0
0
98
0
2
0
Anger
5
0
0
0
93
2
0
Fear
6.66
0
0
10
0
83.33
0
Disgust
2
6
1
0
5
0
86
32
Table 9: Confusion Matrix of KDEF dataset (associated to the high recognition rate: 93.34%)
Neutral
Joy
Sad
Surprise
Anger
Fear
Disgust
Neutral
100
0
0
0
0
0
0
Joy
0
100
0
0
0
0
0
Sad
5
0
90
0
2.5
0
2.5
Surprise
0
0
0
100
0
0
0
Anger
2.5
0
2.5
2.5
87.5
0
5
Fear
0
0
2.5
7.5
0
90
0
Disgust
7.5
0
5
0
2.5
0
85
6.2.5. Processing time 500
Processing time (in ms), required to handle one image with all the steps of the proposed method (face detection, landmarks detection, regions extraction, conversion RGB to gray, scaling and partitioning, features computation, SVM prediction), is reported in the Table 10. It has been computed considering the average of the processing of 100 frames, on an Intel(R) Core(TM) i5-2430M
505
CPU 2.40GHz based Windows machine. The algorithm codes are written in C++. Table 10: Processing time for CK,FEED and KDEF datasets
CK
FEED
KDEF
HOG
196.24
201.30
228.38
LTP
214.96
220.02
242.89
HOG+LTP
236.18
244.61
266.60
6.3. Comparison with the state of the art This section is dedicated to the comparison with state-of-the-art methods. The comparison process, in the FER field, faces different difficulties such as the 510
absence of common evaluation, the shared databases require an experimental setting for selecting images and the algorithm codes of the existing methods are not accessible and their reimplementation could generate errors. In order to 33
deal with these difficulties and make fair comparison, we carried out different experiments on CK+ database [52], which is used in all the compared methods, 515
by following the same protocols for images selection and k-fold cross-validation as performed in the works considered for comparison [20, 36, 29, 27, 28]. The first protocol, applied in [29, 27, 28], consists of selecting in each sequence the first image for neutral expression and three peak frames for target one. We recall that each sequence represents the target expression starting by the neutral one.
520
The second protocol, used in [36], selects in each sequence the last peak frame for target expression. In[36], the neutral expression is not considered in the recognition process. The third one, applied in [20], takes two peak frames for anger, fear and sadness expressions, last peak frame for disgust, happy, and surprise expressions, and first frame for neutral expression in few sequences. Tables 11
525
and 12 report the recognition rates of our proposed method and the compared methods, applied on CK+7 (all the emotion expressions including the neural one) and CK+6 (excluding the neural expression), respectively. Table 13 summarizes the parameter values that allowed our method to reach the best results. For all the methods in comparison, we considered the recognition performances
530
reported in their referenced paper (see Tables 11 and 12). As shown in Table 11, our proposed method achieved the best recognition rates among all the compared method independently of their features category and used procedure of face registration. Indeed, compared to [28], our method provided an accuracy of 96.03% VS 93.7%. When compared to [20], the proposed method reached
535
94.48% in accuracy and 94.52% in F-score VS 90.08% and 90.64%. From Table 12, we can observe that our method exceeded the method that extracts features from small patches located around facial landmarks [36] and outperformed also the ones using BDBN [27] and appearance and geometric features [20]. We note that the BDBN method [27] reached an F-score of 83.4%, which is lower com-
540
pared to the one (96.9%) provided by our method. Compared to our method, the method [29] achieved the better accuracy of 98.3% VS 96.77% thanks to data augmentation procedure by applying a random transformation to each input image (translations, horizontal flips, rotations, scaling, and pixel intensity 34
augmentation). It should be noted that our method, unlike all the compared 545
methods, did not use any preprocessing such as face alignment, patch-wise mean subtraction or data augmentation. Furthermore, our method needs less memory and computational cost unlike the deep learning methods [27, 31]. As we can see from Table13, there are three parameter settings (a, b and c) that allowed our method to achieve the best results, which outperform all the
550
compared state of the art methods, except one of them to which the proposed method remains competitive. However, if we consider only the parameters setting b (see Table 13) as standard one for our method, the results are still better than those of the compared state of the art methods, regardless of the used experimental protocol and the considered number of expressions (6 or 7), as we
555
can see in Tables 11 and 12. Table 11: Comparison of different methods on the CK+ database with 7 expressions. a and b are the references of the parameter values (see Table 13) allowed our method to reach the best results using the experimental protocols in [28] and [20], respectively.
Method
Category
Face regis-
Experiment F-score
tration Liu et al.
Deep
Whole
[28]
learning
face
Ours
Appearance ROI
Ghimire
Appearance ROI
et al. [20]
and Geo-
protocol N/A
93.7
94.63a
96.03a
(94.19b )
(95.47b )
90.64
90.08
94.52b
94.48b
[28]
metric Ours
Accuracy
[20]
Appearance ROI
35
Table 12: Comparison of different methods on the CK+ database with 6 expressions. b and c are the references of the parameter values (see Table 13) allowed our method to reach the best results using the experimental protocols in ([29], [27], [20]) and [36], respectively.
Method
Category
Face registration
Khorrami et
Deep
Whole
al. [29]
learning
face
Ours
Experiment F-score
Accuracy
protocol N/A
98.3
Appearance ROI
96.01b
96.77b
Deep
Whole
83.4
96.7
[27]
learning
face
Ours
Appearance ROI
96.9b
97.52b
Appearance ROI
94.24
94.1
Liu
et
Ghimire
al.
et
al. [20]
[29]
[27]
and Geo[20]
metric Ours
Appearance ROI
95.8b
95.58b
Happy and
Appearance Patch
94.39
94.14
96.3c
97.18c
(95.13b )
(96.25b )
Routray [36] Ours
[36] Appearance ROI
36
Table 13: Optimized parameter values taken by our method to reach the best results (see Tables 11 and 12) on CK+ dataset.
Reference Descriptor
a
b
c
LTP+HOG
LTP+HOG
LTP+HOG
LTP/ Thresh-
Configuration
Configuration
olding
of LTP
of HOG
(3) /
Configuration
Configuration
formula 3 (see
1 (see Figure
1 (see Figure
Table 2)
11 (a))
11 (b))
Eq.
(1) /
Configuration
Configuration
formula 3 (see
1 (see Figure
1 (see Figure
Table 2)
11 (a))
11 (b))
Eq.
(4) /
Configuration
Configuration
fixed
thresh-
2 (see Figure
1 (see Figure
11 (a))
11 (b))
Eq.
old (t=2) 6.4. Cross-database evaluation
We evaluated the generalization ability of our method across different databases by carrying out six experiments. In each experiment, we performed the training on one dataset and we tested on the other two datasets (see Table 14). 560
As we can see from the Table 14, our method can achieve encouraging results. Especially, when the model is trained using KDEF dataset (posed emotions), the results on the other two datasets (spontaneous or posed emotions) are very interesting. This allows claiming that, training the model with KDEF dataset is valuable to recognizing spontaneous and posed emotions. We can also see
565
that the model behaves relatively well when it is trained and tested using posed emotions (CK/KDEF and KDEF/CK cases) Table 14: Cross-database performance on CK, KDEF, and FEED datasets.
Train
CK
KDEF
Test
FEED
KDEF
CK
FEED
CK
KDEF
Accuracy
68.41
79.28
78.85
79.52
58.36
67.85
F-score
70.41
79.35
77.14
74.17
57.83
70.04
37
FEED
7. Conclusion A new facial decomposition for expression recognition is presented. The method first extracts regions of interest (ROI) using landmarks given by In570
traFace algorithm. After ROI preprocessing stage, feature extraction is then performed to construct face descriptor. A multiclass SVM classifier is finally trained and used for FER. Several texture (LBP, CLBP, LTP) and shape (HOG) descriptors and their combination are tested and evaluated. To demonstrate its performance, the proposed facial decomposition is compared with existing ones,
575
using three public datasets. The results showed that the new facial decomposition improves significantly the recognition rate. This improvement was possible thanks to two reasons. First, the proposed decomposition allowed to extract relevant and precise facial components, which are involved in emotion expression. Second, exploiting both texture and shape information contributed to this
580
improvement. Indeed, Descriptors evaluation demonstrated that hybrid ones constructed through heterogeneous concatenation from texture and shape features are the best, in particular the concatenation of LTP and HOG. However, The optimal size (after scaling) of the ROIs varies according to the training data. Therefore, it is difficult to go towards a generic system. For future works,
585
we envision to consider other sophisticated hand-craft and deep learning descriptors. We are also interested in exploiting FER with multi-observation in training and testing phases. Another perspective we envision is to extend the developed framework to facial images acquired with uncontrolled view point.
Acknowledgements 590
This research work is part of Volubilis project registered under Volubilis MA/14/302.The authors would like to thank the Franco-Moroccan Volubilis program for its support. The authors thank the anonymous reviewers for their helpful comments and suggestions.
38
References 595
[1] S. Shakya, S. Sharma, A. Basnet, Human behavior prediction using facial expression analysis, in: Computing, Communication and Automation (ICCCA), 2016 International Conference on, IEEE, 2016, pp. 399–404. [2] A. A. Gunawan, et al., Face expression detection on kinect using active appearance model and fuzzy logic, Procedia Computer Science 59 (2015)
600
268–274. [3] R. Shbib, S. Zhou, Facial expression analysis using active shape model, Int. J. Signal Process. Image Process. Pattern Recognit 8 (1) (2015) 9–22. [4] P. Ekman, An argument for basic emotions, Cognition & emotion 6 (3-4) (1992) 169–200.
605
[5] U. X. Eligio, S. E. Ainsworth, C. K. Crook, Emotion understanding and performance during computer-supported collaboration, Computers in Human Behavior 28 (6) (2012) 2046–2054. [6] G. Molinari, C. Bozelle, D. Cereghetti, G. Chanel, M. B´etrancourt, T. Pun, Feedback ´emotionnel et collaboration m´ediatis´ee par ordinateur: Quand
610
la perception des interactions est li´ee aux traits ´emotionnels, in: Environnements Informatiques pour l’apprentissage humain, Actes de la conf´erence EIAH, 2013, pp. 305–326. [7] K. Lekdioui, R. Messoussi, Y. Chaabi, Etude et mod´elisation des comportements sociaux d’apprenants ` a distance, ` a travers l’analyse des traits
615
du visage, in: 7`eme Conf´erence sur les Environnements Informatiques pour l’Apprentissage Humain (EIAH 2015), 2015, pp. 411–413. [8] M.-T. Yang, Y.-J. Cheng, Y.-C. Shih, Facial expression recognition for learning status analysis, in: International Conference on Human-Computer Interaction, Springer, 2011, pp. 131–138.
39
620
[9] R. Nkambou, V. Heritier, Reconnaissance ´emotionnelle par l’analyse des expressions faciales dans un tuteur intelligent affectif, in: Technologies de l’Information et de la Connaissance dans l’Enseignement Sup´erieur et l’Industrie, Universit´e de Technologie de Compi`egne, 2004, pp. 149–155. [10] R. Li, P. Liu, K. Jia, Q. Wu, Facial expression recognition under par-
625
tial occlusion based on gabor filter and gray-level cooccurrence matrix, in: Computational Intelligence and Communication Networks (CICN), 2015 International Conference on, IEEE, 2015, pp. 347–351. [11] G. Stratou, A. Ghosh, P. Debevec, L.-P. Morency, Effect of illumination on automatic expression recognition: a novel 3d relightable facial database, in:
630
Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, IEEE, 2011, pp. 611–618. [12] A. Samal, P. A. Iyengar, Automatic recognition and analysis of human faces and facial expressions: A survey, Pattern recognition 25 (1) (1992) 65–77.
635
[13] T. Ojala, M. Pietik¨ainen, D. Harwood, A comparative study of texture measures with classification based on featured distributions, Pattern recognition 29 (1) (1996) 51–59. [14] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, IEEE transactions on image processing
640
19 (6) (2010) 1635–1650. [15] F. Ahmed, H. Bari, E. Hossain, Person-independent facial expression recognition based on compound local binary pattern (clbp)., Int. Arab J. Inf. Technol. 11 (2) (2014) 195–203. [16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,
645
in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 886–893.
40
[17] G. Lei, X.-h. Li, J.-l. Zhou, X.-g. Gong, Geometric feature based facial expression recognition using multiclass support vector machines, in: Granular Computing, 2009, GRC’09. IEEE International Conference on, IEEE, 650
2009, pp. 318–321. [18] A. Poursaberi, H. A. Noubari, M. Gavrilova, S. N. Yanushkevich, Gauss– laguerre wavelet textural feature fusion with geometrical information for facial expression identification, EURASIP Journal on Image and Video Processing 2012 (1) (2012) 1–13.
655
[19] V. Rapp, T. S´en´echal, L. Prevost, K. Bailly, H. Salam, R. Seguier, Combinaison de descripteurs h´et´erogenes pour la reconnaissance de micromouvements faciaux, in: RFIA 2012 (Reconnaissance des Formes et Intelligence Artificielle), 2012, pp. 978–2. [20] D. Ghimire, S. Jeong, S. Yoon, J. Choi, J. Lee, Facial expression recognition
660
based on region specific appearance and geometric features, in: Digital Information Management (ICDIM), 2015 Tenth International Conference on, IEEE, 2015, pp. 142–147. [21] T. Gritti, C. Shan, V. Jeanne, R. Braspenning, Local features based facial expression recognition with face registration errors, in: Automatic Face &
665
Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on, IEEE, 2008, pp. 1–8. [22] C. Shan, S. Gong, P. W. McOwan, Facial expression recognition based on local binary patterns: A comprehensive study, Image and Vision Computing 27 (6) (2009) 803–816.
670
[23] P. Carcagn`ı, M. Coco, M. Leo, C. Distante, Facial expression recognition and histograms of oriented gradients: a comprehensive study, SpringerPlus 4 (1) (2015) 1. [24] M. Valstar, M. Pantic, Fully automatic facial action unit detection and
41
temporal analysis, in: 2006 Conference on Computer Vision and Pattern 675
Recognition Workshop (CVPRW’06), IEEE, 2006, pp. 149–149. [25] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron, in: Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, IEEE, 1998,
680
pp. 454–459. [26] L. Zhang, D. Tjondronegoro, Facial expression recognition using facial movement features, IEEE Transactions on Affective Computing 2 (4) (2011) 219–229. [27] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a
685
boosted deep belief network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812. [28] M. Liu, S. Li, S. Shan, X. Chen, Au-inspired deep networks for facial expression feature learning, Neurocomputing 159 (2015) 126–136. [29] P. Khorrami, T. Paine, T. Huang, Do deep neural networks learn facial
690
action units when doing expression recognition?, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 19–27. [30] T. Hinz, P. Barros, S. Wermter, The effects of regularization on learning facial expressions with convolutional neural networks, in: International
695
Conference on Artificial Neural Networks, Springer, 2016, pp. 80–87. [31] B. Liu, M. Wang, H. Foroosh, M. Tappen, M. Pensky, Sparse convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 806–814. [32] P. Ekman, W. V. Friesen, Facial action coding system.
42
700
[33] L. Wang, R. Li, K. Wang, A novel automatic facial expression recognition method based on aam, Journal of Computers 9 (3) (2014) 608–617. [34] J. Chen, Z. Chen, Z. Chi, H. Fu, Facial expression recognition based on facial components detection and hog features, in: International Workshops on Electrical and Computer Engineering Subfields, 2014, pp. 884–888.
705
[35] M. M. Donia, A. A. Youssif, A. Hashad, Spontaneous facial expression recognition based on histogram of oriented gradients descriptor, Computer and Information Science 7 (3) (2014) 31. [36] S. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, IEEE transactions on Affective Computing
710
6 (1) (2015) 1–12. [37] A. A. Youssif, W. A. Asker, Automatic facial expression recognition system based on geometric and appearance features, Computer and Information Science 4 (2) (2011) 115. [38] T. Kanade, J. F. Cohn, Y. Tian, Comprehensive database for facial expres-
715
sion analysis, in: Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on, IEEE, 2000, pp. 46–53. ¨ [39] D. Lundqvist, A. Flykt, A. Ohman, The karolinska directed emotional faces (kdef), CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet (1998) 91–630.
720
[40] F. Wallhoff, Facial expressions and emotion database, Technische Universit¨at M¨ unchen. [41] X. Xiong, F. De la Torre, Supervised descent method and its applications to face alignment, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 532–539.
725
[42] G. Bradski, et al., The opencv library, Doctor Dobbs Journal 25 (11) (2000) 120–126. 43
[43] X. Wu, J. Sun, G. Fan, Z. Wang, Improved local ternary patterns for automatic target recognition in infrared imagery, Sensors 15 (3) (2015) 6399–6418. 730
[44] A. A. Mohamed, R. V. Yampolskiy, Adaptive extended local ternary pattern (aeltp) for recognizing avatar faces, in: Machine Learning and Applications (ICMLA), 2012 11th International Conference on, Vol. 1, IEEE, 2012, pp. 57–62. [45] M. Ibrahim, M. Alam Efat, S. Kayesh, S. M. Khaled, M. Shoyaib,
735
M. Abdullah-Al-Wadud, Dynamic local ternary pattern for face recognition and verification, in: Proceedings of the International Conference on Computer Engineering and Applications, Tenerife, Spain, Vol. 1012, 2014. [46] A. Mignon, Apprentissage de m´etriques et m´ethodes ` a noyaux appliqu´es ` a la reconnaissance de personnes dans les images, Ph.D. thesis, universit´e de
740
caen (2012). [47] W.-H. Liao, Region description using extended local ternary patterns, in: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE, 2010, pp. 1003–1006. [48] V. N. Vapnik, An overview of statistical learning theory, IEEE transactions
745
on neural networks 10 (5) (1999) 988–999. [49] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE transactions on Neural Networks 13 (2) (2002) 415– 425. [50] C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines,
750
ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3) (2011) 27. [51] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing & Management 45 (4) (2009) 427–437. 44
755
[52] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, 2010, pp. 94–101.
45
*Highlights (for review)
Highlights of the paper
An automatic and appropriate facial decomposition with ROIs for FER Emotion representation using texture, shape-based descriptors and their combination LTP descriptor analysis with new definitions and strategies for thresholding Comprehensive evaluation and comparison to state of the art on 4 public databases Our method outperforms all the compared methods, and is competitive with one method