shape descriptors and SVM classifier

Accepted Manuscript Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier Khadija Lekdioui, Rochdi Messou...

Download PDF

1MB Sizes 0 Downloads 85 Views

Report

PDF Reader
Full Text

Accepted Manuscript Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier Khadija Lekdioui, Rochdi Messoussi, Yassine Ruichek, Youness Chaabi, Raja Touahni

PII: DOI: Reference:

S0923-5965(17)30140-6 http://dx.doi.org/10.1016/j.image.2017.08.001 IMAGE 15260

To appear in:

Signal Processing: Image Communication

Received date : 4 January 2017 Revised date : 1 July 2017 Accepted date : 1 August 2017 Please cite this article as: K. Lekdioui, R. Messoussi, Y. Ruichek, Y. Chaabi, R. Touahni, Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier, Signal Processing: Image Communication (2017), http://dx.doi.org/10.1016/j.image.2017.08.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Manuscript

Facial decomposition for expression recognition using texture/shape descriptors and SVM classifier Khadija Lekdiouia,b,∗, Rochdi Messoussib , Yassine Ruicheka , Youness Chaabib , Raja Touahnib a Le2i

FRE2005, CNRS, Arts et M´ etiers, Univ. Bourgogne Franche-Comt´ e, UTBM, F-90010 Belfort, France b Laboratoire des Syst` emes de T´ el´ ecommunication et Ing´ enierie de la D´ ecision (LASTID) Universit´ e Ibn Tofail BP 133, Kenitra 14000, Maroc

Abstract Automatic facial expression analysis is a challenging topic in computer vision due to its complexity and its important role in many applications such as human-computer and social interaction. This paper presents a Facial Expression Recognition (FER) method based on an automatic and more efficient facial decomposition into regions of interest (ROI). First, seven ROIs, representing more precisely facial components involved in expression of emotions (left eyebrow, right eyebrow, left eye, right eye, between eyebrows, nose and mouth), are extracted using the positions of some landmarks detected by IntraFace (IF). Then, each ROI is resized and partitioned into blocks which are characterized using several texture and shape descriptors and their combination. Finally, a multiclass SVM classifier is used to classify the six basic facial expressions and the neutral state. In term of evaluation, the proposed automatic facial decomposition is compared with existing ones to show its effectiveness, using three public datasets. The experimental results showed the superiority of our facial decomposition against existing ones and reached recognition rates of 96.06%, 92.03% and 93.34% for the CK, FEED and KDEF datasets, respectively. Then, a comparison with state-of-the-art methods is carried out using CK+ dataset. ∗ Corresponding

author Email addresses: [email protected] (Khadija Lekdioui), [email protected] (Rochdi Messoussi), [email protected] (Yassine Ruichek), [email protected] (Youness Chaabi), [email protected] (Raja Touahni)

Preprint submitted to Journal of LATEX Templates

July 1, 2017

The comparison analysis demonstrated that our method outperformed or competed the results achieved by the compared methods. Keywords: Facial components, Texture and shape descriptors, Facial expression recognition, SVM classifier, ROI extraction

1. Introduction Facial expression is an important aspect of behavior and non-verbal communication. Indeed, it represents a form of non-verbal body language, which plays an essential role in human interactions. Recently, several research works 5

have focused on FER to detect emotions and analyze people behavior[1, 2, 3]. Some facial muscles are specifically associated with certain emotional states and allow, according to Ekman [4] the expression of primary emotions (Sadness, Anger, Fear, Joy, Disgust and Surprise). Emotion recognition systems, based on facial expression analysis, found their interest in various applications,

10

such as eLearning and affective computing. Several studies have shown the importance of taking into account emotional aspects of a group of learners when it is necessary to create computing environments of collaborative learning [5, 6]. In [6], authors proposed to use computer tools for helping learners to share their emotions felt in a collaborative task. Others proposed to integrate facial

15

expression analysis based system into educational platforms in order to make emotions sharing automatic and unconscious [7]. In [8], the authors used automatic facial expression analysis for discovering learning status of e-learners. In affective computing, emotion recognition is crucial to allow behavioral and emotional interaction between a human and a machine. In [9], the authors thought

20

to promote more dynamic and flexible communication between a learner and an educational platform through the integration of an agent whose role is to capture and manage emotions expressed by the learner during a learning session, using facial expression analysis. Emotion recognition based solely on facial expression analysis is a very difficult task, especially within uncontrolled con-

25

ditions such as lighting changes, occlusion, fringe, and facial pose variation in

2

front of camera. In recent years, more research works [10, 11] have been conducted to address these problems, and the literature proposes significant and growing research body in computer vision that contributes to perform FER with increasing performances. The first research work in face detection and FER can 30

be found in [12]. Feature extraction is an important step in FER system. Feature extraction methods, used to characterize facial expression, can be categorized into two approaches: appearance-based and geometry-based methods. Appearance-based methods consist of extracting facial texture caused by expressions using different

35

descriptors including Local Binary Pattern (LBP) [13], Local Ternary Pattern (LTP) [14], Compound Local Binary Pattern (CLBP)[15] and Histogram of Oriented Gradient (HOG)[16]. Geometry-based methods extract shape information and locations of facial components such as distances and angles between landmarks [17]. Some research works combine appearance and geometric information

40

as in [18, 19, 20]. Based on appearance-based features extracted from the whole face, Gritti et al. [21] extensively investigated different local features including HOG descriptor, LBP descriptor and its variants, and Gabor wavelets. Using Support Vector Machines (SVM) classifier, the authors demonstrated that the best FER

45

performances are obtained using LBP descriptor. Other extensive experiments were carried out in [22]. The authors showed that LBP features are effective and efficient for FER, even in low-resolution video sequences, using different machine learning methods. Carcagn`ı et al. [23] carried out large experiments using HOG descriptor, highlighting that a proper set of HOG parameters (cell

50

size and number of orientation bins) can make HOG descriptor one of the powerful techniques to recognize facial expressions, using SVM classifier. Ahmed et al. [15] proposed a facial feature descriptor constructed with CLBP and used SVM for classification. The most framework based on the geometric-based approach uses Active

55

Shape Model (ASM) or its extension called Active Appearance Models (AAM). In [2], the authors detect facial expression by observing changes in key features 3

in AAM using Fuzzy Logic. Shbib and Zhou [3] used geometric displacement of projected ASM feature points, and the mean shape of ASM is analyzed to evaluate FER based on SVM classifier. Poursaberi et al. [18] combined texture 60

and geometric features. They used Gauss-Laguerre circular harmonic filter to extract texture and geometric information of fiducial points, then K-nearest neighbor (KNN) is used for facial expression classification. Rapp et al. [19] also combined appearance and geometric information. They proposed an original combination of two heterogeneous descriptors. The first one uses Local Gabor

65

Binary Pattern (LGBP) in order to exploit multi-resolution and multi-direction representation between pixels. The second descriptor is based on AAM which provide an important information on the position of facial key points. Then, they used multiple kernel SVM based classification to achieve FER. Valstar and Pantic [24] proposed a method that consists in tracking 20 fiducial points in all

70

subsequent frames in a video, using particle filtering with factorized likelihoods. Then, the features are calculated from the position of facial points as indicated by the point tracker. A comparison of two feature types, geometry-based and Gabor wavelets-based, using multi-layer perceptron, is explored in [25]. The comparison showed that Gabor wavelet coefficients are much powerful than

75

geometric positions. In [26], the authors proposed a method that extracts patchbased 3D Gabor features to obtain salient distance features, selects the salient patches and then performs patch matching operations using SVM classifier. Lei et al. [17] proposed a geometric features extraction method applying an ASM automatic fiducial point location algorithm to facial expression image, and then

80

calculating euclidean distance between the center of gravity of the face shape and the annotated points. Finally, they extract geometric deformation difference between features of neutral and facial expressions to analyze, before applying SVM classifier for FER. As for the aforementioned traditional machine learning, deep learning with

85

Convolutional Neural Networks (CNN) and Deep Belief Networks (DBN) has also been applied for FER and has achieved competitive recognition rates [27, 28, 29, 30]. However, deep learning methods require more memory and are time 4

consuming [27, 31]. Khorrami et al. [29] empirically showed that the learned features obtained by CNNs correspond completely with the Facial Action Cod90

ing System (FACS) developed by Ekman and Friesen [32]. In [30], the authors demonstrated that a combination of different methods such as dropout, max pooling and batch normalization impacts the performance of CNNs. A unified FER system combining feature learning, feature selection, and classifier construction was proposed based on the Boosted Deep Belief Network (BDBN)

95

[27]. Liu et al [28] constructed a deep architecture by using convolutional kernels for learning local appearance variations caused by facial expressions and extracting features based on DBNs. All the methods presented above use facial image as a whole region of interest. Recently, many researchers have studied facial expression using facial

100

regions or facial patches instead of using the whole face [33, 34, 35, 36]. Indeed, some specific face-regions are inappropriate to FER. In [33], the authors used AAM to extract facial regions, then applied Gabor wavelet transformation to extract facial features from the defined regions, and finally used SVM to recognize learned expressions. The authors in [34, 35] defined facial components from

105

which they extracted HOG feature descriptors. Then, SVM classifier is used for expression recognition. In [37], the authors combined geometric and appearance facial features, then applied Radial Basis Function (RBF) based neural network for facial expression classification. The geometric facial features are extracted from regions of interest defined from facial decomposition. Happy and Routray

110

[36] extracted appearance based features from salient patches selected around facial landmarks then used SVM to classify expressions. According to FACS, the main facial features required to analyze facial expressions are located around the eyebrows, eyes, nose and mouth. Hence, different definitions of ROIs around these facial components are proposed in the

115

literature [37, 35]. The first one [37] segmented the face into three ROIs (eyeseyebrows, nose, mouth) using a segmentation process which detects the face image, determines its width and height then uses predefined ratios to estimate the regions in question. The second one [35] aims to define six ROIs (eyebrows, 5

left eye, right eye, between eyes, nose, mouth) by dividing the face manually. In 120

our work, different from the ones mentioned before, facial expression recognition is investigated using new face-regions defined more precisely with the help of some facial landmarks (detected by IF algorithm), which allow to extract automatically seven ROIs (left eyebrow, right eyebrow, between eyebrows, left eye, right eye, nose, mouth). Furthermore, unlike the previous works where the

125

superior region contains more than one facial component, our decomposition aims to separate the facial components. This ensures better face registration and therefore an appropriate facial representation. Two major contributions are presented in this paper. First, we define an automatic facial decomposition that improves FER performance, compared to

130

state of the art facial decompositions and the approach based on whole face. Second, the study is performed considering texture and shape descriptors, and SVM for classification. We consider different descriptors, including LBP, LTP, CLBP, HOG and their combination, to carry out comprehensive performance comparison between

135

whole face based and face-regions based methods on three different datasets. Two of them, CK [38] and KDEF [39] represent posed emotions. The third one, FEED [40], contains spontaneous emotions. Furthermore, the proposed method is compared with other state of the art methods such as deep learning based methods and feature points based ones.

140

The paper is organized as follows. Section 2 presents the overview of the proposed methodology for FER. Regions of interest extraction is detailed from facial image in Section 3. Feature extraction is presented in Section 4. Section 5 presents multiclass SVM based FER. In Section 6, we present extensive ex145

periments and discuss the results. Section 7 concludes the paper and proposes some future works.

6

2. Overview of the proposed methodology The objective of the proposed methodology is to characterize the six universal facial expressions (Joy, Fear, Disgust, Surprise, Sadness, Anger) and the 150

neutral state, by analyzing specific well defined face-regions from which facial features are extracted. The proposed study for FER is based on two main steps: (i) determination of specific and more accurate face-regions using IF framework [41], allowing facial landmarks detection; (ii) evaluation of different texture and shape descriptors (texture through LBP and its variants; and shape through

155

HOG) and their combination. To explain the interest of facial decomposition into different ROIs, representing the main components of the face (eyes, eyebrows, nose, mouth, between eyebrows), one can see in Figure 1 that when using the whole face as a ROI, the main components are not located in the same region. Indeed, the red region in Figure 1 (a) contains mouth and nose, whereas

160

the same red region in Figure 1 (b) contains only mouth. This is due to the difference in face shape from a person to another. As reported in [20], the holistic method, that uses the whole face as one ROI, does not provide better face registration due to the different shapes and sizes of facial components among the population and according to the expression. Then, one facial component may

165

not located in the same region. Hence the interest of facial decomposition into different ROIs, representing the main components of the face (left eye, right eye, left eyebrow, right eyebrow, nose, mouth, between eyebrows). This issue affects FER performance, as we will show in Section 6. When facial decomposition is applied, one can extract the facial components and analyze them globally to

170

perform expression recognition. Figures 1 (c) and (d) and Figures 1 (e) and (f) illustrate ROIs, representing nose and mouth, extracted from the faces of Figures 1 (a) and (b) respectively. Figure 2 illustrates the different steps of the proposed system: a) the openCV face detector [42] is used to detect the position of the face within the image; b) facial landmarks are detected using IF; c)

175

facial components (ROIs) are extracted using the coordinates of specific facial landmarks; d) the defined ROIs (left eyebrow, right eyebrow, between eyebrows,

7

left eye, right eye, nose and mouth) are cropped; e) each ROI is scaled and partitioned into blocks (this step allows to extract more local information); f) for each block, feature descriptor is extracted. The descriptors of all the blocks are 180

then concatenated to build the descriptor of the ROI. Finally, the descriptors of all ROIs are concatenated to obtain the descriptor of the face; g) the obtained descriptor is fed into a multiclass SVM to achieve the recognition task.

(a)

(b)

(c)

(d)

(e)

(f) Figure 1: a) The whole face as ROI, the red region incorporates two facial components that are the mouth and a part of nose. (b) The whole face as ROI, the red region incorporates only one facial component that is the mouth. (c) and (d) nose and mouth ROI extracted from (a). (e) and (f) nose and mouth ROI extracted from (b)

Figure 2: Automatic FER system.

8

3. Regions of interest extraction As stated in [20], using whole face as a unique region to extract features 185

affects FER performances. To solve this issue, our approach proposes to extract specific face-regions, representing the main face components (eyes, eyebrows, nose, mouth, between eyebrows), from which features will be extracted. The objective is to increase FER performance through feature descriptors. In Section 6, we will show through comprehensive experiments the performance of the

190

proposed face decomposition, compared with whole face and state of the art face decompositions.

To achieve the ROIs extraction step, we start by detecting facial landmarks using the popular IF framework[41]. The IF algorithm can detect 49 landmarks 195

around the regions of eyebrows, eyes, nose, and mouth (see Figure 3) using the Supervised Descent Method (SDM) [41]. The 49 landmarks detected by IF algorithm are shown in Figure 3, where each point has a label number. Hence, the face shape is described by X∗ = (X1 , X2 , · · · , Xp ), where Xi is ith point, and p denotes the number of landmarks (here p = 49). Xi = (xi , yi ), where xi

200

and yi are horizontal and vertical coordinate of ith point respectively. In this paper, we use some of these points (see Table 1) to define the proposed face decomposition into ROIs, which are located mainly around eyebrows, eyes, nose and mouth (see Figure 3).

9

Table 1: Facial components extraction using IF algorithm based landmarks.

Dimensions of ROI

Left eyebrow

Starting point

Width

Height

(x1 , y4 )

x11 − x1

max(x4

−

Facial components

x3 , y3 − y1 ) Right

eye-

x11 − x10

(x11 , y7 )

max(x7

−

x6 , y7 − y10 )

brow (x5 , min(y5 , y6 ))

x6 − x5

y13 − y12

Left eye

(x20 , y22 )

x23 − x20

y24 − y22

Right eye

(x26 , y27 )

x29 − x26

y31 − y27

Nose

(x15 , y12 )

x19 − x15

y12 − y17

Mouth

(x25 , min(y32 , y35 , y38 ))

x30 − x25

max(y32 , y38 , y41 )−

Between eyebrow

min(y32 , y35 , y38 )

Figure 3: 49 Landmarks detected by IF algorithm.

After detecting landmarks by IF algorithm, seven ROIs, that are prone to 205

change with facial expression, are extracted (left eyebrow, right eyebrow, left eye, right eye, between eyebrows, nose and mouth). Figure 4 shows an example of ROIs extraction. These ROIs are the most representative regions of facial expression according to FACS. The idea behind FER is to analyze the face locally by focusing on permanent and transient features. Permanent features

210

are eyes, eyebrows, nose and mouth. Their shape and texture are exposed to 10

change with facial expression, which produces different wrinkles and furrows that are called transient facial features (for example, vertical wrinkles between the eyebrows due to their convergence one towards another, especially when face comes to expressing sadness and anger according to FACS). This leads us 215

to choose meticulously the starting point, width and height for each ROI (see Table 1) in order to capture more exactly permanent and transient features (see Figure 4). In our work, FER is especially based on texture information. Hence, we are interested in analyzing specific areas of the face that are likely to present changes in texture information with facial expression. For this purpose,

220

it is more useful to set the width of the mouth region to the horizontal distance between the points x25 and x30 , instead of choosing the points x32 and x38 which horizontally delimit only the mouth. This choice allows to detect changes in texture information within the area between the points x25 and x32 and the area between the points x30 and x38 , especially when the face comes to expressing joy,

225

as reported by FACS. Indeed, joy emotion is expressed on the face by stretching the corners lips. For the height, we chose to set it to the difference between the maximum of the points y32 , y38 and y41 and the minimum of the points y32 , y38 and y35 , instead of using simply the difference between the points y41 and y35 which vertically delimit only the mouth. Indeed, there are some people

230

when they smile, the corners X32 and X38 of their mouth move above the point X35 . Furthermore, based again on FACS, sadness is expressed by lowering the lip outer corners, which justifies the choice of the maximum of the points y32 , y38 and y41 . Our aim by using the minimum of the points y32 , y38 and y35 , and maximum of the points y32 , y38 and y41 , is to consider the highest and lowest

235

point respectively in order to extract the region of interest that brings complete information needed for precise and reliable FER. For eyebrow region, we set the width of the left eyebrow region to the horizontal distance between the points x1 and x11 , instead of choosing the points x1 and x5 which horizontally delimit only the left eyebrow. This choice allows to detect changes in texture

240

information within the area between the points x5 and x11 , especially when the face comes to expressing fear, anger and sadness as reported by FACS. The same 11

strategy is adopted for the right eyebrow. In addition to the corresponding facial component, the mouth and eyebrows regions contain small area around them that are affected by the deformation of mouth and eyebrows when face comes 245

to express something. In contrast, for eyes, we don’t need to analyze additional area because the required information are inside the eyes. For example, when the face comes to expressing fear, the upper eyelids rise and the lower ones are tightened, which has the effect of displaying a superior Sanpaku. Hence, our choice to use the points that delimit only the eyes.

(a)

(b)

Figure 4: (a) Examples of 49 landmarks detecting by IF. (b) Examples of ROI extraction.

250

4. Feature extraction Once the ROIs are extracted from a facial image, the next step is to apply a scaling procedure by modifying ROIs size. This step is crucial as each facial component (represented by a ROI) could have different sizes depending on the face shape in the image. The idea behind scaling is to get ROIs of the same

255

component with a close scale. To achieve that, we apply different ROI sizes

12

and then select the one providing better FER performances, as we will show in Section 6. After scaling procedure, we partition each ROI in regular blocks in which features will be extracted. The partitioning step is interesting as it allows to extract local information. Once feature extraction is achieved in each 260

block of a ROI, the features of all blocks are concatenated to build the feature descriptor of the ROI. Finally, the feature descriptors of all ROIs are concatenated to create the face feature descriptor.

In the following, we will present the different feature descriptors we used in 265

this study. 4.1. LBP LBP operator was initially proposed by Ojala et al. [13] in order to characterize texture of an image. It consists in applying a threshold procedure to the difference between the value of each pixel and the value of pixels of its neigh-

270

borhood. The result of this operation gives a chain of binary values (8-bit code in the case of a 3 × 3 neighborhood), reading in clockwise direction around the central pixel. The chain of binary values could be transformed on decimal value, representing a gray level of the LBP image. 4.2. CLBP

275

Ahmed et al. [15] tried to increase the robustness of LBP feature descriptor by incorporating additional local information that is ignored by the original LBP operator. They extended the basic LBP to another variant, which is called CLBP. In this variant, the comparison between the central pixel and the pixels of its neighborhood is performed using simultaneously the sign (represented

280

by a bit), like in LBP, and the magnitude (represented by another bit) of the difference between gray values. 4.3. LTP and Dynamic LTP LTP [14] is an interesting descriptor that overcomes some limits of LBP. In LTP instead of binarizing the sampled values, they take three values according 13

to their distance to the value of the central pixel. In LTP, the indicator function s(x) is defined as:    1 if x ≥ t   s(x) = 0 if |x| < t     −1 if x ≤ −t

(1)

where t is a threshold defined manually.

LTP is resistant to noise. However, it is not invariant to gray-level transformations owing to the naivety of the choice of the threshold. To overcome this problem, many studies [43, 44, 45] proposed techniques, named as dynamic LTP, for calculating a dynamic threshold based on the values of the neighboring pixels. Nevertheless, to our best knowledge, all the formulas proposed in the literature to calculate a dynamic threshold use a parameter called scaling factor whose value is constant for all the neighborhood. In all existing works, to set this parameter, authors chose a value between 0 and 1. In [45], the authors evaluated the performance of their proposed method considering different values for the scaling factor. In our study, we carried out large experiments using the original LTP with different strategies for threshold calculation. The first strategy is to set the threshold value from 0 to 5 [14], as proposed in many research works. The second one proposed in [45] uses formula 1 (see Table 2), considering different values for the scaling factor δ (0.02, 0.08, 0.05, 0.1, 0.15 and 0.2). In this formula, pc denotes the value of central pixel. In this paper, we introduce other strategies to achieve dynamic thresholding with simple formulas 2, 3, 4 and 5, given in Table 2, where pi is the value of the ith neighbor pixel. Another important consideration discussed in this article is the manner to define LTP. Indeed, changing the original definition of LTP operator [14] could impact the classification results. The original definition of LTP (Eq. (1)) is to assign 0 when the absolute value of the difference is strictly lower than the threshold. While in other works [46, 47], authors unconsciously changed the original definition of LTP by affecting 0 when the absolute value of the difference is less than or equal to the threshold (Eq. (2)). Since the representation of LTP is 14

based on three values, the allocation of the equal sign to one or more of the three conditions could affect classification results. This statement led us to ask the question: which one of LTP definitions would be adapted to our application. To answer this question, we conducted extensive experiments by considering the different definitions (Eq. (1), (2), (3) and (4)).    1 if   s(x) = 0 if     −1 if    1 if   s(x) = 0 if     −1 if    1   s(x) = 0     −1

x>t (2)

|x| ≤ t x < −t x≥t x ≥ −t and

x
(3)

x≤t

(4)

x < −t

if x > t if x > −t and

if x ≤ −t

Table 2: Formulas based LTP operator (N: number of neighborhood (in this work N=8)).

Formula 1 t = pc × δ

Formula 2 t=

PN −1 √ i=0

N

pi

Formula 3 t=

PN −1 i=0

pi

N

Formula 4 √PN −1 i=0 pi t= N

Formula 5 qP N −1 t= i=0

pi N

4.4. HOG 285

HOG descriptor, largely inspired from Scale Invariant Feature Transform (SIFT), was proposed by Dalal and Triggs [16] to address the limitations of SIFT in the case of dense grids. The purpose of HOG is to represent the appearance and shape of an object in an image by the distribution of intensity gradients or edge directions. This is accomplished by dividing the image into small connected

290

regions, called cells, and, for each cell, computing a local histogram (with 9 bins) of gradient directions for the pixels belonging to this cell. The concatenation of all local histograms form HOG descriptor. The local histograms are normalized to local contrast by calculating a measure of the intensity over larger spatial 15

regions, called blocks. This normalization leads to better invariance to changes 295

in illumination and shadowing.

5. SVM based classification SVM are among the most known classification methods. They are inspired from statistical learning theory [48]. SVM perform linear classification based on supervised learning. We used linear SVM classifier to avoid parameter sensitivity. Given a labeled training data xi , yi , i = 1, · · · , p, yi ∈ −1, 1, xi ∈
X 1 min wT w + C ξ(w, xi , yi ) w 2 i=1 where C > 0 is a penalty parameter, w is normal to the hyperplane.

(5) Pp

i=1

ξ(w, xi , yi )

can be considered as a total of misclassification error. The classification of a new test data is carried out by computing the distance from the test data to 300

the hyperplane.

SVM is a binary classifier. For multiclass problem, as in our case, the problem is set to multiple binary classification problems, each one of them is then handled by SVM. Among the existing multiclass SVM methods, we opted for 305

the one-against-one based one, which is a competitive approach according to the detailed comparison conducted in [49]. One-against-one based multiclass SVM method consists in constructing one classifier for each pair of classes, and then training k(k − 1)/2 classifiers (k is the number of classes) in order to distinguish the examples of one class from those of another class. The classification of a

310

new test data is determined by the class that obtains the maximum of the sum of binary classifier responses. In this work, to implement SVM classification, we used LIBSVM library [50], which offers codes of the one-against-one based multiclass SVM method. 16

6. Experiments 315

6.1. Databases used To evaluate the proposed methodology, three different public databases are used in order to carry out tests in different conditions. The first database, called FEED [40], represents spontaneous emotions in image sequences. The second database, called CK [38], contains posed emotions over image sequences. The

320

third database, called KDEF [39], contains also posed emotions, but in independent images, i.e. one image represents one expression. Table 3 summarizes the properties of the used datasets. Table 3: Properties of CK, FEED and KDEF datasets

Dataset

CK

FEED

KDEF

Type

posed

spontaneous

posed

Uniform

Uniform

Uniform

We do not per-

We do not per-

We do not per-

form any align-

form any align-

form any align-

ment

ment

ment

Angle of the face

Frontal

Frontal

Frontal

# of images

610

630

280

# and names of

6 basic expres-

6 basic expres-

6 basic expres-

emotions

sions + neutral

sions + neutral

sions + neutral

# of subjects

9 males/ 23 fe-

8 males/ 8 fe-

20 males/ 20 fe-

males

males

males

Euro-American,

European

Caucasian

Lighting

condi-

tions Face size

Ethnicity

Afro-American, Asian and Latino 6.1.1. CK Database This database contains image sequences representing six facial expressions 325

for male and female subjects (the number of females is greater than the num17

ber of males) of different ethnic origins such as Euro-American (81%), AfroAmerican (13%) and other (6%). It is composed of posed (deliberated) facial expressions. Each sequence starts with a neutral state and carries on with the expressive state. The database is made up of subjects of varying ages, skin 330

colors and facial conformations. The lighting condition is relatively uniform. The image sequences, which are captured with frontal views, were digitized into 640 × 490 or 640 × 480 pixels resolution. All these characteristics allowed this database to be widely used for FER evaluation. In this study, a dataset of 610 images selected from 98 image sequences, representing 32 subjects and the 7

335

basic facial expressions, namely neutral, joy, sadness, surprise, anger, fear and disgust, is considered. Figure 5 illustrates some images of CK database.

(a)

(b)

(c)

Figure 5: Examples of three facial expressions in the CK database. (a) Image sequence representing sadness expression. (b) Image sequence representing fear expression. (c) Image sequence representing joy expression

18

6.1.2. FEED Database As in CK database, this database contains image sequences representing the 7 basic facial expressions for male and female subjects (the number of fe340

males and males is the same). The image sequences, which are captured with frontal views, were digitized into 640 × 480 pixels resolution. Unlike CK, FEED database is much more challenging, its subjects performed spontaneous expressions and some of them are not well distinguishable. The difference between natural facial expressions and deliberated ones is important. We will show in

345

the experiments the impact of the expression type (natural/posed) on the recognition performances. In this study, the 7 basic facial expressions fill up a dataset of 630 images selected from 81 image sequences representing 16 subjects. Figure 6 shows some images of FEED database.

(a)

(b)

(c)

Figure 6: Examples of three facial expressions for the FEED database. (a) Image sequence representing sadness expression. (b) Image sequence representing fear expression. (c) Image sequence representing joy expression.

19

6.1.3. KDEF Database 350

Composed of deliberated facial expressions, this database contains images of male and female subjects belonging to Caucasian ethnicity, each subject expressing the 7 different emotions. The number of males and females is the same. We note that some subjects have moles on their face. Each posed facial expression has been captured twice for each subject from 5 different angles (-90, -45,

355

0, +45, +90 degrees). In this study, we built a dataset of 280 images representing 40 subjects with the 7 basic facial expressions. In this work, we considered only the images captured with the straight angle (0 degree). This choice is made considering two motivations. The first one is related to the region extraction algorithm,

360

which could meet difficulties to perform well for non-zero view angles. Second, we considered the same experimental methodology for all the three databases, by choosing frontal view for face capturing. Figure 7 illustrates some images of KDEF database.

(a)

(b)

(e)

(c)

(f)

(d)

(g)

Figure 7: Examples of different facial expressions for the KDEF database.(a) neutral (b) joy (c) sadness (d) surprise (e) anger (f) fear (g) disgust.

20

6.2. Results and Discussion

(a)

(b)

(c)

(d)

Figure 8: Regions Of Interest. (a) the whole face as region of interest (b) 3 regions of interest already used in the literature [37] (c) 6 regions of interest already used in the literature [35] (d) Our 7 proposed regions of interest.

365

We applied 10-fold cross-validation to test our proposed method. Each dataset is divided into 10 person-independent subsets of equal number of images that are used to conduct 10 experiments. In each experiment, nine subsets are used for training and the remaining subset is used for the test. The 10-fold cross-validation is used to fill the confusion matrix and then the recognition rate

370

is obtained by calculating F-score which is defined as F-score=

2.Recall.P recision Recall+P recision

[51]. The choice of size and number of blocks of ROI as well as descriptor impacts generally the recognition performance. In our experiments, each ROI is divided into W × H blocks for descriptor computation. The best parameters are set 375

by performing several tests. For test and evaluation, we considered three ROI face decompositions with different parameter configurations, i.e. the sizes of ROI and the number of blocks in each ROI, as shown in Figures 9, 10 and 11. The first one considers the whole face as a single ROI (see Figure 8 (a)), as proposed in [15, 23]. In the second decomposition, the face is composed of six

380

ROIs (see Figure 8 (c)), as proposed in [35]. The third decomposition is the one we proposed, and which considers seven ROIs, as shown in Figure 8 (d). In [37], the authors proposed another face decomposition with three ROIs (see Figure 8 (b)). The performances of the face decompositions with three ROIs (see Figure

21

8 (b)) and six ROIs (see Figure 8 (c)) are close, with small advantage to the 385

second decomposition (six ROIs). Indeed the best recognition rates provided by the face decomposition with three ROIs versus six ROIs are 93.34% (VS 93.55%), 82.17% (VS 83.04%), and 92.6% (VS 92.53%) on CK, FEED and KDEF datasets, respectively. For conciseness, we opted to not give result details for this decomposition (three ROIs) as we did for the other decompositions.

390

The experiment settings are as follows: • ROI parameters: ROI size and number of blocks in each ROI (see Figures 9, 10 and 11). Figures 9, 10 and 11 provide, for each tested dataset, only parameter configurations that yield high recognition rates: given a dataset, and for each tested descriptor, we conserve the parameter config-

395

uration that yields the higher recognition rate. • Feature descriptor parameters: for LBP, CLBP and LTP, we used a radius of 1 pixel, a neighbor of 8 pixels and a histogram of 256 bins. Concerning LTP descriptor, the threshold is set manually or automatically using the formulas given in Table 2. We mention in the results analysis how we set

400

the threshold. When set manually, the threshold value is given. For HOG, we used blocks of 8 × 8 pixels, cells of 2 × 2 pixels and 9 bins histogram for each block. • SVM parameters: we used a linear kernel with parameter C=0.3 selected by performing a grid-search in 10-fold cross validation.

405

6.2.1. Experimental results using the whole face as one ROI Figure 9 presents the results obtained when using the whole face as one ROI. For each dataset, we considered texture descriptors (LBP, CLBP and LTP) (see Figure 9 (a)), the shape descriptor (HOG) (see Figure 9 (b)) and their concatenation (see Figure 9 (c)). As we can see in Figure 9, HOG descriptor provides the

410

better recognition rate (93.24%) when CK dataset is tested with the parameters configuration 1 (see Figure 9 (b)). We can see also for the same dataset, that

22

when LTP descriptor is involved (solely or in hybrid form with HOG descriptor), the recognition rate is interesting with almost all parameter configurations, and often close to the high recognition rate given by HOG descriptor. However, 415

when FEED and KDEF datasets are tested, the hybrid descriptor LTP + HOG shows the high recognition rate, when compared to the other descriptors. Indeed, we obtain 85.32% on FEED dataset with parameters configuration 3 (see Figure 9 (c)). Note that the used LTP here is defined using Eq. (1) with a fixed threshold selected experimentally (t = 1). For KDEF dataset, we obtain

420

a recognition rate of 92.19%, with the parameters configuration 2 (see Figure 9 (c)). Here, LTP is defined using Eq. (1) with a threshold procedure based on formula 3 (see Table 2).

23

CK Dataset

F-score

95

92.44

FEED Dataset 91.49

89

KDEF Dataset

LBP

90.65

CLBP

90.33

LTP 92.12

90.19

84.57

83.6

83

79.29

77 71 65

Size

165×154 11×11

# of blocks

104×117

99×90

8×9

11×10

2

3

1

Configuration

(a)

F-score

CK Dataset 95 90 85 80 75 70 65

FEED Dataset

KDEF Dataset

93.24

HOG

92.72

89.63

88.21 76.13

72.18

96×120

Size # of blocks

80×80

6×5 1

5×5 5×5 2

6×5

Configuration

(b)

CK Dataset 95

92.89

FEED Dataset

KDEF Dataset

LBP+HOG

92.21

91.96

CLBP+HOG 92.19

LTP+HOG

F-score

85

85.32

83.72

83.7

91.39

91.16

90

80 75 70

65 Size (HOG)

# of blocks (HOG) Size (Texture) # of blocks (Texture)

96×120 96×120

80×80

6×5

5×5

5×5

165×154

165×154

104×117

11×11 1

33

11×11 2

80×80

33

8×9 3

Configuration

(c)

Figure 9: Recognition rate for ROI=1 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (three configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (three configurations).

24

6.2.2. Experimental results using the six regions of interest (number of ROI = 6) 425

In this section, we analyze the performances of the descriptors using face decomposition with 6 ROIs. As we can see in Figure 10, applied on CK dataset, the hybrid descriptor (LTP + HOG) outperforms all the other descriptors in all parameter configurations. Here, LTP is computed using Eq. (4) with threshold formula 1 (see Table 2). The hybrid descriptor (LTP + HOG) was able

430

to slightly increase the recognition rate to 93.55% (see Figure 10 (c)), when compared to the results of the previous case, i.e. face as a whole ROI (number of ROI =1), where HOG descriptor has shown its effectiveness face to all the other descriptors (see Figure 9). Concerning KDEF dataset, LTP descriptor, defined with its Eq. (2) and threshold formula 4 (see Table 2), provides the best

435

results (see Figure 10 (a)). It slightly increased the recognition rate to 92.53%, when compared to the results of the previous case (number of ROI = 1), where the hybrid descriptors provided the best results, in particular the combination of HOG and LTP (see Figure 9 (c)). For FEED dataset, one can note globally from Figure 10 a significant decrease in the recognition rate to 83.04%,

440

when compared to the results of the previous case where the number of ROI = 1 (see Figure 9).The recognition rate decreasing concerns the texture and hybrid descriptors. However, HOG descriptor increases the recognition rate, when compared to the results of the previous case (number of ROI = 1) with same descriptor, and provides the best recognition rate (83.04%) when compared to

445

all the tested descriptors (see Figure 10).

25

CK Dataset 92 86 80 74 68 62 56 50

F-score

FEED Dataset

KDEF Dataset

LBP

CLBP

LTP

93.1

92.53

90.68 78.69

90.84 75.13

ROI

r1

r2

r3/r4

r5

r6

r1

r2

r3/r4

r5

r6

Size

80×20

30×30

36×24

30×30

80×60

120×36

24×36

36×24

24×18

72×54

4×2

1×1

2×2 1

1×1

4×3

5×2

1×2

1×2 2

1×1

3×3

# of blocks

Configuration

(a)

CK Dataset 95 88.48

F-score

90

FEED Dataset

KDEF Dataset

HOG

91.06

87.11

86.85

83.04

85 80

77.39

75 70 ROI

r1

r2

r3/r4

r5

r6

r1

r2

r3/r4

r5

r6

Size

144×24

32×32

40×32

24×24

72×48

240×24

32×32

32×16

16×16

128×112

6×1

1×1

1×2

1×1

3×3

10×1

1×1

2×1

1×1

8×7

# of blocks

1

2 Configuration

(b)

CK Dataset

FEED Dataset

93.55

95

KDEF Dataset

LBP+HOG

CLBP+HOG

LTP+HOG

92.02

90.38

91.81

F-score

90 85

80.16

80

76.61

75

70 65

r1

r2

r3/r4

r5

r6

r1

r2

r3/r4

r5

r6

Size (HOG)

240×24

32×32

32×16

16×16

128×112

240×24

32×32

32×16

16×16

128×112

# of blocks (HOG) Size (Texture)

10×1

1×1

2×1

1×1

8×7

10×1

1×1

2×1

1×1

8×7

90×48

30×30

36×24

24×24

90×32

120×36

24×36

36×24

24×18

72×54

3×3

1×1

1×2 1

1×1

3×4

5×2

1×2

1×2 2

1×1

3×3

ROI

# of blocks (Texture)

Configuration

(c)

Figure 10: Recognition rate for ROI=6 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (two configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (two configurations).

26

6.2.3. Experimental results using specific well defined regions of interest (number of ROI = 7) As we can see in Figure 11, a significant increase in the recognition rate for all tested datasets is obtained with our proposed facial decomposition (number of 450

ROI = 7), when compared to the previous ones (number of ROI = 1 and number of ROI = 6). Indeed, our facial decomposition provided the best recognition rates, which are obtained mainly with the hybrid descriptors. In particular, the combination of LTP and HOG achieved the high recognition rates of 96.06%, 92.03% and 93.34% for CK, FEED and KDEF datasets respectively. In the

455

descriptor combination, the LTP that allowed these best results are defined with Eq. (4) associated to threshold formula 1, Eq. (4) associated to threshold formula 3, and Eq. (2) associated to threshold formula 1, for CK, FEED and KDEF datasets respectively. The improvements given by our proposed facial decomposition was possible thanks to the relevant selected ROIs, the location

460

and proper cropping of facial components.

27

CK Dataset

F-score

96

FEED Dataset

93.56

KDEF Dataset

LBP

CLBP

LTP

94.29

92.87

92.12

89.25

90

83.02

84 78 72

ROI

r1/r2

r3

r4/r5

r6

r7

r1/r2

r3

r4/r5

r6

r7

Size

54×18

30×30

36×24

24×24

72×36

72×18

18×26

36×18

18×26

72×36

3×1

1×1

1×2 1

1×1

3×3

3×1

1×1

2×1 2

1×1

3×2

# of blocks

Configuration

(a)

F-score

CK Dataset 92 90 88 86 84 82 80

FEED Dataset

KDEF Dataset

HOG 90.28

89.74

88.25 86.4

85.82 83.67

ROI

r1/r2

r3

r4/r5

r6

r7

r1/r2

r3

r4/r5

r6

r7

Size

120×24

32×32

32×16

16×16

120×64

120×24

32×32

32×16

16×16

128×112

5×1

1×1

1×1

1×1

5×4

5×1

1×1

1×1

1×1

8×7

# of blocks

1

2 Configuration

(b)

CK Dataset 98

F-score

FEED Dataset

KDEF Dataset

LBP+HOG

CLBP+HOG

LTP+HOG

96.06

94.62

95.22

93.34

94

92.53

92.03

92.91

89.44

90 85.76

86 82 78 74

ROI

r1/r2

r3

r4/r5

r6

r7

r1/r2

r3

r4/r5

r6

r7

r1/r2

r3

r4/r5

r6

r7

Size (HOG)

120×24

32×32

32×16

16×16

120×64

120×24

32×32

32×16

16×16

128×112

72×24

32×32

40×32

24×24

72×48

# of blocks (HOG) Size (Texture)

5×1

1×1

1×1

1×1

5×4

5×1

1×1

1×1

1×1

8×7

3×1

1×1

1×2

1×1

3×3

54×18

30×30

36×24

24×24

54×36

72×18

18×26

36×18

18×26

72×36

54×18

30×30

36×24

24×24

72×36

2×1

1×1

1×2

1×1

3×3

3×1

1×1

2×1

1×1

4×2

3×1

1×1

1×2

1×1

3×3

# of blocks (Texture)

1

2

3

Configuration

(c)

Figure 11: Recognition rate for ROI=7 using different descriptors with different configurations of size and number of blocks. (a) Texture descriptor (two configurations). (b) HOG descriptor (two configurations). (c) Hybrid method (three configurations).

28

Tables 4, 5 and 6 summarize the best results obtained for CK, FEED and KDEF datasets, respectively, through the different descriptors and the different facial decompositions (number of ROI = 1, 6 and 7). Through these tables, we can see that the proposed method (face decomposition with number of ROI = 7) 465

outperforms all the others when using almost all the descriptors and for all the tested datasets. The tables show that hybrid descriptors demonstrated generally better results with respect to elementary ones, especially for our proposed facial decomposition. Concerning the datasets variability, one can see in Figures 9, 10 and 11, that on one hand, the recognition rates for FEED dataset, composed of

470

spontaneous facial expressions, are lower than those obtained for CK and KDEF datasets, composed of posed facial expressions. The main reason is that posed facial expressions are easy to detect, when compared to spontaneous expressions, which are generally acquired in non-controlled experience protocol. On the other hand, the recognition rates for KDEF dataset are low, when compared to those

475

obtained for CK dataset. This is due to the difference of the construction of each database. Indeed, CK dataset is composed of image sequences, while KDEF dataset is composed of independent images.

29

Table 4: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for CK dataset and all the tested descriptors

ROI=1

ROI=6

ROI=7

LBP

89.42

90.17

91.33

CLBP

87.13

86.6

91.26

LTP

92.44

93.1

94.29

HOG

93.24

88.48

90.28

LBP+HOG

89.5

89.53

93.75

CLBP+HOG

87.31

89.58

94.43

LTP+HOG

92.89

93.55

96.06

Table 5: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for FEED dataset and all the tested descriptors

ROI=1

ROI=6

ROI=7

LBP

84.17

72.41

86.96

CLBP

75.01

71.29

81.7

LTP

84.75

78.69

89.25

HOG

76.13

83.04

86.4

LBP+HOG

85.26

77.43

89.89

CLBP+HOG

79.09

76.84

85.66

LTP+HOG

85.32

80.16

92.03

30

Table 6: Comparison of the performance of ROI=7, ROI=1 and ROI=6 for KDEF dataset and all the tested descriptors

ROI=1

ROI=6

ROI=7

LBP

90.44

87.95

91.87

CLBP

90.43

81.87

88.66

LTP

92.12

92.53

92.87

HOG

89.63

91.06

88.25

LBP+HOG

90.44

90.74

92.2

CLBP+HOG

91.85

89

92.59

LTP+HOG

92.19

91.81

93.34

6.2.4. Confusion matrices This section analyzes the results through confusion matrices. Tables 7, 8 480

and 9 show the confusion matrices (corresponding to the high recognition rate reached) for CK, FEED and KDEF datasets respectively. As one can see in Table 7, we achieved very good results for the majority of emotions with an average rate close to 100%, with the exception of neutral and anger emotions that are recognized with average rates of 86% and 90% respectively. Neutral

485

expression is misclassified as fear expression with an error rate of 9%. For FEED dataset (see Table 8), we can observe that one of the main reasons for the decrease in the recognition rate is the misclassification of sadness, anger and fear expressions as neutral expression, which is recognized with less precision of 82.13% (computed from the neural column as 95/(95+7+5+6.66+2)), and the

490

confusion between the fear and surprise expressions and between the disgust and joy/anger expressions. For KDEF dataset (see Table 9), neutral, joy and surprise emotions are recognized with the maximum rate of 100%. The other emotions are classified with an average recognition rate between 85% and 90%. One can see also that expression confusion is more present, with more or less

495

low percentages. Tables 8 and 9 show also that the fear expression is confused with surprise expression with an error rate of 10% and 7.5% for FEED and KDEF datasets respectively. This misclassification is due maybe to similar face 31

deformation caused by these expressions. Table 7: Confusion Matrix for CK dataset (associated to the high recognition rate: 96.06%)

Neutral

Joy

Sad

Surprise

Anger

Fear

Disgust

Neutral

86

0

1

0

3

9

1

Joy

0

100

0

0

0

0

0

Sad

1

0

99

0

0

0

0

Surprise

3

0

0

97

0

0

0

Anger

2.22

0

4.44

0

90

3.33

0

Fear

0

0

0

0

0

100

0

Disgust

0

0

0

0

0

0

100

Table 8: Confusion Matrix for FEED dataset (associated to the high recognition rate: 92.03%)

Neutral

Joy

Sad

Surprise

Anger

Fear

Disgust

Neutral

95

0

2

0

0

3

0

Joy

0

94

0

6

0

0

0

Sad

7

0

93

0

0

0

0

Surprise

0

0

0

98

0

2

0

Anger

5

0

0

0

93

2

0

Fear

6.66

0

0

10

0

83.33

0

Disgust

2

6

1

0

5

0

86

32

Table 9: Confusion Matrix of KDEF dataset (associated to the high recognition rate: 93.34%)

Neutral

Joy

Sad

Surprise

Anger

Fear

Disgust

Neutral

100

0

0

0

0

0

0

Joy

0

100

0

0

0

0

0

Sad

5

0

90

0

2.5

0

2.5

Surprise

0

0

0

100

0

0

0

Anger

2.5

0

2.5

2.5

87.5

0

5

Fear

0

0

2.5

7.5

0

90

0

Disgust

7.5

0

5

0

2.5

0

85

6.2.5. Processing time 500

Processing time (in ms), required to handle one image with all the steps of the proposed method (face detection, landmarks detection, regions extraction, conversion RGB to gray, scaling and partitioning, features computation, SVM prediction), is reported in the Table 10. It has been computed considering the average of the processing of 100 frames, on an Intel(R) Core(TM) i5-2430M

505

CPU 2.40GHz based Windows machine. The algorithm codes are written in C++. Table 10: Processing time for CK,FEED and KDEF datasets

CK

FEED

KDEF

HOG

196.24

201.30

228.38

LTP

214.96

220.02

242.89

HOG+LTP

236.18

244.61

266.60

6.3. Comparison with the state of the art This section is dedicated to the comparison with state-of-the-art methods. The comparison process, in the FER field, faces different difficulties such as the 510

absence of common evaluation, the shared databases require an experimental setting for selecting images and the algorithm codes of the existing methods are not accessible and their reimplementation could generate errors. In order to 33

deal with these difficulties and make fair comparison, we carried out different experiments on CK+ database [52], which is used in all the compared methods, 515

by following the same protocols for images selection and k-fold cross-validation as performed in the works considered for comparison [20, 36, 29, 27, 28]. The first protocol, applied in [29, 27, 28], consists of selecting in each sequence the first image for neutral expression and three peak frames for target one. We recall that each sequence represents the target expression starting by the neutral one.

520

The second protocol, used in [36], selects in each sequence the last peak frame for target expression. In[36], the neutral expression is not considered in the recognition process. The third one, applied in [20], takes two peak frames for anger, fear and sadness expressions, last peak frame for disgust, happy, and surprise expressions, and first frame for neutral expression in few sequences. Tables 11

525

and 12 report the recognition rates of our proposed method and the compared methods, applied on CK+7 (all the emotion expressions including the neural one) and CK+6 (excluding the neural expression), respectively. Table 13 summarizes the parameter values that allowed our method to reach the best results. For all the methods in comparison, we considered the recognition performances

530

reported in their referenced paper (see Tables 11 and 12). As shown in Table 11, our proposed method achieved the best recognition rates among all the compared method independently of their features category and used procedure of face registration. Indeed, compared to [28], our method provided an accuracy of 96.03% VS 93.7%. When compared to [20], the proposed method reached

535

94.48% in accuracy and 94.52% in F-score VS 90.08% and 90.64%. From Table 12, we can observe that our method exceeded the method that extracts features from small patches located around facial landmarks [36] and outperformed also the ones using BDBN [27] and appearance and geometric features [20]. We note that the BDBN method [27] reached an F-score of 83.4%, which is lower com-

540

pared to the one (96.9%) provided by our method. Compared to our method, the method [29] achieved the better accuracy of 98.3% VS 96.77% thanks to data augmentation procedure by applying a random transformation to each input image (translations, horizontal flips, rotations, scaling, and pixel intensity 34

augmentation). It should be noted that our method, unlike all the compared 545

methods, did not use any preprocessing such as face alignment, patch-wise mean subtraction or data augmentation. Furthermore, our method needs less memory and computational cost unlike the deep learning methods [27, 31]. As we can see from Table13, there are three parameter settings (a, b and c) that allowed our method to achieve the best results, which outperform all the

550

compared state of the art methods, except one of them to which the proposed method remains competitive. However, if we consider only the parameters setting b (see Table 13) as standard one for our method, the results are still better than those of the compared state of the art methods, regardless of the used experimental protocol and the considered number of expressions (6 or 7), as we

555

can see in Tables 11 and 12. Table 11: Comparison of different methods on the CK+ database with 7 expressions. a and b are the references of the parameter values (see Table 13) allowed our method to reach the best results using the experimental protocols in [28] and [20], respectively.

Method

Category

Face regis-

Experiment F-score

tration Liu et al.

Deep

Whole

[28]

learning

face

Ours

Appearance ROI

Ghimire

Appearance ROI

et al. [20]

and Geo-

protocol N/A

93.7

94.63a

96.03a

(94.19b )

(95.47b )

90.64

90.08

94.52b

94.48b

[28]

metric Ours

Accuracy

[20]

Appearance ROI

35

Table 12: Comparison of different methods on the CK+ database with 6 expressions. b and c are the references of the parameter values (see Table 13) allowed our method to reach the best results using the experimental protocols in ([29], [27], [20]) and [36], respectively.

Method

Category

Face registration

Khorrami et

Deep

Whole

al. [29]

learning

face

Ours

Experiment F-score

Accuracy

protocol N/A

98.3

Appearance ROI

96.01b

96.77b

Deep

Whole

83.4

96.7

[27]

learning

face

Ours

Appearance ROI

96.9b

97.52b

Appearance ROI

94.24

94.1

Liu

et

Ghimire

al.

et

al. [20]

[29]

[27]

and Geo[20]

metric Ours

Appearance ROI

95.8b

95.58b

Happy and

Appearance Patch

94.39

94.14

96.3c

97.18c

(95.13b )

(96.25b )

Routray [36] Ours

[36] Appearance ROI

36

Table 13: Optimized parameter values taken by our method to reach the best results (see Tables 11 and 12) on CK+ dataset.

Reference Descriptor

a

b

c

LTP+HOG

LTP+HOG

LTP+HOG

LTP/ Thresh-

Configuration

Configuration

olding

of LTP

of HOG

(3) /

Configuration

Configuration

formula 3 (see

1 (see Figure

1 (see Figure

Table 2)

11 (a))

11 (b))

Eq.

(1) /

Configuration

Configuration

formula 3 (see

1 (see Figure

1 (see Figure

Table 2)

11 (a))

11 (b))

Eq.

(4) /

Configuration

Configuration

fixed

thresh-

2 (see Figure

1 (see Figure

11 (a))

11 (b))

Eq.

old (t=2) 6.4. Cross-database evaluation

We evaluated the generalization ability of our method across different databases by carrying out six experiments. In each experiment, we performed the training on one dataset and we tested on the other two datasets (see Table 14). 560

As we can see from the Table 14, our method can achieve encouraging results. Especially, when the model is trained using KDEF dataset (posed emotions), the results on the other two datasets (spontaneous or posed emotions) are very interesting. This allows claiming that, training the model with KDEF dataset is valuable to recognizing spontaneous and posed emotions. We can also see

565

that the model behaves relatively well when it is trained and tested using posed emotions (CK/KDEF and KDEF/CK cases) Table 14: Cross-database performance on CK, KDEF, and FEED datasets.

Train

CK

KDEF

Test

FEED

KDEF

CK

FEED

CK

KDEF

Accuracy

68.41

79.28

78.85

79.52

58.36

67.85

F-score

70.41

79.35

77.14

74.17

57.83

70.04

37

FEED

7. Conclusion A new facial decomposition for expression recognition is presented. The method first extracts regions of interest (ROI) using landmarks given by In570

traFace algorithm. After ROI preprocessing stage, feature extraction is then performed to construct face descriptor. A multiclass SVM classifier is finally trained and used for FER. Several texture (LBP, CLBP, LTP) and shape (HOG) descriptors and their combination are tested and evaluated. To demonstrate its performance, the proposed facial decomposition is compared with existing ones,

575

using three public datasets. The results showed that the new facial decomposition improves significantly the recognition rate. This improvement was possible thanks to two reasons. First, the proposed decomposition allowed to extract relevant and precise facial components, which are involved in emotion expression. Second, exploiting both texture and shape information contributed to this

580

improvement. Indeed, Descriptors evaluation demonstrated that hybrid ones constructed through heterogeneous concatenation from texture and shape features are the best, in particular the concatenation of LTP and HOG. However, The optimal size (after scaling) of the ROIs varies according to the training data. Therefore, it is difficult to go towards a generic system. For future works,

585

we envision to consider other sophisticated hand-craft and deep learning descriptors. We are also interested in exploiting FER with multi-observation in training and testing phases. Another perspective we envision is to extend the developed framework to facial images acquired with uncontrolled view point.

Acknowledgements 590

This research work is part of Volubilis project registered under Volubilis MA/14/302.The authors would like to thank the Franco-Moroccan Volubilis program for its support. The authors thank the anonymous reviewers for their helpful comments and suggestions.

38

References 595

[1] S. Shakya, S. Sharma, A. Basnet, Human behavior prediction using facial expression analysis, in: Computing, Communication and Automation (ICCCA), 2016 International Conference on, IEEE, 2016, pp. 399–404. [2] A. A. Gunawan, et al., Face expression detection on kinect using active appearance model and fuzzy logic, Procedia Computer Science 59 (2015)

600

268–274. [3] R. Shbib, S. Zhou, Facial expression analysis using active shape model, Int. J. Signal Process. Image Process. Pattern Recognit 8 (1) (2015) 9–22. [4] P. Ekman, An argument for basic emotions, Cognition & emotion 6 (3-4) (1992) 169–200.

605

[5] U. X. Eligio, S. E. Ainsworth, C. K. Crook, Emotion understanding and performance during computer-supported collaboration, Computers in Human Behavior 28 (6) (2012) 2046–2054. [6] G. Molinari, C. Bozelle, D. Cereghetti, G. Chanel, M. B´etrancourt, T. Pun, Feedback ´emotionnel et collaboration m´ediatis´ee par ordinateur: Quand

610

la perception des interactions est li´ee aux traits ´emotionnels, in: Environnements Informatiques pour l’apprentissage humain, Actes de la conf´erence EIAH, 2013, pp. 305–326. [7] K. Lekdioui, R. Messoussi, Y. Chaabi, Etude et mod´elisation des comportements sociaux d’apprenants ` a distance, ` a travers l’analyse des traits

615

du visage, in: 7`eme Conf´erence sur les Environnements Informatiques pour l’Apprentissage Humain (EIAH 2015), 2015, pp. 411–413. [8] M.-T. Yang, Y.-J. Cheng, Y.-C. Shih, Facial expression recognition for learning status analysis, in: International Conference on Human-Computer Interaction, Springer, 2011, pp. 131–138.

39

620

[9] R. Nkambou, V. Heritier, Reconnaissance ´emotionnelle par l’analyse des expressions faciales dans un tuteur intelligent affectif, in: Technologies de l’Information et de la Connaissance dans l’Enseignement Sup´erieur et l’Industrie, Universit´e de Technologie de Compi`egne, 2004, pp. 149–155. [10] R. Li, P. Liu, K. Jia, Q. Wu, Facial expression recognition under par-

625

tial occlusion based on gabor filter and gray-level cooccurrence matrix, in: Computational Intelligence and Communication Networks (CICN), 2015 International Conference on, IEEE, 2015, pp. 347–351. [11] G. Stratou, A. Ghosh, P. Debevec, L.-P. Morency, Effect of illumination on automatic expression recognition: a novel 3d relightable facial database, in:

630

Automatic Face & Gesture Recognition and Workshops (FG 2011), 2011 IEEE International Conference on, IEEE, 2011, pp. 611–618. [12] A. Samal, P. A. Iyengar, Automatic recognition and analysis of human faces and facial expressions: A survey, Pattern recognition 25 (1) (1992) 65–77.

635

[13] T. Ojala, M. Pietik¨ainen, D. Harwood, A comparative study of texture measures with classification based on featured distributions, Pattern recognition 29 (1) (1996) 51–59. [14] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition under difficult lighting conditions, IEEE transactions on image processing

640

19 (6) (2010) 1635–1650. [15] F. Ahmed, H. Bari, E. Hossain, Person-independent facial expression recognition based on compound local binary pattern (clbp)., Int. Arab J. Inf. Technol. 11 (2) (2014) 195–203. [16] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,

645

in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, IEEE, 2005, pp. 886–893.

40

[17] G. Lei, X.-h. Li, J.-l. Zhou, X.-g. Gong, Geometric feature based facial expression recognition using multiclass support vector machines, in: Granular Computing, 2009, GRC’09. IEEE International Conference on, IEEE, 650

2009, pp. 318–321. [18] A. Poursaberi, H. A. Noubari, M. Gavrilova, S. N. Yanushkevich, Gauss– laguerre wavelet textural feature fusion with geometrical information for facial expression identification, EURASIP Journal on Image and Video Processing 2012 (1) (2012) 1–13.

655

[19] V. Rapp, T. S´en´echal, L. Prevost, K. Bailly, H. Salam, R. Seguier, Combinaison de descripteurs h´et´erogenes pour la reconnaissance de micromouvements faciaux, in: RFIA 2012 (Reconnaissance des Formes et Intelligence Artificielle), 2012, pp. 978–2. [20] D. Ghimire, S. Jeong, S. Yoon, J. Choi, J. Lee, Facial expression recognition

660

based on region specific appearance and geometric features, in: Digital Information Management (ICDIM), 2015 Tenth International Conference on, IEEE, 2015, pp. 142–147. [21] T. Gritti, C. Shan, V. Jeanne, R. Braspenning, Local features based facial expression recognition with face registration errors, in: Automatic Face &

665

Gesture Recognition, 2008. FG’08. 8th IEEE International Conference on, IEEE, 2008, pp. 1–8. [22] C. Shan, S. Gong, P. W. McOwan, Facial expression recognition based on local binary patterns: A comprehensive study, Image and Vision Computing 27 (6) (2009) 803–816.

670

[23] P. Carcagn`ı, M. Coco, M. Leo, C. Distante, Facial expression recognition and histograms of oriented gradients: a comprehensive study, SpringerPlus 4 (1) (2015) 1. [24] M. Valstar, M. Pantic, Fully automatic facial action unit detection and

41

temporal analysis, in: 2006 Conference on Computer Vision and Pattern 675

Recognition Workshop (CVPRW’06), IEEE, 2006, pp. 149–149. [25] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron, in: Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, IEEE, 1998,

680

pp. 454–459. [26] L. Zhang, D. Tjondronegoro, Facial expression recognition using facial movement features, IEEE Transactions on Affective Computing 2 (4) (2011) 219–229. [27] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a

685

boosted deep belief network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1805–1812. [28] M. Liu, S. Li, S. Shan, X. Chen, Au-inspired deep networks for facial expression feature learning, Neurocomputing 159 (2015) 126–136. [29] P. Khorrami, T. Paine, T. Huang, Do deep neural networks learn facial

690

action units when doing expression recognition?, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 19–27. [30] T. Hinz, P. Barros, S. Wermter, The effects of regularization on learning facial expressions with convolutional neural networks, in: International

695

Conference on Artificial Neural Networks, Springer, 2016, pp. 80–87. [31] B. Liu, M. Wang, H. Foroosh, M. Tappen, M. Pensky, Sparse convolutional neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 806–814. [32] P. Ekman, W. V. Friesen, Facial action coding system.

42

700

[33] L. Wang, R. Li, K. Wang, A novel automatic facial expression recognition method based on aam, Journal of Computers 9 (3) (2014) 608–617. [34] J. Chen, Z. Chen, Z. Chi, H. Fu, Facial expression recognition based on facial components detection and hog features, in: International Workshops on Electrical and Computer Engineering Subfields, 2014, pp. 884–888.

705

[35] M. M. Donia, A. A. Youssif, A. Hashad, Spontaneous facial expression recognition based on histogram of oriented gradients descriptor, Computer and Information Science 7 (3) (2014) 31. [36] S. Happy, A. Routray, Automatic facial expression recognition using features of salient facial patches, IEEE transactions on Affective Computing

710

6 (1) (2015) 1–12. [37] A. A. Youssif, W. A. Asker, Automatic facial expression recognition system based on geometric and appearance features, Computer and Information Science 4 (2) (2011) 115. [38] T. Kanade, J. F. Cohn, Y. Tian, Comprehensive database for facial expres-

715

sion analysis, in: Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on, IEEE, 2000, pp. 46–53. ¨ [39] D. Lundqvist, A. Flykt, A. Ohman, The karolinska directed emotional faces (kdef), CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet (1998) 91–630.

720

[40] F. Wallhoff, Facial expressions and emotion database, Technische Universit¨at M¨ unchen. [41] X. Xiong, F. De la Torre, Supervised descent method and its applications to face alignment, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 532–539.

725

[42] G. Bradski, et al., The opencv library, Doctor Dobbs Journal 25 (11) (2000) 120–126. 43

[43] X. Wu, J. Sun, G. Fan, Z. Wang, Improved local ternary patterns for automatic target recognition in infrared imagery, Sensors 15 (3) (2015) 6399–6418. 730

[44] A. A. Mohamed, R. V. Yampolskiy, Adaptive extended local ternary pattern (aeltp) for recognizing avatar faces, in: Machine Learning and Applications (ICMLA), 2012 11th International Conference on, Vol. 1, IEEE, 2012, pp. 57–62. [45] M. Ibrahim, M. Alam Efat, S. Kayesh, S. M. Khaled, M. Shoyaib,

735

M. Abdullah-Al-Wadud, Dynamic local ternary pattern for face recognition and verification, in: Proceedings of the International Conference on Computer Engineering and Applications, Tenerife, Spain, Vol. 1012, 2014. [46] A. Mignon, Apprentissage de m´etriques et m´ethodes ` a noyaux appliqu´es ` a la reconnaissance de personnes dans les images, Ph.D. thesis, universit´e de

740

caen (2012). [47] W.-H. Liao, Region description using extended local ternary patterns, in: Pattern Recognition (ICPR), 2010 20th International Conference on, IEEE, 2010, pp. 1003–1006. [48] V. N. Vapnik, An overview of statistical learning theory, IEEE transactions

745

on neural networks 10 (5) (1999) 988–999. [49] C.-W. Hsu, C.-J. Lin, A comparison of methods for multiclass support vector machines, IEEE transactions on Neural Networks 13 (2) (2002) 415– 425. [50] C.-C. Chang, C.-J. Lin, Libsvm: a library for support vector machines,

750

ACM Transactions on Intelligent Systems and Technology (TIST) 2 (3) (2011) 27. [51] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classification tasks, Information Processing & Management 45 (4) (2009) 427–437. 44

755

[52] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, I. Matthews, The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, IEEE, 2010, pp. 94–101.

45

*Highlights (for review)

Highlights of the paper

    

An automatic and appropriate facial decomposition with ROIs for FER Emotion representation using texture, shape-based descriptors and their combination LTP descriptor analysis with new definitions and strategies for thresholding Comprehensive evaluation and comparison to state of the art on 4 public databases Our method outperforms all the compared methods, and is competitive with one method

shape descriptors and SVM classifier

shape descriptors and SVM classifier

Recommend Documents