Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features

Visual-audio emotion recognition based on multi-task and ensemble learning with multiple features

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features Communicated by Zechao Li Journal Pre-proof Visua...

6MB Sizes 0 Downloads 21 Views

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Communicated by Zechao Li

Journal Pre-proof

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features Man Hao , Wei-Hua Cao, Zhen-Tao Liu , Min Wu , Peng Xiao PII: DOI: Reference:

S0925-2312(20)30099-0 https://doi.org/10.1016/j.neucom.2020.01.048 NEUCOM 21805

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

7 August 2019 18 November 2019 1 January 2020

Please cite this article as: Man Hao , Wei-Hua Cao, Zhen-Tao Liu , Min Wu , Peng Xiao , Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.048

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features✩ Man Haoa,b , Wei-Hua Caoa,b,∗, Zhen-Tao Liua,b, Min Wua,b , Peng Xiaoa,b a

b

School of Automation, China University of Geosciences, Wuhan 430074, China Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan 430074, China

Abstract An ensemble visual-audio emotion recognition framework is proposed based on multi-task and blending learning with multiple features in this paper. To solve the problem that existing features can not accurately identify different emotions, we extract two kinds features, i. e., Interspeech 2010 and deep features for audio data, LBP and deep features for visual data, with the intent to accurately identify different emotions by using different features. Owing to the diversity of features, SVM classifiers and CNN are designed for manual features, i.e., Interspeech 2010 features and local LBP features, and deep features, through which four sub-models are obtained. Finally, the blending ensemble algorithm is used to fuse sub-models to improve the recognition performance of visual-audio emotion recognition. In addition, multi-task learning is applied in the CNN model for deep features, which can predict multiple tasks at the same time with fewer parameters and improve the sensitivity of the single recognition model to user’s emotion by sharing information between different tasks. Experiments are performed using eNTERFACCE database, from which the results indicate that the recognition of multi-task CNN increased by 3% and 2% on average over CNN model in speaker-independent and speaker-dependent experiments, respectively. And ✩

This work was supported by the National Natural Science Foundation of China under Grants 61976197, 61403422 and 61273102, the Hubei Provincial Natural Science Foundation of China under Grant 2018CFB447 and 2015CFA010, the Wuhan Science and Technology Project under Grant 2017010201010133, the 111 project under Grant B17040, and the Fundamental Research Funds for National University, China University of Geosciences (Wuhan) under Grant 1810491T07. ∗ Corresponding author: Wei-Hua Cao, email: [email protected] Preprint submitted to Neurocomputing

January 18, 2020

emotion recognition accuracy of visual-audio by our method reaches 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively, which maintain higher performance than some state-of-the-art works. Keywords: Visual-audio emotion recognition, Multiple features, Multi-task learning, Ensemble learning 1. Introduction With the development of affective computing, it is hoped that robots have the ability to understand user’s feelings, and respond appropriately to improve user’s experience in human-robot interaction (HRI). The primary problem that the HRI faces is how to identify emotional information accurately [1]. Motivated by the higher requirement of emotion recognition, the trend of emotion recognition research is to integrate multi-modality data and capture complementary information between modalities to improve the robustness and recognition rate. Emotion can be detected using different forms of cues[2], such as speech[3], facial expression[4], and physiological signal[5], where speech and facial expression are natural and effective ways to express emotions when communicating. To date, bimodal emotion recognition, especially using speech and facial expression has attracted extensive attention owing to its great potential applications in various aspects, such as service industry, healthcare, and online gaming. To deal with bimodal emotion recognition problem from facial expression images and speeches, various methods have been proposed in the literature, the recognition systems are roughly divided into two types, as shown in (a) and (b) of Figure 1. Early multimodal emotion recognition was based on feature engineering. Many authors agree that the most important audio features for emotion recognition are pitch, intensity, duration, spectral energy distribution, Mel Frequency Cepstral Coefficients (MFCCs), average Zero Crossings Density (ZCD) and filter-bank energy parameters (FBEs) [3],[6]. Commonly used visual features are geometric features [7], Local Binary Patterns (LBP) features[8], and Gobar features[9]. With the advent of deep neural networks in the last decade, people are more inclined to explore some deep features. It was applied into bimodal emotion recognition [10, 11] to learn the deep features of spectrogram and facial expression images. However, it still not found a suitable feature to accurately identify different emotions. 2

Visual-audio data

Extraction of manual features

Classifier

(a) Visual-audio data

Extraction of deep features

Visual-audio data

Extraction of manual features

Classifier Fusion

Extraction of deep features

Classifier

(b)

Classifier (c)

Figure 1: Emotion recognition system of visual-audio.

Driven by above problem, we propose an ensemble visual-audio emotion recognition method based on multi-task and ensemble learning with multiple features in this paper, as shown in (c) of Figure 1. Considering that different features in the same modality have different advantages [12, 13], manual features, i.e., Interspeech 2010 features and LBP features, and deep features are extracted for the audio and visual data. LBP is a texture feature with rotation invariance and gray invariance, which can capture local details in the image well. Deep feature extracted by CNN is overall feature of images, which is sensitive to changes in images’ position and orientation. Interspeech 2010 is a specific emotional speech feature set that contains manually extracted emotion-related information. Deep feature in audio data contains more comprehensive information, however, it is descriptive and poorly stable. In terms of the advantages of this two features, they can complement each other, which are used to improve the performance of emotion recognition. Due to the difference between two features, decision-level feature fusion is employed. Finally, we establish an ensemble framework with four sub-models from visual and audio data, namely, multi-task CNN with mel-spectrograms, SVM with Interspeech2010, multi-task CNN with facial images, and SVM with local LBP. Experiments are implemented on the eNTERFACE’05 databases, from which results show that our method can achieve higher accuracy compared with some state-of-the-art works. The main contributions of this paper are two points. (1) We propose an ensemble emotion recognition framework consisting of four sub-models with manual and deep features in visual and audio modality, which used to improve the generalization ability and recognition accuracy of different emotion. (2) Multi-task CNN model is established in ensemble framework, in which gender is selected as an auxiliary task to improve the sensitivity of single model to user’s emotion. The rest of paper is organized as follows. Related works are introduced in 3

Section 2. Framework of bimodal emotion recognition is proposed in Section 3. Experimental results and analysis are given in Section 4. 2. Related Works Researchers have made numerous efforts for fusion methods in audiovisual emotional recognition, including data/feature level fusion, decisionlevel fusion, and hybrid approaches for audio-visual emotion recognition. Chen et.al [14] explored the properties of visual and audio features and adopted multiple kernel learning to find an optimal feature fusion. A bimodal emotion recognition approach based on the sparse kernel reduced-rank regression fusion method was proposed in [15] to fuse emotion features of two modalities. A PSO-optimized FAMNN at feature level fusion method was proposed in [16], in which the emotions from audio and visual information was recognized using fuzzy ARTMAP neural network (FAMNN), and the particle swarm optimization (PSO) was employed to determine the optimum values of FAMNN. These methods belong to feature level fusion, mainly to combine speech and facial expression features directly, which is simple and intuitive, but cannot express the influence coefficient of each feature flexibly and cause excessive feature dimension. To solve above problems, some scholars have studied on decision level fusion methods, which trains the classifier for each feature, and combines the recognition results of all classifiers as the final output. A classification model for speech and facial expressions was established in [17], and Bayes sum rules was used to fuse the results. Kaya [18] proposed an extreme learning machines method for modeling audio and video features for emotion recognition under uncontrolled conditions. To accurately identify different emotions, we extracted manual features and deep features simultaneously for visual and audio data. Nowadays, research on multi-information fusion [19]-[21] and feature selected [22] methods have been studied in many fields. However, the deep neural network for deep features, such as CNN, which needs a large amount of data when training. Thus, we augment Mel-spectrograms and facial images in the training dataset, which creates difficulties for the direct fusion with LBP features or Interspeech 2010 features. Therefore, we focus on the decision-level feature fusion method. And the problem of visual-audio emotion recognition faced is how to effectively combine multiple models. Ensemble Learning is a way to combine multiple learning algorithms for better performance, which can not only combine different structural mod4

els, but also train different subsets of features. Given the flexibility and practicality of ensemble learning, it was introduced into the field of emotion recognition [23]-[26]. Lee and Narayanan [23] used an ensemble of three classifiers to detect negative emotion in spoken dialog, where each classifier was separately trained with acoustic, lexical and discourse features, respectively. They showed that the fusion of the classifiers decisions outperform that of a single classifier. An ensemble learning method based on selective ensemble feature selection and rough set theory was proposed in [25], which can meet the tradeoff between accuracy and diversity of base classifiers, and it was proved to be effective for emotion recognition. Sun et.al [26] proposed an ensemble model for speech emotion recognition which used the feature extraction methods with many different principles to generate the subspaces for the base classifier to ensure the diversity of the base classifiers. However, they are mainly applied in single-modality emotion recognition. In this study, we propose an ensemble model with multiple features for bimodal emotion recognition. We extract different features in each modality, and design different classifiers according to the characteristics of features, which can take the advantages of different features to improve the robustness of the model. Besides, it is easily adaptable to other multi-modal recognition problems because of the flexible design. On the other hand, the performance of ensemble model highly depends on the ability of base classifiers. Many problems in the real world cannot be decomposed into an independent sub-problem. For example, human emotions are quite complex, and the way to express emotions is diverse, which are affected by many factors, such as different speakers, genders, and ages. That is to say, other information is related to individual’s emotional state, thus, exploiting the relevance of information is critical to improve emotion recognition accuracy. To achieve sharing and mutual assistance of different information, multi-task learning was proposed [27], which improves learning efficiency and prediction accuracy by learning shared representations between appropriate tasks. To date, it has been widely applied to solve many problems from computer vision to disease identification [28]-[31]. Among them, the most widely used is facial recognition [32, 33], which considered facial verification and facial landmark prediction as auxiliary tasks, and the classifiers for the major task can be better trained with the help of auxiliary tasks. Besides, multi-task learning has also been applied in emotion recognition, Xia et. al [34] considered two dimensional labels, i.e., valence and activation, to improve categorical speech emotion recognition. Some re5

search areas had highly benefited with the advent of multi-task learning, we design a multi-task learning CNN framework to improve the base classifier’s ability for emotion recognition, in which gender recognition is selected as an auxiliary task. 3. Bimodal Emotion Recognition In this section, we present an ensemble visual-audio emotion recognition framework. Different from other emotion recognition methods, we extract both manual features and deep features for visual and audio data, which makes the feature discrimination and robustness stronger to accurately identify different emotions. As described in Fig. 2, the framework contains four sub-models, i.e., multi-task CNN with mel-spectrogram, SVM with Interspeech2010, multi-task CNN with facial images, and SVM with local LBP. After the four models are established, blending algorithm in ensemble learning is adopted to fuse them for better performance. Speech

Multitask CNN with Mel-spectrogram

Speech Input video

Preprocessing

Facial images Facial images

SVM with IS10 Multitask CNN with facial images

Meta-Classifier

Final result

SVM with local LBP

Figure 2: Framework of bimodal emotion recognition.

3.1. Multi-channel Emotional Feature Extraction In this part, speech signals and facial expression images in emotional video signals are acquired firstly, and then different features of each modality are extracted for emotion recognition. 3.1.1. Speech feature extraction For audio modality, fundamental frequency, formant, short-term energy, etc. are commonly used speech features. To get these features, we extract the Interspeech 2010 Acoustic Feature Set (IS10) acoustic feature [35] using OPENSMILE [36]. This feature set contains a variety of statistics over 6

frame-level acoustic features including loudness, Mel-frequency cepstrum coefficients (MFCCs), line spectral pairs (LSPs), fundamental frequency (F0), voicing, shimmer, and jitter, which form 1582 utterance-level features. Since the dimensions of different speech features are inconsistent, the z-score normalized method is used to normalize the features to obtain the normalized N ∗ 1580 feature matrices, where N is the number of sample data. IS10 acoustic feature is well applied in speech emotion recognition, but it lacks time domain information. To obtain the features of both time domain and frequency domain information mel-spectrogram is generated, in which the strength or “loudness” of a signal over time at various frequencies presents in a particular waveform. In current research, it was used for identifying the strength and frequencies of formants, and for real-time biofeedback in voice training and therapy. To draw mel-spectrograms, discrete fourier transform is used as X(k) =

N −1 X

x(n)e−i

2πkn N

n=0

, k = 0, 1, · · · , N − 1.

(1)

These frequency contents are stacked together in time by compressing the amplitude axis into a contour map drawn. Spectrograms have time along the horizontal axis, frequency along the vertical axis, and the amplitude of the signal at any given time and frequency is shown as a color level. Conventionally, black is used to signal the most energy, while white is used to signal the least. Fig. 3 shows mel-spectrograms of six emotions, i.e., angry, disgust, fear, happy, sad, and surprise. The mel-spectrogram contains the variation of each frequency component in the speech signal with time, and it contains the features of time domain and frequency domain. The signal amplitude on texture in figures reflects the case of formant. In addition to formants, common prosodic features such as fundamental frequencies are included in mel-spectrogram. The intensity of texture distribution in melspectrogram represents the distribution of energy, reflecting the emotional information such as tone, accent, and pause in speech. 3.1.2. Facial expression feature extraction To extract facial expression features from audio-visual emotion datasets, the first step is getting facial expression images. All videos are first framed to obtain a series of facial expression images. Considering that emotion in actual situation is a gradual process, which usually requires a process that is 7

(a) angry

(b) disgust

(d) happy

(e) sad

(c) fear

(f) surprise

Figure 3: Mel-spectrogram examples of six basic emotions.

evoked. For each emotional video clip, we select five consecutive images in the middle of the video. In this way, we can get richer emotional information, on the other hand, the sample number of training data can be expanded to make the model training better. Changes in facial expressions cause facial texture information to appear or disappear, thus texture features can be used for emotion classification. LBP[37] is a representative texture deformation feature, which compares the gray value of any pixel of the image with the pixel of the shower, and then uses binary mode to represent the comparison result to accurately describe the spatial structure information of the local texture in the image. LBP has gotten a good application in facial emotion recognition[12, 38] because it has good rotation and gradation invariance, simple calculation, and can overcome the problems of image displacement, rotation and illumination imbalance. Therefore, LBP feature is extracted in this paper. It is expressed from the eyes, brows, and mouth when the expression changes. Thus, the dlib library[39] is used for face detection to obtain feature points of eyes, brows, and mouth. Then LBP features of the eyes, eyebrows and mouth are extracted, the whole process is shown in Fig. 4. 3.2. Emotion Recognition For the obtained mel-spectrograms and facial expression images, a multitask CNN model is designed to improve the sensitivity of a single modality 8

Dlib Feature point detection

Feature point selection

LBP feature extraction

Figure 4: LBP feature extraction process.

recognition model to user’s emotion, in which gender recognition is the auxiliary task, and emotion recognition is the main task. Then, SVM classifier is designed with the artificially extracted speech emotion features and facial expression features. Finally, blending ensemble algorithm is used to fuse sub-models to obtain the final recognition result. 3.2.1. Multi-task CNN Multi-task learning model is established on the basis of CNN network, in which the convolutional layers could be shared, which is a good way to reduce parameters and share information for multi-task recognition, the architecture of multi-task CNN model is shown in Fig. 5. Task 1 pool1

pool2

Task1 Loss

pool5

or

Task 2 Conv1

Conv2

Conv3 Conv4 Conv5

Emotion

Task2 Loss

fc1

Shared Layer

Joint Optimiser

fc7

Input data

Figure 5: Architecture of multi-task CNN.

The CNN structure is designed according to AlexNet [43], which is composed of five convolutional layers (C1, C2, C3, C4, C5) to extract features and three subsampling layers (P1, P2, P5), the parameters of each layer and the output image size are shown in Table 1. At the top of CNN network, Softmax function is used to calculate the loss of training process and predict the class probabilities during classification. The input and output of Softmax represent the feature and category of the classification, respectively. Suppose the sample data is {(x1 , y1 ), · · · , (xm , ym )}, where m is the number of input sample data. For each sample data, first perform a fractional linear mapping, hθj (x) = exj 9

Tθ j

(2)

Layer C1 P1 C2 P2 C3 C4 C5 P5

Table 1: CNN parameters. Dimension Stride Feature map size of output 3×3(64 filters) 2 64×64×64 2×2 2 64×32×32 3×3(128 filters) 2 128×16×16 2×2 2 128×8×8 3×3(256 filters) 1 256×8×8 3×3(256 filters) 1 256×8×8 3×3(256 filters) 1 256×8×8 2×2 2 256×4×4

where θj represents the weight parameter of the j output node. Normalize the formula, the probability of each category classification is expressed as P (y = j|x) =

hθj (x) K P

j=1

(3)

hθj (x)

where K represents the number of output categories, and P (y = j|x) ∈ [0, 1]. The output category is the one with the highest classification probability, arg max P (y = j|x)

(4)

j

Since the two tasks may not converge at the same time, we set two parameters to eliminate this impact, the loss function is computed as lossf ull = plossemotion + qlossgender

(5)

where the parameters p and q are trained using back-propagation algorithm. Softmax is a form of the cross-entropy loss function, thus, above loss function can be expressed as lossf ull = −p −q

K2 P

K1 P

yemotion log Pemotion (yemotion = k|x)

k=1

(6)

ygender log Pgender (ygender = k|x)

k=1

where K1 and K2 are the numbers of emotion and gender. When training the classifiers, we choose stochastic gradient descent (SGD) to minimize the loss 10

in Eq.(6). The gradient of the loss function is computed as the maximum value decreasing speed in all directions, the process can be expressed using w = w − η∇w L

(7)

where η is the step-size, w is the weight vector, and ∇w L is the gradient. 3.3. Fusion method For IS10 features and LBP features extracted artificially, traditional classifiers can meet the requirements because their dimensions are not very high, in which SVM is widely used in speech and facial expression emotion recognition [40]-[42] due to its good classification performance. Thus, a sub-model using multi-class SVM classifier is established in each modality. Although the artificially extracted feature set has certain emotional representation ability, it can not fully represent the deeper information of the data, while the deep learning method can extract the high-order emotional features and can compensate for the defects of the artificial calibration feature. For deep features, we designed the multi-task CNN classifier with the original facial expression picture and mel-spectrogram as input. Considering the diversity of base classifiers in speech and facial expression, an ensemble model is established, in which blending algorithm is used to fuse the four models. Blending algorithm is a machine learning method which can combine the advantages of multiple models, and generally achieve better prediction results than a single model. Blending algorithm mainly consists of two layers. The first layer is traditional training, including many basic classifiers; the second layer is recombining the output of basic classifiers into a new training set to train a higher level classifier. The purpose is to determine the weight or the way they are combined. The output of base classifiers is used as input when training the second level classifier, and the role of second level classifier is to integrate the output of base classifiers. The process is shown in Fig. 6, where C1 , · · · , Cm are the classifiers in the first layer, which represent four sub-models mentioned above, the meta-classifier is the second level classifier, in which logistic regression is adopted. Firstly, the original training data set is divided into a training data set DT and a test data set DA according to a certain proportion, wherein the proportion of the training data set is about 60-80%. Then, four models C1 , · · · , C4 , i.e., SVM with IS10, Multi-task CNN with mel-spectrogram, SVM with local LBP, and Multi-task CNN with facial images, are built by 11

Training data

C1 C2

Ă

Cm

prediction Predictions

P1

P2

Ă

Test data

training Classification models

Pm

Meta-Classifier

Final result

Emotion

Figure 6: Process of blending ensemble.

learning DT. After the model training, the prediction result P1 , · · · , P4 of the training data is obtained. Then each of the prediction result sequences is regarded as a new feature, who is arranged as an input. Finally, construct a new classifier using logistic regression to train these new features to get the final recognition result. The algorithm flow is shown in Algorithm 1. Algorithm 1 Blending. Require: Training data, D = {DA, DT }; Ensure: ensemble classifier H; Final emotion result E; 1: Step 1: learn base-level classifiers 2: for m = 1 to 4 do 3: learn Cm based on D; 4: end for 5: Step 2: construct new data set of predictions 6: D 0 = {DP, DT }, where DP = C1 (DA), · · · , C4 (DA); 7: Step 3: learn a meta-classifier 8: learn H based on D 0 ; 9: return H,E;

12

4. Experiments The proposed approach is evaluated on eNTERFACE database[44]. We set up two different experiments, which are speaker-independent experiment and speaker-dependent experiment. We first present the results of four submodels in visual and audio, at the same time, the result of CNN is given to prove the superiority of multi-tasking CNN (MTCNN). Then we show the recognition results of different emotion in each modality. Finally, fusion results and the comparison with state-of-the-art alternatives are given. The experiments are performed in MATLAB R2015b and Python 3.6 on the computer of the 64-bit Windows10 system, and CPU is the dual-core Intel CORE i5, clocked is 3.30 GHz, RAM is 8 GB of RAM. 4.1. Speech database We used the eNTERFACE audio-visual emotion database [44] which contains clips of 44 subjects from 14 different nationalities, including 32 males and 12 females. The subjects were asked to act the six basic emotions (anger, disgust, fear, happy, sad and surprise) while uttering selected sentences in English with target emotions, and finally the most people showing 5 believable reactions to each emotion with the acoustic sampling rate of 48 KHz and visual frame rate of 25. Fig. 7 shows pictures of six emotions. The sixth subject has only three emotional sentences, and the 23rd subject has only one sentence for each emotion. To eliminate the influence of data inconsistency on experimental results, we removed the two individuals during experiments.

(a) angray

(b) disgust

(c) fear

(d) happy

(e) sad

(f) suprise

Figure 7: Some samples of the facial pictures from the eNTERFACE dataset [44].

13

4.2. Evaluation for four sub-models In this section, we present experimental results of each sub-models. Following consideration of the application of the model in different situations, we set up speaker-independent experiments and speaker-dependent experiments. For speaker-independent experiments, 42 people in the database are randomly divided, i.e., 30 for training and 12 for testing. For speaker-dependent experiments, we used roughly 70% of the database to train the classifiers and the remaining 30% to test them. When CNN is training, considering it has high requirements on the amount of data samples, the mel-spectrograms and facial expression images are augmented. And the learning rate is set to 0.001 and iteration is 6000 steps. For fully connected layers, dropout is set to 0.3, which is used to prevent model overfitting. The weights for gender and emotion are set to 0.1 and 0.9, respectively, and the values were debugged refer to experimental results. Multi-class SVMs use linear kernel functions. 4.2.1. Audio Modality Results For audio data, we first extracted acoustic characteristics of IS10, and the classification is performed by using multiclass SVM. Then convert the audio data into Mel-spectrogram. Convenient for data augmentation, the size of generated mel-spectrograms is unified to 128×220, while the CNN model input image size is 128×128. In the process of data augmentation, the original mel-spectrograms are tailored, in which the selected step size of each movement is important. If the step size is too small, the augmented samples contain more repeated information, which may results in over-fitting of the model. To prevent this phenomenon, we determine the cutting step size according to experimental results, finally set the step size as 8, and augment the mel-spectrogram data sample of the training set by 12 times. Correspondingly, for the test set, we intercepted the middle part of the original mel-spectrogram and got a picture of 128*128. The recognition results of three sub-models are shown in Table 2. Table 2: Recognition results of sub-models in audio modality.

Sub-model SI-accuracy

SD-accuracy

SVM

47.23%

42.16%

CNN

54.99%

52.35%

MTCNN

56.33%

54.57%

14

4.2.2. Visual Modality Results In facial expression modality, it is found that most of the emotions are expressed in the middle and back of the video through analysis of the database, thus, we select five facial expression frame images behind the middle frame of each video for training set, by which the training data is amplified by 5 times. For the test set, an image of the middle frame of each video is selected. Then, using the dlib feature point detection method, the face area in these images is detected, and then resize them into 128×128. The recognition results of three sub-models are shown in Table 3. Table 3: Recognition results of sub-models in visual modality.

Sub-model SI-accuracy

SD-accuracy

SVM

61.42%

56.39%

CNN

63.45%

61.67%

MTCNN

66.93%

63.32%

To verify the effectiveness of the multi-task CNN model, the recognition results of three classifier models in visual and audio are compared, i.e., SVM classifier, CNN model, and multi-task CNN (MTCNN) model. The experimental results are shown in Table 2 and Table 3, from which we can see that the multi-task CNN model improves the accuracies from 63.45% to 66.93% for audio, and from 54.99% to 56.33% for visual in speaker-independent experiments. Similarly, in speaker-dependent experiments, our multi-task CNN model also makes an improvement from 61.67% to 63.32% for audio, and from 52.35% to 54.57% for visual, respectively. Besides, the recognition rate of the multi-task CNN model is more than 5% higher than the recognition rate of the SVM classifier in two experiments for audio and visual. The experimental results show that the gender recognition task we selected can effectively improve the performance of emotion recognition in single modality. 4.3. Evaluation for the ensemble framework After training each classifier by using audio and video, we train a blending classifier including SVM and MTCNN, for audio, visual, and visual-audio, to obtain final emotion result. The fusion results for two sub-models in visual and audio are provided in Figure 8 and Figure 9. The final recognition results of visual-audio and comparison with state-of-the-art are summarized in Table 4. 15

Speaker-independent and speaker-dependent experiments are implemented in audio and visual modalities. The final ensemble recognition rates of audio and visual are 70.12, 65.02, 61.06 and 59.13 percent, the comparison with two sub-models as shown in Figure 8 and 9, respectively. It can be observed that the recognition result of ensemble model is higher than the SVM and MTCNN model both in audio and visual, which is because the blending ensemble can combine the advantages of two models to improve the overall recognition performance. 0.8 SI-accuracy 0.7

SD-accuracy

0.6 0.5 0.4 0.3 0.2 0.1 0 SVM

MTCNN

Ensemble

Figure 8: Recognition results of sub-model and ensemble model in audio.

In the eNTERFACE database, we recognize six emotion classes. Recognition accuracies of SVM model and MTCNN model in audio under different emotions are represented as shown in Figure. 10, from which it can be seen that each model has different recognition effects for different emotional states. For instance, the recognition accuracy of angry and disgust is higher in the MTCNN model than in the SVM model, on the contrary, the SVM model is better than the MTCNN in the suprise emotional state, which illustrate that different features can complete emotion recognition task, however, it still not found a suitable feature to accurately identify different emotions. It is precisely because of the combination of the advantages of two features in different emotional states that the recognition performance of the ensemble model improved. For visual-audio emotion recognition, we fuse IS10, LBP, and depth features, which results in a higher recognition rate compared to considering a single modality. The comparison of our ensemble framework with other 16

0.7

SI-accuracy SD-accuracy

0.6 0.5 0.4 0.3 0.2 0.1 0 SVM

MTCNN

Ensemble

Figure 9: Recognition results of sub-model and ensemble model in visual. 0.9

SVM

0.8

MTCNN

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Angry

Disgust

Fear

Happy

Sad

Surprise

Figure 10: Recognition results of different emotion in audio.

visual-audio emotion recognition models is provided in Table 4, where SIaccuracy and SD-accuracy stand for emotion recognition result of speakerindependent experiments and speaker-dependent experiments, respectively. As can be seen from Table 4, our ensemble framework with multi-features recognition accuracy is higher than [21] and [45] in the speaker-dependent experiment, and [46] in the speaker-independent experiment, which used only manual features and single classifier prove the validity of multi-feature mod17

Table 4: Comparison of recognition performances of different models for visual-audio emotion recognition.

Model

Features

Classifier

[21]

MFCC, Gabor

[45]

Rcognition accuracy SIaccuracy

SDaccuracy

HMM

-

- 76%

MFCC, RASTA-PLP

SVM

-

- 76.4%

[46]

MFCC, LPQ

SVM

77.02%

-

[47]

MFCC, prosody, ITMI, and QIM

Multiclassifier

77.78%

-

SVM model

LBP, MFCC

SVM

63.30 %

61.10%

MTCNN model

Deep features

CNN

69.02 %

65.67%

Proposed method

LBP, MFCC, deep features

Multiclassifier

81.36 %

78.42%

els. In addition, our experimental results yield better recognition performance than [47] with single kind feature and multi-classifier model. In addition, we present the results of visual-audio emotion recognition using single feature. By comparison, effectiveness of our proposed framework is visible. For single modality recognition of audio and visual data, the experimental results demonstrate that the multi-task learning mechanism can improve the recognition performance by sharing information between different tasks. Besides, blending can take the advantages of models with different features to improve the overall recognition performance. In addition, from the experimental results, we can see that there is little difference between speakerindependent experiments and speaker-dependent experiments, which may be explained by the large number of objectors in the database we selected and the small number of video samples per user. 18

5. Conclusion An ensemble framework for visual-audio emotion recognition based on multi-task and ensemble learning with multiple features was proposed. Interspeech 2010, LBP, and deep features are extracted to accurately identify different emotions. And MTCNN is designed to improve the sensitivity of single model to users emotion by information sharing. And the validity of the proposal was verified through a series of contrast experiments on eNTERFACE database, which contained the comparison of MTCNN model with CNN and SVM models, the ensemble of SVM and MTCNN models with MTCNN and SVM models in the same modality, and our method with other relevant literatures. Experimental results show that the recognition accuracy of the MTCNN model is higher than SVM and MTCNN classifier. By comparison, recognition results of different emotion in audio using MTCNN and SVM model is different. And a better recognition result is obtained through the fusion of the model using blending. In future work, the difference between speaker-independent and speakerdependent experiments will be further confirmed. Besides, the approach of end to end feature fusion method would be studied for emotion recognition. And we will be studying on key frame selection method, by which the information of facial expression information in visual can be fully utilized. Finally, our method could be applied into human-robot interaction [48], initiative service [49], internet teaching system etc. References References References [1] P. Salovey, J.D. Mayer, Emotional intelligence, Magination, Cognition, and Personality. 9 (3) (1990) 185-211. [2] Z. Zeng, M. Pantic, G. I. Roisman, et al., A survey of affect recognition methods: audio, visual, and spontaneous expressions, IEEE Transactions on Pattern Analysis & Machine Intelligence. 31 (1) (2009) 39-58. [3] Z. T. Liu, M. Wu, W. H. Cao, et al., Speech emotion recognition based on feature selection and extreme learning machine decision tree, Neurocomputing.(273) (2018) 271-280 19

[4] L. Ricciardi, F. Viscocomandini, R. Erro, et al., Facial Emotion Recognition and Expression in Parkinsons Disease: An Emotional Mirror Mechanism, Plos One. 12(1)(2017) e0169110. [5] Z. T. Liu, Q. Xie, M. Wu, et al., Speech Emotion Recognition Based on An Improved Brain Emotion Learning Model, Neurocomputing.(309) (2018) 145-156. [6] F. Noroozi, T. Sapiski, D. Kamiska, et al, Vocal-based emotion recognition using random forests and decision tree. International Journal of Speech Technology. 20 (2) (2017) 239-246. [7] F. Noroozi, M. Marjanovic, A. Njegus, et al, Audio-Visual Emotion Recognition in Video Clips[J]. IEEE Transactions on Affective Computing. 10 (1) (2019) 60-75. [8] S. Q. Zhang, L. M. Li, Z. J. Zhao, Audio-Visual Emotion Recognition Based on Facial Expression and Affective Speech. Springer Berlin Heidelberg. (2012). [9] Y. Wang, L. Guan, A. N, Venetsanopoulos, Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition, IEEE Transactions on Multimedia. 14 (3) (2012) 597-607. [10] S. Q. Zhang, S. L. Zhang, T. J. Huang, et al, Learning Affective Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition, IEEE Transactions on Circuits & Systems for Video Technology. 28 (10) (2018) 3030-3043. [11] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, et al, End-to-End Multimodal Emotion Recognition Using Deep Neural Networks, IEEE Journal of Selected Topics in Signal Processing. 11 (8) (2017) 1301-1309. [12] Y. Ding, Q. Zhao, B. Li, et al, Facial Expression Recognition from Image Sequence based on LBP and Taylor Expansion, IEEE Access. 4 (2017) 19409-19419. [13] X. B. Jia, C. C. Wen, Facial Expression Recognition Based on Gabor Features and Fuzzy Classifier, Applied Mechanics and Materials. 182-183 (2012) 1046-1050. 20

[14] J. K. Chen, Z. Chen, Z. Chi, et al, Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning, International Conference on Multimodal Interaction. (2014). [15] J. Yan, W. Zheng, Q. Xu, et al, Sparse Kernel Reduced-rank Regression for Bimodal Emotion Recognition from Facial Expression and Speech, IEEE Transactions on Multimedia. 18 (7) (2016) 1319-1329. [16] D. Gharavian, M. Bejani, M. Sheikhan, Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks, Multimedia Tools & Applications. 76 (2) (2017) 2331-2352. [17] S. Sahoo, A. Routray. Emotion recognition from audio-visual data using rule based decision level fusion, In Proceedings of the 2016 IEEE Students Technology Symposium. (2017) 7-12. [18] H. Kaya, A. A. Salah. Combining Modality-Specific Extreme Learning Machines for Emotion Recognition in the Wild, J Multimodal User Interfaces. 10 (2016) 139-149. [19] Z. C. Li, J. H. Tang, T. Mei, Deep Collaborative Embedding for Social Image Understanding, IEEE Transactions on Pattern Analysis and Machine Intelligence. 41 (9) (2019) 2070-2083. [20] J. H. Tang, X. B. Shu, G. J. Qi, et al, Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement, IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (8) (2017) 1662-1674. [21] Y. Wang, L. Guan, A. N. Venetsanopoulos, Kernel Cross-Modal Factor Analysis for Information Fusion With Application to Bimodal Emotion Recognition, IEEE Transactions on Multimedia. 14 (3) (2012) 597-607. [22] Z. Li, J. Tang, Unsupervised Feature Selection Via Nonnegative Spectral Analysis and Redundancy Control, IEEE Transactions on Image Processing. 24 (12) (2015) 5343-5355. [23] C. M. Lee, S. S. Narayanan, Toward detecting emotions in spoken dialogs, IEEE Transactions on Speech & Audio Processing. 13 (2) (2005) 293-303. 21

[24] Z. Yin, M. Zhao, Y. Wang, et al, Recognition of emotions using multimodal physiological signals and an ensemble deep learning model, Computer Methods and Programs in Biomedicine. 140(Complete) (2017) 93110. [25] G. Wang, Y. Yang, A Novel Emotion Recognition Method Based on Ensemble Learning and Rough Set Theory, International Journal of Cognitive Informatics & Natural Intelligence. 5 (3) (2017) 61-72. [26] Y. Sun, G. Wen, Ensemble softmax regression model for speech emotion recognition, Multimedia Tools & Applications. (2016) 1-24. [27] R. Caruana, Multitask learning, in Learning to Learn. Berlin, Germany: Springer, (1998) 95-133. [28] Y. Yan, E. Ricci, R. Subramanian, et al. A Multi-Task Learning Framework for Head Pose Estimation under Target Motion, IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (6) (2016) 10701083. [29] R. Ranjan, V. M. Patel, R. Chellappa, HyperFace: A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence. 41 (1) (2019) 121-135. [30] D. Chen, K. W. Mak. Multitask learning of deep neural networks for lowresource speech recognition, IEEE/ACM Transactions on Audio Speech & Language Processing. 23 (7) (2015) 1172-1183. [31] X. Zhu, H. I. Suk, S. W. Lee, et al, Subspace Regularized Sparse MultiTask Learning for Multi-Class Neurodegenerative Disease Identification, IEEE Transactions on Biomedical Engineering. 63 (3) (2016) 607-618. [32] T. Devries, K. Biswaranjan, G. W. Taylor, Multi-Task Learning of Facial Landmarks and Expression, IEEE Canadian Conference on Computer and Robot Vision (CRV). 2014. [33] P. Yang, Learning active facial patches for expression analysis, IEEE Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2012. 22

[34] R. Xia, Y. Liu, A Multi-task Learning Framework for Emotion Recognition Using 2D Continuous Space, , 2017, 8(1):3-14. IEEE Transactions on Affective Computing. 8 (1) (2017) 3-14. [35] B. Schuller, S. Steidl, A. Batliner, et al. The interspeech 2010 paralinguistic challenge, Interspeech. (2010) 2794-2797. [36] F. Eyben, M. Wollmer, and B. Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, In Proceedings of the 18th ACM international conference on Multimedia. (2010) 1459-1462. [37] Y. Liu, Y. Cao, Y. Li, et al, Facial expression recognition with PCA and LBP features extracting from active facial patches, In Proceedings of the 2016 IEEE International Conference on Real-time Computing and Robotic. 2016. [38] Z. T. Liu, M. Wu, W. H. Cao, et al, A facial expression emotion recognition based human-robot interaction system, IEEE/CAA Journal of Automatica Sinica. 4 (4)(2017) 668-676. [39] D. E. King. Dlib-ml: A Machine Learning Toolkit, Journal of Machine Learning Research. 10 (3) (2009) 1755-1758. [40] M. Sheikhan, B. Mahdi, G. Davood, Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method, Neural Computing & Applications. 23 (1) (2013) 215-227. [41] A. Milton, S. S. Roy, S. T. Selvi, SVM Scheme for Speech Emotion Recognition using MFCC Feature, International Journal of Computer Applications. 69 (9) (2014) 34-39. [42] W. Wei, Q. Jia, Weighted Feature Gaussian Kernel SVM for Emotion Recognition, Comput Intell Neurosci. (6) (2016) 1-7. [43] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, in International Conference on Neural Information Processing Systems. (2012) 1097-1105. [44] O. Martin, I. Kotsia, B. Macq, and I. Pitas, The enterface05 audio-visual emotion database, in 22nd International Conference on Data Engineering Workshops (ICDEW06), (2006) 1-8 23

[45] S. Zhalehpour, Z. Akhtar, C. E. Erdem, Multimodal Emotion Recognition with Automatic Peak Frame Selection, IEEE International Symposium on Innovations in Intelligent Systems and Applications. (2014) 116-121 [46] S. Zhalehpour, O. Onder, Z, Akhtar, et al, BAUM-1: A Spontaneous Audio-Visual Face Database of Affective and Mental States, IEEE Transactions on Affective Computing. 8 (3) (2017) 300-313. [47] M. Bejani, D. Gharavian, and N. M. Charkari, Audio-visual emotion recognition using ANOVA feature selection method and multi-classifier neural networks, Neural Comput. Appl. 24 (2) (2014) 399-412. [48] Z. T. Liu, F. F. Pan, M. Wu, et al. A multimodal emotional communication based humans-robots interaction system, in Proceedings of China Control Conference. (2016) 6363-6368. [49] M. Hao, W. H. Cao, M. Wu, et al., Proposal of Initiative Service Model for Service Robot, CAAI Transactions on Intelligence Technology. 2 (4) (2017) 253-261.

Man Hao received the B.E. degree from China University of Geosciences, Wuhan, China in 2016. She is currently a Ph.D. student in School of Automation, China University of Geosciences. Her current research interests include initiative service robot, emotion recognition, and human-robot interaction system. She is a student member of Chinese Association for Artificial Intelligence.

24

Wei-Hua Cao received his B.S., M.S., and Ph.D. degrees in Engineering from Central South University, Changsha, China, in 1994, 1997, and 2007, respectively. He is a Professor in the School of Automation, China University of Geosciences. He was a visiting scholar with the Department of Electrical Engineering, Alberta University, Edmonton, Canada, from 2007 to 2008. His research interest covers intelligent control and process control.

Zhen-Tao Liu received the B.E. and M.E. degrees from Central South University, Changsha, China, in 2004 and 2008, respectively, and Dr. E. degree from Tokyo Institute of Technology, Tokyo, Japan, in 2013. From 2013 to 2014, he was with Central South University, Changsha, China. Since Sept. 2014, he has been with School of Automation, China University of Geosciences, Wuhan, China. His research interests include affective computing, fuzzy systems, and intelligent robot. He is a member of IEEE-IES (Industrial Electronics Society, Institute of Electrical and Electronics Engineers) Technical Committee on Human Factors, CAAI (Chinese Association for Artificial Intelligence) Technical Committee on Intelligent Service, and SOFT (Japan Society for Fuzzy Theory and Systems). He is an Associate Editor of Int. J. of Advanced Computational Intelligence and Intelligent Informatics. He received Best Paper Awards of Int. J. of Advanced Computational Intelligence and Intelligent Informatics in 2018 and 2017, respectively, Best Presentation Award in IWACIII2017, Young Researcher Award of Int. J. of Advanced Computational Intelligence and Intelligent Informatics in 2014, Best Paper Award in ASPIRE League Symposium 2012, Excellent Presentation Award in IWACIII2009, and Zhang Zhongjun Best Paper Nomination Award in CPCC 2009. 25

Min Wu received his B.S. and M.S. degrees in engineering from Central South University, Changsha, China, in 1983 and 1986, respectively, and his Ph.D. degree in engineering from the Tokyo Institute of Technology, Tokyo, Japan, in 1999. He was a faculty member of the School of Information Science and Engineering at Central South University from 1986 to 2014, and was promoted to Professor in 1994. In 2014, he moved to China University of Geosciences, Wuhan, China, where he is a professor in the School of Automation. He was a visiting scholar with the Department of Electrical Engineering, Tohoku University, Sendai, Japan, from 1989 to 1990, and a visiting research scholar with the Department of Control and Systems Engineering, Tokyo Institute of Technology, from 1996 to 1999. He was a visiting professor at the School of Mechanical, Materials, Manufacturing Engineering and Management, University of Nottingham, Nottingham, UK, from 2001 to 2002. His current research interests include process control, robust control, and intelligent systems. Dr. Wu is a member of the Chinese Association of Automation, and a fellow of IEEE. He received the IFAC Control Engineering Practice Prize Paper Award in 1999 (together with M. Nakano and J. She).

Peng Xiao received the B.E. degree from Nanchang University, Nanchang, China in 2018. He is currently pursuing the master’s degree with 26

the School of Automation, China University of Geosciences. His current research interests include speech emotion recognition and human-robot interaction system. He is a student member of Chinese Association for Artificial Intelligence. Declaration of Competing Interest We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from [email protected] AuthorContribution Man Hao: Methodology, Software, Writing - Original Draft Weuhua Cao: Investigation, Conceptualization Zhentao Liu: Validation, Writing- Reviewing and Editing Min Wu: Resources, Supervision Peng Xiao: Software, Investigation

27