Multi-modal Correlated Network for emotion recognition in speech

Multi-modal Correlated Network for emotion recognition in speech

Visual Informatics xxx (xxxx) xxx Contents lists available at ScienceDirect Visual Informatics journal homepage: www.elsevier.com/locate/visinf Mul...

519KB Sizes 0 Downloads 93 Views

Visual Informatics xxx (xxxx) xxx

Contents lists available at ScienceDirect

Visual Informatics journal homepage: www.elsevier.com/locate/visinf

Multi-modal Correlated Network for emotion recognition in speech ∗



Minjie Ren, Weizhi Nie , Anan Liu , Yuting Su The School of Electrical and Information Engineering, Tianjin University, China

article

info

Article history: Received 19 September 2019 Received in revised form 11 October 2019 Accepted 14 October 2019 Available online xxxx Keywords: Multi-modal Emotion recognition Neural networks

a b s t r a c t With the growing demand of automatic emotion recognition system, emotion recognition is becoming more and more crucial for human–computer interaction (HCI) research. Recently, there is a continuous improvement in the performance of automatic emotion recognition due to the development of both hardware and deep learning methods. However, because of the abstract concept and multiple expressions of emotion, automatic emotion recognition is still a challenging task. In this paper, we propose a novel Multi-modal Correlated Network for emotion recognition aiming at exploiting the information from both audio and visual channels to achieve more robust and accurate detection. In the proposed method, the audio signals and visual signals are first preprocessed for the feature extraction. After preprocessing, we obtain the Mel-spectrograms, which can be treated as images, and the representative frames from visual segments. Then the Mel-spectrograms are fed to the convolutional neural network (CNN) to get the audio features and the representative frames are fed to the CNN and LSTM to get features. Specially, we employ the triplet loss to increase the differentiation of inter-class. Meanwhile, we propose a novel correlated loss to reduce the differentiation of intra-class. Finally, we apply the feature fusion method to fuse the audio and visual feature for emotion recognition classification. The experimental result on AEFW dataset demonstrates the correlation information of multiple modals is crucial for automatic emotion recognition and the proposed method can achieve the state-of-the-art performance on the classification task. © 2019 Zhejiang University and Zhejiang University Press. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Due to the wide applications in human–computer interaction (HCI), automatic human affect recognition, as a key step towards more natural HCI, has attracted lots of attention and also fascinated researchers from different science fields. Moreover, there is a variety of intelligent systems have the potential to utilize the automatic emotion recognition technology for great improvement, including medical treatment, customer feedback assessment, online gaming and education quality evaluation. For example, when the emotion recognition module is embedded in the medical treatment system, the appropriate medications or treatments can be prescribed according to the prejudged mental and physical state of patients. (Chen et al., 2018). There are so many forms of input can be used to detect emotions such as short phrases, speeches, facial expressions, videos, emoticons, long texts and short messages. These input forms are varied by different applications. Such as emotion-aware ∗ Corresponding authors. E-mail addresses: [email protected] (W. Nie), [email protected] (A. Liu). Peer review under responsibility of Zhejiang University and Zhejiang University Press. https://doi.org/10.1016/j.visinf.2019.10.003 2468-502X/© 2019 Zhejiang University and Zhejiang University Press. (http://creativecommons.org/licenses/by-nc-nd/4.0/).

e-health systems (Doctor et al., 2016; Lin et al., 2016), affectaware smart city (Guthier et al., 2014), and intelligent home systems embedding botanical Internet of Things (IoT) and emotion detection (Chen et al., 2017). Many of these systems are based on text or emoticons inputs. Recently, lots of emotion recognition systems based on the electroencephalogram (EEG) signal were proposed (Hossain et al., 2018; Menezes et al., 2017); however, the EEG cap is invasive for the users and makes users feel uncomfortable. Based on the review of above emotion recognition application systems and methods, we note that only one input modal cannot provide the desired emotion recognition accuracy (Huang et al., 2016; Valstar et al., 2016). Although there are various input modalities for multi-modal emotion recognition, the most common input is a combination of audio and visual information. Because we can obtain these two modals in a non-invasive manner and they are two kinds of most powerful and natural signals for human to convey their emotional states (Darwin and Prodger, 1998; Tian et al., 2001). Although several works on audio-visual emotion recognition have been proposed, most of them suffer from low recognition accuracy and are not robust enough. Especially in the real-world enviroment, which means there are so many uncontroled conditions of data acquisition, the audio-visual emotion recognition is still a challenging task. The changing factors of audio-visual emotion data are consisted of indoor and outdoor scenarios,

Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.

2

M. Ren, W. Nie, A. Liu et al. / Visual Informatics xxx (xxxx) xxx

environmental noise, lighting conditions, motion blur, occlusions and pose changes. In order to tackle these above problems and obtain more robust and accurate detection, we propose an audiovisual emotion recognition network. Our network firstly extracts features from these two modals. Then we define the weight of different modal features and fuse them. This process can utilize the advantages of different modals to final generate more robust feature. Other than this, we creatively employ the correlated loss and triplet loss in our network, which can reduce the differentiation of intra-class and increase the differentiation of inter-class. The contributions of this paper can be summarized as follows:

Berlin Emotional Speech Database (EMO-DB) (Burkhardt et al., 2005) and IEMOCAP database and achieved 86.23% and 62.23% separately. Besides HMMs, for dynamic modeling, recurrent neural networks (RNNs) with long short-memory (LSTM) are also be proved effectiveness for speech emotion recognition (Mirsamadi et al., 2017; Han et al., 2018). Recently, a deep retinal CNNs (DRCNNs) were proposed by Niu et al. (2017), which was proved to be successful to recognize emotions for achieving an accuracy as high as 99.25% in the IEMOCAP database.

– We propose a novel audio-visual based network for emotion recognition. In the proposed method, we utilize different network architectures for extracting audio and visual features respectively and triplet loss are also employed for boosting performance. – We propose a novel correlated loss to guide the feature learning of different modals in training step. The correlated loss is used to make the speech signal feature approximate to the video signal feature for a better classification performance. Furthermore, the learning rate and the robustness of the final feature vector can be effectively increased by employing the proposed correlated loss. – We design two experiments to verify the effectiveness of different parts of our method. And the comparison with classic methods on the AFEW dataset turns out that we achieve the state-of-the-art performance, which means our network obtains a more robust and accurate detection for the human emotion. The final experiment result also demonstrates the advantages of our methods.

In Hossain and Muhammad (2017), a mobile emotion recognition application based on faces detection was developed by using the bandlet transform and local binary pattern (LBP) as the feature extractor, and using the Gaussian mixture model (GMM) as the classifier. The experimental result on the CK database achieved 99.7% accuracy. In Zeng et al. (2018), the histogram of oriented gradients (HoG) features were employed and deep sparse autoencoders were proposed for emotion recognition system from images. An accuracy of 96% is obtained by their method using the extended CK database (CK +). With the rise of deep learning, many new approaches have been proposed to achieve the video-based emotion recognition. Mollahosseini et al. proposed a deep neural network (DNN) based approach (Mollahosseini et al., 2016), where the input is the raw face image. Using the CK + database, 93.2% accuracy was achieved. In Ding et al. (2017), a combination of several deep models was used as one deep network, which was called FaceNet2ExpNet. The network is performed on the CK + database and achieved 96.8% accuracy. The Bounded Residual Gradient Networks (BReGNet) were proposed for facial expression recognition in Hasani et al. (2019). In this network, the shortcut connection between the input and the output of the ResNet module is replaced with a differentiable function, which is a bounded gradient, and achieved 69.29% accuracy on FER2013 dataset.

The rest of this paper is organized as follows: in Section 2, we do a brief review of the current related work. We then explain how each part of our networks works in Section 3. Experiments, results and analysis are provided in Section 4. Finally, we draw conclusions of this paper in Section 5. 2. Related works There are three parts of this section, including the exiting works of emotion recognition from speech signals, video signals, and both speech and video signals. The whole section gives an overview of the current effective methods on emotion recognition research. 2.1. Speech-based emotion recognition In Han’s method, segment-level features are used to detect emotions. They employ the Deep neural networks (DNNs) to predict probabilities of each emotion for every speech segment (Han et al., 2014). The utterance-level features were generated by these probabilities and the experiments were performed on the interactive emotional dyadic motion capture (IEMOCAP) database (Busso et al., 2008) and achieved 54.3% in accuracy. In Fayek et al. (2017), the authors use the spectrogram of the speech signal as the input for the proposed convolutional neural networks (CNNs). An accuracy of 64.78% have obtained in the IEMOCAP database. The decision tree based on ELM was utilized for detecting emotions from speech signals in Liu et al. (2018). On the CASIA Chinese emotion corpus (Tao et al., 2008), the method obtained 89.6% in accuracy. There is a long tradition for utilizing the Hidden Markov models (HMMs) in automatic speech recognition (ASR) based on their effectiveness of extracting the temporal dynamic features of speech. Mao et al. (2019) proposed several models based on the HMMs and tested the effectiveness of their models on both

2.2. Video-based emotion recognition

2.3. Audio-video based emotion recognition Zhang et al. proposed a system (Zhang et al., 2017) with audiovisual pre-trained models for emotion recognition. The raw signal firstly transferred to a Mel-spectrogram and the Mel-spectrogram was used as the input to the CNN for feature extraction. For the video signal, the key face frames were selected and then input to a 3D-CNN. The system achieved around 86% in accuracy on the eNTERFACE database. There are several different networks for different signals’ features extraction in Kaya et al. (2017a). More specifically, different signals features were fused at the score level for a more robust and effective feature vector to recognize emotion. The authors tested their methods on the EmotiW 2015 database and obtained 54.55% in accuracy. In a recent study, in Cai et al. (2019), the OpenSmile toolkit and LBP-TOP, an ensemble of CNNs, were utilized to extract features from the audio channel and visual channel, respectively. Moreover, there are two kinds of fusion methods, feature-level fusion and model-level fusion, which were introduced to fuse the extracted features. Experimental results on the EmotiW2018 AFEW dataset (Dhall et al., 2012) have obtained 56.81% accuracy. In Zhang et al. (2019), factorized bilinear pooling (FBP) were proposed to deeply integrate the extracted features of audio and video. The proposed approach is validated on the AFEW database and achieved an accuracy of 62.48%. A pre-trained CNN and transfer learning were employed to extract features from key face frames, and a minimalistic set of parameters were used as features for audio signals in Ortega et al. (2019). Then the feature level fusion was carried out at, and provided the concordance correlation coefficients (CCCs) of 0.749 and 0.565 for arousal and valence respectively on the RECOLA dataset.

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.

M. Ren, W. Nie, A. Liu et al. / Visual Informatics xxx (xxxx) xxx

3

Fig. 1. An overall block diagram of the proposed emotion recognition system. As the picture shown, the triplet loss minimizes the distance between the anchor and the positive samples, and maximizes the distance between the anchor and the negative samples. Meanwhile, the correlated loss is employed to guide the feature learning of different modal, which means the information of these two different modal network classification result should be similar to each other.

3. Proposed method As Fig. 1 shown, the framework of our work includes three parts: (1) Data processing: The two modalities, speech and video signals, are firstly preprocessed separately and then feed to our network. (2) Emotion recognition network: It is used to extract the emotion features of different modals. Here, we introduced the correlated loss to ensure these networks can guide each other to learn the feature information for increasing the final learning rate and obtaining more robustness feature; (3) Audio-visual information fusion: The effective fusion method is employed to take the advantages of the two modalities information for a more accurate emotion recognition result. In the next part, we will introduce these three parts in details. 3.1. Data pre-processing Speech signal pre-processing: We firstly divide the input speech signal into 40 ms frames and make the successive frames overlap by 50%. Then the frames are multiplied by the Hamming window. The outputs pass through the fast Fourier transform to convert to the frequency-domain. Next, we employ 25 bandpass filters(BPFs) to the frequency-domain signal and utilize the logarithm function on the filter outputs. Finally, we follow the above steps frame by frame to form the Mel-spectrogram of the

speech signal. Due to the traditional hand-crafted speech features cannot achieve the expected performance on the large amount of noisy data, we employ the CNNs models to extract the features of speech signal. As the input of the CNNs, the speech signals should transform to the image forms. Thus, we perform the previous steps to obtain the input image, which use the Mel-spectrogram, delta image and the double delta image as its three channels. Moreover, the relative temporal information of a speech signal can be encoded by the Mel-spectrogram’s delta and double delta coefficients. Video signal pre-processing: The face detection algorithm Viola and J.Jones, 2001 is firstly employed to help us filter out the frames without faces. Then selecting the key frames from the video segment is the second step of the preprocessing. For a window of 2i + 1 video frames, where i is empirically set to three, we utilize the chi-square distance to find the frame with the least distance as the key frame. Next we convert the key frame into a gray-scale image. We perform the mean normalization, and calculate the LBP image and the IDP image from the gray-scale image. The window shifted by 4 frames, we repeat the above process until the end of the video segment. There are 16 key frames are selected, which will be sampled to 227 × 227 as the input of the CNN, for a 2.02 s video segment.

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.

4

M. Ren, W. Nie, A. Liu et al. / Visual Informatics xxx (xxxx) xxx

3.2. Emotion recognition network Based on the preprocessed audio and visual data, we propose a novel audio-visual based network. Fig. 1 shows the detailed framework of our proposed method. The proposed networks consist of two networks, one for speech signal feature extraction, utilizing the 2D-CNN to solve this problem, and the other one for the video signal feature extraction, employing the 3D-CNN and LSTM to learn the feature extraction function. In this paper, we consider that the features based on different modals should be similar because they represent the same human emotion at the same time. Thus, in order to demonstrate the assumption, we suppose the two different modal features from the networks have the same dimensional of fc ∈ R4096 . The two deep neural networks are trained simultaneously with proposed loss including the triplet loss for each modal and correlated loss for crossmodal. The 2D-CNN and 3D-CNN architecture details can be seen in Hossain and Muhammad (2019). In order to capture the temporal and sequential information from the video data, following the Lian et al. (2018), CNN-LSTM network architecture is utilized. The outputs of the 3D-CNN are fed to the LSTM. We select 16 face features as input for each iteration of LSTM training. Then, the outputs of all the hidden states from the LSTM layer are averaged by the mean pooling layer. Next, we obtain the final video representation from the output of the pooling layer. Finally, the fully-connected layer and the softmax layer are utilized for obtaining the final classification scores. In the training step of single modal network, the triplet loss is employed for making sample xai (anchor) closer to all other − positive samples x+ i , and farther from negative samples xi . It is determined as: Lt = max[0,

N ∑

− 2 2 a (∥f (xai ) − f (x+ i )∥2 − ∥f (xi ) − f (xi )∥2 + β )]

where fiT fˆi represents the similarity between these two modal features- fi and fˆi , and T is the parameter controlled the concentration level of the feature distribution. Therefore, Our correlated loss is defined by Lc (x, xˆ ) = −

log P(i|xˆi )

The final loss function of speech and video modal network is followed as Lx = Lt ,x + Lc (x, xˆ );

(4)

Lxˆ = Lt ,ˆx + Lc (x, xˆ )

(5)

where Lt ,x is the triplet loss based on speech modal input x network, and Lt ,ˆx is the triplet loss based on video modal input xˆ network. Lc (x, xˆ ) represents the correlated loss with the speech modal input x and video modal input xˆ respectively. Finally, these two networks are optimized via back-propagation with stochastic gradient descent. 3.3. Audio-visual information fusion We can get two feature extraction models based on different emotion modal data following the above process. Based on our design purpose, in the feature space, the closer these two extracted features are, the better. The weighted fusion method is employed to fuse the two feature vectors. The framework of this fusion method is shown in Fig. 1. And the Eq. (6) shows this process in detail: fg =

2 ∑

wi ξ (fMi );

i=1

(1)

where β is used to divide the positive and negative samples, and N represents the set of all samples in the training set, f (x) is a function which can be learned by using the fully connected deep neural network (DNN). The triplet loss has been utilized in many research fields and obtained excellent results in many classic classification problem. In our method, the triplet loss function is employed for each modal network to shorten the distance of samples with similar emotional content in the feature space and guarantee the final classification performance. Moreover, we creatively introduce the correlated loss to guide the cross-modal fusion in training step. Therefore, the learning speed and the robustness of final feature vector can be increased. In our method, the correlated loss is used to make the speech signal feature approximate to the video signal feature for a better classification performance. More specifically, the features of different modal should be similar for the same part of the input. Thus, the correlated loss can better guide the feature learning of different modal. For our correlated loss, in the training process, we randomly sample m clips from the dataset, which means there will be m classification results, and denote the speech modal feature as {{f1 , f2 , . . . , fm }}. Meanwhile, the video modal feature is denoted by fˆ1 , fˆ2 , . . . , fˆm , and all the features are l2 normalized, i.e. ∥fi ∥2 =

1, which means when the speech feature is fm , the video modal feature should also be classified as the same result m. The probability of the video clip xˆi being recognized as emotion i is defined by (2)

(3)

i

2 ∑

i

exp (fiT fˆi /T ) P(i|xˆi ) = ∑m Tˆ k=1 exp (fk fi /T )



(6)

wi = 1;

i=1

where f represents the feature vector extracted by 2D-CNN and 3D-CNN with LSTM respectively based on speech and video signals. M represents the modal data, its subscript can be 1, 2 in our work. wi is the weight of extracted feature in order to balance the audio and video feature. fg represents the final feature vector for the emotion classification task. Finally, we utilize softmax on the fusion feature to get the class label. The related experimental results are shown in Section 4. 4. Experiments In this section, we present a brief description of the database used in our experiments, related experimental details, results, and discussion. 4.1. Experimental dataset The Acted Facial Expressions in the Wild (AFEW) dataset are used to evaluate our proposed method. The AFEW dataset is a dynamic temporal facial expression data corpus, which is consisted of short video clips of facial expressions in close to real-world environments. These video clips are collected from TV series or movies, which are included actors’ natural head pose movement, indoor and outdoor scenarios, various luminations and occlutions. Moreover, the subjects of the dataset vary in race, age, and gender, and the facial emotion information from the clips has been labeled as one of seven expressions: happiness, surprise, anger, disgust, fear, sadness and neutral. There are three partitions in the dataset: training set (773), test set (653) and validation set (383).

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.

M. Ren, W. Nie, A. Liu et al. / Visual Informatics xxx (xxxx) xxx Table 1 Performance comparison on the validation of introduced loss. Method

Channel

Accuracy

2D-CNN 2D-CNN(TL) 2D-CNN(TL+CL)

Audio Audio Audio

48.15 48.59 48.74

3D-CNN 3D-CNN(LSTM) 3D-CNN(LSTM+TL) 3D-CNN(LSTM+TL+CL)

Visual Visual Visual Visual

54.32 55.87 56.73 58.62

Table 2 Performance comparison on the validation of Our Fusion Method. Method

Accuracy

Simple connected two different modal features Weight connected two different modal features

59.68 60.59

5

Table 3 Performance comparison on the validation and test sets in terms of the average recognition accuracy of the 7 emotions. Method

Channel

Accuracy

Hu et al. (2017) Fan et al. (2016) Vielzeuf et al. (2017) Yao et al. (2016) Ouyang et al. (2017) Kim et al. (2017) Yan et al. (2016) Wu et al. (2016) Kaya et al. (2017b) Ding et al. (2016) Yao et al. (2015) Bargal et al. (2016) Sun et al. (2016)

Audio, Audio, Audio, Audio, Audio, Audio, Audio, Audio, Audio, Audio, Audio, Visual Visual

60.34 59.02 58.81 57.84 57.20 57.12 56.66 55.31 54.55 53.96 53.80 56.66 50.14

2D-CNN (baseline) 3D-CNN(LSTM) (baseline)

Audio Visual

Proposed Method

Audio, visual

visual visual visual visual visual visual visual visual visual visual visual

48.15 55.87 60.59

4.2. Experiment for validating the effectiveness of our method In our proposed method, we introduce the triplet loss to increase the differentiation of inter-class and the correlated loss to reduce the differentiation of intra-class. In order to validate the effectiveness of our method, we design two experiments, one for the introduced loss and the other for our fusion method. Meanwhile, the two experiments are both performed on the AFEW dataset. For validating the effectiveness of the two employed loss function, the related experimental results are listed in Table 1. 3D-CNN(LSTM) means the LSTM network is employed after the 3D-CNN network. Here, the ‘‘TL’’ denotes the triplet loss is added to the network and the ‘‘CL’’ denotes the correlated loss, and obviously, the results of unimodal emotion recognition classification without two introduced loss have the lowest accuracy. The final classification result of the method, which is both employed the triplet loss and correlated loss to the global loss function, outperforms the result of networks which only have the triplet loss. Besides, we can also find that the 3D-CNN network based on visual channel gets better classification results compared with audio modal network. So we think that the video data have more representative information for the human emotion analysis than the audio data. And the experimental result have proved our thinking. So we assume that different modals should have different weights in the final information fusion in order to utilize the advantage of each modal. We define the weights (from fusion Eq. (6)) of different modal network and sample the different values of fusion parameters to find the best values for emotion classification task. Finally, as Table 2 shown, the simple connected method means the two modal features are treated equally (w1 = 0.5, w2 = 0.5), and weight connected, which is our method, means the parameters w1 , w2 are set to 0.3 and 0.7 as the weights of 2DCNN and 3D-CNN respectively and achieve the best result. From the above experiments, the experimental results demonstrate our assumption and the reasonableness of our networks. 4.3. Comparisons of typical methods The proposed method is compared with the two baseline methods based on (1) 2D-CNN network for audio signal emotion recognition, (2) 3D-CNN with LSTM network for visual signal emotion recognition, respectively. As shown in Table 3, our proposed method improves the baseline unimodal methods employing a single type of features and achieves an emotion classification accuracy of 60.46% based on the AFEW dataset. Compared with the audio-based method, our approach wins by 11.85% more gains on the AFEW dataset. And 4.59% more gains for comparing

to the only visual-based network. The significant improvement validates the importance of extracting multimodal information to recognize and analysis human emotion. For comparison on multimodal methods, the proposed approach outperforms the direct concatenation and without correlated loss function approaches, showing the advantage of weighted concatenation and the importance of retaining the similarity between audio and video modal (using the correlated loss). Our method outperforms the most of state-of-art models in terms of classification accuracy, which means the design of our network is reasonable. The experimental result efficiently demonstrates the effectiveness of our method in the emotion classification task. The exciting performance of our method can be explained from the following reason. The different CNNs structures extract the feature information from the two signals. For capturing the temporal and sequential information from the video signals, the CNN-LSTM architecture is employed after 3D-CNN network. Furthermore, by introducing the correlated loss function and the static multimodal fusion method, our network integrated each modal’s feature information and took the correlation between the two different modals into consideration for the final feature vector. Therefore, the resulting feature vector is more responsive to the feature information of the model than the features from unimodel network structure and other typical methods. So we can get a more robust and accurate detection for the emotion and achieve state-of-art performance. 5. Conclusion In this paper, we proposed a novel audio-visual based network architecture for emotion recognition to exploit the information from both audio and visual channels. Our method employed 2D-CNN and 3D-CNN for extracting audio and video features respectively. To further capture the temporal information in the video clip, we employed LSTM architecture after 3D-CNN. Then we fused the two different modal features by defining the weight based on statistic method in order to generate more robust feature for the classification task. In our network, the triplet loss and correlated loss were employed to increase the differentiation of inter-class and reduce the differentiation of intra-class. Compared with traditional typical methods, our method not only utilized the two different modals data-speech signals and video signals, but also takes the correlation information between the two different modals into consideration for automatic emotion recognition. Experimental results on the AFEW dataset have demonstrated the effectiveness of the proposed network. Furthermore, the comparison with the typical multimodal based methods also showed that our method has achieved the state-of-the-art performance.

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.

6

M. Ren, W. Nie, A. Liu et al. / Visual Informatics xxx (xxxx) xxx

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Bargal, S., Barsoum, E., F, C.C., Zhang, C., 2016. Emotion recognition in the wild from videos using images. In: ICMI. ACM, pp. 433–436. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B., 2005. A database of german emotional speech. In: Proceedings of the INTERSPEECH, Lisbon, Portugal. Busso, C., Bulut, M., Lee, C.C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S., 2008. Iemocap: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42 (4), 335–359. Cai, Jie, Meng, Zibo, Khan, Ahmed Shehab, Li, Zhiyuan, O’Reilly, James, Han, Shizhong, Liu, Ping, Chen, Min, Tong, Yan, 2019. Feature-level and model-level audiovisual fusion for emotion recognition in the wild. In: CVPR. arXiv:1906.02728v1. Chen, M., Yang, J., Zhu, X., Wang, X., Liu, M., Song, J., 2017. Smart Home 2.0: innovative smart home system powered by botanical IoT and emotion detection. Mob. Netw. Appl. http://dx.doi.org/10.1007/s1103. Chen, M., Zhang, Y., Qiu, M., Guizani, N., Hao, Y., 2018. SPHA: smart personal health advisor based on deep analytics. IEEE Commun. Mag. 56 (3), 164–169. Darwin, C., Prodger, P., 1998. The Expression of the Emotions in Man and Animals. Oxford University Press, USA. Dhall, A., Goecke, R., Lucey, S., Gedeon, T., et al., 2012. Collecting large, richly annotated facial-expression databases from movies. IEEE Multimedia 19 (3), 34–41. Ding, W., Xu, M., Huang, D., Lin, W., Dong, M., Yu, X., Li, H., 2016. Audio and face video emotion recognition in the wild using deep neural networks and small datasets. In: ICMI. ACM, pp. 506–513. Ding, H., Zhou, S.K., Chellappa, R., 2017. FaceNet2ExpNet: regularizing a deep face recognition net for expression recognition. In: Proceedings of the 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), Washington, DC, pp. 118–126. Doctor, F., Karyotis, C., Iqbal, R., James, A., 2016. An intelligent framework for emotion aware e-healthcare support systems. In: Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Athens, pp. 1–8. Fan, Y., Lu, X., Li, D., Liu, Y., 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In: ICMI, pp. 445–450. Fayek, Haytham M., Lech, Margaret, Cavedon, Lawrence, 2017. Evaluating deep learning architectures for speech emotion recognition. Neural Netw. 92, 60–68. Guthier, B., Alharthi, R., Abaalkhail, R., El Saddik, A., 2014. Detection and visualization of emotions in an affect-aware city. In: Proceedings of the First International Workshop on Emerging Multimedia Applications and Services for Smart Cities. EMASC ’14, ACM, New York, NY, USA, pp. 23–28. Han, W., Ruan, H., Chen, X., Wang, Z., Li, H., Schuller, B., 2018. Towards temporal modelling of categorical speech emotion recognition. In: Proc. INTERSPEECH, pp. 932–936. Han, K., Yu, D., Tashev, I., 2014. Speech emotion recognition using deep neural network and extreme learning machine. In: Proc. INTERSPEECH, pp. 223–227. Hasani, Behzad, Negi, Pooran Singh, Mahoor, Mohammad H., 2019. Bounded residual gradient networks (BReG-Net) for facial affect computing. In: 14th IEEE International Conference on Automatic Face and Gesture Recognition. FG 2019, arXiv:1903.02110. Hossain, M.S., Muhammad, G., 2017. An emotion recognition system for mobile applications. IEEE Access 5, 2281–2287. Hossain, M. Shamim, Muhammad, Ghulam, 2019. Emotion recognition using deep learning approach from audio–visual emotional big data. Inf. Fusion 49. Hossain, M.S., Muhammad, G., AL Qurishi, M., 2018. Verifying the images authenticity in cognitive internet of things (CIoT)-oriented cyber physical system. Mobile Netw. Appl. 23 (2), 239–250. Hu, P., Cai, D., Wang, S., Yao, A., Chen, Y., 2017. Learning supervised scoring ensemble for emotion recognition in the wild. In: ICMI. ACM, pp. 553–560. Huang, X., Kortelainen, J., Zhao, G., Li, X., Moilanen, A., Seppänen, T., Pietikäinen, M., 2016. Multi-modal emotion analysis from facial expressions and electroencephalogram. Comput. Vis. Image Underst. 147, 114–124. Kaya, H., Gürpınar, F., Salah, A.A., 2017a. Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vision Comput. 65, 66–75.

Kaya, H., Gurpınar, F., Salah, A., 2017b. Video-based emotion recognition in the wild using deep transfer learning and score fusion. J. Image Vis. Comput. 65, 66–75. Kim, D., Lee, M., Choi, D., Song, B., 2017. Multi-modal emotion recognition using semi-supervised learning and multiple neural networks in the wild. In: ICMI. ACM, pp. 529–535. Lian, Zheng, Li, Ya, Tao, Jianhua, Huang, Jian, 2018. Investigation of multimodal features, classifiers and fusion methods for emotion recognition. In: CVPR. Lin, K., Xia, F., Wang, W., Tian, D., Song, J., 2016. System design for big data application in emotion-aware healthcare. IEEE Access 4, 6901–6909. Liu, Z.-T., Wu, M., Cao, W.-H., Mao, J.-W., Xu, J.-P., Tan, G.-Z., 2018. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273, 271–280. Mao, Shuiyang, Tao, Dehua, Zhang, Guangyan, Ching, P.C., Lee, Tan, 2019. Revisiting hidden Markov models for speech emotion recognition. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). http://dx.doi.org/10.1109/ICASSP.2019.8683172. Menezes, M.L.R., Samara, A., Galway, L., et al., 2017. Towards emotion recognition for virtual environments: an evaluation of EEG features on benchmark dataset. Pers. Ubiquitous Comput. http://dx.doi.org/10.1007/s00779-0171072-7. Mirsamadi, S., Barsoum, E., Zhang, C., 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In: Proc. ICASSP, pp. 2227–2231. Mollahosseini, A., Chan, D., Mahoor, M.H., 2016. Going deeper in facial expression recognition using deep neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, pp. 1–10. Niu, Yafeng, Zou, Dongsheng, Yadong, Niu, He, Zhongshi, Tan, Hua, 2017. A breakthrough in speech emotion recognition using deep retinal convolution neural networks. eprint arXiv:1707.09917. Ortega, Juan D.S., Cardinal, Patrick, Koerich, Alessandro L., 2019. Emotion recognition using fusion of audio and video features. In: CVPR. arXiv:1906. 10623v1. Ouyang, X., Kawaai, S., Goh, E., Shen, S., Ding, W., Ming, H., Huang, D., 2017. Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In: ICMI. ACM, pp. 577–582. Sun, B., Li, L., Zhou, G., He, J., 2016. Facial expression recognition in the wild based on multimodal texture features. J. Electron. Imaging 25 (6), 061407. Tao, J., Liu, F., Zhang, M., Jia, H.B., 2008. Design of speech corpus for mandarin text to speech. In: Proceedings of the Blizzard Challenge 2008 Workshop. Tian, Y.-I., Kanade, T., Cohn, J.F., 2001. Recognizing action units for facial expression analysis. IEEE Trans. Pattern Anal. Mach. Intell. 23 (2), 97–115. Valstar, M., Gratch, J., Schuller, B., et al., 2016. AVEC 2016: depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the Sixth International Workshop on Audio/Visual Emotion Challenge. AVEC ’16, ACM, New York, NY, USA, pp. 3–10. Vielzeuf, V., Pateux, S., Jurie, F., 2017. Temporal multimodal fusion for video emotion classification in the wild. In: ICMI. ACM, pp. 569–576. Viola, Paul, J.Jones, Michael, 2001. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 511–518. Wu, J., Lin, Z., Zha, H., 2016. Multi-view common space learning for emotion recognition in the wild. In: ICMI. ACM, pp. 464–471. Yan, J., Zheng, W., Cui, Z., Tang, C., Zhang, T., Zong, Y., Sun, N., 2016. Multi-clue fusion for emotion recognition in the wild. In: ICMI, pp. 458–463. Yao, A., Cai, D., Hu, P., W, S., Sha, L., Chen, Y., 2016. HoloNet: towards robust emotion recognition in the wild. In: ICMI. ACM, pp. 472–478. Yao, A., Shao, J., Ma, N., Chen, Y., 2015. Capturing AU-aware facial features and their latent relations for emotion recognition in the wild. In: ICMI, pp. 451–458. Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., Dobaie, A.M., 2018. Facial expression recognition via learning deep sparse autoencoders. Neurocomputing 273, 643–649. Zhang, Yuanyuan, Wang, Zi-Rui, Du, Jun, 2019. Deep fusion: An attention guided factorized bilinear pooling for audio-video emotion recognition. In: CVPR. arXiv:1901.04889v1. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q., 2017. Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. 99 (1).

Please cite this article as: M. Ren, W. Nie, A. Liu et al., Multi-modal Correlated Network for emotion recognition in speech. Visual Informatics (2019), https://doi.org/10.1016/j.visinf.2019.10.003.