Accepted Manuscript Wild Facial Expression Recognition Based on Incremental Active Learning Minhaz Uddin Ahmed, Kim Jin Woo, Kim Yeong Hyeon, Md. Rezaul Bashar, Phill Kyu Rhee PII: DOI: Reference:
S1389-0417(18)30118-9 https://doi.org/10.1016/j.cogsys.2018.06.017 COGSYS 647
To appear in:
Cognitive Systems Research
Received Date: Revised Date: Accepted Date:
29 March 2018 15 June 2018 28 June 2018
Please cite this article as: Uddin Ahmed, M., Jin Woo, K., Yeong Hyeon, K., Rezaul Bashar, Md., Kyu Rhee, P., Wild Facial Expression Recognition Based on Incremental Active Learning, Cognitive Systems Research (2018), doi: https://doi.org/10.1016/j.cogsys.2018.06.017
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Cognitive Systems Research
1
Cognitive Systems RESEARCH______________
Wild Facial Expression Recognition Based on Incremental Active Learning Minhaz Uddin Ahmeda,Kim Jin Wooa,Kim Yeong Hyeona,Md. Rezaul Basharb, Phill Kyu Rheea a
ComputerEngineering Department, Inha University,100 Inha-ro, Nam-gu 22212,Incheon, Republic of Korea b
Science, Technology and Management Crest,Sydney, Australia
Elsevier use only: Received date here; revised date here; accepted date here
Abstract Facial expression recognition in a wild situation is a challenging problem in computer vision research due to different circumstances, such as pose dissimilarity, age, lighting conditions, occlusions, etc. Numerous methods, such as point tracking, piecewise affine transformation, compact Euclidean space, modified local directional pattern, and dictionary-based component separation have been applied to solve this problem. In this paper, we have proposed a deep learning–based automatic wild facial expression recognition system where we have implemented an incremental active learning framework using the VGG16 model developed by the Visual Geometry Group. We have gathered a large amount of unlabeled facial expression data from Intelligent Technology Lab (ITLab) members at Inha University, Republic of Korea, to train our incremental active learning framework. We have collected these data under five different lighting conditions: good lighting, average lighting, close to the camera, far from the camera, and natural lighting and with seven facial expressions: happy, disgusted, sad, angry, surprised, fear, and neutral. Our facial recognition framework has been adapted from a multi-task cascaded convolutional network detector. Repeating the entire process helps obtain better performance. Our experimental results have demonstrated that incremental active learning improves the starting baseline accuracy from 63% to average 88% on ITLab dataset on wild environment. We also present extensive results on face expression benchmark such as Extended Cohn-Kanade Dataset, as well as ITLab face dataset captured in wild environment and obtained better performance than state-of-the-art approaches. Keywords:expression
recognition;
emotion
classification;
face
1. Introduction Facial expressions are a form of nonverbal communication for social interactions. These days, the self-portrait photograph (called a selfie) is used on many different social networking websites where
detection;
convolutional
neural
network;
active
learning
.
images are captured in uncontrolled situations due to the availability of cheaper digital cameras and smartphones. Many of these images contain different facial expressions, such as happiness, sadness, disgust, anger, fear, and surprise. In our daily conversations, 55 percent of our feelings are expressed nonverbally or in a facial expression(Mehrabian, 1968; Ekman, 1993;ZHANG,
2
Cognitive Systems Research
ZHAO, Morvan, & Chen, 2018).The main challenge in facial expression recognition (FER) is facial expressions varies in different situations, such as the person’s mood, age, and skin color, and under different lighting conditions. Facial frontal view with different expression have number of applications such as human computer interface, surveillance systems, census systems, virtual reality, customer satisfaction are greatly rely on efficient face detection accuracy (Osuna, Freund & Girosi,1997). A survey on FER by Fasel et al. (Fasel & Luettin, 2003) included different prominent automatic facial expression analysis methods. Moreover, Sharma et al. (Sharma, 2011) correctly classified 63.33% accuracy based on feature point tracking technique among them 50% accuracy for angry and 60% accuracy for sad images. Different feature extraction methods can be applied to solve FER challenges, such as geometric measurement, linear discriminant embedding(LLDE) (Li, Zheng & Huang, 2008), principal component analysis (PCA), quadtree decomposition, and adaptive boosting (AdaBoost) (Fasel & Luettin, 2003), Modified Census Transform classifier (Froba & Ernst, 2004). Recent research has shown that a deep convolutional neural network increase recognition performance (Simonyan & Zisserman, 2014). However, a large volume of labeled data is necessary for training in order for a machine to accurately recognize facial expressions (Taigman, Yang, Ranzato, & Wolf, 2014). Moreover, dataset creation with a large number of labeled emotions is tedious work due to the manual effort involved. Although manually labeled face expression images increase the training reliability for the computer, it is very time-consuming. Our incremental active learning framework is suitable for tackling this challenge. In our research, we have considered five different environments (good lighting, average lighting, close to the camera, far from the camera, and natural lighting) with seven facial expressions: happy, disgusted, sad, angry, surprised, afraid, and neutral. Generally, most of the facial expression recognition research work considers six emotions (Ekman, 1993)(Fasel & Luettin, 2003), omitting the neutral facial expression. In face appearance, a number of challenges exist, such as the pose, occlusions, age, gender, and expression-intensity
changes. In addition, most research used less data for training, and therefore, another challenge is dealing with a massive amount of data. The main contributions of this work are summarized below. I. This facial expression recognition method successfully works under different lighting conditions for both real-time and offline data. II. We have five different wild environment datasets (good lighting, average lighting, face close to the camera, face far from the camera, and natural light), which is unique, and we also include experimental support in our work. III. Our incremental active learning method improved performance on both ITLab and Extended Cohn-Kanade Dataset (Kanade & Cohn, 2000). We have organized this paper as follows. In Section 2, we explain related work with facial expression recognition. In Section 3, we describe the proposed facial expression recognition system. In Section 4, we explain details about the dataset, the experimental environment, and the results. Finally, in Section 5, we discuss the conclusions drawn from the experiments and suggest future work.
2. Related Work Multilayer perceptron network (MLPN) using constrained learning algorithm (CLA) is an effective approach in order to speed up the training process in a neural connectionism approaches (Huang, 2004).Their proposed method used adaptive learning parameter which helps the initial weight selection method for the root finding of MLPN increases the accuracy. Recently, convolutional neural networks (CNNs) have explicitly enhanced the facial expression recognition research (Simonyan & Zisserman, 2014) . Commonly, CNNs enhance traditional features by learning from millions of training data (Taigman et al., 2014). A large amount of labeled data and a deep network are mandatory for accurate face expression recognition using CNN. But in practice, it is hard to work with millions of annotated images and train them from scratch.
Cognitive Systems Research
3
Therefore, it is common to use pre-trained models with a CNN to train the network.
combination of multiple deep convolutional neural networks.
In this paper, we have explored many different deep learning methods used in recent face expression recognition work such as Inception (Szegedy, Vanhoucke, Ioffe, Shlens & Wojna, 2016), Vgg(Simonyan & Zisserman, 2014). Taigman et al.(Taigman et al., 2014) proposed DeepFace, a ninelayer framework that contains 120 million parameters. The DeepFace computation time with labeled faces in the wild dataset is a fairly high percentage 97.35% on the labeled data. Schroff et al.(Schroff, Kalenichenko, & Philbin, 2015) proposed FaceNet is a data driven method, which directly learns mapping from face images into a compact Euclidean space where distances are considered based on face similarity. Liu et al. (Liu, 2015) proposed a two-stage combined approach based on a multi-patch deep CNN and metric learning, which outperformed different state-of-the-art methods.
Taheri et al. (Taheri, Patel, & Chellappa, 2013) proposed joint face and facial-expression recognition with a dictionary-based component separation algorithm. They decomposed an expressive test face into building components from two data-driven dictionaries that used for neutral components, and another for expression components. Different morphological elements of the test face with the dictionaries were used for face and expression recognition. Wang and Deng (Wang, Hu, & Deng, 2017) applied a compressed Fisher vector for robust facial expression recognition. First, they put forward a new compact Fisher vector by zeroing out small posteriors, and they then calculated first-order statistics and reweighted them. Secondly, light iterative quantization and compact Fisher Vector (CFV) were applied together to encode convolutional activations of a CNN. Rifaiet et al.(Rifai, Bengio, Courville, Vincent, & Mirza, 2012) applied a multiscale contractive convolutional network to determine facial traits in an image. Burkert et al.(Burkert, Trier, Afzal, Dengel, & Liwicki, 2015) proposed a CNN architecture that has four parts (convolution, pooling, parallel feature extraction, and fully connected layers) to achieve high accuracy also has most similarity with our approach.
Deep supervised auto encoders were used by Zhang et al.(Gao, Zhang, Jia, Lu, & Zhang, 2015), who extracted robust features, variances in illumination, different expressions, occlusions, and face poses for recognition. Zhu et al. (Zhu & Ramanan, 2012), obtained surprising result on face benchmark dataset “in the wild”. Their model used Mixtures of trees with a shared pool of parts but trained with only hundreds of faces. Kim et al. (Kim, Lee, Roh, & Lee, 2015) modified a deep network architecture, input normalization, and random weight initialization while training deep models. In order to classify six facial expressions such as anger, happy, sad, surprise, neutral, disgust, they constructed a hierarchical architecture with exponentially weighted decision fusion. Uddin et al. (Uddin et al., 2017) considered facial modified local directional pattern features processed with generalized discriminant analysis and then applied a deep-belief network for better performance. A face-detection module based on an ensemble of three state-of-the-art face detectors, was followed by a classification module with an ensemble of multiple deep CNNs. Yu and Cha(Yu, 2015) assembled three state-of-the-art face detectors followed by a classification module with a
3. Overview of the proposed system Our proposed system uses an incremental active learning method with a CNN for precise facial expression recognition. Labeling a large volume of FER data manually is a challenging issue for a deeplearning platform. On the other hand, inclusion of new FER data incurs the overfitting problem. A trained model infected with noisy data and substantial errors reduces the prediction power, which a computer cannot tackle correctly. In order to solve this problem, a better learning approach (labeling new data using the incremental active learning method) reduces the network training time due to less noisy data and correctly labeled FER.
4
Cognitive Systems Research
Fig. 1. Illustration of the proposed incremental active learning approach A webcam (Logitech HD Webcam C270) is used to gather facial-expression images, as seen in Fig. 1. In the second step, we normalized the image intensity value and eliminated noise in the image by cropping. Here, color images are converted to grayscale images, and contrast is enhanced by histogram equalization 3.1 Noise removal and pre-processing
Fig. 2. Intensity normalization Fig.2 shows intensity normalization where the brightness of the light is a noise element that is a big barrier to facial recognition. The reason is that, when a shadow occurs, the corresponding pixel value of the color code at that position changes. Dark, occluded images and bright images without occlusion have completely different pixel values, even if the facial expression in the image is the same. Adopting an intensity normalization (Pizer,
Cognitive Systems Research
5
Amburn, Austin, Cromartie, Geselowitz, Greer, Romeny, Zimmerman, & Zuiderveld, 1987) method minimizes interference from illumination (ZHANG et al., 2018).
Fig. 3. Facial expression gathering and the dataset creation tool Our primary model create with a large number of good facial expression sample images trained with the Visual Geometry Group (VGG) pre-trained model. A large batch and a mini batch of image datasets were created using an image gathering tool developed by the Intelligent Technology Lab (ITLab) at Inha University, Korea, as shown in Fig. 3. Each mini batch of images contains seven facial expressions where total facial expression has five .
sets, for a total of 35 images per batch. In order to get better performance, we increased the number of each facial expression images by 10 and 15 per dataset. In that case, the total number of images becomes 70 and 105 per batch in each image dataset. Then we trained that mini batch and evaluated each batch of the dataset against the pretrained model and checked the performance score
6
Cognitive Systems Research
Fig. 4. Face image dataset training tool After the performance evaluation, if the predicted score was less than 0.9, we applied active learning to eliminate the low-score facial expression labels and replaced them with new images to improve the performance of the learning model. If the training score was higher than the previous training result, then we combined the batch with another mini batch of a dataset and trained again. If the performance was lower than the previous training result, the training data were discarded and a new batch of training datasets was used for training. This process continued until we reached saturation. Fig. 4 shows the training tool’s graphical user interface (GUI), where we can set fine-tuning parameters like learning rate, batch size, number of epochs, and central processing unit (CPU) or graphics processing unit (GPU) execution. Active
learning is performed when the result of the training data is not improved smoothly. A confidence value is considered between 0 and 1, whereas the threshold value was 0.9 because a lower value biases the outcome of our experiment. If the labeled score is below the threshold value, we do not consider that image, replacing it with a new facial expression image.
Fig. 5. Active learning tool for face image manipulation
Fig. 5 shows the active learning tool, which is supervised learning that automatically learns a new label and then iteratively learns unlabeled data through the model. This method greatly reduces the effort
Cognitive Systems Research
7
required to label new data. If a new dataset includes the wrong label in the training process can drastically reduce performance. Finally, the active learning method overcomes the over fitting problem by attaching the correct label (Cohn, Ghahramani, & Jordan, 1996). 3.2 Deep convolutional neural network
Fig. 6. The VGG very deep 16 network model The convolutional neural network is one of the new machine learning schemes that have received attention in recent years due to better performance in resolving computer vision problems(Rifai et al., 2012)(Burkert et al., 2015)(Lawrence, Giles, Tsoi, & Back, 1997)(Huang, Liu, van der Maaten, & Weinberger, 2016)(Huynh, Tran, & Kim, 2016). The CNN extracts feature vectors according to the network structure determined by the user(Gao et al., 2015)(Rifai et al., 2012) and classifies them with ground truth labels. These trained network features are suitable for training data. The VGG deep learning 16 network was used for facial expression recognition (Simonyan & Zisserman, 2014). Instead of designing the network from scratch, we used the pre-trained VGG16 network for transfer learning(Li, Member, Sun, & Xu, 2017). In order to recognize facial expressions, it is necessary to extract feature vectors that are classified by modifying the layer of the VGG network so they
can be identified through facial feature representation. The classifier classifies the extracted feature vectors when receiving a new face image with this value, where 1 denotes an angry expression, 2 denotes a disgusted expression, 3 denotes fear, 4 denotes smiling, 5 denotes sadness, 6 denotes a surprised facial expression, and 7 is neutral. The network adds a very small (3×3) convolution filter to the existing VGG network, resulting in better performance for large-image recognition. Figure 6 shows the structure of the VGG very deep 16 network used in our experiments (Simonyan & Zisserman, 2014)(Huynh et al., 2016). The overall algorithm for facial expression classification in wild environment using incremental active learning presented in Algorithm1.
8
Cognitive Systems Research
Algorithm 1. Wild facial expression classification using incremental active learning Input: Labeled, unlabeled dataset, pre-trained model Output: Correctly labeled wild FER dataset and classification model Method: Step 1: Select pre-processed FE feature from a training set Repeat: Step 2: Train a model using Step 3 and Step 4 Step 3: For each image predict FE Find the max predicted score Step 4: If false prediction Apply AL If predicted score more than threshold value Combine with the previous dataset Else Replace the FE image set with new image set Step 5: Train, go to Step 4 Step 6: Select final model
4. Experimental Result and Analysis 4.1 Dataset Overview A. Extended Cohn-Kanade Dataset In our experiments, we have used Cohn-Kanade dataset. It has two kinds of data such as version 1 and version 2 which includes 65% female, 15% African-American and 3% percent Asian or Latino face expression images. We have used version 2, known as CK+ for our experiment due to sufficient facial expression labeled data and face pose with frontal view. Six common facial expressions in Original CK+ dataset (Face expression surprise)
CK+ dataset are anger, disgust, fear, happiness, sadness, and surprise are considered. First few images of each video clip without expression considered for neutral. We have not considered other expressions such as contempt. In order to train the network models we have used 760 images for training, and 180 for testing. Training samples before and after pre-processing are shown in Fig. 7.
Cognitive Systems Research
9
After pre-processing
Fig. 7. Cohn-Kandae dataset the face portion before and after pre-processing. B. ITLab Face Expression Datasets Our dataset includes face expressions are mainly from East and South Asians people are members of intelligent technology Lab (ITLab), Inha University, Korea Republic. Generally, seven emotions are considered such as anger, disgust, fear, happiness, sadness, surprise and neutral. We consider these facial expressions in five different atmospheres such as Good Lighting Condition, Average Lighting
Condition, Close to Camera, Far from Camera and Natural lighting Condition. We have collected 30 images from 30 sequential frames for each batch size. For each expression, on average, there are 1050 images. For each environment, we gathered
both test and training datasets. Fig. 6 shows VGG 16 network model which use to train ITLab database to create a pre-training model.
4.2 Facial expressions in different environments
(a) Good lighting
10
Cognitive Systems Research
(b) Average lighting
(c) Close to the camera
(d) Far from the camera
Cognitive Systems Research
11
(e) Natural light Fig. 8. Samples of different ITLab facial expressions in a wild environment Fig. 8 shows different facial expression images in wild environments of good lighting, average lighting, with the face close to the camera, the face far from the camera, and in natural light. Challenge (ILSVRC-2014) dataset (Berg, Deng, & Fei-Fei, 2010), which includes 1.2 million training As outlined in Fig. 8, our approach takes advantage images, labeled into 1000 classes. For the detection of an integrated learning framework and detects system, we used a multi-task cascaded facial expressions under different lighting convolutional network(Zhang et al., 2016), which is conditions. The input for our framework is a a convolutional neural network–based face detector, sequence of images. We consider a diverse and and an incremental active learning model (Lopes, de large amount of training data, which helps a new Aguiar, De Souza, & Oliveira-Santos, 2017) that is dataset be classified well and in real-time detection. implemented on the popular Caffe deep learning We thus trained our network with 50,000 iterations library(Jia et al., 2014). All implementations were and set the learning rate at 0.001 with a batch size on a single server with the Compute Unified Device of 8. Architecture (CUDA) deep neural network Our experiments use a publicly available VGG net (cuDNN) (Chetlur & Woolley, 2014) and a single that has five convolutional layers and three fully NVIDIA GeForce GTX 970 graphics card connected layers. These networks are a pre-trained running Ubuntu 14.04. ImageNet Large Scale Visual Recognition Table 1. Correctly classified facial expression. (Higher result indicates better performance) Environment Average Correctly Classified Average Incorrectly Classified Facial Expression images Facial Expression images Good lighting conditions 91 9 Average lighting conditions 92.5 7.5 Close to the camera 94 5.5 Far from the Camera 90 10 Natural light 96 3.5 Table 1 shows various facial expression image accuracies in a wild environment. We can see that for close to the camera and natural light environments FE detection performance is higher, compared to other environments due to the less noisy conditions. The correctly classified facial expressions were 94 images and 96 images respectively. The face far from the camera had the worst detection rate from FE, compared to other environments. Table 2
Facial expression recognition in different environments. (Higher result indicates better performance) Environment Good lighting conditions Average lighting conditions Close to the camera
Average FER recognition (%) 71.19 74.83 73.30
12
Far from the camera Natural light
Cognitive Systems Research
64.50 81.00
Table 2 shows the average facial expression recognition accuracy for sadness, neutral, disgust, happy, fear, angry, and surprise in the different environments. Our experiments show that natural light and average lighting conditions produce better results, with average FER accuracy at 81% and 74.83%, respectively.
Cognitive Systems Research
13
Fig. 9. Example images from real-time online evaluation of facial expression recognition. Table 3
Facial expression recognition accuracy in online evaluation. Baseline Random Incremental Active Learning
Average 63.0 ± 0.2% 72.4 ± 0.2% 88.0 ±.03%
Neutral 81 ± 0.2%
Angry 55 ± 0.3%
Happy 49 ± 0.3%
Disgust 50 ± 0.2%
76 ± 0.1%
51 ± 0.2%
73 ± 0.2%
91 ± 0.1%
92 ± 0.02%
89 ± 0.01%
89 ± 0.02%
90 ± 0.04%
Table 3 shows online evaluation result. We compare our Incremental Active Learning method against baseline and random facial expression recognition. Here baseline selection is the initial data model without adding additional face expression images. However, for random selection, training data samples are selected randomly from the number of facial expression images in order to avoid sample selection bias (Zadrozny, 2004). Without Incremental Active learning the average face expression recognition performance is only 63% whereas including Incremental Active learning performance is over 88%. When Random sampling face expressions data model are considered the average performance is only 72.4%. Quite
Fear 75.0 0.1% 73.0 0.2% 78.0 0.1%
± ± ±
Sadness 69 ± 0.2%
Surprise 58 ± 0.2%
80.5 ± 0.1% 86 ± 0.03%
62.5 ± 0.2% 76 ± 0.02%
remarkable, Incremental Active learning produce the better performance compared to random and baseline selection due to finest samples are considered for training. In Fig. 10, we show the performance of the baseline, random, and incremental active learning approaches. Each group of bars shows the performance from facial expression recognition in online evaluation. Each group shows three bars corresponding to baseline accuracy, random sampling accuracy, and the proposed incremental active learning method’s accuracy.
14
Cognitive Systems Research
Fig. 10. Facial expression performance analysis comparing the proposed incremental active learning with the baseline and random data results from online evaluation.
In our experiment, the active learning model starts from an environment with a high initial performance. If we start active learning from high initial performance, then supervised learning predicts incorrect labels with high confidence. Therefore, to minimize this process, we gradually attempted to learn from the environment most similar to the data used to construct the initial learning model. Experimental results show that the convergence of the initial performance decreases as the environment changes. This is because the learning of various environmental data is automatic through active learning. Our method is more useful when capturing facial expressions under better lighting conditions in similar environments.
C. Benchmark Dataset Comparison Few popular deep learning approaches such as inception (Szegedy, Vanhoucke, Ioffe, Shlens & Wojna, 2016), vgg(Simonyan & Zisserman, 2014), GoogLeNet (Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, & Rabinovich, 2015)
in facial expression recognition are compared with ours. These models are trained with both CK+ (Kanade & Cohn, 2000) and ITLab face data. Table 4 Few representative comparison method performance result on Cohn-Kanade dataset. Method KNN (Shan, Guo & You, 2017) CNN (Shan, Guo & You, 2017) GM (Wu and Lin, 2018) GM+AFM (Wu and Lin, 2018) GM+W-AFM (Wu and Lin, 2018) GM+W-CR-AFM (Wu and Lin, 2018) Ours
and
Dataset (CK+) 77.27 80.30 86.83 87.78 88.25 89.84 91.80
The result in Table 4 shows few representative comparison method and their performance on CK+ dataset. Our model trained with CK+ data outperform other methods.
Cognitive Systems Research
Table 5 The comparison result of state-of-the-art approach and our method. Approach
VGG [ADAM] VGG [RMS] Inception [RMS] Ours
Accuracy (%) Test set (180, 0)
Test set (180, 250)
Test set (180, 500)
Test set (180, 750)
Test set (180, 1000)
Test set (0, 1000)
0.912
0.415
0.300
0.225
0.222
0.215
0.927
0.514
0.391
0.304
0.312
0.233
0.916
0.514
0.380
0.300
0.301
0.234
0.91
0.76
80.77
82.22
82.75
72.96
In the Table 5 “Test set (N, M)” means the test data which is combined with N from CK+ dataset and M test data from the ITLab facial expression dataset. Common Incremental Active Learning parameters are the number of batch size, total number of epoch and learning rate. The result in Table 5 shows comparison result of different state-of-the-art approaches using the ITLab facial expression dataset and CK+. It can be notice that our method outperforms other techniques using the mixed test data of ITLab and CK+ datasets. Our method gains the lowest accuracy for Test set (0, 1000) and obtained best accuracy for Test set (180, 0) which is significantly higher than (Wu & Lin, 2018). 5. Conclusion In this paper, we propose an Incremental Active learning framework that can work with facial expressions in various wild environments. We successfully label the right candidate image using active learning method. While doing a rigorous experiment on ITLab and CK+ dataset, there are several issues that affects the performance such as person’s skin color, camera distance is too far from the face (more than 10 meter in our case) with various lighting condition. However, by adjusting the experiment order with camera distance (less than 5 meter) and comparing the initial performance, these issues can be addressed through excluding the data from the experiment when the performance is not
15
satisfactory. The proposed system is expected to help a lot when requires a large amount of training data in various environments. Our future research direction is to find ways to deal more flexible way with the learning sequence. In addition, by reducing the number of erroneous labels with high reliability that occurs in the active learning process, minimizing the discarded learning data will maximize its efficiency. Our experiment result depicts that propose method applicable to different domains including cognitive systems, surveillance systems and security systems where environment is not friendly.
ACKNOWLEDGEMENT This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(NRF-2016R1D1A1B03935440). The GPUs used in this research were generously donated by NVIDIA.
References A. Tawari and M. M. Trivedi, (2013). Face Expression Recognition by Cross Modal Data Association IEEE Transactions on Multimedia, vol. 15, no. 7 Burkert, P., Trier, F., Afzal, M. Z., Dengel, A., & Liwicki, M. (2015). DeXpression: Deep Convolutional Neural Network for Expression Recognition, 1–8. Retrieved from http://arxiv.org/abs/1509.05371 B. Li, C. H. Zheng, D. S. Huang. (2008). Locally linear discriminant embedding: An efficient method for face recognition. Pattern Recognition. 41, 3813-3821. B. Froba, A. Ernst (2004). Face Detection with the Modified Census Transform, IEEE International Conference on Automatic Face and Gesture Recognition, (FGR’04) B. F. Wu & C. H. Lin. (2018). Adaptive Feature Mapping for Customizing Deep Learning Based Facial Expression Recognition Mode. IEEE Access, pp.12451-12461 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens & Z. Wojna. (2016). Rethinking the inception architecture for computer vision. In
16
Cognitive Systems Research
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818-2826. Chetlur, S., & Woolley, C. (2014). cuDNN: Efficient Primitives for Deep Learning. arXiv Preprint arXiv:, 1–9. Retrieved from http://arxiv.org/abs/1410.0759 Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statistical models. Journal of Artificial Intelligence Research, 4, 129–145. https://doi.org/10.1613/jair.295 D. S. Huang (2004). A Constructive Approach for Finding Arbitrary Roots of Polynomials by Neural Networks, IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004 E. Osuna, R. Freund, F. Girosi (1997).Training Support Vector Machines: an Application to Face Detection, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, ISSN: 1063-6919 Ekman, P. (1993). Facial expression and emotion. American Psychologist. https://doi.org/10.1037/0003-066X.48.4.384 F. Dornaika and A. Bosaghzadeh, (2013) .Exponential Local Discriminant Embedding and Its Application to Face Recognition. IEEE Transactions on Cybernetics, vol. 43, no. 3. Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: A survey. Pattern Recognition, 36(1), 259–275. https://doi.org/10.1016/S0031-3203(02)000523 Gao, S., Zhang, Y., Jia, K., Lu, J., & Zhang, Y. (2015). Single Sample Face Recognition via Learning Deep Supervised Auto-Encoders. IEEE Transactions on Information Forensics and Security, 6013(c), 1–1. https://doi.org/10.1109/TIFS.2015.2446438 Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2016). Densely Connected Convolutional Networks. https://doi.org/10.1109/CVPR.2017.243 Huynh, X., Tran, T., & Kim, Y. (2016). Information Science and Applications (ICISA) 2016, 376, 441–442. https://doi.org/10.1007/978-981-100557-2 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S.,
Long, J., Girshick, R., … Darrell, T. (2014). Caffe: Convolutional Architecture for Fast Feature Embedding. ACM International Conference on Multimedia, 675–678. https://doi.org/10.1145/2647868.2654889 Kanade, T., & Cohn, J. F. (2000). Comprehensive database for facial expression analysis. Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), 46–53. https://doi.org/10.1109/AFGR.2000.840611 Kim, B., Lee, H., Roh, J., & Lee, S. (2015). Hierarchical Committee of Deep CNNs with Exponentially-Weighted Decision Fusion for Static Facial Expression Recognition. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 427– 434. https://doi.org/10.1145/2818346.2830590 K. Shan, J. Guo, W. You, Di Lu, R. Bie. (2017). Automatic Facial Expression Recognition Based on a Deep Convolutional-Neural-Network Structure, IEEE SERA 2017. Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional neural-network approach. IEEE Transactions on Neural Networks, 8(1), 98–113. https://doi.org/10.1109/72.554195 Li, H., Member, S., Sun, J., & Xu, Z. (2017). Multimodal 2D + 3D Facial Expression Recognition with Deep Fusion Convolutional Neural Network, 9210(c), 1–16. https://doi.org/10.1109/TMM.2017.2713408 Liu, J. (2015). Targeting Ultimate Accuracy : Face Recognition via Deep Embedding. Cvpr, 4–7. Lopes, A. T., de Aguiar, E., De Souza, A. F., & Oliveira-Santos, T. (2017). Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order. Pattern Recognition, 61, 610–628. ttps://doi.org/10.1016/j.patcog.2016.07.026 Mehrabian A. (1968).Communication without words, Psychol. Today, Vol. 2, pp. 53_5, 1968. P. Sharma. (2011). Feature Based Method for Human Facial Emotion Detection using Optical Flow Based Analysis, International Journal of Research in Computer Science eISSN 22498265 Volume 1 Issue 1 (2011) pp. 31-38
Cognitive Systems Research
Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Disentangling Factors of Variation for Facial Expression Recognition BT - Computer Vision – ECCV 2012. Computer Vision – ECCV 2012, 7577(Chapter 58), 808–822. https://doi.org/10.1007/978-3642-33783-3_58 Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 712-NaN-2015, 815–823. https://doi.org/10.1109/CVPR.2015.7298682 Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition, 1–14. https://doi.org/10.1016/j.infsof.2008.09.005 S. L. Happy, P. Patnaik, A. Routray and R. Guha,(2017). The Indian Spontaneous Expression Database for Emotion Recognition. IEEE Transactions on Affective Computing, vol. 8, no. 1, 2017 S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. ter Haar Romeny, J. B. Zimmerman, and K. Zuiderveld, (1987). Adaptive histogram equalization and its variations.Computer vision, graphics, and image processing, vol. 39, no. 3, pp. 355–368, 1987. Taheri, S., Patel, V. M., & Chellappa, R. (2013). Component-based recognition of facesand facial expressions. IEEE Transactions on Affective Computing, 4(4), 360–371. https://doi.org/10.1109/T-AFFC.2013.28 Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). DeepFace: Closing the gap to humanlevel performance in face verification. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1701–1708. https://doi.org/10.1109/CVPR.2014.220 Uddin, M. Z., Hassan, M. M., Almogren, A., Zuair, M., Fortino, G., & Torresen, J. (2017). A facial expression recognition system using robust face features from depth videos and deep learning. Computers and Electrical Engineering, 63, 114–125.
17
https://doi.org/10.1016/j.compeleceng.2017.04. 019 V. Bettadapura, (2012.).Face expression recognition and analysis: the state of the art. arXiv preprint arXiv:1203.6722, Wang, H., Hu, J., & Deng, W. (2017). Compressing Fisher Vector for Robust Face Recognition. IEEE Access, 5, 23157–23165. https://doi.org/10.1109/ACCESS.2017.274933 1 X. Song, Z. Feng, G. Hu, and X. Wu. (2017). HalfFace Dictionary Integration for RepresentationBased Classification. IEEE Transactions on Cybernetics, vol. 47, no. 1, X. Zhu, D. Ramanan (2012). Face Detection, Pose Estimation, and Landmark Localization in the Wild. CVPR Yu, Z. (2015). Image based Static Facial Expression Recognition with Multiple Deep Network Learning. ACM on International Conference on Multimodal Interaction - ICMI, 435–442. https://doi.org/10.1145/2823327.2823341 Y. Zong, W. Zheng, X. Huang, K. Yan, J. Yan, and T. Zhang,(2016).Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis. Journal on Multimodal User Interfaces, pages 1–10, 2016. Zhang, K., Zhang, Z., Li, Z., Member, S., Qiao, Y., & Member, S. (2016). Joint Face Detection and Alignment using Multi - task Cascaded Convolutional Networks. Spl, (1), 1–5. https://doi.org/10.1109/LSP.2016.2603342 ZHANG, W., ZHAO, X., Morvan, J. M., & Chen, L. (2018). Improving Shadow Suppression for Illumination Robust Face Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–14. https://doi.org/10.1109/TPAMI.2018.2803179 Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Twenty-first international conference on machine learning – ICML, p. 114.