Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network

ARTICLE IN PRESS JID: NEUCOM [m5G;January 22, 2020;12:17] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing ...

Download PDF

2MB Sizes 0 Downloads 10 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;January 22, 2020;12:17]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network Pengfei Xu a, Fei Xie b, Tongsheng Su c, Zhaoxin Wan c, Zhaoyong Zhou d, Xiaoyu Xin e, Ziyu Guan a,∗ a

School of Information Science and Technology, Northwest University, Xi’an 710127, China School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China Shaanxi Hospital of Traditional Chinese Medicine, Xi’an 710003, China d Network & Education Technology Center, NorthWest A&F University, Xianyang 712100, China e Rui Jin Hospital, Shanghai Jiao Tong University, School of Medicine, Shanghai, China b c

a r t i c l e

i n f o

Article history: Received 18 May 2019 Revised 23 November 2019 Accepted 9 January 2020 Available online xxx Communicated by Dr. Qingshan Liu Keywords: Facial nerve paralysis Facial asymmetry Dual-path LSTM Deep differentiated network Deep learning methods

a b s t r a c t Facial nerve paralysis adversely affects the patients’ mental and physical health, and several existing evaluation methods of facial paralysis are put forward based on the static facial asymmetry. However, these traditional methods still suffer from two drawbacks: (1) the facial movement information is always being ignored, which plays important roles for facial paralysis analysis; (2) the shallow machine learning models have their limitations on extracting useful facial features for the evaluation of facial paralysis. To solve these problems, we present dual-path LSTM with deep differentiated network (DP-LSTM-DDN) to evaluate the severity of facial paralysis automatically. The key idea behind DP-LSTM-DDN is that the diagnosis results are sensitive to the facial asymmetry and the patterns of facial muscular movements when the patients were doing the diagnostic facial actions. Therefore, we design a deep differentiated network to analyze the difference between two sides of patients’ faces. Furthermore, since the involved facial regions are as important as the whole face for the diagnostic of facial paralysis analysis, we propose a dual-path LSTM network to extract both global and local facial movement features. Then these extracted high level representations are fused for the ﬁnal evaluation of facial paralysis. The experimental results have veriﬁed the better performance of DP-LSTM-DDN compared with the state-of-the-art methods. © 2020 Elsevier B.V. All rights reserved.

1. Introduction Facial nerve paralysis is a common and frequently-occurring disease with poor capacity for facial muscular movements, and the cardinal symptoms are characterized by the facial expression muscles losing the normal movement function. As shown in Fig. 1, the clinical features of most facial paralysis cases are unilateral, i.e., only one side of the face suffering from malfunctioning [1]. Facial paralysis adversely affects the patients’ normal life and their social activities, and brings a heavier mental burden to them. The evaluation of facial paralysis is performed mainly based on the facial asymmetry and the patterns of facial muscular movements. For now, most of diagnoses and evaluations of facial paralysis are made by specialist doctors based on their medical experience and relevant standards. Therefore, the diagnosis results would be easily affected by subjective factors. In addition, the ∗

Corresponding author. E-mail addresses: [email protected] (P. Xu), [email protected] (F. Xie), [email protected] (Z. Zhou), [email protected] (Z. Guan).

whole diagnosis process is very ineﬃcient. For example, the patients, who are in rehabilitation self-training, have to cost much time for the frequently diagnosis and evaluation by their doctors. Therefore, it is important to make an accurate diagnosis and evaluation of facial paralysis using artiﬁcial intelligence technologies, which could be adapted to auxiliary diagnosis for doctors to ease their workloads and reduce the inﬂuence of the subjective factors, and also beneﬁt patients by allowing them to do self-diagnosis. With the development of computer vision technologies, the studies of facial (micro-)expression recognition have laid a good foundation for the analysis of facial appearance changes [2]. At present, several evaluation methods for facial paralysis are performed based on the asymmetry of the static facial images or the facial muscle movements. These methods can be roughly divided into two categories. 1 The evaluation methods based on facial key point detection [3], which mainly reﬂects the facial shape information, but ignore the texture information, resulting in that the extracted features cannot well describe facial states. In addition, the performance of these evaluation methods highly depends on the accuracy of key point detection. Unfortunately, it is hard for us to

https://doi.org/10.1016/j.neucom.2020.01.014 0925-2312/© 2020 Elsevier B.V. All rights reserved.

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

2. Related work

Fig. 1. The facial changes of the diagnostic facial action “smile”.

accurately detect the key points on the patients’ faces due to their complex facial structures and various facial states. 2 The methods based on facial region partition [1,4] can capture more facial texture information, and have higher accuracy. These methods are also performed on static facial images by considering the facial asymmetry, but ignore the facial muscle movement features. Besides, they typically employ shallow models for classiﬁcation, which have limitations on facial feature extraction. Moreover, these methods usually only consider the local region of a single organ but ignore the global facial features. At present, a number of deep models have been used in facial micro-expression recognition and action recognition [2,5], and these works show that deep models can extract effective dynamic features better. Therefore, the deep learning methods would be effective to evaluate facial paralysis. However, up to now, only one existing facial paralysis evaluation method uses a classical network GoogleNet by Guo [6] to objectively evaluate facial paralysis. Therefore, we try to design a new deep neural network framework to assess the severity of facial paralysis automatically. During the diagnosis of facial paralysis, the patients are required to do several standard facial actions, and doctors evaluate the severity of facial paralysis by analyzing the facial asymmetry. Besides, they also focus on the asymmetry occurred around particular facial regions. For example, when the patients make a smile, the asymmetry of the facial shape and texture changes would be happened on the whole face, and the largest asymmetry of the changes occurred around their mouths, as shown in Fig. 1. Based on these facts, to evaluate facial paralysis automatically, we proposed dual-path LSTM with deep differentiated network (DP-LSTM-DDN) to extracted high-level movement features of facial muscles from the videos of facial diagnostic actions. In DP-LSTM-DDN, deep differentiated network (DDN) is designed to extract the differentiated features between two sides of the patients’ faces. Dual-path LSTM employs one LSTM sub-network to extract the global movement features from the whole faces and the other one to extract the local movement features from the involved facial regions. Finally, the extracted features by dual-path LSTM are fused for ﬁnal evaluation. Compared with the existing evaluation methods, DP-LSTM-DDN mainly has two contributions. (1) We propose deep differentiated network (DDN) to extract the differentiated features between the two sides of the patients’ faces or the involved facial regions. This network is inspired by the process of the doctors making a diagnosis for facial paralysis based on the asymmetry of the faces. (2) Dual-path LSTM (DP-LSTM) is proposed to extract the global and local movement features of the patients’ faces from the videos of facial diagnostic actions, and these features are fused for the automatic evaluation of facial paralysis. This architecture is inspired by the process of evaluating facial paralysis by doctors to focus on the asymmetry of the muscle movements both on the patients’ faces and the involved facial regions.

Nowadays, with the increasing incidence of facial paralysis, efﬁcient and automated facial paralysis recognition and evaluation methods have become an urgent need for clinical diagnosis. At present, there are several methods have been proposed for automatic evaluation of facial nerve paralysis. These methods provide an extremely eﬃcient and convenient diagnostic route for facial paralysis. More important, they reduce the inﬂuence of subjective factors on the diagnosis results and ease the workload of doctors, and some of these works have even been successfully applied to clinical medical diagnosis for facial paralysis. The key point detection is a classic modeling way for human faces, and several evaluation methods for facial analysis were proposed based on key point detection. Nishida [7] calculated the movement distances of the detected key points, and quantitatively measured the severity of facial paralysis based on the movement distances. Liu [8] estimated the severity of facial paralysis by comparing the numbers of pixels in certain regions on either side of the face. Wachtman [3] made the evaluation of facial paralysis based on the facial asymmetry. Wang [9] proposed an assessment method of facial paralysis based on active appearance model. An improved AAM model [10] is proposed by Modersohn to extract the facial features of patients, and further to identify and analyze the disease of patients. Barbosa [11] detected the key points by using a hybrid classiﬁcation model, and made an analysis of facial disease through tracking the changes of the facial key points. Suchy [12] put forward a blind analysis system for facial paralysis based on the comparative analysis of facial regional features. Although the above methods achieved inspired improvements on automatically facial paralysis analysis, the performances of these methods highly depend on the accuracy of key point detection, and it is a big challenge to accurately detect the key points on the facial paralysis patients’ faces. There are also some evaluation methods based on face region partition, and the severity of facial paralysis is evaluated by comparing the difference of the shape and texture features between the two symmetrical regions. Lin [4] proposed an automatic classiﬁcation method for facial motor function of patients based on Gabor feature and SVM. Ma [13] put forward a facial nerve function evaluation method based on facial key points and facial regions. Ngo [14] made quantitative evaluation of facial paralysis by combining Gabor and LBP features. In addition, several evaluation methods are proposed based on dynamic and 3D features extracted from videos to solve the problems which are existed in the above two types of methods. For example, the methods based on the facial image only consider the facial asymmetry information, but neglect the motion features of facial muscle movements. He [15,16] proposed an evaluation method based on the optical ﬂow information of facial motion, this method evaluates the severity of facial paralysis by calculating the changes of the optical ﬂow in speciﬁc facial regions before and after making facial actions. Besides, some researchers have designed automatic facial analysis systems to identify the subtle changes of the patients’ facial movements [17]. Hontanilla [18] utilized 3D model to quantitatively analyze facial movements by requiring patients to make some facial actions. Wang [1] even proposed a new method to automatically evaluate the severity of facial paralysis based on both static facial asymmetry and dynamic facial changes. Besides, Liu [19] used the infrared thermal images to extract the facial temperature distribution features in the relevant facial regions for the evaluating facial paralysis. Ngo [20] applied a concentric modulation ﬁlter to measure the asymmetry between two sides of the ﬁltered images, and made the facial paralysis evaluation. However, these traditional methods only try to extract the artiﬁcial features from static facial images by considering the facial asymmetry, but

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

few works pay attention to the deep features of the facial muscle movements. In recent years, the deep learning methods have lots of successful applications to facial expression and micro-expression recognition [21–25], and several works based on deep learning are proposed for evaluating facial paralysis. There are several similarities between our task and facial expression and micro-expression recognition: 1 . These two tasks focus on the facial muscle movement features. 2 . Most existing methods achieve these two tasks by carrying out the classiﬁcation task. 3 . Subtle changes in facial movements may greatly affect the ﬁnal results. While, there are also differences between them: 1 . Facial paralysis evaluation is to assess the severity of facial disease, and it deals with the images or videos of facial paralysis patients. While facial expression and micro-expression recognition are to recognize the types of facial movements, and it deals with the images or videos of the normal facial movements. 2 . Facial paralysis evaluation mainly focuses on the difference information between the two sides of the faces, while facial expression and micro-expression recognition mainly focus on the changes of the whole faces instead of facial asymmetry, which is the biggest difference between these two tasks. At present, the automatic detection method for the feature points of facial paralysis was proposed based on deep convolutional network [26]. The method [6] utilized GoogleNet model [27] to conduct transfer learning and reported a satisfactory results. However, these methods usually apply a classical network on facial paralysis analysis with few improvements on the original structures. In fact, there should be a particular deep learning network designed for the task of facial paralysis analysis. Therefore, the evaluation method for facial paralysis based on deep learning has a broader development prospect.

3

Fig. 2. The overall framework of DP-LSTM-DDN.

3. Automatic evaluation for facial paralysis by DP-LSTM-DDN 3.1. The overview of DP-LSTM-DDN The doctors evaluate the severity of facial paralysis by observing the difference between the left and right sides of the patients’ faces and the involved facial regions. Inspired by this, we design dual-path LSTM with deep differentiated network (DP-LSTM-DDN) to evaluate facial paralysis automatically. As shown in Fig. 2, DPLSTM-DDN has two paths of LSTM. In each path, DDN is used to extract the features of the difference information between the two sides of the faces or the involved facial regions corresponding to the speciﬁc facial diagnosis actions, and a sequence of features extracted by DDN from the frames are input into LSTM to learn the temporal features of facial movements. Therefore, the videos recorded the facial movements are input into DP-LSTM-DDN, and one path is used to extract the global temporal features Fg of the whole facial muscle movements, while the local temporal features Fl of the involved facial regions are learned in the other path. Finally, the feature vectors obtained by dual-path LSTM are cascaded to form a fused feature vector for evaluating the severity of facial paralysis. 3.2. The symmetrical separation of the patients’ faces Different facial paralysis patients have their own diﬃculties in performing facial diagnosis actions, which takes them different time. Therefore, we need to divide the inconsistent videos with different frames into the sets of frames ﬁrst. Then, the consistent videos with the same number of frames are obtained by selecting frames in equal intervals [28,29]. We get the frames of the videos by using the existing tools (such as OpenCV or Matlab). Then we select 30 frames for each video in equal interval. Specifically, given a video with n frames, we select the frames in the

Fig. 3. Face detection and symmetry separation.

n interval 30 . So 30 frames are selected from the sequence of frames C = {c1 , c2 , ..., ck , ..., cn }, and the frame c n ×k are selected. By this 30

way, each consistent video have 30 selected frames. In most realistic scenarios, complex and frequently changed background would have adverse effects for DDN to extract the facial differentiated features in their diagnostic videos. Therefore, we need to focus on the patients’ faces, so as to reduce the adverse interference brought by background. Furthermore, we need to perform symmetry separation on the patients’ faces and the involved facial regions, then the separated faces and facial regions are input into DDN to extract the differentiated features. To achieve these tasks, Faster RCNN [30] is employed to detect the faces from all the frames (As shown in Fig. 3(a)), and the key points can be obtained by performing AAM algorithm [9] on the detected faces, as shown in Fig. 3(b). Further, we calculate the facial medial axis lines according to the coordinates of the key points, and achieve symmetry separation on the faces and the involved facial regions, as shown in Fig. 3(c). However, the accuracy of the conventional AAM algorithm for facial paralysis faces cannot as high as it for normal faces due to the huge difference between patients. Therefore, we select two key points with relative stability locations to obtain the facial medial axis lines for symmetry separation: the key points P1 (x1 , y1 ) and P2 (x2 , y2 ), as shown in Fig. 4. In addition, the two separated sides of the faces should have the same size for the comparative analysis by DDN.

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

JID: NEUCOM 4

ARTICLE IN PRESS

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

Fig. 4. The two key points with relative stability locations for symmetry separation.

Fig. 5. Region location by Faster RCNN and the determination of the upper and lower bounds of the three involved facial regions.

DDN, the difference information is used as the input data for a convolutional neural network to learn the differentiated features, which are the features of the deference between the features of the two sides of the faces or the involved facial regions. Compared with Siamese Network [31], the two streams of BCNN have the same architecture, and each stream consists of 5 convolution layers, 3 pooling layers, 5 deconvolution layers, 3 unpooling layers, and 4 normalized layers. For our task, the difference between the shape and texture features of the two sides of the facial images can reﬂect the severity of facial paralysis. In addition, BCNN contains the deconvolution operations, which is to backward derive the possible input activation map based on the convolution kernel, and the information of the obtained input activation map can reﬂect the relative activation degree. Therefore, the contribution of the feature points in the input activation map is directly proportional to the input values, then the network would consider these feature points to be the most important feature points, which should be activated. Moreover, the deconvolution network can obtain an enlarged but sparse response feature map through the unpooling layer, and these sparse features are turned into dense features by deconvolution layers. Further, based on our previous works and experiments with several different types of neural network, we ﬁnd that the network setting with deconvolution layers can extract the useful texture features for the evaluation of facial paralysis, and provide the applicable feature maps for calculating the difference information. The structure of BCNN can be described as:

(conv + ReLU + pooling) + (Norm + conv + ReLU + pooling) + (Norm + (conv + ReLU ) × 3 + pooling ) + (unpooling + Deconv × 3 + Norm ) + (unpooling + Deconv + Norm ) + (unpooling + Deconv ) (1) Fig. 6. The overall architecture of DDN.

x +x

The medial axis line of the face: L(x ) = 1 2 2 . The width of each side of the face in the horizontal direction for symmetry separation: x +x x +x W = max(|min(X ) − 1 2 2 |, |max(X ) − 1 2 2 |), where X represents the horizontal coordinates of all key points. Therefore, the left side of the face in the horizontal direction is from L(x ) − W to L(x), and the right side is from L(x) to L(x ) + W . min(Y) and max(Y) are used to determine the upper and lower bounds of the separated faces, and Y represents the vertical coordinates of all facial key points. On the other hand, in order to capture the differentiated features of the involved facial regions, we utilize Faster RCNN to locate three involved regions on the face, as shown in Fig. 5. Similarly, these involved regions also need to be symmetrically separated in horizontal direction, and the operation is similar to that for the whole face. In the vertical direction, (y1, h , y1, l ), (y2, h , y2, l ) and (y3, h , y3, l ) of the bounding boxes obtained by Faster RCNN are used to determine the upper and lower bounds of the three involved facial regions. 3.3. Extracting the differentiated features by DDN DDN consists of two parts (As shown in Fig. 6): The ﬁrst part is a bifurcated convolutional neural network (BCNN) with two streams for feature extraction from pairs of images. The two sides of the faces or the involved facial regions are input into BCNN to obtain pairs of feature maps. Then, the difference information for each pair of the feature maps is calculated. In the latter part of

In BCNN, The ﬁrst convolution layer uses a convolution kernel with the size of 11 × 11, and is connected to a pooling layer with the pool size of 3 × 3. The second convolution layer has a convolution kernel with the size of 5 × 5, and the second pooling layer follows close behind. Then Local Response Normalization (LRN) is used in the former two convolution and pooling layers. After that, the third, fourth and ﬁfth convolution layers are applied with the same convolution kernel size of 3 × 3. The third pooling layer is connected behind the ﬁfth convolution layer. Further, the network need to perform the reverse operations including deconvolution, unpooling and normalization, and the parameters and structures of these types of layers are the one-to-one correspondence with the former operations of convolution, pooling and normalization. The deconvolution layers are corresponding to the convolution layers, and so are the unpooling layers. The main contribution of DDN is to extract the features of the difference between the extracted features from pairs of images. In traditional Siamese Network, the difference between two images is determined by calculating the Euclidean distance of the features. In contrast, DDN pays attention to the deep features extracted from the difference information between pairs of the feature maps. In DDN, the semi-global matching algorithm (SGM) [32] is used to calculate the difference information between the feature maps extracted by BCNN. SGM can make a ﬁne-grained comparison and generate a new map DF of the difference information. SGM performs pixelwise matching based on Mutual Information and the approximation of a global smoothness constraint. In SGM, a disparity map is formed by selecting the disparity of each pixel, and this disparity map is our difference information DF between the feature maps (F1 and F2 ). SGM sets a global energy function related to the disparity map, and tries to achieve the optimal disparity of each

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

ARTICLE IN PRESS

JID: NEUCOM

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

pixel by minimizing the energy function.

E ( DF ) =

C ( p, DF,p ) +

4. The dataset and experimental setup

P1 T DF,p − DF,q = 1

q∈Np

p

+ P2 T DF,p − DF,q > 1

5

(2)

q∈Np

where, DF is the disparity map, and E(DF ) is the energy function corresponding to DF . p and q represent the pixels in the feature maps F1 and F2 respectively. Np refers to the adjacent pixel of p, C(p, DF, p )refers to the cost of the current pixel, whose disparity is DF, p . P1 and P2 are the penalty factors applied to the adjacent pixels of p, and their disparity values differ from the disparity value of p by 1 and larger than 1 respectively. T[ · ] returns 1 if the parameters in the function are true, otherwise returns 0. Then DF is used as the input of the second part of DDN to extract the deep features of the difference information by a network structure, which consists of 5 convolution layers, 3 pooling layers, 2 normalization layers and 3 fully connected layers. Finally, the feature vector with 10 0 0 dimensions are obtained after the 3 fully-connected layers, as described in formula (3)

(conv + ReLU + pooling) + (Norm + conv + ReLU + pooling) + (Norm + (conv + ReLU ) × 3 + pooling ) + (fc + ReLU ) × 2 + fc (3) In DDN, BCNN is used to extract the deep features of pairs of images, and the difference information between the feature maps is calculated by SGBM, then the ﬁnal differentiated features are extracted by a convolutional neural network. Therefore, DDN can extract the ﬁnal differentiated features between two sides of the faces or the involved facial regions for evaluating facial paralysis more eﬃciently. 3.4. Evaluating facial paralysis by DP-LSTM-DDN Although DDN can effectively extract the differentiated features between the left and right sides of faces, these features represent the static difference. However, for evaluating the severity of facial paralysis, we should pay more attention to extract and analyze the temporal features in facial morphology and texture changes. Therefore, LSTM is used to learn the dynamic features of facial movements, and combined with DDN to extract the sequential differentiated features from the videos. Furthermore, we not only focus on the features of the movement difference on the overall faces, but also take the involved facial regions corresponding to different facial diagnosis actions in consider. Therefore, a new network structure (DP-LSTM-DDN) combined Dual-path LSTM and DDN is designed to extract the temporal features of movement difference between the two symmetrical sides of the whole face and the involved facial regions for evaluating facial paralysis. As shown in Fig. 2, for the one path of DP-LSTM, a sequence of frames X = (x1 , x2 , ..., xi , ..., xT ) recording the facial movements are used as the input data, and xi is the ith frame. The whole face needs to be detected from each frame by Faster RCNN, and then the detected face need to be separated symmetrically. Then the two symmetric sides of the face are input into DDN to capture the differentiated features. Therefore, a sequence of the differentiated features F = ( f1 , f2 , ..., fi , ..., fT ) can be obtained, where fi is the feature vector of the ith frame xi . Then F would be input into LSTM to learn the global temporal features Fg . Furthermore, the similar operations are made on the involved facial regions corresponding to the speciﬁc diagnosis actions, and the local temporal features Fl can be obtained. Finally, Fg and Fl are cascaded to form a new feature vector for the ﬁnal evaluation of facial paralysis.

To achieve the task of the automatic evaluation of facial paralysis, we collect the videos of the facial diagnosis actions, which are made by the facial paralysis patients, and all these videos are used as the experimental data by the patients’ permission. The videos of 103 facial paralysis patients are obtained, and 40 normal volunteers are involved for video collection. For each person, they need to make seven facial actions including raise eyebrow, close eyes, crew up nose, plump cheeks, open mouth, smile and frown, and the sequence of these seven facial actions are need to repeat 3 times by each person. In the process of video collection, one video segment is used to record one type of facial action for each person. The severity of facial paralysis is divided into four levels including normal, mildly ill, moderate ill or critically ill, which can be denoted with 0, 1, 2 and 3 as their labels respectively, and three professional doctors in our cooperative hospital help us to evaluate the severity of facial paralysis. If the evaluation results given by doctors are consistent, then the results are used as the ground truth. However, if their evaluation results are inconsistent, then the ﬁnal results given by an expert with extensive diagnostic experience would be the ground truth. Therefore, we divided all these 3003 video samples into four groups, which are corresponding to the four levels of the severity of facial paralysis. 70% of videos in each group are used to train the model, and these videos are allocated on 5:1 as the training set and the validation set. The remained 30% of videos in each group are used as the test data. All the experimental results are provided by using 5-fold crossvalidation. For DP-LSTM-DDN, the initial learning rate is 0.0 0 01, and the iterations is 20 0 0. Besides, the frame sequence with a sequence of 30 frames selected from a video is used as one training or test sample, and also as a basic processing unit. However, some traditional methods need to use the static facial images to evaluate the severity of facial paralysis, so we select the key frames from all the video segments as their experimental data, and these key frames reﬂect the facial states of having the greatest range of facial actions. Furthermore, accuracy, precision, recall and F1 score are used as the four parameters for evaluating the performances of the experimental methods. These four evaluation parameters are calculated by the following formulas:

Accuracy =

TP + TN TP + TN + FP + FN

(4)

P recision =

TP TP + FP

(5)

Recall =

TP TP + FN

F 1 score =

2 × P recision × Recall P recision + Recall

(6) (7)

where, TP means a sample is actually positive and is predicted as a positive one. FNmeans a sample is actually positive but is predicted as a negative one. FPmeans a sample is actually negative but is predicted as a positive one. TNmeans a sample is actually negative and is predicted as a negative one. 5. Experimental results and analysis To verify the effectiveness and superiority of DP-LSTM-DDN for the evaluation of facial paralysis, several existing methods are used as the comparison methods, and the experimental results are shown in Table 1. There are four types of evaluation methods for facial paralysis, including the traditional methods using artiﬁcial features, CNN-based methods, facial expression and micro-expression recognition methods and LSTM-based methods.

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

Fig. 7. The histogram of the evaluation parameters.

Table 1 The evaluation parameters of different comparison methods. Networks

Accuracy

Precision

Recall

F1

Gabor+SVM [4] LBP+SVM [1] GoogleNet [6] VGG-16 [33] Resnet34 [34] Resnet50 [34] Resnet101 [34] CNN-FER [24] MicroExpSTCNN [25] LSTM [35] CNN-LSTM [5] DP-LSTM-DDN

0.6087 0.6488 0.5675 0.6389 0.6286 0.6667 0.6571 0.3889 0.5102 0.6531 0.6364 0.7347

0.6731 0.5673 0.5958 0.6580 0.6726 0.6862 0.7279 0.5786 0.5307 0.6900 0.6705 0.7396

0.5446 0.5423 0.6014 0.6292 0.6786 0.6814 0.6306 0.3888 0.5102 0.7041 0.6417 0.7242

0.5801 0.5018 0.5936 0.6346 0.6419 0.6790 0.6544 0.45895 0.5165 0.6970 0.6475 0.7219

At present, the works of Wang on the evaluation of facial paralysis are the classic technologies of the methods based on the artiﬁcial features [1]. For our experimental data, the average accuracy of the evaluation results based on the static asymmetry features is 64.88%. The similar methods based on Gabor features and SVM [4] has even lower accuracy, but has better results in terms of Recall and F1 score. Moreover, these two methods have lower performance than CNN-based methods in terms of accuracy, precision, recall and F1 score. The reason is that the artiﬁcial features cannot represent the deep facial asymmetric features, and these traditional methods only focus on the global facial features without considering local facial motion features. Moreover, these methods have strict requirements for the experimental images or videos. Furthermore, several CNN-based methods are also applied to our task, and tested on the static facial images. Compared with traditional methods [1,4], the methods based on GoogleNet [6] and VGG-16 [33] have the similar performances in terms of accuracy and precision, but have better performances in Recall and F1 score. While the methods based on Resnets [34] have great improvements in the four evaluation parameters, and Resnet50 has the better performance than the former traditional methods [1,4] with the accuracy of 66.67% and F1 score of 67.90%. These experimental results demonstrate the power of deep features on the evaluation of facial paralysis. Compared with artiﬁcial features, CNN-based methods can capture more generalized facial features for facial paralysis evaluation. In addition, Resnets have more convolution and residual layers with the effect of ensemble, which enables Resnets maintain more effective features. Moreover, we apply facial expression and micro-expression recognition methods to our task. As shown in Table 1. we can see that the expression and micro-expression recognition

methods can apply to facial paralysis evaluation, and MicroExpSTCNN [25] achieves better performance than CNN-FER [24] in terms of accuracy, recall and F1 score respectively. However, these methods cannot achieve good performance for facial paralysis evaluation due to that they mainly focus on the overall facial motion features rather than the asymmetry features of facial movements. In contrast, our method DP-LSTM-DDN considers the asymmetry features of global and local facial movements, and has 0.2245, 0.2089, 0.214 and 0.0.2054 improvements over MicroExpSTCNN [25] in terms of accuracy, precession, recall and F1 score respectively. Finally, the methods based on single stream of LSTM [35], CNN-LSTM [5] and our DP-LSTM-DDN are performed on the videos of the patients’ facial actions. LSTM-based methods have the general superior performances. We believe that is because these methods make full use of the dynamic features extracted from the facial muscle movements. Especially for DP-LSTM-DDN, it has 12.60% higher evaluation accuracy and 14.18% larger value of F1 score than Wang’s method [1]. Furthermore, it also has 6.80% higher accuracy and 4.29% larger value of F1 score than Resnet50. Among the LSTM-based methods, DP-LSTM-DDN also has the best performance. That is due to the sequential differentiated features extracted by DP-LSTM-DDN can better reﬂect the changes of the differences in facial movements. Moreover, DP-LSTM-DDN fused the global and local differentiated features for facial paralysis evaluation rather than only using the facial motion features extracted from the whole faces. These asymmetry features of facial paralysis evaluation is basically consistent with that of the doctors to evaluate facial paralysis. Fig. 7 illustrates the superiority of DP-LSTMDDN more intuitively. Although the difference among the evaluation parameters of DP-LSTM-DDN and other methods are uneven, DP-LSTM-DDN is generally superior to those comparison methods. 6. Conclusion and discussion In this paper, we design a new network model with the combination of dual-path LSTM and deep differentiated network (DDN) to evaluate the severity of facial paralysis automatically. In DP-LSTM-DDN, DDN are proposed to extract the differentiated features between the two sides of the faces, and the global and local facial movement features are extracted by DP-LSTM from the videos of facial diagnostic actions. The experimental results show that DP-LSTM-DDN has its effectiveness for evaluating the severity of facial paralysis, and also has much better performances than other existing methods. However, we need to supplement the experimental data for the future works, and the speciﬁc images

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx

and videos recorded by depth cameras may be much more helpful for our tasks. Furthermore, ﬁne-grained classiﬁcation may need to apply to the evaluation of facial paralysis, which can divide the severity of facial paralysis into more levels for doctors to make more accurate diagnoses. Declaration of Competing Interest The authors declare that they have no known competing ﬁnancial interests or personal relationships that could have appeared to inﬂuence the work reported in this paper. CRediT authorship contribution statement Pengfei Xu: Conceptualization, Methodology, Writing - original draft. Fei Xie: Funding acquisition. Tongsheng Su: Data curation. Zhaoxin Wan: Data curation. Zhaoyong Zhou: Validation. Xiaoyu Xin: Data curation. Ziyu Guan: Methodology, Writing - review & editing. Acknowledgments This research was supported in part by the National Natural Science Foundation of China under grant agreements Nos. 61973250, 61936006, 61672409, 61876145, 61802335, 61973249, and 61702415. Talent Support Project of Science Association in Shaanxi Province: 20180108. Scientiﬁc research plan for servicing local area of Shaanxi province education department: 19JC038 and 19JC041. The Major Basic Research Project of Shaanxi Province: 2017ZDJC-31, Shaanxi Province Science Fund for Distinguished Young Scholars: 2018JC-016, and the Fundamental Research Funds for the Central Universities: JB190301 and JB190305. References [1] T. Wang, S. Zhang, J. Dong, et al., Automatic evaluation of the degree of facial nerve paralysis, Multimed. Tools Appl. 75 (19) (2016) 11893–11908. [2] S.-J. Wang, W.-J. Yan, T. Sun, et al., Sparse tensor canonical correlation analysis for micro-expression recognition, Neurocomputing 214 (2016) 218–232. [3] G.S. Wachtman, J.F. Cohn, J.M. VanSwearingen, et al., Automated tracking of facial features in patients with facial neuromuscular dysfunction, Plast. Reconstr. Surg. 107 (5) (2001) 1124–1133. [4] Y. Lin, Research on Automatic Quantitative Assessment of Facial Paralysis, Chinese Marine University, Qingdao, Shandong, 2008. [5] P. Xu, L. Wang, Z. Guan, et al., Evaluating brush movements for Chinese calligraphy: a computer vision based approach, in: IJCAI, 2018, pp. 1050–1056. [6] Z. Guo, M. Shen, L. Duan, et al., Deep assessment process: objective assessment process for unilateral peripheral facial paralysis via deep convolutional neural network, in: Proceedings of the 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), IEEE, 2017, pp. 135–138. [7] T. Nishida, W. Chen Y, N. Matsushiro, et al., An image based quantitative evaluation method for facial paralysis, in: Proceedings of the 2nd International Conference on Software Engineering and Data Mining, IEEE, 2010, pp. 706–709. [8] L. Liu, G. Cheng, J. Dong, et al., Evaluation of facial paralysis degree based on regions, in: Proceedings of the 2010 3rd International Conference on Knowledge Discovery and Data Mining, IEEE, 2010, pp. 514–517. [9] Q. Wang, Research into Assessment of Facial Paralysis Based on Active Appearance Model, Chinese Marine University, Qingdao, Shandong, 2012. [10] L. Modersohn, J. Denzler, Facial paresis index prediction by exploiting active appearance models for compact discriminative features, in: Proceedings of the VISIGRAPP (4: VISAPP), 2016, pp. 271–278. [11] J. Barbosa, K. Lee, S. Lee, et al., Eﬃcient quantitative assessment of facial paralysis using iris segmentation and active contourbased key points detection with hybrid classiﬁer, BMC Med. Imaging 16 (1) (2016) 23. [12] H. Suchy B, R. Wolf S, A. Gebhard, et al., Bildanalysesystem zur Erkennung einer Fazialisparese, HNO 49 (10) (2001) 814–817. [13] L. Ma, Research into Assessment of Facial Paralysis Based on Key Points and Regions, Chinese Marine University, Qingdao, Shandong, 2009. [14] T.H. Ngo, M. Seo, Y.-W. Chen, et al., Quantitative assessment of facial paralysis using local binary patterns and Gabor ﬁlters, in: Proceedings of the 5th Symposium on Information and Communication Technology, ACM, 2014, pp. 155–161. [15] S. He, J.J. Soraghan, B.F. O’Reilly, Biomedical image sequence analysis with application to automatic quantitative assessment of facial paralysis, EURASIP J. Image Video Process. (1) (2007) 081282 2007.

7

[16] S. He, J.J. Soraghan, B.F. O’Reilly, et al., Quantitative analysis of facial paralysis using local binary patterns in biomedical videos, IEEE Trans. Biomed. Eng. 56 (7) (2009) 1864–1870. [17] S. Wang, H. Li, F. Qi, et al., Objective facial paralysis grading based on pface and eigenﬂow, Med. Biol. Eng. Comput. 42 (5) (2004) 598–603. [18] B. Hontanilla, C. Auba, Automatic three-dimensional quantitative analysis for evaluation of facial movement, J. Plast. Reconstr. Aesthet. Surg. 61 (1) (2008) 18–30. [19] X. Liu, S. Dong, M. An, et al., Quantitative assessment of facial paralysis using infrared thermal imaging, in: Proceedings of the 2015 8th International Conference on Biomedical Engineering and Informatics (BMEI), IEEE, 2015, pp. 106–110. [20] T.H. Ngo, M. Seo, N. Matsushiro, et al., Quantitative analysis of facial paralysis based on ﬁlters of concentric modulation, in: Proceedings of the 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), IEEE, 2015, pp. 1758–1763. [21] X. Ben, P. Zhang, R. Yan, et al., Gait recognition and micro-expression recognition based on maximum margin projection with tensor representation, Neural Comput. Appl. 27 (8) (2016) 2629–2646. [22] X. Zhu, X. Ben, S. Liu, et al., Coupled source domain targetized with updating tag vectors for micro-expression recognition, Multimed. Tools Appl. 77 (3) (2018) 3105–3124. [23] X. Ben, M. Yang, P Zhang, Survey of automatic micro expressions recognition methods, J. Comput.-Aided Design Comput. Graph. 26 (9) (2014) 1385–1395. [24] Alizadeh S., Fazel A. Convolutional neural networks for facial expression recognition, arXiv:1704.06756. 2017. [25] S.P.T. Reddy, S.T. Karri, S.R. Dubey, Spontaneous facial micro-expression recognition using 3D spatiotemporal convolutional neural networks, in: Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), 2019, pp. 1–8. [26] H. Yoshihara, M. Seo, T.H. Ngo, et al., Automatic feature point detection using deep convolutional networks for quantitative evaluation of facial paralysis, in: Proceedings of the 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), IEEE, 2016, pp. 811–814. [27] C. Szegedy, W. Liu, Y. Jia, et al., Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. [28] X. Chang, L. Yu Y, Y. Yang, et al., Semantic pooling for complex event analysis in untrimmed videos, IEEE Trans. Pattern Anal. Mach. Intell. 39 (8) (2016) 1617–1632. [29] X. Chang, Z. Ma, M. Lin, et al., Feature interaction augmented sparse learning for fast kinect motion detection, IEEE Trans. Image Process. 26 (8) (2017) 3911–3920. [30] S. Ren, K.M. He, R. Girshick, et al., Faster r-cnn: towards real-time object detection with region proposal networks, in: Proceedings of the 29th Conference on Neural Information Processing Systems, 2015, pp. 91–99. [31] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic segmentation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1520–1528. [32] H. Hirschmuller, Accurate and eﬃcient stereo processing by semi-global matching and mutual information, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 807–814. [33] Simonyan Karen and Zisserman Andrew. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2015. [34] K. He, X. Zhang, S. Ren, et al., Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [35] N. Joe Yue-Hei, H. Matthew, V. Sudheendra, et al., Beyond short snippets: deep networks for video classiﬁcation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 4694–4702. Pengfei Xu: He is an associate professor at School of Information Science and Technology, Northwest University in China. His-main research interests include: image processing and pattern recognition.

Fei Xie: He is an associate professor at School of Computer Science, Northwestern Polytechnical University in China. His-main research interests include: image processing and pattern recognition.

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;January 22, 2020;12:17]

P. Xu, F. Xie and T. Su et al. / Neurocomputing xxx (xxxx) xxx Tongsheng Su: He is Chief Physician at Shaanxi Hospital of Traditional Chinese Medicine in China. His-main research interests include: acupuncture therapy on peripheral neuropathy.

Zhaoxin Wan: He is attending physician at Shaanxi Hospital of Traditional Chinese Medicine in China. His-main research interests include:acupuncture therapy on peripheral neuropathy.

Xiaoyu Xin: She is an attending doctor at Neurological Department of Ruijin Hospital aﬃliated to Shanghai Jiaotong University School of Medicine in China. Her main research interests include: cerebrovascular disease and peripheral neuropathy.

Ziyu Guan: He is a professor at School of Information Science and Technology, Northwest University in China. His-main research interests include: Data mining and machine learning.

Zhaoyong Zhou: He is an senior engineer at Network & Education Technology Center, Northwest A&F University, in China. His-main research interests include: pattern recognition and agricultural information.

Please cite this article as: P. Xu, F. Xie and T. Su et al., Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.014

Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network

Automatic evaluation of facial nerve paralysis by dual-path LSTM with deep differentiated network

Recommend Documents