A GPSO-optimized convolutional neural networks for EEG-based emotion recognition

A GPSO-optimized convolutional neural networks for EEG-based emotion recognition

A GPSO-optimized Convolutional Neural Networks for EEG-based Emotion Recognition Communicated by Dr Zhu Yu Journal Pre-proof A GPSO-optimized Convo...

3MB Sizes 3 Downloads 67 Views

A GPSO-optimized Convolutional Neural Networks for EEG-based Emotion Recognition

Communicated by Dr Zhu Yu

Journal Pre-proof

A GPSO-optimized Convolutional Neural Networks for EEG-based Emotion Recognition Zhongke Gao, Yanli Li, Yuxuan Yang, Xinmin Wang, Na Dong, Hsiao-Dong Chiang PII: DOI: Reference:

S0925-2312(19)31539-5 https://doi.org/10.1016/j.neucom.2019.10.096 NEUCOM 21491

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

27 March 2019 14 October 2019 28 October 2019

Please cite this article as: Zhongke Gao, Yanli Li, Yuxuan Yang, Xinmin Wang, Na Dong, Hsiao-Dong Chiang, A GPSO-optimized Convolutional Neural Networks for EEG-based Emotion Recognition, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.10.096

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

1

A GPSO-optimized Convolutional Neural Networks for EEG-based Emotion Recognition Zhongke Gao, Yanli Li, Yuxuan Yang, Xinmin Wang, Na Dong*, and Hsiao-Dong Chiang, Fellow, IEEE

Abstract—An urgent problem in the field of deep learning is the optimization of model construction, which frequently hinders its performance and often need to be designed by experts. Optimizing the hyper-parameters remains a substantial obstacle in designing deep learning models, such as CNNs, in practice. In this paper, we propose an automatically optimization framework using binary coding system and GPSO with gradient penalties to select the structure. Such swarm intelligence optimization approaches have been used but not extensively exploited, and the existing work focus on models with fixed depth of networks. We design an experiment to arouse three types of emotion states of each subject, and simultaneously collect EEG signals corresponding to each emotion category. The GPSO-based method efficiently explores the solution space, allowing CNNs to obtain competitive classification performance over the dataset. Results indicate that our method based on the GPSO-optimized CNN model enables to achieve a prominent classification accuracy and the proposed method provides an effective automatical optimization framework for CNNs of the emotion recognition task with uncertain depth of networks. Index Terms—Convolutional Neural Networks (CNNs); hyperparameter optimization; electroencephalography (EEG) analysis; particle swarm optimization (PSO)

I

I. I NTRODUCTION

N our daily life, human behavior is affected by ones emotions to some extent. As a distinctive feature of intelligent organisms, emotion has a relatively great impact on social information exchange efficiency as well as human work efficiency. With the rapid development of interactive technologies, it is very necessary to integrate emotion recognition into human-computer interaction systems to improve interaction ability and further realize the comprehensive intelligence of computers. Judging the human emotions of others must rely on long-term experience and the thinking of individuals, which is a complex matter for humans, let alone machines. This greatly challenges the researchers in the field of affective computing [1] which focuses on endowing machines with the ability to read and respond to humans emotions. In general, the performance of human emotions could be clued from various physiological representations of human This work is supported by National Natural Science Foundation of China under Grant N os. 61873181, 61922062 and 61773282. Z. Gao, Y. Li, Y. Yang and X. Wang are with the School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: [email protected]). N. Dong* is the corresponding author of this paper. She is with the School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China (e-mail: [email protected]). H. D. Chiang is with the School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853 USA.

beings, including facial expressions, muscle responses, heart rates, and other direct or indirect emotional expressions. Therefore, emotion recognition has a variety of methods in principle, on this issue. Many researchers have used different clues to explore human emotions, including facial expression recognition [2-4], vocal analysis [5, 6] and other physiological signals like electrocardiogram (EKG), electromyogram (EMG), skin resistance and electroencephalogram (EEG) signals [7, 8]. Compared with other clues, EEG signals are direct reflection and expression of emotions, and possess the relatively direct and objective information [9]. Its relevance to human emotions has also been proved by many related physiological studies [10, 11]. EEG, typically noninvasive, is an electrophysiological monitoring method to record electrical activity of the brain through measuring voltage fluctuations resulting from ionic current within the neurons of the brain [12]. EEG is most often used to diagnose epilepsy, as well as other diseases or brain activities that cause abnormalities in EEG readings. Despite limited spatial resolution, EEG is one of the few mobile techniques available and offers millisecond-range temporal resolution. The derivatives of the EEG technique include evoked potentials (EP) and event-related potentials (ERPs). With the development of ERPs, EEG is gradually being used more in cognitive science, cognitive psychology, and psychophysiological research. Affective computing [13, 14] is one of the common applications of EEG. Although EEG signals have lots of advantages, they are still inevitably affected by noise interference due to their low signal-to-noise ratio (SNR). In addition, the high complexity, nonlinearity and stationarity of EEG signals and even the correlation between the different channels of signals have also brought numerous difficulties to the research. Some methods, including classical power spectral density (PSD) analysis, have been gradually developed in recent years [15-17], but most of them have more or fewer limitations like neglecting the connectivity between two distant brain regions [18, 19]. In order to solve these problems, researchers are working in many different directions. For instance, better channel information enhancement methods [20], or better filtering and classification methods [21]. Recently, with the rise of deep learning techniques, many recognition problems have been well solved with deep learning techniques [22], as well as EEG analysis problems [23,24]. Therefore, the idea of combining with deep learning as a feature extraction method for emotion recognition based on EEG analysis is initiated. In addition, deep learning techniques can also be used as a kind of classifier to realize

2

emotion recognition. An effective and reliable emotion recognition model is very demanding for classifiers, but the generalization capabilities of conventional support vector machines (SVM) or K-nearest neighbors (KNN) methods are limited by the problem that the features of EEG signals are more complex for analysis and classification than that of many other signals. The employment of deep learning techniques cannot only analyze the intrinsic information underlying the signal series, but also overcome the nonlinearity and complexity, and reduce the difficulty of implementation. Due to the great advantages of deep learning techniques, they have contributed greatly in various recognition research areas, including object detection [25], medical research [26, 27], face recognition [28], fatigue driving [29, 30, 31] and short-term load forecasting [32]. Among varied deep learning techniques, widespread concerns have been raised regarding convolutional neural networks (CNNs) [33] and the long-short term memory (LSTM) [34, 35] model. The powerful classification and feature extraction capabilities of CNNs have been proven in facial recognition [28], speech recognition [36], and medical imaging [25]. Furthermore, deep learning techniques have also shown satisfactory results and huge potential in the analysis of physiological signals, such as cardiologist-level arrhythmia detection [37], P300 brain-computer interfaces analysis [38], and steady state visual evoked potential (SSVEP) analysis [39, 40]. Although Deep learning techniques have obtained many achievements, designing a suitable deep learning model requires extremely specialized knowledge and is a time- and labor-consuming process to a considerable extent. Thus, the process of establishing a deep learning model is cannont meet the original intention of using these techniques and the requirements of an intelligent society. The establishment of a suitable and efficient deep learning model is so demanding that it does not meet the original intention of using deep learning technology. Therefore, it is extremely necessary to find an appropriate automatic method that can replace the complicated manual experiment to obtain the best or nearoptimal deep learning model. So far, a great deal of studies have been done to improve the performance of the deep neural networks [41-45]. In some respects, one of the meaningful research directions is to optimize the hyper-parameter. Taking CNN as an example, many hyper-parameters could be deeply investigated, such as the number of convolutional layers, the number and size of the feature maps in convolution, and the layer distribution of the network. Moreover, many hyper-parameters could be further modified to improve the performance, including the number of iterations, learning rate, attenuation function, regularization-related methods, and weight initialization methods. Optimization of the above parameters has great influence on the performance of deep learning networks, resulting in the significance of hyperparameter optimization in deep neural networks, which is in high demand for further exploration. So far, hyper-parameter optimization mainly relies on experiential knowledge, which is quite time-consuming and inefficient. First, experiential knowledge generally depends on random search, which is not conducive to promotion and rapid mastering, and the

approach of random search would greatly affect the efficiency of parameter tuning. Secondly, random search can only change the structure and parameters of only one network at each time. Trial operation and verification are also based on comparison between a network and the current optimal network. Therefore, the utilization of information is very low. Moreover, when there are redundant computing resources, lacking computing resources greatly limits the efficiency of parameter tuning. Essentially, the above mentioned challenges in deep learning networks could be referred as a problem of optimization: seeking the optimal network. Therefore, we can regard the parameters of multiple deep learning networks as variables in the optimization problem. Then, the evaluation indexes, such as the accuracy of the resulting network, could be used for the objective function, and the EEG data used for testing are fixed as unchanged. While data are set to a scalar, the problem of parameter tuning of the deep learning network is transformed into the mathematical problem of seeking the optimal value. In order to make the best use of redundant computing resources and improve the optimization efficiency, the characteristics of the swarm intelligence optimization algorithm that can perform parallel operations and information exchange have attracted our attention. The swarm intelligence algorithm mainly includes the particle swarm optimization (PSO) algorithm [46], the ant colony optimization (ACO) algorithm [47] and their improved versions. The PSO algorithm is simpler than the ACO algorithm in setting its own parameters, which is consistent with our original intention of trying to simplify the parameter optimization of the deep learning model. Some studies have tried to use PSO methods in DNNs. Lorenzo, et al [48] apply PSO to optimize the DNNs in the image processing field. Sinha, et al [49] apply basic PSO to the hyper-parameter optimization of CNNs.The PSO algorithm for hyper-parameter optimization of deep learning model encounters some obstacles: the low convergence efficiency when there are a lot of hyperparameters to be optimized. Ye, Fei [50] employ GPSO method for the better locally fast search capabilities in the optimization of DNNs hyper-parameters. However, the lack of priority concepts for parameters that represent different meanings in basic GPSO algorithm, resulting in the inability to flexibly adjust the optimization propensity of the algorithm. Furthermore, the existing work are based on the deep learning models with fixed number of layers. Recently, some studies start from the genetic algorithms [51] and the reinforcement learning methods [52, 53], and optimize the structure of the variable layer number. But their approaches presupposed some possible network baseline structures, thus the entire network structure is not generated from scratch. To cope with the problems, we propose the following methods: first, we propose a method based on the binary coding system that translates parameters tuning problems of CNNs into mathematical optimization problems, which includes the number and distribution of CNN functional layers. Second, we develop a gradient-priority particle swarm optimization (GPSO) with priority concepts, based

3

on the classical PSO algorithm and gradient descent method. Compared with manual adjustment, it has certain advantages in training efficiency and model performance, especially in the case of abundant computing resources. Compared with other traditional emotion recognition methods, the deep learning model adjusted by this method has achieved better results in analyzing and classifying the emotion data obtained from experiments. The layout of the article is organized as follows: first, we describe in detail the deep learning model for emotion data and the algorithmic ideas of the optimization framework used; second, we describe the experimental process; finally, we give the results and discussions, and draw conclusions.

494 = 2^8+2^7+2^6+2^5+2^3+2^2+2^1

1

:

7

1 1 1 1 0 1 1 1 0

0

:

2

×7

×2

II. M ETHODOLOGY We develop a novel EEG-based CNN for extracting the features from EEG signals corresponding to different emotions. In order to realize the automatic dynamic optimization of hyper-parameter tuning in the CNN model, we first develop a novel swarm intelligence optimization method named the GPSO algorithm. The method consists of two main parts: the way of transforming structural problems into mathematical problems and the principle of optimization method.

Fig. 1. The schematic diagram of the binary system method.

Where ω1,r,0 is a bias and ω1,r,t means the set of weights with 1 ≤ t ≤ Nch . Nch is the kernel size in layer 1. b) From one convolution layer Lk to the next convolution layer Lk+1 can be described as: σk+l,r,(p,q) = ωk+1,r,0 +

Nf N1 X X

xk,s,(p+g−1,q) ωk+1,(s,r) (3)

Nf N1 X X

xk,s,(p,q+g−1) ωk+1,(s,r) (4)

s=1 g=1

A. Fundamental Model Architecture The model inputs each sample by a matrix X ∈ R , including E recorded electrodes and T sampling points of 1s EEG signals. In this fundamental model architecture, the node number of two fully-connected layers is fixed as 64 and 3 to simplify the problem while the kind, number and distribution of convolutional layers and pooling layers are undetermined and to be optimized. In addition, rectified linear activation [54] with a batch normalization (BN) layer [41] is applied after each possible convolutional layer, referred to as the activation block [42]. The cross-entropy objective function is employed in the softmax layer to generate a distribution for each category. Furthermore, the initialization of the weights of the convolutional layers is realized by the glorot − normal initializer. The stochastic gradient descent (SGD) [44] optimizer with a learning rate of 0.001 is employed to optimize the model. We define a unit in the CNN as xl,r,(p,q) with the position (p, q) of a feature map r in layer l, and similarly define the scalar product between a set of input neurons as σl,r,(p,q) . And a rectified linear function f is applied to describe the relationship between xl,r,(p,q) and σl,r,(p,q) : E∗T

xl,r,(p,q) = f σl,r,(p,q)

Nch X j=1

Ip,q+j−1 ω1,r,t

s=1 g=1

N1 is the number of filters in Lk , and Nf is the kernel size of Lk+1 . c) From one convolutional layer Lj to the next maxpooling layer Lj+1 and averagepooling layer Lj+1 of size (1, n) can be expressed as: max ωp,q = max{x(p, q), x(p, q + n)}

P {x(p, q), x(p, q + n)} = n d) From the last layer to the fully-connected layer: average ωp,q

σuf = ω0,u +

Nl X

xl,h ωnf

(5) (6)

(7)

h=1

Where Nl = 64 refers to the number of neurons in the last layer and 1 ≤ h ≤ Nl . The classical back propagation is employed as the learning algorithm to tune up the thresholds and weights of the network.

(1)

Through this way, the information transmission process can be described as follows. a) From the original data to CNN convolutional layer 1: σl,r,(p,q) = ω1,r,0 +

σk+l,r,(p,q) = ωk+1,r,0 +

(2)

B. Optimization Framework Before we introduce the GPSO algorithm in detail, we first solve the implementation problem of the algorithm applied to the hyper-parameter adjustment. As the distribution and number of network layers influence the number of hyperparameters, it is of great importance to choose a suitable

4

distribution pattern and the right number of network layers. However, such a complex issue requires a large number of tests. Therefore, in this work, we propose to convert the issue into problems that can be solved by mathematical algorithms. Specifically, we input a numerical parameter and convert it into a binary combination of 1s and 0s. Then we treat the pooling and convolutional layers as binary 0 and 1, respectively. In this way, we can obtain the corresponding network layer distribution pattern and the detailed amounts of the two specific layers. The schematic diagram and an example are shown in Fig. 1 where Conv is the convolutional layer and Pool is the pooling layer. As shown in Fig. 1, we input a numerical parameter 494, then a nine-bit binary number (111101110) is obtained, so the formed CNN model is composed of seven convolution layers and two pooling layers. The method of arranging each layer follows the order of 1 and 0 in the binary number 111101110. In addition, two strategies can be considered to simultaneously adjust the number of layers and the resulting parameters of each layer. To be specific, one strategy is to fix the number of layers and tune the hyper-parameters to obtain the optimal structure of the current number of layers. Then we gradually increase the number of layers to obtain a series of optimal structures and finally choose the overall best structure among these obtained structures. This strategy is simple, but inefficient, repeating a lot of useless work. The other approach is to limit the total number of layers within a certain range, and set a flag for the unused portion (the blank layer). Restrictions and judgments are made on unused parts so that particles are treated as individuals who have lost some dimensions in the whole space. In this way, the optimization algorithm can be applied to optimize deep learning models with an unfixed number of layers. Fig. 2 gives an example for the parameter settings in the position array. As shown in Fig. 2, the situation can be simplified into three categories, described as follows. (1) Convolutional layers have 4 parameters (1, a, b, c), where 1 represents a convolutional layer, a is the number of kernels, and (b, c) is the size of each kernel. In this paper, we limit the size of convolution kernels to simplify the problem. (2) Max-pooling layers also have 4 parameters (0, d, e, f ), where 0 represents a Max-pooling layer, d represents the function of pooling layer, and (e, f ) are the size of the layer. (3) Blank layers have 4 parameters (−1, −1, −1, −1), indicating that the layer is not used in the model. The binary numbers formed by the first parameter of each layer (such as 111101110) are converted into decimal numbers (494), which are used as the characteristic values of the network structure to distinguish different structures, and avoid duplicate errors like ( , 1, 1, 1, 1, 0, 1, 1, 1, 0) and (1, 1, 1, 1, 0, , 1, 1, 1, 0) where indicates the unused layer. In practical applications, in order to simplify calculations and search more efficiently, we move all unused layers to the end of the vector before input. Thus, we have formed a framework for translating hyper-parameters into optimization goals.

C. The gradient-priority particle swarm optimization (GPSO) algorithm According to the above method, we can successfully transform the hyper-parameter optimization problem in the CNN network into a mathematical optimization problem. Then the optimization algorithm named gradient-priority particle swarm optimization (GPSO) is developed. The main purpose of the GPSO algorithm is to handle the problem when the PSO is applied to CNN, including the low convergence efficiency of PSO when there are a lot of hyper-parameters to be optimized and a lack of priority for parameters that represent different meanings, resulting in the inability to flexibly adjust the optimization propensity of the algorithm. Fig. 3 is an algorithm flow diagram of the GPSO algorithm. The detailed steps of the algorithm are as follows. (a) Initialize the population (including individual position, velocity, and inertial counter values)    Xn,i,j = random(0, 1) n ∈ (0, nm ) Vn,i,j = random(0, 1) i ∈ (1, im ) f (x) =   Ii = I0 j ∈ (1, jm )

(8)

n refers to the number of iterations, i refers to the particle entity number. j refers to the jth positional parameter of the positional coordinates of the individual particle. Ii ,I0 is the initial value of inertia counter value. nm is the maximum number of iterations, im is the number of the particles, and jm refers to the number of the parameters. (b) Determine if there are overlapping particle entities, and if so, re-initialize the particles with the higher number until there is no overlap. Skip if not. (c) According to the corresponding relationship, the parameter values corresponding to the position information of Xn,i are calculated, and the parameter values are substituted into the network for training to obtain the corresponding objective function Jn,i . If the accuracy is the target, then the objective function is the size of the accuracy. (d) Record the individual best position Pib , its objective function Jipb , the global optimum position Gb and its objective function Jipb . gb pb Ji,n = max{Ji,n , Ji,n−1 }

(9)

Jipb

n means that comes from the comparison between the current value and the previous one, and Pib is the position corresponding to Jipb . gb gb gb , J2,n , ..., J2,n } J gb = max{J1,n

(10)

(e) Calculate the gradient. Find the maximum gradient direction for each individual. G Jn,j = max{

Jn,i − Jm,k } Xn,i − Xm,k

m ∈ (1, n), k ∈ (0, im(11) )

G G Xn,i is the position Xm,k corresponding to Jn,j . (f) Update the particles speed,

  G Vn+1,i = (w × Vn,i ) + a × Xn,i − Xn,i + b × P b − Xn,i (12)

5

A Conv Layer

494 (111101110)

The number of kernel Convolution a@(b, c)

The size of kernel

1

a

b

c

0

494(111101110)

Unused layer

d

e

f

-1

-1

1

1

1

1

0

1

1

1

1

1

1

1

0

1

1

1

0

0

1

1

1

1

0

1

1

1

0

Transform

A pooling Layer 1

MaxPooling d@(e, f)

1

1

1

0

1

1

1

0

1 for MaxPooling 2 for AveragePooling The size of pooling

Fig. 2. The schematic diagram of the parameter settings in the position array.

where w is the inertial coefficient of speed, used for keeping the movement from falling into a local optimum. a and b are the velocity correction coefficients used for adjusting the direction of particle motion. In practical applications, since the parameters of different positions have different physical meanings, a and b are two coefficient vectors. (g) Determine whether the inertia counter is zeroed, if so, let: Vn+1,i = random(0, 1)

(13)

Realize a random reset of speed. (h) Update particle’s position, Xn+1,i = Xn,i + Vn+1,i

(14)

(i) Determine whether the iteration is completed. If so, output Gb and J gb ; otherwise return to step (b). Fig. 4 is a schematic diagram of the algorithm. Fig. 4 (a) shows the part of the initialization population in the algorithm. Each individual is endowed with an initial position, velocity, and inertia counter value. Fig. 4 (b) shows the process of updating the particle velocity after calculating the maximum gradient direction in the algorithm. The initial velocity of particle A is toward the lower left corner. After the calculationwe get the information that the maximum gradient direction of A is in the direction of particle G and the direction of the updated particle motion is modified to be in the direction of G. If the maximum gradient property of G with respect to A is always maintained, then A would eventually approach G with the trajectory marked by the red line in Fig. 4 (b). For particle B in the PSO algorithm, the B updates the global optimum G direction and ignores the nearest particles of E and F . In the GPSO, it can be calculated that E is the maximum gradient direction for particle B. Eventually, B would update its own speed based on E’s location. If E does not move, B would move along the yellow line. Fig. 4 (c) shows the particle’s motion state after a certain amount of iterations. Fig. 4 (d) shows two special states that appear in the algorithm. One is that particles E and F overlap in the course of motion. At

this time, the algorithm will disperse F with a larger serial number to another random position in space. The other state is when the inertia counter of particle D returns to zero, then the speed of D will be reset randomly.

D. The gradient-priority particle swarm optimization (GPSO) algorithm Compared with the PSO algorithm, the GPSO algorithm focus more on high local search efficiency [50]. Furthermore, the GPSO with priority concepts could achieve selective optimization by setting gradient penalties. Gradient penalties reduce the proportion of particles moving in different dimensions to each other by increasing the weight of the distance between different layers. Through gradient penalties, particles find their optimal position within their own dimensions as much as possible first, and move towards particles of other dimensions when the objective function of their own dimensions shows overall poorer performance. In the experiment, we use a GPSO method with 20 particles to tune 40 possible hyper-parameters, which are structural hyper-parameters of CNN. The learning rate and dropout rate are not included, mainly because we use the SGD with Momentum optimizer to automatically tune the learning rate in the training process. And the dropout rate mainly influences the overfitting, which is avoided by Earlystop strategy. Through the GPSO algorithm, we can achieve an optimized model architecture as shown in Fig. 5. The network architecture is mainly composed of 7 convolutional layers, completed with 2 fully-connected layers and one softmax layer. In the convolutional layers, the kernel size is set as (1, 3) with a default stride 1 and the number of filters as follows: (16, 16, 32, 32, 64, 64, 64). III. E XPERIMENTS At the beginning of the experimental design, we choose film clips that contain relatively direct and abundant emotional

6

G

START

G

H F

A

Initialize the population (position information, velocity information, inertia counter)

F A

E D

E

C

B

B

(a)

(b)

G F

Dispersion If individuals overlap each other?

A

Y

(c)

Parsing the parameter values corresponding to the position information, and substituting the parameter values into the network for training

EEG signal input

Record the best individual position P b , individual best fitness value J pb , global best position Gb , global best fit value J gb

F

D1

B

N

N

E

E

D

(d)

Fig. 4. The schematic diagram of the algorithm. (a) The initialization population with 8 individuals. (b) The process of updating the particle velocity after calculating the maximum gradient direction. (c) The particle’s motion state after a certain amount of iterations. (d) Two special states that appear in the algorithm: E and F overlap, and the inertia counter of particle D returns to zero. TABLE I T HE SOURCES OF THE SELECTED MOVIE CLIPS AND THEIR

Calculate the gradient Gn ,i and find the G maximum gradient position X n ,i for each individual

CORRESPONDING EMOTIONAL LABELS

Update individual Speed

Vn 1,i  wVn ,i  a  X nG,i  X n ,i   b P b  X n ,i  Number

Film clip sources

Emotional label

Duration

1

Find miracle in cell No.7

Sadness

9 min 55 sec

2

Dearest

Sadness

4 min 30 sec

3

A hero or not

Happiness

5 min 30 sec

4

Mr. Bean

Happiness

7 min 20 sec

Iteration completed?

5

Mr. Bean

Happiness

5 min 52 sec

Y

6

Dead silence

Fear

8 min 24 sec

7

The Conjuring 2

Fear

4 min

8

Lights out

Fear

3 min 25 sec

If the inertia counter is 0?

Y Speed random initialization

N Update individual location

X n 1,i  X n ,i  Vn 1,i

Output global optimal position and its fitness value OUTPUT

Fig. 3. The algorithm flow diagram of the GPSO algorithm.

expressions to evoke the targeted emotions. The EEG acquisition experiments are conducted in the Laboratory of Complex Networks and Intelligent Systems at Tianjin University. The experimental process has been approved by the ethics committee of the general hospital affiliated with Tianjin Medical University in China.

A. Film Clips There are three types of film clips: happiness, sadness and fear. The source, emotion type, and duration of the clips are shown in Table 1. The video clips are selected by the nonsubjects from a large number of well-known movies. The real

experimental scene is shown in Fig. 6.

B. Participants There are fifteen right-handed and healthy students (10 males and 5 females) aging from 20 to 24 (mean: 22.14 and std: 1.13) participating in the experiment. All the subjects have no vision or hearing impairments. Before the experiment, they are asked to keep calm and avoid fatigue as well as other kinds of possible interferences as much as possible to guarantee that their emotions can be aroused successfully and correctly. In addition, each subject is provided with a detailed description of the experiment. Also, some of the matters needing attention during the experiment are described, including focusing on the movie itself, and avoiding body

7

Input

Convolution

Convolution

Convolution

Dense

Output

! 30×200

16@30×198, (1,3) 32@30×194, (1,3) 64@30×190, (1,3) 16@30×196, (1,3) 32@30×192, (1,3) 64@30×188, (1,3) 64@30×186, (1,3)

FullyConnected Layer 64×3

Happiness Sadness Fear

Fig. 5. The schematic diagram of the parameter settings in the position array.

D. Device and Environment The EEG recording device was equipped by the ESI NeuroScan System with 40 electrodes, arranged according to international 10-20 electrode placement system, and the sampling rate was 1000 Hz. Among these 40 electrodes, A1 and A2 are the reference electrodes, and four electrodes (horizontally and vertically placed around the eyes) are responsible for recording the electrooculogram (EOG) signals. E. Data Pre-processing

Fig. 6. The real experimental scene and the experiment procedure.

and facial movements as much as possible while watching the movie clips.

C. Experiment Protocol During the experiment, the subjects follow the guide words appearing on the screen. The entire experiment include eight movie clips, namely, eight segments. Before the beginning of each segment, there will be a duration of 5 seconds to give a prompt for the start of the movie screening and remind the subject to start concentrating on the screen. Then, the program randomly select one of the eight movie clips to be screened. After the screening is completed, a questionnaire will be presented to help the subject to evaluate his/her emotions and grade the intensity of emotions on a 10 point scale. After completing the questionnaire, a certain amount of rest time will be provided to relax and avoid mixed emotions.

After completing all signal acquisition processes, the EEG and EOG signals of each subject are pre-processed through the EEGLAB toolbox. To remove noise and blink artifacts, bandpass filters between 1 and 50 Hz and Independent Component Analysis (ICA) are performed on the signals. In addition, the reference electrode channels, and spare channel signals are deleted during this preprocessing phase. Finally, we could get pre-processed signals with 30 channels for each subject. Then, the signal sampled at 1000 Hz was divided by 1 second without overlapping. From these, we could obtain 855, 1110, and 945 pure epochs for sadness, happiness and fear, respectively. At the same time, in order to reduce the computational burden and verify the effectiveness of the algorithm faster, we down-sampled the signals from 1000 Hz to 200 Hz. IV. R ESULTS AND D ISCUSSION A. Overall Performance In order to examine the generalization ability of the model and avoid overfitting, the entire data set is divided into 10 sub-data sets and a within-subject 10 fold cross-validation is applied for model evaluation. Besides, we also apply within-subject cross-validation with the training set and test set from different clips. The clips-based split takes one clip of certain emotion type as the training data, and takes another

8

Accuracy Std. Accuracy Std. Performance(%) 0.9255 0.0129 92.55 1.29 0.8941 0.00498.82 89.41 97.65 0.4 97.65 100 92.44 94.9 0.9216 0.0261 92.16 2.61 92.16 93.73 92.16 92.55 95 92.55 91.37 0.9882 89.41 0.0035 90.59 98.82 0.35 90.2 87.45 90 85.49 0.9059 0.0124 90.59 1.24 0.0115 94.9 1.15 850.949 0.8745 0.0351 87.45 3.51 80 0.9216 0.0207 92.16 2.07 75 0.9765 0.0229 97.65 2.29 0.9373 0.0091 93.73 0.91 70 0.9137 0.0165 91.37 1.65 65 0.9765 0.0225 97.65 2.25 60 0.9255 0.0115 92.55 1.15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Average 16 0.8549 0.0217 85.49 2.17 0.902 0.0127 90.2 1.27 0.036 and standard92.44 3.6 Fig.0.9244 7. The accuracy deviation of the fifteen subjects.

clip with a large time interval from the former one as the test data. There is no continuity or overlap in time between the test data and the training data. The results are shown in the method comparison part. This way, the average accuracy is used as the criterion for calculating the accuracy of each subject. Thus, an average accuracy under 10 fold cross-validation of 92.44% is achieved with a standard deviation of 3.60% across the fifteen subjects. Performance of our model on the EEG signals corresponding to each subject is shown in Fig. 7. It can be noticed that the result is different for each subject, which may result from many unavoidable personal reasons, including the individual’s educational factors, the sensitivity to the film’s emotional expressions and the past experience’s emotional resonance. The difference may also come from uncontrollable equipment factors, including minor errors when the EEG data acquisition device is equipped. Despite individual differences, our CNN model still shows a great effect and potential on the emotion recognition task. Although the average cognitive accuracy of the fifteen subjects reaches a satisfactory accuracy of 92.44%, we can use the GPSO algorithm to receive a more competitive performance in fact. In the process of the parameter tuning of the CNN model, there is a model that can achieve higher accuracy for each of the fifteen subjects. This phenomenon may be due to the model’s adaptability to individual differences. Finally, a model with the highest average accuracy for all subjects are selected as our final model.

B. Network Tuning The purpose of this paper is to find a method that can replace manual parameter adjustment and find suitable network parameters as soon as possible. The point of our method that differs from other works is that our method not only adjusts the parameters of the model with a fixed number of layers like the size and number of convolution kernels, but also adjusts the number and distribution of the layers. As mentioned above, the GPSO automatically adjusts a total of 40 possible parameters and finally achieves the optimized CNN model of 28 effective parameters and 12

Accuracy Curve during optimization Accuracy 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59

Iterative ordinal GPSO+BC

PSO+BC

PSO+7Layer-fixed

GPSO+7Layer-fixed

PSO+3Layer-fixed

Fig. 8. The emotion recognition accuracy curve of the 1st subjects EEG data during optimization.

invalid parameters. Each parameter combination represents a form of the network. The random search of each parameter will take a long time and a great deal of effort, and manually adjusting the parameter combinations wastes computing resources while the efficiency is extremely dependent on personal experience. When the number of network layers is gradually increased, manual parameter adjustment becomes extremely inefficient and time-consuming. Therefore, the best choice is to utilize redundant computing resources to train the network in parallel and find the most suitable network parameters. Fig. 8 shows the emotion recognition accuracy curve of the 1st subject’s EEG data during the process of using the optimization framework based on the GPSO algorithm and other existing works. It can be seen that the accuracy in Fig. 8 presents a non-linear increase, as the curve in green represents the global optimal accuracy of the GPSO algorithm, because the hyper-parameter is discrete. It can be seen from the above results that in the process of parameter adjustment, the cognitive ability of the model for EEG signals is gradually improved with the operation of the GPSO algorithm. In addition, Fig. 8 also shows the optimization process of several other approaches: (1) the optimization process using PSO method combined with binary code system. (2) The optimization process using a PSO method in a fixed threelayer network. (3) Optimization process using a PSO method in a fixed seven-layer network. (4) Optimization process using a GPSO method in a fixed seven-layer network. The PSO method used here is the same as [48-50]. By comparison, we can find that: GPSO is more efficient in searching than PSO, and is not affected too much by local optimum. Besides, the difference in accuracy between a fixed three-layer network and a fixed seven-layer network denotes the importance of searching the appropriate networks depth.

C. Method Comparison For proving the effectiveness of the method in the field of emotion recognition, we compare the performance of our

9

Performance(%)

Predicted Label Happiness Sadness

Actual Label Sadness Happiness Fear

DE+DBN HCNN DE+SVM PSD+SVM 40

60

80

100

Fig. 9. Comparison among the proposed model and other advanced algorithms.

method with four competitive methods used in emotion recognition. To achieve effective comparison, we employ some competitive methods that have achieved valid results on some public datasets for comparison. Recent years, in EEG-based emotion recognition area, publicly available dataset SEED has attract growing interests, there are various studies tried to explore the challenging task of EEG-based emotion recognition. With various methods of different thoughts, some existing works provide their valuable ideas and valid results, including: PSD+SVM [15] with 73.4%, GSCCA [55] with 83.72%, DBN [15] with 86.08%, HCNN [56] with 86.2%, SVM [15] using DE features with a performance of 86.65%, BDAE [57] using DE features of EEG signals and eye movement signals with high mean accuracy of 91.01%. These methods mainly include the following types: the most widely used PSD-based and DE-based methods, CNNs-based methods and GCNNs-based methods. The representative methods we select as comparisons are as follows: PSD+SVM: Classical emotion recognition method on EEG signals, the feature PSD is extracted from each channel of EEG signals at five specific frequency bands (delta: 1-3 Hz, theta: 4-7 Hz, alpha: 8-13 Hz, beta: 14-30 Hz, gamma: 31-50 Hz), and then feed into traditional SVM. DE+SVM: differential entropy(DE) [15] is a novel feature commonly used in emotion recognition in recent years, which is usually better than PSD. DE is extracted from each channel of EEG signals at five specific frequency bands. HCNN (Hierarchical Convolutional Neural Networks): a DEbased convolutional neural network for classification developed by [56]. DE+DBN: a DE-based advanced method for feature extraction and classification developed by [15]. CNN (optimizied by GPSO): the convolutional neural network optimized by our GPSO framework. To achieve an effective comparison, all the five methods are used on the dataset collected by our emotion experiment to test their average accuracy and standard deviation, and the results are shown in Fig. 9. It should be noticed that, since our goal is to realize emotion recognition based on completely 1 second samples, our method does not perform cross-sample feature smoothing. This makes it difficult to do the comparison based on public database, with those methods

PSD+SVM Predicted Label Fear Happiness Sadness

HCNN

DE+DBN

Predicted Label Fear Happiness Sadness Actual Label Sadness Happiness Fear

20

Clips-basedPerformance 10-fold cross validation

DE+SVM Predicted Label Fear Happiness Sadness Actual Label Sadness Happiness Fear

0

Predicted Label Happiness Sadness

Actual Label Sadness Happiness Fear

GPSO-CNN

Fear

Actual Label Sadness Happiness Fear

Fear

CNN-GPSO

Fig. 10. The average number of successfully-predicted samples in each type.

using the cross-sample feature smoothing. Thus, we reproduce all the four introduced methods and do all the comparisons based on our dataset. The flow of the feature-based method is: Enter 30 ∗ 200 (30 channels with 200Hz) data that has been cut in data preprocessing, perform feature calculation based on the data. The processed feature value is used as the input of classifiers. The construction of the classifiers is based on the recommended range of the original references and selects the optimal value after the test. The network structure is based on the recommended range in the original references and selects the optimal value after the test. First of all, we can see that, compared to the traditional PSD+SVM method (70.39% ± 1.12%) and the DE+SVM method (87.39% ± 6.69%), the accuracy of the GPSO-optimized CNN has a significant improvement and the standard deviation is obviously lower. Second, the comparison with HCNN (91.65%±4.32%) and DE+DBN (89.75% ± 6.35%) proves that the accuracy of the GPSO-optimized CNN model (92.44% ± 3.60%) has been slightly higher than these advanced methods. Besides, we perform another statistical analysis, Kappa value, on the results based on confusion matrix. The average number of successfully-predicted samples in each type has been shown in Fig. 10 in the form of confusion matrix, and the Kappa values of the five method are: PSD+SVM(0.554), DE+SVM(0.811), HCNN(0.873), DE+DBN(0.847), CNN optimized by GPSO(0.905). Furthermore, the results of these methods under the evaluation method of training set and test set from different time periods are as follow:

10

PSD+SVM method (63.67%±11.12%), the DE+SVM method (67.97% ± 9.84%), HCNN (69.27% ± 9.98%), DE+DBN (71.73% ± 9.31%) and the GPSO-optimized CNN model (75.40% ± 9.32%). The Kappa values of the five method are: PSD+SVM(0.446), DE+SVM(0.516), HCNN(0.535), DE+DBN(0.569), CNN optimized by GPSO(0.600). Generally speaking, due to the difference of static EEG signals of different time periods, the extracted feature will be interfered, and the accuracy of all methods under this evaluation will decrease. Although the accuracy of all methods has decreased under this evaluation method, our method can still maintain the highest among all methods. In general, the comparison to the four previous methods illustrates the effectiveness of our GPSO-optimized CNN model structure. V. C ONCLUSION Since EEG analysis has a wide range of applications in many brain science fields, it is of great significance to establish a highly accurate and reliable EEG signal analysis method. Deep learning techniques provide a kind of EEG analysis method with strong generality and high accuracy. However, the manual adjustment of the hyper-parameters of the deep learning model is an experience-dependent task, the wide and rapid use of this method is restrcited to some extent. Therefore, it is necessary to find a method that can quickly acquire an efficient deep learning model without manual attempts and make full use of spare computing resources. In this paper, we design a novel optimization framework based on the GPSO algorithm for tuning the hyper-parameters in the CNN model, and apply the optimized CNN model to the EEG-based emotion recognition task. The effect of our CNN model is verified by the movie-evoked emotion classification task. The result shows that the GPSO-optimized CNN model can provide an average accuracy higher than the accuracy of four existing works and the accuracy increases greatly after using our optimization method, proving the effectiveness of our method. In addition, since this method does not limit the content of the EEG signal, this method can be applied to other types of EEG signal analysis in principle, and we expect further applications in other fields. R EFERENCES [1] R. Picard, Affective Computing, MIT press, 2000. 47, no. 6, pp. 14961509, 2017. [2] M. Pantic, and L. J. M. Rothkrantz, “Automatic analysis of facial expressions: The state of the art,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 12, pp. 1424-1445, 2011. [3] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: Audio, visual, and spontaneous expressions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39-58, 2009. [4] K. Mistry, L. Zhang, S. C. Neoh, C. P. Lim, and B. Fielding, “A micro-GA embedded PSO feature selection approach to intelligent facial emotion recognition,” IEEE Trans. Cybern., vol. 47, no. 6, pp. 1496-1509, 2017. [5] M. E. Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recogn., vol. 44, no. 3, pp. 572-587, 2011. [6] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge,” Speech Commun., vol. 53, no. 9-10, pp. 10621087, 2011.

[7] R. L. Mandryk, K. M. Inkpen, and T. W. Calvert, “Using psychophysiological techniques to measure user experience with entertainment technologies,” Behav. Inform. Technol., vol. 25, no. 2, pp. 141-153, 2006. [8] G. Chanel, C. Rebetez, M. Btrancourt, and T. Pun, “Emotion assessment from physiological signals for adaptation of game difficulty,” IEEE Trans. Syst. Man Cybern. A, vol. 41, no. 6, pp. 1052-1063, 2011. [9] X. W. Wang, D. Nie, and B. L. Lu, “Emotional state classification from EEG data using machine learning approach,” Neurocomputing, vol. 129, pp. 94-106, 2014. [10] D. Mathersul, L. M. Williams, P. J. Hopkinson, and A. H. Kemp, “Investigating models of affect: Relationships among EEG alpha asymmetry, depression, and anxiety,” Emotion, vol. 8, no. 4, pp. 560572, 2008. [11] G. G. Knyazev, J. Y. Slobodskoj-Plusnin, and A. V. Bocharov, “Gender differences in implicit and explicit processing of emotional facial expressions as revealed by event-related theta synchronization,” Emotion, vol. 10, no. 5, pp. 678-687, 2010. [12] N. Sheehy, “Electroencephalography: Basic Principles, Clinical Applications and Related Fields.” Wolters Kluwer Health/Lippincott Williams & Wilkins, 1982. [13] Y. P. Lin, C. H. Wang, T. P. Jung, T. L. Wu, S. K. Jeng, J. R. Duann, and J. H. Chen, “EEG-based emotion recognition in music listening,” IEEE Trans. Biomed. Eng., vol. 57, no. 7, pp. 1798-1806, 2010. [14] A. M. Bhatti, M. Majid, S. M. Anwar, and B. Khan, “Human emotion recognition and analysis in response to audio music using brain signals,” Comput. Hum. Behav., vol. 65, pp. 267-275, 2016. [15] W. L. Zheng, and B. L. Lu, “Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks,” IEEE Trans. Auton. Mental Develop., vol. 7, no. 3, pp. 162-175, 2015. [16] X. Chai, Q. S. Wang, Y. P. Zhao, Y. Q. Li, D. Liu, X. Liu, and O. Bai, “A fast, efficient domain adaptation technique for cross-domain electroencephalography(EEG)-based emotion recognition,” Sensors, vol. 17, no. 5, pp. 1014, 2017. [17] Y. Zhang, X. M. Ji, and S. H. Zhang, “An approach to EEG-based emotion recognition using combined feature extraction method,” Neurosci. Lett., vol. 633, pp. 152-157, 2016. [18] H. Shahabi, and S. Moghimi, “Toward automatic detection of brain responses to emotional music through analysis of EEG effective connectivity,” Comput. Hum. Behav., vol. 58, pp. 231-239, 2016. [19] Y. Dasdemir, E. Yildirim, and S. Yildirim, “Analysis of functional brain connections for positiveCnegative emotions using phase locking value,” Cogn. Neurodyn., vol. 11, no. 6, pp. 487-500, 2017. [20] Wu, W. , Chen, Z. , Gao, X. , Li, Y. , Brown, E. N. , and Gao, S, “Probabilistic common spatial patterns for multichannel eeg analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 639-653, 2015. [21] Qi F, Li Y, Wu W, “RSTFC: A Novel Algorithm for Spatio-Temporal Filtering and Classification of Single-Trial EEG,” IEEE Transactions on Neural Networks & Learning Systems, vol. 26, no. 12, pp. 30703082, 2017. [22] S. Ji, W. Xu, M. Yang and K. Yu, “3D convolutional neural networks for human action recognition” IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, 2013. [23] R. T. Schirrmeister, J. T. Springenberg, L. D. J. Fiederer, M. Glasstetter, K. Eggensperger, M. Tangermann, F. Hutter, W. Burgard and T. Ball, “Deep learning with convolutional neural networks for brain mapping and decoding of movement-related information from the human EEG” Hum. Brain Mapp, 2017. [24] Wu W, Nagarajan S, Chen Z, “Bayesian Machine Learning for EEG/MEG Signal Processing,” IEEE Signal Processing Magazine, vol. 33, no. 1, pp. 14-36, 2016. [25] Y. Bengio, A. Courville and P. Vincent, “Representation Learning: A Review and New Perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798-1828, 2013. [26] E. P. Giri, M. I. Fanany, A. M. Arymurthy and S. K. Wijaya, “Ischemic stroke identification based on EEG and EOG using ID convolutional neural network and batch normalization,” in Proc. ICACSIS’8, Malang, Indonesia, pp. 484-491, 2016. [27] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1299-1312, 2016. [28] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proc. BMVC’26, vol. 1, No. 3, Swansea, UK, pp. 6, 2015.

11

[29] M. Hajinoroozi, Z. Mao, T. P. Jung, C. T. Lin and Y. Huang, “EEG based prediction of drivers cognitive performance by deep convolutional neural network,” Signal Process.: Image Commun., vol. 47, pp. 549-555, 2016. [30] Z. Gao, X. Wang, Y. Yang, C. Mu, Q. Cai and W. Dang, “EEGbased spatio-temporal convolutional neural network for driver fatigue evaluation,” IEEE Transactions on Neural Networks and Learning Systems., vol. 30, no. 9, pp. 2755-2763, 2019. [31] Z. Gao, S. Li, Q. Cai, et al. “Relative Wavelet Entropy Complex Network for Improving EEG-Based Fatigue Driving Classification,” IEEE Transactions on Instrumentation and Measurement., vol. 68, no. 7, pp. 2491-2497, 2018. [32] L. G. Chen, H. D. Chiang, N. Dong, and R. P. Liu, “Group-based chaos genetic algorithm and non-linear ensemble of neural networks for short-term load forecasting,” Iet Generation Transmission & Distribution, pp. 1440-1447, 2016. [33] Y. LeCun, and Y. Bengio, “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, MIT press, pp. 3361, 1995. [34] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997. [35] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with LSTM recurrent networks,” J. Mach. Learn. Res., vol. 3, no. 1, pp. 115-143, 2003. [36] A. R. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequence training of deep belief networks for speech recognition,” in Proc. Interspeech, pp. 2846-2849, 2010. [37] P. Rajpurkar, A. Y. Hannun, M. Haghpanahi, C. Bourn and A. Y. Ng, “Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks,” arXiv preprint arXiv, 2017. [38] H. Cecotti and A. Graser, “Convolutional neural networks for P300 detection with application to brain-computer interfaces,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 3, pp. 433-445, 2011. [39] N. S. Kwak, K. R. Muller and S. W. Lee, “A convolutional neural network for steady state visual evoked potential classification under ambulatory environment,” Plos One, vol. 12, no. 2, pp. e0172578, 2017. [40] Z. K. Gao and K. L. Zhang and W. D. Dang and Y. X. Yang and Z. B. Wang and H. B. Duan and G. R. Chen, “An Adaptive Optimal-Kernel Time-Frequency Representation-based Complex Network Method for Characterizing Fatigued Behavior Using the SSVEP-based BCI System,” Knowledge-Based Systems, vol. 152, pp. 163-171, 2018. [41] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. ICML’32, pp. 448-456, 2015. [42] K. He, X. Zhang, S. Ren and J. Sun, “Identity mappings in deep residual networks,” in Proc. 13th Eur. Conf. Comput. Vis. (ECCV), pp. 630-645, 2014. [43] X. Glorot, and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. AISTATS’10, pp. 249256, 2010. [44] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proc. COMPSTAT’19, pp. 177-186, 2010. [45] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors” Nature, vol. 323, no. 6088, pp. 533-536, 1986. [46] Y. Shi, R. Eberhart, “Modified particle swarm optimizer” Proc. of IEEE ICEC conference, Anchorage, pp. 69-73, 1998. [47] M. Dorigo,C. G. Di, L. M. Gambardella, “Ant algorithms for discrete optimization.”[J]. Artificial Life, vol. 5, issue. 2, pp. 137, 1999. [48] P. R. Lorenzo, J. Nalepa, L. S. Ramos and J. R. Pastor, “Hyperparameter selection in deep neural networks using parallel particle swarm optimization,” Genetic & Evolutionary Computation Conference Companion, ACM, 2017. [49] Sinha, Toshi, Ali Haidar, and Brijesh Verma, “Particle Swarm Optimization Based Approach for Finding Optimal Values of Convolutional Neural Network Parameters,” Plos One, vol. 12, no, 12, e0188746, 2017. [50] F. Ye, “Particle swarm optimization-based automatic parameter selection for deep neural networks and its applications in large-scale and high-dimensional data,” IEEE Congress on Evolutionary Computation, IEEE, 2018. [51] M. Suganuma, S. Shirakawa, T. Nagao, “A Genetic Programming Approach to Designing Convolutional Neural Network Architectures,”[J]. Genetic & Evolutionary Computation Conference Companion, pp. 497-504, 2017, doi:10.1145/3071178.3071229.

[52] I. Bello, B. Zoph, V. Vasudevan and Q. V. Le, “Neural Optimizer Search with Reinforcement Learning,”. Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 459-468, 2017. [53] B. Baker, O. Gupta, N. Naik and R. Raskar, “Designing Neural Network Architectures using Reinforcement Learning,”.International Conference on Learning Representations, abs/1611.02167, 2016. [54] V. Nair, and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proc. ICML’27, Haifa, Israel, pp. 807-814, 2010. [55] W. Zheng, “Multichannel EEG-based emotion recognition via group sparse canonical correlation analysis,” inIEEE Transactions on Cognitive and Developmental Systems, vol. 9, no. 3, pp. 281-290, 2016. [56] J. Li, Z. Zhang, H. He, “Hierarchical Convolutional Neural Networks for EEG-Based Emotion Recognition,” Cognitive Computation, vol. 10, issue. 2, pp.368-380, 2018. [57] W. Liu, W. L. Zheng, B. L. Lu, “Emotion Recognition Using Multimodal Deep Learning,” International Conference on Neural Information Processing, pp. 521-529, 2016.

Zhongke Gao received the M.Sc. and Ph.D. degrees from Tianjin University, Tianjin, China, in 2007 and 2010,respectively. He is currently a Full Professor with the School of Electrical and Information Engineering, Tianjin University, and the Director of Laboratory of Complex Networks and Intelligent Systems, Tianjin University, Tianjin, China. His current research interests include complex networks, multisource information fusion, deep learning, sensor design, multiphase flows, and brain-computer interface. He and his research team have published over 100 papers in refereed journals and conference proceedings. He has been serving as an Editorial Board Member for Scientific Reports and an Associate Editor for the IEEE Access, Neural Processing Letters, and Royal Society Open Science.

Yanli Li received the bachelor’s degree from the School of Electrical and Information Engineering, Tianjin University, Tianjin, China, in 2017. He is currently working toward the master’s degree at the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His research interests include affective computing, EEG analysis, and machine learning.

Yuxuan Yang received the bachelor’s degree from Anhui University, Hefei, China, in 2014, and the master’s degree from the School of Electrical and Information Engineering, Tianjin University, Tianjin, China, in 2017. She is currently pursuing the Ph.D. degree at the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include affective computing, EEG analysis, machine learning and complex networks.

12

Xinmin Wang received the bachelor’s degree from the School of Electrical and Information Engineering, Tianjin University, Tianjin, China, in 2017. He is currently working toward the master’s degree at the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His research interests include affective computing, EEG analysis and machine learning.

Na Dong received her Ph.D. degree in control theory and control application at Nankai University in 2011. She is currently an associate professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her current research areas encompass intelligent control algorithms, heuristic optimization algorithms, neural networks, data-driven control, deep learning and image processing.

Hsiao Dong Chiang Hsiao-Dong Chiang (M’87— SM’91—F’97) received the Ph.D. degree in electrical engineering and computer sciences from the University of California at Berkeley, Berkeley, CA, USA. He is a Professor of electrical engineering with Cornell University, Ithaca, NY, USA. He is the Founder of Bigwood Systems, Inc., Ithaca. He and his research team have published over 350 papers in refereed journals and conference proceedings. He holds 17 U.S. and overseas patents and several consultant positions. His current research interests include nonlinear system theory, nonlinear computation, nonlinear optimization, and their practical applications. Prof. Chiang has served as an Associate Editor for several IEEE TRANSACTIONS and journals.

13

C ONFLICT OF I NTEREST The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.