J. Parallel Distrib. Comput. 116 (2018) 18–26
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Using convolution control block for Chinese sentiment analysis Zheng Xiao a , Xiong Li b , Le Wang a , Qiuwei Yang a, *, Jiayi Du c , Arun Kumar Sangaiah d, * a
College of Computer Science and Information Engineering, Hunan University, 410082, Changsha, China School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan 411201, China College of Computer and Information Engineering, Central South Forest Technology University, 410004, Changsha, China d School of Computer science and Engineering, VIT University, Vellore-632014, Tamil Nadu, India b c
highlights • We generate a word vector query library, i.e. the lookup table, trained by the skip-gram algorithm on a large scale unlabeled dataset from Chinese Wikipedia of size 1.3G.
• Based on convolution neural network, we propose a Chinese sentiment classification model on the concept of convolution control block. • We experiment with the real-world millions of Chinese hotel reviews dataset to compare the performance of our model with LR_all and DCN.
article
info
Article history: Received 19 July 2017 Received in revised form 26 October 2017 Accepted 30 October 2017 Available online 16 November 2017 Keywords: Natural language processing Deep learning Convolutional neural network Sentiment analysis
a b s t r a c t Convolutional neural network (CNN) has lately received great attention because of its good performance in the field of computer vision and speech recognition. It has also been widely used in natural language processing. But those methods for English cannot be transplanted due to phrase segmentation. Those for Chinese are not good enough for poorly semantic retrieving. We propose a Chinese sentiment classification model on the concept of convolution control block (CCB). It aims at classifying Chinese sentences into the positive or the negative. CCB based model considers short and long context dependencies. Parallel convolution of different kernel sizes is designed for phrase segmentation, gate convolution for merging and filtering abstract features, and tiering 5 layers of CCBs for word connection in sentence. Our model is evaluated on Million Chinese Hotel Review dataset. Its positive emotion accuracy reaches 92.58%, which outperforms LR_all and DCN by 2.89% and 4.03%, respectively. Model depth and sentence length are positively related to the accuracy. Gate convolution indeed improves model accuracy. © 2017 Elsevier Inc. All rights reserved.
1. Introduction Nowadays, the Internet have evolved from a static and oneway information carrier to a dynamic and interactive media, where human beings can post comments on news or products, send instant message, etc. The massive interactive information exposes people’s emotion, which is users’ psychological footprint. Those posted texts are subjective information reflecting opinion and attitude of people on social events or commodities, and of great value. They help enterpriser to set a good brand image or investigate users’ experience. They help consumers to make purchase decision. They help government to know the public opinion on policy or hot social issues. Even they are helpful for medical authors. * Corresponding E-mail addresses:
[email protected] (Z. Xiao),
[email protected] (X. Li),
[email protected] (L. Wang),
[email protected] (Q. Yang),
[email protected] (J. Du),
[email protected],
[email protected] (A.K. Sangaiah). https://doi.org/10.1016/j.jpdc.2017.10.018 0743-7315/© 2017 Elsevier Inc. All rights reserved.
institutions to evaluate people’s healthy status and take counter measures. Sentiment analysis is widely applied in internet public opinion analysis, product or service review mining, etc. Text sentiment analysis is a classic research topic in the field of natural language processing (NLP). It plays an important role in intelligent network or society. Machine learning [31,9] and pattern recognition [24,32] are two important technologies in computer field, which can be used in many areas, such as wireless sensor networks [32] and wireless communication networks [24]. Traditional methods of sentiment analysis mainly include the emotional dictionary based classification method [6] and the traditional machine learning based method. Although these methods perform well in classification accuracy, they confront many difficulties. The former relies on a large number of manual operations, such as manual tagging of emotional parts of speech, manual development of search rules, etc. Moreover, such kind of methods is incapable of dealing with new words and unknown words. The latter cannot distinguish the
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
semantics of sentences because it ignores the order of words in a sentence. This leads to the problem of sentiment misclassification. Take Bag of Words (BOW) for example, which is a feature model frequently used in machine learning based methods. BOW model represents a text (such as a sentence or document) as a collection of words, but the collection ignores the syntax of the statement and the order in which the words appear. As a result, it cannot catch the context-sensitive information between the words and underneath the sequence. The polarity change sometimes happens in real world cases [23]. The above-mentioned pitfall leads to the potential misclassification. Words connection has an important impact on classification performance. Due to neglecting it, traditional machine learning exists serious misclassification. For example, given two Chinese (the phone is very expensive sentences, ‘‘ but good-looking)’’ and ‘‘ (the phone is good-looking but very expensive)’’. The BOW model thinks those sentences identical without noticing the subtle difference, leading to the wrong judgment that those two sentences have the same sentiment. In fact, those two sentences emphasize different parts, and the emotion gets different from each other. The former emphasizes that the phone looks good, and its emotion is positive; while the latter emphasizes that the phone is expensive, and its emotion is negative. BOW model ignores the word order of statement, and thus cannot understand the emotional implication. The example illustrates the word connection in sentence. There is another problem, the word connection in phrase, which is exclusive in Chinese. For example, given two phrases in Chinese ‘‘ (sorry)’’ and ‘‘ (great)’’, which have opposite sentiment. In English, there is a single word for each phrase. But in Chinese, three characters are contained in each phrase. Furthermore, the character ‘‘ (no)’’ is negative and challenges the judgment. Therefore, phrase segmentation is a special feature in Chinese, which impedes the transplantation of some methods for English sentiment analysis. Recently deep learning methods have started to be exploited to deal with NLP tasks, such as word-based ConvNet, long-short term memory (LSTM) [12], recurrent neural network (RNN) model, and convolutional neural network (CNN) model. Section 2 gives a comprehensive analysis and comparison of those methods. This paper focused on CNN for Chinese sentiment analysis task, because of its advantages on training speed and efficiency, parallel implementation. As to the word connection in phrase and sentence, we propose a Convolution Control Block (CCB) construct and CCB based model. Instead of the bag of words, an unsupervised pre-training algorithm is used to generate word vectors, making the word relation measurable. In a sentence, several words form a semantic unit, which has a unified contribution to sentiment. The number of words in a semantic unit reflects a semantic distance. The larger the distance is, the more words are thought to be interdependent. According to the feature of Chinese, 3 and 5 are believed to be the short and long semantic distances. Based on that, two parallel convolutional operations of different sizes are simultaneously executed. And then gate convolution is used to merge and filter the high-level abstract features. The structure is called Convolution Control Block (CCB) construct, which attempts to deal with the phrase segmentation problem. A large network of CCBs is organized by tiering and pooling, expecting the word connection to be extended to the sentence level. So a CCB based model is designed using CCB as a basic construct. This model has a good understanding of the dependency and semantics of the Chinese context. The model considers larger context dependency and extracts more abstract hierarchical features by CCB networking. In contrast to dictionary based and traditional machine learning based methods, our model does not need to manually develop
19
emotional dictionaries and classification rules; partially immune to the new words and the unknown words through training; capable of semantic retrieving damaged in BOW model. To evaluate the performance of our model, two recent deep learning methods, LR_all [3] and DCN [28], are compared. As far as F1-score is concerned, our model outperforms them by about 2%. Model depth and sentence length are positively proportional to accuracy. Gate convolution is testified to work in improving accuracy. The following summarizes our contributions:
• Based on CNN, we propose a Chinese sentiment classification model on the concept of convolution control block. It is realized by the deep learning library Keras. Its main component is 5 layered CCBs. • The performance of the model above is assessed on the real millions of Chinese hotel review dataset. The experimental results show that its accuracy is higher than that of LR_all and DCN. • The depth of convolutional layer, the length of training statement, and the existence of gate convolution are studied, so as to obtain some better Hyper-parameters. The rest of this paper is organized as follows: In Section 2, we briefly describe some of the relevant work in the field of sentiment analysis. In Section 3, we review the concept of convolution and develop the construct of convolution control block. Sections 4 and 5 describe the model design and the training algorithm. Section 6 shows the experimental results, and finally in Section 7 we summarize the full paper. 2. Related works 2.1. Emotional dictionary based methods Dictionary based approach is the simplest method for sentimental analysis. Words or phrases are annotated with their sentiment tendency. The emotional intensity of each word or phrase is then aggregated to get the emotional orientation of the whole text. Riloff and Shepherd et al. [25] constructed semantic dictionary based on corpus. Hatzivassiloglou and McKeown et al. [10] studied the emotional tendencies of English words on large-scale corpus, with impact of conjunction on sentiment of adjectives. Kamps and Marx [14] research on emotional tendencies by use of the wellknown emotional dictionary WordNet. 2.2. Traditional machine learning based methods Traditional machine learning methods include maximum entropy, decision tree, support vector machine, hidden Markov model, and conditional random field. These methods are unnecessary to build dictionary, instead construct feature template for text classification and sentiment recognition. They are based on annotated dataset from which learning is automated. Wang D et al. [33] proposed a new approach to detect the sentiments of Chinese microblogs using layered feature. Semi-supervised learning is a good choice when we have far more unlabeled data than labeled data for training. Li J et al. [19] presented a semi-supervised bootstrapping algorithm for analyzing China’s foreign relations from the People’s Daily. Yu N and Kübler S [35] investigated the use of Semi-Supervised Learning in opinion detection both in sparse data situations and for domain adaptation and showed that cotraining reaches the best results in an in-domain setting with small labeled datasets.
20
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
2.3. Deep learning based methods
3.1. Convolutional operation
Recently, more and more researchers are using deep neural network to study NLP tasks. The performance of deep neural network [1,15] in NLP tasks has been shown to outperform classical n-gram language models [2]. Regarding different convolutional network architectures for NLP tasks, there are many related research works. Graves et al. [8] proposed long short-term memory (LSTM) neural network for sequences modeling task. And Socher et al. [29,30] used treestructured long short-term memory networks to improve semantic representations. Those recurrent neural networks are able to retain memory between training examples, allowing it to capture relations between words. And it is more promising to apply recurrent neural networks to solve sentiment analysis problem because its variants LSTM has the ability to capture long short-term dependencies [11]. LSTM is a temporal learning method, hard for parallel training. Large scale text dataset takes LSTM quite long time to learn. Hence, convolutional neural network (CNN) draws attention. Deep CNNs have one key advantage over existing approaches for sentiment analysis that rely on extensive feature engineering. CNNs automate the feature generation phase and learn more general representations, thereby making this approach robust and flexible when applied to various domains. Because of the characteristic of English words, which are composed of letters, Zhang et al. [37] first proposed a convolutional neural network classification model based on the letter-level feature. Their model used 6 convolutional layers and 3 fully-connected layers for largescale text classification datasets. Conneau A et al. [4] present a new architecture for text classification which operates directly at the character level. They use only small convolution and pooling operations, summing up to 29 convolutional layers. Improvement over the state-of-the-art models is reported on several public text classification tasks. Y Kim [17] used CNN for sentence classification task, in which one convolutional layer is followed by a max pooling layer over time and used one fully connected layer with dropout to classify sentences. Nal Kalchbrenner et al. [16] proposed a DCNN model, which is a dynamic convolution neural network with five convolutional layers. It uses the dynamic k-max pooling to detect the k most important feature over sequence and returns the subsequence of k maximum values in the sentence. In the network architecture designed by Dos Santos et al. [28], it does not need any input about the syntactic structure of the sentence. However, most of the current work [28,17] which uses CNNs to deal with sentiment analysis targets at English. Although those models have high accuracy in English datasets, if applied directly to the Chinese dataset, they will get poor accuracy, because of word connection in phrase and sentence. Chinese is more complicated, and needs to deal with the task of word segmentation. In this paper, our work is similar to [17]. Considering the characteristics of Chinese, we try to use a deeper CNN to extract more useful classification features. Multiple CCBs with parallel convolution are designed to extract features, not just simple stacking.
In traditional neural network, a block that can move across the input layer is called filter. The filter can transform a sub-node matrix on the current layer into a unit node matrix of the next layer of neural network. This process is defined as the forward propagation process. The unit node matrix refers to a node matrix with a length and a width of 1. Suppose that xi is the ith node input, then the value of the unit node matrix is calculated as
3. Convolution control block Before we formally introduce our model, we first look at CNN’s related concepts and theories. Convolution operation is a basic arithmetic in digital signal processing [18]. Convolution is able to find out other implicit forms of representation of signal. There are two kinds of convolutions, which are one-dimensional convolution (also known as 1-D convolution) and two-dimensional convolution (2-D convolution). This paper focuses on temporal convolutional operation [17], i.e., 1-D convolution, which is applied to temporal sequence.
− →
→ y = f( x ·− w + b).
(1)
→ Here − w is the filter’s weight parameter, b is the corresponding offset parameter, f is the activation function, the symbol · denotes 1∗m the dot-Multiplication. ∑m Assuming that two vectors A, B ∈ R , the = a b . operation A · B = i i i=1 In CNN, dot-Multiplication is replaced by convolution. Convolutional operator is represented by symbol ’*’. Suppose we have a discrete input function g : [1, m] → ℜ and a discrete kernel function f : [1, k] → ℜ. The filter moves up and down at the input layer (or the current neural network layer), and the distance of a single movement is called stride. In this case, the input matrix size is 1 × m, the filter size is 1 × k, and the stride of the filter is d. The convolutional operation is defined as f ∗ g : [1, l] → ℜ.
− → y = g(x) ∗ f (x) j
=
k ∑
(2)
f (i) · g(j ∗ d − i + c).
i=1
Here c = k − d + 1 is an offset constant. The output matrix of convolutional operation is of size 1 × l where l = ⌈(m − k + 1)/d⌉. Fig. 1 is an example for 1-D convolution operation with a certain kernel function of size 3 and stride 2. 3.2. CCB construct In a sentence, several words form a semantic unit, which has a unified contribution to sentiment. The number of words in a semantic unit reflects a semantic distance. The larger the distance is, the more words are thought to be interdependent. According to the feature of Chinese, 3 and 5 are believed to be the short and long semantic distances. Thus, a window of size 3 or 5 is used to trap those potential semantic units. Different from the traditional convolution construct, a parallel construct is designed to take semantic units of different sizes into account simultaneously. Fig. 2 shows the construct, which is named convolution control block (CCB). It includes parallel convolution, gate convolution, normalization function, and activation function. Temp conv indicates sequential convolution operation to input text. nb_filters represents the number of kernel functions. In parallel convolution, kernel functions are in size of either 3 or 5. Assume that the input sentence X has a length of m words. According to Eq. (2), we can get the results of two parallel convolution operations.
− →L L L − → → c = X1×m ∗ − w + b − →R R R − → → c =X ∗− w + b .
(3)
1×m
→ → Here − w and − w denote the kernel functions taken by the left and L
R
− →L
the right parallel convolution, of size 3 and 5, respectively. b and
− →R
− →L
− →R
b are two offset vector. They are of the same size as c and c , which are ⌈(m − 3 + 1)/(dL )⌉ and ⌈(m − 5 + 1)/(dR )⌉, respectively. dL and dR are the left and the right convolution stride. The Temp batch norm means batch normalization [13], which makes the data conform to standard normal distribution, i.e. with
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
21
Fig. 1. 1-D convolution operation with kernel size 3 and stride 2.
Thus, after normalization, the activation function Tanh/ReLU is used to non-linearize the normalized data. The Hyperbolic Tangent x e−x , while the Rectified Linear Unit function function Tanh(x) = eex − +e−x
− →L
− →R
ReLU(x) = max(x, 0). h and h L R c and c element by element.
− →L
are the non-linearized results of
→ = Tanh(c L ) = Tanh(BN(− c )) − →R − →R R h = ReLU(c ) = ReLU(BN( c )). L
h
(4)
Here BN denotes the batch normalization function. − →L − →R h and h are the extracted different high-level features for the same input feature, and reflect short and long context dependencies. In order to fuse those two kinds of features, gate convolution [5] with kernel size 1 is applied because it is thought that high level abstract representations have less interdependency. Before that, we take the two parallel convolution outputs by element-wise dot-product.
− →
− →L − →R · h
h = h
− → → − → − → y = ReLU(BN( h ∗ − w + b )).
(5) (6)
Eq. (6) gives the final output of CCB. It includes the gate convolution with kernel size 1 to control the information flow. After the gate convolution layer, we batch normalize and non-linearize the output data. 4. CCB based model
Fig. 2. The construct of convolutional control block.
mean of 0 and variance of 1. It is expected to reduce the internal − →L − →R coupling of the data. For each element in c and c , follow the steps of Algorithm 1 to normalize them. The operation is a batchwise one and so our training algorithm is batch-based. Algorithm 1 Batch Normalization Algorithm
− →
− →
Require: the j-th elements of c L and c R over a mini-batch: {cj1 , cj2 , · · · cjr }, where r is the batch size Parameters: γ , β Ensure: normalized elements: {c j1 , c j2 , · · · c jr } ∑r 1 1: (1) Compute the mean: µ ← i=1 cj1 m
∑r
2:
(2) Compute the variance: σ ←
3:
(3) Normalize the input value: ˆ cji ←
4:
2
1 m
i=1 (cji cji −µ
− µ)
R
4.1. Word embedding
2
sqrt σ 2 +ε
cji + β (4) Scale and shift the normalized value: c ji ← γˆ L
A convolutional neural network includes such five parts as input layer, convolutional layer, pooling layer, full connection layer, and Softmax layer [36]. Our model coarsely consists of three parts, as shown in Fig. 3. The first part is the input and the table lookup layers at the bottom. A sentence is imported by each individual Chinese character, without segmentation. A special table is built to reflect the inter-word connection. Section 4.1 describes how such table shapes and how the elements in this table are produced. The next part is the CCB combination layer. Deep learning often relies on deep and broad network topology, so that more abstract hierarchical features and larger context dependency are extracted. A number of CCBs are tiered through pooling, referred to Sections 4.2 and 4.3. The last part outputs the maximum likelihood classification by use of the Softmax function, on which Section 4.4 provides more details.
Since c and c are linear combinations of the output of the upper layers, the neural network cannot approximate any function.
In traditional machine learning models, the input is the features extracted from text after segmentation. The feature definition has a significant impact on the model efficiency. Instead, deep neural network simplifies it due to its capability of feature automatic extraction. So the input is the sequence of statement S = {s1 , s2 , . . . , st }, here t is the number of words. For example,
22
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
} (I do not like this clothes) where { t = 8. Then each character is needed to convert to a computable vector, which is also called word embedding. One-hot encoding [37] is widely used. It sorts the vocabulary V and defines a vector of size |V |. For each word, set its corresponding element to 1. This way has the drawbacks of overlarge dimension and sparseness. More importantly, one-hot encoding cannot reflect the similarity or discrepancy of two words [21], through computing the Euclidean distance of one-hot vectors. For example, one-hot(‘‘King’’) is close to one-hot(‘‘Men’’), in contrast to one-hot(‘‘Women’’). In addition, one-hot encoding is unable to reflect phrase connection between words. Word has an impact on the preceding or succeeding words, particularly in a phrase. To consider phrase connection in word embedding helps solve the phrase segmentation problem. To mitigate the aforementioned problems of one-hot encoding, we use Skip-gram algorithm [22] to construct a new word vector table. The word vector generated by Skip-gram algorithm considers the co-occurring probability of neighboring words. So the Euclidean distance of words uncovers the relationship of words in a phrase or a sentence. This algorithm is an unsupervised learning algorithm, aiming at maximize conditional probability of word sequence Eq. (7). t 1∑
t
∑
log p(si+j |si )
Fig. 3. Convolution control block based model.
(7)
i=1 −u≤j≤u,j̸ =0
where u is the maximum context distance. We use a large scale unlabeled dataset, Chinese Wikipedia dataset with size of 1.3 GB, to train Skip-gram model. The word vectors is pre-stored in a lookup table T |V |×D , where |V | represents the size of the vocabulary and D represents the size of the word vector. By looking up the table T |V |×D , we can get the corresponding − → − → − → word vector sequence of statement S, i.e. SD×t = { s1 , s2 , . . . , st }. 4.2. Convolutional tiering Convolutional layer can be used to analyze each small block in the neural network in order to obtain a higher degree of abstraction. It is necessary to extend features in depth and breadth. First, more kernel functions are adopted to diversify features in CCB. Parameter nb_filters determines how many CCBs exist in a single layer. Then the parallel convolution outputs two matrices − →L − →L −−−−→L − →R − →R −−−−→R C L = { c1 , c2 , . . . , cnb_filters } and C R = { c1 , c2 , . . . , cnb_filters }. In order to deepen the neural network, CCBs are tiered with one or more CCBs as the previous layer. As the height of CCBs grows, trivial features should be filtered. Max pooling, which is described in Section 4.3, is filled between two layers of CCBs. 4.3. Pooling In CNN, the convolutional layers are usually followed by pooling layer. The pooling layer can be very effective in reducing the size of the input matrix, and reducing the number of weight parameters in the last full connection layer or the next layer network. In addition, the pooling layer can not only greatly speed up the calculation of neural network, but also be effective to prevent over-fitting. Similar to the forward propagation of the convolution layer, the forward propagation process of the pooling layer is also accomplished by sliding a similar filter mechanism. But the difference lies in that the pooling filter does not calculate the weighted sum of the nodes, instead uses the simple pooling function to take the maximum value or the average of the current region to replace the output of the network at that position. The pooling operation mainly has max pooling and average pooling.
− →
In our case, CCBs output feature y = {y1 , y2 , . . . , yn }. If the filter size of the pooling layer is k′ and the stride is d′ , then the max
− →
pooling output y′ = (y′1 , y′2 , . . . , y′l ) (l = [(n−k′ +1)/d′ ]) is defined as Eq. (8). y′i = max(y(i∗d′ −j+c) , for j = 1 to k′ ) ′
(8)
′
here c = k − d + 1, is the offset constant. Our default model stacks five layers of CCBs, and the gap between two CCB is a max-pooling layer. Finally, we insert a global average pooling layer between the convolutional layer and the classifier. In this model, we set the pooling layer kernel size k′ = 3 and the stride d′ = 2. The max pooling operation ensures that we can get the most important features in every 3 features, and shortens the output sequence length to half of the original one. The global average pooling layer levels all the features of the CCB output, which allows the output feature to be directly trained with the
− →
classifier [17]. For the final feature from CCBs y′ = {y′1 , y′2 , . . . , y′l }, − → its global average pooling result zj ∈ z = {z1 , z2 , . . . , znb_filters } is defined as Eq. (9). zj =
l 1∑
l
y′i .
(9)
i=1
4.4. Classifying In the sentiment analysis task, we need to predict the sentiment label c˜ ∈ C for each input sequence, where C is binary label set {positive, negative}. The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression. Softmax layer supports the decision making for the classification problem, through which you can get a probability dis− → tribution over different classes. Given the feature vector z = {z1 , z2 , . . . , znb_filters }, the probability of sentiment label c˜ is defined
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26 Table 1 Hyper-parameters of default CCB-based model.
Algorithm 3 provides an optimization algorithm [34] to improve the learned parameters above.
Hyper-parameters
Values
Word vector size D Maximum context distance u Learning rate α CCB tiering layers η Batch size r Number of kernels nb_filters Left, right, and gate convolution stride dL , dR , d Pooling size k′ Pooling stride d′
300 5 0.01 5 128 128 1 3 2
− →
p(c˜ | z ) = softmax( z )
= ∑
e
− → z ·ω
c ∈C
c˜
− → z ·ω
e
c
(10)
.
6. Experiments and results
Here ω is parameter to be trained based on maximum entropy. As per maximum likelihood, the sentiment classification of the − → sample z is:
− →
Algorithm 3 Optimization Algorithm for Post Training Require: Test dataset Dtest Ensure: Optimized model parameters ˜ θ 1: load trained model parameters θ 2: while stopping criterion not met do 3: Sample a minibatch of m examples from Dtest 4: Using m samples to perform forward propagation to compute J(θ ) 5: Perform back ∑ propagation to estimate gradient: 1 ∇θ i J(Xi , Yi ), (Xi , Yi ) ∈ Dtest 6: g ← m 7: apply update: θ ← θ − α g 8: end while
as Eq. (10).
− →
23
− →
label( z ) = arg max(p(c˜ | z )), for c˜ ∈ C.
In this paper, the server configuration is 3.2 GHz 8-core CPU, 16 GB memory, and two NVIDIA GeForce GTX980 GPU. The algorithms are implemented using software platform Keras, which is a deep learning framework in Python.
(11) 6.1. Dataset
5. Training algorithm The convolutional control block model training algorithm mainly uses the training set to train the model parameters, and uses the verification set to adjust the parameters. The description is shown in Algorithm 2, which is based on typical error back propagation method [27]. Algorithm 2 Training Algorithm of Convolution Control Block Model Require: Training dataset Dtrain , the number of CCB tiering layers
η
− → − → − →
→ → → Ensure: Model parameters θ = {− w L, − w R, − w , b L, b R, b } 1: Randomly initial all model weights and bias. 2: set the model learning rate α . 3: repeat 4: for each mini batch sentences S ∈ Dtrain do − → − → − → 5: get embedded word vectors S = ( s1 , s2 , · · · , st ) from |V |∗d lookup table E . 6: for i = 1 to η do − → 7: perform the convolution block to get the output y for the input S. 8: if i ̸ = η then
The performance of our proposed model is evaluated through the Million Chinese Hotel Review (MioChnCorp) dataset [20]. All comments on this dataset have been labeled positive or negative by publishers. It contains a total of 908,189 positive comments and 101 762 negative comments. After mixing the positive and negative comments randomly, 192 848 comments are exported, in which 101,424 are positive comments while the rest are negative comments. The exported dataset is divided into training set, test set, and verification set. The training set contains 115,708 comments while test set and verification set have 38,570 comments each. Before experiments, the selected comments are cleared: wiping out the extra spaces and symbols, and cutting each sample to a fixed length. 6.2. Hyper-parameters Back propagation [26] and standard gradient descent are used in the training algorithm 2. In our CCB based model, there exist a number of hyper-parameters, which should be settled ahead of training. By cross validation, we screen out the most favorable hyper-parameter setting shown in Table 1. Globe normal initializer [7] is adopted to initialize model’s weights and biases.
− →
9: 10: 11: 12: 13:
14: 15: 16: 17: 18: 19: 20:
perform the max pooling operation get the output y′ . end if − → S = y′ . end for perform the global average pooling to get the output z for
− →
the input y′ . end for get the predicted label c˜s of the sentences S through the softmax function. compute the loss function J(θ ). perform the back propagation algorithm to compute the gradient ∇θ . update the model parameters θ = θ − α∇θ . until model convergence. save the model parameters θ .
When the model is trained, the test set is used to test the generalization capability of the model. Preventing from over-fitting,
6.3. Performance metrics F1-score and accuracy are the two performance metrics used to evaluate the generalization ability F1-score is a typical metric for assessing accuracy of binary classifier. It takes both Precision and Recall into account. Eqs. (12) to (14) give the definition of Precision, Recall, and F1-score. Symbols P and R mean the Precision and Recall metrics. The subscripts ‘‘pos’’ and ‘‘neg’’ denote the two sentimental categories Positive and Negative. Ppos = Pneg = Rpos = Rneg =
# of truely predicted positiv e comments # of positiv e comments # of truely predicted negativ e comments
(12)
# of negativ e comments # of truely predicted positiv e comments # of comments predicted positiv e # of truely predicted negativ e comments # of comments predicted negativ e
(13)
24
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
Fig. 5. Accuracy dynamic on the validation set.
Fig. 4. Accuracy dynamic on the training set.
F 1pos = 2 × F 1neg = 2 ×
Table 2 Performance of LR_all, DCN, and CCB models (%).
Ppos × Rpos Ppos + Rpos Pneg × Rneg Pneg + Rneg
(14)
.
From Eq. (14), F1-score appears to be the weighted average of Precision and Recall. When F1-score equals 1, Ppos = Pneg = Rpos = Rneg = 1. Its prediction is totally correct. If F1-score is 0, it indicates the worst performance. Accuracy is an intuitive metric, defined as Eq. (15). accuracy =
# of truely predicted comments # of comments
.
(15)
6.4. Result analysis Our experiments study five aspects of our CCB based model, which are training speed, performance comparison, model depth, sentence length, and gate convolution. (A) Training speed Figs. 4 and 5 show the accuracy dynamic of CCB based model on the training set and the validation set. Each individual iteration stands for traversing the entire training samples once. As can be seen from Fig. 4, the accuracy increases gradually with the increase of the number of iterations. The accuracy of the model increases rapidly from 1 to 16 iterations. At this stage, the model is rapidly fitting the training set. The distribution over training set is being learned, which is called the under-fitting state. However, as the number of iterations increases, the model accuracy tends to be saturated. At this point, the model has learned the distribution of training set. Fig. 5 is the same statistic as Fig. 4 but on the validation set. As the model is in the under-fitting state, the accuracy on the verification set in the previous 10 iterations is low with fluctuation. As the training carries on, the model gradually fits the training set, and meanwhile the accuracy on the verification set gets stabilized. (B) Performance comparison In order to illustrate the performance of our model in Chinese sentiment analysis, LR_all model proposed by Lin et al. [20] and Deep Convolution Network Architecture model (DCN) proposed by Conneau et al. [4] are used as the benchmark comparison models. F1-scores of our model and those two models are shown in Table 2. The length of sentence is fixed to 120. Through the comparison of the experimental results in Table 2, it is observed that our model has achieved the best performance
Model
Ppos
Rpos
F 1pos
Pneg
Rneg
F 1neg
LR_all DCN CCB(ours)
90.3 89.16 93.19
91.9 92.98 92.67
91.09 91.03 92.93
91.7 92.57 91.8
90.1 88.55 92.38
90.89 90.52 92.09
in Chinese sentiment analysis. The precision rate of our model in the positive comment data classification reached 93.19%, which is improved by 3% compared to the other two models. F1-scores in the positive and negative comments are 92.93% and 92.09%, respectively. An improvement also exists compared with LR_all and DCN. Among these three models, the performance of DCN is relatively poor, probably because the model structure is relatively deeper and the model parameters are more difficult to converge. In addition, F1-score of LR_all and DCN in our experiment is a little lower than that in [20] and [4]. Perhaps this is because those two models are on basis of English, rather than Chinese. They do not take advantage of Chinese features and their performance drops. They are unable to fully solve the impact of phrase segmentation in Chinese. But our model learns phrase semantics through 3 and 5 sized convolution. (C) Impact of model depth The effect of model depth is studied by increasing and removing CCBs. The number of the convolution kernels increases twice as the depth doubles. For example, if the model has three layers of CCBs tiered, the number of convolution kernels for the first layer of CCB is 128, and the number of kernels of 2 or 3 layer CCBs is 256 and 512, respectively. Fig. 6 shows that the accuracy of Chinese sentiment analysis changes with the depth of the model. As seen from the results, the accuracy of Chinese sentiment analysis improves as we increase the model depth. When the depth of the model is 1, the accuracy rate is only 91.73%. When the depth grows to 3, the accuracy rate rises to 92.49%. When the model depth is 5, the model accuracy is improved by 0.09% compared with the depth of 4, which is not a distinct improvement. However, the deeper the model is, the more difficult it is to train and converge. For deeper model, the gradient dispersion effect can easily emerge by the stochastic gradient descent. Thus it is difficult to find the global minimum value of the loss function in the back propagation process. So the default depth of CCB based model stays 5. (D) Impact of sentence length This section examines the effect of sentence length on model accuracy. In this experiment, the sentence length varies from 60 to 120. The experimental results are shown in Fig. 7. It can be seen
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
25
how much of the information of parallel convolution can flow into the next layer. Some trivial features are filtered out, remaining more important ones. 7. Conclusion This paper offered an empirical study on Chinese sentiment analysis by convolution neural network. Considering word connection in phrase and sentence, we designed parallel convolution with different kernel sizes and developed a construct called convolution control block. Our CCB based model captures short and long context dependencies. It is evaluated by use of the Million Chinese Hotel Review dataset. The results show that our model is the best compared to LR_all and DCN models. The experimental results also show that the model depth, the sentence length, and the gate convolution are related to the accuracy of sentiment classification, of guidance to model configuration. Our model is of practical value in comment or instant message mining. However, we simplified sentiment analysis to a binary classification problem. Positive and Negative are not enough to cover all kinds of emotions. A more complex network should be studied in the future. Up to date, the convolutional neural network has been successfully applied in the field of computer vision. The depth is now considered to have reached hundreds of layers in some typical network architectures. For the natural language processing tasks, the convolutional neural network model depth is still very shallow. Therefore, making full use of the convolutional neural network to extract higher eigenvalues automatically and establish a deeper model to solve the multinomial sentiment classification tasks, is promising.
Fig. 6. The effect of network depth on accuracy.
References
Fig. 7. The effect of sentence length on accuracy.
Table 3 The accuracy with or without gate convolution. Model
Accuracy
Model with gate Model without gate
92.53 91.4
that with the increase of sentence length, the accuracy of the model is also increasing. When the length of the sentence is 120, the accuracy rate is 92.53%, which is 0.49% higher than that with sentence length of 60. The longer the sentence is, the more effectively our parallel convolution in CCB captures short and long context dependencies. Besides, long sentences help mitigate gradient vanish problem. (E) Impact of gate convolution In Section 3 we introduce the gate convolution and explain why we use it. This section examines the effect of gate convolution on the accuracy of the model. The experimental results are shown in Table 3: From the experimental results we can see that after using the gate convolution, our model obtained an accuracy rate of 92.53% on the Chinese sentiment analysis dataset, 1.13% higher than the model which does not use the gate convolution. It is thought to be necessary to adopt the gate convolution, which effectively controls
[1] Y. Bengio, P. Vincent, C. Janvin, A neural probabilistic language model, J. Mach. Learn. Res. 3 (6) (2003) 1137–1155. [2] S.F. Chen, J. Goodman, An empirical study of smoothing techniques for language modeling, in: Meeting on Association for Computational Linguistics, 1996, pp. 310–318. [3] R. Collobert, Deep learning for efficient discriminative parsing, in: International Conference on Artificial Intelligence and Statistics, 2011, pp. 224–232. [4] A. Conneau, H. Schwenk, L. Barrault, Y. LeCun, Very deep convolutional networks for natural language processing, 2016. [5] Y.N. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, 2016. [6] A. Esuli, F. Sebastiani, SentiWordNet: A publicly available lexical resource for opinion mining, in: In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC’06), 2006, pp. 417–422. [7] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, J. Mach. Learn. Res. 9 (2010) 249–256. [8] A. Graves, Supervised sequence labelling, in: Supervised Sequence Labelling with Recurrent Neural Networks, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 5–13. [9] B. Gu, V.S. Sheng, K.Y. Tay, W. Romano, S. Li, Incremental support vector learning for ordinal regression, IEEE Trans. Neural Netw. Learn. Syst. 26 (7) (2015) 1403–1416. [10] V. Hatzivassiloglou, K.R. McKeown, Predicting the semantic orientation of adjectives, in: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, ACL ’98, 1997, pp. 174–181. [11] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [12] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (2014) 1735–1780. [13] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, 2015,. [14] K. J, M. M, Words with attitude, in: In Proceedings of the International Conference on Global WordNet, 2002, pp. 332–341. [15] R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer, Y. Wu, Exploring the limits of language modeling, 2016. URL http://arxiv.org/abs/1602.02410. [16] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, 2014. [17] Y. Kim, Convolutional neural networks for sentence classification, 2014. URL http://arxiv.org/abs/1408.5882.
26
Z. Xiao et al. / J. Parallel Distrib. Comput. 116 (2018) 18–26
[18] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012. [19] J. Li, E. Hovy, Sentiment analysis on the people’s daily, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 467–476. [20] Y. Lin, H. Lei, J. Wu, X. Li, An Empirical Study on Sentiment Classification of Chinese Review using Word Embedding, 2015. [21] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: International Conference on Learning Representations, 2013. [22] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26, Curran Associates, Inc., 2013, pp. 3111–3119. [23] G. Paltoglou, M. Thelwall, More than bag-of-words: sentence-based document representation for sentiment analysis, in: Recent Advances in Natural Language Processing (RANLP), 2013, pp. 546–552. [24] Z. Qu, J. Keeney, S. Robitzsch, F. Zaman, X. Wang, Multilevel pattern mining architecture for automatic network monitoring in heterogeneous wireless communication networks, China Commun. 13 (7) (2016) 108–116. [25] E. Riloff, J. Shepherd, A corpus-based approach for building semantic lexicons, in: In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, pp. 117–124. [26] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representation by back-propagation of errors, Nature 323 (323) (1986) 533–536. [27] D.E. Rumelhart, G.E. Hinton, R.J. Williams, in: J.A. Anderson, E. Rosenfeld (Eds.), Neurocomputing: Foundations of Research, MIT Press, Cambridge, MA, USA, 1988, pp. 696–699 (Chapter) Learning Representations by Back-propagating Errors. [28] C.N.D. Santos, M. Gattit, Deep convolutional neural networks for sentiment analysis of short texts, in: International Conference on Computational Linguistics, 2014, pp. 69–78. [29] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013, pp. 1631–1642. [30] K. Tai, R. Socher, C. Manning, Improved semantic representations from tree-structured long short-term memory networks, in: The 53rd Annual Meeting of the Association for Computational Linguistics (ACL 2015), 2015, pp. 1556–1566. [31] Q. Tian, S. Chen, Cross-heterogeneous-database age estimation through correlation representation learning, Neurocomputing 238 (2017) 286–295. [32] B. Wang, X. Gu, L. Ma, S. Yan, Temperature error correction based on BP neural network in meteorological wireless sensor network, Int. J. Sensor Netw. 23 (4) (2017) 265–278. [33] D. Wang, F. Li, Sentiment analysis of chinese microblogs based on layered features, in: C.K. Loo, K.S. Yap, K.W. Wong, A. Teoh, K. Huang, (Eds.), Neural Information Processing: 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part II, 2014, pp. 361–368. [34] R.J. Williams, D. Zipser, in: Y. Chauvin, D.E. Rumelhart (Eds.), Backpropagation, L. Erlbaum Associates Inc., Hillsdale, NJ, USA, 1995, pp. 433–486 (Chapter) Gradient-based Learning Algorithms for Recurrent Networks and Their Computational Complexity. [35] N. Yu, S. Kubler, Semi-supervised learning for opinion detection, in: Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03, WI-IAT ’10, 2010, pp. 249–252. [36] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: D. Fleet, T. Pajdla, B. Schiele, T. Tuytelaars (Eds.), Computer Vision –ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I, Springer International Publishing, 2014, pp. 818–833. [37] X. Zhang, J.J. Zhao, Y. Lecun, Character-level convolutional networks for text classification, Neural Inf. Process. Syst. (2015) 649–657.
Zheng Xiao received his Ph.D. in computer science from Fudan University, China, in 2009, and B.Sc. in communication engineering from Hunan University in 2003. Now he is an assistant professor in college of information science and engineering of Hunan University. His research interests include high-performance computing, distributed artificial intelligence, intelligent information processing.
Xiong Li now is an associate professor at School of Computer Science and Engineering of the Hunan University of Science and Technology (HNUST), China. He received his master’s degree in mathematics and cryptography from Shaanxi Normal University (SNNU), China in 2009 and Ph.D. degree in computer science and technology from Beijing University of Posts and Telecommunications (BUPT), China in 2012. He has published more than 50 referred journal papers in his research interests, which include cryptography, information security, cloud computing security etc. He has served as TPC member of several international conferences on information security and reviewer for more than 30 ISI indexed journals. He is a winner of Journal of Network and Computer Applications 2015 best research paper award.
Le Wang received his bachelor degree in Hunan University of Science and Engineering in 2012. Now he is a graduate student in College of Computer Science and Electronic Engineering of Hunan University. His research interests include deep learning, transfer learning, natural language processing.
Qiuwei Yang received the M.S. and Ph.D. degrees in Computer Science and Electronic from Huazhong University of Science and Technology in 2008. He is currently an assistant professor in the Department of Computer Science and Electrical Engineering (CSEE) at University of Hunan, China (April 2008–Present), and a post doctor at present. His interests are multimedia security, network security and privacy protection.
Jiayi Du received his Ph.D., M.Sc. and B.Sc. in computer science from Hunan University, China, in 2004, 2010 and 2015. He is currently an assistant professor in Central South Forest Technology University, China. His research interest includes modeling and scheduling for parallel and distributed computing systems, embedded system computing, cloud computing, parallel system reliability, and parallel algorithms.
Arun Kumar Sangaiah had received his Ph.D. degree in Computer Science and Engineering from the VIT University, Vellore, India. He is presently working as an Associate Professor in School of Computer Science and Engineering, VIT University, India. His area of interest includes software engineering, computational intelligence, wireless networks, bio-informatics, and embedded systems. He has authored more than 100 publications in different journals and conference of national and international repute. His current research work includes global software development, wireless ad hoc and sensor networks, machine learning, cognitive networks and advances in mobile computing and communications. Moreover, he has carried out number of funded research projects for Indian government agencies. Also, he has registered a one Indian patent in the area of Computational Intelligence. Besides, Prof. Sangaiah is responsible for Editorial Board Member/Associate Editor of various international journals.