Communicated by Dr. Nianyin Zeng
Accepted Manuscript
Textual sentiment analysis via three different attention convolutional neural networks and cross-modality consistent regression Zufan Zhang, Yang Zou, Chenquan Gan PII: DOI: Reference:
S0925-2312(17)31609-0 10.1016/j.neucom.2017.09.080 NEUCOM 18963
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
25 January 2017 27 June 2017 24 September 2017
Please cite this article as: Zufan Zhang, Yang Zou, Chenquan Gan, Textual sentiment analysis via three different attention convolutional neural networks and cross-modality consistent regression, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.09.080
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Textual sentiment analysis via three different attention convolutional neural networks and cross-modality consistent regression Zufan Zhang, Yang Zou∗, Chenquan Gan
CR IP T
School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
Abstract
ED
M
AN US
Word embeddings and CNN (Convolutional Neural Networks) architecture are crucial ingredients of sentiment analysis. However, sentiment and lexicon embeddings are rarely used and CNN is incompetent to capture global features of sentence. To this end, semantic embeddings, sentiment embeddings and lexicon embeddings are applied for texts encoding, and three different attentions including attention vector, LSTM (Long Short Term Memory) attention and attentive pooling are integrated with CNN model in this paper. Additionally, a word and its context are explored to disambiguate the meaning of the word for rich input representation. To improve the performance of three different attention CNN models, CCR (Cross-modality Consistent Regression) and transfer learning are presented. It is worth noticing that CCR and transfer learning are used in textual sentiment analysis for the first time. Finally, some experiments on two different datasets demonstrate that the proposed attention CNN models achieve the best or the next-best results against the existing state-of-the-art models.
CE
PT
Keywords: Textual sentiment analysis, Word embedding, Lexicon embedding, Attention mechanism, Cross-modality consistent regression
1. Introduction
AC
Recent years have witnessed the rapid development of information technology and revolutionary transformation of social media. For example, websites such as Facebook, Twitter, Flickr, Weibo, IMDB (Internet Movie Database), where people can show their sentiments or emotions by uploading text, pictures or videos, have been growing popular. Meanwhile, a large number of user data, which are widely applied in public opinion analysis and product ∗
Corresponding author Email addresses:
[email protected] (Zufan Zhang),
[email protected] (Yang Zou),
[email protected] (Chenquan Gan) Preprint submitted to Neurocomputing
October 3, 2017
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
recommendation [1], are generated by them every moment. Indeed, how to dig out sentiment polarity from user data, which is involved in text mining and NLP (Natural Language Processing) tasks, has become an important research direction of sentiment analysis. Generally, sentiment analysis, which mainly focuses on text analysis (especially short text sentiment analysis), includes two scenarios: tweets and reviews. In the former case, the text data are more oral and contain more popular words than that in the latter case. Currently, the approaches to extracting sentiment from sentence can be summarized as two categories: lexicon-based sentiment analysis and machine learning sentiment classification. The former employs lexicons to compute the sentiment score of each word in a sentence, so the sentiment score of the sentence can be obtained. The effectiveness of this method has also been verified by some previous work (e.g., see Refs. [2, 3, 4]). The traditional machine learning method for sentiment analysis requires hand-crafted features. However, this method can be very difficult. Obviously, how to automatically extract features has very important significance. In this paper, word embeddings, CNN and attention mechanism are investigated. Due to numerous NLP tasks depending heavily on word embeddings, many word embedding models, such as NNLM (Neural Network Language Model) [5], C&W (Collobert & Weston) [6], CBOW (Continuous Bag of-Words) [7], Skip-gram [7] and SSWE (SentimentSpecific Word Embedding) [8], were proposed. NNLM, C&W, CBOW and Skip-gram models are used to construct semantic embeddings. By analyzing shortages of current semantic embeddings and based on C&W model, Tang et al. [8] presented an SSWE model to train sentiment embeddings for sentiment classification. However, SSWE inevitably suffers high time complexity because of the non-linear hidden layer. To remedy this flaw, this paper successfully utilizes the SSWE method on Skip-gram model. Additionally, lexicon embeddings are also examined. Different from the previous work (see Refs. [9, 10]), this paper can collect different kinds of sentiment lexicons. With the development of deep learning [11], many deep neural networks, such as convolutional neural networks and deep belief networks, have been applied to many aspects [12]. Sentiment analysis, as one of those aspects, have achieved great success (e.g., see Refs. [13, 14, 15, 16]). In Ref. [13], one-layer CNN, which utilized variety of convolution kernels to capture n-gram features, was performed for sentence classification with a kind of semantic embeddings. On this basis, Refs. [17] incorporated different types of word embeddings. To disambiguate the meaning of the word in a sentence, standard RNN (Recurrent Neural Networks) was used in Ref. [18]. Different from it, this paper employs LSTM. Additionally, there is a common defect that the above mentioned work can only capture local features. So, attention mechanism is considered in this paper. Attention mechanism has also been successfully applied to many fields, such as machine translation, reasoning about entailment and sentiment classification (e.g., see Refs. [19, 20, 21]). In sentiment analysis, attention mechanism is considered in the condition of document or aspect level sentiment classification. The general sentiment classification about it is neglected. Recently, Ref. [10] applied attention vector to capture global features of the sentence for this problem. However, attention weight is only assigned to each single word. In this paper, not only the attention weight of single word is considered, but also the attention 2
ACCEPTED MANUSCRIPT
AN US
CR IP T
weight of multiple words is included. In addition, another two attention mechanisms are also examined. The attention weight of them for single word is computed by LSTM and attentive pooling [22], respectively. Most importantly, CCR [23] is presented to impose consistent constraints across related but different modalities, and transfer learning is used to yield further improvement. To sum up, this paper constructs three different embeddings (i.e., semantic embeddings, sentiment embeddings and lexicon embeddings) for texts encoding. Based on the extracted local features, this paper employs three attention mechanisms (i.e., LSTM attention, attentive pooling and attention vector) to extract global features of the sentence, and hand-crafted features are also captured to serve as additional information. In addition, LSTM is used to capture context information of each word in the sentence to disambiguate the meaning of the word. CCR and transfer learning method are presented to improve the performance of the proposed attention CNN models. Some experiments on two different datasets are also examined to show the robustness and effectiveness of the proposed models. The organization of the rest of the paper is as follows. Section 2 introduces three different word embeddings. Section 3 describes three different attention mechanisms, cross-modality consistent regression and transfer learning. Some experiments are made in Section 4. Finally, Section 5 summarizes this work. 2. Word embeddings
ED
M
Word representation plays a critical role in many NLP tasks. Decent word embeddings can better encode texts and improve system performance. In this section, three different word embeddings trained under different regimes are introduced.
(w,c)∈D wj ∈c
AC
CE
PT
2.1. Semantic embeddings Skip-gram model trains semantic embeddings by predicting the target word in accordance with its context. Hierarchical softmax and negative sampling are used to accelerate the training procedure because of the large vocabulary size. It is worth noting that Skip-gram is capable to capture semantic relations between words. The loss function of the model is to minimize the log probability: X X Lsemantic = − log(P (wj |w)), (1) where D is the whole corpus, w represents target word, and c is the context of word w. 2.2. Sentiment embeddings Semantic embeddings ignore the sentiment polarity of the word in the sentence, and map words with similar semantic context but opposite sentiment polarity. To integrate sentiment information into Skip-gram, inspired by Ref. [8], a softmax layer is added (see Figure 1). The model can learn the word embeddings with both sentiment and semantic information. The training loss of semantic part is the same as semantic embeddings, and sentiment part 3
CR IP T
ACCEPTED MANUSCRIPT
AN US
Figure 1: Sentiment embeddings model
is to minimize the log probability:
Lsentiment = −
X
y log (ypred ) ,
(2)
y
ED
M
where ypred = sof tmax (X, W ) represents the predicted sentiment label, X is an average vector of target word and its context, W is a parameter matrix, and y is a gold sentiment label. Final loss function of sentiment embeddings is the linear combination of the semantic and sentiment parts: L = αLsemantic + (1 − α) Lsentiment . (3) In [8], empirical experiments have shown that it can achieve the best performance when α ∈ [0.4, 0.5].
AC
CE
PT
2.3. Lexicon embeddings Sentiment lexicons are valuable resources that cannot be neglected. In [9], only a few lexicons are employed to construct lexicon embeddings, and in [10], lexicons that only contain sentiment scores are considered. Most importantly, most lexicons are constructed from tweets. It is worth incorporating more different kinds of sentiment lexicons to improve the coverage. The lexicons that used in this paper are as follows: • MPQA [4] • Bing Liu Opinion Lexicon [2] • NRC Emotion Lexicon [24] • TS-Lex [25] • SSPE-Lex [25] 4
ACCEPTED MANUSCRIPT
(b) closest words to “good” for semantic
(c) closest words to “bad” for sentiment
(d) closest words to “bad” for semantic
AN US
CR IP T
(a) closest words to “good” for sentiment
Figure 2: Comparison between sentiment and semantic embeddings
• NRC Hashtag Sentiment Lexicon [26] • MaxDiff Twitter Lexicon [4]
M
• Sentiment140 Lexicon [4]
• NRC Sentiment140 AffLexNegLex [4]
ED
• Hashtag Sentiment AffLexNegLex [4] • NRC Amazon Laptop Sentiment Lexicons [27]
PT
• NRC Yelp Restaurant Sentiment Lexicons [27]
AC
CE
2.4. Comparison and analysis Differences between sentiment and semantic embeddings are explored (see Figure 2). Sentiment embeddings can distinguish the sentiment polarity of a word. For example, “good” and “bad” in semantic embeddings are mapped into closest vectors. On the contrary, they are mapped into opposite vectors in sentiment embeddings. Many methods were proposed to fuse multiple word embeddings. For example, one type of word embeddings is appended to the end of another word embeddings (e.g., see Refs. [10, 15]), each set of embeddings is treated as a “channel” (see Refs. [10, 13]) or taken independently [13, 15]. The empirical experimental results have shown that treating embeddings independently can achieve the best performance [13].
5
ACCEPTED MANUSCRIPT
3. Multi-modality sentiment classification The previous work has verified the effectiveness of CNN architecture and LSTM networks in sentiment analysis. The two common frames are described firstly and then three attention mechanisms are introduced. CCR and transfer learning are detailed finally.
= σ(Wf · [ht−1 , xt ] + bf ), = σ(Wi · [ht−1 , xt ] + bi ), = tanh(Wc · [ht−1 , xt ] + bc ), = ft ∗ ct−1 + it ∗ c˜t , = σ(Wo · [ht−1 , xt ] + bo ), = ot ∗ tanh(ct ),
AN US
ft it c˜t ct ot ht
CR IP T
3.1. LSTM & CNN RNN (Recurrent Neural Networks) can model sequence tasks, which have been widely applied to NLP tasks. However, considering that standard RNN will suffer gradient explosion or gradient vanish, LSTM was proposed to tackle this problem by introducing a memory cell that is able to preserve state over long periods of time [28]. LSTM makes use of four gates, including an input gate it , a forget gate ft , an output gate ot , and a memory cell ct , to compute hidden state of each word. Formally, ht is defined by: (4) (5) (6) (7) (8) (9)
AC
CE
PT
ED
M
where xt is the embedding of word wt , σ denotes the sigmoid function, ∗ is an element-wise multiplication, and the rest are LSTM parameters. In this paper, bidirectional extension of LSTM (Bi-LSTM) [29] is applied to better encode the sequence. The Bi-LSTM model maps each word to a pair of hidden vectors hft and hbt , which denote the hidden vector of the forward LSTM and backward LSTM, respectively. CNN architecture is used to local capture features of the sentence. For each set of word i embeddings, the individual document matrix of the sentence is denoted as Di ∈ Rn×d , where n is the number of words in the sentence and di is the dimension of corresponding word embeddings. Considering that a word has a variety of meanings, for example, in the terms of bigram “Adam’s apple” in the sentence “His Adam’s apple bobbed in his throat”, it is hard to tell whether it means a physiology organ or the fruit. To disambiguate the meaning of the word, LSTM rather than standard RNN [16] is employed to capture contextual information of the word. The left-side context of each word in Di is denoted as l r cl ∈ Rd and the right-side context is denoted as cr ∈ Rd . Then contextual information is ˆ i ∈ Rn×(dl +di +dr ) . The input integrated into document matrix to build up input matrix D l i r matrix is fed into the convolutional layer and convolved by the weights wc ∈ Rl×(d +d +d ) i to obtain convolution matrix S i ∈ R(n−l+1)×mc , where l is the length of the filter and mic represents the number of convolution filters. Max pooling operation is performed on the each column of the convolution matrix. However, local features, accounting only for local view not the global view of the document [30], are insufficient to settle sentiment classification tasks. To overcome this shortage, three attention mechanisms are presented to capture global features. Furthermore, given that lexicon embeddings represent linguistic features of the word, semantic and sentiment embeddings are used to capture global features. 6
ACCEPTED MANUSCRIPT
AN US
CR IP T
E
Figure 3: LSTM attention
PT
ED
M
3.2. LSTM attention Intuitively, not all words contribute equally to the sentiment of the sentence. A straightforward way is to assign different weights to words. The proposed model 1 employs Bi-LSTM to compute the attention weight of each word, which is regarded as a baseline attention model. The structure of Bi-LSTM attention is shown in Figure 3, and attention weight of each word is expressed as follows: esi = v T tanh(W s hsi + bs ), exp(esi ) , αis = PT s j=1 exp(ej )
(10) (11)
CE
where T is the number of words, αis is the attention weight of each word in the sentence s, and hsi is the concatenated hidden vector. Then, the corresponding transposed input matrix is multiplied by the computed attention weights to obtain global features.
AC
3.3. Attentive pooling A new attention mechanism called attentive pooling is employed for relation classification [22, 31]. The proposed model 2 adopts it for sentiment classification. Different from max pooling, attentive pooling is capable to determine the importance of individual windows in S i by introducing a correlation matrix. Some of these windows could represent meaningful words in the sentence. So, in this paper, it is used to determine the importance of each word in the sentence. Define W class as the class embeddings, whose columns encode the distributed vector representations of different class labels. The dimension of class embedding must be equal to the number of convolution kernels. Class embedding is considered as a parameter that can be learned by the network and initialized by randomly sampling each value from an 7
(a) 1-gram attention vecotor
CR IP T
ACCEPTED MANUSCRIPT
(b) 2-gram attention vecotor
Figure 4: Attention vector
AN US
q 6 and |C| is the number of classes. uniform distribution [22]: U (−β, β), where β = |C|+m c A correlation matrix G is created firstly to capture the connections between convolution matrix and class embeddings. G is computed as follows: G = S i U W class ,
(12)
ED
M
where U is a weighting matrix learnt by the network and the convolution matrix is created by convolution kernel with filter length of one. Then, execute softmax for each row of the correlation matrix G to obtain attentive pooling matrix Ai,j . A is multiplied by the ˆ T to highlight important components. A max operation is applied transposed input matrix D to select global features, which is expressed as: ˆ T A)p,q . wqo = max(D p
(13)
AC
CE
PT
3.4. Attention vector Attention weights are necessary to extract sentiment polarity of the sentence. However, the above mentioned attention mechanisms and attention vector proposed by Ref. [10] only consider attention weight of single word. To ameliorate this issue, the proposed model 3 extends attention mechanism to multiple words. The different filter lengths map convolution matrix varies. Max pooling is executed for each row of the convolution matrix to obtain attention vector (see Figure 4). As a result, the corresponding transposed input matrix is multiplied by it to extract global features. To simplify the model, only attention weights of single word and bi-gram are considered. ˆ l2 ∈ R(n−1)×2(dl +di +dr ) , In particular, for filter length l = 2, its input matrix is denoted as D ˆ whose row vector is formulated as rˆi = [ri , ri+1 ], where ri is the i-th row vector of D. 3.5. Hand-crafted features Apart from local and global features, there are hand-crafted features, such as Punctuation, All-caps and Emoticon, that cannot be ignored, especially in Twitter sentiment analysis. In addition, features like negation based features, and POS based features are also 8
ACCEPTED MANUSCRIPT
CCR
semantic global features sentiment global features hand-crafted features semantic local features sentiment local features lexicon local features
CR IP T
Sentiment
Attention
Attention
CNN
CNN
AN US
CNN
Semantic
Lexicon
Sentiment
M
Figure 5: The integrated model
PT
ED
useful. The extracted hand-crafted features can be categorized as follows: Morphological features (i.e., the number of question marks and exclamation marks, the existence of a question mark at the message’s end), POS based features (i.e., the number of adjectives, the number of verbs, the number of subjective emoticons), and Negation based features (i.e., the number of negative words).
AC
CE
3.6. Cross-modality Consistent Regression Generally, many fusions, such as early fusion and late fusion, have been proposed for final feature fusion. Early fusion concatenates max pooled features of different modalities to form a single long feature vector. Late fusion builds the final probability distributions by using a linear combination of all predicted probability distributions. However, the above mentioned fusions cannot significantly improve the model performance. In this paper, another method called cross-modality regression is applied to process extracted features. KL divergence is a measure of the non-symmetric difference between two probability distributions. Let p and q denote two probability distributions of the same length. Due to the asymmetry of KL divergence, an alternative is used to measure the penalty between any two different distributions, which is defined as: D(p||q) = DKL (p||q) + DKL (q||p). 9
(14)
ACCEPTED MANUSCRIPT
c CCR is integrated with our model (see Figure 5). Define xm i and xi as the m-th modality features of the i-th training samples and the concatenated features from all the modality features of the i-th training samples, respectively. The objective function of the model is given by:
where pθ (xi ) =
1 K P
θ T xi k
e
k=1
CR IP T
N M N λ c 2 1 XX λ m 2 1 X c D (yi ||pθc (xi )) + ||θ || + D (pθc (xci ) ||pθm (xm J (Θ) = i )) + ||θ || , (15) N i=1 2 N m=1 i=1 2
h iT T T T eθ1 xi , eθ2 xi , ..., eθk xi represents the probability of modality fea-
AN US
tures xi in the relative class, K is the number of classes, yi is the gold sentiment distribution, M and N are the number of modalities and training samples, respectively, λ is a hyperparameter, θm and θc are regular parameters. The gradients of the objective function J(Θ) to θm and θc are expressed as: M N 1 X X ∂D (pθc (xci ) ||pθm (xm ∂J (Θ) i )) = + λθm , m m ∂θ N m=1 i=1 ∂θ
N M N ∂J (Θ) 1 X ∂D (yi ||pθc (xci )) 1 X X ∂D (pθc (xci ) ||pθm (xm i )) c = + λθ + . c c c ∂θ N i=1 ∂θ N m=1 i=1 ∂θ
k=1
∂ =
D (pθc (k|xci ) ||pθm (k|xm i ))
ED
K P
can be computed as:
m ∂θjl K P
PT
∂
∂D(pθc (xci )||pθm (xm i )) ∂θm
M
The derivative term
(16)
k=1
m c (DKL (pθc (k|xci ) ||pθm (k|xm i )) + DKL (pθm (k|xi ) ||pθc (k|xi )))
(17)
AC
CE
m ∂θjl K P c (k|xc m (k|xm p p ) ) θ θ i i ∂ pθc (k|xci ) ln p m k|xm + pθm (xm i ) ln pθc (k|xci ) θ ( i ) k=1 = m ∂θjl K X pθc (k|xci ) pθc (k|xci ) ∂pθm (k|xm i ) = 1 − ln − , m m m p p ∂θ θm (k|xi ) θm (k|xi ) jl k=1
m where θjm is a sub-vector of θm , and θjl represents the l-th element of θjm . c Due to the asymmetry of D(p||q), the gradient of D (pθc (xci ) ||pθm (xm i )) to θ can be
10
ACCEPTED MANUSCRIPT
given by: ∂
K P
k=1
D (pθc (k|xci ) ||pθm (k|xm i ))
c ∂θjh K c X pθm (k|xm pθm (k|xm i ) i ) pθc (k|xi ) = 1 − ln − . c pθc (k|xci ) pθc (k|xci ) ∂θjh k=1
(18)
k=1
AN US
CR IP T
Similarly, in the prediction stage, final probability distribution can be obtained by minimizing the objective function: M X K X p (k) min J (p|pθ1 , pθ2 , ..., pθm ) = p (k) ln , p pθm (k) m=1 k=1 (19) K X s.t. p (k) = 1 . Then final probability for each class is predicted as follows: pQ M m pθm (k) . p (k) = K pQ P M m pθm (j)
(20)
j=1
ED
M
3.7. Transfer learning Inspired by Ref. [32], transfer learning repeatedly trains the model to obtain the best parameters to yield further improvement. Concretely, the transfer learning steps are as follows:
PT
Step 1: train the model through training samples. Step 2: use the trained model to label testing samples.
CE
Step 3: evaluate the labeled testing samples. Step 4: combine training samples with testing samples to retrain the model.
AC
Step 5: use the trained model to label testing samples again. Step 6: repeat Step 3. Step 7: combine the training samples with testing samples whose sentiment label is consistent with all previous sentiment label to retrain the model. Step 8: repeat Step 5. Step 9: repeat Step 3. Step 10: repeat Step 7 - Step 9, until the model performance keep unchanged. 11
ACCEPTED MANUSCRIPT
Table 1: Statistics of the Twitter datasets
Positive
Neural
Negative
Total
Train Dev Twt2013 Twt2014 Twt2015 Twt2016
7,727 884 1,572 982 1,038 7,059
7,837 616 1,649 669 987 1,0342
2,916 279 601 202 365 3,213
18,480 1,779 3,813 1,853 2,390 20,632
test test test test
CR IP T
Corpus
Table 2: Statistics of the SST2 dataset
Corpus
3,610 428 912
3,310 444 909
6,920 872 1,821
AN US
Train Dev Test
Positive Negative Total
Table 3: Statistics of the SST5 dataset
Corpus Very pos Positive Neural
4. Experiments
2,322 279 510
1,624 229 389
M
1,288 165 399
2,218 289 633
1,092 139 279
Total 8,544 1,101 2,210
ED
Train Dev Test
Negative Very neg
AC
CE
PT
The proposed models are verified in two different datasets, one is Twitter dataset from Semeval-2016 task 4 [33] and the other is SST (Sentiment Tree Bank) dataset constructed by Stanford University including two subtasks: binary classification and fine-grained classification of the sentence sentiment polarity. For Twitter dataset, three word embeddings trained by ourselves are applied for texts encoding and the evaluation metric is Marco-F1 score of positive and negative categories. For SST dataset, publicly available 300-dimensional vectors Glove 840B 300d are used as semantic embeddings, and the other two embeddings remain unchanged, where accuracy is used as evaluation metric. The statistics of two datasets are listed in Table 1, Table 2 and Table 3, respectively, and hyperparameter settings are presented in Table 4. 4.1. Data preprocessing In the two datasets, all sentences are converted into lowercase, and Tweet NLP is used for tokenization. In particular, for Twitter datasets, the usertags, URLs and numbers in sentences are mapped into generic tokens.
12
ACCEPTED MANUSCRIPT
Table 4: Hyperparameters
Parameter name
Value
d1 , d2 , d3 m1c , m2c , m3c l dl , dr
dimension of each embeddings filter number filter lenth dimension of hidden vetor
100/300, 100, 19 100, 100, 9 1, 2, 3, 4, 5 50, 50
CR IP T
Parameter
AC
CE
PT
ED
M
AN US
4.2. Results analysis Table 5 shows the results on the Twitter test sets. Four models are presented for comparison. Webis [34] and UNITN [35] are 2 top-ranked teams in Semeval-2015 task 10 [37]. SwissCheese [36] and SENSEI-LIF [17] are 2 top-ranked teams in Semeval-2016 task 4. SwissCheese, which uses dual-layer CNNs and ensemble mechanism, is a state-of-the-art performer on the Twt2015 test set and Twt2016 test set. UNITN and SENSEI-LIF, both of which use one-layer CNN, achieve the best result on the Twt2013 test set and Twt2014 test set, respectively. Compared with the previous models, the proposed three models achieve highly competitive results and all outperform the state-of-the art result achieved by SwissCheese on the Twt2016 test set. The model 3 gives the highest F1-score on the Twt2016 test set, which gains 0.6% F1-score improvement. Although the proposed models fail to give the best results on the Twt2013-Twt2015 test sets, they achieve the next-best results and show very strong robustness. The reasons why the proposed models give the top results only on the Twt2016 test set can be analyzed from two aspects. On the one hand, the Twt2016 test data are larger than Twt2013-Twt2015. On the other hand, the annotation method of Twt2016 test data is different from that of Twt2013-Twt2015, where the label of Twt2016 test data is more accurate. The proposed models achieve higher performance in larger data set, which are more convincing. Table 6 shows the results on SST test sets. Various CNN models are included for comparison. CNN-non-static [8] and CNN-multichannel [8] are two baseline models. DCNN (Dynamic CNN) [38] considers k-max pooling strategy, in which the maximum k values are extracted from the feature maps, and the relative order of these values is preserved. C-CNN
Model
Webis UNITN SwissCheese SENSEI-LIF Model 1 Model 2 Model 3
Table 5: Results on the Twitter test sets
Twt2013
Twt2014
Twt2015
Twt2016
68.49 72.79 70.00 70.60 72.30 72.30 72.40
70.86 73.60 71.60 74.40 73.90 74.10 74.30
64.84 64.59 67.10 66.20 66.80 66.80 66.90
63.30 63.00 63.40 63.70 63.90
13
ACCEPTED MANUSCRIPT
Table 6: Results on the SST test sets
SST2
SST5
CNN-non-static CNN-multichannel DCNN C-CNN(w2v+Syn+Glv) MG-CNN(w2v+Syn+Glv) MGNC-CNN(w2v+Syn+Glv) SC-EAV Model 1 Model 2 Model 3
87.2 88.1 86.8 88.58 88.36 88.65
48.0 47.4 48.5 47.47 48.37 49.16 48.8 49.22 49.58 49.69
CR IP T
Model
88.77 88.81 88.83
PT
ED
M
AN US
(w2v+Syn+Glv), MG-CNN (w2v+Syn+Glv) and MGNC-CNN (w2v+Syn+Glv) capitalize on multiple sets of semantic embeddings for sentence classification. SC-EAV integrates single word attention with CNN to extract sentence features. The proposed models all give the top result both on binary and fine-grained classification task. The model 3 still surpasses all other models, achieving the highest accuracy. Tables 7 and 8 present the results of ablation experiments on the Twt2016 test set and SST5 test set, respectively. As can be seen from the table, lexicon embeddings can contribute obvious extra improvement on the Twt2016 test set while the attention performs better on SST5 test set. This is intuitive because Twitter data are more oral and include a variety of topics. However, movie reviews of SST dataset are more formal and tailored to specific topics. From the above analysis, the proposed model 3 outperforms the other two models in two different data sets and ablation experiments. Different global feature extraction methods Table 7: Ablation experiment: macro F1-score on the Twt2016 test set with features removed
AC
CE
Feature set All features w/o lexicon w/o attention
Model 1 Model 2 Model 3 63.40 62.20 62.80
63.70 62.30 62.80
63.90 62.50 62.80
Table 8: Ablation experiment: accuracy on the SST5 test set with features removed
Feature set All features w/o lexicon w/o attention
Model 1 Model 2 Model 3 49.22 48.57 47.45 14
49.58 48.77 47.45
49.69 48.83 47.45
ACCEPTED MANUSCRIPT
Table 9: F1-score on the Twitter test set with diverse feature processing methods on model 3 without transfer learning
Twt2013
Twt2014
69.60 72.00 72.40
72.80 73.60 74.10
Early Late CCR
Twt2015 Twt2016 64.54 64.79 66.80
61.80 63.10 63.80
CR IP T
Method
0.75 0.60
a
ryan
all
at
normal
feel
doesn't
it
dinner
0.30 0.15
AN US
sunday
a
without
sunday
like
is
united
leeds
without
saturday
0.45
0.75 0.60
all ryan
0.45 at all
normal at
feel normal
doesn't feel
it doesn't
dinner it
sunday dinner
a sunday
without a
sunday without
like sunday
ED
is like
united is
PT
leeds united
without leeds
saturday without
a
M
(a) Single word attention visualization
0.30 0.15
(b) Multiple words attention visualization
CE
Figure 6: Attention visualizations
AC
can account for that. The model 3 assigns attention weights to single word and bigram, which can capture most important components. The other two models achieve the nextbest results because both models only assign weight to single word. Through incorporating global information, all three models achieve the best or the next-best results, which proves the usefulness of the global features. Table 9 shows results on the Twitter test sets using diverse feature processing methods on model 3 without transfer learning. Early fusion gives the worst results, which is the responsibility of small training data. As for late fusion, correlations between different modalities are ignored. CCR tries to enforce the agreement among sentiment labels predicted by different modality features, which obviously improve the performance compared 15
ACCEPTED MANUSCRIPT
with other two fusions. Additionally, model 3 with transfer learning has better performance than without it. Figure 6 depicts the heatmap of attention mechanism. The attention mechanism for single word is indeed putting more focus on the words that have relation with the sentiment of the sentence. As for attention on multiple words, most important phrases are detected.
CR IP T
5. Conclusions
AN US
Three different word embeddings are used to encode texts in this paper. Sentiment embeddings and lexicon embeddings can help to overcome the shortages of semantic embeddings. Additionally, three attention mechanisms are integrated with CNN model to extract sentence features. Most importantly, CCR and transfer learning are presented to improve model performance. Finally, the empirical experimental results demonstrate that lexicon embeddings, attention mechanism and CCR are effective in general sentiment analysis, bringing significant improvements. In the near feature, latest optimization algorithm can be exploited for the proposed models [39]. Acknowledgments
ED
M
The authors are grateful to the anonymous reviewers and the editor for their valuable comments and suggestions. This work is supported by the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant No. KJ1704080), Chongqing Research Program of Basic Research and Frontier Postdoctoral Science Foundation (Grant No. cstc2017jcyjA1054), and Doctoral Scientific Research Foundation of Chongqing University of Posts and Telecommunications (Grant No. A2016-10). References
AC
CE
PT
[1] B. J. Jansen, M. Zhang, K. Sobel, A. Chowdury, Twitter power: Tweets as electronic word of mouth, J. Am. Soc. Inf. Sci. Technol. 60 (11) (2009) 2169-2188. [2] M. Hu, B. Liu, Mining and summarizing customer reviews, in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 168-177. [3] T. Wilson, J. Wiebe, P. Hoffmann, Recognizing contextual polarity in phrase-level sentiment analysis, in: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, 2005, pp. 347-354. [4] S. Kiritchenko, X. Zhu, S. M. Mohammad, Sentiment analysis of short informal texts, J. Artif. Intell. Res. 50 (2014) 723-762. [5] Y. Bengio, R. Ducharme, P. Vincent, C. Janvin, A neural probabilistic language model, J. Mach. Learn. Res. 3 (2003) 1137-1155. [6] R. Collobert, J. Weston, A unified architecture for natural language processing: deep neural networks with multitask learning, in: Proceedings of the 29th International Conference on Machine Learning (ICML), 2008, pp. 160-167. [7] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, CoRR abs/1301.3781. [8] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-specific word embedding for twitter sentiment classification, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014, pp. 1555-1565.
16
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
[9] S. Ebert, N. T. Vu, H. Schutze, A linguistically informed convolutional neural network, in: Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA@EMNLP), 2015, pp. 109-114. [10] B. Shin, T. Lee, J. D. Choi, Lexicon integrated cnn models with attention for sentiment analysis, CoRR abs/1610.06272. [11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep neural network architectures and their applications, Neurocomputing 234 (2017) 11-26. [12] N. Zeng, Z. Wang, H. Zhang, W. Liu, F. E. Alsaadi, Deep belief networks for quantitative analysis of a gold immunochromatographic strip, Cognitive Computation 8 (4) (2016) 684-692. [13] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746-1751. [14] R. Zhang, H. Lee, D. R. Radev, Dependency sensitive convolutional neural networks for modeling sentences and documents, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1512-1521. [15] Y. Zhang, S. Roller, B. C. Wallace, MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1522-1527. [16] W. Yin, H. Schutze, Multichannel variable-size convolution for sentence classification, in: Proceedings of the 19th Conference on Computational Natural Language Learning (CoNLL), 2015, pp. 204-214. [17] M. Rouvier, B. Favre, SENSEI-LIF at semeval-2016 task 4: Polarity embedding fusion for robust sentiment analysis, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2016, pp. 202-208. [18] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification, in: Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI), 2015, pp. 2267-2273. [19] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, CoRR abs/1409.0473. [20] T. Rocktaschel, E. Grefenstette, K. M. Hermann, T. Kocisky, P. Blunsom, Reasoning about entailment with neural attention, CoRR abs/1509.06664. [21] H. Chen, M. Sun, C. Tu, Y. Lin, Z. Liu, Neural sentiment classification with user and product attention, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016, pp. 1650-1659. [22] L. Wang, Z. Cao, G. de Melo, Z. Liu, Relation classification via multi-level attention cnns, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, (ACL), 2016, pp. 1298-1307. [23] Q. You, J. Luo, H. Jin, J. Yang, Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia, in: Proceedings of the 9th ACM International Conference on Web Search and Data Mining (WSDM), 2016, pp. 13-22. [24] S. M. Mohammad, T. Yang, Tracking sentiment in mail: How genders differ on emotional axes, CoRR abs/1309.6347. [25] D. Tang, F. Wei, B. Qin, M. Zhou, T. Liu, Building large-scale twitter-specific sentiment lexicon : A representation learning approach, in: Proceedings of the 25th International Conference on Computational Linguistics (COLING), 2014, pp. 172-182. [26] S. Mohammad, S. Kiritchenko, X. Zhu, Nrc-canada: Building the state-of-the-art in sentiment analysis of tweets, in: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2013, pp. 321-327. [27] S. Kiritchenko, X. Zhu, C. Cherry, S. Mohammad, Nrc-canada-2014: Detecting aspects and sentiment in customer reviews, in: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval@COLING), 2014, pp. 437-442. [28] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735-1780. [29] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural
17
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
network architectures, Neural Networks. 18 (5-6) (2005) 602-610. [30] R. Socher, B. Huval, C. D. Manning, A. Y. Ng, Semantic compositionality through recursive matrixvector spaces, in: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2012, pp. 1201-1211. [31] C. N. dos Santos, B. Xiang, B. Zhou, Classifying relations by ranking with convolutional neural networks, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2015, pp. 626-634. [32] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl Data Eng. 22 (10) (2010) 1345-1359. [33] P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, V. Stoyanov, Semeval-2016 task 4: Sentiment analysis in twitter, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2016, pp. 1-18. [34] M. Hagen, M. Potthast, M. Bchner, B. Stein, Webis: An ensemble for twitter sentiment detection, in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2015, pp. 582C589. [35] A. Severyn, A. Moschitti, UNITN: training deep convolutional neural network for twitter sentiment classification, in:Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2015, pp. 464C469. [36] J. Deriu, M. Gonzenbach, F. Uzdilli, A. Lucchi, V. D. Luca, M. Jaggi, Swisscheese at semeval-2016 task 4: Sentiment classification using an ensemble of convolutional neural networks with distant supervision, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2016, pp. 1124C1128. [37] S. Rosenthal, P. Nakov, S. Kiritchenko, S. Mohammad, A. Ritter, V. Stoyanov, Semeval-2015 task 10: Sentiment analysis in twitter, in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval@NAACL-HLT), 2015, pp. 451-463. [38] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014, pp. 655-665. [39] N. Zeng, H. Zhang, W. Liu, J. Liang, F. E. Alsaadi, A switching delayed PSO optimized extreme learning machine for short-term load forecasting, Neurocomputing 240 (2017) 175-182.
18
ACCEPTED MANUSCRIPT
CR IP T
Zufan Zhang is a professor with School of Communication and Information Engineering, Chongqing University of Post and Telecommunications (CQUPT), Chongqing, China. He received his B.Eng. and M.Eng. degrees in 1995 and 2000, respectively from CQUPT, and his PhD degree in Communications and Information Systems, University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 2007. He was a visiting professor at Centre for Wireless Communications (CWC), Oulu of University, Finland from Feb. 2011 to Jan. 2012. His current main research interest concerns wireless and mobile communication networks. Yang Zou received his B.Sc. degree in Xihua University in 2015. He is currently pursuing the Master?s degree in Information and Communication Engineering at Chongqing University of Post and Telecommunications, Chongqing, China. His research interests include: multimedia processing and deep learning techniques.
AN US
Chenquan Gan is a senior lecturer in computer science at Chongqing University of Post and Telecommunications. He received his B.Sc. degree from Department of Mathematics in Inner Mongolia Normal University in 2010, his M.Sc. and Ph.D. degrees from Department of Computer Science in Chongqing University in 2012 and 2015, respectively. He has published more than 10 research papers in international journals. His research interests include: difference equations, computer virus propagation dynamics, and
AC
CE
PT
ED
M
deep learning.
19