ARTICLE IN PRESS
JID: NEUCOM
[m5G;January 23, 2020;15:32]
Neurocomputing xxx (xxxx) xxx
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Aspect-based sentiment classification with multi-attention network Qiannan Xu, Li Zhu∗, Tao Dai, Chengbing Yan School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China
a r t i c l e
i n f o
Article history: Received 5 March 2019 Revised 8 December 2019 Accepted 1 January 2020 Available online xxx Communicated by Dr Erik Cambria Keywords: Aspect-based sentiment classification Sentiment analysis Attention mechanism Neural network
a b s t r a c t Aspect-based sentiment classification aims to predict the sentiment polarity of an aspect term in a sentence instead of the sentiment polarity of the entire sentence. Neural networks have been used for this task, and most existing methods have adopted sequence models, which require more training time than other models. When an aspect term comprises several words, most methods involve a coarse-level attention mechanism to model the aspect, and this may result in information loss. In this paper, we propose a multi-attention network (MAN) to address the above problems. The proposed model uses intra- and inter-level attention mechanisms. In the former, the MAN employs a transformer encoder instead of a sequence model to reduce training time. The transformer encoder encodes the input sentence in parallel and preserves long-distance sentiment relations. In the latter, the MAN uses a global and a local attention module to capture differently grained interactive information between aspect and context. The global attention module focuses on the entire relation, whereas the local attention module considers interactions at word level; this was often neglected in previous studies. Experiments demonstrate that the proposed model achieves superior results when compared to the baseline models. © 2020 Elsevier B.V. All rights reserved.
1. Introduction Aspect-based sentiment classification is a fine-grained task in aspect-based sentiment analysis (ABSA). Instead of predicting the sentiment polarity of an entire sentence, the sentiment polarity of a specific aspect in the sentence is determined [1]. For example, in the sentence ‘This is a high-speed computer, but it has short battery life’, the sentiment polarity of the aspects ‘speed’ and ‘battery life’ are positive and negative, respectively. Aspect-based sentiment classification overcomes the limitation of sentence-level sentiment classification that the sentiment polarity of each aspect may differ when a sentence contains more than one aspect. Aspect-based sentiment classification consists of two stages: aspect extraction [2–7] and sentiment classification [8,9]. The former explores the aspects that appear in reviews, and the latter classifies the opinions about these aspects. In this study, we focus only on sentiment classification. Recently, sequence models such as long short-term memory (LSTM) [10] and gated recurrent units [11] have been successfully used in aspect-based sentiment classification [9,12,13]. Despite the effectiveness of these approaches, sequential models encode words individually, which is time-consuming. To overcome this, Xue and ∗
Corresponding author. E-mail address:
[email protected] (L. Zhu).
Li [14] proposed a parallelisable solution by using convolutional neural networks (CNNs). Although CNNs are effective in reducing training time, they cannot capture long-distance relations in sentences. In addition, aspect-level sentiment polarity is highly dependent on both review context and aspect. Some models utilise an attention mechanism to add aspect information [15–17]. However, most of them regard all aspect words as a whole. When an aspect contains several words, these approaches ignore the different importance between the words in the aspect phrase, resulting in information loss. For example, the aspect of the sentence ‘This place has many different styles of pizza, and they are all amazing’ contains three words. In the aspect phrase ‘styles of pizza’, ‘of’ contributes less than ‘styles’ and ‘pizza’. It is inappropriate to place the three aspect words in equal position. In this paper, we propose a multi-attention network (MAN) to address the aforementioned issues. MAN is a parallelisable model, as no sequence model is involved. It contains an intra- and an inter-level attention mechanism. The former learns word representations through a transformer encoder [18], which is based on a self-attention mechanism that can process context and aspect in parallel. Self-attention also allows MAN to handle long-distance dependencies because it considers every two words in a sentence. The latter employs global and local attention to capture coarseand fine-grained interactive information between aspect and context. Global attention captures the entire interaction, whereas local
https://doi.org/10.1016/j.neucom.2020.01.024 0925-2312/© 2020 Elsevier B.V. All rights reserved.
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
attention captures the word-level interaction between aspect and context words. The main contributions of this study can be summarised as follows: • We propose a novel model (MAN) to process words in review sentences in parallel using an attention mechanism. The proposed model requires significantly less training time than sequence models. MAN can effectively capture long dependencies in sentences by self-attention. • MAN introduces global and local attention modules to capture different-level interactions between aspect and context. The local attention module considers the difference between aspect words. • We evaluated MAN on several datasets, namely laptop, restaurant, and twitter. Experiments demonstrate that the proposed model achieves superior results when compared to the baseline models. The rest of this paper is organised as follows. In Section 2, we review related work. In Section 3, we define the problem of aspect-based sentiment classification and present the proposed model in detail. Section 4 reports experiments and evaluations. Section 5 concludes this paper. 2. Related work In this section, we review related work as follows: First, we discuss the particularities of aspect-based sentiment classification and existing related methods. Secondly, we present recent neural networks for aspect-based sentiment classification. Thirdly, we present some attention mechanisms for aspect-based sentiment classification.
resources, and statistical methods are not effective in aspect classification owing to their weak semantic comprehension ability. The third solution, deep learning, is receiving increasing attention. 2.2. Neural networks for aspect-based sentiment classification Neural networks have been extensively used in several fields, including opinion mining. The most common model used in aspect-level sentiment classification is LSTM. For example, Tang et al. [13] used two LSTMs to encode the left and right context of an aspect simultaneously and concatenated the last hidden state of the LSTMs as the final representation. Ma et al. [33] constructed extended LSTM with two-stage attention to model the target and the entire context. There are numerous other studies based on LSTM [8,34–37]. However, LSTM requires long training time to encode each input word, and thus other networks have been adopted. Specifically, memory networks [38] have been used in aspectlevel sentiment classification [39,40]. Tang et al. [15] constructed a memory network with multiple computational layers. They regarded the aspect vector as a query, and the context embeddings as memories. Chen et al. [41] developed position-weighted memory and applied recurrent attention to obtain the final sentiment prediction. Even though memory networks can handle long-range relations, they focus on single memory elements, thus rendering certain tasks unlearnable. Other neural architectures such as CNNs [42,43] have also been employed. Xue and Li [14] proposed a model based on convolutional layers and gating units. The convolution layers allow parallel process, and the gating units generate accurate features. However, CNNs cannot capture long-distance relations in sentence. 2.3. Attention mechanism for aspect-based sentiment classification
2.1. Aspect-based sentiment classification Sentiment classification can be divided into document-, sentence-, and aspect-level [19]. In the first, it is assumed that a document contains only one topic, and an opinion about the topic is provided [20–23]. In the second, the sentiment polarity of an entire sentence is analysed [24–26]. The third (aspect-based sentiment classification) is fine-grained sentiment analysis, as it classifies the sentiment polarity of an aspect term or phrase [1]. The largest challenge of aspect-based classification, compared with the first two classification methods, is to consider aspect information and the interaction between aspect and context. Existing methods can be divided into three categories. In the first, a sentiment dictionary is constructed for determining the sentiment polarity of an aspect. Singh et al. [27] proposed searching for adjectives that appear before or after aspect words within a context window. If these adjectives are detected, a sentiment dictionary called SentiWordNet1 is used to infer the sentiment polarity of the aspect. Deng et al. [28] constructed a sentiment lexicon, where word polarities depend on topics or domains through a hierarchical supervision topic model. Federici and Dragoni [29] employed the Stanford NLP tool2 to identify the part-of-speech of each word and generate a syntax tree of the input sentence. Secondly, certain studies depend on statistical methods. Mohammad et al. [30] employed supervised models, including naïve Bayes, Bayes networks, decision trees, K-Nearest neighbour, and support vector machines (SVMs) with morphological, syntactic, and semantic features. They compared the results and found that SVMs are the best classifiers. Other methods used different features such as bi-grams [31,32]. However, knowledge-based methods are limited by the employed 1
SentiWordNet is available at http://swn.isti.cnr.it/ Stanford NLP Library is available at http://stanfordnlp.github.io/CoreNLP/index. html. 2
Attention mechanisms are important in natural language processing (NLP) and have achieved remarkable performance in aspect-based sentiment classification [44–46]. Attention mechanisms were introduced to LSTM to incorporate aspect information. Ma et al. [16] used attention to capture interactive information between aspect and context. He et al. [47] transferred documentlevel knowledge to attention-based LSTM to improve classification performance. These attention models learn context and aspect at a coarse-grained level and cause information loss [48]. To resolve this, Fan et al. [48] proposed a model called MGAN for classification. They employed a fine-grained attention mechanism for wordlevel learning, and coarse-grained attention to obtain multi-grained information of a sentence. The proposed solution is based on a transformer, which is self-attentive model [18]. The transformer contains multi-head attention layers to capture information and relies on residual connections [49] and layer normalisation [50] for convergence. Selfattention allows the proposed model to process input sentences in parallel. It obtains information between each two words, and thus it can capture word-level information and long-distance dependencies in a sentence. Even though the proposed model is inspired by MGAN [48], it differs in three main respects. First, it applies a selfattention mechanism to word embedding to learn the hidden context and aspect representation, whereas MGAN uses BiLSTM to obtain the hidden representation. The self-attention mechanism can encode each word in parallel and naturally capture long-range semantic relationships between two words. Secondly, the proposed model does not use fine-grained attention to aspect from context, as in the case of MGAN. As it is experimentally demonstrated that fine-grained attention to aspect from context is not significant, we only retain attention to context from aspect. Thirdly, the local attention weights are trainable. In general, to the best of our knowledge, MAN is the first model in which an attention mechanism
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
ARTICLE IN PRESS
JID: NEUCOM
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
3
Softmax
Output Layer
…
…
Inter-attention Layers
…
…
…
…
Global Attention
Local Attention
…
Intra-attention Layers
…
Aspect Hidden Representation
Context Hidden Representation
Transformer Encoder
Transformer Encoer
Word Embedding
…
vi
vi +1
× N
…
vi+m-1
v1
Aspect
v2
vn
v3 Context
Fig. 1. MAN architecture.
performs aspect-level sentiment classification independently without using a CNN or a sequence model.
word vectors: the context embedding {v1 , v2 , . . . , vn } ∈ Rn×dw and the aspect embedding {vi , vi+1 , . . . , vi+m−1 } ∈ Rm×dw .
3. Multi-attention network for aspect-based sentiment classification
3.3. Multi-attention mechanism
This section presents the structure of MAN. The overall architecture is shown in Fig. 1. It consists of input embedding, multiattention and output layers. 3.1. Task definition We are given a context sentence c = {w1 , w2 , . . . , wn } and an aspect that contains m words a = {wi , wi+1 , . . . , wi+m−1 }. The aspect could be a single word or a word phrase of the context sentence. The goal of aspect-based sentiment classification is to infer the sentiment polarity of the aspect a in the sentence c. The sentiment polarity may be positive, neutral, or negative. 3.2. Input embedding layer The input embedding layer maps each input word wi into a low-dimensional vector vi ∈ Rdw , where dw is the dimension of the word vector. An embedding lookup matrix M ∈ R|V |×dw generated by GloVe [51] is employed to obtain the input vectors, where V is the size of the vocabulary. Subsequently, we obtain two sets of
In this section, we describe the main part of the proposed model, namely the multi-attention mechanism. In Section 3.3.1, we review the transformer encoder, and then we introduce the intra- and inter-attention mechanisms in Sections 3.3.2 and 3.3.3, respectively. Moreover, we discuss the need for position attention in Section 3.3.4. 3.3.1. Transformer encoder A transformer encoder consists of two layers: a multi-head attention mechanism and a fully connected layer. Each layer is followed by a residual connection and layer normalisation. Before introducing multi-head attention, we first introduce the scaled dot-product attention. Generally, the input of an attention function is a query qi ∈ Rd and a set of key–value pairs, whereas the output is the weighted sum of the values. In scaled dot-product attention, the weights are computed by the dot products of the query qi and the keys as follows:
qi K T At tent ion(qi , K, V ) = so f tmax √ d
V,
(1)
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
ARTICLE IN PRESS
JID: NEUCOM 4
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
Whole Interaction Embedding …
Word-level Interaction Embedding
…
…
…
…
Aspect-context Encoder
Pool
…
Pool
Aspect Hidden Representation
…
Context Hidden Representation …
…
Fig. 2. Details of the global attention module.
where keys and values are denoted as K = (k1 , k2 , . . . , kn ), ki ∈ Rd , and V = (v1 , v2 , . . . , vn ), vi ∈ Rd , respectively. The multi-head attention function consists of h parallel scaled dot-product attention mechanisms called ‘head’. The query qi , keys, and values are first divided into h segments. Then, linear projections are used to learn information from different subspaces. Finally, the output of multi-head attention is generated as
MultiHead (qi , K, V ) = Concat (head1 , . . . , headh )W O head j = At tent ion(W jQ qi , W jK K, W jV V ), d
d
(2) d
3.3.2. Intra-attention layer We apply a transformer encoder to the word embedding to obtain hidden representations. Intra-attention, also called selfattention, encodes each word based on all other words in the same sentence. We first apply the transformer encoder to the context, where context words are fed as queries, keys, and values. Then, we obtain the output of the encoder as a hidden context representation [hc1 , hc2 , . . . , hcn ] as follows:
(3)
where hci ∈ Rdh , and dh is the hidden dimension. Similarly, we feed aspect words into the transformer encoder and obtain the hidden aspect representation [ha1 , ha2 , . . . , ham ], where hai ∈ Rdh . It is well known that networks with multiple layers can achieve more abstract representations [52]. It is difficult to extract complicated semantic relations in sentences by using only a single attention layer. Therefore, we use a stack of N attention layers to learn comprehensive context and aspect information. 3.3.3. Inter-attention layer In addition to intra-attention, the relationship between aspect and context is crucial for sentiment prediction. However, this interaction will become more complicated when an aspect phrase contains more than one word, as different words in an aspect phrase make different contributions to the interaction between aspect and context. To address this issue, we present two types of inter-attention modules, namely, global and local attention, to extract interactive information of varying granularity. The global attention mechanism explores the interplay between context and aspect, and generates a coarse-grained representation of the relation. Fig. 2 shows the details of global attention. The attention mechanism first considers the global influence on context. It applies average pooling to hidden aspect vectors and
Context Hidden Representation
Fig. 3. MAN architecture.
then computes the attention weights α i for each context word as follows: m
haavg = 1/m ×
hai
i=1
exp( f (haavg , hci ))
where W jQ ∈ R h ×d , W jK ∈ R h ×d and W jV ∈ R h ×d are linear projection parameter matrices. In practice, we often pack a set of queries into a matrix Q to allow multi-head attention to process these queries simultaneously.
[hc1 , hc2 , . . . , hcn ] = Encoder (v1 , v2 m . . . vn ),
Aspect Hidden Representation
αi = n
j=1
exp( f (haavg , hcj ))
(4) ,
where the function f (haavg , hci ) is calculated by
f (haavg , hci ) = tanh(haavg × W × hci + b), T
(5)
Rdh ×dh
where W ∈ and b ∈ R are the weight matrix and bias, reT spectively. hci is the transpose of hci . For the influence on aspect, the global attention mechanism performs similar operations as follows:
hcavg = 1/n ×
n
hci
i=1
exp( f (hcavg , hai ))
βi = m
j=1
(6)
exp( f (hcavg , haj ))
Subsequently, the weighted context and weighted aspect representations are computed by
mc =
n
αi hci
i=1
m = a
m
(7)
βi hai .
i=1
Finally, the context and aspect representations are concatenated to form the interaction embedding. The local attention mechanism captures word-level interactions to enhance the relation information between aspect and context, as shown in Fig. 3. In reality, aspect words will assign different attention to each word in the context sentence when the aspect phrase contains more than one word. For example, common words such as ‘of’ in the aspect will attend little to context words. We use an aspect–context encoder to compute word-level attention scores between each context word and aspect. The aspect–context encoder contains a transformer encoder and an attention weight vector γ i . γ i can be learned at the training stage. For the input of the transformer encoder, each hidden aspect embedding is considered a query, and contextual embeddings are regarded as keys and values. The output of the transformer
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
ARTICLE IN PRESS
JID: NEUCOM
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
encoder is a fine-grained representation of aspect words with context. We then obtain the weighted representations as word-level interaction embeddings. The local attention mechanism can be denoted as follows: a a a c c c ac ac [hac 1 , h2 , . . . , hm ] = Encoder (h1 , h2 , . . . , hm , h1 , h2 , . . . , hn )
mac =
m
γi × hac i ,
(8)
i=1
where γ i is a vector consisting of attention weights. 3.3.4. Position attention Utilising an attention mechanism results in neglecting the order information of the input sequence. In this study, we introduce two position attention mechanisms into the input embedding. The first position encoding is added to the input of the global attention mechanism and involves sine and cosine functions. It can easily describe the relative position in a sentence [18] because and cosine functions can describe the distance between two words. The position encoding can be calculated based on position and dimension as follows:
p(k, 2i ) = sin
p 2i
p(k, 2i + 1 ) = cos
10 0 0 dh p 2i
10 0 0 dh
(9)
,
where p is the position and i is the dimension of the vector. The second position encoding is added to the input of local attention and considers the distance between context and aspect. Obviously, context words near the aspect are more important than contexts away from the aspect. Therefore, the position encoding can be calculated as follows:
pi = 1 −
l=
ps − i 0 i − pe
l , n−m+1
i ≤ ps ps < i ≤ pe i > pe
3.4. Model training We concatenate the output of the global and local attention mechanisms to obtain the final dimensional vector. We then feed it into a fully connected layer and a softmax layer to predict the final sentiment polarity. The final loss function consists of the crossentropy loss and L2 regularisation, and is calculated as follows:
yci log pci + λ2 ,
datasets are from SemEval 2014 Task 4,3 which consist of reviews of laptops and restaurants. Restaurant2015 and Restaurant2016 are reviews of restaurants from SemEval 2015 Task 124 and SemEval 2016 Task 5,5 respectively. The last dataset contains twitter posts with manually tagged sentiment labels [53]. The first four datasets contain reviews of laptops and restaurants, where each sentence is annotated with, for example, target aspect, aspect category, or sentiment polarity.. The sentiment labels are ‘positive’, ‘neutral’, ‘negative’, and ‘conflict’. The label ‘conflict’ implies that the sentiment orientation is ambiguous, that is, both positive and negative. We remove samples labelled as ‘conflict’ [41]. The twitter dataset has been widely used for ABSA [13,54,55]. Dong et al. collected tweet posts by using certain keywords [53]. These tweets are divided into the train set and test set, and are manually annotated as positive, neutral, and negative. The percentage of negative, neutral, and positive samples is 25%, 50%, and 25%, respectively. The details of the five datasets are shown in Table 1. 4.2. Experimental setting In the experiments, we use pretrained 300-dimensional GloVe [51] vectors to initialise the word embedding. Words out of vocabulary are randomly initialised using a uniform distribution U (−0.25, 0.25 ). The initial values of all weight matrices are sampled from a uniform distribution U (−0.01, 0.01 ). The dimension of the hidden representation is set to 300, and the head of the transformer encoder is set to 10. We use the Adam optimiser at the learning rate of 0.001 to train the model. The batch size is set to 64, and the L2 regularisation parameter is set to 0.001. We set the dropout rate to 0.1, and the number of attention layers to 2. 4.3. Baseline models
(10)
where n and m denote the length of the context and aspect, respectively. ps and pe are the start and end positions of the aspect phrase, respectively.
L=−
5
(11)
i∈D c∈C
where pi ∈ Rc is the predicted sentiment distribution of the i-th context–aspect pair. yi ∈ Rc is the true sentiment polarity, which is a one-hot vector. C denotes the sentiment polarities, and D the training data set. λ controls the influence of the L2 regularisation, and denotes all the parameters. In addition, we employ the dropout strategy to avoid overfitting. 4. Experiment
In the experiments, the proposed model is compared with several existing models. For fairness, the word embedding vectors of all baseline methods are initialised using the same 300dimensional Glove vectors. The batch size is set to 64, and the other parameters are set to the default values in the original papers. The compared baselines are listed as follows: ATAE-LSTM [12] employs an attention mechanism to compute a weighted hidden representation. The aspect embedding is concatenated with context embedding and fed into a simple LSTM model. GCAE [14] is based on a convolution neural network and a gating mechanism. The gate units select important parts from feature maps for sentiment prediction. TD-L STM [13] utilises two LSTMs on the left and right context of an aspect to obtain a target-dependent representation. The final representation contains the left and right contexts and the aspect. IAN [16] learns the context and aspect representation using two LSTMs. Then, the model employs an attention mechanism to capture the interaction between context and aspect. MemNet [15] uses a deep memory network for sentiment classification. It contains a content attention mechanism and position encoding. The number of computational layers in MemNet is set to 9. RAM [41] uses a bidirectional LSTM and captures features by a recurrent attention mechanism. This model constructs a positionweighted memory network to capture long-range information. The number of attention layers in RAM is set to 2.
4.1. Datasets 3
We evaluate MAN on five datasets: laptop2014, restaurant2014,restaurant2015,restaurant2016, and twitter. The first two
4 5
The task site is http://alt.qcri.org/semeval2014/task4/ The task site is http://alt.qcri.org/semeval2015/task12/ The task site is http://alt.qcri.org/semeval2016/task5/
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
ARTICLE IN PRESS
JID: NEUCOM 6
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx Table 1 Statistics of datasets. Sentiment
Positive Neural Negative
Laptop2014
Restaurant2014
Restaurant2015
Restaurant2016
Twitter
Train
Test
Train
Test
Train
Test
Train
Test
Train
Test
987 460 866
341 169 128
2164 633 805
728 196 196
1178 48 380
437 34 328
1618 88 708
596 36 189
1561 3127 1560
173 346 173
Table 2 Main results on datasets. The results with ‘∗ ’ are retrieved from the original papers, and the best performance is marked in bold. Model
ATAE-LSTM∗ GCAE TD-LSTM IAN∗ MemNet RAM MGAN TNet-LF MAN
Laptop2014
Restaurant2014
Restaurant2015
Restaurant2016
Twitter
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
68.70 69.46 71.48 72.10 72.34 74.51 75.39 76.32 78.13
– 64.46 68.43 – 64.25 70.53 72.47 71.56 73.20
77.20 77.60 78.11 78.60 79.13 80.37 81.25 80.75 84.38
– 67.53 66.73 – 66.41 69.79 71.94 70.46 71.31
– 78.31 78.80 – 80.12 80.79 80.81 81.43 82.65
– 65.47 61.71 – 67.46 69.21 68.99 68.78 69.10
– 81.26 83.77 – 82.58 84.37 85.12 85.03 85.87
– 69.71 71.20 – 67.32 72.19 72.46 71.49 73.28
– 69.21 70.62 – 68.52 71.04 72.54 74.78 76.56
– 64.06 69.01 – 66.71 69.73 70.81 73.42 72.19
TNet-LF [56] consists of LSTM, CNN, and a special component called CPT layers. CPT contains target-specific representations and a context-preserving mechanism. MGAN [48] uses LSTM to learn hidden representations and applies coarse- and fine-attention mechanisms to the output of LSTM to capture interactive information between aspect and context.
4.4. Results Accuracy and the macro-F1 score [35,41] were used as evaluation metrics. The experiment results are shown in Table 2. It can be seen that the proposed model outperforms baseline methods in most cases. Specifically, its accuracy exhibits an improvement of approximately 0.84% to 3.63% compared with TNet-LF, which is the current best model on these datasets. This is because the self-attention mechanism captures the relation between every two words, including bidirectional context information and long-range relations, and the trainable weights of the local attention mechanism enable the proposed model to learn the different importance of words in the aspect phrase. For baseline models, ATAE-LSTM performs worst in terms of accuracy. GCAE is slightly better than ATAE-LSTM owing to the feature extraction by the nonlinear gating mechanism. However, the performance of GCAE remains unsatisfactory owing to the lack of long-distance dependencies. TD-LSTM achieves higher accuracy than GCAE because it models the left and right contexts of an aspect rather than the entire sentence. IAN is better than TD-LSTM in terms of accuracy, as it interactively generates target and context representations. MemNet performs better than IAN but cannot perform as effectively as RAM without considering the results of multiple attention mechanisms. TNet-LF outperforms RAM on all datasets, particularly on the twitter dataset. The improvement is due to the convolution component, which enables TNet-LF to learn non-sequential features. Moreover, the accuracy on the twitter dataset remains lower than that on the restaurant or laptop datasets in most cases. This may be because sentences in twitter express more complex emotions not limited to positive/negative, as in ‘ They didn’t know how I feel inside, through my smile I cry’ (Alicia Keys ‘Caged Bird’). Furthermore, a larger number of unlisted words appear in the twitter datasets, such as ‘OMG’, and thus classifying sentiment polarity is more challenging.
Table 3 Runtime of each epoch on the laptop dataset. The smallest training time is marked in bold. Model
Time
TD-LSTM RAM MGAN TNet-LF GCAE MemNet MAN
261.05 255.61 279.37 362.74 35.68 51.22 44.82
4.5. Training time Herein, we compare the training time of the proposed model with that of baseline methods. For fairness, we run MAN and the baseline models on the same Tesla p100 GPU. Table 3 shows the runtime for each epoch on the laptop dataset. It can be seen that even though MAN is slower than GCAE, it requires less training time than the other models, particularly those based on LSTM. This is because MAN requires more training time than GCAE, as it computes weight parameters during training. Moreover, MAN run faster than sequence models owing to the different computational complexity. The cost of self-attention is 2 d ), whereas the cost of LSTM is O (l d 2 ), where l O ( lm w m w m is the maximum length of the input sentence and dw is the dimension of the word vector [57]. In general, lm is smaller than dw . In the experiment, we set lm to 50 and dw to 300. In addition, self-attention can compute weights in parallel, whereas sequence models are serial. 4.6. Model analysis Herein, we investigate the effect of important MAN components, such as the number of layers and various attention modules. 4.6.1. Effect of the number of layers The performance of the proposed model is affected by the number of attention layers in the transformer encoder. The stacked attention layers are used to handle complex sentiment relations in the input sequence. Therefore, we evaluate the performance of the proposed model with one to five attention layers.
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
ARTICLE IN PRESS
JID: NEUCOM
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
7
Table 4 Effect of layers. n in MAN(n) is the number of attention layers. # of layers
MAN(1) MAN(2) MAN(3) MAN(4) MAN(5)
Laptop2014
Restaurant2014
Restaurant2015
Restaurant2016
Twitter
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
75.00 76.56 78.13 73.43 71.88
69.84 72.07 73.20 65.03 65.64
82.81 81.25 84.38 79.69 78.13
73.19 72.42 71.31 68.87 67.63
79.76 81.47 82.65 81.76 80.15
64.72 66.83 69.10 67.84 61.71
80.39 84.18 85.03 82.66 78.48
69.14 70.35 71.49 68.21 64.53
73.44 76.56 71.88 71.88 70.31
70.81 72.19 69.44 68.29 69.47
Table 5 Effect of attention. MAN is the full proposed model. Model
MAN(AS) MAN(I) MAN(A) MAN(C) MAN(P) MAN MAN(CA)
Laptop2014
Restaurant2014
Restaurant2015
Restaurant2015
Twitter
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
Accuracy
Macro-F1
73.68 74.44 75.10 76.45 77.69 78.13 78.09
64.52 65.98 70.62 71.41 71.99 73.20 72.97
79.71 81.25 81.37 82.79 79.69 84.38 84.16
69.85 72.02 67.64 71.84 71.27 71.31 71.30
75.92 78.66 80.27 80.64 81.39 82.65 81.86
64.03 64.82 66.46 65.97 67.75 68.23 68.15
80.65 82.03 82.65 83.24 84.77 84.93 85.03
65.21 68.16 69.20 68.83 70.32 71.49 70.71
70.88 73.26 73.44 73.88 75.00 76.56 76.48
68.47 71.93 71.43 69.85 71.00 72.19 72.22
Fig. 4. Visualisation of attention weights. Colour intensity represents the strength of the attention weights.
As shown in Table 4, the proposed model generally achieves the best performance with two or three attention layers in terms of both accuracy and macro-F1 on five datasets except in certain special cases. For example, the macro-F1 score on the restaurant dataset appears to be the exception. However, the proposed model with only one attention layer cannot learn complicated sentiment features in sentences. Nevertheless, an excessive number of layers will also affect performance, as this will also increase computational complexity, and therefore generalisation ability and prediction accuracy will be reduced.
4.7. Case study 4.7.1. Effect of attention To verify the effectiveness of different attention mechanisms, we design six variants of MAN, called MAN(AS), MAN(I), MAN(A), MAN(C), MAN(P), and MAN(CA). MAN(AS) and MAN(I) denote the proposed model without local attention and global attention, respectively. MAN(A) removes the global aspect influence on context, and MAN(C) removes global attention on the target from context. MAN(P) is MAN without position attention. In addition, we add the local attention to aspect from context, as in MGAN [48], which is denoted by MAN(CA).
As shown in Table 5, the full model (MAN) yields the best results on the five datasets in most cases. MAN(AS) has the worst performance, implying that word-level interactive attention is the most important module in the proposed model. Moreover, the gap between MAN(AS) and MAN is more obvious on the twitter dataset; this may be due to the number of aspect words. After counting the number of aspect words on the datasets, it is found that 69.86% of the aspects on the twitter dataset contain one more word, whereas those on the laptop2014 and restaurant2014 datasets are 36.66% and 24.65%, respectively. The results indicate that local attention is more useful for reviews that contain several aspect words, such as twitter posts. MAN(I), MAN(A), and MAN(C) show the effectiveness of global interaction, including global attention to context from aspect, and global attention to aspect from context. In addition, the results of MAN(P) suggest that position information can improve sentiment prediction. However, local attention to aspect from context cannot improve the performance of sentiment classification because word-level influence on aspect from context is similar to local attention and thus cannot provide sufficient information. MAN uses only local attention to context from aspect. We apply MAN to a case from the restaurant dataset to visualise its attention results. Considering the context ‘This place has many different styles of pizza and they are all amazing’ and the as-
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
JID: NEUCOM 8
ARTICLE IN PRESS
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx
pect ‘styles of pizza’, Fig. 4 shows the attention weights for interactive information, including global and local attention. Fig. 4(a) and (b) shows the global attention weights for aspect and context, respectively. It can be seen that the context words ‘many’ and ‘amazing’ are important in classifying the sentiment polarity of ‘styles of pizza’. Some common words such as ‘the’, ‘this’, and ‘of’ in the context influence less the final prediction. For the aspect phrase ‘styles of pizza’, ‘styles’ and ‘pizza’ are assigned more attention than the word ‘of’. The global attention weights demonstrate that the multi-attention mechanism is effective in capturing context and aspect features. The visualisation of local interaction between aspect and context is shown in Fig. 4(c). It can be seen that the word ‘styles’ in the aspect receives higher attention scores with ‘different’ and ‘amazing’ than other words, whereas the words ‘many’ and ‘amazing’ in the context are important to ‘pizza’ in the aspect phrase. The local attention weights demonstrate that the proposed models can take full advantages of the difference between aspect words in the aspect phrase more precisely. 5. Conclusion and future work We proposed a novel model based on multiple attention (MAN) for aspect-based sentiment classification. MAN requires less training time than sequence models because it can process the input sentence in parallel. Compared with convolution models, MAN can effectively capture long-range sentiment relations. Moreover, it uses global and local attention mechanisms to capture differently grained interactive relations between aspect and context. The global attention mechanism computes the entire interactive information, whereas the local attention computes the word-level interaction. Experiments demonstrated that the proposed approach achieves the best performance in aspect-based sentiment classification. In the future, we plan to aattention mechanisms and construct new network frameworks attention mechanisms and construct new network frameworks. Besides, we consider developing the recent effective methods like capsule networks and XLNet model to gain improvement of aspect-based sentiment classification. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Qiannan Xu: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Writing - review & editing. Li Zhu: Resources, Writing - review & editing, Funding acquisition. Tao Dai: Writing - original draft, Writing - review & editing. Chengbing Yan: Project administration, Writing - original draft. Acknowledgment This research is supported by National Key Research and Development Project (No. 2018AAA0101100) and National Key Research and Development Project (No. 2019YFB2102500). References [1] M. Hu, B. Liu, Mining and summarizing customer reviews, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004, pp. 168–177. [2] L. Shu, H. Xu, B. Liu, Lifelong learning crf for supervised aspect extraction, in: Proceedings of Meeting of the Association for Computational Linguistics, 2, 2017, pp. 148–154.
[3] X. Li, L. Bing, P. Li, W. Lam, Z. Yang, Aspect term extraction with history attention and selective transformation, in: Proceedings of International Joint Conference on Artificial Intelligence, 2018, pp. 4194–4200. [4] M. Dragoni, M. Federici, A. Rexha, An unsupervised aspect extraction strategy for monitoring real-time reviews stream, Inf. Process. Manag. 56 (3) (2019) 1103–1118. [5] T.A. Rana, Y.N. Cheah, Aspect extraction in sentiment analysis: comparative analysis and survey, Artif. Intell. Rev. 46 (4) (2016) 459–483. [6] D. Ma, S. Li, F. Wu, X. Xie, H. Wang, Exploring sequence-to-sequence learning in aspect term extraction, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3538–3547. [7] H. Xu, B. Liu, L. Shu, P.S. Yu, Double embeddings and CNN-based sequence labeling for aspect extraction, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 592–598, doi:10.18653/v1/P18-2094. [8] J. Wang, J. Li, S. Li, Y. Kang, M. Zhang, L. Si, G. Zhou, Aspect sentiment classification with both word-level and clause-level attention networks, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, 2018, pp. 4439–4445. [9] X. Li, L. Bing, W. Lam, B. Shi, Transformation networks for target-oriented sentiment classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 946–956. [10] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [11] J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, (2014), arXiv:1412.3555. [12] Y. Wang, M. Huang, x. zhu, L. Zhao, Attention-based lstm for aspect-level sentiment classification, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 606–615. [13] D. Tang, B. Qin, X. Feng, T. Liu, Effective lstms for target-dependent sentiment classification, in: Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 3298–3307. [14] W. Xue, T. Li, Aspect based sentiment analysis with gated convolutional networks, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 2514–2523. [15] D. Tang, B. Qin, T. Liu, Aspect level sentiment classification with deep memory network, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 214–224. [16] D. Ma, S. Li, X. Zhang, H. Wang, Interactive attention networks for aspect-level sentiment classification, in: Proceedings of International Joint Conference on Artificial Intelligence, 2017, pp. 4068–4074. [17] Y. Cui, Z. Chen, S. Wei, S. Wang, T. Liu, G. Hu, Attention-over-attention neural networks for reading comprehension, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 593–602. [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of Neural Information Processing Systems, 2017, pp. 5998–6008. [19] K. Schouten, F. Frasincar, Survey on aspect-level sentiment analysis, IEEE Trans. Knowl. Data Eng. 28 (3) (2016) 813–830. [20] G. Rao, W. Huang, Z. Feng, Q. Cong, LSTM with sentence representations for document-level sentiment classification, Neurocomputing 308 (2018) 49–57. [21] M. Dragoni, G. Petrucci, A neural word embeddings approach for multi-domain sentiment analysis, IEEE Trans. Affect. Comput. 8 (4) (2017) 457–470. [22] A. Tripathy, A. Anand, S.K. Rath, Document-level sentiment classification using hybrid machine learning approach, Knowl. Inf. Syst. 53 (3) (2017) 805–831. [23] D. Ma, S. Li, X. Zhang, H. Wang, X. Sun, Cascading multiway attentions for document-level sentiment classification, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing, 2017, pp. 634– 643. [24] F. Wu, J. Zhang, Z. Yuan, S. Wu, Y. Huang, J. Yan, Sentence-level sentiment classification with weak supervision, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2017, pp. 973–976. [25] Z. Yuan, F. Wu, J. Liu, C. Wu, Y. Huang, X. Xie, Neural sentence-level sentiment classification with heterogeneous supervision, in: Proceedings of IEEE International Conference on Data Mining, ICDM, 2018, pp. 1410–1415. [26] X. Fu, W. Liu, Y. Xu, L. Cui, Combine Hownet lexicon to train phrase recursive autoencoder for sentence-level sentiment analysis, Neurocomputing 241 (2017) 18–27. [27] V.K. Singh, R. Piryani, A. Uddin, P. Waila, Sentiment analysis of movie reviews: a new feature-based heuristic for aspect-level sentiment classification, in: Proceedings of International Multi-conference on Automation, 2013. [28] D. Deng, L. Jing, J. Yu, S. Sun, M.K. Ng, Sentiment lexicon construction with hierarchical supervision topic model, IEEE/ACM Trans. Audio Speech Lang. Process. 27 (4) (2019) 704–718. [29] M. Federici, M. Dragoni, A knowledge-based approach for aspect-based opinion mining, in: Proceedings of Third SemWebEval Challenge at Semantic Web Challenges, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, Revised Selected Papers, 2016, pp. 141–152. [30] M. Alsmadi, M. Alayyoub, Y. Jararweh, O. Qawasmeh, Enhancing aspect-based sentiment analysis of arabic hotels reviews using morphological, syntactic and semantic features, Inf. Process. Manag. 56 (2) (2018) 308–319. [31] C. Brun, J. Perez, C. Roux, Xrce: feedbacked ensemble modeling on syntactico-semantic knowledge for aspect based sentiment analysis, in: Proceedings of the 10th International Workshop on Semantic Evaluation, 2016, pp. 277– 281.
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024
JID: NEUCOM
ARTICLE IN PRESS
[m5G;January 23, 2020;15:32]
Q. Xu, L. Zhu and T. Dai et al. / Neurocomputing xxx (xxxx) xxx [32] A. Kumar, S. Kohail, A. Kumar, A. Ekbal, C. Biemann, Beyond sentiment lexicon: combining domain dependency and distributional semantics features for aspect based sentiment analysis, in: Proceedings of the 10th International Workshop on Semantic Evaluation, 2016, pp. 1129–1135. [33] Y. Ma, H. Peng, T.M. Khan, E. Cambria, A. Hussain, Sentic lstm: a hybrid network for targeted aspect-based sentiment analysis, Cogn. Comput. 10 (4) (2018) 639–650. [34] J. Liu, Y. Zhang, Attention modeling for targeted sentiment, in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2017, pp. 572–577. [35] B. Huang, Y. Ou, K.M. Carley, Aspect level sentiment classification with attention-over-attention neural networks, in: Proceedings of International Conference on Social Computing, 2018, pp. 197–206. [36] Y. Tay, A.T. Luu, S.C. Hui, Learning to attend via word-aspect associative fusion for aspect-based sentiment analysis, in: Proceedings of National Conference on Artificial Intelligence, 2018, pp. 5956–5963. [37] J. Yang, R. Yang, C. Wang, J. Xie, Multi-entity Aspect-based Sentiment Analysis with Context, Entity and Aspect Memory (2018) 6029–6036. [38] S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus, End-to-end memory networks, in: Proceedings of Neural Information Processing Systems, 2015, pp. 2440– 2448. [39] R. Ma, K. Wang, T. Qiu, A.K. Sangaiah, D. Lin, H.B. Liaqat, Feature-based compositing memory networks for aspect-based sentiment classification in social internet of things, Futur. Gener. Comput. Syst. 92 (2019) 879–888. [40] C. Li, X. Guo, Q. Mei, Deep memory networks for attitude identification, in: Proceedings of web search and data mining, 2017, pp. 671–680. [41] P. Chen, Z. Sun, L. Bing, W. Yang, Recurrent attention network on memory for aspect sentiment analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 452–461. [42] X. Gu, Y. Gu, H. Wu, Cascaded convolutional neural networks for aspect-based opinion summary, Neural Process. Lett. 46 (2) (2017) 581–594. [43] J. Feng, S. Cai, X. Ma, et al., Enhanced sentiment labeling and implicit aspect identification by integration of deep convolution neural network and sequential algorithm, Cluster Comput. 22 (3) (2019) 5839–5857. [44] X. Wang, G. Xu, J. Zhang, X. Sun, L. Wang, T. Huang, Syntax-directed hybrid attention network for aspect-level sentiment analysis, IEEE Access 7 (2019) 5014–5025. [45] K. Shuang, X. Ren, Q. Yang, R. Li, J. Loo, Aela-dlstms: attention-enabled and location-aware double lstms for aspect-level sentiment classification, Neurocomputing 334 (2019) 25–34. [46] C. Sun, L. Huang, X. Qiu, Utilizing BERT for Aspect-based Sentiment Analysis via Constructing Auxiliary Sentence, CoRR abs/1903.09588 (2019). [47] R. He, W.S. Lee, H.T. Ng, D. Dahlmeier, Exploiting document knowledge for aspect-level sentiment classification, in: Proceedings of meeting of the association for computational linguistics, 2, 2018, pp. 579–585. [48] F. Fan, Y. Feng, D. Zhao, Multi-grained attention network for aspect-level sentiment classification, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3433–3442. [49] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of Computer Vision and Pattern Recognition, 2015. [50] J.L. Ba, J.R. Kiros, G.E. Hinton, Layer Normalization (2016). [51] J. Pennington, R. Socher, C. Manning, Glove: global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532–1543. [52] Y.B. Yann LeCun, G. Hin-ton, Deep learning, Nature 521 (7553) (2015) 436–444. [53] L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, K. Xu, Adaptive recursive neural network for target-dependent twitter sentiment classification, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014, pp. 49–54. [54] M. Zhang, Y. Zhang, D. Vo, Gated neural networks for targeted sentiment analysis, in: Proceedings of the Thirtieth Conference on Artificial Intelligence, 2016, pp. 3087–3093.
9
[55] M. Yang, Q. Qu, X. Chen, C. Guo, Y. Shen, K. Lei, Feature-enhanced attention network for target-dependent sentiment classification, Neurocomputing 307 (2018) 91–97. [56] X. Li, L. Bing, W. Lam, B. Shi, Transformation networks for target-oriented sentiment classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 946–956. [57] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A.N. Gomez, S. Gouws, L. Jones, U. Kaiser, N. Kalchbrenner, N. Parmar, Tensor2tensor for Neural Machine Translation (2018). Qiannan Xu received her B.E. degree in Computer Science and Technology from Southwestern University of Finance and Economics, China, in 2017. She is currently pursuing the M.S. degree in the School of Software Engineering at Xi’an Jiaotong University. Her main research interests include sentiment analysis and natural language processing.
Li Zhu received his Ph.D. degree in Computer System Architecture from Xi’an Jiaotong University, China, in 20 0 0. He is currently a Professor in the School of Software Engineering at Xi’an Jiaotong University. His research interests include machine learning and computer networking.
Tao Dai received his B.E. and M.S. degree in Software Engineering from Xi’an Jiaotong University, China, in 2008 and 2011, respectively. He is currently a Ph.D. candidate in the School of Software Engineering at Xi’an Jiaotong University. His main research interests include machine learning and information retrieval.
Chengbing Yan received her B.E. degree in Computer Science and Technology from Sun Yat-Sen University, China, in 2017. She is currently pursuing the M.S. degree in the School of Software Engineering at Xi’an Jiaotong University. Her main research interests include machine learning and image processing.
Please cite this article as: Q. Xu, L. Zhu and T. Dai et al., Aspect-based sentiment classification with multi-attention network, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.024