Knowledge-Based Systems xxx (xxxx) xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys
Aspect-based sentiment analysis with gated alternate neural network✩ Ning Liu, Bo Shen
∗
School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China Key Laboratory of Communication and Information Systems, Beijing Municipal Commission of Education, Beijing, China
article
info
Article history: Received 28 March 2019 Received in revised form 28 August 2019 Accepted 29 August 2019 Available online xxxx Keywords: Aspect-based sentiment analysis Natural language processing Text classification Deep learning CNN RNN Attention
a b s t r a c t Aspect-based sentiment analysis (ABSA) is a type of fine-grained sentiment analysis. Previous work in ABSA is mostly based on recurrent neural networks (RNNs). However, RNNs employed in ABSA have some weaknesses, such as lacking position invariance and lacking sensitivity to local key patterns. Meanwhile, a convolutional neural network (CNN) addresses the limitations in RNN, but itself is weak at capturing long-distance dependency and modeling sequence information. Moreover, the attention mechanism employed in ABSA may introduce some noise that is detrimental to capturing important sentiment expressions. In this paper, we assume that a sentence consists of some sentiment clues, and a sentence clue consists of multiple words. Based on this, we propose a novel neural network structure, named the Gated Alternate Neural Network (GANN), to address the limitations mentioned above. In GANN, a specially designed module, named the Gate Truncation RNN (GTR), is used to learn informative aspect-dependent sentiment clue representations. In these representations, the relative distance between each context word and aspect target, the sequence information, and semantic dependency within a sentiment clue are concurrently encoded. To filter out noise, a gating mechanism is designed to control information flow to obtain more precise representations. Convolution and pooling mechanisms are employed to capture key local sentiment clue features and acquire the position invariance of features. To verify the effect and generalization of GANN, we conducted abundant experiments on four Chinese and three English datasets. The experimental results show that GANN achieves state-of-the-art results and indicate that our proposed model is language-independent. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Sentiment analysis has become an important task of natural language processing in recent years, especially for data such as blogs, forums, microblogs, e-commerce platforms, and other online social media. Sentiment analysis aims at detecting the sentiment polarity or sentiment rating of the aspect, sentence or document [1]. Sentiment analysis is regarded as the key technique for realizing strong artificial intelligence and machines that can entirely understand human languages. Additionally, sentiment analysis can be used in various application scenarios, such as for financial [2] and political prediction [3], e-health [4] and e-tourism [5], user profiles [6] and user influence, community detection [7] and dialogue systems [8]. ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.105010. ∗ Corresponding author at: School of Electronic and Information Engineering, Beijing Jiaotong University, Beijing 100044, China. E-mail addresses:
[email protected] (N. Liu),
[email protected] (B. Shen).
Aspect-based sentiment analysis (ABSA) belongs to finegrained sentiment analysis. It aims at detecting the polarity of an entity or entity’s aspect in a sentence. ABSA can be divided into another two tasks according to the analysis object: aspect term (or target) sentiment analysis and aspect category sentiment analysis. As the name implies, aspect term sentiment analysis detects the polarity of an aspect target that occurs in the sentence. Aspect category sentiment analysis aims at mining the polarity of the aspect category which a target belongs to. For example, in the sentence ‘‘The food was lousy - too sweet or too salty and the portions tiny’’, aspect terms are ‘‘food’’ and ‘‘portions’’, respectively, while the sentiment polarities towards the aspect terms ‘‘food’’ and ‘‘portions’’ are both positive. The aspect term ‘‘food’’ belongs to aspect category ‘‘QUALITY’’, while aspect term ‘‘portions’’ belongs to aspect category ‘‘STYLE_OPTIONS’’. Correspondingly, both aspect categories are positive. In this paper, we are concerned with the aspect target sentiment analysis task. Recurrent neural networks (RNNs) are commonly used to model sentiment polarity toward an aspect term or the category in an ABSA task. RNN can model sentence sequence information and capture long-distance dependency, but lacks position invariance and lacks sensitivity to the local key pattern. These limitations may reduce the performance of RNNs when they are
https://doi.org/10.1016/j.knosys.2019.105010 0950-7051/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
2
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
employed in the ABSA task. Meanwhile, convolutional neural networks (CNNs) can capture key local features and have position invariance. But they are powerless in modeling long-distance dependency. In ABSA, however, word order information is also important. Unfortunately, CNN is not sensitive to word order, which can result in the wrong decision, for example, ‘‘not good, but poor’’ and ‘‘not poor, but good’’. Instinctively, the distance between a word and the aspect term may play an auxiliary role in ABSA. The closer word contributes more information, while the more distant word contributes less information. As such, encoding the relative distance between context words and aspect terms into a model is important in ABSA. Moreover, a sentence may in some cases contain multiple aspect terms. Thus, how best to make a model precisely select relevant context information about current aspect terms remains a challenge. One promising approach is the attention mechanism. However, it will introduce some noise that will mislead the model into learning irrelevant information toward the current aspect target. Thus, alleviating such noise also needs to be addressed. In this paper, we assume that a sentence consists of multiple fragments, which we call sentiment clues. These sentiment clues play a decisive role in identifying the sentiment polarity of an aspect target. The sentiment clue consists of multiple words. Modeling these clues and combining them with the aspect target will improve the model’s performance in the ABSA task. Thus, based on this assumption, we propose a novel language-independent deep neural network, named Gated Alternate Neural Network (GANN), to overcome the inadequacies mentioned above. GANN has two key layers, a Gate Truncation Layer, and a Local feature and Position Invariance Learning Layer. In the Gate Truncation Layer, a Gate Truncation RNN (GTR) is designed to learn informative denoising aspect-dependent sentiment clue representations. In these representations, the relative distance between context words and an aspect target, sequence information, and the semantic dependency within these sentence clues are encoded concurrently. A gating mechanism named a filter gate is contained in GTR to filter out noise. In the Local Feature and Position Invariance Learning Layer, convolution and pooling mechanisms are designed to capture key denoising aspect-dependent sentiment clue representations and acquire the position invariance of features. Experiment results show that the proposed model is effective for ABSA and acquires state-of-the-art results in three English datasets and most of four Chinese datasets. Further, these results with different language datasets prove that our proposed model is language-independent and dataset source-independent. The main contributions from this research are the following.
• We assume that a review or opinion sentence consists of
•
• • •
multiple sentiment clues, which are sentence fragments. The specific sentiment clue can play an important role in deciding sentiment polarity towards an aspect term or aspect category. We propose a novel framework, named Gated Alternate Neural Network (GANN) to overcome the weaknesses of CNNs and RNNs used in ABSA. We propose a Gate Truncation RNN (GTR) module to encode and learn sentiment clue representation. We propose a filter gating mechanism to alleviate noise that is introduced by the attention mechanism. Abundant experiments were done on three English and four Chinese datasets from different sources and domains. GANN achieves state-of-the-art results when compared with other state-of-the-art methods. These experimental results verify the superior performance and powerful generalization ability of GANN. Further, these results also demonstrate that GANN is language-independent and dataset sourceindependent.
2. Related work In related literature, the algorithms proposed in sentiment analysis can be divided into three categories: symbols and linguistic rule-based approaches, traditional machine learning methods, and deep learning methods. Symbols and linguistic rulebased approaches employ lexicons [9], ontologies [10] and linguistic rules [11] to detect the polarity of the object. Traditional machine learning methods take advantage of statistical theory such as the Bayesian Theory [12], Maximum Entropy model [13] and Support Vector Machine (SVM) [14], etc. Traditional machine-learning methods can obtain good performance, such as thru SVMs, which can utilize a small-scale dataset to acquire better generalization ability. However, they are dependent on feature engineering, which is time consuming and laborious work. In deep learning, neural networks automatically detect abstract features that are suitable for the task, e.g., an autoencoder [15]. In general, recurrent neural network (RNN) [16], long short term memory (LSTM) [17,18], bidirectional LSTM [19] or gated recurrent unit (GRU) [20], convolutional neural network (CNN) [21] and recursive neural networks [22] are frequently used to encode the sequence of words to decide the sentiment polarity of the specific aspect, sentence or document. In this section, we discuss previous work about aspect-based sentiment analysis from three perspectives: RNN, CNN and Memory Network. These methods are summarized in Table 1. There are also some literature reviews related to the ABSA task. A good survey and introduction into the field of aspect-based sentiment analysis is Schouten and Frasincar’s research from 2016 [23]. Not only are various evaluation measures and techniques discussed, but they also cover related and complicated issues. In recent years, deep learning has dominated many other application domains, including aspect-based sentiment analysis. Zhang et al. [24] provided a particularly timely overview of multigranularity sentiment analysis, such as document-level, sentencelevel, aspect-level sentiment classification and other related tasks. They focused on deep learning and provided comprehensive research of current applications in sentiment analysis. 2.1. Methods based on recurrent neural network Target-Dependent LSTM (TD-LSTM) and Target-Connection LSTM (TC-LSTM) are two RNN-based neural networks that focus on the ABSA task [25]. The former divides the sentence into a left part and a right part around the aspect target and sends them into two LSTM models separately with a sequentially forward and backward sequential way. Without considering the relevancy between aspect terms and sentence context words, TD-LSTM does not capture and learn aspect-dependent sentence representations. The latter implicitly models the relevancy for learning aspect-dependent sentence representations by concatenating aspect target embeddings and context word embeddings and sends them into two LSTMs in a forward and backward approach. To learn sentence semantic representations, Wang et al. [26] proposed AT-LSTM and ATAE-LSTM models that combined the attention mechanism with LSTM. The authors found that the attention mechanism can capture the importance of different sentence context information according to the aspect terms. In the AT-LSTM model, relevancy is modeled by concatenating hidden output representation and aspect target embedding to learn attention values. To powerfully model the semantic relationship between context words and the specific aspect target, ATAELSTM concatenated aspect target embedding and context word embeddings on the basis of AT-LSTM. When the distance in dependency is very long, one attention mechanism may be poor at capturing different key context
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
3
words toward different aspect targets. To alleviate the problem, multiple-attention mechanisms are adopted in Recurrent Attention on Memory (RAM) [27] and Interactive Attention Network (IAN) [28]. In general, a review consists of multiple sentences, and a sentence consists of multiple words, thus the review is naturally a hierarchical structure. Based on the review’s architecture, Ruder et al. [29] proposed a hierarchical model, named Hierarchical bidirectional LSTM (H-LSTM) for the ABSA task. They found that modeling the inner knowledge of the review structure can improve model performance. Some researchers feel that averaging aspect target embeddings may result in new irrelevant word meanings and a loss of the sequence information of the specific aspect target, which would severely reduce the model’s performance. To tackle this problem, Target-specific Transformation Networks (TNet) [30] and Aspect Target Sequence Model (ATSM) [31] model the aspect target in ABSA. TNet employs a bidirectional LSTM to model aspect term sequence and generates target-specific representation thru the interaction of each context representation and the specific aspect term representation. Meanwhile, instead of an attention mechanism, CNN is used to extract salient features in TNet. ATSM explicitly encodes not only context information, but also the specific aspect target at three granularities: radical, character and word. In ATSM, the attention mechanism is employed to learn the representation of the aspect target.
Table 1 √ The summarization of the previous related work. The symbol ‘‘ ’’ denotes that the proposed model fits into this category.
2.2. Methods based on convolutional neural network
3. Methodology
Kim et al. [21] employed convolution and the max-pooling mechanism to capture local key features in sentiment analysis. However, the algorithm is not directly suitable for ABSA. One reason is that a sentence may contain multiple aspect targets, and a convolutional neural network may not precisely distinguish the different sentiment expressions of different aspect targets, which hinders performance. Additionally, long-distance dependency and word order are not captured by a convolutional neural network. Convolutional neural networks are often employed as an additional component to discover local key features, such as thru CNN + LP (linguistic patterns) [32], or as a powerful tool for replacing attention, such as TNet [30], in ABSA tasks.
This section introduces our model. In order to clearly explain the proposed model, all symbols used in the proposed model are shown in Table 2.
2.3. Methods based on memory network Recently, memory networks have played a large role in question answering [33,34]. Tang et al. [35] first used a memory network in the ABSA task (we call this model MemNN). They proposed a deep memory network by viewing context words as the fact description and viewing the aspect target as the question. The task is formalized as answering the sentiment polarity towards the specific aspect target. However, their work implicitly modeled word–aspect relationships, which may be inadequate to generate powerful attention value and word–aspect representations. Tay et al. proposed Aspect Fusion LSTM (AF-LSTM) [36] and Dyadic Memory Networks (DyMemNN) [37] to employ circular convolution and circular correlation to implement word–aspect associative fusion to mitigate the problems mentioned above based on a memory network and LSTM, respectively. All of the various approaches do not eliminate noise that is introduced by the attention mechanism, which downgrades prediction accuracy [30]. This phenomenon is also discovered in machine translation [38] and image captioning [39]. Our proposed neural network GANN differs observably from the methods mentioned above, such as MemNN, RAM and AFLSTM. Firstly, GANN can alleviate weaknesses that are present in RNN-based methods by dividing a sentence into multiple sentiment clues and employing convolution and max-pooling operations to detect local features. Secondly, GANN can alleviate the
Based on RNN TD-LSTM TC-LSTM AT-LSTM ATAE-LSTM RAM IAN H-LSTM TNet ATSM CNN+LP MemNN AF-LSTM DyMemNN
√ √ √ √ √ √ √ √ √
Based on CNN
Based on memory network
Attention mechanism
√ √ √ √ √ √ √ √
√
√ √ √
√ √ √
weakness present in CNN-based methods by proposing a GTR module to encode these sentiment clues. Considering the noise problem, we designed the filter gating mechanism to overcome this problem. Moreover, in our approach, we do not view the sentence as fact and regard the aspect as the query, which is adopted in methods based on a memory network.
3.1. Method overview This section defines the task and describe the model inputs. Subsequently, an overview of our proposed model is given. 3.1.1. Task definition The goal of the task is to analyze and predict the sentiment polarity of the specific aspect target into three (positive, neutral, negative) or two (positive, negative) classes from a sentence. 3.1.2. Model inputs Some researchers view the Chinese written character [40] or radical [41] as the basic unit. Meanwhile, we consider the word as the atom unit because in most circumstances we express feelings by the combination of the words. Given the context word c = {w1 , w2 , . . . , wn−1 , wn } and aspect term (also called aspect target) e, we can get the relative distance sj between the jth context word and the specific aspect term in a sentence. In the case of the specific aspect term containing multiple units, we calculate the average value of the multiple word embeddings as the embedding of the specific aspect target. The length of context is n. GANN accepts the context, the specific aspect target, and the relative distance as inputs. The following subsections discuss the details related to inputs respectively. 3.1.2.1. Input form of the sequence. Vanilla RNN-based networks receive sentence context in order and aspect target as inputs in general. However, there are the problems of computing burden and forgetting previous dependent information when the sequence is long. In order to alleviate these problems, we divide a sentence into multiple sentiment clues, with each sentiment clue consisting of m words. More simply, there is a sliding window starting from the beginning of the context. The size of the sliding window is m, which is a hyperparameter in GANN. Every time we step into one word. In Fig. 1, the symbol h represents the
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
4
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx Table 2 All symbols used to describe the proposed model. Symbol
Type
Meaning or Implication
c
Set One-hot vector One-hot vector Scalar Scalar Matrix Scalar Scalar Vector Vector Scalar Scalar One-hot vector Matrix Vector Scalar Scalar Vector Function Function Matrix Vector
The context word The ith word in context word The specific aspect target The length of the context word The size of the sliding window The word embedding matrix The size of the vocabulary The dimension of the word embedding The word embedding of the ith context word The word embedding of the aspect target The length of the multiword aspect target The relative distance between jth context word and the specific aspect term The one-hot representation of the relative distance sj The location embedding matrix The location embedding of jth context word The maximum of the relative distance The dimension of the location embedding The update and reset gate in GRU The sigmoid function The hyperbolic tangent function i ∈ {z , r , h, f , g , m}, the weight parameter matrix in GANN i ∈ {z , r , h, f , g , m}, the bias in GANN
Vector Scalar Vector Vector Scalar Vector Vector Scalar Scalar Vector Vector Scalar Scalar Operation Operation
The output of the bidirectional, the forward and backward GRU The dimension of GRU unit The filter gate in GTR The output of ith time step of the Filter Gate Layer ′ The attention weight of hi The output of lth GTR in attention layer The final aspect-dependent sentiment clue representation The height and width of the convolution kernel The height and width of the max-pooling The output of the convolution and max-pooling operations The output of the fully connected layer and softmax layer The number of sentiment polarity classes The predicted class of the sentiment polarity Element-wise multiplication Vector concatenation
wi e n m D V d
w ˆi eˆ r sj lj E ˆlj p q zt , rt σg (), σf () g() Wi , Ui bi
− → ← −
h, h , h k fi ′ hi ai hl hg sh , sw sh′ , sw′ hc , hf hm , hs cˆ yˆ
⊙ ;
Fig. 1. The input form comparison between vanilla RNN and GANN. The dashed box in GANN indicates the sliding window. The number m is the size of the sliding window, m is 2 here.
output of the network or module. In the RNN presented in Fig. 1, h represents the output representation of the word, which is encoded by RNN. In the GANN presented in Fig. 1, h represents the output representation of the sentiment clue, which is encoded by a GTR module. There is an obvious contrast between GANN and a vanilla RNN as shown in Fig. 1. 3.1.2.2. Sentence context and aspect target embedding. Before we feed the context and aspect target into GANN, we need to map the one-hot vector into word embedding. The word is mapped from a high-dimension disperse vector into a low-dimension dense vector. We can compute the compositionality by mathematical operation in the embedding space. Given the word embedding matrix D ∈ RV ×d , V is the size of vocabulary, and d is the dimension of the word embedding. Context embedding can be obtained by a lookup table as the following:
w ˆj = lookup(wj ) = wjT · D
(1)
where wj ∈ RV ×1 is a one-hot vector, the jth element is 1, and otherwise are 0 s. When the aspect target consists of single word, aspect embedding can be obtained as the following: eˆ = lookup(e) = eT · D,
e ∈ RV ×1
(2)
When the aspect target consists of r words, aspect embedding is computed as the following: eˆ =
r 1∑
r
lookup(ej ) =
j=1
r 1∑
r
eTj · D,
ej ∈ RV ×1
(3)
j=1
3.1.2.3. Location embedding. The relative distance between each context word and the specific aspect term can be computed as the following when the specific aspect target contains one word:
⏐
⏐
sj = ⏐is − js ⏐
(4)
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
5
Fig. 2. GANN overall framework. In the embedding layer, red circles represent sentence context embeddings, green circles represent location embeddings, dark red circles represent aspect target embedding . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
where is denotes the index of the specific aspect target in a sentence, js denotes the index of the specific context word in a sentence. In the case of an aspect term consisting of multiple words, e = {ei , ei+1 , . . . , ei+r }, r is the length of a multiword aspect term, and lj can be obtained as below: { sj =
i − j, j − (i + r ) ,
when context word wj is on the left of e in a sentence when context word wj is on the right of e in a sentence
(5) p×q
Given an embedding matrix E ∈ R , p is the maximum of the relative distance, q is the dimension in a location embedding space. We first convert the relative distance sj into the location one-hot vector lj , then convert the one-hot vector into location embedding by the lookup table, which can be obtained as the following:
ˆlj = lookup(lj ) =
lTj
· E,
lj ∈ R
p×1
(6)
Lastly, the input vectors can be obtained by concatenating location embeddings, context embeddings, and aspect target embedding. 3.1.3. Algorithm overview In order to capture long-distance dependency and encode sequence information within a sentiment clue, and to simultaneously filter out noise in the attention mechanism, and to capture the local key feature of the sentiment clue, and to have position
invariance, we designed a hierarchical neural architecture, named GANN. In GANN, the Gate Truncation Layer is designed to model the sentiment clue. It is composed of Gate Truncation RNN (GTR), which is used to capture the long-distance dependency and encode sequence information within a sentiment clue. Because of the proposed gate mechanism in GTR, we can precisely obtain denoising informative aspect-dependent sentence clue representations. A Local Feature and Position Invariance Learning Layer is designed to learn the local key sentiment clue representation and obtain position invariance by employing the convolution and max-pooling mechanisms. A Fully Connected Layer and nonlinearity operation are designed to extract high-level abstract features and transform the final aspect-dependent sentence representation into the appropriate feature that is suitable for the ABSA task. The overall architecture is shown in Fig. 2. 3.2. Gated truncation RNN This section describes details of the GTR module and explains how it acquires good performance. GTR consists of a bidirectional GRU layer, filter gate layer, and attention layer. The following subsections discuss details. The GTR architecture is shown in Fig. 3. 3.2.1. Bidirectional GRU layer Compared to unidirectional GRU, bidirectional GRU takes into account the forward semantic information of the sentence, while also considering the backward information of the sentence. The
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
6
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 3. The overall architecture of GTR. The dimension of the output representation hl is equal to the dimension of the bidirectional GRU, e.g. h1 . The superscript l represents the index of the outputs in Gate Truncation Layer, l ∈ [1, n−m+1], n is the length of context words, m is the size of the sliding window. In this figure, m is 3.
backward semantic information can make the model use future words to encode the current word. Thus, we employed bidirectional GRU as the encoder in GTR, which is exhibited in the middle and bottom part of Fig. 3. The bidirectional GRU is composed of the forward GRU and the backward GRU. Context words are fed into the forward GRU in the forward order and the backward GRU in the reverse order. There are two gates and one hidden output in unidirectional GRU. Taking the forward calculation as an example, they can be obtained as follows: zt = σg (Wz xt + Uz ht −1 + bz )
(7)
rt = σg (Wr xt + Ur ht −1 + br )
(8)
h˜ = g(Wh xt + Uh (rt ⊙ ht −1 ) + bh )
(9)
− →
h = (1 − zt ) ⊙ ht −1 + zt ⊙ h˜
(10)
− →
where zt is the update gate, rt is the reset gate, h is the output state of the forward GRU unit, xt denotes the current input of sentence contexts at time step t, σg is the logistic nonlinearity, g is the activation nonlinearity, which is the hyperbolic tangent function in general, or may be a rectified Liner Unit (ReLU). The
operation ⊙ denotes element-wise (Hadamard) multiplication. For simplicity, the output representation of the forward GRU − → unit is denoted by h , the output representation of the backfor-
← −
ward GRU unit is denoted by h . The output representation of bidirectional GRU is denoted as below:
− → ← −
h=[h; h]
(11)
where the operation ‘‘;’’ denotes the concatenation of the vectors,
− →
← −
we concatenate these two vectors by the row, h ∈ Rk×1 , h ∈ Rk×1 , h ∈ R2k×1 , k is the dimension of the GRU unit. 3.2.2. Filter gate layer The attention mechanism may introduce noise that is irrelevant information when predicting aspect target sentiment polarity, such as context that is not correlated with expressing the sentiment, e.g. ‘‘is’’, ‘‘has’’, ‘‘after’’, etc. In addition, bidirectional GRU may wrongly encode sentiment expressions toward another aspect target, and cannot accurately encode sentiment expressions toward the current aspect target when a sentence contains multiple aspect targets. This forces the model to capture incorrect aspect-dependent sentiment clue representations
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
7
3.3. Convolution and pooling layer This section provides details of the convolution and the maxpooling mechanism. The operations of the convolution and the max-pooling are the key techniques utilized to detect local sentiment clue representation and acquire position invariance.
Fig. 4. The overall architecture of the filter gate. The green representation hi denotes the output representation of the bidirectional GRU, which they are concatenated with two orientations, i denotes ith time step.
and reduces the performance of the algorithm. To mitigate the problems mentioned above, a filter gate is designed by utilizing a gating mechanism to control the signal flow. The signal can go through the filter gate when the signal contributes informative sentiment information, otherwise, the signal is discarded. The architecture can be observed in Fig. 4. The filter gate accepts the concatenation of the output representation of the bidirectional GRU and aspect term embedding as inputs. There is a gate and a nonlinear activation function in the filter gate. The output of the filter gate is defined as below: ′
hi = f i ⊙ hi
(12)
fi = σf (Wf ·
hTi
+ Uf · eˆ + bf )
(13)
where eˆ is the aspect term embedding, Wf ∈ R2k×2k , Uf ∈ R2k×d , bf ∈ R2k×1 , h′i ∈ R2k×1 . The operation ⊙ denotes element-wise (Hadamard) multiplication, σf is the sigmoid function.
3.2.3. Attention layer In fact, only a few words can play a decisive role in determining the sentiment polarity of the aspect target in the specific sentiment clue. Attention can capture important sentiment words in a sentiment clue toward the current aspect target. The architecture of attention can be observed at the top of Fig. 3. The output of the attention layer is calculated as the following: hl =
m ∑
ai · h′i
(14)
i=1
egi ai = softmax (gi ) = ∑m
j=1
egj
gi = tanh(Wg · h′i ; eˆ + bg )
[
]
(15) (16)
where m is the size of the sliding window, Wg ∈ R1×(2k+d) , bg ∈ R1×1 , gi ∈ R, hl ∈ R2k×1 , l indicates the index of the GTR, l ∈ [1, n − m + 1], n is the length of the context words. Finally, we obtain a plurality of aspect-dependent sentiment clue representations. The number of sentiment clue representations depends on m. The final aspect-dependent sentiment clue representation hg is concatenated by the column as below: g
1
2
h = [h ; h ; · · · ; h
n−m+1
where hg ∈ R2k×(n−m+1) .
]
(17)
3.3.1. Convolution layer After obtaining the final aspect-dependent sentiment clue representation, we employ a convolution operation to capture local features in the specific representation. Convolution can capture the most important aspect-dependent sentiment clue vector and extract the high-level abstract representation. Local features are important in determining the sentiment polarity of the aspect target. Additionally, convolution is important to acquire position invariance. The output features of the convolution are defined as the following: hc = Conv 2d((hg )T |sh , sw )
(18)
where Conv 2d denotes the convolution operation for simplicity, and the ‘‘|’’ represents the condition of a given sh and sw , which represents the height and the width of the convolution kernel. In general, sh is customized and the value of sw is the same as the dimension of the concatenated output representation of the bidirectional GRU. In other words, we can view Conv 2d as the convolution in 1d-CNN, which is proposed by Kim et al. [21] when sw is the same as the dimension of the concatenated output representations of the bidirectional GRU. We step one word every time, and finally, get hc ∈ R(n−m−sh +2)×1 . 3.3.2. Max-pooling layer The Max-Pooling layer is the key to realizing position invariance. The operation of max-pooling can detect the most important aspect-dependent sentiment clue feature in local features. Decisive aspect-dependent sentiment representation extracted by the max-pooling mechanism is regarded as the final aspectdependent sentiment representation. An important characteristic in the max-pooling mechanism is that it can transform the variable length vectors into fixed-length vectors. The output of the pooling is given as below: hp = MaxPooling(hc |sh′ , sw′ )
(19)
where MaxPooling represents the max-pooling operation, and sh′ and sw′ are the height and the width of the pooling’s sliding window, respectively. In max-pooling, sh′ is n − m − sh + 2 and sw′ is 1 in general and we move one step every time. We thus get hp ∈ R1×1 . If we use n filters in each filter size at a convolution layer, we will obtain n aspect-dependent sentiment representations p p p (h1 , h2 , . . . , hn ) after the max-pooling operation. We flatten these representations and acquire the final aspect-dependent sentiment representation hf ∈ Rn×1 . 3.4. Fully connected layer and softmax layer To predict the sentiment polarity of the aspect target, a fully connected layer is designed to transform the final sentiment representation hf into an appropriate high-level sentiment representation that is suitable for predicting the sentiment orientation. A fully-connected layer consists of a multi-layer perceptron (MLP), and the output is obtained as the following: hm = relu(Wm · hf + bm )
(20)
where Wm and bm are parameters that are learned in the training, Wm ∈ Rcˆ ×n , bm ∈ Rcˆ ×1 , hm ∈ Rcˆ ×1 , cˆ is the number of the predicted sentiment classes.
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
8
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
In the softmax layer, a softmax function is used to get the probabilities of each sentiment class. Finally, we regard the class that acquires maximum probability as the predicted sentiment polarity of the specific aspect target. The predicted class of sentiment polarity is given as below: yˆ = argmax(hs )
(21)
cˆ
Table 3 The statistical information of all datasets. Datasets Restaurant Laptop Twitter
hs = softmax hm
(
)
(22)
where hs ∈ Rcˆ ×1 denotes the probability of each class in cˆ classes. 4. Experiments This section introduces our experiments and explain the results in detail. We conducted our experiments on multi-language and multi-source datasets, consisting of three English and four Chinese datasets, to verify our proposed model’s superior performance, language-independence, and source-independent abilities. Additionally, we studied the role of GANN hyperparameters and provided a rule of thumb for selecting these hyperparameters. We used Tensorflow [42] to implement our model. Adam [43] is employed as the model’s optimization method. The learning rate is 1e − 3 in Adam. We adopt Dropout [44] as our regularization strategy and the keep probability of Dropout was 0.5. Glove [45] 300-dimension word embeddings are adopted as word embeddings in the English datasets. Because time and energy are limited, we did not collect a large-scale Chinese corpus by the crawler as Peng et al. [31] did in their experiment, where the corpus can be used to produce Chinese word embeddings by word embeddings algorithms such as Word2vec [46], Glove [45] and Fasttext [47]. We randomly initialized a word embedding matrix in Chinese datasets thru a uniform distribution. The batch size was 128 and we reported on the best results for 100 epochs in the test datasets. 4.1. Datasets We used the following datasets in our experiments: SemEval2014 restaurant and laptop datasets [48], four Chinese datasets used by Peng et al. [31], and the Tweet dataset collected by Dong et al. [49]. The Tweet dataset is an English dataset and larger than the other datasets. The SemEval2014 datasets are English datasets and are composed of two domains for aspect reviews: a restaurant domain and a laptop domain. The four Chinese datasets cover four domains: car, notebook, camera, and phone. English sentences are very different from Chinese sentences in expression form. For example, words are separated by a space in English, while there is no obvious separator except for punctuation in Chinese. We used NLPIR-ICTCLAS [50] to implement Chinese word segmentation. We removed sentences where the participle snippets do not match the pre-participial aspect target lists and this slightly resulted in reducing the amount of data for the four Chinese datasets. The statistical information of all datasets is shown in Table 3. We separated twenty percent of each Chinese datasets as test datasets. Two SemEval2014 datasets and four Chinese datasets consisted of customer reviews. For example, in the SemEval2014 restaurant dataset, the review ‘The quantity is also very good, you will come out satisfied’ shows that the customer has a positive polarity towards the aspect term ‘quantity’, which belongs to aspect category ‘food’. The Tweet dataset consisted of people opinions from a social media platform. For example, the opinion ‘musicmonday Britney spears - lucky do you remember this song? It is awesome. I love it’ shows that this person has a positive opinion towards the aspect term ‘Britney spears’. The statistical information of aspect terms for all datasets can be found in
Camera Car Notebook Phone
Train Test Train Test Train Test Train Test Train Test Train Test Train Test
Overall
Pos.
Neg.
Neu.
3578 1110 2317 637 6257 694 1635 408 885 221 485 121 1885 471
2148 721 991 341 1567 174 1157 263 668 166 331 72 1260 303
800 195 865 128 1563 174 478 145 217 55 154 49 625 168
630 194 459 168 3127 346 – – – – – – – –
Length
Lg.
AsT.
78
En
1630
82
En
1301
44
En
177
26
Ch
967
34
Ch
538
18
Ch
326
39
Ch
1089
Table 3. In the SemEval2014 restaurant dataset, we divided aspect terms into four aspect categories: Food, Service, Ambience and Miscellaneous. The number of Food items in the train and test datasets were 1166 and 402. The number of Service items in the train and test datasets were 562 and 167. The number of Ambience items in the train and test datasets were 384 and 105. The number of Miscellaneous items in the train and test datasets were 1406 and 299. In the SemEval2014 laptop dataset, we divided aspect terms into three aspect categories: Battery, Screen, and Miscellaneous. The number of Battery items in the train and test datasets were 97 and 15. The number of Screen items in the train and test datasets were 84 and 14. The number of Miscellaneous items in the train and test datasets were 2069 and 594. In Table 3, ‘Length’ denotes the max length of the sentence in the train and test datasets. The four Chinese datasets only contain two types of sentiment class: positive and negative, thus we use ‘‘–’’ to indicate that the dataset does not contain the neutral class. ‘Lg.’ represents language, ‘En’ represents English, and ‘Ch’ represents Chinese. We use ‘AsT.’ to represent the number of aspect terms found in the dataset. 4.2. Metrics Accuracy is a common evaluation criterion in ABSA and it is easy to understand. Accuracy shows the proportion of the correctly classified number of all classes in all samples of all classes. In general, higher accuracy indicates a better classifier. However, this is not always the case. Accuracy cannot reflect the performance of the classifier in a dataset that has an unbalanced categorical distribution. As we can see in Table 3, the distribution of each class is extremely asymmetrical in most datasets. To truly reflect the classifier’s performance, macro-F1 is adopted as an extra indicator, which is the weighted average of precision and recall. The precision of the specific class indicates the proportion of the number correctly predicting the class in the preceding results of the current class. The recall of the specific class indicates the proportion of the correctly classified number of one class in the true samples of the current class. We make a confusion matrix that is described in Table 4. Accuracy and macro-F1 of the classifier, precision, and recall of the specific class are given as below: Accuracy =
a+e+i a+b+c+d+e+f +g +h+i a
Precisionpositiv e = Recallpositiv e =
a+d+g a
a+b+c
(23) (24) (25)
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx Table 4 Multicategory confusion matrix. We use three categories of sentiment polarity as an example. Actual classes
Predicted classes Positive class
Negative class
Neutral class
Positive class Negative class Neutral class
a d g
b e h
c f i
F 1positiv e = 2 ·
Precisionpositiv e · Recallpositiv e Precisionpositiv e + Recallpositiv e
(26)
In the same way, we can get Precisionnegativ e , Recallnegativ e , F 1negativ e , Precisionneutral , Recallneutral , F 1neutral , and the value of macro-F1, which can be obtained as the following: Macro − F 1 =
1∑ 3
F 1j
(27)
j
9
4. GANN-v4: This is the fourth variant of GANN. It replaces the max-pooling mechanism in the Local Feature and Position Invariance Learning Layer with the average of the output representation of each feature map. The difference between GANN-v4 and GANN is that GANN-v4 uses average-pooling in the Local Feature and Position Invariance Learning Layer, however, GANN employs a max-pooling mechanism in the Local Feature and Position Invariance Learning Layer. In order to clearly show the differences between each other variant of GANN and the overall architecture, we list the various modules or layers that these models contain or do not contain in Table 5. It should be pointed out that we aim at showing differences between these models, so we only show some modules or layers, not all modules or layers of these models. In Table 5, ‘W’ indicates that the network contains this module or layer, ‘W/O’ indicates that the network does not contain this module or layer, and the bracket represents the substitution module.
where j ∈ [positiv e, neutral, negativ e]. 4.3. Comparison algorithms This section presents some baseline comparisons. Firstly, we describe variants of the proposed GANN, which aims at verifying the validity of each module of GANN. Secondly, we introduce state-of-the-art algorithms in the ABSA task, which is used to demonstrate our model’s effect. 4.3.1. Variants of GANN The major layers of our proposed neural network are the Gate Truncation Layer, and a Local Feature and Position Invariance Learning Layer. Thus, we designed different variants for the modules to validate the particular advantage of each layer. GANN-v1 and GANN-v2 were designed to verify the Gate Truncation Layer. GANN-v3 and GANN-v4 were designed to verify the Local Feature and Position Invariance Learning Layer. 1. GANN-v1: This is the first variant of GANN. It removes all GTR modules. We employ a vanilla GRU to replace all GTR modules. The difference between GANN-v1 and GANN is that GANN-v1 does not contain a Gate Truncation Layer, which consists of multiple GTR modules. In GANN-v1, we do not consider the aspect-dependent sentiment clue. Context is viewed as a whole and GRU is employed to encode all the context embeddings and location embeddings. 2. GANN-v2: This is the second variant of GANN. It does not have the gate mechanism in the GTR module. The output representations of bidirectional GRU are directly sent to the attention layer in GTR. The difference between GANN-v2 and GANN is that GANN-v2 does not contain a Filter Gate Layer which is designed in the GTR module of GANN. In GANN-v2, we do not deal with the noise introduced by the attention mechanism. 3. GANN-v3: This is the third variant of GANN. It eliminates the Local Feature and Position Invariance Learning Layer and employs an attention mechanism as a high-level feature extractor. It takes the output of the additional attention layer as the input to the fully connected layer. The difference between GANN-v3 and GANN is that GANN-v3 does not employ a convolution operation and max-pooling operation as a high-level feature extractor to get the final aspect-dependent sentiment representation. In GANNv3, we use an attention mechanism to obtain the final aspect-dependent sentiment representation.
4.3.2. State-of-the-art methods In our experiments, we consider several state-of-the-art methods in four Chinese datasets: SVM, LSTM, Bi-LSTM, TD-LSTM, TC-LSTM, AT-LSTM, ATAE-LSTM, MemNN, ATAM-S, and ATAM-F. 1. SVM: The SVM classifier is a state-of-the-art traditional machine method. It accepts unigram, bigram, POS tags, etc. as input features in general. In ABSA, aspect target features are added to sentence context features. 2. LSTM: LSTM is a typical method in modeling time series. It can better capture long-term dependency than vanilla RNN by introducing an input gate, the forget gate, the output gate, and a memory cell. LSTM treats each word equally in a sentence, so it is not sensitive to aspect terms and related sentiment expression words in ABSA. 3. Bi-LSTM: Bi-LSTM takes advantage of head to tail sequential information and also utilizes tail to head sequential information. As with LSTM, it cannot recognize the aspect target and related sentiment expression words. 4. TD-LSTM: TD-LSTM [25] divides the sentence into two subsentences according to the location of the aspect target. It employs two LSTMs to model forward sequence information of the first subsentence and model backward sequence information of the second subsentence separately. While considering the effect of aspect targets to some degree, it does not take full advantage of the semantic relation between aspect target and context words in the ABSA task. 5. TC-LSTM: TC-LSTM [25] appends the average of aspect target word embeddings into word embeddings on the basis of TD-LSTM. It implicitly captures the semantic relation between the aspect target and sentence context words in the ABSA task. 6. AT-LSTM: AT-LSTM [26] is an attention-based model in the ABSA task. It concatenates aspect word embedding and the hidden outputs of LSTM and then utilizes the attention mechanism to assign different weights to different hidden outputs of LSTM. 7. ATAE-LSTM: ATAE-LSTM [26] can make better use of aspect term information in relation to AT-LSTM. It concatenates aspect term embedding to each of the context word embeddings. 8. MemNN: MemNN [35] is inspired by the working mechanism of the question-answering system. It regards context words as a factual description and views aspect terms as the query. However, it does not consider taking advantage of word order information.
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
10
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx Table 5 The modules or layers contained or not contained in each variant of GANN and the overall GANN. Models
Gate Truncation Layer/GTR Bidirectional GRU layer
Local Feature and Position Invariance Learning Layer Filter gate layer
Attention layer
Convolution operation
Max-pooling operation
GANN-v1
W/O (Vanilla GRU)
W
W
GANN-v2
W
W/O
W
W
W
GANN-v3
W
W
W
W/O (Attention layer)
GANN-v4
W
W
W
W
GANN
W
W
W
W
9. IAN: IAN [28] employs attention and pooling mechanisms to learn the attention value of context words and aspect target separately and generates representations of context words and aspect target separately, then it concatenates these two representations for the final aspect-dependent sentiment representations. 10. ATAM-S: ATAM-S [31] models sentence and aspect terms at three levels of granularity: word, character, and radical. It explicitly models the sequential information of aspect term sequence. The problem with this method is that it does not consider semantic correlation information between aspect terms and context words. In our experiments, we use the best results on word granularity from the original paper. 11. ATAM-F: ATAM-F [31] integrates three representation granularities based on ATAM-S. The early and late fusion manner are designed to merge these three granularities. It can utilize multiple granularities to acquire better semantic representations of aspect target and context words. It does not consider the incidence relation between aspect target and context words. We use the best results between two fusion manners on three representation granularities form the original paper. The structures of SVM, LSTM, Bi-LSTM, ATAM-S, and ATAM-F are consistent with the structures described in Peng et al. [31]. The structures of TD-LSTM and TC-LSTM are the same as the structures proposed by Tang et al. [25]. The structures of AE-LSTM and ATAE-LSTM are consistent with the structures proposed by Wang et al. [26]. The structure of MemNN is same as the structure proposed by Tang et al. [35]. The structure of IAN is same with the structure proposed by Ma et al. [28].
4.4. Results analysis This section presents the experiment results for all variations of the model and all state-of-the-art methods and provides a detailed analysis regarding how our proposed model obtains improved performance compared to other methods.
4.4.1. Comparison of GANN variants This section presents our experiments on four Chinese datasets to compare the four variants of GANN to perform an ablation study. The experiments are shown in Table 6. Our proposed GANN achieves the best accuracy and F1-score in three datasets as well as the highest accuracy and F1-score on the average performance of four datasets. In other words, these experimental results demonstrate the validity of the layers or modules utilized in our GANN architecture. The differences between GANN-v1 and GANN are as follows: firstly, the former ignores the semantic correlation information
W/O (Average-pooling operation) W
and interacted effect between context and aspect target. Secondly, the former did not consider the effects of different sentiment clues that are very important for determining sentiment polarity. The significant performance degradation from GANNv1 in accuracy and macro-F1 indicates that the GTR module can better capture and model semantic correlation information and interacted effect between context words and aspect target. Additionally, it proves the GTR module successfully encoded sentiment clues. Even if semantic correction information and interacted effect between context words and aspect target are modeled and different sentiment fragment clues can be correctly learned, the overall performance is still not satisfactory. This is illustrated by GANN-v2, which captures and encodes the important information mentioned above, but does not filter out noise that easily confuses the neural network algorithm. GANN employs a gating mechanism to shelve irrelevant semantic information, thus informative and precise sentiment clue representations are learned to contribute to final performance. GANN-v3 and GANN-v4 verified the validity of the Local Feature and Position Invariance Learning Layer. GANN-v3 throws away the Local Feature and Position Invariance Learning Layer, and employs the attention layer to learn the aspect-dependent sentiment representation by employing the self-attention mechanism. When GANN-v3 is compared with GANN, the performance of the model decreases in most datasets. It does not have access to detect local features and properly capture position invariance in a sentence. Table 6 shows that GANN-v3 acquired the secondbest performance not only in the largest number of datasets, but also in the average performance of all datasets. It further proves that the attention mechanism is a powerful technique in ABSA and can capture relevant aspect-dependent sentiment clue representations. There are multiple pooling mechanisms in the literature, such as max-pooling and average-pooling. Here, we replaced max-pooling with average-pooling in GANN-v4. It ignores the ability to detect the most important features. Comparing GANN-v4 with GANN, the results demonstrate that max-pooling indeed surpasses average-pooling in the ABSA task, because maxpooling successfully captures the most useful features contributing to sentiment polarity and also does well in obtaining position invariance. Table 6 also shows that GANN does not outperform GANNv3 and GANN-v4 in the Car dataset. We believe the first reason for this is that the polarity distribution of the Car dataset is extremely non-uniform. Table 3 shows that there are three times as many positive instances as negative instances in the Car dataset. However, there are twice as many positive instances as negative instances in the other datasets, such as the Notebook and Phone datasets. The second reason is that the convolution operation and max-pooling operation may learn to detect the useless even harmful local features towards fewer sentiment polarity in the extremely non-uniform distribution of sentiment polarity. As a
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
11
Table 6 The experiments of different variants of GANN. The metric ‘‘Acc.’’ means the accuracy and ‘‘F1’’ denotes the macro-F1. The bold indicates the best result, the asterisk indicates the second-best result. Camera
GANN-v1 GANN-v2 GANN-v3 GANN-v4 GANN
Car
Notebook
Phone
Average
Acc.
F1
Acc.
F1
Acc.
F1
Acc.
F1
Acc.
F1
85.78 86.52 87.01∗ 86.03 87.99
84.58 85.31 85.97∗ 84.58 86.75
81.90 82.35 84.62 84.62∗ 83.71
76.88 75.64 78.90 78.03∗ 77.66
80.17 82.64∗ 80.99 81.82 82.65
79.42 82.05∗ 78.91 81.14 82.16
87.05 87.47 88.32∗ 87.69 89.17
86.08 86.54 87.29∗ 86.40 88.16
83.73 84.75 85.24∗ 85.04 85.88
81.74 82.39 82.75∗ 82.54 83.68
Table 7 The experiments of state-of-the-art methods. The metric ‘‘Acc.’’ means the accuracy and ‘‘F1’’ denotes the macro-F1. The bold indicates the best result, the asterisk indicates the second-best result. We use double horizontal to indicate that the original paper does not report the results in the dataset. Camera
SVM LSTM Bi-LSTM TD-LSTM TC-LSTM AT-LSTM ATAE-LSTM MemNN ATAM-S ATAM-F GANN
Car
Notebook
Phone
Average
Acc.
F1
Acc.
F1
Acc.
F1
Acc.
F1
Acc.
F1
69.83 78.31 78.35 70.48 70.88 85.05 85.54 70.59 82.88 88.30 87.99∗
41.11 68.72 69.35 51.46 54.79 83.44 84.09∗ 55.13 72.50 – 86.75
75.60 81.99 81.82 76.53 76.19 80.09 81.90 75.55 82.94 82.94∗ 83.71
43.04 58.83 56.42 46.67 50.99 72.34 76.88∗ 51.01 64.18 – 77.66
66.92 74.63 74.15 67.10 68.39 79.34 83.47 69.10 75.59 77.52 82.65∗
40.09 62.32 63.09 40.58 50.57 77.99 82.14∗ 53.51 60.09 – 82.16
67.02 81.38 81.45 69.17 69.88 86.41 85.77 70.29 84.86 88.46∗ 89.17
40.11 72.13 70.42 53.40 54.26 84.46∗ 83.87 55.93 75.35 – 88.16
69.84 79.08 78.94 70.82 71.33 82.73 84.17 71.38 81.57 84.31∗ 85.88
41.09 65.5 64.82 48.03 52.66 79.56 81.74∗ 53.90 68.03 – 83.68
result, GANN-v1, GANN-v2, and the overall GANN have worse performance than GANN-v3 and GANN-v4. GANN-v2 acquires the best results compared with the other three variants and obtains comparable results compared with the overall GANN on the Notebook dataset. Table 3 shows that the sentence length of the Notebook dataset is pretty short. The main reason why GANN-v2 performs better is that the attention mechanism may produce very little noise when the length of the sentence is short, and the proposed filter gating mechanism may have less influence to improve the performance of the model in this scenario. Thus, GANN-v2, which gets rid of the filter gating mechanism, obtains better results than other variants and achieves comparable performance compared with the overall GANN. 4.4.2. Comparison of state-of-the-art models This section compares our proposed methods with other stateof-the-art methods. The results for four Chinese datasets are given in Table 7. Table 7 shows that GANN obtains state-of-theart results in most of the four datasets. It is around 0.3%–5% better than other state-of-the-art methods in accuracy. GANN acquires the highest macro-F1 in all datasets by around 0.2%– 3.7%. GANN obtains the highest score of accuracy and macro-F1 on the average dataset by around 1.3% and 1.7%. We believe the first reason why GANN does better than the other state-of-the-art methods is that we explicitly capture and model sentiment clues. We designed the GTR model to divide a sentence into multiple sentiment clues and then encode them. Other state-of-the-art methods treat the sentence as a whole and ignore the different roles of different sentiment clues in the ABSA task. To emphasize this difference, the baseline model GANN-v1 eliminates the GTR model, which leads to the model ignoring to learn context sentiment expression clues, hence, resulting in poor performance. The second reason is that we designed a gate mechanism to filter out noise that will reduce model performance. Other state-of-the-art methods ignore the problem. To validate the importance of the gate mechanism, we designed the second variant
Table 8 Accuracy results on SemEval2014 restaurant, SemEval2014 laptop and twitter datasets. LSTM TD-LSTM AT-LSTM ATAE-LSTM MemNN IAN RAM GANN
Restaurant
Laptop
Twitter
74.28 75.63 76.60 77.20 78.38 78.60 78.93 80.09
66.45 68.13 68.90 68.70 71.11 71.78 71.81 72.21
64.84 66.62 68.01 70.03 70.38 71.53 68.30 72.40
of GANN, which is GANN-v2. It differs from GANN only in ignoring the function of eliminating noise. The experiment results in Table 6 validate our assumption. The third reason is that we capture local features and have the ability to acquire position invariance in aspect-dependent sentiment clues. Other state-of-the-art methods are only based on LSTM, which is poor at detecting local features and acquiring the character of position invariance. Some state-of-the-art methods use an attention mechanism to get the final aspectdependent representation. Attention is essentially weighted sum of each hidden representation, which lacks the character of position invariance. To demonstrate the validity of local features and position invariance, we designed two variants: GANN-v3 and GANN-v4. GANN-v3 and GANN-v4 differ from GANN respectively in ignoring the detection of local features and max features. The experimental results in Table 6 demonstrate the validity of our assumptions. Although our proposed model takes advantage of the average sum of aspect terms as the representation of aspect terms when aspect terms contain multiple words, we do not explicitly consider the aspect target sequential information. GANN, however, still exceeded for ATAM-S and ATAM-F in all datasets. This demonstrates that our model can obtain powerful performance in ABSA. If considering sequential information in modeling aspect target, GANN will achieve better performance.
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
12
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 5. (a) shows the effect of the size of the sliding window. (b) describes the max length of sentence context words in four Chinese datasets.
Table 9 Accuracy results on the original imbalanced SemEval2014 dataset and the balanced SemEval2014 dataset.
LSTM AT-LSTM ATAE-LSTM MemNN IAN RAM GANN
Unbalanced SemEval2014
Balanced SemEval2014
Restaurant
Laptop
Restaurant
Laptop
74.28 76.60 77.20 78.38 78.60 78.93 80.09
66.45 68.90 68.70 71.11 71.78 71.81 72.21
69.15 71.19 70.75 71.62 72.28 73.64 74.83
68.39 70.47 69.01 70.05 70.31 70.57 71.62
To further verify the generalization capacity and test whether our proposed model is language-independent, we conduct additional experiments in the SemEval2014 and Twitter datasets. The experimental results are shown in Table 8, which provides a comparison with the top state-of-the-art methods: RAM, IAN, MemNN, ATAE-LSTM, AE-LSTM, TD-LSTM, and LSTM. The structure of RAM is consistent with the structure proposed by Chen et al. [27]. Table 8 shows that GANN achieves the highest accuracy in all datasets. That indeed proves the powerful generalization capacity of our proposed model. Our model explicitly captures aspect-dependent sentiment clues, filters out noise, and detects local features, while being equipped with the character of position invariance. The datasets that we used for the experiments are imbalanced. In consideration of the imbalanced nature of data, we used the SemEval 2014 dataset to construct a balanced version of the restaurant and laptop datasets. Let us take an example of how to construct a balanced restaurant training dataset. From Table 3, we can see that the number of training samples of the neutral class in the restaurant dataset is the smallest, so we keep the training samples of the neutral class and randomly sample from the positive and negative classes until the number of random samples is the same as the number of the neutral. The process of constructing a balanced restaurant testing dataset is consistent. The experimental results are shown in Table 9 in comparison with state-of-the-art methods: RAM, IAN, MemNN, ATAE-LSTM, AE-LSTM, and LSTM. GANN obtains the best performance in all unbalanced and balanced datasets. It indeed demonstrates that our proposed model is of great significance. The main reason is that GANN considers aspect-dependent sentiment clues, filters out noise, and detects local key features.
Additional experiments investigated aspect categories that are divided into different categories in the SemEval2014 restaurant and laptop datasets. The aspect category level task is different from the aspect term level task in ABSA. The aspect terms belong to one of the aspect categories. For instance, in the sentence ‘‘But the staff was so horrible to us’’., the aspect term is ‘‘staff’’ and the aspect category is ‘‘service’’. In the SemEval2014 restaurant dataset, we divided the aspect categories into four categories: Food, Service, Ambience, and Misc (Miscellaneous). In the SemEval2014 laptop dataset, we divided the aspect categories into three categories: Battery, Screen, and Misc (Miscellaneous). In the aspect term level task, the aspect term and the context are utilized as input. Meanwhile in the aspect category level task, the aspect category and the context are utilized as input. The experimental results are shown in Table 10 in comparison with state-of-the-art methods: IAN, MemNN, ATAE-LSTM, and AE-LSTM. GANN achieves the best accuracy in both datasets. That indeed verifies the powerful performance of our proposed GANN. We believe that the first reason why GANN performs best is that GANN takes into account aspect-dependent sentiment clues and designs GTR to encode these clues as well as the word order within the sentiment clue. The second reason is that GANN filters out some noise. The third reason is that GANN detects local key features and obtains position invariance due to the maxpooling operation and explicitly considers the specific form of the sentiment clue. 4.5. Effect of size of sliding window and convolution kernel This section focuses on the sliding window in the Gate Truncation RNN Layer and convolution kernel in the Local Features and Position Invariance Learning Layer. We analyze the effects of the size of the sliding window and convolution kernel respectively. 4.5.1. Sliding window size This subsection describes the impact of different sliding window sizes on the effectiveness of GANN and how to decide the optimal sliding window size in the ABSA task. We did experiments with different sizes for the sliding window in GTR to validate the influence of sliding window size. All the experiments in this part were conducted on the four Chinese datasets. The experimental results are shown in Fig. 5. Fig. 5(a) shows that the sliding window size affects the performance of GANN. In most datasets, it achieves the best performance when the size sliding window size is 15. Fig. 5(b) shows
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
13
Table 10 Accuracy results of aspect categories on the SemEval2014 restaurant and laptop datasets. Restaurant
AT-LSTM ATAE-LSTM MemNN IAN GANN
Laptop
Food
Service
Ambience
Misc
Overall
Battery
Screen
Misc
Overall
83.33 82.59 80.35 86.57 87.06
88.62 89.22 87.43 91.02 92.22
83.81 81.90 83.81 85.71 86.67
75.59 75.92 73.91 77.59 78.26
81.91 81.60 79.96 84.48 85.20
66.67 73.33 60.00 73.33 80.00
50.00 50.00 42.86 64.29 71.43
68.47 67.00 65.52 71.76 73.73
68.03 66.77 64.89 71.63 73.82
that max length of sentence context varies depending on the dataset. In other words, we can obtain a rule-of-thumb to use 15 as the best size for the sliding window regardless of the type of dataset and length of the context in the ABSA tasks. Moreover, the performance of GANN becomes stable as the size of the sliding window increases. 4.5.2. Convolution kernel size To validate the effect of the different sizes of convolution kernel size on model performance and to determine the optimal convolution kernel size, we conducted many contrast experiments on the restaurant and laptop English datasets. The experiment results are given in Fig. 6. Fig. 6 shows that the different sizes of the convolution kernel have a slight impact on the performance of our proposed model. The optional size is independent with the type of dataset. We can use 9 as the best size of the convolution kernel according to the experiment results in Fig. 6.
Fig. 6. The horizontal axis is the different size of the convolutional kernel. The vertical axis is the accuracy of GANN.
5. Discussion The proposed model outperforms vanilla LSTM and vanilla CNN when performing the ABSA task. Our main idea to overcome the weaknesses of RNN and CNN was by combining RNN and CNN in a framework to overcome the weaknesses of RNN and CNN. The first reason why the proposed model performs better than LSTM is that GANN overcomes the weaknesses of LSTM by making use of convolution and max-pooling operations as a high-level feature extractor to get the final aspect-dependent sentiment representation in GANN. By setting the window size of the filter in a convolution operation, GANN can detect different local key representations from these aspect-dependent sentiment clue representations that are encoded by the GTR. By means of setting the window size for max-pooling, GANN can obtain the fixed-length vector regardless of the length of input sequence and acquires the character of location invariance in spite of the order of these local key representations. That is to say, GANN can detect the most important aspect-dependent sentiment clue representations from these local key representations and acquire location invariance. These functions are not available in vanilla LSTM. The second reason why the proposed model performs better than LSTM is that we designed a special GTR module to model these sentiment clues with the specific aspect target. The sentiment clue plays a decisive role in identifying the sentiment polarity of an aspect target. The interaction between the sentiment clue and the specific aspect target is important in the ABSA task. The third reason why the proposed model performs better than LSTM is that we proposed the filter gating mechanism to avoid noise that is introduced by the attention. The main reason why the proposed model performs better than CNN is that GANN overcomes the weaknesses of CNN by designing a specific GTR module as an encoder. In GANN, a bidirectional GRU is used to capture semantic dependency and word order in the GTR module, the semantic dependency information
and previous historical information within the sentiment clue, and the order information of the sentiment clue can be captured by the GTR module. When increasing the size of the sliding window in the GTR module, the model can capture longer semantic dependency and within the sentiment clue. These functions are not available in vanilla CNN. 6. Conclusions In this paper, we assume that a sentence consists of multiple sentiment clues and a sentiment clue is composed of multiple words. Based on these assumptions, we proposed a novel neural network framework called the Gated Alternate Neural Network (GANN) for aspect-based sentiment analysis (ABSA). GANN can effectively alleviate the problems that exist in RNN, CNN, and the attention mechanism in the ABSA task. There are two important layers, the Gate Truncation Layer which consists of a GTR module, and a Local Feature and Position Invariance Learning Layer to provide convolution and max-pooling mechanisms. The GTR module is designed to learn the informative aspect-dependent sentiment clue representation where the relative distance between each context word and aspect target, sequence information, and semantic dependency within a sentiment clue are concurrently encoded. In GTR, an extra gating mechanism is designed to alleviate the effect of noise to obtain more precision sentiment clue representations. Convolution and max-pooling mechanisms are designed to capture key aspectdependent sentiment clue representation and acquire position invariance of the feature. Abundant experiments were done on four Chinese datasets and three English datasets to verify GANN’s performance and generalization. Experimental results show that GANN achieves state-of-the-art results. Additionally, a rule of thumb for selecting model hyperparameters is given. In the future, we believe that considering aspect term sequence information will improve the performance of the model. Integrating
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.
14
N. Liu and B. Shen / Knowledge-Based Systems xxx (xxxx) xxx
common knowledge with the model is another important work in the next step, which is one of the key techniques toward making machines fully understand human emotion. Acknowledgments This work was supported by the National Key Research and Development Program of China (grant number 2018YFC0831300), the Fundamental Research Funds for the Central Universities, China (grant number 2019YJS022). References [1] B. Liu, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions, Cambridge University Press, 2015. [2] L. Khaidem, S. Saha, S.R. Dey, Predicting the direction of stock market prices using random forest, arXiv preprint arXiv:1605.00003, 2016. [3] A. Tumasjan, T.O. Sprenger, P.G. Sandner, I.M. Welpe, Predicting elections with twitter: What 140 characters reveal about political sentiment, Icwsm 10 (2010) 178–185. [4] E. Cambria, A. Hussain, T. Durrani, C. Havasi, C. Eckl, J. Munro, Sentic computing for patient centered applications, signal processing (ICSP), in: 2010 IEEE 10th International Conference on, IEEE, 2010, pp. 1279–1282. [5] A. Valdivia, M.V. Luzón, F. Herrera, Sentiment analysis in tripadvisor, IEEE Intell. Syst. 32 (2017) 72–77. [6] Z. Zhang, Y. Liu, G. Xu, H. Chen, A weighted adaptation method on learning user preference profile, Knowl.-Based Syst. 112 (2016) 114–126. [7] B. Shen, N.-W. Wang, H.-H. Qiu, A new genetic algorithm for overlapping community detection, J. Internet Technol. 15 (2014) 1143–1150. [8] T. Young, E. Cambria, I. Chaturvedi, M. Huang, H. Zhou, S. Biswas, Augmenting end-to-end dialog systems with commonsense knowledge, arXiv preprint arXiv:1709.05453, 2017. [9] M. Al-Ayyoub, S.B. Essa, I. Alsmadi, Lexicon-based sentiment analysis of arabic tweets, Int. J. Soc. Netw. Min. 2 (2015) 101–114. [10] R.Y. Lau, C. Li, S.S. Liao, Social analytics: Learning fuzzy product ontologies for aspect-oriented sentiment analysis, Decis. Support Syst. 65 (2014) 80–94. [11] S. Poria, E. Cambria, G. Winterstein, G.-B. Huang, Sentic patterns: Dependency-based rules for concept-level sentiment analysis, Knowl.Based Syst. 69 (2014) 45–63. [12] S. Das, A.K. Kolya, Sense GST: Text mining & sentiment analysis of GST tweets by Naive Bayes algorithm, in: Research in Computational Intelligence and Communication Networks (ICRCICN), 2017 Third International Conference on, IEEE, 2017, pp. 239–244. [13] A.M. El-Halees, Arabic text classification using maximum entropy, IUG J. Natural Stud. 15 (2015). [14] F. Luo, C. Li, Z. Cao, Affective-feature-based sentiment analysis using SVM classifier, in: Computer Supported Cooperative Work in Design (CSCWD), 2016 IEEE 20th International Conference on, IEEE, 2016, pp. 276–281. [15] G. Zhou, Z. Zhu, T. He, X.T. Hu, Cross-lingual sentiment classification with stacked autoencoders, knowl. Inf. Syst. 47 (2016) 27–44. [16] J.L. Elman, Finding structure in time, Cogn. Sci. 14 (1990) 179–211. [17] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (1997) 1735–1780. [18] F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: Continual prediction with LSTM, Neural Comput. 12 (2000) 2451–2471. [19] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Netw. 18 (2005) 602–610. [20] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder– decoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014. [21] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint arXiv:1408.5882, 2014. [22] R. Socher, C.C. Lin, C. Manning, A.Y. Ng, Parsing natural scenes and natural language with recursive neural networks, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 129–136. [23] K. Schouten, F. Frasincar, Survey on aspect-level sentiment analysis, IEEE Trans. Knowl. Data Eng. 28 (2015) 813–830. [24] L. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: A survey, Wiley Interdiscip. Rev.: Data Min. Knowl. Discov. 8 (2018) e1253.
[25] D. Tang, B. Qin, X. Feng, T. Liu, Effective LSTMs for target-dependent sentiment classification, arXiv preprint arXiv:1512.01100, 2015. [26] Y. Wang, M. Huang, L. Zhao, Attention-based lstm for aspect-level sentiment classification, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 606–615. [27] P. Chen, Z. Sun, L. Bing, W. Yang, Recurrent attention network on memory for aspect sentiment analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 452–461. [28] D. Ma, S. Li, X. Zhang, H. Wang, Interactive attention networks for aspect-level sentiment classification, arXiv preprint arXiv:1709.00893, 2017. [29] S. Ruder, P. Ghaffari, J.G. Breslin, A hierarchical model of reviews for aspect-based sentiment analysis, arXiv preprint arXiv:1609.02745, 2016. [30] X. Li, L. Bing, W. Lam, B. Shi, Transformation Networks for Target-Oriented Sentiment Classification, arXiv preprint arXiv:1805.01086, 2018. [31] H. Peng, Y. Ma, Y. Li, E. Cambria, Learning multi-grained aspect target sequence for chinese sentiment analysis, Knowl.-Based Syst. 148 (2018) 167–176. [32] S. Poria, E. Cambria, A. Gelbukh, Aspect extraction for opinion mining with a deep convolutional neural network, Knowl.-Based Syst. 108 (2016) 42–49. [33] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, R. Socher, Ask me anything: Dynamic memory networks for natural language processing, in: International Conference on Machine Learning, 2016, pp. 1378–1387. [34] S. Sukhbaatar, J. Weston, R. Fergus, End-to-end memory networks, Adv. Neural Inf. Process. Syst. (2015) 2440–2448. [35] D. Tang, B. Qin, T. Liu, Aspect level sentiment classification with deep memory network, arXiv preprint arXiv:1605.08900, 2016. [36] Y. Tay, A.T. Luu, S.C. Hui, Learning to Attend via Word-Aspect Associative Fusion for Aspect-based Sentiment Analysis, arXiv preprint arXiv:1712. 05403, 2017. [37] Y. Tay, L.A. Tuan, S.C. Hui, Dyadic memory networks for aspect-based sentiment analysis, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, ACM, 2017, pp. 107–116. [38] T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural machine translation, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421. [39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in: International Conference on Machine Learning, 2015, pp. 2048–2057. [40] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, Adv. Neural Inf. Process. Syst. (2015) 649–657. [41] H. Peng, E. Cambria, X. Zou, Radical-based hierarchical embeddings for chinese sentiment analysis at sentence level, in: The 30th International FLAIRS Conference, Marco Island, 2017. [42] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, Tensorflow: a system for large-scale machine learning, in: OSDI, 2016, pp. 265–283. [43] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014. [44] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958. [45] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [46] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Adv. Neural Inf. Process. Syst. (2013) 3111–3119. [47] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information, arXiv preprint arXiv:1607.04606, 2016. [48] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androutsopoulos, S. Manandhar, A.-S. Mohammad, M. Al-Ayyoub, Y. Zhao, B. Qin, O. De Clercq, Semeval2016 task 5: Aspect based sentiment analysis, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 2016, pp. 19–30. [49] L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, K. Xu, Adaptive recursive neural network for target-dependent twitter sentiment classification, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 49–54. [50] H.-P. Zhang, H.-K. Yu, D.-Y. Xiong, Q. Liu, HHMM-based chinese lexical analyzer ICTCLAS, in: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume 17, Association for Computational Linguistics, 2003, pp. 184–187.
Please cite this article as: N. Liu and B. Shen, Aspect-based sentiment analysis with gated alternate neural network, Knowledge-Based Systems (2019) 105010, https://doi.org/10.1016/j.knosys.2019.105010.