Journal Pre-proof Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews Mohd Usama, Belal Ahmad, Jun Yang, Saqib Qamar, Parvez Ahmad, Yu Zhang, Jing Lv, Joze Guna
PII: DOI: Reference:
S0140-3664(19)30545-6 https://doi.org/10.1016/j.comcom.2019.08.002 COMCOM 5920
To appear in:
Computer Communications
Received date : 1 June 2019 Revised date : 31 July 2019 Accepted date : 2 August 2019 Please cite this article as: M. Usama, B. Ahmad, J. Yang et al., Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews, Computer Communications (2019), doi: https://doi.org/10.1016/j.comcom.2019.08.002. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Journal Pre-proof
of
Equipping Recurrent Neural Network with CNN-style Attention Mechanisms for Sentiment Analysis of Network Reviews
pro
Mohd Usama, Belal Ahmad, Jun Yang, Saqib Qamar, Parvez Ahmad, Yu Zhang, Jing Lv, Joze Guna
Abstract
Deep learning algorithms have achieved remarkable results in natural language
re-
processing(NLP) and computer vision. Especially, deep learning methods such as convolution and recurrent neural networks have shown remarkable performance in text analytic task. Moreover, from the attention mechanism perspective convolutional neural network (CNN) is applied less than recurrent neural
urn al P
network (RNN). Because RNN can learn long-term dependencies and gives better results than CNN. But CNN has its own advantage, can extract high-level features invariant to the local translation by using its local fix size context at the input level. Thus, in this paper, we proposed a new model based on RNN with CNN-style self-attention mechanism by using the merits of both architectures together in one model. In the proposed model, first, CNN learns the highlevel representation of words at the input level. Second, we used self-attention mechanism to get the attention of the model on the features which contribute much in the prediction task by calculating the attentive context vectors over hidden states representation generated from CNN. Finally, hidden state representations from CNN with attentive context vectors are commonly used at the
Jo
1 M. Usama, B. Ahmad, J. Yang, S. Qamar, P. Ahmad and Y. Zhang are with School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China. (email:
[email protected],
[email protected],
[email protected],
[email protected], junyang
[email protected],
[email protected]) J. Lv (corresponding author) is with Hospital Attached to Huazhong University of Science and Technology (email:
[email protected]) J. Guna is with the Faculty of Electrical Engineering, University of Ljubljana, Trzaska cesta 25, 1000 Ljubljana, Slovenia. Email:
[email protected].
Preprint submitted to Journal of Computer Communications
August 27, 2019
Journal Pre-proof
RNN to process them sequentially. To validate the model we experiment on three benchmark datasets i.e. Movie review, Stanford sentiment treebank1, and
of the proposed model.
of
treebank2. Experiment results and their analysis demonstrate the effectiveness
Keywords: Convolutional neural network, recurrent neural network, attentive
pro
learning, network reviews, and sentiment analysis.
1. Introduction
The enormous and rapid growth of the social network, smart gadgets, and the internet brings together billions of users to generate short texts on the internet
5
re-
such as public opinion on services, products, movies, and blogs [1, 2, 3, 4]. These reviews as short texts are usually found to be semantic and subjective oriented. Nowadays identifying and classifying proper semantic orientation of text reviews written by authors on the internet is getting much attention of researcher which
urn al P
solves customers as well as company’s various practical problems such as what product a customer like or dislike? And whether the product is doing well in 10
the market or not? Sentiment analysis of short text remains a challenging one because short text usually contains limited semantic and contextual information, which limits the accuracy of the analysis [5]. In early days many researchers have done sentiment analysis task by using traditional classification approaches which are based on manual features en-
15
gineering and required rule-based methods such as data mining technique[6], knowledge-based approach[7, 8]. But these traditional approaches mainly used words as a statistical indicator representation such as TF-IDF (Term frequencyInverse document frequency). These schemes do not consider words feature as a positional factor. Usually, it does not represent the feature which has large weights. Besides this, some researcher used machine learning algorithms[9] such
Jo
20
as Naive Bayes and Support Vector Machine for feature extraction and sentiment classification. However, these algorithms faced encounter problems like data sparsification during small training dataset.
2
Journal Pre-proof
Later after increasing in training datasets[10][11], researchers started to used 25
distributed representation methods with deep learning approaches [12][13] which
of
do not require manual features engineering and can deal with the above problems [14, 15]. These approaches automatically extract the features from text and perform appropriate analysis for many different problems in NLP. Deep
30
pro
learning approaches have successfully implemented in many areas of NLP. It is not only improving the accuracy and efficiency of prediction but also reducing the prediction time. 5G also provides an efficient application environment for deep learning based on big data[16, 17]. Furthermore, traditional deep learning approaches such as CNN[18] and RNN[19] used max pooling layer to find impor-
35
re-
tant significant features of words. This max pooling layer approach could most likely help in text prediction and classification task to get better results. However, there was no specified way to deal with significant features; max pooling layer is selected the significant features corresponding to maximum activation value. Meanwhile, the combination of label-less mechanism [20], cognitive information
40
urn al P
concept [21, 22],5G and in-depth learning promotes the further development of artificial intelligence [23, 24].
Thus to deal with this issue in more significant way idea of attention mechanism in deep learning is first ever proposed by Bahdanau et. al.[25] for machine translation. Attention mechanism is a way to get the concentration of the model on the significant features by using softmax at the interior of the model. Gen45
erally, softmax is used as an output system to generate the probability of the classes and categories. However, It is also possible to use softmax in a way to normalized a stack of the number, so they all sum up to one; which effectively giving a percentage probability to all piece of information. Then it can use that to bring way during input data. More specifically, we know that not all word features in the sentence representation contribute equally in the text. So, here we use self-attention approach to gain the attention of the model on such word
Jo
50
features which contribute to the meaning in the text. Attention-based neural networks have been proven successful in many prob-
lems of NLP and computer vision such as image classification[26], and machine 3
Journal Pre-proof
55
translation[27]. However, most of the work in the attention mechanism is based on RNN and its variants. Attention-based RNN used three states of inputs
of
to evaluates results at current states, i.e., the current input is given to RNN, representation of left and right context computed bidirectionally, and attentive weighted-sum of local context hidden states. After the success of attention mechanism in RNN little work has also done on CNN with attention mechanism
pro
60
to solve different problem in NLP. However, CNN has been proven beneficiary in both ways, with and without attention mechanism. For sentence modeling without attention [28] [29], where CNN learns word representation with sliding window and text modeling with attention has done for many problem such as sentence modeling [30], textual entailment [31] and machine translation[32].
re-
65
Our motivation of using a recurrent neural network with CNN-style selfattention mechanism is inspired by the individual success of CNN and RNN with attention mechanism in NLP.
In this study, proposed a new model based on recurrent neural network with CNN-style self-attention mechanism by using the merits of both architectures
urn al P
70
together in one model. First, we used CNN to generate hidden states units (hi ) from input representation. Then over these hidden states units, we used self-attention mechanism to get the attention context vectors (Ci ). Finally both hidden states units (hi ) from CNN and attention context vectors (Ci ) 75
commonly used as an input at RNN. Final feature map from RNN is used to perform sentiment analysis task.
To evaluate the proposed model, we experiment with four self variation of the model on three benchmark datasets. The detail of the models’ variation is given in subsection 4.1. Experiment results show the effectiveness of the proposed 80
model. Our model outperforms on two out of three benchmark datasets and achieves state-of-art results compared to existing works.
Jo
Main contributions of the article are summarized as follows: • We Proposed a new model based on the recurrent neural network with CNN-style self-attention mechanism. Experiment results, analysis, and
4
Journal Pre-proof
85
attention visualization show the effectiveness of the proposed model. • Furthermore we demonstrate four self variations of the proposed model i.e.
of
ATTConv RNN-pre, ATTPooling RNN-pre, ATTConv RNN-rand, and ATTPooling RNN-rand. Experiment results exhibit that the ATTPooling RNN-pre and ATTPooling RNN-rand performs better than ATTConv RNN-pre and ATTConv RNN-rand respectively.
pro
90
• We apply all model variations on three benchmark datasets (Movie Reviews, Stanford Sentiment Treebank, and Treebank2) to test the effectiveness of the proposed models and achieves better results on two out of three
95
re-
datasets.
The remaining article is categorized in the following way. Section II explains the existed relevant work. Section III explains the proposed model and evaluation methods. Section IV explains the model variations, experiment setup, and
urn al P
benchmark datasets used to measure the performance of the models. Section V discusses the analysis of results. Finally, Section VI concludes the paper.
100
2. Related Work
2.1. Sentiment Analysis Task
The existing work in sentiment analysis has done by various methods such as data mining technique[33], knowledge-based approach[34, 35], and machine learning algorithms[36]. Later with the rapid growth of deep learning, classi105
cal deep learning algorithms such as convolution neural network (CNN) and recurrent neural network (RNN) have achieved remarkable success in many areas such as computer vision[37], natural language processing[38], and speech recognition[39]. These technologies make the rapid development of application
Jo
scenarios such as intelligent medicine [40, 41], intelligent health care [42, 43].
110
And typical CNN and RNN have already been used for sentiment analysis and achieved remarkable results. Yoon Kim[29] used CNN to perform sentence-level
5
Journal Pre-proof
classification with pre-trained word vectors. The recursive deep learning studied for sentiment classification by Socher et. al. [44]. Tai et. al.[45] proposed
115
of
a generalization of long short term memory (LSTM) network for prediction of semantic relatedness and sentiment classification of text, named as Tree-LSTM. Wang et. al.[46] proposed an approach of getting the benefit of the dropout
to visualize emotional recognition in audio.
pro
training without sampling.Y. Ma et. al.[47] proposed a depth-weighted method
After that, an exclusive idea of research by combining two networks begun. 120
Very few researchers used two networks together in one model for sentiment classification. Kim et. al.[48] proposed a neural language model that receives
re-
input at the character level and predict output at the word level. Their model derived by combining CNN, highway network, and LSTM. In his model output from the highway network and CNN over characters is used as input for LSTM 125
RNN-LM (recurrent neural network language model). Zhang et. al.[49] and Vu et al.[50] both utilized two distinct neural networks together and combine
urn al P
CNN and RNN for relation classification. These models achieved better results compared to existing models. Furthermore, Hassan et. al.[51] put forward an end-to-end bottom-up architecture by combining CNN and RNN for sentiment 130
analysis. The present sequence architecture in their paper and used word embedding as an input to the convolutional layer, which learned features maps by using the window of different sizes and various weights. Then obtained output from the convolutional layer is used as an input to the LSTM layer, where LSTM learned long-term dependencies from these feature maps to generate sentence
135
level representation. Liu et. al. [52] proposed attention based gated convolutional neural network for sentiment analysis. Besides this, they proposed a new activation function named them NLReLU.
Jo
2.2. Attention based CNN
In CNN, little work was done based on attention. Gehring et al.[32] pro-
140
posed the attention-based CNN model for machine translation. They used a hierarchical convolutional layer for both encoder and decoder. Each encoder 6
Journal Pre-proof
hidden units be queried at nth decoder layer by output hidden units of convolution, after than weighted sum of the total encoder hidden units will be added
145
of
to the decoder hidden units, and these updated hidden units will be received by n+1 convolutional layer. Hence, the above attention-based model needs the multi convolutional layer to pass the weighted context from encoder to decoder
pro
side, which shows that their attention played a role after convolution. There are some other CNN architecture[53] [54] which used attention at the pooling layer and named it attentive pooling. The architecture used in both paper uses 150
two input sentence in parallel and sets of the hidden unit for every sentence is computed by convolution. Thus, every sentence learns attentive weights for
re-
each hidden unit based on the current hidden unit with all hidden unit in other sentences. After that resulted representation of each sentence will obtain by weighted mean pooling or attentive pooling. 155
2.3. Attention based RNN
urn al P
The foremost attention-based model was proposed by Graves et al.[55]. They used differentiable attention schemes which enforce RNN to look over distinct parts of the input. This scheme has been explored by many researchers to solve the problem of in text analytics such as neural machine translation[56], 160
text generation [57], document reconstruction, and summarization[58], machine comprehension[59]. It is further explored in relation classification such as textual entailment[60] and answer selection[61].
Gazpio et al.[62] proposed a self-attention based supervised model to learn word embedding. This method can calculate attention weight at each position 165
in the window for every word. A sentence level attention is proposed by Liu et al.[63]. They used mean pooling as an attention context over LSTM states and re-weighted pooling vector of the sentence. Li et al.[64] proposed similar self-
Jo
attention based factoid model for question encoding. Cheng et al.[65] proposed LSTMN model which is intra-sentence based attention scheme and further used
170
by Parikh et al.[66]. Proposed scheme calculates the attention vector during the recurrent process for every hidden unit, which is targeted on lexical correlations 7
Journal Pre-proof
between adjacent words. In contrast to these, our proposed model uses the recurrent neural network
175
of
with CNN-style self-attention mechanism. In the proposed model, RNN will use hidden state units and attention vectors generated from standard convolution
pro
or max pooling together.
3. Proposed Model
This section describes the proposed model which consists of Input layer, word embedding layer, CNN layers, CNN-style self-attention mechanism, RNN layer2 180
with CNN-style self-attention mechanism, full connection layer, and softmax
1. 3.1. Input layer
re-
output layer. The overall architecture of the proposed model is shown in figure
185
urn al P
To start let assume the input layer of the model will receive text data as X(x1 , x2 , ..., xn ), where x1 , x2 , ..., xn are the n number of word with dimension of each input word m. So, each word vector will be represented as Rm dimensional space. Therefore, the dimension space of text data will be Rm ∗ n. 3.2. Word embedding layer
To learn word embedding let assume size of the vocabulary is d for text rep190
resent. Thus, the 2-dimensional word embedding matrix would be represented as Am∗d . Now, the input text X(xi ), Where i = 1, 2, n., XRm∗n , is pass to embedding layer from the input layer to generate word embedding vector from pure corpus text. We use a distributed representation method to learn word embedding and implement it through the word2vec3 model[67]. Hence, actual input text to be feed into the model would be the representation of input text X(x1 , x2 , ..., xn )Rm∗n as numerical word vectors. Where,
Jo
195
2 In
experimentation we used a more power full variant of RNN layer i.e. LSTM
3 https://code.google.com/p/word2vec/
8
Journal Pre-proof
Full Connection Layer
Self Attention
h1
h3
h2
Softmax
Hidden Representation
re-
Merge Layer 1
PADDING
If you like then this is one the of family oriented fantasy adventure movie
urn al P
Max Pooling
h4
pro
RNN with CNN-Style Self Attention
of
Softmax Output
Merge Layer 2
Convolution
n * m Input Text Representation
PADDING
Figure 1: Over all architecture of proposed framework. Dotted arrow
Jo
line indicate that either hidden units representation will be passed from convolution to merge layer 2 or from max pooling to merge layer 1.
9
Journal Pre-proof
x1 , x2 , ..xn are n number of word vectors in embedding vocabulary with each dimension space Rm . Where dimension of input word m will be selected during
200
of
word embedding learning. 3.3. CNN layers
pro
3.3.1. Convolution layer
Now convolution layer performs the convolution operation over word-vectors representation receives from the embedding layer in a successive sequence of row representation form. Let s words be selected at a time with weight matrix W 205
of dimension W Rs∗m to perform convolution operation as follows:
(1)
re-
hi = f (Xi+s−1 ∗ W [i] + bi )
Here, hi Rn−s+1 is the feature map generated with t words every time repeatedly, f is the non-linear Relu function, and bi is the bias term.
urn al P
3.3.2. Max pooling layer
We then perform the max-pooling operation over generated features map 210
from convolution, which converts the features map into its half by selecting the features with its maximum activation value, as follow: pi = max hi
(2)
Here, pi Rn−s+1/2 is the new feature map. We take the experience from the literature[29] and use two convolution layer in parallel with filter size 4 and 5. Then, we concatenate features of both con215
volution layers at merge layer2 and after performing the max-pooling operation at merge layer 1 as shown in Figure 1.
3.4. CNN-style self-attention mechanism
Jo
After obtaining the hidden state units h1 , h2 , ..., hn from CNN, we will cal-
culate the attentive context vector based on these. For simple mathematical
220
explanation, let’s assume H = (h1 , h2 , ..., hn ), then features map H has a shape of Rn∗t , where t is the dimension of hidden states. 10
pro
of
Journal Pre-proof
Output Recurrent Neural Network
h1
h2
Softmax
h3
h2
urn al P
Convolution
h1
C2 C3 C4
Padding X1
C5
re-
C1
h3
h4
h4
h5
h5
X2 X3 X4 X5 Padding Input Text Representation
Figure 2: Illustration of hidden units and attentive context vectors together used at RNN. Xi are the input representations and Ci are attentive context vectors calculated based on hidden units hi genr-
Jo
erated from standard convolution
11
of
Journal Pre-proof
pro
Output Recurrent Neural Network
C1 h2
h1
h2
h3
h3
h4
h5
h4
h5
Convolution
urn al P
Max Pooling
Softmax
C5
re-
h1
C2 C3 C4
Padding X1
X2 X3 X4 X5 Padding Input Text Representation
Figure 3: Illustration of hidden units and attentive context vectors together used at RNN. Xi are the input representations and Ci are attentive context vectors calculated based on hidden units hi genr-
Jo
erated from max pooling.
12
Journal Pre-proof
Self-attention requires all hidden units as input to compute attention vector. Thus attention context vector Ci for each word xi is calculated as weighted sum
αi = tanh(Wx ∗ H T + bx )
225
pro
exp(αi ) Cix = P i exp(αi )
of
over all hidden units of snippet, as follows:
(3)
(4)
Where, Wx Rm∗n is a weight matrix, bx is the bias term, and αi is the attention weight.
Here for ATTConv RNN hidden units generated from convolution layer is used to compute the attention vector as shown in figure 2. For ATTPooling RNN hidden units generated from max pooling layer is used to compute the
re-
230
attention vector as shown in figure 3.
This attentive context vector is used with the RNN and jointly learned during the training process. This context vector is expected to focus on such word
235
urn al P
features which play an essential role in prediction and contribute to the meaning in the text.
3.5. RNN layer with CNN-style Self-attention mechanism After computation of context vector, we will use hidden units hi from CNN and context vectors Ci together at RNN to generates the higher-level hidden state units by processing them sequentially. 240
Hence, the final feature map from RNN will be obtain by jointly learning of hidden states units from CNN and the attentive context vector Ci (as shown in figure 1) as follows:
h(t)1i,j = w1 [i](h(t)j , h(t − 1)j , C(t)i,j ) + b1
(5)
Jo
For simplification, we can rewrite above equation as: h(t)1i,j = w1 [i](h(t)j , h(t − 1)j ) + w2 [i].C(t)i,j + b1
13
(6)
Journal Pre-proof
i = 1, 2, ..., l and j = 1, 2, ..., n; 245
Where, w1 Rl∗nm , w2 Rl∗nm are the weight matrices, b1 is the bias term, h(t)
of
is the hidden units representation at time-step t, h(t − 1) is the the hidden units representation at time-step t − 1, Ci is the attentive context vector, and final
feature map from RNN will be obtain from h1 (t) as a new feature map. We use
250
pro
here soft self-attention approach so that gradient could propagate through both convolution as well as attention context vector components in parallel. For non-linear relationship among features, we use rectified linear unit (ReLU) as an activation function as follows:
(7)
re-
h(t)2i,j = max(0, h(t)1i,j ) 3.6. Full connection layer
After that, we use the full connection layer. Thus, the full connection opera255
tion will be perform over feature map h(t)2i,j obtain from RNN after non-linearity
urn al P
as follows:
h3 = w4 .h2 + b4
(8)
Where h2j is a feature map receive from RNN after non-linearity and h3 is feature map obtain from full connection operation. w4 and b4 are the weight and bias of full connection layer. 260
3.7. Softmax output layer
Finally, the feature map from the full connection layer is passed to soft-max classifier, as shown in figure 1, to perform sentiment classification which result
Jo
is a probability distribution over all categories as follows: exp(hi ) Yi = Pc j=1 exp(hj )
(9)
Where results Yi is the probability distribution of ith class overall classes.
265
To measure the disparity between real sentiment and predicted sentiment of
the sentence in corpus text, we used cross-entropy as a loss function as follows: 14
Journal Pre-proof
0
loss =
s XX
Yit (C)log(Yi (C))
(10)
of
s=T i=1
Where, Y t (C) is the element corresponding to the real sentiment of the sentence, Y (C) is the element corresponding to predicted sentiment of the sentence, 0
270
pro
s is the sentiment classes, and T is the training corpus text. We initiate training processes with pre-trained word vectors from word2vec then used stochastic gradient descendant algorithms with Adadelta update method to train and update the parameters of the model.
4.1. Model variations 275
re-
4. Model variations, Datasets and Experiment Setup
We experimented with several variations of the model defined as follows: • ATTPooling RNN-pre: A model based on the recurrent neural network
urn al P
with pooling-style self-attention mechanism trained with pre-trained wordvectors from word2vec. Here self-attention mechanism is used over the hidden state units (hi ) generated from max pooling layer as shown in
280
figure 3.
• ATTConv RNN-pre: A model based on the recurrent neural network with convolution-style self-attention mechanism trained with pre-trained word-vectors from word2vec. Here self-attention mechanism is used over the hidden state units (hi ) generated from standard convolution layer as
285
shown in figure 2.
• ATTPooling RNN-rand: A model based on the recurrent neural network with pooling-style self-attention mechanism trained with randomly
Jo
initialized word-vectors. Here self-attention mechanism is used over the hidden state units (hi ) generated from max pooling layer as shown in
290
figure 3.
15
Journal Pre-proof
• ATTConv RNN-rand: A model based on the recurrent neural network with convolution-style self-attention mechanism trained with randomly
of
initialized word-vectors. Here self-attention mechanism is used over the hidden state units (hi ) generated from standard convolution layer as shown in figure 2.
pro
295
4.2. Datasets
We experimented with three benchmark datasets to evaluate the proposed model. Detail statistics of the datasets are given in Table 1.
• MR4 (Movie Reviews): This dataset contains a total of 10662 positive and negative movie reviews from networks with one sentence per review.
re-
300
Challenge is to predict positive and negative sentiment associated with sentences[68].
• SST15 (Stanford Sentiment Treebank): This dataset is an extension 305
urn al P
of MR dataset with training, test, and deviation set splits provided with labeled into five classes as very positive, very negative, positive, negative, and neutral[44].
• SST26 : Similar to SST1 but with neutral reviews removed and binary labeled.
4.3. Training Procedure and Hyperparameters 310
4.3.1. Hyperparameters
For all dataset, we used activation function rectified linear units, window size 4 and 5 for convolution layers, feature maps 100 for each CNN layer, RNN layer output size 100, dropout 0.50 at full connection layer, and mini-batch size
Jo
50. The training was performed through a stochastic gradient descent algorithm 4 https://www.cs.cornell.edu/people/pabo/movie-review-data/ 5 https://nlp.stanford.edu/sentiment/ 6 https://nlp.stanford.edu/projects/glove/
16
Journal Pre-proof
Table 1: Statistical details of the datasets after tokenization used in experiment. c: Number of classes, Al :Average sentence length, Ml :Maximum sentence
of
length, V :Dataset size, N :Vocabulary size, VT rain :Training set, VT est :Test set, VV alid :Validation set, CV indicates that there was no standard train/test/dev
c
Al
Ml
V
N
VTrain
VTest
VValid
MR
2
20
51
10662
18983
8655
1064
CV
SST1
5
18
56
11855
18784
8544
2210
1101
SST2
2
19
56
9613
18784
6920
1821
872
re-
315
Dataset
pro
split for dataset so 10-fold Cross validation was used
with the Adadelta update method[69]. All the parameters values are selected through the grid-search method.
urn al P
4.3.2. Pre-trained Word Vectors
We initialized word vectors with pre-trained set of word vectors obtained from word2vec, an unsupervised natural language algorithm to obtain vector 320
representations for words which are used in the absence of large training set to improve performance of the model. Publically available word2vec7 . vectors with word dimension 300 were used in experiments which were trained on 100 billion words from Google news data. Word vectors were trained by using the continuous bag-of-words framework. Words which were absent in pre-train set
325
were initialized randomly.
5. Experiment Results and Discussion Implement the proposed model is done by using well-known library Ten-
Jo
sorflow and Keras. We compare the running time for model variations on the system (256GB RAM and 16GB GPU). Here, we set 30 epochs for all model 7 https://code.google.com/p/word2vec/
17
of
500
414
410
pro
400 300 232 186.5 160
200
209
182
154
150
121
99.3
100 0
ATTPooling RNN-pre
93
MR SST1 SST2
re-
Running Time of the Models in Minuts
Journal Pre-proof
ATTConv RNNpre
ATTPooling RNN-rand
ATTConv RNNrand
urn al P
Models
Figure 4: Running time of all model variations. From these statistics, we can tell that there is not much difference in running time with pre-trained and randomly initialized models on all the datasets. That means ATTpooling RNN-pre and ATTConv RNN-pre consume approximate same time to complete execution as ATTpooling RNNrand and ATTConv RNN-rand. While ATTpooling RNN takes half of the times than ATTConv RNN on SST1 and SST2 datasets for both randomly initialized and pre-trained. Thus running speed of ATTPooing RNN is 2 times fast than ATTConv RNN for both randomly initialized and pre-trained on SST1 and SST2 datasets. But
Jo
for MR dataset running speed of all models are almost the same.
18
Journal Pre-proof
330
variations to compare with each other. Figure 4 shows the details of the running time of all models. Subsection 4.3 describes the parameter settings and training
of
method of the model’s parameters and word vectors. Three benchmark datasets are used in the experiment to measure the performance of the model; results are listed in Table 2. As we expected, our variant models (ATTPooing RNNrand and ATTConv RNN-rand) with randomly initialized word-vectors does not
pro
335
perform better than pre-trained model variations (ATTPooing RNN-pre and ATTConv RNN-pre). While we got expected results by using pre-trained word vectors, even pre-trained model variants with attentive convolution (ATTConv RNN-pre ) is achieve better results performance again some sophisticated natural language models which utilized complex structure such as[44]. The achieved
re-
340
results suggest that pre-trained vector performs well and better extracted the features than randomly initialized. There can be an accuracy gain of 2-3% approximate by using pre-trained vectors than random initialized. Experiment results validate that the proposed model performs better than existing state-of-art methods on two out of three datasets, and achieve compet-
urn al P
345
itive results on one dataset. Especially, as we expected, our model ATTPooling RNN-pre does perform better than ATTConv RNN-pre over all the datasets. ATTPooling RNN-pre achieves remarkable results and again in increase accuracy by 1.2% on MR, 0.73% on SST1, and 1.77% on SST2 dataset than ATTConv 350
RNN-pre. These results state that our architecture using attention mechanism over selected words by max pooling is more beneficiary than using attention over all word features from convolution on prediction datasets. 5.1. Analysis of results on MR dataset
The challenge of this dataset is to predict sentiment from binary labeled sentence i.e. positive or negative which contain one sentence per review. Reference [29] perform experimentation with several variations of the standard CNN mod-
Jo
355
el by doing fine tuning in word embedding and using a multichannel approach. Compared to his results our model accuracy is better and achieves 2.82% gain approximate. There is another work[70] which used bow SVM and same non19
Journal Pre-proof
of
Table 2: Results of the proposed model against existing works. Bold represent the best results. Year
Recursive Neural Tensor Network[44]
2013
-
45.7
85.4
Word vector averages[44]
2013
-
32.7
80.1
Fast Dropout[46]
2013
79.1
-
-
DCNN[28]
2014
-
48.5
86.8
Non-static[29]
2014
81.5
48.0
87.2
re-
Multi-channel[29]
MR
SST1
pro
Models
SST2
2014
81.1
47.4
88.1
2015
-
48.0
-
2015
0.79
-
-
2015
-
46.4
84.9
CNN-GRU-word2vec[73]
2016
82.28
50.68
89.95
bowwvSVM[70]
2016
79.7
43.2
83.3
Non-static GloVe+word2vec CNN[70]
2016
81.0
45.9
85.6
ConvLstm[74]
2017
-
47.5
88.3
CRDLM [51]
2018
-
48.8
89.2
AGCNN-SELU-3-channal[52]
2018
81.3
49.4
87.6
AGCNN-NLReLU-3-channal[52]
2018
81.9
49.4
87.4
ATTPooling RNN-pre
2019
84.32
51.14
89.62
ATTConv RNN-pre
2019
83.12
50.41
87.85
ATTPooling RNN-rand
2019
79.25
48.32
86.09
ATTConv RNN-rand
2019
78.21
46.36
84.90
Tree LSTM1[71] Tree bi-LSTM[72]
Jo
urn al P
LSTM[45]
20
Journal Pre-proof
360
static CNN model as [29] but trained with combined Glove and Word2vec and reported lesser accuracy than above. But compared to [70] we achieve 3.32%
of
gain in accuracy approximate. Reference [73] proposed a joint architecture by using CNN and RNN; reported much better results than previous works. This shows that combining CNN and RNN gives better results than single CNN or RNN architecture in the prediction task. The latest works[52] used attention
pro
365
with CNN and further achieve remarkable results which is better than existing works except for the joint architecture model. This latest work substantiates the benefits of attention in the prediction task. Thus in the proposed model we utilized CNN and RNN together with attention mechanism and achieve accuracy gain 2.42% than the latest CNN attention model[52] and 2.04% than joint
re-
370
architecture[73]. In overall proposed model architecture is most modern of its type and achieves the best results among all existing works on MR dataset. 5.2. Analysis of results on SST1 dataset
375
urn al P
SST1 dataset is fine-grain extension of MR dataset which labeled into five classes as very positive, very negative, positive, negative, and neutral. Early existing works on SST1 dataset based on different architectures such as RNTN[44], Deep CNN[28], Non-static and Multichannel CNN[29], LSTM[45], Tree-LSTM[71], and GloVe+word2vec CNN[70] reported the accuracy between 46% to 48% approximately except the Word vector averages[44]. Word vector 380
average reported much worst results with accuracy 32.7%. Compared to early existing works our model achieves more than 2% gain in accuracy approximate. Later works such as CNN-GRU[73] and CRDLM[51] used joint architecture approach; reported 50.68% and 48.8% accuracy results respectively which is better compared to early existing works. But other joint architecture ConvLstm[74] could not achieve better results than all early existing works but still produced strong competitive results. Moreover, the latest work[52] based on attention
Jo
385
mechanism with CNN reported poor results with accuracy 49.4%, than the joint architecture model but still get better results than early existing works. However, our model not only performs better than early existing works and joint 21
Journal Pre-proof
390
architecture models but also than latest CNN based attention model[52]. The proposed performs better than the latest existing models and achieves 0.46% and
of
1.74% accuracy gain than joint architecture[73] and gated-attention model[52] respectively. In general, the proposed model achieves state-of-art results on
395
5.3. Analysis of results on SST2 dataset
pro
SST2 dataset also.
SST2 is the subset of SST1 dataset with binary labeled and without neutral reviews. On SST2 dataset recursive and recurrent model such as RNTN[44] and LSTM[45] achieve accuracy between 84-86% approximate. While all variation of standard CNN model(DCNN, Non-static CNN, and Multichannel CNN) achieve better results than existing recursive and recurrent model with accu-
re-
400
racy more than 86% except GloVe+word2vec CNN[70] which got 85.6% accuracy. Compared to existing recursive, recurrent, and CNN models our model achieves better results with accuracy more than 89%. The proposed model also
405
urn al P
got gain in accuracy 2.02% when compared to latest gated-attention model[52]. But On this dataset joint architecture models (CNN-GRU[73] and CRDLM[51]) performed better compared to all existing models. However, compared to the proposed model only one joint model (CNN-GRU) got little better results with 0.33% gain in accuracy. That means our model not giving state-of-art results on SST2 dataset but still giving better results than many existing works including 410
the latest one and a strong competitor of joint architecture models. Furthermore, to validate the effectiveness of the proposed model, we do a comparative study with strong benchmark attention models. To understand the inner and deep behavior of the model we visualize the attentive context over test dataset.
5.4. Comparison with strong benchmark attention models
Jo
415
5.4.1. Comparison with attention-based CNN Little work has done on attention-based CNN, and among them almost work
in CNN is based on attentive pooling[54] but the proposed model is based on 22
pro
of
Journal Pre-proof
emerges as something rare an issue movie thats so honest and keenly observed that it doesnt feel like one offers that rare combination of entertainment and education
most of the problems with the film dont derive from the screenplay but rather the mediocre performances by most of the actors involved
re-
curling may be a unique sport but men with brooms is distinctly ordinary (a) ATTConv RNN-pre.
urn al P
emerges as something rare an issue movie thats so honest and keenly observed that it doesnt feel like one offers that rare combination of entertainment and education most of the problems with the film dont derive from the screenplay but rather the mediocre performances by most of the actors involved curling may be a unique sport but men with brooms is distinctly ordinary (b) ATTPooling RNN-pre.
Figure 5: Attention visualization as heat map for four sentences from MR dataset. In both (a) and (b) first two sentence are with positive
Jo
sentiment and last two are with negative sentiment.
23
Journal Pre-proof
RNN with CNN-style self-attention mechanism. In this study, the process of 420
computing attentive context vector is the same, but it differs in two ways: (i)
of
Proposed model used both the convolutional and recurrent neural network. Here RNN computes the high-level representation of text by using hidden states units generated from standard convolution/max pooling. (ii) We compute attentive
425
pro
context vectors from hidden state units generated from CNN and then used them with hidden state units together at RNN. 5.4.2. Comparison with attention-based RNN
The significant amount of work has done on RNN and its variant with attention mechanism[55]. Proposed model has benefits over attention based RNN
430
re-
in the following ways: (i) In case of attentive RNN, attentive context is calculated based on the RNN itself; but in the proposed model, attentive context is calculated based on hidden state units generated from standard convolution and further used at RNN to learn high-level representation. (ii) As we can see
urn al P
from figure 1, the proposed model is composed of both CNN and RNN, thus it gets benefits of both. CNN learns the robust local feature by using sliding 435
convolution, and RNN process these feature sequentially with attentive context vector generated from CNN itself. 5.5. Attention visualization
In this section, we visualize attentive context over sentences to compare which word features are being attended by the model when doing predictions for 440
both ATTConv RNN-pre and ATTPooling RNN-pre model. The same sentences are used as input for both variations of the model so that we can infer the difference between ATTConv RNN-pre and ATTPooling RNN-pre. Parameters setting of the model is kept same as during training. After train-
Jo
ing on all dataset and saving the models; we randomly selected four sentences 445
(two positives and two negatives) from test dataset and compare both models by visualizing the attentive context as a heat map. Since both model architectures used is the same for all dataset, which makes visualizing on all dataset is
24
Journal Pre-proof
quite redundant. Thus we only visualized for MR dataset. This four sentence visualization for both model architectures reflects the attention situation for all dataset.
of
450
Indicating figure 5, from sentiment perspective we can tell that for both model architectures (ATTPooling RNN-pre and ATTConv RNN-pre) the at-
pro
tentive context vector focus more on positive part (”emerge as something” and ”so honest and keenly observed”, and ”entertainment and education”) in posi455
tive sentence and similarly focus on negative part (”most of the problems with the film dont derive”, and ”distinctly ordinary”) in negative sentence to predict the actual sentiments.
re-
However, from the model perspective, there is a little variation in attention focus toward important parts. In first positive sentence ATTPooling RNN-pre 460
(figure 5(b)) model focus on positive parts (”emerge as something” and ”so honest and keenly observed”) with similar intensity while ATTConv RNN-pre (figure 5(a)) does not focus on ”emerge as something” part as focus on ”so honest
urn al P
and keenly observed”. But in the second positive sentence, both architectures focus with similar intensity towards similar words. In the first negative sentence 465
the part ”mediocre performances” is not getting as much attention by ATTConv RNN-pre (figure 5(a)) as its get by ATTPooling RNN-pre (figure 5(b)); while with ground truth perspective, to predict negative sentiment from first negative sentence, ”mediocre performances” should be focus as much negative; which is done by ATTPooling RNN-pre (figure 5(b)) architecture. Thus, in conclusion,
470
we can say that ATTpooling RNN-pre performing better than ATTConv RNNpre not only in terms of accuracy but also in getting attention over important word features related to the corresponding sentiment.
Jo
6. Conclusion
This article proposed a new model based on the recurrent neural network
475
with CNN-style self-attention by combining the convolutional and recurrent neural network with the self-attention mechanism for sentiment analysis task.
25
Journal Pre-proof
The evaluation process of the model performed over three benchmark datasets. The proposed model can extract not only local contextual features by CNN
480
of
and high-level temporal features by RNN but also focuses more attention on important word features. Analysis of results with baseline methods, comparative study, and attention visualization demonstrates the effectiveness of the model.
pro
Moreover, performed experiments indicate that the model achieves better results than existing state-of-the-art works on two out of three datasets in sentiment analysis task with accuracy 84.46%, 51.2%, and 89.6% on MR, SST1, and SST2 485
dataset respectively. In the future study, we will test our approach in other
References
re-
tasks of natural language processing.
[1] H. Jiang, C. Cai, X. Ma, Y. Yang and J. Liu, “Smart Home Based on WiFi Sensing: A Survey,” IEEE Access, vol. 6, pp. 13317-13325, 2018. [2] Y. Zhang, et al., “TempoRec: Temporal-Topic Based Recommender for So-
urn al P
490
cial Network Services,” Mobile Networks and Applications, Vol. 22, No. 6, pp. 1182-1191, 2017.
[3] K. Lin, J. Song, J. Luo, J. Wen, G. Ahmed, “GVT: Green Video Transmission in the Mobile Cloud Networks,” IEEE Transactions on Circuits & 495
Systems for Video Technology. vol. 27, no. 1, pp. 159-169, Mar. 2016. [4] J. Chen, C. Wang, Z. Zhao, K. Chen, R. Du, G. Ahn, “Uncovering the Face of Android Ransomware: Characterization and Real-Time Detection,” IEEE Transactions on Information Forensics and Security, vol. 13 no. 5, pp. 12861300, 2018.
[5] X. Ma, et al., “Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis,” Future Generation Com-
Jo
500
puter Systems, Vol. 93, pp. 304-311, April 2019.
[6] S. Kim, E. Hovy, “Automatic detection of opinion bearing words and sentences,” In: Proceedings of the IJCNLP, pp. 61-66, 2005. 26
Journal Pre-proof
505
[7] S. Kim, E. Hovy, “Automatic identification of pro and con reasons in online reviews,” In: Proceedings of the COLING/ACL, pp. 483-490, 2006.
of
[8] K. Lin, J. Luo, L. Hu, M. Shamim Hossain, A. Ghoneim. “Localization based on Social Big Data Analysis in the Vehicular Networks,” IEEE Transactions
510
pro
on Industrial Informatics, vol. 13, no. 4, pp. 1932-1940, Aug. 2017.
[9] B. Pang, L. Lee, S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” In: Proceedings of the EMNLP, pp. 79-86, 2002.
[10] M. Chen, S. Mao, Y. Liu, “Big data: A survey,” Mobile Netw. Appl., vol.
515
re-
19, no. 2, pp. 171-209, Apr. 2014.
[11] M. Usama, M. Liu, M. Chen, “Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark program-
urn al P
s,” Digital Communications and Networks, vol. 3 no. 4, pp. 260273, 2017. [12] M. Chen, Y. Hao , K. Hwang, L. Wang, L. Wang, “Disease Prediction by Machine Learning Over Big Data From Healthcare Communities,” IEEE 520
Access, Vol. 5, No. 1, pp. 8869-8879, 2017. [13] M. Usama, B. Ahmad, J. Wan, M. S. Hossain, M. F. Alhamid, M. A. Hossain, “Deep Feature Learning for Disease Risk Assessment Based on Convolutional Neural Network With Intra-Layer Recurrent Connection by Using Hospital Big Data,” IEEE Access, vol. 6, pp. 67927-67939, 2018.
525
[14] K. Lin, C. Li, D. Tian, A. Ghoneim, M. Shamim Hossain, S. Amin. “Artificial-Intelligence-Based Data Analytics for Cognitive Communication in Heterogeneous Wireless Networks,” IEEE Wireless Communications, vol.
Jo
26, no. 3, pp. 83-89, Jun. 2019.
[15] J. Chen, K. He, Q. Yuan, G. Xue, R. Du, L. Wang, “Batch Identification
530
Game Model for Invalid Signatures in Wireless Mobile Networks,” IEEE Transactions on Mobile Computing, vol. 16 no. 6, pp. 15301543, 2017.
27
Journal Pre-proof
[16] M. Chen, J. Yang, J. Zhou, Y. Hao, J. Zhang, C. Youn, “5G-Smart Diabetes: Towards Personalized Diabetes Diagnosis with Healthcare Big Data
535
of
Clouds,” IEEE Communications, Vol. 56, No. 4, pp. 16-23, April 2018. [17] K. Lin, M. Chen, J. Deng, M. Hassan, G. Fortino “Enhanced Fingerprinting and Trajectory Prediction for IoT Localization in Smart Buildings,” IEEE
pro
Transactions on Automation Science and Engineering, vol.13, no. 3, pp. 1294 - 1307, Apr. 2016.
[18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-based learning ap540
plied to document recognition,” In Proceedings of the IEEE, pp. 2278-2324,
re-
1998.
[19] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179-211, 1990.
[20] M. Chen, Y. Hao, “Label-less Learning for Emotion Cognition,” IEEE Transactions on Neural Networks and Learning Systems, DOI:
urn al P
545
10.1109/TNNLS.2019.2929071, 2019. [21] M.
Chen,
mation
Y.
Hao,
Measurements:
H.
Gharavi,
A
New
V.
Leung,
Perspective,”
“Cognitive Information
InforSci-
ence,https://arxiv.org/abs/1907.01719, 2019. 550
[22] M. Chen, F. Herrera, K. Hwang, “Cognitive Computing: Architecture, Technologies and Intelligent Applications,” IEEE Access, Vol. 6, pp. 1977419783, 2018.
[23] Y. Zhang, et al., “Edge Intelligence in the Cognitive Internet of Things: Improving Sensitivity and Interactivity,” IEEE Network, vol. 33, no. 3, pp. 58-64, 2019.
Jo
555
[24] Y. Zhang, et al., “COCME: Content-Oriented Caching on the Mobile Edge for Wireless Communications,” IEEE Wireless Communications, vol. 26, no. 3, pp. 26-31, June 2019.
28
Journal Pre-proof
[25] D. Bahdanau, K. Cho, Y. Bengio, “Neural machine translation by jointly 560
learning to align and translate,” In Proceedings of ICLR, 2015.
of
[26] Y. Peng, X. He and J. Zhao, “Object-Part Attention Model for FineGrained Image Classification,” in IEEE Transactions on Image Processing,
pro
vol. 27, no. 3, pp. 1487-1500, March 2018.
[27] B. Zhang, D. Xiong, J. Su, “Neural Machine Translation with Deep Atten565
tion,” in IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2018.2876404, 2018.
[28] N. Kalchbrenner, E. Grefenstette, P. Blunsom, “A convolutional neural
re-
network for modelling sentences,” In Proceedings of ACL, pp. 655-665, 2014. [29] Y. Kim, “Convolutional neural networks for sentence classification,” In 570
Proceedings of EMNLP, pp. 1746-1751, 2014.
urn al P
[30] M. Er, Y. Zhang, N. Wang, P. Mahardhika, “Attention pooling-based convolutional neural network for sentence modeling,” Information Sciences, Vol. 373, pp. 388-403, 2016.
[31] Z. Zhang, Y. Zou, C. Gan. “Textual sentiment analysis via three differ575
ent attention convolutional neural networks and cross-modality consistent regression,” Neurocomputing, Vol. 275, pp. 1407-1415, 2018. [32] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. Dauphin, “Convolutional sequence to sequence learning,” In Proceedings of ICML, pp. 1243-1252, 2017.
580
[33] S. Shayaa, et al., “Sentiment Analysis of Big Data: Methods, Applications, and Open Challenges,” IEEE Access, vol. 6, pp. 37807-37827, 2018.
Jo
[34] D. Pham, A. Le, “Learning multiple layers of knowledge representation for aspect based sentiment analysis,” Data & Knowledge Engineering, Vol. 114, pp. 26-39, 2018.
29
Journal Pre-proof
585
[35] He K.; Chen J., Du R., Wu Q., Xue G., Zhang X., “DeyPoS: Deduplicatable Dynamic Proof of Storage for Multi-User Environments,” IEEE Transactions
of
on Computers, vol. 65 no. 12, pp. 36313645, 2016. [36] A. Abdi, S. Shamsuddin, S. Hasan, J. Piran, “Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment,” Expert Systems with Applications, Vol. 109, pp. 66-85, 2018.
pro
590
[37] N. Akhtar, A. Mian, “Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey,” IEEE Access, vol. 6, pp. 14410-14430, 2018. [38] T. Young, D. Hazarika, S. Poria, E. Cambria, “Recent Trends in Deep
595
re-
Learning Based Natural Language Processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55-75, Aug. 2018. [39] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review,” IEEE Access, vol. 7,
urn al P
pp. 19143-19165, 2019.
[40] M. Chen, X. Shi, Y. Zhang, D. Wu, M. Guizani, “Deep Features Learning 600
for Medical Image Analysis with Convolutional Autoencoder Neural Network,” IEEE Trans. Big Data, DOI:10.1109/TBDATA.2017.2717439, 2017. [41] M. Chen, J. Yang , X. Zhu, X. Wang, M. Liu, J. Song, “Smart Home 2.0: Innovative Smart Home System Powered by Botanical IoT and Emotion Detection,” Mobile Networks and Applications, Vol. 22, pp. 1159-1169, 2017.
605
[42] M. Chen, J. Yang, L. Hu, M. Hossain, G. Muhammad, “Urban Healthcare Big Data System based on Crowdsourced and Cloud-based Air Quality Indicators,” IEEE Communications, Vol. 56, No. 11, Nov. 2018.
Jo
[43] M. Chen, W. Li, Y. Hao, Y. Qian, I. Humar, “Edge Cognitive Computing based Smart Healthcare System,” Future Generation Computing System,
610
Vol. 86, pp. 403-411, Sep. 2018.
30
Journal Pre-proof
[44] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts, “Recursive deep models for semantic compositionality over a sentiment tree-
of
bank,” In: Proceedings of EMNLP, pp. 16-42, 2013. [45] K. Tai, R. Socher, C. Manning, “Improved semantic representations from 615
tree-structured long short-term memory networks,” arXiv preprint arX-
pro
iv:1503.00075, 2015.
[46] S. Wang, C. Manning, “Fast Dropout Training,” Proceedings of ICML, 2013.
[47] Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Kosir, “Audio-visual emotion fusion (AVEF): A deep efficient weighted approach,” Information Fusion, Vol. 46, pp. 184-192, 2019.
re-
620
[48] Y. Kim, Y. Jernite, D. Sontag, A. Rush, “Character-aware neural language
urn al P
models,” arXiv preprint arXiv:1508.06615, 2015.
[49] X. Zhang, F. Chen, R. Huang, “A Combination of RNN and CNN for 625
Attention-based Relation Classification,” In: proceeding of ICICT, pp. 911917, 2018.
[50] N. Vu, H. Adel, P. Gupta, H. Schutze, “Combining Recurrent and Convolutional Neural Networks for Relation Classification,” In: proceedings of NAACL-HLT, pp. 534-539, 2016. 630
[51] A. Hassan, A. Mahmood, “Convolutional Recurrent Deep Learning Model for Sentence Classification,” IEEE Access, vol. 6, pp. 13949-13957, 2018. [52] Y. Liu, L. Ji, R. Huang, T. Ming, C. Gao, J. Zhang, “An attention-gated convolutional neural network for sentence classification,” arXiv preprint arX-
Jo
iv:1808.07325, 2018. 635
[53] W. Yin, H. Schutze, B. Xiang, B. Zhou, “ABCNN: Attention-based convolutional neural network for modeling sentence pairs,” TACL, pp. 259-272, 2016.
31
Journal Pre-proof
[54] C. Santos, M. Tan, B. Xiang, B. Zhou, “Attentive pooling networks,” CoRR, abs/1602.03609, 2016. [55] A. Graves, G. Wayne, I. Danihelka, “Neural turing machines,” CoRR, ab-
of
640
s/1410.5401, 2014.
pro
[56] Y. Heo, S. Kang, D. Yoo, “Multimodal Neural Machine Translation With Weakly Labeled Images,” IEEE Access, vol. 7, pp. 54042-54053, 2019. [57] H. Zheng, W. Wang, W. Chen, A. K. Sangaiah, “Automatic Generation of 645
News Comments Based on Gated Attention Neural Networks,” IEEE Access, vol. 6, pp. 702-710, 2018.
re-
[58] K. Al-Sabahi, Z. Zuping, M. Nadher, “A Hierarchical Structured SelfAttentive Model for Extractive Document Summarization (HSSAS),” IEEE Access, vol. 6, pp. 24205-24212, 2018.
[59] S. Liu, S. Zhang, X. Zhang, H. Wang, “R-Trans: RNN Transformer Net-
urn al P
650
work for Chinese Machine Reading Comprehension,” IEEE Access, vol. 7, pp. 27736-27745, 2019.
[60] M. Lu, Y. Fang, F. Yan, M. Li, “Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts,” IEEE Access, vol. 7, pp. 655
57623-57632, 2019.
[61] T. Shao, Y. Guo, H. Chen, Z. Hao, “Transformer-Based Neural Network for Answer Selection in Question Answering,” IEEE Access, vol. 7, pp. 2614626156, 2019.
[62] I. Lopez-Gazpio, M. Maritxalar, M. Lapata, E. Agirre. “Word n-gram at660
tention models for sentence similarity and inference,” Expert Systems with
Jo
Applications, Vol. 132, pp. 1-11, 2019.
[63] Y. Liu, C. Sun, L. Lin, X. Wang, “Learning natural language inference using bidirectional LSTM model and inner-attention,” CoRR, abs/1605.09090, 2016.
32
Journal Pre-proof
665
[64] P. Li,W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, W. Xu, “Dataset and neural recurrent sequence labeling model for open-domain factoid question
of
answering,” arXiv preprint arXiv:1607.06275, 2016. [65] J. Cheng, D. Li, M. Lapata, “Long short-term memory-networks for ma-
670
pro
chine reading,” In Conference on EMNLP, pp. 551-561, 2016.
[66] A. Parikh, O. Tackstrom, D. Das, J. Uszkoreit, “A decomposable attention model for natural language inference,” In Proceedings of EMNLP, 2016. [67] Google, “word2vec,” https://code.google.com/p/word2vec/. [68] B. Pang, L. Lee, “Seeing stars: Exploiting class relationships for sentiment
675
re-
categorization with respect to rating scales,” In: Proceedings of ACL, 2005. [69] M. Zeiler, “Adadelta:An adaptive learning rate method,” CoRR, abs/1212.5701, 2012.
urn al P
[70] Y. Zhang, B. Wallace, “A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification,” arXiv preprint arXiv:1510.03820, 2016. 680
[71] X. Zhu, P. Sobhani, H. Guo, “Long short-term memory over tree structures,” arXiv preprint arXiv:1503.04881, 2015. [72] J. Li, D. Jurafsky, E. Hovy, “When are tree structures necessary for deep learning of representations?,” arXiv preprint arXiv:1503.00185, 2015. [73] X. Wang, W. Jiang, Z. Luo, “Combination of Convolutional and Recur-
685
rent Neural Network for Sentiment Analysis of Short Texts,” Proceedings of COLING pp. 24282437, 2016.
Jo
[74] A. Hassan, A. Mahmood, “Deep Learning Approach for Sentiment Analysis of Short Texts,” In: Proceedings of the ICCAR, 2017.
33