Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews

Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews

Journal Pre-proof Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews Mohd Usama, Belal A...

878KB Sizes 2 Downloads 75 Views

Journal Pre-proof Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews Mohd Usama, Belal Ahmad, Jun Yang, Saqib Qamar, Parvez Ahmad, Yu Zhang, Jing Lv, Joze Guna

PII: DOI: Reference:

S0140-3664(19)30545-6 https://doi.org/10.1016/j.comcom.2019.08.002 COMCOM 5920

To appear in:

Computer Communications

Received date : 1 June 2019 Revised date : 31 July 2019 Accepted date : 2 August 2019 Please cite this article as: M. Usama, B. Ahmad, J. Yang et al., Equipping recurrent neural network with CNN-style attention mechanisms for sentiment analysis of network reviews, Computer Communications (2019), doi: https://doi.org/10.1016/j.comcom.2019.08.002. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Journal Pre-proof

of

Equipping Recurrent Neural Network with CNN-style Attention Mechanisms for Sentiment Analysis of Network Reviews

pro

Mohd Usama, Belal Ahmad, Jun Yang, Saqib Qamar, Parvez Ahmad, Yu Zhang, Jing Lv, Joze Guna

Abstract

Deep learning algorithms have achieved remarkable results in natural language

re-

processing(NLP) and computer vision. Especially, deep learning methods such as convolution and recurrent neural networks have shown remarkable performance in text analytic task. Moreover, from the attention mechanism perspective convolutional neural network (CNN) is applied less than recurrent neural

urn al P

network (RNN). Because RNN can learn long-term dependencies and gives better results than CNN. But CNN has its own advantage, can extract high-level features invariant to the local translation by using its local fix size context at the input level. Thus, in this paper, we proposed a new model based on RNN with CNN-style self-attention mechanism by using the merits of both architectures together in one model. In the proposed model, first, CNN learns the highlevel representation of words at the input level. Second, we used self-attention mechanism to get the attention of the model on the features which contribute much in the prediction task by calculating the attentive context vectors over hidden states representation generated from CNN. Finally, hidden state representations from CNN with attentive context vectors are commonly used at the

Jo

1 M. Usama, B. Ahmad, J. Yang, S. Qamar, P. Ahmad and Y. Zhang are with School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China. (email: [email protected], [email protected], [email protected], [email protected], junyang [email protected], [email protected]) J. Lv (corresponding author) is with Hospital Attached to Huazhong University of Science and Technology (email: [email protected]) J. Guna is with the Faculty of Electrical Engineering, University of Ljubljana, Trzaska cesta 25, 1000 Ljubljana, Slovenia. Email: [email protected].

Preprint submitted to Journal of Computer Communications

August 27, 2019

Journal Pre-proof

RNN to process them sequentially. To validate the model we experiment on three benchmark datasets i.e. Movie review, Stanford sentiment treebank1, and

of the proposed model.

of

treebank2. Experiment results and their analysis demonstrate the effectiveness

Keywords: Convolutional neural network, recurrent neural network, attentive

pro

learning, network reviews, and sentiment analysis.

1. Introduction

The enormous and rapid growth of the social network, smart gadgets, and the internet brings together billions of users to generate short texts on the internet

5

re-

such as public opinion on services, products, movies, and blogs [1, 2, 3, 4]. These reviews as short texts are usually found to be semantic and subjective oriented. Nowadays identifying and classifying proper semantic orientation of text reviews written by authors on the internet is getting much attention of researcher which

urn al P

solves customers as well as company’s various practical problems such as what product a customer like or dislike? And whether the product is doing well in 10

the market or not? Sentiment analysis of short text remains a challenging one because short text usually contains limited semantic and contextual information, which limits the accuracy of the analysis [5]. In early days many researchers have done sentiment analysis task by using traditional classification approaches which are based on manual features en-

15

gineering and required rule-based methods such as data mining technique[6], knowledge-based approach[7, 8]. But these traditional approaches mainly used words as a statistical indicator representation such as TF-IDF (Term frequencyInverse document frequency). These schemes do not consider words feature as a positional factor. Usually, it does not represent the feature which has large weights. Besides this, some researcher used machine learning algorithms[9] such

Jo

20

as Naive Bayes and Support Vector Machine for feature extraction and sentiment classification. However, these algorithms faced encounter problems like data sparsification during small training dataset.

2

Journal Pre-proof

Later after increasing in training datasets[10][11], researchers started to used 25

distributed representation methods with deep learning approaches [12][13] which

of

do not require manual features engineering and can deal with the above problems [14, 15]. These approaches automatically extract the features from text and perform appropriate analysis for many different problems in NLP. Deep

30

pro

learning approaches have successfully implemented in many areas of NLP. It is not only improving the accuracy and efficiency of prediction but also reducing the prediction time. 5G also provides an efficient application environment for deep learning based on big data[16, 17]. Furthermore, traditional deep learning approaches such as CNN[18] and RNN[19] used max pooling layer to find impor-

35

re-

tant significant features of words. This max pooling layer approach could most likely help in text prediction and classification task to get better results. However, there was no specified way to deal with significant features; max pooling layer is selected the significant features corresponding to maximum activation value. Meanwhile, the combination of label-less mechanism [20], cognitive information

40

urn al P

concept [21, 22],5G and in-depth learning promotes the further development of artificial intelligence [23, 24].

Thus to deal with this issue in more significant way idea of attention mechanism in deep learning is first ever proposed by Bahdanau et. al.[25] for machine translation. Attention mechanism is a way to get the concentration of the model on the significant features by using softmax at the interior of the model. Gen45

erally, softmax is used as an output system to generate the probability of the classes and categories. However, It is also possible to use softmax in a way to normalized a stack of the number, so they all sum up to one; which effectively giving a percentage probability to all piece of information. Then it can use that to bring way during input data. More specifically, we know that not all word features in the sentence representation contribute equally in the text. So, here we use self-attention approach to gain the attention of the model on such word

Jo

50

features which contribute to the meaning in the text. Attention-based neural networks have been proven successful in many prob-

lems of NLP and computer vision such as image classification[26], and machine 3

Journal Pre-proof

55

translation[27]. However, most of the work in the attention mechanism is based on RNN and its variants. Attention-based RNN used three states of inputs

of

to evaluates results at current states, i.e., the current input is given to RNN, representation of left and right context computed bidirectionally, and attentive weighted-sum of local context hidden states. After the success of attention mechanism in RNN little work has also done on CNN with attention mechanism

pro

60

to solve different problem in NLP. However, CNN has been proven beneficiary in both ways, with and without attention mechanism. For sentence modeling without attention [28] [29], where CNN learns word representation with sliding window and text modeling with attention has done for many problem such as sentence modeling [30], textual entailment [31] and machine translation[32].

re-

65

Our motivation of using a recurrent neural network with CNN-style selfattention mechanism is inspired by the individual success of CNN and RNN with attention mechanism in NLP.

In this study, proposed a new model based on recurrent neural network with CNN-style self-attention mechanism by using the merits of both architectures

urn al P

70

together in one model. First, we used CNN to generate hidden states units (hi ) from input representation. Then over these hidden states units, we used self-attention mechanism to get the attention context vectors (Ci ). Finally both hidden states units (hi ) from CNN and attention context vectors (Ci ) 75

commonly used as an input at RNN. Final feature map from RNN is used to perform sentiment analysis task.

To evaluate the proposed model, we experiment with four self variation of the model on three benchmark datasets. The detail of the models’ variation is given in subsection 4.1. Experiment results show the effectiveness of the proposed 80

model. Our model outperforms on two out of three benchmark datasets and achieves state-of-art results compared to existing works.

Jo

Main contributions of the article are summarized as follows: • We Proposed a new model based on the recurrent neural network with CNN-style self-attention mechanism. Experiment results, analysis, and

4

Journal Pre-proof

85

attention visualization show the effectiveness of the proposed model. • Furthermore we demonstrate four self variations of the proposed model i.e.

of

ATTConv RNN-pre, ATTPooling RNN-pre, ATTConv RNN-rand, and ATTPooling RNN-rand. Experiment results exhibit that the ATTPooling RNN-pre and ATTPooling RNN-rand performs better than ATTConv RNN-pre and ATTConv RNN-rand respectively.

pro

90

• We apply all model variations on three benchmark datasets (Movie Reviews, Stanford Sentiment Treebank, and Treebank2) to test the effectiveness of the proposed models and achieves better results on two out of three

95

re-

datasets.

The remaining article is categorized in the following way. Section II explains the existed relevant work. Section III explains the proposed model and evaluation methods. Section IV explains the model variations, experiment setup, and

urn al P

benchmark datasets used to measure the performance of the models. Section V discusses the analysis of results. Finally, Section VI concludes the paper.

100

2. Related Work

2.1. Sentiment Analysis Task

The existing work in sentiment analysis has done by various methods such as data mining technique[33], knowledge-based approach[34, 35], and machine learning algorithms[36]. Later with the rapid growth of deep learning, classi105

cal deep learning algorithms such as convolution neural network (CNN) and recurrent neural network (RNN) have achieved remarkable success in many areas such as computer vision[37], natural language processing[38], and speech recognition[39]. These technologies make the rapid development of application

Jo

scenarios such as intelligent medicine [40, 41], intelligent health care [42, 43].

110

And typical CNN and RNN have already been used for sentiment analysis and achieved remarkable results. Yoon Kim[29] used CNN to perform sentence-level

5

Journal Pre-proof

classification with pre-trained word vectors. The recursive deep learning studied for sentiment classification by Socher et. al. [44]. Tai et. al.[45] proposed

115

of

a generalization of long short term memory (LSTM) network for prediction of semantic relatedness and sentiment classification of text, named as Tree-LSTM. Wang et. al.[46] proposed an approach of getting the benefit of the dropout

to visualize emotional recognition in audio.

pro

training without sampling.Y. Ma et. al.[47] proposed a depth-weighted method

After that, an exclusive idea of research by combining two networks begun. 120

Very few researchers used two networks together in one model for sentiment classification. Kim et. al.[48] proposed a neural language model that receives

re-

input at the character level and predict output at the word level. Their model derived by combining CNN, highway network, and LSTM. In his model output from the highway network and CNN over characters is used as input for LSTM 125

RNN-LM (recurrent neural network language model). Zhang et. al.[49] and Vu et al.[50] both utilized two distinct neural networks together and combine

urn al P

CNN and RNN for relation classification. These models achieved better results compared to existing models. Furthermore, Hassan et. al.[51] put forward an end-to-end bottom-up architecture by combining CNN and RNN for sentiment 130

analysis. The present sequence architecture in their paper and used word embedding as an input to the convolutional layer, which learned features maps by using the window of different sizes and various weights. Then obtained output from the convolutional layer is used as an input to the LSTM layer, where LSTM learned long-term dependencies from these feature maps to generate sentence

135

level representation. Liu et. al. [52] proposed attention based gated convolutional neural network for sentiment analysis. Besides this, they proposed a new activation function named them NLReLU.

Jo

2.2. Attention based CNN

In CNN, little work was done based on attention. Gehring et al.[32] pro-

140

posed the attention-based CNN model for machine translation. They used a hierarchical convolutional layer for both encoder and decoder. Each encoder 6

Journal Pre-proof

hidden units be queried at nth decoder layer by output hidden units of convolution, after than weighted sum of the total encoder hidden units will be added

145

of

to the decoder hidden units, and these updated hidden units will be received by n+1 convolutional layer. Hence, the above attention-based model needs the multi convolutional layer to pass the weighted context from encoder to decoder

pro

side, which shows that their attention played a role after convolution. There are some other CNN architecture[53] [54] which used attention at the pooling layer and named it attentive pooling. The architecture used in both paper uses 150

two input sentence in parallel and sets of the hidden unit for every sentence is computed by convolution. Thus, every sentence learns attentive weights for

re-

each hidden unit based on the current hidden unit with all hidden unit in other sentences. After that resulted representation of each sentence will obtain by weighted mean pooling or attentive pooling. 155

2.3. Attention based RNN

urn al P

The foremost attention-based model was proposed by Graves et al.[55]. They used differentiable attention schemes which enforce RNN to look over distinct parts of the input. This scheme has been explored by many researchers to solve the problem of in text analytics such as neural machine translation[56], 160

text generation [57], document reconstruction, and summarization[58], machine comprehension[59]. It is further explored in relation classification such as textual entailment[60] and answer selection[61].

Gazpio et al.[62] proposed a self-attention based supervised model to learn word embedding. This method can calculate attention weight at each position 165

in the window for every word. A sentence level attention is proposed by Liu et al.[63]. They used mean pooling as an attention context over LSTM states and re-weighted pooling vector of the sentence. Li et al.[64] proposed similar self-

Jo

attention based factoid model for question encoding. Cheng et al.[65] proposed LSTMN model which is intra-sentence based attention scheme and further used

170

by Parikh et al.[66]. Proposed scheme calculates the attention vector during the recurrent process for every hidden unit, which is targeted on lexical correlations 7

Journal Pre-proof

between adjacent words. In contrast to these, our proposed model uses the recurrent neural network

175

of

with CNN-style self-attention mechanism. In the proposed model, RNN will use hidden state units and attention vectors generated from standard convolution

pro

or max pooling together.

3. Proposed Model

This section describes the proposed model which consists of Input layer, word embedding layer, CNN layers, CNN-style self-attention mechanism, RNN layer2 180

with CNN-style self-attention mechanism, full connection layer, and softmax

1. 3.1. Input layer

re-

output layer. The overall architecture of the proposed model is shown in figure

185

urn al P

To start let assume the input layer of the model will receive text data as X(x1 , x2 , ..., xn ), where x1 , x2 , ..., xn are the n number of word with dimension of each input word m. So, each word vector will be represented as Rm dimensional space. Therefore, the dimension space of text data will be Rm ∗ n. 3.2. Word embedding layer

To learn word embedding let assume size of the vocabulary is d for text rep190

resent. Thus, the 2-dimensional word embedding matrix would be represented as Am∗d . Now, the input text X(xi ), Where i = 1, 2, n., XRm∗n , is pass to embedding layer from the input layer to generate word embedding vector from pure corpus text. We use a distributed representation method to learn word embedding and implement it through the word2vec3 model[67]. Hence, actual input text to be feed into the model would be the representation of input text X(x1 , x2 , ..., xn )Rm∗n as numerical word vectors. Where,

Jo

195

2 In

experimentation we used a more power full variant of RNN layer i.e. LSTM

3 https://code.google.com/p/word2vec/

8

Journal Pre-proof

Full Connection Layer

Self Attention

h1

h3

h2

Softmax

Hidden Representation

re-

Merge Layer 1

PADDING

If you like then this is one the of family oriented fantasy adventure movie

urn al P

Max Pooling

h4

pro

RNN with CNN-Style Self Attention

of

Softmax Output

Merge Layer 2

Convolution

n * m Input Text Representation

PADDING

Figure 1: Over all architecture of proposed framework. Dotted arrow

Jo

line indicate that either hidden units representation will be passed from convolution to merge layer 2 or from max pooling to merge layer 1.

9

Journal Pre-proof

x1 , x2 , ..xn are n number of word vectors in embedding vocabulary with each dimension space Rm . Where dimension of input word m will be selected during

200

of

word embedding learning. 3.3. CNN layers

pro

3.3.1. Convolution layer

Now convolution layer performs the convolution operation over word-vectors representation receives from the embedding layer in a successive sequence of row representation form. Let s words be selected at a time with weight matrix W 205

of dimension W Rs∗m to perform convolution operation as follows:

(1)

re-

hi = f (Xi+s−1 ∗ W [i] + bi )

Here, hi Rn−s+1 is the feature map generated with t words every time repeatedly, f is the non-linear Relu function, and bi is the bias term.

urn al P

3.3.2. Max pooling layer

We then perform the max-pooling operation over generated features map 210

from convolution, which converts the features map into its half by selecting the features with its maximum activation value, as follow: pi = max hi

(2)

Here, pi Rn−s+1/2 is the new feature map. We take the experience from the literature[29] and use two convolution layer in parallel with filter size 4 and 5. Then, we concatenate features of both con215

volution layers at merge layer2 and after performing the max-pooling operation at merge layer 1 as shown in Figure 1.

3.4. CNN-style self-attention mechanism

Jo

After obtaining the hidden state units h1 , h2 , ..., hn from CNN, we will cal-

culate the attentive context vector based on these. For simple mathematical

220

explanation, let’s assume H = (h1 , h2 , ..., hn ), then features map H has a shape of Rn∗t , where t is the dimension of hidden states. 10

pro

of

Journal Pre-proof

Output Recurrent Neural Network

h1

h2

Softmax

h3

h2

urn al P

Convolution

h1

C2 C3 C4

Padding X1

C5

re-

C1

h3

h4

h4

h5

h5

X2 X3 X4 X5 Padding Input Text Representation

Figure 2: Illustration of hidden units and attentive context vectors together used at RNN. Xi are the input representations and Ci are attentive context vectors calculated based on hidden units hi genr-

Jo

erated from standard convolution

11

of

Journal Pre-proof

pro

Output Recurrent Neural Network

C1 h2

h1

h2

h3

h3

h4

h5

h4

h5

Convolution

urn al P

Max Pooling

Softmax

C5

re-

h1

C2 C3 C4

Padding X1

X2 X3 X4 X5 Padding Input Text Representation

Figure 3: Illustration of hidden units and attentive context vectors together used at RNN. Xi are the input representations and Ci are attentive context vectors calculated based on hidden units hi genr-

Jo

erated from max pooling.

12

Journal Pre-proof

Self-attention requires all hidden units as input to compute attention vector. Thus attention context vector Ci for each word xi is calculated as weighted sum

αi = tanh(Wx ∗ H T + bx )

225

pro

exp(αi ) Cix = P i exp(αi )

of

over all hidden units of snippet, as follows:

(3)

(4)

Where, Wx Rm∗n is a weight matrix, bx is the bias term, and αi is the attention weight.

Here for ATTConv RNN hidden units generated from convolution layer is used to compute the attention vector as shown in figure 2. For ATTPooling RNN hidden units generated from max pooling layer is used to compute the

re-

230

attention vector as shown in figure 3.

This attentive context vector is used with the RNN and jointly learned during the training process. This context vector is expected to focus on such word

235

urn al P

features which play an essential role in prediction and contribute to the meaning in the text.

3.5. RNN layer with CNN-style Self-attention mechanism After computation of context vector, we will use hidden units hi from CNN and context vectors Ci together at RNN to generates the higher-level hidden state units by processing them sequentially. 240

Hence, the final feature map from RNN will be obtain by jointly learning of hidden states units from CNN and the attentive context vector Ci (as shown in figure 1) as follows:

h(t)1i,j = w1 [i](h(t)j , h(t − 1)j , C(t)i,j ) + b1

(5)

Jo

For simplification, we can rewrite above equation as: h(t)1i,j = w1 [i](h(t)j , h(t − 1)j ) + w2 [i].C(t)i,j + b1

13

(6)

Journal Pre-proof

i = 1, 2, ..., l and j = 1, 2, ..., n; 245

Where, w1 Rl∗nm , w2 Rl∗nm are the weight matrices, b1 is the bias term, h(t)

of

is the hidden units representation at time-step t, h(t − 1) is the the hidden units representation at time-step t − 1, Ci is the attentive context vector, and final

feature map from RNN will be obtain from h1 (t) as a new feature map. We use

250

pro

here soft self-attention approach so that gradient could propagate through both convolution as well as attention context vector components in parallel. For non-linear relationship among features, we use rectified linear unit (ReLU) as an activation function as follows:

(7)

re-

h(t)2i,j = max(0, h(t)1i,j ) 3.6. Full connection layer

After that, we use the full connection layer. Thus, the full connection opera255

tion will be perform over feature map h(t)2i,j obtain from RNN after non-linearity

urn al P

as follows:

h3 = w4 .h2 + b4

(8)

Where h2j is a feature map receive from RNN after non-linearity and h3 is feature map obtain from full connection operation. w4 and b4 are the weight and bias of full connection layer. 260

3.7. Softmax output layer

Finally, the feature map from the full connection layer is passed to soft-max classifier, as shown in figure 1, to perform sentiment classification which result

Jo

is a probability distribution over all categories as follows: exp(hi ) Yi = Pc j=1 exp(hj )

(9)

Where results Yi is the probability distribution of ith class overall classes.

265

To measure the disparity between real sentiment and predicted sentiment of

the sentence in corpus text, we used cross-entropy as a loss function as follows: 14

Journal Pre-proof

0

loss =

s XX

Yit (C)log(Yi (C))

(10)

of

s=T i=1

Where, Y t (C) is the element corresponding to the real sentiment of the sentence, Y (C) is the element corresponding to predicted sentiment of the sentence, 0

270

pro

s is the sentiment classes, and T is the training corpus text. We initiate training processes with pre-trained word vectors from word2vec then used stochastic gradient descendant algorithms with Adadelta update method to train and update the parameters of the model.

4.1. Model variations 275

re-

4. Model variations, Datasets and Experiment Setup

We experimented with several variations of the model defined as follows: • ATTPooling RNN-pre: A model based on the recurrent neural network

urn al P

with pooling-style self-attention mechanism trained with pre-trained wordvectors from word2vec. Here self-attention mechanism is used over the hidden state units (hi ) generated from max pooling layer as shown in

280

figure 3.

• ATTConv RNN-pre: A model based on the recurrent neural network with convolution-style self-attention mechanism trained with pre-trained word-vectors from word2vec. Here self-attention mechanism is used over the hidden state units (hi ) generated from standard convolution layer as

285

shown in figure 2.

• ATTPooling RNN-rand: A model based on the recurrent neural network with pooling-style self-attention mechanism trained with randomly

Jo

initialized word-vectors. Here self-attention mechanism is used over the hidden state units (hi ) generated from max pooling layer as shown in

290

figure 3.

15

Journal Pre-proof

• ATTConv RNN-rand: A model based on the recurrent neural network with convolution-style self-attention mechanism trained with randomly

of

initialized word-vectors. Here self-attention mechanism is used over the hidden state units (hi ) generated from standard convolution layer as shown in figure 2.

pro

295

4.2. Datasets

We experimented with three benchmark datasets to evaluate the proposed model. Detail statistics of the datasets are given in Table 1.

• MR4 (Movie Reviews): This dataset contains a total of 10662 positive and negative movie reviews from networks with one sentence per review.

re-

300

Challenge is to predict positive and negative sentiment associated with sentences[68].

• SST15 (Stanford Sentiment Treebank): This dataset is an extension 305

urn al P

of MR dataset with training, test, and deviation set splits provided with labeled into five classes as very positive, very negative, positive, negative, and neutral[44].

• SST26 : Similar to SST1 but with neutral reviews removed and binary labeled.

4.3. Training Procedure and Hyperparameters 310

4.3.1. Hyperparameters

For all dataset, we used activation function rectified linear units, window size 4 and 5 for convolution layers, feature maps 100 for each CNN layer, RNN layer output size 100, dropout 0.50 at full connection layer, and mini-batch size

Jo

50. The training was performed through a stochastic gradient descent algorithm 4 https://www.cs.cornell.edu/people/pabo/movie-review-data/ 5 https://nlp.stanford.edu/sentiment/ 6 https://nlp.stanford.edu/projects/glove/

16

Journal Pre-proof

Table 1: Statistical details of the datasets after tokenization used in experiment. c: Number of classes, Al :Average sentence length, Ml :Maximum sentence

of

length, V :Dataset size, N :Vocabulary size, VT rain :Training set, VT est :Test set, VV alid :Validation set, CV indicates that there was no standard train/test/dev

c

Al

Ml

V

N

VTrain

VTest

VValid

MR

2

20

51

10662

18983

8655

1064

CV

SST1

5

18

56

11855

18784

8544

2210

1101

SST2

2

19

56

9613

18784

6920

1821

872

re-

315

Dataset

pro

split for dataset so 10-fold Cross validation was used

with the Adadelta update method[69]. All the parameters values are selected through the grid-search method.

urn al P

4.3.2. Pre-trained Word Vectors

We initialized word vectors with pre-trained set of word vectors obtained from word2vec, an unsupervised natural language algorithm to obtain vector 320

representations for words which are used in the absence of large training set to improve performance of the model. Publically available word2vec7 . vectors with word dimension 300 were used in experiments which were trained on 100 billion words from Google news data. Word vectors were trained by using the continuous bag-of-words framework. Words which were absent in pre-train set

325

were initialized randomly.

5. Experiment Results and Discussion Implement the proposed model is done by using well-known library Ten-

Jo

sorflow and Keras. We compare the running time for model variations on the system (256GB RAM and 16GB GPU). Here, we set 30 epochs for all model 7 https://code.google.com/p/word2vec/

17

of

500

414

410

pro

400 300 232 186.5 160

200

209

182

154

150

121

99.3

100 0

ATTPooling RNN-pre

93

MR SST1 SST2

re-

Running Time of the Models in Minuts

Journal Pre-proof

ATTConv RNNpre

ATTPooling RNN-rand

ATTConv RNNrand

urn al P

Models

Figure 4: Running time of all model variations. From these statistics, we can tell that there is not much difference in running time with pre-trained and randomly initialized models on all the datasets. That means ATTpooling RNN-pre and ATTConv RNN-pre consume approximate same time to complete execution as ATTpooling RNNrand and ATTConv RNN-rand. While ATTpooling RNN takes half of the times than ATTConv RNN on SST1 and SST2 datasets for both randomly initialized and pre-trained. Thus running speed of ATTPooing RNN is 2 times fast than ATTConv RNN for both randomly initialized and pre-trained on SST1 and SST2 datasets. But

Jo

for MR dataset running speed of all models are almost the same.

18

Journal Pre-proof

330

variations to compare with each other. Figure 4 shows the details of the running time of all models. Subsection 4.3 describes the parameter settings and training

of

method of the model’s parameters and word vectors. Three benchmark datasets are used in the experiment to measure the performance of the model; results are listed in Table 2. As we expected, our variant models (ATTPooing RNNrand and ATTConv RNN-rand) with randomly initialized word-vectors does not

pro

335

perform better than pre-trained model variations (ATTPooing RNN-pre and ATTConv RNN-pre). While we got expected results by using pre-trained word vectors, even pre-trained model variants with attentive convolution (ATTConv RNN-pre ) is achieve better results performance again some sophisticated natural language models which utilized complex structure such as[44]. The achieved

re-

340

results suggest that pre-trained vector performs well and better extracted the features than randomly initialized. There can be an accuracy gain of 2-3% approximate by using pre-trained vectors than random initialized. Experiment results validate that the proposed model performs better than existing state-of-art methods on two out of three datasets, and achieve compet-

urn al P

345

itive results on one dataset. Especially, as we expected, our model ATTPooling RNN-pre does perform better than ATTConv RNN-pre over all the datasets. ATTPooling RNN-pre achieves remarkable results and again in increase accuracy by 1.2% on MR, 0.73% on SST1, and 1.77% on SST2 dataset than ATTConv 350

RNN-pre. These results state that our architecture using attention mechanism over selected words by max pooling is more beneficiary than using attention over all word features from convolution on prediction datasets. 5.1. Analysis of results on MR dataset

The challenge of this dataset is to predict sentiment from binary labeled sentence i.e. positive or negative which contain one sentence per review. Reference [29] perform experimentation with several variations of the standard CNN mod-

Jo

355

el by doing fine tuning in word embedding and using a multichannel approach. Compared to his results our model accuracy is better and achieves 2.82% gain approximate. There is another work[70] which used bow SVM and same non19

Journal Pre-proof

of

Table 2: Results of the proposed model against existing works. Bold represent the best results. Year

Recursive Neural Tensor Network[44]

2013

-

45.7

85.4

Word vector averages[44]

2013

-

32.7

80.1

Fast Dropout[46]

2013

79.1

-

-

DCNN[28]

2014

-

48.5

86.8

Non-static[29]

2014

81.5

48.0

87.2

re-

Multi-channel[29]

MR

SST1

pro

Models

SST2

2014

81.1

47.4

88.1

2015

-

48.0

-

2015

0.79

-

-

2015

-

46.4

84.9

CNN-GRU-word2vec[73]

2016

82.28

50.68

89.95

bowwvSVM[70]

2016

79.7

43.2

83.3

Non-static GloVe+word2vec CNN[70]

2016

81.0

45.9

85.6

ConvLstm[74]

2017

-

47.5

88.3

CRDLM [51]

2018

-

48.8

89.2

AGCNN-SELU-3-channal[52]

2018

81.3

49.4

87.6

AGCNN-NLReLU-3-channal[52]

2018

81.9

49.4

87.4

ATTPooling RNN-pre

2019

84.32

51.14

89.62

ATTConv RNN-pre

2019

83.12

50.41

87.85

ATTPooling RNN-rand

2019

79.25

48.32

86.09

ATTConv RNN-rand

2019

78.21

46.36

84.90

Tree LSTM1[71] Tree bi-LSTM[72]

Jo

urn al P

LSTM[45]

20

Journal Pre-proof

360

static CNN model as [29] but trained with combined Glove and Word2vec and reported lesser accuracy than above. But compared to [70] we achieve 3.32%

of

gain in accuracy approximate. Reference [73] proposed a joint architecture by using CNN and RNN; reported much better results than previous works. This shows that combining CNN and RNN gives better results than single CNN or RNN architecture in the prediction task. The latest works[52] used attention

pro

365

with CNN and further achieve remarkable results which is better than existing works except for the joint architecture model. This latest work substantiates the benefits of attention in the prediction task. Thus in the proposed model we utilized CNN and RNN together with attention mechanism and achieve accuracy gain 2.42% than the latest CNN attention model[52] and 2.04% than joint

re-

370

architecture[73]. In overall proposed model architecture is most modern of its type and achieves the best results among all existing works on MR dataset. 5.2. Analysis of results on SST1 dataset

375

urn al P

SST1 dataset is fine-grain extension of MR dataset which labeled into five classes as very positive, very negative, positive, negative, and neutral. Early existing works on SST1 dataset based on different architectures such as RNTN[44], Deep CNN[28], Non-static and Multichannel CNN[29], LSTM[45], Tree-LSTM[71], and GloVe+word2vec CNN[70] reported the accuracy between 46% to 48% approximately except the Word vector averages[44]. Word vector 380

average reported much worst results with accuracy 32.7%. Compared to early existing works our model achieves more than 2% gain in accuracy approximate. Later works such as CNN-GRU[73] and CRDLM[51] used joint architecture approach; reported 50.68% and 48.8% accuracy results respectively which is better compared to early existing works. But other joint architecture ConvLstm[74] could not achieve better results than all early existing works but still produced strong competitive results. Moreover, the latest work[52] based on attention

Jo

385

mechanism with CNN reported poor results with accuracy 49.4%, than the joint architecture model but still get better results than early existing works. However, our model not only performs better than early existing works and joint 21

Journal Pre-proof

390

architecture models but also than latest CNN based attention model[52]. The proposed performs better than the latest existing models and achieves 0.46% and

of

1.74% accuracy gain than joint architecture[73] and gated-attention model[52] respectively. In general, the proposed model achieves state-of-art results on

395

5.3. Analysis of results on SST2 dataset

pro

SST2 dataset also.

SST2 is the subset of SST1 dataset with binary labeled and without neutral reviews. On SST2 dataset recursive and recurrent model such as RNTN[44] and LSTM[45] achieve accuracy between 84-86% approximate. While all variation of standard CNN model(DCNN, Non-static CNN, and Multichannel CNN) achieve better results than existing recursive and recurrent model with accu-

re-

400

racy more than 86% except GloVe+word2vec CNN[70] which got 85.6% accuracy. Compared to existing recursive, recurrent, and CNN models our model achieves better results with accuracy more than 89%. The proposed model also

405

urn al P

got gain in accuracy 2.02% when compared to latest gated-attention model[52]. But On this dataset joint architecture models (CNN-GRU[73] and CRDLM[51]) performed better compared to all existing models. However, compared to the proposed model only one joint model (CNN-GRU) got little better results with 0.33% gain in accuracy. That means our model not giving state-of-art results on SST2 dataset but still giving better results than many existing works including 410

the latest one and a strong competitor of joint architecture models. Furthermore, to validate the effectiveness of the proposed model, we do a comparative study with strong benchmark attention models. To understand the inner and deep behavior of the model we visualize the attentive context over test dataset.

5.4. Comparison with strong benchmark attention models

Jo

415

5.4.1. Comparison with attention-based CNN Little work has done on attention-based CNN, and among them almost work

in CNN is based on attentive pooling[54] but the proposed model is based on 22

pro

of

Journal Pre-proof

emerges as something rare an issue movie thats so honest and keenly observed that it doesnt feel like one offers that rare combination of entertainment and education

most of the problems with the film dont derive from the screenplay but rather the mediocre performances by most of the actors involved

re-

curling may be a unique sport but men with brooms is distinctly ordinary (a) ATTConv RNN-pre.

urn al P

emerges as something rare an issue movie thats so honest and keenly observed that it doesnt feel like one offers that rare combination of entertainment and education most of the problems with the film dont derive from the screenplay but rather the mediocre performances by most of the actors involved curling may be a unique sport but men with brooms is distinctly ordinary (b) ATTPooling RNN-pre.

Figure 5: Attention visualization as heat map for four sentences from MR dataset. In both (a) and (b) first two sentence are with positive

Jo

sentiment and last two are with negative sentiment.

23

Journal Pre-proof

RNN with CNN-style self-attention mechanism. In this study, the process of 420

computing attentive context vector is the same, but it differs in two ways: (i)

of

Proposed model used both the convolutional and recurrent neural network. Here RNN computes the high-level representation of text by using hidden states units generated from standard convolution/max pooling. (ii) We compute attentive

425

pro

context vectors from hidden state units generated from CNN and then used them with hidden state units together at RNN. 5.4.2. Comparison with attention-based RNN

The significant amount of work has done on RNN and its variant with attention mechanism[55]. Proposed model has benefits over attention based RNN

430

re-

in the following ways: (i) In case of attentive RNN, attentive context is calculated based on the RNN itself; but in the proposed model, attentive context is calculated based on hidden state units generated from standard convolution and further used at RNN to learn high-level representation. (ii) As we can see

urn al P

from figure 1, the proposed model is composed of both CNN and RNN, thus it gets benefits of both. CNN learns the robust local feature by using sliding 435

convolution, and RNN process these feature sequentially with attentive context vector generated from CNN itself. 5.5. Attention visualization

In this section, we visualize attentive context over sentences to compare which word features are being attended by the model when doing predictions for 440

both ATTConv RNN-pre and ATTPooling RNN-pre model. The same sentences are used as input for both variations of the model so that we can infer the difference between ATTConv RNN-pre and ATTPooling RNN-pre. Parameters setting of the model is kept same as during training. After train-

Jo

ing on all dataset and saving the models; we randomly selected four sentences 445

(two positives and two negatives) from test dataset and compare both models by visualizing the attentive context as a heat map. Since both model architectures used is the same for all dataset, which makes visualizing on all dataset is

24

Journal Pre-proof

quite redundant. Thus we only visualized for MR dataset. This four sentence visualization for both model architectures reflects the attention situation for all dataset.

of

450

Indicating figure 5, from sentiment perspective we can tell that for both model architectures (ATTPooling RNN-pre and ATTConv RNN-pre) the at-

pro

tentive context vector focus more on positive part (”emerge as something” and ”so honest and keenly observed”, and ”entertainment and education”) in posi455

tive sentence and similarly focus on negative part (”most of the problems with the film dont derive”, and ”distinctly ordinary”) in negative sentence to predict the actual sentiments.

re-

However, from the model perspective, there is a little variation in attention focus toward important parts. In first positive sentence ATTPooling RNN-pre 460

(figure 5(b)) model focus on positive parts (”emerge as something” and ”so honest and keenly observed”) with similar intensity while ATTConv RNN-pre (figure 5(a)) does not focus on ”emerge as something” part as focus on ”so honest

urn al P

and keenly observed”. But in the second positive sentence, both architectures focus with similar intensity towards similar words. In the first negative sentence 465

the part ”mediocre performances” is not getting as much attention by ATTConv RNN-pre (figure 5(a)) as its get by ATTPooling RNN-pre (figure 5(b)); while with ground truth perspective, to predict negative sentiment from first negative sentence, ”mediocre performances” should be focus as much negative; which is done by ATTPooling RNN-pre (figure 5(b)) architecture. Thus, in conclusion,

470

we can say that ATTpooling RNN-pre performing better than ATTConv RNNpre not only in terms of accuracy but also in getting attention over important word features related to the corresponding sentiment.

Jo

6. Conclusion

This article proposed a new model based on the recurrent neural network

475

with CNN-style self-attention by combining the convolutional and recurrent neural network with the self-attention mechanism for sentiment analysis task.

25

Journal Pre-proof

The evaluation process of the model performed over three benchmark datasets. The proposed model can extract not only local contextual features by CNN

480

of

and high-level temporal features by RNN but also focuses more attention on important word features. Analysis of results with baseline methods, comparative study, and attention visualization demonstrates the effectiveness of the model.

pro

Moreover, performed experiments indicate that the model achieves better results than existing state-of-the-art works on two out of three datasets in sentiment analysis task with accuracy 84.46%, 51.2%, and 89.6% on MR, SST1, and SST2 485

dataset respectively. In the future study, we will test our approach in other

References

re-

tasks of natural language processing.

[1] H. Jiang, C. Cai, X. Ma, Y. Yang and J. Liu, “Smart Home Based on WiFi Sensing: A Survey,” IEEE Access, vol. 6, pp. 13317-13325, 2018. [2] Y. Zhang, et al., “TempoRec: Temporal-Topic Based Recommender for So-

urn al P

490

cial Network Services,” Mobile Networks and Applications, Vol. 22, No. 6, pp. 1182-1191, 2017.

[3] K. Lin, J. Song, J. Luo, J. Wen, G. Ahmed, “GVT: Green Video Transmission in the Mobile Cloud Networks,” IEEE Transactions on Circuits & 495

Systems for Video Technology. vol. 27, no. 1, pp. 159-169, Mar. 2016. [4] J. Chen, C. Wang, Z. Zhao, K. Chen, R. Du, G. Ahn, “Uncovering the Face of Android Ransomware: Characterization and Real-Time Detection,” IEEE Transactions on Information Forensics and Security, vol. 13 no. 5, pp. 12861300, 2018.

[5] X. Ma, et al., “Modeling multi-aspects within one opinionated sentence simultaneously for aspect-level sentiment analysis,” Future Generation Com-

Jo

500

puter Systems, Vol. 93, pp. 304-311, April 2019.

[6] S. Kim, E. Hovy, “Automatic detection of opinion bearing words and sentences,” In: Proceedings of the IJCNLP, pp. 61-66, 2005. 26

Journal Pre-proof

505

[7] S. Kim, E. Hovy, “Automatic identification of pro and con reasons in online reviews,” In: Proceedings of the COLING/ACL, pp. 483-490, 2006.

of

[8] K. Lin, J. Luo, L. Hu, M. Shamim Hossain, A. Ghoneim. “Localization based on Social Big Data Analysis in the Vehicular Networks,” IEEE Transactions

510

pro

on Industrial Informatics, vol. 13, no. 4, pp. 1932-1940, Aug. 2017.

[9] B. Pang, L. Lee, S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” In: Proceedings of the EMNLP, pp. 79-86, 2002.

[10] M. Chen, S. Mao, Y. Liu, “Big data: A survey,” Mobile Netw. Appl., vol.

515

re-

19, no. 2, pp. 171-209, Apr. 2014.

[11] M. Usama, M. Liu, M. Chen, “Job schedulers for Big data processing in Hadoop environment: testing real-life schedulers using benchmark program-

urn al P

s,” Digital Communications and Networks, vol. 3 no. 4, pp. 260273, 2017. [12] M. Chen, Y. Hao , K. Hwang, L. Wang, L. Wang, “Disease Prediction by Machine Learning Over Big Data From Healthcare Communities,” IEEE 520

Access, Vol. 5, No. 1, pp. 8869-8879, 2017. [13] M. Usama, B. Ahmad, J. Wan, M. S. Hossain, M. F. Alhamid, M. A. Hossain, “Deep Feature Learning for Disease Risk Assessment Based on Convolutional Neural Network With Intra-Layer Recurrent Connection by Using Hospital Big Data,” IEEE Access, vol. 6, pp. 67927-67939, 2018.

525

[14] K. Lin, C. Li, D. Tian, A. Ghoneim, M. Shamim Hossain, S. Amin. “Artificial-Intelligence-Based Data Analytics for Cognitive Communication in Heterogeneous Wireless Networks,” IEEE Wireless Communications, vol.

Jo

26, no. 3, pp. 83-89, Jun. 2019.

[15] J. Chen, K. He, Q. Yuan, G. Xue, R. Du, L. Wang, “Batch Identification

530

Game Model for Invalid Signatures in Wireless Mobile Networks,” IEEE Transactions on Mobile Computing, vol. 16 no. 6, pp. 15301543, 2017.

27

Journal Pre-proof

[16] M. Chen, J. Yang, J. Zhou, Y. Hao, J. Zhang, C. Youn, “5G-Smart Diabetes: Towards Personalized Diabetes Diagnosis with Healthcare Big Data

535

of

Clouds,” IEEE Communications, Vol. 56, No. 4, pp. 16-23, April 2018. [17] K. Lin, M. Chen, J. Deng, M. Hassan, G. Fortino “Enhanced Fingerprinting and Trajectory Prediction for IoT Localization in Smart Buildings,” IEEE

pro

Transactions on Automation Science and Engineering, vol.13, no. 3, pp. 1294 - 1307, Apr. 2016.

[18] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-based learning ap540

plied to document recognition,” In Proceedings of the IEEE, pp. 2278-2324,

re-

1998.

[19] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179-211, 1990.

[20] M. Chen, Y. Hao, “Label-less Learning for Emotion Cognition,” IEEE Transactions on Neural Networks and Learning Systems, DOI:

urn al P

545

10.1109/TNNLS.2019.2929071, 2019. [21] M.

Chen,

mation

Y.

Hao,

Measurements:

H.

Gharavi,

A

New

V.

Leung,

Perspective,”

“Cognitive Information

InforSci-

ence,https://arxiv.org/abs/1907.01719, 2019. 550

[22] M. Chen, F. Herrera, K. Hwang, “Cognitive Computing: Architecture, Technologies and Intelligent Applications,” IEEE Access, Vol. 6, pp. 1977419783, 2018.

[23] Y. Zhang, et al., “Edge Intelligence in the Cognitive Internet of Things: Improving Sensitivity and Interactivity,” IEEE Network, vol. 33, no. 3, pp. 58-64, 2019.

Jo

555

[24] Y. Zhang, et al., “COCME: Content-Oriented Caching on the Mobile Edge for Wireless Communications,” IEEE Wireless Communications, vol. 26, no. 3, pp. 26-31, June 2019.

28

Journal Pre-proof

[25] D. Bahdanau, K. Cho, Y. Bengio, “Neural machine translation by jointly 560

learning to align and translate,” In Proceedings of ICLR, 2015.

of

[26] Y. Peng, X. He and J. Zhao, “Object-Part Attention Model for FineGrained Image Classification,” in IEEE Transactions on Image Processing,

pro

vol. 27, no. 3, pp. 1487-1500, March 2018.

[27] B. Zhang, D. Xiong, J. Su, “Neural Machine Translation with Deep Atten565

tion,” in IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2018.2876404, 2018.

[28] N. Kalchbrenner, E. Grefenstette, P. Blunsom, “A convolutional neural

re-

network for modelling sentences,” In Proceedings of ACL, pp. 655-665, 2014. [29] Y. Kim, “Convolutional neural networks for sentence classification,” In 570

Proceedings of EMNLP, pp. 1746-1751, 2014.

urn al P

[30] M. Er, Y. Zhang, N. Wang, P. Mahardhika, “Attention pooling-based convolutional neural network for sentence modeling,” Information Sciences, Vol. 373, pp. 388-403, 2016.

[31] Z. Zhang, Y. Zou, C. Gan. “Textual sentiment analysis via three differ575

ent attention convolutional neural networks and cross-modality consistent regression,” Neurocomputing, Vol. 275, pp. 1407-1415, 2018. [32] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. Dauphin, “Convolutional sequence to sequence learning,” In Proceedings of ICML, pp. 1243-1252, 2017.

580

[33] S. Shayaa, et al., “Sentiment Analysis of Big Data: Methods, Applications, and Open Challenges,” IEEE Access, vol. 6, pp. 37807-37827, 2018.

Jo

[34] D. Pham, A. Le, “Learning multiple layers of knowledge representation for aspect based sentiment analysis,” Data & Knowledge Engineering, Vol. 114, pp. 26-39, 2018.

29

Journal Pre-proof

585

[35] He K.; Chen J., Du R., Wu Q., Xue G., Zhang X., “DeyPoS: Deduplicatable Dynamic Proof of Storage for Multi-User Environments,” IEEE Transactions

of

on Computers, vol. 65 no. 12, pp. 36313645, 2016. [36] A. Abdi, S. Shamsuddin, S. Hasan, J. Piran, “Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment,” Expert Systems with Applications, Vol. 109, pp. 66-85, 2018.

pro

590

[37] N. Akhtar, A. Mian, “Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey,” IEEE Access, vol. 6, pp. 14410-14430, 2018. [38] T. Young, D. Hazarika, S. Poria, E. Cambria, “Recent Trends in Deep

595

re-

Learning Based Natural Language Processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55-75, Aug. 2018. [39] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review,” IEEE Access, vol. 7,

urn al P

pp. 19143-19165, 2019.

[40] M. Chen, X. Shi, Y. Zhang, D. Wu, M. Guizani, “Deep Features Learning 600

for Medical Image Analysis with Convolutional Autoencoder Neural Network,” IEEE Trans. Big Data, DOI:10.1109/TBDATA.2017.2717439, 2017. [41] M. Chen, J. Yang , X. Zhu, X. Wang, M. Liu, J. Song, “Smart Home 2.0: Innovative Smart Home System Powered by Botanical IoT and Emotion Detection,” Mobile Networks and Applications, Vol. 22, pp. 1159-1169, 2017.

605

[42] M. Chen, J. Yang, L. Hu, M. Hossain, G. Muhammad, “Urban Healthcare Big Data System based on Crowdsourced and Cloud-based Air Quality Indicators,” IEEE Communications, Vol. 56, No. 11, Nov. 2018.

Jo

[43] M. Chen, W. Li, Y. Hao, Y. Qian, I. Humar, “Edge Cognitive Computing based Smart Healthcare System,” Future Generation Computing System,

610

Vol. 86, pp. 403-411, Sep. 2018.

30

Journal Pre-proof

[44] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts, “Recursive deep models for semantic compositionality over a sentiment tree-

of

bank,” In: Proceedings of EMNLP, pp. 16-42, 2013. [45] K. Tai, R. Socher, C. Manning, “Improved semantic representations from 615

tree-structured long short-term memory networks,” arXiv preprint arX-

pro

iv:1503.00075, 2015.

[46] S. Wang, C. Manning, “Fast Dropout Training,” Proceedings of ICML, 2013.

[47] Y. Ma, Y. Hao, M. Chen, J. Chen, P. Lu, A. Kosir, “Audio-visual emotion fusion (AVEF): A deep efficient weighted approach,” Information Fusion, Vol. 46, pp. 184-192, 2019.

re-

620

[48] Y. Kim, Y. Jernite, D. Sontag, A. Rush, “Character-aware neural language

urn al P

models,” arXiv preprint arXiv:1508.06615, 2015.

[49] X. Zhang, F. Chen, R. Huang, “A Combination of RNN and CNN for 625

Attention-based Relation Classification,” In: proceeding of ICICT, pp. 911917, 2018.

[50] N. Vu, H. Adel, P. Gupta, H. Schutze, “Combining Recurrent and Convolutional Neural Networks for Relation Classification,” In: proceedings of NAACL-HLT, pp. 534-539, 2016. 630

[51] A. Hassan, A. Mahmood, “Convolutional Recurrent Deep Learning Model for Sentence Classification,” IEEE Access, vol. 6, pp. 13949-13957, 2018. [52] Y. Liu, L. Ji, R. Huang, T. Ming, C. Gao, J. Zhang, “An attention-gated convolutional neural network for sentence classification,” arXiv preprint arX-

Jo

iv:1808.07325, 2018. 635

[53] W. Yin, H. Schutze, B. Xiang, B. Zhou, “ABCNN: Attention-based convolutional neural network for modeling sentence pairs,” TACL, pp. 259-272, 2016.

31

Journal Pre-proof

[54] C. Santos, M. Tan, B. Xiang, B. Zhou, “Attentive pooling networks,” CoRR, abs/1602.03609, 2016. [55] A. Graves, G. Wayne, I. Danihelka, “Neural turing machines,” CoRR, ab-

of

640

s/1410.5401, 2014.

pro

[56] Y. Heo, S. Kang, D. Yoo, “Multimodal Neural Machine Translation With Weakly Labeled Images,” IEEE Access, vol. 7, pp. 54042-54053, 2019. [57] H. Zheng, W. Wang, W. Chen, A. K. Sangaiah, “Automatic Generation of 645

News Comments Based on Gated Attention Neural Networks,” IEEE Access, vol. 6, pp. 702-710, 2018.

re-

[58] K. Al-Sabahi, Z. Zuping, M. Nadher, “A Hierarchical Structured SelfAttentive Model for Extractive Document Summarization (HSSAS),” IEEE Access, vol. 6, pp. 24205-24212, 2018.

[59] S. Liu, S. Zhang, X. Zhang, H. Wang, “R-Trans: RNN Transformer Net-

urn al P

650

work for Chinese Machine Reading Comprehension,” IEEE Access, vol. 7, pp. 27736-27745, 2019.

[60] M. Lu, Y. Fang, F. Yan, M. Li, “Incorporating Domain Knowledge into Natural Language Inference on Clinical Texts,” IEEE Access, vol. 7, pp. 655

57623-57632, 2019.

[61] T. Shao, Y. Guo, H. Chen, Z. Hao, “Transformer-Based Neural Network for Answer Selection in Question Answering,” IEEE Access, vol. 7, pp. 2614626156, 2019.

[62] I. Lopez-Gazpio, M. Maritxalar, M. Lapata, E. Agirre. “Word n-gram at660

tention models for sentence similarity and inference,” Expert Systems with

Jo

Applications, Vol. 132, pp. 1-11, 2019.

[63] Y. Liu, C. Sun, L. Lin, X. Wang, “Learning natural language inference using bidirectional LSTM model and inner-attention,” CoRR, abs/1605.09090, 2016.

32

Journal Pre-proof

665

[64] P. Li,W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, W. Xu, “Dataset and neural recurrent sequence labeling model for open-domain factoid question

of

answering,” arXiv preprint arXiv:1607.06275, 2016. [65] J. Cheng, D. Li, M. Lapata, “Long short-term memory-networks for ma-

670

pro

chine reading,” In Conference on EMNLP, pp. 551-561, 2016.

[66] A. Parikh, O. Tackstrom, D. Das, J. Uszkoreit, “A decomposable attention model for natural language inference,” In Proceedings of EMNLP, 2016. [67] Google, “word2vec,” https://code.google.com/p/word2vec/. [68] B. Pang, L. Lee, “Seeing stars: Exploiting class relationships for sentiment

675

re-

categorization with respect to rating scales,” In: Proceedings of ACL, 2005. [69] M. Zeiler, “Adadelta:An adaptive learning rate method,” CoRR, abs/1212.5701, 2012.

urn al P

[70] Y. Zhang, B. Wallace, “A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification,” arXiv preprint arXiv:1510.03820, 2016. 680

[71] X. Zhu, P. Sobhani, H. Guo, “Long short-term memory over tree structures,” arXiv preprint arXiv:1503.04881, 2015. [72] J. Li, D. Jurafsky, E. Hovy, “When are tree structures necessary for deep learning of representations?,” arXiv preprint arXiv:1503.00185, 2015. [73] X. Wang, W. Jiang, Z. Luo, “Combination of Convolutional and Recur-

685

rent Neural Network for Sentiment Analysis of Short Texts,” Proceedings of COLING pp. 24282437, 2016.

Jo

[74] A. Hassan, A. Mahmood, “Deep Learning Approach for Sentiment Analysis of Short Texts,” In: Proceedings of the ICCAR, 2017.

33