Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification

Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification

ARTICLE IN PRESS JID: NEUCOM [m5G;January 15, 2020;22:2] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing j...

2MB Sizes 0 Downloads 57 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;January 15, 2020;22:2]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification Weijiang Li∗, Fang Qi, Ming Tang, Zhengtao Yu Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kumming, China

a r t i c l e

i n f o

Article history: Received 21 June 2019 Revised 30 December 2019 Accepted 1 January 2020 Available online xxx Communicated by Dr Erik Cambria Keywords: Self-attention mechanism Multi-channel features Bidirectional long short-term memory Sentiment classification

a b s t r a c t There are a lot of linguistic knowledge and sentiment resources nowadays, but in the current research with deep learning framework, these kinds of unique sentiment information are not fully used in sentiment analysis tasks. Moreover, the sentiment analysis task can be seen as a sequence model, and the sequence model has a problem: the model will decode the input file sequences into a specific length vector. If the length of the vectors is set too short, the input text information will be lost, and finally the text will be misjudged. To solve these problems, we propose a bidirectional LSTM model with self-attention mechanism and multi-channel features (SAMF-BiLSTM). The method models the existing linguistic knowledge and sentiment resources in sentiment analysis tasks to form different feature channels, and uses self-attention mechanism to enhance the sentiment information. SAMF-BiLSTM model can fully exploit the relationship between target words and sentiment polarity words in a sentence, and does not rely on manually organized sentiment lexicon. In addition, we propose the SAMF-BiLSTM-D model based on SAMF-BiLSTM model for document-level text classification tasks. The method obtains the representation of all sentences in the document through SAMF-BiLSTM training, then integrates BiLSTM to learn the representation of all sentences, and further obtains the sentiment feature information of the entire document. Finally, we evaluate experiment results under five datasets. The results show that SAMF-BiLSTM and SAMF-BiLSTM-D are superior to other advanced methods in classification accuracy in most cases. © 2020 Elsevier B.V. All rights reserved.

1. Introduction Sentiment analysis [1] is a branch of sentimental computing research [2], which aims to classify texts as positive or negative, sometimes even neutral [3]. The existing methods of affective computing and sentiment analysis [4] can be divided into three categories: knowledge-based techniques, statistical methods, and hybrid methods. Knowledge-based technology divides texts into emotional categories based on the presence of fairly explicit emotional words. The common sources of sentiment words or multi-word expressions include the Affective Lexicon [5], linguistic annotation scheme [6], WordNet-Affect [7], SentiWordNet [8], SenticNet [9], and probabilistic knowledge bases that trained from linguistic corpora [10,11]. The main weakness of knowledge-based methods is that they do not have a good understanding of emotions when it comes to language rules [12]. Moreover, The validity of knowledgebased methods depends heavily on the depth and breadth of the



Corresponding author. E-mail address: [email protected] (W. Li).

employed resources. Statistical methods, such as support vector machines and deep learning, have been widely used for sentiment classification of text, and researchers have used them for projects such as movie review classifiers [13,14]. By providing a large number of training corpora with emotional annotation text for machine learning algorithms, the system can not only learn the emotional valence of affect keywords, but also consider the valence of other keywords and word co-occurrence frequencies. While these methods may be effective at categorizing the user’s text at the page or paragraph level, they do not work well with smaller text units, such as sentences or clauses. Hybrid approach utilizes both knowledge-based techniques and statistical methods to perform tasks such as emotion recognition and polarity detection from textual or multi-modal data [15–19]. Sentiment classification is the core issue of sentiment analysis. Traditional sentiment classification techniques are mainly based on rules and machine learning. The rule-based approaches mainly use sentiment lexicons, templates and statistical features that obtained from experience or expert opinions to classify the sentiment of text, this technique usually requires a large amount of manual intervention [20,21]. And machine learning method regard sentiment analysis as a classification problem, it first builds a training set

https://doi.org/10.1016/j.neucom.2020.01.006 0925-2312/© 2020 Elsevier B.V. All rights reserved.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

by manually labeling a part of the data, then extracts and learns the training data to construct a classification model. Finally, the classification model that derived from training is used to classify and predict the test data of unknown tags [22,23]. However, such methods often rely on the quality of the annotated data and complex feature engineering, which requires a lot of labor costs. Recently, with the development of deep learning, neural network-based methods have become mainstream and are widely used in natural language processing (NLP). Compared with traditional machine learning methods, deep learning methods excel in sentiment analysis, it does not need to construct sentiment lexicons or analyze syntax. As long as the training dataset reaches a certain scale, a deep learning model with high classification accuracy and generalization can be trained. Convolutional neural networks (CNN) [24–26] and Recurrent neural network (RNN) [27] are the most commonly used deep learning models in sentiment analysis tasks. CNN extracts highdimensional features between locally adjacent words by using different sizes of sliding windows for the word vectors of all words of the sentence. However, CNN’s filter has a limited word capacity and cannot capture long-term dependencies, so it cannot obtain the semantic relationship between non-adjacent words in a sentence. Unlike CNN, RNN is specifically designed for sequence modeling with contextual semantic capture capabilities that can apply memory content to current scenarios. Due to the characteristics of RNN, it is used more frequently in text categorization. However, for long data sequence, the traditional RNN may cause gradient explosion or gradient disappearance. Long Short-Term memory (LSTM) [28] is an RNN structure with long short-term memory cells as hidden units, which effectively solves the problem of gradient disappearance and gradient explosion. In addition, LSTM considers the order dependencies between word sequences, so it can capture both long-distance and close-range dependencies. In terms of the power of LSTM in extracting long text messages, it plays a vital role in NLP. The basic LSTM scans only in one direction of the sequence, Bi-directional Long Short-Term Memory (BiLSTM) [29] is a further development of it. BiLSTM scans in both directions of the sequence, allowing simultaneous access to both forward and backward contexts. Therefore, BiLSTM can solve sequence model tasks better than LSTM. Although these neural network models have achieved great success in the emotional classification task, there are still much more room to improve: 1. There are a large number of linguistic knowledge and emotional resources, such as emotional vocabulary, negative words (if not, never), degree adverbs (a little, very), etc. These resources play a vital role in traditional sentiment classification methods [30]. But so far, this unique sentiment information has not been fully utilized in the latest deep neural network models such as CNN and LSTM. For example, the WFCNN model [31] uses the terms in the sentiment lexicon to get the abstract expression of words in the text, and then uses the convolutional neural network to extract the sequence features of the abstract words. The sentiment features in this method depend on the artificially sorted sentiment lexicon, and the language knowledge and sentiment feature information in the sentiment analysis task cannot be fully utilized. In addition, the method uses a single feature representation, which has a large dependence on the initial value of the input vector, and it is difficult to correctly express the importance of each word in the sentence. 2. In fact, text sentiment categorization can be seen as a sequence modeling task. In current research, single attention mechanism is often used to capture important input information. However, in different representation subspaces, the same words may express different levels of sentiment at different locations, so all information forms the overall semantics of the entire input se-

quence. But there is a problem with the input sequence: no matter what the length of the input text sequence, the file will eventually be decoded into a specific length vector. If the input text exceeds the set vector length, it will cause the loss of important information in the text, and finally lead to text misjudgment. In view of the problem that the length of the input sequence vector is too short, the literature [32] proposes a network model that combines the part-of-speech attention mechanism and LSTM, and uses the attention matrix to calculate the attention feature of the word. The experimental results show that a good emotional classification effect can be achieved within a certain dimension. But when the dimension of the text map exceeds the threshold, the accuracy of the classification decreases as the vector dimension increases. Liu et al. [33] proposed a bidirectional LSTM text classification model with attention mechanism and convolutional layer to solve the problem of arbitrary sequence length of text and the sparse problem of text data. Experimental results show that the performance of the model is affected by the convolution window size and step size. To solve above problems, we propose a BiLSTM model based on self-attention mechanism and multi-channel features (SAMFBiLSTM). First, we model the existing linguistic knowledge and sentiment resources in the sentiment analysis task, and combine the word vector in the input text sentence with part-of-speech feature vector, position feature vector and dependent feature vector to form different feature channel inputs, then learn a Bi-LSTM for each channel vector, so that the model can learn the sentiment features in the sentence from different angles, and mine the hidden information in different aspects of the sentence. Next, the three feature channel vectors are combined with the output vectors of the three Bi-LSTMs, and then we use the self-attention model to discover the important information in the sentence. The self-attention we used in this paper is a special case of selfattention. Different from the traditional attention mechanism, the self-attention mechanism can reduce the dependence on external information, it can directly calculate the dependence relationship without paying attention to the distance between words, learn the weight distribution of each word on the sentiment tendency of sentences, and focus on strengthening the sentiment features in the sentence, so that the model can learn more hidden feature information. The main contributions of our work can be concluded as follows: • We discover that modeling the special linguistic knowledge and sentiment resources in the sentiment classification can enhance the effect of classification. Therefore, we achieve this by establishing multiple feature channel vector inputs on the sequence BiLSTM models. • We propose a self-attention mechanism. Combining multiple feature vectors with the implicit output of the BiLSTM model gives different sentiment weights to different words. It can effectively improve the importance of sentiment polarity words and fully discover the sentiment information in the text. • Meanwhile, based on SAMF-BiLSTM model, we propose SAMFBiLSTM-D model for document-level text classification tasks. • We verify our model on various datasets. The experimental results show that SAMF-BiLSTM and SAMF-BiLSTM-D achieved better performance than the baseline model. The rest of our paper is structured as follows: Section 2 reviews the related work; Section 3 focuses on the model presented in this paper; Section 4 is the comparative experiment here and a detailed description and analysis of the experimental results; and Section 5 is the summary and future prospects of this paper.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

2. Related work Sentiment classification is an important task of natural language processing (NLP). In the previous research, machine learning method has attracted the attention of many scholars, the common methods of sentiment classification based on machine learning mainly include Naive Bayes (NB), Support Vector Machine (SVM) and Maximum Entropy (ME). The performance of these methods mostly depends on n-gram features or manually designed features. By constructing features by extracting information from the data, the classification effect can be greatly improved. Therefore, in the traditional method of emotional classification, many researchers try to improve the performance of sentiment classification by designing better features from language knowledge and sentiment resources. For example, Tang et al. [34] proposed a classification model for training SVM with sentiment-specific word embedding (SSWE) features. Huang et al. [35] proposed a method that incorporate sentimental emoji and microblog user personality emotion into the graph model LDA, realizing the simultaneous derivation of microblog theme and sentiment. Moreover, Huang added sentiment layer and microblog user relationship parameters in LDA [36], using microblog user relationship and microblog theme to learn the sentiment polarity of microblog. Vo et al. [37] proposed adding emoticons features to the sentiment lexicon to automatically construct texts and perform sentiment analysis on the Twitter texts. This method effectively utilizes the sentiment information hidden by different expressions in the Twitter text. By learning the expressions, the model makes full use of the sentiment information of the input text, and effectively improves the performance of the sentiment classification. Furthermore, there are some studies on the automatic construction of sentiment lexicons from social data and multiple languages [38]. However, most of these models are based on the bag of words model, where each word in the text is independent of other words, ignoring the semantic information between words. With the research and development of deep learning in NLP, many researchers have begun to use deep learning to solve the problem of sentiment classification. Chen et al. [31] proposed a sentiment classification method combining sentiment lexicon and convolutional neural network, which mainly uses the words in the sentiment lexicon to abstract the words in the text, and then uses the CNN to extract sequence features of abstract words. However, the use of sentiment features in this method relies on the artificially compiled sentiment lexicon and has only a single feature, which cannot fully utilize the unique sentiment feature information in the sentiment analysis task. To solve this problem, Chen et al [39]. proposed a multi-channel convolutional neural network, which combines the word vector, part-of-speech vector and position value of the input text to learn the sentiment feature information in the sentence. Experiments show that the method can combine more features to learn and optimize the model, but for sentences with satirical sentiment, the method still does not recognize well. Zhang et al. [40] proposed a convolutional neural network sentiment analysis model based on critical learning and rule optimization, which consists of three key components: feature-based predictor, rule-based predictor and critical learning network. Critical learning network can estimate the importance of knowledge rules and use them adaptively. Experimental results show that the proposed method has better performance than the latest method in sentiment analysis. For the negative polarity rule and sentence structure rules, the model needs to manually sort out an additional sentiment lexicon. Teng et al. [41] proposed a method based on simple weighted and context sensitive lexicons, using an RNN to learn sentiment intensity, strengthen and negate the lexical sentiment, and thus constitute the sentiment value of the sentences. Qian et al. [42] proposed a simple LSTM model with sentence-

3

level annotation training to model existing linguistic rules such as sentiment vocabulary, negative words and degree adverbs. The model effectively uses the linguistic rules, and the results perform good, but the model needs to establish an intensity regularizer, which relies too much on regularized sentence-level annotations. Tai et al. [43] proposed a neural network model for introducing memory cells and gates into a tree structure called Tree-LSTM. Experimental results show that Tree-LSTM outperforms all LSTM baselines on two tasks, but the model relies on a parse tree structure and expensive phrase-level annotations, and its performance drops significantly when only using sentence-level training. Cambria et al. [44] proposed that combining sub-symbols and symbolic AI together, using long short-term memory networks to automatically discover concept primitives from text through vocabulary replacement, and then link them with commonsense concepts and named entities together, at last perform sentiment analysis in a new three-level knowledge representation. Li et al. [45] proceeded from word representations. In order to investigate the influence each word has on the sentiment label of both target word and context words.They incorporate such prior sentiment information at both word level and document level. They get the best way to incorporate previous sentiment information By evaluating the performance of sentiment analysis in each category.Experimental results on real world datasets demonstrate that the word representations learnt by DLJT2 can significantly improve the sentiment analysis performance. Nowadays, attention mechanisms have become an effective way to achieve excellent results by selecting important information. The attention mechanism was first proposed in the field of computer vision, its purpose is to imitate the attention mechanism of human beings and give different weights to different parts of the image. Bahdanau et al. [46] used the attention mechanism on machine translation tasks and was the first to apply attention mechanisms to the NLP field. Ma et al. [47] proposed a model of attention mechanism based on hidden states, which learns attention interactively from context and aspect. Wang et al. [48] proposed the attention-based LSTMs for aspect-level sentiment classification and Liu et al. [49] proposed an aspect classification model based on content attention, their key idea is to add aspect information to the attention mechanism. On the basis of the sentence-level sentiment analysis task, Guan et al. [50] proposed a two-way LSTM model for attention enhancement for sentiment analysis. The attention mechanism used by this model directly learns the weight distribution of each word on the sentiment tendency of the sentence based on the word vector, so as to learn the words that can enhance the classification effect, use the two-way LSTM to learn the semantic information of the text, and finally improve the classification effect by means of parallel fusion. Ma et al. [51] proposed a novel method for aspect- based sentiment analysis. This method uses a layered attention mechanism consisting of targetlevel attention and sentence-level attention to augment a longterm short-term memory (LSTM) network. Commonsense knowledge of emotion-related concepts is incorporated into the end-toend training of deep neural networks for sentiment classification. Experiments were performed on two publicly published datasets, and the results show that the proposed attention architecture combined with Sentic LSTM can outperform the latest methods in emotional tasks on the target side. Zhou et al. [52] proposed an LSTM network based on attention mechanism and Vaswani et al. [53] proposed a Self-Attention and Multi-head-Attention model in 2017, all of which use self-attention to solve the sentiment classification problem. Lin et al. [54] used the self-attention mechanism to learn the word vector of sentences in the LSTM network, and achieved good results in the sentiment classification task. Wang et al. [55] proposed an RNN-based sentiment classification capsule (RNN-Capsule), each capsule have an attribute, a state, and three

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM 4

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

modules (representation module, probability module, and reconstruction module). The properties of the capsule are the specified sentiment categories, using the attention mechanism to construct the capsule representation. The probability module calculates the state probability of the capsule based on the capsule representation. If the state probability is the best among all capsules, the capsule is active or otherwise inactive. RNN-Capsule achieved good classification results on two benchmark datasets and one professional dataset. In addition, Zhao et al. [56] extended the existing capsule networks into a new framework with advantages concerning scalability, reliability and generalizability. The method is verified on two NLP tasks with multi-label text classification and question answering. The experimental results show that the method has obvious improvements on both tasks over other state-of-art methods. Besides, in a low-resource settings, the best results are obtained with fewer training instances. Liang et al. [57] proposed a multi-channel attentional convolutional neural network model MATT-CNN, which combines three attention mechanisms of word vector, part-of-speech vector and position vector to construct a multi-convolutional neural network. The model can focus on the relationship between the target words of the sentence and other words from a variety information, and is used to solve the problem that in the specific target sentiment analysis, the model combining the attention mechanism and the serial input network such as LSTM has a long training time and cannot input the text in parallel. Huang et al. [58] proposed an attention-over-attention (AOA) neural network for aspectal sentiment classification, which models aspects and sentences in a joint way, and capturing interactions between aspects and context sentences. Through the AOA module, the model can learn aspects and sentence representations together, and automatically focus on important parts of the sentence. Experimental results show that the method is superior to the previous LSTM-based architecture. Aiming at the problem that deep learning approaches for sentiment classification cannot fully utilize sentiment linguistic knowledge, Yang et al. [59] proposed a Featureenhanced Attention Network to improve the performance of target dependent Sentiment classification (FANS). This method first uses word features, part-of-speech features and word position features to learn the word representation of enhanced features, then uses a multi-perspective co-attention network to better learn multiview sentiment analysis and target-specific sentence representations by modeling the interaction of context words, target words and sentiment words. Lei et al. [60] proposed a Multi-sentimentresource Enhanced Attention Network (MEAN), which uses the attention mechanism to integrate three kinds of emotional language knowledge (such as sentiment lexicon, negation words and intensity words) into deep neural networks. By using different types of emotional resources, the model uses emotion-related information from different representation subspaces to more effectively classify emotions. Our model differs from the above in several ways. Firstly, we model the linguistic knowledge of sentiment vocabulary, negative words and intensity words to form different feature channels, then let BiLSTM learn the sentiment feature information in the sentence from different angles. Secondly, we use the self-attention mechanism that directly weights the current input adaptively, ignoring the distance between words and words, reducing the dependence on external information, learning the weight distribution of each word on the sentiment tendency of sentences, and focusing on strengthening the important feature information in the sentences.

sional continuous value vector wi , 1 ≤ i ≤ n by the trained word vector. The word vectors in the sentence sequence are concatenated to obtain a word vector matrix representation of the entire sentence sequence: W d = w1  w2  . . .  wn , and the dimension is d. We do not directly use word vectors Wd as input for bidirectional LSTM, but combine the part-of-speech feature vector, the position value vectors and the dependency parsing vector to form different channels based on the word vector (see Section 3.1). The purpose is to let the model learn the sentiment feature information from different perspectives and fully exploit the hidden information in the sentence. As shown in the figure, the bidirectional LSTM extracts the feature information from three channel feature inputs, and then normalizes the layer to obtain VLN , then learns a weighting matrix Satt through the self-attention mechanism to weight the original VLN , and different sentiment weights are assigned to different words to perform sentiment classification. The specific design is introduced in the following sections. 3.1. Multi-channel features The multi-channel features in this paper consists of word vector Wd , part-of-speech feature vector Tagm , position value vector Posl , and dependency parsing vector Parp from the entire dataset. Part-of-speech feature vectors. Using the sentiment set of HowNet1 to re-labeled the part-of-speech of the input sentence. By labeling the part-of-speech, let the model learn the words that have an important influence on the sentiment classification, focusing on the annotation of special sentiment words: degree adverbs (very, extremely), positive/negative review words (good, bad), positive/negative sentiment words (like, disappointment) and negative words (not, never). As with the operation of word vector Wd , let ti ∈ Tagm , where ti is the ith part-of-speech feature vector, and m is the dimension of the part-of-speech vector. Position value vectors. The position between words and words often hides important information. We notice that same word appears in different positions may express different sentiment information. Therefore, each position value is mapped into a multidimensional continuous value vector pi ∈ Posl , where pi is the ith position feature vector and l is the dimension of the position feature vector. Dependency parsing vectors. Dependency parsing analysis reveals the syntactic structure by analyzing the dependencies between the components in the language unit. By syntactically analyzing the input sentences, the syntactic structure of the sentences and the dependencies between the vocabulary in the sentences are determined. It allows the model to learn more about the existing language knowledge in the sentiment analysis task and to discover more hidden sentiment information. Then we concatenate part-of-speech feature vectors, position value vectors and dependency parsing vectors with word vectors respectively to form different feature channel inputs. The model can learn different aspects of sentiment feature information from different angles, and explore the hidden information of different angles in the sentence. In order to simplify the model, we use a simple row vector splicing operation in the experiment:

Rwt = W d  T agm

(1)

Rwp = W d  Posl

(2)

Rwpa = W d  Par p

(3)

3. SAMF-BiLSTM model The architecture of our model is shown in Figure 1. Formally using the words in the text as the basic unit to form a sequence of words: {x1 , x2 , . . . , xn }. Each word is mapped into a multidimen-

1

http://www.keenage.com/html/c_index.htm

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

5

Fig. 1. The architecture of the SAMF-BiLSTM.

Algorithm 1 SAMF-BiLSTM algorithm. Input: The word vector W d , part-of-speech feature vector T agm , position value vector Posl and dependent parsing vector Par p are input as multi-channel features and are calculated by Eqs. (1)–(3); Output: return pk where k is the task. 1: for iteration t do Using Eqs. (5) and (6) to obtain forward-backward context 2: features from the multi-channel feature sequences; Using Eqs. (7)–(9) to calculate the mean and variance of 3: the summation inputs of the neurons in the BiLSTM hidden layer, thereby obtaining the output of the hidden layer VLN ; Using Eqs. (11)–(13) to calculate the word self-attention 4: weight matrix for each channel; 5: Using Eq. (14) to weight the hidden layer output VLN of each BiLSTM channel, and the weighted attention feature vector is Ove ; The attention feature vectors of the three channels are fused 6: to obtain Satt , and then classified by softmax function; Update the model parameters using the loss function Eq. 7: (17), the Adadelta method. 8: end for

Fig. 2. Bidirectional LSTM network structure.

memory cell ct in LSTM is a function of their previous status ht−1 , ct−1 and input vector Wt , the hidden state of each location (ht ) considers only the forward context without regard to the backward context, and its form is as follow:

ct , ht = gLST M (ct−1 , ht−1 , Wt ) 3.2. Long short-term memory and layer normalization LSTM [28] is an improvement of RNN, it considers the sequence dependency among word sequences and can solve the long-term distance dependence problem well and solve the gradient disappearance problem in RNN. In summary, the hidden states ht and

(4)

Bidirectional LSTM [29] considers both forward and backward information to better capture two-way semantic dependencies, and the architecture of BiLSTM is shown in Figure 2. BiLSTM utilizes two parallel channels (forward and backward) at the same time and concatenates the hidden states of the two LSTMs as the representation of each position. The forward and backward LSTMs are

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx Table 1 Analysis of key words in MR data samples.

respectively formulated as:

→ − → − → − − → c t , h t = gLST M ( c t−1 , h t−1 , Wt ) − ← − ← − ← ← − c t , h t = gLST M ( c t+1 , h t+1 , Wt )

(5) (6)

where gLSTM is the same as that in Eq. (4), and the parameters in the two LSTMs are shared. The representation of the entire sen− → ← − tence is [ h n , h 1 ], where n is the length of words in the sentence. − → − ← At each position t, the representation is ht = h t  h t , which is a concatenation of the hidden states of the forward LSTM and backward LSTM. In this way, the forward and backward contexts can be considered simultaneously. Next, we use the layer normalization proposed in [61] to calculate the mean and variance of the summation inputs of neurons in the hidden layer. The purpose is to stabilize the hidden dynamics in the LSTM network and prevent model over-fitting. In layer normalization, we assign each of the neurons in the BiLSTM hidden layer ht their own adaptive bias and gain. All hidden units in the layer share the same normalized terms μ and σ , its form is as follows: 

ht = f

μt =

g σt

H 1 hti H

 σt =



 (ht − μt ) + b

(7)

(8)

i=1

H 1 (hti − μt )2 H

(9)

i=1

 ht .

Then the output of all hidden layer states of BiLSTM is Eq. (10), where the VLN dimension is n∗ H: 





Key words

an ambitious, serious film that manages to do virtually everything wrong; sitting through it is something akin to an act of cinematic penance.

ambitious; serious; virtually; wrong; penance

because of an unnecessary and clumsy last scene, ’swimfan’ left me with a very bad feeling.

Unnecessary; clumsy; very; bad

the emotion is impressively true for being so hot-blooded, and both leads are up to the task.

Impressively; true; hot-blooded

the screenplay sabotages the movie’s strengths at almost every juncture. all the characters are stereotypes, and their interaction is numbingly predictable.

Sabotages; almost; stereotypes; numbingly; predictable

We use the part-of-speech as an example to analyze the sample of MR data set. As shown in Table 1, the sentiment words in the sentence (unnecessary, bad, etc.) can reflect the sentiment tendency of reviewers. In order to pay more attention to these sentiment words in classification, we use self-attention mechanism to learn the internal structure of sentences, focusing on strengthening certain feature information in the sentences. The self-attention mechanism is an improvement of the attention mechanism, which reduces the dependence on external information and is better at capturing the internal correlation of data or features. Figure 3 is the Rwt channel feature self-attention weight matrix watt1 :

PV LN = VLN1

where H denotes the number of hidden units in a layer and  is the element multiplication between the two vectors. g and b are defined as the bias and gain parameters of the same dimension as

VLN = (h1 , h2 , . . . , hn )

MR data sample

(10)

3.3. Self-attention and output The attention mechanism was first proposed in the field of image processing, and its purpose is to focus on certain feature information during model training. The conventional attention mechanism approach is to take advantage of the state of the last hidden layer of the LSTM, or to align with the hidden state of the current moment input using the implicit state of the LSTM output. However, in the sentiment analysis task, the self-attention mechanism that adaptively weights the current input is more appropriate.

It pp = T agm Lnor = L(VLN2  VLN3 )

(11)

awt1 = PV LN  It pp  Lnor

(12)

wat t 1 = so f tmax(L3 (tanh(L2 tanh(L1 awt1 ))))

(13)

In the above formula, PVLN , Itpp and Lnor are self-auxiliary matrices, initial attention matrices and additional auxiliary matrices, respectively. L, L1 , L2 and L3 are weight vectors of dimension sizes H, 3 ∗ H + m + 1, H + m and m, respectively. The softmax is used for normalization operation, then the hidden state VLN1 of BiLSTM is weighted by the self-attention weight watt1 , and the weighted attention feature vector is Ove1 :

Ove1 = wat t 1  VLN1

(14)

As with the calculation of the attention feature vector of the Rwt channel, the attention feature vectors of the Rwp and Rwpa channels

Fig. 3. Self-Attention structure of Rwt . The VLN2 of Rwp channel and VLN3 of Rwpa channel are additional auxiliary weights to participate in the calculation of the self-attention weight matrix of Rwt channel.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

ARTICLE IN PRESS

JID: NEUCOM

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx Table 2 Datasets for sentiment classification. Data

C

SL

SD

DS

WS

Test

MR SST-5 SST-2

2 5 2

20 18 19

-

10,062 11,855 9613

18,765 17,836 16,185

1066 2210 1821

YELP3 IMDB

5 10

189 395

11 16

71,193 76,538

48,957 105,373

8671 9112

are calculated as Ove2 and Ove3 . Sentiment analysis is essentially a classification problem, so at the end of the model, the attention feature vectors of the three channels are merged to obtain Satt , and then the softmax function is used to classify them. The formula are as follows:

Satt = [Ove1 , Ove2 , Ove3 ]

(15)

p = so f tmax(wc Satt + bc )

(16)

where wc is the weight matrices and bc is offset value. Eq. (17) is the loss function used in the process of model training, and the weight is attenuated on the model parameters to regularize the parameters.

loss = −

D  C 

yki log pki + λθ 2

(17)

i=1 k=1

where D is the size of the training data set, C is the number of labels of the data, p is the predicted sentiment category, y is the actual category, λθ 2 is the L2 regular term, λ is an L2 regularization hyper-parameter, and θ is the set of parameters in the model. In this paper, the back-propagation algorithm is used to update the network parameters. 3.4. SAMF-BiLSTM-D model As shown in Table 2, in the sentiment classification task, the average length of sentence-level text does not exceed 100(SL < 100). Each word in the text may have a certain characteristic meaning that will have an impact on the classification results. Our SAMF-BiLSTM model fully learns the linguistic feature information of each word in the sentence, and focuses on strengthening feature information. Therefore, SAMF-BiLSTM model works well on sentence-level text classification tasks (Table 4). However, in document-level texts with an average length of more than 100(SL ≥ 100), there are multiple sentences for each text, and each sentence may have a different sentiment tendency. Therefore, the factors that affect the classification effect of a document are sentences rather than words. To solve this problem, Le et al. [62] proposed an unsupervised algorithm for learning distributed feature representation from sentences and documents, expressing variable-length sentences or paragraphs as fixed-length vectors, and considering word order within a certain range. Tang et al. [63] proposed a text prioritization matrix for each user and product in the document into the CNN sentiment classification. Xu et al. [64] proposed a Cached Long Short-Term Memory neural networks (CLSTM) to capture the overall semantic information in long texts, it divides memory into several groups with different forgetting rates, so that the network can better save sentiment information in a recurrent unit. Chen et al. [65] proposed a hierarchical neural network that incorporates all user and product information into sentiment classification. The model first constructs a hierarchical LSTM model, using the average pool layer of words and sentences at LSTM to generate sentences and document representations, and then considers user and product information by focusing on different semantic levels to improve the emotional classification effect.

7

In the paper, the classification of document-level text directly by SAMF-BiLSTM model will result in poor classification due to the inability to accurately obtain the sentiment features in the document (see Table 5). Based on the SAMF-BiLSTM model, we propose the SAMF-BiLSTM-D model for document-level text classification tasks (see Figure 4). The MFSA-BiLSTM-D method is the same as [62] and [65], it is also trained to get the sentence representation first and then get the document representation. As shown in Fig. 4(left), We divide the document Doc. into sentence sequence [S1 , S2 , . . . , Sm ], where m is the number of sentences. The sentence Si , 1 ≤ i ≤ m is further divided into a series of words {xi1 , xi2 , . . . , xin }, where n is the length of Si . According to Section 3.1, we vectorize the word features to form three channels. Then use the SAMF-BiLSTM model to learn the words sentiment of each sentence in the document, and get the expression vector Sattj , 1 ≤ j ≤ m for each sentence in the document. After that, all the sentences in Doc. are expressed as DS = [Sat t 1 , Sat t 2 , . . . , Sat t m ], and sent to the model shown in Fig. 4(right) for training. After layer normalization, the sentence self-attention weight matrix wsatt will be calculated as follows:

Swt = VSLN  DS

(18)

wsatt = so f tmax(L2 tanh(L1 Swt ))

(19)

where VSLN is the hidden output of BiLSTM, L1 and L2 are the weights of dimension sizes HS + m and m respectively, and HS is the number of hidden units. Then we can get the weighted attention feature vector Osve as follow:

Osve = wsatt  VSLN

(20)

Finally, use the softmax function to classify the document. 4. Experiment In this section, we experimented with five real-world datasets, presented experimental details, evaluated the performance of the model and analyzed the results. 4.1. Experimental setup 4.1.1. Datasets Table 2 shows the datasets in detail, where C is the number of classification targets. SL is the average length of dataset samples. SD is the average number of sentences in the document-level text. DS represents the size of the experimental datasets. WS is the vocabulary size of the dataset. Test is the size of test set. We experimented with the model in the following five datasets. 1. MR2 : Movie Reviews (MR) is a sentence polarity dataset, which includes 5331 positive and 5331 negative samples. 2. SST-53 : Stanford Sentiment Treebank-5 is a five-category dataset that includes 227,376 phrase-level fine-grained sentiment classifications, which are parsed by the Stanford parser in a parse tree of 11,855 sentences. This paper trains sentence-level and phrase-level-annotated sentence-levels on the SST-5 dataset, using the test data in the sentence-level test. 3. SST-2: The sentiment in the Stanford Sentiment Treebank2 dataset are divided into two categories, with neutral reviews being deleted. Very positive and positive comments were marked as positive, negative and very negative comments were marked as negative. It includes 9613 training (3310 negative and 3610 positive) samples and 1821 test samples. This article uses the SST-2 dataset of phrase-level annotations for training, using test data from the sentence level for testing. 2 3

https://www.cs.cornell.edu/people/pabo/movie-review-data/ https://nlp.stanford.edu/sentiment/

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 4. The architecture of SAMF-BILSTM-D. It consists of the left and right parts of the figure.

Table 3 The best hyper-parameter configuration.

4.2. Baseline methods YELP3

IMDB

Parameter

MR

SST-5

SST-2

W

S

W

S

learning rate Hidden layer units Weight Decay Batch Size

0.1 128

0.1 128

0.1 128

0.1 128

0.01 100

0.1 128

0.01 100

1e-3 16

1e-4 64

1e-5 64

1e-4 25

1e-3 32

1e-4 28

1e-3 128

4. Yelp34 : Yelp 2003 is a review dataset containing five sentiments, including 62,522 training sets and 8671 test sets. 5. IMDB5 : IMDB is a movie comment dataset that includes ten sentiments. This dataset includes 67,426 training samples and 9112 test samples. MR, SST-5 and SST-2 are sentence-level data sets (SL < 100). YELP3 and IMDB are document-level data sets (SL ≥ 100).

4.1.2. Experimental settings In this paper, the Stanford CoreNLP6 is used to conduct word segmentation, part-of-speech tagging and dependency parsing on five experimental datasets in Table 2. We use the Glove vector proposed by Pennington et al. [66] as the initial setting for word embedding, where each word vector is 300 dimensions and the lexicon size is 1.9 MB. We randomly initialized the unregistered words in the five experimental datasets using a uniform distribution U (−0.05, 0.05 ). Throughout the experiment, the word vector dimension is 300, the part-of-speech feature dimension is 30, the positional feature dimension is 25, and the dependency parsing feature is 25. The training process use AdaDelta method proposed by Zeiler et al. [67] to update the model parameters. The dropout rate for all datasets is set to 0.5. This paper selects the best performing results on the test data set as the final performance. The hyper-parameter settings of the model are shown in Table 3.

4 5 6

http://www.yelp.com/dataset_challenge http://ir.hit.edu.cn/dytang/paper/acl2015/dataset.7z https://nlp.stanford.edu/software/corenlp- backup- download.html

This article tests the benchmark methods for sentence-level text categorization and document-level text categorization. The baseline methods can be divided into the following three groups. 1. General baseline model: SVM: Support Vector Machine [68]. MNB: Polynomial Bayes with a one-letter combination [69]. NBSVM: A variant of SVM, using the naive Bayesian logarithmic ratio as the eigenvalue, was proposed by Wang and Manning [69]. SSWE + SVM: Tang et al. [34] proposed a method that combining sentiment-specific word embedding (SSWE) and SVM. CNN: The convolution neural network model proposed by Kim et al. [24]. RNN: Recurrent neural network proposed by Socher et al. [70]. RNTN: Socher et al. [71] proposed a recursive tensor neural network that use tensors to model the correlations between different dimensions of a vector that in the sub-node. LSTM/BiLSTM: Long short-term memory network and Bidirectional long short-term memory network. Paragraph-Vec: Le et al. [62] proposed an unsupervised learning distributed feature representation algorithm. 2. Sentence-level network model: Tree-LSTM: A neural network that add memory cells and gates into a tree structure proposed by Tai et al. [43]. WFCNN: Chen et al. [31] proposed a convolutional neural network model combined with emotional sequences. NCSL: The method proposed by Teng et al. [41], it considers the sentiment score of a sentence as the weighted sum of previous scores in the sentence, where the weight is learned by the neural network. LR-Bi-LSTM: The LSTM of linguistically regularization proposed by Qian et al. [42]. RNN-capsule: A sentiment classification capsule model based on RNN proposed by Wang et al. [55]. Capsule-B: A sentence classification capsule model based on CNN proposed by Yang et al. [72]. AC-BiLSTM: A bidirectional LSTM text classification model with attention mechanism and convolutional layer proposed by Liu et al. [33].

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

ARTICLE IN PRESS

JID: NEUCOM

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx Table 4 The accuracy of sentence-level sentiment classification. And “-” indicates that there is no relevant literature and this dataset is not used by the method. SST-5 Method

MR

SVM MNB NBSVM Paragraph-Vec CNN RNN RNTN LSTM BiLSTM Tree-LSTM WFCNN NCSL LR-Bi-LSTM RNN-capsule Capsule-B AC-BiLSTM CL+CNN

79.0 79.4 81.5 77.7 75.9 77.4 78.5 80.7 82.9 82.1 83.8 82.1 83.2 84.3

SAMF-BiLSTM

83.3

SST-2 +phrase

+phrase

46.9 43.2 43.4 45.6 46.5 48.1 48.0 47.1 48.6 49.3 48.6 48.9 -

40.7 48.7 48.0 44.8 45.7 46.4 49.1 51.0 49.6 51.1 50.6 51.2

79.4 87.8 87.2 82.4 85.4 84.9 87.5 88.0 88.7 89.1 88.7 88.3 89.5

49.7

51.8

89.7

CL+CNN: A regular convolutional neural network optimization application model based on critical learning and sentiment analysis proposed by Zhang et al. [40]. 3. Document-level network model: RNTN+RNN: Each sentence is represented by a Recursive Neural Tensor Network (RNTN [71]) and the sentence representation is entered into a Recurrent Neural Network (RNN). Afterwards, the hidden vectors of the RNN are averaged to obtain a document representation for the sentiment classification. UPNN(CNN and no up): Tang et al. [63] proposed using the user and product preference matrix as extra information to train the CNN classifier (UPNN). UPNN (CNN no UP) uses only CNN without considering user and product information. CIFG-L STM/CIFG-BLSTM: The LSTM and BLSTM of the coupled input forgotten gate proposed by Greff et al. [73] are denoted as CIFG-LSTM and CIFG-BLSTM respectively. CLSTM: Xu et al. [64] proposed a cached LSTM method for capturing semantic information in long text, including both CLSTM (using regular LSTM) and B-CLSTM (using BI-LSTM) variants. NSC: The average pool layer of word and sentence levels proposed by Chen et al. [65]. NSC+LA: Use local context to capture semantic information as an attention mechanism. 4.3. Results 4.3.1. Overall comparison (a) sentence-level sentiment classification The comparison results of sentiment classification of sentencelevel texts (MR, SST-5 and SST-2) are shown in Table 4. The performance of the model is evaluated by classification accuracy. We show the best performance results in bold text, and the results of the first 17 baseline methods are cited from the literatures [33,40,42,43]. As can be seen from Table 4, our SAMF-BiLSTM achieves better results on the three datasets than the traditional classification methods SVM, MNB and NBSVM, which indicate that the neural network model has better classification effect on the sentiment analysis task than the traditional method. It is observed that SAMFBiLSTM is superior to the neural network model on all datasets except the MR dataset. On the SST-5 and SST-2 datasets, the SAMFBiLSTM classification results are 49.7%, 51.8%, and 89.7%. Compared

9

Table 5 The accuracy of document -level sentiment classification. Method

YELP3

IMDB

AvgWordvec + SVM SSWE + SVM Paragraph-Vec RNTN+RNN UPNN(CNN and no UP) LSTM BiLSTM CIFG-LSTM CIFG-BLSTM CLSTM B-CLSTM NSC NSC+LA

52.6 54.9 55.4 57.4 57.7 53.9 58.4 57.3 59.2 59.4 59.8 62.7 63.1

30.4 31.2 34.1 40.1 40.5 37.8 43.3 39.1 44.5 42.1 46.2 44.3 48.7

SAMF-BiLSTM SAMF-LSTM-D SAMF-BiLSTM-D

59.5 62.4 63.8

45.6 45.7 48.9

with the three CNN-based methods (CNN, Capsule-B and CL+CNN), SAMF-BiLSTM gives better results on the two datasets, which indicate that the LSTM-based method used in this paper is more suitable for this task than the CNN-based method. At the same time, compared with WFCNNs, LR-Bi-LSTM and NCSL which model the language knowledge, the classification effect of the SAMFBiLSTM method is better, which indicates that the effectiveness of the method proposed in this paper to model the existing language knowledge to generate different feature channels, and then let the model learn the emotional feature information in the sentence from different angles. The self-attention used in this paper can achieve better performance than the AC-BiLSTM method using the attention mechanism. Compared to the Tree-LSTM method that relies on phrase-level annotations (when trained using only sentence-levels, its performance drops by 2.9%), the SAMF-BiLSTM method does not rely on the parse tree. On SST-5 dataset, there is only little difference between using phrase-level annotations or not using phrase-level annotations. In addition, the CL+CNN method is the only method that achieves 84.3% on the binary MR data set. However, the method we proposed is not significantly different from the results of CL+CNN. The results show that SAMF-BiLSTM is a more self-attention and linguistic feature combination model, and SAMF-BiLSTM is more feasible than the CL+CNN model that requires manual editing of an additional sentiment lexicon (negative words and turning words). (b) document-level sentiment classification The comparison results of sentiment classification in documentlevel texts (YELP3 and IMDB) are shown in Table 5. The performance of the model is evaluated by classification accuracy. We show the best performance results in bold text, and the results of the top 13 baseline methods are cited from the literatures [63–65]. As can be seen from Table 5, the SAMF-BiLSTM-D method we proposed obtained better results (63.8% and 48.9%) than the other baselines on the two datasets. Compared with RNTN+RNN, Paragraph-Vec, NSC and NSC+LA, the SAMF-BiLSTM-D method has a better classification effect, which indicates the effectiveness of our method. At the same time, the SAMF-BiLSTM-D method is feasible compared to CIFG-LSTM, CIFG-BLSTM, CLSTM and B-CLSTM, which change the internal storage of the LSTM model. In addition, as can be seen from Table 5, for the document text dataset (YELP3 and IMDB), the SAMF-BiLSTM-D is better than the sentence-level SAMF-BiLSTM method. This indicates that SAMF-BiLSTM-D is more suitable for this task, and SAMF-BiLSTM-D can capture the emotional tendencies of document-level text. Combining with the results of sentence-level classification and document-level classification, the validity of SAMF-BiLSTM and

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx Table 6 The accuracy for SAMF-BiLSTM with different self-attention mechanism.

MF-BiLSTM SAMF-BiLSTM (no Itpp ) SAMF-BiLSTM (no PVLN ) SAMF-BiLSTM (no Lnor ) SAMF-LSTM (all) SAMF-BiLSTM (our model)

MR

SST-5

SST-2

81.9 82.3 82.5 83.0 82.2 83.3

49.5 50.8 51.3 51.5 51.1 51.8

88.0 88.4 88.9 89.2 88.6 89.7

Table 8 The accuracy for SAMF-BiLSTM with different linguistic feature. Feature channel

SA-BiLSTM

Rwp √ × × √ √ × √

Table 7 The accuracy for SAMF-BiLSTM-D with different selfattention mechanism.

MF-BiLSTM-D SAMF-BiLSTM-D SAMF-BiLSTM-D SAMF-BiLSTM-D SAMF-BiLSTM-D SAMF-BiLSTM-D

(no Itpp ) (no PVLN ) (no Lnor ) (no wsatt ) (our model)

YELP3

IMDB

59.6 63.0 62.8 63.2 63.6 63.8

45.4 47.6 46.9 48.1 48.8 48.9

SST-5

SST-2

× × √

79.1 80.9 82.1

49.7 50.2 50.8

87.8 88.3 88.7

× √

× √ √

81.9 83.0 82.9

50.5 51.0 51.4

88.5 88.8 89.4





83.3

51.8

89.7

Rwt

× √ × √

Table 9 The accuracy for SAMF-BiLSTM-D with different linguistic feature. Feature channel

SA-BiLSTM-D

Rwp √ × × √ √ × √

SAMF-BiLSTM-D models in sentiment classification task can be verify. 4.3.2. Impact of each component of SAMF-BiLSTM SAMF-BiLSTM consists of two parts, the self-attention mechanism and multi-channel language features. It should be demonstrated that all components of SAMF-BiLSTM can be used for the final result. In this section, we will conduct a set of experiments to evaluate the effects of self-attention and multi-channel language features on the performance of the SAMF-BiLSTM and SAMFBiLSTM-D models. Since SAMF-BiLSTM does not rely on parse trees, it is similar to the use of phrase-level annotations on SST-5 that do not use phrase-level annotations. Therefore, in order to unify the analysis, we only used the SST-5 data of the phrase-level annotation in the experiment. (a) the impact of self-attention mechanism on models The proposed self-attention weight is composed of three parts: initial attention matrix Itpp , the self-auxiliary matrix PVLN and additional auxiliary matrix Lnor (see Fig. 3). In order to reveal the influence of self-attention on the model, we preserve the linguistic features of the model. On the five data sets, the selfattention weight adjustment experiments were performed on the SAMF-BiLSTM and SAMF-BiLSTM-D models. The observed results are shown in Tables 6 and 7. It can be seen from Tables 6 and 7 that the MF-BiLSTM and MF-BiLSTM-D classification without using the word attention mechanism are significantly inferior to the SAMF-BiLSTM (no Itpp ) and SAMF-BiLSTM using the word self-attention mechanism. This

MR

Rwpa

YELP3

IMDB

× × √

59.4 60.3 61.7

45.8 46.3 47.7

× √

× √ √

62.9 63.1 63.5

47.4 47.9 48.4





63.8

48.9

Rwpa

Rwt

× √ × √

means that self-attention has a certain impact on our approach. By adjusting the weight of self-attention, it can be observed that the initial attention matrix Itpp , the self-auxiliary matrix PVLN and the additional auxiliary matrix Lnor , which calculate the self-attention weight, have a great influence on the performance of SAMFBiLSTM and SAMF-BiLSTM-D. In addition, SAMF-LSTM(all) is significantly less efficient than SAMF-BiLSTM (our model) when full self-attention weights are used. It can be seen that BiLSTM is better able to solve sequence modeling tasks than LSTM. At the same time, SAMF-BiLSTM (our model) and SAMF-BiLSTM-D (our model) with full self-focused weights give the best results. It demonstrates that all ingredients in self-attention are useful for the final results of SAMF-BiLSTM and SAMF-BiLSTM-D. (b) the effect of different linguistic features The multi-channel language features we propose include: Rwp (consisting of word vectors and position values), Rwpa (consisting of word vectors and dependency parsing), and Rwt (consisting of word vector and part-of-speech vectors) (see Fig. 1). In order to reveal the influence of linguistic features on the model, we performed linguistic features adjustment experiments on the SAMFBiLSTM and SAMF-BiLSTM-D models on five datasets (see Table 3). From Tables 8 and 9, it can be seen that with the addition of linguistic features, the complexity of the model increases and the performance is relatively fluctuating. However, the overall perfor-

Fig. 5. Influence of parts-of-speech features in different dimensions.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

11

Fig. 6. Influence of dependency parsing features in different dimensions.

mance of the model is increasing with the addition of linguistic features. The use of three-channel SAMF-BiLSTM and SAMFBiLSTM-D improves the analysis by 1.8%–4.4% compared to models using only word features, where Rwt and Rwpa play a key role in performance improvement. It demonstrates that multi-channel linguistic features can further improve the performance of SAMFBiLSTM and SAMF-BiLSTM-D. 4.3.3. Influence of vector sizes and different word embedding From the experiment of linguistic feature adjustment, we conclude that on the basis of word vector, part-of-speech features and dependency parsing features play key roles in the classification effect. Therefore, we further analyze the part-of-speech features, dependency parsing features and word vectors. In Figure 5 and 6 the performance of SAMF-BiLSTM and SAMFBiLSTM-D models with different dimensional part-of-speech features and dependency parsing features sizes can be seen. We use the vector size in the following set {10, 20, 25, 30, 50, 10 0, 20 0}. As can be seen form Figure 5 that the model has an increasing trend in the four datasets (MR, SST-2, YELP3 and IMDB) when the size of the part-of-speech vector changes. The model fluctuates on the MR and SST-2 datasets when the size of the part-of-speech vector is over 30, and with the increase of the dimension, the classification accuracy shows a downward trend. The performance tends to stabilize on the YELP3 and IMDB datasets. Figure 6 shows that the model performance tends to be stable when the dependency parsing vector size is over 25. Therefore, choosing the appropriate part-of-speech vector and dependency parsing dimension can achieve better performance. Figure 7 shows the performance of the SAMF-BiLSTM and SAMF-BiLSTM-D models under different dimension sizes and different initial word embedding. We use the vector size in the following set {50, 100, 150, 20 0, 30 0} and set the pre-training and random initial word embedding. Note that the dimensions of all the units in the model will also change with that. As can be seen from Table 7, SAMF-BiLSTM and SAMF-BiLSTM-D using pre-trained word embedding vectors are better than SAMF-BiLSTM and SAMFBiLSTM-D using random word embedding vectors on all datasets. When the vector size changes, the performance of the pre-trained word embedding vector model exhibits a steady upward trend, while those using the random word embedding vector begins to fluctuate when the vector size is greater than 150. The experimental results show that the error of the model in the initial and pre-training two initial words is in the range of 1.6%-2.3% in the case of the best results. Therefore, the method of pre-training embedded word vectors is more suitable for SAMFBiLSTM and SAMF-BiLSTM-D. 4.4. Discussions For SAMF-BiLSTM and SAMF-BiLSTM-D, BiLSTM can access context information, and learn the context of each word in the text more effectively. Multi-channel feature input allows BiLSTM to

Fig. 7. Influence of different word embedding and vector size.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM 12

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 8. Three channel features self-attention visualization. Ove1 , Ove2 , Ove3 are represented as scoring vectors for text self-attention through three channels. (a) Answer: positive. Prediction: positive. (b) Answer: positive. Prediction: negative.

learn sentiment feature information from sentences from different angles. Self-attention can reduce the dependence on external information, directly calculate the dependence relationship by ignoring the distance between words and words, learn the weight distribution of each word on the sentiment tendency of the sentence, focus on strengthening the emotional features in the sentence, so that the model can learn more hidden feature information. The combination of these methods makes the semantic understanding of sentences more accurate and improves the classification capabilities of SAMF-BiLSTM and SAMF-BiLSTM-D. Experimental results show that the self-attention mechanism, BiLSTM and multi-channel language features have important implications for the performance of SAMF-BiLSTM and SAMF-BiLSTMD. Among them, multi-channel features have a greater impact on classification accuracy than self-attention. At the same time, the dimensions of part-of-speech and syntactic features in multi-channel features also affect the performance of SAMF-BiLSTM and SAMFBiLSTM-D. For text categorization tasks, the method of generating a word embedding vector affects the classification accuracy. Compared with the pre-trained embedded word vector, the random embedding word vector needs to train more parameters, and the performance of the classification will fluctuate with different random initialization. Experimental results show that the pre-trained embedded word vector can achieve better results than the random word embedding vector. Therefore, the method of generating pre-trained embedding word vectors is more suitable for SAMF-BiLSTM and SAMF-BiLSTM-D. All experimental results show that the self-attention mechanism, the combination of BiLSTM and multi-channel features significantly improves the accuracy of text classification. For most benchmark datasets, SAMF-BiLSTM and SAMF-BiLSTM-D can achieve better results than other baselines. It shows that SAMF-BiLSTM and SAMF-BiLSTM-D have better classification capabilities. 4.5. Case study To further analyze the advantages of our model over BiLSTM (no self-attention, no multi-channel features), MF-BiLSTM (no self-attention mechanism, multi-channel features), WFCNN (CNN using emotional sequence features) and LR-Bi-LSTM (LSTM with language features), we used trained SAMF-BiLSTM, BiLSTM, MFBiLSTM, WFCNN and LR-Bi-LSTM to predict several specific examples for analysis. Since SAMF-BiLSTM-D is based on SAMF-BiLSTM, we only analyze SAMF-BiLSTM in this section.

Table 10 Analysis of typical sample cases. ID

Text

Methods

1

all of the elements are in place for a great film noir, but director george hickenlooper’s approach to the material is too upbeat.

SAMF-BiLSTM LR-Bi-LSTM MF-BiLSTM BiLSTM WFCNN SAMF-BiLSTM LR-Bi-LSTM MF-BiLSTM BiLSTM WFCNN

2

After discovering the use of Samsung mobile phones, my Weibo is full of typos! Can’t stand it! Be careful! Be careful!

Success √ × × × × √ √ √ × ×

As shown in the sample classification results in Table 10. In example 2, the emotional words do not work alone, but express the emotions of the entire sentence through the sequence of words combined with the semantics of the sentence. Misclassification occurs because the features extracted by WFCNN are features between locally adjacent words. Although Bi-LSTM has strong contextual semantic capture ability, there are a large number of positive and negative sentiment words in example 2, and misclassification occurs because special emotional words are not processed. The SAMF-BiLSTM, LR-Bi-LSTM and MF-BiLSTM models make full use of language knowledge, not only have strong contextual semantic capture capabilities, but also enhance the emotional words in the text according to the upper and lower semantics. Therefore, it can be classified correctly. For the text with “but” in example 1, LR-Bi-LSTM is not classified successfully because the regulator of the LR-Bi-LSTM model has limitations, it does not consider the dependency of the sentence, but directly adjusts the intensity of the emotional words of the entire text. And our proposed SAMFBiLSTM can learn the emotions in the text according to the sentence structure, the position of the words and the part of speech. Therefore, our model can be classified successfully. In addition, we visualized two cases in the test set of MR data in Figure 8 to explain how the multi-channel and self-attention of SAMF-BiLSTM works. The depth of the color indicates how important the word is. The darker the color, the more important the word is. Figure 8(a) is an example with a “but” clause. The polarity of the sample is determined by the sentence “but”. We observe that the attention score vector of Ove1 highlights the two words that are more pronounced “flawed” and “engrossing”. For the attention score vector of Ove2 , with the help of position information, part-of-speech and syntactic information, it highlights

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

“engrossing” and does not distract the unrelated words. For attention score of Ove3 , with the help of the syntax in the sentence, part-of-speech and position as an aid, the “but” has been strengthened and affected “engrossing”, so “engrossing” is a little darker than the “flawed” color.Therefore, SAMF-BiLSTM can correctly predict the sample. Figure 8(b) is another example with “but” clause. In general, the polarity of the sample is determined by the sentence “but” in the case where the target word is not specified. From the whole sentence, the polarity of the sample is negative. However, there are two target words “film” and “book” in the sentence, with “film” as the target word, the sentence is positive, and if ”book” is used as the target word, then the sentence will be judged negative. However, this sample belongs to the MR dataset, and MR is a movie comment dataset. Therefore, the sample should be judged as positive by using ”film” as the target word. As can be seen from Figure 8, SAMF-BiLSTM does not use the “film” target word as the prediction center, but starts from the sentence structure, focusing on the clauses after “but” and make a wrong judgment. 4.6. Error analysis To better understand the limitations of our model, we analyzed the errors generated by the SAMF-BiLSTM model. Specifically, we randomly selected 50 instances of SAMF-BiLSTM error prediction from the test set of MR movie review data sets. We reveal several reasons for classification errors and Figure Algorithm 1 classify them into the following categories. First, SAMF-BiLSTM cannot predict text with multiple target words. For example, for a sentence “intriguing and beautiful film, but those of you who read the book are likely to be disappointed.” Since it is not certain whether the target word is “film” or “book”, our model will directly misjudge the sentence based on the structure, position and part of speech of the sentence and the “book” behind “but”. Second, if the lengths of the texts are too large, the multi-channel features will be sparse, affecting the distribution of self-attention weights, and thus affecting the classification effect. 5. Conclusions We propose a Bidirectional LSTM with self-attention mechanism and multi-channel features. This model consists of two parts, self-attention mechanism and multi-channel features. Firstly, the existing linguistic knowledge and sentiment resources in the sentiment analysis task are modeled, and different feature channels are generated as the input of the model. Then use BiLSTM to fully extract the effective sentiment resource information. Finally, the selfattention mechanism is used to focus on strengthening the important information. In addition, we proposed a SAMF-BiLSTM-D model on the base of SAMF-BiLSTM model for the document-level text classification task. The method uses SAMF-BiLSTM training to get all the sentence representations in the document, then combines a BiLSTM to learn all the sentences representation information to obtain the entire document representation. We conducted an experimental evaluation on five sentiment classification datasets to verify the performance of our model. Experimental results show that SAMF-BiLSTM and SAMF-BiLSTM-D models have better classification performance than some advanced baseline methods in most cases. Future work will focus on the study of attention mechanisms and the design of model structures for specific target classification tasks for document-level text. The future work can be divided into the following aspects: (1) Improving our methods by introducing other different attention mechanisms; (2) Designing new attention mechanisms and network models for specific document-level text

13

classification tasks; (3) Apply our method to the practical scenarios. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. CRediT authorship contribution statement Weijiang Li: Conceptualization, Methodology, Project administration. Fang Qi: Software, Validation, Data curation, Writing original draft. Ming Tang: Validation, Visualization, Investigation. Zhengtao Yu: Supervision, Writing - review & editing. Acknowledgments This work is supported by the National Natural Science Foundation of China (61363045); The Key Project of Yunnan Nature Science Foundation (No. 2013FA130); Science and technology innovation talents fund projects of Ministry of Science and Technology (No. 2014HE001) References [1] E. Cambria, D. Das, S. Bandyopadhyay, A. Feraco, A Practical Guide to Sentiment Analysis, Springer, 2017. [2] S. Poria, E. Cambria, R. Bajpai, A. Hussain, A review of affective computing: from unimodal analysis to multimodal fusion, Inf. Fusion 37 (2017) 98–125. [3] I. Chaturvedi, E. Ragusa, P. Gastaldo, R. Zunino, E. Cambria, Bayesian network based extreme learning machine for subjectivity detection, J. Frankl. Inst. 355 (4) (2018) 1780–1797. [4] E. Cambria, Affective computing and sentiment analysis, IEEE Intell. Syst. 31 (2) (2016) 102–107. [5] A. Ortony, G.L. Clore, A. Collins, The Cognitive Structure of Emotions, Cambridge University Press, 1990. [6] J. Wiebe, T. Wilson, C. Cardie, Annotating expressions of opinions and emotions in language, Lang. Resour. Eval. 39 (2–3) (2005) 165–210. [7] C. Strapparava, A. Valitutti, Wordnet affect: an affective extension of wordnet, in: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), 4, Citeseer, 2004, pp. 1083–1086. [8] A. Esuli, F. Sebastiani, Sentiwordnet: A publicly available lexical resource for opinion mining, in: Proceedings of the LREC, 6, Citeseer, 2006, pp. 417–422. [9] E. Cambria, D. Olsher, D. Rajagopal, Senticnet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis, in: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI Press, 2014, pp. 1515–1521. [10] S. Somasundaran, J. Wiebe, J. Ruppenhofer, Discourse level opinion interpretation, in: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, Association for Computational Linguistics, 2008, pp. 801–808. [11] D. Rao, D. Ravichandran, Semi-supervised polarity lexicon induction, in: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2009, pp. 675–682. [12] S. Poria, E. Cambria, A. Gelbukh, F. Bisio, A. Hussain, Sentiment data flow analysis by means of dynamic linguistic patterns, IEEE Comput. Intell. Mag. 10 (4) (2015) 26–36. [13] X. Glorot, A. Bordes, Y. Bengio, Domain adaptation for large-scale sentiment classification: A deep learning approach, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 513–520. [14] R.Y. Lau, Y. Xia, Y. Ye, A probabilistic generative model for mining cybercriminal networks from online social media, IEEE Comput. Intell. Mag. 9 (1) (2014) 31–43. [15] Y. Xia, E. Cambria, A. Hussain, H. Zhao, Word polarity disambiguation using Bayesian model and opinion-level features, Cognit. Comput. 7 (3) (2015) 369–380. [16] M. Dragoni, A.G. Tettamanzi, C. da Costa Pereira, A fuzzy system for concept-level sentiment analysis, in: Semantic Web Evaluation Challenge, Springer, 2014, pp. 21–27. [17] D.R. Recupero, V. Presutti, S. Consoli, A. Gangemi, A.G. Nuzzolese, Sentilo: frame-based sentiment analysis, Cognit. Comput. 7 (2) (2015) 211–225. [18] J.M. Chenlo, D.E. Losada, An empirical study of sentence features for subjectivity and polarity classification, Inf. Sci. (Ny) 280 (2014) 275–288. [19] E. Cambria, B. White, Jumping NLP curves: a review of natural language processing research, IEEE Comput. Intell. Mag. 9 (2) (2014) 48–57.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM 14

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

[20] A. Joshi, A. Balamurali, P. Bhattacharyya, R. Mohanty, C-feel-it: a sentiment analyzer for micro-Blogs, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations, Association for Computational Linguistics, 2011, pp. 127–132. [21] P. Chesley, B. Vincent, L. Xu, R.K. Srihari, Using verbs and adjectives to automatically classify blog sentiment, Training 580 (263) (2006) 233. [22] E. Boiy, M.-F. Moens, A machine learning approach to sentiment analysis in multilingual web texts, Inf. Retr. Boston 12 (5) (2009) 526–558. [23] Q. Ye, Z. Zhang, R. Law, Sentiment classification of online reviews to travel destinations by supervised machine learning approaches, Expert Syst. Appl. 36 (3) (2009) 6527–6535. [24] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746–1751. [25] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 655–665. [26] T. Lei, R. Barzilay, T. Jaakkola, Molding CNNS for text: non-linear, non-consecutive convolutions, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1565–1575. [27] K.-i. Funahashi, Y. Nakamura, Approximation of dynamical systems by continuous time recurrent neural networks, Neural Netw. 6 (6) (1993) 801–806. [28] X. Zhu, P. Sobihani, H. Guo, Long short-term memory over recursive structures, in: Proceedings of the International Conference on Machine Learning, 2015, pp. 1604–1612. [29] A. Graves, J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM and other neural network architectures, Neural Netw. 18 (5–6) (2005) 602–610. [30] Q. Li, S. Shah, Learning stock market sentiment lexicon and sentiment-oriented word vector from stocktwits, in: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), 2017, pp. 301–310. [31] Z. Chen, R. Xu, L. Gui, Q. Lu, Combining convolutional neural networks and word sentiment sequence features for chinese text sentiment classification, J. Chinese Inf. Process. (in China) 29 (6) (2015) 172–178. [32] S. Pei, L. Wang, Text sentiment analysis based on attention mechanism, Comput. Eng. Sci. (in China) 41 (2) (2019) 344–353. [33] G. Liu, J. Guo, Bidirectional lstm with attention mechanism and convolutional layer for text classification, Neurocomputing 337 (2019) 325–338. [34] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-specific word embedding for twitter sentiment classification, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2014, pp. 1555–1565. [35] F. Huang, S. Feng, D. Wang, G. Yu, Mining topic sentiment in microblogging based on multi-feature fusion, Chinese J. Comput. (in China) 40 (4) (2017) 872–888. [36] F. Huang, G. Yu, J. Zhang, C. Li, C. Yuan, J. Lu, Mining topic sentiment in micro-blogging based on micro-blogger social relation, J. Softw. (in China) 28 (3) (2017) 694–707. [37] D.T. Vo, Y. Zhang, Donâ;;t count, predict! an automatic approach to learning sentiment lexicons for short text, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 219–224. [38] Y. Chen, S. Skiena, Building sentiment lexicons for all major languages, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, pp. 383–389. [39] K. Chen, B. Liang, W. Ke, Chinese micro-blog sentiment analysis based on multi-channels convolutional neural networks, J. Comput. Res. Devel. 55 (5) (2018) 945–957. [40] B. Zhang, X. Xu, X. Li, X. Chen, Y. Ye, Z. Wang, Sentiment analysis through critic learning for optimizing convolutional neural networks with rules, Neurocomputing 356 (2019) 21–30. [41] Z. Teng, D.T. Vo, Y. Zhang, Context-sensitive lexicon features for neural sentiment analysis, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1629–1638. [42] Q. Qian, M. Huang, J. Lei, X. Zhu, Linguistically regularized LSTM for sentiment classification, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1679–1689. [43] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from tree-structured long short-term memory networks, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1556–1566. [44] E. Cambria, S. Poria, D. Hazarika, K. Kwok, Senticnet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 1795–1802. [45] Y. Li, Q. Pan, T. Yang, S. Wang, J. Tang, E. Cambria, Learning word representations for sentiment analysis, Cognit. Comput. 9 (6) (2017) 843–851. [46] D. Bahdanau, K. Cho, Y. Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, International Conference on Learning Representations, 2015. [47] D. Ma, S. Li, X. Zhang, H. Wang, Interactive attention networks for aspect-level sentiment classification, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, AAAI Press, 2017, pp. 4068–4074.

[48] Y. Wang, M. Huang, X. Zhu, L. Zhao, Attention-based LSTM for aspect-level sentiment classification, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, pp. 606–615. [49] Q. Liu, H. Zhang, Y. Zeng, Z. Huang, Z. Wu, Content attention model for aspect based sentiment analysis, in: Proceedings of the World Wide Web Conference, International World Wide Web Conferences Steering Committee, 2018, pp. 1023–1032. [50] P. Guan, B. Li, X. Lv, J. Zhou, Attention enhanced bi-directional LSTM for sentiment analysis, J. Chinese Inf. Process. (in China) 33 (2) (2019) 105–111. [51] Y. Ma, H. Peng, E. Cambria, Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 5876–5883. [52] X. Zhou, X. Wan, J. Xiao, Attention-based LSTM network for cross-lingual sentiment classification, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, pp. 247–256. [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5998–6008. [54] Z. Lin, M. Feng, C.N. dos Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding, CoRR abs/1703.03130 (2017). [55] Y. Wang, A. Sun, J. Han, Y. Liu, X. Zhu, Sentiment analysis by capsules, in: Proceedings of the World Wide Web Conference, International World Wide Web Conferences Steering Committee, 2018, pp. 1165–1174. [56] W. Zhao, et al., Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications, 2019 Meeting of the Association for Computational Linguistics 1549–1559. [57] B. Liang, Q. Liu, J. Xu, Q. Zhou, P. Zhang, Aspect-based sentiment analysis based on multi-attention CNN, J. Comput. Res. Develop. (in China) 54 (8) (2017) 1724–1735. [58] B. Huang, Y. Ou, K.M. Carley, Aspect level sentiment classification with attention-over-attention neural networks, in: Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, Springer, 2018, pp. 197–206. [59] M. Yang, Q. Qu, X. Chen, C. Guo, Y. Shen, K. Lei, Feature-enhanced attention network for target-dependent sentiment classification, Neurocomputing 307 (2018) 91–97. [60] Z. Lei, Y. Yang, M. Yang, Y. Liu, A multi-sentiment-resource enhanced attention network for sentiment classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 758–763. [61] J.L. Ba, J.R. Kiros, G.E. Hinton, Layer normalization, Stat 1050 (2016) 21. [62] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: Proceedings of the International conference on machine learning, 2014, pp. 1188–1196. [63] D. Tang, B. Qin, T. Liu, Learning semantic representations of users and products for document level sentiment classification, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015, pp. 1014–1023. [64] J. Xu, D. Chen, X. Qiu, X. Huang, Cached long short-term memory neural networks for document-level sentiment classification, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1660–1669. [65] H. Chen, M. Sun, C. Tu, Y. Lin, Z. Liu, Neural sentiment classification with user and product attention, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1650–1659. [66] J. Pennington, R. Socher, C. Manning, Glove: Global vectors for word representation, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [67] M.D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, 2012. [68] Y. Liu, J.-W. Bi, Z.-P. Fan, A method for multi-class sentiment classification based on an improved one-vs-one (ovo) strategy and the support vector machine (SVM) algorithm, Inf. Sci. (Ny) 394 (2017) 38–52. [69] S. Wang, C.D. Manning, Baselines and bigrams: Simple, good sentiment and topic classification, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Association for Computational Linguistics, 2012, pp. 90–94. [70] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng, C.D. Manning, Semi-supervised recursive autoencoders for predicting sentiment distributions, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2011, pp. 151–161. [71] R. Socher, A. Perelygin, J. Wu, J. Chuang, C.D. Manning, A. Ng, C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1631–1642. [72] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, S. Zhang, Investigating capsule networks with dynamic routing for text classification, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018, pp. 3110– 3119. [73] K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, Lstm: a search space odyssey, IEEE Trans. Neural Netw. Learn Syst. 28 (10) (2016) 2222–2232.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006

JID: NEUCOM

ARTICLE IN PRESS

[m5G;January 15, 2020;22:2]

W. Li, F. Qi and M. Tang et al. / Neurocomputing xxx (xxxx) xxx

15

WeiJiang Li received the B.A. degree in 1991 in computer science and engineering from Dalian University of Technology in Dalian, China, and the MS. and Ph.D. degrees in computer science from the department of Computer Sciences, Harbin Institute of Technology in Harbin, China, in 20 04 and 20 08, respectively. Since May 2008, he has been Assistant Professor in Computer Application Key Laboratory of Yunnan Province, Kunming University of Science and Technology, Kunming, China. His research interests include natural language processing, information retrieval.

Ming Tang is currently a postgraduate student in the School of Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. His current research interests include natural language processing and machine learning.

Fang Qi is currently a postgraduate student in the School of Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China. Her current research interests include deep learning technology and natural language processing.

Zhengtao Yu received his Ph.D. degree in computer application technology from Beijing Institute of Technology, Beijing, China, in 2005. He is currently a professor in the School of Information Engineering and Automation, Kunming University of Science and Technology, China. His main research interests include natural language processing, information retrieval and machine learning.

Please cite this article as: W. Li, F. Qi and M. Tang et al., Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2020.01.006