Using a stacked residual LSTM model for sentiment intensity prediction

Using a stacked residual LSTM model for sentiment intensity prediction

Accepted Manuscript Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction Jin Wang , Bo Peng , Xuejie Zhang PII: DOI: Reference: S0...

1010KB Sizes 0 Downloads 53 Views

Accepted Manuscript

Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction Jin Wang , Bo Peng , Xuejie Zhang PII: DOI: Reference:

S0925-2312(18)31122-6 https://doi.org/10.1016/j.neucom.2018.09.049 NEUCOM 19982

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

18 July 2017 9 July 2018 18 September 2018

Please cite this article as: Jin Wang , Bo Peng , Xuejie Zhang , Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction, Neurocomputing (2018), doi: https://doi.org/10.1016/j.neucom.2018.09.049

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Available online at www.sciencedirect.com

Procedia Economics and Finance 00 (2012) 000–000

CR IP T

Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction Jin Wanga, Bo Penga, and Xuejie Zhanga*

School of Information Science and Engineering, Yunnan University, Kunming, PR China

AN US

a

Abstract

PT

ED

M

The sentiment intensity of a text indicates the strength of its association with positive sentiment, which is a continuous real-value between 0 and 1. Compared to polarity classification, predicting sentiment intensities for texts can achieve more fine-grained sentiment analyses. By introducing word embedding techniques, recent studies that use deep neural models have outperformed existing lexicon- and regression-based methods for sentiment intensity prediction. For better performance, a common way of a neural network is to add more layers in order to learn high-level features. However, when the depth increases, the network degrades and becomes more difficult to train. Since the errors between layers will be accumulated, and gradients will be vanished. To address this problem, this paper proposes a stacked residual LSTM model to predict sentiment intensity for a given text. By investigating the performances of shallow and deep architectures, we introduce a residual connection to every few LSTM layers to construct an 8-layer neural network. The residual connection can center layer gradients and propagated errors. Thus it makes the deeper network easier to optimize. This approach enables us to stack more layers of LSTM successfully for this task, which can improve the prediction accuracy of existing methods. Experimental results show that the proposed method outperforms lexicon-, regression-, and conventional NN-based methods proposed in previous studies.

CE

Keywords: Sentiment Intensity Prediction; Stacked Residual LSTM; Neural Network; Sentiment Analysis

1. Introduction

AC

Online social networking services (SNSs), such as Twitter, Facebook and Weibo, enable users to share their thoughts, opinions, and emotions with others through texts which are informal and strongly subjective. Analysis of user-generated information is very useful for understanding how sentiments spread from person to person on the Internet. Sentiment analysis techniques [1]–[5] provide a way to handle such affective information automatically. As an active research field in computational linguistic and affective computing [6],

*

Corresponding author. Tel.: +86-135-0871-5558. E-mail address: [email protected].

ACCEPTED MANUSCRIPT 2

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

CR IP T

sentiment analysis can analyze, process, induce and deduce such subjective texts with affective information. Most existing methods of sentiment analysis focus on the polarity classification approach, which classifies the target texts into several categories, e.g., positive or negative. Such methods mostly use classification models. The methods first extract features such as n-gram, bag-of-words (BOW) or part-of-speech (POS) [7], [8]. Next, support vector machine (SVM), naïve Bayes, maximum entropy, logistic regression or random forest methods are applied on these features to classify texts into either positive or negative classes [9]. Alternatively, sentiment intensity prediction could be another choice for sentiment analysis. More specifically, sentiment intensity of a word, phrase or text indicates the strength of its association with positive sentiment, also known as valence values [10], [11] or affective ratings. It is a score between 0 and 1, where the score of 0 or 1 respectively indicates the least or maximum association with positive sentiment (i.e., negative or positive). In contrast to the traditional approach, the intensity prediction is usually considered to be a regression, rather than a classification, since the sentiment intensity is defined as a continuous real-value. The following three movie reviews were rated in both polarity and sentiment intensity, as taken from the Stanford Sentiment Treebank† (SST) [12] corpus:

AN US

(Text 1 negative, senti=0.375) The movie is genial but never inspired, and little about it will stay with you. (Text 2 negative, senti=0.194) However, the movie does not really deliver for country music fans or for family audiences. (Text 3 negative, senti=0.083) Bears are even worse than I imagined a movie ever could be.

AC

CE

PT

ED

M

All three reviews about the movie were classified to be negative. However, the 3rd review, which was rated a lower intensity (more negative) than other two texts, requires a higher priority to draw attention. In addition, the intensity prediction approach can provide more intelligent and fine-grained sentiment applications, such as hotspot detection and forecasting [13], mental illness identification [14], financial news analysis [15], question answering [16], and blog post analysis [17]. Few studies have sought to predict continuous affective ratings of texts using lexicon- and regression-based methods. The lexicon-based methods are based on the underlying assumption for most algorithms that the intensity of a given text can be estimated via the composition of the intensity of its constituent words [18]. Another approach uses regression-based methods [17], [19], [20]. These methods sought to learn the correlations between sentiment intensities and linguistic features of words, e.g., BoW and POS. However, the prediction performance of such methods is still low. Recently, several classification methods have been implemented to explore the use of deep neural networks and word embedding, such as convolutional neural networks (CNN) [21], [22], recurrent neural networks (RNN) [23], [24] and long-short term memory (LSTM) [25], [26]. CNN [21], [22] is able to extract active local n-gram features. Conversely, LSTM [25], [26] can sequentially model the texts. This model focuses on past information and draws conclusions from the entire text. In addition, such NN methods are used as classifiers to distinguish whether the given text is positive or negative. These models have not been thoroughly investigated for sentiment intensity prediction. In image recognition, several recently proposed models, such as VGG [27], InceptionNet [28], [29] and ResNet [30], have all exploited very “deep” architecture. The success of these models reveals that increasing the depth of a neural network can help improve the performance of learning models as deeper networks learn better representations of features [31]. For language modeling tasks, a feasible way of applying deep architecture is to use a stacked CNN [32] or LSTM [33], [34] model. The question is becoming clear: Is



http://nlp.stanford.edu/sentiment/

ACCEPTED MANUSCRIPT

CR IP T

Author name / Procedia Economics and Finance 00 (2012) 000–000

Fig. 1: Mean absolute error (MAE) and Pearson correlation coefficient (r) on SST corpus with 1-layer and 8-layer plain LSTM model. The deeper network has higher training error and thus testing error.

AC

CE

PT

ED

M

AN US

training better the prediction model as simple as stacking more layers? In fact, deeper networks are easily impacted by the degradation problem: when the network has more layers, prediction accuracy will becoming saturated and never increase again. Since the errors between layers will be accumulated, and gradients will be vanished. To explain this phenomenon, we trained a conventional 2-layer LSTM and a stacked 8-layer plain LSTM on SST for sentiment intensity prediction. Figure 1 shows the mean squared error and Pearson correlation coefficient of these two models on training and testing sets. Unexpectedly, the deeper network has higher training and test error. Similar results on other corpora are shown in Fig. 4. This result indicates that the more stacked layers there are, the harder it is to optimize. In this paper, we propose a stacked residual LSTM model to predict the sentiment intensity of a given text. To tackle the degradation problem, we introduce the residual connection to every few LSTM layers inspired by ResNet [30]. The residual connection can center layer gradients and propagated errors. Thus, it makes the network easier to optimize. This approach enables us to stack more layers of LSTM successfully for NLP task. As similar as stacked deep convolution networks extract different level features from pixels to shapes and contours in previous image processing task [28]–[30], the proposed stacked LSTM model can extract higher level sequence features from lower level n-gram features to form a hierarchical representation. These features are linguistic function blocks, which could be words, phrases, clauses, sentences, or even a paragraphs. Experiments were conducted on four English and Chinese corpora to evaluate the performance of the stacked residual LSTM model. We first investigate the degradation problems in stacked LSTM when the model is deepened. Next, the proposed model is compared with several previously proposed methods, such as traditional lexicon-, and regression-based methods, and the conventional deep neural network-based models, including CNN, LSTM and RNN. Besides sentiment intensity prediction, this stacked model with residual connections can also be used to build various time series prediction applications, such as short-term electrical load forecasting [35], solar irradiation forecasting [36], QoS estimating of stream service [37], [38] and video sequence recognition [39], [40]. The rest of this paper is organized as follows. Section 2 offers a brief review of related works. Section 3 describes the proposed neural network model and residual architecture. Section 4 summarizes the comparative results of different methods for sentiment intensity prediction. The study’s conclusions are presented in Section 5.

3

ACCEPTED MANUSCRIPT 4

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

2. Related Work Sentiment intensity of a word, phrase or sentence indicates its strength of positive emotion. In this section, we present a brief review about sentiment intensity prediction of texts, including lexicon-, regression- and conventional neural network-based methods. 2.1. Lexicon-Based Methods

CR IP T

Lexicon-based methods are based on the underlying assumption that the intensity of a text can be estimated via the composition of the intensity of its constituent words. An affective lexicon, in which affective words are tagged with sentiment ratings, is always used as the basis of each text in these methods. Given the affective scores of words, one may calculate the affective scores of a text through different composition methods. An intuitive method for composition is arithmetic mean, that is, sentiment intensities of a text t can be predicted by average sentiment intensities of each words w in this text, defined as 1 sentit   sentiw (1) n wt

M

AN US

where sentit and sentiw respectively denote the sentiment intensity of text t and word w. Instead of simply using the arithmetic mean affective values of words, Paltoglou et al. [18] used three different methods to estimate the sentence’s overall sentiment score, including weighted arithmetic mean, weighted geometric mean, and a Gaussian mixture model. These authors’ experimental results show that weighted geometric mean method outperforms other two methods. Although the lexicon-base methods can be easily applied, they cannot model a sentence or document with complex linguistic expression. For example, if a positive review contains more negative words than positive words, it will be incorrectly predicted as a negative one. This finding means that the emotional import of a sentence or document is not simply the sum of emotional associations of its component words.

ED

2.2. Regression-Based Methods

AC

CE

PT

Gokcay et al., [19] applied a simple linear regression model on a sentiment lexicon to calculate the overall sentiment score of texts. The sentences are decomposed into their words to obtain sentiment intensities from an affective lexicon. A list of stop-words is used to remove words that are not found in lexicon. Next, a regression model is built between the sentiment intensity of the sentence and the average intensities of words in the sentence. Instead of simply using the mean affective values of words, Malandrakis et al., [20] extract n-grams, the weighted average and maximum intensities of component words, as features to train regression models. The authors also proposed a method that extracts n-grams with affective ratings as features to predict sentiment intensities for sentences and documents. Paltoglou and Thelwall [17] predicted the intensities of a sentence or document on an ordinal five-level scale, from very negative/low to very positive/high, respectively. These authors considered the sentiment prediction problem as both classification and regression. Both methods are based on BOW features, and support vector machine (SVM) and ε-support vector regression (ε-SVR) are used for classification and regression, respectively. Their experimental results also show that regression techniques tend to make smaller scale errors.

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

hˆ3

h2

h1

ot

hˆ4

h3

h4

ct ct-1

it c_int

Repeating Module ...

LSTM

LSTM

LSTM

LSTM

...

  n  

LSTM

... LSTM

Word Vector Input Word

(xt, ht-1) (a)

x1

x2

x3

x4

(c)

(b)

CR IP T

ft

hˆ2

hˆ1

ht

5

Fig. 2: Illustration example of LSTM cells with residual connection.

2.3. Neural Network-Based Methods

CE

PT

ED

M

AN US

Recently, word embeddings have been shown to boost the performance in several NLP tasks [41], [42], including semantic parsing and sentiment analysis. Given a variable-length text, one challenge of using a learning algorithm for sentiment analysis is to identify a method to take individual word vectors and transform them into a feature vector that is the same length for every text. One intuitive method is to simple average is simply to average the word vectors in a given text [43]–[45]. Although such a method is easy to implement and provides efficient computation, it sacrifices word order information, making it very similar to the concept of BOW. Based on word embedding, several deep neural networks were proposed for positive-negative classification, such as CNN [21], [22], RNN [23], [24] and LSTM [25], [26]. A CNN model consists of the convolution and pooling layers and provides a standard architecture that maps variable-length sentences or texts into fixed-size distributed vectors. the CNN model takes as input the sequence of word embedding, summarizes the sentence meaning by convolving the slide window and the sentence, and outputs the fixed-length distributed vector with other layers, such as dropout and fully connected layers, where the activation can be sigmoid, tanh or ReLU. RNN and LSTM were theoretically more powerful in language modeling due to their capability of representing a sentence or text with sequence order information, rather than a fixed length, as in CNN. These two models both learn the contextual information that could be beneficial to capture the semantics of long texts. However, due to the vanishing and the exploding gradient problems, the traditional RNN is difficult to train. To address these problems, LSTM introduces a gating structure that allows for explicit memory updates and delivers. Thus, an LSTM has three of these gates, to protect and control the state of a memory cell to learn the long-distance dependency of a given sentence or text.

AC

2.4. Stacked Neural Network Methods Deep learning is built based on a theoretical hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a swallow model [46]. According to recent progress in computer vision [28]–[30], developing deep architecture are able to learn hierarchical representations of whole sentence in NLP tasks. Several works seek to use such deep architectures of stacking many convolutional or recurrent layers to approach this goal.

ACCEPTED MANUSCRIPT 6

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

t x LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Meanpooling

Type 1 (n=1)

Linear decoder Intensity

t

x LSTM

LSTM

LSTM

LSTM

Type 2 (n=2)

LSTM

LSTM

Meanpooling

Linear decoder

CR IP T

Intensity

Fig. 3: Example network architectures of eight layer stacked LSTM model for sentiment intensity prediction

ED

M

AN US

Zhang et al. [47] perform a character-level convolutional neural network for sentiment analysis task. Their model use up to six convolutional layers, followed by a dense-connected softmax layer for classification. Other similar stacked CNN that use character-level information was proposed by Santos et al. [48]. Conneau et al. [32] proposed to stack 29 convolutional layers to form a very “deep” convolutional networks for text classification tasks. “Benefit of depth” was shown for those stack convolutional neural networks in NLP tasks. An RNN or LSTM can be also extended deeper by stacking multiple recurrent layers on top of each other [33], [34]. As same as stacked CNN model that extracts different level features, the stacked RNN and LSTM can also extract higher level features from lower-level features to form a hierarchical representation. However, we are not aware of any work that use more than six LSTM layers for sentiment classification or intensity prediction. Deeper networks were reported that they cannot improve performance or even not tried. The deeper when the network goes, the more difficult it is to optimize. Since the stacked model suffered from the degradation problem. With residual connections, we were able to show that performance improves with increased depth in stacked LSTM model. 3. Stacked Residual LSTM Model

CE

PT

This section presents the architecture of the proposed stacked residual LSTM. We first introduce a residual connection into LSTM layers with to form a building block. Next, by stacking LSTM building blocks, we turn the deep stacked LSTM model into its residual version. For outputting continuous affective ratings instead of discrete categories, we adopt a linear decoder in the output layer. The details of the proposed stacked residual LSTM are described in what follows. 3.1. Residual Connection

AC

In most NLP tasks, LSTM is more powerful than RNN due to the problem of vanishing or exploding gradients in RNN [49]. The LSTM introduces a new structure called a memory cell, as shown in Fig. 2(a). As in the RNN, the LSTM model is defined at each time step t to be a collection of vectors in d : an input gate it, a forget gate ft, an output gate ot, a memory cell ct and a history representation ht. The entries of the gating vectors it, ft and ot are in the range [0, 1]. The LSTM transition equations are the following:  Gates

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

7

it   (Wxi xt  Whi ht 1  bi ) ft   (Wxf xt  Whf ht 1  b f )

(2)

ot   (Wxo xt  Who ht 1  bi ) 

Input transform

c _ int  tanh(Wxc xt  Whc ht 1  bc _ in ) 

(3)

Memory update

ct  ft  ct 1  it  c _ int

(4)

CR IP T

ht  ot  tanh(ct )

AN US

where xt is the input word vector at the time step t, σ denotes the logistic sigmoid function, W and b respectively present the weights and bias and  denotes element-wise multiplication. Intuitively, the forget gate controls the extent to which the previous memory cell is forgotten, the input gate controls how much each unit is updated, and the output gate controls the exposure of the internal memory state. Pascanu et al. [50] explored multiple ways of combining one LSTM layer with another and discussed various difficulties in training such deep LSTM models. Grave [34] also investigate the stacked LSTMs for text generation. In our model, we use vertical stack architecture where the hidden representation of the previous layer h(l)t is used as the input for the next layer, where l denotes the layer. Thus, a hidden state of time step t in layer h(l)t can be calculated as follows: ht(l )  LSTM ( xt , ht(l 1) ) if l  1 (5) ht(l )  LSTM (ht(l 1) , ht(l 1) ) if l  1

PT

ED

M

where xt denotes the input word vector in step t. When multiple layers in a neural network are stacked, the network becomes very hard to train, leading to a degradation problem. As described previously, the problem arises due to the low convergence rate of the training error, instead of the vanishing or exploding gradient problem. Inspired by shortcut connections in ResNet [30], we employ residual connections to address the problem. Fig. 2(b) shows an example of a residual LSTM building block. We introduce residual connections to add the hidden state h(l)t with the input vector xt of n layer. Thus, the dimension of each hidden state h(l)t is required to match the dimension of the input vector xt. By adding residues for learning, the hidden state hˆ can be denoted as (6) hˆt(l )  LSTM (ht(l 1) , ht(l 1) )  xt(l n)

CE

where hˆ in layer l is updated with residual value xt(l  n ) , and  denotes matrix addition. As shown in Fig. 2(c), n denotes the number of intermediate LSTM layers in a residual connection. Intuitively, if n is very large, the stacked model is very expensive in terms of computations and hard to train.

AC

3.2. Stacked LSTM Architecture

In this paper, we propose an eight-layer stacked residual LSTM model for sentiment intensity prediction. Based on a vertical stacked LSTM model, we insert residual connections, as shown in Fig. 3, which turn the network into its residual version. As noted previously, the output hidden state can be directly added by the input vector only when the input and output are of the same dimensions. By introducing residual connections, we proposed two type models, i.e., Type 1 (n=1) and Type 2 (n=2). That is, we insert the residual connection after every one or two stacked layers of LSTM.

ACCEPTED MANUSCRIPT 8

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

Table 1 Sentiment corpora used in this experiment. Corpus

Train

Dev

Test

Language

Stanford Sentiment Treebank (SST) [12] VADER [58] EmoBank [59], [60] Chinese Valence-Arousal Texts (CVAT) [54]

8,544

1,101 23,503 10,548 2,100

2,210

English English English Chinese

AN US

CR IP T

It is worth noting that the matrix addition of each residual connection does not add any parameters that need to be learned. Thus, this does not increase the complexity of the proposed model. To improve the performance of the proposed model, we also perform a bi-directional strategy in each LSTM layer [34]. At each time step, the hidden state of the bidirectional LSTM is the concatenation of the forward and backward hidden states to capture both past and future information. LSTM is a biased model, as is RNN. The words in the tail of a sentence are more dominant than the words in the header. Thus, LSTM could reduce the prediction performance when it is used to capture the sentiment intensity of a whole text, since the key components could appear in any parts of the text, rather than at the end. Therefore, we use a mean pooling method to learn text vectors and to make each word contribute equally to the prediction. 3.3. Linear Decoder

Since the intensity values are continuous values, the prediction task requires a regression model, as noted previously. Instead of using a softmax or sigmoid classifier, a linear activation function, also known as linear decoder, is used in the output layer, defined as

M

sentit  Wo xo  bo

(7)

PT

ED

where xo represents the feature maps learned from the previous layer, Sentit is the intensity of the target texts, and Wo and bo respectively denote the weight and bias associated with the output layer. The proposed models are trained by minimizing the mean squared error between the actual and predicted values. Given a training set of text matrix X={x(1), x(2),..., x(m)}, and their sentiment intensities set S={senti(1), senti(2), …, senti(m)}, the cost function is defined as follows: 2 1 m J ( X, S)  h(x(i ) )  senti (i ) (8)  2m i 1

AC

CE

where h represents the hypothesis of the proposed stacked residual LSTM model. The model training is carried out by the back propagation algorithm (BP) [51] using the Adadelta optimizer [52]. The learning rate is 0.5. To avoid overfitting, we added a dropout layer [53] after each residual connection to make use of its regularization effect. All parameters are randomly initialized with a uniform distribution in (-0.01, 0.01) and updated by each iteration. 4. Experimental Results In this section, we first investigate the degradation problems in stacked architecture when the model was deepened. Then, we present the experimental results of the proposed stacked residual LSTM for sentiment intensity prediction against existing lexicon-, regression-, and conventional deep NN-based methods. We also investigate the training per

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

4.1. Dataset

CR IP T

This experiment used four corpora in both English and Chinese, as listed in Table 1. The VADER dataset contains social texts in four different domains, including social media texts (4,000 samples), digital product reviews (3,708 samples), opinion news articles (5,190 samples) and movie reviews (10,605 samples), for a total of 23,503 samples. Each sentence in the SST and VADER corpora was manually assigned a sentiment score between 0 and 1. In addition, the other two corpora were rated with valence and arousal values. The valence represents the degree of pleasant and unpleasant (i.e., positive and negative) feelings; therefore, it can be considered as sentiment intensity. Each sentence in EmoBank has two different intensities assigned by different entities, i.e., reader and writer. In this experiment, we rescale the range of valence values into [0, 1]. For SST, learning models are trained on training sets to predict sentiment intensity for testing samples, and development sets are used to fine-tune the model. For other three corpora, we performed 5-fold crossvalidation (i.e., 80% of these texts were used as training samples, and 20% for testing). 4.2. Evaluation Metrics

M

AN US

To compare the results of the proposed stacked residual LSTM model against regression-based methods and other deep neural networks, we introduce the Pearson correlation coefficient (r) and mean absolute error (MAE) to evaluate the performance of each models in different four datasets. The MAE results reflect the difference between the predicted values of sentiment intensities and the corresponding manually rated actual values in four corpora. The Pearson correlation coefficient is a measure of the linear correlation between the actual value and the predicted value, giving a value between +1 and −1 inclusive, where +1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. A higher Pearson correlation coefficient and a lower MAE value indicate more accurate forecasting performance.

ED

4.3. Implementation Details

The proposed stacked residual LSTM model is compared with several existing methods for sentiment intensity prediction. The implementation details for each method are as follows. wAM and wGM: weighted arithmetic mean (wAM) and weighted geometric mean (wGM) are two lexicon-based methods [18]. In these methods, sentiment intensity of a candidate text can be estimated via the weighted mean of tokens in the text.



AVR and MVR: average values regression (AVR) and maximum values regression (MVR) are two regression-based methods [20], which extract the weighted average and maximum intensities of component words as features to train regression models. The sentiment intensities of English words used in above methods are taken from the extended ANEW [11], while sentiment intensities of Chinese words are taken from CVAW [54].

CE

AC



PT



CNN, RNN and LSTM: Three other word embedding-based methods are also introduced for comparison: convolutional neural networks (CNN) [21], [22], recurrent neural networks (RNN) [23], [24] and long-short term memory (LSTM) [25], [26]. CNN can extract active local n-gram features while RNN and LSTM can sequentially model the texts.

9

ACCEPTED MANUSCRIPT Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

ED

M

AN US

CR IP T

10

Stacked Residual LSTM (SR-LSTM): The proposed 8-layer stacked residual LSTM model are also implemented for comparison. By introducing residual connections, we compared two types of stacked models, i.e., Type 1 (n=1) and Type 2 (n=2). That is, we add the residual connection after every one or two stacked layers of LSTM.

CE



PT

Fig. 4: Training on SST, CVAT and EmoBank with plain and residual architecture.

AC

To enhance performance of LSTM layers, we also introduce a bi-directional strategy [34]. The word embeddings used in this experiment was respectively pre-trained on Common Crawl 840B‡ (English) and wiki dumps § (Chinese) by GloVe [55]. For neural models, we also implement dropouts [53] as regularization to prevent overfitting problem. The dropout rate was set to 0.25. The dimension of input word embeddings and the hidden state is 300. All word embedding-based methods are implemented using

‡ §

https://nlp.stanford.edu/projects/glove/ https://dumps.wikimedia.org/

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

11

Table 2 Comparative results of different methods on SST, EmoBank and CVAT corpus r 0.350 0.385 0.455 0.448 0.679 0.659 0.700 0.715 0.709 0.718 0.688 0.692 0.692 0.764 0.696 0.772

EmoBank (reader) MAE r 0.148 0.365 0.132 0.398 0.112 0.433 0.107 0.426 0.072 0.510 0.082 0.445 0.075 0.511 0.070 0.531 0.068 0.549 0.067 0.556 0.078 0.484 0.077 0.490 0.076 0.490 0.058 0.615 0.071 0.512 0.058 0.612

EmoBank (writer) MAE r 0.145 0.372 0.136 0.383 0.118 0.429 0.112 0.432 0.057 0.504 0.068 0.439 0.055 0.524 0.055 0.529 0.052 0.541 0.050 0.548 0.061 0.491 0.061 0.498 0.061 0.490 0.592 0.046 0.060 0.492 0.047 0.596

AN US

Lexicon-wAM Lexicon-wGM Regression-MAR Regression-MVR CNN RNN LSTM Bi-LSTM 2-layer LSTM 2-layer Bi-LSTM 8-layer LSTM 8-layer Bi-LSTM 8-layer SR-LSTM (Type 1) 8-layer SR-LSTM (Type 2) 8-layer SR-Bi-LSTM (Type 1) 8-layer SR-Bi-LSTM (Type 2)

MAE 0.202 0.198 0.182 0.179 0.153 0.160 0.149 0.148 0.143 0.139 0.153 0.151 0.148 0.131 0.146 0.128

CVAT MAE r 0.204 0.406 0.199 0.418 0.171 0.476 0.174 0.468 0.143 0.625 0.165 0.493 0.132 0.641 0.125 0.652 0.112 0.684 0.108 0.690 0.134 0.632 0.130 0.638 0.098 0.740 0.161 0.611 0.096 0.742 0.162 0.610

CR IP T

SST

Methods

Gensim ** [56], TensorFlow †† [57], and Keras ‡‡ toolkits. These models were implemented with default parameter settings, as summarized in Table 2. 4.4. Evaluation of the Residual Connections.

AC

CE

PT

ED

M

We first evaluate 2-layer and 8-layer plain networks on different corpora. The results in Fig. 1 and Fig. 4 show that the deeper 8-layer stacked LSTM without residual connections has higher testing MAE error and lower correlation r than a shallower LSTM network with 2 layers. We observe the degradation problem that the deep plain models suffered from increased depth and exhibit higher training MAE error and lower correlation r when the architecture becomes deeper. The phenomenon was unlikely to be due to the vanishing or explosion of gradients problem, since the gate mechanism and memory cell in LSTM model were designed to address this problem. In addition, it was also unlikely to be caused by overfitting problems, as the stacked models were regularized by dropouts. After introducing residual connections, the performance of deeper 8-layer models (Type 2) were reversed to exceed the 2-layer swallow networks, and exhibits considerably lower training MAE error (8%-10%) and higher correlation r (8%-9%). Compared to 8-layer plain counterpart models, the residual connection also helps the models reduce MAE error by 15% and increase correlation r by 11%. The proposed models can provide better model convergence to obtain better optimization capability.

**

http://radimrehurek.com/gensim/ http://www.tensorflow.org/ ‡‡ https://keras.io/ ††

ACCEPTED MANUSCRIPT 12

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

Table 3 Comparative results of different methods on VADER corpus Product MAE r 0.175 0.421 0.173 0.442 0.169 0.482 0.165 0.486 0.121 0.679 0.139 0.514 0.124 0.673 0.118 0.686 0.112 0.692 0.108 0.697 0.128 0.672 0.129 0.670 0.126 0.679 0.098 0.712 0.122 0.680 0.096 0.718

Article MAE r 0.169 0.423 0.168 0.431 0.156 0.472 0.155 0.477 0.088 0.711 0.128 0.496 0.082 0.724 0.074 0.741 0.068 0.749 0.065 0.752 0.096 0.702 0.092 0.710 0.055 0.782 0.078 0.732 0.052 0.788 0.077 0.738

CR IP T

Lexicon-wAM Lexicon-wGM Regression-MAR Regression-MVR CNN RNN LSTM Bi-LSTM 2-layer LSTM 2-layer Bi-LSTM 8-layer LSTM 8-layer Bi-LSTM 8-layer SR-LSTM (Type 1) 8-layer SR-LSTM (Type 2) 8-layer SR-Bi-LSTM (Type 1) 8-layer SR-Bi-LSTM (Type 2)

Movie MAE r 0.183 0.487 0.182 0.490 0.175 0.526 0.174 0.534 0.134 0.737 0.169 0.571 0.124 0.744 0.122 0.755 0.118 0.762 0.112 0.768 0.133 0.738 0.132 0.732 0.128 0.736 0.101 0.779 0.130 0.734 0.098 0.785

AN US

Social MAE r 0.225 0.486 0.224 0.489 0.202 0.568 0.201 0.571 0.141 0.803 0.199 0.607 0.134 0.818 0.129 0.822 0.124 0.821 0.121 0.829 0.146 0.776 0.144 0.778 0.138 0.805 0.111 0.882 0.130 0.812 0.108 0.886

Methods

4.5. Comparative Results of Sentiment Intensity Prediction

AC

CE

PT

ED

M

Table 2 and Table 3 respectively present the result of the stacked residual LSTM model compared against several methods, as applied to sentiment intensity prediction in both English and Chinese corpora. For lexicon-based methods, wGM outperformed wAM, which is consistent with the results reported in Paltoglou et al. [18]. Instead of directly using sentiment intensities of words to measure those of texts, the regressionbased methods learned the correlations between the sentiment intensities of words and texts, thereby yielding better performance. Once the word embedding and deep learning techniques were introduced, the performance of NN-based methods (except RNN) increased dramatically. RNN suffered from the problems of vanishing and explosion gradients. In addition, the proposed stacked residual LSTM models outperformed the other NN-based methods, indicating the effectiveness of residual connections. By overcoming degradation problems, the residual connection can increase the performance when the architecture of stacked models gets deeper. Another observation in Table 2 and Table 3 is that the Type 2 (n=2) stacked residual model also outperforms Type 1 (n=1) model. When n=1, the function in Eq. (6) can be transformed to a standard stacked LSTM model with a bias derived from the input vector x. Therefore, the performances of Type 1 (n=1) models are very similar to those of plain stacked models. 4.6. Training Time Consume Analysis For some time-sensitive processing tasks, the time consume for training and prediction is another critical metric to determine the whole performance of the model. Although the proposed model achieved better performance than the existing models for sentiment intensity prediction, stacking more layers will result in more training time consuming. Since the more LSTM layers were introduced to the model, the more parameters were need to be trained. In Table 4, we compared the sizes of several trained models, including bi-

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

13

Table 4 The sizes of different depth LSTM models. # units #parameters

Bi-LSTM 2-layer Bi-LSTM 8-layer Bi-LSTM 8-layer SR-Bi-LSTM (Type 1) 8-layer SR-Bi-LSTM (Type 2)

541K 1.08M 3.25M 3.25M 3.25M

SST

EmoBank

CVAT

56 112 448 448 448

120 240 960 960 960

247 494 1,976 1,976 1,976

CR IP T

Methods

Table 5 The training time of different depth LSTM models. SST Epoch (avg ± std)

Methods

GPU

CPU

44s±2.8s 90s±4.3s 290s±10.8s 278s±12.4s 282s±13.2s

3.32s±0.12s 5.56s±0.15s 14.2s±0.44s 13.8s±0.48s 13.2s±0.51s

93s±5.5s 186s±7.6s 691s±22.4s 682±23.2s 685s±22.8s

CVAT Epoch (avg ± std)

GPU

CPU

GPU

5.54s±0.28s 11.1s±0.62s 26.3s±1.02s 25.8s±1.18s 25.2s±1.06s

112s±3.2s 226s±3.8s 843s±12.4s 839s±10.9s 844s±10.6s

3.20s±0.11s 8.50s±0.14s 24.2s±0.82s 23.8s±0.77s 24.4s±0.80s

AN US

Bi-LSTM 2-layer Bi-LSTM 8-layer Bi-LSTM 8-layer SR-Bi-LSTM (Type 1) 8-layer SR-Bi-LSTM (Type 2)

EmoBank Epoch (avg ± std)

CPU

PT

ED

M

directional LSTM (Bi-LSTM), 2-layer Bi-LSTM, 8-layer Bi-LSTM and the proposed 8-layer Bi-LSTM model with residual connections (SR-Bi-LSTM). We provide the number of hidden units as well as the total number of trainable parameters on SST, EmoBank and CVAT. As shown in Table 4, the addition of residual connections to original stacked LSTM model does not add any trainable parameters. Thus, it does not increase the complexity of the model. The increase in training complexity depends only on the length of sentences (#units) and the number of stacked layers (#parameters). For more detailed analysis, we trained these five models on an 8-core CPU and a GTX 1080 GPU with CuDNN 7, respectively. As shown in Table 5, when the model goes deeper, the training time will increase linearly on both CPU and GPU. Compared with 8-layer plain Bi-LSTM model, the proposed stacked residual model with same number of layers did not take much more time for training. When GPU is used, the training efficiency will be 20x~30x higher than that of CPU. Therefore, the efficiency of training and prediction of the proposed model on a GPU platform still remains highly competent for those time-sensitive tasks. 5. Conclusions

AC

CE

In this paper, we presented a stacked residual LSTM model to predict sentiment intensities of texts. By introducing residual connections to every few LSTM layers, we constructed an eight-layer stacked LSTM model. The residual connections let the deep stacked model avoid degradation problems. In addition, residual connection does not increase the complexity of the model. Since it does not add any trainable parameters to the stacked model. In these experiments, we also provide comprehensive empirical evidence that these residual networks are easier to optimize. Comparative results show that this model achieves a lower MAE and a higher Pearson correlation coefficient in sentiment intensity prediction than existing methods, indicating improved accuracy. With a GPU support, the efficiency of training and prediction of the proposed model is still highly competitive for those time-sensitive language processing tasks. Future work will attempt to introduce the attention or memory mechanism in order to extract more useful information between layers and to improve the performance of the residual architecture.

ACCEPTED MANUSCRIPT 14

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

Acknowledgements This work was supported by the National Natural Science Foundation of China (NSFC) under Grants No.61702443 and No.61762091, and in part by Educational Commission of Yunnan Province of China under Grant No.2017ZZX030. The authors would like to thank the anonymous reviewers for their constructive comments.

References

[9]

[10] [11] [12] [13] [14] [15]

[16]

[17] [18] [19]

[20]

CR IP T

AN US

[8]

M

[6] [7]

ED

[5]

PT

[3] [4]

CE

[2]

B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Found. Trends® Inf. Retr., vol. 1, no. 2, pp. 91– 231, 2006. B. Liu, “Sentiment Analysis and Opinion Mining,” Synth. Lect. Hum. Lang. Technol., vol. 5, no. 1, pp. 1–167, 2012. R. Feldman, “Techniques and applications for sentiment analysis,” Commun. ACM, vol. 56, no. 4, p. 82, 2013. S. M. Mohammad, “Sentiment analysis: Detecting valence, emotions, and other affectual states from text,” Emot. Meas., 2015. R. A. Calvo, S. Member, S. D. Mello, and I. C. Society, “Affect detection: An interdisciplinary review of models, methods, and their applications,” IEEE Trans. Affect. Comput., vol. 1, no. 1, pp. 18–37, 2015. R. W. Picard, “Affective Computing,” MIT Press, no. 321, pp. 1–16, 1995. B. Pang, L. Lee, H. Rd, and S. Jose, “Thumbs up ? Sentiment Classification using Machine Learning Techniques,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-02), 2002, pp. 79–86. P. D. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL2002), 2002, pp. 417–424. S. Wang and C. D. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL-2012), 2012, pp. 90–94. J. A. Russell, “A circumplex model of affect,” J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 1161–1178, 1980. A. B. Warriner, V. Kuperman, and M. Brysbaert, “Norms of valence, arousal, and dominance for 13,915 English lemmas,” Behav. Res. Methods, vol. 45, no. 4, pp. 1191–1207, 2013. R. Socher, A. Perelygin, and J. Wu, “Recursive deep models for semantic compositionality over a sentiment treebank,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp. 1631–1642, 2013. N. Li and D. Dash, “Using text mining and sentiment analysis for online forums hotspot detection and forecast,” Decis. Support Syst., vol. 48, no. 2, pp. 354–368, 2010. T. Nguyen, D. Phung, B. Dao, S. Venkatesh, and M. Berk, “Affective and Content Analysis of Online Depression Communities,” IEEE Trans. Affect. Comput., vol. 5, no. 3, pp. 217–226, 2014. L. Yu, J. Wu, P. Chang, and H. Chu, “Knowledge-Based Systems Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news,” Knowledge-Based Syst., vol. 41, pp. 89–97, 2013. M. De Marne, C. D. Manning, and C. Potts, “Was it good? It was provocative. Learning the meaning of scalar adjectives,” in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL10), 2010, pp. 167–176. G. Paltoglou and M. Thelwall, “Seeing stars of valence and arousal in blog posts,” IEEE Trans. Affect. Comput., vol. 4, no. 1, pp. 116–123, 2013. G. Paltoglou, M. Theunis, A. Kappas, and M. Thelwall, “Predicting emotional responses to long informal text,” IEEE Trans. Affect. Comput., vol. 4, no. 1, pp. 106–115, 2013. D. Gökçay, E. Işbilir, and G. Yildirim, “Predicting the sentiment in sentences Based on words: An exploratory study on ANEW and ANET,” in Proceedings of the 3rd IEEE International Conference on Cognitive Infocommunications (CogInfoCom-2012), 2012, pp. 715–718. N. Malandrakis, A. Potamianos, E. Iosif, and S. Narayanan, “Distributional semantic models for affective text analysis,” IEEE Trans. Audio, Speech Lang. Process., vol. 21, no. 11, pp. 2379–2392, 2013.

AC

[1]

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

[29] [30] [31] [32]

[33] [34]

[35] [36] [37]

[38] [39]

[40] [41]

[42] [43]

CR IP T

[28]

AN US

[27]

M

[26]

ED

[25]

PT

[23] [24]

CE

[22]

N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL-2014), 2014, pp. 655–665. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), 2014, pp. 1746–1751. A. Graves, Supervised sequence labelling. Springer Berlin Heidelberg, 2012. O. Irsoy and C. Cardie, “Opinion mining with deep recurrent neural networks,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), 2014, pp. 720–728. K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long shortterm memory networks,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL-2014), 2015, pp. 1556–1566. X. Wang, Y. Liu, C. Sun, B. Wang, and X. Wang, “Predicting polarities of tweets by composing word embeddings with long short-term memory,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL-2015), 2015, pp. 1343–1353. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of International Conference on Learning Representation (ICLR-15), 2015, pp. 1–10. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2016), 2016, pp. 2818–2826. C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-ResNet and the impact of residual connections on learning,” arXiv Prepr. arXiv1602.07261, 2016. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2016), 2016, pp. 770–778. C. Farabet, C. Couprie, L. Najman, and Y. Lecun, “Learning hierarchical features for scene labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915–1929, 2013. A. Conneau, H. Schwenk, Y. Le Cun, and L. Barrault, “Very deep convolutional networks for text classification,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2017, pp. 1107–1116. R. Pascanu, Ç. Gülçehre, K. Cho, Y. Bengio, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep recurrent neural networks,” arXiv Prepr. arXiv1312.6026, 2013. A. Graves, N. Jaitly, and A. R. Mohamed, “Hybrid speech recognition with deep bidirectional LSTM,” in Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU-2013), 2013, pp. 273–278. W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, “Short-term residential load forecasting based on LSTM recurrent neural network,” IEEE Trans. Smart Grid, 2017. X. Qing and Y. Niu, “Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM,” Energy, vol. 148, pp. 461–468, 2018. X. Luo, M. C. Zhou, Y. Xia, Q. Zhu, A. C. Ammari, and A. Alabdulwahab, “Generating highly accurate predictions for missing QoS data via aggregating nonnegative latent factor models,” IEEE Trans. Neural Networks Learn. Syst., vol. 27, no. 3, pp. 524–537, 2016. X. Luo, M. Zhou, Z. Wang, Y. Xia, and Q. Zhu, “An effective scheme for QoS estimation via alternating direction method-based matrix factorization,” IEEE Trans. Serv. Comput., 2016. J. Liu, A. Shahroudy, D. Xu, A. Kot Chichung, and G. Wang, “Skeleton-based action recognition using spatiotemporal LSTM network with trust gates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. XX, no. XX, pp. 1–14, 2017. X. Wang, L. Gao, J. Song, and H. Shen, “Beyond frame-level CNN: Saliency-aware 3-D CNN with LSTM for video action recognition,” IEEE Signal Process. Lett., vol. 24, no. 4, pp. 510–514, 2017. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the Annual Conference on Advances in Neural Information Processing Systems (NIPS-2013), 2013, pp. 1–9. T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of the International Conference on Learning Representations (ICLR-2013), 2013, pp. 1–12. M. Iyyer, V. Manjunatha, J. Boyd-Graber, and H. Daumé III, “Deep unordered composition rivals syntactic methods for text classification,” in Proceedings of the 53rd Annual Meeting of the Association for Computational

AC

[21]

15

ACCEPTED MANUSCRIPT 16

[46] [47] [48]

[49] [50] [51] [52] [53] [54]

[55] [56]

[60]

PT

CE

[59]

AC

[58]

ED

M

[57]

CR IP T

[45]

Linguistics (ACL-2015), 2015, pp. 1681–1691. A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text classification,” arXiv Prepr. arXiv1607.01759, 2016. P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” arXiv Prepr. arXiv1607.04606, 2016. Y. Bengio, “Learning deep architectures for AI,” Found. Trends® Mach. Learn., vol. 2, no. 1, pp. 1–127, 2009. X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Proceedings of Advances in Neural Information Processing Systems (NIPS-2015), 2013, pp. 3057–3061. C. N. Dos Santos and M. Gatti, “Deep convolutional neural networks for sentiment analysis of short texts,” in Proceedings of the 25th International Conference on Computational Linguistics (COLING-2014), 2014, pp. 69– 78. Y. Bengio, P. Simard, and P. Frasconi, “Learning long-rerm dependencies with gradient decent is difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. R. Pascanu, Ç. Gülçehre, K. Cho, Y. Bengio, C. Gulcehre, K. Cho, and Y. Bengio, “How to Construct Deep Recurrent Neural Networks,” arXiv Prepr. arXiv1312.6026, pp. 1–10, 2013. Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient backprop,” Neural networks: Tricks of the trade, pp. 9–48, 2012. M. D. Zeiler, “Adadelta: An adaptive learning rate method,” arXiv Prepr. arXiv1212.5701, p. 6, 2012. N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout : A Simple Way to Prevent Neural Networks from Overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014. L.-C. Yu, L.-H. Lee, S. Hao, J. Wang, Y. He, J. Hu, K. R. Lai, and X. Zhang, “Building Chinese affective resources in valence-arousal dimensions,” in Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL/HLT-2016), 2016, pp. 540–545. J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), 2014, pp. 1532–1543. R. Rehurek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45–50. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zheng, G. Brain, I. Osdi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI-2016), 2016, pp. 265–283. C. J. Hutto and E. Gilbert, “VADER: A parsimonious rule-based model for sentiment analysis of social media text,” in Proceedings of the 8th International AAAI Conference on Weblogs and Social Media (ICWSM-2014), 2014, pp. 216–225. S. Buechel and U. Hahn, “EmoBank: Studying the impact of annotation perspective and representation format on dimensional emotion analysis,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2017), 2017, pp. 578–585. S. Buechel and U. Hahn, “Readers vs. writers vs. texts : Coping with different perspectives of text understanding in emotion annotation,” in Proceedings of the 11th Linguistic Annotation Workshop in EACL 2017, 2016, pp. 1– 12.

AN US

[44]

Jin Wang et al., Using a Stacked Residual LSTM Model for Sentiment Intensity Prediction

ACCEPTED MANUSCRIPT Author name / Procedia Economics and Finance 00 (2012) 000–000

Jin Wang is a lecturer in the School of Information Science and Engineering, Yunnan University, China. He received the Ph.D. degree in computer science and engineering from Yuan Ze University, Taoyuan, Taiwan and in communication and information systems from Yunnan University, Kunming, China. His research interests include natural language processing, text mining, and machine learning.

CR IP T

Bo Peng is a Ph.D. candidate at the School of Information Science and Engineering, Yunnan University, China. His research interests include natural language processing and machine learning.

AC

CE

PT

ED

M

AN US

Xuejie Zhang is a professor in the School of Information Science and Engineering, and Director of High Performance Computing Center, Yunnan University, China. He received his Ph.D. in Computer Science and Engineering from Chinese University of Hong Kong in 1998. His research interests include high performance computing, cloud computing, and big data analytics.

17