Neural Networks 119 (2019) 299–312
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
Hierarchical gated recurrent neural network with adversarial and virtual adversarial training on text classification ∗
Hoon-Keng Poon a , Wun-She Yap a , , Yee-Kai Tee a , Wai-Kong Lee b , Bok-Min Goi a a b
Lee Kong Chian Faculty of Engineering and Science, Universiti Tunku Abdul Rahman, Malaysia Faculty of Information and Communication Technology, Universiti Tunku Abdul Rahman, Malaysia
article
info
Article history: Received 23 January 2019 Received in revised form 23 June 2019 Accepted 14 August 2019 Available online 2 September 2019 Keywords: Machine learning Adversarial training Text classification Small-scale datasets Neural network
a b s t r a c t Document classification aims to assign one or more classes to a document for ease of management by understanding the content of a document. Hierarchical attention network (HAN) has been showed effective to classify documents that are ambiguous. HAN parses information-intense documents into slices (i.e., words and sentences) such that each slice can be learned separately and in parallel before assigning the classes. However, introducing hierarchical attention approach leads to the redundancy of training parameters which is prone to overfitting. To mitigate the concern of overfitting, we propose a variant of hierarchical attention network using adversarial and virtual adversarial perturbations in 1) word representation, 2) sentence representation and 3) both word and sentence representations. The proposed variant is tested on eight publicly available datasets. The results show that the proposed variant outperforms the hierarchical attention network with and without using random perturbation. More importantly, the proposed variant achieves state-of-the-art performance on multiple benchmark datasets. Visualizations and analysis are provided to show that perturbation can effectively alleviate the overfitting issue and improve the performance of hierarchical attention network. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Nowadays, big data has become essential assets for companies to make good decisions about their services, employees, strategy, policies and product. Massive data have been created due to the digital transformation trend and social networking trend including digital news, micro blogging, messaging applications, Twitter and Facebook. Specifically, Twitter generates over 400 million tweets every day. However, only small amount of data had been utilized effectively. To further exploit the data created due to digital transformation trend and social networking, text classification (Li & Jain, 1998) is the key to process data at scale. Text classification, a widely used natural language processing task, is the key technique to different popular applications. This includes detecting spam and non-spam emails (Pantel & Lin, 1998), categorization of documents into different topics (Manevitz & Yousef, 2001) and understanding customer sentiment from reviews in social network (Pang, Lee, & Vaithyanathan, 2002). Text classification assumes documents having similar contents will have high similarity between them. Thus, a labeled dataset containing documents and their labels is used to train a classifier. Finally. a new document can be classified based on the ∗ Corresponding author. E-mail address:
[email protected] (W.-S. Yap). https://doi.org/10.1016/j.neunet.2019.08.017 0893-6080/© 2019 Elsevier Ltd. All rights reserved.
similarity between this new document with the trained classifier. Different classifiers were proposed to classify texts, ranging from traditional text classifiers (e.g., support vector machine Hearst, Dumais, Osuna, Platt, & Scholkopf, 1998, naive Bayes Lewis, 1998, logistic regression Ng & Jordan, 2001 and random forest Breiman, 2001) to neural network based classifiers (e.g., convolutional neural network (CNN) Krizhevsky, Sutskever, & Hinton, 2012, recurrent neural network (RNN) Mikolov, Karafiát, Burget, Černocký, & Khudanpur, 2010 and bidirectional recurrent neural network Schuster & Paliwal, 1997). However, text classification becomes challenging when the dimensionality of the input feature space is large. For instance, combination of words or word alone can have distinct meaning in different context. Thus, it is difficult to capture the correct semantics or the meaning of words. Furthermore, classification models are usually not well understood. A model that can accurately classify text from semantic information and retrieve salient features from text becomes necessary. 1.1. Traditional text classifiers Traditional text classification includes support vector machine (Hearst et al., 1998), naives Bayes (Lewis, 1998), logistic regression (Ng & Jordan, 2001) and random forest (Breiman, 2001). These classifiers are widely used due to their simplicity of implementation and reasonable performances in various tasks.
300
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
However, these methods face challenges of data sparsity problem. Most of these methods rely on bag-of-words (Zhang, Jin, & Zhou, 2000) or bag-of-ngram that represent a document in numeric vector. Bag-of-words is considered as a binary feature vector representing the present or absent of each word in a document. Instead of assigning words into vector, bag-of-ngram creates vector to identify occurrence of n continuous words in a document. Thus, this implies that the size of feature space must be tally with the size of document or corpus vocabulary. The increment of space size thus becomes burden to text classifier. Due to limited computation capability and storage resources, feature space reduction is then be the design goal. To reduce the feature space, some complex features such as noun phrases (Lewis, 1992), part of speech tags (Kristina & Manning, 2000) and term frequency-inverse document frequency (TF–IDF) (Salton & Buckley, 1988) were proposed. These features focus on important keywords rather than whole vocabulary. However, these features also involve certain statistical filtering or supervised processing techniques which might contain biases. For instance, noun phrases and TF–IDF will eliminate words or phrases with occurrence below or above predefined threshold. These eliminated words or phrases are then automatically assumed to be insignificant which might not be true in many cases. Consequently, the reduced features to represent its text are considered to be inadequate. Despite the space redundancy, these features are still restricted by the loss of word order and semantic meaning. Numerical vector captures only the existence or importance of individual feature but ignores the contextual information of word sequence which might distort the classifier. In addition, semantic meaning of words is missing, e.g., synonym like ‘‘good’’ and ‘‘excellent’’ or antonym like ‘‘good’’ and ‘‘poor’’ are semantically equivalent in numerical vector representation. Subsequently, most of the traditional text classifiers achieve only mediocre performance in classifying text. 1.2. Deep neural network Recently, researches on deep neural network and representation learning have disrupted conventional text classification methods. These ideas aim to solve data sparsity and semantic representation problems. Many neural models that learn word representations had been proposed and they are generally called as word embedding. Word embedding is a fixed length distributed continuous vector, thus it is no longer constrained by large vocabulary. It also carries rich semantic information for words and enables us to relate similar words by measuring semantic distances between embedding vectors. With the pre-trained word embedding by neural network models, the feature space of a document is reduced from the size of vocabulary down to the embedding size with respect to amount of words. As smaller features space allows more hidden layers to be used, deeper network is able to capture more complex linguistic sentiment. Therefore, applying deep neural networks on natural language processing tasks is appealing. Several recent studies tried to apply deep neural networks and word embedding on text classification tasks. Socher, Pennington, Huang, Ng, and Manning (2011) proposed semi supervised recursive autoencoders to predict the sentiment of a sentence and further introduced deep recursive neural network (Socher et al., 2013a) for phrases and sentences prediction. Impressed by outstanding performance of convolutional neural network in image classification (Krizhevsky et al., 2012), Kim (2014) introduced convolutional neural network for text classification. Besides, Mikolov (2012) proposed the use of recurrent neural network to build better language model than ngram models.
Meanwhile, Yang et al. (2016) proposed hierarchical attention mechanism on top of recurrent neural network to capture structural semantic information in words and sentences level. Along the same direction, Cheong, Yap, Tee, and Lee (2018) and Poon, Yap, Tee, Goi, and Lee (2018) had demonstrated that the efficiency of hierarchical Gated Recurrent Neural Network (HiGRNN) in document level polarity classification and various types of document classification. However, while deeper networks deliver state-of-the-art performance on many natural language processing tasks, it comes at the cost of higher computational complexity. Moreover, the need of more training parameters causes the deeper network to overfit. This is because with limited training data, classification model might closely fit to some extend of biases. As claimed by Caruana, Lawrence, and Giles (2001), the network, with too much capacity, hidden layers and hidden units, tends to overfit the training data. Thus, it over-emphasizes certain features in training data and might harm the generalization to new data. Inspired by the nature of hierarchical architecture consisting high number of training parameters, Hi-GRNN tends to converge in early training phase and hence overfit easily. A direct method against overfitting is by restricting the capacity of deep neural networks. For instance, a densely connected long-term short memory (LSTM) proposed by Ding, Xia, Yu, Li, and Yang (2018) fixed the size of hidden layer output to restrict the expansion of network capacity that increases along with the network depth. However, smaller capacity might lead to insufficient training parameters for complex problems such as text classification. Therefore, a well-suited regularization method is needed to restraint overfitting without compromising computational complexity. 1.3. Semi-supervised deep neural network One commonly used regularization methods in neural network is dropout. Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) was firstly proposed for CNN and it is widely used in many other domains including text. The key idea is to randomly drop neurons from the network when the network is being trained to avoid excessive co-adapting. However, some previous works found that applying dropout into RNN might not work. Different from feed forward only neural networks like CNN, RNN is famous with its ability to retrain memory from previous state. Thus, applying standard dropout to RNN tends to limit its strength. For instances, Bayer et al. (2013) stated that setting of zero to any outgoing vector will result in dramatic changes during every forward feed in RNN. To retrain valuable memorization ability of RNN, Jozefowicz, Zaremba, and Sutskever (2015) and Pham, Bluche, Kermorvant, and Louradour (2014) proposed to apply dropout only on non-recurrent connection. These proposals were then rejected, where Gal and Ghahramani (2016) found these approaches still lead to over fitting. Apart from dropout, another increasingly popular regularization alternative is to learn additional representation embedding for input from a large collection of unlabeled data, also called as semi-supervised learning. These pre-trained embedding aim to augment richer features to the inputs. For example, to generate extra contextual information during training, two-view embedding method (Johnson & Zhang, 2015) which includes pre-trained embedding from related unlabeled samples was proposed. Nevertheless, Tang and de Sa (2018) proposed a multiview learning that ensembles and leverages advantages of gated recurrent unit (GRU) and neural encoders to generate rich semantic information of inputs from unlabeled data. Both twoview embedding and multi-view learning methods are types of
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
regularization method against overfitting by adding uncertainties into their models. However, both methods suffer from the inherent drawback, i.e., massive related unlabeled text corpora are needed. Besides learning additional supplement embedding for input, Sietsma and Dow (1991) suggested adding random noise to the inputs and hidden layers during training to prevent overfitting. However, adding random noise tends to generate intractable perturbation to training process. Previous works have primarily applied non-random noise that can actively improve regularizer to neural network models. Examples include adversarial training (AT) (Goodfellow, Jonathon, & Christian, 2015) and virtual adversarial training (VAT) (Miyato, Maeda, Koyama, Nakae and Ishii, 2016). Both ideas aim to add perturbation to the inputs such that adversary can regularize machine learning model. In reality, classifiers are known to be vulnerable to small perturbation that even unnoticed by human and this will lead to severe misclassification. Adversarial training is the process that trains a model against both inputs and adversary modified inputs. In other words, it improves the robustness of machine learning models beyond what they can learn from training data alone. Subsequently, Miyato, Dai and Goodfellow (2016) extended the idea of adversarial training from supervised training to semi-supervised training (also known as virtual adversarial training). The proposed virtual adversarial training regularizes the underlying model by introducing the adversarial perturbation to inputs without needed any labels. In this paper, we propose a Hi-GRNN model with adversarial and virtual adversarial training. We recommend to add both regularization methods in each layer of our hierarchical architecture. This recommendation prevents our model from overfitting caused by the increment in training parameters. Unlike adversarial perturbation for image classification problem, small changes in input can subverse the meaning for discrete text input. Thus, we define the perturbation only on embedding layers of Hi-GRNN instead of discrete word inputs. Both continuous and distributed word representations and sentence representations are added with perturbation before being fed into Hi-GRNN. The resulting representation after perturbation does not map to any words or sentences indicating that our proposed training strategy is intended to strengthen model robustness against any possible perturbed validation data. Moreover, we show that the proposed model can maintain the optimization of negative log-likelihood for longer training iterations. In other words, classifier will not converge easily in local minima. Besides, we examine the classifier performance by tuning the norm constraint that adjusts the weightage of adversarial perturbation. This allows us to observe the impact of our approach in new data generalization. Finally, we also further analyze the datasets with limited classification improvement using our proposed approach for future improvement. In summary, the contributions of this paper are listed as follows:
• A hierarchical attention network with adversarial and virtual adversarial training, also known as Hi-GRNN, is proposed by adding perturbations on both word and sentence representations to minimize overfitting issue; • The significance of our proposed model in optimization and regularization is proven by achieving state-of-the-art performance on six small-scale benchmark datasets over other baseline methods; • Graphical measures and embedding visualizations are used to analyze the effectiveness of our proposed Hi-GRNN on two datasets (SST-1 and SST-2) with limited classification improvement.
301
Fig. 1. An example of hierarchical input parsing architecture.
2. Model description In this section, we present the architecture of our proposed HiGRNN, inspired by the hierarchical attention network proposed by Yang et al. (2016). To understand our proposed Hi-GRNN, we first explain the gated recurrent unit and the hierarchical attention network in Sections 2.1 and 2.2 respectively. 2.1. Gated recurrent unit (Cho, Van Merriënboer, Bahdanau and Bengio, 2014) The GRU (Cho, Van Merriënboer, Bahdanau et al., 2014) uses the gated hidden unit as an alternative to simplify RNN units such as tanh for activation function. It combines an input gate and a forget gate into an update gate. This gated hidden unit is quite similar but performs better than LSTM (Hochreiter & Schmidhuber, 1997) in learning long-term dependencies. This architecture encodes variable-length sentences into a fixed length vector representation. Two gates, reset gate rt and update gate zt are used to control the updating of information into a new state yt . First, the reset gate rt is computed using Eq. (1). rt = σ (Wr xt + Ur yt −1 + br )
(1)
where σ is a logistic sigmoid function, xt is the input representation vector at time t, Wr ∈ ℜn,m and Ur ∈ ℜn,n are the learned weight matrices and yt −1 is the previous state. Notice that m and n are the word embedding dimensionality and the number of hidden units respectively. Subsequently, the forget gate zt is computed using Eq. (2). zt = σ (Wz xt + Uz yt −1 + bz ) n,m
(2) n,n
where Wz ∈ ℜ and Uz ∈ ℜ are the learned weight matrices. The updated candidate state y˜ t is then computed using Eq. (3). y˜ t = tanh(Wy xt + rt ⊙ Uy yt −1 + by )
(3) n,m
where ⊙ is the element-wise multiplication, Wy ∈ ℜ and Uy ∈ ℜn,n are the learned weight matrices. Finally, the new state yt in GRU is computed using Eq. (4). yt = (1 − zt ) ⊙ yt −1 + zt ⊙ yt
(4)
If reset gate rt is zero, then the previous state will not contribute to the candidate state. Thus, it can effectively allow hidden state to drop irrelevant information and generate more compact representation (Cho, Van Merriënboer, Gulcehrer, Bahdanau, Bougares, Schwenk and others, 2014). 2.2. Hierarchical attention network (Yang et al., 2016) For hierarchical training, the document y is parsed into K sentences as {syk |k = 1, . . . , K }, where the words in the sentence k are denoted as T words as {wkt |t = 1, . . . , T }. Fig. 1 shows an example of the hierarchical input parsing architecture where a document is split into multiple sentences and each sentence is split into multiple words. Each word is represented with a fixed dimensional, continuous vector instead of discrete one-hot-vector to reduce the
302
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
information from words and sentences. The output of attention can be seen as higher level representation of input words or sentences. Notice that average or max pooling can be used as alternative to summarize the meaning of words and output a sentence representation but it is less accurate as compared to attention mechanism (Rush, Chopra, & Weston, 2015). More precisely, Eqs. (8)–(10) are used to capture words that contribute important meaning from the annotations. ukt = tanh(Ww hkt + bw ) exp(u⊤ kt uw )
αkt = ∑
k
syk =
∑
exp(u⊤ kt uw )
αkt hkt
(8) (9) (10)
t
where ukt is a hidden representation of hkt and uw is randomly initialized and jointly learned during the training process. With attention mechanism, we first feed the word annotation to a single layer perceptron with hypothesis tangent activation to get hidden representation ukt . Then, we compute normalize weight αkt through softmax function against hidden representation ukt . Lastly, a sentence representation vector is computed by summing word hidden representation with normalized weight. Similarly, given a document y, document representation can be composed of summarizing given sentence representation vector. The bidirectional GRU is used again on the sentences to capture contextual information across sentences as follows.
Fig. 2. Hierarchical attention network proposed by Yang et al. (2016).
redundancy. This transformation is also known as word embedding (Mikolov et al., 2013). Words are transformed from word token into continuous vector, xkt = We wkt . Notice that We ∈ ℜdx|V | is the word embedding matrix, where d is the vector dimension and V is the number of unique words in the vocabulary. Each row of We corresponds to the word embedding of the ith word. Fig. 2 shows the hierarchical attention network proposed by Yang et al. (2016). As a summary, the hierarchical attention network consists of gated recurrent unit, word encoder, work attention, sentence encoder, sentence attention and document classification is described as follows. A bidirectional GRU is then applied on these word representation vectors to obtain annotations of words. It summarizes information from both directions, i.e., forward (denoted as −−→ ←−− GRU) and backward (denoted as GRU), to capture contextual information, namely bidirectional context dependent annotations. − → − → As a result, a sequence of forward hidden states hk1 , . . . , hkT
← −
← −
and a sequence of backward hidden states hkT , . . . , hk1 are generated. xkt = We wkt , t ∈ [1, T ]
− →
(5)
−−→
hkt = GRU(xkt ), t ∈ [1, T ]
← −
(6)
←−−
hkt = GRU(xkt ), t ∈ [T , 1]
(7)
− →
By concatenating forward hidden state hkt and backward hidden ← − state hkt , we obtain an annotation for the given word as hkt = − → ← − [hkt , hkt ], which summarize contextual information centered by word wkt . However, not every word is equally important in sentence representation. Hence, attention mechanism is introduced. Attention mechanism allows models to attend and accumulate the past output vectors to capture the whole semantic of the cells from word representation or sentence representation. Hierarchical attention network utilizes this mechanism to capture the important
− →
−−→
← −
←−−
hk = GRU(syk ), k ∈ [1, K ] hk = GRU(syk ), k ∈ [K , 1]
(11) (12)
To compute sentence that contributes important meaning to document, attention mechanism is used again to summarize the output annotations for sentences. Subsequently, the vector representation for document y is obtained as follows. uk = tanh(Ws hk + bs ) exp(u⊤ k us )
αk = ∑
k
v=
∑
exp(u⊤ k us )
αk hk
(13) (14) (15)
t
where us is randomly initialized and jointly learned during the training process. Finally, real values are converted to conditional probability, by adding a softmax layer as follows. p = softmax(Wc v + bc )
(16)
2.3. The proposed hierarchical gated recurrent neural network with adversarial and virtual adversarial training Introducing hierarchical attention approach leads to the redundancy of training parameters which is prone to overfitting. To mitigate the concern of overfitting for the hierarchical attention network (HAN) (Yang et al., 2016), we propose to use adversarial and virtual adversarial training. More precisely, we train the HAN classifier to be robust to perturbation of the embedding by using the adversarial and virtual adversarial training. We name the variant of HAN as hierarchical gated recurrent neural network with adversarial and virtual adversarial training (Hi-GRNN) as shown in Fig. 3. Adversarial training is a novel regularization method designed to strengthen the robustness of classifier against small, approximately worst case perturbation. These small and well-tuned perturbations can lead to severely misclassification of classifier
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
303
word perturbation rw−AT is applied to word embedding rather than directly to input. Perturbations to input can lead to massive perversion as small changes in input might indicate mismatch in word embedding vocabulary. At sentence level, adversarial sentence perturbation rs−AT is applied to hidden representation that is composed of sequence of words. To define adversarial perturbation in different stages, we denote a concatenation of a sequence of normalized word embedding vectors as X = [˜x1k , x˜ 2k , . . . , x˜ Tk ] and sequence of sentence representation vectors as S = [s1 , s2 , . . . , sK ]. Nevertheless, we then define the model conditional probability of the classifier y given the input X and parameter θ as p(y|X ; θ ) and model conditional probability of the classifier y given S and parameter θ as p(y|S ; θ ). Then, adversarial perturbation for word, rw−AT and sentence rs−AT is defined as follows: rw−AT = −ϵ g /∥g ∥2 where g = ∇x log p(y|X ; θ )
(20)
rs−AT = −ϵ g /∥g ∥2 where g = ∇x log p(y|S ; θ )
(21)
To train the classifier robustness against disruptive perturbation, we define the adversarial loss with respect to word and sentence level as: Lw−AT (θ ) = −
Ls−AT (θ ) = − Fig. 3. The proposed HiGRNN model.
easily. The proposed model can trivially learn to make the perturbation insignificant by learning embedding with very large norm. In the proposed Hi-GRNN model, we replace the embedding xkt with normalized embedding x˜ kt , which is defined by Eq. (17). xkt − E(xkt )
x˜ kt = √
Var(xkt )
,
where E(xkt ) =
T ∑
N 1 ∑
N
N 1 ∑
N
log p(yn |(Xn + rw−AT ,n ); θ )
(22)
n=1
log p(yn |(Xn + rw−AT ,n , Sn + rs−AT ,n ); θ )
(23)
n=1
where N is the number of inputs. Meanwhile, virtual adversarial training (VAT) is a new notion of local distributional smoothness that can be used as a regularization method proposed by Miyato, Maeda, Koyama et al. (2016). It is closely related to the adversarial training but introducing additional cost as follows: KL[p(.|x; θ ) ∥ p(.|x + rVAT ); θ] where rVAT
2
ft (xkt − E(xkt ))
(17)
t =1
where ft is the frequency of the tth word, calculated within all training examples. Let x denote the input and θ denote the parameter of the classifier y. When we train our classifier using adversarial training, the loss function L can be computed by using Eq. (18). L = − log p(y|x + rAT ; θ ) where rAT = arg min log p(y|x + r ; θ˜ ) (18) where rAT is an adversarial perturbation to the input and θ˜ is a current constant set of parameters of classifier. We use θ˜ rather than θ to state that the backpropagation algorithm is ignoring these parameters to propagate gradients throughout the process of adversarial perturbation construction. During each training step, the worst case adversarial perturbation rAT against the conditional probability of the classifier y given the input x and parameter θ , p(y|x; θ ), is computed. These perturbations rAT disrupt the classifier cognition and train classifier against misclassification by minimizing the cost function with respect to θ . However, the exact minimization of classifier with respect to rAT is intractable. Goodfellow (Goodfellow et al., 2015) proposed to linearize p(y|x; θ˜ ) around x and with a L2 (also known as Euclidean) norm constraint, the resulting adversarial perturbation can be backpropagated easily in neural networks as follows: rAT = −ϵ g /∥g ∥2 where g = ∇x log p(y|x; θ )
(19)
In our work, we suggest to add the adversarial perturbations to the embedding in different stages. At word level, adversarial
= arg max KL[p(.|x; θ˜ ) ∥ p(.|x + r ; θ˜ )]
(24)
where rVAT denotes the perturbations by virtual adversarial training and KL[p ∥ q] indicates the KL divergence between distributions p and q. Similar to adversarial training, classifier is trained to be smooth by minimizing Eq. (22) to improve classifier robustness against such perturbations. One advantage of virtual adversarial training is that the cost function introduced requires only the input x and does not require the label y. As compared to adversarial training, virtual adversarial training is more applicable with semi-supervised training capability. However, rVAT cannot be calculated accurately, and thus linearization and L2 norm were proposed by Goodfellow et al. (2015) to approximate it efficiently with backpropagation. During each training step, approximated virtual adversarial perturbation is calculated for word and sentence levels as follows: rw−VAT = −ϵ g /∥g ∥2 where g = ∇x+d KL[p(.|X ; θ ) ∥ p(.|X + dw ; θ )] (25) rs−VAT = −ϵ g /∥g ∥2 where g = ∇x+d KL[p(.|S ; θ ) ∥ p(.|S + ds ; θ )] (26) where d is a TD-dimensional small random vector. Then the virtual adversarial loss is defined as: Lw−VAT (θ ) =
N 1 ∑
N
n=1
KL[p(.|Xn , Sn ; θ ) ∥ p(.|Xn + rw−VAT ,n ); θ]
(27)
304
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
Table 1 Summary of eight datasets where T indicates number of words per sentence and K indicates number of sentences per document. Reuter Amazon Snippet SST-1 SST-2 TREC Subjective MR
Classes Avg. T
Max T
Avg. K
Max K
Vocab
8 4 8 5 2 6 2 2
41 350 28 36 36 19 43 36
9 5 1 1 1 1 1 1
31 31 1 4 4 2 2 3
21,071 5,485 2189 40,451 7,200 800 23,142 10,060 2280 17,173 9,654 2210 15,554 7,792 1821 7,149 5,542 500 20,177 10,000 CV 17,173 10,662 CV
Ls−VAT (θ ) =
9 24 15 14 14 8 18 16
N 1 ∑
N
Train
Test
KL[p(.|Xn + rw−VAT ,n , Sn + rs−VAT ,n ; θ )
n=1
∥ p(.|Xn + rw−VAT ,n ); θ]
(28)
where N is the number of examples. In our experiments, we examine the capability of adversarial and virtual adversarial training in both word and sentence levels. For word-only adversarial and virtual adversarial training, we minimize the negative log-likelihood along with word level adversarial loss (i.e., Lw−AT ) and virtual adversarial loss (i.e., Lw−VAT ) . For sentence level adversarial and virtual adversarial training, we minimize the negative log-likelihood along with word and sentence level adversarial loss (i.e., Lw−AT and Ls−AT ) and virtual adversarial loss (i.e., Lw−VAT and Ls−VAT ).
at both phrase and sentences for training but only sentences are used for testing. SST-2 (Socher et al., 2013b). SST-2, is the same dataset as SST-1 but used in binary without neutral sentences. The remaining data is 7792 for training and 1821 for testing. TREC (Voorhees, 1999). TREC dataset consists of open-domain, fact-based questions which are divided into broad semantic categories. We use the six-class version of the small TREC dataset. These six classes include abbreviation, entity, person, description, location and numeric information. The training dataset consists of 5452 labeled questions whereas the testing dataset consists of 500 questions. Subj (Pang & Lee, 2004). Subjectivity dataset is from Pang et al. where the task is to classify a sentence as being subjective or objective. 5000 movie reviews snippets from Rotten Tomatoes were collected as subjective data and remaining 5000 were took from Internet Movie Database (IMDB). Cross validation is conducted where the data is split into 9000 for training and 1000 for testing in each validation. Movie Review (MR) (Pang & Lee, 2005). Movie Review is a popular sentiment classification dataset proposed by Pang et al. These reviews involve positive and negative sentiment classes which contain only one sentence per review. The reviews for training and testing are 17,173 and 10,662 respectively. Besides, 10-fold cross validation is used to train and validate the dataset. 3.1. Data preparation
3. Experiments All the training and verification of models are conducted using TensorFlow (Abadi et al., 2016) on GPUs. We compare our proposed model with other existing text classification models by using the eight datasets summarized in Table 1. Reuter (Debole & Sebastiani, 2005). This dataset is a subset from Reuters-21578, a set of labeled news articles from the 1987 Reuters newswire which are classified according to 135 thematic categories, mostly concerning business and economy. This dataset selects eight categories with higher frequency (i.e., earning, acquisition, crude, trade, money, interest, grain and ship) with a total of 7674 articles. We further split these 7674 articles into 5485 for training and 2189 for testing. Amazon (Blitzer, Dredze, & Pereira, 2007). Amazon dataset consists of 8000 product reviews acquired from Amazon over four different sub-collections (i.e., books, DVDs, electronics and kitchen appliances). Each collection is split into 1800 for training and 200 for testing. Thus, in total, we have 7200 for training and 800 for testing. Snippet (Phan, Nguyen, & S., 2008). Snippet consists of 10,060 training snippets and 2280 test snippets from eight topics including business, computers, culture arts, education science, engineering, health, politics society and sports. Top 20 snippets were selected from different number of phrases for each topic. On average, each snippet has 18.07 words. Both labeled training and testing data were retrieved from Google search using JWebPro. SST-1 (Socher et al., 2013b). Stanford Sentiment Treebank (SST) — 1 is an extension of Movie Review from the Rotten Tomatoes. Each review has fine grained labels (e.g., very positive, positive, neutral, negative, very negative). We split the reviews into 9654 for training and 2210 for testing. Moreover, phrase-level annotations on all inner nodes are provided. Note that data is provided
To prepare the dataset for training, we apply a standard set of pre-processing for all eight datasets. We convert the upper-case letters to lower-case letters and reserve digits, special characters, and special tokens for padding and out-of-vocabulary characters. Some special characters beyond the utf-8 string decoder are removed. In total, the vocabulary size is of 72 different characters. Besides, variable-sized documents are handled as fixed-sized with shortening and padding. The fixed-length is selected by covering 98% of total text for each dataset. Moreover, no limitation is set for maximum number of vocabulary allowed. 3.2. Experimental settings In our experiments, we train and evaluate the performance of the text classification in four different modes, (1) HAN model, (2) Proposed Hi-GRNN model with random perturbation, (3) Proposed Hi-GRNN model with adversarial training, and (4) Proposed Hi-GRNN model with adversarial and virtual adversarial training. Training with labels is done as follows. the best combination of Adam optimizer (Kingma & Ba, 2015) with learning rate (varying from 10−2 to 10−6 ), weight decay penalization (from 0.9 to 0.99) and epoch (from 10 to 200) was selected using combination of grid search and manual search (Montavon, Orr, & Müller, 2012). Batch sizes are 128 for Snippet, SST, Subjective, TREC and MR; and 32 for Reuter and Amazon respectively depending on the size of parameters and the computation memory of our equipment. Word embedding from pre-trained global vector (GloVe) (Pennington, Socher, & Manning, 2014) is used to assign pre-trained continuous vector to input. GloVe vocabulary contains 400,000 most frequent words obtained from five large scale corpora. We select the word embedding size to 200 as it gives optimum results through our experiments. In all layers, weights and biases are initialized by the Gaussian distribution and standard deviation of 0.01. The words and padding which are not included in the pre-trained vocabulary have to exact same Gaussian distribution
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
initialization. For gated recurrent unit layer, we fix the size of hidden units to 256 for both forward and backward GRUs. The output representation then produces 512 units after concatenation. For each dataset, original training set is divided into training set, testing and cross-validating set as shown in Table 1. 10-fold cross validation is performed on Subjective and MR datasets. For each mode, we optimize some hyperparameters, e.g., norm constraint epsilon of adversarial and virtual adversarial training and dropout rate. The epsilon value for adversarial training is assigned to range from 0.2 to 0.5 according to the complexity of input data. Any values exceeding or below this range can lead to intractable overfitting or under-fitting of training. Besides, the dropout rate is set in between 0.1 to 0.2 within the gated recurrent unit layer. Higher dropout rate can cause the model with adversarial and virtual adversarial training to underfit severely as cumulative of perturbation and dropout can lead to excessive confusion to model exponentially. Note that we do not perform early stopping in our experiment to observe the behaviors of validation and training loss after convergence. All the training and testing models are conducted offline with local machine equipped with Nvidia Titan Xp. 4. Results & discussion 4.1. Comparisons Other than the four aforementioned modes, we also compare the performance of the text classification with the following baseline methods: 1. Gaussian (Nikolentzos, Meladianos, Rousseau, Stavrakas, & Vazirgiannis, 2017): A method proposed by Nikolentzos et al. to model each document as Gaussian document representation based on the distributed representation of its word. Documents are classified based on the similarity of their distributions. 2. NBSVM (Wang & Manning, 2012): It is a hybrid of two traditional methods, naive Bayes and support vector machine. Wang et al. proposed a variant of support vector machine using naive Bayes log-count ratios as feature that gives consistent gains on sentiment analysis tasks. 3. WMD (Kusner, Sun, Kolkin, & Weinberger, 2015): Kusner et al. proposed a Word Mover’s Distance (WDM) that measures the dissimilarity between documents. WMD measures the word dissimilarity by their Euclidean distance in word2vec embedding space and document dissimilarity by minimum cumulative cost to travel words from one document to another. 4. Bi-LSTM (Qian, Huang, Lei, & Zhu, 2017): Qian et al. proposed a linguistic-inspired regularizer on sequence bidirectional LSTM models. This model captures the linguistic role of sentiment words, negation words and intensity words in sentiment expression. 5. Skipthought (Kiros et al., 2015): An unsupervised learning approach inspired by skip-gram model and being extended to sentence level encoder–decoder called Skip-thought vector. The model extracts the features from sentences and evaluates the representations through a linear model. 6. DisSent (Nie, Bennett, & Goodman, 2017): DisSent model proposed by Nie et al. that trains sentence embedding using explicit discourse relations. They evaluated DisSent model using transfer learning to SentEval (Conneau & Kiela, 2018) training framework with a simple softmax layer for sentiment classification.
305
7. Multi-View (Tang & de Sa, 2018): Tang et al. proposed a regularization method by creating combination of different views from unlabeled corpora. The learning framework encodes a unified multi-views representation with input using recurrent neural network and linear model. 8. DC-Bi-LSTM (Ding et al., 2018): Dine et al. proposed a densely connected bidirectional LSTM model that inspired by highway networks (Srivastava, Greff, & Schmidhuber, 2015). They accumulated and concatenated representations from each layer with depth up to 20 by using recursive training to improve the information flow between layers. 9. DSCNN (Zhang, Lee, & Radev, 2016): A novel hierarchical hybrid model that builds textual representation via LSTM network and extracts global features with CNN-based model. The model captures both the dependency features within sentences and across sentences. 10. BLSTM-2DCNN (Zhou et al., 2016): Two-dimensional convolution and max-pooling are used as features extraction on top of LSTM network. It allows better sampling over LSTM network and exploits more features for sequence modeling tasks. 11. CNN (Kim, 2014): Kim proposed a simple CNN-model addition with hyperparameter tuning and pre-trained word vector for sentiment analysis and question classification. He also proposed a multi-channel CNN architecture as a regularization alternative. 12. Semantic-CNN (Wang et al., 2015): An improved CNN to model short texts based on semantic clustering and convolutional neural network. The model combines useful semantic unit from input under supervision of semantic cliques that are obtained from fast clustering (Rodriguez & Laio, 2014) for short text classification. 4.2. Results Table 2 shows the classification performance of different training methods on eight datasets. Full, Word and Sent denote hidden layer of perturbations applied. Full indicates perturbation is applied on both word and sentence representations, while Word and Sent indicate perturbation is applied on word representations only and sentence representations only respectively. Hi-GRNN Full AT is the method with only adversarial training implemented; Hi-GRNN Full Random is the method with random perturbation implemented; Hi-GRNN Full AT+VAT is the method with both adversarial and virtual adversarial training implemented. Besides, we investigate the performance of model when adversarial training and virtual adversarial training are applied only on word or sentences representations only, denoted as Hi-GRNN Word AT+VAT and Hi-GRNN Sent AT+VAT. With our recommended training parameters and preprocessing method, the HAN model outperforms the state-ofthe-art models on four out of eight datasets (i.e., TREC, Snippet, Reuter and Amazon). For TREC dataset, the HAN model is 0.2% greater in accuracy as compared to 98.2% achieved by Gaussian. Meanwhile, the HAN model is also 1.2% and 0.6% greater in accuracy as compared to Semantic-CNN and CNN on Snippet and Reuter datasets. Nevertheless, the HAN model also outperforms Gaussian with accuracy of 0.02% greater on Amazon dataset. Note that the proposed Hi-GRNN model outperforms the HAN model in terms of accuracy in all datasets. Among the variants of Hi-GRNN model, Hi-GRNN Full AT+VAT successfully improves the performance drastically relative to the HAN model and other state-of-the-art methods. With adversarial training, we improve the accuracy of text classification with range from 0.2% to 4.6% on all eight datasets as compared to the HAN model. With both
306
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312 Table 2 Classification accuracy of Hierarchical Gated Recurrent Neural Network (Hi-GRNN) against other state-of-the-art models. The best result of each dataset is highlighted in bold and underlined. Full, Word and Sent denote perturbations on both word and sentence representations, word representations only and sentence representations only respectively. There are mainly six categories: (1) A collections of baseline machine learning methods; (2) Long short-term memory neural network models (3) Hybrid models of long short-term memory neural network and convolutional neural networks (4) Convolutional neural network models (5) Hierarchical attention network and (6) Our proposed Hi-GRNN model. Model
SST-1
SST-2
Subj
TREC
MR
Snip
Reuter
Amaz
Baselines
Gaussian NBSVM WMD
– – –
– – –
93.1 93.6 86.0
98.2 97.8 92.4
80.2 86.9 66.4
82.2 64.7 74.1
97.1 97.1 95.0
94.9 94.8 92.0
LSTM
Bi-LSTM Skipthought DisSent Multi-View DC-Bi-LSTM
49.1 – – – 51.9
87.5 82.9 82.9 89.6 89.7
93.0 93.7 92.4 95.7 94.5
93.6 92.2 84.6 89.8 95.6
81.8 79.4 82.5 85.0 82.8
– – – – –
– – – – –
– – – – –
LSTM+CNN
DSCNN BLSTM-2DCNN
50.6 52.4
88.7 89.5
93.9 94.0
95.6 96.1
82.2 82.3
– –
– –
– –
CNN
CNN Semantic-CNN
47.4 –
88.1 –
93.1 –
98.0 97.2
81.1
84.7 85.1
97.0 –
94.9 –
HAN
44.9
83.8
92.7
98.4
78.2
86.3
97.7
95.0
Full Random Full AT Word AT+VAT Sent AT+VAT Full AT+VAT
45.5 47.7 46.8 45.7 46.8
84.4 85.7 85.2 83.5 85.4
93.8 94.5 94.8 94.5 97.0
99.0 98.8 98.6 98.8 99.6
82.3 82.8 82.8 83.0 89.2
85.7 86.5 88.6 86.4 88.3
97.7 98.0 98.3 97.6 98.5
95.2 95.7 94.7 95.8 96.2
Hi-GRNN
adversarial and virtual adversarial training, we achieve state-ofthe-art performance on six datasets (i.e., Subj, TREC, MR, Snippet, Reuter and Amazon) as shown in Table 2. However, there are two exceptions, the proposed Hi-GRNN model underperforms SST-1 and SST-2 datasets. 4.3. Continual learning Proposing adversarial and virtual adversarial training to Hi-GRNN make continual learning without suffering from overfit issue becomes possible. We do not perform early stopping to let the classifier to learn continuously such that the classifier will not converge early in local minima. To check whether our proposed model can effectively alleviate the overfitting issue, we investigate the learning curves of six different methods (i.e., HAN model, Hi-GRNN Full Random, Hi-GRNN Full AT, Hi-GRNN Word AT+VAT, Hi-GRNN Sent AT+VAT and Hi-GRNN Full AT+VAT) on eight datasets. These learning curves look into negative loglikelihood obtained by different models in 25 iterations. Negative log-likelihood computed by the loss function describes how well the predicted output fits the exact label of data. We show the learning curves for validation data and training data in Figs. 4 and 5 respectively. For validation data, we expect the results to show low negative log-likelihood which indicates better generalization on new data. In contrast, low negative log-likelihood in training data indicates that it tends to overfit. Therefore, a well regularized and generalized model is expecting to have low negative loglikelihood in validation data and high negative log-likelihood in training data. Referring to Fig. 4, we can observe that the adversarial and adversarial training achieve much lower negative log likelihood than the HAN model in validation data. Unlike the HAN model that tends to overfit after it achieves the minimum negative log likelihood; with adversarial and virtual adversarial training, it keeps the value lower and allows continual learning of the proposed Hi-GRNN model. By adding adversarial and virtual adversarial perturbations to embedding, the model is unlikely to memorize the training parameters. With longer training steps, the model is able to achieve lower negative log likelihood and higher accuracy than the HAN model. In Fig. 5, we can see that
the HAN model has lower negative log-likelihood as compared to the proposed Hi-GRNN models. This indicates that the HAN model is more prone to overfitting (i.e., bias to training data). To show that adversarial training is a better regularizer, we investigate the performance of adversarial and virtual adversarial perturbations over random initialized perturbations. Both Figs. 4 and 5 show that random perturbations are weaker in terms of regularizing capability as compared to adversarial perturbations. We can see the Hi-GRNN Full Random has greater negative loglikelihood in validation data and lower negative log-likelihood in training data as compared to other adversarial training methods. Theoretically, adversarial perturbations have advantage of continual cost increment which leads to significant network regularization. Meanwhile, small and random perturbations in high dimensional input spaces are approximately orthogonal to the cost gradient. Note that for Hi-GRNN Full Random, we replace the adversarial perturbations with Gaussian distribution initialized noises. Even though including random perturbations to embedding outperforms the HAN model on almost all datasets, it still underperforms adversarial training in overall. 4.4. Effect on tuning norm constraint Adding adversarial perturbations to embedding leads to better regularization. However, adding too much perturbations may lead to underfitting problem. Thus, fine tuning on norm constraint which adjusts the weightage of adversarial perturbations is needed. Fig. 6 shows the effect of norm constraint ϵ tuning towards negative log likelihood. As mentioned, we set the norm constraint ϵ from 0.25 to 1.00. Note that higher norm constraint will lead to intractable overfitting. We only show the relationship between norm constraint and negative log likelihood on SST-2 dataset using Hi-GRNN Full AT and Hi-GRNN Full AT+VAT. We can see that the models with lower norm constraint values tend to perform better. Negative log likelihood is intractable as the norm constraint increases. Note that when norm constraint is set as 1.00 in Hi-GRNN Full AT+VAT, the model tends to converge at early phase. This shows that the superposition effect on both adversarial and virtual adversarial perturbations with high norm constraint can significantly increase the model complexity.
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
307
Fig. 4. Negative log likelihood obtained from 8 validation datasets.
4.5. Information retrieval Information retrieval is important as it interprets the prediction made by text classifier. Operations within neural network are usually difficult to interpret and described as black-box. With attention mechanism applied in our proposed models, words that have significant impact to prediction result can be found and analyzed easily. Fig. 7 demonstrates that the proposed Hi-GRNN models can retrieve salient information that contribute to prediction result. As described in Eqs. (9) and (14), normalized attention weights are trained to emphasize the meaningful representation from both words and sentences. Thus, we investigate the effect of adversarial perturbations added to our proposed models. Since
there might be more than one sentence in a document and both sentence and word attention weights are used, a normalization across the whole document is necessary for visualization. We denote the normalization equation as follows:
αk
Xα,kt = |αK | · |αT | where |αK | = ∑K
k=1
αkt
αk
, |αT | = ∑T
t =1
αkt
(29)
Fig. 7 shows a sample that Hi-GRNN Full AT+VAT predicts the review correctly (i.e., c1) while the rest of models predict the review wrongly (i.e., c0). With absence of perturbations, the HAN model tends to memorize certain words rather than interpreting the document generally. This led to wrong prediction against unseen documents. In contrast, the proposed Hi-GRNN models tend
308
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
Fig. 5. Negative log likelihood obtained from 8 training datasets.
to have more plateau-shaped distribution on attention weights across the document. Based on the distributions of attention weights, we can foresee that the adversarial models are robust against small perturbation as they do not rely on memorizing features. For the given example, we found that the proposed models (excepts Hi-GRNN Full AT+VAT) tend to focus on word ‘‘watchable’’ and thus wrongly predict the review as negative. Meanwhile, Hi-GRNN Full AT+VAT may classify the review correctly due to the understanding of words ‘‘guilty pleasure’’ which indicates strong positive meaning.
4.6. Silhouettes evaluation Table 2 shows that our proposed models perform below expectation on both binary and fine-labeled SST datasets (i.e., SST-1 and SST-2). These datasets consist of obscure training examples which lead to unfavorable classification accuracy. To investigate obscurity of the underlying dataset, we compute Silhouettes score (Rousseeuw, 1987) that can be used to evaluate the clustering quality. Silhouettes score is a measure of similarity for an object from its own cluster as compared to other clusters.
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
309
Fig. 6. Effects of norm constraint tuning on SST-2 dataset.
Fig. 7. Attention weights extracted from a predicted result for six different methods using SST-2 dataset. c0 and c1 indicate the result predicted by different methods where c0 denotes negative comment and c1 denotes positive comment. HG denotes Hi-GRNN.
The score ranges from −1 to 1 where high value indicates an object is well matched to its own cluster and poorly matched to neighboring clusters. In our case, high Silhouettes score on a dataset indicates that the predicted outputs from classifier are well separated across different classes. We first cluster the document representations of validation data after performing five training epochs using K-means clustering. Subsequently, we calculate Silhouettes score in range of two to 10 classes. Moreover, the clustering of each dataset is visualized using T-SNE (Maaten & Hinton, 2008) to reduce high dimensional representations. More precisely, Fig. 8 consists of two columns for each dataset as follows:
• Left column: To show how many clusters (i.e., classes) are suitable to represent the output of each dataset by displaying the Silhouettes scores for different number of clusters in each dataset • Right column: To show the visualization of each dataset when the number of clusters is fixed to 5, 2, 8 and 8 for SST1, SST-2, Reuter and Snippet respectively. Similar to other existing literature, we follow the number of clusters given by the benchmark datasets. Referring to the left column of Fig. 8, the presented Silhouettes scores indicate that SST-1, SST-2, Reuter and Snippet are more
suitable to be classified into 8, 2, 8 and 7 classes respectively. Meanwhile, referring to the right column of Fig. 8, different colors represent different classes in each dataset. Unlike Reuter and Snippet that have clear observable differentiation between classes, SST datasets have lower clustering quality. For both SST datasets, majority of the output embeddings entangle in a group rather than separated. As expected, SST-1 dataset has low Silhouettes score in its dedicated class (i.e., class 5). Although we notice that a small group of output embeddings is well separated, the output embeddings are still not well clustered in overall. Judging from visualization of SST-2 output embeddings, the clustering is ambiguous even though the Silhouettes score is high. Thus, our proposed models underperform other methods in classifying SST datasets due to the unclear boundary of different classes. In addition, small-scale training data with limited features provided might be the root cause that limits the capability of our proposed models. 5. Conclusion Small-scale datasets contain lesser information. With deep neural networks, massive training parameters cause bias to training data and thus a learning algorithm tends to overfit. To mitigate the concern of overfitting over small-scare datasets, we
310
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
Fig. 8. Silhouettes score analysis and K-means clustering on two sentiment datasets (SST-1 and SST-2) and two categorical datasets (Snippet and Reuter). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
have proposed the Hi-GRNN models that integrate hierarchical structure of gated recurrent neural network with two different adversarial trained regularizers. Experiments were conducted on eight small-scale benchmark datasets and the proposed model (i.e., Hi-GRNN Full AT+VAT) achieves state-of-the-art performance on multiple benchmark datasets. The proposed model provides rich semantic and context information for both words
and sentences while preventing overfitting as compared to the HAN model. Finally, meaningful comparison was made to demonstrate the regularization and generalization ability of our learning framework, and to explain our result. From our work, we found that adversarial perturbations applied to a learning network can aggravate the regularize effect. Besides, our proposed models do not perform well for small-scale datasets consisting of obscure
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
inputs. However, the proposed adversarial training and virtual adversarial training will introduce sequential processing, and thus not efficient to be implemented using parallel computing platforms, indicating the time needed for training will increase in linear proportional to the size of training data. In this paper, we focus on minimizing overfitting for small-scale datasets. The proposed approach will be extended to include larger datasets in the future work. Acknowledgments This research was supported in part by the Collaborative Agreement with NextLabs (Malaysia) Sdn Bhd (Project title: Advanced and Context-Aware Text/Media Analytics for Data Classification). We gratefully acknowledge NVIDIA Corporation for donating the Titan Xp GPU to support this research. References Abadi, M., et al. (2016). TensorFlow: Large-scale machine learning on heterogeneous distributed systems. In Proceedings of the 12th USENIX symposium. Bayer, J., Osendorfer, C., Korhammer, D., Chen, N., Urban, S., & van der Smagt, P. (2013). On fast dropout and its applicability to recurrent networks. CoRR abs/1311.0701. Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 440–447). ACL. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Caruana, R., Lawrence, S., & Giles, C. L. (2001). Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in neural information processing systems (NIPS) Vol. 14 (pp. 402–408). Cheong, H. S., Yap, W. S., Tee, Y. K., & Lee, W. K. (2018). Hierarchical attention networks for different types of documents with smaller size of datasets. In Proceedings of the 6th international conference on robot intelligence technology and applications (RITA) (pp. 28–41). Springer. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of eighth workshop on syntax, semantics and structure in statistical translation (SST@EMNLP) (pp. 103–111). ACL. Cho, K., Van Merriënboer, B., Gulcehrer, C., Bahdanau, D., Bougares, F., Schwenk, H., et al. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of conference on empirical methods in natural language processing (EMNLP) (pp. 1724–1734). ACL. Conneau, A., & Kiela, D. (2018). SentEval: An evaluation toolkit for universal sentence representations. In The international conference on language resources and evaluation (LREC). ELRA. Debole, F., & Sebastiani, F. (2005). An analysis of the relative hardness of Reuters21578 subsets. Journal of the Association Information Science and Technology, 56(6), 584–596. Ding, Z., Xia, R., Yu, J., Li, X., & Yang, J. (2018). Densely connected bidirectional LSTM with applications to sentence classification. In LNCS: vol. 11109, Natural language processing and chinese computing - 7th CCF international conference (NLPCC) (pp. 278–287). Springer. Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems (NIPS) Vol. 29 (pp. 1019–1027). Goodfellow, Ian J., Jonathon, S., & Christian, S. (2015). Explaining and harnessing adversarial examples. In Proceedings of the international conference on learning representations. Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their Applications, 13(4), 18–28. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Johnson, R., & Zhang, T. (2015). Semi-supervised convolutional neural networks for text categorization via region embedding. In Advances in neural information processing systems (NIPS) Vol. 28 (pp. 919–927). Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In International conference on machine learning (ICML) (pp. 2342–2350). JMLR.org. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1746–1751). ACL. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations.
311
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., et al. (2015). Skip-thought vectors. In Advances in neural information processing systems (NIPS) Vol. 28 (pp. 3294–3302). Kristina, T., & Manning, C. D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora (EMNLP) (pp. 63–70). ACL. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Proceedings of the advances in neural information processing systems (NIPS) Vol. 25 (pp. 1106–1114). Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From word embeddings to document distances. In International conference on machine learning (ICML) (pp. 957–966). JMLR.org. Lewis, D. D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR) (pp. 37–50). ACM. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In LNCS: vol. 1398, Proceedings of European conference on machine learning (EMCL) (pp. 4–15). Springer. Li, Y. H., & Jain, A. K. (1998). Classification of text documents. The Computer Journal, 41(8), 537–546. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), (9), 2579–2605. Manevitz, L. M., & Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research (JMLR), 2, 139–154. Mikolov, T. (2012). Statistical language models based on neural networks. (Ph.D. Dissertation), Brno University of Technology. Mikolov, T., Chen, K., Corrado, G., Dean, J., Sutskever, L., & Zweig, G. word2vec. https://code.google.com/p/word2vec, 2013 (accessed 11 december 2017). Mikolov, T., Karafiát, M., Burget, L., Černocký, J., & Khudanpur, S. (2010). Recurrent neural network based language model. In ISCA, Proceedings of the 11th annual conference of the international speech communication association (INTERSPEECH) (pp. 1045–1048). Miyato, T., Dai, A. M., & Goodfellow, I. (2016). Adversarial training methods for semi-supervised text classification. In Proceedings of the international conference on learning representations. Miyato, T., Maeda, S. I., Koyama, M., Nakae, K., & Ishii, S. (2016). Distributional smoothing with virtual adversarial training. In Proceedings of the international conference on learning representations. Montavon, G., Orr, G. B., & Müller, K.-R. (2012). Lecture notes in computer science series. LNCS: vol. 7700, Neural networks: Tricks of the trade. Springer Verlag. Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Proceedings of the advances in neural information processing systems (NIPS) Vol. 14 (pp. 841–848). Nie, A., Bennett, E. D., & Goodman, N. D. (2017). Dissent: Sentence representation learning from explicit discourse relations. CoRR abs/1710.04334. Nikolentzos, G., Meladianos, P., Rousseau, F., Stavrakas, Y., & Vazirgiannis, M. (2017). Multivariate Gaussian document representation from word embeddings for text categorization. In Proceedings of the 15th conference of the european chapter of the association for computational linguistics (2) (pp. 450–455). ACL. Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 271–278). ACL. Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL) (pp. 115–124). ACL. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 conference on empirical methods in natural language processing (EMNLP) (pp. 79–86). ACL. Pantel, P., & Lin, D. (1998). SpamCop - A spam classification and organization program. In Proceedings of AAAI workshop on learning for text categorization (AAAI) (pp. 95–98). IAAI. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). ACL. Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In 14th international conference on frontiers in handwriting recognition (ICFHR) (pp. 285–290). IEEE. Phan, X. H., Nguyen, L. M., & S., Horiguchi (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the 17th international conference on world wide web (pp. 91–100). ACM. Poon, H. K., Yap, W. S., Tee, Y. K., Goi, B. M., & Lee, W. K. (2018). Document level polarity classification with attention gated recurrent unit. In 2018 international conference on information networking (ICOIN) (pp. 7–12). IEEE.
312
H.-K. Poon, W.-S. Yap, Y.-K. Tee et al. / Neural Networks 119 (2019) 299–312
Qian, Q., Huang, M., Lei, J., & Zhu, X. (2017). Linguistically regularized LSTMs for sentiment classification. In Proceedings of the 55th annual meeting of the association for computational linguistics (ACL) (volume 1: long papers) (pp. 1679–1689). ACL. Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492–1496. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP) (pp. 379–389). ACL. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management , 24(5), 513–523. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. Sietsma, J., & Dow, R. J. (1991). Creating artificial neural networks that generalize. Neural Networks, 4(1), 67–79. Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semisupervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the 2011 conference on empirical methods in natural language processing (EMNLP) (pp. 151–161). ACL. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., & Ng, A. Y. (2013a). Potts, Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP) (pp. 1631–1642). ACL. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., et al. (2013b). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP) (pp. 1631–1642). ACL. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. In Advances in neural information processing systems (NIPS) Vol. 28 (pp. 2377–2385).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1), 1929–1958. Tang, S., & de Sa, V. R. (2018). Multi-view sentence representation learning. CoRR abs/1805.07443. Voorhees, E. M. (1999). The TREC-8 question answering track report. In Proceedings of the eight text retrieval conference (TREC) (pp. 77–82). NIST. Wang, S., & Manning, C. D. (2012). Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th annual meeting of the association for computational linguistics (ACL) (volume 2) (pp. 90–94). ACL. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., et al. (2015). Semantic clustering and convolutional neural network for short text categorization. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (pp. 352–357). ACL. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Human language technologies (pp. 1480–1489). ACL. Zhang, Y., Jin, R., & Zhou, Z.-H. (2000). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52. Zhang, R., Lee, H., & Radev, D. (2016). Dependency sensitive convolutional neural networks for modeling sentences and documents. In Annual conference of the north american chapter of the association for computational linguistics: Human language technologies (HLT-NAACL) (pp. 1512–1521). ACL. Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., & Xu, B. (2016). Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In Proceddings of the 26th international conference on computational linguistics (COLING) (pp. 4845–4849). ACL.