Finding decision jumps in text classification

Finding decision jumps in text classification

ARTICLE IN PRESS JID: NEUCOM [m5G;September 25, 2019;21:17] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputin...

2MB Sizes 0 Downloads 38 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 25, 2019;21:17]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Finding decision jumps in text classificationR Xianggen Liu a,d, Lili Mou b, Haotian Cui a,d, Zhengdong Lu c, Sen Song a,d,∗ a

Laboratory for Brain and Intelligence and Department of Biomedical Engineering, Tsinghua University, Beijing 100084, China Department of Computing Science, University of Alberta, Edmonton, AB T6G2E8, Canada c DeeplyCurious.ai, Beijing 100084, China d Beijing Innovation Center for Future Chip, Tsinghua University, Beijing 100084, China b

a r t i c l e

i n f o

Article history: Received 12 March 2019 Revised 25 June 2019 Accepted 28 August 2019 Available online xxx Communicated by Dr. T. Mu Keywords: Text classification Reinforcement learning Weak supervision Rationalizing neural prediction

a b s t r a c t Text classification is one of the key problems in natural language processing (NLP), and in early years, it was usually accomplished by feature-based machine learning models. Recently, the deep neural network has become a powerful learning machine, making it possible to work with text itself as raw input for the classification problems. However, existing neural networks are typically end-to-end and lack explicit interpretation of the prediction. In this paper, we propose Jumper, a novel framework that models text classification as a sequential decision process. Generally, Jumper is a neural system that scans a piece of text sequentially and makes classification decisions at the time it wishes, which is inspired by the cognitive process of human text reading. In our framework, both the classification result and when to make the classification are part of the decision process, controlled by a policy network and trained with reinforcement learning. Experimental results of real-world applications demonstrate the following properties of a properly trained Jumper: (1) it tends to make decisions whenever the evidence is enough, therefore reducing total text reading by 30–40% and often finding the key rationale of the prediction; and (2) it achieves classification accuracy better than or comparable to state-of-the-art models in several benchmark and industrial datasets. We further conduct a simulation experiment with mock data, which confirms that Jumper is able to make a decision at the theoretically optimal decision position. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Natural language understanding plays an important role in various applications, including text classification [14], information extraction [44], and machine comprehension [9,31]. Recently, neural networks have become a prevailing technique in natural language processing (NLP) and have achieved significant performance in these tasks. However, previous work mainly focuses on the ultimate performance of a task (e.g., classification accuracy). For example, Kim [14] builds several variants of convolutional neural networks for sentiment classification. It is typically not clear where and how the model makes such a decision, which are in fact important

R This paper is an extension to Liu et al. [18]. The code of our work is available at: https://github.com/Liuxg16/jumper-codes ∗ Corresponding author at: Laboratory for Brain and Intelligence and Department of Biomedical Engineering, Tsinghua University, Beijing 10 0 084, China. E-mail addresses: [email protected] (X. Liu), doublepower. [email protected] (L. Mou), [email protected] (H. Cui), luz@deeplycurious. ai (Z. Lu), [email protected], [email protected] (S. Song).

in real industrial applications for debuggability and interpretability [21]. In our paper, we propose a novel framework, Jumper, that models text understanding as a sequential decision process, inspired by the cognitive process of humans. When people read text, we look for clues, perform reasoning, and obtain information from text. Jumper mimics this process by reading the text in a sentenceby-sentence manner with a neural network. At each sentence, the model makes decisions (also known as actions) based on the input, and at the end of this process, it would have some “understanding” of the text. More specifically, our paper focuses on paragraph-level text classification, which may have more semantic changes than sentence-level classification. When our neural network reads a paragraph, it is assumed to have a default value “None” at the beginning. At each decision step, a sentence of the paragraph is fed to the neural network; the network then decides if it is confident enough to “jump” to a non-default value. We impose a constraint that each jump is a finalized decision, which cannot be updated in the future. Such decision process is depicted in Fig. 1, and we call our model Jumper.

https://doi.org/10.1016/j.neucom.2019.08.082 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

Fig. 1. Illustration of Jumper’s decision process. In this paragraph of six sub-sentences, Jumper makes a prediction at an appropriate step for each subtask (predefined questions). Jumper is trained separately for each subtask, and here we show its decision processes together.

Our model is trained by reinforcement learning with only weak supervision. That is to say, we assume our training labels only contain the ultimate results, but no supervision signal is given regarding which step the model should make a decision. This also conforms to human reading, as people are typically certain about reading comprehension results, but it is difficult to model how human belief changes when they read. An intriguing consequence of the one-jump constraint is that it forces our model to be serious about both when to predict and what to predict. This is because a paragraph does not contain a special symbol indicating the end of the paragraph. If our model defers its decision later than it could have made an accurate enough prediction, it takes a risk of not being able to predict. On the other hand, if the model predicts too early, it takes a risk of low accuracy. By optimizing the expected reward in reinforcement learning, the model learns to make decisions at an “optimal” time step. Jumper has the following advantages, compared with traditional end-to-end text classification: •



Jumper is able to locate where the evidence of classification is (if it is within one or a few sentences), which coincides with recent work on rationalizing neural prediction [15]. In those tasks where information is scattered more widely, Jumper learns to make a decision as long as it is confident enough, making it possible to skip reading the remaining part of a paragraph without loss of accuracy.

To evaluate our approach, we first design a simulation experiment where we generate mock data with manually specified distributions. The simulation shows that Jumper is able to find the theoretically optimal decision step in the online-prediction fashion. Then, we apply our model in real-world tasks, including two benchmark datasets (movie review1 and the AG’s news corpus2 ), as well as an industrial application of analyzing occupational injury. Experiments show that our Jumper achieves comparable or higher ultimate classification accuracy compared with strong baselines. Moreover, it reduces the length of text reading by 30–40%, resulting in fast inference. For information extraction-style classification where the evidence is centered in a single sentence, our 1 2

https://www.cs.cornell.edu/people/pabo/movie-review-data/ https://www.di.unipi.it/∼gulli/AG_corpus_of_news_articles.html

model is able to find the key rationale without training labels of jumping positions.

2. Related work Text classification aims to categorize a piece of text into predefined classes, and is a fundamental problem in natural language processing (NLP), with applications ranging from sentiment analysis [27,47] to topic classification [36,45]. The representation of words or sentences plays an important role in text classification. Traditional text classification usually adopts hand-crafted features or feature templates (e.g., bag-ofwords features), based on which machine learning models are used for classification [11,13,27]. However, such bag-of-words features may lose critical information about the grammar and the order of words. This may lead to conflict in sentence representations since sentences of different meanings could have the same bag-ofwords features [32]. n-gram [37] alleviates the above problem by considering several continuous words. Mikolov et al. [23] present distributed “word-to-vector” (Word2Vec) representation to capture semantic information of words. The Word2Vec representation for a word, also known as a word embedding, is trained on the surrounding words over a huge corpus [22] and is widely used in neural networks. Another key advance of text classification is the development of more powerful classifiers. In early years, the naive Bayes classifier [20] and the decision tree-based classifier [40] are used in document categorization, featuring low computational cost. Later, nonparametric techniques, such as k-nearest neighbors (kNN) [16] and the support vector machine (SVM) with the radius basis function (RBF) kernel [19], are also widely used for text classification. Recently, the deep neural network has become one of the most powerful models, as it works well with raw input of words [14,41,45]. Typical neural models include convolutional neural networks (CNNs) [4,10] and recurrent neural networks (RNNs), where RNNs usually involve long short-term memory (LSTM) [7] and the gated recurrent unit (GRU) [5]. Attention mechanisms [1,17], including the more advanced transformer architecture [6,35], are achieving state-of-the-art performance in text classification. However, these architectures usually lack interpretability.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

3

The RNN reads the input sentences in sequence and maintains the entire historical information. PolicyNet predicts the decision action at each step based on the RNN’s hidden states. We impose a one-jump constraint that allows at most one prediction of a non-default value in a paragraph. That is to say, if Jumper predicts a non-default value at a certain step, then the decision is final, being the classification result of the paragraph. Jumper is trained by reinforcement learning with weak supervision. We only use the classification accuracy as the reward signal to make Jumper learn not only what to predict but also when to predict. The rest of this section elaborates the model components and its training process. 3.1. Sentence encoder

Fig. 2. Overview of Jumper (SentEnc is a CNN-based sentence encoder).

Other work predicts classification labels for not only sentences but also words and phrases [25,33,46,48], but they require finegrained supervision in both word/phrase and sentence levels. Recently, researchers are focusing more on rational predictions in a weakly supervised manner [15]. Lei et al. [15] build a neural text classifier on key phrases in a paragraph, where key phrase extraction is learned by reinforcement learning with real-valued reward. However, their method cannot deal with non-existing information because it is unclear how to train and predict without extracted phrases. Also, such an approach would be more difficult to train with sparse reward (like 0–1 loss). Yu et al. [42] learn to skim text by predicting how many words to skip during reading. Another study [43] further brings in re-read operation in addition to word skipping. However, it is counterintuitive that a network can learn to skip several future words (which by themselves have a lot of freedom) without actually seeing them. By contrast, our network skims text by ignoring all future sentences after it has been confident enough to predict, where the confidence is said in terms of its expectation of the remaining sentences. Different from existing approaches, our paper models text classification as a sequential decision process. The network is related to the belief tracker in Wen et al. [38] for a task-oriented dialog system. However, their network is trained by cross-entropy loss with strong supervision of the groundtruth labels at every step. We instead propose a one-jump constraint in the decision process and train our network by reinforcement learning. As we shall show in the experiment, Jumper outperforms cross-entropy loss when trained with weak supervision, in terms of finding the rationale. 3. The proposed method In our approach, a paragraph is segmented into sub-sentences3 , each of which could be thought of as a basic unit for some “proposition” and is fed to Jumper in order. Jumper takes an action after reading each sub-sentence. Fig. 2 provides an overview of Jumper, which has a hierarchical structure as follows. •



A sentence encoder reads the words in a sentence and extracts the semantic features into a fixed-dimensional vector space. A controller, essentially consisting of a recurrent neural network (RNN) and a policy network (PolicyNet), is built upon sentence encoders, and takes actions (“jumps”) when appropriate.

3 Segmented by “,.!?” We abuse the terminologies of sentence and sub-sentence for simplicity if not confusing.

We use a convolutional neural network (CNN) as the sentence encoder [14]. CNN applies a set of sliding windows to the concatenation of neighboring words to extract local features, which are aggregated by max pooling to represent sentence-level information. For a particular sentence in a paragraph, we denote the word embeddings by x1 , x2 , . . . , xL ∈ Rd , where L is the number of words in the sentence and d is the dimension of embeddings. We also denote the concatenation of column vectors xi , xi+1 , . . . , x j by xi: j = [xi  xi+1  · · ·  x j ]. Then, convolution and its pooling are computed by

ck,i = f (wk xi:i+h−1 + bk ),

(1)

ck = max{ck,1 , ck,2 , . . . , ck,L−h+1 },

(2)

c = [c1  c2  · · ·  cK ],

(3)

Rhd

where wk ∈ and bk ∈ R are the weights of the kth convolutional kernel, extracting a local feature ck,i at position i. The maximum feature over all positions is chosen as the sentence’s representation in terms of this kernel. Finally, the features of different kernels are concatenated as the encoded representation of the sentence, denoted as c. We would like to point out that other neural networks (e.g., RNNs) may also be a reasonable architecture for the sentence encoder. In our work, we choose CNN because we hope to further induce word-level rationales by backtracking through the maxpooling layer, as will be described in Section 3.4. 3.2. Controller Based on encoded sentence features, the controller of Jumper takes corresponding actions, as in a sequential decision process. Inside the controller are two submodules: (1) an RNN fuses the current input and previous sentences, integrating information over the entire history; and (2) a policy network (PolicyNet) makes a decision for the current step. (See the gray part of Fig. 2.) Formally, RNN takes a sequence of sentence features c1 , . . . , cT computed by Eq. (3), and updates its hidden states accordingly. (T is the number of sentences.) In this paper, we use the gated recurrent unit (GRU) [5] as our recurrent update, denoted by ht = GRU(ht−1 , ct ), where ht is the hidden state of the time step t. Based on RNN’s hidden states, PolicyNet predicts the decision action. Suppose the classification task has N possible target labels, we use a softmax predictor of N + 1 ways, where an additional target “None” represents information not existing. Notice that the “None” class does not differ from other classification labels at the beginning of training. Rather, reinforcement learning with the onejump constraint would make the model predict “None” before it is confident enough to take an action.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

Let at ∈ RN+1 denote Jumper’s decision after processing the tth sentence. It is given by a policy distribution

π (at |ct ) = softmax(Wp [ct  ht ] + b p ),

(4)

where Wp and bp are weights and the bias term. Here, we feed PolicyNet with the concatenation of RNN’s hidden state and sentence features, inspired by ResNet [8]. During training, we sample an action from its predicted distribution, whereas for testing, we choose the action with the maximum a posterior probability, i.e., at = argmax π (at |ct ). 3.3. One-jump constraint and training process The main difficulty of learning Jumper is the lack of step-bystep supervision, i.e., we assume the labels contain only ultimate results but no information for the appropriate position to jump. Admittedly, the process will be easier if we have fine-grained annotations, but they are costly and labor-intensive to obtain. We therefore propose a training method using reinforcement learning (RL) with a one-jump constraint. The one-jump constraint allows at most one non-default prediction (not “None”) in a paragraph, that is, Jumper terminates the reading process once it predicts a non-default prediction at a certain step. We train our Jumper with reinforcement learning. Specifically, a classification reward compares the model prediction and groundtruth by



( j)

Racc = I a

( j)



( j) Tjump

=l

( j)

,

(5) ( j)

for a data point j, where Tjump is the step when the model jumps (predicts the non-default value) and a

( j) ( j) Tjump

is the action that the

model takes at that step. I is the indicator function and l(j) is the groundtruth. With only the classification reward, however, our model tends to jump at early stages. This is because for a uniform distribution over all N + 1 actions (due to randomly initialized parameters), the N probability of jumping at time step t is (N+1 , which is near to )t zero when t grows. Thus, we design an intermediate step-by-step reward as ( j,t ) Rstep =



r, 0,

if st = None, otherwise.

(6)

for data step j, and each step t. Here, r is a (small) positive (· ) (· ) constant, balancing the importance of Rstep and Racc . We would (· ) like to emphasize that the design of Rstep is different from traditional planning and reinforcement learning (e.g., the maze problem) where the reward for each step is negative. In our problem, each step has a positive reward so that it alleviates the earlyjumping problem. Following the calculation of the returned reward in reinforcement learning [34], we compute the accumulated reward from step t to the jumping step as

R

( j)

( j) t:Tjump

( j) Tjump

=





( j,t ) ( j) γ t −t Rstep + Racc , 

(7)

t  =t

( j)

where Tjump denotes the jumping step. γ ∈ [0, 1] is the discounting rate. The objective is to maximize the expected reward as

J () =

 j



( j)



Eπ (a( j ) |c ( j ) ) R ( j ) ,  t 1:Tjump t

where  denotes all model parameters.

(8)

The gradient of the objective is computed by T ( j)

∇ J () ≈

jump N  

j=1 t=1

( j)

where at

1 ( j)

NTjump

Rˆ( j )( j )

t:Tjump

  ∇ log π at( j ) |ct( j ) ,

( j) ( j) is sampled from π (·|ct ) and Rˆ ( j )

t:Tjump

(9)

is the adjusted

reward, which will be described later. Such approach is also known as the REINFORCE algorithm [34,39]. To have a balance between exploration and exploitation, we reserve a small probability  to uniformly sample from the entire action space. To reduce the variance of REINFORCE, we subtract the reward by a baseline term (computed as the average of the M = 5 samples for each data point) and truncate negative rewards following Mou et al. [24]. The truncation origins from the idea of “reward inaction,” where unsuccessful trials are ignored (Section 5.8 in Sutton et al. [34]), and it could prevent gradient from being messed ( j) up by incorrect samples. The adjusted reward is denoted as Rˆ ( j) t:Tjump

in Eq. (9), which is computed by

Rˆ( j )( j ) = max t:Tjump

( j,m ) ( j,m ) t:Tjump

where R

⎧ ⎨

M

0, R ( j ) ( j ) −





) R( j,m( j,m ⎬ t:Tjump) , M ⎭

m=1

t:Tjump

(10)

stands for the accumulated reward at the mth trial

of the jth data point. It is interesting to have an intuitive understanding on why Jumper can find the “right” position to predict with only weak supervision. For a particular data sample, let t∗ be the position that the network could have predicted. The reward encourages the model to predict any time after t∗ , because later sentences have a slightly higher reward due to Rstep . However, if the network learns to maximize the reward for a particular training data point, it has to wait for intermediate reward by not predicting. Since there is no clue indicating the end of a paragraph, the network unfortunately cannot learn such information, and thus is in the risk of not being able to predict for other samples, resulting in a low total reward over the training set. Therefore, our Jumper framework with the one-jump constraint enables the model to find the “right” position to predict with only weak supervision. 3.4. Backtracking word-level clues Currently, our approach works in the sentence level. To obtain word-level rationales in our model, we propose a simple heuristic that backtracks information flow through max-pooling operation, based on the key sentence where Jumper makes the non-default decision. We compute the gradient of the log-likelihood with respect to the key sentence’s representation; it is then multiplied with the magnitude of the difference between two steps (ignoring the sign by taking the square). The two aspects indicate how a feature (at the last step) could have improved the prediction, and what is mostly changed at the current time step. Then we choose the top D = 10 values, yielding the most important D-dimensions in the output of CNN, given by



D = topD

 ∂ log(π (at |ct ))  (ct − ct−1 )2 , ∂ ct−1

(11)

where  indicates point-wise product. We backtrack where the maximum values come from in the max-pooling operation in Eq. (2), obtaining the word that matters in a dimension d as wd = argmax{c1 , . . . , cK }. The importance of a word is counted as the fraction in D at which the word is backtracked.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

5

Notice that such simulation simplifies the classification task described in Section 3. Here, each character corresponds to a (sub)sentence—as a decision step—in a paragraph, and thus the tailored Jumper does not have a CNN sentence module (Section 3.1). Such simplification makes much sense, as it does not affect the learning or theoretical computation of the optimal decision step. 4.2. Computation of the theoretical optimum

Fig. 3. The generative model that generates the mock dataset. We select a latent variable μ, with probability α ; then we generate the character X with probability μ, and obtain a data sample by repeating L times, where L is sampled from a Poisson distribution with the expected value λ. N is the number of data samples in the dataset.

4. Simulation experiment In this section, we present a simulation experiment on mock data to investigate if Jumper could learn to predict at an optimal time step. We conduct such simulation experiments because it is hard to quantitatively analyze the optimality with real data.

Assuming the data distribution is known, we can compute the theoretical optimal decision step in a recursive fashion, akin to the methodology of value iteration [2,34]. For a time step t, a partial sequence x1:t = x1 . . . xt is observed (xt = EOS). We have three choices, namely, predicting G = G0 , predicting G = G1 , and not predicting at this step. The maximum expected accuracy at this step is

Accx1:t = max{Accx1:t (predict G0 ), Accx1:t (predict G1 ), Accx1:t (not predict )},

We first compute the best performance we could obtain at the step t. We have

p(G = Gi |x1:t ) ∝ p(x1:t |G = Gi ) p(G = Gi )

4.1. Definition of the simulation task Fig. 3 illustrates the probabilistic graphical model that generates the mock data. To be specific, a data sample is a sequence of characters, generated as follows: •





There are two sequence generators, G0 and G1 . For each sequence, we sample a generator with a probability α of 0.5. For each sequence, we also sample its length L (number of characters) by a Poisson distribution, that is, L ∼ Poisson(λ). In our experiment, λ = 8. For the Jumper training, we set the maximum length of the sequence to 15 (covering 98% samples). In a sequence, each character X is generated independently by a Bernoulli distribution (i.e., we have a unigram model with a vocabulary size of 2). The generator G0 generates “a” with a probability μ0 , and “b” with 1 − μ0 ; the other generator G1 generates “a” with a probability μ1 , and “b” with 1 − μ1 . We set μ0 = 0.3 and μ1 = 0.7. A special indicator EOS (abbreviation of End-of-Sentence) is pended at the end of each sequence. EOS is introduced simply for computing the theoretical optimum, but is not seen by Jumper.

For each data point, the character sequence x1 x2 xL is observable, whereas the generator (G = G0 or G = G1 ) is unobserved, being the variable to predict. We consider the setting of online prediction, where the characters are coming one after another as a stream. At each time step, the model is allowed to predict which generator has synthesized the sequence; the model could also opt to not predict, if it decides to see more evidence. The online prediction has two constraints: (1) Once a prediction has been made, the model cannot be changed even if future characters might show opposite evidence. (2) Once EOS is seen, the model loses its privilege of prediction, and if it has not predicted yet, we say the model errs with this data point. Intuitively, if the model predicts too early, the expected accuracy is low due to the partial evidence; if the model otherwise predicts too late, the expected accuracy is also low due to the constraint of not being able to predict. Hence, there exists a theoretical optimum that maximizes the expected accuracy. Fortunately, we are able to compute the theoretically optimal decision process since we know the data distribution.

(12)



t 

(13)

p(x j |G = Gi ),

(14)

j=1

where Eq. (13) is due to the Bayes’ rule, and in Eq. (14) we use the generation distribution defined in Section 4.1. The proportion is normalized by Gi , and therefore

t

p(x j |G = Gi ) , (15) t j=1 p(x j |G = G0 ) + j=1 p(x j |G = G1 )

Acct (predict Gi ) = t

j=1

Alternatively, the prediction can be delayed, and the expected accuracy of not predicting at the step t equals the maximum expected accuracy at the step t + 1 when x1:t+1 is observed. For the step t + 1, there could be two scenarios. If xt+1 = EOS, then the expected accuracy in this case is 0. Otherwise, the expected accuracy can be computed recursively by Eq. (12). In other words,

Accx1:t (not predict ) = p(xt+1 = EOS ) · 0 + p(xt+1 = EOS ) ·

E

xt+1 ∼p(xt+1 |x1:t )

Accx1:t+1 , (16)

where p(xt+1 = EOS ) can be computed since we generate the sequence length by the Poisson(λ). See Appendix A for details. During the sequential decision process, Eq. (12) defines the optimal decision for each step based on the partial sequence. Since the prediction of G = G0 or G = G1 cannot be updated, the optimal decision step is the earliest step where predicting G has a larger expected accuracy than not predicting, that is

toptimal = argmin{∃ i, Accx1:t (predict Gi ) > Accx1:t (not predict )}. t

(17) In this way, we build the above recursive method to derive the optimal decision step and classification result. The pseudo-code is shown in Appendix A. 4.3. Implementation of JUMPER and baselines We apply Jumper to this task and see whether the jumping positions are close to the theoretical optima. Following the

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx Table 2 Test accuracy (%) and the average jumping position in the simulation task.

Table 1 Position accuracy and correlation between the jumping positions and the theoretically optimal decision positions over all the test samples. Model Random Jumping Jumper-ce Jumper

PositionAcc

Pearson’s Correlation

12.50% 59.85% 91.66%

0 −0.069 0.947

setting defined in Section 4.1, Jumper reads the input sequence in order and predicts which generator has synthesized the sequence in an online fashion. We compare our model with two baselines, namely, random jumping and Jumper trained with cross-entropy loss (denoted as Jumper-ce). The random-jumping baseline chooses the jumping position randomly within the sequence length; it also predicts the class label randomly. Jumper-ce is trained by cross-entropy loss at the end of a paragraph without the one-jump constraint. In such case, Jumper-ce does not have an explicit decision step, and thus, we induce a potentially plausible jumping position by heuristics. For a sequence where Jumper-ce changes its mind for one or more steps, we randomly choose one from these steps as the jumping position, with the jumping target being the prediction of classification. If Jumperce keeps the same prediction during the whole sequence, we follow the random-jumping baseline and choose the jumping step randomly; however, the predicted class is given by Jumper-ce in this case as opposed to random jumping. For a fair comparison, Jumper and Jumper-ce share the same hyperparameters. The GRU layer is 20-dimensional. For the input layer, we use one-hot representation to encode the characters instead of embeddings, since the vocabulary is small. Our mock dataset contains 100k, 10k, and 10k samples for training, validation, and test, respectively. To train Jumper and Jumper-ce, we apply the AdaDelta algorithm (initial learning rate of 0.1) with mini-batch of 10 0 0. We adopt a dropout layer (dropout rate of 0.5) on the top of the RNN layer for regularization. In addition, the early stopping criterion [3] is also used to select the parameters according to the best performance on the development dataset. Jumper has a few reinforcement learning hyperameters: the intermediate reward r is 0.05, the discounting rate γ is 0.9, and the exploration rate  is 0.1.

Accuracy (%)

Avg. jumping position

Random Jumping Jumper-ce Optimal decision process Jumper

49.98 81.79 82.97 82.69

4.97 4.58 3.85 3.95

Table 3 Statistics of the datasets after tokenization: the number of classes, the number of data samples, the vocabulary size, and the number of test samples. Data

# of class

# of samples

vocabulary size

Test

MR AG OI

2 4 2–12

10,662 127,600 3995

18,765 17,836 2089

10-fold 7600 400

jumping position). In other words, the accuracy of Jumper-ce is subject to our heuristically induced decision step, which leads to lower performance than Jumper-ce’s prediction at the end of the sentence (as is trained). Jumper, however, achieves 82.69% classification accuracy, which is close to the analytic optimum (in an online fashion). In summary, the simulation experiment verifies that Jumper is indeed able to learn not only the classification label but also the “optimal” decision position. 5. Real data experiments In addition to the simulation experiment, we also evaluate Jumper on three real-world tasks, including two benchmark datasets and one industrial application. 5.1. Datasets In this part, we describe the datasets used in our experiments. •



4.4. Results and discussion We measure Jumper by both jumping positions and classification predictions, compared with the theoretical optima discussed in Section 4.2. In particular, the quality of jumping positions is evaluated by (1) PositionAcc, the accuracy of the model’s jumping positions; and (2) Pearson’s Correlation [29], a measure of the linear correlation between the jumping positions and the theoretical optima. Table 1 shows the performance of jumping positions in the simulation experiment. As seen, both random jumping and Jumperce achieve low PositionAcc and are uncorrelated to the theoretical optima. Particularly, Jumper-ce is trained by cross-entropy loss, which is unable to tell the decision position during classification. While our heuristic for Jumper-ce is reasonable, the induced jumping positions are less optima (60% PositionAcc). By contrast, Jumper is trained by our proposed method with the one-jump constraint. It is able to recover the theoretical optimal decision step, evidenced by high PositionAcc and the correlation score. In addition to jumping positions, Table 2 shows the classification performance. Jumper-ce also yields a poor accuracy (at its

Model



Movie Review (MR), whose objective is a binary sentiment classification (positive vs. negative) for movie reviews [26]; it is widely used as a sentence classification task. AG news corpus (AG), which is a collection of more than one million news articles, and we followed Zhang et al. [45], classifying the largest four categories: “world,” “sports,” “business,” and “science or technique.” We randomly selected 5% of the training data as the development set since AG does not have a standard split. Occupational Injury (OI).4 The task—information extraction of occupational injury—originates from a real industrial application in the legal domain. We constructed a dataset (in the Chinese language) of 3995 cases related to occupational injuries from an online domain-specific forum. Based on an established ontology with 15 slots, each text is annotated with answers for these 15 questions, the statistics of which are listed in Appendix B. We report two subtasks, occupational injury identification (InjIdn) and injury level (Level), where our model is trained independently in a single-task setting.

5.2. Competing methods We compare Jumper with the following baselines: 4 The Occupational Injury dataset is available at: https://github.com/Liuxg16/ jumper- codes/tree/master/data/OI- dataset

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

7

Table 4 The test classification accuracy (mean ± standard deviation) on MR, AG, and OI datasets. For AG and OI datasets, we run each baseline ten times with different random seeds and report the average accuracy and the standard deviation. For the MR dataset, we randomly split data into 10 folds, 8 for training, 1 for validation, and 1 for test. The process was repeated 10 times and we report average test performance. †Results quoted from previous papers. Model †

CNN [14] fasttext† [12] Bi-GRU CNN Self-Attentive Hierarchical CNN-GRU Jumper

MR

AG

OI-Level

OI-InjIdn

81.00 – 77.80 80.80 82.10 80.23 80.67

– 92.50 92.35 92.49 91.38 92.41 92.57

– – 94.51 96.18 96.94 95.53 97.21

– – 72.95 74.28 73.27 74.63 75.47

± 1.03 ± 0.93 ± 0.81 ± 0.92 ± 0.87

± 0.10 ± 0.08 ± 0.06 ± 0.08 ± 0.07

± 0.41 ± 0.26 ± 0.24 ± 0.34 ± 0.27

± 0.42 ± 0.35 ± 0.29 ± 0.40 ± 0.33

Table 5 The average inference time (10−4 s) on MR, AG, and OI datasets, which are tested 5 times independently with NVIDIA’s GeForce GTX 1080 GPUs. In this experiment, our batch size is 1, which simulates a machine learning model deployed in human interactive applications. Here, we mainly compare Hierarchical CNN-GRU and Jumper as they share the same neural architecture (but differ in training and inference methods). We compute the speedup of inference time and the reduce of text reading (in percentage) by using Jumper.









Dataset

MR

AG

OI-Level

OI-InjIdn

Hierarchical CNN-GRU Jumper Speedup % Reduced %

51.30 ± 3.02 38.29 ± 3.37 33.98% 32.7%

71.01 ± 0.46 59.12 ± 0.48 20.11% 41.0%

103.27 ± 5.95 83.61 ± 6.14 23.57% 33.8%

102.43 ± 8.65 78.53 ± 8.87 30.43% 41.2%

Hierarchical CNN-GRU. As mentioned in the previous section, we use Jumper with cross-entropy loss as a baseline model, which is essentially a hierarchical model with CNN and GRU for sentences and paragraphs, respectively. This baseline is similar to our model except training criteria. Bi-GRU. It reads a text in two opposite directions, and the final states are concatenated for prediction. CNN. This model is proposed by Kim [14], with several different sizes of convolution operators to learn sentence representation. Self-Attentive. Lin et al. [17] propose a self-attentive model that attends to the sequence itself.

In the latter three baselines, we concatenated all sentences, and the models were applied to the paragraph.

5.3. Implementation details For Jumper and Hierarchical CNN-GRU, we adopted the same hyperparameters and training strategy as in the simulation experiment, except that words are represented by embeddings. We applied coarse grid search on both the MR and OI development datasets to select hyperparameters for other baselines. The CNN part used the rectified linear unit (ReLU) as the activation function, filter windows with sizes 1 to 5, 200 feature maps for each filter, and a dropout rate of 0.5. We re-implemented the self-attentive model using the same hyperparameters as in Lin et al. [17]. We did not perform any dataset-specific tuning except early stopping on the development sets. In addition, word embeddings for all of the models were initialized with 300d GloVe vectors [30] and fine-tuned during training. Other parameters were initialized by randomly sampling from the uniform distribution in [−0.01, 0.01]. For all the models, we used AdaDelta with a learning rate of 0.1 and a batch size of 50. Both Jumper and the baseline methods were implemented based on the PyTorch library 0.3.1 [28], and the NVIDIA’s GeForce GTX 1080 GPUs were used in the training and testing processes.

5.4. Results and discussion In this section, we present Jumper’s performance regarding several aspects: classification accuracy, jumping accuracy, and inference speed. We also present an analysis of the learning dynamics and a case study. Classification results. We first analyze the classification accuracy of Jumper when compared with baselines. Table 4 shows the test performance on the three datasets with four tasks. We notice that, in the MR and AG datasets, Jumper occasionally predicts “None,” which is not a valid label in these datasets. This puts our model at a disadvantage, and we take the most likely non-default (not “None”) as the prediction at the end of a paragraph. As shown, our Jumper model achieves comparable or better performance on all these tasks. This indicates that modeling text classification as a sequential decision process does not hurt or even improves performance. We would also like to point out that “accuracy” is not the only performance that we are considering. More importantly, our proposed model is able to find the key supporting sentence for text classification, or reduce the reading process, as shown in the following experiments. Efficiency of Text Reading. Jumper has a better inference efficiency than full-text reading. This is because Jumper has to make a decision as long as it sees sufficient evidence during its reading process due to the one-jump constraint, and after prediction, there is no need to read future sentences. We see in Table 5 that, while our model achieves similar or higher performance compared with strong baselines, it reduces the length of text reading by 30–40%, leading to fast inference (20–30% speedup) for prediction. Accuracy of Jumping Positions. We are now further curious if Jumper could “jump” at the right position in an information extraction-style task such as OI-Level. We annotate the rationale sentences in 400 data points (also available on our website in Footnote 4), serving as the test groundtruth. It should be noticed that we still have no training labels for jumping positions (rationale sentences) in this experiment.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

Fig. 4. The classification accuracy (a) and jumping behavior (b) of Jumper during its training stage.

Table 6 Performance of finding the key rationale in the OI-Level dataset, where information is often local. CA: Classification accuracy. JA: Jumping accuracy. OA: Overall accuracy. Model

CA

JA

OA

CNN Self-Attentive Hierarchical CNN-GRU Jumper

96.25 97.00 96.00 97.25

94.81 98.45 98.18 100

91.25 95.50 94.25 97.25

We compare Jumper with three baselines, namely, Hierarchical CNN-GRU, the CNN classifier [14], and the Self-Attentive model. To derive the jumping positions of Hierarchical CNN-GRU, we choose the first position where it makes a prediction other than “None.” Compared with the heuristics applied in Jumper-ce in Section 4.3, the above technique fits Hierarchical CNN-GRU better. This is because the target labels of the OI-Level task include “None,” whereas the simulation experiment (Section 4) is a binary classification task (excluding “None”). The CNN classifier chooses the sentence with the maximum number of words selected by max pooling in different dimensions. For the Self-Attentive model, we select the sentence of the largest attention weights. We use the following metrics: (1) Jumping accuracy (JA), the percentage of correct jump positions conditioned on correct classification; and (2) Overall accuracy (OA), the percentage of both correct jumping positions and correct classification results. We also include the classification accuracy (CA) as has been shown in Table 4. It is easy to verify that OA = CA · JA. The results are shown in Table 6. We see that Jumper can discover the jumping position with a very high accuracy in terms of both JA and OA, and that both CNN and Hierarchical CNN-GRU perform worse in this task. Although they achieve similar classification results (Jumper slightly outperforming by ∼ 1%), Jumper is better at finding the key rationale by 2–6%. This shows that our one-jump constraint forces the model to think more carefully about when to make a decision, and that reinforcement learning is an effective way to learn the correct position of making decisions. Another interesting finding is that, for Hierarchical CNN-GRU, the classification accuracy at the end of the paragraph as in Table 4 is lower than that when it could have predicted as in

Table 6. This shows evidence of the distortion phenomenon of distributed representation: when neural networks are fed with too much irrelevant information, its knowledge is less accurate. Analysis of Learning Dynamics. Fig. 4(a) plots the learning curves of classification and jumping accuracies in the OI-Level task. We see that, although the model is trained with classification labels only, its jumping accuracy is also improving steadily. The two curves, in fact, align well during the training process. This implies that the classification labels, within our Jumper framework, indeed helps to learn the appropriate decision position. Fig. 4 (b) shows the fraction of Jumper not jumping during the entire sequence if the groundtruth label is not “None.” In this case, the prediction is always incorrect and it puts us to a disadvantage with our framework. We could divide the curve is into three stages: (1) Before training, Jumper tends to jump early as it almost always makes a non-default (not “None”) prediction within the sequence, which verifies the early-jumping problem stated in Section 3.3. (2) At the beginning of training, Jumper quickly learns to remain in the default value, showing that our intermediate reward Rstep in Eq. (6) helps to mitigate the early-jumping problem. (3) After 80 batches, it starts to jump before the paragraph ends. Eventually, only a small fraction of samples are not predicted in the Jumper framework. Case Study. We show in Fig. 5 several examples of the decisions made by the neural network. In the AG and MR datasets, information is located over a wide range, and the network makes a prediction as long as it sees enough evidence (e.g., “trade commissioner” for the business domain). By backtracking the wordlevel rationales, we find words like “trade commissioner” and “tiresome” play a more important role in the decision making. In these cases, the model does not need to read future sentences, which is more efficient than reading the entire paragraph. For OI-Level classification where information is mostly local, the neural network precisely locates the subsentence that contains the information, as shown in both Table 6 and Fig. 5. In addition, we also present two error cases of Jumper in the AG dataset. In the first error case, Jumper makes a non-default decision too early, leading to a wrong prediction (predicting the business category but the groundtruth is the “science or technique” category). In the second error case, on the contrary, Jumper delays its decision (predicting “None”) when it reads the second sentence. However, there is no following sentence, resulting in a wrong decision again.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

9

Fig. 5. Case study. We show the histogram of decision distributions and the heatmaps of word importance in MR, AG and OI Level samples.

Overall, the above two types of error cases are relatively rare in Jumper’s predictions. Although Jumper fails to predict correctly at a few specific cases, the classification accuracy of Jumper is quantitatively superior or comparable to other baselines, as demonstrated in Table 4. 6. Conclusion and future work In this paper, we have proposed a novel framework, Jumper, that models text classification as a sequential decision process on a sentence-by-sentence basis when reading a paragraph. We train Jumper by reinforcement learning with a onejump constraint. Experiments show that Jumper could find the optimal decision position on a synthetic dataset; that Jumper achieves comparable or higher performance than baselines; that it reduces text reading by a large extent; and that it can find the key rationale if the information is local within a sentence. In future work, we would like to incorporate symbolic reasoning into the output layer in a multitask setting, where we could explicitly handle inference, contradiction, etc. among different slots.

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments We thank the anonymous reviewers for their insightful suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 61836004) and the Beijing Brain Science Special Project (No. Z18110 0 0 015180 06). Lili Mou is an Amii Fellow; he also thanks AltaML for support.

Appendix A. Mathematical details for the optimal decision process The probability of the next character xt+1 given the history x1: t can be computed by

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx

p(xt+1 |x1:t ) =



Appendix B. Supplementary table

p(xt+1 , μi |x1:t )

(A.1)

i∈{0,1}



=

p( xt+1 |x1:t , μi ) p(μi |x1:t )

(A.2)

i ∈ {0 , 1 }



=

p( xt+1 |μi ) p(μi |x1:t )

(A.3)

p(x1:t , μi ) p(x1:t )

(A.4)

i ∈ {0 , 1 }



=

p( xt+1 |μi )

i ∈ {0 , 1 }



=

p( xt+1 |μi )

i ∈ {0 , 1 }

p( x1:t |μi ) p(μi ) , p( x1:t |μi ) p(μi )

(A.5)

i∈{0,1}

where p(xt = a|μi ) = μi . Given the generated characters, the probability of the sequence length being t is given by

P r (xt+1 = EOS ) =

=

P r (L = t ) P r (L ≥ t )

(A.6)

Pr(L = t ) , 1 − Pr(L ≤ t − 1 )

(A.7)

where Pr(· ) represents the probability of some event.

A.1. Algorithm The computation of the theoretical optimum described in Section 4.1 can be implemented by the following pseudo-code.

Algorithm 1 The algorithm for computing the theoretical optimum. 1: μ0 , μ1 , λ ← 0.3, 0.7, 8  Na , Nb denote the numbers 2: procedure OptimalAcc(Na , Nb ) of characters “a” and “b” for a given sequence x N (1−μ0 )Na μ0 b Nb N N (1−μ0 ) a μ0 +(1−μ1 )Na μ1 b Nb N a (1−μ1 ) μ1 N N (1−μ0 )Na μ0 b +(1−μ1 )Na μ1 b

3:

p μ0 ←

4:

p μ1 ←

5:

AccG0 ,t , AccG1 ,t ← pμ0 , pμ1

6:

if Na + Nb = 15, return AccG0 ,t , AccG1 ,t,0 pt+1,1 ← p(μ0 |x )μ0 + p(μ1 |x )μ1 pt+1,0 ← p(μ0 |x )(1 − μ0 ) + p(μ1 |x )(1 − μ1 ) AccG0 ,t+1,1 , AccG1 ,t+1,1 , AccNone,t+1,1 ← OptimalAcc (Na + 1, Nb ) AccG0 ,t+1,0 , AccG1 ,t+1,0 , AccNone,t+1,0 ← OptimalAcc (Na , Nb + 1) for action in {G0 , G1 , None} do Accaction,t+1 = pt+1,1 Accaction,t+1,1 + pt+1,0 Accaction,t+1,0 AccNone,t ← max{AccG0 ,t+1 , AccG0 ,t+1 , AccNone,t+1 }  Compute the expected accuracy at t + 1 return AccG0 ,t , AccG1 ,t , AccNone,t

7: 8: 9: 10: 11: 12: 13: 14:

 Compute the best performance we can obtain

Table B1 Statistics of the Occupational Injury (OI) dataset. We chose injury identification (InjIdn) and injury level (Level) as the tasks for single-slot prediction, highlighted in bold. Subtask

# of Class

Majority guess (%)

IsOccuInj AssoPay LaborContr EndLabor OnOff DiseRel OutForPub WorkTime WorkPlace JobRel InjIdn ConfirmLevel Insurance HaveMedicalFee Level

2 2 2 3 2 3 2 3 3 3 3 3 3 3 12

85.68 81.75 93.22 93.67 93.69 99.05 99.07 79.75 80.60 91.34 55.02 72.99 89.66 83.63 82.65

References [1] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR, 2015. [2] R. Bellman, A Markovian decision process, J. Math. Mech. 6 (5) (1957) 679–684. [3] Y. Bengio, Practical recommendations for gradient-based training of deep architectures, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 437–478. [4] P. Blunsom, E. Grefenstette, N. Kalchbrenner, A convolutional neural network for modelling sentences, in: ACL, 2014, pp. 655–665. [5] K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder–decoder for statistical machine translation, in: EMNLP, 2014, pp. 1724–1734. [6] J. Devlin, M. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language understanding, NAACL-HLT (2019) 4171–4186. [7] K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, IEEE Trans. Nerual Netw. Learning Syst. 28 (10) (2015) 2222–2232. [8] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016, pp. 770–778. [9] X. He, D. Golub, Character-level question answering with attention, in: EMNLP, 2016, pp. 1598–1607. [10] B. Hu, Z. Lu, H. Li, Q. Chen, Convolutional neural network architectures for matching natural language sentences, in: NIPS, 2014, pp. 2042–2050. [11] K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Documentation. 60 (5) (1972) 493–502. [12] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, in: EACL, 2017, pp. 427–431. [13] I. Kanaris, K. Kanaris, I. Houvardas, E. Stamatatos, Words versus character n-grams for anti-spam filtering, Int. J. Artif. Intell. Tools 16 (06) (2007) 1047–1067. [14] Y. Kim, Convolutional neural networks for sentence classification, in: EMNLP, 2014, pp. 1746–1751. [15] T. Lei, R. Barzilay, T. Jaakkola, Rationalizing neural predictions, in: EMNLP, 2016, pp. 107–117. [16] L. Li, C.R. Weinberg, T.A. Darden, L.G. Pedersen, Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics 17 (12) (2001) 1131–1142. [17] Z. Lin, M. Feng, C.N.d. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding, ICLR, 2017. [18] X. Liu, L. Mou, H. Cui, Z. Lu, S. Song, Jumper: learning when to make classification decisions in reading, in: IJCAI, 2018, pp. 4237–4243. [19] L.M. Manevitz, M. Yousef, One-class SVMs for document classification, J. Mach. Learn. Res. 2 (Dec) (2001) 139–154. [20] C. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008. [21] G. Marcus, Deep learning: a critical appraisal, arXiv preprint arXiv:1801.00631, 2018. [22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, ICLR (2013). [23] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: NIPS, 2013, pp. 3111–3119. [24] L. Mou, Z. Lu, H. Li, Z. Jin, Coupling distributed and symbolic execution for natural language queries, in: ICML, 2017, pp. 2518–2526.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082

JID: NEUCOM

ARTICLE IN PRESS

[m5G;September 25, 2019;21:17]

X. Liu, L. Mou and H. Cui et al. / Neurocomputing xxx (xxxx) xxx [25] L. Mou, H. Peng, G. Li, Y. Xu, L. Zhang, Z. Jin, Discriminative neural sentence modeling by tree-based convolution, in: EMNLP, 2015, pp. 2315–2325. [26] B. Pang, L. Lee, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, in: ACL, 2004, pp. 271–278. [27] B. Pang, L. Lee, S. Vaithyanathan, Thumbs up?: sentiment classification using machine learning techniques, in: ACL„ 2002, pp. 79–86. [28] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, NIPS Autodiff Workshop (2017). [29] K. Pearson, Note on regression and inheritance in the case of two parents, Proc. R. Soc. Lond. 58 (1895) 240–242. [30] J. Pennington, R. Socher, C.D. Manning, GloVe: global vectors for word representation, in: EMNLP, 2014, pp. 1532–1543. [31] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 10 0,0 0 0+ questions for machine comprehension of text, in: EMNLP, 2016, pp. 2383–2392. [32] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, L. Chanona-Hernández, Syntactic dependency-based n-grams as classification features, in: Mexican International Conference on Artificial Intelligence, Springer, 2012, pp. 1–11. [33] R. Socher, J. Pennington, E.H. Huang, A.Y. Ng, C.D. Manning, Semi-supervised recursive autoencoders for predicting sentiment distributions, in: EMNLP, 2011, pp. 151–161. [34] R.S. Sutton, A.G. Barto, F. Bach, Reinforcement Learning: An Introduction, MIT press, 1998. [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: NIPS, 2014, pp. 5998–6008. [36] S. Wang, C.D. Manning, Baselines and bigrams: simple, good sentiment and topic classification, in: ACL (2), 2012, pp. 90–94. [37] G. Weikum, Foundations of statistical natural language processing, in: International Conference on Management of Data, 31, 2002, pp. 37–38. [38] T.-H. Wen, D. Vandyke, N. Mrkšic´ , M. Gasic, L.M. Rojas Barahona, P.-H. Su, S. Ultes, S. Young, A network-based end-to-end trainable task-oriented dialogue system, in: EACL, 2017, pp. 438–449. [39] R.J. Williams, Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learn. 8 (1992) 229–256. [40] B. Xu, X. Guo, Y. Ye, J. Cheng, An improved random forest classifier for text categorization, Journal of Computers 7 (12) (2012) 2913–2920. [41] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in: NAACL-HLT, 2016, pp. 1480–1489. [42] A.W. Yu, H. Lee, Q. Le, Learning to skim text, in: ACL, 2017, pp. 1880–1890. [43] K. Yu, Y. Liu, A. G. Schwing, J. Peng, Fast and accurate text classification: Skimming, rereading and early stopping, ICLR Workshop, 2018. [44] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation classification via convolutional deep neural network, in: COLING, 2014, pp. 2335–2344. [45] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, in: NIPS, 2015, pp. 649–657. [46] Y. Zhang, I. Marshall, B.C. Wallace, Rationale-augmented convolutional neural networks for text classification, in: EMNLP, 2016, pp. 795–804. [47] X. Zhou, X. Wan, J. Xiao, Attention-based lstm network for cross-lingual sentiment classification, EMNLP (2016) 247–256. [48] X. Zhu, P. Sobhani, H. Guo, Long short-term memory over tree structures, in: ICML, 2015, pp. 1604–1612.

11

Lili Mou is an assistant professor at the Department of Computing Science, University of Alberta. Lili received his BS and Ph.D. degrees from School of EECS, Peking University. After that, he worked as a postdoctoral fellow at the University of Waterloo and a research scientist at Adeptmind (a startup in Toronto, Canada). His research interests include deep learning applied to natural language processing as well as programming language processing. He has publications at top conferences and journals like AAAI, ACL, CIKM, COLING, EMNLP, ICASSP, ICML, IJCAI, INTERSPEECH, NAACL-HLT, and TACL.

Haotian Cui received his BS and Master degrees in 2015 and 2018, respectively, from Department of Biomedical Engineering, Tsinghua University. His research interests relate to NLP and data mining.

Zhengdong Lu is the founder and CTO of DeeplyCurious (a startup in Beijing, China). Lu got his Ph.D. degree from Oregon Health & Science University in 2008. Dr. Lu then worked as a postdoctoral researcher at the University of Texas at Austin for several years. Before founding DeeplyCurious, he worked as an associate researcher in MSRA and a senior researcher at Noahs Ark Lab, Huawei. Dr. Lu has published more than 60 top conference and journal articles and has served as a reviewer for several international conferences (NIPS, ICML, and IEEE transaction on PAMI). His research interests include machine learning, deep learning, natural language understanding, and data mining. Sen Song is a principal investigator at the Department of Biomedical Engineering, Tsinghua University. He got his Ph.D. degree at Brandeis University in USA and worked for several years at Massachusetts Institute of Technology as a postdoctoral fellow. Since joining Tsinghua, he has also been increasingly interested in brain-inspired neural networks and spiking neural networks.

Xianggen Liu is currently a Ph.D. student at Tsinghua University. He received BS degree in the School of Computer Science and Engineering, University of Electronic Science and Technology of China. He mainly focuses on neural networks and machine learning, and their applications in natural language processing (NLP) problems, such as text classification, tagging, parsing, and reasoning.

Please cite this article as: X. Liu, L. Mou and H. Cui et al., Finding decision jumps in text classification, Neurocomputing, https://doi.org/ 10.1016/j.neucom.2019.08.082