Exploiting Deep Representations for Natural Language Processing
Communicated by Dr. Y. Chang
Journal Pre-proof
Exploiting Deep Representations for Natural Language Processing Zi-Yi Dou, Xing Wang, Shuming Shi, Zhaopeng Tu PII: DOI: Reference:
S0925-2312(19)31769-2 https://doi.org/10.1016/j.neucom.2019.12.060 NEUCOM 21689
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
20 February 2019 16 July 2019 15 December 2019
Please cite this article as: Zi-Yi Dou, Xing Wang, Shuming Shi, Zhaopeng Tu, Exploiting Deep Representations for Natural Language Processing, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.060
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Exploiting Deep Representations for Natural Language Processing Zi-Yi Doua,∗, Xing Wangb,∗, Shuming Shib , Zhaopeng Tub,∗∗ a Carnegie
Mellon University AI Lab
b Tencent
Abstract Advanced neural network models generally implement systems as multiple layers to model complex functions and capture complicated linguistic structures at different levels [1]. However, only the top layers of deep networks are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to expose all of these embedded signals with two types of mechanisms, namely deep connections and iterative routings. While deep connections allow better information and gradient flow across layers, iterative routings directly combines the layer representations to form a final output with iterative routing-by-agreement mechanism. Experimental results on both machine translation and language representation tasks demonstrate the effectiveness and universality of the proposed approaches, which indicates the necessity of exploiting deep representations for natural language processing tasks. While the two strategies individually boost performance, combining them can further improve performance. Keywords: Natural Language Processing, Deep Neural Networks, Deep Representations, Layer Aggregation, Routing-by-Agreement 2019 MSC: 00-01, 99-00
∗ Zi-Yi
Dou and Xing Wang contributed equally to this work. Tu is the corresponding author:
[email protected]. Email addresses:
[email protected] (Zi-Yi Dou),
[email protected] (Xing Wang),
[email protected] (Shuming Shi),
[email protected] (Zhaopeng Tu) ∗∗ Zhaopeng
Preprint submitted to Journal of Neurocomputing
December 18, 2019
1. Introduction Neural network models have advanced the state of the arts in various Natural Language Processing (NLP) tasks[2, 3, 4]. Nowadays, advanced neural network models generally implement systems as multiple layers (i.e. deep neu5
ral networks), regardless of the specific model architectures, such as Recurrent Neural Networks (RNNs) [5, 1], Convolutional Neural Networks (CNNs) [6], or Self-Attention Networks (SANs) [3, 4]. Several researchers have revealed that deep neural networks (DNNs) are able to capture various linguistic properties of input sentences [1, 7, 8], and different types of syntax and semantic information
10
are captured by different layers. However, current DNN models only leverage the top layer in the subsequent process, which misses the opportunity to exploit useful information embedded in other layers. In addition, information about the output that is lost in one layer cannot be recovered in higher layers [9]. Although residual connections [10] are
15
generally incorporated to combine layers, these connections have been “shallow” themselves, and only fuse by simple, one-step operations [11]. In response to this problem, Peters et al. [1] have proven that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for transfer learning tasks. However, they still aggregate layer representations in a shallow
20
fusion with a simple linear combination. Recently, aggregating layers to better fuse semantic and spatial information has proven to be of profound value in computer vision tasks [11, 12]. A few recent work analyze the necessity and effectiveness of information aggregation for DNNs in NLP tasks [13, 14, 15], and we aim at further extending previous methods in the field of NLP.
25
In this work, we investigate how to effectively fuse information across DNN layers from two different perspectives, namely information flow and representation composition. While residual connections [10] and linear combination [1] are respectively the shallow mechanism of the two perspectives, we propose two deep mechanisms to better extract the full spectrum of linguistic information
30
embedded in different DNN layers:
2
• Deep Connections maintain additional layers to allow better information and gradient flow between DNN layers. Specifically, we investigate two structures for deep connections: iterative deep connection and hierarchical deep connection [11]. Iterative deep connection follows the base 35
hierarchy to refine layer representations layer-by-layer, while hierarchical deep connection assembles its own hierarchy of tree-structured connections to merge layer representations. Comparing with the shallow residual connections, deep connections iteratively and hierarchically merge layer representations by incorporating more depth and sharing.
40
• Iterative Routings cast the representation composition as the problem of assigning parts to wholes. It iteratively updates the proportion of how much a part (i.e. the partial information embedded in a specific layer) should be assigned to a whole (i.e. the final output representation), based on the agreement between parts and wholes. Comparing with the shallow
45
linear combination, the iterative routings have two appealing strengths. First, while the linear combination is encoded in a static set of learned weights, iterative routings combine the information dynamically in that they treat each combination of hidden states differently. Second, the routing mechanism provides a new way to aggregate information according to
50
the representation of the final output, as well as directly model the partwhole relationships. To combine the advantages of both mechanisms, we propose a simple strategy that feeds the aggregation nodes instead of the standard layers to the iterative routing.
55
We evaluated our approach on two representative NLP tasks, machine translation and language representation. For machine translation, we conducted experiments on the benchmark WMT14 English⇒German task using the state-ofthe-art Transformer model [3]. Experimental results show that both types of strategies individually improve translation performance over the vanilla Trans-
60
former model, indicating the necessity and effectiveness of fusing information 3
across layers for deep networks. By combining the advantages of both mechanisms, we obtain a further improvement in translation performance. For language representation, we use the linguistic probing tasks [16] that study what linguistic properties are captured by input representations. Experimental re65
sults show that our approaches indeed produce more informative representation, which embeds more syntactic and semantic information. Contributions. Our key contributions are: • Our study demonstrates the necessity and effectiveness of fusing information across layers for DNNs in NLP tasks. Our study also indicates the
70
superiority of the dynamic principles for aggregating deep representations. • Our work is among the few studies (cf. [17, 18]) which prove that the idea of capsule networks can have promising applications on natural language processing tasks.
2. Background 75
2.1. Deep Neural Networks Deep representations have proven to be of profound value in various NLP tasks, such as reading comprehension [4], question answering [1], and machine translation [3]. Multi-layer networks are employed to perform the representation learining task through a series of nonlinear transformations from the representa-
80
tion of input sequences to final output representation (deep encoder). The layer can be implemented as RNNs [1, 5], CNNs [6], or SANs [3, 4]. In this work, we take the representative SAN model as an example, which has advanced the state of the art in different NLP tasks and will be used in experiments later. However, we note that the proposed approach is generally applicable to any
85
other type of DNNs. Specifically, the deep encoder is composed of a stack of L identical layers, each of which has two sub-layers. The first sub-layer is a self-attention network, and the second one is a position-wise fully connected feed-forward network. A 4
residual connection [10] is employed around each of the two sub-layers, followed 90
by layer normalization [19]. Formally, the output of the first sub-layer Cle and the second sub-layer Hle are calculated as Cle Hle
l−1 l−1 = Ln Att(Qle , Kl−1 , e , Ve ) + He = Ln Ffn(Cle ) + Cle ,
(1)
where Att(·), Ln(·), and Ffn(·) are self-attention mechanism, layer normalization, and feed-forward networks with ReLU activation in between, respectively. l−1 {Qle , Kl−1 e , Ve } are query, key and value vectors that are transformed from
95
the (l-1)-th encoder layer Hl−1 e .
There has been a wealth of research over the past several years on Seq2Seq learning [20, 2, 3], which generally lies in a deep encoder-decoder framework. The deep encoder is identical to the above description (e.g. Equation 1). The deep decoder is also composed of a stack of L identical layers. In addition to 100
two sub-layers in each decoder layer, the decoder inserts a third sub-layer Dld to perform attention over the output of the encoder stack HL e: Cld Dld Hld
l−1 l−1 = Ln Att(Qld , Kl−1 , d , Vd ) + Hd L l = Ln Att(Cld , KL e , Ve ) + Cd , = Ln Ffn(Dld ) + Dld ,
(2)
l−1 l−1 where {Qld , Kl−1 d , Vd } are transformed from the (l-1)-th decoder layer Hd ,
L and {KL e , Ve } are transformed from the top layer of the encoder. The top layer
of the decoder HL d is used to generate the final output sequence. 105
2.2. Exploiting Deep Representations DNNs can be considered as a strong feature extractor with extended receptive fields capable of linking salient features from the entire sequence [21]. However, one potential problem about the vanilla DNNs, as shown in Figure 1a, is that they stack layers in sequence and only utilize the information in the top
5
110
layer. While studies have shown deeper layers extract more semantic and more global features [22, 1], these do not prove that the last layer is the ultimate representation for any task. This work follows recent successes in exploiting deep representations from improved information flow and representation composition, the representative
115
methods of which are residual connections [10] and linear combination [1]. Residual connections [10] are important for assembling very deep networks. They utilize skip connections to jump over some layers by directly adding the layer input to the layer output, as shown in Equations 1 and 2. The skip connections allow information to flow unimpeded through the entire network.
120
Although residual connections have been incorporated to combine layers, these connections have been “shallow” themselves, and only fuse by simple, one-step operations [11]. Linear Combination. Recently, Peters et al. [1] have shown that simultaneously exposing all layer representations outperforms methods that utilize just the top layer on several generation tasks. They proposed a simple method to linearly combine the outputs of all layers, as shown in Figure 1b: e = H
L X
Wl Hl
(3)
l=1
where {W1 , . . . , WL } are trainable parameters. The linear combination strat125
e that egy is applied to both the encoder and decoder. The combined layer H
embeds all layer representations instead of only the top layer HL , is used in the subsequent processes. As seen, the linear combination is encoded in a static set of weights {W1 , . . . , WL }, which ignores the useful context of sentences that could further improve layer aggregation. In this work, we propose to effectively fuse information across DNN layers by
130
extending the above two representative methods, aiming at better information flow and representation composition for DNNs.
6
(a) Vanilla
(b) Linear
(c) Iterative
(d) Hierarchical
Figure 1: Illustration of (a) vanilla model without any aggregation, (b) shallow linear combination, and (c,d) deep connection strategies. Aggregation nodes are represented by green circles.
3. Approach In this paper, we investigate two types of mechanisms, namely deep connections (Section 3.1) and iterative routings (Section 3.2). While deep connections 135
encourage feature propagation and gradient flow across layers (i.e.,better information flow), iterative routings focus on aggregating useful parts embedded in different layers (i.e.,better representation composition). In addition, we propose to further improve the performance by combining advantages of both mechanisms (Section 3.3).
140
3.1. Deep Connections Recently, designing architectures to encourage feature propagation and gradient flow has proven to be of profound value in computer vision tasks [11, 12]. Inspired by these works, we propose to better propagate information across layers for NLP tasks by maintaining additional layers to aggregate standard layers,
145
as depicted in Figure 1. 3.1.1. Iterative Connection As illustrated in Figure 1c, iterative connection follows the iterated stacking of the backbone architecture. Aggregation begins at the shallowest, smallest scale and then iteratively merges deeper, larger scales. The iterative deep con-
150
nection function I for a series of layers Hl1 = {H1 , · · · , Hl } with increasingly 7
(a) CNN-like tree
(b) Hierarchical
Figure 2: Hierarchical Connection (b) that aggregates layer representations through a CNNlike tree structure (a).
deeper and semantic information is formulated as b l = I(Hl ) = Agg(Hl , H b l−1 ), H 1
b 1 = H1 and Agg(·, ·) is the aggregation function: where we set H Agg(x, y) = Ln(Ffn([x; y]) + x + y).
(4)
(5)
As seen, in this work, we first concatenate x and y into z = [x; y], which is subsequently fed to a feed-forward network with a sigmoid activation in between. Residual connection and layer normalization are also employed. Specifically, 155
both x and y have residual connections to the output. 3.1.2. Hierarchical Connection While iterative connection deeply combines states, it may still be insufficient to fuse the layers for its sequential architecture. Hierarchical connection, on the other hand, merges layers through a tree structure to preserve and combine
160
feature channels, as shown in Figure 1d. The original model proposed by [11] requires the number of layers to be the power of two, which limits the applicability of these methods to a broader range of NMT architectures (e.g. six layers in [3]). To solve this problem, we introduce a CNN-like tree with the filter size being two, as shown in Figure 2a. Following [11], we first merge aggregation
8
165
nodes of the same depth for efficiency so that there would be at most one aggregation node for each depth. Then, we further feed the output of an aggregation node back into the backbone as the input to the next sub-tree, instead of only routing intermediate aggregations further up the tree, as shown in Figure 1d. The interaction between aggregation and backbone nodes allows the model to
170
better preserve features. b i is calculated as Formally, each aggregation node H bi = H
Agg(H2i−1 , H2i ),
Agg(H2i−1 , H2i , H b i−1 ),
i=1
(6)
i>1
b i−1 ) is where Agg(H2i−1 , H2i ) is computed via Eqn. 5, and Agg(H2i−1 , H2i , H computed as
Agg(x, y, z) = Ln(Ffn([x; y; z]) + x + y + z).
(7)
b L/2 serves as the final output of the The aggregation node at the top layer H network.
3.2. Iterative Routing Recent studies have proven that simultaneously exposing all layer represen175
tations outperforms methods that utilize just the top layer on several generation tasks [1, 23]. The representation composition of combining all layer representations to form a final output can be cast into the problem of assigning parts to wholes, for which the iterative routing becomes an appealing solution [24, 25]. Concretely, the basic idea is to iteratively update the proportion of how much
180
a part should be assigned to a whole, based on the agreement between parts and wholes. Specifically, in this work we explore two representative routing mechanisms, namely dynamic routing and EM routing, which differ at how the iterative routing procedure is implemented. We expect that DNNs can benefit greatly from advanced routing algorithms, which allow the model to directly
185
learn the part-whole relationships. 9
Dynamic Routing
.. .
.. .
.. .
ag
re C eme l➝ nt n
.. .
Input Capsules !
Output Capsules "
Vote Vectors V
Final Output
Figure 3: Illustration of the dynamic routing algorithm.
3.2.1. Dynamic Routing Dynamic routing is a straightforward implementation of routing-by-agreement. To illustrate, the information of L input capsules is dynamically routed to N out-
190
b = [Ω1 , . . . , ΩN ], put capsules, which are concatenated to form the final output H as shown in Figure 3. Each vector output of capsule n is calculated with a nonlinear “squashing” function [24]: Ωn
=
Sn
=
||Sn ||2 Sn , 1 + ||Sn ||2 ||Sn ||
L X
Cl→n Vl→n ,
(8) (9)
l=1
where Sn is the total input of capsule Ωn , which is a weighted sum over all “vote vectors” V∗→n transformed from the input capsules Ψ: Vl→n
= Wl→n Ψl ,
10
(10)
Algorithm 1 Dynamic Routing. Input: input capsules Ψ = {Ψ1 , . . . , Ψl }, iterations T ; Output: capsules Ω = {Ω1 , . . . , ΩN }. 1: 2: 3: 4: 5: 6:
procedure Routing(Ψ, T ): ∀(Ψl , Ωn ): Bl→n = 0 for T iterations do ∀(Ψl , Ωn ): Cl→n = softmax(Bl→n ) ∀Ωn : compute Ωn by Eq. 8 ∀(Ψl , Ωn ): Bl→n = Bl→n + Ωn · Vl→n return Ω
where Wl→n (·) is a trainable transformation matrix, and Ψl is an input capsule associated with input layer Hl : Ψl = Fl (H1 , . . . , HL ),
(11)
where Fl (·) is a distinct non-linear function.Cl→n is the assignment probability 195
(i.e. agreement) that is determined by the iterative dynamic routing. Algorithm 1 lists the algorithm of iterative dynamic routing. The assignment P probabilities associated with each input capsule Ψl sum to 1: n Cl→n = 1, and are determined by a “routing softmax” (Line 4):
exp(Bl→n ) Cl→n = PN , 0 n0 =1 exp(Bl→n )
(12)
where Bl→n measures the degree that Ψl should be coupled to capsule n (similar to energy function in the attention model [2]), which is initialized as all 0 (Line 2). The initial assignment probabilities are then iteratively refined by measuring the agreement between the vote vector Vl→n and capsule n (Lines 2-6), which 200
is implemented as a simple scalar product αl→n = Ωn · Vl→n (Line 5). With the iterative routing-by-agreement mechanism, an input capsule prefers to send its representation to output capsules, whose activity vectors have a big scalar product with the vote V coming from the input capsule. Benefiting from the high-dimensional coincidence filtering, capsule neurons are able to ignore all
205
but the most active feature from the input capsules. Ideally, each capsule output represents a distinct property of the input. To make the dimensionality of the 11
Algorithm 2 Iterative EM Routing returns activation AΩ of the output capsules, given the activation AΨ and vote V of the input capsule. 1: 2: 3: 4: 5: 6:
procedure EM Routing(AΨ , V): ∀(Ψl , Ωn ): Cl→n = 1/N for T iterations do
∀Ωn : M-Step(C, AΨ , V)
∀Ψl : E-Step(µ, σ, AΩ , V)
∀Ωn : Ωn = AΩ n ∗ µn return Ω
1:
procedure M-Step(C, AΨ , V) . hold C constant, adjust (µn , σn , AΩ n ) for Ωn
2: 3:
∀Ψl : Cl→n = Cl→n ∗ AΨ l
4:
Compute µn , σn by Eq. 14 and 15
5:
Compute AΩ n by Eq. 17
1:
procedure E-Step(µ, σ, AΩ , V) . hold (µ, σ, AΩ ) constant, adjust Cl→∗ for Ψl
2: 3:
∀Ωn : compute Cl→n by Eq. 19
final output be consistent with that of hidden layer (i.e. d), the dimensionality of each capsule output is set to d/N . 3.2.2. EM Routing 210
Dynamic routing uses the cosine of the angle between two vectors to measure their agreement: Ωn ·Vl→n . The cosine saturates at 1, which makes it insensitive to the difference between a quite good agreement and a very good agreement. In response to this problem, [25] propose a novel Expectation-Maximization routing algorithm.
215
Specifically, the routing process fits a mixture of Gaussians using ExpectationMaximization (EM) algorithm, where the output capsules play the role of Gaussians and the means of the activated input capsules play the role of the datapoints. It iteratively adjusts the means, variances, and activation probabilities of the output capsules, as well as the assignment probabilities C of the input cap-
220
sules, as listed in Algorithm 2. Comparing with the dynamic routing described 12
above, the EM routing assigns means, variances, and activation probabilities for each capsule, which are used to better estimate the agreement for routing. The activation probability AΨ l of the input capsule Ψl is calculated by Ψ AΨ l = Wl Ψl ,
(13)
where WlΨ is a trainable transformation matrix, and Ψl is calculated by Equation 11. The activation probabilities AΨ and votes V of the input capsules are 225
fixed during the EM routing process. M-Step for each Gaussian associated with Ωn consists of finding the mean µn of the votes from input capsules and the variance σn about that mean: µn (σn )2
= =
P
Cl→n Vl→n lP
P
l
l
Cl→n
,
(14)
Cl→n (Vl→n − µhn )2 P . l Cl→n
(15)
The incremental cost of using an active capsule Ωn is costn = log(σn ) +
1 + log(2π) X Cl→n . 2
(16)
l
The activation probability of capsule Ωn is calculated by AΩ n = logistic λ(βA − βµ
X l
Cl→n − costn ) ,
(17)
where βA is a fixed cost for coding the mean and variance of Ωn when activating 230
it, βµ is another fixed cost per input capsule when not activating it, and λ is an inverse temperature parameter set with a fixed schedule. We refer the readers to [25] for more details. E-Step adjusts the assignment probabilities Cl→∗ for each input Ψl . First, we compute the negative log probability density of the vote Vl→n from Ψl under
13
the Gaussian distribution fitted by the output capsule Ωn it gets assigned to: 1 (Vl→n − µn )2 pn = p exp(− ). 2(σn )2 2π(σn )2
(18)
Accordingly, the assignment probability is re-normalized by AΩ n pn . Ω n0 An0 pn0
Cl→n = P
(19)
As has been stated above, EM routing is a more powerful routing algorithm, which can better estimate the agreement by allowing active capsules to receive 235
a cluster of similar votes. In addition, it assigns an additional activation probability A to represent the probability of whether each input capsule is activated or not. These improvements can help to better learn the part-whole relationships [25]. 3.3. Combining Deep Connection and Routing Mechanisms
240
Given that deep connection and routing mechanisms fuse information across layers from different perspectives, it is natural to combine the advantages of both models. In this work, we propose two combination approaches to leverage the strengths of both exploring more depth and sharing on feature aggregation, as well as directly modeling part-whole relationships. Intuitively, we combine the two mechanisms by feeding the aggregation nodes instead of the standard layers to the iterative routing: e = Routing(H1 , H b 1 , H3 , H b 2 , . . . , H2I−1 , H b I ), H
245
(20)
e is the final representation, Routing(·) is either dynamic routing or where H b i is the aggregation node and Hi is the original node as EM routing, and H depicted in Figure 1.
14
# 1 2 3 4 5 6 7 8 9 10
Model Transformer + Iterative Connection + Hierarchical Connection + Dynamic Routing + EM Routing + Combine (3+5) Transformer-Big + Hierarchical Connection + EM Routing + Combine (8+9)
# Para. 88M +32M +23M +38M +57M +79M 264M +92M +226M +319M
Train 1.79 1.32 1.46 1.37 1.10 0.94 0.73 0.53 0.37 0.30
Test 1.43 1.22 1.25 1.24 1.15 0.97 0.63 0.59 0.46 0.42
BLEU 27.31 28.27 28.33 28.22 28.81 28.95 28.58 28.90 28.97 29.63
4 – +0.96 +1.02 +0.91 +1.50 +1.64 +0.32 +0.39 +1.05
Table 1: Translation performance on WMT14 English⇒German translation task. “# Para.” denotes the number of parameters, and “Train” and “Test” respectively denote the training (steps/second) and decoding (sentences/second) speeds on Tesla P40.
4. Experiment In this section, we evaluate the performance of our proposed models on both 250
machine translation tasks and linguistic probing tasks. 4.1. Machine Translation Tasks 4.1.1. Setting Dataset. We conducted experiments on the widely-used WMT14 English ⇒ German (En⇒De) translation task, in which the training corpus consists of
255
about 4.56 million sentence pairs. We used newstest2013 as the development set and newstest2014 as the test set. All the data had been tokenized and segmented into subword symbols using byte-pair encoding with 32K merge operations [26]. We used 4-gram NIST BLEU score [27] as the evaluation metric, and sign-test [28] for statistical significance test.
260
Models. We evaluated the proposed approaches on the Transformer model [3]. We followed the configurations in [3], and reproduced their reported results on the En⇒De task. The parameters of the proposed models were initialized by the pre-trained model. All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens.
15
265
4.1.2. Results Table 1 shows the results on the WMT14 En⇒De translation task. Clearly, the proposed approaches outperform Transformer in all cases, while there are still considerable differences among different variations. Deep Connections. (Rows 2-3, 8) The hierarchical connection outperforms its
270
iterative connection counterpart, and introduces less parameters (23M vs. 32M). In addition, the hierarchical connection is more efficient in both training and testing. The hierarchical connection model can also improve the TransformerBig model by 0.32 BLEU points. In the following experiments, we use “Hierarchical Connection” as the default option for deep connections.
275
Iterative Routing. (Rows 4-5, 9) Benefiting from the advanced routing-by-agreement algorithm, the dynamic routing strategy can achieve similar improvement as deep connection approaches. The EM routing further improves performance by better estimating the agreement during the routing. These findings suggest broad applicability of capsule networks to natural language processing tasks,
280
which has not been fully investigated yet. Combining Aggregating and Routing Mechanisms. (Rows 6, 10) Feeding aggregation nodes to iterative routing would further enhance the performance of the model. Specifically, this can improve the Transformer-Base model by 1.64 BLEU points and the Transformer-Big model by 1.05 BLEU points. These results pro-
285
vide support for the claim that the model benefits from both better information flow and representation composition for DNNs. 4.1.3. Effect on Encoder and Decoder Both encoder and decoder are composed of a stack of L layers, which may benefit from the proposed approaches. In this experiment, we investigate how
290
our models affect the two components, as shown in Table 2. All results are reported on the development set of En⇒De task.
16
Model Transformer +Hierarchical Connection
+EM Routing
Applied to Encoder Decoder N/A N/A X × × X X X X × × X X X
BLEU 26.13 26.32 26.41 26.69 26.63 26.65 26.89
Table 2: Effects of hierarchical connection and EM routing on encoder and decoder.
As seen, for both approaches, fusing information across layers of encoder or decoder individually consistently outperforms the vanilla baseline model, and exploiting both components further improves performance. These results provide 295
support for the claim that fusing information is useful for both understanding input sequence and generating output sequence. 4.1.4. Effect on LSTM Model LSTM-based Encoder + Hierarchical Connection + EM Routing
BLEU 27.23 28.43
Table 3: Effects of LSTM-based Encoder.
In this experiment, we replaced the encoder of Transformer with an LSTMbased encoder to investigate whether the proposed approach is applicable to 300
other types of DNNs. Specifically, we substituted the SAN with LSTM and kept other components unchanged, and applied both the aggregating and routing mechanisms to the Transformer-Base model with an LSTM-based encoder. At each layer, the encoder has an 512-dimensional birectional LSTM layer. We conducted the experiment on the En⇒De task.
305
Experiment results are listed in Table 2. We can see that the proposed approach achieves 1.20 BLEU improvement over the Transformer model with an LSTM-based encoder, demonstrating the proposed approach is applicable to other types of DNNs.
17
4.2. Linguistic Probing Tasks 310
4.2.1. Setting Recently, Conneau et al. [16] introduced 10 probing tasks to study what linguistic properties are captured by input representations. A probing task is a classification problem that requires the model to make predictions related to certain linguistic properties of sentences. The abbreviations for the 10 tasks are
315
listed in Table 4. Basically, these tasks are set to test the model’s abilities to capture surface, syntactic or semantic information. We refer the reader to [16] for details. We conducted these probing tasks to study whether the proposed approaches can benefit DNNs on the capability of language understanding by producing more informative representation.
320
For each classification task, the models were trained and examined using the publicly available dataset provided by [16], where each task is assigned 100k sentences for training and 10k sentences for validating, and 10k sentences for testing. For reasons of computational efficiency, we merely evaluated the hierarchical connection model, the EM routing model and their combination
325
according to the empirical results in Section 4.1. Both of the models were trained for 50 epochs and allocated 1k samples in each iterative step. Since the size of the datasets are much smaller, we reduced the hidden size and filter size to 128 and 512 respectively and kept other parameters unchanged. 4.2.2. Results
330
Table 4 lists the classification accuracies of the three models on the 10 probing tasks. We highlight the best accuracies under each category (e.g., “Surface”, “Syntactic”, and “Semantic”) in bold. As seen, all the proposed approaches outperform baseline in all cases, proving that our models produce more informative representations by exploiting deep neural networks. Combining deep connec-
335
tions and iterative routing achieves the best accuracies, which is consistent with the results on machine translation task. Concerning the three main categories, the relative improvements are respectively 0.7*%, 1.5*%, and 0.9*%. The syntactic tasks achieves the best improve18
Syntactic Surface Semantic
Task SeLen WC Ave. TrDep ToCo BShif Ave. Tense SubNm ObjNm SOMO CoIn Ave.
Baseline 95.89 98.03 96.96 44.78 84.53 52.66 60.66 84.76 85.18 81.45 49.87 68.97 74.05
Connection 96.27 98.62 97.45 46.02 84.89 53.45 61.45 85.55 86.03 81.79 50.67 69.54 74.72
Routing 96.25 99.01 97.63 45.73 85.00 53.79 61.51 85.46 85.73 81.94 50.03 69.86 74.60
Combine 96.21 99.15 97.68 46.44 85.26 53.50 61.73 85.43 86.23 81.87 50.28 70.75 74.91
Table 4: Classification accuracies on 10 probing tasks of evaluating the linguistic properties (“Surface”, “Syntactic”, and “Semantic”) embedded in the final encoding representation produced by each model. “Ave.” denotes the averaged accuracy in each type of linguistic tasks.“Connection” denotes the hierarchical connection model, “Routing” denotes the EM routing model and “Combine” denotes the combination of the two mechanisms.
ment, which may indicate that the bottom layers can contain more syntactic 340
information and combining layers can encourage the information flow. Concerning the individual task, “TrDep” and “CoIn” achieve relatively higher improvements, suggesting that the model’s abilities to learn both the syntactic and semantic information have been improved.
5. Related Work 345
Representation learning is at the core of deep learning. Our work is inspired by technological advances in representation learning, specifically in the field of deep representation learning, representation interpretation and capsule networks. Deep Representation Learning. Deep neural networks have advanced the state of the arts in various communities, such as computer vision and natural lan-
350
guage processing. One key challenge of training deep networks lies in how to transform information across layers, especially when the network consists of hundreds of layers. In response to this problem, ResNet [10] uses skip connections to combine layers by simple, one-step operations. Dense connections [12] 19
are designed to better propagate features and losses through skip connections 355
that concatenate all the layers in stages. Yu et al. [11] design structures iteratively and hierarchically merge the feature hierarchy to better fuse information in a deep fusion. Concerning machine translation, researchers have shown that deep networks with advanced connecting strategies outperform their shallow counterparts [29,
360
30]. Due to its simplicity and effectiveness, skip connection becomes a standard component of state-of-the-art NMT models [31, 6, 3]. In this work, we prove that deep representation exploitation can further improve performance over simply using skip connections. Representation Interpretation. Several researchers have tried to visualize the
365
representation of each layer to help better understand what information each layer captures [22, 32, 33]. Concerning natural language processing tasks, [7] find that both local and global source syntax are learned by the NMT encoder and different types of syntax are captured at different layers.
[8] show that
higher level layers are more representative than lower level layers. Peters et 370
al. [1] demonstrate that higher-level layers capture context-dependent aspects of word meaning while lower-level layers model aspects of syntax. Inspired by these observations, we propose to expose all of these representations to better fuse information across layers. Capsule Networks. The idea of iterative routing is first proposed by [24], which
375
aims at addressing the representational limitations of convolutional and recurrent neural networks for image classification. The iterative routing procedure is further improved by using Expectation-Maximization algorithm to better estimate the agreement between capsules [25]. In computer vision community, [34] explore its application on CIFAR data with higher dimensionality. [35] apply
380
capsule networks on object segmentation task. The applications of capsule networks in natural language processing tasks, however, have not been widely investigated to date. [17] testify capsule networks on text classification tasks and [18] propose to aggregate a sequence of vectors 20
via dynamic routing for sequence encoding. To the best of our knowledge, this 385
work is the first to apply the idea of iterative routing to NMT.
6. Conclusion In this work, we propose several ways to better exploit deep representations including deep connection and iterative routing strategies that are learned by multiple layers for NLP tasks. Experimental results on both sequence to se390
quence and sentence classification tasks have demonstrated that the proposed approach consistently outperforms the strong baseline models. Future directions include validating our approach on other architectures such as RNN [2, 21] or CNN [6] based models, as well as combining with other advanced techniques [36, 37, 38, 39] to further improve the performance.
395
References References [1] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: NAACL, 2018.
400
[2] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: ICLR, 2015. [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: NIPS, 2017. [4] J. Devlin, M.-W. Chang, K. L. Lee, K. Toutanova, Bert: Pre-training
405
of deep bidirectional transformers for language understanding, in: arXiv, 2018. [5] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.
21
[6] J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional 410
sequence to sequence learning, in: ICML, 2017. [7] X. Shi, I. Padhi, K. Knight, Does string-based neural mt learn source syntax?, in: EMNLP, 2016. [8] A. Anastasopoulos, D. Chiang, Tied multitask learning for neural speech translation, in: NAACL, 2018.
415
[9] N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in: Information Theory Workshop (ITW), 2015 IEEE, IEEE, 2015, pp. 1–5. [10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: CVPR, 2016.
420
[11] F. Yu, D. Wang, E. Shelhamer, T. Darrell, Deep layer aggregation, in: CVPR, 2018. [12] G. Huang, Z. Liu, L. van der Maaten, K. Q. Weinberger, Densely connected convolutional networks, in: CVPR, 2017. [13] Z.-Y. Dou, Z. Tu, X. Wang, S. Shi, T. Zhang, Exploiting deep representa-
425
tions for neural machine translation, in: EMNLP, 2018, pp. 4253–4262. [14] Q. Wang, F. Li, T. Xiao, Y. Li, Y. Li, J. Zhu, Multi-layer representation fusion for neural machine translation, in: COLING, 2018. [15] Z.-Y. Dou, Z. Tu, X. Wang, L. Wang, S. Shi, T. Zhang, Dynamic layer aggregation for neural machine translation with routing-by-agreement, in:
430
AAAI, 2019. [16] A. Conneau, G. Kruszewski, G. Lample, L. Barrault, M. Baroni, What you can cram into a single $&!#∗ vector: Probing sentence embeddings for linguistic properties, in: ACL, 2018.
22
[17] W. Zhao, J. Ye, M. Yang, Z. Lei, S. Zhang, Z. Zhao, Investigating capsule 435
networks with dynamic routing for text classification, in: ACL, 2018. [18] J. Gong, X. Qiu, S. Wang, X. Huang, Information aggregation via dynamic routing for sequence encoding, arXiv. [19] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450.
440
[20] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: NIPS, 2014. [21] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, P. Niki, M. Schuster, Z. Chen, Y. Wu, M. Hughes, The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation,
445
in: ACL, 2018. [22] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014. [23] Y. Shen, X. Tan, D. He, T. Qin, T.-Y. Liu, Dense information flow for neural machine translation, in: NAACL, 2018.
450
[24] S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing between capsules, in: NIPS, 2017. [25] G. E. Hinton, S. Sabour, N. Frosst, Matrix capsules with em routing, in: ICLR, 2018. [26] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words
455
with subword units, in: ACL, 2016. [27] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: ACL, 2002. [28] M. Collins, P. Koehn, I. Kucerova, Clause restructuring for statistical machine translation, in: ACL, 2005. 23
460
[29] F. Meng, Z. Lu, Z. Tu, H. Li, Q. Liu, A deep memory-based architecture for sequence-to-sequence learning, in: ICLR Workshop, 2016. [30] J. Zhou, Y. Cao, X. Wang, P. Li, W. Xu, Deep recurrent models with fast-forward connections for neural machine translation, TACL. [31] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
465
M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv. [32] Y. Li, J. Yosinski, J. Clune, H. Lipson, J. E. Hopcroft, Convergent learning: Do different neural networks learn the same representations?, in: ICLR,
470
2016. [33] Y. Ding, Y. Liu, H. Luan, M. Sun, Visualizing and understanding neural machine translation, in: ACL, 2017. [34] E. Xi, S. Bing, Y. Jin, Capsule network performance on complex data, arXiv.
475
[35] R. LaLonde, U. Bagci, Capsules for object segmentation, arXiv. [36] P. Shaw, J. Uszkoreit, A. Vaswani, Self-Attention with Relative Position Representations, in: NAACL, 2018. [37] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, C. Zhang, DiSAN: directional self-attention network for RNN/CNN-free language understanding,
480
in: AAAI, 2018. [38] B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, T. Zhang, Modeling localness for self-attention networks, in: EMNLP, 2018. [39] J. Li, Z. Tu, B. Yang, M. R. Lyu, T. Zhang, Multi-head attention with disagreement regularization, in: EMNLP, 2018.
24
485
Conflict of Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
25
490
Zi-Yi Dou is a master student at Carnegie Mellon University. He received his bachelor degree at Nanjing University. His research interests include machine translation, natural language processing and machine learning.
Xing Wang is a researcher with the Tencent AI Lab, Shenzhen, China. 495
He received his Ph.D. degree from Soochow University in 2018. His research interests include statistical machine translation and neural machine translation.
Zhaopeng Tu is a Principal Researcher with the Tencent AI Lab, Shenzhen, China. He received his Ph.D. degree from Institute of Computing Technology,
26
500
Chinese Academy of Sciences in 2013. He was a Postdoctoral Researcher at University of California at Davis from 2013 to 2014. He was a researcher at Huawei Noahs Ark Lab, Hong Kong from 2014 to 2017. His research focuses on deep learning for natural language processing.
505
Shuming Shi is now a Principal Researcher and research director of natural language processing center, Tencent AI Lab.
27