Adversarial approach to domain adaptation for reinforcement learning on dialog systems

Adversarial Approach to Domain Adaptation for Reinforcement Learning on Dialog Systems Journal Pre-proof Adversarial Approach to Domain Adaptation f...

Download PDF

1MB Sizes 0 Downloads 56 Views

Report

PDF Reader
Full Text

Adversarial Approach to Domain Adaptation for Reinforcement Learning on Dialog Systems

Journal Pre-proof

Adversarial Approach to Domain Adaptation for Reinforcement Learning on Dialog Systems Sangjun Koo, Hwanjo Yu, Gary Geunbae Lee PII: DOI: Reference:

S0167-8655(19)30296-X https://doi.org/10.1016/j.patrec.2019.10.017 PATREC 7665

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

29 April 2019 20 September 2019 17 October 2019

Please cite this article as: Sangjun Koo, Hwanjo Yu, Gary Geunbae Lee, Adversarial Approach to Domain Adaptation for Reinforcement Learning on Dialog Systems, Pattern Recognition Letters (2019), doi: https://doi.org/10.1016/j.patrec.2019.10.017

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

1

Research Highlights (Required) To create your highlights, please type the highlights against each \item command. It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) • We introduced an adversarial approach to aid domain adaptation in Deep Q network. • We devised a calibrator loss to reduce domain shift in feature network level. • We experimented the architecture in PyDial environment with three different domain. • Our method out-performed baseline, although is hindered by domain-inherit properties. •

1

Pattern Recognition Letters journal homepage: www.elsevier.com

Adversarial Approach to Domain Adaptation for Reinforcement Learning on Dialog Systems Sangjun Kooa,b,∗∗, Hwanjo Yua , Gary Geunbae Leea a Pohang b Scatter

University of Science and Technology (POSTECH), 77 Cheongam-Ro, Nam-Gu, Pohang, 37673, Republic of Korea Lab Inc., 125 Wangsimni-ro, Seongdong-gu, Seoul, 04766, Republic of Korea

ABSTRACT Domain adaptation using source domain data is preferable in the development of Deep Q Network (DQN) based dialog systems, because of the training cost for additional target domains. However, the inherent domain shift hinders feature-space level generalization, which degrades performance. We introduce Adversarial Calibrator based Transfer learning (ACT) to aid the coherent training of the target domain feature extractor network. In ACT, the feature extractor of the target domain is considered as a generator, which is trained to be matched with a pre-stored latent feature set of the source domain with adversarial training losses. We verified ACT with the PyDial framework, conducting domain adaptation experiment with the following domains: San Francisco Restaurant (SFR), Cambridge Restaurant (CR), and Laptop11 (LAP). The result demonstrated that ACT outperforms β-VAE based adaptation between the pair of similar domains (SFR-LAP), and even outperforms individual learning between the pair of distinct domains (SFR-CR). c 2019 Elsevier Ltd. All rights reserved.

1. Introduction The main challenge in implementing a spoken dialog system (SDS) is to establish a dialog management policy: an agent function that selects appropriate system actions for given users’ input dialog. While deterministic methods [17, 24] are primarily used to implement commercial systems, stochastic reinforcement learning (RL) methods [8, 29], including deep RL [32], have been widely researched, because these methods can manage input errors with flexibility [25, 33]. Traditionally, SDSs are built only for limited domains, because policy training requires a significant amount of computational resources [9]. However, owing to the increasing popularity of SDSs, especially multi-domain SDSs, an efficient method for constructing multiple policy models is required.

∗∗ Corresponding author: Tel.: +82-10-3394-4942; Fax: +82-54279-2299; e-mail: [email protected] (Sangjun Koo), [email protected] (Hwanjo Yu), [email protected] (Gary Geunbae Lee)

A domain adaptation method using pre-trained policies can be a desirable solution because it reduces several individual policy trainings into a single policy training. The method can be formulated as follows [19]: Let us assume that a source set S = {(xi , yi )}i=1...n belongs to domain (DSX )n and a target set T = {xi }i=n+1...N belongs to do0 main (DTX )n . Domain adaptation between the domains DS and DT can be defined to minimize the risk of the estimator η, where RDT (η) = P r(η(x) 6= y), using S. Although considerable research [4, 20] has been devoted to the theoretical basis of this technique, implementing domain adaptation in traditional RL problems is not straightforward. By contrast, deep RL-based techniques are considered to be feasible candidates for domain adaptation, because the problem space can be directly represented in the form of a neural network layer. For instance, the deep Q network (DQN) [21], which is a basic structure for Q learning with neural networks, introduces a feature extractor network – a series of convolution neural network layers that transform low-level input spaces into high-level latent feature spaces. The architecture reduces the problem to a derivation of the generalized mapping.

2 However, applying domain adaptation in dialog policies introduces another novel problem – incoherent training among different domains. The diversity among domains, which is known as the domain shift [27] or domain bias, hinders the derivation of a general feature space; the feature space tends to yield a sparse and distinct latent feature space mapping. This results in inefficient handling of the shared features and may degrade performance. Adversarial methods can be adopted to reduce the feature-level differences between domains. The main intuition of the Generative Adversarial Network (GAN) is to have two networks – a generator and a discriminator – compete against each other [13]; The main goal of adversarial methods is to generate “fake” data that is indistinguishable from the data in original space. This method can be directly applied to the feature extractor: if a set of latent features from a specific domain is given (the original data), a feature extractor for another domain (the generator) can be trained against the difference among them (the discriminator). In this work, we propose a method called Adversarial Calibrator based Transfer learning (ACT) that can align a feature extractor with another pre-trained network. The main contribution of the proposed method is as follows: (1) We suggest the main concept of the calibrator, a variant of an adversarial generator that enables target feature extractors to be coherently trained. The training of the network is achieved with a Q-loss and a feature discriminator loss. (2) We establish the mathematical basis of the training process and implement it with the PyDial framework [32], which is widely used for RL-based dialog management benchmarking. 2. Related work Although most traditional reinforcement learning methods do not support domain adaptation, Gaussian-Process State-Action-Reward-State-Action (GP-SARSA) [7] introduced the concept of a Gaussian kernel representation, which supports adaptation from pre-trained models. There have been some successful attempts to implement the method for multi-domain dialog systems [10, 11], although further studies are required to optimize the handling of the large-scale Gaussian kernels. In deep RL, one of the major approaches is based on the structural similarity between the source task and target task; architectures based on this perspective suggest that domain adaptation can be achieved by network-wise techniques including mimicking of the generalized pretrained policy [23] or inter-utilization of the feature networks [14, 26]. Because the approaches are rather contentindependent, they are theoretically compatible with several applications, although most methods are implemented in the Arcade Learning Environment [3] and are practically limited. The other approaches are based on semantic similarity; they focus on utilizing inter-domain knowledge and datalevel relevance. Some notable techniques include weight

sharing over the same slot entities over different domains [15] and reflecting the personalized behavior of a specific user [22]. In contrast to the former structural approaches, these methods are content-dependent, which can be utilized in specific tasks but are extremely difficult to adopt to other general tasks. The use of adversarial methods has been proposed to minimize an approximate domain discrepancy distance through an adversarial objective with respect to a domain discriminator [30]. Various studies have used the method to construct generalized classifiers [12, 28, 30] especially in image processing. Our main contribution is to prove that the methods can also be employed in a dialog system, implying further uses on other RL tasks. 3. Preliminaries We present the notations used in the discussion in this paper (Table 1). Table 1: Notations

Symbol

Description

s ∈ S, a ∈ A r←R:S→R γ DS DT Ei M(·) θ θfi ∈ Wf θpi ∈ Wp θci ∈ Wc Is ∈ Wf (s) : I F(·; Wf ) : S → I P(·; Wp ) : I → A C(·; Wc ) : I → R L(·) : · → R

states and actions reward (function) discount factor source domain target domain network variable replay distribution weights (parameters) weights of feature extractor weights of policy network weights of calibrator feature embedding feature extractor function policy network function calibrator function loss function

Furthermore, we present a simplified canonical form to indicate the inputs, outputs, and components in the training process. Let the feed-in input be (A, B) and output be (X, Y, Z) for network E : (C, D). Moreover, Cˇ is the only component that is updated using objective loss L. The corresponding canonical form is as follows: ˇ D)) 7→ (X, Y, Z) : L (A, B) 7→ (E : (C, 4. Architecture 4.1. Construction of ACT The construction of the proposed architecture comprises the following four stages (Fig. 1 and Algorithm 1): (1) pre-training in the source domain; (2) storage of a feature embedding set; (3) source network training with the calibration network; and (4) fine-tuning with the adapted networks.

3

Pre-Training Source (𝒟𝒟𝑆𝑆 )

Input State

𝑠𝑠

Feature Extractor

𝑊𝑊𝑓𝑓 𝑠𝑠

Feature Embedding

ℐ

Food : Irish Price : Cheap

Source Policy

𝑊𝑊𝑝𝑝𝑆𝑆 : ℐ → 𝐴𝐴

Source Q-Loss: 𝐿𝐿𝑄𝑄 (𝑥𝑥 ∼ 𝑀𝑀𝒟𝒟𝑆𝑆 )

Calibrator Network

Tartget (𝒟𝒟𝑇𝑇 )

𝑊𝑊𝑐𝑐 𝑊𝑊𝑓𝑓 (𝑠𝑠), 𝐼𝐼 : ℐ, ℐ → ℝ

Discriminator Loss: 𝐿𝐿𝐶𝐶 (𝑊𝑊𝑓𝑓 (𝑠𝑠), 𝐼𝐼)

Food : Chinese Price : Expensive

Target Policy

Adaptation

Target Q-Loss: 𝐿𝐿𝑄𝑄 (𝑥𝑥 ∼ 𝑀𝑀𝐷𝐷𝑇𝑇 )

Fig. 1: Schematic of the architecture

Algorithm 1 Adversarial Adaptation Procedure Input: ES , ET 1: for i ← 1 to numP reT rainEpochs do 2: Construct MS (·): DS 3: for j ← 1 to numP reT rainBatches do 4: x = (s, a, r, s0 , a0 ) ∼ MS (·) 5: (x) 7→ (ES : (Wˇf , Wˇp )) 7→ : LQ (x) . Pre-training 6: I=∅ 7: Construct MS (·): DS 8: for k ← 1 to countF eatEmbeddings do 9: x = (s, a, r, s0 , a0 ) ∼ MS (·) 10: (x) 7→ (ES : (Wf )) 7→ (Ik ) : 11: I ← {Ik } ∪ I . Feature Embedding Storing 12: for i ← 1 to numCalibT rainEpochs do 13: Construct MS (·): DS 14: for j ← 1 to numCalibT rainBatches do 15: Sample Iij ← I 16: x = (s, a, r, s0 , a0 ) ∼ MS (·) ˇc )) 7→ 17: (x, Iij ) 7→ (ES : (Wˇf , Wˇp , W : LQ (x) + κ · LC (s, Iij ) . Calibrator Network Training 18: ET : (Wp , Wc ) ← ES : (Wp , Wc ) . Adaptation 19: for m ← 1 to numF ineT uneEpochs do 20: Construct MT (·): DT 21: for n ← 1 to numF ineT uneBatches do 22: Sample Imn ← I 23: x = (s, a, r, s0 , a0 ) ∼ MT (·) ˇc )) 7→ 24: (x, Imn ) 7→ (ET : (Wˇf , Wˇp , W : LQ (x) + κ · LC (s, Imn ) . Fine Tuning

Note that the training process is driven by two objective loss functions with corresponding stages: the Q-loss loss function [21], and the discriminator loss function. The

𝑊𝑊𝑝𝑝𝑇𝑇 : ℐ → 𝐴𝐴

Q-loss (Eq. 1) indicates the proficiency of the network for a given task. The loss is calculated with a sampled SARSA tuple x = (s, a, r, s0 , a0 ) from the experience replay set M, the feature extractor and the policy network (Eq. 2), where p(a|s; θ) ∝ sof tmax(Q(s; θf a , θp a )) LQ (x) =L(n+1) (θ(n+1) ) =

E

(1) 0

0

[(r + γ · max Q(s , a ; θ

x∼M0 (·)

a0 ∈A

(n)

) − Q(s, a; θ

(E : Wf , Wp )(s) = P(F (s)) ≈ Q(s; θf , θp )

(n+1)

2

)) ].

(2)

The discriminator loss (Eq. 3) indicates the similarity among the feature embeddings, which consist of the current feature embedding and pre-stored feature embedding. Current feature embedding F(s) is elicited from the present feature extractor with sampled state s in the SARSA tuple. Pre-stored feature embedding I, in contrast, is sampled from set I, which is constructed from the previous network. The loss is calculated with a calibrator network, which is a series of layers that act as a discriminator, mapping a given feature embedding to a real value (Eq. 4). (n)

LC (s, In ) = −L(C(F (s, In ; θf ); θc (n) ))

(3)

(E : Wc , Wf )(s) = C(F (s)) → R

(4)

To ensure the convergence of the calibration training, candidates for calibration loss function L should be convex and differentiable. We used a log-loss function for the designated calibration term as a default (Eq. 5). (n)

LC (s, In ) = −{log(1 − C(F (s; θf ); θc (n) )) + log(C(In ; θc

(n)

))}

(5)

4 4.2. Convergence property of ACT Consider the updating scheme of the network E in the n-th iteration. For each iteration in training session, the model parameter θ will be updated with L(n) = LQ (n) + κ · LC (n), where LQ and LC are the loss functions of the Q-update term and the calibration term, respectively. Assumption 1. Update process according to LQ (n) for θ is induced by a Lipschitz operator KQ , that is a contraction for model parameter θ, which is described as θn+1 ← KQ (θn ): There exists a constant 0 ≤ cQ < 1 that k(KQ (n) − KQ (n − 1))k < cQ kθn − θn−1 k for all n. Assumption 2. Update process according to LC (n) for θ is induced by a Lipschitz operator KC , that is a contraction for model parameter θ, which is described as θn+1 ← KC (θn ): There exists a constant 0 ≤ cC < 1 that k(KC (n) − KC (n − 1))k < cC kθn − θn−1 k for all n. Assumption 3. Overall update process according to L(n) for θ can be described as (KQ ◦ KC )(θn ) = KQ (KC (θn )). Assumption 4. KQ , KC , KQ ◦ KC is non-periodic. Theorem 1. If the assumptions hold, overall loss function L(n) = LQ (n) + κ · LC (n) almost surely converges. Proof. We want to show that KQ ◦ KC (As. 3) is a contraction. k((KQ ◦ KC )(θn ) − (KQ ◦ KC )(θn−1 )k ≤ cQ kKC (θn ) − KC (θn−1 ))k ≤ cC · cQ kθn − θn−1 k < kθn − θn−1 k

(6) (By As.1)

(7)

(By As.2)

(8)

(By As.1, 2)

(9)

By the equation 9, KQ ◦ KC is a contraction function over θ. By the assumption 4 and the contraction mapping theorem, the convergence is shown. It is proved that Q-learning process can be Lipschitz continuous, constructed by a linear estimator network and Lipschitz transition functions [2], which supports assumption 1. Additionally, a GAN learning process can be also Lipschitz continuous, if the loss follows Earth-Mover distance or Wasserstein distance [1], which supports assumption 2. Note that native version of Q-loss and GAN loss were used in the architecture, which require stronger conditions since they were constructed with Jensen?Shannon distance; Empirical convergent behavior of the implementation implies Lipschitz continuity of the system. Assumption 3 is valid if the update is achieved by stochastic gradient descent method with fixed learning rate, and is asymptotically valid if the update is achieved by adaptive gradient descent methods. Note that Theorem 1 only shows conditional convergence. In fact, after sufficient amount of training, the gradient of θ for LC will hinder the stochastic gradient descent approach of θ for LQ , which makes the assumption 4 invalid. Hence, performance will degrade. To prevent this potential problem, scaling parameter β can be annealed as the iteration proceeds.

5. Experiment We implemented our proposed model (Table 2) using the PyDial framework, an open source end-to-end statistical spoken dialog system toolkit [32]. One major advantage of using PyDial is that the system provides conventional benchmark functions with various configurations. Table 2: Layer settings of the network

Layer wf1 ,f2 wc1 ,c2 wp1 ,p2

Description (∗, 768), (768, 1024) (ReLU) (1024, 256), (256, 1) (ReLU) (1024, 512), (512, ∗) (FC)

5.1. Data Description Three supported domain datasets were used for the experiment: the San Francisco Restaurant (SFR) dataset as the source domain and the Cambridge Restaurant (CR) and Laptop11 (LAP) datasets as target domains (Table 3). We varied the error rates of the input from 0% to 30%. Table 3: Description of the datasets

Domains Input Dim. Constraints Requests Error Rate Maximum Turns

Source SFR 636 6 11

Target 1 Target 2 CR LAP 268 257 3 11 9 21 0%, 15%, 30% 25

5.2. Experiment Design We used two metrics to evaluate training methods: task success rate (TSR) and reward. TSR is the ratio of completed dialogs to all dialogs [31]. The reward for an episode X is calculated as 20 · 1(X) − T , where 1(X) is the success indicator and T is the dialog length of X, measured in turns [32]. After each training batch of 50 episodes, we measured over a test batch of 250 episodes. We tested three different methods: (1) Baseline (individual training without adaptation); (2) Pre-training feature extractor with a β-Variational Auto Encoder(β-VAE); and (3) ACT with (3a) κ = 0.05 and (3b) κ = 0.1. Here, β-VAE originated from DARLA [14], which pre-trains a disentangled version of the input space, although implemented version is rather “entangled” because there is no distinct concept of an “agent image” in dialog input. Each training and test session was repeated five times with different random seed values. We used 5,000 episodes of SFR to pre-train β − V AE (β=1.0) and another 10,000 episodes of SFR to pre-train the network weights of the target networks for method (2). For methods (3a) and (3b), we iterated 5,000 episodes of SFR for pre-training with κ = 0.05, where TSR was 65.8%. In addition, we iterated over another 5,000 episodes of the SFR batches to construct the latent image set.

5 Table 4: Experiment result

(a) TSR (%) for SFR–CR

Err. 0%

15%

30%

(Methods\Episodes) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10)

400 85.8 82.3 90.1 82.4 70.1 74.6 78.2 69.7 60.7 55.7 59.2 61.0

600 84.6 85.5 87.0 82.2 70.8 72.0 78.9 82.5 57.6 63.4 67.0 60.6

800 86.9 81.3 88.6 85.0 77.2 76.0 79.4 82.6 66.0 64.6 71.6 70.3

1000 87.0 86.2 83.1 82.2 75.4 77.8 77.0 85.6 64.8 63.0 74.0 69.8

1200 86.8 86.7 86.5 90.0 73.0 73.6 83.1 79.2 68.6 57.8 75.7 73.7

1400 77.6 80.2 93.2 91.7 78.8 73.5 79.9 89.3 72.0 58.6 70.9 65.6

1600 89.1 81.1 91.5 90.3 76.3 73.5 82.5 89.8 72.6 65.7 70.8 75.0

1800 87.2 85.0 89.8— 83.0 72.8 76.7 79.2 81.7 72.0 64.7 76.0 74.4

2000 88.2 83.8 85.0 89.8 81.0 75.0 72.4 83.1 75.3 54.7 71.7 71.2

1400 57.0 64.7 64.7 52.9 49.3 51.1 49.8 56.1 44.0 42.5 48.5 45.8

1600 58.2 56.0 66.8 58.0 54.7 47.0 58.6 52.6 44.3 36.3 41.1 32.2

1800 67.8 50.8 47.5 66.1 58.5 52.1 51.5 54.8 44.9 36.2 39.0 45.6

2000 59.9 57.9 65.0 61.8 55.5 50.3 66.4 57.2 54.7 46.4 44.6 45.5

(b) TSR (%) for SFR–LAP

Err. 0%

15%

30%

(Methods\Episodes) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10) Baseline β-VAE ACT (κ = 0.05) ACT (κ = 0.10)

400 52.5 49.4 55.5 49.2 51.5 44.4 56.2 50.7 44.8 42.1 39.6 39.4

600 61.2 59.5 54.2 62.3 46.4 48.9 44.0 50.5 51.0 38.2 39.8 47.3

800 61.7 53.8 58.5 61.4 52.7 52.9 52.5 57.7 46.7 42.5 42.6 45.0

5.3. Analysis The TSR for SFR–CR (Table 4a) shows that ACT performed better than the baseline method over the interval 400–1,600 episodes in both kappa settings. The learning curves of TSR and reward (Figs. 3a and 3b) show that the proposed methods also converged faster than the baseline. Moreover, the standard deviation of our method for the parameter setting κ = 0.05 was smaller than that of the baseline iteration. A significant improvement was recorded over the interval 1,400–1,600 episodes, where both the naive training session of the CR and the SFR achieved TSRs of 77.6% and 57.0%, respectively. In contrast, our methods yielded a poor TSR and reward for SFR–LAP, except over the interval 1,400–1,600 episodes (Table 4b), although the standard deviation was smaller than that of the LAP–SFR baseline. We assume that the internal domain gap is the main cause of this phenomenon; intuitively, the difference between LAP and SFR is larger than the difference between CR and SFR. Fitting a target feature network to generate source-like feature embedding would cause “incorrect” overfitting of the overall networks. The results from higher error rates support this assumption; TSRs of 15% and 30% over the

1000 66.3 50.6 58.6 62.4 52.5 48.6 50.6 56.1 44.9 38.2 46.2 39.4

1200 60.6 56.8 61.4 70.0 56.7 48.8 55.1 54.3 42.6 45.8 48.2 39.3

interval 1,000–1,600 episodes indicate better margin-wise performance than that of 0% in the LAP domain. Moreover, we note that the performance improvement yielded by a larger discriminator loss weight (κ) in SFR–CR was more beneficial than that in SFR–LAP. This implies that the proposed calibration is more effective when the distance between the source domain and target domain is narrow. This implies that that the effectiveness of ACT is limited by the inherent difference in the domain characteristics. Even though the architecture did not require inter-domain knowledge, which is our main contribution, we assume that the latent relations among the domains were reflected in the calibrated feature network level images. Although the trends of the learning curves of the proposed settings accord with each other in both domains (Fig. 3), the numerical values for each setting showed distinct difference, even though the initial seed values for the settings were equal. This result implies that magnitude of the gradient for LC affects the iteration, and the proposed method could benefit from an annealing scheme for κ. Note that ACT yielded better performance than the βVAE based feature network adaptation in both CR and

6 Source (𝒟𝒟𝑆𝑆 )

Sys > hello() User > inform()  type="restaurant",area="north panhandle",allowedforkids="dontcare ",goodformeal="dontcare"

Adaptation (𝒟𝒟𝑆𝑆 ← 𝒟𝒟𝑇𝑇 )

Sys > hello() User > inform()  type="restaurant", area="dontcare"

Tartget (𝒟𝒟𝑇𝑇 )

Sys > hello() User > inform()  type="restaurant", area="dontcare"

SFR

CR

Sys > hello() User > hello()  type="laptop",weightrange="light weight",pricerange="moderate")

Sys > hello() User > hello()  type="laptop",weightrange="light weight",pricerange="moderate")

LAP

Fig. 2: Comparison of feature embedding image

LAP for every experimental error rate. This was expected since the main objective of DARLA is to obtain generalized latent representations, dis-entangling sub-images of objects over the input image. This method was devised to operate with visual input feeds with the same dimensions. However, it was shown to be ineffective for dialog systems, where input states are composed of 1-dimensional linear sequences with different lengths. For an intuitive comprehension of the procedure, we also recorded the feature embedding images of the trained networks for specific episodes. Figure 2 shows the 1024dimensional image (Wf (s)) from each setting with the corresponding state (turn-0). The proposed networks yielded a sparse feature embedding. This indirectly shows the contraction behavior of our method, which reduces domain shift and prevents inefficient usage of the feature space.

between feature-level parameters and slot-level domain knowledge. This could be achieved either at policy designlevel, as slot-dependent dialog policy has been shown to perform better than slot-independent dialog policy [5], or at network-level, as recent adversarial approaches support the efficacy of parameterizing the semantic properties of the data [6, 16]. Adding pixel-wise attention [18] between the source embedding and target embedding could be a viable solution. Another major challenge is to find the best parameter settings for a given source and target. As pointed out in previous sections, discriminator loss can hinder the overall process if the discriminator weight is excessive. One method to prevent this would be an “adaptive” weight scheme, where the weight is updated based on the recorded performance, which is measured after every iteration.

6. Conclusion

Acknowledgments

This study considered the problem of domain adaptation for dialog systems based on DQN. Our proposed architecture supports domain adaptation by reducing domain shift at feature level without inter-domain knowledge. The proposed approach has the potential to solve a wide range of other domain adaptation problems in DQN-based reinforcement learning, potentially reducing the training cost for target domains, although it can suffer from performance degradation in some cases, possibly owing to inherent differences between the domains. One major challenge in improving the proposed architecture would be to consider inter-domain knowledge in a “parameterized” way: to devise a method to bridge the “gap”

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7063570)

7

(a) Task Success Rate of SFR-CR (Left: 0%, Middle: 15%, Right: 30%)

(b) Reward of SFR-CR (Left: 0%, Middle: 15%, Right: 30%)

(c) Task Success Rate of SFR-LAP (Left: 0%, Middle: 15%, Right: 30%)

(d) Reward of SFR-LAP (Left: 0%, Middle: 15%, Right: 30%) Fig. 3: Learning curve of the experiment; Blue shade: σ of the proposed method; Green shade : σ of the baseline.

8 References [1] Arjovsky, M., Chintala, S., Bottou, L., 2017. Wasserstein generative adversarial networks, in: Precup, D., Teh, Y.W. (Eds.), Proceedings of the 34th International Conference on Machine Learning, PMLR, International Convention Centre, Sydney, Australia. pp. 214–223. [2] Asadi, K., Misra, D., Littman, M., 2018. Lipschitz continuity in model-based reinforcement learning, in: Dy, J., Krause, A. (Eds.), Proceedings of the 35th International Conference on Machine Learning, PMLR, Stockholmsmssan, Stockholm Sweden. pp. 264–273. [3] Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M., 2013. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279. [4] Ben-David, S., Blitzer, J., Crammer, K., Pereira, F., 2006. Analysis of representations for domain adaptation, in: Proceedings of the 19th International Conference on Neural Information Processing Systems, MIT Press. pp. 137–144. [5] Casanueva, I., Budzianowski, P., Su, P.H., Ultes, S., Rojas Barahona, L.M., Tseng, B.H., Gasic, M., 2018. Feudal reinforcement learning for dialogue management in large domains, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 714–719. [6] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. [7] Engel, Y., Mannor, S., Meir, R., 2005. Reinforcement learning with gaussian processes, in: Proceedings of the 22Nd International Conference on Machine Learning, ACM. pp. 201–208. [8] Gaˇsi´ c, M., Jurˇ c´ıˇ cek, F., Keizer, S., Mairesse, F., Thomson, B., Yu, K., Young, S., 2010. Gaussian processes for fast policy optimisation of pomdp-based dialogue managers, in: Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics. pp. 201–204. [9] Gaˇsi´ c, M., Mrkˇsi´ c, N., Rojas-Barahona, L.M., Su, P.H., Ultes, S., Vandyke, D., Wen, T.H., Young, S., 2017. Dialogue manager domain adaptation using gaussian process reinforcement learning. Computer Speech & Language 45, 552–569. [10] Gaˇsi´ c, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., Young, S., 2013. Pomdpbased dialogue manager adaptation to extended domains, in: Proceedings of the SIGDIAL 2013 Conference, Association for Computational Linguistics, Metz, France. pp. 214–222. URL: http://www.aclweb.org/anthology/W/W13/W13-4035. [11] Gaˇsi´ c, M., Young, S., 2014. Gaussian processes for pomdp-based dialogue manager optimization. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 28–40. [12] Ghifary, M., Kleijn, W.B., Zhang, M., Balduzzi, D., Li, W., 2016. Deep reconstruction-classification networks for unsupervised domain adaptation, in: European Conference on Computer Vision, Springer. pp. 597–613. [13] Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, MIT Press. pp. 2672–2680. [14] Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., Lerchner, A., 2017. Darla: Improving zero-shot transfer in reinforcement learning, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org. pp. 1480–1490. [15] Ilievski, V., Musat, C., Hossmann, A., Baeriswyl, M., 2018. Goal-oriented chatbot dialog management bootstrapping with transfer learning, in: IJCAI. [16] Karras, T., Laine, S., Aila, T., 2018. A style-based generator

[17]

[18]

[19]

[20]

[21]

[22]

[23] [24]

[25]

[26]

[27]

[28] [29]

[30]

[31]

[32]

[33]

architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948 . Lee, C., Jung, S., Kim, S., Lee, G.G., 2009. Example-based dialog modeling for practical multi-domain dialog system. Speech Communication 51, 466–484. Liu, N., Han, J., Yang, M.H., 2018. Picanet: Learning pixelwise contextual attention for saliency detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3089–3098. Long, M., Cao, Y., Wang, J., Jordan, M., 2015. Learning transferable features with deep adaptation networks, in: International Conference on Machine Learning, pp. 97–105. Mansour, Y., Mohri, M., Rostamizadeh, A., 2008. Domain adaptation with multiple sources, in: Proceedings of the 21st International Conference on Neural Information Processing Systems, Curran Associates Inc.. pp. 1041–1048. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.A., 2013. Playing atari with deep reinforcement learning. CoRR abs/1312.5602. Mo, K., Zhang, Y., Li, S., Li, J., Yang, Q., 2018. Personalizing a dialogue system with transfer reinforcement learning, in: Thirty-Second AAAI Conference on Artificial Intelligence. Parisotto, E., Ba, J., Salakhutdinov, R., 2016. Actor-mimic: Deep multitask and transfer reinforcement learning, in: ICLR. Raux, A., Eskenazi, M., 2009. A finite-state turn-taking model for spoken dialog systems, in: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA. pp. 629–637. Roy, N., Pineau, J., Thrun, S., 2000. Spoken dialogue management using probabilistic reasoning, in: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA. pp. 93–100. Rusu, A.A., Rabinowitz, N.C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., Hadsell, R., 2016. Progressive neural networks. CoRR abs/1606.04671. Shimodaira, H., 2000. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244. Sun, B., Saenko, K., 2016. Deep CORAL: correlation alignment for deep domain adaptation. CoRR abs/1607.01719. Thomson, B., Schatzmann, J., Young, S., 2008. Bayesian update of dialogue state for robust dialogue systems, in: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4937–4940. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T., 2017. Adversarial discriminative domain adaptation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176. Ultes, S., Kraus, M., Schmitt, A., Minker, W., 2015. Qualityadaptive spoken dialogue initiative selection and implications on reward modelling, in: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Association for Computational Linguistics, Prague, Czech Republic. pp. 374–383. Ultes, S., Rojas Barahona, L.M., Su, P.H., Vandyke, D., Kim, D., Casanueva, I.n., Budzianowski, P., Mrkˇsi´ c, N., Wen, T.H., Gaˇsi´ c, M., Young, S., 2017. PyDial: A Multi-domain Statistical Dialogue System Toolkit, in: Proceedings of ACL 2017, System Demonstrations, Association for Computational Linguistics, Vancouver, Canada. pp. 73–78. Zhang, B., Cai, Q., Mao, J., Guo, B., 2001. Planning and acting under uncertainty: A new model for spoken dialogue systems, in: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc.. pp. 572–579.

Adversarial approach to domain adaptation for reinforcement learning on dialog systems

Adversarial approach to domain adaptation for reinforcement learning on dialog systems

Recommend Documents