Real user evaluation of a POMDP spoken dialogue system using automatic belief compression

Real user evaluation of a POMDP spoken dialogue system using automatic belief compression

Available online at www.sciencedirect.com ScienceDirect Computer Speech and Language 28 (2014) 873–887 Real user evaluation of a POMDP spoken dialog...

512KB Sizes 3 Downloads 40 Views

Available online at www.sciencedirect.com

ScienceDirect Computer Speech and Language 28 (2014) 873–887

Real user evaluation of a POMDP spoken dialogue system using automatic belief compression Paul A. Crook a , Simon Keizer b,∗ , Zhuoran Wang b , Wenshuo Tang b , Oliver Lemon b a

Microsoft Corporation, Redmond, WA, USA b Heriot-Watt University, Edinburgh, UK

Received 20 March 2013; received in revised form 26 September 2013; accepted 16 December 2013 Available online 27 December 2013

Abstract This article describes an evaluation of a POMDP-based spoken dialogue system (SDS), using crowdsourced calls with real users. The evaluation compares a “Hidden Information State” POMDP system which uses a hand-crafted compression of the belief space, with the same system instead using an automatically computed belief space compression. Automatically computed compressions are a way of introducing automation into the design process of statistical SDSs and promise a principled way of reducing the size of the very large belief spaces which often make POMDP approaches intractable. This is the first empirical comparison of manual and automatic approaches on a problem of realistic scale (restaurant, pub and coffee shop domain) with real users. The evaluation took 2193 calls from 85 users. After filtering for minimal user participation the two systems were compared on more than 1000 calls. © 2013 Elsevier Ltd. All rights reserved. Keywords: Spoken dialogue systems; Dialogue management; Belief compression

1. Introduction One of the main problems for a spoken dialogue system (SDS) is to determine the user’s goal (e.g. plan suitable meeting times or find a good Indian restaurant nearby) under uncertainty, and thereby to compute the optimal next system dialogue action (e.g. offer a restaurant, ask for clarification). Recent research in statistical SDSs has successfully addressed aspects of these problems through the application of Partially Observable Markov Decision Process (POMDP) approaches (Thomson and Young, 2010; Young et al., 2010). In these approaches, system responses are computed on the basis of a distribution over possible user goals, called the belief space, rather than the most likely user goal. However, in order to keep belief state monitoring and action selection tractable, various techniques are used to reduce the size of the space. For learning action selection policies in particular, current systems rely on system designers hand-selecting a sub-set of the features of the belief space and action set. This paper proposes the use of automatic belief compression (ABC) techniques in POMDP dialogue systems. Automatic belief space compression is attractive in that it reduces the knowledge required when constructing statistical SDSs and allows for greater automation of the design process (see Section 7.1). The aim of this paper is to demonstrate that automatic belief compression is competitive from an end user viewpoint, without which the potential reduction in ∗

Corresponding author. Tel.: +44 131 451 4335. E-mail addresses: [email protected], [email protected] (S. Keizer).

0885-2308/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.csl.2013.12.002

874

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

design effort is of little value. To evaluate this a comparison is made between an existing state-of-the-art POMDP SDS that uses a hand-coded belief space compression, and a variant that uses an automatically compressed belief space. The evaluation is done both in simulation and with a real users. The compressed belief space used for dialogue management in the variant system was generated by applying Exponential Family Principal Components Analysis (E-PCA) (Roy and Gordon, 2002; Roy et al., 2005) to the belief space to automatically select a reduced set of summarising features. However, where Roy and Gordon (2002), Roy et al. (2005) require knowledge of the original POMDP model in order to train a policy in the compressed belief space we use reinforcement learning (RL) to avoid having to know the precise POMDP transition and observation probabilities, since approximations required in scaling up state-of-the-art statistical SDS models mean that they are not directly available. The main contribution of this work is that it is the first empirical comparison of manual and automatic approaches to SDS belief state compression on a problem of realistic scale (restaurant, pub and coffee shop domain) and with real users. In addition, to the best of our knowledge, this is the first work to demonstrate the feasibility of using E-PCA as a feature selection procedure in RL scenarios, which also strengthens the potential of E-PCA belief compression in practical applications (see Section 2.2 for a more expanded discussion). 1.1. Background: POMDPs for SDS dialogue management POMDPs are Markov Decision Processes where the system’s state is only partially observable, i.e. there is uncertainty as to what the true state is. The ability to account for uncertainty is crucial for SDSs because their knowledge about the state is uncertain due to speech recognition errors and the fact that the user’s goals are not directly observable. In POMDP models of spoken dialogue (Williams and Young, 2005; Thomson and Young, 2010; Young et al., 2010) the dialogue policy (what the system should say next) is based not on a single view of the current state of the conversation, but on a probability distribution over all possible states of the conversation (this is denoted as the system’s belief b). The optimal POMDP SDS dialogue act thus automatically takes account of the uncertainty about the user’s utterances and goals. Formally, a POMDP is defined as a tuple S, A, O, T, Ω, R where S is the set of states that the environment can be in, A is the set of actions that the system can take, O is the set of observations which it can receive, T is a set of conditional transition probabilities which describe the likelihood of transitioning between states given a selected action (i.e. P(s | s, a), where s, s ∈ S and a ∈ A), Ω is a set of conditional observation probabilities which describe the likelihood of each observation occurring (e.g. P(o | s , a), where o ∈ O), and R is the reward function (R : A, S → R). For a SDS dialogue manager (DM) we say that the user’s utterance after it has been rendered into the form of a semantic act (or the list of semantic acts1 ) is the observation o which the POMDP receives. We assume that the dialogue has a discrete number of states which it can be in and these are represented by the set of POMDP states S. Finally the DM action is equated to the POMDP act a. Now given a set of transition matrices, observations vectors, and an initial belief b0 the POMDP DM can monitor and update its belief b over the possible states of the dialogue. 1.2. The need for state space compression Even considering limited domains, POMDP state spaces grow very quickly. For example the domain ontology used by the Hidden Information State (HIS) (Young et al., 2010) dialogue systems evaluated in this paper contains three types of entities (restaurant, pub and coffee shop) and both the entity type and between four and six further attributes (depending on the entity type) are searchable by the user, e.g. restaurant has the attributes cuisine, city area, near a landmark, price range, whereas pub has attributes children allowed, has Internet, has TV, city area, near a landmark, price range. There are 28 cuisines, 15 city areas, 52 landmarks, and 4 price ranges, and the remaining searchable attributes are Boolean. Even without considering negation of attribute values or that an attribute can reference a conjunction or disjunction of values (Gaˇsi´c and Young, 2011) the user goal space can represent 87,360 states.2 The dialogue hypothesis space is larger again as in addition to the user goal various attributes which summarise the dialogue 1 2

In the case of a system that considers N-best lists of ASR output. Allowing disjunctions and negation increases this user goal space to around 2100 .

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

875

state are also tracked, e.g. whether a hypothesised user goal is confirmed, not confirmed, or rejected, and has been mentioned by the user, system or both, etc. Focusing only on the user goal space a POMDP belief is a probability distribution over these possible states, i.e. a 87,360 dimensional real valued (R) space. In order to render such large belief spaces tractable, the current state of the art in POMDP SDS uses a variety of handcrafted compression techniques, such as making several types of independence assumption. For example, a dialogue system designer might decide that users are only ever interested in one type of food or one location, and that their interests in food type, price range, quality, etc. are independent.3 The real valued user goal space distribution can then be reduced to a much smaller “summary space” (Williams and Young, 2005) consisting of around 100 × R values.4 However such assumptions limit the expressiveness of the user goal space and thus what the dialogue manager can infer. This can have a detrimental effect on the quality of the dialogues and hence the user experience. The tight coupling between some dialogue states and actions (e.g. a user’s goal state travel-from-London and system act confirm-from-London) has led some researchers to conclude that compression techniques, such as state aggregation, are not useful in the dialogue domain (Williams and Young, 2007). However, such tight coupling may not exist for all states, indeed Value Directed Compression (VDC) has already been applied to a small spoken dialogue system problem (Poupart, 2005) where it was shown that compressions could be found; losslessly compressing a test problem of 433 states to 31 basis functions.

2. Automatic belief compression methods The current state-of-the-art in POMDP SDSs uses a variety of handcrafted compression techniques, such as making several types of independence assumption as discussed above. Poupart (2005) and Crook and Lemon (2010) propose replacing handcrafted compressions with automatic compression techniques. The idea is to use principled statistical methods for automatically reducing the dimensionality of belief spaces, but which preserve useful distributions from the full space, and thus can more accurately represent real users’ goals. A POMDP is defined as a tuple S, A, O, T, Ω, R, see Section 1.1. Uncertainty in the current dialogue state is expressed as a distribution b that is maintained over the states s ∈ S. Automatic belief compression consists of finding a some mapping from b → b˜ where b˜ is expressed using a smaller number of bases (U) than used by b, i.e. |U| < |S|. One approach to automatic belief compression is to use knowledge of the tuple S, A, O, T, Ω, R in order to determine an appropriate mapping. This is what the VDC algorithm (Poupart, 2005) does. Roughly speaking it uses knowledge of the transition and observation probabilities, T and Ω, to compute projections of the rewards R into the future and discover a minimum set of bases which preserve the value function (the discounted future reward of state-action pairs). The mapping to this minimum set of bases can be used to compress b and theoretically guarantee5 that an optimal policy can be represented. For HIS this approach is not easily applied as the state transition probabilities T and observation probabilities Ω are not available ahead of runtime. This is due to the approximations adopted in the HIS framework to keep the framework tractable, e.g. the mixing of bi-gram and rule-based models in computing transition likelihoods or the approximation of the observation likelihood as the series of N-best confidence scores produced at runtime by the ASR-SLU pipeline (Young et al., 2010). A sample-based approach to determining the b → b˜ mapping is thus more appropriate for HIS and one of the most popular and successful forms of dimensionality reduction given sampled data is principal components analysis (Roy and Gordon, 2002; Roy et al., 2005). Given that the sampled HIS belief space is unlikely to lie on a linear manifold we use the Exponential Family of PCA to compute the belief space mapping.

3

These are not the assumptions used in the HIS system By considering only the maximum marginal likelihood for each of the user goal attributes. 5 Unfortunately numerical stability issues which occur when the VDC algorithm is implemented on a digital computer mean that this guarantee is difficult to realise in practice, see (Wang et al., 2013). 4

876

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

2.1. E-PCA Conventional principal components analysis assumes a linear transformation between some original data X and the projection of that data V in the reduced dimensional space. The assumption that X lies on a lower dimensional linear manifold is not always valid and Exponential Family PCA adds a link function f which performs a non-linear projection of the linear mapping of basis vectors U with V (Roy and Gordon, 2002; Roy et al., 2005), i.e. ˆ = f (UV ) X

(1)

ˆ is the recovered approximation of the original data X. where X In this application our original data X is a collection of column vectors bi , where each bi represents an uncompressed POMDP belief state. If X ≡ B = [b1 b2 b3 · · · bK ], where K is the number of sampled belief states, then ˆ = f (U B) ˜ = exp(U B) ˜ B

(2)

˜ = [b˜ 1 b˜ 2 b˜ 3 . . . b˜ K ] are the compressed belief states and B ˆ is the recovered approximation of B, i.e. where B ˆ ≈ B. Since beliefs are probability distributions, the exponential link function shown in Eq. (2) appears the most B appropriate choice (as per Roy and Gordon, 2002; Roy et al., 2005), which corresponds to a Poisson error model for each ˜ are obtained by maximising the component of the reconstructed belief. In addition, if the E-PCA parameters U and B log-likelihood of the data with respect to the exponential link function, it is equivalent to minimising the unnormalised ˆ Once the matrix U has Kullback-Leibler (KL) divergence between the original beliefs B and their reconstructions B. been computed for a given sample B then new compressed beliefs can be found by solving Eq. (3) for each new vector b.  ˜   ˜ eU b − b · U b˜  + b · ln b − b) = argmin( eU b − b · U b˜  ) b˜ = argminUKL(b f (U b˜  )) = argmin( b˜ 

b˜ 

b˜ 

(3) where UKL is the unnormalised KL divergence and the summations are over the elements of the vectors e (and for the purposes of minimising this function any terms not involving b˜  can be ignored).

U b˜ 

and b

2.2. Discussion To the best of our knowledge, there are only three types of belief compression algorithms. We will discuss their capability and suitability for application to the HIS framework. VDC explores the low-rankness of a POMDP model, and produces a linear projection in a value-directed manner such that a belief and its compression will obtain an (approximately) identical value. As mentioned above, it is designed for “model-based” POMDPs, and requires explicit representations of the transition, observation and reward functions (matrices), which are not available in a “model-free” POMDP system such as HIS. E-PCA exploits the sparsity of the belief space based on sampled beliefs. Note that the original E-PCA belief compression method of Roy and Gordon (2002), Roy et al. (2005) was also proposed for model-based POMDP problems. It aims to cluster beliefs with reduced dimensionality such that the POMDP can be approximated by a (more tractable) MDP and solved using corresponding dynamic programming techniques. However, since computing the compression bases for E-PCA does not rely on any POMDP parameters but only on sampled belief points, the compression process itself can be directly applied to model-free POMDP problems, as will be shown in Section 4. In addition, the previous belief clustering strategy for solving an E-PCA compressed POMDP has only demonstrated its success on some relatively simple problems with a few hundred state dimensions, where sampled beliefs can be reconstructed quite accurately using a small number of bases (less than 10). However, it may fail on more complex real-world problems due to the computational complexity of compressing a sufficiently large set of belief samples in order to achieve reliable clusters (see (Yu et al., 2005) for an example). In this paper, we show that instead of clustering beliefs, the compression bases computed by E-PCA can be used as features of MDP states that serve as input to a sample-efficient reinforcement learning (RL) policy optimiser, which also strengthens the potential of E-PCA belief compression itself in more practical use. Benefiting from the insights behind both E-PCA and VDC, Non-negative Matrix Factorisation (NMF) methods have also been proposed for belief compression (Li et al., 2007; Theocharous and Mahadevan, 2010). NMF belief

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

877

compression seeks a linear projection (matrix) also by exploring the belief space (sampled beliefs), and then applies the projection (and its orthogonal matrix6 ) to the POMDP parameter matrices to construct a linear compression. The advantage of NMF is the linear and non-negative compression function, which is helpful in (approximately) solving a model-based POMDP, according to our previous research (Wang et al., 2013). But regarding the model-free setting (in an RL scenario) of our interest here, non-negativity is entirely an unnecessary requirement. If one discards the non-negative constraint on the compression function, then the problem will essentially become equivalent to linear PCA. Roy et al. (2005) have shown that linear PCA performs poorly at representing probability distributions, which suggests that the most (and probably the only) suitable belief compression method for the HIS framework would be E-PCA. 3. The spoken dialogue systems In order to evaluate the use of automatic compression of beliefs in a POMDP SDS, we use an end-to-end spoken dialogue system that runs with the Hidden Information State (HIS) POMDP dialogue manager (Young et al., 2010). This is a state-of-the-art statistical dialogue system which maintains a distribution over dialogue states and uses a handcoded compression of the belief space to make policy optimisation tractable. We created a variant of this system by replacing the hand-coded compression with an automatic E-PCA compression and carried out a comparative evaluation. 3.1. The HIS system The HIS POMDP system deals with the problem of computational complexity in belief monitoring and action selection in several ways (Young et al., 2010). First, the dialogue state is factored into three components: the last user act au , the user’s goal su and the dialogue history sd . The dialogue history is based on a simple confirmation model and is encoded deterministically. It yields probability one when the updated dialogue state hypothesis is consistent with the history, and zero otherwise. Second, the space of user goals is partitioned (at run-time) into equivalence classes p of equally likely goals. By replacing a machine state sm with its factors (su , au , sd ) and making reasonable independence assumptions, it can be shown that in partitioned form, the belief is updated as follows:  b (p , au , sd ) = k · P(o |au ) · P(au |p , am ) · P(s |p , au , sd , am ) · P(p |p) · b(p, sd )     s  d        (4) observationmodel useractionmodel

d

dialoguemodel

beliefrefinement

where p is the parent of p , i.e. p is a partition resulting from splitting p. In this equation, the observation model is approximated by the normalised distribution of confidence measures output by the speech understanding system. Unlike the observation model, the user action model takes the context in which the user performed their action into account. It allows the observation probability that is conditioned on au to be scaled by the probability that the user would perform the action au given the partition p and the last system prompt am . For example, the observation model might not be able to distinguish between an acknowledgement and an affirmation for the utterance “Yes”, but if the last system act was a confirmation (“You want Chinese food?”), the affirmation reading should become more likely (Keizer et al., 2008). The process of belief updating in the HIS system results in a belief state, which consists of a list of dialogue state hypotheses and their associated probabilities. Each hypothesis consists of a user goal partition, the last user act, and dialogue history information. Selecting appropriate system actions based on the current belief state happens via a policy which is optimised using reinforcement learning (RL), in interaction with a simulated user (Schatzmann et al., 2007). Despite the assumed conditional independencies and the mechanism for partitioning the user goal space, the belief space is still too large for tractable policy optimisation. Therefore, a hand-coded method for compressing the state-action space is adopted. A belief is mapped onto a so-called summary state, defined by a vector consisting of the probabilities of the top two state hypotheses; two discrete status variables, h-status and p-status, summarising the state of the top state hypothesis and its associated partition; and finally the type of the last user act. The set of possible 6 In practice, the orthogonality in NMF belief compression algorithms is unachievable due to the low-rankness of the two factor matrices (Wang et al., 2013).

878

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

Table 1 Sample based approach to automatic belief compression in SDS. 1. 2. 3. 4.

Collect a sample of beliefs B by running some existing SDS policy Use E-PCA to generate bases U that form a lower dimensional space which can represent the sample beliefs with a given degree of precision Run the SDS system and at run-time use the bases U to compress the beliefs. Additionally compute features indicating the quality of the current compressed belief b˜ Treat the compressed belief plus quality features as the state of an MDP and learn the policy on-line using sample-efficient RL

system acts is also compressed in summary space, which is done by removing all semantic content items leaving only a reduced set of dialogue act types. These can be mapped back into master space by inferring information from the most likely state hypotheses. Additionally, the still continuous summary space is discretised by maintaining a set of grid points, which is expanded and updated during the training process. The policy is then represented as a Q-function over these grid points and the summary actions. A distance metric is used to find the closest grid point in each turn and to decide whether to add a new grid point. The policy is then optimised through RL, using a Monte Carlo control algorithm (Gaˇsi´c et al., 2008). 3.2. The ABC system Instead of the summary-space method used in the baseline HIS system, we propose to use an automatic belief compression (ABC) technique and perform policy optimisation in the resulting compressed space. The approach adopted, outlined in Table 1 is similar to that described in Roy and Gordon (2002), Roy et al. (2005) however we do not require precise knowledge of the original uncompressed POMDP. In their work the POMDP transition and observation ˜ probabilities and reward function are used to estimate an MDP model in the compressed belief space, i.e. T (b˜  |, a, b) ˜ a), and they solve this MDP using value iteration (a form of dynamic programming) (Sutton and Barto, and R(b, 1998). Instead we use an on-line based training approach with sample-efficient RL, thus avoiding the need to directly represent the lower dimensional transition probabilities and reward function, as well as the less reliable belief clustering procedure. This is possible because our requirement is not a lower-dimensional representation of the problem but a compact enough representation to make policy learning feasible. In addition, because we are compressing beliefs at run-time we can compute quality measures for the compressed belief. To aid action selection (through disambiguation of potentially overlapping compressed belief states) we make these quality measures available as features to the policy. We call this system “ABC-HIS”. 3.2.1. Policy learning Policy learning in ABC-HIS is performed using a temporal differences (TD)-based RL approach (Sutton and Barto, 1998) using an on-line, sample-efficient RL algorithm from a class of algorithms known as the Kalman Temporal Differences (KTD) framework (Geist and Pietquin, 2010). The KTD framework casts the problem of learning the state-action pairs’ value function (the Q-function) to one of Kalman filtering (Kalman, 1960), i.e., the tracking of a hidden random variable of a non-stationary dynamic system through indirect observations. The aim of the KTD framework is to benefit from the advantages of Kalman filtering; on-line second order learning, uncertainty estimation, and non-stationary handling. The standard KTD algorithms are derived under an assumption of deterministic MDP transitions. Although in practice this does not necessarily prevent their successful application to MDPs with stochastic transitions, a bias in the value estimates can be demonstrated when this assumption does not hold. An extension of the KTD approach is the addition of observation noise to the Kalman filter. The extended Kalman Temporal Differences (XKTD) algorithm adds auto-regressive process noise to the Kalman filter which removes the bias for stochastic MDP transitions (Geist and Pietquin, 2010). For KTD it is possible to derive an off-policy algorithm, namely KTD-Q. Off-policy algorithms have the advantage that the state-action values they estimate are uninfluenced by the policy actually being followed during learning (Sutton and Barto, 1998). Thus KTD-Q can be training while following a randomly acting policy. For XKTD it is not possible to derive an off-policy algorithm, so only an on-policy algorithm, XKTD-SARSA, exists. Since the value estimates

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

879

of on-policy algorithms are effected by the policy being followed, training has to proceed by largely exploiting the current best estimate of the policy with a limited level of exploration. Both KTD-Q and XKTD-SARSA have relative merits, as outlined above, and so both were investigated in this application.

4. Experimental set-up 4.1. Belief samples The sampled beliefs were collected using HIS logs from system evaluations conducted with real users7 ; see (Jurˇcíˇcek et al., 2011) for details of that evaluation. This corpus consists of some 836 dialogues. The HIS logs do not contain its beliefs so for this work they were regenerated by re-running the logged transactions with an off-line HIS system which used the same policy. Unfortunately due to stochastic policy actions exact reproduction of all the dialogues was not possible. Around 51% of the dialogues were reproduced exactly, in full. Of the remaining 49%, the beliefs were sampled up to the turn before the reproduction diverged. This resulted in total of 4,972 sampled beliefs.

4.2. Belief vector representation HIS represents the probability distribution over dialogue hypotheses (its belief) through a combination of partitions of the user’s goal space, inferred last user acts, and dialogue histories. The user goal space is defined using an ontology which represents the user’s intentions and the target entity in the domain, see Young et al. (2010) for details. The user goal partitions p are defined using the same ontology and thus can span arbitrary portions of the space from all items in the domain, to searching for one specific pub, coffeeshop, or restaurant. E-PCA requires the original data, i.e. the belief samples, to be represented in a vector form. The most straightforward mapping of the distribution over the space described by the user goal ontology would require a vector with one element for each possible partition that can be represented. The resulting dialogue state belief vector would be significantly larger again as it has to additionally incorporate information about the hypothesised last user act and dialogue history. Given the 4972 sampled beliefs such massive belief vectors would be very sparsely populated. Since our aim is to test the efficacy of applying sample-based compression to a real SDS system we selected a more compact (yet almost equally expressive) representation of the user goal distribution. Rather than a distribution over all possible partitions we use a distribution over the underlying database of entities that the system can present to the user. There are 158 entities in the HIS database. A user goal vector containing 160 elements was used to represent the user goal distribution. This contains one element per database entity plus two additional elements; one indicating that no entity in the databased matches the partition description, and one used to represent that no database look-up has been attempted (this occurs for example when the partition indicates that the user is not interested in searching for an entity in the restaurant, pub and coffee shop domain). Internally HIS can represent 9 possible hypothesis statuses (which summarise the dialogue history) and 22 user dialogue act types. When combined with the 160 element vector representing the user goal distribution this results in a belief vector of 31,680 elements.

4.3. Compression E-PCA compression was computed using the 4972 sampled belief vectors (each vector containing 31,680 elements). The number of bases of the E-PCA mapping was set in turn to 20, 50 and 100, representing three possible low dimensional compressions of the HIS beliefs. The average and maximum error measured as resulting from compression to these levels and reconstructing beliefs are given in Table 2 and Fig. 1.

7

Student volunteers as well as Mechanical Turk crowd sourced workers.

880

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

Table 2 Reconstruction errors for the 4972 sampled belief vectors when compressed to and recovered from the number of bases shown. Number of bases

Euclidean error

20 50 100

KL divergence

Average

Maximum

Average

Maximum

0.325 0.302 0.296

0.986 0.982 0.982

1.777 1.623 1.584

7.660 7.662 7.658

1.8

KL divergence

1.75 1.7 1.65 1.6 1.55 1.5 0

20

40

60

80

100

120

Number of bases Fig. 1. Plot visualising the change in the average KL divergence (from Table 2) against the numbers of bases.

4.4. Training The 20, 50, and 100 bases projections, i.e. U = U20 , U50 or U100 resulting from Eq. (2) were incorporated into the ABC-HIS system (using Eq. (3)) and were applied at run-time to compress new beliefs as they were encountered by the running system. Two quality measures for compression were computed; the Kullback-Leibler (KL) divergence and ˆ The compressed belief plus the two quality measures Euclidean error between the original b and recovered belief b. were used as the MDP state for DM action selection. The 20, 50, or 100 elements of the compressed belief and the 2 quality measures are continuously valued. Function approximation was thus used to estimate the Q-value of each possible action. Taking M as the number of bases, and Q as the number of quality measures, the value function for each action is approximated using a linear sum of the product of M + Q + 1 weights and features; +1 as a constant factor was added. The HIS DM can select between 12 different summary actions. This gives rise to 12 × (M + Q + 1) weights that need to be adapted, i.e. 276, 636, or 1236 weights respectively. Given this number of weights policy learning was carried out using sample-efficient RL algorithms KTD-Q and XKTD-SARSA as described in Section 3.2.1. The DM policy was learnt on-line against the same user simulator used to train a policy of the baseline HIS system (see Section 5.1 for more details). The error rate, which is used in creating simulated N-best lists of (possibly corrupted) user act hypotheses with confidence scores from the true user acts, was varied dynamically for each training dialogue in the range 0.1–0.4. The policies were regularly sampled and evaluated as learning progressed and training was terminated when it appeared that policy improvement had plateaued.

5. Evaluation The two DMs, original HIS and ABC-HIS as described previously, were evaluated both in simulation and with human users.

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

881

Table 3 Percentage of successful dialogues and average dialogue score when evaluated against user simulator with error rate of 0.1. Figures shown include ±95% confidence intervals. All results are for the XKTD-SARSA algorithm and training with variable simulation error-rate between 0.1 and 0.4. System

% Success

HIS ABC-HIS 20 bases ABC-HIS 50 bases ABC-HIS 100 bases

95.06 91.70 93.76 94.53

± ± ± ±

0.42 0.54 0.47 0.45

Av. dialogue score 87.03 82.56 84.27 86.02

± ± ± ±

0.43 0.56 0.49 0.46

Table 4 Percentage of successful dialogues and average dialogue score when evaluated against user simulator with error rate of 0.4. Figures shown include ±95% confidence intervals. All results are for the XKTD-SARSA algorithm and training with variable simulation error-rate between 0.1 and 0.4. System

% Success

HIS ABC-HIS 20 bases ABC-HIS 50 bases ABC-HIS 100 bases

79.31 69.56 74.01 78.76

± ± ± ±

0.79 0.90 0.86 0.80

Av. dialogue score 67.83 57.26 61.24 66.77

± ± ± ±

0.80 0.92 0.87 0.81

5.1. Evaluation with user simulation For both training and evaluating DM policies of the HIS system and its ABC variant, an agenda-based user simulator (Schatzmann et al., 2007) was used, which generates user behaviour that is both rational (consistent with a given random goal) and displays a variation as observed with real users. Previous evaluations of the HIS system have shown that this simulator generates sufficiently realistic data to learn effective policies and is also capable of predicting the performance of the dialogue system when evaluated in interaction with real users. With the simulator, thousands of dialogues can be generated and scored automatically. Note that the simulator has only been used for policy optimisation and not for learning the belief compression. The user simulator and DM interact at the semantic understanding level. That is to say that neither the DM nor simulated user generate natural language nor audio. To simulate the uncertainty of spoken utterances passing through ASR and SLU, a noise model is applied to the output of the simulated user before it is provided to the DM. This noise model first generates a semantic N-best list from the dialogue act generated by the simulated user and then proceeds to corrupt each entry in the N-best list with a fixed likelihood. This likelihood is referred to as the error rate. In addition, confidence scores are assigned to each entry on the N-best list, using a Dirichlet distribution (Thomson et al., 2012). For both training and testing in simulation this noise model was used with fixed length N-best lists where N = 2. To establish a performance target the original HIS system was run against the user simulator using the same policy from which the belief samples had been generated. It was evaluated at the two error rates; 0.1 and 0.4, the latter being close to what has been observed in a previous data collection with real users. Given stochasticity in both the HIS DM policy actions and the user simulator’s responses, 10,000 dialogues were run in order to establish tight error bounds. Results reported in Table 3 and 4 show the percentage of successful dialogues (% Success) and average dialogue score. Success is a binary score returned by the simulated user based on a comparison of the information it obtained from the system with the goal that the simulator was initialised with for that dialogue. The dialogue score is computed as +100 for success, minus 1 per turn, and additional penalties, for example for failing to respond to the immediate request “Could you repeat that, please?”. Although KTD-Q has the advantage of being an off-policy approach the on-policy XKTD-SARSA algorithm produced the better results. This may be due to bias induced in KTD-Q’s Q-value estimates by this task’s stochastic transitions. These could be significant enough to have a noticeable effect on policy learning. However, we have not eliminated the alternative possibility that on-line policy learning may be advantageous in this situation8 as KTD-Q was trained following a random policy. 8 Any compression may result in the confounding of what should be independent states. If this occurs, the process of on-line policy learning can find successful policies which navigate around or through the confounded states (Crook, 2007).

882

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

Table 5 Percentage of successful dialogues when evaluated against user simulator with error rate of 0.1. Figures shown include ±95% confidence intervals. All results are for ABC-HIS with 50 bases. Algorithm

Training error rate

% Success

KTD-Q XKTD-SARSA XKTD-SARSA LSTD

Variable 0.1–0.4 Variable 0.1–0.4 Fixed 0.4 Fixed 0.4

90.48 93.76 92.22 92.95

± ± ± ±

0.58 0.47 0.52 0.50

Table 6 Percentage of successful dialogues when evaluated against user simulator with error rate of 0.4. Figures shown include ±95% confidence intervals. All results are from ABC-HIS with 50 bases. Algorithm

Training error rate

% Success

KTD-Q XKTD-SARSA XKTD-SARSA LSTD

Variable 0.1–0.4 Variable 0.1–0.4 Fixed 0.4 Fixed 0.4

71.73 74.01 74.59 73.43

± ± ± ±

0.88 0.86 0.85 0.87

Tables 3 and 4 only reports results for ABC-HIS policies learnt with XKTD-SARSA. It is not the aim of this paper to make any claim as to which RL set-up is the most effective in these circumstances, thus we did not undertake a systematic comparison of all alternative combinations of features, algorithms and training parameters. Nevertheless several variations were tested for 50 bases which are reported in Tables 5 and 6. As an additional comparison the off-line policy learning algorithm least squares temporal difference (LSTD) (Boyan, 1999) was also investigated. This used transitions which were sampled with a moderate level of exploration whilst exploiting a known reasonable XKTD-SARSA policy. The policy learnt with KTD-Q shows a markedly lower success rate than XKTD-SARSA. Although not shown here the average dialogue scores have a similar trend. As detailed above this could be due to bias in the KTD-Q algorithm’s value estimates or due to differences between on- and off-policy training. The off-line, off-policy LSTD algorithm achieved similar results to XKTD-SARSA. Since LSTD shows no advantage there is little to recommend it in this situation as it requires separate sampling and learning stages. These sampling and learning stages are combined when using on-line algorithms such as XKTD-SARSA. Overall the simulation results indicate that among the ABC-HIS systems using 100 bases gives the best performance. This system is still outperformed by the original HIS system, but the overall margin is small, and it has the advantage of using an automatic approach to arrive at a belief space compression instead of a hand-coded mapping. The real user evaluation therefore compares the original HIS system with the 100 bases ABC-HIS system. 5.2. Evaluation with human users For the human user evaluation, subjects were recruited using crowdsourcing technology. Via CrowdFlower, an intermediary crowdsourcing service, jobs were published on Amazon Mechanical Turk (AMT), asking workers to call the system and find venues in Cambridge (UK), according to particular scenarios. After the call, they were asked to fill out a short questionnaire assessing the quality of the conversation. A single toll-free phone number was set up for workers to call and talk to the evaluated systems. This phone number could handle multiple incoming calls which were randomly allocated to systems thus ensuring that users were unable to identify the systems by the number dialled. Each job’s web page provided this phone number and a scenario, in the form of an automatically generated textual description of a task, specified in terms of a set of constraints (e.g. food=Chinese) and a set of requested slots (e.g. the restaurant’s phone number or address). Example tasks are: • You are looking for a pub in Castle Hill and it should have TV. Make sure you get the address, phone number, and whether it has internet.

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

883

Table 7 Subjective scores resulting from the questionnaire answers, together with 95% confidence intervals. The results for Q1 are percentages, whereas those of Q2–Q3 are average scores. System

Num

Q1

Q2

Q3

ABC-HIS HIS

508 518

71.46 ± 3.93 78.19 ± 3.56

4.122 ± 0.157 4.334 ± 0.157

4.843 ± 0.135 4.981 ± 0.132

• You are looking for a Mediterranean restaurant and it should be in the city centre area. You want to know the phone number and postcode of the venue. In order to make sure that workers actually talked to the system, they were given a 4-digit token at the end of the conversation, which they had to report back on the job web page. After successful validation of this token, the questionnaire would be enabled. The questionnaire consisted of three questions: Q1: Did you get all the information you were looking for? [Y/N] Please state your attitude towards the following statements: Q2: The system understood me well [1–6 Likert scale] Q3: The system’s responses were quick [1–6 Likert scale] After completing the questionnaire, a password was given to the user, which was required finalise the CrowdFlower job and be paid. Workers were paid $0.20 per job (corresponding to one task/scenario/dialogue/call), in line with previous work on with crowdsourced SDS evaluation (Jurˇcíˇcek et al., 2011). 6. Results and analysis We associated the collected dialogues with user identities (by matching time-stamps on the submitted questionnaires with those on the submitted CrowdFlower tasks). The randomised allocation of incoming calls to systems meant that exposure to the two variants was skewed for a portion of the users, especially if they had only made a small number of calls. To counter this we used the users’ ids to filter out users that called one of the systems a lot more than the other one, and users that called the systems less than some minimum number of times. The results after filtering (3 dialogues per system; 0.4 maximum ratio between the two numbers of calls a user made to each of the systems; resulting in data from 1026 calls and 35 users) are given in Table 7. The scores indicate that the original HIS system is preferred over ABC-HIS on all scores, in particular in terms of perceived success rate. We also measured objective performance, based on the log files and the tasks assigned to the users. The following metrics were used: • PCs/FCs: partial resp. full completion success rate • PCn/FCn: number of turns until partial resp. full completion • PCp/FCp: partial resp. full completion dialogue score where partial completion of a dialogue means that the system offered a restaurant matching the constraints described in the scenario given to the user before the call, and full completion means that the system provided all required information about this venue (e.g. phone number and address). The results are given in Table 8. In this case the original HIS system still outperforms the ABC-HIS system, but the difference in performance is much smaller. Whereas the evaluation results in simulation suggested that the original HIS system was only marginally better than the ABC-HIS system, the evaluation with real users revealed a significant preference for the original HIS system, also confirmed by the objective performance measured. Besides the observation that the behaviour of the simulated users is not perfectly realistic and therefore evaluation results in simulation are not fully predictive, one also has to conclude that the automatic belief state compression in combination with XKTD-SARSA policy optimisation resulted in a system that was somehow under-performing in comparison to the original HIS system.

884

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

Table 8 Objective scores based on log files and assigned tasks. System

Num

PCs (%)

PCn

PCp

FCs (%)

FCn

FCp

ABC-HIS HIS

508 517

63.78 68.09

5.06 4.11

58.72 63.97

36.02 41.59

8.05 7.15

27.97 34.44

Table 9 Error analysis. ALL

NumSysRepeats AvgNumVenueOffers NumDialsNoVenueOffered

UNSUCC

ABC-HIS

HIS

ABC-HIS

HIS

374 (7%) 4.15 5 (0.98%)

296 (6%) 4.6 1 (0.19%)

179 (10%) 4.5 5 (3.45%)

139 (9%) 6.5 1 (0.88%)

In order to better understand these results, we analysed the data further and looked in particular at repetitive behaviour of the system and the number of venue offers made by the system. Table 9 summarises this analysis. In the row labelled “NumSysRepeats”, the number of times the system repeated the last system action is indicated, both as an absolute number and as a percentage of the total number of system actions. The ABC-HIS dialogues have slightly more system repeats than the original HIS system. When only taking into account the dialogues considered unsuccessful by the users (column UNSUCC), this trend is the same, while the percentages of system repeats are higher for both systems. It might be the case that due to the particular belief compression, the ABC-HIS system may be caught in a state it cannot get out of, causing the policy to repeatedly return the same action. The resulting repetitive system behaviour might cause users to give up sooner and give more negative scores. The row in Table 9 labelled “AvgNumVenueOffers” gives the average numbers of system acts per dialogue in which a venue is offered. For the corpus of dialogues with the original HIS system, the average is slightly higher, especially when only considering the dialogues deemed unsuccessful by the users. Finally, the row labelled “NumDialsNoVenueOffered” shows the number of dialogues in which no venue was offered. Although both systems tend to very quickly start offering venues (due to the simulator’s tolerance to this kind of behaviour), the ABC-HIS had more (failed) dialogues without any venue offers. These statistics could indicate that the ABC-HIS system has learned (in simulation) to be slightly more cautious than the original HIS system, but suffered from this in interaction with real users. The cases in which no venue was offered could also have been the result of repeated system actions (most likely a prompt such as “Can I help you with anything else?”) in states it could not get out of. 7. Discussion and future work There are a number of factors that could explain, either separately or possibly in combination, the ABC-HIS system’s behaviour and hence under-performance. E-PCA minimises the reconstruction error over the set of original sampled beliefs, the assumption being that accurate reconstruction will allow for near optimal policy learning. However it could be the case that the errors that remain in reconstruction are significant in terms of our primary objective, which is not reconstruction but action selection. As pointed out by Roy et al. (2005), “we would ideally like to find the most compact representation that minimises control errors.” Unfortunately with no way to easily compute the underlying model’s transition and observation functions it is difficult to directly compute a loss function that is weighted by the value of each state. From Table 2 and Fig. 1 we can see that although the selected number of bases (100) corresponds to a lowest level of reconstruction errors among the compressions that were computed, it is by no means perfect. The values for KL divergence are unnormalised so it is not possible to judge directly from them how good the reconstruction is. However the maximum Euclidean error does suggest that for some sampled beliefs, their reconstructions are missing possibly significant parts of the distribution. Surveying the reconstructions for these cases typically reveals reduced or missing peaks in the distribution. It was to alleviate potential problems caused by this (for example two sets of beliefs being

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

885

mapped onto the same compressed representation) that prompted the addition of error features to the MDP state used by RL. Increasing the number of bases should reduce the reconstruction error further and thus improve the performance of ABC-HIS. However the computation time for each iteration of the optimisation problem in Eq. (3), i.e. computing the compressed belief b˜ for every new belief b, is roughly cubic in the number of bases due to a matrix inversion operation (Roy et al., 2005). As such using more than 100 bases will cause the overall SDS’s response to the user to be perceptually slow even though multi-core parallel computing was employed (e.g. become greater than 1 second even for the limited number of iterations used). In addition, from the shape of the plot in Fig. 1 it appears that a small increase in the number of bases, say 20–50 bases, would not lead to a large improvement in the average KL divergence for this problem. E-PCA is particularly well suited to compressing problems where the beliefs are relatively sparse and have a small number of degrees of freedom (Roy et al., 2005). Although it seems plausible that the states in a statistical SDS are likely to be sparse (with individual users focusing on a limited number of goals and thus distributions due to noise and ambiguity being tied to those sparse goals) it is possible that the number of degrees of freedom could be significantly larger than the number of bases selected (e.g. due to mixtures of distributions over the 158 venues in the database). Similarly, the compressed belief space may do a poor job of reconstructing new beliefs if the sample of beliefs that were collected and used to compute U do not provide sufficient coverage of the belief space actually encountered during testing. Future work should try to tease apart these effects, including an examination of the shape and degrees of freedom demonstrated in the run-time beliefs and whether alternative formulations of compression would be better suited to compressing these distributions. An additional factor could be the use of a linear function approximation for the state-action Q-values. Although E-PCA performs a non-linear projection of the beliefs onto a linear manifold this does not guarantee that there exists a linear mapping between the features of the compressed beliefs and the Q-values. Additionally, in cases where errors in the belief compression cause beliefs that should be far apart to be mapped close together in the compressed space, the linear approximation could exacerbate the problem by assigning similar Q-values to the two beliefs. Although the KTD framework allows for non-linear (even non-differentiable) function approximation the difficulty lies in selecting the appropriate function. A limited number of experiments were investigated with various non-linear functions, squared, cubed, and even exponent features. However, the additional complexity and weights often significantly slowed learning and showed no gain in policy performance. Other alternatives to linear function approximation that could alleviate these issues include using a k-nearest neighbour or variable resolution grid representation of the value function. Trying these approaches is left to future work. Finally, the domain or set of tasks presented here to the users may not be complex enough to expose any deficit in the manual belief compression, e.g. to force situations where the optimal action could depend on more than the top two hypotheses or specific features of the top hypotheses that are missing from the hand-crafted summary space. 7.1. System development effort The principal aim of this paper is to examine if automated belief compression is of value in automating the design of POMDP SDSs. To this end the study conducted considered whether the resulting system’s performance is comparable to a state-of-the-art POMDP SDS whose performance has been worked on and refined by a team of researchers for a period of 6 years, since Young et al. (2006), Young (2006), and over a number of iterations of user tests and refinements; Young et al. (2007, 2010), Gaˇsi´c et al. (2008), Keizer et al. (2008). Given the strength of the baseline the ABC results show promise given the comparatively shorter time span (a few months) and given no iterations of data collection and refinement; the E-PCA compression is based only an initial data collection. These experiments however do not quantify the value of achieving automation in this setting. Development of the HIS baseline took place over many years and involves many components: ASR, SLU, DM, TTS, user simulator, etc. Due to this it is difficult to account in retrospect for the proportion of time and level of expertise required to design and experiment with features to include in the summary space. In addition, although the HIS summary space is designed to abstract away from the details of the domain, its design is closely matched to the underlying partition representation used by HIS. It is also possible that the design settled upon is most suited to a subset of domains, e.g. goal seeking, where statistics relating to the probability of the top two partitions in the HIS belief are the most significant signals.

886

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

Table 10 Comparison of design and development of a new POMDP SDS for a new domain using the current state-of-the-art approach and the proposed ABC approach. 1. 2. 3.

4. 5. 6. 7.

8. 9. 10. 11. 12.

Define the domain and its ontology and the representation of dialogue states in this domain Design the internal POMDP DM belief space representation Sample dialogues in this domain, e.g. Wizard of Oz data collection, using rule based SDS, or human-human conversations Annotate sampled dialogues and draw representative statistics Create a User Simulation based on these statistics Populate POMDP DM transition and observation likelihood models based on these statistics Design the compressed space that summarises the POMDP DM belief space using knowledge of the internal POMDP DM belief space Select an RL algorithm for the resulting summary-space states Train POMDP DM policy using User Simulation and RL Repeat steps 7–9 experimenting with summary features to include Test POMDP SDS with real users. Collect feedback and new dialogues Repeat from step 4 to refine the system

1. 2. 3.

4. 5. 6. 7.

8. 9. 10. 11. 12.

Define the domain and its ontology and the representation of dialogue states in this domain Design the internal POMDP DM belief space representation Sample dialogues in this domain, e.g. Wizard of Oz data collection, using rule based SDS, or human-human conversations Annotate sampled dialogues and draw representative statistics Create a User Simulation based on these statistics Populate POMDP DM transition and observation likelihood models based on these statistics Sample from the annotated dialogues and use E-PCA to create a summary-space. Knowledge of underlying POMDP DM space not required, works with any numerical vector representation Apply a continuous state RL algorithm such as XKTD-SARSA Train POMDP DM policy using User Simulation and RL Repeat steps 7 and 9 experimenting with the compression level used Test POMDP SDS with real users. Collect feedback and new dialogues Repeat from step 4 to refine the system

In contrast, the proposed automated approach of E-PCA is generally applicable to any POMDP DM representation (it has not been specifically designed for HIS) and has a single parameter which can be tuned without having to have any deep understanding of the underlying domain, the POMDP representation or the likely belief distribution. Table 10 provides a comparison of the steps that would be required in construction of a new POMDP SDS in a new domain using the current state-of-the-art approach and the proposed ABC approach. Examining the processes side by side the savings that can be made are principally tied to the level of automation, the level of expertise required, and thus the speed with which refinement could occur. An expert is still required to work on the initial system design phase but in the ABC case they can take a step back after step 2 and an automated process can run the series of experiments to test alternate compression levels. A simple selection can then be made on the basis of the trade-off between performance and system latency. In the conventional case the search is not as easy to automate due to the size and complexity of searching over the potential feature space. Thus expert input is required for feature selection slowing this step and tying up valuable expertise. Both reducing the level of expertise required and automation are very attractive to industry where expertise can often be expensive and continuous refinement is the norm. 8. Conclusion E-PCA and other automated compression techniques reduce the human design load by automating part of the current POMDP SDS design process. This reduces the knowledge required when building such statistical systems and makes it more feasible for industry to automate their development. While the automatic belief compression approach outlined in this paper underperforms when compared to the hand-crafted state compression it should be recognised that HIS sets a high-quality target since its manually designed compression presumably represents many person-hours of design and refinement by a research team. In contrast the ABC-HIS system approach is a first step towards “push button” building of POMDP SDSs, where automation should allow for continual improvement as run-time data is accumulated. From that viewpoint the results are quite promising, with a dialogue success rate of 95% in simulation (with a low level of simulated noise) and 71% perceived success rate, c.f. 78% for HIS, with real users. Such compression approaches are not only applicable to SDSs but should be equally relevant for multi-modal interaction systems where several modalities are being combined in user-goal or state estimation.

P.A. Crook et al. / Computer Speech and Language 28 (2014) 873–887

887

Future work includes looking at the effect of continual improvement with the accumulation of user interaction data as well as exploring alternative compression algorithms that have different underlying assumptions. Acknowledgements The authors would like to thank the Dialogue Systems Group at the University of Cambridge for the use of the HIS POMDP spoken dialogue system and the Cambridge Tourist Information domain and database. The research leading to these results was funded by the Engineering and Physical Sciences Research Council, UK (EPSRC) under project no. EP/G069840/1, and by the European Commission’s Framework 7 Programme under grant agreement no. 270019 (SPACEBOOK project). References Boyan, J.A., 1999. Technical update: least-squares temporal difference learning. Machine Learning 49, 233–246. Crook, P.A., 2007. Learning in a State of Confusion: Using Active Perception and Reinforcement Learning to Learn Good POMDP Policies Based Solely on Observation. Informatics, University of Edinburgh (Ph.D. thesis). Crook, P.A., Lemon, O., 2010. Representing uncertainty about complex user goals in statistical dialogue systems. In: Proceedings of SIGdial. Gaˇsi´c, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., Yu, K., Young, S., 2008. Training and evaluation of the HIS-POMDP dialogue system in noise. In: Proceedings of 9th SIGdial, Columbus, OH. Gaˇsi´c, M., Young, S., 2011. Effective Handling of Dialogue State in the Hidden Information State POMDP-based Dialogue Manager. ACM TSLP. Geist, M., Pietquin, O., 2010. Kalman Temporal Differences. Journal of Artificial Intelligence Research 39, 483–532. Jurˇcíˇcek, F., Keizer, S., Gaˇsi´c, M., Mairesse, F., Thomson, B., Yu, K., Young, S., 2011. Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk. In: Proceedings of Interspeech, Florence, Italy. Kalman, R.E., 1960. A new approach to linear filtering and prediction problems. Transactions of the ASME – Journal of Basic Engineering 82, 35–45. Keizer, S., Gaˇsi´c, M., Mairesse, F., Thomson, B., Yu, K., Young, S., 2008. Modelling user behaviour in the HIS-POMDP dialogue manager. In: Proceedings of SLT, Goa, India. Li, X., Cheung, W.K.W., Liu, J., Wu, Z., 2007. A novel orthogonal NMF-based belief compression for POMDPs. In: Proceedings of the 24th International Conference on Machine learning, pp. 537–544. Poupart, P., 2005. Exploiting Structure to Efficiently Solve Large Scale Partially Observable Markov Decision Processes. Dept. Computer Science, University of Toronto (Ph.D. thesis). Roy, N., Gordon, G., 2002. Exponential family PCA for belief compression in POMDPs. In: NIPS. Roy, N., Gordon, G., Thrun, S., 2005. Finding approximate POMDP solutions through belief compression. Artificial Intelligence Research 22. Schatzmann, J., Thomson, B., Weilhammer, K., Ye, H., Young, S., 2007. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In: Proceedings of HLT/NAACL. Sutton, R., Barto, A., 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Theocharous, G., Mahadevan, S., 2010. Compressing POMDPs using locality preserving non-negative matrix factorization. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. Thomson, B., Gasic, M., Henderson, M., Tsiakoulis, P., Young, S., 2012. N-best error simulation for training spoken dialogue systems. In: Proceedings of the IEEE Workshop on Spoken Language Technology (SLT), Miami, FL. Thomson, B., Young, S., 2010. Bayesian update of dialogue state: a POMDP framework for spoken dialogue systems. Computer Speech and Language 24, 562–588. Wang, Z., Crook, P.A., Tang, W., Lemon, O., 2013. On the linear belief compression of POMDPs: a re-examination of current methods. Machine Learning (under review). Williams, J., Young, S., 2007. Scaling POMDPs for spoken dialog management. IEEE Transactions on Audio, Speech, and Language Processing 15, 2116–2129. Williams, J.D., Young, S., 2005. Scaling up POMDPs for dialog management: the “Summary POMDP” method. In: Proceedings of ASRU. Young, S., 2006. Using POMDPs for dialog management. In: IEEE/ACL Workshop on Spoken Language Technology (SLT 2006), Aruba. Young, S., Gaˇsi´c, M., Keizer, S., Mairesse, F., Thomson, B., Yu, K., 2010. The Hidden Information State model: a practical framework for POMDP based spoken dialogue management. Computer Speech and Language 24, 150–174. Young, S., Schatzmann, J., Thomson, B., Ye, H., Weilhammer, K., 2006. The HIS dialogue manager. In: IEEE/ACL Workshop on Spoken Language Technology (SLT 2006), Aruba. Young, S., Schatzmann, J., Weilhammer, K., Ye, H., 2007. The Hidden Information State approach to dialog management. In: ICASSP 2007, Honolulu, Hawaii. Yu, C.H., Chuang, J., Gerkey, B., Gordon, G., Ng, A., 2005. Open Loop Plans in Multi-Robot POMDPs. Technical Report. Department of Computer Science, Stanford University.