Dual MPC with Reinforcement Learning

Dual MPC with Reinforcement Learning

11th IFAC IFAC Symposium Symposium on on Dynamics Dynamics and and Control Control of of 11th 11th Symposium on and Process including Biosystems 11th ...

486KB Sizes 0 Downloads 105 Views

11th IFAC IFAC Symposium Symposium on on Dynamics Dynamics and and Control Control of of 11th 11th Symposium on and Process including Biosystems 11th IFAC IFACSystems, Symposium on Dynamics Dynamics and Control Control of of Process Systems, including Biosystems 11th IFAC Symposium on Dynamics and Control of Process including Biosystems June 6-8,Systems, 2016. NTNU, NTNU, Trondheim, Norway Available online at www.sciencedirect.com Process Systems, including Biosystems June 6-8, 2016. Trondheim, Norway Process Systems, including Biosystems June 6-8, 2016. NTNU, Trondheim, Norway June June 6-8, 6-8, 2016. 2016. NTNU, NTNU, Trondheim, Trondheim, Norway Norway

ScienceDirect

Dual Dual Dual

IFAC-PapersOnLine 49-7 (2016) 266–271

MPC MPC MPC

with with with

Reinforcement Reinforcement Reinforcement ∗

Learning Learning Learning ∗

Juan Juan E. E. Morinelly Morinelly ∗∗∗ ,, B. B. Erik Erik Ydstie Ydstie ∗∗∗ Juan E. Morinelly Erik Ydstie ∗ ∗ ,, B. Juan E. Morinelly B. Erik Juan E. Morinelly , B. Erik Ydstie Ydstie ∗ ∗ Chemical Engineering Department, Carnegie Mellon Chemical Engineering Department, Carnegie Mellon University, University, ∗ ∗ Chemical Engineering Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA ∗ Chemical Engineering Department, Carnegie Pittsburgh, PA 15213 USA Chemical Engineering Department, Carnegie Mellon Mellon University, University, Pittsburgh, PA (e-mail: Pittsburgh,ydstie}@andrew.cmu.edu) PA 15213 15213 USA USA (e-mail: {jmorinel, {jmorinel, ydstie}@andrew.cmu.edu) Pittsburgh, PA 15213 USA (e-mail: {jmorinel, ydstie}@andrew.cmu.edu) (e-mail: (e-mail: {jmorinel, {jmorinel, ydstie}@andrew.cmu.edu) ydstie}@andrew.cmu.edu) Abstract: An adaptive optimal control algorithm Abstract: An adaptive optimal control algorithm for for systems systems with with uncertain uncertain dynamics dynamics is is Abstract: An adaptive optimal control algorithm for systems with uncertain dynamics is formulated under a Reinforcement Learning framework. An embedded exploratory component Abstract: An adaptive optimal control algorithm for systems with uncertain is formulated a Reinforcement Learning framework. embedded component Abstract: under An adaptive optimal control algorithm for An systems with exploratory uncertain dynamics dynamics is formulated under a Reinforcement Learning framework. An embedded exploratory component is included explicitly in the objective function of an output feedback receding horizon Model formulated under a Reinforcement Learning framework. An embedded exploratory component is included under explicitly in the objective function of an output feedback receding horizon Model formulated a Reinforcement Learning framework. An embedded exploratory component is included Control explicitly in the objective function of output feedback receding horizon Model Predictive problem. The optimization optimization is an formulated as aa Quadratically Quadratically Constrained is explicitly in function an output receding Model Predictive problem. The is formulated as Constrained is included included Control explicitly in the the objective objective function of of an output feedback feedback receding horizon horizon Model Predictive Control problem. The optimization is formulated as a Quadratically Constrained Quadratic Program and it is solved to -global optimality. The iterative interaction between Predictive Control problem. The optimization is formulated as a Quadratically Constrained Quadratic and it is solved to -global optimality. The iterative interaction Constrained between the the Predictive Program Control problem. The optimization is formulated as a Quadratically Quadratic Program and it is solved to -global optimality. The iterative between the action specified by solution and approximation of functions Quadratic Program and it to optimality. The interaction between action specified by the the optimal solution and the the approximation of cost costinteraction functions balances balances Quadratic Program andoptimal it is is solved solved to -global -global optimality. The iterative iterative interaction between the the action specified by the optimal solution and the approximation of cost functions balances the exploitation of current knowledge and the need for exploration. The proposed method is shown action specified by the optimal solution and the approximation of cost functions balances exploitation of current knowledge and the need for exploration. The proposed method is shown action specified by the optimal solution and the approximation of cost functions balances the the exploitation of current knowledge and the need for exploration. The proposed method is shown to converge to to the optimal optimal policyand for the controllable discrete time time linear plant with unknown unknown exploitation of knowledge need The proposed method is to converge the policy for aa controllable discrete plant with exploitation of current current knowledge and the need for for exploration. exploration. Thelinear proposed method is shown shown to converge to the optimal policy for a controllable discrete time linear plant with unknown output parameters. to to output parameters. to converge converge to the the optimal optimal policy policy for for aa controllable controllable discrete discrete time time linear linear plant plant with with unknown unknown output parameters. output parameters. output parameters. © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Adaptive Adaptive control, Keywords: control, dual dual control, control, optimal optimal control, control, model model predictive predictive control, control, Keywords: Adaptive control, dual control, optimal control, model predictive control, control, reinforcement learning, approximate dynamic programming. Keywords: Adaptive control, dual control, optimal control, model reinforcement learning, approximate dynamic programming. Keywords: Adaptive control, dual control, optimal control, model predictive predictive control, reinforcement learning, learning, approximate approximate dynamic dynamic programming. programming. reinforcement reinforcement learning, approximate dynamic programming. 1. INTRODUCTION INTRODUCTION agent can can be be viewed viewed as as the the MPC MPC algorithm algorithm itself, itself, which which 1. agent 1. INTRODUCTION INTRODUCTION agent can can actions be viewed viewed as the the MPC algorithm itself,control which computes adaptively by an 1. agent be as algorithm itself, which computes adaptively by solving solving an optimal optimal 1. INTRODUCTION agent can actions be viewed as the MPC MPC algorithm itself,control which computes actions adaptively by solving an optimal control Approximate Dynamic Programming (ADP) and Reinproblem defined by the environment’s response in the form computes actions adaptively by solving an optimal control Approximate Dynamic Programming (ADP) and Rein- problem defined byadaptively the environment’s response in the form computes actions by solving an optimal control Approximate Dynamic Programming (ADP) andincreasRein- problem problem defined by the the environment’s environment’s response response in in the the form form forcement Learning (RL) methods have been an increasApproximate Dynamic Programming (ADP) and Reinof output measurements. defined by forcement Learning (RL) methods have been an of output measurements. Approximate Dynamic Programming (ADP) and Rein- problem defined by the environment’s response in the form forcement Learning (RL) methods have been an increasof output measurements. ingly active research area in many engineering disciplines forcement Learning (RL) methods have been an increasingly activeLearning research(RL) area methods in many engineering disciplines forcement have been an increas- of of output output measurements. measurements. ingly active active research area These in many many engineering disciplines (Lewis and research Liu (2013)). (2013)). methods provide geningly area in engineering disciplines 2. (Lewis and Liu methods provide aa gen2. MATHEMATICAL MATHEMATICAL BACKGROUND BACKGROUND ingly active research area These in many engineering disciplines (Lewis and Liu (2013)). These methods provide a gen2. MATHEMATICAL BACKGROUND BACKGROUND eral structure for feedback control of complex, uncertain (Lewis and Liu (2013)). These methods provide a gen2. MATHEMATICAL eral structure for(2013)). feedbackThese control of complex, uncertain (Lewis and Liu methods provide a gen2. MATHEMATICAL BACKGROUND eral structure structure forand feedback control of complex, uncertain The content of this section follows systems. Sutton and Barto (1998) define the theoretical eral for feedback control of complex, uncertain systems. Sutton Barto (1998) define the theoretical eral structure for feedback control of complex, uncertain The content of this section follows the the development development by by systems. Sutton Sutton and Barto (1998) define the theoftheoretical theoretical The content content of this this section follows the the development by development of RL RL under three characteristics the probprob- The systems. and Barto (1998) define Bertsekas (2012) where a detailed description can by be of section follows development development of under three characteristics the systems. Sutton and Barto (1998) define theof theoretical Bertsekas (2012) where a detailed description can be The content of this section follows the development by development of RL under three characteristics of the probBertsekas (2012) where a detailed description can be lem to be solved: the closed-loop nature of the interaction development of RL under three characteristics of the probfound. Consider the controllable and observable discrete(2012) where aa detailed description can be lem to be solved: closed-loop nature of theofinteraction development of RLthe under three characteristics the prob- Bertsekas found. Consider the controllable and observable discreteBertsekas (2012) where detailed description can be lem to to be bean solved: theand closed-loop nature of of the the interaction found. Consider the controllable and observable discretebetween an agent and its environment, an agent that lem solved: the closed-loop nature interaction time SISO system found. Consider the controllable and observable discretebetween agent its environment, an agent that lem to be solved: the closed-loop nature of the interaction time SISO system found. Consider the controllable and observable discretebetween an agent agent and its environment, environment, anrewards, agent that that time SISO SISO system must discover whichand actions yield the the most mostan and time between an its agent must discover which actions yield and between an agent and its environment, anrewards, agent that ϕt+1 = ff (ϕ (ϕtt ,, u utt )) = = Aϕ Aϕtt + + bu butt time SISO system system = ϕ must discover which actions yield the most rewards, and a performance evaluation that considers extended periods. must discover which actions yield the most rewards, and (1) ϕt+1 = ff (ϕ = Aϕ bu amust performance thatyield considers extended periods. t+1 tt ,, u tt ) tt + (1) discover evaluation which actions the most rewards, and  ϕ = (ϕ u ) = Aϕ + butt t+1  ϕ = f (ϕ , u ) = Aϕ + y g(ϕ ) = θ ϕ + v a performance evaluation that considers extended periods. t+1 t = g(ϕttt) =t θ ϕtt +t vtt but (1) aspects are closely related to the features of adaptive y aThese performance evaluation that considers extended periods. (1) t  These aspects are closely related to the features of adaptive a performance evaluation that considers extended periods.  = g(ϕ )the = θθadmissible ϕtt + + vvtt control u ∈(1) n ϕ yyyttt = g(ϕ = Thesecontrollers aspects are areas closely related to the features features(1995). of adaptive adaptive where n ,ttt ) dual controllers as described by Wittenmark (1995). the state ϕ ∈ R These aspects closely related to the of = g(ϕ ) = θ ϕ + v t t t dual described by Wittenmark R These aspects are closely related to the features of adaptive where the state ϕt ∈ Rnn , the admissible control ut ∈ R dual controllers controllers as as described described by (1995). where the state ϕ R control u ∈ tt ∈ tt∀ϕ only takes takes values dependent of admissible ϕtt (u (utt ∈ ∈U U (ϕ (ϕ )⊂ ⊂ R, R, ∀ϕ ∈ dual by Wittenmark Wittenmark (1995). n ,, the where the state ϕ ∈ R the admissible control u ∈ R tt ) tt R only values dependent of ϕ ∈ dual controllers as described by Wittenmark (1995). where the state ϕ ∈ R , the admissible control u ∈ R t t Bertsekas and Tsitsiklis (1995) coined the term Neuron  Bertsekas and Tsitsiklis (1995) coined the term Neuro- R only values dependent of ϕ ∈ U ⊂ R, ∀ϕ n ), takes tt (u ttby tt ∈ and the output yytt ∈ R is given θθ(ϕ ϕ a noise only takes values dependent of ϕ (u ∈ U (ϕ ))plus ⊂ R, ∀ϕ ∈ tt ) R ), and the output ∈ R is given by ϕ plus a noise Bertsekas and Tsitsiklis (1995) coined the term Neuroonly takes values dependent of ϕ (u ∈ U (ϕ ⊂ R, ∀ϕ t t t t ∈ n  Dynamic Programming to(1995) refer to tocoined ADP and and RL methods. Bertsekas and the term Neuro2 n ), and vthe ϕ Dynamic refer ADP R output ∈ R is given by θ plus a noise Bertsekas Programming and Tsitsiklis Tsitsiklisto (1995) coined the RL termmethods. Neuro- sequence 2y t t ∼ N (0, σ ). For the case where A, b, and θ are n  R ), and the output y ∈ R is given by θ ϕ plus a noise t t b, and sequence ∼ output N (0, σ 22y).tt ∈ For case where θ are Dynamic Programming tothat referthese to ADP ADP and RL RL methods. R ), and vthe R the is given by θ ϕA, a noise t plus This is to methods are usually Dynamic Programming refer to and This is due due to the the fact factto methods aremethods. usually sequence vvttt ∼ N (0, σ the case where A, b, and θθ are Dynamic Programming tothat referthese to ADP and RL methods. deterministic, consider anFor agent tasked to bring bring the output 2 ). sequence ∼ N (0, σ ). For the case where A, b, and deterministic, consider an agent tasked to the output This is due to the fact that these methods are usually sequence v ∼ N (0, σ ). For the case where A, b, and θ are are t applied through the use of neural networks as approxiThis is due to the fact that these methods are usually applied through thefact use that of neural consider an agent tasked to bring the output This is due to the these networks methods as areapproxiusually deterministic, in (1) to zero with cautious inputs. A stage cost function deterministic, consider an agent tasked to bring the output in (1) to zero with cautious inputs. A stage cost function applied through the use of neural networks as approxideterministic, consider an agent tasked to bring the output mating structures to derive control policies of nonlinear applied through the use of neural networks as approximating to use derive control networks policies ofas nonlinear to zero with cost function applied structures through the of neural approxi- in for aa given time, t, defined in the state, in (1) to with cautious inputs. A stage cost for(1) given time, t, is iscautious defined inputs. in terms termsA ofstage the current current state, mating structures structures toother derive control policies of nonlinear nonlinear (1) to zero zero with cautious inputs. Aof stage cost function function systems. Similarly, to names have been proposed proposed as in mating derive control policies of systems. Similarly, names have been as for a given time, t, is defined in terms of the current state, mating structures toother derive control policies of nonlinear ϕ , and control action, u : for a given time, t, is defined in terms of the current state, t t ϕ , and control action, u : systems. Similarly, other names have been proposed as for a given time, t, is defined in terms of the current state, t t Actor-Critic RL Feedback Control by Lewis et al. (2012) systems. other been proposed as Actor-Critic RL Feedback Controlhave by Lewis al. (2012) control action, u systems. Similarly, Similarly, other names names have been et proposed as ϕ  2 tt ,, and tt :: ϕ and control action, u   2 c(ϕ , u ) = ϕ θθ ϕ + ru ≥ 0 (2) Actor-Critic RL Feedback Control by Lewis et al. (2012) ϕ , and control action, : tt , utt ) =uϕ tt + rut ≥ 0 t t t or as Action Dependent Heuristic Dynamic Programming c(ϕ θθ ϕ (2) Actor-Critic RL Feedback Control by Lewis et al. (2012)  t  t2 or as Action RL Dependent Heuristic Actor-Critic Feedback ControlDynamic by LewisProgramming et al. (2012)  ϕt + ru2 c(ϕ , u ) = ϕ θθ ≥ 0 (2) t t   2 c(ϕ , u ) = ϕ θθ ϕ + ru ≥ 0 (2) t t or as Action Dependent Heuristic Dynamic Programming t , ut ) = t +level. where r > 0 dictates the caution This stage cost by Ferrari and Stengel (2004) to specify a feature of their t t or as Action Dependent Heuristic Dynamic Programming c(ϕ ϕ θθ ϕ ru ≥ 0 (2) t t t level. thet caution by Ferrari and Stengel (2004) to specify a feature of their where r > 0 dictates t This stage cost or as Action Dependent Heuristic Dynamic Programming by Ferrari and Stengel (2004) to specify a feature of their where r > 0 dictates the caution level. This stage cost is a nonnegative function. Equations (1) and (2) define proposed method. Lee and Lee (2009) propose an ADP by Ferrari and Stengel (2004) to specify a feature of their where r > 0 dictates the caution level. This stage cost is a nonnegative function. Equations (1) and (2) define proposed method. Lee and Lee (2009) propose an ADP by Ferrari and Stengel (2004) to specify a feature of their where r > 0 dictates the caution level. This stage cost proposedwith method. Lee and andprovided Lee (2009) (2009) propose an ansimulaADP is aa nonnegative (1) and (2) define Decision Process (MDP) state method exploration by proposed method. Lee Lee propose ADP function. Equations (1) and a Markov Markov Decisionfunction. Process Equations (MDP) with with continuous state method exploration by stochastic stochastic proposedwith method. Lee andprovided Lee (2009) propose ansimulaADP ais is a nonnegative nonnegative function. Equations (1)continuous and (2) (2) define define method with exploration provided by stochastic simulaa Markov Decision Process (MDP) with continuous state and action spaces in which an agent perceives the envitions with defined suboptimal policies. The underlaying method with exploration provided by stochastic simulaa Markov Decision Process (MDP) with continuous state actionDecision spaces in which (MDP) an agent perceives the envitions with defined suboptimal policies. The underlaying method with exploration provided by stochastic simula- aand Markov Process with continuous state tions with defined suboptimal policies. The underlaying and action spaces in which an agent perceives the environment’s response to an action in the form of a suctheoretical basis is common to all methods and rests on tions with defined suboptimal policies. The underlaying and action spaces in which an agent perceives the environment’s an action in the form of suctheoretical basis is suboptimal common to policies. all methods rests on and tions with defined The and underlaying action response spaces intowhich an agent perceives thea envitheoretical basis isby common to(1952) all methods methods and rests rests onaa ronment’s response to an action in the form of a succeeding state and a stage cost. The response depends the original work by Bellman (1952) approximated by theoretical basis is common to all and on ronment’s response to an action in the form of a ceeding state and a stage cost. The response depends the original work Bellman approximated by theoretical basis is common to all methods and rests on ronment’s response to an action in the form of a sucsucthe original work by Bellman (1952) approximated by a ceeding state and a stage cost. The response depends on the current state-control pair only, which is known as sequential learning process that combines search and longthe original work by Bellman (1952) approximated by a ceeding state and a stage cost. The response depends on the current state-control pair only, is known as sequential process that(1952) combines search and longthe originallearning work by Bellman approximated by a ceeding state and a stage cost. The which response depends sequential learning process that combines search and longon the current state-control pair only, which is known as the Markov property. MDPs are generally defined by the term memory (Barto and Dietterich (2004)). sequential learning process that combines search and longon the current state-control pair only, which is known as the the Markov property. MDPs pair are generally defined by the term memory (Barto and Dietterich (2004)). sequential learning process that combines search and long- on current state-control only, which is known as term memory memory (Barto (Barto and and Dietterich Dietterich (2004)). (2004)). the Markov property. MDPs are generally defined by the agent-environment interaction under the Markov property. term the Markov property. MDPs are generally defined by the agent-environment interaction under the Markov property. term memory (Barto and Dietterich (2004)). the Markov property. MDPs are generally defined by the This study formulates a generalization of the dual Model This study formulates a generalization of the dual Model agent-environment the Markov The state transition transitioninteraction dynamicsunder and the the costs areproperty. not necnecagent-environment interaction under the Markov property. The state dynamics and are not interaction under thecosts Markov property. This study study Control formulates generalization of the the dual Model Model Predictive (MPC) algorithm Heirung et This formulates aa of dual Predictive (MPC) algorithm by by Heirung et al. al. agent-environment The state transition dynamics and the costs are not necThis study Control formulates a generalization generalization of the dual Model essarily known or deterministic (Mitchell (1997)). The state transition dynamics and the costs are not necessarily known or deterministic (Mitchell (1997)). The state transition dynamics and the costs are not necPredictive Control (MPC) algorithm by Heirung et al. (2015) by introducing introducing the effect effect of anticipated anticipated information Predictive Control algorithm by Heirung Heirung et al. al. essarily known or deterministic (Mitchell (1997)). (2015) by the of information Predictive Control (MPC) (MPC) algorithm by et essarily known or deterministic (Mitchell (1997)). n essarily known or deterministic (Mitchell (1997)). (2015) by introducing the effect of anticipated information n A policy policy is is aa function function π π = = µ µ00 ,, µ under the ADP/RL framework. framework. The algorithm is develdevel- A µ11 ,, .. .. with with µ µkk :: R R → → U U (2015) by the anticipated information under ADP/RL algorithm is (2015) the by introducing introducing the effect effect of ofThe anticipated information underfor thethe ADP/RL framework. The algorithm algorithm is develdevelA policy is a to function = µ , admissible µ , .. with µ : Rnnn → U k oped simple of of linear belongs the policies, Π. If under the ADP/RL framework. The is A is function π = µ µ → U k :: R that belongs the set setπ of all policies, Π. If oped simple case case of regulation regulation of a a discrete discrete linear that underfor thethe ADP/RL framework. The algorithm is develA policy policy is aa to function π of = all µ000 ,, admissible µ111 ,, .. .. with with µ µ R →  U k oped for the simple case of regulation of a discrete linear that belongs to the set of all admissible policies, Π. If system with uncertain dynamics. In the RL context, the π = µ, µ, ... then the policy is said to be time-invariant oped for the simple case of regulation of a discrete linear that belongs to the set of all admissible policies, Π. π = µ, µ, ... then theset policy is admissible said to be policies, time-invariant system In the context, the that oped forwith the uncertain simple casedynamics. of regulation of RL a discrete linear belongs to the of all Π. If If system with with uncertain uncertain dynamics. dynamics. In In the the RL RL context, context, the the π = µ, µ, ... then the policy is said to be time-invariant system π = µ, µ, ... then the policy is said to be time-invariant system with uncertain dynamics. In the RL context, the π = µ, µ, ... then the policy is said to be time-invariant

Copyright © 2016 266 2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2016, 2016 IFAC IFAC 266Hosting by Elsevier Ltd. All rights reserved. Copyright © 2016 IFAC 266 Copyright © 2016 IFAC 266 Peer review under responsibility of International Federation of Automatic Copyright © 2016 IFAC 266Control. 10.1016/j.ifacol.2016.07.276

IFAC DYCOPS-CAB, 2016 June 6-8, 2016. NTNU, Trondheim, Norway Juan E. Morinelly et al. / IFAC-PapersOnLine 49-7 (2016) 266–271

or stationary. The infinite horizon total cost or cost-togo function, Jπ (ϕ), is defined for any state ϕ ∈ Rn , a stage cost function c, and a fixed policy π as the limit of the expected value for the discounted accumulation of all future stage costs subject to the system dynamics. N −1   k Jπ (ϕt ) := lim E α c(ϕt+k , µk (ϕt+k )) (3) N →∞

k=0

Where α is a discount factor (0 < α ≤ 1). Jµ (ϕt ) indicates the cost-to-go starting from state ϕt for a stationary policy µ. In the context of RL, equation (3) provides the agent a decision criterion to learn an optimal policy, π ∗ (ϕ) ∈ Π that minimizes the cost-to-go function. π ∗ (ϕt ) := arg J ∗ (ϕt ) (4) = arg min Jπ (ϕt ) π∈Π

Controllability of {A, b} guarantees the existence of a optimal stationary policy that stabilizes the system and minimizes (3). The problem we consider is the design of an algorithm that adaptively learns and implements policies forward in time for a system with unknown output parameter vector, θ. A desired property of the algorithm is that, as experience from the feedback interaction accumulates, the learning process converges to the optimal policy. 2.1 Bellman’s Equation The cost-to-go function for a stationary policy µ at the current state ϕt can be expressed recursively by extracting the first N contributions, the N -step cost. For a deterministic stage cost, such as (2), equation (3) can be rewritten in terms of the one-step cost as: Jµ (ϕt ) = c(ϕt , µ(ϕt )) + αJµ (ϕt+1 ) (5) in general, the expression for the N -step is given by Jµ (ϕt ) = Jµ,N (ϕt ) + αN Jµ (ϕt+N ) (6) where N −1  Jµ,N (ϕt ) := αk c(ϕt+k , µ(ϕt+k )). k=0

These backward recursions define a consistency path followed by the state transitions under the specified stationary policy. The solution to (5) or (6) yields the value of the infinite sum for a given stationary policy, µ, in (3). Furthermore, for c(ϕ, u) ≥ 0, the optimal cost-to-go satisfies: (7) J ∗ (ϕt ) = min [c(ϕt , u) + αJ ∗ (ϕt+1 )] u∈U (ϕ)

This is Bellman’s equation and holds for any ϕt ∈ Rn . An optimal policy has the property that, regardless of previous actions that lead to ϕt , the controls specified by it are optimal from then on. For the discounted case with bounded cost per stage, a function that satisfies this backward recursion is the unique solution to (7). An agent with perfect knowledge of J ∗ , f , g, and c can compute optimal control actions, forward in time, given by the argument that solves the optimization problem posed by (7). A necessary and sufficient condition for a stationary policy µ to be optimal is J ∗ (ϕ) = c(ϕ, µ(ϕ)) + αJ ∗ (f (ϕ, µ(ϕ)), ∀ϕ ∈ Rn (8) which states that the optimal policy is stationary when it satisfies the backward recursion given by (5). This 267

267

condition holds for the discounted case with bounded cost and regardless of discounting when the cost function is nonnegative. 2.2 Policy Iteration with Perfect Knowledge For system (1) with quadratic cost (2) define a set of stationary policies given by a linear state feedback gain, L, such that the matrix A + bL has all of its eigenvalues in the unit circle. This stabilizing policy µ(ϕ) = Lϕ yields quadratic cost-to-go function that can be expressed as Jµ (ϕt ) = ϕ (9) t Kϕt where K is a positive semidefinite, symmetric cost matrix. For any µ0 (ϕ) = L0 ϕ belonging to the stabilizing set given above, equation (7) yields Jµ0 (ϕt ) = c(ϕt , L0 ϕt ) + αJµ0 ((A + bL0 )ϕt ) (10) with Jµ0 defined by K0 as in (9). Given that equation (10) must hold for all ϕ ∈ Rn 0 = α(A + bL0 ) K0 (A + bL0 ) − K0 + θθ + rL 0 L0 (11) Obtaining the solution K0 to (11) denotes a policy evaluation step. The cost-to-go under µ0 for any ϕ ∈ Rn can be evaluated with (9) and K0 . An improved policy µ1 such that Jµ1 ≤ Jµ0 across the state space, can be determined by obtaining the minimizer of the one-step cost plus the cost-to-go for the current policy for the succeeding state over Rn µ1 (ϕ) = arg min [c(ϕ, u) + αJµ0 (Aϕ + bu)] (12) u∈U (ϕ)

The solution to (12) denotes a policy improvement step, which can be followed by a new evaluation step to determine K1 . The algorithm that sequentially iterates between evaluation and improvement steps is known as Policy Iteration (PI). A sequence of cost matrices {Kl }, where the index l denotes a PI step, is generated as a result. For a linear system with quadratic stage cost, PI converges to the optimal linear state feedback policy for a stabilizing L0 and corresponds to Hewer’s algorithm for solving the discrete-time algebraic Riccati equation (DARE) by repeated solutions of Lyapunov equations (Lewis et al. (2012)). Convergence is achieved when the cost matrix obtained during policy evaluation is unchanged (Kl = Kl−1 ). The corresponding policy satisfies (7), therefore it is optimal. Its cost matrix, K ∗ , defines the stationary optimal feedback linear control policy: µ∗ (ϕ) = − α(αb K ∗ b + r)−1 b K ∗ Aϕ (13) = L∗ ϕ For an agent with perfect knowledge of the transition dynamics and the stage cost, PI can be applied off-line, it is exact and does not require any learning. As such, it is not a ADP/RL method. 2.3 Q-learning Q-learning is a RL method introduced by Watkins (1989) in order to overcome the limitations of an agent without the ability of making perfect predictions of the environment. The core idea is to learn a function that incorporates the system dynamics and the stage cost. The quantity being minimized in (7) serves as the definition for the Q function: (14) Q(ϕt , u) := c(ϕt , u) + αJ ∗ (ϕt+1 )

IFAC DYCOPS-CAB, 2016 268 Juan E. Morinelly et al. / IFAC-PapersOnLine 49-7 (2016) 266–271 June 6-8, 2016. NTNU, Trondheim, Norway

Q is the sum of the one step cost incurred by taking the admissible control action u from state ϕt and following the policy π ∗ (ϕ) thereafter. For system (1) with stage cost (2) the control that minimizes Q is given by the optimal linear state feedback control policy (13). Q(ϕt , u) is given by a quadratic function in terms of the augmented state   χ = ϕ t u    ∗  αA K ∗ b  αA K A + θθ Q(ϕt , u) = χ χ (15) αb K ∗ b + r αb K ∗ A where u is arbitrary and does not necessarily correspond to the input specified by the optimal policy. For system (1) with nonnegative stage cost (2), Q(ϕt , u) is convex with respect to u for a given ϕt . Therefore, the optimal policy control action is given by the stationarity condition ∂Q(ϕt , u) =0 (16) ∂u evaluated at ϕt ∈ Rn . This expression yields exactly the same optimal linear feedback gain specified by (13). Q learning is equivalent to learning the Hamiltonian and applying the stationarity and convexity conditions according to Pontryagin’s minimum principle as shown by Mehta and Meyn (2009). Bradtke et al. (1994) formulate an adaptive control algorithm for continuous space using PI in terms of Q and provide convergence and stability results. This formulation requires the addition of white noise to the input in order to provide adequate exploration for its convergence. 3. ADAPTIVE AGENTS WITH ANTICIPATION An important consideration in ADP/RL control methods is how to introduce excitation for the learning to take place effectively while maintaining control. The most common strategies to achieve this are -greedy policy iteration and the addition of white noise to the input signal as discussed by Lewis et al. (2012). Heirung et al. (2015) developed a detailed description of the concept of anticipated information and its results applied to MPC for FIR systems. In the context of RL, the inclusion of anticipated information provides the agent with the ability to introduce exploration in optimal decisions along the control path. Below, a generalization of the FIR dual MPC control algorithm by Heirung et al. (2015), designed from a RL perspective, is provided. Main definitions from their development are included for completeness. 3.1 Anticipated Information Consider the case where the system dynamics {A, b} are known and all the uncertainty is contained in the vector of output parameters, θ. A Certainty Equivalence (CE) strategy relies on information contained in the set of previous control inputs and output measurements, (17) Yt := {ut−1 , ut−2 , ..., yt , yt−1 , ...} . A CE output predictor for system (1) is given by the current estimate of θ given Yt yˆt+k|t = E[yt+k |Yt ], for k = 0, 1, ... = E[θ |Yt ]ϕt+k|t = θˆ ϕt+k|t

(18)

t

268

where yˆt|t = yt and ϕt+k|t is deterministically given by the known current state, ϕt|t = ϕt , and transition dynamics for a series of future control actions ut , ut+1 , ..., ut+k−1 ϕt+k+1|t = Aϕt+k|t + but+k . (19) ˆ is The error covariance matrix, Pt , associated with θ, defined as Pt := E[(θ − θˆt )(θ − θˆt ) |Yt ]. (20) In order to incorporate the effect of future control inputs, the set of past and anticipated information, Yk|t , is introduced   (21) Yk|t := ut , ut+1 , ..., ut+k , Pt , θˆt , Yt

leading to the definition of the anticipated error covariance matrix: Pt+k|t := E[(θ − θˆt )(θ − θˆt ) |Yk|t ] (22) = E[θθ |Yk|t ] − θˆt θˆ t

where Pt|t = Pt . For a given set of anticipated control N −1 actions, uN := {ut+k ∈ U (ϕk )}k=0 , the parameter uncertainty is propagated forward in time with Recursive Least Squares (RLS, Ljung (1999)) Rt+k+1 = Rt+k + ϕt+k+1|t ϕ t+k+1|t

(23)

−1 and ϕt+k+1|t is given by (19). An where Rt+k = Pt+k|t approximate stage cost function, related to (2), under output predictor parameter uncertainty is defined for k ≥ 0, given the set of past and anticipated information  2 cˆt+k|t := ϕ t+k|t E[θθ |Yk|t ]ϕt+k|t + rut+k (24) = ϕ (θˆt θˆ + Pt+k|t )ϕt+k|t + ru2

=

t+k|t 2 yˆt+k|t +

t

ru2t+k

+

t+k  ϕt+k|t Pt+k|t ϕt+k|t

This nonlinear function can be expressed in terms of Rt+k|t by introducing an additional variable: zt+k := Pt+k|t ϕt+k|t (25) equivalently, Rt+k zt+k = ϕt+k|t (26) (25) is substituted into (24) to obtain 2 + ru2t+k + ϕ cˆt+k|t = yˆt+k|t t+k|t zt+k

(27)

which can be computed exactly given Yk|t for any k > 0 with equations (18),(19),(23), and (26). 3.2 Dual Cost-To-Go An approximate infinite-horizon cost function for the current state, ϕt ∈ Rn , and a given policy, π, can be written in terms of (27): Jˆπ (ϕt ) = lim

N →∞

= lim

N →∞

N −1  k=0 N −1 

αk cˆt+k|t 2 αk yˆt+k|t + lim

k=0 N −1 

+ lim

N →∞

N →∞

N −1 

αk u2t+k

(28)

k=0

α k ϕ t+k|t zt+k

k=0

= Jˆπ,exploit (ϕt ) + Jˆπ,caution (ϕt ) + Jˆπ,explore (ϕt ) This decomposition makes an explicit distinction of the desired features of a dual control policy: exploitation of

IFAC DYCOPS-CAB, 2016 June 6-8, 2016. NTNU, Trondheim, Norway Juan E. Morinelly et al. / IFAC-PapersOnLine 49-7 (2016) 266–271

previous knowledge, caution, and exploration. It has the property that as θˆt → θ, Jˆπ,explore (ϕt ) → 0, converging to the exact cost-to-go given by stage cost (2). 3.3 N -step Bellman’s Equation A stationary optimal policy can be expressed by the N step recursion given by (6) ∗ J ∗ (ϕt ) = JN (ϕt ) + αN J ∗ (ϕt+N ) (29) Corollary 1. Bellman’s equation can be expressed in terms of the N -step cost for a deterministic system with a stationary optimal policy and nonnegative stage cost.  N  −1 αk c(ϕt+k , ut+k )+αN J ∗ (ϕt+N ) (30) J ∗ (ϕt ) = min uN

k=0

Proof. Equation (29) can be rewritten using the definition for the optimal policy (4) ∗ JN (ϕt ) + αN J ∗ (ϕt+N ) = min [Jπ (ϕt )] π∈Π

≤ min [Jπ,N (ϕt ) + αN J ∗ (ϕt+N )] π∈Π

= min uN

 N −1

k

N



α c(ϕt+k , ut+k ) + α J (ϕt+N )

k=0



If the reverse inequality holds, (30) must be true. Rewrite equation (3) for a deterministic system by extracting the first N stage costs in the sum N −1  Jπ (ϕt ) = αk c(ϕt+k , µk (ϕt+k )) + αN Jπ (ϕt+N ) ≥

k=0 N −1  k=0

≥ min uN

αk c(ϕt+k , µk (ϕt+k )) + αN J ∗ (ϕt+N )

 N −1

k

N



α c(ϕt+k , ut+k ) + α J (ϕt+N )

k=0



Since the inequality holds for any π ∈ Π, it must also hold for J ∗  N  −1 ∗ k N ∗ J (ϕt ) ≥ min α c(ϕt+k , ut+k ) + α J (ϕt+N ) uN

k=0

Combining the two inequalities completes the proof. 3.4 Approximate Iteration

For system (1) with stage cost (2) under output parameter uncertainty, an approximate cost-to-go function is defined  N  −1 ∗ k N  ∗ ˆ ˆ α cˆt+k + α ϕt+N |t Kt ϕt+N |t ) (31) J (ϕt ) := min uN

k=0

This approximation of (30) is composed of an N -step truncation of (28) and a CE cost-to-go. It assumes that for a large enough N , the exploration component in the cost-to-go for ϕt+N is negligible, and Jˆ∗ (ϕt+N ) is given ˆ ∗ , the solution to by (9) with cost matrix K t ˆ∗ − K ˆ ∗ b(b K ˆ t∗ = A˜ (K ˆ t∗ b + r˜)−1 b K ˆ t∗ )A˜ K t t (32) + θˆt θˆt √ where A˜ = αA and r˜ = r/α. The solution of (32) is only possible if the pair {A, b} is controllable and the pair {A, θˆt } is observable. 269

269

By solving the optimization problem posed by (31) and implementing the first element of the minimizer, corresponding to ut , an actor agent is specifying a optimal control signal that balances exploration with respect to a finite horizon window of size N , and the infinite horizon CE problem. Upon implementation of the optimal dual control action, a critic agent collects the subsequent output measurement, yt+1 , and provides improved approximations for the output parameter vector θˆt+1 , Pt+1 and the cost-to-go function Jˆ∗ . A new optimization problem is defined and the procedure can be repeated forward in time. This process is similar to PI in Q learning except the function being approximated includes the cost of N steps, and it has an embedded mechanism to generate excitation. 3.5 Control Algorithm An indirect adaptive optimal control algorithm is presented based on the process described above. The distinct feature of the proposed method emerges from the inclusion of anticipated information, which is expressed in the optimal action that balances dual control considerations. In the subsequent discussion, the algorithm is referred to as RL dual MPC (RLdMPC). A detailed description is given in Fig. 1. 4. COMPUTATIONAL RESULTS The optimization problem (33) is a Quadratically Constrained Quadratic Program (QCQP) in terms of the column vector x    ... yˆt+N −1|t yˆt|t    [ut ... ut+N −1 ]       (35) x :=  ϕ ... ϕ  t+N |t   t+1|t       ... zt+N −1   zt   ¯ ¯ t+N −1  Rt ... R ¯ t+k is a row vector with n2 elements corresponding where R to the concatenation of the rows of Rt+k . Column vector x contains all the variables involved in (33) and it has N n2 + 2N (n + 1) elements. The equations that define this optimization problem can be written, in terms of x, in the following form: min x Mo x + a o x x

s.t.

x Mi x + a i x = bi

i∈E

(36) x Mi x + ≤ bi i∈I xmin ≤ x ≤ xmax where E and I are the sets of equality and inequality constraints respectively. E is composed of n2 + N (n + 1) linear equations and (N − 1)n2 + N n quadratic equations. A single linear inequality and N − 1 quadratic equations constitute I. 

a i x

4.1 Example Consider system (1) specified by     2.5 −1 −0.1 1 1 0 0 , b= 0 , A= 0 1 0 0

θ=



5 2 −1



IFAC DYCOPS-CAB, 2016 270 Juan E. Morinelly et al. / IFAC-PapersOnLine 49-7 (2016) 266–271 June 6-8, 2016. NTNU, Trondheim, Norway

ˆ ∗ = 0, ∀t) was also without the cost-to-go for ϕt+N (K t studied, however, it resulted in an infeasible optimization problem after a few steps.

Initialization: Specify the initial state, ϕt0 , output parameter estimate, θˆt0 , and its covariance matrix, ˆ t = 0n . Set t0 → t Pt0 . Set K 0   Step 1 : Check observability of A, θˆt . If it checks, ˆ t with its solution. If it doesn’t solve (32) and update K ˆ check, keep previous Kt .

The stage costs from t0 to t0 + 10 were added for each case using (2) and are shown in Fig. 2. By definition, optimal policy (13) gives the minimum cost, and can be considered a base cost. Any additional cost can be attributed to learning and instability. Thus, RLdMPC achieves the minimum deviation from optimality. The difference between cases (i) and (iii) is achieved by including the cost to go after N steps, while the difference between cases (i) and (ii) is obtained by including the exploratory component of (27) ϕ t+k|t zt+k in the finite horizon portion of the objective function.

Step 2 : Solve N −1  2 αk (ˆ yt+k|t + ru2t+k + ϕ min t+k|t zt+k ) uN

s.t.

k=0

ˆ∗ + α N ϕ t+N |t Kt ϕt+N |t

ϕt+k+1|t = Aϕt+k|t + but+k  k=0 yt , yˆt+k|t = ˆ θt ϕt+k|t , k > 0  Pt−1 , Rt+k = Rt+k−1 + ϕt+k|t ϕ t+k|t ,

k=0 (33) k>0

Rt+k zt+k = ϕt+k|t Fig. 2. Observed Total Cost

ϕ t+k|t zt+k ≥ 0

for

ymin ≤ yˆt+k|t ≤ ymax umin ≤ ut+k ≤ umax k ∈ [0, N − 1] ˆ∗ ϕt , yt , θˆt , Pt , and K

Fig. 3 displays the input and output trajectories control methods (i)-(iv). In agreement with Fig. 2, the RLdMPC trajectories are the closest to optimal, achieving fastest regulation with minimum input compared to cases (ii) and (iii). The cross marks in the bottom plot correspond to the optimal input computed with (13) for the RLdMPC state trajectory. At the second step, RLdMPC converges with the optimal policy and remains optimal thereafter.

given t and implement the first element of the solution, ut . Step 3 : Measure yt+1 , update state (ϕt+1 = Aϕt + but ), and parameter estimate θˆt+1 = θˆt + Gt+1 (yt+1 − θˆ ϕt+1 ) t

−1 Gt+1 = Pt ϕt+1 (σ 2 + ϕ t+1 Pt ϕt+1 )

Pt+1 = (In −

(34)

Gt+1 ϕ t+1 )Pt

Step 4 : Set t + 1 → t, return to step 1. Fig. 1. RLdMPC Algorithm and vt ∼ N (0, 1). The following output feedback regulator methods were applied: (i) (ii) (iii) (iv)

RLdMPC: N = 3, θˆt0 = [0 0 0] , Pt0 = 103 I3 ˆ ∗ = 0, ∀t same as (i) except K t 2 same as (i), except cˆt+k|t = yˆt+k|t + ru2t+k µ∗ computed exactly with (13)

The discount factor and the caution parameter were set at α = 1 and r = 1 respectively. A common initial condition for the state, ϕt0 , and noise sequence vt were R generated using the function randn in MATLAB . Box constraints for the system input and output were specified with ymin = −2000, umin = −100, ymax = 2000, and umax = 100. Optimizations for cases (i)-(iii) were solved to global -optimality with BARON (Tawarmalani and Sahinidis (2005)) through its MATLAB interface with a relative termination tolerance, r = 10−3 . The average solution time for cases (i)-(iii) were 1.017, 0.317, and 0.274 seconds in that order. An additional case equivalent to (iii) 270

Fig. 3. Input/Output Trajectories The output predictor parameter identification with RLS is shown in Fig. 4 for methods (i)-(iii). The right column provides a zoomed in view once the estimates have stabi-

IFAC DYCOPS-CAB, 2016 June 6-8, 2016. NTNU, Trondheim, Norway Juan E. Morinelly et al. / IFAC-PapersOnLine 49-7 (2016) 266–271

lized. All cases behave similarly, converging near the true value after 3 steps.

Fig. 4. Output Parameter Identification 5. CONCLUSION RLdMPC was shown to converge to the optimal policy by the sequential interaction of an actor agent that solves an optimization problem and a an adaptive critic agent that provides improved approximations to the system model and cost-to-go function. The main contribution is the inclusion of anticipated information to the decision process. This feature offers an alternative to introduce excitation automatically along the control path with an optimal balance of dual objectives. The cost of this inclusion compared to a CE strategy is expressed in an increase in problem size and the solution of a QCQP instead of a QP for the optimal control problem. The size increment could be alleviated by reducing the number of anticipation steps in the objective function. The results highlight the stabilizing effect of an approximate cost-to-go in an adaptive receding horizon MPC strategy. RLdMPC was shown for the regulation case but can be easily extended to a tracking objective. ACKNOWLEDGEMENTS The authors would like to thank the ACS Petroleum Research Fund for their financial support. REFERENCES Barto, A.G. and Dietterich, T.G. (2004). Reinforcement learning and its relationship to supervised learning. In J. Si, A.G. Barto, W.B. Powell, and D. Wunsch (eds.), Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence). Wiley-IEEE Press. Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8), 716–719. Bertsekas, D.P. (2012). Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 4th edition. Bertsekas, D.P. and Tsitsiklis, J.N. (1995). Neuro-dynamic programming: an overview. In Proceedings of the 34th IEEE Conference on Decision and Control, 560–564. 271

271

Bradtke, S.J., Ydstie, B.E., and Barto, A.G. (1994). Adaptive linear quadratic control using policy iteration. In American Control Conference, 1994, volume 3, 3475– 3479. IEEE. Ferrari, S. and Stengel, R.F. (2004). Model-based adaptive critic designs. In J. Si, A.G. Barto, W.B. Powell, and D. Wunsch (eds.), Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence). Wiley-IEEE Press. Heirung, T.A.N., Ydstie, B.E., and Foss, B. (2015). Dual MPC for FIR systems: information anticipation. In 9th IFAC Symposium on Advanced Control of Chemical Processes. Lee, J.M. and Lee, J.H. (2009). An approximate dynamic programming based approach to dual adaptive control. Journal of process control, 19(5), 859–864. Lewis, F., Vrabie, D., and Vamvoudakis, K. (2012). Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. Control Systems, IEEE, 32(6), 76–105. Lewis, F.L. and Liu, D. (2013). Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, volume 17. John Wiley & Sons. Ljung, L. (1999). System Identification: Theory for the User. Prentice Hall PTR. MATLAB (2015). R2015a. The MathWorks Inc., Natick, Massachusetts. Mehta, P. and Meyn, S. (2009). Q-learning and Pontryagin’s minimum principle. In 48th IEEE Conference on Decision and Control. Mitchell, T.M. (1997). Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition. Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT press. Tawarmalani, M. and Sahinidis, N.V. (2005). A polyhedral branch-and-cut approach to global optimization. Mathematical Programming, 103(2), 225–249. Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge England. Wittenmark, B. (1995). Adaptive dual control methods: an overview. In 5th IFAC Symposium on Adaptive Systems in Control and Signal Processing.