A robust Markov game controller for nonlinear systems

A robust Markov game controller for nonlinear systems

Applied Soft Computing 7 (2007) 818–827 www.elsevier.com/locate/asoc A robust Markov game controller for nonlinear systems Rajneesh Sharma *, Madan G...

1MB Sizes 40 Downloads 158 Views

Applied Soft Computing 7 (2007) 818–827 www.elsevier.com/locate/asoc

A robust Markov game controller for nonlinear systems Rajneesh Sharma *, Madan Gopal Control Laboratory, Electrical Engineering Department, Indian Institute of Technology, Delhi, Hauz Khas, New Delhi-110016, India Received 13 April 2005; received in revised form 9 January 2006; accepted 13 February 2006 Available online 3 May 2006

Abstract This paper proposes a reinforcement learning (RL)-based game-theoretic formulation for designing robust controllers for nonlinear systems affected by bounded external disturbances and parametric uncertainties. Based on the theory of Markov games, we consider a differential game in which a ‘disturbing’ agent tries to make worst possible disturbance while a ‘control’ agent tries to make best control input. The problem is formulated as finding a min–max solution of a value function. We propose an online procedure for learning optimal value function and for calculating a robust control policy. Proposed game-theoretic paradigm has been tested on the control task of a highly nonlinear two-link robot system. We compare the performance of proposed Markov game controller with a standard RL-based robust controller, and an H1 theory-based robust game controller. For the robot control task, the proposed controller achieved superior robustness to changes in payload mass and external disturbances, over other control schemes. Results also validate the effectiveness of neural networks in extending the Markov game framework to problems with continuous state–action spaces. # 2006 Elsevier B.V. All rights reserved. Keywords: Reinforcement learning; Markov decision process; Matrix games; Markov games; Markov game controller; Neural networks; Inverted pendulum; Twolink robot

1. Introduction Control of nonlinear systems is a nontrivial problem and can be framed as a reinforcement-learning (RL) task. RL is a computationally simple, direct approach to adaptive-learning control. In reinforcement learning, environmental models have been used quite extensively for both offline learning and online action planning. However, no model is perfect and modeling errors can cause unpredictable results — sometimes worse than with no model at all. External disturbances and modeling errors cause a deviation of real system state from that predicted by the model. The designed controller should be robust to such model uncertainties and external disturbances. Markov decision process (MDP) framework [2] that has been extensively used to design RL controllers for systems/processes is grossly inadequate for nonlinear systems as it allows only a single player with stationary environment. In direct-adaptive-reinforcement-learning (DRAL) [13], authors have claimed robust controller performance by combining a fixed gain conventional

* Corresponding author. Tel.: +91 11 27943497; fax: +91 11 25099022. E-mail addresses: [email protected], [email protected] (R. Sharma). 1568-4946/$ – see front matter # 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2006.02.005

controller and a standard RL-based neural networks (NN) controller. Robust reinforcement learning (RRL) [12] is an online technique for robust game-theoretic controller design based on H1 control theory. In H1 paradigm [11], worst-case controller design is viewed as a differential game between the controller and the disturber. H1 control theory gives an analytical solution only for linear systems. For nonlinear systems, constructing best control and worst disturbance requires solving a Hamilton– Jacobi–Isaac (HJI) equation, which is analytically infeasible [12]. Theory of zero-sum stochastic games have been applied by Altman and Hordijk for designing worst-case optimal controllers for queuing networks [1]. The context in which Altman and Hordijk apply zero-sum game theory is queuing models, e.g., where several users compete for some resource. In [5], linear approximation architecture has been used to apply Markov game setup to a football game and a server-router flow control problem of Altman [14]. We emphasize that the role of opponent in above problems and in other problems say chess, backgammon, etc., is different from that of noise/disturbance in a general control problem. In Markov game (MG) formulation of controllers, as proposed in this paper, controller–disturber tussle is viewed as a two-player zero-sum Markov game. We use a continuous action version of minimax-Q algorithm [7] for finding continuous control to be

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

applied to the controlled plant. The approach is applicable to any general nonlinear system with/without knowledge of desired or reference trajectory and yields a conceptually simple, direct, reactive controller for nonlinear systems. For moderately large state-spaces Markov game controller can be implemented by gridding of the state space and a lookup table approach [7]. However, when the state space is large or continuous, some form of function approximation becomes inevitable to avoid the well-known ‘‘curse of dimensionality’’. Neural networks (NN) are an ideal candidate for approximating value functions as they possess excellent learning ability and a powerful generalization capability besides other properties [3]. This work is a step to extend the zero-sum Markov game framework to continuous state–action space systems using NN. 2. Learning frameworks We consider a learning system (controller) that interacts with its environment (discrete-time dynamical system). For each state xk 2 X of the dynamical system at discrete-time k, there is a finite set of possible decisions (control actions, u(xk) 2 U(xk)) that may be taken by the learning system (controller). The index k represents a stage variable to describe how many decisions have thus far been made where k = 0, 1, 2. . . 2.1. Markov decision process (MDP) There is a single adaptive agent (controller) interacting with the environment. At each step the agent senses the current system state xk, chooses action uk, receives reinforcement signal ck from the environment and moves to the next state xk+1, experience tuple is hxk, uk, ck, xk+1i MDP consists of a tuple hX, U, C, Pi where X is the set of states, U the action set for the controller, C the cost function C: X  U ! , P is the state transition function P:X  U ! P(X) where P(X) is the set of discrete probability distributions over the set X and p(xk, uk, xk+1) is the probability of transition from xk to xk+1 under action uk. If the response at k + 1 depends only on the state and action at k, then the system is said to possess the Markov Property, and the sequential decision-making problem under uncertainty is called the Markov decision process (MDP) [2]. The controller’s aim is to discover a policy pu:pu(xk) ! uk where uk 2 U(xk), so as expected sum of  P1to minimize k where ct+k is the cost discounted cost, i.e., E g c tþk k¼0 incurred k steps into future and g is the discount factor where 0  g < 1. MDP’s can be solved by either policy iteration or value iteration [2]. We describe a simulation-based value iteration procedure called Q learning due to Watkins [10]. The value of a state V(xk) is the total expected discounted cost incurred by an optimal policy starting from state xk 2 X. Q value is defined on a state–action pair and is the total expected discounted cost incurred by a policy that takes action uk 2 U(xk) from state xk 2 X and follows the optimal policy thereafter. The Q value for the state–action pair(xk, uk) Q(xk, uk) is defined as: X Qðxk ; uk Þ ¼ ck þ g pðxk ; uk ; xkþ1 ÞVðxkþ1 Þ (1) xkþ1 2 X

819

where Vðxkþ1 Þ ¼ minu 2 Uðxkþ1 Þ Qðxkþ1 ; uÞ and ck is the immediate cost of taking action uk at state xk. In words, the quality of a state–action pair is the immediate cost plus the expected discounted value of successor states weighted by their likelihood. Optimal policy pu is defined as one that prescribes minimizing action at any state, i.e., pu ðxk Þ ¼ arg min Qðxk ; uÞ u 2 Uðxk Þ

In standard value iteration procedure, we need to apply the above equations to all states. We can use Eq. (1) directly as an update equation, for an iteration process that calculates exact Q values. This does, however, require that a model be given (or is learned) because the equation uses p(xk, uk, xk+1). Watkins [10], proposed a procedure to iteratively update Q values that does not require either the system transition probabilities or the cost structure. The above iteration, i.e., Eq. (1) can be written in a more general form as: Qðxk ; uk Þ ¼ ð1  hÞQðxk ; uk Þ   X þ h ck þ g pðxk ; uk ; xkþ1 ÞVðxkþ1 Þ

(2)

xkþ1 2 X

where h is a small learning-rate parameter with h 2 (0, 1] that may change from one iteration to the next. Q learning is an approximate form of the above iteration wherein the expectation with respect to successor state xk+1 is replaced by a single sample, i.e., Qðxk ; uk Þ

ð1  hÞQðxk ; uk Þ

þ hðck þ g þ hðck þ g

Qðxk ; uk Þ

min

Qðxkþ1 ; uÞÞ

min

Qðxkþ1 ; uÞ  Qðxk ; uk ÞÞ

u 2 Uðxkþ1 Þ

u 2 Uðxkþ1 Þ

(3)

where xk+1 and ck are generated from pair (xk, uk) as per the transition probability p(xk, uk, xk+1). The only requirement for using Q learning is that the state of the environment should be fully observable. Q learning converges to optimal Q values as long as every state–action pair is visited infinitely often and the learningrate parameter h is reduced to a small value at a suitable rate [10]. 2.2. Matrix games A matrix game [8] is a tuple hN, U1. . .N, C1. . .Ni where N is the number of players, Ui the action set of the player i (U is the joint action space, i.e., U1  U2. . .UN) and Ci is the player i’s cost function. In a two-player zero-sum game, the game is played between two players with diametrically opposite goals so that cost to one player is the reward of the other, i.e., if C1 and C2 are the cost functions for two players then C1 = C2. In this setting we can use a single cost function for representing the game. We take the cost as c when first player or the agent takes action u 2 U and the second player or the opponent takes d 2 D. The optimal policy in matrix games can be probabilistic, i.e., the agent’s optimal policy specifies a probability distribution over its action set, rather then a crisp action or pu:X ! PD(U) where PD(U) is a probability distribution over action set U.

820

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

Such an optimal policy can be found using linear programming technique [9]. Value of the game is given as: X cpu (4) V ¼ min max pu 2 PDðUÞ d 2 D

u2U

and the optimal policy pu ¼ ½pu1 ; pu2 ; . . . ; pun  is a probability distribution, over the agent’s action set U = (u1, u2, . . ., uv ) where n = jUj, that minimizes V. 2.3. Markov games Markov games [6], generalize MDP framework to multiagent case. A Markov game is represented by the tuple hN, X, U1,2,. . .,N, C1,2,. . .,N, Pi where X is the set of states, N the number of agents, U1. . .N the collection of action sets for the agents 1. . .N, Ci the cost function for the agent i, i.e., Ci:X  U1  U2. . .UN ! R, P the state transition function, P:X  U1  U2. . .UN ! P(X) and p(xk, u1, u2, . . ., uN, xk+1) = probability of moving from state xk to xk+1 when each agent takes an action (ui 2 Ui) at state xk. As in Q learning, for a two-player case, we can define Q(xk, uk, dk) value for tuple hxk, uk, dki as the expected cost for taking action uk when the opponent takes action dk at state xk and continuing optimally thereafter, i.e., X pðxk ; uk ; dk ; xkþ1 ÞVðxkþ1 Þ (5) Qðxk ; uk ; dk Þ ¼ ck þ g xkþ1 2 X

p(xk, uk, dk, xk+1) is the probability of transition from state xk to xk+1 when agent takes action uk while opponent takes action dk at state xk and ck is the cost incurred when first player or agent takes action uk 2 U(xk) and second player or opponent takes dk 2 D(xk) at state xk.Minimax-Q algorithm is similar to Q learning, except that the term minu 2 Uðxkþ1 Þ Qðxkþ1 ; uÞ is replaced by the value of the game played between the two players P at state xk+1, i.e., Vðxkþ1 Þ ¼ minpu 2 PDðUðxkþ1 ÞÞ maxd 2 Dðxkþ1 Þ u 2 Uðxkþ1 Þ Qðxkþ1 ; u; dÞpu . Q values are updated as: Qðxk ; uk ; dk Þ

Qðxk ; uk ; dk Þ

þ h½ck þ gVðxkþ1 Þ  Qðxk ; uk ; dk Þ

(6)

where h is the learning-rate parameter and V(xk+1) is the value of the game played between the agent and the opponent at state xk+1. 3. Markov game controller for continuous state–action spaces

To the best of our knowledge, this work describes a first effort to implement a Markov game controller for continuous state–action space using NN. In implementation using NN, the tabular Q function Q(xk, uk, dk) is replaced by a parameterized Q ˆ k ; uk ; dk ; wk Þ where wk is set of parameters function Qðx describing the approximation architecture used, i.e., the weights and biases of the network at time step k. The update equation for a transition from state xk to xk+1 becomes ˆ k ; uk ; dk ; wk Þ Qðx ˆ k ; uk ; dk ; wk Þ þ h½ck þ gVðxkþ1 Þ  Qðx

ˆ k ; uk ; dk ; wk Þ Qðx

where h is the learning-rate parameter. 3.2. Controller design The controller works by employing a minimax criterion with respect to an estimate of a real-valued function Q of state, controller’s action and disturber’s action. Q(xk, uk, dk) is the expected discounted sum of cost of taking action uk when the disturber chooses dk at state xk and continuing optimally thereafter. At each discrete-time step, k = 1, 2. . . the Markov Game controller observes current state xk of the Markov game, selects action uk, incurs a cost ck and observes the resulting state xk+1. Based on these observations, Q is updated as per Eq. (6). For updating Q values one need to calculate V(xk+1), i.e., expected cost for the optimal policy starting from state xk+1.Markov game controller (Fig. 1) consists of two parts, (i) game solver and (ii) NN function approximator. The task of game solver is to provide V(xk+1) value for training of NN and generating a mixed policy solution pu(xk) at the current state xk, which is used to generate action uk. For problems with continuous state–action space Q is represented not as a look-up table but as a parameterized structure, i.e., a neural network. Parameters of the function approximator are tuned via a supervised learning method based on error back propagation. One can define a general way of moving from a unit of experience (xk, uk, dk, ck, xk+1) as in Eq. (6) ˆ k ; uk ; dk ; wk Þ should be to a training example for Q, i.e., Qðx ˆ kþ1 ; u; d; wk Þpu . ck þ g minpu 2 PDðUðxkþ1 ÞÞ maxd 2 Dðxkþ1 Þ Qðx 3.3. Estimation of game value and continuous action generation Suppose the system state is xk and let the cardinality of the controller’s action space U(xk) and that of disturber’s D(xk), ˆ k ; u; d; wk Þ each be 3. The game matrix at state xk consists of Qðx values supplied by the NN approximator and is given by

3.1. Value function approximation For continuous state-space, the usual approach of stateaggregation becomes computationally intractable and the solution heavily depends on the grid resolution adopted for state-aggregation. Further the aggregated system may not retain the Markov property. Function approximation provides a way out to deal with this problem. We use neural networks, which have a strong generalization capability.

(7)

ˆ k ; ui ; d j ; wk Þ 8 ui 2 Uðxk Þ; d j 2 Dðxk Þ. where Qi; j ¼ Qðx

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

821

Fig. 1. Proposed Markov game controller.

The game solver is a linear program [9] that solves this game matrix to generate a policy pu(xk) over the controller’s action set, i.e., it solves for pu ðxk Þ ¼ arg

min

ˆ k ; u; d; wk Þpu ; u 2 Uðxk Þ max Qðx

pu 2 PDðUðxk ÞÞ d 2 Dðxk Þ

Policy pu(xk):xk ! PD(U(xk)) specifies a probability distribution over the controller’s action set U(xk) and we use this probability distribution to generate a continuous control ukc at state xk (in Littman’s work [7], a discrete action uk 2 U(xk) is selected at state xk as per the probability distribution pu(xk)). Continuous action is calculated as: X ukc ðxk Þ ¼ puk ðxk Þuk k 2 jUðxk Þj

The controller observes the resulting system state xk+1. The ˆ kþ1 ; u; d; wk Þ 8 u 2 Uðxkþ1 Þ; game solver gets estimates of Qðx d 2 Dðxkþ1 Þfrom the NN approximator and solves for V(xk+1):

Controller handles external disturbance torques having a uniform distribution in 20% around the torque values applied at each joint. (ii) Control versus external disturbances and varying payload mass Controller handles, in addition to external disturbances as in (i), variation of payload mass picked up by the robotic arm. The end-effector mass m2 is varied with time, which corresponds to robotic arm picking up and releasing payloads having different masses. The mass m2 is varied as: (a) (b) (c) (d) (e) (f) (g)

For For For For For For For

t < 4 s, m2 = 1 kg. 4 < t  7 s, m2 = 2.5 kg. 7 < t  9 s, m2 = 1 kg. 9 < t  12 s, m2 = 4 kg. 12 < t 15 s, m2 = 1 kg. 15 < t 18 s, m2 = 2 kg. 18 < t 20 s, m2 = 1 kg.

ˆ kþ1 ; u; d; wk Þpu ; u 2 Uðxkþ1 Þ 4.1.1. Robust standard reinforcement-learning controller max Qðx Direct-reinforcement-learning-adaptive (DRAL) [13], is a Lyapunov theory-based RL control scheme that guarantees This value is used to update the NN approximator’s tunable both system tracking stability and error convergence for the parameters, i.e., weights and biases of the network. closed loop system. The scheme incorporates, in addition to an outer PD tracking loop, an inner loop consisting of action 4. Controller details and simulation generating NN that uses reinforcement signal as an adaptivelearning signal. We have implemented a digital version of 4.1. Two-link robot arm control DRAL controller, i.e., the control torque remains constant during the sampling interval. Further details on controller The physical system has been simulated for 20 s using thirdarchitecture and controller parameters could be found in [13]. order Runge-Kutta method with fixed time step of 20 ms. DRAL scheme cannot be applied for controlling systems where Simulation model details can be found in Appendix A. The no desired system trajectory is available, as in the case of robot manipulator has to follow a prescribed desired trajectory inverted pendulum swing-up problem. expressed in expressed in joint space as qd(k) = (u1d(k), u2d(k), u1d(k), u2d(k). Finding control input t(k) = [t1(k), t2(k)] for this 4.1.2. H1 controller is the controller design problem. In all controllers implemented for this task, there is outer proportional derivative (PD) tracking In robust reinforcement learning (RRL), an H1 formulation loop [13]. We evaluate the proposed approach and compare its has been used wherein controller design is viewed as a performance against other schemes for cases: differential game in which a ‘disturbing agent’ tries to make worst possible disturbance while a ‘control agent’ tries to make (i) Control versus external disturbances best control output, details could be found in [12]. For nonlinear

Vðxkþ1 Þ ¼

min

pu 2 PDðUðxkþ1 ÞÞ d 2 Dðxkþ1 Þ

822

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

problems, there is no analytical way of solving the HJI equation to find best control output [12]. We have implemented model free actor–disturber–critic formulation of RRL [12] for the discrete-time case. Actor–disturber–critic scheme involves simultaneous training of three function approximators, i.e.,  Critic: V(xk) = V(xk, wc ); wc = parameter vector of critic network  Actor (controller): uk ¼ su ðAu ðxk ; wu ÞÞ þ huk ; wu ¼ parameter vector of actor network  Disturber:dk ¼ sd ðAd ðxk ; wd ÞÞ þ hdk ; wd ¼ parameter vector of disturber network where su, sd = monotonically increasing functions, e.g., logsig function and huk ; hdk ¼ exploration noise in action0 s of controller and disturber; respectively. For each joint three networks, one each for actor, disturber and critic have been used. Critic network has 16 neurons while networks for actor and disturber use eight neurons each. Learning-rate parameter has been taken as 0.01 for all the networks. The robustness parameter has been kept same as in [12]. The parameters of three networks are updated using a discrete-time variant of the update equations given in [12]. 4.1.3. Markov game controller There are two function approximators, one each for the two links. Both have 16 neurons and learning-rate parameter has been fixed at 0.02. Rests of the parameters have been kept same as in Section 4.2.1.

(d) For trials > 1300, m = 4.0 kg and l = 1 m. Function approximation: The neural network used for function approximation is a two-layer structure with four inputs. The hidden layer has 18 neurons and output layer contains a single neuron. The four inputs being uk, u˙ k , uk and dk. The hidden layer neurons use the sigmoid activation function given by y ¼ ð1  ex Þ=ð1 þ ex Þ, i.e., tansig function and output layer uses linear function purelin. 4.2.1. Markov game controller The controller chooses a continuous control action, ukc 2 [50 +50] N i.e., action set for the controller jUj = 3[50 50 0 +50] N and action set for the disturber jDj = 3[(<0) 0 (>0)] N. At each state xk, controller and disturber play out a zero-sum Markov game whose solution gives a probability distribution pu(xk) over controller’s discrete action set. Continuous control action ukc derived using policy pu(xk) is applied to the controlled system. Exploration level e is decayed from 0.5 ! 0.02 over the trials. The discount factor g of the process is set to 0.8 and the learning-rate parameter h is set to 0.01. 4.2.2. H1 controller Actor–disturber–critic architecture as described in Section 4.1.2 has been used. The parameters of the three networks are updated using a discrete-time variant of the update equations as given in [12]. Other learning parameters and robustness parameter values have been taken from [12]. 5. Results

4.2. Inverted pendulum swing-up 5.1. Two-link robot arm control We also evaluated the proposed approach on a simulated inverted pendulum control task [4]. Simulation model and cost function details can be found in Appendix B. Each trial is started from an initial state x0 ¼ ðu0 ; u˙ 0 Þ where u0 and u˙ 0 are selected from a uniform distribution that ranges between 0 and 0.01. A trial is terminated when it successfully balances the pole for 3000 time steps corresponding to 5 min of continuous balancing in real time or till failure as in [4]. We repeat the experiment for a total of 100 times. We evaluate the proposed approach and compare its performance against other schemes for following cases: (i) Control versus external disturbances Controller handles uniform disturbances in [10 + 10] N or 20% of applied force value as well as a deterministic noise of 10 N. (ii) Control versus external disturbances and parameter variations Controller handles, in addition to disturbances as in (i), parameter variations of the controlled system, i.e., variations in mass and length of the pendulum: (a) For trials < 400, m = 2.0 kg and l = 0.5 m. (b) For 400  trials < 700, m = 1.0 kg and l = 0.5 m. (c) For 1000  trials < 1300, m = 1.0 kg and l = 0.25 m.

5.1.1. Control versus external disturbances and parameter variations 5.1.1.1. Output tracking errors. Figs. 2–4 show output tracking errors (both links) of the robotic arm for all controllers, i.e., H1, robust standard reinforcement learning and the Markov game controller. A small error after transient period shows that Markov game controller is robust to both external disturbances and payload mass variations. A comparative evaluation of results shows Markov game controller’s superiority over H1 and robust standard RL controllers. 5.1.1.2. Control effort requirement. A good controller should exert only an optimal control effort for achieving desired performance. Figs. 5–7 bring out the control effort exerted by each controller at both joints for controlling the robotic arm. Maximum torque values and overall norm of control torques exerted by Markov game controller is less than its counterparts. 5.1.2. Control versus external disturbances 5.1.2.1. Output tracking errors. A comparison of Figs. 8–10 shows that when only external disturbances are present all the

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

823

Fig. 2. Output tracking errors: H1 controller.

Fig. 5. Joint torques for joint1 (u1) and joint2 (u2): H1 controller.

controllers perform well. Here again Markov game controller’s performance is slightly better. Fig. 3. Output tracking errors: robust standard RL controller.

5.1.2.2. Control effort requirement. All controllers exert almost identical control effort at joints when only external disturbances are present. 5.2. Inverted pendulum swing-up Robust standard RL (DRAL) scheme cannot be applied to this problem as there exists no prior desired or reference trajectory for balancing the pendulum.

Fig. 4. Output tracking errors: Markov game controller.

5.2.1. Control versus external disturbances 5.2.1.1. Markov game controller. For Markov game controller, with 1000 trials, the expected number of balancing steps is about 2993. Fig. 11 also shows the best and worst controller performance (in terms of balancing steps) as a function of number of trials. With 1000 trials even the worst-case controller performance is 1512 steps or it could balance the pendulum for at least half the time. As can be seen from Fig. 12, the controller is able to achieve 63% success in just about 200 trials and thereafter it improves to about 97% with 1000 trials.

824

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

Fig. 6. Joint torques for joint1 (u1) and joint2 (u2): robust standard RL controller.

Fig. 7. Joint torques for joint1 (u1) and joint2 (u2): Markov game controller.

Fig. 13 shows that the pole angle’s maximum deviation from the vertical is only 0.14 rad or u = 88 even though the failure condition is u > p/2.

achieved success probability of 99% with 110 trials. When the mass and length of pendulum were changed (Section 4.2), the controller failed only twice or thrice and again attained a high success rate of 99%. H1 controller is found to be comparatively superior to Markov game controller for the pendulum swing-up task.

5.2.1.2. H1 controller. Using cost function and learning parameters of the networks as in [12], the controller attained first successful balance with 110 trials and thereafter it achieved and maintained a success probability of 99% all along. 5.2.2. Control versus external disturbances and parameter variations Figs. 14–16 show performance of Markov game controller in handling the external disturbances and parameter variations. Results have been averaged over 50 episodes each of 1500 trials. From Fig. 15, Markov game controller not only handles the disturbances well but also achieves a high success probability. Figs. 14–16 also show the performance obtained with a standard non-robust RL controller, i.e., Q-controller, clearly bringing out the inadequacy of the MDP setup for handling noise and disturbances. 5.2.2.1. H1 controller. On implementing H1 (RRL) controller with parameters as explained in Section 5.2.1, controller

Fig. 8. Output tracking errors: H1 controller.

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

Fig. 12. Probability of success: Markov game controller. Fig. 9. Output tracking errors: robust standard RL controller.

Fig. 10. Output tracking errors: Markov game controller.

Fig. 13. Pole angle trajectory for a successful trial.

Fig. 11. Average balancing steps: Markov game controller.

Fig. 14. Disturbance tolerance: probability of success.

825

826

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

tool in the toolkit of designers for control of nonlinear systems. We are working on making the current approach more robust by incorporating ideas from game theory that allow controller’s action selection to be based on opponent’s observed behavior, such as ‘fictitious play’ approach. Further, we would attempt to bring in an element of ‘universal consistency’ in controller’s behavior by using ‘cautious fictitious play’ approach from game theory literature. Appendix A. Two-link robot arm control task

Fig. 15. Disturbance tolerance: average balancing length.

Simulation model (Fig. A1) for the two-link robot arm has been taken from [15]. A two-link robot arm has all the nonlinear effects common to general robot manipulators and is easy to simulate. The dynamical equations of the robotic manipulator are:  Manipulator dynamics 3 2 6 hð2u˙ u˙ Þ sin u 7 a þ b þ 2h cos u2 b þ h cos u2 u¨ 1 1 2 27 6 þ6 7 ¨ 4 5 b þ h cos u2 b 2 u2 hu˙ 1 sin u2     ae1 cos u1 þ he1 cosðu1 þ u2 Þ t1 þ ; ¼ he1 cosðu1 þ u2 Þ t2 g a ¼ ðm1 þ m2 Þa21 ; b ¼ m2 a22 ; h ¼ m2 a1 a2 ; e1 ¼ a1  Manipulator parameters "



Fig. 16. Disturbance tolerance: minimum balancing length.

6. Conclusions and future work The paper shows that zero-sum Markov game framework is an effective platform for designing robust controllers for nonlinear systems. The work also illustrates the usefulness of generic function approximators such as neural networks for extending Markov games to continuous state–action space. Markov game formulation offers a general control scheme applicable to any control problem, i.e., problems with/without a predefined desired reference trajectory and problems having delayed/immediate cost/reward scheme. We have shown that Markov game controller can deal with external disturbances as well as parameter variations. The proposed control methodology is tested on a two-link robotic arm control task. We also applied the approach on an inverted pendulum swing-up task. The results reveal that in the case of a two-link robotic arm, the proposed approach is superior to H1-based algorithm (RRL) of Morimoto and Doya [12] and robust standard RL-based approach (DRAL) of Kim and Lewis [13]. However, for inverted pendulum swing-up task, it is observed that with cost function and learning parameters as given in [12], RRL algorithm gives better results. This corroborates the known fact that for nonlinear systems, no universally superior scheme is possible; performance of any scheme is highly dependent on the types of nonlinearities involved. Our paper adds one potential

#

2

a1 ¼ a2 ¼ 1 m m1 ¼ m2 ¼ 10 kg  Desired trajectory u1desired ¼ sinðtÞ; u2desired ¼ cosðtÞ  System state space is continuous and has four variables, i.e., xk ¼ ðu1 ; u2 ; u˙ 1 ; u˙ 2 Þ The task is to apply a sequence of torques t(k) = [t1(k)t2(k)] at each joint so that the joints follow a prescribed trajectory in joint space, i.e., qd = [Q1desired, Q2desired].

Fig. A1. Simulation model for two-link robotic manipulator.

R. Sharma, M. Gopal / Applied Soft Computing 7 (2007) 818–827

function is available. We take one step costs as: ( p 1 ifjuj > c¼ 2 and termination condition 0 otherwise

827

juj >

p 2

References Fig. B1. Inverted pendulum problem.

Appendix B. Inverted pendulum swing-up task The controller has to balance a pendulum of unknown mass and length in an upright position by applying a sequence of forces to the cart as shown in Fig. B1. The state space of the problem is continuous and consists of ˙ The vertical angle of the pole and its angular velocityðu; uÞ. transitions are governed by the nonlinear dynamics of the system [4] and depend on current state xk ¼ ðuk ; u˙ k Þ and the current control F: ::

uk ¼

g sinðuk Þ  amlðu˙ k Þ2 sinð2uk Þ=2  a cosðuk ÞF ð4=3Þl  aml cos2 ðuk Þ

(B1)

where g is the gravity constant (g = 9.8 m/s2), m the mass of the pendulum (m = 2.0 kg), M the mass of the cart (M = 8.0 kg), l the length of the pendulum (l = 0.5 m) and a = 1/(M + m). The system is simulated by numerically approximating the equation of motion using Euler’s method with a time step of T = 0.1 s, as suggested in [4] and discrete-time state equations of the form ukþ1 ¼ uk þ T u˙ k . The pole must be kept within p/2 from vertical by applying a sequence of forces. We wish to control the pendulum using delayed reinforcement learning [10], i.e., the learning is based on a delayed reinforcement signal, which is a failure. Failure may occur after a very long sequence of actions. The task is a difficult assignment-of-credit problem as failure may occur only after a very long sequence of actions and no analytic objective

[1] E. Altman, A. Hordijk, Zero-sum Markov games and worst-case optimal control of queueing systems, Invited paper, QUESTA, Optim. Queueing Syst. 21 (1995) 415–447 (special issue). [2] D.P. Bertsekas, J.N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996. [3] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ, 1999. [4] M.G. Lagoudakis, R. Parr, Least-squares policy iteration, J. Mach. Learn. Res. 4 (2003) 1107–1149. [5] M.G. Lagoudakis, R. Parr, Value function approximation in zero-sum Markov games, in: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence (UAI 2002), Edmonton, Alta., Canada, August 1–4, (2002), pp. 283–292. [6] M.L. Littman, Value-function reinforcement learning in Markov games, J. Cogn. Syst. Res. 2 (2001) 55–66. [7] M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Proceedings of the 11th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1994. [8] G. Owen, Game Theory, 2nd ed., Academic Press, Orlando, FL, 1982. [9] G. Strang, Linear Algebra and its Applications, Academic Press, Orlando, FL, 1980. [10] C.J.C.H. Watkins, Learning from delayed rewards, Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [11] K. Zhou, J.C. Doyle, K. Glower, Robust and Optimal Control, PrenticeHall, New Jersey, 1996. [12] J. Morimoto, K. Doya, Robust reinforcement learning, Neural Comput. 17 (2005) 335–359. [13] Y.H. Kim, F.L. Lewis, Direct-adaptive-reinforcement-learning neural network control for nonlinear systems, in: Proceedings of the American Control Conference, Albuquerque, NM, June, (1997), pp. 1804–1808. [14] E. Altman, Flow control using the theory of zero-sum Markov games, IEEE Trans. Automat. Control 39 (1994) 814–818. [15] F.L. Lewis, S. Jagannathan, A. Yesildirek, Neural Network Control of Robotic Manipulators and Nonlinear Systems, Taylor & Francis, London, 1999.