On stochastic dynamic stackelberg strategies

On stochastic dynamic stackelberg strategies

Automatica, Vol. 12, pp. 177-183. Pergamon Press, 1976. Printed in Great Britain On Stochastic Dynamic Stackelberg Strategies*t DAVID CASTANON~ and M...

612KB Sizes 5 Downloads 72 Views

Automatica, Vol. 12, pp. 177-183. Pergamon Press, 1976. Printed in Great Britain

On Stochastic Dynamic Stackelberg Strategies*t DAVID CASTANON~ and MICHAEL ATHANS§

In dynamic two-person games with alternating decision, the feedback Stackelberg solution concept represents an adaptive equilibrium between independent players; this concept applies to a class of stochastic multi-stage games. Summary--Feedback Stackelberg strategies are considered for two-person linear multi-stage games with quadratic performance criteria and noisy measurements. Explicit solutions are given when the information sets are nested; the solutions are closely related to the 'separation theorem' of stochastic control.

Stackelberg solutions of dynamic games with perfect information. In this paper, optimal Stackelberg strategies are obtained for a given class of games, consisting of multi-stage games where the state evolves linearly, and each player attempts to minimize a quadratic cost functional of his won decisions, the states and the other player's decisions over the duration of the game. At each stage, each player makes his decision based on the information available to him through his observations of the state. Furthermore, one of the players (the leader) makes his decision first at all stages. Problems of this nature arise frequently in economics, where decisions must be made by two different parties and one of them is subordinated to the other, hence must wait for the other party's decision before formulating its own. Previous results on Stackelberg strategies [7] have concerned themselves primarily with openloop strategies. Cruz and Simaan [8, 9] introduced the notion of feedback Stackelberg strategies, which are adaptive strategies satisfying Bellman's principle of optimality [10]. In the present problem, the information available to each player changes at each stage, so that the optimal strategies will be adaptive. In many cases the optimal strategies for each player are closely related to the separation principle of optimal control; that is, the optimal strategy for the stochastic game consists of the same feedback laws used for the deterministic game using the minimum-variance estimate of the state as argument. Some comments will be made on the limitations of these cases, and on possible extensions of this work to other open problems in the area. The notation used throughout the paper conforms closely to that which is common in optimal control theory. Matrices are denoted by upper case letters, vectors by lower case letters, stages by argument (t). The notation for a function z on its value at stage t is z(t). Euclidean n-space is R'*. Lower case Greek letters denote gaussian random variables of zero mean, while upper case Greek letters denote their covariance. M' is the transpose of matrix M, tr{M} is its trace. A gaussian random variable x will be indicated as x = N(:~, Y.) where

I. INTRODUCTION

In recent years, considerable attention has been given in the literature to differential and multistage games [1-3,5]; most of it has dealt with deterministic games of perfect information. Behn and Ho [3] and Rhodes and Luenberger [1] considered the solution of pursuit-evasion games where each player made noise-corrupted linear measurements of the state. The problem was studied in more detail by Willman [4]; without additional restrictions, the general formulation of the problem leads to solutions involving infinitedimensional estimators. The concept of non-zero sum games is well known in static competitive economics, and has been recently introduced to dynamic games [5-7]. The solution of non-zero sum games is defined in terms of the approach taken by the players in defining optimality. The Stackelberg solution to the game is obtained when one of the players is forced to wait until the other player announces his decision, before making his own decision. Chen and Cruz [7] and Cruz and Simaan [8] have discussed extensively the * Received 12 August 1974; revised 25 March 1975; revised 30 September 1975. The original version of this paper was presented at the 6th IFAC Congress which was held in Boston/Cambridge, Mass., U.S.A., during August 1975. The published proceedings of this IFAC meeting may be ordered from: ISA (Instrument Society of America), 400 Stanwix Street, Pittsburgh, PA 15222, U.S.A. It was recommended for publication in revised form by associate editor I. Rhodes. t This research was conducted at the Decision and Control Sciences Group of the MIT Electronic Systems Laboratory with partial support provided by the Air Force Office of Scientific Research under grant AFAFOSR-72-2273. Graduate student, Department of Mathematics, MIT, Cambridge, MA 02139, U.S.A. § Professor of Electrical Engineering, Room 35-308, MIT, Cambridge, MA 02139, U.S.A.

177

178

DAVID CASTANON

is its mean, Z its covariance. E(x) is the expected value of x, and E(x[z) denotes the conditional expectation of x given z. II. PROBLEM STATEMENT

Problem 1 Consider the stochastic multi-stage game where x, (t), the system state at stage t, evolves as

x(t + 1) = A(t) x(t) + B(t) u(t) + C(t) v(t) + O(t), t = 0 ..... N - l , Zl(t )

=

(1)

Hi(t) x(t) + ~l(t), t = 1..... N.

(2)

z~(t) = H~(t) x(t) + ~2(t), where x(t) ~ R n, u(t) ~ R m, v(t) ~ R t, zl(t ) E R t', z2(t)~R q for all t. The vectors O(t), ~l(t), ~(t) and x(0) are independent gaussian random vectors for all t, where x(0) = N(xo, Z0); O(t) = N(O, O(t)); ~:~(t) = N(0, ~(t)). The matrices A, B, C, H 1 and H~ are bounded for all t and of appropriate dimensions. In equation (1), u(t) represents the decision of the leader, v(t) the decision of the follower. Under these assumptions the state vector x is a discrete gaussian Markov process whose evolution is described by (1). At the start of the game, each player seeks to minimize the expected value of costs Ji, where N-1

d~(x, u, v, O) = x'(N) Qi(N) x(N) + ~ (x'(t) Qi(t) t--o x x(t) + u'(t) Ri(t ) u(t) + v'(t) Si(t ) × v(t)),

i = 1,2.

(3)

At a given stage t, the information sets include exact knowledge of the system dynamics (1), the measurements rules of both players (2) and the cost functionals of both players (3). Additionally, they include exact knowledge of all decisions made by each player up to stage t - i and the statistics of random elements O(t), ~x(t) and ~:2(t) for all t. The leader's and followers information sets contain the values of all measurements up to stage t - 1 available to the leader and follower, respectively. Additionally, the follower's information set contains the exact value of the leader's decision at time t. The stochastic nature of problems 1 indicates that desirable strategies are closed-loop strategies based on the information available to each player. Cruz and Simaan [9] define two different closed-loop solutions in the Stackelberg sense: the closed-loop Stackelberg solutions, and the feedback Stackelberg solutions. The closed-loop Stackelberg solutions are feedback laws determined at the start of the game, with the underlying assumption that both

and MICItAELATHANS players will use these strategies throughout the game. Should either player deviate from his strategy at any stage, the remaining decision would cease to be optimal. In a competitive situation, or any time the players' actual decisions differ from their optimal decision, this solution concept is inadequate; it assumes that the leader has complete control over both his own decisions and the follower's decisions. The feedback Stackelberg solution concept, on the other hand, represents an equilibrium for both players: Should one player follow his optimal strategy, the other player's decisions which minimize his cost are given by his optimal feedback Stackelberg strategies. Additionally, the feedback Stackelberg strategies, with their stage-by-stage definition, retain their optimal properties after any suboptimal play. In that sense, they are adaptive strategies. They are the appropriate concept of an optimal solution when both players make decisions independently and stochastic elements are present. The optimal feedback Stackelberg strategies (hereinafter referred to as OFSS) are found through the following procedure (9): At each stage, the leader computes the follower's expected reaction to his decision, based on minimizing the follower's expected cost-to-go assuming that both players will use their OFSS in the future. The leader then seeks to minimize his expected cost-to-go assuming that the follower will respond as expected, and that both players will use the OFSS in the future. The follower then uses the leader's decision to compute his optimal decision v(t) by similarly minimizing his expected cost-to-go. These expectations are conditioned on the information sets available to each player. The solution obtained in this fashion has the property that it is adaptive; at any instant of time and from any state, it provides the leader with the best choice of control in the Stackelberg sense, regardless of previous decisions, with the assumption that OFSS will be used in the future. The equations that define these optimal solutions are as follows: Let argminf(u) denote the value of u at whichf(u) achieves its absolute minimum. Then, Vo(U, t) = argmin E{J2(x , u, v, t) lZ2(t)},

(5)

v

u*(t) -= argminE(Jl(X, u, Vo(U, t), t)]Za(t)},

(6)

u

v*(t) = Vo(U*(t), t).

(7)

Vo(U,t) is the follower's optimal reaction to a decision u by the leader. Hence, the leader chooses a strategy that minimizes his expected cost-to-go assuming the follower will respond optimally; thus, u*(t) is the leader's optimal decision, and v*(t) the follower's optimal response at stage t. The assumption that both players will play optimally in the future, although not explicitly included in equations

On stochastic dynamic Stackelberg strategies (5)-(7), is an essential part of the definition. Denote the optimal cost-to-go at stage t by di*(t) = E{Ji(x , u, v, t) lZl(t), u(j) = u*(j), v(j) = v * ( j ) , j = t ..... N - I } ,

matrix K~(t) and function IIl(t ). This assumption can be proven using the principles of dynamic programming. The optimal strategies are u*(t) = - W(t) -x Y(t)x(t),

i = 1,2.

We can now rewrite equations (5)-(7) to include this assumption, defining the OFSS:

× C'(t) K2(t + 1) (A(t) x(t) + B(t) u(t)) = - A(t) (A(t) x + B(t) u),

+ 4 * ( t + 1)lZs(t)},

O)

where, dropping the argument t for brevity,

(11)

r = B' A' S~ A A + B ' ( I - CA)'KI(t + 1) x ( I - CA)A,

x ( I - CA)A.

= E{Jl(A(t ) x(t) + B(t) u(t) + C(t) v(t) + O(t),

Jx*(t) = x'Kl(t ) x + Hi(t ),

(23)

A*(t) = x' K2(t) x + II2(t),

(24)

where Kl(t ) = L ( t ) - Y'(t) W -1 (t) Y(t),

(13)

We can now use stochastic dynamic programming to obtain the solution using equations (7)-(10). Consider the N t h stage where no decisions are made. The expected cost-to-go at stage N for the players is: JI*(N) = E{x' (N) QI(N) x(N) IZI(N)}, (14) J,2*(U) = E(x'(U) Qs(U)x(U)]Zs(N)}.

(22)

This assumes the required inverses exist. We will state further on sufficient conditions for the existence of these inverses. The optimal costs-to-go are

E(Jl*(t + 1) [Zl(t)} (12)

(21)

L = QI+A'A'SxAA+A'(I-CA)'KI(t+I)

Thus

u*, v*, t + 1)]Z~(t)}.

(20)

× ( I - CA)B, (10)

where J l * ( t + l ) , J2*(t+l) depend on u(t), v(t) through the random variable x(t+ 1). Since x is a Markov process, the conditional expectations in (8)-(10) are dependent on the random variables x(t) and { O ( j ) , j = t ..... N - l } . It follows that, since x(t) and {O(j),j = 6, . . . , N - I } are mutually independent, the probability density is

= E{Jl(x(t+ 1), u*, v*, t + 1)[Zl(t)},

(19)

W = RI + B' A' S 1 A B + B ' ( I - C A ) ' KI(t + 1)

u*( t ) = argmin E{x ' Q1 x + u ' R 1 u + vo' S 1 vo (uIt

p(x, {O}tN-l) = p(x(t))p({O}tN-1).

(18)

Vo(U,t) = -- (S2(t) + C'(t) K2(t + I) C(t)) -1

Vo(U,t) = argmin E{x' Q2 x + u' R 2 u + v' S 2 v v(l)

+J~*(t+l)lZx(t)},

179

(15)

Using these results in (9) and (10), the optimal strategies for both players would be obtained at stage N - 1 . This process is repeated until the entire chain of optimal decision is determined. We will now examine some specific cases where this procedure is carried out. III. THE DETERMINISTIC CASE Problem 2 Suppose both players have perfect information of the state, as a special case of Problem 1. Then, since ZI(N) and Zs(N ) both contain knowledge of x(N), (14) and (15) become JI*(N) = x'(N) QI(N) x(N),

(16)

J~*(N) = x'(N) Qs(N) x(m).

(17)

Let us make the inductive assumption that Jl*(t) = x' Kz(t) x + H~(t), l = 1,2, for some deterministic

KI(N ) = QI(N),

(25)

IIl(t ) = II~(t + 1) + tr {0(t) Kl(t + 1)}, III(N ) = 0,

(26)

Ks(t) = Qs+ (A - B W -1 Y)' (Ks(t+ 1)) ( I - CA) x(A-BW

-1 Y ) + Y' W - 1 R s W - 1 Y, Ks(N) = Q2(N),

(27)

II2(t ) = IIs(t + 1)+tr{0(t)Ks(t+ 1)}, II2(m ) = 0.

(28)

The right-hand side of equations (20)-(28) depend only on known parameters of the game (i.e. A(t), B(t), etc.) and future values of the cost-to-go matrices Kl(t) and Ks(t), IIl(t) and IIs(t ). Thus, we can solve them backwards in time using the initial conditions KI(N) = QI(N), Ks(N) = Qs(U), HI(N ) = 0 -- IIs(N). The solutions of these equations provide optimal feedback matrices for the implementation of the adaptive Stackelberg strategies. Figure 1 contains a block representation of the final optimal strategies. Notice that the follower implements his optimal strategy after the leader does so. Still, the existence of a solution is not guaranteed unless W -~ exists for all t, and ( S 2 + C ' K s ( t + I ) C ) -1 exists for all t. Furthermore, the solutions given must minimize the expressions for Jl(X, u, v, t) and Js(x, u, v, t).

180

DAVID CASTANONand MICHAEL ATHANS

Sufficient conditions for the existence of the inverses and the minimum are given as follows: Let Q1 and S 1 be positive semidefinite for all t, and Rl(t) positive definite. The expected cost-to-go is thus non-negative so Kl(t)~>0. Thus, from (20), W is positive definite, hence invertible. Similarly, if we let Q2 and R~ be positive semidefinite, and S 2 be positive definite, then (S2+C'K2(t+I)C) is invertible. In addition, these conditions ensure that the functionals minimized are strictly convex over the whole space, so the minimums are unique. Thus, the optimal strategies (18) and 09) are the unique optimal feedback Stackelberg strategies for Problem 2, under the conditions previously described. Let us proceed to examine the more general case where measurement noise is present in the problem. .[;L---,

.......

-'

.....

I

i

+

FIG. 1. IV. THE STOCHASTIC CASE

Problem 3 Consider the special case of Problem 1 where the leader knows both Zl(t ) and zz(t) in (2), whereas the follower knows only z2(t ). So, for any t, Zl(t ) D Zz(t), implying that the information sets are nested. In addition, note that neither player can assume the other has played optimally in the past. This assumption is implicit in Bellman's principle of optimality; it is stated here explicitly so that the possibility of information transfer between players through their controls is avoided completely. The optimal strategies for this problem are derived in the Appendix as

Vo(U,t) = - A(t) (A(t) 22 + B(t) u),

(29)

u*(t) = -- W(t) -1 Y(t) 21 - W(t) -1 M(t) (2~.- 21) , (3O)

\ x z - x x ] \klB'(t) + 1-Ix(t),

d~*(t) = 2 z' Kg.(t) 2~ + H2(t),

kldt)! (31) (32)

where 2x(t ) = E(xlz~(t)), 22(0 = E(xIz2(t)) , W(t), A(t) and Y(t) are defined in (19), (20) and (21) as in the deterministic case with Kla(t ) replacing Kx(t ). The recursive relations for Kla(t), K1B(t), Klc(t), K~(t), Ha(t ) and II~(t) are derived and expressed in detail in the Appendix, where the initial conditions

are KaB(N) = Klc(N ) = 0, K1A(N ) = QI(N),

K2(N ) = Q2(N), III(N) = tr{El(N ) QI(N)}, II2(N ) -- tr {Zz(N ) Q2(N)}. As in the deterministic case, existence and uniqueness of the optimal Stackelberg strategies follows from assuming $2 > 0, R 1 > 0, $1, R2, Q1, Q2/> 0 for all stages. The right-hand sides of equations (33)(39) can be determined at stage t if the cost-to-go matrices K1a(t+l), K11~(t+l), K l v ( t + l ) and K2(t+ 1) are known. We have initial conditions at stage N, so we can solve these equations backwards a priori, because the only terms involved are the parameters of the game A(t), B(t), etc. which are known to both players, and the covariances of the estimates, which can be computed a priori as in the Appendix. There are several important aspects in the solution. First, the recursive relations (r) and (t) are identical to relations (25) and (27) in the deterministic case, with the same initial conditions, so that the solutions Kla(t) and K~(t) for Problem 3 are the same as Kx(t ) and K2(t) for Problem 2. Thus, as far as the follower is concerned, he is playing a 'separation principle' strategy which consists of the optimal deterministic feedback law of his best estimate of the state. The leader, on the other hand, knows both his own estimate and the follower's estimate, so that his optimal strategy includes a term taking advantage of the difference in estimates. When both estimates are the same, the leader also plays as in the separation principle of optimal control. A special case of Problem 3 would be the case where the follower has no measurements, in which case G2 = 0. Another case of interest is where the leader has perfect information on the state, in addition to knowing the measurements of the follower. In both of these cases the information sets are nested, so Z I = Z 2. If the information sets were not nested, then both players would be unable to estimate the other player's estimate at each stage. Thus, at each stage, the optimal strategies would include, in addition to their own estimates, terms involving the estimate of the other player's estimates of the state in the future. Carried out to N stages, the augmented state vector would be roughly of dimension 2nN for each player. This leads to estimators of much larger dimension than the system itself, hence it is impractical. An interesting variation of Problem 3 is where, instead of knowing the other player's past decisions exactly, there is some channel noise involved, so a noise-corrupted version of the decisions is known. If this noise is zero-mean, additive, gaussian and statistically independent of the other noises in the system, then the OFSS remain unaffected, but the expected optimal cost increases. The reason for

On stochastic dynamic Stackelberg strategies this is that neither player extracts information from the other player's past decisions. Thus, past decisions merely affect the mean value of the random vector x(t), and to include additive channel noise with the above properties would just leave x(t) as a Gauss-Markov process with increased covariance. It has no effect on the OFSS because of its independence from the other random elements in the system implies that the conditional expectations of x(t) given Zi(t ) would not be affected. v. CONCLUSION In Section II we have described an approach for solving multi-stage stochastic games of the Stackelberg type. What's more, when the cost criteria involved are quadratic, the dynamics linear, and the information sets nested, we have found the exact form of the optimal strategies for the players, and shown that they correspond closely to what a 'separation principle' would indicate. The general case has been discussed briefly, noting that it cannot be solved compactly. The reason for this is that, as indicated in the previous section, in order for a player to evaluate his optimal strategy at stage t, he has to estimate the other player's estimate of the state at stage t. However, this implies that he has to estimate the other player's information set at stage t, which involves estimating all his measurements up to stage t. Thus, the augmented state vector includes the estimates of the measurements at each stage; as the number of stages gets large, the augmented state vector becomesinfinite-dimensional. A related problem that is of interest but is not discussed here is solving Problem 3 with the conditions that players will exchange information using their decisions. Intuitively, both players should improve their expected cost. This is not certain, however, since the field of game theory is full of counter-intuitive results. Also of interest are the expressions of the cost-to-go matrices K1A, KIB, Klc and Ke. It is not possible to examine the present expressions and interpret the contributions of each term to the total cost. One final question remains: Why should the concept of optimal strategies be the OFSS ? Cruz and Simaan [9] point out that, in some cases, the leader's total cost is higher under this solution concept than if both players acted simultaneously at each stage under a Nash solution concept. In Section 2 we discussed why the appropriate Stackelberg solution concept is the OFSS. In many practical applications, the order in which decisions are made at each stage is fixed, thus the Nash solutions concept is not applicable. In the few situations where a Nash concept can be used, it is in the leader's best interest to compute both optimal costs under the different solution concepts; then, he may choose whether to make decisions

181

simultaneously with the follower or to continue as leader. In either case, it is necessary to know the meaning of the OFSS and how to evaluate them. In conclusion, we have shown conditions which guarantee existence and uniqueness of solutions to Problem 3. Also, we have exhibited a general technique for the solution of multi-stage games of the Stackelberg type, and shown that the optimal strategies for a class of information patterns resemble closely those suggested by assuming that a 'separation principle' applies. REFERENCES [1] I. B. RHODESand D. G. LUENBERGER:Differential games with imperfect state information. 1EEL Trans. Aut. Control AC-14, 29-38 (1969). [2] I. B. RHODES and D. G. LUENaERGER: Stochastic differential games with constrained state estimators. IEEE Trans. Aut. Control AC-14, 476-481 (1969). [3] R. D. BEHN and Y. C. Ho: On a class of linear stochastic differential games. IEEE Trans. Aut. Control AC-13, 3 (1968). [4] W. W. WILLMAN: Formal solutions for a class of stochastic pursuit-evasion games. IEEE Trans. Aut. Control AC-14, 5 (1969). [5] A. W. STARRand Y. C. Ho: Nonzero-sum differential games. J. Opt. Theory Applications 3, 3 (1969). [6] A. W. STARRand Y. C. Ho: Further properties of nonzero-sum differential games. J. Opt. Theory Applications 3, 4 (1969). [7] C. I. CHrN and J. B. CRUZ, JR.: Stackelberg solution for two person games with biased information patterns. IEEE Trans. Aut. Control AC-17, 5 (1972). [8] M. SIMAANand J. B. CRUZ, JR.: On the Stackelberg strategy in non-zero-sum games. J. Opt. Theory Appl. 11, 5 (1973). [9] M. SIMAANand J. B. CRUZ,JR. : Additional aspects of the Stackelberg strategy in non-zero-sum games. J. Opt. Theory Appl. 11, 6 (1973). [10] R. BELLMAN; Dynamic Programming. Princeton University Press, Princeton N.J. (1957). APPENDIX

Development of solution to Problem 3 Let :fi(t) = (15) become

E{x(t)lZi(t) ), i =

1,2. Then, (14) and

Et(t) = E [ ( x ( t ) - :fi(t)) |

(x(t) - :~i(t))' [zi(t)} I Zi(t+ 11 t) -- E{Zi(t + 1)]g,(t)), ~ (A1) ~Z(t+ 1) = & ( t + l i t ) -- E{x(t+ 1)[g~(t)}, | !

J~*(N) = ?ci'~Qi(N) :fi + tr {Zi(N ) Qi (N)}, ] where

~ i ( N ) = E { ( x - fci) (x - fQ)'I Zi(N)). Assume that, at stage t, the optimal expected costs-to-go are given by Jl*(t)

\fc~(t)-~l(t)/ \K1B(t)

Klc(t)]

+Hi(t),

J2*(t)=fc~'(t)K~(t):~(t)+II~(t).

~(t) 1 -~2(t)--~l(t)/ (A2) (A3)

182

DAVID CASTANONand MICHAEL ATHANS

These assumptions are consistent with (A1). We now proceed by induction to show that the above assumptions are indeed true. From (9) and (10), the optimal strategies are given by

Since Zz(t+ 1)toZl(t ) involves adding the information of the measurement z2(t+ 1) to Zl(t ), then

E{21]/2(t + 1) to El(t)} = E{x[ Z2(t + 1) to l~(t)} = 21(t+ 1)+ ~ ( t + 1) H3'(t + 1)Ez-1 (t + 1)

Vo(U,t) = argmin E{x' Q3 x + u' R 3 u + v' S 2 v V

x(z3(t+l)-H3(t+l)2~(t+l)),

+ II3(t+ 1)+22'(t + 1) K2(t + 1) x 23(t + 1)1Z2(t)}.

(A4)

21 = E{xlZ3(t+ 1)toZl(t)},

We can calculate the expectation as

~ = E{(x-2a)

× (X-- 21)' IZl(t ) toZ3(t + 1)}.

Vo(U,t) = argmin [£:2'Q322 + u ' R3 u + v' S3 v

Let us drop the argument t + 1 for brevity. We also know

V

+x2'(t + 1] t)K3(t+ 1) x3(t+ 1[ t)

22 = 23 + Y~3//2 E2-1(z3 - / / 2 23).

+ H~(t + 1) + tr {Y~3 Q~} + tr {(y~3(t+ 1[ t) -

So, (Ag) becomes

Y~3(t+ 1)) K2(t + 1)}].

E{21'L231Zl(t)} = E{(21 + ~H~'E~.-I(z2- H 321))' × L(23 + Y~3H3'E3 -1 (z~-/42 22))1

From equation (A3) (A4) becomes

Vo(U,t) = argmin (x2'(Q3 + A' K2(t + 1) A) 23

× Z,(t)}.

V

(A 12)

(t + 1) C) v + 2 v' C'K3(t + 1) (A~ 2 + Bu)

Now, z2 conditioned on Zi(t ) is a gaussian random variable with mean /-/2 2~ and covariance H, ZI(t+ I[t)H2' +E 2. So

+2u'B'K3(t+ 1)A~2+b3(t) },

E{21'L22[Z1(t)} = 21'L2 z + 21'LG ~//2(21 - 23)

+u'(R2+ B'K3(t+ 1)B) u+v'(S~+ C'K2 (A5)

where

+tr{Y,l(t+l[t)LG3H3},

b.,(t) = II2(t + 1) + tr (Y~3Q3) + tr {(Y,3(t+ I [ t) -

(All)

where

(A13)

where G 2 = Z 2 H2'E3 -1

Y~3(t+ 1)) K2(t + 1)}.

We can differentiate with respect to v and obtain the minimum:

E{22'L23[ Zx(t)} = E{(x 3 -b G2(z3 -- n 2 X2))' × L(2 3 + G2(z2- H 322)) IZl(t)}

= 23'L23 + 222'LG 3112(21- 22)

Vo(U,t) = - (S 2 + C' K2(t + 1) C) -1 C'K2(t + 1) × (A2 3+ Bu) = - A(A23 + Bu).

+ tr {G2 E 3 G2'L + GH 2 Y~I

(A6)

× (t + 11 t)H2'Ge'L }.

Notice that A(t) is defined as it was in (19). Using (10), we get

(A14)

( u*(t) = argmin E Ix'Q1 x -k-u'R 1 u -k Vo'(U,t) S 1 Vo(U,t) -k IIl(t + 1) %

(A7)

23-21

\K1B'(t+ 1)

For any matrix L, we can determine the following expectations:

E{21'(t + 1) L21(t + 1)1 Zl(t)}

Kle(t+ 1)

3-21

Let us expand (A7) and take expected values using (A8), (A13) and (A14):

u*(t) = argmin [21'(0 Ql(t) 21(0 + u'(t) Rx(t ) u(t)

= 2'(t+ 1)L2(t + 1) + tr{(Zx(t + 1[ t)

U

- Y,~(t + 1 ) ) L } ,

(A8)

e(21'(t + 1)L22(t + 1) [Zl(t)}

= E(E[21'(t + 1)g'~(t + 1) [Zx(t ) u Za(t + 1)] ]Zx(t)}

(A9)

because Zl(t ) toZ3(t + l) induces a refinement of the partition induced by Zl(t ) alone. The inner expectation can be rewritten as

E(21'(t + 1) L23(t + 1)] Zl(t ) toZ2(t + 1)} = E{2(t+ 1)]Zl(t ) toZz(t+ 1)}'L23(t+ 1).

(A10)

+ (A(t) 23(0 + B(t) u(t))'A'(t) Sx(t ) A(t) (A(t) x 23( 0 + B(t) u(t)) + 21'(KIA - 2K1B + KIC) x 21 + 221'(K1B --/(16,) 23 + 221' (K1B -- K1O) x G~-I2(21- 23) + 23'Klc 2~ + 22~'Kl0 G 3H 3 x (2 i - 23) + (23- 2i)'H3'a3'Klc G3H 2 x ( 2 3 - / i ) + I]i(t + 1) + tr {El(t ) Ql(t)} + tr {(El(t + 1 [ t) - Y~l(t+ 1)) (KI~ - 2K1B

+ Klc)+ 2G~Ha Y.I(t + 11 t)(K1B- Klc) + az(E 3+ H~ Y~l(tq- 11 t) H2' ) G2'Klc,}] (A15)

183

On stochastic dynamic Stackelberg strategies IIl(t ) = IIl(t + 1) + tr {(El(t + I I t)

Since

- x(t + 1)) (Kxa(t + 1 ) - 2K1B(t + 1)

gl(t + 1) = A(t) 2(t) + B(t) u(t) + C(t) Vo(U, t)

+ Klc(t + 1))} + 2 tr {Gz(t + 1)

= (I-CA)A2 I+(I-CA)Bu-CAA

x H2(t+ 1) Zl(t+ 1[ t) (KxB(t+ 1)

× (.~$ - - 21)

- Klc(t + 1))} + tr {Zx(t ) Ql(t)}

22(t + 1) = ( I - CA) A21 + ( I - CA) Bu + ( I - CA)

+tr{G2(t + 1) (Ha(t+ 1)Z~

x A(2~-21),

where the augment t was dropped for brevity. Substituting into (A15), and differentiating, we find u*(t) as u*(t) = - W-a(t) Y(t) 21(t)- W-l(t) M(t) x (22(t) - 21(t)),

( A 16)

× (t+ 1[ t)Hz'(t+ 1) + E~(t + 1)) x G2'(t + 1) Kxc(t + 1)).

Similarly, the optimal cost-to-go for player 2, the follower, is given by d2* = E{2~'(Q2+ A'K2(t + 1) A) 23 + u*'

where

x (R 2+ B'K2(t + 1) B) u* + 2u*'B'K2(t + 1) A22

W(t) = Rx + B ' ( I - CA)'Kla(t + 1) ( I - CA)

- (A2 z + Bu*)'A'C'K2(t + 1) (A2~ + Bu*)

x B + B'A'SI AB,

+ bg.(t)]Z~(t)} = 22'K2(t) 22 + [Is(t).

y(t) = B'A'Sx A A + B ' ( 1 - CA)'K~.a(t + l)

This follows because E(2I[Zz(t))=22, because Zz(t)CZl(t ). The optimal cost-to-go matrix is given by

x ( I - CA)A, M(t) = B ' ( I - CA)'Kla(t + 1) ( I - CA) A + B'

Ks(t) = Q 2 + ( A - B W -1 Y)'(K~(t+ 1 ) - A ' C ' K 2

-- (ICA)'(K1B(t + 1) - Kla(t + 1)) A + B'A'Sx AA - B ' ( I - CA)'K1B(t + 1)

x (t+ 1 ) ) ( A - B W -1 Y)+ Y ' W -1

x G2(t+ 1) Hz(t + 1) A.

x R2W -1 Y

(A20)

and

So, from (A2), jl,(t) = (

(A 19)

21(0

]'

\2z(t)-21(t)]

x(

( K1a(t)

K1B(t)]

\Kln'(t)

Klc(t)]

21(0 ] +IIl(t) k22(t)-21(t)]

II2(t ) = IIz(t + 1)+tr(E 2 Q2) + t r {(Y,2(t+ 1 t)

(A17)

-

Z2(t + 1)) K2(t + 1)} + tr {(E2(t) - El(t))

x ( M - Y)'W-a(R2+B'(K2(t+ 1)-K~(t + 1 )

x CA) B) W-I(M - Y)}.

(A21)

verifying the induction assumption made in (A2), where K1a(t) = Q~+ A'A'S 1AA + A ' ( I - CA)'Kaa(t + 1) ( I - CA)A - Y' W -x Y, KxB(t) = A'A'S1AA + A ' ( I - CA)'KL~(t + 1 ) ( I - CA) A + A ' ( I - CA)'(KIB(t + 1) - Kxa(t + 1)) A

-A'(I- CA)'K1B(t + 1) G,(t + 1)H2(t + 1 ) A - Y' W -~ M, Kxc(t ) = - M' W -~ M + A'A'(SI+ C'Kxa(t + 1) C) AA - A'A'C'(K~B(t + 1)

-- KIB(t + 1) G2(t+ 1) H2(t+ 1)) A +A'(K1B(t+ 1) G2(t+ 1) H2(t+ 1) --K1B(t+ I))'CA A + A ' ( I - G~(t+ 1) H~(t+ 1))'Klc(t + 1) ( I - G2(t+ 1) H~(t+ 1),

(A18)