139
Economics Letters North-Holland
31 (1989) 139-143
SYMMETRIC PRISONERS’
PATHS AND EVOLUTION DILEMMA
William
TO EQUILIBRIUM
IN THE
DISCOUNTED
STANFORD
Unioersity of Illinois at Chicago, Chicago, IL 60680, USA
Received Accepted
18 January 1989 27 February 1989
We consider an alternating best response process for strategies process in general games are studied and a sharp convergence prisoners’ dilemma.
in two-player, discounted repeated games. Properties of this result is proved for a class of strategies in the discounted
1. Introduction Given a two-person non-cooperative game and a strategy f, for player one (Pl), consider the following evolutionary process: (i) player two (P2) observes fi and finds a best response strategy g,, (ii) PI observes g, and reacts optimally with fi, (iii) P2 optimally reacts to that optimal reaction by finding g,, etc. If, given fi, a sequence of such strategies { f, } exists and converges to the strategy f, {g,} exists and converges to the strategy g, and the pair (f, g) is a Nash equilibrium of the game, then the strategy f, may be called prestable. Related notions go back at least to Cournot (1960) and can also be found in Brown (1951) and Robinson (1951), where the method of ‘fictitious play’ was introduced and investigated. In this note, the non-cooperative game of greatest interest will be the infinitely repeated, discounted prisoners’ dilemma, though other discounted games will also be considered. This means the pure strategy f, is a repeated game strategy and prestabifity of f, will be expressed by requiring best response sequences { f, } and subgame perfect Nash equilibrium Our object in writing this note response processes in quite general outlined in these remarks may be strategies in such games.
{ g, }, with limit pair (f, g), to exist and the latter to constitute a of the discounted game. is two-fold. First, we make remarks true of such alternating best discounted games and for arbitrary initial strategies f,. The facts useful in proving prestability results for some general classes of
Second, we consider a set of restrictive conditions on f, and prove a proposition showing these conditions are sufficient to ensure that f, is prestable in the discounted prisoners’ dilemma. These conditions are based on the idea of ‘symmetric path equilibrium’ introduced in Abreu (1986) in a symmetric discounted oligopoly game. Further, the limit pair (f, g) may prescribe cooperative stage game actions. Thus we have ‘cooperative’ equilibrium as the potential outcome of a stylized evolutionary process driven purely by self-interest. 0165-1765/89/$3.50
0 1989, Elsevier Science Publishers
B.V. (North-Holland)
2. Alternating
Consider
best response
the following
processes
bimatrix
game:
Player two Player one
N, c,
To make this a prisoners’ dilemmu, we require a, > 1, h, > 1, u2 < 0, h, < 0, C, is associated with ‘cooperation’ and N, with ‘non-cooperation’. In general, we restrict attention to two-player, simultaneous move, finite action stage games of which the prisoners’ dilemma is but an example. We then concentrate on infinitely repeated, discounted versions of such games. For the sake of brevity. we will not reproduce here the (by now) standard formal definitions of discounted infinitely repeuted game, pure strategy, Nash equilibrium, or subgume perfect Nush equilibrium for such games. Our usage is consistent with Rubinstein (1979) and Abreu (1986) for example. We will also adopt at some points the vocabulary of finite automaton theory and make use of the equivalence between the concepts of repeated game strutegies and automata as shown in Kalai and Stanford (1988). We begin with an observation. A pure strategy f, for Pl in any infinitely repeated discounted game defines a discounted Markov decision problem for P2. After all, f, consists of a state space (set of induced strategies), a deterministic transition rule, and a behavior function that determines Pl actions depending on state. Current period reward for P2 depends on the state of the system (through Pl action), and the action of P2. Long term payoffs are partly determined by the transition rule of f,. If the action set of P2 in the stage game is finite (as we assume), then results of Blackwell (1965) among others, indicate that P2 has a stationary dynamic programming best response strategy g,. The stationarity of g, means that f, ‘controls’ R, in the sense that the current state of f, determines the action chosen by R,. Given the transitions and behavior function of f,, actions by g, are optimal in all states of f,. We may think of f, and g, as sharing the same state space and transition rule, though in fact, the ‘complexity’ of R, (state space cardinality of the smallest automaton implementing gi) may be lower than that of f,. This is because R, may not distinguish between states that correspond to different strategies induced by f, and different histories: g, and two different histories may induce the same dynamic programming best response to two different strategies induced by f, and those two histories. In short, if both players have finite action spaces in the stage game then the sequential best response process, as described in the introduction, has meaning in the discounted game since two suitable sequences {f,,} and {g,,} exist. This is true for any initial strategy f,, whether of finite or infinite complexity. See again Blackwell (1965) on this last point. In the following, we require stationary best responses to be chosen at each stage of the sequential process. We also require that the choice of a given stationary best response for a player in any stage implies its choice in all subsequent stages where it is a feasible best response. One supposes that, in some sense, dynamic programming best responses will usually be unique so that such choice rules will not often be necessary. Since stationary best responses are chosen at each stage, if f, has a finite state space, i.e., /, is of finite complexity, at some point in the sequential process a minimal state space cardinality is reached and maintained in all subsequent stages. For any two best response strategies obtained beyond this point, there will exist a one-to-one and onto map from the state space of either strategy to that of the
W. Stanford / Discounted prisoners’ dilemmu
141
other. Further, this map will preserve transitions: if the two automata are ‘in’ corresponding states under this map and the same input is submitted to both, then the two automata will transit to states that also correspond. To summarize, if the players have finite action spaces in the stage game and f, is of finite complexity, then a sequential best response process as described above exists and gives rise to a minimal limit state space and a limit transition function. Best response strategies obtained far enough along in the process may be assumed to share this state space and transition function: Under these finiteness assumptions, such sequential best response processes are always convergent to this extent. Long period, though finite, cycling of the behavior functions may prevent convergence of the { f,} and {g,} sequences to a subgame perfect equilibrium pair. Conditions sufficient to rule this out are taken up in the following. Consider now Abreu’s (1986) concept of ‘symmetric path’ strategy adapted to the discounted prisoners’ dilemma. Such a strategy for Pl, say, consists of a sequence of pairs of actions Q,={q,(t):q,(t)=(N,, N,) or ql(t)=(C,,C,), t=l,2,...} and a rule stating that Pl play according to path Q, unless some unilateral deviation occurs. Such a deviation has either the form (Iv,, C,) or (C,, N2). On such a deviation, Q, is restarted. Simultaneous deviations are ignored. It is straightforward to verify that states - i.e., induced strategies - correspond precisely to the different tail sequences coming from the sequence Q, For fixed k = 1,2, . . , such a tail sequence has the form Q,(k)={q,(k+t):t=0,1,2...} with Q,=Q,(l). Fix such a strategy for Pl, denote it by f,, and assume f, has finitely many states. Thus the corresponding sequence Q, gives rise to only finitely many different tail sequences and so Q, is eventually periodic. By Blackwell’s (1965) results, we know there is a stationary dynamic programming best response g, which we assume P2 adopts. Thus for each of the finitely many tail sequences Q,(k), q, prescribes an action for P2. This naturally defines a sequence of P2 actions, indexed by t. If we pair each member of this P2 action sequence with the corresponding Pl action, the result is another symmetric sequence of pairs of actions: R, = { rl(t) : r,(t)=(N,, N2) or r,(t)=(C1, C,), t= 1, 2,. . . }. The second coordinate of r,(k) is the optimal action for P2 given the state of f,, i.e., given Q1(k) for k = 1, 2,. . . The strategy g, is determined by the sequence R, and the transition rule of f,. Alternatively, a little reflection shows that g, is determined by R, and a rule stating that P2 play according to R, unless some unilateral deviation occurs. Such a deviation has either the form (N,, C,) or (C,, N,). On such a deviation, R, is restarted. Simultaneous deviations are ignored. We may formalize this in a lemma. Lemma I. In the discounted prisoners’ dilemma, any stationary dynamic programming best response to a finite complexity symmetric path strategy is a finite complexity symmetric path strategy. We remark that finite complexity considerations are really irrelevant to the truth of the lemma. In addition, it is essential that players have two actions in the stage game; the lemma generally will not hold in a repeated game based on a stage game where the players have at least three actions, as in a duopoly game. Suppose now that R, # Q, and consider the first t for which rl(t) f q,(r). If we denote t by t,, then we claim r,( tl) = (N,, N,). To see this, note that P2 must have a non-negative discounted payoff along the play path generated by (f,, g,), starting with t = 1. Thus, if r,(t,) = (C,, C,), we could find t < t, such that q,(t) = rl(t) = (C,, C,). If t, is the largest of such t values, it is easy to verify that choosing N2 at time t, strictly improves P2 discounted payoffs for the corresponding state. This yields a contradiction since g, is optimal. Implicitly using Lemma 1, earlier remarks, and symmetry of the above argument, we have another lemma.
W. Stanford
142
/ Discounted prisoners’ ddemmcr
Lemma 2. In the discounted prisoners’ dilemma, consider u finite complexity symmetric puth strategy and a stationary dynamic programming best response. The first t for which the corresponding paths differ (should it exist) yields the best response path action pair (N, , N, ). As with Lemma 1, finite complexity plays no role here; through this changes in what follows. By earlier remarks, we may assume that f, ,g,, and all succeeding strategies share the same minimal finite state space and transition function. (This is equivalent to the assertion that Q,,(k,) = Q,(k,) if and only if R,(k,) = R,,(k,) for all positive integers n and m.) Also, by our tie breaking rules, the condition that S,, -+f, g, + g is equivalent to the existence of a positive integer k for which fk =fx+i9 i.e., Q1,= Q,+,. We now show that such a k exists. Were this not the case, without loss of generality, our tie breaking rules and the finiteness of the common state space of all strategies implies the existence of a smallest integer j 2 3 such that Q, = Q,,,,, R, = R,,, for m = 1, 2.. . . For i= 1, 2,. .., m,-1,let t,=min{t:q,(t)#r,(t)orr,(t)fq,+,(t)}, t’=min{t,} and i* amemberof argmin{ t, }. Thus t * = t,*. If q,*(t*)#r,*(t*) then, by Lemma2, q,*(t*)=(C,, C,) and r,*(t*)= (N,, N,). By the definition of t * and Lemma 2, q,*+1(t * ) = ( N, , N, ). Lemma and periodicity of the sequence (Q,, R,), (Q,, R 2), . then yields a contradiction. Similarly if r,* (t * ) # 4,: ,( t * ). Proposition 1. Any finite complexity symmetric puth strategy f, is prestable in the discounted prisoners’ dilemma. Moreover, limit puir (J g) is u pair of finite comp1exit.v symmetric path strutegies with corresponding paths Q und R satisfying Q = R. To prove the proposition, note that once convergence is established, the process yields f as a discounted dynamic programming best response to g and vice versa. Thus the pair (f, g) is a subgame perfect Nash equilibrium of the discounted game. That Q = R follows from Lemma 2 and convergence. We want to make a few more remarks about alternating best response processes in general infinitely repeated games with discounting. For a two-player game, let x = (a,, az) be a pair of actions with a, a member of player i’s stage game action set A,. Let k be a non-negative integer. We denote by xh the concatenation of k copies of x where x ” is the ‘empty input string’. Let f, be a finite complexity strategy for the repeated game. Then f, consists of a minimal state space S, a transition function T: S X A - S where A = A, x AZ, and a behavior function B : S 4 A, which determines action as a function of state. Let s E S and define the x path of s to be P,(s) = {(T(s, xl’): k = 0, 1, 2.... }, where T(s, x”) is defined inductively by T(s, x0) = s. T(s. x”) = T(T(s, xh-’ ),x) for k=l,2,.... Define the x cycle ofs as C,(s)= {T(s, x”): there exists n>k such that T(s, x“) = T(s, x”), with k = 0, 1, 2.... . Finally, the x stem of s is S,V(s) = P,(s) - C,(s). As we have remarked, if g, is a stationary dynamic programming best response to f,. there is a natural transition preserving map from S onto the state space of g,. Denote such a map by LY.The proof of the following lemma is straightforward, though a bit tedious. Lemma (1) (2) (3) (4)
3.
Define f,, g,, a, s, und x as above. Then.
P,(a(s)) = MP,(sU C,(a(s)) = a(C,(s),, S,(a(s)) 5 ~(&(s)), The curdinality of <,(a(~)) divides the cardinality
of C,(s).
This provides some detail to the statement of the Proposition. For example, if we define the period of a finite complexity symmetric path strategy in the discounted prisoners’ dilemma to be the
W. Stanford
cardinality (common)
/ Dlscounred pnsoners’
dilemmti
of the (C,, C,) cycle of any state, then the fourth part of the lemma states that period of the limit strategies f and g divides the period of the initial strategy fi.
3. Concluding
143
the
remarks
We have pointed out some general facts about alternating best response processes and considered an example with sharp convergence properties. The Proposition and its proof admit accessible though minor generalization; but a truly comprehensive theorem seem difficult to formulate and prove. On the other hand, we are aware of no counterexamples to the assertion that all finite complexity pure strategies are prestable in the discounted prisoners’ dilemma. For example, strategies based on the dynamics (transition functions) of various ‘trigger strategies’ are also prestable. A computer aided search may be helpful in uncovering counterexamples and we hope to pursue this.
References Abreu, Dilip, 1986. Extremal equilibria of oligopolistic supergames, Journal of Economic Theory 39. 191-225. Blackwell, David, 1965, Discounted dynamic programming, Annals of Mathematical Statistics 36, 226-235. Brown, G.W., 1951, Iterative solutions of games by fictitious play, in: T.C. Koopmans, ed., Activity analysis of production and allocation, Cowles Commission Monograph 13 (Wiley, New York). Cournot, Augustin, 1960, Researches into the mathematical principles of the theory of wealth, Translated by N.T. Bacon (Kelley, New York). Kalai, Ehud and William Stanford, 1988, Finite rationality and interpersonal complexity in repeated games, Econometrica 56, 397-410. Robinson, Julia, 1951, An interative method of solving a game, Annals of Mathematics 54, 2966301. Rubinstein, Ariel, 1979, Equilibrium in supergames with the overtaking criterion, Journal of Economic Theory 21, l-9.