Urn models for evolutionary learning games

Urn models for evolutionary learning games

JOURNAL OF MATHEMATICAL PSYCHOLOGY 36, 278-282 (1992) Theoretical Urn Models Note for Evolutionary Learning Games NOLA D. TRACY McNeese Sta...

320KB Sizes 0 Downloads 132 Views

JOURNAL

OF MATHEMATICAL

PSYCHOLOGY

36,

278-282 (1992)

Theoretical Urn Models

Note

for Evolutionary

Learning

Games

NOLA D. TRACY McNeese

Stare

University

JOHN W. SEAMAN, JR. Baylor

University

In an influential paper, Harley (Journal of Theoretical Biology 89, 1981, 611433) proposed a model in which an evolutionarily stable strategy (ESS) is learned rather than inherited. Harley also established an approximation to such an ESS called the relative payoff sum learning rule. Audley and Jonckheere (British Journal of Statistical Psychology 9, 1956, 87-94) and Arnold (Journal of Mathematical Psychology 10, 1973, 232-239; 4, 1967, 301-315) have developed general urn models for learning through reinforcement. In this paper, we establish correspondences among these learning models, thus suggesting a potential bridge between evolutionary game theory and classical learning theory. 0 1992 Academic Press, Inc.

1. INTRODUCTION Evolutionary game theory (Maynard Smith and Price, 1973; Maynard Smith, 1982) is primarily concerned with the evolution of behavior. Strategies in evolutionary games are behavioral phenotypes, and payoffs are changes in Darwinian fitness. Loosely speaking, an evolutionarily stable strategy (ESS) is a behavioral phenotype such that, if it is adopted by all members of a population, then no mutant strategy can invade under the influence of natural selection. For a concise overview of evolutionary game theory, see Riechert and Hammerstein (1983). A review of the extensive literature on the subject has been provided by Hines (1987). Harley (1981) has developed and explored ideas about learning rules and evolutionary game theory in an attempt to understand how an ESS might be achieved by non-genetic means. In Harley’s theory, learning rules are the behavioral phenotypes of interest; he is concerned with characterizing an ESS for learning. Harley has suggested an approximation to such an ESS known as the relative Correspondence and requests for reprints should be sent to Nola D. Tracy, Department of Mathematics, Computer Science and Statistics, MSU Box 92340, Lake Charles, LA 70609-2340. 278 0022-2496192 $5.00 Copyright 0 1992 by Academrc Press, Inc. All rights of reproduction in any form reserved.

URN MODELS FOR EVOLUTIONARY

LEARNING GAMES

279

payoff sum learning rule. Tracy and Seaman (1989) have established a rigorous mathematical foundation for Harley’s theory, correcting certain technical arguments appearing in Harley’s paper. Audley and Jonckheere (1956) have proposed a two-response-choice stochastic learning model and discussed several modifications, including some which allow the effect of the previous history sequence to depend on temporal distance from the present trial. Arnold (1973) has generalized Audley and Jonckheere’s model to a continuum of responses. He has also considered the effect of the previous history sequence but provided for lag one dependencies through the introduction of correlation parameters. Harley’s model also makes use of the previous history sequence, though not through the use of correlation parameters. We shall see that, although he was apparently unaware of the connection, Harley’s model utilizes temporal distance just as Audley and Jonckheere had suggested. In this paper, we establish a correspondence among these models, thus establishing a potential link between evolutionary game theory and classical learning theory. In Section 2, we review Arnold’s generalization of Audley and Jonckheere’s urn model. In Section 3 we present Harley’s model and the relative payoff sum learning rule. We characterize Harley’s development as an urn model in Section 4 and establish the connections to the work of Arnold, Audley, and Jonckheere.

2. ARNOLD'S URN SCHEME Consider a sequence of trials in which an individual must make a choice between two actions. In particular, let 4?/ be a set (or an urn) initially containing cli balls of color Ci, i= 1, .. .. k. After each trial, one of p sets gj of balls is added to the urn. The set Bj is known as a reinforcement set. Let tlii be the number of balls of color i in Bj, i= 1, .... k, j= 1, .,., p, so that the total number of balls in gj is cf=, tlii. After trial t, reinforcement set gj is added to the urn with probability nj. The xis cannot depend on the results of trials 1, .... t - 1. Arnold (1967, p. 303) refers to this as noncontingent reinforcement (see also Arnold 1973, p. 233; Atkinson and Estes, 1963, p. 142; or Johnson and Kotz, 1977, p. 271). The assumption of noncontingent reinforcement is used in the derivation of certain asymptotic results (see the end of Section 3). On each of t trials, an individual samples (with replacement) from one of the reinforcement sets Bj added on previous trials or from 9. Thus, on trial t there are two possible actions: Action 1. Randomly t - g, g = 1, .... t - 1, or Action 2. Randomly 480/36/2-9

select a ball from the reinforcement

select a ball from @.

set added after trial

TRACYANDSEAMAN

280

Let 6,, g= 1, .... t - 1, be the probability of selecting from the reinforcement set added after trial t - g in Action 1. Thus, Action 2 has probability fl, = 1 -CA:\ 6,. Clearly, Chm_I 6, converges and lim, _ m b, = 1 - Chm_I 6,. Let Y, be the value of the index of the reinforcement set chosen after trial h. Note that Pr( Yh = j) = rcj and the Yh’s are mutually independent. Let X, take the value i if a ball of color Ci is chosen on trial t, i = 1, .... k. Then the probability that the ball chosen on trial t will be of color Ci, given the sequence of reinforcement sets for trials 1, .... t - 1, is given by 1-l

Pr

( I X,=i

(7 j=l

c$+c;=: (yj=uj)

=Bt )

Ck= (

( g

1

ct. +ctzl

%

,-*I”

u

@*

)

)+:!:

($I+:,)>

(l)

where a,, indexes the reinforcement set added after trial h, that is, ah E { 1,2, .... p}, h = 1, .... t - 1. Note that the accumulation of balls in the urn changes the probability of a particular color being chosen on any trial. This is the actual “learning” procedure. Arnold (1973, p. 234) has noted that “the Audley-Jonckheere urn scheme can be identified as the special case where di=O Vi.” Thus, the Audley-Jonckheere urn scheme takes the following form: r-1

Pr X,=i (

n (Yj=uj) I j=l

>

ai + Gil:

= c$=

1 tag

+

aiah CL:

QJ’

Clearly, in this form, the Audley-Jonckheere urn scheme makes no use of previous history sequences. However, as we have already noted, Audley and Jonckheere (1956, p. 89) have suggested that temporal distance be used for such purposes: “Another way in which we might modify assumptions as to the effects of previous events on the probability of success at the (n + 1)th trial is to weight the effects with some function of their temporal distance from the trial in question.” In the next section, we shall see that Harley has done just that.

3. HARLEY'S RELATIVE PAYOFF SUM LEARNING RULE As we have stated, evolutionary game theory is concerned with the evolution of behavioral phenotypes. In the application of game theory to the evolution of behavior it is not unreasonable to assume that individuals play a given “game” repeatedly. Foraging and mating come immediately to mind. Suppose the strategy (behavioral phenotype) to be employed in such a game is learned rather than genetically pre-determined. If the learning process is genetically determined, then it must be under the influence of natural selection and potentially amenable to study under evolutionary game theory. Harley (1981) has approached the problem of the evolution of learning from an evolutionary game theoretic point of view. He has defined an evolutionarily stable

URNMODELSFOREVOLUTIONARYLEARNINGGAMES

281

(ES) learning rule and established an approximation to it. As we shall see, that approximation, the so-called relative payoff sum (RPS) learning rule, is closely related to the urn models of Audley, Jonckheere, and Arnold. Formally, the RPS rule is defined as follows. For a given game, suppose an individual has k possible distinct behaviors (“pure strategies”) and let ri be proportional to the a priori probability of employing the ith behavior; that is, prior to learning, the probability of using behavior i is proportional to ri, i= 1, .... k. (There is evidence for the existence of genetically determined probability distributions-see the discussion on the female digger wasp Sphex ichneumoneus in Maynard Smith, 1982, p. 75, and the references therein.) Let Ri(t) be the probability of using behavior i on trial t and let the random variable Qi(t) be the payoff, representing changes in Darwinian fitness, for using behavior i on trial t. Then, for 0 < m -C1 and t > 2, Harley has proposed that Ri(r)=~:=l

ri+Ci:{ (r,+Ci::

rnrPh-‘Q,(h) mzPh-‘Q,(h))’

The coefficient rntPh- ‘Galled the memory factor-provides for heavier weights on more recent payoffs. Harley has described the RPS learning rule as specifying “most frequently the behaviour which has, up to present, paid the most, but only in proportion, roughly, to its cumulative payoff relative to the overall total” (1981, p. 617). Note that, like (l), Ri(t) is a conditional probability, since it depends on a specific prior sequence of trials and outcomes. Arnold (1973) has derived the asymptotic response probability distributions for his noncontingent reinforcement urn scheme. Motivated by Arnold’s asymptotic results, Tracy and Seaman (1989) have shown that, under certain conditions, Harley’s RPS learning rule approximates the ES learning rule in the sense that the two have the same limiting form with probability one; that is, the RPS rule converges almost surely to the ES learning rule.

4. CONNECTING THE MODELS Although he did not recognize it as such, Harley’s RPS learning rule is a generalization of Audley and Jonckheere’s urn model in the same spirit as Arnold’s generalization. Like Arnold, Harley allows for more than two responses. On a given trial, the player can choose behaviors from the existing repertoire of k distinct behaviors. Since the reinforcement urn sets correspond to units of Darwinian fitness from payoffs which have been translated into reinforcement probabilities (see Harley, 1981, pp. 617-619, for a possible physiological structure capable of affecting such a translation), choosing from this repertoire is analogous to selecting a ball from the urn-that is, choosing Action 2, not Action 1, in Arnold’s model. This forces ~5~= 0 for all i so that j?, = 1 for all t and, in view of Eq. (2), the correspondence

TRACY AND SEAMAN

282

between Harley’s model and Audley and Jonckheere’s urn scheme begins to take shape. It remains to consider Harley’s utilization of previous history sequences. Here, like Audley and Jonckheere, he makes use of a geometric progression of weights. These weights are in the form of Harley’s memory factors, which result in greater emphasis being placed on more recent trials. The inclusion of Harley’s memory factors in (2) completes the correspondence.

5. CONCLUSION

Evolutionary game theory has had a great impact on thinking in behavioral ecology. Perhaps the correspondences considered in this paper will extend that influence into the domain of learning theory. Indeed, Maynard Smith (1982, p. 55) has noted “... there is not only a formal analogy between learning and evolution; there is also a causal connection between them.”

ACKNOWLEDGMENT We are indebted to the editor and two anonymous referees for their helpful criticism on an earlier version of this manuscript.

REFERENCES B. C. (1967). A generalized urn scheme for simple learning with a continuum of responses. of Mathematical Psychology, 4, 301-3 15. B. C. (1973). Response distributions for a generalized urn scheme under noncontingent reinforcement. Journal of Mathematical Psychology, 10, 232-239. ATKINSON, R. C., & ESTES, W. K. (1963). Stimulus sampling theory. In R. D. Lute, R. R. Bush, and E. Galanter (Eds.). Handbook of mathematical psychology (Vol. 2, Chap. 10). New York: Wiley. AUDLEY, R. J.. & JONCKHEERE,A. R. (1956). The statistical analysis of the learning process. British ARNOLD, Journal ARNOLD,

Journal

of Statistical

Psycholog.y,

9, 87-94.

HARLEY, C. B. (1981). Learning the evolutionarily stable strategy. Journal of Theoretical Biology, 89, 611-633. HINES, W. G. S. (1987). Evolutionarily stable strategies. A review of basic theory. Journal of Theoretical Population

Biology,

31, 195-272.

JOHNSON,N. L., & Korz, S. (1977). Urn models and their application. New York: Wiley. MAYNARD SMITH, J. (1982). Evolution and the theor?, of games. Cambridge: Cambridge Univ. Press. MAYNARD SMITH, J., & PRICE, G. R. (1973). The logic of animal conflict. Nature, 246, 15-18. RIECHERT, S., & HAMMERSTEIN,P. (1983). Game theory in the ecological context. Annual Review oj Ecology and Systematics, 14, 37749. TRACY, N. D., & SEAMAN, J. W. (1989). Evolutionarily stable learning rules. Working Paper Series IS-WP-89-11, Department of Information Systems. Baylor University. RECEIVED: June 18, 1990