ARTICLE IN PRESS
Engineering Applications of Artificial Intelligence 20 (2007) 667–675 www.elsevier.com/locate/engappai
Approximation spaces in off-policy Monte Carlo learning James F. Peters, Christopher Henry Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, Man., Canada R3T 5V6 Received 19 November 2006; accepted 20 November 2006 Available online 3 January 2007
Abstract This paper introduces an approach to off-policy Monte Carlo (MC) learning guided by behaviour patterns gleaned from approximation spaces and rough set theory introduced by Zdzis"aw Pawlak in 1981. During reinforcement learning, an agent makes action selections in an effort to maximize a reward signal obtained from the environment. The problem considered in this paper is how to estimate the expected value of cumulative future discounted rewards in evaluating agent actions during reinforcement learning. The solution to this problem results from a form of weighted sampling using a combination of MC methods and approximation spaces to estimate the expected value of returns on actions. This is made possible by considering behaviour patterns of an agent in the context of approximation spaces. The framework provided by an approximation space makes it possible to measure the degree that agent behaviours are a part of (‘‘covered by’’) a set of accepted agent behaviours that serve as a behaviour evaluation norm. Furthermore, this article introduces an adaptive action control strategy called run-and-twiddle (RT) (a form of adaptive learning introduced by Oliver Selfridge in 1984), where approximate spaces are constructed on a ‘‘need by need’’ basis. Finally, a monocular vision system has been selected to facilitate the evaluation of the reinforcement learning methods. The goal of the vision system is to track a moving object, and rewards are based on the proximity of the object to the centre of the camera field of view. The contribution of this article is the introduction of a RT form of off-policy MC learning. r 2006 Elsevier Ltd. All rights reserved. Keywords: Approximation space; Behaviour pattern; Expectation; Off-policy learning; Monte Carlo method; Rough sets; Run-and-twiddle
An approximation space . . . serves as a formal counterpart of perception ability or observation. Ewa Orlowska, March, 1982. 1. Introduction The problem considered in this paper is how to guide reinforcement learning by an agent in a non-stationary environment using weighted sampling of returns. Specifically, one might ask how to estimate the expected value of returns relative to what has been learned from experience (i.e., from previous states). The solution to this problem stems from a rough set approach to reinforcement learning. This is made possible by observing returns received from actions during intervals of time called episodes. During an Corresponding author.
E-mail addresses:
[email protected] (J.F. Peters),
[email protected] (C. Henry). 0952-1976/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2006.11.005
episode, returns and other state information are recorded in information tables called ethograms (Peters et al., 2005a) that provide a basis for observing patterns of past behaviour. Furthermore, these patterns are used to guide future action selections through the use of approximation spaces, which provide a framework for estimating the value of returns on actions carried out by an agent. In addition, a form of adaptive learning called run-and-twiddle (RT) introduced by Oliver Selfridge in 1984 (Selfridge, 1984) is investigated, where learning parameters are adjusted as a result of poor performance. By way of illustration of the suggested approach, this paper considers the problem of how to estimate the value of returns during off-policy Monte Carlo (MC) reinforcement learning (see, e.g., Gaskett, 2002; Precup et al., 2000; Sutton and Barto, 1998). This paper represents a continuation of research on approximation space-based reinforcement learning (see, e.g., Peters, 2005a; Peters et al., 2005a–d) and related work on approximation spaces (see, e.g., Peters, 2004; Peters
ARTICLE IN PRESS 668
J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
et al., 2003; Stepaniuk, 1998; Skowron and Stepaniuk, 2005). The approach suggested in this paper is reminiscent of the MC method introduced by Ulam (1951), where one estimates the chances of success of a future event based on a knowledge of the number of successes during a number of previous trials. A basic assumption in this paper is that the environment is non-stationary and that a perfect model of the environment is not available. Let ri denote the ith reward received in an episode (resulting from an action selected by an agent). Then, for an episode m that ends at time T define return R as R ¼ r1 þ gr2 þ g2 r3 þ þ gT1 rT which is the cumulative future discounted rewards resulting from a sequence of actions, where g 2 ½0; 1 is called a discount rate. Define a policy pðs; aÞ (we use pðs; aÞ and p interchangeably) as a mapping from an environment state s to the probability of selecting a particular action. The basic idea is to choose actions during an episode (according to some policy) that ends in a terminal state at time T so that the expected discounted return E p ðRÞ following policy p improves. In choosing actions, it is necessary to estimate the expected value of R. Let PrðX ¼ xÞ denote the probability that X equals x. It is assumed that the return R (cumulative discounted future rewards) for a sequence of actions is a discrete random variable,1 and the probability PrðR ¼ rÞ is not known. In effect, if the episodic behaviour of an agent yields a sequence of returns R1 ; . . . ; Rn over n Pn episodes, the value of the expectation E½R ¼ j¼1 xj PrðRj ¼ xj Þ is not known. MC methods (see, e.g., Hammersley and Handscomb, 1964; Rubinstein, 1981; Ulam, 1951) offer an approach to estimating the expected value of R. In general, a MC method relies on the use of pseudo-random methods to construct estimates of unknown quantities such as the expected value of a random variable. Consequently, an agent using these methods can only ‘‘learn’’ the expected return based on observations from its environment. This means the agent learns only from experience, which is reflected in sample sequences of states, actions, rewards and returns on sequences of actions that result from interaction with a non-stationary environment (i.e., a constantly changing, complex environment). This form of experience is episodic with randomly varying duration, and it is possible for each episode to yield markedly different returns on sequences of actions. In such an environment, continual exploration is essential. That is, an agent learns during an episode by exploring actions that may not have been particularly promising (low rewards) in the past. MC estimation is used in controlling behaviour by estimating the value of actions by an agent that follows a particular policy. The conventional off-policy learning control method, its approximation space-based counterpart, and a RT method are considered in this article. The 1 A random variable is a variable defined as a function X : O ! R the domain of a probability space, where O is a sample (Jukna, 2001).
contribution of this article is the introduction to a RT form of off-policy MC learning. This article is organized as follows. Some of the basic notions associated with reinforcement learning are given in Section 2. A brief overview of off-policy MC learning is given in Section 3. The basic idea of an approximation space is briefly presented in Section 4, which includes the basic ideas and notation from rough set theory. The basic approach to reinforcement learning considered in the context of approximation spaces is presented in Section 5. An ethogram-based form of off-policy MC learning is given in Section 6. A RT form of off-policy MC learning is given in Section 7. Sample test results with a monocular vision system that learns using various forms of the off-policy MC learning algorithms are presented in Section 8. 2. Reinforcement learning Reinforcement learning is the act of learning the correct action to take in a specific situation based on feedback obtained from the environment (Sutton and Barto, 1998; Watkins, 1989). In the context of this paper, feedback is a numerical reward associated with each action taken by an agent. In this research, response to action rewards is combined with feature extraction from observed individual behaviours (this information is accumulated in ethograms, Peters et al., 2005a). An ethogram provides a record of less rewarding and more rewarding behaviours, which becomes useful in controlling the actions of an agent so that the reward signal ‘‘improves’’ (increases in amplitude). At some point during a episode, an agent will pause to take into account the content of an ethogram, and then modify its behaviour accordingly. This approach to learning is reminiscent of the RT method (Selfridge, 1984; Watkins, 1989). That is, an agent at first performs actions experimentally, then pauses (‘‘twiddles’’) long enough to gauge which actions have more promise before continuing. A central mechanism in the algorithms discussed in this paper is the update rule (see, e.g., Sutton and Barto, 1998). Let S be the set of all states, and let s 2 S denote the current state. Similarly, AðsÞ as the set of all actions for a given state with a specific action a 2 AðsÞ. Also, let A ¼ S s2S AðSÞ be the set of all possible actions. Next, define the state–action value function Qpn ðs; aÞ as the estimated value of selecting action a in state s while following policy p (note, we will abbreviate Qpn ðs; aÞ as Qðs; aÞ or simply Q unless doing so introduces ambiguity). Let Qk , Qk1 , rk , a denote current estimated Q-value, previous estimated Qvalue, action–reward, and step-size, respectively. The update rule has the form Qk ¼ Qk1 þ a½rk Qk , where a is used to control the rate of learning. The motivation in working with update rules in learning algorithms is the resulting computational simplicity as well as the intuitive representation of the incremental character of the learning process. Using the MC method, Q ðs; aÞ, the actual value of action a in state s can be estimated using a weighted sum. Let wi denote an importance sampling weight on Ri used to
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
obtain Qpn ðs; aÞ, which is an approximation of Q ðs; aÞ using Pn wi Ri p . (1) Qn ðs; aÞ Pi¼1 n i¼1 wi
while True do Use p0 ðs; aÞ to generate an episode: s0 ; a0 ; r1 ; s1 ; a1 ; r2 ; . . . ; sT1 ; aT1 ; rT ; sT ; t latest when at apðst Þ; for ðt ¼ t;toT;t ¼ t þ 1Þ do Q 1 wnþ1 T1 k¼t p0 ðsk ;ak Þ ; Nðs; aÞ Nðs; aÞ þ wa;t Rt ; Dðs; aÞ Dðs; aÞ þ wa;t ; Qðs; aÞ Nðs;aÞ Dðs;aÞ ; end for ðeach s 2 SÞ do pðsÞ argmaxa Qðs; aÞ; end
Property 1. A weighted average value function can be rewritten as an incremental update rule. For a proof, see Peters and Henry (2006). 3. Off-policy MC control A MC control algorithm is dubbed ‘‘off policy’’ in the case where the policy that is being revised is not the one being used to make decisions. The two policies are p0 (called a behaviour policy used to generate behaviour) and p (called an estimation policy). The advantage to this scheme is that the behaviour policy p0 can be used for exploration in sampling all possible actions while the estimation policy p can be deterministic and can be improved after each episode (Sutton and Barto, 1998). In this way information about non-greedy actions can still be acquired without sacrificing the deterministic policy which would have fallen short of this task on its own. Notice that a requirement for this method to work is that the behaviour policy p0 must have a non-zero probability of returning to a state (i.e., p0 ðs; aÞ40) for all state action pairs (Sutton and Barto, 1998). This is guaranteed through the use of the -greedy policy 8 > > < 1 þ jAðsÞj if a ¼ a ; pðs; aÞ ¼ > > if aaa : : jAðsÞj The basic idea is to choose actions during an episode that ends in a terminal state at time T so that the expected return E½R is maximized. To see how this is done, we first consider the weighted sampling of R-values, which utilizes Property 2. The formal algorithm (general case) is given in Algorithm 1, which reflects Property 2. Notice, also, that Algorithm 1 can be simplified using the update rule from Property 1. Property 2. A weight on returns can be defined as a function of a single episodic behaviour policy p0 as shown in Eq. (2) (see Peters and Henry, 2006): wk ¼
k¼T i ðsÞ1 Y k¼t
1 . p0 ðsk ; ak Þ
Algorithm 1. Off-policy MC Input: States s 2 S, Actions a 2 A. Output: Policy pðsÞ. for ðall s 2 S; a 2 AðsÞÞ do pðsÞ; p0 ðs; aÞ arbitrary policies; Qðs; aÞ arbitrary; Nðs; aÞ 0; Dðs; aÞ 0; end
(2)
669
end
4. Approximation spaces This section briefly presents some fundamental concepts in rough set theory (Pawlak, 1982, 1981) that provide a foundation for estimating the value of returns on actions performed during reinforcement learning. An overview of rough set theory and its applications is given in Grzyma"aBusse et al. (2001), Polkowski and Skowron (1998), Skowron and Stepaniuk (2005), and Polkowski (2002). For computational reasons, a syntactic representation of knowledge in rough set theory is provided in the form of data tables. 4.1. Rough sets Let U denote a non-empty finite set called a universe and let PðUÞ be the power set of U (i.e., the family of all subsets of U). Elements of U could, for example, represent objects or behaviours. A feature F of elements in U is measured by an associated probe function f ¼ f F whose range is denoted by Vf , called the value set of f; that is, f : U ! Vf . There may be more than one probe function for each feature. For example, a feature of an object may be its weight, and different probe functions for weight are found by different weighing methods; or a feature might be colour, where probe functions could measure the quantity of red, green, blue, hue, intensity, and saturation contained in an image. The similarity or equivalence of objects can be investigated quantitatively by comparing a sufficient number of object features by means of probes (Pavel, 1993). For present purposes, for each feature there is only one probe function associated and its value set is taken to be a finite set (usually of real numbers). Thus one can identify the set of features with the set of associated probe functions, and hence we use f rather than f F and call Vf ¼ VF a set of feature values. If Fis a finite set of probe functions for features of elements in U, the pair ðU; F Þ is called a data table, or information system (IS). For each subset B F of probe functions, define the binary relation B ¼ fðx; x0 Þ 2 U U : 8f 2 B; f ðxÞ ¼ f ðx0 Þg. Since
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
670
each B is an equivalence relation, for B F and x 2 U let ½xB denote the equivalence class, or block, containing x, that is, ½xB ¼ fx0 2 U : 8f 2 B; f ðx0 Þ ¼ f ðxÞg U. If ðx; x0 Þ 2 B (also written xB x0 ) then x and x0 are said to be indiscernible with respect to all feature probe functions in B, or simply, B-indiscernible. Information about a sample X U can be approximated from information contained in B by constructing a B-lower approximation [ B X ¼ ½xB , x:½xB X
and a B-upper approximation [ B X ¼ ½xB . x:½xB \X a;
The B-lower approximation B X is a collection of blocks of sample elements that can be classified with full certainty as members of X using the knowledge represented by features in B. By contrast, the B-upper approximation B X is a collection of blocks of sample elements representing both certain and possibly uncertain knowledge about X. ! B X , the sample X has been classified Whenever B XD imperfectly, and is considered a rough set. In this paper, only B-lower approximations are used. 4.2. Generalized approximation spaces (GASs) This section gives a brief introduction to approximation spaces. A very detailed introduction to approximation spaces is presented in Skowron and Stepaniuk (1995), Stepaniuk (1998), and Polkowski (2002). An approximation space serves as a formal counterpart of perception or observation (Or"owska, 1982), and provides a framework for approximate reasoning about vague concepts. Define a neighbourhood function on a set U as a function N : U ! PðUÞ that assigns to each x 2 U some subset of U containing x. A particular kind of neighbourhood function on U is determined by any partition x : U ¼ U 1 [ [ U d , where for each x 2 U, the x-neighbourhood of x, denoted N x ðxÞ, is the U i that contains x. In terms of equivalence relations in Section 4.1, for some fixed B F and any x 2 U, ½xB ¼ N B ðxÞ naturally defines a neighbourhood function N B . In effect, the neighbourhood function N B defines an indiscernibility relation, which in turn, defines for every object x a set of like-wise defined objects (i.e., objects whose value sets agree precisely see, e.g., Peters et al., 2003). An overlap function n on U is any function n : PðUÞ PðUÞ ! ½0; 1 that reflects the degree of overlap between two subsets of U. A GAS is a triple ðU; N; nÞ, where U is a non-empty set of objects, N is a neighbourhood function on U, and n is an overlap function on U. In this work, only indiscernibility relations determine N.
A set X U is definable in a GAS if, and only if X is the union of some values of the neighbourhood function. Specifically, any IS ðU; F Þ and any B F naturally defines parameterized approximation spaces AS B ¼ ðU; N B ; nÞ, where N B ¼ ½xB , a B-indiscernibility class in a partition of U. A standard example (see, e.g., Skowron and Stepaniuk, 1995) of an overlap function is standard rough inclusion, defined by nSRI ðX ; Y Þ ¼ jX \ Y j=jX j for non-empty X. Then nSRI ðX ; Y Þ measures the portion of X that is included in Y. An analogous notion is used in this work. If U ¼ U beh is a set of behaviours, let Y U represent a kind of ‘‘standard’’ for evaluating sets of similar behaviours. For any X U, we are interested in how well X ‘‘covers’’ Y, and so we consider another form of overlap function, namely, standard rough coverage nSRC , defined by 8 < jX \ Y j if Y a;; jY j (3) nSRC ðX ; Y Þ ¼ : 1 if Y ¼ ;: In other words, nSRC ðX ; Y Þ returns the fraction of Y that is covered by X. In the case where X ¼ Y , then nSRC ðX ; Y Þ ¼ 1. The minimum coverage value nSRC ðX ; Y Þ ¼ 0 is obtained when X \ Y ¼ ;. One might note that for non-empty sets, nSRC ðX ; Y Þ ¼ nSRI ðY ; X Þ. 5. Reinforcement learning with approximation spaces By way of an illustration of the utility of approximation spaces, a rough set approach to reinforcement learning is briefly considered in this section. The study of reinforcement learning carried out in the context of approximation spaces is the outgrowth of recent work on approximate reasoning and intelligent systems (see, e.g., Peters, 2004, 2005a,b; Peters et al., 2002, 2005a–d; Peters and Henry, 2006). An overview of a MC approach to reinforcement learning with approximation spaces is given in Peters and Henry (2006). An ethogram is an IS with an additional attribute representing a decision. During an episode, an ethogram is constructed, which provides the basis for an approximation space and the derivation of the degree that a block of equivalent behaviours covers a set of behaviours representing a standard (see, e.g., Peters, 2005a; Peters et al., 2005a; Tinbergen, 1963). Define a behaviour to be a collection ðs; a; rÞ at any one time t, and let d denote a decision (1 ¼ choose action, 0 ¼ reject action) for acceptance of a behaviour. Let U beh ¼ fx0 ; x1 ; x2 ; . . .g denote a set of behaviours. Decisions to accept or reject an action are made by the learning agent during the learning process; let d denote a decision (0 ¼ reject, 1 ¼ accept). Let B be the set of probe functions for state, action, and reward. The probe functions are suppressed, so identifying probe functions with features, write B ¼ fs; a; rg. Let D ¼ fx 2 U beh : dðxÞ ¼ 1g, where dðxÞ ¼ 1 specifies that behaviour x has been accepted by a learning agent. Assume that N B : U beh ! PðU beh Þ is used
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
to compute B D, which represents certain knowledge about the behaviours in D. For each possible feature value j of a, (that is, j 2 Va ), and x 2 U beh , put Bj ðxÞ ¼ ½xB if, and only if, aðxÞ ¼ j, and call Bj ðxÞ an action block. Put B ¼ fBj ðxÞ : aðxÞ ¼ j; x 2 U beh g, a set of blocks that ‘‘represent’’ action aðxÞ ¼ j. Setting n ¼ nSRC , define the average (lower) rough coverage with respect to an action aðxÞ ¼ j in 1 X n¯ a ¼ nðBj ðxÞ; B DÞ. (4) jBj B ðxÞ2B j
Consequently, B D provides a useful behaviour standard or behaviour norm in gaining knowledge about the proximity of behaviours to what is considered normal. The term normal applied to a set of behaviours denotes forms of behaviour that have been accepted. The introduction of some form of behaviour standard makes it possible to measure the degree to which blocks of equivalent action-specific behaviours are ‘‘covered’’ by a set of those behaviours that are part of a standard. The framework provided by an approximation space makes it possible to derive pattern-based rewards, which are used by the learning agent in response to perceived states of their environment. 6. Off-policy MC control with approximation spaces An approximation space-based alternative to the weighted sampling method in Algorithm 1 is briefly explored in this section. 6.1. Weighted sampling based on approximation spaces The framework provided by an approximation space makes it possible to derive pattern-based rewards, which are used by a learning agent to choose actions in response to perceived states of the environment. Each ethogram constructed during an episode provides a record of behaviour patterns that have been observed. The action value function Qp ðs; aÞ can be calculated using the weighted average in the following P equation (which is equivalent to Eq. (1) by letting W ¼ ni¼1 wi ): Qp ðs; aÞ
m 1 X Rk wk . W k¼1
Next, the weight defined in Eq. (2) is used in conjunction with Eq. (4) to obtain an approximation space-based weight as shown in wm ¼ ð1 þ n¯ a Þ
TY m 1 k¼tm
1 . p0 ðsk ; ak Þ
(5)
Thus, the weights given by Eq. (5) are emphasized by the average rough coverage value for a given action where the weight is equal to the traditional case in the event na ¼ 0. In other words, in approximating the expected value of the return R, the samples Rk can be weighted within the context of an approximation space. This leads to Property 3.
671
Property 3. The expectation E½R can be estimated with approximation space-based weighted sampling. To obtain the new form of weighted sampling, Algorithm 1 has been rewritten using the update rule from Proposition 1 and the approximation space-based weighted sampling estimates of the expected value of Rm from Property 3 (see Algorithm 2). A tail-of-episode (TOE) approach to off-policy MC control is used in Algorithm 2. That is, the application of the new approach to weighted sampling only occurs after a swarm has gained some experience at the beginning of each episode. Experience at the beginning of an episode is represented by an extracted ethogram. The knowledge gained from each ethogram influences the weighted sampling (guided by average rough coverage values for each action) and governs behaviour in the tail of an episode. Algorithm 2. TOE off-policy MC Input: States s 2 S, Actions a 2 A Output: Policy pðsÞ. for ðall s 2 S; a 2 AðsÞÞ pðsÞ; p0 ðs; aÞ arbitrary policies; Qðs; aÞ arbitrary; Rðs; aÞ 0; Cðs; aÞ 0; W ðs; aÞ 1 end while True do Use p0 ðs; aÞ to generate an episode: s0 ; a0 ; r1 ; s1 ; a1 ; r2 ; . . . ; sT1 ; aT1 ; rT ; sT ; t latest when at apðst Þ; Extract ethogram table DT swarm ¼ ðU beh ; A; dÞ; Compute n¯ a 8a 2 A; for ðt ¼ t; toT; t ¼ t þ 1Þ do a ¼ at , s ¼ s t ; Cðs; aÞ Cðs; aÞ þ 1; 1 ½rt Rðs; aÞ; Rðs; aÞ Rðs; aÞ þ Cðs;aÞ QT1 1 w ¼ ð1 þ n¯ a Þ k¼t p0 ðsk ;ak Þ; W ðs; aÞ W ðs; aÞ þ w; Qðs; aÞ w Qðs; aÞ þ W ðs;aÞ ½Rðs; aÞ Qðs; aÞ; end for ðall s 2 S; a 2 AðsÞÞ do Cðs; aÞ 0; Rðs; aÞ 0; end for ðeach s 2 SÞ do pðsÞ argmaxa Qðs; aÞ; end end
In some sense, Algorithm 2 mimics the approach Ulam suggests in learning from experience (see, e.g., Ulam, 1951).
ARTICLE IN PRESS 672
J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
For example, consider trying to estimate the success of the next solitaire hand based on previous experience with solitaire hands. In our case, the ethogram extracted at some point in time after the beginning of each episode of Algorithm 2 reflects the experience of a learning agent in coping with an environment. The approximation spacebased TOE off-policy MC algorithm is explained in detail in Peters et al. (2005d).
end for ðall s 2 S; a 2 AðsÞÞ do Cðs; aÞ 0; Rðs; aÞ 0; end for ðeach s 2 SÞ do pðsÞ argmaxa Qðs; aÞ; end end
7. RT form of MC Other forms of approximation space-based off-policy algorithms are also possible. For example, consider only performing ethogram extraction (and consequently calculating an approximation space) when things appear to be getting worse, i.e., when an agents actions are consistently causing transitions into states with a lower estimate of Q. Such a situation can be observed by keeping track of the number of times Qðstþ1 ; atþ1 ÞoQðst ; at Þ. Intuitively, an agent should effect a change in its behaviour policy during such an event. This change is analogous to the ‘‘twiddle’’ described by Selfridge and is implemented in Algorithm 3 by recalculating n¯ a whenever the number of times Qðstþ1 ; atþ1 ÞoQðst ; at Þ is greater than some threshold. The result is an algorithm that adjusts its learning parameters based on previous experience whenever things appear to be getting worse. Algorithm 3. RT off-policy MC Input: States s 2 S, Actions a 2 A, Initialized th. Output: Policy pðsÞ. for ðall s 2 S; a 2 AðsÞÞ do Initialize pðsÞ, p0 ðs; aÞ, Rðs; aÞ, Cðs; aÞ, Qðs; aÞ, W ðs; aÞ, n¯ a ; end while True do Use p0 ðs; aÞ to generate an episode: s0 ; a0 ; r1 ; s1 ; a1 ; r2 ; . . . ; sT1 ; aT1 ; rT ; sT ; t the latest time at which at apðst Þ; for ðt ¼ t; toT; t ¼ t þ 1Þ do a ¼ at , s ¼ s t ; Cðs; aÞ Cðs; aÞ þ 1; 1 Rðs; aÞ Rðs; aÞ þ Cðs;aÞ ½rt Rðs; aÞ; QT1 1 w ¼ ð1 þ n¯ a Þ k¼t p0 ðsk ;ak Þ; W ðs; aÞ W ðs; aÞ þ w; Qðs; aÞ w Qðs; aÞ þ W ðs;aÞ ½Rðs; aÞ Qðs; aÞ; if Qðstþ1 ; atþ1 ÞoQðs; aÞ then u u þ 1 if u ¼ th then Extract ethogram table DT swarm ¼ ðU beh ; A; dÞ; Compute n¯ a 8a 2 A; u 0; end end
8. Results A monocular vision system was used to obtain results for this paper (for details see Peters et al., 2006). The system consists of a digital camera with two degrees of freedom attached to a TS-5500 compact full-featured PC compatible Single Board Computer based on the AMD 0 586 class processor (TS-5500 User’s Manual, 2004), which is part of a swarmbot similar to the one described in Mondada et al. (2004). The camera movements are provided by two Hobbico mini servos. The learning algorithms were implemented in C þ þ in a Linux developing environment. The goal of the system is to track a moving target using a combination of image processing and reinforcement learning techniques. The target is a moving image displayed by a LCD screen of an IBM Thinkpad laptop. Furthermore, the camera remains at a fixed distance from the target. The digital camera passes images to the preprocessing stage of the monocular vision system. Preprocessing on the images is necessary to create an environment suitable for reinforcement learning. The preprocessing entails using a process called template matching to calculate coordinates of the target with respect to the origin (located at the centre) of a given image. Similarly, the image is divided into regions that represent states. The system is considered in state s if the coordinates of the target lie in the region associated with that state. Actions correspond to servo steps where each state has a range of allowable movements. Finally, rewards are in the interval ½0; 1 and are based on the Euclidean distance of the coordinates of the target to the origin, where the largest reward occurs when the target is located at the origin. The results are given in Figs. 1–3 and were obtained by running five minute trials with Algorithms 1–3 and parameter g ¼ 0:1; 0:5, and 1. Figs. 1 and 2 are plots of a running average of the distance of the target from the origin and the average reward, respectively. For g ¼ 0:1, and 0:5 the traditional MC algorithm performs better. This is due to the lower processing power of the TS-5500 (see, e.g., Peters et al., 2006). During learning the TOE and RT methods both have to calculate approximation spaces. This causes the system to briefly pause (due to the large amount of calculations) and consequently they lose the target. Also note, the TOE method performs worse than the RT, and this is due to the fact that the TOE method calculates an
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675 1 Off Policy MC TOE Run and Twiddle
3.6
Normalized Average Reward
Distance of target measured from origin
3.8
673
3.4 3.2 3 2.8
Off Policy MC TOE Run and Twiddle
0.95
0.9
2.6
0.85 0
200
400
600
800
1000
1200
0
10
20
Time Steps (γ=0.1)
Off Policy MC TOE Run and Twiddle
50
60
3.5
3
Off Policy MC TOE Run and Twiddle
0.98 Normalized Average Reward
Distance of target measured from origin
40
1
4
0.96 0.94 0.92 0.9 0.88 0.86
2.5 0
500
1000
0
1500
10
20
30
40
50
Episodes (γ=0.5)
Time Steps (γ=0.5)
1
3.8 Off Policy MC TOE Run and Twiddle
3.6
Off Policy MC TOE Run and Twiddle
0.98 Normalized Average Reward
Distance of target measured from origin
30 Episodes (γ=0.1)
3.4 3.2 3 2.8 2.6
0.96 0.94 0.92 0.9 0.88
2.4
0.86
2.2 0
500 1000 Time Steps (γ=1)
1500
0
10
20
30
40
50
60
70
Episodes (γ=1)
Fig. 1. Distance of target from origin.
Fig. 2. Normalized total reward.
approximation space at the end of every episode, and the RT method only does so when needed. Next, observe that the RT method performs the best when g ¼ 1. This appears
to be due to the fact that rewards are not being discounted at the end of an episode (when learning traditionally occurs in MC methods) and consequently the value of R for each
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675
674
state is more sensitive to each reward. Lastly, observe that in Fig. 3 the approximation space-based methods appear to be converging to Q faster than the traditional MC method. This makes sense, since the weights in the approximation space-based algorithms are being emphasized using rough coverage which reflects decisions made in the past.
40
Normalized Total State Value
35 30 25 20
9. Conclusion
15 10 Off Policy MC TOE Run and Twiddle
5 0 0
10
20
30
40
50
60
Episodes (γ=0.1) 40
Normalized Total State Value
35 30 25 20 15 10 Off Policy MC TOE Run and Twiddle
5 0 0
10
20
30
40
50
This article presents an approach to off-policy MC reinforcement learning within the context of approximation spaces. A basic assumption made in this article is that learning is carried out in a non-stationary environment. The results obtained from the monocular vision system tend to corroborate the hypothesis that the use of episodic average degree of coverage of blocks of equivalent behaviours as a basis for weighted sampling estimates of the expected value of returns in the off-policy MC method results in improved reinforcement learning. Moreover, the concept of RT is to change an action strategy when things appear to getting worse. Our results also indicate that this control strategy is an improvement. From this work, it has been found that approximation spaces provide frameworks for a weighted sampling approach to estimating the expected value of episodic returns during off-policy MC learning. Future work include will consideration of offpolicy MC learning in other forms of approximation spaces, feature extraction, pattern recognition in binocular vision systems and new platforms for testing the current as well as new forms of the learning algorithms. Acknowledgements
Episodes (γ=0.5)
The author wishes to thank Andrzej Skowron, David Gunderson and the anonymous reviewers for their insights and many helpful suggestions concerning parts of this article. This research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Grant 185986 and Manitoba Hydro Grant T277.
50
Normalized Total State Value
45 40 35 30
References
25 20 15 Off Policy MC TOE Run and Twiddle
10 5 0
10
20
30 40 Episodes (γ=1)
50
Fig. 3. Normalized total reward.
60
70
Gaskett, C., 2002. Q-Learning for Robot Control. supervisor: A. Zelinsky, Department of Systems Engineering, The Australian National University. Grzyma"a-Busse, J., et al. (Eds.), 2001. Rough sets and their applications. International Journal of Applied Mathematics and Computer Science 11(3) (special issue). Hammersley, J.M., Handscomb, D.C., 1964. Monte Carlo Methods. Methuen & Co. Ltd., London. Jukna, S., 2001. Extremal Combinatorics. Springer, Berlin. Mondada, F., Pettinaro, G.C., Guignard, A., 2004. A New Distributed Robotic Concept, Autonomous System Lab (LSA), EPFL-STI-I2S, Lausanne, Switzerland. Kluwer Academic Publishers, The Netherlands. Or"owska, E., 1982. Semantics of vague concepts. Applications of rough sets. Report 469, Institute for Computer Science, Polish Academy of Sciences.
ARTICLE IN PRESS J.F. Peters, C. Henry / Engineering Applications of Artificial Intelligence 20 (2007) 667–675 Pavel, M., 1993. Fundamentals of Pattern Recognition, second ed. Marcel Dekker Inc., New York. Pawlak, Z., 1981. Classification of objects by means of attributes. Research Report PAS 429, Polish Academy of Sciences. Pawlak, Z., 1982. Rough sets. International Journal of Computer and Information Sciences 11, 341–356. Peters, J.F., 2004. Approximation space for intelligent system design patterns. Engineering Applications of Artificial Intelligence 17, 393–400. Peters, J.F., 2005a. Rough ethology: towards a biologically inspired study of collective behaviour in intelligent systems with approximation spaces. Transactions on Rough Sets III, 153–174. Peters, J.F., 2005b. Approximation spaces in off-policy Monte Carlo learning. In: Burczynski, T., Cholewa, W., Moczulski, W. (Eds.), Recent Methods in Artificial Intelligence Methods. AI-METH Series, Gliwice, pp. 139–144. Peters, J.F., Henry, C., 2006. Reinforcement learning with approximation spaces. Fundamenta Informaticae 71, 1–27. Peters, J.F., Skowron, A., Stepaniuk, J., Ramanna, S., 2002. Towards an ontology of approximate reason. Fundamenta Informaticae 51 (1,2), 157–173. Peters, J.F., Skowron, A., Synak, P., Ramanna, S., 2003. Rough sets and information granulation. In: Bilgic, T., Baets, D., Kaynak, O. (Eds.), 10th International Fuzzy Systems Association World Congress, Lecture Notes in Computer Science, vol. 2715. Springer, Heidelberg, pp. 370–377. Peters, J.F., Henry, C., Ramanna, S., 2005a. Rough ethograms: study of intelligent systems behaviour. In: K"opotek, S., Wierzchon´, S.T., Trojanowski, K. (Eds.), Intelligent Information Systems, Advances in Soft Computing. Springer, Heidelberg, pp. 117–126. Peters, J.F., Henry, C., Ramanna, S., 2005b. Reinforcement learning with pattern-based rewards. In: Proceedings of the Fourth International IASTED Conference on Computational Intelligence. Calgary, Alberta, CA, pp. 267–272. Peters, J.F., Henry, C., Ramanna, S., 2005c. Reinforcement learning in swarms that learn. In: Skowron, A., et al. (Eds.), IEEE/WIC/ACM International Conference on Intelligent Agent Technology. Compiegne University of Technology, France, pp. 400–406. Peters, J.F., Lockery, D., Ramanna, S., 2005d. Monte Carlo off-policy reinforcement learning: a rough set approach. In: Nedjah, N., et al.
675
(Eds.), Fifth IEEE International Conference on Hybrid Intelligent Systems. Rio de Janeiro, Brazil, pp. 187–192. Peters, J.F., Borkowski, M., Henry, C., Lockery, D., Ramanna, S., Gunderson, D.S., 2006. Line-crawling bots that inspect electric power transmission line equipment. In: Proceedings of the Third International Conference on Autonomous Robots and Agents (ICARA 2006). Palmerston North, New Zealand, 12–14 December 2006. Polkowski, L., 2002. Rough Sets. Mathematical Foundations. Advances in Soft Computing. Springer, Heidelberg. Polkowski, L., Skowron, A., 1998. Rough sets in knowledge discovery 2. Studies in Fuzziness and Soft Computing, vol. 19. Springer, Heidelberg. Precup, D., Sutton, R.S., Singh, S., 2000. Eligibility traces for off-policy evaluation. in: 17th Conference on Machine Learning (ICML 2000). Morgan Kaufmann, San Francisco, CA, pp. 759–766. Rubinstein, R.Y., 1981. Simulation and the Monte Carlo Method. Wiley, Toronto. Selfridge, O.G., 1984. Some themes and primitives in ill-defined systems. In: Selfridge, O.G., et al. (Eds.), Adaptive Control of Ill-Defined Systems. Plenum Press, London. Skowron, A., Stepaniuk, J., 1995. Generalized approximation spaces. In: Lin, T.Y., Wildberger, A.M. (Eds.), Soft Computing. Simulation Councils, San Diego, pp. 18–21. Skowron, A., Stepaniuk, J., 2005. Modelling complex patterns by information systems. Fundamenta Informaticae XXI, 1001–1013. Stepaniuk, J., 1998. Approximation spaces, reducts and representatives. In: Polkowski, L., Skowron, A. (Eds.), Rough sets in knowledge discovery 2. Studies in Fuzziness and Soft Computing, vol. 19. Springer, Heidelberg, pp. 109–126. Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning. An Introduction. The MIT Press, Cambridge, MA, USA. Tinbergen, N., 1963. On aims and methods of ethology. Zeitschrift fu¨r Tierpsychologie 20, 410–433. TS-5500 User’s Manual, 2004. hhttp://www.embeddedarm.com/i. Ulam, S., 1951. On the Monte Carlo method. In: Proceedings of the Second Symposium on Largescale Digital Calculating Machinery, pp. 207–212 Watkins, C.J.C.H., 1989. Learning from Delayed Rewards. Ph.D. Thesis, supervisor: Richard Young, King’s College, University of Cambridge, UK.