Two Classifier Systems for Reinforcement Learning of Motion Patterns

Two Classifier Systems for Reinforcement Learning of Motion Patterns

Copyright @ IFAC Mobile Robot Technology, Jejudo Island, Korea, 2001 TWO CLASSIFIER SYSTEMS FOR REINFORCEMENT LEARNING OF MOTION PATTERNS K. Yam ad a...

727KB Sizes 0 Downloads 30 Views

Copyright @ IFAC Mobile Robot Technology, Jejudo Island, Korea, 2001

TWO CLASSIFIER SYSTEMS FOR REINFORCEMENT LEARNING OF MOTION PATTERNS K. Yam ad a * M. Svinin *. K. Ohkura * S. Hosoe ** K. Ueda* • Mechanical Engineering Department, K obe- University, 1- 1, Rokkodai, Nada-ku , Kobe 657-8501, Japan Bio-Mimetic Control Research Center, RIKEN, Anagahora, Shimoshidami, Nagoya 463-0003, Japan

.>

Abstract: Two reinforcement learning technique for adaptive segmentation of the continuous state space have been formulated in this paper. The first technique uses weight vectors and continuous matching function , while the second technique makes use of Bayesian discrimination method , where the segmented state space is represented by Bayes boundaries. The proposed techniques have been tested under simulation for a navigation task, where the agent does not have a priori knowledge of the environment and its own internal model. The simulation results show that the agent can segment its continuous state space adaptively and reach the goal. Copyright @2001 IFAC

Keywords: Reinforcement learning, classifier systems, adaptive segmentation .

1. INTRODUCTION

(Holland and Reitman , 1978). In a sense, they emulate the mental processes in animals and humans.

The development of autonomous robots is a very promising direction challenging researchers and engineers working in the field of robotics and artificial intelligence. In recent years, many interesting paradigms and concepts have been proposed fo:' controlling autonomous robots, Reinforcement learning control is among them .

Classical LCS , as outlined by Holland , deal with binary representation of the sensor and action spaces. However, in many applications the sensor and action data are real-valued , and the mapping from the sensor space to the action space is continuous, Developing and extending the framework of the LCS to the continuous mapping in the realvalued data format is an important problem.

In this paper we explore an approach where reinforcement learning is realized by classifier systems that output control commands in response to a sensory input. In a classifier system , the actually observable sensor space is populated with IF- THEN rules mapping certain state space to certain actions. As learning progresses , the state space evolve and its structure is self-organized.

In conventional LCS , the sensory input x E {O, I} Tt and the condition part of the i-th classifier V i E {O, 1, #} n. Geometrically, the "don't care" symbol # , placed in the k-th position, defines a pro jection onto the k- th coordinate axis of {O. I} n. The generalized rule with k symbols # covers kdimensional subspace of {O , l}n.

In our research we were inspired by the framework of learning classifier systems (LCS) . Basically. LCS are a class of inductive models , capable of adaptation and learning in reinforcement mode

On the other hand , the symbol # defines zero binary weight of the corresponding sensor. Thus, the condition part Vi contains information of different nature, about the states and the weights asso--

2. FUZZY-LIKE CLASSIFIER SYSTEM

ciated with the classifier. While it is convenient for the compact representation of the classifier, it can be confusing (or conceptually incorrect) when later on one has to conduct the genetic operations with the components of Vi .

In this section we show how the conventional framework ofLCS can be blended with the ideas of the fuzzy set theory, with fuzzy relations (matching functions) being defined on the receptive fields. The structure of the classifier system is similar to the basic LCS proposed by Wilson (1994), except that it does not contain the genetic component which is an important but independent component of LCS .

An initial study on the continuous LCS has been undertaken by Wilson (1992) who tried to combine the approximation philosophy of CMAC (Albus, 1975) with the adaptive possibilities of classifier systems. Later on, Wilson (2000) extended his accuracy-based classifier system to real-valued inputs and discrete outputs. The system, named XCSR, was tested on learning a Boolean multiplexer function , mapping ]Rn ->

{O, I}.

2.1 System decription Let x E ]Rn, be the sensory input , where ns be the number of sensors. The system operates on a set of action rules. The rule r is defined as follows: r :=< u, v , w, a > where v E ]Rn, is the state vector associated with and memorized in the rule r , w E ]Rn, is the weight vector, u is the utility of the rule, a E ]Rna is the action corresponding to the rule r , and na is the number of actuators.

.

Another possible approach to introduction of realvalued variables into the discrete domain of LCS is to use the theory of fuzzy sets. This approach was first proposed by Valenzuela-Rend6n (1991) . In this approach, every real-valued variable (input , output , or internal one) is represented by a fixed set of membership functions , corresponding to a linguistic value of this variable. Thus, the search space offuzzy LCS can be described by the (finite) alphabet of the linguistic variables, augmented with the "don't care" symbol (Bonarini, 2000).

If v matches in some sense the current sensory input x , the rule r becomes active and can trigger its action a. The weight vector w is used for comparing v and x. The components of w , Wi E [0, 1]' are the continuous analogs of the Holland's "don't care" symbol #.

Main drawback of the fuzzy ' LCS is that the parameters of the membership functions are not changed adaptively, so that fewer resources are employed where the environment is changing slowly. Parodi and Bonelli (1993 ) have proposed a system in which rules encode the center and the width of triangular membership functions. They, however , used a very simplified procedure of the credit assignment that is hardly suitable for reinforcement learning.

The closer Wi to zero, the less important the measurement of the ith sensor. The rules for which w = 0 are called indefinite. They can be activated anywhere regardless of the current state x the robot is in. All the other rules are definite. They can be activated in a vicinity of v , and the vicinity is defined using the weight w . The rule 's specificity A = 2:~~1 wdns serves as a measure of definiteness of the rule. In the beginning, there is only one indefinite rule with initially assigned utility uo. As the learning progresses, the total number of rules, nT' varies by reproduction and extinction.

While the concept of fuzzy LCS looks very promising for the inductive learning, its machinery is heavy. This leads to complex data structures and algorithms , and often makes it difficult to clearly understand the learning process and interpret its results. To simplify the design of LCS, the discretization of the continuous process can be done on the level of n-ary fuzzy relations (ndimensional membership functions). The scalar membership functions , if necessary, can be restored by projecting the fuzzy relations onto the corresponding coordinate axes.

The rules compete with each other for the right to trigger their actions. For all the rules rj, j = 1, . .. , nT) the normalized weighted distance between the current sensory state x and the rule's state vector vJ is defined as: o} =

I:~~l (Xkd~V~ wt)2 Ins> where dk is a time-dependent scaling parameter. The winner rule is selected with probability given by the Boltzmar;n distribution:

In this paper we sketch two possible approach to the implementation of classifier systems working in the continuous state and action domains. The first one is fuzzy-like classifier system, using weight vectors and continuous matching functions , while the other exploits the ideas of Bayesian discrimination technique. These two classifier systems are then tested under simulation for a robot navigation task.

where the matching rate m J Tm is a constant.

= exp( -TmCT]) and

The utilities of the rules are updated every time after the winner executes its action. The utility

2

adjustment mechanism consists of following four parts.

~xS

1. Direct payoff distribution The direct payoff P is given to the winner rules only in specific states. There are two types of payoff: reward (P > 0) and punishment (P < 0). The payoff is spreading back along the sequence of the rules triggered their actions (i.e., to the current and previous winners) with the discount rate , > 0: ++ P, k = O... N, where N is the depth of the winners chain. Here, the parent of Tw is d]), the parent of T~) is T;;) and so on.

u::)

,k

Wheel}

Wheel2

Goal ~

u::)

Fig. 1. Simulation setting

2. Bucket brigade strategy. The current winner Tw hands over a part of its utility, ~u, back to the previous winner , T~): u~ ) +- u~ ) + ~u, where ~ U = Ii Aw (u w - u~: » ) , if U w > u~ ) and ~ U = 0 otherwise. 3. Taxation. Whenever a definite rule Tw triggers its action, its utility is updated as U w +- (1 Cl )u w . (a) Acquired behavior.

4. Evaporation. All the rules reduce their utilities at the evaporation rate 1] < 1 when the robot

(b) 10th episode.

reaches the goal state: U w +- 1]U w . The rules decreasing their utility below the threshold Umin are removed . The winner rule Tw always generates a new rule Tc except for the case when the action triggered by T w has led to stepping back or falling down . If the winner is the indefinite rule, the reproduced rule parameters are set as VC = x, a c = a w , U c = u w , and wf = 1, i = 1, .. . ,n s . (c) 30th episode.

If the winner is a definite rule , the newly produced rules are called generalized. The winner reproduces a generalized rule Tc provided that its matching rate mw is within a certain reproduction threshold Br, i.e., mw < Br , where Br = exp(-Tru w ) and Tr is a constant. The components of the vectors VC and WC for the new generalized rule are set as follows: VC = x , a c = a w , -

Uc -

\

I\cUw ,

(d) lOOth episode.

Fig. 2. State space decomposition. system fixed to the agent. The time interval between the starting motion and reaching the goal is called episode. The agent is set at the starting point , with its heading orientation random at every episode. The episodes are updated when the agent reaches the goal , or when the number of produced actions reaches 1000. The agent is rewarded only upon reaching the goal. Note that the agent does not have a priori knowledge of the environment and of the goal coordinates.

.- 1 an d w,c -- 1 - Ix ;-vil d ; ' 1 - , .. . , n s ·

2.2 Simulation Feasibility of the classifier system is tested under two type of simulations. The first one shows how the proposed learning system segments the state space in a two-dimensional environment. The second one shows how the proposed learning system can perform in a multi-dimensional state space.

The parameters of the classifier systems are set as follows: P = 20 (reward), P = -0.05u (punishment) , T = 3.0, Tm = 150, Tr = 0.6, cl=0 .015, , =0 .8, 1]=0.95, K=0 .2, u m in=9.5, uo=10, nr = 100. The simulation shows that the agent can learn how to reach the goal. It was expected and we do not make the point on this feature. What we want to inspect here is the state space decomposition, which is illustrated in Fig. 2. In this example, the evolution of the actually observed region is shown

1. 2D state space. In this simulation we we want to check how the agent constructs the state space. The goal is to reach the light source. The simulation setup is shown in Fig.l. The sensory input is (x , y) coordinates of the goal in the coordinate

3

rules decrease their utility and evaporate eventually. Thus, after some time, the behavior acquired by the survivors becomes more dominant.

3. BAYESIAN CLASSIFIER SYSTEM Bayesian discrimination method is one of the wellknown methods of pattern classification. In this method, the conditional probability distribution p(xlC;) of the i-th cluster, i = 1,2, . . ., and the prior probability P( G i ) are assumed to be known in advance. Given the input data x , each cluster estimates the posterior probability P( Gilx) using Bayes ' formula, and the input data x is assigned to the cluster with the maximal posterior probability.

Fig. 3. Acquired behavior. 1200~----------------.,

~ ;;;

1000

-

800

-

Number of steps Number of collisions

Bayesian discrimination method estimates the loss of misclassification of the sensory input. Basically, the agent selects one of the rules with the smallest estimated losses and executes a corresponding action. In addition to the Bayes decision theory, the method described below blends the ideas of LCS and learning vector quantization in order to adaptively divide and segment the continuous state space into clusters.

Reward

!

(; 600

2

5

Z

400 200 40

60

80

tOO

Episode

Fig. 4. Learning history. by circles. As the learning progresses, the classifier system produces the action rules more and more consistently and in less number. The consistent action rules attract less states, and the irrelevant rules are evaporated.

3.1 System decription

An individual rule is associated with a cluster in the state space. Formally, a rule r is defined as: r :=< u, v , a , j , :E,

where u, v and a have the same meaning as before, j is the occurrence (prior probability) of the rule, :E = diag{ 0'1, ... , 0' n.} is the covariance matrix of the rule,


2. 16D state space. In this simulation the envi-

ronment contains obstacles. The agent has eight infrared sensors which can function as the light sensors and as the proximity sensors. Thus, we can set ns = 16. The episodes are updated when the robot reaches the goal, or when the number of produced actions exceeds 1000.

Each sample cPi = {Xl, ... XnJ of the set


The acquired behavior in the episode No. 100 is shown in Fig. 3. As can be seen, the agent has acquired the light-seeking, wall-following and collision avoidance patterns. The number of collisions and and the number of actions (steps ) are plotted in Fig. 4. Here, we also indicate the episodes at which the agent obtains the reward. As can be seen, the number of collisions decreases as the frequency of reaching the light source increases.

Formally, the system functions much the same way as described in the previous section. However, the action selection mechanism and the reproduction mechanisms are different. The action selection mechanism is implemented using Bayesian discrimination technique. The selection procedure is organized as follows . Given the conditional probability distribution of the i-th rule 's cluster,

Our observations also reveal a high correlation between the number of collisions and the number of generations of the rules . This indicates that the exploration function of the indefinite rule is gradually subsumed by the exploitation function of the definite rules with high utilities. As learning progresses, new rules are seldom generated and the total number of rules gradually decreases. This is because only particular definite rules , causing "useful" behavior, trigger their actions and increase their utility. Conversely, the "irrelevant"

1

p(xIGi )

= (27l') T l:Eil~ . exp {

4

~1 (x -

vif:E;-l(X - Vi) } ,

and the estimated value of fi' the risk of misclassification of the sensory input x into other clusters can be defined as gi = -logUi . p(x!Ci )}. The rule with the minimal value of gi is selected as a winner and is denoted as T w' If the value gw of the winner risk is larger than a threshold gth, then the winner is rejected, and the agent performers an action selected randomly. Otherwise, the agent performers the action specified by the winner rule. As to the modification/reproduction phase, it is performed if the action triggered by the winner rule T w did not result in a situation where the agent is punished. More specifically, if the winner rule is rejected (that is if gw > gth) a new rule, memorizing the current sensory input and the executed action, is produced. The parameters of the new rule are set as Vc = x, ~c = 0'61, a c = a w , Uc = Uo, fc = fo , where 1 is the unit matrix, and 0'0, Uo and fo are constants.

(a) Acquired behavior .

On the other hand, if the winner rule T w is not rejected, its parameters are modified as follows. First , the sample set w is updated, i.e. the current sensory input x is added to w . Then, the sample mean X = {Xl , . . . ,xn.}T and the sample variance s2 = {sI , ... , s;. }T are estimated from the updated set w, and the confidence intervals for X and S2 are determined.

(c) 40th episode.

(d) lOOth episode.

Fig. 5. Segmented state space

1000 -

If all the components of the state vector v and the covariance matrix ~ of the winner rule T w are within the confidence intervals, the parameters of the rule are left unchanged. Otherwise, even if one of the components is out of the corresponding confidence interval , the parameters of the rule Tu' are modified as follows: v, +- Vi + IX(Xi Vi) ' t-+ IX2(S~ fw +- fw + (3(1- fw) , where IX, (3 are constants. For all the other rules, the prior probabilities fi are updated as follows: f , +- (1 - (3)j;.

a; a;

(b) 20th episode.

Number of steps

-

Number of collisions I

o

20

40

Reward

60

80

100

Episode

an ,

Fig. 5. Learning history Fig.5 shows the agent's behavior in episode No. 100 and the segmented state space (Bayes boundaries) in episode No. 20, No. 40 and No. 100. The number of collisions and the number of actions (step) are plotted in Fig.5. In the first episode, the state space of the agent is undivided. The agent behavior is random and the number of collisions is large. It takes 20 episodes for the agent to acquire the goal seeking and r;:ollision avoidance patterns.

3.2 Simulations

Feasibility of the Bayesian classifier system is tested under two type of simulations, 2D and 15D cases, similar to those considered in Section 2.2.

Fig.5(b)-5(d) shows the segmented state space at each episode. The circle indicates the area covered with the rules, and undivided state space is covered by the white area. The arcs and the arrows indicate the rule's action I. As learning progresses , the state space is segmented and the shape of each cluster is modified.

1. 2D state space. The simulation settings are similar to the example considered in Section 2.2. The episodes are updated when the agent reaches the goal, or when the number of produced actions reaches 1000. The agent is rewarded only upon reaching the goal and is punished for every collision. The parameters of the classifier system are set as follows: P = 20 (reward) , P = -0.05u (punishment), cj=O .Ol , ,=0.8 , 7]=0 .98 , K=0.15, u m in=9 .5, uo=lO, 0'0=0.05, fo=O.OOl , IX=O .l , (3=0.0001

1 The agent turns in the direction specified by the arc , and then moves in the direction (resulting after the turn) specified by the arrow.

5

(a) Clockwise tion .

• Currently, the exploration of the action space is performed randomly in our system. In future, we plan to employ more complex techniques to abstract and decompose the continuous action space. In this connection the radial radial base function networks, as in (Morimoto and Doya, 1999), or the techniques of Williams (1992) should be given consideration . • Currently, our systems do not include a genetic component, which is normally presented in LCS and used for the discovery of new rules. This is an important component of the LCS framework. In future, we plan to include such a genetic component , implemented in the spirit of Evolutionary Programming or Evolutionary Strategies.

(b) Counterclockwise motion.

mo-

Fig. 7. Simulation results (16D state space) . 1200 IIUIUII

1000

~

800

U>

(;

Q;

-

Number of steps

-

Number of collisions

600

Reward

REFERENCES

D

E 400 ::l

Z

Albus , J .S. ( 1975). A new approach to manipulator control: The cerebellar model articulation controller. Journal of Dynamic Systems , Measurement and Control , Trans . ASME 97(3) , 220-227. Bonarini, A. (2000). An introduction to learning fuzzy classifier systems. In: Learning Classifier Systems: An Introduction to Contemporary Research (P.L. Lanzi, W. Stolzmann and S.W. Wilson , Eds.) . Vol. 1813 of Lecture Notes in Artificial Intelligence. SpringerVerlag. Berlin. pp. 83-104. Holland, J.H. and J. S. Reitman (1978). Cognitive systems based on adaptive algorithms. In: Pattern-Directed Inference Systems (D.A. Waterman and F. HayesRoth, Eds.). Academic Press. New York. Morimoto, J. and K. Doya (1999). Hierarchical reinforcement learning for motion learning: Learning "standup" trajectories. Advanced Robotics 13(3) , 267-268. Parodi, A. and P. Bonelli (1993). A new approach to fuzzy classifier systems. In: Proceedings of the 5th International Conference on Genettc Algorithms (S. Forrest, Ed .). Morgan Kaufmann. pp. 223- 230. Valenzuela-Rend6n, M. (1991) . The fuzzy classifier system: a classifier system for continuously varying variables. In: Proceedings of the 4th International Conference on Genetic Algorithms (L.B. Booker and R.K. Belew , Eds.) . Morgan Kaufmann. pp. 346-353. Williams, R. J. (1992) . Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 229--256. Wilson , S.W. (1992). Classifier system mapping of real vectors. In: Collected Abstracts for the First International Workshop on Learning Classifier System. October 6-8 , NASA'Johnson Space Center, Houston , Texas. . Wilson, S.W. (1994). Zcs: A zeroth level classifier system. Evolutionary Computation 2(1), 1- 18.

200 20

40

60

80

100

Episode

Fig. 8. Learning history 2. 16D state space. This simulation illustrates that the state space can be segmented adaptively. In this simulation the agent has to reach the goal and avoid obstacles. Again, the agent has 8 light light sensors and 8 infra-red sensors sensors, and the learning system has to segment the 16dimensional state space. All the other settings are the same as the simulation 1.

Fig.7 shows the behaviors the agent has acquired. Depending on the heading direction at the starting point, the robot overcomes the obstacle from the left or the right side. The learning history is shown in Fig.8. As can be seen, the number of steps necessary to reach the goal are stabilized after 20 episodes.

4. CONCLUSIONS Two reinforcement learning technique for adaptive segmentation of the continuous state space have been formulated in this paper. The first technique uses weight vectors and continuous matching function , while the second technique makes use of Bayesian discrimination method , where the segmented state space is represented by Bayes boundaries. The proposed techniques has been applied to a navigation task and tested under simulation. The simulation results show that the agent can segment its continuous state space adaptively and , by doing so, reach the goal.

Wilson , S.W . (2000). State of XCS classifier system research. In: Learning Classifier Systems: An Introduc-

tion to Contemporary Research (P.L. Lanzi , W . Stolzmann and S.W . Wilson , Eds.) . Vol. 1813 of Lec-

ture Notes in Artificial Intelligence. Springer- Verlag. Berlin. pp. 63-82 .

The following directions can be outlined for the future research.

6