Information Sciences 234 (2013) 112–120
Contents lists available at SciVerse ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Policy sharing between multiple mobile robots using decision trees Yu-Jen Chen, Kao-Shing Hwang ⇑, Wei-Cheng Jiang Department of Electrical Engineering, National Sun Yat-Sen University, Kaohsiung 80424, Taiwan
a r t i c l e
i n f o
Article history: Received 7 January 2009 Received in revised form 10 October 2012 Accepted 16 January 2013 Available online 1 February 2013 Keywords: Multi-agent Cooperation Sharing Reinforcement learning Mobile robot
a b s t r a c t Reinforcement learning is one of the more prominent machine learning technologies, because of its unsupervised learning structure and its ability to produce continual learning, even in a dynamic operating environment. Applying this learning to cooperative multiagent systems not only allows each individual agent to learn from its own experience, but also offers the opportunity for the individual agents to learn from other agents in the system, in order to increase the speed of learning. In the proposed learning algorithm, an agent stores its experience in terms of a state aggregation, by use of a decision tree, such that policy sharing between multiple agents is eventually accomplished by merging the different decision trees of peers. Unlike lookup tables, which have a homogeneous structure for state aggregation, decision trees carried with in agents have a heterogeneous structure. The method detailed in this study allows policy sharing between cooperative agents by means merging their trees into a hyper-structure, instead of forcefully merging entire trees. The proposed scheme initially allows the entire decision tree to be translated from one agent to others. Based on the evidence, only partial leaf nodes have useful experience for use in policy sharing. The proposed method induces a hyper decision tree by using a large amount of samples that are sampled from the shared nodes. The results from simulations in a multi-agent cooperative domain illustrate that the proposed algorithms perform better than the algorithm that does not allow sharing. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Reinforcement learning (RL) uses the concept of learning by reward. If a robot is given rewards when its performance is good it can learn to improve its behavior [17,6,11]. Because they use trial-and-error, RL methods usually require a lot of time to search for an optimal policy, especially in a huge state space. In these complex environments, humans usually ask natives or experts, when they are lost or in an unknown situation. Similarly, in a cooperative social environment, agents using reinforcement learning share their experience with each other, in order to reduce the need for exploration [16,18,20]. There are various methods for the representation of a policy learned by RL, such as probability distribution functions, neural networks for function approximation, tabular methods, and so on [9,13,21–23]. Using a lookup table may be an easy way to record a policy, but it is not appropriate to problems with multiple dimensions, where neural networks are a more viable alternative. Since the weights of the neural network record the policy, it is not easy to distinguish which weights have learned to effectively share. A decision tree is also a popular representation method and it is also appropriate for problems with multiple dimensions [7,14,19]. Since the decision tree partitions a state space into regions, the agents can share the regions that have learned well with their partners. The policies can be incorporated into a shared policy from decision trees ⇑ Corresponding author. Tel.: +886 932855019. E-mail address:
[email protected] (K.-S. Hwang). 0020-0255/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ins.2013.01.008
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
113
containing similar policies by sharing the experiences in state aggregations. This then presents a problem of how best to combine the policies between agents. Although there are many algorithms for merging trees, most of these merge the entire trees, which results in much redundancy [1]. The entire decision tree must be translated to other agents. However, not all of the leaf nodes have experience that is helpful in achieving a goal. The creation of a decision tree involves the induction of a great number of samples [15,2]; the shared information can be appended by inducting the tree with samples that are sampled from the shared nodes. The remainder of this paper is organized as follows. Section 2 describes three policy-sharing algorithms. Two simulations are used to compare the three algorithms in section 3. Finally, conclusions are presented in Section 4. 2. Policy sharing Regardless of the algorithm used for policy sharing, communication between agents is a basic requirement [3– 5,10,12,24]. Since there are many available communication protocols in wired networks, wireless networks and Bluetooth, only policy sharing is discussed [8]. Therefore, some assumptions for communication are proposed in advance. There is no limitation on the communication distance between any two agents. In other words, any two agents can directly communicate with each other, without the need for a third agent. These communication protocols also address the problem of crosscommunication. 2.1. Sharing without request Policy sharing without request occurs when an experienced agent voluntarily shares its policy. When an agent terminates N episodes, it broadcasts its policy to other agents. The policy consists of the information from all of the leaf nodes, including the ranges and reliabilities of nodes, state-action values and actions having the maximum state-action value of the nodes. Other agents receive and save the shared information, until they terminate an episode. Fig. 1 shows an illustrative scenario of the sharing-without-request method. If an agent receives another agent’s policy, it randomly creates samples from the range of shared nodes and leaf nodes in its decision tree. For each sample, the action having the maximum state-action value is denoted as its class and all samples of the same class are classified into a category. While a policy is being learned by each agent, the corresponding decision tree may not split appropriately, creating redundant leaf nodes, especially in the early episodes. If some trivial areas are split, the proliferated state is only scarcely visited. The reliability is proportional to the number of activations and is inversely proportional to the variance. The reliability of a state is defined by (1), where cr is a reliability constant and count is the number of activations of the state.
reliability ¼
2 1 1 þ exp cr count=r2e
ð1Þ
In each node, the number of samples is defined by (2).
Nnode ¼ reliabilitynode N Hðv node Þ
ð2Þ
In (2), reliabilitynode is the reliability of the node, N is the maximum number of samples in a node and vnode is the state value of the node. (3) is a Heaviside step function. Once all of the samples have been created, these samples are passed through the decision tree and fall into different leaf nodes. Fig. 2a is an illustration of the samples in a leaf node. The bold rectangles represent shared nodes from the experienced agent and the normal rectangle represents one leaf node of the
Policy
Broadcast its policy after terminating N episodes
Append the received policy after terminating an episode
Fig. 1. An illustrative scenario of sharing policy without request.
114
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
sp2
sp1
sp3
sp4
(b) The splits in the leaf node
(a) The samples from the leaf nodes
Fig. 2. An illustration showing the appending of shared information.
agent receiving experience. The small triangles, circles and squares denote three categories. The category whose number of samples is less than N is not considered, when the leaf node is split. The mean, lc,d, and variance, r2c;d , of samples belonging to category c in dimension d are calculated, in order to find a splitting point. For example, the probability of two categories being profiles is shown on Fig. 3. The split point, located on the intersection of two probability curves, provides greater pur ity. For each pair of categories, one split point can be obtained from (4), using the pairs li;d ; r2i;d and lj;d ; r2j;d . Fig. 4 shows all of the split points of the samples in Fig. 2a. Each dashed line represents the split point of two categories; for example, sp1 is the split point between the triangle and the circle category, in the horizontal dimension.
HðxÞ ¼
0
x60
ð3Þ
1 x>0
ðx li;d Þ2 ni qffiffiffiffiffiffiffiffiffiffiffiffiffi ffi exp 2r2i;d 2pr2i;d
!
ðx lj;d Þ2 nj ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp 2r2j;d 2pr2j;d
! ð4Þ
The leaf nodes attempt to split at each split point and the values of the purity function after splitting are compared. The purity function of the sample set, /, in a node is defined as (5), for all categories, c.
purityðPð/ÞÞ ¼ purityð½p1 ; p2 ; . . . ; pjAj Þ ¼
jAj X pc logðpc Þ
ð5Þ
c¼1
where pc is the probability of the occurrence of category c in /. The splitting point is chosen as (6)
split point ¼ arg max½purityðPð/sp;l ÞÞ þ purityðPð/sp;r ÞÞ sp
ð6Þ
where /sp,l and /sp,r are the sample sets belonging to the left and right child nodes, after splitting at the split point, sp. Fig. 4 shows that maximum purity occurs after splitting at sp1. The algorithm is summarized as follows:
Fig. 3. The proper splitting point of two probability density functions.
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
115
sp1
Fig. 4. Split points between two categories.
for d = 1 to number of state space dimension for each i (number of samples of category i > N) for each j (number of samples of category j > N && i – j) calculate an sp with li;d ; r2i;d and lj;d ; r2j;d split_point = arg maxsp[purity(P(/sp,l)) + purity(P(/sp,r))]
After splitting, the left and right child nodes are recursively split, until no split point remains. Fig. 2b illustrates the recursive splitting. The leaf node is firstly split at sp1 and then the leaf child is again split at sp2. The main category on the left hand side, after splitting at sp2, is the category, ‘‘square’’. Since both of the numbers in both of the categories, ‘‘circle’’ and ‘‘triangle’’, are less than N, the left child node of sp2 stops splitting. Similarly, the numbers in the categories, ‘‘circle’’ and ‘‘square’’, are less than N, so the right child node stops splitting. In the right child node of sp1, since the number in the category, ‘‘square’’, is less than N, only the categories, ‘‘triangle’’ and ‘‘circle’’, split. When a leaf node stops splitting, the stateaction values and the action policy of the leaf node are set to the average of the samples in the leaf node. 2.2. Sharing by request Similarly to human interaction, when an agent enters a leaf node and it does not know what action within the node is proper to the goal, it broadcasts a request for help to other agents. Agents receive positive rewards only when they achieve the goal. Since the state-action value is the value of the reward accumulated from the current state to the goal, it is used to indicate whether the agent receives a positive reward from the state. The actions with positive state-action values then guide the agents to the goal. However, a negative state value cannot provide guidance to the goal. Therefore, if the state value of an agent is less than 0, it broadcasts a request to other agents, as illustrated in Fig. 5. The agent broadcasts its sensory inputs and state value. Other agents receive the request and search for a node corresponding to the sensory inputs. If the state value of the node is larger than the received state value, the information from the node, including the range of the node, state-action values and policy, is sent back to the agent; otherwise a ‘‘not found’’ message is sent.
Fig. 5. Communication between agents.
116
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
Fig. 6. An illustration of a splitting point in dimension i, based on overlaps.
When the agent receives the information from the nodes of other agents, only the shared node with the maximum state value is considered. Since both the node of the agent and the shared node of another agent correspond to the same state, they overlap in the state space. The node of the agent can be split at one of the boundaries of the shared node, as shown by the dotted line in Fig. 6. The upper and lower boundaries of the node in dimension i are denoted as node:bUi and node:bLi , respectively; the upper and lower boundaries of the shared node in dimension i are denoted as share:bUi and share:bLi , respectively. If the boundary of the shared node is outside the range of the node, then aijbij is less than 0. Hence, the split point is chosen as (7), except when all of the aijbij are less than 0. In other words, if the shared node fully covers the node, it is not split. Once the node has been split, the two child nodes inherit the information from the node and the current node of the agent is translated to one of the child nodes. In Fig. 6, the small cross is the state of the agent and the current node of the agent is translated to the left child node. Finally, the information, including state-action values and policy of the current node, is averaged from the shared node and the current node.
split point ¼ arg max aij bij ðnode:bUi node:bLi Þjaij bij > 0: i;j
ð7Þ
3. Simulation In this section, two simulations of the proposed policy-sharing algorithm are executed. In one simulation, a group of robots learn how to reach a target and in the other a three-wheeled car learns the shortest path for patrolling six rooms. 3.1. Reaching a target In the first simulation, three mobile robots try to approach a target in a maze, as shown in Fig. 7. The three rectangles marked in Fig. 7 are the starting points for the three mobile robots. The target that the robot is attempting to reach is the black circle in the center of the maze. The robot’s task is to simultaneously determine the position of the target and to find
Fig. 7. The maze for the approaching goal with the three mobile robots.
117
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
a path to the target. The agents have four options; movement in the up, down, left, or right directions, one step at a time. The three gray broken rings are the walls of the maze. Since the walls are curved, the state aggregation must match the curves and partition the state space more accurately along the curve. The four corners of the map are not important to reaching the target, so this state can be partitioned roughly. If the agent hits the wall, after making a movement, it remains in the same position and receives a reward of 1. When the agent arrives at the target, it receives a reward of 5 and the episode is terminated; otherwise, the agent moves to the next position and receives a reward of 0.01. When the agent reaches the target, or when the number of steps reaches 5000, the episode is terminated. The agent executes 5000 episodes per round. At the beginning of each round, the decision tree is pruned, so that it only has a root node, and all parameters are reset. Since the mobile robot 3 is the farthest from the goal, it is expected that robots 1 and 2 will share their policy with robot 1. Therefore, robot 3 can avoid some exploration. The parameters of the simulation are listed in Table 1. The simulation results for robot 3 are shown in Fig. 8. These curves sketch the moving average with a period of 200, over 20 runs. The robot using the sharing without request algorithm (blue line) actually reaches the target faster than that with no sharing (yellow line). However, it performs badly after the 500th episode. The raw data shows that some runs in later episodes do not reach the target. This may be the result of over-sharing from robots 1 and 2. Since robots 1 and 2 are closer to the target, they have run many episodes, before robot 3 terminates an episode, and they share too frequently. Both the agents that use the sharing by request algorithms do not voluntarily share their policy and avoid over-sharing. Although the sharing by request algorithm (green line in Fig. 8) takes fewer steps than the no sharing algorithm, it has more leaf nodes, as listed in Table 2. Since the agent splits at the boundary of the shared node, there may be redundant splitting, when the current node and the shared node have the same optimal action. Also, if the range of the current node varies little from that of the shared node, an improper splitting can also occur, as illustrated in Fig. 9. The solid rectangle represents the current node and the dashed rectangle represents the shared node. The split point of the current node is at the dotted line, but it can almost be ignored.
Table 1 The parameters for the approaching goal simulation. Parameter symbol
Parameter value
a c e
0.5 0.99 0.01 0.0022
Cr
Fig. 8. The simulation results for mobile robot 3.
Table 2 The number of leaf nodes and steps for mobile robot 3. Method
Number of leaf nodes
Step
No sharing Sharing without request Sharing by request
615 505 883
240 929 231
118
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
Fig. 9. An illustration of an improper split.
Fig. 10. Three-wheeled cars wandering around six room.
Fig. 11. A three-wheeled car with two free wheels at the rear and one steered traction wheel at the front.
3.2. Wandering between rooms The scenario for this simulation involves a three three-wheeled car wandering between six rooms, as shown in Fig. 10. The three small rectangles are the starting points for the three-wheeled cars and the arrows are the initial directions. The bold solid lines represent walls. The structure of the three-wheeled car is shown in Fig. 11 and its kinematic equations are described in (8).
8 0 > < x ¼ rc x sinðh þ aÞ þ 2r c x sinðaÞ cosðhÞ=3 y0 ¼ r c x cosðh þ aÞ þ 2rc x sinðaÞ sinðhÞ=3 > : 0 h ¼ 2rc x sinðaÞ=3L
ð8Þ
where rc is the radius of the three wheels, x is the rotational velocity of the front steering wheel, h is the orientation of the front wheel of the three-wheeled car with respect to the y-axis and a is the steering angle. When a three-wheeled car wanders clockwise across a gate between two rooms, it receives a reward of 3. If the three-wheeled car collides with a wall, it is repositioned to the initial position and direction and receives a penalty of 3; otherwise it keeps wandering until 400 steps have been completed and is penalized by a small cost of 0.001 for each step. The state space has three dimensions; the x and y ordinates and the orientation of the three-wheeled car. Within the range of
119
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120 Table 3 The parameters for the wandering between rooms simulation. Parameter symbol
Parameter value
a c e
0.3 0.99 0.01 0.004
Cr
Fig. 12. The simulation results for the three-wheeled cars.
Table 4 The number of leaf nodes and steps for the three-wheel cars. Method
Number of leaf nodes
Step
No sharing Sharing without request Sharing by request
2475 2485 2727
332 351 366
the action space, the steering angle varies between p/4 and p/4 and it is quantized into three discrete actions; p/6, 0 and p/6. The range of the action bias is from p/12 to p/12. Other simulation parameters are listed in Table 3. In this simulation, the three cars share reciprocally with each other. Every car is expected to explore only two rooms and then shares the results of its exploration. After passing through the second room, they can utilize others’ experience and find the gates faster. The simulation results are roughly similar to those of the previous simulation and are sketched in Fig. 12. The curves are the average number of steps, which are the moving average, with a period of 200, over 20 runs. Since the three cars must explore six rooms by themselves, with no sharing, the slope of the no sharing curve (yellow line) is almost the same, before the 10,000th episode. Unlike the reaching the target simulation, since the three cars have episodes of similar length, they do not over-share. The curve for sharing without request (blue line) is still better than that for no sharing, in the later episodes. The curves for sharing by request (green line) have different slopes after the 4000th episode, because the cars have passed through the second gates and search for the later gates using the received experience. Table 4 lists the number of leaf nodes and steps, after 20,000 episodes. The sharing by request algorithm also has the greatest number of leaf nodes of the three algorithms.
4. Conclusion This paper proposes two policy-sharing algorithms for sharing policy or learning from agents working in the same environment. The policy of the agent is represented by a decision tree. By translating the information in the nodes of the decision tree, they share their experience. These proposed policy-sharing algorithms can reduce exploration in the areas that have been explored by other agents. Using the ‘‘sharing without request’’ algorithm, the agent that takes the shorter path shares
120
Y.-J. Chen et al. / Information Sciences 234 (2013) 112–120
its policy too frequently, which significantly influences the learning processes of the other agents who take a longer path. By sharing policy when an agent makes a request, the agent can avoid over-sharing with the agent that has a short episode. Although the ‘‘sharing by request’’ algorithm performs better, it results in more leaf nodes, because the algorithm based on overlaps causes many trivial splits. The results of simulation show that only partial leaf nodes hold useful and valid experience for policy sharing. The results from simulation in a multiple agent cooperative domain illustrate that the proposed algorithms perform better than an algorithm without sharing. Although the sharing by request algorithm performs better, it may have more redundant nodes, due to improper splitting. Future study could combine the advantages of the two algorithms, in order to avoid the problem of over sharing and improper splitting. References [1] M. Asadpour, M.N. Ahmadabadi, R. Siegwart, Heterogeneous and hierarchical cooperative learning via combining decision trees, in: Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 2684–2690. [2] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth Inc., Belmont, California, 1984. [3] M. Cimino, F. Marcelloni, Autonomic tracing of production processes with mobile and agent-based computing, Information Sciences 181 (5) (2010) 935–953. [4] R.J. Duro, M. Graña, J. de Lope, On the potential contributions of hybrid intelligent approaches to multicomponent robotic system development, Information Sciences 180 (14) (2010) 2635–2648. [5] Q. Guo, M. Zhang, A novel approach for multi-agent-based intelligent manufacturing system, Information Sciences 179 (18) (2009) 3079–3090. [6] A. Gosavi, Reinforcement learning: a tutorial survey and recent advances, INFORMS Journal on Computing 21 (2) (2009) 178–192. [7] H.W. Hu, Y.L. Chen, K. Tang, A dynamic discretization approach for constructing decision trees with a continuous label, IEEE Transactions on Knowledge and Data Engineering 21 (11) (2009) 1505–1514. [8] K.S. Hwang, C.J. Lin, C.Y. Lo, Cooperative learning by policy-sharing in multiple agents, Cybernetics and Systems, An International Journal 40 (4) (2009) 286–309. [9] K.S. Hwang, H.Y. Lin, Y.P. Hsu, H.H. Yu, Self-organizing state aggregation for architecture design of Q-learning, Information Sciences 181 (13) (2011) 2813–2822. [10] R.E. Haber, R.M. del Toro, A. Gajate, Optimal fuzzy control system using the cross-entropy method: a case study of a drilling process, Information Sciences 180 (14) (2010) 2777–2792. [11] C.F. Juang, C.M. Lu, Ant colony optimization incorporated with fuzzy Q-learning for reinforcement fuzzy control, IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans 39 (3) (2009) 597–608. [12] B. Liao, H. Huang, ANGLE: an autonomous, normative and guidable agent with changing knowledge, Information Sciences 180 (17) (2010) 3117–3139. [13] C.H. Lee, Y.C. Lee, Nonlinear systems design by a novel fuzzy neural system via hybridization of electromagnetism-like mechanism and particle swarm optimisation algorithms, Information Sciences 186 (1) (2012) 59–72. [14] L.D. Pyeatt, A.E. Howe, Decision tree function approximation in reinforcement learning, in: Proceedings of the Third International Symposium on Adaptive Systems: Evolutionary Computation and Probabilistic Graphical Models, 2001, pp. 70–77. [15] R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, 1993. [16] P. Stone, M. Veloso, Multiagent Systems: A Survey from a Machine Learning Perspective, 2000. [17] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, Cambridge, Mass, 1998. [18] M. Tan, in: Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA, 1993, pp. 330–337. [19] W.T.B. Uther, M.M. Veloso, Tree based discretization for continuous state space reinforcement learning, in: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, WI, 1998, pp. 769–774. [20] V. Vassiliades, A. Cleanthous, C. Christodoulou, Multiagent reinforcement learning: spiking and nonspiking agents in the iterated Prisoner’s dilemma, IEEE Transactions on Neural Network 22 (4) (2011) 639–653. [21] N.A. Vien, H. Yu, T.C. Chung, Hessian matrix distribution for Bayesian policy gradient reinforcement learning, Information Sciences 181 (9) (2011) 1671–1685. [22] W. Wu, L. Li, J. Yang, Y. Liu, A modified gradient-based neuro-fuzzy learning algorithm and its convergence, Information Sciences 180 (9) (2010) 642– 1630. [23] A. Weissensteiner, A Q-learning approach to derive optimal consumption and investment strategies, IEEE Transactions on Neural Network 20 (8) (2009) 1234–1243. [24] Guofu Zhang, Jianguo Jiang, Zhaopin Su, Meibin Qi, Hua Fang, Searching for overlapping coalitions in multiple virtual organizations, Information Sciences 180 (17) (2010) 3140–3156.