ROBOT NAVIGA TTON IN COMPLEX AND TNITTALLY UNKNOWN ...
14th World Congress ofTFAC
Q-8f-02-3
Copyrigbt (9 I 999 IFAC 14th Triennial World Congress, Beijing, P.R. China
ROBOT NAVIGATION IN COMPLEX AND INITIALLY UNKNOWN ENVIRONMENTS
Arthur P. de S. Braga
Aluizio F. R. Araujo
University of Sao PauIo Department of Electric Engineering Av. Dr. CarIos Botelho, 1465 13560-250 Sao Carlos, SP, Brazil
[email protected] [email protected]
Abstract: This paper proposes an agent that uses reinforcement learning methods to guarantee its autonomy to navigate. Differently from the representations commonly found in the technical literature, the reinforcement signal from the environment is represented through reward and penalty surfaces to endow the agent with the ability to plan and to behave reactively. The agent solves the goal-directed reinforcement learning problem in which a first learning stage finds a path, based only on local information, and this path is a meliorated thmugh furthcr training. The proposed task is cxecuted in an initially unknown environment, then, the initial viable solution is improved, employing a variable learning rate for the reward evaluation. The simulations suggest that the agent always reach the target, even in complex environments. Copyright © 1999lFAC Keywords: Learning algorithms, agents, planning, robot navigation, autonomous vehicles.
1. INTRODUCTION
A viable planning for autonomous navigation of a mobile robot in a dynamic and initially unknown world is a nontrivial task. The plan should be able to take the agent to a goal and to avoid obstacles on its way. In the classic control theory, the general procedure is to use a reference signal that indicates the system desired output (in this case, the robot position in the environment). Thus, the agent acts to reduce the difference between the obtained and desired outputs. This approach is not viable if there is not a reference signal, in this case, the trajectory has to be discovered by the agent. Singh (1994) argues that an agent needs to learn in order to work autonomously with uncertainty caused by partial or total lack of knowledge about a complex environment in which it is immersed. Very often, agents with learning ability can produce solutions
for this class of task in a more efficient way than pre-programmed agents. In this case, learning allows agent adaptation to changes in the environment not foreseen by a designer and the agent can improve its performance along the time. The reinforcement learning (RL) paradigm mimics anjmal behavior interacting with the world in the search for a goal. The agent learns the comple}"ity of the environment receiving feedback reinforcement signals from the world that indicate the usefulness of an action to reach an objective. In tIus paper, the agent learns through a RL method that takes into consideration two distinct appraisals for each agent state: reward and penalty evaluations. On one hand, a reward surface, modified whenever the agent reaches the target, indicates paths the agent can follow to reach the target. On the other
8450
Copyright 1999 IF AC
ISBN: 008 0432484
ROBOT NAVIGAnON IN COMPLEX AND INlTIALLY UNKNOWN ...
14th World Congress ofIFAC
recognizcs tile goal when reaches it. Hence, the agent has to detennine paths between any position in the initially unknown world and a chosen goal.
hand, the penalty surface, modified whenever the agent moves, is set up in a way that the agent can get rid of obstacles on its way. Thus, the reward surface deals basically with planning (Millan, 1996; Winston, 1992) whilst the penalty surface deals mainly with reactive behavior (Brooks, 1986; Chang and Gaudiano, 1997; Whitehead and Ballard, 1991). The composition of the two surfaces guides the agent dccisions. The two surfaces are constructed employing temporal difference method as the RL learning strategy.
This problem, defined by Koening and Simmons (1996) as a goal-directed reinforcement learning problem (GDRLP), has two stages. The first part, the goal-directed explordtion problem (GDEP), involves the exploration of the state space to determine at least one viable path between the initial and the target st.ales. The second phase uses such a knowledge to find the optimal or sub-optimal path.
The chosen representation is suitable to solve the goal-directed reinforcement learning problem (GDRLP) (Koenig and Simmons, 1996) defined as finding a viable path to reach a goal and then diminishing the number of steps to reach the target. In this approach, the second part of GDRLP is achieved through a variable learning rate that allows fine improvements in the surfaces.
3. THE AGENT
The navigation problem can be viewed as a decision tree in which each node corresponds to a possible state s and each link corresponds to an action to changes states. The search in this tree is heuristically guided by evaluations for all possible actions the agent can take from the state s. The agent has to learn how to evaluate the actions from the scratch, using environmental feedback, without help of a teacher. Hence, reinforcement learning is a suitable approach to solve GDRLP because matches the requirements.
The next section presents the task to be executed. Section 3 introduces the used RL algorithm. Section 4 silow simulation results obtained for three different environments. Finally, Section 5 presents the conclusions.
2. THE PROBLEM The problem consists of an autonomous agent, a mobile robot, acting in an environment with obstacles. The robot-environment interaction is the unique data source used by the agent to execute three basic goals: to find a path that links starting points to a target, to guarantee that in the trajectory the robot is able to avoid obstacles, and to improve the performance of the robot to reach the goal. According to Leonard and Durrant-Whytc (1991), the navigation problem can be divided in three subproblcms:
In RL, the agent receives a reinforcement signal ,.. from its environment that quantifies the instantaneous contribution of a particular action in the state s to reach the goal state. Then, the evaluation of a state s under a policy 7r, denoted by V(s), estimates the total expected return starting the process from such a state and following that mentioned policy (Sutton and Barto, 1998). The learning process takes into consideration three types of situations: free states, obstacle states, and goal state. A free state is defined as a state in which the agent does not reach the goal (goal state) or an obstacle (obstacle state). Thus, the reinforcement signal is defined below:
1) Establishment of the robot position in the space; rt+l (St ' a , St+l)
0, If S/+1
2) Establishment of the robot goal; 3) Establishment of a path that takes the robot from an initial position to the goal.
The establishment of the robot spatial position serves as basis for the two following stages because this phase describes thc stale of the environment. The second stage aims at determining tne objective that the robot wants to reach. The establishment of a path to reach the goal defines the robot actions. This last stage is the main interest of this work. In sum, our agent knows where it is placed at each step and
+ 1, if s H is a goal state; = { -1 , ~f s( +1 is an obstacle state; (1) IS
a free state.
where: r t+ l is the reinforcement signal at time step t+ 1; Sf is the state at time t; and a is the action chosen by the agent. This representation was derived from the goalreward representation (Sutton, 1990) and the actionpenalty representation (Barto et al. , 1995). The fonner indicates the states to guide the agent towards the goal . Consequently, it is a very sparse representation because, v ery often, there is only a single target. The latter is less sparse than the first
8451
Copyright 1999 IF AC
ISBN: 0 08 043248 4
14th World Congress oflFAC
ROBOT NAVIGATION IN COMPLEX AND INITlALL Y UNKNOWN ...
representation because the occurrence of obstacles is much more common than the presence of goals. However, this representation does not direct the agent towards the target.
The learning strategy for solving this GDRLP is the temporal difference (ID) as used by Barlo et at. (1983). This 1echnique is characterized for being incremental and for declining a model of world. The chosen learning rule is the following (Sutton and Barto, 1998);
Figure I presents an example of how agent sees its workspacc, following a number of agentenvirorunent interactions in which the goal was reached at least once. The infonnation obtained about the environment from reinforcement signal is encoded into two knowledge structures: the reward and penalty surfaces.
where: V(s)- evaluation of the state s ; .1. V(s)- updated value of the evaluation of the state s: a- learning rate (0 < tX S; I); r- discount factor (0 < r S; 1). The timing to update each one of the surfaces is different: The I>enalty surface is modified whenever the agent moves and the algorithm considers only the state to be updated (the current state) and its successor. The changes follow every single visit to a particular state. Initially the agent has a totally random behavior that becomes reactive as times goes by. This behavioT increasingly considers the paths to reach the goal when further information is added to the reward surface.
Fig.
The reward surlacc is updated whenever the goal is encountered. The modification involves all the states the agent visited to reach the goal. The algorithm stores this trajectory while the agent tries to reach the goal. [t was adopted a variable learning rate to modify the values avoiding the presence of local minimum (Araujo and Braga, 1998). The variable learning rate is initially high and diminishes abruptly as function of the number of visits. This learning rate is updated as follows :
1. An example of agent interaction with workspace.
The composition of the reward and penalty surfaces guides the agent actions. The learning algorithm 10 constmct both surfaces is shown in Figure 2.
(3) Initialize with zeros alJ V(s)-ysJues in both surfaces; Repeat Reset the memorized current trajectory; Take the agent to the cu rrent initial position; Repeat Select an action a; Execute the action a; Update the penalty V(s)-value according to (2); Store the state s in the current trajectory; While the agent does not find an obstacle or a goal state; if the new state is a goal one, update the reward V(s)-varue for all memorized trajectory according to (2); While the agent does not find the target for a pre-determined number of times.
Fig. 2. The learning algorithm in procedural English.
where: a (s)- the learning rate of state s; a 1 - the minimum value oflearning rate (a 1 = 0 .1); a 2- the maximum value of learning rate Ca 2 = 0.9998) ; k(s)- number of visits to the state s in trials that reached the goal. The algorithm above deals with GDEP and GDRLP. It was implemented in the MATLAB using the graphic user interface (GUI) 10 visualize the agent behavior. Thus, using this facility, the reward and penalty surfaces have aid (see Figures 3 and 5). The policy to select an action to the agent, is based on a composition of the reward and the penalty surfaces as follows:
8452
Copyright 1999 IFAC
ISBN: 0 08 043248 4
14th World Congress oflFAC
ROBOT NAVIGATION IN COMPLEX AND INITlALL Y UNKNOWN ...
•
the evolution of the reward and penalty surfaces; the effects of sensorial information distribution on agent learning; and the process of trajectories optimization. In all situations, the agent had no initial knowledge about the environment.
In 60% of the attempts the algorithm chooses the action with the largest combined value (reward + penalty)~
•
In 10% of the attempts the algorithm chooses the action that takes the agent to the opposite direction of the largest penalty;
•
In 5% of the attempts the algorithm chooses the action with the largest reward;
•
In 25% of the attempts the algorithm chooses the action randomly.
•
In case of tie the choice is random .
During the training cycles, for two-room-fewobstacle environment, the penalty surface changes slightly after tlle initial interactions (Figure 3) After the first time the goal is reached, the occurrence of collisions between agent and obstacles is expressively reduced because the stored information guides the agent towards the goal. This behavior can be seen in Figure 4 in which the number of agentobstacle colliSion is plotted against tIle number of training cycle.
The first three components of the policy provide the agent with a tendency to follow the grnwth of reward evaluation and to avoid the growth of penalty evaluation. The fourth component, however, adds a random component to the agent behavior to allow variations in the initially learned paths. The randomness makes possible the improvement of the trajectories as long as the number of training cycles l increases.
0 .5
D.E.
-1 50
In test stage, the policy does not use the random component:
-1
50
C 0
(a)
Cb)
0
0
05
05
-1
-1
50
o
0
(c)
50
50
o
0
(d)
Fig. 3. Sketch of the penalty surface after: (a) I training cycle, (b) 100 training cycles, (c) 1000 trailling cycles and (d) 5574 training cycles.
• In 10% of the attempts the algorithm chooses the action that takes the agent to the opposite state with the largest penalty; • In 5% of the attempts the algorithm chooses the action with the largest rev,ard. •
50
C 0
50
• In 85% of the attempts the algorithm chooses the action with larger combined value (reward + penalty);
~[l
•
1 &~
~
10
~ ::: ~l
In case of tie the choice is random.
~ " \ <
4. SIMULATIONS The algorithm was tested for three different complex workspaces. The first set of tests considered an environment with many free states, few obstacles and no other marks. This workspace (Figure 6) has a sparse distribution of feedback information, thus, initially the agent may choose very inefficient paths. The other two workspaces have many obstacle states reducing the agent action choices (Figures 7 and S). The agent navigation is evaluated with respect to:
1 A training cycle corresponds to the random selection and training of some initial points_
::
i
" \~ Fig. 4. Occurrence agent-obstacle collisions during training stage for the first environment. In the early stages of the training phase, there is a peak of collisions. After that, the number of clashes diminishes abruptly, becoming oscillatory as a consequence of the random component of policy. After the first cycles, the learning occurs basically in the reward surface following the first time an agent arrives in its target from Cl particular initial point. As time goes by, the reward surface maps more states
8453
Copyright 1999 IF AC
ISBN: 0 08 043248 4
14th World Congress oflFAC
ROBOT NAVIGATION IN COMPLEX AND INITlALL Y UNKNOWN ...
and becomes smoother (Figure 5) than that one initially produced.
The above trajectories exemplity the tendency to improve the initial path of the algoritlml. In a workspace with many free states, the improvement of path is significant while in workspaces with many obstacle states, the tendency is to Occur discrete modifications of the initial trajectory (Figure 7).
0.5
o SO
o
50
50
o
0
(a)
0 (b)
0 .5
a
50
o
Ca)
50
50
o0
0
(c)
(b)
(c)
Fig. 7. (a)-(c) Improvement of the agent performance between an initial point and the goal state after I, 100, 1000. The final point is placed at the right upper side.
(d)
Fig. 5. The reward surface after: (a) 1 training cycle, Cb) 100 training cycles, (c) 1000 training cycles and (d) 5574 training cycles. The agent used the following parameters for all tests
0.5
(p - penalty; r - reward):
o
5D
Op =
0.001,
Yp
= 0.67,
Or""
0.9998, Yr = 0.99 (a)
The trajectories in Figure 6 exemplify the improvement of the agent performance for the fewobstacle environment.
The trajectories of Figure 7 could be smaller if the agent approaches the goal through the central corridor. This was not possible because the first link between initial and goal point was made along the left side of the room. The random component of the policy tries to explore new states, but the narrow passages due to obstacles induce the agent to exploit the known path (Figure 8).
s
/"
.-.cV>
-~))
Cb)
(c)
Fig. 8. The reward surface after: (a) 1 training cycle, (b) 100 training cycles, (c) 1O00 training cycles
...../"'.~L. -. -",
(a)
Cb)
} '\
le)
~~ ~ . . t
'"
,/ (d)
(a)
(b)
(c)
(d)
--~l\ (
<
-'
Fig. 6. Improvement of the agent path for reaching a goal (on the left hand side of the room) from two particular initial points. The figure shows the agent trajectory after (a) I, (b) 100, (c) 1000 and (d) 5574 training cycles.
Fig. 9 . Improvement of the agent perfOlmance between an initial point and the goal state after: (a) I training cycle, (b) 100 training cycles, (c) 1000 training cycles, (d) 10000 training cycles.
8454
Copyright 1999 IF AC
ISBN: 0 08 043248 4
ROBOT NAVIGA TTON IN COMPLEX AND TNITTALLY UNKNOWN ...
14th World Congress ofTFAC
REFERENCES
05
Araujo, A.F.R. and Braga, A.PS. (1998) RewardPenalty Reinforcement Learning Scheme for Planning and Reactive Behavior. In: IEEE International Conference on Systems. Man, and CybernetiCS, San Diego, CA, USA. Barto, A. G., Sutton, R. S. and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 3, 5, 834-846. Barto, A. G.; Bradtke, S. J. and Singh, S. P. (1995). Learning to act using real-time dynamic programming. ArtifiCial Intelligence, 73, 1, 81138. Brooks, R. A. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, RA-2, I, 14-23. Chang, C. and Gaudiano, P. (1997). Neural competitive maps for reactive and adaptative navigation. In: 2nd international Conference on Computational Intelligence and Neuroscience, 19-23, Research Triangle Park, North Carolina. Koenig, S. and Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed reinforcement learning exploration with algorithms. Machine Learning, 22, 227-250. Leonard, J. J. & Durrant-White, H. F. (1991). Mobile robot localization by tracking geometric beacons, IEEE Transactions on Robotics and Automation 7, 3, 376·382; , Millan, J. del R. (1996). Rapid, safe, and incremental learning of navigation strategies. IEEE Transactions on Systems, Man, and Cybernetics, 26, 408-420. Singh, S. P. (1994). Learning la so/V
O.S
o
o
40
o0
40
o
0 (b)
(a)
o
0
(d)
Fig. 10. The reward surface after: (a) 1 training cycle, Cb) 100 training cycles, (c) 1000 training cycles, (d) 10000 training cycles. The agent can work in more complex environments like a maze (Figure 9), in this case, the learning phase adjusts the trajectory in a way to produce smoother paths and the reward surface is updated just around the states of the first trajectory obtained (Figure 10).
5. CONCLUSlONS This paper proposed an agent that use reinforcement learning methods to guarantee its autonomy to navigate in its environment state space. Different from the representations commonly found in the technical literature, the reinforcement signal from the environment instead of simply reward or penalize the agent actions, it composes reward and penalty to endow the agent with the ability to plan and to behave reactively. The agent is supposed to solve GDRLP in which an exploration stage finds a solution to the proposed task in an initially unknown environment and, after that, the initial solution must be improved following further training. The tests suggest that the agent always find a viable path between an initial state and a goal one (provided that the solution exists). The agent is able to find its way in environments with few or many obstacles. In the first environments, the path improvement is more significant than in the second ones. Such a better performance is obtained through further training in which the reward surface explores new spatial positions and becomes smoother.
216-224.
Sulton, R.S. and BartD, A. (J 998). An Introduction to Reinforcement Learning. Cambridge, MA: MIT Press, Bradford Books. Whitehead, S. D. and Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7, 45-83. Winston, P. H. (1992). Artificial Intelligence. Reading, UK: Addison Wesley.
The tests also suggest that the agent does not follow the smallest path because the policy is not totally deterministic and because the smallest path is not the only one to match the improvement criteria.
8455
Copyright 1999 IF AC
ISBN: 008 0432484