Automatica, Vol. 6, pp. 469-476. Pergamon Press, 1970. Printed in Great Britain.
Control of a Robot in a Partially Unknown Environment* Commande d'un robot dans un environnement partiellement inconnu Regelung eines Roboters in einer partiell unbekannten Umgebung
YnpanJIeI-ii4e p O 6 O T O M
B qaCTI4qHO-HeH3BeCTHOM
orpy~reHnn
W. G . K E C K L E R I " a n d R. E. L A R S O N +
problems [11]. The detailed calculations required to imple-
Summary--This paper discusses the problem of controlling the motion of a robot in a partially unknown environment. The problem can be formulated as a stochastic control problem with some state variables related to the physical dynamics of the robot and others related to the information the robot has obtained about the environment. First, a dynamic programming procedure that calculates the optimum policy for the robot is described. Next, a heuristic method that is capable of solving much larger problems is presented. The performance of the heuristic is shown to compare favorably with that of humans in complex examples.
ment this approach are also described. Dynamic programming is shown to be feasible for handling system equations, performance criteria, and constraints that simultaneously involve physical variables and informational variables. Another aim of this paper is to demonstrate the relationship of a number of concepts from system theory to the combined optimum control and estimation problem; among the concepts discussed are dual control [12] and value of information [13]. In the robot problem, computational complexity increases exponentially with the number of physical and informational state variables. Thus, many problems of interest are too unwieldy to solve rigorously on present-day computers. In order to attack these problems, a heuristic method based on the optimization algorithm has been devised. It is thus possible in this paper to analyze the relation between the heuristic methods and optimization approaches for a concrete example. The results of heuristic methods are also compared with the performance of humans in some representative cases. The remainder of the paper is organized as follows: Section 2 defines the mission of the robot and develops a state-space formulation for the problem of controlling it. Section 3 describes a dynamic programming solution method and presents results for some illustrative examples. The computational difficulties of this algorithm are also discussed. Section 4 describes a heuristic solution method with greatly reduced computational requirements. The performance of the heuristic method is evaluated in terms of both the optimum performance and that obtained by a group of control theorists given the same information. Section 5 evaluates the potential of both optimization techniques and heuristic methods for solving practical problems involving stochastic effects.
1. Introduction THE oFrnmzATION of systems in which stochastic effects are present has been studied extensively by a number of researchers [1-6]. An extremely general formulation of these problems has been called by MEmg [6] the combined optimum control and estimation problem. A solution to this problem has been formulated using dynamic programming [6-8]. Even though several theoretical papers have been written on this subject, there have been very few examples worked out for any cases but the linear gaussian problem [9, 10]. This paper first describes a dynamic programming approach that was proposed for the problem of optimally controlling a robot, equipped with sensors, that is operating in an unknown environment.§ A methodology is presented for formulating a class of stochastic control problems in which there are informational variables that specify the degree of knowledge about the physical state of the system as well as the physical variables of the type encountered in most control applications. These problems are present in a number of areas; the robot example discussed here is related to the general problem of unmanned exploration of a hostile, inaccessible environment, while another formulation of this type has been developed for mission reliability
2. Problem statement and formulation The problem discussed in this paper can be stated as follows: A robot is attempting to perform a specified task in an unknown environment. As part of this task, it must travel to a particular spatial location, called the goal. There are a number of barriers that the robot cannot traverse; the presence or absence of these barriers is not known a priori, but their potential locations are all known. The robot has a sensor system that can detect the presence or absence of these barriers; however, there is a cost associated with using these sensors. The problem is to devise a policy for the robot that will find that path to the goal that minimizes the sum of the cost of using the sensor system and a cost that reflects the length of the path. This problem can be formulated as an optimum control problem in which the minimum-distance requirement and the sensor cost are joined in a common cost criterion. Minimization of this criterion requires that state and control
* Received 3 April 1969; revised 24 November 1969. The original version of this paper was presented at the 4th I F A C Congress which was held in Warsaw, Poland during June 1969. It was recommended for publication in revised form by associate editor P. Dorato. This research was performed while both authors were with Stanford Research Institute, Menlo Park, California. t School of Engineering and Applied Science, Washington University, St. Louis, Missouri 63130. ,+ Systems Control, Incorporated, 260 Sheridan Avenue, Pale Alto, California 94306. § A robot project is under current study at Stanford Research Institute. "Application of Intelligent Automata to Reconnaissance," Contract A F 30(602)-4147, SRI Project 5953. 469
470
Brief Papers
variables be defined that completely represent the situation and options of the robot. Complete specification of the state variables requires not only that the location of the robot in its environment be specified, but also that the state of knowledge about the barrier configuration at each position be considered. Control options include both movements to a number of adjacent unblocked areas and use of the sensing equipment to determine the presence or absence of barriers at more distant sites. Since additional knowledge can prevent wasteful moves, the expected benefit of observing must be considered each time that a control decision is made. The control decision is a function of present location in the environment and of the present state of knowledge of barrier locations. A simple problem that illustrates the features of this formulation is shown pictorially in Fig. 1. In this example, an automaton is attempting to find its way through the environment shown to a goal located at the upper left-hand square, which has coordinates (xl = 1, x 2 = 1). As shown in Fig. 2, the automaton can move one square either up, down, left, or right. ~ x
horizontal coordinate in Fig. 1 vertical coordinate in Fig. 1 state of knowledge about barrier at ( x l = l , x2=2) state of knowledge about barrier at (xl = 2 , x2=3).
The first two variables are quantized to the values xl=l,
2, 3
X 2 --~"
l , 2, 3, 4 .
(2)
The latter two variables are quite different from the usual state variables associated with dynamic systems. Each variable can take on three different values as follows
x3=P,A, Q x+=P, A, Q,
(3)
where
I
C
P A Q
@
3
xl x2 x3 x+
5
I
1
where
barrier is known to be present barrier is known to be absent absence or presence of barrier is not known.
The control vector, u, has three components. They are
4
G : GOAl-SQUARE []
:
POSSIBLE
BARRIER
u=
LOCATION
FIG. 1. Automaton environment.
2
I
I 2
O I
I
where 2 I
PRESENT
t
SOUARES
THAT
REACHED
IN
SOUARES
THAT
=
2 :
SQUARE
REACHED tN
2
,
(4)
U3
2 O :
]ul1 u2
CAN BE
ONE
CAN
TWO
ul negative change in xl* u2 positive change in x2 u3 decision to make an observation.
MOVE BE
MOVES
The variable u3 takes on two values 1~o. 2. Automaton moves. There are two squares on which it is possible that barriers are present: namely ( x l = l , x 2 = 2 ) and (Xl=2, x2--3). The automaton is not allowed to pass through these barriers. Initially, the automaton does not know if these barriers are present; instead, it knows that there is a barrier in ( x l = l , x 2 = 2 ) with probability 0.4 and a barrier in (x1=2, x 2 = 3 ) with probability 0'5. The robot can always "see" one move ahead, i.e. if it is within one move of a barrier location, it can find out if the barrier is there or not. For a certain price, which is expressed as a specified fraction of a move (0-3 moves), the automaton can make an observation of all squares that are two moves away (see Fig. 2). The objective is to find the policy for the automaton that reaches the goal square ( x l = l , x 2 = l ) from any initial square while minimizing the expected value of the sum of moves and penalties for making observations. This problem can most easily be put into the desired framework by defining the complete state description of the system (information state) to be four-dimensional vector x, x1
X3 X4
(5)
L N
an observation is made no observation is made.
The set of admissible controls is thus
U=
IiilIi°i1Iiil°I,l°I01t0 , (6)
corresponding to move right, move left, move up, move down, and make a n observation. The first two system equations can be written as
xt(k+ 1)--- xt(k) + ut(k)
x2 x =
u3=L, N, where
,
(1)
xz(k+ 1)-xz(k ) - u2(k ) .
(7)
* For historical reasons, the convention was adopted that a positive move was up and to the right in Fig. 1, while the direction of position xl and x2 was as shown there.
Brief Papers The uncertainty about the barriers is taken into account by defining a random forcing function vector w with two components,
471
where l[x(k), u(k)] is specified as in Table 2. In this table, it is assumed that the penalty for an observation is 0.3 moves. TABLE 2. VALUE O F / I x ( k ) , u ( k ) ] FOR AUTOMATON
LW~d wl presence or absence of barrier at ( x l = l , x 2 = 2 ) w2 presence or absence of barrier at (xt = 2 , x2=3). These variables can take on the values
wl =B, R w2 = B, R,
(9)
Xl(k)
x2(k)
u3(k )
/ I x ( k ) , II(k)]
1 1 1 ~=1
1 ~1 =/=1 1
-N L N
0 1"0 0"3
d=l =/=1 =~1
1 ~1 ~1
L N L
1"0 0"3 1"0 0"3
where B R
barrier is present barrier is absent.
This vector affects only the state variables x3 and x4. In writing the system equations for these two variables, it is useful to define two auxiliary variables, ml and me. The variable ml takes on the value 1 if the control is such that the presence or absence of the harrier at (xl=l, x2=2) will be determined; m i = 0 otherwise. The variable me is 1 if the control will determine the presence or absence of the barrier at (x1=:2, x e = 3 ) ; m2=O otherwise. For u3=N, ml is equal to 1, if the move chosen causes the next square to be within one move of the barrier at (x1=1, x2=2). For u3~L, ml is equal to I, if the barrier at (x~=l, xe=2) is within two moves of the present square. Otherwise, rnl = 0 . Similar conditions can be written for me. The system equation for x3 can thus be written in the form
x a ( k + 1)=f3[xa(k), us(k), ml(k), wl(k)l, (10) where f3 is defined by Table 1, and where
p(wl =B)=0.4 p(w 1 = R) = 0"6
p(w2=B)=0.5 (11)
p(w 2 =R)=0"5. TABLE 1. NEXT VALUEOF X 3 FOR AUTOMATON* x3(k)
u3(k)
ml(k)
Wl(k)
x3(k+l)
e
--
__
__
e
A Q
m N
__ 0
__ --
A Q
Q
N
1
B
e
Q Q Q Q
N L L L
1 0 1 1
R -B R
A Q P A
* The entry " - - " implies that the value of x 3 ( k + l ) is the same for all values of this variable. The equation for x4 is x 4 ( k + 1 ) = f 4 [ x 4 ( k ) , u4(k), m e ( k ) , w e ( k ) ] ,
(12)
where f4 is specified by a table similar to that in Table 1. The performance criterion, which is to be minimized, is the sum of moves and penalties for observations. This criterion can be written as
J= ~ fix(k),u(k)], k=O
(13)
The constraints are that the set of admissible controls and the set of admissible states are as defined above. Also, if a barrier is present, a move to that square is forbidden. However, if the robot should be placed on a barrier square, it is permitted to move off this square. Thus, the expected number of moves from a potentially blocked square to the goal can be defined and computed. The extension of this formulation to a larger problem is straightforward, and the optimum solution for the general case is presented in the next section. The computational procedure described below requires that there exists at least one path to the goal from each square, including barrier squares, which is known to be unblocked. Thus, no square can have all moves from it blocked by barrier squares, n o r can any group of squares be completely surrounded by barrier squares; in other words, there must always be a deterministic path to the goal known to the robot.
3. Solution and results for the complete-state representation The problem formulation of the preceding section satisfies the conditions for use of the technique "approximation in policy space" [1, 15]. Two of these conditions are that the system equations and constraints have no explicit dependence on time and that the performance criterion be the sum of time-invariant terms over an infinite number of stages. Another condition is that there exists a state and control that has a single-stage cost of zero. Finally, there must exist a finite-length path from this state to any other state that is unblocked with probability 1.0. If the latter condition is not met, the expected cost of reaching the goal is infinite, and it is necessary to use the technique "iteration in policy space" [15]. Because of the stochastic nature of this problem, the particular version of approximation in policy space desscribed below is used. A proof that this method converges to the optimum solution can be obtained by using the results shown in Ref. [1, 15]. In order to start the procedure, an initial guess of the optimal policy, iit0)(x), is made. One such control policy that is easy to compute is to pick the control that is optimal if all barriers are present, the worst possible situation. The corresponding minimum cost function, I<0)(x), is found by solving a deterministic minimum-path-length problem for each square, including those on which barriers are located. Formally, this function is obtained from solving
I(°)(x) =
l[x, a(°)(x)]+ I ( ° ) { f [ ' x ,
fi(°)(x), w ' l } ,
(14)
where f represents the system equation vector defined in equations (7), (10), and (12), and where w* corresponds to Wl=B, w2=B. A particularly efficient method for solving this equation is to note that I(0)(x)=0 for all states corresponding to physical location at the goal, and then to compute outward from this square. When lO)(x) has been found for all x, a new policy atl)(x) is found by solving
I'(x)=min[l(x, u
u)+E{lt°)[f(x, u, w)l}l. w
(15)
472
Brief Papers
Since l ' ( x ) is not allowed to be a stochastic variable, it is necessary to take the expected value of I(0) in equation (15). The policy fiU)(x) is the value of u for which the minimum in equation (15) is attained. However, l'(x) is not l(1)(x), the minimum cost corresponding to policy fiO)(x), because I(0) appears inside the braces. The function l(1)(x) is found from fi(1)(x)] + E { I O ) [ f ( x , fi(O(x), w ) ] } . w (16)
IO)(x) =l[x,
This equation can be solved iteratively as in Ref. [15]. In general, a new policy fi(J+l)(x) is formed from knowledge of ICJ)(x), using
I'(x)=rnin[l(x, u)+E{lU)[f(x, u
u, w ) ] } ] .
(17)
w
Another effect that can be observed is Howard's value of information [13]. This effect is related to the cost that should be paid in order to gain information. Again, this is illustrated here in the decision of whether or not to make an observation; if the cost of 0.3 moves does not pay for the increase in performance, then the observation should not be made. At the two points where the observation is made, it is interesting to note the optimal control and minimum cost for the case where the decision to observe is not allowed. It is seen from Table 4 that at (xb x2, x3, x4) =(1, 4, Q, P), the net profit of making the observation is 0'5 moves, while at (xj, -,Y2, x3, .V4)--(1, 4, Q, Q), the net profit is 0.1 moves. TABLE 4.
COMPARISON OF OPTIMAL CONTROL AND MINIMUM COST W I T H AND WITHOUT THE OBSERVATION CONTROL
The corresponding minimum cost function 1U+l)(x), is found by solving
I (j+ 1)(X) = l [ x , flU+ i ) ( x ) ]
+E{IU+I)[f[x,
fiO+l)(x), w ] ] } .
(18)
w
As already indicated, convergence to the true optimum can be proved for this case• When this method is applied to the problem described in the preceding section the results shown in Table 3 are obtained. Several interesting effects can be observed in Table 3. The first is Feldbaum's dual control effect [12]; this effect is said to occur whenever the optimal control is used to gain more information about the system instead of optimizing the performance directly. The effect is illustrated in this example by the optimal controls of looking instead of moving toward the goal. By imposing a penalty for the observation, the system is able to make a decision whether to gather more information or to continue moving on the basis of the information gathered so far. This occurs at (Xl, X2, X3, X4)=(1, 4, Q, P ) and (Xl, x2, x3, X4)=(1, 4, Q, Q). TABLE 3. S O L U T I O N TO AUTOMATON PROBLEM*
Optimal control
Present state Xl
X2
X3
X4
Minimum cost
//1
U2
U3
1
1
1
2 3
1 1
-1 --1
0 0
N N
0
1 2
1
2
0
1
N
1
2 3
2 2
0 0
1 I
N N
2 3
1 1
3 3
A P
-A
0 1
1 0
N N
2 4
P
P
0
1
3
2 3
3 3
1
4
A
1 1 1 1 1 1
4 4 4 4 4 4
P P P Q Q Q
2 2 2 2 3
4 4 4 4 4
-A P Q
N
8
0 0
--1
1 l
N N
3 4
--
0
1
N
3
A P Q A P O
0 1 1 0 0 0
1 0 0 1 0 0
N N N N L L
5 7 6 3"8 4'9
A P P P
0 --1 1 -1 0
1 0 0 0 1
N N N N N
4.5 4 4 6 5-9 5
* The entry " - - " implies that the optimal control and minimum cost are the same for all values of this variable.
x~
x2
x3
x4
1 1
4 4
Q a
P a
Minimum cost with observation
Minimum cost without observation
4.9
5.4
4.5
4.6~
Optimal control with no observation ut
/t2
/13
0 0
1 1
N N
The technique described above has been implemented in a computer program, and other larger examples have been solved. One of these is the one shown in Fig. 3. There are four squares on which barriers are possible; the presence of each one has a probability of 0.7. The price of a look two moves ahead is still 0"3. The state vector of this system has six state variables--the two position coordinates and four information variables. Since this problem has 1620 possible states, the complete solution cannot be presented here. However, in only 29 of these situations is the look option chosen when it is available. The improvement resulting from the use of this option at a cost of 0.3 moves ranges from 0'3 to 1 '2 moves. When the a priori probability of the presence of each barrier is reduced to 0"5, 25 situations require looks, and the maximum saving is 0.8 moves. When the barrier probability is reduced to 0.3, the number of look situations declines to 12, but the maximum gain is again 1 '2. A discussion of the reasons for the situations is interesting, but not central to the theme of this paper.
[
2
3
4
.-----,J~ X i
[~
POSSIBLE BARRIER LOCATION
FIG. 3. Enlarged automaton environment. Though the complete state description does lead to optimal solutions, the procedure does have one very serious shortcoming: the number of states that must be considered builds up very rapidly with the number of potential barriers-•
N = n x t - nx2 3
n b
,
(19)
473
Brief Papers
Step I. For all x e X compute the initial cost estimate, J°(x),
where N nb nxl
• nx2
total number of states number of barriers number of states which are used to represent the environment
Thus, a problem which requires a 10 x 10 grid to represent the environment and includes 20 barriers requires 1.9 × 10s storage locations. Some reduction in state requirements can be achieved by eliminating those to which it is clear that no transitions will occur. For example, if the robot is only one move from the goal, its knowledge of the barrier configuration is irrelevant. The only important fact is which control option will take it to the goal. Thus, there is n o reason to include any information states at squares only one move from the goal. At many other squares, the full range of information states is not required, but elimination of these states will not reduce the problem to manageable proportions. The next section describes a formulation that can yield control policies based on fewer computations than the procedure described in this section.
by implementing the control policy based on the assumption that all of the barriers are present. Let k = 0 . Go to Step 2.
Step 2. For every x e X compute Jk+l(x) as follows: (a) Let x* = x . Order the elements x / e X * , according to increasing values of Jk(xt). Thus, once X*P is ordered, X * e = { x l , x2 . . . . } implies Jk(xt)~< Jk(x2)~< . . . Furthermore, at least one of the xieX*A. Let
N=
min
{i: x i e X , ~ ) .
X ¢X*P
(N is the index of the first element in X*p which is also in X*a.) (b) Compute the revised value of Jk(x*), and denote it by Jk+l(x*): i f N = 1, t h e n j k + l ( x , ) = 1 + J k ( x l )
(21)
i f N > 1, t h e n jk+ l ( x , ) = 1 + J k ( x l )
4. Reduced state-space formulation The problem stated in section 2 does not require that the control policy for all possible states be computed. If the problem is attacked in the context of the particular example the robot is facing, considerable savings in computer time are obtained. The solution is initiated with the robot located at a particular position in the environment and equipped with a specific a priori knowledge about the barrier configuration it must overcome. This includes knowledge of the possible locations of barriers and the probability of each one being present. The information about the barriers is not formalized as state variables, but instead is used as parameters for the decision-making process. The state of the system is represented by only the quantized position coordinates, xl and x2. The control vector has the same five options given in equation (6), and the state equations are given by equation (7). If the control option " L o o k " is chosen, or if a move yields additional information about the barrier configuration, the control policy is revised to include the new information. The performance criterion continues to be the sum of moves and penalties for observations. The actual computational procedure is analogous to that described in section 3. The quantities defined below will be used to describe the procedure: xt a quantized value of the horizontal coordinate of the environment. x2 a quantized value of the vertical coordinate of the environment. X {x : x=[xl,x2]}; i.e. X i s the set of all squares in the system environment. t a measure of convergence of the computational procedure. Given a particular x*eX, define X* the set of squares one move from x*. X*n the set of squares one move from x* on which barriers are known to be present. X*e the set of squares one move from x* to which moves may be possible. (Note: X * p = X * - - X * n ) X*A the set of squares one move from x* on which barriers are known to be absent. (Note: X* A ~ X ' p ) p(xi) the probability there is a barrier on a square denoted by x i. X*A cannot be empty because x* cannot be completely surrounded by barriers. X* has at most four elements, and it may have less if x* is on a boundary of X. X*B contains at most three elements, and X*p contains at most four elements. The description below outlines the computational procedure when the look option is not allowed. Modifications required to include the look option are described later. Further comments about Step 2 appear in the Appendix.
N-IV
i
-I
+ ,z., Step 3. Compute max Jk(x)-- jk+ l(x ) = M. xeX
If M < 6, a predetermined measure of convergence, the cost has converged near enough to the limit J(x), and the procedure continues to Step 4. If MI> e, another iteration is required, and the procedure returns to Step 2.
Step 4. The robot now has costs assigned to every square in the environment and is located at a particular square x*eX. Since the robot can "see" one move ahead, all elements in X* must be either X*B or X*a. The robot chooses to move to the square xteX*a for which J(x~) is smallest.
Step 5. Execute the move indicated by Step 4. If the movement to x / b r i n g s n o new barrier squares into view, there is no new information to be processed. The robot can move again according to the rule in Step 4. If the robot obtains new information about the unknown barriers, the changed situation requires that Steps 1-3 be repeated before the rule in Step 4 can be used. Steps 1 to 5 are repeated until the robot reaches the goal. Since the cost criterion directly specifies the control in this computational procedure as indicated in Step 4, the characteristics of the cost obtained by the heuristic determine the control policy. The truly optimal values of J(x), for all xeX, are the expected values of the minimum costs based on the present state of information. Unfortunately, computational experience indicates that the costs obtained by the heuristic do not converge to the optimal values. The computation of J(x*) depends only on the cost associated with neighbouring locations and falls to adequately recognize the risk of entering a cul de sac. Thus, the approximate costs are lower than the truly optimal costs. The control policy reflects this deficiency by causing the robot to be more willing to enter cul de sacs than it should be. Though the costs obtained by the heuristic do not converge to the optimal costs, they have been found to converge to values bounded above by the optimal costs and below by the costs in a barrier-free environment. The costs appear to be nearer the optimal costs than the lower bound. The procedure outlined above converges in five or six iterations. The value of e is not critical in convergence. Reducing ~- by an order of magnitude results in one or two more iterations
474
Brief Papers
before the convergence criterion is satisfied. Furthermore, a change in ~ does not change the limits to which the costs converge. When the look option is allowed, Step 5 must be modified. If the robot is located at x*, observe that a look option can bring as many as eight additional squares, as indicated in Fig. 2, into the "view" of the robot; define this set as X*L. Since x* cannot be surrounded by barriers, at least one element of X % will be known to be open. In addition, X*L can contain squares known to be blocked and squares whose status is unknown; let X*LP be the subset of X*L containing squares whose barrier status is unknown. If there are n members of X*r.p, then there are 2 n possible barrier configurations, consisting of all combinations of each of the n barriers being present or absent. If the robot chooses the look option, it will see one of the 2 n possible configurations. In order to determine whether to take the move indicated by Steps 1 to 4 or to look, the robot must compute the expected effect on its strategy of the information it would obtain by looking. From this point of view, Step 5 becomes the following:
--
S - POSSIBLE POINTS
•
]
STARTING
G - GOAL
o.8 0~6
3:8
0.7
0;5
OA
o14 oH 0.3
0.5 0.6
G
&4
Step 5'b. From the information computed in 5"a compute the probability of a look changing the move chosen in Step 4. If this probability is above a given threshold, the look option is chosen as the next move; if it is less than the threshold, the move indicated by Step 4 in the no-look option is chosen.
The choice of the threshold used in Step 5'b is based on the observation that an incorrect move will cost the robot two moves in a cellular environment--the move itself and the move required to correct it. Thus, if the probability of the move being changed by looking is greater than one-half the cost of the look, the look option should be chosen. The cost of a move in the example of section 2 is 0.3. If the probability of a look changing the move is greater than 0.15, the look option should be chosen. This procedure results in actions on the part of the robot simulation that appear to be a learning process. When it is positioned at a particular cell in the environment, it is able to determine its next move through consideration of the entire barrier and goal configuration. If this move increases its knowledge of the environment, the computational procedure is reapplied to determine if a better control policy can be found. Thus, as the robot moves through the environment searching for the shortest path to the goal in a systematic manner, it is also mapping the environment. If it were again placed in this same environment, this "experience" would enable it to reach the goal much more quickly than the first time. Since this heuristic is not optimal, it is difficult to determine how effective it is. However, in one test, the performance of a group of control theorists was compared with the performance of the heuristic in the barrier configurations shown in Figs. 4 and 5. The barriers were placed according to the probabilities shown in these figures. Solutions for six cases based on the configuration of Fig. 4 were obtained by the heuristic and by four people. In the resulting twenty-four comparisons of control theorists with the heuristic, the heuristic performed better eleven times, poorer seven times, and there were six ties. Five cases based on Fig. 5 were presented to five people and to the heuristic.
.
Fie. 4. First environment presented to control theorists. Probability of a barrier on a shaded square is 0'5.
Step 5"a. For each barrier configuration which a look could reveal, compute the probability of this configuration and repeat Steps I to 4 to determine the move which should result.
Step 5'c. Execute the move indicated by 5'b. If this move brings no new barrier squares into view or into the range of another look, the robot can immediately move again by the rule in Step 4. If new information is within the view of the look option or the robot's direct view, then Steps 1 to 4, 5'a and 5'b must be repeated to determine the next move.
••
0.4 o.6
oi61oi6
S = START G = GOAL
FIO. 5. Second environment presented to control theorists. Probability of a barrier on a shaded square equals the number in the square. The heuristic won sixteen of the resulting twenty five trials, lost six and tied three. Tables 5 and 6 compare the performance of the heuristic with the average performance of the humans in each case. TABLE 5. COMPARISON OF PERFORMANCES BY HEURISTIC AND THE CONTROL THEORISTS ON THE CONFIGURATION OF FIG. 4
Average of 4 control theorists
Heuristic Case 1 2 3 4 5 6
Moves
Looks
Moves
Looks
12 12 14 12 14 14
2 2 2 0 0 2
13"75 13-00 14"50 12.00 14.00 14"50
0"75 1 "25 0"75 1 "00 0 2"00
HeuristiCs combined record: Win 11, Lose 7, Tie. 6. TABLE 6. COMPARISON OF PERFORMANCES BY HEURISTIC AND THE CONTROL THF~RISTS ON THE CONFIGURATION
oF FIG. 5 Average of 4 control theorists
Heuristic Case Moves
Looks
Moves
Looks
1
12
1
13"2
0'8
2
22
1
21 "6
1-0
3 4
18 12
2 1
18"8 18"0
1 '6 1 "4
5
12
2
14"0
0"8
HeuristiCs combined record: Win 16, Lose 6, Tie 3.
Brief P a p e r s The configuration shown in Fig. 3 is considerably more complex than that in Fig. 4; there are 100 states instead of 64 and the probabilities are no longer the simple 0.5 that humans can deal with intuitively. As shown above, the heuristic, which considers each configuration in the same systematic manner, scores better relative to the control theorists as the problems become more complex. Even though the heuristic is not optimal, it is able to deal with these problems more successfully than can humans.
5. Conclusions The paper shows how additional state variables can be defined in a dynamic programming formulation to include information-gathering considerations in the decision-making process. An example illustrates the change in control policies that result from these considerations. Although the complete state description of the system does provide a good example of the value of information, the number of possible states increases so fast that solutions are not feasible for moderate-sized problems. A heuristic that includes much of the viewpoint of the complete description has been devised and applied to several grid and barrier configurations. Instead of considering all possible barrier configurations, this procedure just deals with the situation actually facing the robot. As new information is obtained, the solution of the problem is repeated. The performance of the route-finding heuristic has been compared with the performance of a group of control theorists, and it has been found to be superior even on moderate-sized problems. Based on computational experience, it appears that the improvements over human decision-makers will become even larger as the size of problems grows. Procedures motivated by the same viewpoint could be used to automate many searches now done by humans. References [1] R. BELLMAN: Adaptive Control Processes. Princeton University Press, Princeton, N.J. (1961). [2] H. J. KtJsm,rEg: Some problems and recent results in stochastic control. 1EEE Cony. Record (1965). [3] H. J. KrOSH/,~R: Stochastic Stability and Control. Academic Press, New York (1967). [4] W. M. WONHAM: Stochastic problems in optimal control. IEEE National Cony. Record 11 (2), 114--124 (1963). [5] M. AOKI: Optimization of Stochastic Systems. Academic Press, New York (1967). [6] L. MEIER, III: Combined optimum control and estimation. Proc. 3rd Ann. Allerton Conf. on Circuit and System Theory. Urbana, I11., pp. 109-120 (October 1965). [7] R. SUSSMAN: Optimal Control of Systems with Stochastic Disturbances. Electronics Research Lab., University of California, Berkeley, Ser. 63, No. 20 (November 1963). [8] M. Aolo: Optimal Bayesian and Min-Max Controls of a Class ol Stochastic and Adaptive Dynamic Systems. Preprints, IFAC Tokyo Symp. on Systems Engrg. for Control System Design, pp. 11-21-11-31 (1965). [9] P. D. JosEPh and J. T. Too: On linear control theory. Trans. AIEE (Applications and Industry) 80, 193-196 (1961). [10] T. L. GUNCKEL,II and G. F. FRANKLIN; A general solution for linear sampled-data control. Trans. ASME J. Bas Engng, Ser. D 85, 197-201 (1963). [ll] R. S. RATER and D. G. LUENBEROER: Performanceadaptive renewal policies for linear systems. IEEE Trans. Aut. Control AC-14, 344-51 (1969). [12] A. A. FELDBAUM." Dual control theory--I. Avtom. i Telemekh. 21, 1240--1249 (1960). [13] R. A. HOWARD: Information value theory. IEEE Trans. Syst. Sci. Cyberns SSC-2, 22-26 (1966). [14] R. E. LARSON: State Increment Dynamic Programming. American Elsevier, New York (1968). [15] R. A. HOWARD: Dynamic Programming and Markov Processes. John Wiley, New York (1960).
475
Appendix A Derivation of equations in the computation procedure As indicated by Steps 1-5 (Section 4), the basis of the heuristic proposed in this paper is the iterative solution of equations (21) and (22). This appendix derives these equations and explains the motivation for them. The value of N obtained in Step 2a indicates the index of the first member of X*P to which the robot is certain that it can move when it is located at x*. Because of the geometry of the situation (see Fig. 2), X * , can have at most four members and must have at least one member. The only possible values of N are 1, 2, 3, or 4. Equation (22) will be derived for N = 3 . The reader can convince himself that equation (22) is true for N-----2and 4 and that equation (22) reduces to equation (21) when N = 1. The computational procedure is based on the assumption that if the robot were located at x*, it would move to the first barrier-free member of X*P. If N = 3 , xl and x 2 are not members of X ' a , which means there is a non-zero probability that there are barriers on locations xl and x 2. Thus, the probability that the robot would move to xl given that the robot is at x* at time n can be expressed as P { x ( n + 1) = x l / x ( n ) = x *} = 1 - - p ( x t ) .
(A.1)
Since x2 is the second most desirable next location, P { x ( n + 1) = x 2 / x ( n ) = x *} = P { x ( n + 1) = x 2 / x ( n ) = x*, x ( n + 1) # x 1} •P { x ( n + 1) # x l / x ( n ) = x * ) = p ( x 1)['1 - p ( x 2 ) ] .
(A.2)
Since x3 is known to be unblocked and is the third choice, P { x ( n + 1) = x 3 / x ( n ) = x * ) = P { x ( n + 1) # x 1, x ( n + 1) # x 2 / x ( n ) = x * ) = P { x ( n + 1) # x l / x ( n ) = x * ) •P { x ( n + 1) # x2/x(n) = x*, x ( n + 1) # x 1} =p(xl)p(x2).
(A.3)
Within each iteration k, Jk(x) is assumed to be the cost of reaching the goal from x. jk+l(x*) is the expected cost obtained on the ( k + l ) s t iteration based on the costs obtained on the kth iteration. Thus, from equations (A.1)(A.3), d k + , ( X , ) = 1 + [1 -- p ( x l ) ] J k ( x l ) + p ( x 1)[1 - p(x2)]Jk(x 2)
+ p(xl)p(x2)jk(x3),
(A.4)
where "1" is the known cost of moving from x* to any member of X*p. Equation (A.4) can be rearranged as
jk
+
I(X*) =
1 + J ( x 1) + p ( x l ) [ J k ( x 2) - -
Jk(x 1)]
+ p(x1)p(x2)[Jk(x3)-- Jk(x2)] .
(A.5)
Equation (A.5) is exactly the same form as equation (22) for N=3.
476
Brief Papers
R6surn6--Cet article discute du probl~me de commande des mouvements d'un robot dans un environnement partiellement inconnu. Le probl6me peut 6tre formul6 comme un probl~me de commande al6atoire avec certaines variables d'6tat li6es b. la dynamique physique du robot et d'autres variables d'6tat li6es ~ l'information que le robot a obtenu au sujet de l'environnement. L'article d6crit, d'abord, un proc6d6 de programmation dynamique qui calcule la politique optimale pour le robot. I1 pr6sente, ensuite, une m6thode heuristique capable de resoudre des probl~mes beaucoup plus 6tendus. 11 est montr6 que les performances de la m6thode heuristique peuvent ~tre favorablement compar6es b. celles des hommes dans des examples compl6xes.
Zusammenfassung--Die Arbeit befal~t sich mit dem Problem der Regelung der Bewegung eines Roboters in einer partiell unbekannten Umgebung. Das Problem kann als stochastisches Regelungsproblem mit einigen Zustandsvariablen, die sich auf die physikalische Dynamik des Roboters oder auf die Information, die der Roboter erhalten hat, beziehen.
Zun~ichst wird die Prozedur der dynamischen Programmierung, die fiir den Roboter die optimale Strategie berechnet, beschrieben. D a n n wird eine heuristische Methode, die gr/~Bere Probleme zu 1/Ssen gestattet, dargestellt. Gezeigt wird, dal3 die Wirkungsweise der heuristischen Methode vorteilhaft mit der vom Menschen in komplexen Beispielen benutzten vergleichbar ist. Pe3loMe---HacTo~man CTaTb~ o6cy)r~aeT npo6neMy ynpaBnenrl~t ~Bri~eltrtgMrl pO6OTa 13 ~lacTi,p-irlo-rtevlaBecTnoM oKpy',renvPa. Ilpo6neMa MO~eT 6r~ITb cqbop~yJmpoBana Kar 3a~asa croxacTrt~ecroro ynpaBnertHa c HeKOTOpblMH rlepeMeltl-tblMl,l COCTORHH~I CB~3aI-IH~M/eI C qbn3H~fecKo~ ~rIHaMrL~Oli pO6OTa n C .apyrrlrera nepeMeHHblMn COCTO~HnS ca~3armbIM~ C rm~opMattriei~ rOTOpyrO pO6OT noay~mn orHocnTeymno orpyx~errrtfl. CTaTI,~I cHaqana oi~icblaaeT MeTO~ jlnnaMn~Iecroro nporpaMMnpomann~, BbI~tI~CJmtOII~I4M OlITHMaJIlfftyIO ~JI~I pO6OTa nOJlrITrtKy. Otta 3aTeM npe~Y~araeT 3BpHCTHtleCKHI~ MeTO~ cnoco6rmIi~ pemaTb 3HaqHTe.rl/~HO ~OJIOe m~porne 3a~a~rt,i. 1-Iora3~,laaeTcg '~TO pe3y.rlbTaTbI 3BpHCTH~IeCKOFOMeTO~a MOryT ~blTb y,~OB~eTBOpnTe.q~,HO cpaBHeRbl C noBe~eHHeM ,~enOBeKa a C~O~Hb~X CJ~O~H~,~XnpnMepax.