Learning legal moves in planning problems: A connectionist approach to examining legal moves in the tower-of-hanoi

Learning legal moves in planning problems: A connectionist approach to examining legal moves in the tower-of-hanoi

EngngApplic. Artif. lntell. Vol.5, No. 3, pp. 239-245,1992 Printedin GreatBritain. All rightsreserved 0952-1976/92$5.00+ 0.00 Copyright© 1992Pergamon...

561KB Sizes 0 Downloads 22 Views

EngngApplic. Artif. lntell. Vol.5, No. 3, pp. 239-245,1992 Printedin GreatBritain. All rightsreserved

0952-1976/92$5.00+ 0.00 Copyright© 1992PergamonPressLtd

Contributed Paper

Learning Legal Moves in Planning Problems: a Connectionist Approach to Examining Legal Moves in the Tower-of-Hanoi A N D R E W SOHN New Jersey Institute of Technology, Newark

JEAN-LUC G A U D I O T University of Southern California, Los Angeles

While optimizing scheduling problems such as the Traveling Salesman Problem is relatively easy for neural networks, solving planning problems such as the Tower-of-Hanoi (Toll) of artificial intelligence has been known to be much more difficult. In this paper, the differences between the scheduling and planning problems have been identified from the neural network perspectives. This analysis is based on an approach used to solve planning problems with learning capabilities. In particular, the Toll is chosen as the target problem, and a set of constraints derived from the Toll has been formulated, based on the representation outlined in this paper. The system is designed to learn to generate legal moves by generating random illegal states and by measuring their legality. The approach described in this paper would establish a homogeneous structure which could be applied to planning problems which involve legality learning. Keywords: Neural networks, planning problems, temporal credit.assignment, legality learning, constraint satisfaction.

INTRODUCTION

sequence of steps. The main difference between the TSP and the T o l l is the fact that the TSP is already Neural networks have been shown to be effective in given information related to a sequence a priori in the optimizing scheduling problems. Those optimization problem description, and does not give the intermeproblems that are encountered most often in AI applidiate steps any consideration (because it requires only a cations include the Traveling Salesman Problem (TSP), goal state). However, the T o l l is not given a sequence the n-Queen Puzzle, the Polyomino Puzzle, etc. The of intermediate steps in the problem description but is TSP has been successfully solved by Hopfield, 1 and has instead required to find it, hence the planning problem. been further optimized by using self-organizing feature The planning problem of the T o l l may be viewed as maps. 2 A n o t h e r typical AI optimization problem, the a credit-assignment problem. This was explored by n-Queen Puzzle, has been solved as shown in Refs 3 Samuel's Checkers 5 and Barto's Associative Search and 4, based on a similar formulation to that used in Network (ASN). 6 However, the T o l l displays another Ref. 1. important difference from Checkers and ASN: each Can neural networks then possibly solve problems step of the sequence of the steps in T o l l requires such as the Tower-of-Hanoi ( T o l l ) ? It appears that the checking of its legality, whereas Checkers and ASN do T o l l could be easily solved by a formulation similar to not (of course, Checkers looks ahead for a good move that used in optimization problems. However, a closer at each step). Every step in the T o l l must be a legal examination of the problem will reveal that there is move as opposed to an illegal move, apart from being a substantial difficulty in solving such a problem by using good or a bad move. In many conventional approaches neural networks. Indeed, the T o l l is a planning probto solve the T o l l problem, the system is not concerned lem which requires a solution in the form of a temporal about checking the legality of each move. Instead, it takes legal moves for granted and looks for a good or Correspondence should be s e n t to: Professor J.-L. Gaudiot, an optimal move as it runs through the search space. Department of Electrical Engineering Systems, University of Southern California, University Park, Los Angeles, However, this work is concerned solely about learning CA 90089-2563, U.S.A. legal moves in the T o l l . After the system learns to 239

240

ANDREW SOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS

1

2

3

1

2

(a)

3

(b)

Fig. 1. The Tower of Hanoi problem with 3 disks. (a) Initial state and (b) goal state. generate the legal moves, it is believed that by using a formulation similar to legality learning, the system can be further equipped with a learning capability which can distinguish between good and bad moves. This p a p e r explores a neural network approach to find a solution to the planning problem, especially the T o l l . First, taking the above two crucial differences into account, a model to represent the T o l l states in a neural network will be found. A T o l l of 3 plates can be represented by an array of nine neurons, each of which is fully connected to all the others. A set of constraints can then be derived from the problem domain, based on this representation. Each constraint is used to help the learning process eventually to generate legal moves in the T o l l . The system operates on two spaces: the planning sequence space and the legality learning space. The planning sequence space consists of an actual sequence of legal planning steps, whereas the legality learning space consists of illegal states. In an attempt to find a sequence of legal states, the system moves back and forth between the two spaces. Learning to select more legal moves takes place in the legality learning space by generating illegal states. A legality measure is used to guide the system to search for legal states. Simulation results show that the system moves in a direction in which it learns to generate legal moves. This p a p e r begins by describing a model, as well as an overall system architecture to represent the T o l l in neural networks. Next, a set of domain-specific constraints from the T o l l is derived, and formulated in equations to conform to the system requirements. A learning algorithm is presented, which enables the system to learn to generate legal moves. Simulation results and limitations of the approach are given.

Disk

C

Possible solutions to the limitations mentioned are briefly presented, along with discussions of the future research. The discussion concludes with a summary.

A SYSTEM A R C H I T E C T U R E First the T o l l will be described followed by a description of its representation in neural networks. Based on the representation of the T o l l , we shall describe the system architecture. The Tower-of-Hanoi problem with three disks can be stated as follows: There are three pegs, 1, 2 and 3, and three disks of different sizes A, B and C, as shown in Fig. 1. The disks can be stacked on the pegs. Initially the disks are all on peg 1 as shown in Fig. l(a); the largest, disk C, is on the bottom, while the smallest, disk A, is on top. It is desired to transfer all of the disks to peg 3 by moving one disk at a time as shown in Fig. l(b). Only the top disk on a peg can be moved, but it can never be placed on top of a smaller disk. The network we use in this problem is arranged in an array of 3 × 3 neurons, n,j, i, j = 1 . . . . . 3. Let D, and P,, i = 1 . . . . . 3 denote a disk and a peg, respectively. Each row i represents a disk D, whereas each column j represents a peg Pj. There are 81 connections between nine neurons, i.e. all neurons are fully connected. Let W,j.k~, i, j, k, l = 1 . . . . . 3, denote the connection strength between n,j and nkt, where w is a real value greater than or equal to 0. A particular state of the search space is then represented as an array of 3 x 3 neurons as depicted in Fig. 2. The output value of a neuron n o = 1 indicates that the disk D, is on peg P~. The initial state (all the disks, D l ,

A

n,j ~1

2

(a)

~"Peg

C

1

2

3

(b)

Fig. 2. A network of 3 x 3 neurons to represent the Toll. Each row represents a disk t whereas each column represents a peg 1. All neurons are fully connected, i.e. 81 connections between 9 neurons. (a) Initial state So with n~l = n2~= n31= 1. ( b ) Goal states sg with n13= n23= n33= 1.

ANDREW SOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS

x(t)~'

.i "

AW ~ ,

;

Network of 3x3

Ix(t)

Neurons

I~

Legality

~

x(t)

y(t) ~

Learner

~

Aw

241

Fig. 3. The system architecture consisting of the network of 3 x 3 neurons and the legality learner. The legality learner learns to generate legal moves. and D 3 a r e in P~) is represented by turning on the three neurons nl/= 1, j = 1, . . . , 3. Figure 2(a) depicts the initial state with nH =n21 =n31 = 1 and o t h e r s = 0 . The goal state can be similarly represented by n31 = 1, j = 1 . . . . ,3, i.e. n]3=n23=n33 = 1 and o t h e r s = 0 , as shown in Fig. 2(b). The color of each neuron is for presentation purposes only and bears no meaning. Figure 3 depicts the system architecture, consisting of the network and legality learner. T h e network consists of nine neurons arranged in a 3 x 3 array, as discussed above. Let x , j ( t ) , i , j = l , . . . , 3 denote the input signals, at time t, to each neuron n o of the network. Let p,j(t) i, j = 1 . . . . . 3 be the m e m b r a n e potential of n o at time t, or, p,(t)=E,tW,j:klXkt. Given an input x(t), the network generates an output y (t): If Po (t ) > 0 o , Y,I (t ) = 1. Otherwise, y,j (t) = 0. The second module, the legality learner, determines the legality measure l(t) of a current move and updates the weights accordingly. The legality l(t) is a function of the current input x(t), the current output y(t), and of three constraints c~, c2, and c 3. The first constraint, Cl, requires that there cannot be two disks of the same size at any time. The second constraint, c2, states that only one disk can move at a time. The third constraint, c3, states that a larger disk cannot be placed on top of a smaller one. If the legality measure I(t) < 0, the current move is a legal one. Otherwise, it is illegal. The system learns legal moves through the legality measure l(t) by changing the connection weights. The learning rule can be briefly stated as follows: when a legal m o v e occurs, the weights are strengthened such that the network will be m o r e likely to generate legal moves in the future. When an illegal move is generated, the weights are w e a k e n e d such that the network will be less likely to generate illegal moves in the future. We D2,

y(t)

shall come back to this learning procedure shortly. In the meantime, the three constraints are formulated below.

CONSTRAINTS F O R L E G A L I T Y MEASURE One of the two important differences the T o l l presents from the TSP is the legality related to each move. Identifying and formulating constraints, the system can get to the next state from the current state by using a legal move. The first constraint c~ requires that there not be two disks of the same size at any time t, which, based on our representation, can be formulated from y(t) as follows:

c1=

y,j(t)-I

where n is a total n u m b e r of disks. Each row i contains one and only one neuron fired in each step, thereby resulting in only three neurons fired. If Cl = 0 , the constraint is satisfied. Otherwise, it is violated. Figure 4(a) illustrates that the output, y(t), satisfies the constraint c~ whereas Fig. 4(b) illustrates violation of the constraint. The second constraint c2 requires that only one disk move at a time. Observing that the n u m b e r of fired neurons that are displaced from the input x(t) to the output y(t) must be two, this constraint is formulated as follows: n

x(t)

(a)

(c)

(~

(tO

,

t=l

t-2) y(t)

--0-*

Fig. 4. (a) Constraint c] satisfied, (b) constraint cl violated, (c) constraint c2 satisfied, and (d) constraint c2 violated, x(t) is an input to and y(t) is an output from the network.

242

ANDREW SOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS

y(t) sl

s2

sa

ss

s4

ss

Fig. 5. A part of the search graph for the Toll problem. There are six states which satisfy the first two constraints cl and c2. Three states, s~, s2 and s3, satisfy the third constraint c3 whereas s4, s5 and s6 violate it.

W h e n c2 = 0, the constraint is satisfied and the m o v e is legal. Otherwise, the constraint c2 is violated. Figures 4(c) and (d) illustrate respectively a success and a failure of the constraint c2. T h e third constraint c~ requires that a larger disk not be placed on top of a smaller one. T o illustrate the constraint, consider a part of the search space for T o l l depicted in Fig. 5. I m p o s i n g the two constraints c~ and c2, the system would c o m e up with the six states, s,, i = 1 . . . . , 6. T h e r e are three states, s~, s2 or s3, to which So can e x p a n d by a legal m o v e , due to the third constraint c3. A f t e r a careful and close examination of the above legal and illegal states, this constraint c3 for the column j where a disk m o v e m e n t takes place can be f o r m u l a t e d as follows:

c o l u m n 2 of y(t) has two disks, and m a x ( l , 2) is 2. With j = 2 and m = 2, we find d2 = I(1 × l + 2 x

1 +3 x 0)-(1 x0+2×

1+3 x0)l

=13-21=1, a2 = max{3, 2}/2 = 1.5. W e obtain c3=d2-a2 = 1 - 1.5 = - 0 . 5 < 0 , hence the m o v e So---~s2 is legal. Consider a n o t h e r m o v e So---~s5 which is an illegal move. W e find d 2 = I(1 x 0 + 2 x

l+3x

1)-(1 x0+2x

1 +3 x0)l

=15-21--3, a2 = max{5, 2}/2 = 2.5.

c3 = dj(t ) - a,(t ) d,( t ) =

ix,,( t ) -

iy ,,(t ) /=1

ix,j(t),

max

aj(t) -

iyo(t )

\1=1

t=l

W e obtain c3 = dE-- az = 3 -- 2.5 = +0.5 > 0, hence the m o v e s0---~s5 is illegal. T h e main objective behind developing this legality measure is that it can be used for the T o l l with n disks. R e a d e r s can easily verify the legality measure for any n u m b e r of disks and configurations. Putting together all the three constraints c1, c2, and Ca discussed above, the c o m p l e t e formulation for legality constraint is now d e t e r m i n e d as follows:

2 m- l

l(x, y ) = Acl + Bc2 + Cc3, m=max

( ~ X , j ( t ) , ~ y ( ,j t )) , \/=I

/=I

/

w h e r e n is the total n u m b e r of disks. If c3~<0, the m o v e is legal. Otherwise, it is illegal. T h e basic idea behind this legality m e a s u r e stems f r o m the fact that d, the difference b e t w e e n two states, must not be greater than a, the average of the two states. Consider a legal m o v e , So---~s2, of Fig. 5 for example. W e first observe that j = 2 since c o l u m n 2 is involved in the m o v e m e n t . W e further observe that m = 2 since c o l u m n 2 of x(t) has one disk,

where A , B, and C are constants to control the significance of each constraint. If l(t)<~0, the m o v e is legal. Otherwise, it is illegal. T h e first constraint c~, which enforces the n e t w o r k to fire only three neurons, is the most important a m o n g the three constraints, and therefore A is most heavily weighted. A t the present time, the order of importance is kept as A >I B >~ C to correctly reflect the significance and to avoid the possible offset. In the following section, we shall describe the legality measure in a f o r m appropriate to the learning procedure.

ANDREW SOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS THE LEARNING RULE Learning the legality of moves is performed by randomly generating illegal states. Figure 6 illustrates two spaces for the legality learning: the planning sequence space and the legality learning space. The planning sequence space consists of a sequence of legal states, each of which is a step in an actual plan. The legality learning space consists of a sequence of illegal states, each of which is generated to help the system learn legal moves. Those states s,,o, i = O , . . . , g, are legal states whereas the states s,,j, i = 0 . . . . . g - 1, j~:0, are illegal states. Initially, the system is in the planning sequence space. It enters the learning space by generating an illegal state. While the system generates illegal states, learning for legal moves takes place by modifying connection weights. Generating illegal states is guided by the legality measure l(*). Each time an illegal state is generated, the system modifies the weights such that it moves in a direction in which l(*) decreases over time. As soon as it hits a legal move, the system exits the legality learning space and returns to the planning sequence space. The transition from an illegal state in the legality space to a legal state in the planning sequence space is detected by measuring the legality l(*). The legality has been defined such that when

Planning Sequence Space

1(*)~<0, the system is in a legal state. Therefore, the system modifies the weights in the legality learning space such that l(*) eventually decreases down to zero. Suppose that the system is in the legal state s,, o of Fig. 6. It will generate a next state s,, 1, which may or may not be a legal state. If s,,x is a legal state, the sequence of illegal states in the legality learning space terminates, thereby reaching s,+~.0 which is in the planning sequence space. If s,,1 is an illegal state, learning takes place by modifying weights in a direction in which the legality measure l(*) decreases over time. The system gradually learns in the legality learning space by generating illegal states. The complete sequence of learning is described as follows: Procedure L e g a l i t y _ L e a r n i n g at time (i, k) 1. Generate a temporary output, y = E WS,,k 2. Compute the change in legality, A l = l ( s , . o , y ) l(s,.o,s,,k)

3. If A I < 0 , s , , k + ; = y . Otherwise, Si.k+l =Si.k 4. Update the weight according to the following rule: wo; kt = W,j;kl + aAlxkly,j ,

where x is an input state, S,.k, at time (i, k) and a is a learning rate (see below).

Legality Leamlng Space

Initial State So

IllegaiStates

Legal States

ooo

Goal State Sg

243

k

Learning takes place by generating illegal states in this space.

Fig. 6. Those states in boxes represent an actual planningsequence whereas those in circles indicate illegal states. In the legalitylearningspace, learningfor legal movestake place in a directionin which the legalitydecreases, i.e. l(s,. k- 1, s,, ink)> l(s,. *, S,. k+1)"

244

A N D R E WSOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS

5. Normalize the weight such that Eroww = N, where N is a constant. 6. If l(s,.o, s,.k) ~<0, a legal state, S,+l,O, is found. Go to step 1 with the time (i + 1, 0). Otherwise, go to step 1. Note in step 4 that the indices i, j, k, l for w, x and y, and i, k for s have no relationship. The former is used to denote the neuron's position in the array and the latter is used to describe the time steps. The amount by which the weight W,~:kt is modified depends on the significance of the legality for each move. If both input x o of the state S,.k and output Ykt fire and the legality moves closer to a legal state, the connection W,j.kt is strengthened such that it will be more likely to produce a legal state next time. If both fire and the legality moves farther away from a legal state, the connection is weakened such that it will be less likely to produce a legal move next time.

SIMULATIONS AND DISCUSSIONS Simulations have been carried out to demonstrate the proposed approach to learning legal moves in the T o l l . The three constants for legality function l(*, *) have been set as follows: A = 3 , B - - 2 , C = 1 . The normalization constant N has been set to 1. The learning rate a is initially set to 1 and is changed to between 1 and 2. The threshold 0 is set between 0.4 and 0.6. Each weight is initialized to a real, random value between 0 and 1. For the initial state shown in Fig. l(a), the system took on the average 30-40 iterations to generate a legal state. We have not performed all the possible combination of legal moves in the T o l l (a combination of legal moves refers to a move from a legal state to another legal state and there are 27 legal states in the T o l l ) . 7 However, among the dozen legal moves we have tried, we observed from the simulation that from a legal state to any other legal state, the system took on the average 20-40 iterations for different parameter settings. We also observed that as the system gradually learns, the number of iterations needed to generate any legal state decreases correspondingly. The number of iterations taken by the system should not be compared with the number of states explored by the conventional AI approach. As we indicated above, the T o l l has 27 legal states and the conventional AI approach will explore the legal states of the search space by using whatever search strategy it elected to use. However, our approach includes both the legal states and illegal states toward learning legal moves. We had much difficulty in carrying out simulations due to the following reasons: first, the number of neurons used in this particular simulation is rather small compared to tens or hundreds of neurons used in Ref. 8. This gives a weight space of dimension 81,

which is small considering the ratio of the number of legal states to the number of illegal states in the T o l l . There are only 27 legal states whereas there are infinitely many weight combinations which will generate illegal states. Second, setting three constraints gave a significant reduction in the solution space although the three constraints used in this paper are indispensable. A possible solution to the first problem would be to increase the number of neurons by several orders of magnitude so that the system can have a bigger weight space. The second problem would be resolved by introducing an additional layer to detect the legality in the hardware level. We have reported this kind of hierarchical representation in the domain of production systems.9' 10 In this paper, we have limited ourselves to the legality learning for the planning problem, the Tower of Hanoi. The legality learning has been accomplished by generating illegal states in the legality learning space. The next step we are currently undertaking is to make the system learn to distinguish between good and bad moves in the planning sequence space. To do this, we consider a domain-specific evaluation function which would be defined as follows:

z(s,,o)=~ ~ (i+j-1)'y,l(t). t=l

/=1

The above evaluation function z(*) provides in the planning sequence space a measure of the current position compared to the goal state and is at a maximum when the goal state is obtained, i.e. when all the three disks A, B and C, are placed in the peg 3. A learning function for good or bad moves would have a formulation similar to the legality learning and may be informally described as follows: Wt 1, kl ~- Wq, kl "~-

fl AZYklXtj

where fl is the learning rate for the good moves. The last step, which is the ultimate goal of this study, is putting together two learning capabilities such that the system finds a legal state which is a good move. Again an informal description would have a formulation similar to the following: w,. *l= w,. kl + yf(Al. AZ)yktX,j where f(*, *) is a function of the legality measure and the payoff measure, and 7 is a learning rate. CONCLUSIONS Solving a planning problem, such as the T o l l , in neural networks is difficult because it involves a temporal sequence. Two crucial differences between a

ANDREW SOHN and JEAN-LUC GAUDIOT: LEARNING LEGAL MOVES IN PLANNING PROBLEMS scheduling problem such as the TSP and a planning problem such as the T o l l have been clearly identified in this paper: (1) the planning problem requires the solution in the form of a sequence of intermediate steps, and (2) each step in the sequence must be a legal one. Taking into account the two differences, this paper presented a system equipped with a learning capability to learn to generate legal moves. A n array of nine neurons was used to represent the T o l l . Three constraints were derived from the T o l l , based on which the legality measure was formulated. The legality measure enabled the system to detect the legal and illegal moves. The learning algorithm formulated in this paper pushed the system to move in a direction in which learning legal moves takes place. By generating illegal states in the legality learning space, the system learns the legal moves, thereby finding a legal state which is an actual step in the planning sequence space. Simulations showed that the system was able to learn to generate legal moves. Although the search tree of 27 legal states in the T o l l was not fully explored, the system reached from one legal state to another legal state while learning was taking place. The next step which we are currently undertaking is to give the system an ability to distinguish between good or bad moves towards an optimal solution for the T o l l . The

245

approach described in this paper would provide a framework which could be applied to different A I planning and production systems.

REFERENCES

1. Hopfield J. and Tank, D. Neural computation of decisions in optimizing problems. Biol. Cybernetics 52, 141-151 (1985). 2. Angeniol B., de La Croix Vaubois G. and Lc Texier J.-Y. Self-organizing feature maps for traveling salesman problem. Neural Networks 1,289-293 (1988). 3. Akiyama Y., Yamashita A., Kajiura M. and Aiso H. Combinatorial optimization with Gaussian machines. Proc. IJCNN Int. Joint Conf. Neural Networks, pp. 1-533-540 (1989). 4. Kajiura M., Akiyama Y. and Anzai Y. Solving large scale puzzles with neural networks. Proc. Int. Workshop on Tools for Artificial Intelligence, pp. 562-569 (1989). 5. Samuel A. Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3, 211-229 (1959). 6. Barto A. and Sutton R. Landmark learning: an illustration of associative search. Biol. Cybernetics 42, 1-8 (1981). 7. Nilsson N. Problem Solving Methods in Artificial Intelligence. McGraw-Hill, New York (1971). 8. Anderson C. W. Tower of Hanoi with connectionist networks: learning new features. Proc. Annual Conf. of Cognitive Solciety, pp. 345-349. 9. Sohn A. and Gaudiot J.-L. Connectionist production systems in local representation. Proc. Int. Joint Conf. Neural Networks (1990). 10. Sohn A. and Gaudiot J.-L. Representation and processing production systems in connectionist architectures. Int. J. Pattern Recognition Artif. lntell. 4 (1990).