Consistency of HDP applied to a simple reinforcement learning problem

1MB Sizes 2 Downloads 72 Views

Report

PDF Reader
Full Text

Neural Networks. Vol. 3, pp. 179-189, 1990 Printed in the USA. All rights reserved.

0893-6080/90 $3.00 + .00 1990 Pergamon Press plc

ORIGINAL CONTRIBUTION

Consistency of HDP Applied to a Simple Reinforcement Learning Problem P A U L J. W E R B O S National Science Foundation (Received 27 March 1989: revised and accepted l I August 1989) Abstract--In "reinforcement learning over time," a system of neural networks learns to control a system of motors or muscles so as to maximize some measure o f performance or reinforcement in the future. Two architectures or designs are now widely used to address this problem in an engineering context: backpropagation through time and the adaptive critic family. This article begins with a brief review o f these and other neurocontrol methods and their applications. Then it addresses the issue o f consistency in using Heuristic Dynamic Programming (HDP), a procedure for adapting a "critic" neural network, closely related to Sutton's method of temporal differences. In a multivariate linear environment, H D P converges to the correct system o f weights. However, a variant of HDP--which appeals to common sense and which uses backpropagation with a complete gradient--leads to the wrong weights almost always. Similar consistency tests may' be useful in evaluating architectures for neural nets to identify or emulate dynamic systems.

Keywords--Reinforcement learning, HDP, Dynamic programming, Neurocontrol, Adaptive critics, Consistency, Optimization, Convergence.

INTRODUCTION

system, even to a first-order approximation, we must first understand how it is possible to build large-scale neurocontrol systems with generalized capabilities. Phenomena like pattern recognition and memory may be important parts of such a system, but they are subordinate to the overall function of the system. This article describes some properties of Heuristic Dynamic Programming (HDP), which is one of the adaptation methods developed for the "adaptive critic" family of neurocontrol system. Adaptive critic designs--as 1 will define t h e m - - b a s i c a l l y include all neural net designs capable of optimization over time under conditions of noise or uncertainty. H D P may be seen as a generalization of the method of temporal differences used by Barto, Sutton, and Anderson (1983), the adaptive critic method that has seen the widest degree of engineering application to date. More complex methods have been proposed in the past by myself (1977, 1982), by Kiopf (1982), and by Grossberg and Levine (1987); however, a better understanding of H D P will hopefully be a useful step along the way to better understanding and application of those methods as well. This article begins with a brief review of neurocontrol, reinforcement learning, and adaptive critics. Next, it describes a simplified form of H D P and briefly relates it to Sutton's method and other ex-

Many people in many fields are now familiar with the large potential of artificial neural networks (ANNs) in applications like pattern recognition and associative memory. In the past year or two, there has been growing interest in a different kind of application--"neurocontrol"--in which we design ANNs to directly control motors or muscles or some other kind of overt physical action to achieve physical results in an external environment. Applications are being studied or commercialized in areas as diverse as robotics, chemical plant control, electric utility control, avionics, aid to the disabled, and economic modeling, among others (Werbos, 1989b). The human brain itself--as a whole system--inputs sensor information (some of which provides reinforcement) and outputs signals to control muscles. The biological function of the brain is to control these muscles so as to achieve physical results over time; in other words, the brain itself is a neurocontrol system. Before we can understand the brain as a whole

The views expressed here are those of the author, and do not reflect the official views of NSF or any of its components. Requests for reprints should be sent to Paul J. Werbos, Room 1151, NSF, Washington, DC 20550. 179

180 tensions. Then it summarizes new results regarding the "consistency" of H D P - - t h e ability of HDP to converge to the right answer in the asymptotic limit. It describes a simple reinforcement learning problem that these conclusions are based on; it demonstrates the conclusions by calculating the correct weights for that problem and then comparing them against the weights that come from H D P and a variant of HDP. The variant of HDP based on backpropagation and a complete gradient turns out to be inconsistent almost always. The article concludes with an interpretation of this inconsistency, and ideas that may help in finding a faster learning method in the future.

NEUROCONTROL, REINFORCEMENT LEARNING, AND ADAPTIVE CRITICS The literature on neurocontrol already includes at least 60 to 70 published applications-oriented papers; therefore, the review here discusses only those basic concepts that are most essential to an appreciation of HDP. My first encounter with the term "neurocontrol" was in a tutorial presented by Allan Guez of Drexel at the 1988 IEEE Conference on Intelligent Control. Based on Guez's review, and my own less systematic search of the literature, I would argue that the existing work can be reduced to five basic design strategies (Werbos, 1989b): • Supervised control, in which a neural network learns to imitate humans or other controllers which already know how to perform a task; • Direct inverse control, in which we try to map the coordinates of a robot arm (usually x~, x2, x3) or of a similar device back to the actuator signals (usually the joint angles O~, 02, 03) which would send the arm to the desired coordinates; • Neural adaptive control, in which neural nets are used to replace the (linear) mappings normally used in Self-Tuning Regulators, Model-Reference Adaptive Control, and other such homeostatic control systems used in conventional control theory; • Backpropagation through time (BTr), as a control method (other uses of BTT will receive little attention here); • Adaptive critic methods All of these methods have important possible applications; however, only the last two methods address the problem of planning or optimization over time. Only the last two address the challenge of designing systems that respond to reinforcement signals or performance measures so as to improve performance in a way that accounts for the effect of present actions on future situations. In fact, Jordan (1989) and Ka-

P. J. Werbos wato (1990) have shown that direct inverse control has limitations even in controlling the position of a robot arm, if one intends to minimize some measure of energy cost and fully exploit redundant degrees of freedom (something that Kawato, 1990, argues that humans do, according to recent experiments in Japan). BTT and adaptive critic methods both address the problem of reinforcement learning over time, which may be defined as follows. Suppose that we are asked to design a system of neural networks that outputs a vector, u(t), at every time t, which directly controls motors or muscles, which affect the external environment. The vector u(t) is simply a set of numbers, u~(t) through u,,(t), indicating the outputs of output neurons numbered from 1 to m. The state of the external environment, R(t), is available as an input to the network, from input lines labelled R~(t) through Rn(t). Our system is asked to learn how to maximize some measure of performance. (~. which may be available as one of the inputs (like R~) or, more generally, as a function of R, U(R). (Psychologists and economists have tended to stress the former case. but Werbos, 1990, shows how the former is a special case of the latter and argues that a more general formulation is important both for biology and engineering. Klopf's notion of drive reinforcement learning is based on similar arguments.) In either case. the goal is to maximize U in the long term, over all future times t, and to account for the impact of current actions u(t) on future states of the environment R(C). Sutton (1984) has traced the notion of reinforcement learning back to Minsky, in the early 1960S or earlier. Werbos (1968) included a lengthy discussion of the problem, and its relevance to understanding human intelligence, along with the intuition that later led to the first rigorous formulation of backpropagation in Werbos (1974, 1982). Werbos (1988a, 1988b) discussed the problem briefly, using more modern language, and stressed the importance of the time dimension. Williams (1988) discussed a very different version of reinforcement learning, without the time dimension, but Williams (1990) addresses the time dimension at length. BTT is simpler and older than the adaptive critic family of neural net methods. Werbos (1974) first proposed the method, and, in Chapter II (Section ix), gave a detailed paper-and-penCil example of the approach. By 1988, there were at least four workedout computer implementations of BTr; given in Werbos (1989a), Nguyen and Widrow (1990), Jordan (1989), and Kawato (1990). Jordan (1989) and Kawato (1990) applied BTT to the problem of "inverse dynamics," the problem of making a robot arm follow a desired trajectory or reach a desired point, In earlier years, Jordan simply adapted his neural net-

Consistencyof HDP work so as to minimize the gap between the actual trajectory of a robot arm and the desired trajectory; however, his current work includes considerable sophisticated reasoning about how to specify U so as to make the robot arm really do what we want it to do. Kawato does likewise, though Kawato focuses on the problem of planning to reach a desired target point. Jordan optimizes the weights in an action network (which inputs R(t) and outputs u(t)), while Kawato optimizes the schedule of actions u(t) through u(T), where T is the time when the target is reached. Nguyen and Widrow (1990) used BTT to adapt an action network that drives a simulated truck to a desired landing point. Werbos (1989a) used BTT to optimize a schedule of drilling so as to maximize long-term profits in the natural gas industry; the resuits were used in the official forecasts of the Energy Information Administration in 1988. In general, one can use B T T or adaptive critics to maximize or minimize any function U. The challenge is to figure out what we really want our computer system to accomplish, and express that goal in algebraic terms. (With B T T and adaptive critics, it is also good to avoid "punitive" reinforcement schedules that reinforce behavior leading to out-ofbounds conditions.) Werbos (in press) and Williams (1990) provide tutorials on how to use B T T with ANNs. BTT is simple and exact, but it has some important limitations. First, BTT requires that we have a neural network available that "models" or "emulates" the dynamics of the environment that we want to control before we can begin adapting the action network. In other words, we need to have a network already available and already adapted that inputs R(t) and u(t) and outputs a forecast of what R(t + 1) will be. In principle, we have to assume that this forecast is exact; in other words, the method does not allow for randomness or error in forecasting. (Jacobson and Mayne, 1970, described how a similar approach can be extended to the case of noise, but their approach is likely to be efficient only when the time-horizon is relatively short.) Strictly speaking, this model or emulator need not be an ANN in the narrowest sense; one can use any model made up of differentiable functions, with time-lags and a requirement for simultaneous solution, if one is willing to use a more complex form of backpropagation (Werbos, 1988b). Second, B T T requires a flow of information backwards from time T t o time T - 1, to T - 2, and so on. This requires exact storage of an entire string or time-series of observations, which is totally implausible from a biological point of view. It requires the use of something like "batch learning" (Werbos, 1988a). It is inconsistent with full-fledged real-time adaptation, which would be desirable in many engineering applications; however, by adapting to one

181 string of experience at a time (i.e., data for a single task), a system based on B T T could capture some of the advantages of real-time learning. The adaptive critic family of methods is larger and more complex than BTT. Every current member of this family is fully consistent with real-time learning. The term "adaptive critic" comes from Barto, Sutton, and Anderson (1983), but the approach itself is much older; Barto (1990) cites a very extensive literature from psychology and artificial intelligence related to this family. "Adaptive critic" methods may be defined loosely as methods that try to approximate dynamic programming, using adaptive networks designed for use in the general case. Dynamic programming is the only exact and efficient method available to solve the problem of utility maximization over time in the general case where noise and nonlinearity may be present. Figure 1 illustrates how dynamic programming works. In dynamic programming, the user supplies a utility function, U(R), and a stochastic model F of the external environment. The equations of dynamic programming show how to calculate another function, J(R), which may be viewed as a secondary or strategic utility function. The basic theorem of dynamic programming is that maximizing J in the short term is equivalent to maximizing U in the long term. Unfortunately, the equations of dynamic programming are too complex to solve exactly in the general case; the costs go up exponentially with the number of components of R, and usually reach a practical limit when the number of components is about five or six. Adaptive critic designs may be defined more precisely as designs that include a critic network, a network that inputs R and outputs either an approximation to J or an approximation to the derivatives of J. (In some of my earlier papers, I called this a "strategic assessment" network, and Grossberg has used the terms "drive representation" and "secondary reinforcement.") All of these designs also include action networks, networks that input R(t), output u(t), and somehow try to maximize J ( R ( t + 1)).

MOI]EI,OF I RF..,I.ITY(E)~

[ IJTIIJTY .~.'II N("I'ION(11)

DYNAMICPROGRAMMINf; (OR GDHP)

SECONDARYOR STRATEGIC UTIIJTYFUNCTION(J) FIGURE 1. What dynamic programming requires and produces.

182

P. ,t. Werbo,s

_R(t)

~I CRITIC ,,q .~WORK

]

i

~. J(t) ... J ft+l)

NETWORK

~ ~!(t)

FIGURE 2. A simple adaptive critic system.

How can this idea be translated into an integrated design for a complete system.'? Figures 2 and 3 illustrate two basic possibilities. Figure 2 illustrates the basic design of Barto, Sutton, and Anderson (1983), in which the critic network and the action network make up the entire system. The output of the critic network is used as a general "reward" or "punishment" signal to the entire action network. Unfortunately, there is an information bottleneck here if the vector u(t) has a lot of components; it is hard to tell, at any time t, which components of u(t) are responsible for the reward or punishment. Figure 3 illustrates the basic design of Werbos (1977, 1982), in which the weights in the action network are calculated by a form of backpropagation; in other words, they are adapted in proportion to the derivatives of J with respect to these weights, which are calculated by using the chain rule for ordered derivatives (Werbos 1974, 1989a) back through a stochastic model of the environment. This requires more work--in developing the model--but it does overcome the information bottleneck. Werbos (1990) describes these designs in more detail, including their strengths and weaknesses and possibilities for trying to get the best of both worlds. Whatever design we choose for the system as a whole, we still need a method to adapt the critic network. Sutton's method of temporal differences and HDP are two closely related methods to do this. These methods themselves suffer from a certain information bottleneck, which more complex methods (like Klopf's methods or DHP and G D H P (Werbos, 1990)) may help to solve; however, these more complex methods raise consistency issues that are very similar to those we see with HDP. _R(t+l)

~J[ CRITIC

..,...l /

R(t)

[

"~U(t+l)

i

; DYNAMIC MODEL OF ENVIRONMENT R (t+1) =E (E(t) , ~ ( t ) ,noise) i

[ I

FIGURE 3. B a c k p ~ adaptive crnlc (BAC) (note how J Is not ul~:l dlfllctly,'but I~ d M I V ~ ~ are).

For the record, it should be noted that adaptive critics--like backpropagation through time--can also be used to adapt more conventional networks that perform tasks like pattern recognition or forecasting with time-lagged recurrent links. In the past, only backpropagation through time has been used (as in Watrous & Shastri, 1987, and Werbos, 1974) because it is both simpler and exact, even when noise is present, in these applications; however, Werbos (1988b) describes an alternative procedure that could be used in true real-time learning (as in biological systems). DESCRIPTION OF HDP

This article describes only a simple variant of HDP: see Werbos (1989b) for other variants. In the simplest version of adaptive critics, the critic network inputs the vector R(t), uses a vector of weights w, and outputs a score ]. J(R(t)) is supposed to be an estimate of the expected value of U(R(t)) -~ U(R(t + 1)) -- . . . - U(R(~)), the total reinforcement to be expected across all future time periods. Of course. R(t - i) will depend on more than just R(t). It will also depend on the actions taken between times t and t - i. We want J to represent the expectation value that would result if we continued to use the existing action network. Howard (1960) showed that a scoring procedure of ~his kind will still lead us to the optimal strategy of action if we alternately update the action network m response to the current scores and update the scores to reflect the current action network. (Howard did not consider neural networks or approximate optimization, as does HDP; however. Howard's formulation of dynamic programming was the inspiration behind HDP. in Werbos, 1977.) In a complex, real-time system. everything will be updated in parallel: however, this article describes a simplified approach~ In general, when we really use an infinite timehorizon, the sum of U(R(t)) through U(R(oc)) may not converge. Howard (1960) described a way to deal with this problem, which is easily adapted to neural networks (Werbos. 1989b). However. this will not be necessary for the simple example given here, and will not be considered. How can we adapt the critic network so that J does approximate what it is supposed to? First, we must define the setting more precisely. Let us suppose that we are given a time series of vectors, R(t), for t going from 1 to T. Let us suppose that this entire time series comes from a period when we were using the current action network (so that we do not have to worry about the action network at all). Let us suppose that we know the function U(R). Our critic network is simply an implementation of some mathematical function J ( R , w); in other words, we may define the function J ( R , w) as that

Consistency of HDP

183

function that yields the output of our critic network as a function of its inputs, R and w. In HDP, we use some sort of supervised learning method to adapt the critic. A n y supervised learning method will do, so long as we use the right inputs and targets. For example, anyone familiar with backpropagation should be able to fill in the rest of the details, when the inputs to the net and the targets are fully specified, for the training set. In the simplest form of HDP, we carry out several passes through the training set, as in basic backpropagation using "batch learning" (see Werbos, 1988a, 1988b). We start with an initial set of weights, w"~); in pass number n, we derive a new set of weights, w ~'t. We keep going through the training set, over and over, until the weights settle down, that is, until

W(n~l) -- W07).

On each pass, our training set consists of T pairs of inputs and target, for t = 1 through T. (There is only one target for each pair, because the network has only one output, J.) The calculations required at each time t are illustrated in Figure 4. At each time t, the inputs are simply the vector R(t). At time t (within pass number n), the target is: J(R(t + l), w'" ") + U(R(t)).

(1)

In other words, before we begin adaptation of the weights, we have to plug in R(t + 1) into the critic network, using the old weights, to calculate the target for each time t. The targets are then fixed throughout pass number n. Then, in the adaptation phase of pass number n, we update the weights to try to reach the targets. We could do this in an exhaustive way, by searching the entire weight space, or by using a supervised learning method (like Kohonen's pseudoinverse method) which converges in a single pass; alternatively, we could simply do a single pass of backpropagation. Theory tells us that we should simply throw out the case where t = T when we are trying to control an infinite, continuous process because eqn (1) is not well defined in that case (since R ( T + 1) is unknown). In many practical applications, however, our set of observations R(1) through R ( T ) is actually made up of several strings of observations, where

_R(t+l)

[ J(t+l)

~ CRITIC ,

~:

U(t)

each string represents a different experiment on the system to be controlled. In those cases, R ( T ) usually represents the completion of the last experiment; in that case the target for time T is simply U(R(T)). In fact, the target for the final time t of any string is simply U(R(t)). (In many problems, we may even want to use a different utility function, U, for the final times, depending on "how well" the experiment is completed; it is up to us to decide that we want to maximize.) Werbos (1990) describes how to deal with cases where U(t) + . . . + U(~) is a divergent sum, and explains how we sometimes need another network to estimate R(t) when the true state vector R(t) differs from the vector X(t) of what we see from our sensors. It discusses how to use a stochastic model of the environment, if present, and real-time learning. In the example given in Barto, Sutton, and Anderson (1983), the payoff occurs at the final time T at the end of any string; inserting this back into eqn (1), one arrives at the same targets they use in their method. In that paper, they also use a table-lookup kind of network, in which the target can simply be inserted into the relevant cell of the table. Strictly speaking, Sutton's method of temporal differences could also be used in forecasting applications; Sutton (1988) proves theorems about the consistency of that application, in a Markhov chain environment, and refers to the "Adaptive Heuristic Critic" as a special case of prediction.

THE ISSUE MAJOR

OF CONSISTENCY: CONCLUSIONS

The main goal of this article is to explore the issue of consistency with H D P through a class of examples. "Consistency" is defined like statistical consist e n c y - t h e ability of a method to converge on the right answer, as the data base goes to infinite size. When we use basic backpropagation as our supervised learning method, with HDP, we are minimizing the following square error on each pass, as a function of w°'): E = '~ (J(R(t), w") - (J(R(t + 1), w"' ")

t

+ U(R(t)))) 2. (2) Target= J (t+l) +u(t)

_R(t) .I CRITIC

~

Common sense would appear to suggest that we really want to minimize the following expression instead, as a function of w: E' : ~ (Y(a(t), w) - (J(U(t + 1), w) + tJ(R(t))))~.

FIGURE 4. Mechanics of HDP at time t. Broken line represents feedback actually used to adapt weights.

!

(3)

184

f". J. W e r b o s

To minimize E ' , with backpropagation, we would calculate the c o m p l e t e gradient of E' with respect to w. In other words, in differentiating E ' , we would account for a l l effects of w on J , i n c l u d i n g its effect on J(R(t + 1), w). Instead of updating w, and then changing the targets on each pass, we would hope to a n t i c i p a t e the change in targets from pass to pass by using the complete gradient. Perhaps there are ways to use the complete gradient of E ' in accelerating HDP. However, there are other ways one might try to accelerate convergence in this kind of moving target situation. A major result of this article is that minimizing (2)--as described in the previous section--does converge to the right answer, in a simple linear example, while minimizing (3) converges to the wrong answer almost always. This has some interesting consequences. For one thing, the approach of the previous section allows the use of a n y supervised learning method; Werbos (1988a) describes a few reasons to prefer other learning methods besides the simple version of backpropagation popularized by Rumelhart, Hinton, and Williams (1986). For another, the approach of the previous section was motivated by an effort to approximate Howard's scheme, for which Howard proved a number of consistency and convergence theorems; our results here suggest that this was a legitimate approach and that attempts at shortcuts based on eqn (3) are not so legitimate. It is interesting to consider how similar paradoxes might arise in other applications of neural networks, such as dynamic systems modeling (Werbos, 1988a, 1988b). Preliminary work on those lines hints at serious limitations with all known methods to handle that application, including the methods borrowed from control engineering and statistics, whenever there is noise in unobserved variables. Some readers may find our conclusions here very paradoxical. How can the minimum of E be different from the minimum of E' after convergence has been obtained and w ~") = w ~n+1)? The remaining sections of this article work out the mathematics that makes them different for a class of reinforcement learning problems. However, to understand what is going on, it may help to consider a simple example, for a function E which has nothing to do with reinforcement learning as such: E = ( w I'l + w ~" I~)w~"~ -

4 w I"',

where there is only one weight w to be updated on each pass. When we minimize this with respect to w In), holding w<"-1) constant, it is easy to see that we require: oE - -

=

0

=

2 w ~"~ +

w ~" ~

-

4

0 W C')

w ~'~

= 2 - (1/2)w °' '~.

(4)

Convergence will be reached when w ~" = w ~?' '~' solving the two-equation system made up of this condition and eqn (4), we can see that w~n)will converge to 4/3. (This is easily checked by picking any w I~ and using a calculator to iterate over eqn (4) and watching it converge.) On the other hand, if we simply treat w ~" ~) and w ~n) as two copies of the same thing (as is done in eqn (3)), we arrive at: E' = 2w 2 -

4w.

A little calculus shows very quickly that E' has a unique maximum at w = 1, n o t at w = 4/3. The remainder of this article mainly describes the example that underlies these conclusions; then, the final section provides a more theoretical interpretation of the problem and ideas for accelerating HDP.

THE

EXAMPLE

Many people working in this field restrict their attention to networks of model neurons, which calculate a weighted sum of their inputs and run the result through a sigmoidal function. However. the basic methods of neural network theory including backpropagation can be applied to a n y network of differentiable functions (Werbos. 1974. 1988b. 1989a). In some applications, it is useful to use more complex functions, based on our prior knowledge of the problem under study. In evaluating complex learning algorithms, it can also be useful to consider simpler functions. This article considers an example based on simple, linear functions. The success of H D P in confronting a linear example does not prove its value in the general, nonlinear case. However. with few exceptions, a learning rule that is good for the general, nonlinear case should be able to handle the linear case as well (as a limiting case). The linear test can be a very useful filter in ruling out certain classes of algorithms. This article considers the example of an environment that is governed by a simple linear difference equation: R(t + 1) - A R ( t ) ~- e(t),

(6)

where A is a matrix and e is a random vector governed by a multivariate normal distribution. I will assume: [ei] -- 0 [e,e/] =

a,j,

(7)

where the square brackets denote expectation values. This model is well known both in statistics and in control theory. In statistics, it is called a vector autoregressive process, or an AR(1) process. The

Consistency of HDP

185

calculations below will make use of mathematical devices that are quite routine in statistics; see Box and Jenkins (1970) for a more elaborate discussion of these devices and simpler examples of their use. I will further limit the example by assuming that eqn (6) represents a "stationary process," as this is defined by statisticians. Basically, this means that the infinite series I + A + A 2 + . . . converges, where I is the identity matrix, and that R(t) will not go to infinity as t goes to infinity. When the series I + A + A 2 + . . . does converge, it is well known from matrix algebra that it converges to (I - A) For the example, I will take: U(R) = UrR,

(8)

where U is a known vector and U T represents the transpose of that vector. (In other words, U r R is simply the dot product of U and R.) For the critic network, I will consider a simple linear network: J(R) = wrR.

(9)

In the following sections, I will try to see what happens to the weights w when we use H D P on this problem. I will ask whether the J that emerges from H D P is in fact the correct J to use in this situation. I will restrict myself to the infinite limit, where T goes to infinity. There are other questions one might ask about this example, but the question of consistency is certainly an important starting point.

WHAT

Before weights weights weights

ARE

THE

CORRECT

WEIGHTS?

asking w h e t h e r H D P yields the correct w, we have to establish what the correct are. In this section, I argue that the correct are: w = (I - A r) 'U

(10)

Substituting back into eqn (11), and recalling that R(t) is known with certainty, we conclude that the correct value of J(R(t)) is: UT(I + A + A' + . . .

U~(I - A) 'R(t).

Comparing this with eqn (9), we can see that J ( R ) will exactly equal the correct value of we use eqn (10) to define the weights w. (In more complex examples, of course, there will usually he no set of weights that makes J ( R ) exactly correct.) From a formal point of view, the value of J defined in eqns (9) and (10) exactly satisfies the Bellman equation of dynamic programming; therefore, an action network that succeeds in maximizing this J in the short term will automatically maximize the sum of U over time in the long term. Again, Howard (1960) proved many of the underlying theorems. (Howard's examples even have a similar appearance to ours, but he focuses on Markhov chains rather than arbitrary matrices A, because of his emphasis on exact methods.) CONSISTENCY

U(R(t)) + U(R(t + 1)) + . . .

+ U(R(~)).

0 = --

[UT(R(t) + R(t + 1) + . . .

+ R(~))],

[R(t + 1)] : A[R(t)],

PROPER

0

[EIt)]

Ow'"' [(J(R(t), w'"') - (J(R(t + l), w ~'' ',) t

+ U(R(t)))y]. substituting in from eqns (6) through (9), this reduces to: 0

[(w~",TR(t) - (w ~'' ~)~(AR(t) + e) + UTR(t)))2].

(11)

where we again use square brackets to denote the expectation value. But from eqns (6) and (7):

HDP

0

-

0 = ~

From eqn (9), this is:

OF

Next, we must figure out what weights H D P will actually converge to, in our example. As with the simple example of eqn (4), we first need to describe the relation between w ~') and w ~" ~), assuming that we fully minimize E in pass number n with respect to w ~'~ (holding w ~" ~ constant). Later I will relax this assumption somewhat. In the asymptotic limit, as an infinite number of observations becomes available, minimizing E over the sample is equivalent to minimizing the long-term statistical average of error E(t) across all observations t. Thus, we know for all i that our minimization procedure will enforce:

0WImi

This can be argued on two levels, informal and formal. On an informal level, I start from my earlier statement that J(R(t)) is supposed to be an estimate of the expected value of:

+ A'~)R(t)

Now that R(t + 1) no longer appears in the equation, we may simply rewrite this as: 0 = ~

0

[(wI"'TR - (w'" '~'(AR + e) + U~R))2 l

0 - 0wJ"' [((w"" - w t'' ~"rA - U ' ) R - w'" '"e)2].

which implies: [R(t + i)l = A'[R(t)].

(12)

186

T j, P. J. 14er[ os

For convenience, let us define: v = w(~) - ATw(~ ~ - U.

(13)

In brief, the weights that result from using HDP proper are exactly the correct weights.

Note that:

Ovi {~ Ow~,) =

INCONSISTENCY ifi=j if i ¢ j

(14)

With this definition, eqn (12) reduces to:

0 =

~

OF M I ~ M I Z I N G

E'

Suppose that we minimize E ' - - a s defined in eqn (3)--instead of using H D P proper. In that case, basic calculus leads to a series of calculations very similar to those of the previous section, which culminate in:

(vrR - wl"-')re) z = [Ri(vrR - w(" t~re)]. 0 =

Because e is assumed to be random (or, more precisely, because of the causality assumption invoked again and again by authors like Box and Jenkins, 1970), we know that [Rie/] = 0 for all i and j; thus, our equation reduces to:

If the vector R contained some component that is always zero, then this equation would not be strong enough to tell us what w (n) will be, for a given w ("- ~): however, if we ever encountered a reinforcement learning problem like that, we could simply throw out such components of R since they don't contribute anything. (More could be said on this special case, but it would be a distraction from our present goals.) Thus, it is safe to assume that the matrix [R~Rj] is nonsingular; in that case, the equation above simply tells us that:

0 (vrR _

(16)

where v = w -

Arw

-

IL

(17)

Now, however, wi appears in three places in eqns (16) and (17), while w ~n) appeared in only one place in the corresponding equations before. Because e is random (as before), and because we can multiply by 1/2 without invalidating the equation, eqn (16) reduces to: 0 = (1/2)[~w (vrR)2t + ( 1 / 2 ) ' [ ~ ( w r e ) 2]

= [ (vTR) ~ 0-~v [ ( w R'] % ) - '+1 0 w ,

= ~ vk[RkR~] OVJow,+ ~i wiJeie~]

v=0,

Using matrix algebra, and calculating the derivatives of v from eqn (17), this may be expressed as:

which by eqn (13) tells us: w(,i = Arw~,, i~ + U

(15) 0 = vrC(l

From numerical analysis, we know that this iterative process converges because our assumption of stationarity tells us that the eigenvalues of A (and A r) are all inside the unit circle in the complex plane. In fact, we know that this process converges to the correct weights, as given in eqn (10). Using backpropagation with HDP, we may not choose to complete the minimization process in each pass. However, convergence still requires that the minimum we are moving towards on pass number n must equal w ~"-1) (so long as we use a learning rate greater than zero); since eqn (15) still gives the minimum we are moving towards, any stable (converged) set of weights w must still obey: w=

Arw

+ U,

which can only be satisfied by eqn (10). This does not prove that the process ever will converge in this case; however, since this is essentially a slower version of eqn (15), which clearly does converge, there is reason to be optimistic.

-

A r) + wZQ,

(18)

where C is the covariance matrix of the vector R. Now if Q equals zero, and if C and i - A T are nonsingular (as discussed in the previous section), then we may still deduce that v = 0, which leads to the correct weights, as in eqn (10). However. if Q is anything else--if there is any noise at all in the env i r o n m e n t - t h e n the nonsingularity guarantees that v will be something else as well. unless w is zero. (In the case where w = 0. eqn (18) tells us that v = 0, which is correct only if U = 0; this happens only if we are facing a null problem, in which U(R) is always zero for all vectors R.) In summary, the minimization of E' would lead to a set of weights that are different from the correct weights, unless one of the following two conditions is met: (a) the random noise (i.e.. the vector e in eqn (7) is always zero; (b) the problem is a null problem, in the sense that U(R) is always 0 for all vectors R. This establishes the conclusions discussed earlier in this article.

Consistency of HDP OTHER FOR

POSSIBILITIES ACCELERATION

The results above are somewhat discouraging because they show no way to accelerate HDP to account for what we know about the changes in the targets from pass to pass. If we do not account for this knowledge, then each pass through HDP will extend the effective foresight horizon of the system by only one time unit, at most. (Intuitively, the effective foresight horizon is a number h such that our approximation to J(R(t)) is really just an approximation to U(R(t)) + . . . + U(R(t + h)).) In applications where a small time unit is used (circa 1/10 of a second or less), this could significantly limit the effective foresight of the system. This problem would apply to any use of HDP, regardless of which method we use to solve the supervised learning problem within HDP. This section will describe some very tentative ideas about the possibility of converging faster--of extending the foresight horizon by more than one unit per pass--when using gradient-based learning within HDP. With conventional gradient-based function minimization, there is a huge literature on methods to accelerate convergence. This literature was mainly developed by numerical analysts and operations research specialists; therefore, it mainly applies to "batch learning." In conventional pattern-recognition applications, I have found (Werbos, 1988a) that batch learning with sophisticated numerical methods actually works faster than simple observation-by-observation learning when the dataset is "dense" (has little redundancy between observations). Presumably, it should be possible to adapt these numerical concepts to observation-by-observation learning to get faster convergence; however, this will take a lot of research, and will presumably require some kind of synthesis between backpropagation and concepts involving associative memory. In gradient-based batch learning, there are three numerical methods that currently seem to dominate all others in their capabilities. First, there is the BFGS method, which belongs to the quasi-Newton family of methods, suitable for problems involving 100-200 weights or so. The "S" stands for Shanno. Second, there is Shanno's conjugate gradient method, which performs far better than other conjugate gradient methods with large ANNs because it does not require an expensive line search for maximum effectiveness (Werbos, 1988a). Third, there is the Karmarkar class of methods, which has yet to be applied to ANNs. Shanno says that he will soon publish ways of adapting these methods to get a radical improvement in convergence with linear programming (analogous to the first stage of ANN adaptation), and he states that the theory behind these

methods includes a discussion of how to minim,. something like square error. According to Shanno's papers, these methods all basically depend on a quadratic approximation to the minimization p r o b l e m - - a n approximation that breaks down when there are moving targets. In the case of HDP, however, the targets move in a relatively simple way, which should allow a relatively simple adaptation of the classical numerical methods. This section will suggest a few thoughts along those lines. Our basic problem here is that we are looking for values of w and w' that solve the following equations, for an error function E: aE

g' = __aw--(w, w') = 0

for all i

(19)

w, = w,'

for all i.

(20)

The gradient vector g is defined here as in the papers by Shanno. This is not the same as minimizing E(w, w), which would change eqn (19) to: O--E-E(w,w) + OE (w,w) = 0. ~ w,

(21)

~ w~'

In fact, eqn (21) will not be consistent with eqn (19) unless the derivatives with respect to w,' happen to be zero, which is rarely true at a solution. In the usual minimization problems, based on eqn (19) but with w' treated as fixed, we try to approximate Newton's method for solving eqn (19). For a given value of w, we approximate: g ( w + ~Sw) ~ g ( w ) + H ~Sw

(22)

where 0~E

H,, - awi0w

(23)

This leads to Newton's way of changing the weights: Aw =

-H

lg.

(24)

In most quasi-Newton methods, we estimate H - ~by a matrix H', which should have the following property if the approximation is good: H'(g<,-~/ _ g , , ) = w~,,,,~ _ w!,,/.

(25)

To enforce this property, we may use the following updating procedure for the approximation H ' : g,,)r H'"+u = Hq" + (w,,,+u_ (gl,,+u _ @,,,~,)(g,,,.~, gt,,)7(gl,,,u __ g,,,~)

(26)

where ff~"'~' = w ~''~ + H ' ' ( g ~. . . . . .

g~"~).

(27)

188

P. J. Werbos

This type of update underlies all of the numerical methods discussed above, though they involve many refinements to this basic idea. In the present situation, we need to modify eqn (22) to account for eqn (20). This yields: g(w + 8w) ~ g(w) + H + ~w

(28)

where a2E

H'7

a2E

Ow,Owj+ Ow,Ow,

(29)

Thus, we need to find a modified update rule for H', such that H' becomes an approximation to the inverse of H + instead of the inverse of H. In actuality, this modified H' should meet the exact same condition as eqn (25), if g(") is defined as: =

0E (w% w~"~).

version of HDP, we can use any form of supervised learning to adjust the weights in each pass, including backpropagation, fast one-pass methods, or other methods. An "accelerated" version, making use of a complete error gradient, converges to an incorrect set of weights, almost always, in the same example. Nevertheless, possibilities exist for faster learning, using the simple version of HDP (with its incomplete gradient), by carrying over some basic ideas from numerical analysis; this would require substantial further research, for the case of real-time learning especially. These results should apply with little modification to certain more complex methods (as in Werbos, 1990), which involve a similar effort to approximate dynamic programming.

(30)

Thus, the basic update rules of eqns (26) and (27) should still be valid, on the whole. Intuitively, this seems to suggest that the usual methods of numerical analysis would automatically solve the moving targets problem here. If so, gradient descent using these methods would allow the foresight horizon to grow faster than one time period per pass, which would be impossible if HDP were used with "fast" one-pass supervised learning methods. In actuality, eqn (13) does not match the partial gradient we would normally use in HDP. In HDP, we calculate the gradient at the point (w ~"), w ~"- ~)). At the least, we would want some proofs of stability and convergence here. Such proofs may not be very difficult in quality or difficulty from the existing proofs in numerical analysis, but a great deal of work needs to be done. More seriously, the refinements developed by Shanno and others exploit the symmetry of H very heavily; since H + is not symmetric, those refinements cannot be used directly. Still, the literature on solving nonlinear equations used an asymmetric jacobian instead of a symmetric H °, so that methods from that field might be applied here. Also, there may be some way, in these methods, to exploit the possibility of evaluating the gradient with respect to w'. The similarity between eqn (26) and the Widrow-Hoff learning rule might also lead to useful insights. Clearly, a lot of research will be needed to resolve these issues and to explore other approaches to extending the effective foresight horizon. CONCLUSIONS

A simple version of HDP converges to the correct set of weights, in a simple class of examples, in the limit as the training set grows larger. In this simple

REFERENCES Barto, A. G. (1990). Connectionist learning ior control: An over-. view, [n T. Miller. R, Sutton, & P. Werbos {Eds.), Neural networks for robotics and control. Cambridge, MA: MIT Press. Barto, A. G.. Sutton, R.. & Anderson, C. (1983), Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Transactions of the SMC, 13, 834-846. Box. G E. E. & Jenkins, G. (1970). Time-series analysis: Forecasting and control. San Francisco: Hotden-Dav. Grossberg, S.. & Levine. D. A (1987). Neural dynamics o1 attentionatly modulated Pavolvian conditioning. Applied Optics. 26(23). Howard, R. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Jacobson. D., & Mayne (1980). Differential dynamic programming. New York: Elsevier. Jordan. M. I. (1989 I. Generic constraints on underspecified target trajectories. In Proceedings of the First International Joint Conference on Neural Networks. (IEEE Catalog No. 89CH2765-6). New York: IEEE. Kawato. M. (1990). Computational schemes and neural network models for formation and control of multijoint arm trajectory. In T. Miller. R. Sutton, & P. Werbos (Eds.), Neural networks for robotics and control. Cambridge, MA: MIT Press. Klopf. A. H. (1982). The hedonistic neuron: A theory of memory. learning and intelligence. Washington. DC: Hemisphere. Nguyen. D., & Widrow. B. (1990). The truck backer-upper: An example of self-learning in neural networks. In T. Miller. R. Sutton. & P. Werbos (Eds.), Neural networks for robotics and control. Cambridge, MA: MIT Press Rumelhart. D., Hinton. G.. & Williams, R. (1986), Parallel Distributed Processing (Chap. 8). Cambridge, MA: MIT Press. Sutton. R (1984). Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts, Department of Computer and Information Science, Amherst, MA. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44. Watrous, R.. & Shastri, L. (1987). Learning phonetic features using connectionist networks, In Proceedings of the First International Conference on Neural Networks (ICNN). New York: IEEE. Werbos. P. (1968). Elements of intelligence. Qbernetica (Natour), 3. Werbos. P. (1974). Beyond regression: New tools for prediction

Consistency of H D P and analysis in the behavioral sciences. Ph.D. thesis, Committee on Applied Mathematics, Harvard University, Cambridge, MA. Werbos, P. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22, 25-38. Werbos, P. (1982). Applications of advances in nonlinear sensitivity analysis. In R. Drenick & E Kozin (Eds.), Systems modeling and optimization: Proceedings of the lOth IFIP Con[erence, New York (1981) (pp. 762-777). New York: SpringerVerlag. Werbos, P. (1988a). Backpropagation: Past and future. In Proceedings of the Second International Conference on Neural Networks (ICNN) (Vol. I. pp. 343-353) (IEEE Catalog No. 88CH2632-8). New York: IEEE. (A transcript of the talk and slides--which is more narrative than the paper--is available from the author) Werbos, P. (1988b). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, !, 339-356.

189 Werbos, P. (1989a). Maximizing long-term gas industry profits in two minutes in Lotus using neural network methods. IEEE Transactions of the SMC. 2, 3t5-333. Werbos, P. (1989b). Backpropagation and neurocontrol: A review and prospectus. In Proceedings of the International Joint Conference on Neural Networks. New York: IEEE. (A slightly expanded version is available from the author) Werbos, P. (in press). Backpropagation through time: What it is and how to do it. IEEE Proceedings. Werbos, P. (1990). A menu of designs for reinforcement learning over time. In T. Miller, R. Sutton, & P. Werbos (Eds.), Neural networks for robotics and control. Cambridge, MA: MIT Press. Williams, R. (1988). On the use of backpropagation in associative reinforcement learning. In Proceedings of the Second International Conference on Neural Networks (ICNN) (Vol. [). (IEEE Catalog No. 88CH2632-8). New York: IEEE, Williams, R. (1990). Adaptive state representation and estimation using recurrent connectionist networks. In T. Miller, R. Sutton, & P. Werbos (Eds.), Neural networks J~Jr robotics and control. Cambridge, MA: MIT Press.

Consistency of HDP applied to a simple reinforcement learning problem

Consistency of HDP applied to a simple reinforcement learning problem

Recommend Documents