Muthl.
Modelling Vol. 23, No. l/2, pp. 175-188, 1996
Comput.
Pergamon
Copyright@1996 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0895-7177/96 $15.00 + 0.00
0895-7177(95)00226-X
A Neighboring Optimal Adaptive Critic for Missile Guidance J. DALTON Intelligent Systems Center, University of Missouri-Rolla Rolla, MO 65401, U.S.A.
[email protected]
S. N. BALAKRISHNAN Department of Mechanical and Aerospace Engineering and Engineering Mechanics University of Missouri-Rolla, Rolla, MO 65401, U.S.A. bala@umr . edu Abstract-we present a neural network approach to missile guidance which is based on the notion of an adaptive critic. This approach is derived from the use of both a nominal solution of a linear optimal guidance law and neighboring optimal control law. No assumptions about target maneuver dynamics are made during neural network training. We discuss neuro-control training issues, and the neural network control system results are compared with those obtained from an optimal control formulation. Numerical results from the simulations of the neuro-controller under reference conditions and under perturbations due to target maneuvers are presented. We also demonstrate the transfer of control knowledge from the critic network to the controller network while the simulated missile is in flight. Keywords-Neuro-control, learning.
Neural networks, Adaptive
critic, Optimal guidance,
Reinforcement
1. INTRODUCTION Homing
missile guidance
sophistication
has a long standing
in electronics
and advances
seek ways to improve the performance
history
in aircraft performance,
laws available.
These guidance
assume constant
target velocity
to more advanced
of target acceleration.
control
base [l].
schemes
Yet with
engineers continue
and expand the flight envelope of missiles.
range of control knowledge
and a wide literature
range from classical
to
There is a wide
techniques
which
control laws which assume at least statistical
The purpose of this work is to demonstrate
the capabilities
of
based on recent advances in artificial neural networks. These techniques will exploit parallel architectures for higher computational throughput. We show a neighboring optimal adaptive critic formulation capable of solving the missile guidance problem. Pastrick et al. [l] present an excellent survey article comparing five categories of guidance laws for short range tactical missiles. Classical guidance laws include line-of-sight, pursuit and proportional navigation guidance. Modern techniques include optimal linear guidance laws and other approaches based on differential game theory. The survey briefly describes each of these methods and outlines relevant references. Guidance laws based on proportional navigation are widely used because of their simplicity. In its most basic form, proportional navigation produces missile acceleration commands that are proportional to the line-of-sight angle rate [2]. For zero lag guidance systems, proportional navigation minimizes the integral square control effort required for zero miss distance assuming constant velocity of the target [3]. If the target accelerates, an additional term may be added to the proportional control law yielding the augmented proportional control law. Details of this
techniques
175
176
J.DALTON
AND S.N.BALAKRISHNAN
method may be found in the text by Zarchan f4]. We use the control law described by Fiske [5] as a basis for comp~ing our neural network based results with those obt~nable using optimal guidance techniques. Modern guidance control laws are based on the assumption that an accurate model of the engagement exists. The modeling of target acceleration is difficult because true knowledge of a target evasive maneuver cannot be obtained. The motion model we use for comparison models target acceleration as an exponential function and assumes that maneuver time constants are known. In reality, these time constants may not be known and further, the exponential model may not be appropriate. Balakrishnan has noted that the use of either a state observer or Kalman filter to estimate the target acceleration states may lead to poor results [6,7]. We seek an alternate
approach which does not make restrictive
acceleration model Published literature
assumptions
contains many references to applications
(ANNs) to various control problems [S]. Examples
about a particular
of Artificial
target
Neural Networks
include the control of robotic arms [g-11],
flexible space structures [12,13], and others. Werbos has described the use of neural networks in both control and system identification problems [14]. Steck and Balakrishnan have previously examined the use of neural networks in optimal guidance [15]. The three most widely used ANN architectures for control are multilayer feed-forward neural networks trained via the Backpropagation method fl6], Cerebellar Model Articulation Controller (CMA~) [9] and Hopfield networks [17,18]. Of these methods, Backpropagation and CMAC are nonrecurrent, meaning that there is no feedback between network outputs and network inputs. Hopfield networks, on the other hand, contain feedback connections, and stability issues arise. We use feed-forward neural network control components which are based on the notion of an adaptive critic. The development of critic based control systems can be described in terms of three stages. The first is the study of learning control systems which occurred in the 1960s. Fu defines a learning control system as a control system which is capable of modifying its behavior based on experience in order to maintain acceptable performance in the presence of uncertainties [19]. Sklansky describes learning control in terms of three feedback loops: the controller, a system identifier, and a teacher PO]. It is the teacher which distinguishes Sklansky’s learning control from adaptive control architectures. Nikolic and Fu show similar work except that they allow for incomplete knowledge in the teacher [21]. The early work in learning control established a base for the second stage of development in critic based systems, Mendel and McLaren introduce a subclass of learning control which they call reinforcement learning 1221. In this subclass, a method for evaluating performance of the control system is incorporated into the system and the teacher of learning control systems becomes a critic in reinforcement learning control. The critic offers an evaluation or critique of the current environment and control actions so that the controller can provide better control actions in the future. Widrow, Gupta and Maitra use neural type elements in a reinforcement learning system and further develop the idea of “learning with a critic.” Barto, Sutton, and Anderson [23] apply these ideas in a two component, adaptive critic control architecture using simple neural network elements. The adaptive critic is used to learn to balance a pole on a moving cart in a feedback, reinforcement learning control system. More recently, the notion of the adaptive critic in the context of neuro-control has been further developed. Werbos [24] lists adaptive critics with reinforcement learning as one of five dominant p~~igms for neural control systems. Werbos also shows the use of b~kpropagation for training adaptive critics. He asserts that adaptive critics provide an approximation to dynamic programming [25]. Jameson implements Werbos’ backpropagated critic and compares it with earlier work in reinforcement learning [26]. Sofge and White apply the idea of the adaptive critic in process control for the manufacturing of thermoplastic composites (271. The primary underlying premise of learning control is that knowledge of the plant is unknown and must be discovered by the control system. Conversely, conventional optimal control theory depends strongly on knowledge
Neighboring
Optimal Adaptive
Critic
of plant dynamics and embedded stochastic process models.
177
We offer our work as a means of
bridging this gap. We use the critic architectures above as a basis for the neighboring optimal adaptive critic and decompose the feedback control task into two subtasks. Control on a known nominal trajectory is provided by a controller network based on observed or estimated plant states. Then a critic network gives a critique of controller action and supplies a complementary control signal. Signals from the controller and critic added together form the plant input signal. The controller network is trained using optimal trajectory data on a nominal trajectory, and we use neighboring optimal control techniques [3] to form the training set for the critic network. R.esults indicate that this two component architecture exhibits good behavior in the presence of stochastic disturbances in the feedback loop.
In fact, neither the controller nor the critic network mappings need to be
accurate at every point along a trajectory.
The stochastic nature of the critic which is built in
during training yields a robust control system. The neighboring optimal adaptive critic architecture inherits its stochastic behavior from the use of neighboring optimal control in constructing the critic training set. However, the neural control structure yields additional benefits. Neural networks give us the ability to tune the feedback mapping in a smooth way in response to changes in system dynamics or changes in the stochastic nature of the system. The neuro-control architecture also works when the optimal feedback mapping is not unique. In this case, we obtain a least-squares curve fit automatically as a result of network training and approximate optimal control in normal operation. In addition, we may use the same underlying neural networks for widely varying control problems simply by substituting appropriate network parameters. The neighboring optimal adaptive critic has a parallel structure which enables high speed operation. We use all available problem information in the formulation of the neighboring optimal adaptive critic. This is in contrast to the trial and error procedures used in reinforcement learning systems. As a result, we can get near optimal performance without the need for many trials which is important in applications such as missile guidance where we can only make one attempt. The cost for this performance is in the need to know, a priori, more information about plant dynamics. However, we note that it is possible to make a transition to the reinforcement learning case by approximating neighboring optimal control in the construction of the critic training set. This might be done, for example, when plant dynamics are not known and can only be approximated. Dalton gives an extensive study of the use of the critic based approach to control described in this paper in his Ph.D. dissertation [28]. In the remainder of this work, we address the application of a critic based neural network approach to the homing missile guidance problem. The homing missile problem is defined in Section 2. The topology and training of neural network blocks that are used in our solution are described in Section 3. Simulation results characterizing performance of the neural control system are included in Section 4. We conclude with recommendations for further work in Section 5.
2. PROBLEM
DEFINITION
The contents of this section describe the homing missile intercept problem, that is the determination of proper control signals which guide a missile to the point of intercepting a target. Numerous approaches to the solution of this problem exist ranging from line-of-sight, pursuit, and proportional navigation to more advanced techniques based on optimal control theory and the theory of differential games [l]. This section begins with the development of a state space description of the problem followed by a review of a linear optimal guidance control law and the development of the neighboring optimal control law.
178
2.1. State
AND S. N. BALAKRISHNAN
J. DALTON
Space
Description
In the missile guidance problem, a short range air-to-air missile engages a maneuvering target. All discussions sssume motion in a plane for simplicity. The state vector, X, for the engagement consists of six components:
relative position, 2 and y, and relative velocity, ‘uz: and r+,, of the
missile with respect to the target, and target acceleration, at, and oty, X=[x The system matrix,
Y
%z r.$/ %c
uty 1’.
(1)
A, and the input matrix, B, for a state space model with states, X and
input u are included here for reference:
A=
0010 0001
0 0
0 0
0000 0000
10 0
1
0 0
0 0
0 0
0 0
cy, 0
(2)
0 Cry
where CY~and ay are maneuver time constants, and 0 0
B=
0 0
(3)
;;. 0 0
0 0
In our work, we assume that although the relative position and relative velocity states are known to the control system through the use of an observer or estimator, the target acceleration states are not available. The state equations for the engagement model are given by
A=AX+BU. 2.2.Linear Optimal
Guidance
(4)
Law
When target accelerations are known, guidance commands may be determined from a linear optimal guidance law [5j with perform~ce index J = where XT(tf)SfX(tf)
;xT (tf)
t s 1
SfX (tf) + ;
uTudt,
to
is the square of the miss distance, XT (Q) S_fX (Q) = x2 (tf> + Y2 (Q) I
(6)
7 is a scalar control weighting, and u is the missile command acceleration. This guidance law is based on an estimate of time-to-go calculated by assuming that relative range rate is constant,
where R is the relative range and V is the reiative velocity between the target and the missile. The optimal missile command accelerations are then obtained by state feedback with a time-varying gain matrix,
179
Neighboring Optimal Adaptive Critic
where
(9) (10)
G = T,,, G =
T&o,
(en=T~o-
C3r = T,,
a,Tgo
1)
-
cx;
(e”vTgo - a,Tgo
&y = T,,-----
(11)
’ - 1)
cY$
(12)
’
and
A value of lop4 is assumed for the control energy gain y. The linear optimal feedback control law is 21= K(t)X(t).
(14)
Notice that if the target is not accelerating, the optimal control may be determined using only the gains Cr and Cz, and the relative positions and velocities. Given the feedback control law of (14), we can choose nominal initial conditions, X(O), and generate a corresponding nominal optimal state trajectory. We next formulate a neighboring optimal control law [3]. Neighboring optimal control can be used to supply modifications in control based on small excursions from a known nominal optimal trajectory which result from target acceleration, measurement noise and disturbances. 2.3.
Neighboring
Optimal
Control
The solution of the neighboring optimal control problem involves the minimization second variation of the performance index, (5), given by
of the
tf
s to
SuT
(t)&(t)
dt
(15)
subject to constraints 6X(t)
= ASX(t) + B&(t)
(16)
with GX(te) specified. Notice that this problem is identical in form to the original minimization problem with only a change of variable to accommodate the variations, 6X(t), and Su(t). The variation in control is 6x(t) 6% (t) CjU(t) = [ SU&)
1 [ =
N(t)
“;,
SY(t) co
‘; 1
co
‘32 2
’ 0
c3y
6%(t)
44 (t)
(17)
fhz(t) h&)
The following section describes the architecture and training of neural network blocks used in implementing systems with neural network complementary control. We use a nominal optimal trajectory to train a controller neural network in the feedback loop, and we use neighboring optimal solutions to train a critic network. 3.
NEURAL
NETWORK
ARCHITECTURES
AND
TRAINING
The use of feed-forward neural networks in a critic based architecture for control is the primary focus of this paper. The contents of this section define neural network terminology to be used in the remainder of the study. The architecture and training methodology of the neural control system is also presented.
180
3.
3.1. Assumed
~ALT~NAND
S.~.~ALAK~~HNAN
Neural Network Structure
The class of neural networks to be discussed is feed-forward neural networks with three layers of processing elements. The input layer uses a logarithmic activation function given by
(18)
fin(x) = sgn(x) ln(l -I- 1x1). Two hidden layers use a sigmoid function,
09) and the output layer processing elements have a linear activation function
h(x) = x. The transformation composition
(20)
of signals through a single network layer is easily visualized in terms of a
of three vector transformations.
Let the layer input vector be given by [ 1
x IT
where z here denotes the vector of inputs and the constant value of I in the first row is used to add biases to the input of the hidden layer elements. Let z represent an intermediate vector and let W be a matrix of weights and biases; then z=W[l
TIT,
(21)
where the bias terms are contained in the first column of the matrix W. A vector of activation functions, f, then completes the transformation to the layer output vector, Yi = fi where yi are components transformation f. 3.2. The Neural
Control
(Zi) ,
of the layer output vector and fi are components
(22) of the diagonal
System
A block diagram of the control system is shown in Figure 1. It is a closed loop design where the plant represents the dynamics of the missile-target engagement dynamics given in (l)-(4). The two neural networks in the control system have the same feed-forward structure as defined above. The controller network produces a control output in response to plant output signals. The controller network by itself provides a state feedback mapping to the plant and is trained on a given nominal optimal trajectory.
Figure I. The critic based neural controi system.
Neighboring Optimal Adaptive Critic
The controller control state
vectors vectors
network
training
from a nominal
set is built
solution.
and the corresponding
by matching
A specified
control
vectors
number
181
optimal
state
vectors
and optimal
of copies of the transformed
are randomly
ordered
nominal
to form the training
set. The control system design methodology is not dependent upon the training method. We use backward error propagation because of its simplicity and widespread use. The critic network is so named because it critiques the controller’s performance. The purpose of the critic network is to monitor the current state of the plant and the controller’s response to that state. It then produces a correction signal which augments that of the controller network to produce a modified control command to the plant. The critic network is trained using data from neighboring optimal control. We now discuss this in greater detail. The controller on a given
input
nominal
then produces
signal at a time t is X*(t) +6X(t)
trajectory
and 6X(t)
uC(t) = u*(t) + A+)
second term of the controller
output,
w h ere X*(t)
is the variation
is an optimal
of the state
vector.
w h ere u* (t) is optimal
control
A+),
from the variation,
is distinguished
state vector
The controller to X* (t). The
corresponding h(t)
because
the
neural network controller does not, in general, produce optimal control away from the nominal trajectory which it was trained on. The error in the mapping of the controller network at the new state, X(t) +6X(t), is contained in Au(t). The purpose of the critic network is to remove the term, Au(t), from the controller output and add back an optimal variation in control, 6u(t). Therefore, the critic input vector is obtained from a state vector component, X(t) + 6X(t) and a proposed control action from the controller network. The desired critic output vector is u&t)
= -Au(t)
+ h(t).
To produce the critic training set, we start with sampled nominal state and control functions and expressions for the control variations given in (17). At each sample instant, t, we choose uniformly distributed 6X(t) and Au(t) and calculate the critic network input. The corresponding critic output vector is also calculated. Using this technique, we can choose as many training set examples as necessary. The means and variances of 6X(t) and Au(t) must be selected to cover regions of the state and control spaces where plant operation is expected. Based on the preceding development and data obtained from a nominal trajectory, we can now construct the critic training set. The process used to create a single training set example requires only the selection of random perturbations, 6X and Au. The variation in control, 6u corresponding to 6X can then be calculated using (17). Then the desired critic output is found using (23). The procedure is straightforward in design but difficulties can occur in practice. By choosing 6X and Au randomly late in the engagement, it is possible to get a pair which require an unrealistically high correction, 62~. In this case, our critic training set could contain desired network output vectors for cases which rarely occur in actual engagements. As a result, the critic network may be biased toward these rare cases and not provide an accurate representation for the cases which do frequently occur. Therefore, it is important to carefully choose training set examples, and we now consider one method that works for this type of problem. We take components of Au and the relative velocity components of the change in state, AX, from zero mean uniform distributions. Relative position components of 6X are uniformly distributed with zero mean but with variances proportional to the corresponding position component of the state. This selection of random variations in the state has the effect of producing a wider coverage area early in an engagement and a tighter, more densely populated coverage area close to intercept where accuracy is important. We also restrict critic output signals so that unreasonably large training examples are not generated. The summers which appear at the output of the plant and the output of the controller network in Figure 1 are included so that disturbances may be injected into the control system for robustness testing. Both vector signals, m(t) and u(t) are zero mean and may be drawn from arbitrary
182
J.DALTONA~~ S.N.BALAKRISHNAN
distributions.
In particular, m(t) provides a way for inserting noise into the system to reflect
error in state estimates. In this case, both the critic and controller networks use a measured state vector to compute control signals. The other noise signal, a(t), can be used to simulate actuator noise or inaccuracies in the controller network mapping.
4. ANALYSIS
OF NUMERICAL
RESULTS
We begin this section by describing off-line training of the controller and critic networks using nominal and neighboring optimal trajectory
data.
Next we show simulations which compare
performance of the neural control system with and without the critic network to that obtained with the guidance law. We also show the region of admissible initial states which result in acceptable miss distances. We conclude the section by demonstrating on-line controller training based on critic signals. 4.1. Controller
and Critic
Network
Training
To construct the controller and critic training sets, we first select an initial state for a nominal trajectory, X(0) = [ 1000
200
-100
-50
0
0 1’.
(24)
Then we run a simulation based on a fourth order Runge-Kutta integration with time step, At = 0.2s. State feedback control is calculated at each time step. Control is assumed constant during each step. The simulation ends when the rate of change in the magnitude of relative position changes sign indicating that minimum range has been achieved. At this point, we step back to the immediately preceding time step and divide the sampling interval by a factor of 200. Then we integrate forward again to accurately determine the minimum range. This minimum range is called the miss distance. Data from the nominal state and control trajectories is used to form the controller network training set. The controller network for this problem contains four input elements, six hidden layer elements in each of two hidden layers, and two output elements. The controller network training set is constructed by randomly ordering 20 copies of the data taken from each sampled time on the nominal trajectories. The number of backward error propagation weight modi~cations performed during network training is 500,000 with the learning rate set at 0.0005 and the momentum term set at 0.3. The absolute error between output signals in the controller mapping and training set data is less than 0.1 ft/s2. We now discuss training the critic network. We generate a total of 4949 training points for the critic training set, 101 for each data point along the nominal trajectory using (23). The perturbations 6X and Au are set to zero for one set so that the nominal trajectory itself is included in the training set. 4900 examples are placed in random order and the remaining examples on the nominal trajectory presented to the network on alternating training iterations. A very small learning rate, 5 x 10V6, is required to assure convergence during training. We choose the momentum term as 0.3 and train for five million iterations. The absolute error in critic network output components after training on the nominal trajectory is less than 0.76. 4.2. Neural
Guidance
Results
Both the controller/critic and the controller configurations produce signals which closely match the optimal signals on the nominal trajectory as shown in Figure 2. The cost of control for the controller/critic system is higher than that of the other configurations due to fluctuations late in the engagement. These occur because the noise in relative velocity input components of the critic network has a greater disturbing effect near the point of intercept. There is a trade-off between accuracy in the critic mapping and the range of possible initial states for acceptable guidance. Since the control weighting factor in the performance index is small, the deviations in control
Neighboring Optimal Adaptive Critic
produced
by the controller/critic
architecture
183
do not significantly
change
the total
cost for the
engagement.
-2
-4
I
I
0
12
I
Figure 2. Comparison
I
I
I
I
I
I
I
3
4
5 t (s)
6
7
8
9
of nominal trajectory
10
control signals.
A performance comparison of the three architectures is shown in Table 1. In each one of the three cases, the miss distances are less than one foot. The state trajectories are essentially the same for the three configurations as shown in Figure 3. The performance indices for the controller/critic and optimal is only slightly higher.
architectures
are comparable
Table 1. Performance
I
comparison
I
Architecture Deterministic
Guidance Law
and the cost in the controller
for three architectures.
Miss Distance
Performance Index
Flight Time
0.0409
0.0144
9.7920
Controller/Critic
0.0525
0.0151
9.7620
Controller
0.5656
0.1736
9.7940
Only
configuration
In order to determine the region in state space for successful operation, we consider projections of the state space assuming that the initial relative velocity components of the state remain fixed. Target acceleration is assumed to be zero. We then vary initial relative position states to find the boundaries of a two-dimensional region which result in acceptable miss distances. This test is performed by beginning at the nominal initial conditions and searching along radials at 15 degree intervals up to 500 ft from the nominal initial position in the z-y plane. Initial relative positions tested which result in a miss distance less than 10 ft are shown in Figure 4. We perform the same test using the controller configuration. In this case the only successful runs occur on radials at 0 and 180 degrees for points that are within 5Oft of the nominal initial relative position. The controller configuration provides acceptable performance on the nominal trajectory but does not perform well for regions in the state space which are away from the controller training set. The kill region for the controller/critic configuration is cone-shaped reflecting the method used for selecting critic network training data. Larger kill regions can be produced at the expense of more extensive training, and larger training sets. Robustness testing in the face of measurement noise for nonaccelerating targets is also performed about the nominal trajectory. In this investigation, we add zero mean Gaussian noise with standard deviation 20 to the relative position and velocity components of the state vector
184
J. DALTON
AND S. N.BALAKFLISHNAN
Optimal Controller Only Controller/Critic
---.-.
-
9
10
600
0
1
2
3
4 t
Figure 3. Comparison
600
-
500
-
400
-
300
-
200
-
5 (s)
6
7
8
of nominal state trajectories.
100 0 -100 -200
-
1
400
I
I
I
I
I
I
600
800
1000 x
1200
1400
1600
Figure 4. Initial relative positions for a miss less than 10 feet.
at each sampling instant along the trajectory. We show sixty Monte Carlo simulation runs performed for the (deterministic) optimal feedback control law, (14), controller/critic, and controller architectures. Results of this experiment comparing miss distance and total cost as given by the performance index and flight time are shown in Table 2. The neural controller/critic configuration gives the best performance and the controller configuration the worst. Miss distances for the optimal control law configuration average approximately five feet greater than those for the controller/critic configuration. This shows the inherent robustness of the controller/critic design. 4.3.
On-Line Controller Training
We conclude this discussion of neural guidance by demonstrating that the controller can be trained incrementally using the critic as a teacher. The following procedure is used during a single training session. First we choose independent, zero mean, variance one, normally distributed controller weights. Then we perform simulations using the controller and controller/critic configurations before modifying controller weights. Results from these simulations are saved to assess the training session. The actual on-line training is done for a specific set of initial conditions over the course of several trials. During each trial we use the output of the critic in the BEP
Neighboring Table 2. Performance
Optimal Adaptive
by architecture
Critic
185
for 60 trials with measurement noise added. Miss Distance
Performance Index
Flight Time
Maximum
39.5170
782.1540
10.9380
Minimum
0.4147
1.2898
9.4440
14.9843
170.6145
9.9220
9.6224
177.4997
0.2897
Maximum
34.1532
583.3779
9.8890
Minimum
0 1580
0.1358
9.7090
Mean
9 8232
76.2936
9.8157
St. Dev.
7.4704
115.9711
0.0371
Maximum
79.3828
3150.8000
9.8070
Minimum
0.7471
0.2946
9.6730
Mean
34.9214
786.8937
9.7542
St. Dev.
18.8215
719.2236
0.0229
Architecture Deterministic
Guidance Law
Mean St. Dev. Controller/Critic
Controller Only
training process to modify controller weights at each sample instant. Each trial continues until range rate changes sign indicating minimum range has been achieved. Controller weights which result; from a given trial are used at the beginning of the next trial so that the controller gathers experience in each pass. During training trials, the critic output is not directly applied to the plant. At the conclusion of an on-line training session we again test performance of the controller and controller/critic configurations in simulations runs with controller weights fixed to see if the controller has learned. We say that a training session is successful if the controller, by itself, is able to provide control resulting in a miss less than ten feet. a series of sixty independent on-line training sessions beginning with the nominal initial conditions, (24), we observe the following results. First, in all of the sessions the controller/critic configurations give control resulting in misses less than one foot both before and after on-line controller learning. The mean miss distance over the sixty trials is reduced from 0.1970 ft to 0.0970 ft as a result of controller training. We also note that for the sixty sessions, the untrained controller network produces trajectories with a mean miss of 258 ft when unaided by the critic network. The smallest miss in this case for untrained controllers is 29 ft. In each training session a fixed number of ten trials are taken. We observe an improvement in performance as a result of training for all sixty sessions. A total of thirty four of the training sessions are successful and the mean miss distance after training for controller configurations is 20 ft. III
Miss distances for ten of the sixty training sessions are shown as a bar graph in Figure 5. The first and last bars for each session on the bar graph represent results before and after controller training, respectively. The intermediate bars for each session are miss distances at the end of trials as the controller is being trained. In session 1, the miss distance for the trained cont,roller is 159 ft. This is the largest final miss distance over all sixty trials. Despite the large miss in the end for session 1, we note that there is improvement during the course of on-line training. If we continue with more learning trials, we can produce a better controller network. In session 2, the miss distances increase late in the session. In this case it may have been better to terminate the learning session earlier. Reducing the learning rate may also produce better results but prolongs the training process. We show the state and control trajectories for session 10 in Figures 6 and 7, respectively. The controller configuration produces nearly constant, nonoptimal control after training and misses by six feet. The use of the critic for on-line training of the controller network can be compared with reinforcement learning methods. It is not strictly a supervised learning method since the critic
J. DALTON AND S. N. BALAKRISHNAN
186
250
12
3
4
5 6 7 Session number
8
9
10
11
Figure 5. Miss distances for on-line learning trials. I
I
I
I
I
I
I
I
Trained-Controller Untrained-Controller Trained-Controller/Critic Untrained-Controller/Critic
800
I
---'.....' -.-. -
600
-400
1 0
I
12
I
I
I
I
I
I
I
I
I
3
4
5
6
7
8
9
10
L
L
Figure 6. Plant outputs for configurations
before and after on-line training.
signal is only an approximation of the required correction at any specific point. We also observe improvement in performance measured by miss distance as a result of previous experiences gathered by the controller network as in reinforcement learning. On the other hand, the method that we present depends on the existence of a trained critic network. A true reinforcement learning system should be able to gather experience rather than having it built in. If we could find a way to develop the critic network based on information taken from the system and the controller during operation, we would have a learning control system. This remains a topic of research, will require a means of performing system identification, and will need a way to encode this information in the form of an approximation for neighboring optimal control.
5. CONCLUSIONS We have presented a neural control design methodology based on the notion of a neighboring optimal adaptive critic. This technique decomposes the control task into two subtasks. The first is accommodated by a controller network which provides near optimal control on a given nominal trajectory. The second task is to provide a correction to the controller output which compensates for inaccuracies in the controller mapping and for off-nominal conditions. As a result of this
Neighboring 20
.
I
I
ki 2VI
0
I
I
I
Critic
I
187
I
--!--c ....i.--I--
;(.
._ -.._ ____-...___,_____._____._.__............ ...... . _._~~.~.--.~~~_~~____~___~~_~~~
_,__,,,,,,,,,,,,_,~.r~~~~~~_~.~~~~~~~~~~~,~~?!_
_.... /
_____-__.-.-.
-5
i i i i i
-
-10 -15 20
:”
I
Trained-Controller Untrained-Controller Trained-Controller/Critic Untrained-Controller/Critic
15 -
ci
Optimal Adaptive
0
’
i! i!
’
i i
i! -.-.‘-‘_‘_.’ --------I_:‘.‘.‘.::::‘:I’ I
I
I
I
I
I
I
,
1
2
3
4
5
0
~I
a
Figure 7. Plant inputs for configurations
I,
/ I
9
IO
before and after on-line training.
decomposition, we can provide feedback control which approximates optimal control within a region surrounding the chosen nominal trajectory. The controller/critic system is robust with respect to measurement noise. In addition, we have shown initial results which indicate that the critic network can be used to train a controller network. Further research will focus on extending the on-line learning method so that wider regions in the state space can be accommodated. We would also like to be able to implement selfimprovement algorithms which allow retraining of the control system in the face of target maneuvers.
REFERENCES 1.
2. 3
4. 5
6. 7 8 9. 10
11.
12.
13. 14. 15.
H. Pastrick, S. Seltzer and M. Warren, Guidance laws for short-range tactical missile, Journal of Guidance and Control 4 (March/April), 98-108 (1981). F.W. Nesline and P. Zarchan, A new look at classical vs. modern homing missile guidance, Journal of Guidance and Control 4 (January/February), 78-85 (1981). A.E. Bryson, Jr. and Y. Ho, Applied Optimal Control, Hemisphere Publishing Corporation, (1975). P. Zarchan, Tactical and Strategic Missile Guidance, Vol. 124 of Progress in Astronautics and Aeronautics, American Institute of Aeronautics and Astronautics, Inc., (1990). P.H. Fiske, Advanced digital guidance and control concepts for air-to-air tactical missiles, Tech. Rep. FAFTL-TR-77-130, Eglin AFB, FL, (1977). S.N. Balakrishnan and J.L. Speyer, Assumed density filter with application to homing missile guidance, AIAA Journal of Guidance, Control and Dynamics (January/February), 4-12 (1989). S.N. Balakrishnan, An extension to modified polar coordinates and application with passive measurements, AIAA .Joumal of Guidance, Control and Dynamics (November/December), 906-912 (1989). S.N. Balakrishnan and R. Weil, Neurocontrol: A literature survey, Mathl. Comput. Modelling (this issue). J. Albus, A new approach to manipulator control: The Cerebellar Model Articulation Controller (C!MAC), tinsactions of the ASME, 220-227 (September 1975). J.A. Franklin and O.G. Selfridge, Some new directions for adaptive control theory in robotics, In Neural Networks for Control, (Edited by W. Miller II. R. Sutton and P. Werbos), Chapter 14, pp. 349-364, MIT Press, (1990). M. Kawato, Computational schemes and neural network models for formation and control of multijoint arm trajectory, In Neural Networks for Control, (Edited by W. Miller II, R. Sutton and P. Werbos), Chapter 9, pp. 197-228, MIT Press, (1990). K. Tsutsumi, K. Katayama and H. Matsumoto, Neural computation for controlling the configuration of 2-dimensional truss structure, In Proceedings of the IEEE International Conference on Neural Networks, San Diego, CA, July 24-27, 1987, Vol. 2. G. Wang and D. Miu, Unsupervised adaptive neural-network control of complex mechanical systems, In Proceedings of the 1991 American Control Conference, Boston, MA, June 26-28, 1991, pp. 28-31. P.J. Werbos, Neural networks for control and system identification, In Proceedings of the 28th Conference on Decision and Control, Tampa, FL, December 1989, pp. 260-265. J. Steck and S.N. Balakrishnan, Use of hopfield neural networks in optimal guidance, IEEE IPrans. on Aerospace and Electronic Systems January, 287-293 (1994).
188
J. DALTON AND S. N. BALAKRISHNAN
16.
D. Rumelhart, G. Hinton and R. Williams, Learning internal representations by error propagation, In Parallel Distributed Processing: Explorations in the MicrostrucLcture of Cognition, (Edited by D. Rumelhart, J. McClelland and P.R. Group), MIT Press, (1986). J.J. Hopfield, Neural networks and physical systems with emergent collective computational abilities, In Proceedings National Academy of Science, Vol. 79, pp. 2554-2558, (April 1982). J.J. Hopfield, Neurons with graded response have collective computational properties like those of two-state neurons, In Proceedings National Academy of Science, Vol. 81, pp. 3088-3092, (May 1984). K.S. Fu, Learning control systems, In Computer and Information Sciences, (Edited by J.T. Tou and R.H. Wilcox), pp. 318-343, Spartan Books, Washington, DC, (1964). J. Sklansky, Learning systems for automatic control, IEEE Runsactions on Automatic Control AC-11, 6-19 (January 1966). Z. Nikolic and K. Fu, An algorithm for learning without external supervision and its application to learning control systems, IEEE Transactions on Automatic Control AC-11, 414-422 (July 1966). J. Mendel and R. McLaren, Reinforcement learning control and pattern recognition systems, In Adaptive, Learning, and Pattern Recognition Systems: Theory and Applications, (Edited by J. Mendel and K.S. F’u), pp. 287-318, Academic Press, New York, (1970). A.G. Barto, R.S. Sutton, and C.W. Anderson, Neuronlike adaptive elements that can solve difficult learning control problems, IEEE tinsactions on Systems, Man, and Cybernetics SMC-13 (September/October), 834-846 (1983). P. Werbos, Backpropagation and neurocontrol: A review and prospectus, In Proceedings of the International Joint Conference on Neural Networks, Washington, DC, June 18-22, 1989, Vol. I, pp. 209-216. P. Werbos, Approximate dynamic programming for real-time control and neural modeling, In Handbook of Intelligent Control, (Edited by D. White and D. Sofge), Chapter 13, pp. 493-525, Van Nostrand Reinhold, New York, (1992). J. Jameson, A neurocontroller based on model feedback and the adaptive heuristic critic, In Proceedings of the Intenational Joint Conference on Neural Networks, San Diego, June 1990, pp. 1137-1144. D.A. Sofge and D.A. White, Neural network based process optimization and control, In Proceedings of the .?2gth Conference on Decision and Control, New York, December 1990, pp. 327&3276. J.S. Dalton, A critic based system for neural guidance and control, Ph.D. Thesis, Electrical Engineering, University of Missouri-Rolla, Rolla, MO, (December 1994).
17. 18. 19. 20. 21. 22.
23.
24. 25.
26. 27. 28.