Neural Networks PERGAMON
Neural Networks 14 (2001) 1089±1098
www.elsevier.com/locate/neunet
Contributed article
Estimates of average complexity of neurocontrol algorithms Tomas Hrycej* DaimlerChrysler AG, Research Center Ulm, Postfach 2360, D-89013 Ulm, Germany Received 5 April 2000; accepted 15 February 2001
Abstract Neurocontrol algorithms can be operated in a batch mode or an incremental mode. Furthermore, some of them have variants with and without an explicit plant model. These variants exhibit fundamentally different behavior with regard to the volume of data necessary for convergence. To assess this difference, simpli®ed algorithms in a discrete state space using the dynamic programming framework are analyzed: a batch algorithm, and two incremental algorithms with and without a plant model. Analysis shows that the batch algorithm is the fastest, while the two incremental algorithms (in particular the model-free variant) are considerably slower, measured in expected number of samples to convergence. q 2001 Elsevier Science Ltd. All rights reserved. Keywords: Neurocontrol; Critic systems; Complexity; Batch/incremental mode
1. Introduction Designing the control of a dynamic system consists of ®nding a mapping of plant states to controller actions, and can thus be viewed as a special case of the general functional approximation problem. Its non-standard feature is the impossibility of immediately determining the optimal action for a given state. The reason for this is that the effects of an action on a dynamic system are delayed, and optimality can be investigated only if future consequences of the action are considered. Consequently, statements about the complexity of neurocontrol algorithms cannot be easily derived from statements about the complexity of algorithms for learning a mapping. While there is a fair amount of work on complexity of neural net algorithms for input/output mapping, concerning both learning of a mapping (Amari & Murata, 1992; Baum & Haussler, 1989; Blumer, Ehrenfeucht, Haussler & Warmuth, 1989; Schmitt, 1998; Yang & Amari, 1998) and implementation (Horne & Hush, 1996), much less is known about the complexity of neurocontrol algorithms. Within neurocontrol algorithms, there is a variety of algorithmic frameworks, the most prominent of which are the algorithms based on backpropagation through time and the critic-based algorithms. Although the problem formulation in both these approaches is very different, there are hints that they are, to a large degree, equivalentÐsee a general model * Tel.: 149-731-505-2852; fax: 149-731-505-4210. E-mail address:
[email protected] (T. Hrycej).
subsuming both approaches (Feldkamp et al., 1997), or algebraic arguments (Hrycej, 1998). However, there is a less striking aspect that seems to have a more important effect on the complexity. A learning algorithm can be operated in a batch mode (®rst sampling data and then optimizing) or in an incremental mode (making an optimization step for each data sample). There are also variants working with an explicit plant model, and others working without it. Although both general approaches, backpropagation through time and critic systems, can be operated in both modes, some variants are committed to batch or incremental mode. For example, critic models without explicit plant model seem to be inherently incremental and have so far no batch counterparts. It can be expected that both modes exhibit different convergence properties. This is also what the work reported here focuses on: the comparison of ef®ciency between algorithms that differ in ² mode (incremental or batch), and ² using an explicit plant model or not. Since the main difference between incremental and batch algorithms is in their use of measurement samples, the focus is on measuring the ef®ciency in terms of the number of samples necessary for convergence. The presentation is structured in the following way. Section 2 motivates the commitment to algorithms working in a discrete state space. Some simplifying assumptions about sampling the measurements are explained, and basic
0893-6080/01/$ - see front matter q 2001 Elsevier Science Ltd. All rights reserved. PII: S 0893-608 0(01)00047-8
1090
T. Hrycej / Neural Networks 14 (2001) 1089±1098
de®nitions are given. In three subsections, three algorithms are presented: 1. a batch algorithm using a plant model; 2. an incremental algorithm not using a plant model; and 3. an incremental algorithm using a plant model. For each algorithm, expected number of samples to convergence is derived. Discrete state space problems represent only a minority of applications. To outline a possible way towards the generalization of the results to real-valued problems, grid-based discrete approximations or real-valued problems are brie¯y discussed in Section 3, providing some further insights about complexity. The results of a simulation experiment are reported in Section 4. 2. Investigated algorithms Neurocontrol algorithms in continuous domains use explicitly or implicitly some numerical optimization principle (e.g. gradient descent, Kalman training, relaxation, etc.). Consequently, the complexity of each neurocontrol algorithm heavily depends on the properties of this optimization principle (e.g. its order or its globality). By contrast, in discrete domains, dynamic programming is a prominent method without serious alternatives of comparable generality. Dynamic programming possesses clear complexity bounds for arbitrary nonlinear plants and state-dependent cost functions. This has motivated the decision to consider simple algorithms working in a discrete state space. The close relationship between dynamic programming and critic systems suggests the possibility of future generalizations to the continuous state space. The discrete state space is assumed to have N discrete states and M discrete actions applicable to each state. Such a system can be alternatively viewed as a directed graph with N nodes and M edges leaving each node (i.e. N £ M edges altogether). Edges are allowed to connect a node with itself. The algorithms investigated here are based on sampled data. An arti®cial sampling environment is assumed that guarantees a suf®cient plant excitation and passing through all working points. Sampling from the set of all N £ M possible state±action combinations is random with a uniform probability distribution which is independent from the preceding sample. The probability of sampling a given state is p1 1/N while that of sampling a given action is p2 1/M. The joint probability of sampling a given state/ action pair is then p p1p2 1/(MN). Note 2.1. The random sampling procedure needs some justi®cation. In fact, the random sampling assumption is constituted by three partial assumptions: (1) random
sampling of actions, (2) random sampling of states and (3) independence of state samples. Each of these partial assumptions has to be justi®ed separately. 1. In the neurocontrol community, it is popular to consider plants excited by the controller itself. Within this con®guration, it is possible to speak about the well-known exploration/exploitation trade-off. It is frequently neglected that identi®cation of a controlled plant may then completely fail. The controller actions (the `exploitation' part of the trade-off) excite the plant within a subspace of the state space, so that some modes of the plant remain unidenti®ed. The plant identi®cation proceÊ stroÈm & Wittenmark, 1989). A dure is then divergent (A potential solution to this problem is a full-size dual control (Feldbaum, 1965). Dual control is a dif®cult framework whose complexity analysis is far beyond the scope of this work. This is why non-controlled excitation (i.e. a commitment to a sole exploration) is considered here. According to the control theory, the plant excitation by input actions is a part of a deliberate experiment design that has to satisfy certain conditions for the idenÊ stroÈm & Wittenmark, ti®cation process to converge (A 1989). A uniform random sampling is a good choice for a general discrete-state plant of arbitrary order. 2. To which distribution of states the uniformly distributed input will lead, depends on the particular plant. Stable plants will tend to be biased towards states near to the open-loop equilibrium state (i.e. states of low order in the sense of De®nition 2.1, if the goal state is identical with the open-loop equilibrium state). Unstable plants have a bias towards states distant from the equilibrium state (i.e. high-order states). Lack of further information justi®es the uniform distribution assumption (O'Hagan, 1994). 3. The assumption of independence of successive samples is admittedly arti®cial. For plants of realistic complexity, successive samples are not independent: with random input, the distribution of the next state is not independent from the present state. The reason for this is that not every state can be reached from every other state in a single step.This problem can be circumvented by modifying the sampling procedure. Instead of sampling every state snapshot, only every n-th state snapshot constitutes a sample. For growing n, the samples will become asymptotically independent. A satisfactory degree of independence will be reached for n of the order of magnitude of the longest path through the state graph. Furthermore, the fact that states along the optimal paths are of particular importance for the control optimization is favorableÐoptimal paths are usually substantially shorter than the longest path. This can be expected to lead to a `fast extinction' of the sampling dependence and its relevance to the results. All algorithms take a given number t of samples and then terminate (Algorithms 2 and 3) or start a separate
T. Hrycej / Neural Networks 14 (2001) 1089±1098
optimization phase (Algorithm 1). Formally, sampling data consists of getting state transition triples (i, j, k) with i being the original state, j the action, and k the resulting state. The control objective is to reach the goal state (the state with index 1) on an optimum-cost trajectory. If cij is the cost of performing action j at state i, the cost of a trajectory is the sum of cij over the trajectory The algorithms require storing three mappings: ² the plant model, mapping a state and an action to the resulting state; ² the critic, mapping a state to its cumulative cost; and ² the controller, mapping a state to the corresponding optimal action. For a discrete state space, these mappings are discrete. An appropriate implementation of such mappings is with help of look-up tables, with mapping's arguments being represented by look-up table indexes, and mapping values by the look-up table entries. The plant model look-up table is represented by an (N £ M)-matrix S, with sij being the state that results after application of action j to state i. The look-up table consisting of vector v (v1, ¼, vN) contains the critic mapping: the entry vi represents the cumulative cost of state i. Vector a (a1, ¼, aN) represents the controller look-up tableÐit maps each state i on the optimum action ai. Two simple de®nitions are useful for further investigations. De®nition 2.1. A state is of the l-th order, if the optimum path to the goal state has l edges. In other words, on the optimum path from an l-th order state, l 2 1 intermediary states (i.e. excluding the goal state) are visited. De®nition 2.2. An algorithm converges for state i at last time t, if the optimum path to the goal state (including optimal actions) has been determined from t samples. An algorithm converges for state i at last time t, if the optimum path to the goal state (including optimal actions) has been determined from t samples but not from t 2 1 samples. The focus of investigation is the evaluation of the expected number of samples after which a state of the l-th order converges. Note 2.2. To estimate the expected computing time of an algorithm, convergence for all states, rather than for an individual state of a given order, would be of interest. To determine the convergence for all states is substantially more dif®cult than for a single state. It would be necessary to consider not only the distribution of state orders in the given problem, but also the dependences between the convergence probabilities of individual states. Such dependences would result from the particular structure of the problem. For example, if the optimal paths from many states
1091
have common segments, convergence times for different states are strongly correlated. However, to compare the properties of various algorithms, considering the convergence of individual states seems to be a suf®cient basis. 2.1. Batch algorithm using a plant model Algorithm 1 using a plant model can be organized into the sequence (1) plant model identi®cation (i.e. ®lling the plant model look-up table), and (2) controller optimization by dynamic programming. In the following informal description, the plant model identi®cation consists of the Steps 1 and 2 while Steps 3±5 optimize the controller for the given plant model. 1. Initializing sij to in®nity. 2. Sampling a ®xed number t of triples
i; j; k and storing k in sij. (After this step, the model look-up table entry sij contains the next state k for a given state/action pair (i, j). Entries for pairs that have not been sampled so far contain in®nity.) 3. Initializing vi to in®nity, except for placing zero at the goal state. 4. Iteratively performing the cost/action update for all states. 5. Stopping if no change has taken place in the last update loop. An update for state i consists of a loop over all actions j, setting the cumulative cost vi to the minimum of vi and cij 1 vsij , and ai to j if vi was improved. If each state/optimal action pair (i, j) has been sampled at least once, the controller look-up table a contains the optimum action for every state i. Consequently, the controller will produce the optimum control path from any initial state to the goal state. If some state/optimal action pair has not appeared in the t samples taken, the control paths containing this pair are suboptimal. We wish to estimate the expected number of samples to convergence for a given state. With random sampling from a uniform distribution, the number of samples to convergence is identical for all states of the same order. Let us denote the length of the optimal path to the goal state (i.e. the order of the given state) by L. Let us denote the set of all L state/optimal action pairs by u L. Let us further denote the probability that a certain subset ul , uL of l # L state/optimum action pairs on the optimal path have been completely sampled exactly after t samples by Q1
t; l. Being completely sampled means that every pair of this subset has been sampled at least once at time t. Note that this probability is identical for all subsets of size l. This is why no indexing of different subsets of size l is used. Let us further consider a subset u l21 , u l of l 2 1 state/ optimal action pairs on the optimal path from the given
T. Hrycej / Neural Networks 14 (2001) 1089±1098
1092
state. For time t, t . 0, L $ 1 . 0, the probability that 1. the sampling of all state optimal action pairs from u l21 has been completed exactly at time s , t, 2. none of the L 2 l 1 1 state/optimal action pairs from uL \ ul21 has been sampled between time s and time t, and 3. the last missing state/optimal action pair (i.e. the only element of ul \u l21) on the optimal path has been sampled exactly at time t is tX 21 s0
Q1
s; l 2 1
1 2 p
L 2 l 1 1
t2s21
p
1
This relationship will be used in comparing theoretical and simulated behavior of the algorithm in Section 4. To receive the expected number of samples to convergence, let us de®ne the probability generating function S1(x, l). Using (3), we get S1
x; 0 1 S1
x; l
Q1
t; 0 0
t.0
Q1
0; l 0
L$l.0
s0
1 X
Q1
s; l 2 1
1 2 p
L 2 l 1 1
tX 21 s0 tX 22 s0
p
t . 0; L $ l . 0
t2s21
Q1
s; l 2 1
1 2 p
L 2 l 1 1
t2s21
Q1
s; l 2 1
1 2 p
L 2 l 1 1
tX 22 s0
S1
x; 0 1 S1
x; l
xlp S
x; l 2 1 1 2 x
1 2 p
L 2 l 1 1
p
or, expanding the recursion S1
x; L L!
p
Q1
s; l 2 1
1 2 p
L 2 l
1 1t212s21 p 1 lQ1
t 2 1; l 2 1p
1 2 p
L 2 l 1 1Q1
t 2 1; l 2 1 1 lpQ1
t 2 1; l 2 1
L Y l1
The probability of convergence at last at time t for an l-th order state is t$0
P1
0; l 0
L$l.0
P1
t; l
t X s0 t X s1
£
S1
x; L
s0
Q1
s; l 1 lp
1 2 p
L 2 l 1 1P1
t 2 1; l 1 lpP1
t 2 1; l 2 1
tX 21 s0
£
l1
xp
L X 1 2 x
1 2 p
L 2 l 1 1 l1
xp
p
1 2 x
1 2 p
L 2 l 1 12
S1
x; L
Q1
s; l 2 1
L X 1 2 x
1 2 p
L 2 l 1 1
p 2 xp
1 2 p
L 2 l 1 1 1 xp
1 2 p
L 2 l 1 1
1 2 x
1 2 p
L 2 l 1 12
tX 21
7
p xp
1 2 p
L 2 l 1 1 1 1 2 x
1 2 p
L 2 l 1 1
1 2 x
1 2 p
L 2 l 1 12
S1
x; L
1 2 p
L 2 l 1 1Q1
s 2 1; l 1 lpQ1
s 2 1; l 2 1
L X dS1
x; L 1 2 x
1 2 p
L 2 l 1 1 S1
x; L dx xp l1
£
Q1
s; l
1 2 p
L 2 l 1 1
xp 1 2 x
1 2 p
L 2 l 1 1
The mean value of random variable t is generally equal to the ®rst derivative of probability generating function with x 1. The ®rst derivative of (7) is
3
P1
t; 0 1
l.0
6
1 lQ1
t 2 1; l 2 1
1 2 p
L 2 l 1 1t2t1121 p
1 2 p
L 2 l 1 1l
xt lpQ1
t 2 1; l 2 1
which results in the recursive relationship t2s21
The relationship for t . 0, l . 0 can be rewritten as
l
xt
1 2 p
L 2 l 1 1Q1
t 2 1; l
x
1 2 p
L 2 l 1 1S
x; l 1 xlpS
x; l 2 1 l . 0
5
2
Q1
t; l l
1
xt Q1
t; l
t1
Q1
0; 0 1
Q1
t; l l
t1 1 X t1
However, there are l subsets ul21 such that ul21 , ul . So the recursive relationship for Q1 is
tX 21
1 X
L X l1
1 x1 2 x
1 2 p
L 2 l 1 1
(8)
t . 0; L $ l . 0
4
Setting x 1 in (8) and using the fact that S1(1, L) 1,
T. Hrycej / Neural Networks 14 (2001) 1089±1098
we get E1
L
L X
dS1
1; L 1 1 dx p
L 2 l 1 1 p l1
L X l1
1 1 L2l11 p
L X l1
1 l
9 Theorem 2.1. The expected number of samples to convergence for a state of order L is E1
L
L 1X 1 p l1 l
10
logL p
path from the next (i.e. the (l 2 1)-st order state must have been determined at time s , t. The missing state/optimal action pair has then to be sampled exactly at time t, with probability p. For Algorithm 2, the order of sampling is relevant, and thus the missing sample is always the state/ optimal action pair of the l-th state itself. At times s 1 1, ¼, t 2 1, the l-th state/optimal action pair must not be sampled (otherwise, the path would be completed before time t). The probability of this is (1 2 p) t2s21. So, a recursive de®nition of probability Q2(t, l) that an l-th order state converges exactly at time t is Q2
0; 0 1
The sum in (10) can be roughly approximated by E
L
1093
11
The optimization part of the algorithm does not depend on data sampling. If the reference state is reachable from any other state in maximum Lmax steps (i.e. if the maximum order of the problem is Lmax), the algorithm stops after maximum Lmax 1 1 iterations of Steps 4 and 5. In other words, the complete controller optimization requires, if all plant model entries sij have been sampled, at most (Lmax 1 1)NM operations. With random sampling, the maximum number of operations for the optimization part amounts to Lmax 1 1 =p
12 This is obviously a larger number than (11), but it cannot be concluded that the sampling process is faster than the optimization! Note that the number of operations for optimization (12) is not directly comparable with the expected number of samples to convergence (11), for two reasons. ² The expected number of samples (11) refers to a single state of a given order (see Note 2.2). The number of samples to convergence for all N states would be substantially higher. By contrast, the number of operations (12) refers to optimality for all states. ² The number of samples (11) re¯ects the mean complexity. (In the worst case, an in®nite number of samples is needed, for random sampling.) By contrast, the number of operations (12) refers to the worst case.
Q2
t; 0 0
t.0
Q2
0; l 0
l.0
Q2
t; l
tX 21 s0
Q2
s; l 2 1
1 2 pt2s21 p
t . 0; l . 0
13
The relationship for t . 0, l . 0 can be written as Q2
t; l
tX 21 s0 tX 22 s0
Q2
s; l 2 1
1 2 pt2s21 p
Q2
s; l 2 1
1 2 pt2s21 p 1 Q2
t 2 1; l 2 1
1 2 pt2s1121 p
1 2 p
tX 22 s0
Q2
s; l 2 1
1 2 pt212s21 p 1 Q2
t 2 1; l 2 1p
1 2 pQ2
t 2 1; l 1 pQ2
t 2 1; l 2 1
(14)
The probability of convergence at last time t is P2
t; 0 1
t#0
P2
0; l 0
l.0
P2
t; l
t X s0
Q2
s; l
t X
1 2 pQ2
s 2 1; l 1 pQ2
s 2 1; l 2 1
s0
2.2. Incremental algorithm not using a plant model Algorithm 2 does not maintain a plant model look-up table sij . It initializes vi to in®nity, except for placing zero at the goal state. Then, t triples (i, j, k) are sampled, and, after each sampling, a partial update of vi and ai is performed if cij 1 vk , vi . (A complete update considering all possible actions cannot be performed because of the lack of information about the effects of actions different from the current one.) To converge exactly at time t for an l-th order state, the
1 2 p
t X s0
1 2 p
Q2
s 2 1; l 1 p
tX 21 s0
Q2
s; l 1 p
tX 21 s0
t X s0
Q2
s 2 1; l 2 1
Q2
s; l 2 1
1 2 pP2
t 2 1; l 1 pP2
t 2 1; t 2 1
t . 0; l . 0
15 As for Algorithm 1, the expected number of samples to convergence (21) can be determined with help of a
1094
T. Hrycej / Neural Networks 14 (2001) 1089±1098
probability generating function S2(x, l), using (14):
sampled, and, after each sampling, a complete update of vi and ai is performed, as in Algorithm 1. To investigate the behavior of this algorithm, let us de®ne two auxiliary algorithms.
S2
x; 0 1 1 X
S2
x; l
t1
1 X t1
xt Q2
t; l xt
1 2 pQ2
t 2 1; l 1
1 X t1
xt pQ2
t 2 1; l 2 1
x
1 2 pS
x; l 1 xpS
x; l 2 1
l.0
(16)
which results in the recursive relationship S2
x; 0 1 xp S
x; l 2 1 S2
x; l 1 2 x
1 2 p
l.0
and ®nally non-recursive formula l xp S2
x; l 1 2 x
1 2 p
17
18
The mean value of variable t is equal to the ®rst derivative of S2(x, l) with x 1. The ®rst derivative of (18) is l21 dS2 xp p xp
1 2 p 1 l 1 2 x
1 2 p 1 2 x
1 2 p dx
1 2 x
1 2 p2
l21 xp p 2 xp
1 2 p 1 xp
1 2 p 1 2 x
1 2 p
1 2 x
1 2 p2 l11 1 xp 2 (19) px 1 2 x
1 2 p
l
Setting x 1 in (19), we get l11 dS2
1; l l p l E2
l dx p 1 2
1 2 p p
20
Theorem 2.2. The expected number of samples to convergence for a state of order L is E2
L
L p
21
2.3. Incremental algorithm using a plant model Like Algorithm 2, Algorithm 3 makes an update of cumulative cost vi and controller ai after each sample. However, it simultaneously stores the sampled information in the plant model look-up table sij. This stored model information gives the possibility to update the cumulative cost and the controller with respect to all possible actions, instead of considering only the currently sampled action j. First, sij and vi are initialized to in®nity, except for placing zero at the goal state. Then, t triples
i; j; k are
² First, a model-building Algorithm A corresponding to Steps 1 and 2 of Algorithm 1. This algorithm has reached its goal if all state/optimal action pairs on the optimum path have been sampled, regardless of the sampling order. The probability that this algorithm converges for a given state at last at time t is given by the recursive relationship (4). The expected number of samples to convergence for this algorithm is (10). ² Second, an incremental Algorithm B analogical to Algorithm 2. The difference is that the convergence de®nition for a state is modi®ed. It is now suf®cient for convergence, if any path (rather than the optimal one) has been found by the algorithm. In other words, this algorithm has reached its goal if all states on the optimum path have been visited in the correct order, no matter whether the action sampled is optimum or not.The probability that this algorithm has converged for a given state at last at time t is given by recursive relationship (15) with p1 substituted for p: PB
t; l
1 2 p1 PB
t 2 1; l 1 p1 PB
t 2 1; l 2 1
22 for t . 0, l . 0 and zero otherwise (except for PB(t,0) which equals unity). The expected number of samples to convergence is, by (21) EB
L
L p1
23
For both these algorithms running in parallel (i.e. processing the same sample sequence) to be equivalent to Algorithm 3, an additional condition would have to be satis®ed. For each state on the optimum path sampled by Algorithm B, the model entry for the optimum action would have to be already ®lled by Algorithm A. This means that both algorithms are not independent. The joint probability of both algorithms converging for a given state seems to be dif®cult to ®nd. However, the probabilities of convergence of each of both algorithms can be used to determine the lower and upper bounds for Algorithm 3. For no possible sampling order, Algorithm 3 can converge earlier than either of Algorithm A and Algorithm B alone. Consequently, the probability of convergence at last at t has the upper bound min{PA
t; l; PB
t; l}
24
On the other hand, even in the least favorable case, Algorithm 3 would not converge later than Algorithm A together with subsequently performed Algorithm B. So, the P3
t; l is at least as high as a product of probabilities PA
t1 ; l and PB
t2 ; l with any time partition t1 1 t2 t, and the lower
T. Hrycej / Neural Networks 14 (2001) 1089±1098
bound is
linearly with the number of discretization intervals h:
max PA
s; lPB
t 2 s; l
0#s#t
25
The same arguments lead to the bounds for the expected number of samples to convergence. The number of samples to convergence of Algorithm 3 cannot be smaller than that of Algorithm A or Algorithm B alone, that is, than the minimum of (10) and (23). It also cannot be larger than the sum of both, which would correspond to running both algorithms in sequence. This proves the following theorem. Theorem 2.3. The expected number of samples to convergence for a state of order L is E3
L [ hmax{EA
L; EB
L}; EA
L 1 EB
Li
26
with
27
and L EB
L p1
28
Using approximation (11), we get closed formulae for lower logL L ; maxfEA
L; EB
Lg max
29 p p1 and upper logL L 1 p p1
30
bounds of the number of samples to convergence E3(L). 3. Discretization effects Suppose the discrete formula of a control problem has been received by the discretization of the corresponding continuous problem. Suppose further the domain of each state and action variable has been discretized to h intervals. Then, with n state and m action variables, there are h n discrete states and h m discrete actions. If all states and actions are equally probable, the probability of sampling a state is p1 h
2n
31
and that of sampling a state/action pair p h2
n1m
L < Kh
33
with K being a problem dependent constant. Then, the expected numbers of samples to convergence given by (21) and (10) as well as by the bounds (29) and (30) can be written as E1
L
logL
logK 1 loghhn1m p
L Khn1m11 p logL L logL L ; ; 1 E3
L [ max p p1 p p E2
L
D
hn11 max{
logK 1 loghhm21 ; K}; hn11
logK 1 loghhm21 1 K
E
34
L 1X 1 EA
L p l1 l
EA
L 1 EB
L
1095
32
Re®ning the grid by an integer factor c, the optimum path will cross c times more elementary hypercubes in the state space than before. So, the length of the optimum path tends (except for ¯uctuations caused by rounding) to grow
The relationship between Algorithms 1 and 2 is clear. The ratio E1 logh 1 logK hK E2
35
expresses that Algorithm 1 is always superior to Algorithm 2. The superiority of Algorithm 1 grows both with control problem complexity (characterized by K) and precision required (characterized by h). The relationship between Algorithm 3 and the two other algorithms depends on which of the partial Algorithms A and B contributes more to the computing time. If
logh 1 logKhm21 . K; the larger contribution is that of Algorithm A. Algorithm 3 then requires a number of samples that is larger than that necessary for Algorithm 1 by a factor between 1 and 1 1 c with c K=
logh 1 logKhm21 : If m . 1 (i.e. problems with multiple action variables), the above condition is mostly satis®ed. With growing m, c converges to zero. Then, the complexity of Algorithm 3 in number of samples required is close to Algorithm 1. But in this case, there are h m actions that are to be considered for each sample, and the computing time per sample grows rapidly. So for problems, for which Algorithm 3 is relatively ef®cient in terms of number of samples to convergence, it is inef®cient in terms of computing time. For m 1 (i.e. problems with a single action variable), there are the following ratios E1 logh 1 logK K E2
36
and E3 1 h E2
37
which determine the position of Algorithm 3 between Algorithms 1 and 2. If h . K, the performance of Algorithm 3 tends to be closer to that of Algorithm 1 than to that of Algorithm 2. If K . h, it will be closer to that of Algorithm
1096
T. Hrycej / Neural Networks 14 (2001) 1089±1098
Fig. 1. Proportions of converging states and estimated convergence probabilities.
2. In other words, for a single action variable, Algorithm 3 may well cope with high precision requirements, but not with problem complexity.
4. Computational experiment A simple computational experiment illustrates these concepts. The linear second-order plant z_1 2rz1 2 r 2 fz2 1 gu z z_2 1 f
38
with constants r 0.1, f 10 2, g 10 2, has been transformed to a nonlinear plant by x1 log(z1), x2 log(z2) and discretized to 101 £ 101 states, with 21 discrete actions. The time has been discretized using Dt 5. The behavior of the three algorithms is depicted in the top left panel of Fig. 1. It shows the development of the proportion of converging states (i.e. the states for which a correct value of the cumulative utility function has been determined) as a function of the number of sampled values. The ®rst 6,000,000 randomly generated samples are con-
sidered. The proportions Pa for a 1, 2, 3 denoting Algorithms 1, 2 and 3 satisfy the condition P1 $ P3 $ P2 . The batch Algorithm 1 is clearly the most ef®cient. The performance of the incremental Algorithm 3 relatively rapidly joins that of Algorithm 1, consistently with ratios (36) and (37) since h . K. However, the cost for this relatively good performance is massive computation: 21 operations (a value comparison and an update) for each sample. In contrast, the optimization in Algorithm 1 is done only once and is negligible if compared with the number of samples. So if the number of computing operations is considered, Algorithm 3 may be less ef®cient even than the modelfree Algorithm 2, which processes only the currently sampled action (Hrycej, 1998). Further, the simulation results have been compared with the estimates of Theorems 2.1, 2.2 and 2.3. The proportion of states with a correct value of the cumulative utility function measured in the simulation experiment can be compared with the weighted average probability of determining the optimum path. The average probability is to be computed from the states of different orders, weighted by the order distribution observed in the problem. Theoretical and simulated numbers of samples to convergence can also be compared.
T. Hrycej / Neural Networks 14 (2001) 1089±1098 Table 1 Comparison between theoretical and simulated expected numbers of samples to convergence Algorithm
1 2 3
Number of samples Theoretical
Simulated
507,601 k507,617, 620,031l 2,361,024
509,671 602,887 2,682,600
As stated above, there are 10,201 states and 21 actions. The order of states is varying between zero and 14, the average being 11. The probabilities are p1 0.0000980296 and p2 0.0476190. The results for the three algorithms are plotted in the remaining three panels of Fig. 1. Since computing the lower bound for Algorithm 3 with help of recursive formula (25) for over a million samples is computationally infeasible, only the upper bound is given. It has to be noted that the probability estimates and the proportions of converging states in the algorithms' simulations are not perfect counterparts of each other. The main reason for this is that convergences for different states are not stochastically independent processes. If a large proportion of states has converged, the remaining ones are more probable to converge, too. This makes the proportion curve steeper at the region where the proportion is high. The reason for introducing this conceptual inconsistence was that computing probability estimates by sampling from series of simulations would require huge computing times. The difference in the de®nitions causes the differences between theoretical estimates and simulation results, most striking of which is that the proportions received in the simulation exceed the estimated upper bound of probability for Algorithm 2 in the bottom right panel of Fig. 1. Nevertheless, estimates and simulation results show acceptable correspondence. The expected numbers of samples to convergence are given in Table 1. In order to avoid the computationally intractable assessment of the expected number of samples via genuine statistics of repeated simulations, the cumulative probability distribution of the number of samples to convergence is approximated by the proportion of converging states in the algorithms' simulation. The expected number of samples is computed from such an approximated distribution. Also here, good correspondence between theoretical estimates and simulation results can be observed. This time, simulated results for Algorithm 3 remain within the computed bounds. 5. Summary and limitations The investigation of simpli®ed algorithms in discrete
1097
state space suggests the following conclusions. ² The batch algorithm is superior to incremental ones in number of samples necessary for convergence. ² The incremental algorithm using explicit model converges faster with incoming measurement samples since complete controller optimization is performed for each sampled state. However, it requires more computations than the model-free algorithm, and can turn out to be less ef®cient if sampling is no limiting factor. The discrete state space algorithms are only rough approximations of real neurocontrol algorithms in continuous state space. As stated in the Introduction, for discrete problems, there are hardly generally applicable alternatives to dynamic programming, and the analysis of theoretical complexity remains relatively simple. By contrast, there is a broad variety of computing methods for continuous problems. There are different optimization frameworks such as backpropagation through time (Werbos, 1977), which is close to classical optimal control approaches proposed, for example, by Tsypkin (1971), or reinforcement learning (Barto, Sutton & Anderson, 1983; Werbos, 1991). Also the optimization methods vary from the ®rst-order gradient descent through second-order method such as Kalman training (Feldkamp et al., 1991) to global optimization (Hrycej, 1997). All these methods have their own complexity behavior. For algorithms of incremental type, consistency with the conditions of stochastic approximation (Dvoretzky, 1956) and its effects on complexity have to be considered. A straightforward generalization of the conclusions made on discrete systems is possible only for nonlinear control algorithms based on dynamic optimization that use an approximation of continuous space through a discrete grid (see Section 3) on which the functionals sought are by interpolation on a look-up table (Kreisselmeier & BirkhoÈlzer, 1994). Their complexity is comparable in order of magnitude to that of their discrete counterparts if numerical convergence issues are carefully solved. There are also deviations from the random action assumption of Algorithms 1±3. For practical reasons, the plant cannot be arbitrarily destabilized by arbitrary actions. Furthermore, many algorithms of adaptive type learn during the control, deliberately performing actions only in (or closely around) the optimum known so far. Convergence to better actions (generalization across states) is usually based (explicitly or implicitly) on the gradient descent. This will accelerate the convergence by descending down the current attractor but may also prevent (if not combined with the random principle) the search through the complete search space. A complementary strategy is deliberate exploration, extensively studied in dual and stochastic control, foundations of which have been laid in Bellman (1961); Feldbaum (1965). Here, the distribution of actions is biased towards actions which are expected to reduce parameter uncertainty.
1098
T. Hrycej / Neural Networks 14 (2001) 1089±1098
In neurocontrol, this is also known as `exploration versus exploitation trade-off'. Acknowledgements I am greatly indebted to Wilm Eggbert and the referees for valuable hints for the improvement of the presentation. References Amari, S. -I., & Murata, N. (1992). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 7, 140±152. AstroÈm, K., & Wittenmark, B. (1989). Adaptive control, Reading, MA: Addison-Wesley. Barto, A., Sutton, R., & Anderson, C. (1983). Neuronlike adaptive elements that can solve dif®cult learning control problems. IEEE Transactions on Systems, Man and Cybernetics, 5, 834±846. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 4 (4), 605±618. Bellman, R. (1961). Adaptive control processesÐa guided tour, Princeton, NJ: Princeton University Press. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1989). Learnability and the Vapnik±Chervonenkis dimension. Journal of The Association for Computing Machinery, 36, 929±965. Dvoretzky, A. (1956). On stochastic approximation. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability (vol. 1, pp. 39±55). Berkeley. Feldbaum, A. (1965). Optimal control theory, New York: Academic Press.
Feldkamp, L., Puskorius, G. V., Davis, L. I., & Yuan, F. (1991). Decoupled Kalman training of neural and fuzzy controllers for automative systems. In Proc. IEEE VTS Workshop `Fuzzy and Neural Systems, and Vehicle Applications', Tokyo. Feldkamp, L., Puskorius, G. V., & Prohkorov, D. (1997). Uni®ed formulation for training recurrent networks with derivative adaptive critics. In Proc. IEEE ICCN '97, Houston. Horne, B., & Hush, D. (1996). Bounds on the complexity of recurrent neural network implementations of ®nite state machines. Neural Networks, 9 (2), 243±252. Hrycej, T. (1997). Neurocontrol: towards an industrial control technology, New York: Wiley. Hrycej, T. (1998). Complexity of neurocontrol algorithms. In Proc. International Joint Conference on Neural Networks 1998, Anchorage. Kreisselmeier, G., & BirkhoÈlzer, T. (1994). Numerical nonlinear regulator design. IEEE Trans. on Automatic Control, 39 (1), 33±46. O'Hagan, A. (1994). Kendall's advanced theory of statistics (vol. IIB): Bayesian inference, London: Arnold. Schmitt, M. (1998). Identi®cation criteria and lower bounds for perceptronlike learning rules. Neural Computation, 10, 235±250. Tsypkin, Y. (1971). Adaptation and learning in automatic systems, New York: Academic Press. Werbos, P. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General System Yearbook , 25±38. Werbos, P. (1991). A menu of designs for reinforcement learning over time. In W. Miller, R. Sutton & P. Werbos, Neural networks for control Cambridge, MA: MIT Press. Yang, H., & Amari, S. -I. (1998). Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Computation, 10, 2137±2157.