KNOWLEDGE-BASED AND LEARNING CONTROL
Copyright © IFAC Design Methods of Control Systems. Zurich. Switzerland. 1991
DESIGN OF A RULE-BASED CONTROLLER FOR
THE INVERTED PENDULUM H. Kiendl and M. Krabs Department of Electrical Engineering. University of Dortmund. Dortmund. Germany
Abstract. An iterative learning process for the design of a rule-based two-level controller is discussed which uses the previously described ROSA-Method for rule-based modeling. Learning starts with the observation of the system behaviour and yields a controller consisting of IF-THEN-rules and an inference strategy based on a relevance index according to sta ti stical evaluation of the observation data. The design process is illustrated by application to the benchmark problem "stabili zation of the inverted pendulum on a cart". The analytical plant model has not been used for anything else than simulation. The resulting controller requires only few information about the system state. Its rules turn out to be transparent to human understanding. Keywords. Control sys tem design; artificial intelligence; learning systems; inference processes; nonlinear control systems; bang-bang control.
ca n be interpreted as a generali zatio n of the conventional two-level controller F = F max sign(lp) which does not stabilize this plant. This benchmark problem is a more advanced application of the ROSA-Method than the design of the rule-based controller in the work of Kiendl, Krabs and Fritsch (1991).
SURVEY We present an iterative learning process to generate rulebased controllers consisting of IF-THEN-rules. This design procedure is based on the observation of the plant subjected to control actions having a random component. The rules are obtained by modeling an observed and evaluated system behaviour using rule-based modeling by the ROSA-Method (Rule Oriented Statistical Analysis) introduced by Kiendl and Krabs (1989). The learning process does not require an analytical plant model. It is sufficient either to simulate the plant behaviour or to observe a real plant for an adequately long time. We stress this aspect as it is of special interest to have a design procedure at one's disposal when classical analytical methods fail. In this paper however we choose the plant "inverted pendulum on a cart" for illustration (Fig. 1). Stabilization of this plant is nontrivial and a well-known benchmark problem which has been already treated by many different controller design methods. We assume that the force F can take only the two values F = :!: F max and we treat the plant as a discrete time system with the sampling period T = 0.02 s.
- 2,4 m
2,4 m
The learning process yields a rule-based controller consisting of a set of rules having the predetermined structure "IF system is at sampling instant k in situation EA
Fig. 1.
THEN apply F = F max sign(lp)" (F
= -F max sign(lp) respectively) .
(1)
An inference strategy derives a control action from a ll rules that are applicable simultaneously. Such a controller
503
Benchmark problem: "Inverted pendulum on a cart". F is the driving force, Ip the pendulum angle, x the position of the cart. The aim is to hold pendulum angle and cart position within the restrictions given by the bounds for Ip and x as shown in the figure .
THE ROSA-METHOD
termine the confidence intervals V), and V which include the true values of p), and p respectively with the probability of 0.95 assuming the events are approximately binomial distributed.
The ROSA-Method generates a set of IF-THEN-rules from the observation of the input-output behaviour of dynamical systems. The following outline covers only the here required elements, for details see Kiendl and Krabs (1989). The observation data required are sequences of discrete time events derived from the time functions of input and output variables of the system at sampling instants k. The events have to be defined according to the interesting context and to the previously available knowledge about the system to be mode led. In the example of the inverted pendulum we intend to get a controller which uses less information about the system state than the neural net approach of Barto, Sulton and · Anderson (1983), we want to take advantage of the symmetry of the problem and we want to obtain rules having the structure (1). Taking this into account we choose the following six input events: el: e2: e 3: e 4: es: e6 :
"pendulum angle is good" "pendulum angle is medium" "pendulum angle is bad" "pendulum is falling" "cart position is good" "signs of deviations a re equal"
Now_we define the relevance index J 8 assigned to a rule R)./R), according to the "relative distance" between the confidence intervals (Fig. 2).
v o
r V
(Icp I < 1°)
W ~ Icpl
< 6°)
o
( Icp I ~ 6°)
(sign(cp)
=
(Ixl (sign(cp)
~
sign(~»
< 0.8
m)
= sign(x» (2)
~V]
and the output event a:
"control action is F
= F max sign(cp)" .
(3)
o
Observation of input and output variables yields input sequences ei(k) and an output sequence a(k) each consisting of elements "1", "0" and "?" indicating whether at instant k the corresponding event has been present or not present or whether presence could not be detected. Using the above input events e i we define "input situations" E).. An example for an input situation is "El : e l = 1 and e 2 = 0 and e s = 1". If the input sequences match the situation E). at instant k we say E).(k) = TRUE. The chosen set of input situations defines a rule space consisting of rules having the general form RA'
"IF input situation E). (k) = TRUE THEN a(k) = 1"
R).: "IF input situation E).(k) THEN a(k) = 0" .
Fig. 2.
h
(4)
= p(a(k) = 1
1
E),(k)
= TRUE
= TRUE)
(5)
To evaluate the set of rules we use the following inference strategy which is simple compared to Zuenkov (1990) but sufficiently effective: Apply the best applicable rule according to 118
(6)
I·
and the probability p
= p(a(k) = 1).
P
The absolute value of 18 is a measure for the relevance of a rule, the sign decides whether R). or R). is valid. The relevance index is a credit value out of the intervalJ-l • 1 [ assigned to each rule. The condition 1J 81 > 0 defines minimal requirements to be met to raise "observations" to "rules".
We consider a rule as "relevant" if a statistical analysis of the observed input and output sequences yields a "significant" difference between the conditional probability p).
Definition of the relevance index J 8 : If V and V). are not disjunct (a) then we consider the rule R)./ R). as not relevant and set J 8 = O. If V and V). are disjunct and > (b) then we set the relevance index of rule R). to 18 = d l / d 2. If V and V). are disjunct and p). ~ p (c) we set the relevance index of rule R). to 18 = -d 1/d 2 .
The application of the ROSA-Method is supported by the inductive expert system ROSA developed by the authors.
(7)
We estimate the probability p). by the conditional relative frequency p), with which a = 1 is observed when E). = TRUE. The probability p is estimated by the relative frequency p of a = 1. The accuracy of the estimation p), depends on the number of occurences of E), = TRUE, that one of depends on the number of instants k that have been observed. As a measure for these accuracies we de-
THE ITERATIVE LEARNING PROCESS The starting point of our controller design procedure is to generate a control sequence that leads to good system performance and thus can be used to learn from. To get this control sequence we can assume that a conventional two-
p
504
Du ring each simulation we make a protocol of the appearance of th e predefined input even ts (2) and the output event (3). The best (longest) simulation is selected to learn from ass uming that it contains (on average) better control actions than the others. We then analyse the protocol of thi s simulati on by means of the ROSAMethod which yie lds rule s RA and RA' Notice that these rule s can be inte rpret ed immediate ly as control rules of the form (1) as they come fr o m a control sequence that yie lds a sys te m perfo rmance which is good on ave rage.
level controller, although not stabili zi ng the plant, doe s the right thing most of the time. So it sho uld be possible to "learn" a better controller by observing the system operating with the two-level controll er provided that the correct decisions can be distinguished from the wrong ones. These considerations lead to th e following app roach : - We intend to design a controller which opera te s as follows: Every second sa mpling period a conventional twolevel controller is applied . In th e "free periods" in between a rule -based co ntroller is applied to correct the wrong deci sions of the conventional controller.
- The nex t step is to replace 40 o/c of the random control actions by the output of the rule-based controller consisting of the previously found rules and the inference strateh'Y desc ribed above. We now have a controller which is a combination of a conventional two-step controller (50 % of the control act ions), a random generator (30 o/c of the control actions) and a rule-based controller (20 o/c of th e control actions) as shown in Fig. 3. With this modified controller we repeat the first step. Du e to th e contribution of the rule -based control actions the number of simulat io n steps until failure increases now. This holds for the best of 100 simulations (Fig. 4 a) as well as for the average (Fig. 4 b). The rulebased controller imp roves by mode ling the longest of hundred simulation runs as before. The percentages 50 o/c, 30 % and 20 o/c remain constant in each iteration. Due to 30 % random control acti ons th e curves in Fig. 4 do not tend to infinity: The number of steps until failure goes up to 800 which means a balancing time of 16 seconds. The average length of simulations increases from 50 to 200 steps.
- To lea rn the rule -based controller we observe and eva luate the system performance, ope ratin g with a conventional two-level controller and a pa rtly random controller active in between as decribed below. Latter improve the chance to find good rules which are diffe rent from the reactions of the conventional controller. For the rul es we provide th e genera l form (1). The input situations EA which the rules refer to are based on the six events (2). This is th e only information abou t the sys tem state the rule-based controller has access to. The iterative learning process is illustrated by Fig. 3.
t
3~50 e 20
-
±Fmax
p
Inve rt ed Pen d u lum on
0
Cart
I ~__ vecto r
of
~
Eve nts
::J
I
C
t
::J
• Rule- ba se d Modeling of fhe longes l Run • Keep in g a ll Rules wifh
Yes
~ Found New
750
-2
100 Simulation Runs each up t o 12° Po le Angle
500
Ul
0. ID
IJ 81 > 0
Vi
250
"',,0 ~ I Final
No
o
Co nfroller
5
10
15
20 Iterat io n s
here : 3 4 Ru les
Fig. 3.
..
O+ , ----~------~----~----_,
Fig. 4.
Structure of the iterative learning process to generate a rule-based controller.
The main steps are: - We start to balance the pendulum with a completely random control sequence in the free periods. This leads to a failure after a very short time. As fai lure we regard any viol ation of the re strictions shown in Fig. 1. We perform 100 simulations, each starting at vertical pendulum position and medium cart position and end ing at failure.
Improvement of the se t of rules during the learning process consisting of 20 iterations: Curve a shows th e number of simulation steps until fai lure for the best of 100 runs, curve b shows the corresponding average.
- After about 20 iterations of the two steps above the set of rules becomes stable, that is no new rules are found. Repl acing all random control actio ns by the rule-based actions we get a rule -based controller with the provided structure. It turns out that it stabilizes the plant.
505
RESULTS AND CONCLUSION
ACKNOWLEDG EMENT
The obtained rule-based controller consists of IF-THENrules that can be understood and interpreted. To give an example we consider one of these rules: IF
"pendulum angle is good" and "pendulum is not falling" and "cart position is good"
THEN
"control action is not F = F mal( sign ('P)" .
This research was sponsered by the Deutsche Forschungsgemeinschaft (DFG). We thank M. Fritsch for his constructive contributions and critical review of the manuscript and C. Thum who performed the necessary si mulations to get the rule-based controller. REFERENCES Anderson, C. W. (1989). Learning to Control an Inverted Pendulum Using Neural Networks. IEEE Contr. Syst. Mag., 9, No. 3, 31 - 37. Barto, A G., R. S. Sulton, and C. W. Anderson (1983). Neuronlike Adaptive Elements that can solve difficult Learning Control Problems. IEEE Trans. Syst., Man, Cybem., 13, No. 5, 834 - 846. Bieker, B. (1986). Wissenserwerb fiir eine einfache Expertenregelung. atp, 28, No. 9, 448 - 457. Bondi, P., G. Casalino, and L. Gambardella (1988). On the Iterative Learning Control Theory for Robotik Manipulators. IEEE J. Robotics Automat., 4, No. 1, 14 - 22. Kiendl, H., and M. Krabs (1989). Ein Verfahren zur Generierung regelbasierter Modelle fUr dynamische Systeme. at, 37, No. 11,423 - 430. Kiendl, H., M. Krabs, and M. Fritsch (1990). Regelbasierte Analyse und Synthese von Regelungssystemen. In Autoll1atisicrungstechnik '90, VDI-Beridlt 855, VDI-Verlag, pp. 153 - 162. Kiendl, H., M. Krabs, and M. Fritsch (1991). Rule-Based Modelling of Dynamical Systems. In D. Popovic (Ed.), Analysis and Control of Industrial Processes, Vieweg, Braunschweig, pp. 217 - 231. Kuipers, B. (1989). Qualitative Reasoning: Modeling and Simulation with Incomplete Knowledge. automatica, 25, No. 4, 571 - 585. Zuenkov, M. A, and AG. Poletykin (1990). Agreement Method and Its Use in Expert Systems. Automat. and Remote Contr., 50, No. 9, Part 2, 1242 - 1249.
(8)
This rule applies F = -F max sign('P) to the plant, which is opposite to the action of a conventional two-level controller. This can be interpreted as a "decelerating pulse" that decreases the velocity of the rising ("not falling") pendulum and thus decreases the overshoot amplitude of the pendulum angle which is already close to vertical position ("pendulum position is good"). This pulse cannot do any harm to the cart because "cart position is good". Rules like this express the crucial features of the control law in a more human way of thinking than a mathematical algorithm would do that. Therefore it should be possible to extract the basic message of the rule s and transfer it to related systems. Moreover we expect the controller to be robust with respect to parameter variations of the plant as long as they do not change the qualitative input-output behaviour. Furthermore the result of the learning process can be characterized as follows : Although we start with very little knowledge about the system and quite a primitive controller we obtain a set of rules that can stabilize the plant. The rules refer to a coarse quantisation of the state variables (e. g. the sign of the pendulum velocity). This is few state information compared with the neural net approach of Barto, Sulton and Anderson (1983), who divided the system state into 162 regions, and compared with the full state information the neural net controller of Anderson (1989) requires. The learning process is driven by nothing more than the failure of control sequences. No information about the ideal position of cart nor pendulum is required contrary to Bondi, Casalino and Gambardella (1988) who supply an ideal trajectory to the learning system. Rules for correct control actions have been learned from empirical data that contain quite an amount of unfavourable control actions. This result shows that our ROSAMethod is able to extract rules from uncertain data and that the relevance index we described before (Kiendl, Krabs and Fritsch, 1990) yields an adequate credit assignment. The rules may also be embedded in an expert system for expert control as suggested by Bieker (1986). The model of the plant has not been used for anything else but simulation. Simulation can be substituted by observation data of the real system. Therefore we see an application field for this design procedure in plants for that mathematical models, which are adequate to conventional controller design methods, are not available. These considerations suggest connections with qualitative modeling and simulation methods as proposed e.g. by Kuipers (1989).
506