Nuclear Instruments and Methods in Physics Research A293 (1990) 502-506 North-Holland
502
BEYOND EXPERT SYSTEMS: LEARNING PROGRAMS IN LARGE-PHYSICS CONTROL SYSTEMS Scott H. CLEARWATER and Eric G. STERN
Departments of Computer Science and Physics, University of Pittsburgh, Pittsburgh, PA 15260, USA
Recently, expert systems have been used in diagnosing problems in a variety of areas. The use of many of these systems has resulted in savings of time and money. Despite these advantages expert systems are still difficult to construct and this is one factor preventing their more widespread use . One particular inefficiency is commonly referred to as the "knowledge-acquisition bottleneck", the process whereby the problem-solving and diagnostic knowledge is codified . Machine learning systems are problem-solving programs that have been used successfully in many fields to overcome this bottleneck . Learning systems are able to systematically or heuristically search a hypothesis space of possible solutions while avoiding an exhaustive combinatorially large search . The input to the learning system is given in abstract terms and the program is left to learn the generalizations based on regularities in the training data given to it. This paper reviews learning systems and discusses several paradigms : rule-based induction, neural nets and genetic algorithms. Comparisons between learning systems and conventional adaptation techniques will be discussed such as conventional knowledge acquisition, operations research, statistical analysis and closed feedback systems . Examples of potential applications of learning systems to monitoring, design and analysis in large-physics control systems will be given. 1. Introduction
The purpose of a control system is to automate some aspects of the control of an artifact. Large control systems, by their very nature, are highly complex and costly to build and maintain. This complexity is a result of the physical processes that necessitate a complex control system. The computer programs traditionally used in control systems are also highly complex, yet they lack the ability to reason about their actions and are unable to improve their performance without extensive human intervention . Expert systems [1] are computer programs that seek to increase human productivity by performing complex tasks requiring knowledgeintensive reasoning, such as designing or troubleshooting an experiment . While many successful expert systems have been built for many types of problems [2,3], their development is still costly and is an inhibiting factor in their more widespread use. One of the greatest costs and the least automated part of an expert system occurs during knowledge acquisition. Knowledge acquisition is the process d e expert's /ln-ll!le1V-drge to n process oconverting a n YbW into Gl directly machine-usable form. This conversion is traditionally done by programmers who laboriously interview the expert(s) and then encode their knowledge . In short, knowledge acquisition is how the expert system "learns" . In fact, it is the programmer who must first do the learning, later encoding it. Computer learning systems offer the possibility of widening the knowledgeacquisition bottleneck by systematizing the encoding of knowledge into machine-useable form. In particular, b11V
Al1V
learning systems can be useful in helping to design and recognize faults in a control system by considering relationships too complicated or ill-structured for traditional approaches. 2. Yearning systems 2.1 . Overview
Learning systems are programs that seek to improve their performance at some task [4]. Improved performance can be measured through improved reliability of the answers provided, or by a decrease in the time required to perform the task, or both. All learning systems contain, whether stated explicitly or not, the following components : an I10 system for interaction with the environment, a domain model for the domain under study containing the semantics and syntax for representing the knowledge of the domain and the data, a search program for generating possible solutions and a performance element for measuring whether the proR ." .fM~[. ..TA f. .1AA bl Glill J pVl 1Vr111Gllllr~r 1~ 1111~.71 VV 111~.
Learning, or more generally problem-solving, is a process involving the use of domain-specific knowledge and search through a solution space to achieve a goal. Knowledge is represented by a set of syntax rules describing how domain-specific objects can be arranged and a set of semantics describing the meaning of the symbols that represent these objects . Knowledge includes facts about objects and how they behave and interact with other objects . Search is a procedure that
0168-9002/90/$03 .50 0 1990 - Elsevier Science Publishers B.V. (North-Holland)
S. H Clearwater, E G. Stern / Beyond expert systems
50 3
the dimensionality of the problem increases . Best-first search is similar to hill-climbing, except in the former the best-child nodes of all the previous nodes are searched . Beam search is a heuristic version of breadthfirst search where only some number of the best nodes at a given level (the beam width) are searched. 2.3. Induction
Fig. 1 . Search tree showing names of nodes (parameters in the problem) and the links between the nodes showing how search space is shaped. The numbers refer to the cost or the distance from the node to the goal node Gl . In depth-first search, the order in which the nodes are explored would be: R, Al, Cl, El, C2, C3, B1, Dl, D2, Fl, F2, G1 . In breadth-first search the order would be R, Al, B1, Cl, C2, C3, D1, D2, El, Fl, F2, Gl . A hill-climbing algorithm would follow R, Bl, Dl, thus getting caught in the local minimum at Dl . In best-first search the order would be R, BI, Dl, 132, F2 . Finally, in a beam search where the beam has a width of 2, the search order would be R, Al, Bl, Dl, D2, Fl, F2, G1 . defines operations or translations through a space of possible solutions . The more knowledge available, the less search is required, and the less knowledge available, the greater the amount of search that must be performed to solve the problem [5]. Generally, search is very time-consuming and should be avoided if possible by the use of knowledge that can reduce the search space . 2.2 . Search Since search is so crucial to any learning activity, we briefly review some of the standard methods. Search can be divided into two types : blind and heuristic . In blind search no domain knowledge or heuristics are employed . Examples of blind search are depth-first and breadth-first as shown in fig. 1 . In this paper, search is viewed as an exploration of the nodes in a tree-like search space. In depth-first search, the search follows all the children down one branch before trying another. In breadth-first search, the search follows across the nodes at a given level before going to a deeper level . In heuristic search, information about the problem is used to reduce the search . The heuristic involves using a cost function that allows nodes in the search space to be ordered so that better nodes can be investigated first . Examples of heuristic search are hill-climbing, best-first and beam search. Hill-climbing is the most familiar type of search to scientists and engineers and is used in nearly all numerical optimization programs . In hillclimbing the search proceeds by following the best child of the last expanded node . However, hill-climbing can get stuck in local optima such as ridges, plateaus and foothills in the search space, all of which get worse as
Induction is a type of inference (translation with respect to a set of symbols and rules from one statement to another) whose goal preserves truth (though not guaranteed as in deduction) and expands knowledge . The expansion of knowledge makes induction a candidate for learning systems . An example of a knowledgebased induction program is shown below . (1) Input partial domain theory, including (a) a description language containing names of terms and their possible values, (b) performance thresholds, (c) stopping conditions and (d) previously saved rules. (2) Input new examples via I/O system. (3) Test current rule against new examples and keep it if performance requirements are met. (4) If stopping conditions specified in performance element are met, then report results to I/O system and stop. (5) Generate next rule according to search paradigm to be tested, go to (3) . The reason this approach is called knowledge-based is that the domain theory contains explicit specifications for the syntax and semantics for the rules as well as specifying any other heuristics that may be used such as those used to constrain search . Choosing the terms for a learning program is somewhat analogous to choosing the coordinate system when solving a physics problem ; choosing a good one facilitates problem-solving and a bad one makes the problem difficult or impossible to solve . Different types of induction programs differ in the way they search for rules (step 5). 2 .4. Genetic algorithms Genetic algorithms are a hybrid between knowledge-based and neural-based approaches. Genetic algorithms are induction programs that, in analogy to biological systems, improve their performance by combining (e.g ., by crossovers rules according to a performance-based "fitness" into ones that perform better than the previous ones [6] . The components of the "genes" can be either numeric or symbolic . Like neural nets, genetic algorithms can use a significant amount of processing time. Random mutations help keep the program out of local optima at the cost of greater search time .
XV. EXPERT SYSTEMS
504
S.H. Clearwater, E.G. Stern / Beyond expert systems
2.5. Connectionist approaches
Connectionism includes a class of learning programs commonly referred to as neural networks, Boltzmann machines [7], simulated annealers [8], etc. The idea behind this class of learners is in analogy to a brain containing a very large number of interconnected simple processing units. The inputs and outputs of the units must be numerical . Learning takes place by adjusting the strength of connectivity between the processing units so that difference between the output of the neural net and the desired output is minimized. Thus neural networks are a type of parameter fitting program . Because these programs have no explicit knowledge, they are purely search programs and consequently they typically require prodigious computing time even for learning fairly simple patterns. The hyperspace of parameters is typically searched using a hill-climbing algorithm . As with all hill-climbing programs, it is possible for the parameters to settle into a relative optimum . Simulated annealers are like neural nets but seek the global optimum by randomly jumping out of local optima, but at the cost of much more computation time. 2.6. Previous applications of learning systems in science One of the most famous learning systems is metaDENDRAL [9] which was used to learn rules for the breaking of chemical bonds in a mass spectrometer . More recently its descendant, the RL program [10], has been used for learning diagnostic rules for a particle accelerator [11]. The RL program is discussed in more detail in section 4. 3. Comparisons with optimization techniques 3.1. Operations research Operations research seeks mathe-i .atical solutions to problems of optimization. In some sense optimization is a learning problem where the problem is some sort of minimization based on parameter fitting . While the results are rigorous and there have been many breakthroughs in this field [12], it does make many assumptions about what is being optimized. For example, all the data must be numerical . Certainly many control-system data are numerical, but some are also symbolic, such as the possible values for a status indicator (ALARM, WARNING, NORMAL). Also, many functions are numerical values that are strictly limited to integer values, which makes them highly noncontinuous. Finally, programs used in this field are 6.t--n unable to handle cases where some of the data is missing . 3.2. Statistical analysis Statistics provides another area where the results are rigorously defensible. For example, Bayesian analysis
has long been used to make predictions. However, Bayesian and other statistical methods suffer from the same malady as operations research: there is no guarantee that the assumptions of the techniques hold for the data under consideration. In other words, the assumed distribution functions may not correctly reflect the underlying distributions in the data. The problems with parametric statistics are overcome, in part, by the use of nonparametric statistics. The recently developed CART methodology [13] has been used to learn classification rules in a number of domains. CART also has the advantage of being able to use examples that contain missing data. 3.3. Closed feedback systems Feedback systems have been widely employed in control systems . However, these systems are not designed for nor are they capable of learning. That is, they cannot actively change to a state of higher organization, they can only change reactively [16]. 3.4. Summary Each of the above methods has been shown to work very well for certain types of problems . However, in many cases they have the disadvantage that real-world situations nullify their assumptions and make them inappropriate . 4. Applications 4.1. The learning program We have used a knowledge-based induction program in several areas of high-energy physics (HEP). The particular learning program used was RL4 (written in LISP) and has been used previously to learn diagnostic rules for recognizing errors in a beam line. The reason for using a program like RL is that it is not tied to the assumptions that limit the applicability of other systems. For example, the data can be numerical or symbolic and the search paradigm can be changed to employ a variety of heuristics . These advantages are summed up by saying that RL is more general than many other learning or optimization programs and makes its assumptions more explicit and changeable. RL4 is a descendent of earlier versions of RL which derived from meta-DENDRAL. RL searches a parameter space to form iF-rHEN rule::., where the "if' part contains combinations of the parameters and the " then" part corresponds to the concept defined by the rule. Acceptable combinations are determined by testing the rules against performance thresholds, as discussed below.
S.H. Clearwater, E. G. Stern / Beyond expert systenu
The input to RL is: (1) A set of training examples classified into positive and negative instances of various concepts with no assumptions about the correctness of the classifications . (2) A partial theory of the domain, or half-order theory (HOT), consisting of the vocabulary (names for the parameters) for the examples, relationships among the terms in the vocabulary, and heuristics for pruning the search space. These pruning heuristics include plausible values of descriptors, size and complexity of the rules, and an ordering of the terms for heuristic-search purposes. Meaningful ranges of numerically valued attributes can either be set manually in the HOT or determined automatically. The search is driven by performance constraints on the plausible rules . These constraints involve : (1) the minimum fraction of positive examples (true positives) of the concept correctly classified by the rule (TP/P), (2) the maximum fraction of negative examples (false positives) of the concept classified by the rule (FP/N), and (3) the minimum ratio of the fraction defined in (1) divided by the fraction defined in (2). Rules that are too general satisfy (1) but not (2) and rules that are too specific satisfy (2) but not (1) . Often the best results come from a higher value of the ratio defined in (3), even with less than optimal values for the quantities in (1) and (2) . Thus the search is not exhaustive in the sense that not every node in the search space is explored. For example, RL does not specialize a rule that is already too special. A complete derivation or trace-back of the rules is available. The performance criteria start at stringent values (high TP/P and low FP/N) so that the best rules are found first. If there are uncovered positives remaining after the search is completed at a particular performance level, then the thresholds are progressively loosened until a stopping condition based on absolute minimum thresholds is reached. 4.2. Design of monitor placement
As a test of how a learning program could be used in design, we tried to learn rules for positioning monitors in a beam line. In this case we used an actual beam-like design and varied the position where the monitors were located. Then for a given placement of monitors a simulator [15] generated beam trajectories. Each trajectory was affected by a different magnet or monitor error. The trajectory included effects due to residual magnet misalignments and miscalibrations as well as the effects of monitor resolution . Sections of the beam line where there was a magnet error or monitor error were used as positive examples of the error type, and the other sections as negative exam-
50 5
ples of the concept to be learned (e.g., if element error was the concept to be learned, then sections with monitor errors and no magnet errors were negative examples of the concept, etc.). In a second stage of learning, the characteristics of a beam line, such as monitor spacing, along with the rules learned in the first stage were used as attributes with the concept to be learned, i.e. "good monitor placement". The positive examples of a "good monitor design" were the monitor arrangements whose error-finding rules from the first stage of learning had high performance and the negative examples where the monitor arrangements that had poorly performing error-finding rules. The rule found by RL4 in the second stage of learning was basically identical to the beam-line designers' rule of thumb : "one every quarter betatron phase shift is a good monitor placement" . 4.3. Design of detector electronics At any detector, and especially for the SSC detector since the interaction rate is so high (0(100 MHz)), it is crucial to choose the best granularity for the detector. For triggering purposes, the granularity of the detector is coarsened by forming local sums of adjacent cells. If the granularity at the trigger level is too coarse, then events with finer structure may be missed . If the granularity is too fine, then this poses problems of cost, power consumption, large numbers of cables and storage and I/O problems for the control system. Again RL4 was used to learn rules with the goal being to obtain high efficiency and low background for interesting physics events . This time the positive examples were classified as being "at least one jet event" and the negative examples were events with "no reconstructable jets". RL4 found similar rules to those found by a conventional analysis (i .e., guessing what histograms to make and choosing the criteria based on visual inspection). 4.4. Trigger strategy for SSC experiments Another important consideration for the SSC is that the trigger must be carefully designed to keep trigger rates within the capacities of the control system. An example of a trigger is the selection of events containing a top quark (with 140 GeV mass) from a background of other processes. We approach this problem by simulating various triggering levels and applying learning techniques to establish the actual criteria (rules) at each level. As with many HEP experiments, the trigger is broken into different parts, each designed to select a different subset of the data. In particular, we divide the trigger into three levels: level I is used at the hardware level to filter minimum-bias events and other high-rate processes using simple measures such as total transverse energy in the detector . At level II, the trigger parameXV. EXPERT SYSTEMS
S. H. Clearwater, E.G. Stern / Beyond expert systems
506
Table 1 Trigger-rate goals for each trigger level, along with the performance of RL4-generated rules on the signal and background training data. Trigger
Raw rate Level I Level II Level III
Desired rate ] 108 10 5 10(2-3) 10(1-2)
Rate (RL4 rules) [Hz] 2 .2 x 108 1.8 x 10 4 1.1 x 10 3 1.0 x 102
Signa [nb/%]
Background [nb]
16.0/100 16.0/100 8.3/52 4.5/28
2.2 x 108 1.8 x10 4 1.1 x 10 3 9.7 x 10 1
ters are more complex, involving counting classes of particles such as electrons, muons and energy clusters . The level-III trigger employs sophisticated programs and is more complicated than the level-II trigger. It involves correlating energy clusters and measuring and ordering by energy or momentum particle types such as electrons and muons . (For the present purposes of demonstrating the utility of knowledge-based induction we have neglected the effects of event pileup .) Table 1 below shows the trigger rates at the various trigger levels using as a filter the rules learned by RI.4 . At level I, RL4 used top events as the concept to be learned and minimum-bias and low-energy two-jet events as the negative examples of the concept. In this simulation the trigger rate at level I is limited by the speed of the hardware, amount of disk storage and other resource constraints. We were able to set the performance threshold used by RL4 to learn rules to correspond to the rate constraints of the control system . The subset of rules found by RL4 that gave the numbers in the above table were : Level I : IF (TOTAL-ET > 200 GEV) THEN (TOP-EVENTS YES) Level II : IF (MISSING-PT > 27 GEV) THEN (TOP-EVENTS YES) Level III: IF ((LARGEST-ELECTRON-ET < 270 GEV) AND (EVENT-CIRCULARITY > 0.324)) THEN (TOP-EVENTS YES) Further analysis with RL4 corresponding to an offline analysis was used to eliminate nearly all the background events leaving a high-purity top-event sample ._ The above example illustrates how a learning program can be used to see whether the constraints imposed by limitations in the control system will allow the desired measurements to be made . 5.S Learning programs can be of use when the problem is highly complex and where a partial-domain theory
can be constructed. Also, if the problem changes frequently and not enough human resources are available, then machine learning methods may be appropriate . The knowledge-based induction system we used has an advantage over many other learning or optimization systems in that it is quite general and makes many of the assumptions underlying its reasoning explicit and accessible to change by the user. The learning program was seen to be quite useful in the design of complicated physics experiments by taking into account complex correlations among the parameters. Acknowledgements We thank Wilfred Cleland and Frank Paige for useful discussions and Osamu Kora for his results. References [1] B.G. Buchanan and R.G. Smith, Ann. Rev. Comput. Sci. 3(1988)23. [2] E. Feigenbaum, P. McCorduck and H. Nii, The Rise of the Expert Company (Times Books, New York, 1988). [3] Proc . First Annual Conference on Innovative Applications of Artificial Intelligence (Stanford University, 1989). [4] R.S . Michalski, J.G. Carbonell, T.M . Mitchell (eds .), Machine Learning : An Artificial-Intelligence Approach (Tioga Publishing Co., Palo Alto, 1983). [5] P.H. Winston, Artificial Intelligence (Addison-Wesley, Reading, 1985). [6] J. Machine Learning 3 (2,3) (1988). [7] G.E . Hinton and T.J. Sejnowski, in : Parallel Distributed Processing, eds. D.E . Rumelhart and J.L . McClelland (MIT Press, Cambridge, MA, 1986) p. 282. [8] S. Kirkpatrick, C.D. Gelatt and M.P . Vecchi, Science 220 (1983) 671. [9] B.G . Buchanan and T.M. Mitchell, in : Pattern-Directed Inference Systems, eds. D.A. Waterman and F. HayesRoth (Academic Press, New York, 1978). [10] L.M. Fu, Ph.D. thesis (Stanford University, March 1985). [111 B .G. Buchanan, J. Sullivan, T.P. Cheng and S.H. Clearwater, Proc. 7th National Conf. on Artificial Intelligence, St. Paul, 1988 (Morgan Kaufman, San Mateo, CA) p. 552 . [12] P.E. Grill, W. Murray and M.H. Wright, Practical Optimization (Academic Press, New York, 1981). [13] L. Breiman, J. Friedman, Olshen and R.A . Stone, Classifi.",... aigu an "r id" i 117-A .u 1 ... r-. ... calivu Regrâ.ssioiaarcis ~a....r . -aus-or 1. aux. vava.Y, Belmont, 1984). [14] L. Von Bertalanffy, General System Theory (George Braziller, New York, 1968), p. 150. [15] M.J. Lee, S.H . Clearwater, S.D . Kleban and L.J. Selig, Proc. 1987 Particle Accelerator Conf., Washington, DC, 1987, eds. E.R. Lindstrom and L.S. Taylor, (IEEE Publishing, New York, 1987) p. 1334.