Modeling digital circuits for troubleshooting

Modeling digital circuits for troubleshooting

Artificial Intelligence 51 (1991) 223-271 Elsevier 223 Modeling digital circuits for troubleshooting Walter C. Hamscher Price Waterhouse Technology ...

3MB Sizes 15 Downloads 136 Views

Artificial Intelligence 51 (1991) 223-271 Elsevier

223

Modeling digital circuits for troubleshooting Walter C. Hamscher Price Waterhouse Technology Centre, 68 Willow Rd, Menlo Park, CA 94025, USA

Abstract

Hamscher, W.C., Modeling digital circuits for troubleshooting, Artificial Intelligence 51 (1991) 223-271. Existing methods for model-based troubleshooting have not previously scaled up to deal with complex digital circuits, in part because traditional circuit models do not represent aspects of the device that troubleshooters consider important. An instruction level simulation of a microprocessor explicitly represents the logic levels present on its external bus at every clock edge, but not the fact that during normal operation those bus signals should be very active. A schematic may represent the connectivity of field replaceable components, but does not show how their combined behavior implements the intentions of the designer. The specifications of a component rarely say how it is likely to fail. This suggests basing troubleshooting on a specialized circuit model that emphasizes such aspects. Although it is beyond current technology to derive such models from circuit schematics automatically, this work shows that these models can make the troubleshooting of complex circuits feasible. This paper describes an implemented program for troubleshooting complex digital circuits, using a representation that makes explicit their behavior at a high level of temporal abstraction, their physical and functional organization, and the common ways that their components fail.

1. Modeling for troubleshooting In m o d e l - b a s e d diagnosis, a m o d e l p r o d u c e s p r e d i c t i o n s a b o u t the observed b e h a v i o r o f s o m e artifact; c o m p a r i s o n with actual o b s e r v a t i o n s o f t h a t artifact p r o d u c e discrepancies; these d i s c r e p a n c i e s t h e n give rise to possible diagnoses, each o f w h i c h is a set o f u n d e r l y i n g differences b e t w e e n the artifact a n d the m o d e l [9]. M o d e l - b a s e d t r o u b l e s h o o t i n g refers to d i a g n o s i s in w h i c h the artifact is an e n g i n e e r e d device, the m o d e l is a d e s c r i p t i o n o f its internal structure a n d the i n t e n d e d b e h a v i o r o f its c o m p o n e n t s , a n d the d i a g n o s e s are i n t e r p r e t e d as physical defects in the d e v i c e t h a t h a v e o c c u r r e d while in service. M o d e l - b a s e d diagnosis p r o m i s e s m a n y a d v a n t a g e s o v e r t r a d i t i o n a l a p p r o a c h e s to a u t o m a t e d diagnosis, o n e o f w h i c h is t h a t a 0004-3702/91/$ 03.50 © 1991--Elsevier Science Publishers B.V. All rights reserved

224

II

( ' ltatn.schcr

single diagnosis engine can be built and reused tor many devices b~ providing a different model fbr each device. Model-based diagnosis has bcen studied extensively, with several formal characterizations [15.21.38 ]. and with demonstrations in a variety of domains suggested by the lbllowing sample: • analog circuits, as in INTER 112], WATSON [ 4 ] , SOPHIE [5 [. IDS [35},

1N-4TE [61. DED-\LE [7], and others [33.451: • digital circuits, as in tiT [8,25], DART [20,271, SATURN [43,44], (;DE [16], GMODS [26[, SHERL()CK [17], and others [1,19,23l; • medicine, as in ~BEL [361. L()CALIZE [18], ttFP [30], and ('ASE',

[28]: • fluid and electromechanical systems, as in LES/L{)X [40], (;DE+- [45 ], and others {341. Applying model-based troubleshooting to large-scale devices raises difficult issues. First, predicting the behavior of devices with complex time-dependent behavior could be impractically expensive. Consider board-scale digital circuits: it is relatively expensive to simulate or otherwise reason about their behavior over a significant number of clock cycles. Structural complexity as measured by component count does not by itself raise any comparable issue [18,40]. Second. the troubleshooting engine produces diagnoses that are logically possible but physically implausible or even irrelevant. For example. in sequential circuits almost any discrepancy yields a set of diagnoses that includes almost every physical component [25 ]. Since the architecture of model-based troubleshooting involves two elements, a device-independent engine and a device-specific model, there are at least two broad strategies for attacking these scaling problems: improve the engine or improve the model. Both of these are useful avenues of attack. Consider the strategy of improving the engine. One approach is to limit consideration of diagnoses to those that are statistically plausible, rather than considering all that are logically possible. With this more limited goal, knowledge about how components are likely to break and misbehave-that is, .li~ull models--can be used as heuristics for refining estimates of diagnosis likelihoods. Fault models have been incorporated into modelbased troubleshooting before [5,26,35,45], but in most previous work fault models have been used under the unrealistic assumption that the set of known misbehaviors is exhaustive. The work reported here and independent work by de Kleer and Williams [17] have no such restriction. A second strategy is to improve the model, that is, to construct a de.vice representation that is appropriate for troubleshooting. One approach is to represent the device in a hierarchy of abstractions, using successively more detailed models of behavior as the diagnosis proceeds. This idea appears in some form in nearly every model-based diagnosis program

Modeling digital circuits for troubleshooting

225

[4,5,8,20,35,37,43]. The problem is that a mere commitment to using layers of abstraction says nothing about what abstractions will be appropriate. A well known example of a principle for constructing appropriate representations is the no-function-in-structure principle, which says that the laws of the parts of the device may not presume the functioning of the whole [ 14]. This principle helps to ensure that reasoning about the behavior of a given collection of components will not be invalidated by the presence of faults elsewhere in the device. It concerns the correctness of diagnosis. The focus of this paper is on principles concerning the efficiency of diagnosis. For example, the goal of troubleshooting is repair, and that generally involves replacing a component; representing the device as a collection of its primitively replaceable parts can improve efficiency since the program will not spend effort distinguishing between failures that have identical repairs. For another example, the cost of reasoning over many individual events in sequential digital circuits motivates the use of temporally abstract descriptions of circuit components to improve efficiency when troubleshooting [25], just as it does when generating tests [27]. Similarly, there is a large gap between the temporal granularity at which events occur in a digital circuit and the temporal granularity at which observations can easily be made during troubleshooting; knowing this provides guidance about the appropriate temporal granularity to use in representing the behavior of the circuit. These and other principles constitute the strategy of modeling for

troubleshooting. This paper describes a model-based troubleshooting program that diagnoses faults in a board-scale digital circuit more complex than previously attempted. The circuits used as examples in the current work are all from the console controller board of the Symbolics 3600. About 40% of the board has been represented and several troubleshooting examples run, the largest involving some 100 visible circuit nodes and 20 chips including two microprocessors (Singh's SATURN program [43] generated tests for a simpler digital circuit board consisting of about 60 circuit nodes and a dozen SSI chips no more complex than registers). The troubleshooting engine is XDE, a domain-independent diagnosis engine based on GDE [16]. XDE embodies some of the "improve the engine" strategy: it extends GDE to incorporate hierarchic diagnosis in physical and functional hierarchies and to use fault models. The central concern of this paper, however, is the representation of circuits that arises from pursuing the strategy of "modeling for troubleshooting". This representation is embodied as two languages: (i) the circuit structure language BASIL, which makes explicit both the physical and functional organization of the circuit; (ii) the temporal constraint propagation language TINT, which makes explicit circuit behavior at multiple levels of temporal abstraction. While it is beyond current technology to derive this specialized representation from more readily available information such as

H.(.ttamscht,r

226

..... B ~ -~

~

MouseMo~z~J [

Beset

!

Interrupt

~

'

Fig. I. A portion of "the console controller board.

circuit schematics, the work makes important contributions to the interim goal of understanding how to represent complex devices in ways that make troubleshooting them feasible.

1. I. ,4 troubleshooting scenario A troubleshooting scenario presented here serves to illustrate the distinctive features of the circuit representation and its use by XDE. The console controller board is responsible for transmitting keystrokes and mouse motions to the host computer and for decoding the video signal coming from the host for display on a CRT and the audio signal for output to a speaker. Some keystroke sequences can change the volume of the speaker, the brightness of the CRT, and so forth. Fig. 1 shows abstractly a few of the components (boxes) and the signals through which they interact (arrows), Each small superscript represents the number of chips in that component; there are sixteen in all. The oscillator O produces a clock signal that is buffered by B and sent on to two places: the reset circuitry R and a microprocessor MI. The microprocessor M1 polls the mouse inputs. Each tenth of an inch of mouse motion along its x- or y-axes causes M I to interrupt a second microprocessor M2 with a two-byte message. M2 responds to the interrupt through some bus control circuitry D. After receiving the two-byte message M2 then sends the message on to the host, again through the bus control circuitry D. The host displays the changed mouse position on the screen. Suppose the console controller board reset button is pressed and the mouse rolled around for a couple of seconds. The model predicts that if all sixteen chips are working, mouse motion will be observed at Output. The model is too coarse to predict how fast or how far the cursor will move on the screen--it predicts only that motion will be observed. This temporally abstract behavior is both more efficient to make predictions from and easier to observe than the traditional clock-cycle-by-clock-cycle model of digital circuit behavior.

Modeling digital circuitsfor troubleshooting

227

But suppose the mouse cursor does not move at all. XDE indicates that any one of the sixteen chips might be broken; each chip is a suspect. There are now many possible signals to probe, and XDE ranks them. The model has assigned each chip a prior likelihood of failure, and it turns out that the likeliest chip to fail by far is the oscillator O. XDE suggests probing its output; suppose it is observed to have a frequency of approximately l 0 MHz. The oscillator O can be discounted as an unlikely suspect using knowledge in the model about how some components fail. The model says that when oscillators fail, they usually fail catastrophically, producing an output frequency of 0. Because the signal was observed to be changing, XDE concludes that the oscillator chip is probably not responsible. It is still a suspect, just a relatively unlikely one. This leaves fifteen chips as likely suspects. XDE now suggests another probe by considering the predictions that the model makes at each signal. For example, the model predicts that the Clock signal should have frequency 5 MHz. Each prediction is tagged with a set of chips that, if all were working properly, would ensure the correctness of the prediction. The representation of clock signals in terms of their frequencies is an example of a temporal abstraction; millions of underlying events (rising and falling edges) have been abstracted into a simple description that is easy to reason about and easy to observe. Although the model represents many signals in temporally abstract ways, there are other signals for which the standard digital vocabulary suffices. For example, the Constant output of C is a constant 1 throughout the entire session, and the model predicts that. Also, the Reset signal should be asserted while the reset button is pressed and unasserted otherwise, and the model predicts that as well. These predictions--that the clock frequency is 5 MHz, and so forth-can be used in subsequent predictions. The behavior model for the first microprocessor M1 says that if the Clock input is 5 MHz, the Constant input is 1, and the Reset signal is not asserted, then the microprocessor will be running. While M1 is running, each movement of the mouse results in the Interrupt line being asserted. If all that is known is that the mouse is moving around, the model does not predict exactly when the interrupt will be asserted; instead it predicts that the signal will be changing while the mouse is moving and a constant 1 value otherwise. The model makes many other predictions, but these are all that will be needed in this example. The important one at the moment is the prediction that Interrupt signal will be changing while the mouse is moving. This prediction is tagged with the set of eight chips that would support the prediction if working properly--the eight chips in all components except M2 and D.

228

rI

'

( llamscher

I r-t

)

LmJ- 1

Fig. 2. kikely suspects after probing ];'ll6rplt!)l.

Button

Output

Fig. 3. Likely suspects after probing Re,wl. The probe that XDE now suggests is the lnlerrupt output of M1. Suppose the interrupt line is probed, revealing that it is a constant L even while the mouse is rolled around. This is a discrepancy, since it was supposed to be changing so long as those eight chips were working properly. One of the chips was the oscillator, which has been shown to be an unlikely suspect: this leaves seven as likely suspects (Fig. 2). The model predicted that the Reset signal should be asserted just while the reset button was pressed, so long as the five chips in O, B and R were working. Probing the Reset signal reveals that upon pressing the button it is asserted, then unasserted. This means that the chips in R are no longer suspects, since their failure could not explain the observations made. Now there are five likely suspects (Fig. 3). The model predicted that the ( ' o n s t a n l signal should be 1 throughout the session, so long as the chips in C were working. Probing this signal reveals that it is indeed 1, so the chips in C are no longer suspects. Now there are three likely suspects (Fig. 4). Finally, a probe o f the Clock signal reveals that it has frequency around 5 MHz. The model says that if the clock input to M1 has a high enough frequency and the reset input is not asserted, then the microprocessor should

Modeling digital circuits for troubleshooting

• '

229

Button

'

Fig. 4. Likely suspects after probing Constant.

be running. This means that the Interrupt signal should be changing, which contradicts previous observations. Hence M1 is the only remaining suspect and XDE terminates. The interesting thing about this scenario is that it is so simple compared to the underlying complexity of the real circuit. The circuit is structurally complex; there are thousands of transistors in the chips, hundreds of possible flaws in the wires alone. It is behaviorally complex; consider all the microprocessor instruction cycles that occurred during the one second of mouse motion. People can troubleshoot the circuit without thinking about all those details, and the program can troubleshoot it without explicitly representing them.

1.2. Eight modeling principles The important thing about the circuit model is not that it uses abstractions to deal with complexity; any representation does that. The important idea is that it embodies structural and behavioral abstractions appropriate to troubleshooting. Temporal abstractions, in particular, allow the program to avoid reasoning about long sequences of individual events and instead reason in terms of "moving" mice, "running" clocks, "changing" signals, and so forth. Guided by the principles below, a knowledge engineer can use these abstractions to construct a representation that makes troubleshooting a complex circuit feasible. The model of the console controller board is appropriate for model-based troubleshooting because it was constructed according to those principles.

Behavior One set of principles concerns the representation of circuit behavior. These principles and their embodiment in the behavior description language TINT will be presented in Section 2.

230

II .C tlamscher

• The behavior of components should be represented in terms of l'ea. tures that are easy for the troubleshooter to observe. Some features of time-varying signals are easier to observe than others. The frequency of a clock, for example, is easier to observe than the timing of each of its individual transitions. Expressing the behavior of components in the terms that are more easily observed is a way of choosing what details to ignore. Human troubleshooters who successfully use these coarse observations provide evidence that this is an effective strategy. • The behavior of components should be represented in terms that are stable over long periods of time or that summarize much activity. into a single parameter. This is easiest for a component for which changes on its inputs always result in changes on its outputs. In the troubleshooting scenario, the number of mouse step increments over a period of seconds (a single parameter describing much activity) determined the number of times the interrupt line would be asserted over that period. Such relationships can be derived when each individual change results in one or more other changes. • A temporally coarse behavior description that only covers part of the behavior of a component is better than not covering any at all. Although the full behavior of a component may be too complex to reduce to a simple relationship between (say) the number of changes on its inputs and the number of changes on its outputs, there may be such a relationship that involves only a subset of its inputs, assuming that the others are held constant. In the case of the microprocessor, for example, the relationship between the mouse motion inputs and interrupt output holds only so long as the clock input is running and the reset input is not asserted. Since the troubleshooting program will eventually' use the more detailed behaviors as long as the diagnosis remains ambiguous, no diagnostic resolution will be lost by only representing a subset of the possible behaviors abstractly. • A sequential circuit should be encapsulated into a single component to enable the description of its behavior in a temporally coarse way. Although the individual behaviors of the components in a sequential circuit may not lend themselves to temporally coarse descriptions, the loop may be performing a simple function when taken as a whole. For example, the R component in the troubleshooting scenario is actually a sequential circuit with 2 ~4 distinct states. When viewed in temporally coarse terms, however, there is a simple correspondence between the state of the button and the state of the output. Encapsulating the group of components makes it possible to reason about its behavior in a temporally coarse way, and as

Modeling digital circuits for troubleshooting

231

in the troubleshooting scenario described, it may not be necessary ever to consider the details of its behavior. Structure A second set of principles concerns how the structure of a given circuit should be represented. These principles are embodied in the structure language BASIL, presented in Section 4. • Components in the representation of the physical organization of the circuit should correspond to the possible repairs of the actual device. The representation of physical organization plays a central role in the troubleshooting program, and the program represents all of its diagnoses in terms of the physical components that could be damaged. In the scenario presented earlier, for example, the diagnoses were expressed in terms of chips, which are "repaired" by replacement. Making the elements of this representation correspond to possible repair actions ensures that the troubleshooting program will not waste effort trying to discriminate between diagnoses that have identical repairs. • Components in the representation of the functional organization of the circuit should facilitate behavioral abstraction. The only role that an explicit representation of functional organization plays in model-based troubleshooting is to make behavior prediction more efficient. For example, the only reason that the component M2 exists in the model is because the combined behavior of the four chips inside it can be described more simply in the aggregate than individually. In extracting the functional organization from a raw schematic the modeler need only represent what will make the behavior easy to reason with, rather than what the designer had in mind. While there will often be a great deal of overlap, these two are not necessarily be the same. Failures A final set of principles concerns what knowledge about failures should be represented explicitly. These principles are discussed further in Section 5. • An explicit representation of a given component failure mode should be used if the underlying failure has high likelihood. Components break in the field in certain ways much more often than other ways. Chips, for example, fail more often with breaks in the tiny wires that connect their pins to the silicon chip inside than in other ways. The benefit of knowledge about such failures comes when they are inconsistent with the symptoms, since this can reduce the ambiguity among the possible diagnoses.

I~.C, ltamscher

232

• An explicit representation of a given component failure mode should be used if the resulting misbehavior is drastically simpler than the normal behavior of the component. If a component with normally complex behavior has some internal fault or faults that cause it to misbehave catastrophically, then any partially correct behavior observed for the component makes it a less likely suspect. In the troubleshooting example, the oscillator was known to fail in a way that made it produce a zero output frequency, and that misbehavior was easy to rule out even though the measurement of its output was imprecise. Such cases are quite common in digital circuits, and this is a consequence of aspects of their design. Complex functions tend to get implemented in state machines or as firmware for general processors. The circuits then use the same hardware components over and over to implement different steps of the overall computation, many of which depend on the previous step. Hence a perturbation caused by failure in any one unit of hardware rapidly cascades and propagates its effects. The very economy of the design--the reuse of hardware for different substeps of a complex behavior--means that alter many cycles the behavior will little resemble that intended. Since complex components typically communicate with one another through protocols and languages in which the meaningful message sequences occupy only a fraction of the theoretically available bandwidth, when a component is intended to produce a message sequence understandable by some other component, the message will probably never get through. As a result, the faulty behavior of the device is effectively much simpler than its correct behavior, making it worthwhile to model explicitly.

2. Representing behavior To be appropriate for troubleshooting, behavioral abstractions should retain enough predictive power to detect symptoms, but should allow predictions to be made efficiently. Among the characteristics of digital circuit troubleshooting are (i) a gap of several orders of magnitude between the temporal granularity at which events occur in the machine and the temporal granularity at which observations can easily be made, and (ii) the fact that the most common physical failures are usually manifest at coarse timescales. These characteristics mean that temporal precision can often be sacrificed without losing too much predictive force. Temporal abstractions, including familiar concepts such as .ti'equenc.v, cycles, c'ounling, sequence, duration, sampling, and change, make it possible to reason about large numbers of events occurring in the circuit without having to refer explicitly to each one. This section first presents the temporal constraint propagator TINT. Next,

Modeling digital circuitsfor troubleshooting

233

some simple examples of digital component behaviors as represented with TINT are presented, along with several temporal abstractions. Finally, these temporal abstractions are used to describe the behavior of some components. 2.1. T I N T The behavior of circuit components is represented using a simple temporal reasoning system in which rules are used to derive facts about the values of functions of time. A function of time is called a signal; for example, the voltage at a circuit node is a signal because its value can change over time. An event is a change in the value of a signal. TINT is implemented using the predicate calculus-based expert system subs t r a t e JOSHUA [39]. The syntax will be shown as Cambridge prefix predicate calculus with [] denoting predicate terms, () denoting function terms, and the prefix ? denoting universally quantified variables. TINT provides the four-place predicate t h r u for making assertions about signal values. [thru 71 ?u ?signal ?value]

means that from the lower bound time ?l to the upper bound time ?u inclusive, ? s i g n a l had value ?value. The predicate [tsame 71 ?u ?signall ?signal2]

means that at each time during the interval, ?signall had the same value as ?signal2. Any token can appear as the ?value of a signal. In contrast to more sophisticated models of time (for example, Allen's interval model [2]), for simplicity time is taken to be a sparse set, the integers divisible by a temporal granularity constant 6. Granularity can be thought of as the smallest unit of time that is measurable by available instruments. Only integers, along with - ~ and +~, can appear as time arguments to the t h r u and tsame predicates. This use of time-stamps in TINT rather than symbolic quantities or expressions results in serious limitations as compared to other temporal reasoning systems such as TCP [47] and TMM [11], but it is adequate as a demonstration vehicle. The ?signal argument of the thru predicate is normally a function term. For example, the term (voltage(in a u32a)) denotes the voltage signal at node (in a u32a). The voltage function maps a node to a real-valued signal. Abstractions and behaviors are functions from signals to signals. Abstractions describe relationships between signals at different levels of detail. Behaviors describe the relationships that components enforce between their input and output signals. TINT manipulates assertions with predicate ground

It' ( . ftamscher

234

terms containing composite terms built up from primitive signals and abstractions. [thru

-'~c + x .

(change

(ii (in a u 3 2 a ) ) )

nil],

for example, means that signal resulting from applying the change abstraction to the logic level ( l l ) signal at node (in a u32a) was always n i l . TINT provides rules that are used in forward chaining to propagate the consequences of assertions about signal values. The following rule says that if ?i is of type i n v e r t e r and the input signal ( l l (in 0 ? i ) ) has value ?v from time ?1 to time ?u inclusive, then the output signal ( l l (out 0 ? i ) ) is the result of applying lognot to 7v: If lisa ?i inverter] and [thru ?I ?u (ii (in 0 ?i)) ?v] Then [thru 71 ?u (ii (out 0 ?i)) (lognot ?v)] The set of all thru predications (predicate ground terms) referring to the same signal is called the histoo' of the signal, tINT combines overlapping intervals of the same history having the same value into maxima! inlervals and records a contradiction if a signal has more than one value at a given time. Predications denoting nonmaximal intervals are "shadowed", meaning that rules are prevented from firing on them. It would be inefficient to allow rules to fire on such intervals because this would result in the derivation of" many useless predications denoting nonmaximal intervals. TINT uses an ATMS tO maintain Boolean constraints among the truth values assigned to predications, along with minimal environments for each assertion [ 13,23,31 ]. TINT provides predicates, rules, and a framework of signals and abstractions that together are used to describe circuit behavior. However, the main issue is the vocabulary of signal types and abstractions and the specific rules that the program will use to reason about them. These are treated next.

2.2. Behaviors Circuit components have intended behaviors that are functions from signals to signals, and these behaviors can be translated into rules. The intended behavior of a component depends on some collection of background conditions--for example, that the component in question is "working" (not physically damaged), that it is connected to a power source, and so forth. By convention, the background conditions for a component are collected and summarized as a mode signal whose value is normal during the intervals that all the conditions are satisfied. For example, the following rule says that if a two-input AND-gate ?a has the status working and is getting power, then its mode is normal:

Modeling digital circuits for troubleshooting If and and Then

235

[isa ?a and2] [status-of ?a working] [thru 71 ?u (power (in power ?a)) t] [thru ?I ?u (mode ?a) normal]

The principal behavior rules for AND-gates depends on the mode signal having the value normal. In the following rule the signals (11 . . . ) denote the digital signals (logic levels) appearing at the input ports (in 0 ?a), (in 1 ?a) and at the output port (out 0 Ta). If any input of a binary AND-gate is 0, then the output is o: If and and and Then

[isa ?x and2] [thru ?la ?ua (mode ?x) normal] [thru ?lb Tub (11 (in ?n ?x)) 0] (overlap (?la ?ua) (?lb ?ub)) [thru (max ?la ?lb) (min ?ua Tub) (11 (out y ?x)) 0]

The function overlap tests whether its argument intervals have any point in common. Another rule for the AND-gate says that with all but one of its inputs held to 1, it acts as a buffer. In the two-input case, with inputs numbered o and 1, this means that as long as input ?n is L the output is the same as input (- 1 ?n). [tsame 71 ?u ?signall ?signal2]

means that at every time between the lower bound 71 and the upper bound ?u inclusive, ?signal1 has the same value as ?signal2: If

[isa ?x and2] and [thru ?la ?ua (mode ?x) normal] and [thru ?ib Tub (ii (in Tn ?x)) i] and (overlap (?la Tua) (?ib ?ub)) Then [teams (max ?la ?ib) (rain ?ua ?ub) (Ii (in (- i ?n) ?x)) (ii (out y ?x))]

Other rules for describing the behavior of components do not correspond to their input/output directionality. For example, there is a rule that if the output of an an AND-gate is 1 then all of the inputs must be 1. The previous examples of behavior rules involved only combinational behaviors. Sequential behaviors require introducing signals to explicitly represent the internal states of components. As with any program for reasoning about change, TINT encounters the frame problem [29,32,41]. The approach used in TINT is not general, since it requires that each component interacts with few enough other components and in few enough ways that they

II'.C. ttamscher

236

can all be listed explicitly. The result is a rule--a frame a x i o m - - f o r cver? state signal that mentions every kind of event that could change lhal state. Appendix A shows examples of rules describing sequential behavior.

2.3. Temporal abstractions The notion of an "abstraction" takes on a specific meaning in TINT as a function from signals to signals. For example, suppose the function s i g n maps real numbers into { - , 0, + }. An example of an abstraction would be a function t s i g n that maps a real-valued signal into a { - . 0. +}valued signal for each point in time. Temporal abstractions are abstractions whose pointwise definitions require reference to signal values over multiple times. The temporal abstractions used in this paper are presented below and include change, counting, sequences,
Change The function change is t only at moments when the underlying signal has just changed its value, otherwise it is n i l . Stay is the negation of Change. An example showing the values of these signals over time is shown below (this and subsequent examples follow the convention that 6 = 1, and that the more abstract the signal the closer it appears to the top line): ( s t a y X) ? nil t (change X ) ' ? x

t nil

t nil nil t

3 4

time!lO

1

4

4

5

2

3

4

In the domain of troubleshooting circuit boards, it is much easier to observe whether a given single-bit signal changed or not during an interval of several seconds than it is to observe each individual change. The abstraction c h a n g e d - d u r i n g is specifically tailored to making statements about whether a given logic level signal ever changed, statements that typically arise from observations of the circuit. ( c h a n g e d - d u r i n g ?1 ?u ?S) is t only at the upper bound time ?u and only when ?S changed at least once during the interval from ? l to ?u inclusive: ( c h a n g e d - d u r i n g 1 6 S) nil nil nil nil nil nil t nil ( c h a n g e d - d u r i n g 1 3 S) nil nil nil nil nil nil nil nil (change S)

'? nil nil nil t nil nil nil

S

0

0

0

0

1

1

l

1

time

0

1

2

3

4

5

6

7

237

Modeling digital circuitsfor troubleshooting For example, [thru 6 6 (changed-during 1 6 S) t] changed at least once between times 1 and 6 inclusive.

means that

s

Counting The function count-ww counts the number of events that have occurred with respect to a window of fixed width. It takes an argument n that is the width of the window in units of ~, and a signal argument S: (count-ww 3 S) S

i

? 1 122 1 nil nil t t nil nil 1 2345

time

6

In this example, the window of width 3 evaluated at (say) time 4 consists of times 2, 3, and 4.

Sequences The abstraction sequence indicates when a particular string of (possibly repeated) values has appeared contiguously on a signal. It is t only after the extreme end of such a sequence, and only at a single point. Given a sequence like ( 0 1 ) it can be thought of as a finite string recognizer for occurrences of the regular expression 0 + 1 +0. (sequence ' ( 0 1 ) S) ? nil nil t nil nil t nil t tim~ 00 1 1 0 0 1 234

1 0 1 0 5678

Cycles The function cycles-ww is the composition of the count and sequence abstractions. It is used to count the number of endings of a particular sequence of values: (cycles-ww 3 '(0 I)S) (sequence

? ? ? 1 2 1 2

'(0 I) S) I? nil t nil t nil t r

slO 1 0 1 0 1 0 time 0 1 2 3 4 5 6 Typically, the larger the window, the less relative fluctuation of the cycle count over time. For example, suppose A and B are signals that are just slightly out of phase. (cycles-ww n . . . A) and (cycles-ww n . . . B) will have the same value most of the time, and will never differ by more than 1.

238

H.(. Hamscher (cycles-ww

8 .. A)

2

2

2

3

3

~

~

3

'

~

3

~

)

t~

(cycles-ww

8 .. B)

(sequence

.. A)

nil nil

(sequence

.. B)

nil

t

time

O

1

t rail rail t

3

nil nil

nil nil t nil nil t 3

4

5

6

7

The larger the window, the less the relative difference, and conversely. the easier to detect significant deviations (as for example the difference between a signal occasionally asserted and one that is running at 20 KHz). By convention, the window size is usually taken to be 1000 times the expected period of the signal, so that the cycles-ww of a pair of signals can be judged as equal if they differ by no more than 0.1%, that is, by no more than one cycle in a thousand.

Frequency Frequency is the number of cycles that occurred during a window, divided by the duration of that window. The abstraction function fww yields the frequency of a signal with respect to a window size and a particular sequence of values. With a sufficiently large window relative to the cycle time (for example, 1000 times as large), the result is an adequate approximation to the normal notion of "frequency". (fww 3 '(0 1) SS ? ? '~. ~i 32 7,1

2

01OiOl0 time 0 1 v 3 4 5 - ;

Sampling The notion of sampling is essential to understanding behavior of synchronous systems; here, the sampling of a signal refers to the values that the signal takes on at certain (usually regularly spaced) moments. The abstraction function s a m p l e - a n d - h o l d (abbreviated samp) takes two argument signals v and S; v is t where the signal S is to sampled. The value of samp is the value of S where V was last t: (samp V S)

? 1 1

1

1 0 0

0

V nil t nil nil nil I nil nil timSe ;

1 0 1 2

1 3

0 0 0 456

1 7

Using temporal abstractions Abstractions define how a signal such as (ii n48) (the logic level at node 48) relates to signals "below" it such as ( v o l t a g e n48), and signals "above"

Modeling digital circuitsfor troubleshooting

(A

x)

CA y)

TT. x

239

• (t~ (A x) (Ay)) • CA[A z)

= =

(A (B x y))

B

y

•z

Fig. 5. Abstractionsand behaviors. it such as (fww 106 ' (0 1) (11 n 4 8 ) ) (the frequency at node 48, measured at cycles starting with 0 and with a window of 1066 time units). Abstractions thus yield rules that fire "upward", "downward", or even "sideways" between different abstractions of the same base signal. The important property of temporal abstractions is that they sacrifice precision without sacrificing the ability to detect faulty behavior. In troubleshooting the idea is to detect discrepancies between the observed behavior of the real device and an idealized model of it; thus the predictions of interest are those that can be made efficiently from what we have observed and that could be significantly violated if the device were broken. The change abstraction is useful because it is easy to observe whether signals in a device are changing or not, and easy to predict what the consequences of change (or lack of it) would be. Similarly, the frequency abstraction is useful even if frequencies are hard to observe accurately: the distinction between zero and nonzero frequencies is easy to observe and is likely to result in significantly different behavioral consequences. By summarizing (possibly very long) sequences of events, temporal abstractions make complex behaviors look simple enough for troubleshooting to be feasible.

2.4. Temporally coarse behaviors Component behaviors can be described with respect to more than one level or kind of abstraction. Given any abstraction A and behavior B we can define a function AB that describes the abstracted behavior (Fig. 5). For example, let A be the sign abstraction, and let B be real addition. The abstracted behavior AB is the qualitative addition function qplus (Fig. 6). An example involving temporal abstractions is provided by the abstracted behavior of a counter that increments on falling edges of its input (Fig. 7). By temporally abstracting its input and carry-out output with respect to the count of falling edges on each signal, a four-bit counter can be viewed as dividing the abstracted input by 16. The output frequency would thus also be ~6 that of the input. Viewing counters as frequency dividers in this way is useful because sometimes their inputs have known frequencies that are

H.( . Hamschcr

240

(sign x)

(qplus (sign x) (sign y))

(sign y)

• (sign z) == (sign (plus x y))

ig

sign

x

y



, sign



z

Fig. 6. Example of abstractions and behaviors.

divide by 16 (A x)

~

(/ (A x) 16)

c o u n t of

falling edges 4-bit counZer x

• •

A == count of falling edges z

Fig. 7. Counter behavior wilh respect to the ('otmlin,¢* abstraction.

stable over long intervals of time. For example, one way that the flequency of the input signal could be known over a long interval is if it the output of a 9.8 MHz oscillator. This is approximated as a frequency of 107 cycles per second, with a window size of a thousand periods, that is. 1000 × 10-7 seconds: If [isa ?o 9.8MHz-oscillator] and [thru ?i ?u (mode ?o) normal] Then [thru ?i ?u (fww 10-4sec '(0 i) (ii (out 0 ?o))) 10 7 ]

The frequency divider behavior allows the program to predict what the output frequency will be over similarly long intervals. Frequency dividers can have multiple outputs, which by convention are numbered from 0 upwards. The frequency at the nth output is 1/2 ''+~ that of the input, and the window size at the nth output of a frequency divider scales by 2 'z+~ (because signals at lower frequencies have longer periods and hence require a longer duration to go through the same number of cycles):

Modeling digital circuits for troubleshooting

241

If and

[isa Td frequency-divider] [has-port Td (out Tn Td)] and [thru Tla Tua normal (mode Td)] and [thru Tlb Tub Tf (fww Tw Tcyc (ii (in a Td)))] and (overlap (Tla Tua) (Tlb Tub)) Then [thru (max ?ib ?ic) (min ?ub ?uc) (fww (truncate (* 7w (expt 2 (+ 7n i)))) ?cyc (ii (out Tn Td))) (/ ?f (expt 2 (+ 1Tn)))]

The counter provides an example in which the frequency abstraction yields a useful characterization of its behavior, but criteria are needed to decide which abstractions are appropriate for different behaviors. In principle, any behavior can be abstracted using any abstraction, and moreover there is no reason that the same abstraction A need be applied to all the signals x, y, and z in Fig. 5. Ideally, any prediction made by (A ( B . . . ) ) will also be made by AB. However, A13 will rarely be able to do so for an arbitrary combination of behavior and abstractions, even when g and B are total functions. Abstracting real addition with respect to sign, for example, yields the partial function qplus (Fig. 6). The strength of AB can be characterized by the degree to which it is a total function. One way to strengthen a weak function is to make assumptions about the relationship between x and y such that AB is stronger over the resulting restricted domains. In the case of sign addition, one might assume that (sign x) and (sign y) are never -, so that the resulting restriction of qualitative addition became a total function. Given a particular abstraction function A, one should ask: for what class of behaviors 13 it is possible to formulate easily computable and strong abstract behaviors AB, or, failing that, what reasonable assumptions can be made to strengthen AB. In the case of temporal abstractions the answer is that they are appropriate for event-preserving behaviors. Behaviors are event-preserving to the extent that changes on their input signals are reflected as changes on their outputs (event-preserving behaviors include all one-to-one functions). This is such a small class that is tempting to conclude that the corresponding class of digital components is so small as to be worthless. This is not so, because it is often possible to compose groups of digital components and define abstract signals in such a way that the behaviors of the resulting aggregate components are event-preserving. Hence the relevant class of digital circuit structures is quite large and diverse, and some examples will be presented shortly. Faced with a specific digital circuit and the collection of temporal abstractions above, principles are needed to guide a person in using them

242

I,V. ( '. Hamscher

to describe the behavior of the circuit. This model-building process is not automated, but can be metaphorically understood as "parsing" the circuit schematic: grouping components into composite structures and abstracting signals. The three basic principles by which behaviors are appropriately "parsed" are reduction, synchronization, and encapsulation, each discussed briefly below.

Reduction Any function of n inputs with one of its inputs held constant yields a new function of n - 1 inputs. A multiple input behavior considered under the special case of its having one or more constant inputs sometimes lends itself to temporal abstraction. The resulting behavior is incomplete, of course, in the sense that it does not cover cases in which the inputs are not constant. It is nevertheless worthwhile because it provides an alternative to the undesirable option of predicting all behavior at a temporally detailed level. Weak temporally abstract predictions are better than none. The simplest example is the behavior of a two-input AND-gate during an interval when one of its inputs is a constant 1. By a rule shown earlier, this results in an assertion that the output and free input are the same at each moment during that interval. This assertion will have consequences for any temporal abstraction of either signal. For example, if the frequency of the free input is known then rules will fire to deduce the output frequency as well. There are similar rules for the behavior of any Boolean gate with all but one of its inputs held constant.

Synchronization Many digital circuits have signals that provide timing information, and the sampling abstraction can simplify the behavior of components to which they are connected. Representing the behavior of a component in terms of its inputs and outputs sampled with respect to a common clock lends itself to temporal abstraction. In particular the behavior may turn out to be nearly event-preserving. The simplest example is a falling-edge triggered register. Its basic behavior is not event preserving, because events on its data input will not change the register state unless the latching input falls too. However, from the point of view of the falling edges on its clock inputs, the register is a delay element; its behavior is a one-to-one function except for the delay of one clock cycle. In terms of the temporal abstractions given above, sampling the input and output with respect to falling edges on its clock reveals that an event on the (abstracted) input must be followed one clock cycle later by the same event on the (abstracted) output. A rule that takes advantage of this says

Modeling digital circuitsfor troubleshooting

243

that if the input is known to be changing (with respect to a clock) then the output is changing (with respect to the same clock), provided that the clock frequency was nonzero during a window falling within the time that the data signal was changing: [isa 7r register] [thru ?la ?ua (mode ?r) normal] [thru ?ib ?ub (fww ?w '(0 I) (Ii (in clk ?r))) ?f] and [thru ?u ?u (changed-during 71 ?u (samp (ii (in 0 ?r)) (fall (ii (in clk ?r))))) ?v] and (<= (max ?la ?ib) ?i ?u (min ?ua ?ub)) and (and (> ?f 0) (< ?w (- ?u 71))) Then [thru ?u 7u (changed-during 71 7u (samp (Ii (out 0 ?r)) (fall (ii (in clk 7r))))) 7v]

If and and

This can be extended to describe the behavior of a shift register, which can be viewed as a cascade of these delay elements. If enough changes are observed at the shift register input, a lower bound can be derived on the number of changes that should be observed at its output.

Encapsulation After grouping components together, their combined behavior may lend itself to temporal abstraction using reduction or synchronization. Figure 8 shows a circuit that is p a n of a serial-to-parallel converter; it detects falling edges on the S t a r t signal and asserts its Msb eighteen cycles of the Clock later. Any subsequent falling edges on the S t a r t signal that occur before Msb has been asserted are ignored. Encapsulation alone does not usually simplify reasoning about the behavior of the loop. In this example, the behavior of this group of components has just as many states as the individual components. However, the whole circuit acts much like a counter (and hence much like a frequency divider) with respect to the Starz line. The number of falling edges on Msb sampled with respect to falling edges of the Clock input is bounded from below by [n/18] where n is the number of falling edges on the Szart signal. The following rule says that if the frequency of the S t a r t signal is high enough over a long enough interval, then the Msb output must have changed at least once:

244

H'. ('. H a m s c h e r

Clo~k

~

Start

Load

L

F

L

~

[

[....

I

I--

Counter starts at

11101110 iii01111 11110000 1111 .... 11111111

Nsb

J

4-bit

]

Lore ,

,,,,

Finishes at

AT

+00000000 Start

~ 4_biltb

c°=-1

P

Fig. 8.

AT] J~hi

Burst detector.

If

[isa ?c burst-detector] [thru ?la ?ua (mode ?c) normal] [thru ?ib ?ub (fww ?w '(0 i) (samp (fall (ii (in clock ?c))) (ii (in start ?c)))) ?f] and (< (/ i ?w) ?f) and (overlap (?la ?ua) (?ib ?ub)) and (< 18 (/ (- (min ?ua ?ub) (max ?la ?ib)) ?w)) Then [thru (max ?ua ?ub) (max ?ua ?ub) (changed-during (max ?la ?ib) (min ?ua ?ub) (ii (out y ?c))) t] and and

This rule is useful because it can use information about temporally coarse signals to make predictions about other, easily observed signals. Suppose that the only information about the input is that it is a stream of 1200 bytes per second. That is enough information for this rule to fire and predict that the Msb output ought to be changing, without having to reason about the step-by-step counter behavior.

Modeling digital circuits for troubleshooting

245

2.5. Summary of the behavior representation This section has presented a representation of circuit behavior designed with troubleshooting explicitly in mind. The representation exploits two important features of the domain. First, it is possible to make behavior predictions at coarse time scales, using easily made observations. Second, these predictions are falsifiable because the most common failures in real circuits are manifest at coarse time scales. As illustrated earlier, representing behavior in this temporally abstract way enables a general model-based troubleshooting engine to successfully diagnose failures in a very complex circuit. TINT assertions and rules describe the behavior of individual components. These components need to be organized into a representation of the structure of the device. This will be treated shortly.

3. Troubleshooting This paper concerns principles by which a knowledge engineer can organize a model to facilitate model-based troubleshooting. As discussed in Section 1, these principles guide the representation of behavior, of structure, and of faults and misbehaviors. The preceding section discussed the representation of behavior, and now this section presents an overview of the troubleshooting engine XDE before subsequent sections proceed with discussions of how to represent and use knowledge about circuit structure, faults and misbehaviors. X D E incorporates hierarchic diagnosis and fault models into the modelbased diagnosis framework of GDE. Readers not familiar with G D E may wish to read Appendix C, which presents a small example showing GDE in operation using BASIL and TINT terminology and notation. Figure 9 shows a flowchart for GDE. GDE chooses the first and all subsequent observations, terminating when the ambiguity (entropy) of the set of alternative candidates is below a preset threshold. Figure 10 shows XDE. There are two new operations that can be performed before suggesting new observations: decomposition, which enables hierarchic diagnosis, and refinement, which enables the use of fault models. L i k e G D E , XDE uses conflicts to construct diagnoses. XDE assigns to every diagnosis that it considers a weight that is its prior probability normalized with respect to all the other diagnoses, an approximation to its posterior probability conditioned on the observations that have been made so far. Rules are run, predictions are made, and further conflicts discovered in just IA weight in X D E is not a posterior probability as used in GDE, unless one assumes p(Vt = Oik I Cj) = 1 whenever Cj A Vi = Oik is consistent. See Appendix C.

246

14. (. ttamscher

1,

[ ohooso Observ~.tionI

"1,"

I nddOb~e~tio~ I No Start

Done Fig. 9. GDE flowchart.

g-Done

~

useaI~aultModel

[ C~oose Obso~.tio.. I I

4,

Add Observation

]

I

Fig. 10. XDE flowcharl.

those environments corresponding to those diagnoses with weights above the bottom tenth percentile. All diagnoses except those with weights falling in the lowest percentile are eligible for refinement and decomposition. After each new observation, XDE finds the most likely diagnosis and refines it. Refinement involves selecting

Modeling digital circuits for troubleshooting

247

the most likely fault for a component believed faulty in that diagnosis, adding an assumption corresponding to the belief that the component is faulty, and thereby disovering new conflicts and computing a new set of diagnoses. If there is no such refinement operation available, it decomposes a component instead. If no diagnosis is eligible for this either, it suggests a probe (Fig. 10). XDE assumes that some initial set of "free" observations is available, and all of these are presented to the device model before any refinements or decompositions are done. Refinement and decomposition both have priority over gathering new observations because, heuristically, gathering observations is much more expensive than any additional computation to rule out diagnoses. Refinement has priority over decomposition because, again heuristically, refinement is often able to rule out alternative diagnoses while decomposition often increases the number of alternatives (Experience with XDE suggests the need for research into a more flexible control structure. The program should not always try refinements before decompositions, nor decompositions before probes; people clearly make more use of the current state of the diagnosis to make this decision, and so should the program.).

4. Representing structure Model-based troubleshooting requires an explicit representation of the internal structure of the device being diagnosed, as well as its behavior. All the diagnoses that the troubleshooting engine produces will be expressed in terms of the components that appear in that structure representation. BASIL is a language for representing that structure that is complementary to TINT. BASIL descends from DPL [3] and TDL [10] and inherits the idea of representing circuit structures as networks of objects with connections between them at ports. Like TINT, BASIL is implemented as a vocabulary of predicates and terms. The predication [isa u32 sn74116],

for example, asserts that U32 is an instance of the type SN74116, [ako sn74116 dip]

that SN74116s are a kind of DIP (dual inline package--a rectangular chip package with pins on the two long sides), [conn 11216 n216 (in 1 u32)]

1t. ('. ltamschvr

248

that LL216 connects the port N216 to input port number 1 of DIP 1132, and so forth. The primitive components are chip sections (areas of silicon inside the DIPS) wires, and pins (the metal leads that connect the silicon to the etches on the circuit board). Circuits are described in BASIl. in terms of two component part-of hierarchies, a physical part-of hierarchy and a &m'tional part-of hierarchy. Figures I I and 12 show a simplified example that leaves out the pins: the only primitive components shown there are the chip sections corresponding to individual gates. There are two boards A and B, each having several DIPs. Three of the DIPS on A and two of the DIPs on B form a single four-bit adder. We assume that each board has other chips not shown. The four-bit adder is composed of two two-bit adders tbl and tb2 (only one is shown). Each two-bit adder is composed two lull-adders, each full-adder is composed of two half-adders and an OR-gate, and each half-adder is composed of an AND-gate and an XOR-gate. Each of the full-adders, fal-fa4, is distributed across three chips--a quad AND-gate chip, a quad XOR-gate chip, and a quad OR-gate chip. Fig. 12 shows the physical part-of relation p p a r t - o f and the functional part-of relation fpart-of. Remember that the leaf elements of the two trees are identical, even though they are not drawn that way. The rationales for these two hierarchies are discussed below.

4. I. Ph),sical organization The need for troubleshooting efficiency indicates several desirable properties of a structure representation. Every field replaceable component should correspond to some node in the hierarchy, so that the program will not

fal tb21

:D

h4 QA1

QX~

QOl

QA2

Board A 'Board B Fig. 11. Physical and functionalorganizalions.

QX2

Modeling digital circuits for troubleshooting

249

fpart-of

./

xpart-of ....

a-

^.c

.,,.~'~"

__

L-----l._--',<..--.-

__;.t

adder

tbl

_..

/._-.--'x..---- I I / X

tb$

,

,

\ I -/

.

it

Fig. 12. Physical and functional hierarchies.

waste effort on distinguishing between failures having identical repairs. The hierarchy should be strict, because if components can share subcomponents, then their likelihoods of failure are not independent, which in turn complicates the ranking of diagnoses. The leaves of the structure should correspond to the locations of possible failures, so that it does not represent more detail than necessary. These properties are embodied in the BASIL representation of the physical structure of the device. In BASIL the ppart-of relation is defined to be strict and groups components by their packaging. Thus chip sections and their pins are grouped into DIPS, DIPS into boards, and so forth. All assumptions (in the GDE sense) about which components are working are located in the physical hierarchy. In XDE, hierarchic diagnosis descends through this hierarchy, replacing assumptions about parent components with assumptions about their subcomponents. However, there is no behavioral information associated with most physical components, the exceptions being the primitive components, which are treated as both functional and physical.

4.2. Functional organization Predicting the behavior of a complex device from the details of its physical organization can be greatly simplified by using a representation of the intended behaviors of groups of components at multiple levels of abstraction. For example, it is easier to reason about the behavior of a digital logic gate than about the equivalent collection of resistors and transistors: the structural composition of those components enables abstraction of their combined behavior. For the same reason, it is easier to reason about an adder performing arithmetic on integers than about the equivalent collection digital logic gates, and so on. A functional hierarchy provides a way of organizing these structural compositions to which intended behaviors are attached. A node in the functional hierarchy exists just to facilitate behavioral abstraction. The functional part-of hierarchy is not required to be strict, and groups

250

H .('. ttamscher

components by their function. For example, two gates are collected together to form a functional component called a half-adder; full-adders are grouped together to form two-bit adders, and so forth (Fig. 11 ). Since the hierarchy is not required to be strict, it is perfectly legitimate to (say) group a half adder, an OR-gate and an AND-gate into a three-input majority circuit. All behavioral information about the circuit is associated with components appearing in this functional hierarchy. The concept of a functional component, then, is closely related to that of a slice [46], in the sense that it enforces constraints among signal values at arbitrarily distant physical locations in the circuit. In both hierarchies, a component is believed to be working if and only if all its subcomponents are working. For example, hl is believed to be working if and only if both al and xl are working, hence if and only if both QA1 and QX1 are working. Since assumptions only appear in the physical hierarchy, the practical consequence is that any behavioral assertion (such as "the output of gate U32A is 1"), although arising from knowledge located in the functional hierarchy, depends ultimately only upon observations made by the troubleshooter, plus assumptions about which physical components are working. This is exactly as it should be, since this means that any diagnoses produced by a GDE-like scheme will be in terms of physical components needing repair.

4.3. The decomposition operation in XDE The presence of two hierarchies raises a coordination problem for a troubleshooting engine such as XDE. It is easy to see how to descend through the strict physical hierarchy, expanding those components that appear in the likeliest few diagnoses. But which behaviors in the functional hierarchy should be invoked given a particular level of assumption in the physical? The solution XDE uses is to invoke the behaviors of all functional components that fully contain any immediate physical subcomponent. A physical component is "fully contained" if it is a physically maximal part ~f the functional component, abbreviated x p a r t - o f . The x p a r t - o f relation holds between each physical component and zero or more functional components. A physical component is a physically maximal part of a functional component when it all its subcomponents help to implement that functional component. Strictly speaking, it is when all the leaf p p a r t - o f descendants of the physical component are leaf fpart-of descendants of the functional, but the parent of the physical component is not maximal, x p a r t - o f is maximal with respect to p p a r t - o f but minimal with respect to f p a r t - o f . Figure 12 shows with dashed lines some examples of the relation z p a r t - o f . For example, QA1 is x p a r t - o f tbl because all of its leaf subcomponents are leaf subcomponents of tbl, but the same is not true of the parent of

Modeling digital circuitsfor troubleshooting

251

QA1, Board A. Hence if Board A were assumed to be working, QAI is an immediate physical subcomponent of Board A and is xpart-of tbl, so the behavior of tb I would be run. The children of tb 1 would not. 4.4. Summary Effective troubleshooting requires information that is not always represented explicitly in traditional circuit models. BASIL addresses this by embodying the principles of modeling for troubleshooting. For example, components of the structure representation should correspond to the possible repairs of the actual device; this principle is embodied in BASIL as a physical hierarchy. Components should also facilitate behavioral abstraction; this is embodied as a functional hierarchy. BASIL represents these as two separate hierarchies each having different characteristics. XDE uses both hierarchies, giving primacy to the physical hierarchy and coordinating its descent through the functional hierarchy based on the "physically maximal subpart" relation. 5. Representing faults and misbehaviors Knowledge about how components are likely to fail is important for troubleshooting. One of the strengths of GDE and other model-based troubleshooting frameworks is that they use only information about the intended behavior of components [8,16,20,21,38]. As long as the device model can detect all conflicts between the intended and observed behavior, they produce focused diagnoses. However, there are many reasons why conflicts go undetected when troubleshooting complex devices: observations may be imprecise, models of component behavior may be approximate, reasoning about their behavior may be intractable, and so forth. Faced with such a situation, these schemes generate many diagnoses that appear logically possible but are physically implausible. Within this context, knowledge about how a given component is likely to fail can be viewed as heuristic information about which diagnoses to disregard or discount. XDE is able to make use of, but does not rely on, available knowledge about known modes of failure to perform this discriminatory function. Important as it is, knowledge about failure modes is not always included in traditional circuit models. Hence, just as the principles of modeling for troubleshooting lead to the modeling of temporally coarse behavior (Section 2) and to the separate modeling of physical and functional organization (Section 4), these same principles lead to a representation that includes explicit information about failures. The notion of a syndrome makes explicit the physical and functional aspects of that knowledge as represented in BASIL and TINT.

252

II .('. ltamscher

In BASIL, every physical component has a stattls that indicates whether it is failing, and possibly in what way. The status working means that the physical component is undamaged and hence its behavior is normal. The status other means that it is damaged in a way whose consequences for its behavior are unknown. Each type of physical component also has a (possibly empty) set of statuses indicating that il is broken in some way whose behavioral consequences fall into a known category (a fault model). For example, a resistor might have the status open, meaning that it has some physical damage resulting in its having infinite resistance. The latter statuses are collectively referred to as syndromes of the component. The assumption that a physical component U25 is working is denoted U25: the assumption that U25 has failed with syndrome S is denoted U25s. Assertions about behavior may be supported by assumptions about components having statuses other than working. Just as assertions about normal behavior are labeled with sets of" assumptions about components having status working, assertions about (normal and) abnormal behavior will be labeled with sets of assumptions both about components working and not working. All statuses for a physical component are mutually exclusive and have an estimated prior probability. The statuses are exhaustive, but the syndromes are not: that is, for any physical component the status other always has a nonzero (although possibly very small) probability. For efficiency reasons, there are no explicit assumptions of the form "physical component X has status other". A physical component can on/i, have a status besides working or other by assumption. There is currently no inference mechanism by which it can be directly deduced that a physical component is failing with a certain syndrome and hence no handling of dependent failures. The inclusion of syndrome statuses raises three issues: the behavioral consequences of a physical component having a syndrome status: the assignment of prior probabilities to each status; and the treatment of syndromes by XDE. These are discussed in turn.

5. I. Behavioral consequences o/ physical failures Technically, a syndrome is a set of sets of physical failures that result in equivalent misbehaviors of a component (This is merely a definition of what a knowledge engineer would be referring to in the real world when including a syndrome in the knowledge base. XDE does not itself derive syndromes from more primitive information in the knowledge base. so it has no need to represent or construct the actual sets of sets of physical failures.) For example, consider an imaginary DIP inverter-dip with four pins (power, ground, input, output) and just one TTL inverter on it. Some

Modeling digital circuitsfor troubleshooting

253

of the possible physical failures inside the DIP are: (a) (b) (c) (d) (e)

the the the the the

pulldown is open; output pin is open; pullup is shorted; pulldown is shorted; input pin is open.

Three example syndromes are: (1) A constant output logic level of 1, producible by several different combinations of physical failures. Its pulldown might be open, its output pin might be open (since TTL floats high), its pulldown might be open and its pullup shorted, and so on. This is the set of sets {{a}, {b}, {a,c} .... }. (2) A constant logic level of 0. Its input pin might be open, its pulldown might be shorted, and so on. This is the set of sets {{e}, {d} . . . . }. (3) A constant frequency of 0, producible by both sets of failures described above. The union of those sets is thus another syndrome. Although in principle syndromes can thus intersect, in practice the syndromes for a given component are disjoint sets. Syndromes are sets of sets of failures, but for mnemonic value they are usually named according to the misbehavior that results. For example, syndrome (3) above, which caused the inverter-dip output frequency to be zero, will be denoted zerof. Since the misbehavior of a component is relative to its intended behavior, each syndrome is thus tied implicitly to a level of behavioral abstraction. The status working corresponds to an empty set of failures; the status o t h e r corresponds to all combinations not covered by working and the failure syndromes. The behavioral consequence of a physical component having status working is that functional components of which it is x p a r t - o f can have their normal intended behavior. From Fig. 12, for example, if DIPS QAI and QX 1 are working then h4 has the normal behavior of a half-adder. Similarly, the behavioral consequences of a physical component having some syndrome status will be that some functional component has incorrect behavior. If, for example, the i n v e r t e r - d i p has status zerof, then the output of the inverter will have frequency 0. Knowledge about the behavior resulting from physical failures is given to the system in the same way as it is given knowledge about correct behavior. As an example of the behavioral consequences of syndromes, consider the burst detector (shown earlier in Fig. 8 and reproduced in Fig. 13). Eighteen clock cycles after the S t a r t signal falls, the output Msb is asserted for one cycle.

254

g . (i ftamscher

Msb

Load

ulo

Start

U20 Clock

Ull

Fig. t3. Burst detector, with physical organization.

The internal structure of the burst detector involves three DiPs--two lout'bit counters U10 and Ull, and a q u a d NOR gate DIP U20. Any of the three DIPS Ul0, Ull, or U20 could fail in ways that prevent the burst detector from ever starting to count, so that Msb would always be 0. For example, there are three pins in U20 that if open would cause the Load signal to be stuck at 1, the result being that counting would never start. Thus each of the three DIPS has a syndrome denoted b d - i n a c t i v e (burst detector inactive), and if any of them have that status then the burst detector is i n a c t i v e : If

[status-of

and

(member ?u ' (ul0 u l l u20)

?u bd-inactive]

Then [ s t a t u s - o f b u r s t - d e t e c t o r i n a c t i v e ] If the burst detector is in i n a c t i v e mode then ~ts output is 0: If

[isa ?b burst-detector]

and

[thru 71 ?u (mode ?b) inactive]

Then [thru 71 ?u (ii (out msb ?b)) O]

5.2. Assigning probabilities Each component status needs to be assigned a prior probability. XDE uses these probabilities to rank alternative diagnoses by their likelihood. These probabilities thus help to guide the program in its choice of probes and are used in determining when to terminate the diagnosis. Estimating failure probabilities in general is subtle and complex; a very simple framework is used here. For example, independence between failures is assumed, a strong simplifying assumption--although not as strong as assuming that failure effects are independent, as in MYCIN [42]. That is, the likelihood that a

Modeling digital circuitsfor troubleshooting

255

given component fails does not depend on whether any other component has failed, but it is possible for the effects of multiple failures to mask, counteract, or intensify one another. The probability of a given component working is estimated from its "complexity"--a nonnegative integer representing the number of breakable physical parts and how likely they are to break. The probability of a component having status working is the probability that all its subcomponents are working, assuming independence. The probability of failure in a component with complexity 1 has been assigned 0.0001--almost any number very close to 0 could have been used. For example, the complexity of a chip section might be 1; the complexity of a pin might be 2 (they have a solder joint on each end and tend to break easily); hence the complexity of an n-pin DIP would be 2n + 1; and so forth. The probability that a 16-pin DIP is working would thus be (1 -0.0001 )33 = 0.9967. An estimated probability is assigned to each of the possible statuses of a physical component using the complexity estimates. For example, assume that pins have complexity 2 and all other primitive components have complexity 0. Then the likelihood that the inverter-dip is working is estimated as 0.99998--the likelihood that all four pins are working. The likelihood that the inverter-dip has syndrome z e r o f is estimated as 4 x ((1 -0.99992) x 0.99996)--the likelihood that exactly one of the four pins is independently broken. This is only an estimate, since on the one hand there might be failures in the pins other than opens, but on the other hand multiple pin failures that would cause the same syndrome are not being counted. Finally, the likelihood that it has status other is then 1 minus the likelihoods of these other two statuses:

Syndrome working

Likelihood 0.99998 = 0. 9992

zerof

4 × ((1 - 0.99992) x 0.99996) = 0. 0007

other

1 - 0.9992 - 0.0007 = 0. 0001

For each of the three DIPS in the burst detector, the likelihood of each syndrome occurring is estimated from the likelihood of failures in the pins. For example, let the likelihood of U10 working be 0.999932, the likelihood that all sixteen pins are working. The likelihood of U10 having syndrome bd-inactive is then 3 x (0.0002 x 0.99993°), the likelihood that the DIP has exactly one of the three single-pin faults that cause b d - i n a c t i v e . The likelihood of other is just the residual:

IlL( '. llamscher

256 UIO status

Likelihood Description

working

0.997

All sixteen pins working

b d - i n a c t i v e 0.0006 other

Any of three pins open

0.0024

For UII, there are four open pin faults that can cause the syndrome: U I1 Status

Likelihood Description

working

0.997

All sixteen pins working Any of four pins open

bd-inactive O.O00g other

0.0022

For U20, five open pin faults can cause it: U20 Status

Likelihood Description

working

0.997

b d - i n a c t i v e 0.001 other

All fourteen pins working Any of five open pins

0.002

Of course there are better ways of estimating these failure rates: the power dissipation of the DIP, for example, would probably be a better predictor. This scheme has the advantage that it can be derived from the representation of physical structure once the basic units of complexity have been chosen.

5.3. The r
Modeling digital circuits for troubleshooting

257

redistributed among all diagnoses of all components. For example, suppose the two other diagnoses had been that DIP j was broken (with weight 0.30) and that DIP k was broken (also with weight 0.30). After redistributing the weight 0.35 evenly across the three diagnoses, j and k would have weights of 0.42 each and the diagnosis involving i would have weight of 0.17. Thus the likelihood of i being broken relative to the other diagnoses will have been decreased from 0.40 to 0.17. This simple intuition is complicated by the need to consider multiple faults. Recall that in XDE, assumptions can be of the form "component X is working" or "component X is broken with syndrome S'. Following the terminology of GDE, let an environment be a set of assumptions, denoted {... }. The power set of all assumptions forms the universe of possible environments. In the burst detector example, there are three chips (O 10, O11, and U20) and hence three assumptions U10, U1 l, and U20, each denoting the assumption that a particular chip has status working. There are three other assumptions Ul0i, U l l i , and U20i, each denoting the assumption that a chip has status b d - i n a c t i v e . With six assumptions, there are 26 environments. Every prediction is labeled with the set of environments in which it holds. Conflicts arise when predictions are contradictory. Conflicts are sets of assumptions, at least one of which must be false, and are denoted (...). Conflicts also arise because each component can have only one status at a time. For example, suppose that the output of the burst detector is observed to be incorrect. Then
(p ( U l 0 ) ) (p (Ul 1 ) ) (p (U20i)) = (0.997) (0.997) (0.001) = 0.00099.

258

H . ('. l t a m s c h e r

The probability of the environment {U I 0,U 11 } is ( p ( U 1 0 ) ) (p(U 11 ) ) (p(U2Omh~.,.)) = (0.997) (0.997) (0.002) = 0.00199. The number of consistent environments, however, is exponential in the number of assumptions, and many have very low probabilities. For example, even the empty environment { } is a consistent environment, corresponding to the diagnosis that all three chips have status o t h e r - - b u t its probability is very small. GDE heuristically considers only those consistent environments that are maximal, that is, consistent environments that have no consistent supersets (Technically, in GDE the diagnoses are referred to as candidates and are computed as the minimal covering sets of the conflicts. But with respect to the universe of assumptions, each candidate is simply the complement of a maximal consistent environment: the correspondence is one-to-one.). The goal is to consider only those diagnoses with the highest likelihoods. The (}DE heuristic works well because it only uses assumptions that components are working properly, and these assumptions are individually much more likely than their negations. In the terminology of XDE, this is like saying that the likelihood of the status working is always much greater than the likelihood of the status other. For example, with only the assumptions U 10, U I 1, and U20, and the only conflict being (U10,U11,U20}, ODE would only consider the environments {UI0,UI l}, {U10,U20}, and {UI1,U20}. Similarly, XDE heuristically considers those diagnoses that correspond to the maximal consistent environments. However, the GDE heuristic is no longer entirely appropriate, since the likelihood of a given component having a syndrome status could be either more or less likely than the status other. For example, given the conflicts shown earlier, the two environments {U 10,U 11,U20i} and (its subset) {U 10,U I 1} have the probabilities 0.00099 and 0,00199 respectively. Following the ODE heuristic would overlook {UI0,UI1} as a diagnosis because it is not a maximal environment, even though its probability is greater than that of the maximal environment {U 10,U11,U20i }. XDE deals with this by considering as diagnoses all environments that are either maximal or that can be derived from the maximal consistent environments by deleting syndrome assumptions. Intuitively, XDE considers diagnoses in which known failure syndromes are replaced by the syndrome other. {U 10,U 11}, for example, would be considered because it can be derived from {U 10,U11,U20i } by deleting the syndrome assumption U20i. This is a good compromise, since very low likelihood environments ({U10i,Ulli}, for example), would still not be considered because it is not derivable from any maximal consistent environment by deleting syndrome assumptions.

259

Modeling digital circuitsfor troubleshooting

If the additional conflicts (U10), (UI 1), and (U20) had been discovered, then {U10i,U1 li,U20i} would be a maximal consistent diagnosis, hence the diagnosis {U10i,U1 li} would be considered as well. XDE is thus capable of constructing and considering diagnoses in which multiple components have syndrome statuses as well as the status other. To see the overall effect of this scheme, suppose that initially there are no syndrome assumptions, and only the conflict (U10,U11,U20) is known. Because there are no syndrome assumptions yet, the probability of a component O having status other is simply ( 1 - p (U)). Diagnoses shall be denoted [[]] and written out as the universe of assumptions with an indication of whether each assumption is present (true) or absent (false) in that environment. The environment {U10,U11 }, for example, corresponds to the diagnosis [[U 10,U 11 ,U20 ]]. The set of diagnoses and their weights are as follows: Diagnosis

Probability Weight

[[UIO,U11,U20 ]] 0.00298

0.33

[[UIO,U11,U20]] 0.00298

0.33

[[U10,U11,U20]] 0.00298

0.33

Now let the syndromes U10i, U1 li, and U20i be added. The probability of a component 0 having status other is now ( 1 - p (U) - p (Ui) ). Suppose that the output of the burst detector had been observed to be active, so that it is inconsistent for any of the chips to have status bd-inactive. There would thus be three new conflicts (Ul0i), (U1 li), and (U20i). The new diagnoses and likelihoods are:

Diagnosis

Probability Weight

[[UIO,U11,U20,UIOi,UI li,U20i]] 0.0024

0.36

[[UIO,U11,U20,UIOi,U1 li,U20i ]] .0022

0.33

[[U10,U11,U20,U10i,U1 li,U20i]] 0.0020

0.30

In this case, the syndromes and the conflicts derived from them had two effects. First, they reduced the probabilities associated with each diagnosis (from 0.00298 to 0.0024, and so forth). Had there been other components in the example, their relative likelihoods would have been increased. Second, it changed the relative likelihoods of failure among the three components, although not by much. In the XDE framework, the effect of these shifts in relative likelihood between diagnoses is both to bias the selection of

260

l/.( ',

tfamscher

probes so as to focus on distinguishing among the likeliest diagnoses, and if a syndrome status for a given component is much likelier than the status other, it can reduce the likelihood of a diagnosis so much that for all practical purposes the diagnosis will be ruled out. The treatment of syndromes in XDE is similar in its essentials to the treatment of fault modes in SHERLOCK [17]. One difference is that SHERLOCK allows several assumptions that each correspond to a difl'erent correct behavior for a component. XDE requires only the status working, because the behavior modes of a component are treated separately as a TIN3 signal mode that can vary over time (Section 2). A second difference is that SHERLOCK has an explicit assumption for each component that corresponds to the status other in XDE. As discussed above, the calculation of relative diagnosis likelihoods in XDE treats this assumption as being implicit, so this difference is not important.

5.4. Summao': principles Jbr usin~ syndromes XDE extends the (~DE paradigm to include the use of fault models because when this kind of knowledge is available it provides improved guidance to the troubleshooter. The notion of a syndrome, with its a prior probability and both physical and functional aspects, is the embodiment of a fault model in BASIL and TINT. However, the mere fact that XDE can make use of syndromes says nothing about which syndromes should be included in the model of a given circuit. Just as the goal of modeling for troubleshooting yielded principles for using temporal abstractions and principles for choosing which circuit components to represent explicitly, examination of the impact of syndromes on the performance of XDE yields principles about which syndromes are worth including in tile circuit model. Faults with high likelihood are worth including explicitly. If a particular component is suspected of failure, but (say) 99% of the failures in components of that type produce a behavior other than the one being observed, then that component is almost certainly not faulty. In the scenario of Section 1, for example, the oscillator was virtually ruled out as a suspect because a very likely failure mode (zero output frequency) was contradicted by observations (nonzero output frequency). By contrast, in the burst detector example above, the results are undramatic because the the status b d - i n a c t i v e is less likely than the status other. Discovering the conflict (UI0~} thus has only a small effect on the overall likelihood of U10 being broken. Faults that drastically simplify behavior are worth including explicitly. This is because XDE is completely driven by the detection of conflicts: without conflicts, there is no disambiguation of diagnoses. When behavior is drastically simplified, conflicts will generally bc easier to detect using

Modeling digital circuitsfor troubleshooting

261

inexpensive observations. In the case of the oscillator, the faulty behavior is much simpler than the correct behavior, and so it is easy to detect using an oscilloscope. Had the syndrome been (say) that the oscillator skipped every hundredth cycle, a detailed model of behavior would have been required to represent it and the available observations would not have been able to distinguish it anyway. Such misbehaviors are better dealt with at lower levels of structural and behavioral detail from which they originate.

6. Conclusion The model-based troubleshooting paradigm consists of a device-independent troubleshooting engine and a device-specific model, and hence there are two obvious strategies for dealing with the issues that scaling raises: improve the diagnosis engine, or improve the device model. The work presented here focuses on the second approach, and contributes both a collection of principles for a knowledge engineer to follow in constructing a device model for troubleshooting, and concrete instantiations of those principles in the languages BASIL and TINT. The example in Section 1 showed how XDE diagnosed a fault in the Symbolics 3600 console controller board, and provides a way to review the modeling principles discussed earlier and to summarize the ways in which they interact with the troubleshooting engine XDE to provide leverage on the problem of diagnosing complex devices.

6.1. Representing structure Figures 1-4 showed the functional organization of the console controller board as represented in BASIL. The program reasoned about the behavior of the device in terms of the behavior of individual components within this representation. The "boundaries" of these functional components simplified representing and reasoning about their behavior, and were deliberately chosen to do so--an example of modeling for troubleshooting. XDE represented its diagnoses in terms of sets of chips, which were elements of a different, physical representation that made explicit the possible repairs of the device. The correspondence between the functional and physical representations was only hinted at by superscript numbers on the boxes in Figs. 1-4, but was discussed at length in Section 4. This correspondence is represented in the relation physically maximal part-of a relationship between a physical component and those functional components whose correct behavior depend upon it. In the console controller board example the correspondence happened to be simple: each functional component corresponded to a few chips, and there was no sharing of chips between functional components.

262

[V.(. Itamscher

6.2. Representing behavior The behavior of each functional component in the console controller board example is represented a predicate calculus based language for describing constraints between signal values over time. For example, the behavior of component R is a relationship that holds between a pulse on the input Button and the longer pulse on the output Reset that results, given a certain frequency at Clock. The language TINT for writing such descriptions was described in Section 2. The temporally abstract behavior descriptions associated with the components in Figs. 1-4 had a number of properties desirable for troubleshooting. The model described signals in terms of features that were easy to observe: for example, "the Interrupt signal is changing". The model described behavior in terms of features that were stable over long periods of time, so that it was not necessary to represent many events explicitly. The model encapsulated complex circuits consisting of many subcomponents into abstract functional components to allow simplified behavior descriptions: for example, it represented microprocessor M I as a component with very simple relationships between its Clock and Mouse Moves inputs and its Interrupt output. The important thing was not that the behavior model used abstractions, but that it used abstractions developed for their appropriateness to the task of troubleshooting, and further, that these abstractions made the troubleshooting of a complex device feasible.

6.3. Representing faults and misbehavior XDE uses information about the relative likelihood of component failures to rank alternative diagnoses and to choose discriminating probes. Because the focus of this paper is on modeling issues, the example in Section 1 suppressed these details, and for the most part each chip was treated as having equal likelihood of failure. As discussed in Section 5, the computation of relative likelihoods of failure in ×DE is in reality more elaborate and based on estimates of the complexity of each physical component. XDE is able to refine its estimates of the relative likelihoods of diagnoses by using any fault models available at any appropriate level of behavioral abstraction. In the console controller board example, the model included the knowledge that the oscillator component O nearly always failed in such a way as to produce a constant output (instead of a 9.8MHz wave). XDE used this fact to discount the diagnosis that O was responsible for the observed misbehavior of the board as a whole, Explicitly representing this fault and its effects as a syndrome in the model was well motivated both because the fault had relatively high likelihood, and because the resulting misbehavior of the oscillator was drastically simpler than its correct behavior.

Modeling digital circuits for troubleshooting

263

Appendix A. Sequential behavior A falling-edge triggered register provides the simplest example of sequential behavior, involving only three rules. The first rule says that (a) the number appearing at the output of the register is identical to its state, and that (b) changes from 1 to 0 on the clock input are "interesting": If [ i s a ?r r e g i s t e r ] Then [tsame -oo +co ( s t a t e ?r) (num (out 0 ? r ) ) ] and [ i n t e r e s t i n g - e v e n t ( l l (in clk ?r)) (1 0)]

The value of the abstract signal (event ?from ?to ?s) is t whenever there has been a change from the value ?from to ?to. The value of this abstract signal is recorded explicitly only when that event type is marked as "interesting". The second rule is a state-transition rule. Any change from 1 to 0 on the clock input causes the register to enter the state selected by its data input signal (num (input 0 ?r)). The previous state of the register is irrelevant. The rule below concludes that during (at least) the single moment succeeding the transition, s t a t e had the value ?input: If and and and and Then

[isa ?r register] [thru ?la ?ua (mode ?r) normal] [thru ?ib Tub (event I 0 (ii (in elk ?r))) t] [thru ?ic ?uc (num (in 0 ?r)) ?input] (overlap (?la qua) (?ib Tub) (?ic ?uc)) [thru (+ ~ Tub) (+ 6 ?ub) (state ?r) ?input]

The third rule is a persistence rule. The register stays in whatever state it is in so long as there has been no change of the clock from 1 to 0: If and and

[isa ?r register] [thru ?la ?ua (mode ?r) normal] [thru ?Ib ?ub (event 1 0 (Ii (in clk ?r))) nil] and [thru Tic ?uc (state ?r) ?state] and (<= (max ?la ?ib) ?ic (min ?ua Tub)) Then [thru (max ?la ?Ib) (+ 6 (min ?ua Tub)) (state ?r) ?state]

In general, transition rules deduce that a component must have been in a state for just one moment, and the persistence rules subsequently deduce how long that state must have lasted.

264

11.(. tlamscher

Appendix B. Other temporal abstractions Event

The abstract signal ( e v e n t ?from ?to ?S) is t whenever the underlying signal ?S has just changed sometime during the interval between ?from and %0. For example, ( e v e n t 500 700 S) is t where S has just changed from 50 to 70: (event

:any 50 S) '? '? nil nil nil nil

( e v e n t 50 70 S) ? ? nil

i

t nil nil nil

s ': 50 50 70 30 70 50 time 0 1

2

3

4

5

6

A ?from argument o f :any denotes the special case o f any transition to ?to, which is useful for marking the known beginning of an interval. ( e v e n t :any 50 S) is t at lime 6. However, it is not known lo be t at lime 1 since the value o f S could have been 50 at o. Duration

The abstraction duration is defined to be a when the signal has just changed, and it indicates how long a signal has stayed at the same value:

(duration X)"?

1 ,."~ 1

X3445 ime 1 2 3 4 Swing

The natural c o m p a n i o n to the frequency abstraction is that of amplitude, here t e r m e d the "swing" o f a signal--the difference between its m a x i m u m and m i n i m u m values with respect to a window. The abstraction max-min-ww denotes swing:

(max-min-ww 3

S) ? ? 10 25 15 30 15 15 10

ssIlo

o 0

~ime~

1 2

0 15 0 3

4

5

6

0 7

8

As with fww, if the underlying signal is periodic and the window is large enough, the fluctuations will be relatively insignificant.

Appendix C. The general diagnosis engine In GDE [ 16] a device is described in terms o f c o m p o n e n t s and connections. In Fig. C.I there are two components, each a digital adder, with one

Modeling digital circuitsfor troubleshooting

4 {}

265

IJ Adder-2

x

8 {Adder-l,Adder-2}

4 {Adder-1~ Fig. C. 1. Behavior prediction example.

connection between them. These correspond to functional components in BASIL. The behavior description (embodied in BASIL as rules) is used to make local predictions about behavior. Each local prediction is tagged with the set of components on whose correct behavior it depends using an ATMS [13], SO that when an observation is made that contradicts what the model predicted, the components responsible can be easily found. Each of these predictions are only valid if one or both adders are assumed to be working normally, and each prediction is tagged with the minimal sets of assumptions that support it. For example, suppose both inputs to an adder component Adder-1 are 2 (Fig. C.1). Neither input to Adder-1 requires any assumptions, so the tags for each "2" input is {}. The prediction that the output × is 4 relies on the assumption that Adder-1 is working normally along with all assumptions supporting the inputs, so it is tagged with the set {Adder-1 }. Each such set of assumptions is called an environment. The prediction that the output ¥ is 8 is tagged with the environment containing the assumptions that Adder-1 and Adder-2 are both working. Observations such as those at the inputs of Adder-1 are true in the empty environment since they rely on no assumptions. The behavior model need not only predict outputs from inputs, but can enforce any logical relationship between the values carried by connections in the device. Such predictions are tagged with sets of assumptions just as before. For example, if one input to Adder-2 is 4, and the output is 6, then the other input is predicted to be 2 and tagged with the assumption that Adder-2 is working (Fig. C.2). Similarly, if one input to Adder-1 is 2, then the other input is deduced to be 0 and that prediction is tagged with the assumptions that Adder-1 and Adder-2 are working. Candidate generation involves detecting discrepancies and determining which components could have been responsible. Discrepancies are inconsistent predictions made under different sets of assumptions (that is, in different environments). For example, suppose the inputs to the two-adder device were as in the first case, but the output was observed to be 6 (Fig. C.3). Superimposing the two sets of predictions, it can be seen that (among other discrepancies) node × is predicted to be 4 if Adder-1 is work-

266

1~, ( . ttamscher

4 {)" i~ Adder-2Y6~{} 0 {Adder- 1 ,Adder-2

{Adder_2)q

Fig. C.2. Another prediction example.

2

{}

4 {} Ikl Adder-2

.

Adder-1

@

Fig. C.3. Discrepancies produce conflicts. ing, but 2 if Adder-2 is working. The union of the environments that underly inconsistent predictions are termed con[licls, and are denoted with angle brackets (). In this case, (Adder-l, Adder-2) is a conflict. A conflict is a set of assumptions that contains at least one that must be false. In GDE, the assumptions are about whether components are working properly, so it can be thought of as a set of components that cannot all be working properly. If one of the components in each conflict were actually failing, it would resolve the inconsistency. The minimal set covers of these conflicts are termed c a n d i d a t e s , denoted with square brackets [ ]. By Occam's razor only the minimal set covers (those with no subsets that are covers) are needed; the minimal covers are the simplest explanations for the inconsistency. Each candidate corresponds to a set of components that would resolve all the inconsistencies if all of them were failing. For example, if there is just one conflict (Adder-l, Adder-2) there are two singleton candidates, denoted [Adder-l] and [Adder-2]. The covering set that includes both adders is not a candidate, since it is not minimal. This scheme incorporates the handling of multiple faults in a natural way. Suppose we subsequently observe that × is 5. There would then be two conflicts (Adder-l) and (Adder-2), and their minimal set cover would be the candidate lAdder-l, Adder-2], meaning that both Adder-1 and Adder-2

Modeling digital circuitsfor troubleshooting

267

are faulty. In general, the number of candidates can be exponential in the number of conflicts. Consider for example 2n assumptions and n conflicts, one for each pair of assumptions 2i and 2i + 1; this results in 2 n candidates. Exponential blowup is rare in practice; a more common phenomenon is that along with a small set of single-fault candidates there will be a larger set of multiple-fault candidates. For example, the two conflicts (A, B, C, D) and {D, E, F, G) yield one single-fault candidate [D] and nine two-fault candidates. In GDE each candidate is a set of assumptions that would resolve all conflicts if they were all false. GDE assigns a prior probability to each candidate by treating each assumption as independent and assigning to each a prior probability near 1.0 of being true. The prior probability of a candidate is then the probability that all the assumptions it includes are false and all other assumptions are true. The minimal candidates' prior probabilities can be normalized to yield an approximation to their posterior probabilities. Continuing the two-adder example, let the initial probability of each adder working be p (Adder) -- 0.99. The posterior of each is approximately 0.50, computed as shown: Candidate Prior

Normalized

[Adder-I ] ( 1 - p (Adder-1) ) × p (Adder-2) = 0.0099 0.50 [Adder-2] p(Adder-1) x ( 1 - p ( A d d e r - 2 ) ) = 0.0099 0.50 Suppose there had been three adders A, B, and C with p ( A ) = p ( B ) = p ( C ) = 0.99, and that there were two conflicts {A, B} and (B, C). There would be two candidates [B] and [A, C] whose rankings would be as shown below. This yields the intuitively satisfying result that the single-fault candidate [B] is much more likely than the multiple-fault candidate [A, C]: Candidate Prior [B] [A,C]

Normalized

p ( A ) x (1 - p ( B ) ) x p ( C ) = 0.0098 0.99 (1 - p ( A ) ) x p ( B ) x (1 - p ( C ) ) = 0.000099 0.01

As each new observation is added to the model, the posterior of each candidate Cj is computed using Bayes' rule with V, a measurement, Oik one of m measurement outcomes, where I ~< k ~< m: p ( C j I E = Oi,) =

p(v~ = O~k I G ) P ( C j ) p(E = 0~)

It'. ( . ttamscher

268

where: p(I'; :

0,.~) : ~-~p(I) = O,A I (I,)p(C;) i

p ( C j ) is either the prior or the result of the previous update:

Ii = O,k,

1,

if(j-~

l/m

otherwise (a heuristic).

As diagnosis proceeds there are usually several candidates that could explain all the conflicts. To discriminate between these candidates requires gathering more information in the form of either (i) new observations of the device in its current state, or (ii) observations of its response to some new test stimuli. Since there are typically many observations and tests that could be performed, the program needs to choose which of them to do next. This choice can be formulated in terms of the cost of each action, the benefits of their various outcomes, and the likelihoods of those outcomes. Using the entropy of the distribution of outcomes ( ~ k p ( l ) = O~k) logp(I} = 0,,~ )) as a "benefit" metric, GDE chooses the observation yielding the minimum expected entropy [22]. In ODE the device model is used to derive the expected outcomes of each possible observation along with their likelihoods. XDE inherits that scheme without any significant deviations, and its details are not relevant to the discussion in the main body of this paper.

Acknowledgement This paper describes research done at the Artificial Intelligence Laboratory of the Massachusetts Institute of Technology. Support for the author's research on troubleshooting was provided by DEC. Wang, Symbolics, and DARPA under ONR contract N00014-85-K-0124. Randall Davis provided vital and long-term guidance in the overall task of scaling up the modelbased techniques as embodied in the HT troubleshooting program to apply to realistic problems, and in presenting the results of the research. Ramesh Patti recognized the importance of temporal abstractions before 1 did. and encouraged me to press the attack. Howard Shrobe provided me nay own Lisp machine and a quiet place to work, but more important, unwedged the research at a crucial moment by exposing me to the way that people really fix circuits. Brian Williams scrutinized several drafty papers and was always available to discuss truth maintenance systems, temporal reasoning, and GDE. Peter Szolovits was the first to encourage me to move beyond GDE and incorporate fault modes into the methodology. Randall Davis. Mark

Modeling digital circuits for troubleshooting

269

Stefik, and two anonymous reviewers provided many helpful suggestions on the organization of this paper.

References [ 1] A. Abu-Hanna and Y. Gold, An integrated, deep-shallow expert system for multilevel diagnosis of dynamic systems, Tech. Report 504, Technion--Israel Institute of Technology, Haifa, Israel (1988); also in: J.S. Gero, ed., Artificial Intelligence in Engineering: Diagnosis and Learning (Elsevier, Amsterdam, 1988) 75-94. [2] J.F. Allen. Towards a general theory of action and time, Artif. Intell. 23 (2) (1984) 123-154. [3] J. Batali, An introduction to DPL, Memo 598, Artificial Intelligence Lab, MIT, Cambridge, MA (1981). [4] A. Brown, Qualitative knowledge, causal reasoning, and the localization of failures, Tech. Report 362, Artificial Intelligence Lab, MIT, Cambridge, MA (1976). [5] J.S. Brown, R. Burton and J. de Kleer, Pedagogical, natural language, and knowledge engineering issues in SOPHIE I, II, and III, in: D. Sleeman and J.S. Brown, eds., Intelligent Tutoring Systems (Academic Press, New York, 1982) 227-282. [6] R.R. Cantone, F. Pipitone, W. Lander and M. Marrone, Model-based probabilistic reasoning for electronics troubleshooting, in: Proceedings IJCAI-83, Karlsruhe, Germany (1983) 207-211. [7] P. Dague, O. Raiman, and P. Deves, Troubleshooting: when modeling is the trouble, in: Proceedings AAA1-87, Seattle, WA (1987) 600-605. [8] R. Davis, Diagnostic reasoning based on structure and behavior, Artif Intell. 24 (1) (1984) 347-410; also in: D.G. Bobrow, ed., Qualitative Reasoning about Physical Systems (North-Holland, Amsterdam, 1984/MIT Press, Cambridge, MA, 1985). [9] R. Davis and W.C. Hamscher, Model-based reasoning: troubleshooting, in: H.E. Shrobe, ed., Exploring Artificial Intelligence: Survey Talks from the National Conferences on Artificial Intelligence (Morgan Kaufmann, San Mateo, CA, 1988) 297-346; also in: P.H. Winston and S.A. Shellard, eds., Artificial Intelligence at MIT: Expanding Frontiers (MIT Press, Cambridge, MA, 1990). [10] R. Davis and H. Shrobe, Representing the structure and behavior of digital hardware, IEEE Comput. 16 (10) (1983) 75-82. [ 11 ] T.L. Dean and D.V. McDermott, Temporal data base management, Artif Intell. 32 (1) (1987) 1-55. [ 12 ] J. de Kleer, Local methods for localizing faults in electronic circuits, Memo 394, Artificial Intelligence Lab, MIT, Cambridge, MA (1976). [13] J. de Kleer, An assumption-based TMS, Artif Intell. 28 (2) (1986) 127-162. [14] J. de Kleer and J.S. Brown, A qualitative physics based on confluences, Artif Intell. 24 (1) (1984) 7-83; also in: D.G. Bobrow, ed., Qualitative Reasoning about Physical Systems (North-Holland, Amsterdam, 1984/MIT Press, Cambridge, MA, 1985). [15] J. de Kleer, A. Mackworth and R. Reiter, Characterizing diagnoses, in: Proceedings AAAI-90, Boston, MA (1990) 324-330. [16] J. de Kleer and B.C. Williams, Diagnosing multiple faults, Artif Intell. 32 (1) (1987) 97-130. [17] J. de Kleer and B.C. Williams, Diagnosis with behavioral modes, in: Proceedings IJCAI89, Detroit, MI (1989) 1324-1330. [18] M.B. First, B.J. Weimer, S. McLinden and R.A. Miller, LOCALIZE: computer-assisted localization of peripheral nervous system lesions, Comput. Biomed. Res. 15 (6) (1982) 525-543. [19] L. Friedman, Diagnosis combining empirical and design knowledge, Tech. Report JPL D-1328, Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA (1983).

270

H".C. Hamscher

[20] M.R. Genesereth, The use of design descriptions in automated diagnosis, Ariel. lntell. 24 (1984) 411-436; also in: D.G. Bobrow, ed., Qualitative Reasoning about Ph)sical ,~vstems (North-Holland, Amsterdam, 1984/MIT Press, Cambridge, MA, 1985). [21] M.L. Ginsberg, Counterfactuals, Art~/) lntell. 30 (1) (1986) 35--79. [22] G.A. Gorry, J.P. Kassirer, A.Essig and W.B. Schwartz, Decision analysis as the basis tbr computer-aided management of acute renal failure, ,lm. J. Med. 55 (1973) 473-484. [23] W.C. Hamscher, Model-based troubleshooting of digital systems, Tech. Report 1074, Artificial Intelligence Lab, MIT, Cambridge, MA (1988). [24] W.C. Hamscher, XDE: diagnosing devices with hierarchic structure and known component failure modes, in: Proceedings 6th 1EEE Co~[~Jrence on AI Applicallons, Santa Barbara, CA (1990) 48-54. [25] W.C. Hamscher and R. Davis, Diagnosing circuits with state: an inherently underconstrained problem, in: Proceedings AAAI-84, Austin, TX (1984) 142-147. [26] L.J. Holtzblatt, Diagnosing multiple failures using knowledge of component states, in: Proceedings 4th IEEE Conference on AI Applications, San Diego, CA (1988) 139-143. [27] R. Joyce, Reasoning about time-dependent behavior in a system for diagnosing digital hardware faults, Working Paper HPP-83-37, Stanford Heuristic Programming Project, Stanfrord, CA (1983). [28] P.A. Koton, Reasoning about evidence in causal explanations, in: Proceedings AAAI-88, St. Paul, MN (1988) 256-261. [29] V. Lifschitz, Formal theories of action (Preliminary Report), in: Proceedings IJE~,Jl-87, Milan, Italy (1987) 966-972. [30] W.J. Long, S. Naimi, M.G. Criscitiello and R. Jayes, Using a physiological model for prediction of therapy effects in heart disease, in: Computers in Cardiology (MIT Press, Cambridge, MA, 1986). [3t] D.A. McAllester, An outlook on truth maintenance, Memo 551, Artificial Intelligence Lab, MIT, Cambridge, MA (1980). [32] J.M. McCarthy and P.J. Hayes, Some philosophical problems from the standpoint of artificial intelligence, in: D. Michie and B. Meltzer, eds., Machine Intelligence 4 (Edinburgh University Press, Scotland, 1969) 463-502; also in: B.L. Webber and N.J. Nilsson, eds., Readings in Art(/icial Intelligence (Tioga Press, Palo Alto, CA, 1981 ). [33] R. Milne, Fault diagnosis through responsibility, in: Proceedings 1JCAI-85, Los Angeles, CA (1985) 423-425. [34] H.T. Ng, Model-based, multiple fault diagnosis of time-varying, continuous physical devices, in: Proceedings 6th IEEE Conf~rence on ,41 Applications, Santa Barbara, CA (1990) 9-15. [35] J. Pan, Qualitative reasoning with deep-level mechanism models for diagnoses of mechanism failures, in: Proceedings 1st IEEE Conj~rence on AI Applications, Denver, ('O (1984) 295-301. [36] R.S. Patil, Causal representation of patient illness for electrolyte and acid-base diagnosis, Tech. Report 267, Lab. for Computer Science, MIT, Cambridge, MA (1981). [37] R.S. Patil, P.S. Szolovitz and W. Schwartz, Causal understanding of patient illness in medical diagnosis, in: Proceedings LICA1-81, Vancouver, BC (1981 ) 893-899. [38] R. Reiter, A theory of diagnosis from first principles, Art~[. Intell. 32 ( 1) (1987) 57-95. [39] S. Rowley, H. Shrobe, R. Cassels and W.C. Hamscher, Joshua: uniform access to heterogeneous knowledge structures, or, Why Joshing is better than conniving or planning, in: Proceedings AAAL87, Seattle, WA (1987) 45-52. [40] E. Scarl, J.R. Jamieson and C.I. Delaune, A fault detection and isolation method applied to liquid oxygen loading for the space shuttle, in: Proceedings IJCAI-85, Los Angeles, CA (1985) 414-416. [41] Y. Shoham, Chronological ignorance: time, nonmonotonicity, necessity, and causal theories, in: Proceedings AAAI-86, Philadelphia, PA (1986) 389-393. [42] E.H. Shortliffe, MYCIN: Computer-Based Consultations in Medical Therapeutics (American Elsevier, New York, 1976). [43] N. Singh, Saturn: an automatic test generation system tbr digital circuits, in: Proceedings AAAI-86, Philadelphia, PA (1986) 778-783.

Modeling digital circuitsfor troubleshooting

271

[44] N. Singh, An Artificial Intelligence Approach to Test Generation. (Kluwer Academic Publishers, Norwell, MA, 1987). [45] P. Struss and O. Dressier, Physical negation: integrating fault models into the general diagnostic engine, in: Proceedings IJCAI-89, Detroit, MI (1989) 1318-1323. [46] G.J. Sussman and G.L. Steele Jr, CONSTRAINTS--a language for expressing almosthierarchical descriptions, Artif Intell. 14 (1) (1980) 1-39. [47] B.C. Williams, Doing time: putting qualitative reasoning on firmer ground, in: Proceedings AAAI-86, Philadelphia, PA (1986) 105-112.