The development of an integrated mathematical and knowledge-based maintenance delivery system

The development of an integrated mathematical and knowledge-based maintenance delivery system

Compuws Ops Ros. vol. 19, No. 5. pp. 425-434, 1992 Printed in Great Britain. All rights Kserved 030%X48/92 SS.00+ 0.00 Copyright 0 1992 Pergamon Ptes...

1MB Sizes 0 Downloads 12 Views

Compuws Ops Ros. vol. 19, No. 5. pp. 425-434, 1992 Printed in Great Britain. All rights Kserved

030%X48/92 SS.00+ 0.00 Copyright 0 1992 Pergamon Ptess Ltd

THE DEVELOPMENT OF AN INTEGRATED MATHEMATICAL AND KNOWLEDGE-BASED MAINTENANCE DELIVERY SYSTEM RICHARD M. FELDMAN’*,

WILLIAM M. LIVELY’?,

TOM SLADE~$, L. G. McKEE’$

and

ALAN TALBER?~

Departments of ‘Industrial Engineering and %omputrr Science, Texas A&M University, ColIege Station, TX 77843-3131 3Westlake Programming Laboratory, IBM, Roanoke, Texas ‘ALCOA Rockdale Operations, Rockdale, Texas and ‘Fish Engineering, Houston, TX, U.S.A. ( Received November

1990; in revised form

Seprember

1991)

Scope and Purpose-Providing a combination of knowledge-based and analytic approaches to decision supportcan be a viable alternative for industry. This work illustrates a practical approach to the knowfedge acquisition process and demonstrates how a knowledge-based system can become the framework for incorporating operations research analytical approaches into decision support. The maintenance system developed integrates these concepts. The goal of the system is to give advise regarding the appropriate actions For maintaining smelting pots in a continuous process production facility at an ALCOA plant. The requirements of the system are ease of use For production personnel, ease of m~ifi~tion and extension by the engineering staff and delivery on a PC-compatible personal computer. Ahatract-In this paper, we investigate a maintenance problem From a continuous manuracturing environment. The specific maintenance policies to be considered are whether to have replacement, minimal repair or no action. The objective Function is based on discounted costs and will be structured so that changes in manufacturing conditions can be easily incorporated into the decision-making process. The delivery tool For the mathematical model will be a system integrating a rule-based expert system with an optimization procedure involving the mathematical analysis OFreplacement models.

1. INTRODUCTION

This paper describes the developmental process and results of an integrated system that combines a rule-based methodology with a mathematical analysis of various maintenance alternatives. The test bed for this integrated system was the ALCOA plant in Rockdale, TX. The Rockdale plant is a smelting facility that continuously produces aluminum in carbon-lined steel smelting pots. The aluminum is produced by running a current through an electrolytic bath, which eventually corrodes the carbon lining of the pots. Maintenance decisions center around anticipating pot failure and responding to pot failure. Pot failure occurs when a hole develops in the shell of the pot so that electrolyte leaks out. It is sometimes possible to fix a small leak without removing the pot from

Richard M. Feldman is a Professor of Industrial Engineering at Texas A&M University. He received a B.A. degree in Mathematics from Hone College. Holland. MI. an M.S. deara? in Mathematics from Michigan State University, East Lansing, an M.S. degree in In&trial Engineering From 6hio University, Athens, and a Ph.D. degree in Industrial Engineering From Northwestern University, Evanston, IL. His research interests include applied probability and simulation. He has served as Associate Editor For both Management Science and Operations Research. i William M. Lively is an Associate Professor OFComputer science at Texas A&M University. He received a B.S. degree in Biology, an M.S. degree in Electrical Engineering and a Ph.D. degra: in Computer Science and Electrical Engineering From Southern Methodist University, Dallas, TX. His research interests include s&ware en~n~~ng and computer architecture. $Tom Slade is a technical staff member at IBM’s Westlake Programming Laboratory in Roanoke, TX. His degrees include a B.S. in Biology from Yale University, an MS. in Computer Science From Southwest Texas Stale University, San Narcos, and a PKD. in Computer S&&e from Texas A&M University. His principal research interests are software engineering and human computer interaction. §L. G. McKee is Chief Industrial Engineer For ALCOA’s Rockdale Operations in Rockdale, TX. He received his B.S. degme in Industrial and Systems Engineering From the Georgia Institute of Technology, Atlanta. His responsibiiities at ALCOA include operations planning and performance improvement strategies. c[Alan Talbert is a control System Engineer for Fish Engineering in Houston, TX. He received a B.S. degree in Industrial Engineering from Texas A&M University. He worked for ALCOA’s Rockdale Operations before joining Fish Engineering. His responsibilities include ptanning and engineering of control systems in process industries. l

425

426

RICHARD M. FELDMANet al.

production through the use of probe-and-plug crews. More often, pot failure indicates problems with the lining significant enough that the pot must be removed from production, or “cut-out”. When a pot is cut-out, there are two major alternatives for restoring it to production: either the pot may be relined, i.e. its whole carbon lining is replaced; or it can be patched, by only replacing the lining surrounding a localized fault. Failures can be anticipated by considering a variety of performance parameters as well as the maintenance history of the pot. Production goals (metal purity) as well as efficiency measures can also contribute to the decision to remove a pot from production. Two goals of the maintenance expert system are: (1) to provide a familiar and natural context for using an analytical model in evaluating maintenance policies; and (2) to represent in a set of rules the heuristic knowledge of current routine practices. In this way we have attempted to construct a demonstration of how expert system technology can work synergistically with the dissimilar technologies for decision support offered by rule-based systems and mathematical analysis,

2. THE MATHEMATICAL

MODEL

For the last three decades, there has been an increasing interest in the study of maintenance models for systems with stochastic deterioration, and there has been an accompanying proliferation of models giving so-called optimal policies under a wide variety of differing assumptions [ 1,2]. The problem described in the introduction is similar to the minimal repair models of Refs [ 3,4]. However, one of the problems of the standard models is that these models are not dynamic in nature. That is, the optimal policy computed from the mathematical model is not designed to be adjusted based on a particular realization for which a pot is operating in an unusually efficient (or inefficient) manner. Thus, we shall develop policies different from the usual form found in Refs [3,4]. In the context of the problem described in the Introduction, mathematical analysis assists the decision maker at two specific points: (1) for young pots, in helping to decide between a patch (minimal repair) and reline (replacement) for a pot in which a leak has occurred in the side lining; and (2) for old pots, in helping to decide (in a proactive manner) if an old pot that has not yet failed should be relined (replaced) or left alone. There are clearly some nonquantitative considerations in the patch vs reline decisions, especially the disruption caused by a patch. However, there are also some quantitative factors that should be helpful in knowing when the disruption would be justifiable. The major idea for a patch is the savings in the incremental costs between the reline and patch. Two factors add a slight complication into the calculation of the cost differential. There is a chance that a patch may fail shortly after it is installed; thus, a patch may add to the expense instead of being a cost savings. Also, pots have a slowly increasing operating cost over their life and, thus, the cost differential needs to include the operating cost. For old pots, the decision to replace early is not usually considered, but there are two reasons why it should be considered. First, because as a pot ages it uses more power, the operating cost of a pot increases as it ages. Thus, there may be a point at which it is economical to replace an old pot with a new lining to reduce the operating cost. Second, it may be worthwhile to take advantage of low workloads of the reline crews. The following section develops the mathematical models necessary to give quantitative guidance for these two decisions. The decision process described in this section is simpler than the process described in the next section, due to the requirement to keep the mathematics tractable. Although this may give the appearance that there are two different decision processes discussed in Sections 2 and 3, this is not the case. However, since the assumptions in this section are more restrictive than in the next section, the expert system is designed to be selective in its use of the mathematical model.

The length of time until a pot must be relined is a random variable denoted by i? The length of time until the first failure of the pot is a random variable denoted by T,, with its distribution function given by F and its density function given by f: At the time of first failure, there are two possible actions: either the pot can be relined or it can be patched. After a patch, the pot may

A single replacement policy delivery system

42-i

immediately fail with probability p. If it does not immediately fail (with probability 4 = 1 - p), the distribution governing the time until the next failure is identical to a pot of the same age that has not yet failed. The pot is always relined if it fails a second time. To state the above rules mathematically, if T, = s and if the decision is made to reline the pot, then T= s. If T, = s and if the decision is made to patch, then T has the distribution given by P(T
F(t) - F(s)

=s)=p+q

1 -F(s)

F(t) = P{ T, < t>

for t > s . ,

where for t 2 0.

(1)

There are five costs relevant to this problem: cP is the cost (not including the cost of lost production while waiting to be patched) of a patch; c, is the cost (not including the cost of lost production while waiting in the reline queue) of relining after pot failure occurs; c, is the cost (not including the cost of lost production while waiting in the reline queue) of a scheduled relining in which the pot is removed from service before it fails; cd is the cost rate of loss production due to the pot being down; and c,(t) is the operating cost rate for a pot of age t. If a pot is to be relined, the average length of time that it waits in the reline queue is given by t;; and if a pot is to be patched, the length of time that it waits in a queue to be patched is given by $, The major variable cost that affects the operating cost for a pot is the pot’s power usage. The cost function thus relates usage of electricity with the cost of energy. A pot’s usage of electricity varies over the lifetime of the pot and is thus a function of the age of the pot. We denote this function by u(t) which represents the power used by a pot of age t. The cost of electricity (in dollars per unit energy) is c, and, thus, we have c,(t) = C@(t).

(2)

The current policy preferred by management is to leave pots in production until failure and then reline the pots upon failure. However, as discussed in the next section, a “quick fix” is sometimes possible, but we ignore that possibility here. This preferred policy will be referred to as the baseline policy and other options will be evaluated as they compare with the baseline policy. The total discounted cost of the baseline policy is denoted by cp, where the continuous discount factor is given by Q, with 0 < c( < I. 2.2 Cost for the baseline policy Operating costs vary over the life of the pot. The typical operating cost (based on average power consumption~ for a pot decreases during the first few months of service and then increases slowly over its remaining lifetime. Consider a pot of age t, that is placed into service at time zero and is taken out of service when it is of age t,. Its total discounted operating cost, denoted by C(t,, tz), is given as CZ qt,,

f2)

exp[-“(S- %,(s)

=

ds.

(3)

s Cl

Thus, a new pot that is placed in service at time zero and is relined due to a failure at time s would have a total discounted cost associated with it of E(0, s) f exp( -@@( c, + &).

(4)

The total discounted cost, (p, is the cost of a pot that we follow over an infinite planning horizon. That is, we start with a new pot, wait until it fails and reline it, wait again until it fails and reline it and repeat that cycle indefinitely. The total discounted cost of this infinite cycle is calculated by cp=

0m (?(O, s) + exp(+)[c, s

+ t;c, + exp(-af)q]}

f (s) ds.

(5)

The meaning of the individual terms on the right-hand side are as follows: E(0, s) is the operating cost until the first reline; exp (-“) discounts back to the present time the various costs incurred during the reline at time s; c, is the cost of relining; t;cd is the expected cost of lost production while the pot is down waiting to be relined; cp is the future costs of the continuing cycle of pots

RICHARD M. FELDMAN et al.

428

working and being relined; and, finally, f(s) ds is the probability that the pot fails at time s. For the type of problem we are considering here, the repair time is small with respect to the discount factor; therefore, we shall assume that exp(-Or7;)x 1. Equation (5) thus becomes, (p=

cp)lf(s)ds-

m [E(O, s) + exp’-““)( c, + Kc., + s0

(6)

For the future equations, we shall assume the same thing; i.e. we assume that the discounting over the length of the repair time is negligible. Another approximation in equation (6) is that t; represents a constant mean value independent of the policy. Because deviation from the preferred policy is unusual, we assume that the policy has a negligible effect on the queues. A queueing analysis greatly increases the complexity of the problem, so this is an assumption needed to keep the problem tractable. It is also a realistic assumption for the problem motivating this research. For notational convenience, we define two integral functions l(/i and ti2 to be $1(r) =

m c(t, - s)f(s) ds

(7)

m expt -“(” - “y(s) ds.

(8)

sf

and $z(t) =

sf

(Note that both these integrals are functions of the discount rate; however, we do not show them explicitly a function of tl since the discount rate is considered a constant.) We can now write the total discounted cost for a pot operating under the baseline policy as cp =

+1(o)+ (cr + t;CdM2(0) 1 - @2(O) .

(9)

2.3 Cost considerations for an early failing pot Let us assume that at time zero we have a pot of age t that has failed. Furthermore, assume that the pot meets the requirements to be a candidate for a possible patch job and the actual waiting times at the reline facility and the patch facility are estimated to be t, and t,, respectively. We let cpI be the total discounted cost if the reline option is taken, and ‘pP be the total discounted cost if the patch option is taken. These two values are

pr = c, +

‘pp

=cp +

(10)

trCd + q

tpCd + P(P,

4

+ 1 - F(t)

s

mE(t, s)f( s) ds + q(e,

z

=c,+tpCd+~(Pr+---

+

f&d

+ q)

1 - F(t) &1(t)

+ dcr +

1 - F(t)

+d

m s

expt -at’ - ‘)f (s) ds

t

+ (P)@2tt)

1 - F(t)



(11)

where q = 1 - p and p is the probability that the patch will fail immediately and cause the pot to be relined. The economic effect of a patch or reline is now given explicitly by comparing the two values cpIand ‘pp. As an aid to management, we wish to discuss the economic interpretation of the difference between cp, and ‘pp. To do this, define the following:

A, = c, + t&d - (cP + t&d), A2=‘p--

+ (c, + t;cd + q)$2tt) *l(r) 1 - F(t) 1 - F(t)

A sin&e replacement policy dehvery system

f

‘- t

12000

m 8

429

YA1

6000 ?

i

Fig. 1. Cost values vs age for comparing the reline and patch options. The top line, A,, is the cost savings of patch over reline that would be realized at cut-out time. The second line, A,, represents an increase in the future operating cost if a patch is used. The bottom line, AS, represents a cost associated with the possibility of a patch failure. If the sum of the second and third lines is less than the first (top) line, a patch is favored.

and A3 = A - A, - A,. Thus, we have the difference separated into three components, A,, A2 and A,. If A > 0, then a patch is favored, and A, represents the savings in patch vs reline costs incurred at the cut-out time. If A, > 0, it represents the present value of the savings realized because of reduced future operating costs; if A2 < 0, it represents the present value of the increased cost incurred because of increased future operating costs. Because operating costs increase after an initial decrease, a very young pot will have a lower operating cost than either a new pot or an older pot; therefore, A, may be positive or negative depending on the age of the pot when the decision to patch or reline is being made. Finally, A, is the expected present value of the cost associated with the possibility that the patch may fail immediately, in which case the pot will have to be relined. If A < 0, a reline is favored over the patch and the three component costs have similar meanings from the opposite point of view. (See Fig. 1 for a graphical representation of these factors.) 2.4 Cost considerations for an old pot When an old pot that has not yet failed is being considered for removal for relining, the question may come up as to whether the pot should be removed immediately (perhaps to take advantage of a light workload of the reline crews) or it should remain in service until failure. There are two factors that may be different from their average values and thus affect the cost of relining: operating costs and waiting times in the reline queue. The first factor to consider is an increase in power usage. Assume that the pot is of age t so that its expected power usage is given by u(t); however, its actual usage rate is 9. The first step is to determine the age that would yield the usage rate of tz. Thus, we let lMbe such that ~(3,) = II; in other words, I?,is the eectiue age of the pot with respect to the usage of electricity. A change in power usage from its expected value will not only affect the pot’s operating cost, but it may also reflect a change in the pot’s probability law governing the time until pot failure. However, there may be other factors which also indicate a change in the pot’s probability of failure. The procedure is to use the actual age of the pot unless the user overrides that age and uses another effective age. We let f, denote the effective age of the pot with respect to life length. The distribution for future failure time is then based on ii instead of the actual age of the pot. The second variable is the time spent waiting while in the reline queue; that variable is denoted

RICHARDM. FELDMANet al.

430

I

I

I

I

1050

950

age Fig. 2. Cost values vs age for comparing the reline and do nothing options. The decreasing, A,, line represents the savings possible by delaying the capital expenditure of an immediate replacement. The increasing line, A*, represents the increasing operating cost caused by using an old machine. To the right of the intersection, an immediate replacement is favored.

by t,. With actual values for c, and t, instead of expected values, we can calculate the cost of immediately relining the pot, cpI,and compare it to the cost of doing nothing, (P”.(The “do nothing” policy implies that the pot will be left alone until it fails.) These two costs are given as % =

cs+ bCd

+

(12)

q

and 1 %I= 1 - F(ff)

O3 c, + t&d + q E(%, 2, + s - ff)f(s) ds +

s t;

1 - F(b)

m expC_a(“-f;)lf(s)& s t; (13)

where lUis the effective age of the pot with respect to power usage and fc is the effective age of the pot with respect to life length. As in the previous section, we break the difference between cprand (P” into component parts so that an economic interpretation can be given. Let

A

=

1

c + tc _ s

rd

tcr+

t;cd)$2(b)

1 - F(f[)

and AZ=A-A1. Thus, we have this difference separated into two components, Ai and A2. If A > 0, it is better to do nothing and wait until the machine fails; and A1 represents the savings (a reduction of costs) in replacement costs realized by delaying the expenditure of capital for the replacement. The cost A, represents the increase in the present value of operating costs due to using an old machine instead of a new machine. If A < 0, it is better to replace the machine immediately, in which case Ai may be positive or negative and thus it may represent a cost or a savings depending on the relative values of costs and times involved for the machine. (See Fig. 2 for a graphical representation of these factors.)

A single replacement policy delivery system

431

3. EXPERT SYSTEM DEVELOPMENT

In this section we describe the development of an expert system designed to capture the heuristic knowledge of current routine practices. Then in section 4, we discuss the integration of the expert system with the mathematical model of the previous section. A preliminary analysis indicated that an expert system designed to reflect current maintenan~ actions could be conveniently formulated as one ofselecting among a set of standard maintenance actions. Such a goal-driven formulation indicates a natural fit with backward-chaining inferencing. The computers most conveniently available to plant personnel were IBM-PC compatibles, for which a wide variety of PC-based expert system tools supporting backward chaining are available [ 5,6]. A second requirement was the capability of calling external functions from within the expert system. This feature was necessary to interface with the analytical model. Again, this requirement is met by a large number of PC-based tools. A third consideration was the ability to handle large rule sets to anticipate growth. ALCOA, like many companies, had already invested in some expert system tools. We chose Level 5, by Information Builders Inc. [ 73, because it met our requirements, ALCOA had already bought it, and the engineering staff had already received some training in its use. The knowledge acquisition process started with narrowing the focus to one domain of r~ponsibility and a single expert. An iterative and cooperative process of testing and modifying the prototype during design/interview sessions produced a stable prototype consisting of 32 rules in 4 iterations. In our problem of maintaining a line of smelting pots, each pot line had different characteristics, depending on the type of the pot and the production goals. Line operators are allowed relative freedom to manage their maintenance situations and we deemed it unproductive to try to discern which expert was “right” when in fact they each might be right in the context of their production priorities. Each potroom has a supervisor (responsible for two lines) and each line has a separate supervisor responsible for each shift. After meeting with a group of potroom and line supervisors, we identified one supervisor to be our official “expert”. The system which has resulted should be similar in structure, but not identical in detail, to those which would have been developed for other lines with different experts. As mentioned before, this should be expected not merely because of difference in approach among individuals, but because of differences in production priorities and equipment among lines. 3.1 The initial prototype A common problem for designers is that people can more easily modify an evident design than they can generate design ideas. For this reason, we chose to present our expert with a crude system prototype as a basis for reaction and refinement. With expert system technology, one can easily code a simple rule base that captures one’s understanding of a narrow domain such as our pot maintenance problem. As a result of the group meeting and general background information gathered, we were able to come back to our first meeting with our expert with a working prototype consisting of 14 rules. The basic design decisions included identifying what seemed to be the most implant input variables and the maintenance actions that might be taken. The reasoning about actions was in terms of nominal categories for each parameter. For intance, instead of referring to actual iron content measures, the rules referred to LOW, MEDIUM and HIGH iron content. Pot age was similarly partitioned into YOUNG, MEDIUM, OLD and VERY-OLD categories. Clancey [S] refers to this type of coding as qualitative classification and presents evidence that this is a useful and natural way to turn numerical data into a form used by people in reasoning, This coding strategy emphasizes the heuristic nature of qualitative decision making and conveniently avoids issues as to what is generally meant by a YOUNG pot or HIGH iron content. Many adjustments in order to satisfy test cases were handled by changing the boundary values used to convert numerical data (e.g. age and iron content) into nominal categories. 3.2 Setting up a c5o~ra~~ve design process Many researchers [9,10) are advocating a cooperative approach to the design of software applications in which end-users directly participate in designing the system. This approach is also valid and natural for developing expert systems. One brings the current working prototype to a UOA19:5-x

432

RICHARD M. FELDMANet al.

cooperative design interview and treats it as the piece of clay which you need help in shaping into something correct and useful. The expert is encouraged to explore the boundaries of where it is reasonable and not. We found it useful to come with a chart or list of test cases that checked around boundary values for important parameters. The test cases were presented as problems for the expert to solve, after which we would consult the rule system and compare and discuss discrepancies. It is important to foster a scepticism towards the rule system, because computer programs carry an undeserved authority for many people. With a relatively small system, revision is often possible during the interview session. We found that scheduling two brief interviews, one in the morning and one after lunch, allowed us to revise and retest the system while the material was still fresh in everyones mind. With larger systems revision is not going to be so simple, but during the initial design stage for many routine commercial or industrial rule systems, such an approach is likely to be effective. In this case, 4 iterations of the rule base resulted in a system with face validity, in the sense of producing correct results on a systematic set of test cases. 3.3 Encoding of rule base goals The goals for the expert system are a simple tree structure with the root designating selecting a maintenance action. Each leaf represents one of the five principal actions we identified. There are only five primary actions identified by the system to respond to a perceived maintenan~ problem: (1) cut-out; (2) probe/plug the pot with high priority; (3) probe/plug the pot with low priority; (4) start special samples on the pot; and (5) monitor closely (sort of an ofhcial problem list). A sixth action, do nothing, covers situations that do not merit follow up activities. The mathematical model was tied to these basic activities by using it in the conditions for the rules which call the mathematical model as an external program (see Section 4). In addition to the basic goal structure, there are two goals that effectively let the user bypass the rule base to call the mathematical model. These goals are presented as a top level menu when the expert system is started, allowing the user to either consult the rule base, call the mathematical model to calculate the patch and reline costs or call the mathematical model to calculate the relative costs of relining vs waiting until failure. The iterative process of refining the rule base involved adding rules not identified initially, modifying some initially formulated rules, and incorporating rules for accessing and displaying the mathematical analysis. For example, one of the initial rules was for emergency cut-out:

IF pot red THEN cut-out. Such a rule is straightforward and easily identified in the initial stage. An example of a rule identified during an evaluation step and added during the iterative process is IF pot on cut-out candidate list AND last three iron readings 20.20% AND other actions not ruled out THEN start special samples. After including rules added during the evaluation process and added to access the mathematical analysis under the proper conditions, the rule base went from 14 rules to 32 rules. 3.4 Variables The variables of the expert system which condition a maintenance decision can be considered in four categories or layers, including: (a) current operating status of the pot; (b) performance history for the pot; (c) plant wide parameters; and (d) external economic factors. (a) Current pot per$mnance. The first and most important group includes the current operating status of the pot under consideration‘ Most important in this group are age and iron content. Iron content is an indicator of the corrosion of the carbon lining, resulting in contact between the metal shell of a pot and the electrolyte bath. Voltage level is not exploited by any rule, but our expert was concerned about efficiency and felt there should be some way to incorporate energy efficiency into the decision process. As a result of this concern, the mathematical model was extended to make use of power consumption information. Likewise manganese and silicon impurities were mentioned but not incorporated into any rules. Temperature is only used indirectly to decide the obvious case

433

A single replacement policy delivery system

of when a pot is glowing red from overheating. Since high operating temperature is considered an eventual cause for premature failure, future enhancements of the system should find some systematic way of incorporating this factor. (b) Zndi~idual pot history. Of importance in the pot history is whether the pot has leaked before, whether the pot has been probed before, whether the pot has been identified previously as a cut-out candidate (meaning more frequent sampling for iron and other impurities) and whether the last readings of iron were high.

( c) P~unt-bike ~ur~~e~ers. Six variables were identified as bearing directly on the pot ma~ntenan~ problem. First is the availability of the probing/plugging crew. Next is the pot line complement, e.g. how many pots of a particular line are currently operational. Also of interest are waiting time to begin relining, the waiting time for patching and the turnaround time for both reline and patch. To give an idea of the scope of this project, pots usually last 3-5 years, a reline takes lo-20 days and the pot line complement is between 315-320 pots. (d) Control variables. In addition to the variables described above a small number of control variables are used to ensure that, if cut-out is recommended, lesser measures are not also recommended. 3.5 Rules The current rule base contains 32 rules. Two groups of rules are distinguishable. One category of rule simply classifies numerical data into nominal categories. The second type of rule reasons about maintenance actions and decides whether to call the mathematical model. Simple modifications in the parameters used to classify age and iron content, and in some of the conditions for cut-out, should allow this rule set to be customized to fit the varying production goals and maintenance needs of the other smelting lines. 3.6 Documentation

and explanation

The Level-5 shell allows text expansion for variables and conditions. This allows the user to request detail concerning a particular variable or value at any time during a consultation. About half of the code text is actually supporting these documentation functions. The user can get explanations in terms of the sequence of rule firings for a consultation.

4. INTEGRATION

OF THE EXPERT

SYSTEM AND THE MATHEMATICAL

MODEL

The expert system allows two ways to access the mathematical model. The mathematical model can be called directly by the user by selecting goals that bypass the normal rule base. Otherwise, the rule base decides when to call the mathematical model. For pots in the YOUNG age category, the model is called when any “significant” maintenance action is warranted. Significant action includes either probing/plugging or cut-out actions. The expert system calls the model with a request for a “young pot analysis” and passes it the age of the pot, the length of the queues for relining and patching (in days). The model returns the long-term cost calculations for patching and relining, which the expert system presents to the user as an additional recommendation. For pots in the OLD and VERY-OLD age categories and where cut-out has not been deduced as the appropriate action, then the model is called with a request for an “old pot analysis” and passes it the power consumption rate of the pot in question as well as the pot age and the reline queue (in days). The model returns the long-term costs for relining vs waiting until failure. The results are presented to the user by the expert system as an additional recommendation. After viewing the extra recommendation resulting from a call to the mathematical model, the user can continue and see the “normal” baseline policy recommendations. The baseline policy does not prefer patching or scheduled relining under any circumstances. Cost justification based on the results of the mathematical model are the only way such actions will be recommended. This strategy for integration allows the user to easily distinguish recommendations that are meant to simulate normal decision practices from those based on the additions info~ation provided by the mathematical model. Calls to the mathematical model may result in recommendations that

434

RICHARDM. FELDMAN et al.

may be rejected because of strongly held existing beliefs. For instance, patching is considered risky, disruptive and only worthy of serious consideration for young pots. By restricting calls for a cost comparison between relining and patching to those young pots that would be cut out according to normal decision practices, a recommendation to patch will only arise when it has some reasonable chance of acceptance. The mathematical model makes no distinction between YOUNG and MEDIUM category pots, and recommendations to patch pots just above the threshold between these categories may also be acceptable. For this reason, allowing the user to override the rule base is important. Similar considerations hold for accessing the calculations that compare the cost of scheduled relining with waiting until failure. Although such a comparison is typically of interest only for old pots that are functioning inefficiently, the user should be able to explore the consequences over a wide range of circumstances.

5. CONCLUSIONS

AND DISCUSSION

The prototype expert system for pot maintenance has successfully demonstrated that a knowledge-based system and mathematical analysis can be integrated into a single replacement policy delivery system. Using the expert system shell already purchased by ALCOA, we found that it was straightforward to encode a reasonable approximation to the current decision process. Our system was developed using “quick and dirty” statistical estimates for the probability distributions of pot life, and it was demonstrated to a group comprised of engineers, line management and operational personnel. The result was an accepted tool for use in pot maintenance. Since the prototype was built using rough statistical estimates, the final step now is to obtain acceptable statistical estimates for pot life and power usage. The resulting replacement policy delivery system can serve both as a training tool and as a vehicle for the experienced operator to begin taking advantage of the mathematical analysis. Good user interface characteristics made available by the shell have enhanced the value of the mathematical model in two ways. The first enhancement is “free” availability of a uniform interface for gathering user input, providing explainers and context sensitive help and displaying results. The second enhancement derives from having the rule base decide when the new information available through the mathematical model is relevant to a particular maintenance problem. This guidance is embedded in a simulation of normal decision procedures, so that recommendations at variance with traditional policies can be assessed in a familiar context. Acknowledgement-This No. 3322.

material is based upon work supported by the Texas Advanced Technology Program under Grant

REFERENCES I. W. P. Pierskalla and J. A. Voelker, A survey of maintenance models: the control and surveillance of deteriorating systems. Nov. Res. Logist. Q. 23, 353-388 (1976). 2. C. Valdez-Flares and R. M. Feldman, A survey of preventive maintenance models for stochastically deteriorating single-unit systems. Nav. Res. Logist. Q. 36, 419-446 (1989). 3. T. Nakagawa, A summary of periodic replacement with minimal repair at failure. J. Ops Res. Sot. Japan 24, 213-228 (1981). 4. T. Nakagawa, Modified periodic replacement with minima1 repair at failure. IEEE Trans. Reliab. R-30,165- 168( 1981). 5. W. B. Gevarter, The nature and evaluation ofcommercial expert system building tools. IEEE Comput. 24-41 (May 1987). 6. P. Harmon, R. Maus and W. Morrissey. Expert Systems Tools and Applications. Wiley, New York (1988). 7. Information Builders Inc., Level 5 Expert System Sojware Deuelopmenr System User’s Manual, Version 1.3. Information Builders Inc., New York (1989). 8. W. J. Clancey, Heuristic classification. Arrij: Inrell. 27. 289-350 (1985). 9. J. White-side, J. Bennett and K. Holtzblatt, Usability engineering: our experience and evolution. In Handbook of Human-Computer Interaction (Edited by M. Helander). North-Holland, Amsterdam (1988). 10. T. Winograd and Flares, LIndersranding Comparers and Cognition. Ablex Inc., Nonvood, NJ (1986).