Developments in Reliability Data Collection and Analysis

Developments in Reliability Data Collection and Analysis

Copyright @ IFAC Fault Detection, Supervision and Safety for Technical Processes, Espoo, Finland, 1994 DEVELOPMENTS IN RELIABILITY DATA COLLECTION AN...

1MB Sizes 1 Downloads 73 Views

Copyright @ IFAC Fault Detection, Supervision and Safety for Technical Processes, Espoo, Finland, 1994

DEVELOPMENTS IN RELIABILITY DATA COLLECTION AND ANALYSIS J.K.VAURlO IIflllJraIf Voima

0"

P.O.Boz 23, 07901 Lop""" FiIfImtd

AbstrKt. A computerized failure aDd maintenance event data base has been established at Loviisa POWCl" Plant This data base supports maintenance management and planning as well as reliability and risk assessment studies for economy and safety. This paper describes the attributes and characteristics of this system, methods developed for estimating component reliability parameters, methods for ana1yzing aging and improvements in component failure rates, and developments in common cause failure rnodeling and quantification. Key Words. Reliability data acquisition ; failure rates ; parametCl" estimation; maintenance engineering; nuclear plants

1. IN1RODUcnON

The objectives of RGM are also manyfold: - establish the economic significance of the maintenance tasks and outages of individual components, component types or systems evaluate relative merits of different preventive maintenance tasks - strike an optimal balance between preventive maintenance and corrective (repair) maintenance - plan spare part inventories and repair resources detect aging, wear, or improved reliability and adjust maintenance actions accordingly - determine proper requirements for replacement components and system modifications.

Learning from experience is an essential part of obtaining and maintaining a safe and economic production system. This entails experience feedback from planned and unplanned events that have occurred during plant operation. The component event data collection and analysis systems established at Imatran Voima Oy's Loviisa Power Plant serve safety and economy via two activities: (1)

(2)

Safety: Probabilistic Safety Assessment (PSA), and Economy: Reliability Guided Maintenance

For both PSA and RGM it is necessary to record not only the failure frequencies but also the durations of outages (unavailabilities) for various failure modes and maintenance actions. It turns out that standby systems and alternating systems have more failure mechanisms and failure entry and exit alternatives than normally operating systems. Therefore, more complex reliability models are needed for standby components. This fact has a major impact on the nature and number of items that have to be collected to be able to estimate numerical values for the parameters of the models.

(RGM).

The many purposes of a PSA include identification of acident sequences and assessment of their probabilities (frequences) verification of compliance with safety objectives facilitating comparisons of risks to other plants and technologies identification of main risk contributors prioritization of improvements in plant systems, procedures and practices. Justifying exceptions in conditions of operation and optimization of technical specifications or allowed outage times are possible additional applications. Monitoring the risk requires continuous or periodic updating of the reliability parameters of the PSA models.

2. DATA COLLECTION The fIrSt data collection at Loviisa plant was carried out manually for the period 1977 to 1988 mainly to support the reliability parameter data needs of a PSA. This phase introduced specifications for a computerized system

207

=

developed and installed in 1989. This maintenance data system supports a general comprehensive component model and has the following special features, many unique to this system:

1.

2.

3.

'tm, 't1,> 'tr' 'to and 't, are the corresponding average repair times. (Several early PSA' s assumed only per-demand probabilities ex, i.e.A=O, for standby components, which would imply unrealistic optimal test intervals T=oo).

The data system is completely plant-specific. Component specific and component group specific failure rate and unavailability parameters can be estimated with associated trends and uncertainty estimates.

Periodic preventive maintenance (PM) with intervals Tp can also take a component out of service. These downtimes must also be recorded to calculate the mean downtime 'tp and the unavailability fi = 't/Tp. (possible errors made in PM lead to failure reports later on, thereby contributing to n, A.. f or r).

The items collected in the data system are based on comprehensive component models needed in PSA. These items are associated with several alternative entry and exit times of failures, repairs, tests etc. to allow estimation of both time-average and timedependent unavailabilities.

The most important event attributes needed for estimating the parameters are: A.

Failure detection method. This indicates whether a failure is immediately detectable (by alarm or other symptoms) at any time or in one of the periodic tests or inspections; in the flrSt case the event contributes to Am and 'tm, otherwise to the other parameters (ex, A.. etc.) of Eq. 1.

B.

Unavailability impact (failure criticality). A true failure is such that the component becomes unavailable immediately when the event occurs; the event contributes to (l or A. ('to or 'tJ. An incipient failure is such that the component is unavailable only during the repair time; such an event contributes to f or r ('tr or 't,).

c.

Failure occurrence situation. For failures that are not immediately detected it is important to infer the probable occurrence situation. If the failure occurred during a standby period (after the preceding test or operation) it contributes to A. or f ('tl. or'tr). If it occurred due to stresses during a periodic test or operation, it contributes to (l or r ('to or 't,) depending on the unavailability impact (attribute B above).

T.

Time points. The following times are recorded: Previous test (I.,); Failure detection (t\); Component isolation for repair or maintenance (tJ; Repair or unaVailability start (t3); End of work (r.); Return to service (~); Probable Wlavailability start, to. tt, ~, ~ or (t\-1o)/2 + ~-tl (exceptionally earlier than 10), depending on the answers to questions A. B, C above; End of unavailability [r. or ~]. The actual unavailability time (from start to end) is

Additional items associated with causes, impacts, costs, actions taken etc. are included to support plant maintenance planning (RGM).

4.

5.

Each item is recorded on computerized work orders by a responsible foreman or a shift supervisor as a part of the normal work order routine. This way extra personnel is minimal. A Dedicated Data Inspector verifies the entries, promotes consistency and enters the critical unavailability timing information. Practically all systems and components are included in the data base, not just safety related components.

The data collection system must be able to support comprehensive component unavailability expressions, fi = Am'tm I (1 + A",'t.J ... Am'tm for failures that are immediately detected, and

ii

·U+

Cif +1't +6

• T

r

+A,

(T2+~1 )+'"

(1)

for failures detected in periodic tests or operations. A",

A. f (l

r T

mean downtime associated with a test (if any).

= failure rate for monitored failures, = failure rate for true failures during standby, = failure rate for incipient failures during standby, = probability of true failure per demand (test or operation), = probability of incipient failure per demand, = interval of tests or operations in which failures can be detected. and

208

instrumentation, electrical components and mechanical equipmets according to the maintenance branch called for; Listings and weekly summaries of unfmished or delayed works at different stages, work ordered but not released. work released but not completed, work completed but equipment use not authorized. work completed but data sheet not inspected, work and data completed; Listing and daily and monthly summaries of works with limited outage times (AOT, safety related works); separately for repair works and preventive maintenance, as well as for lDlfinished works; Quarterly and annual numbers of work orders, repair works, modification (backfitting) works and preventive maintenance, separately by maintenance branch - instrumentation/automatics maintenance, - electrical maintenance, - mechanical maintenance; These trends can be plotted for individual systems or components, too; Listing of potential common cause (simultaneous) events; Listing of human error events.

determined by the Data Inspector. A number of additional attributes are requested for maintenance management. optimization, studies of common cause failures, etc. These include D. E. F. G. H. J. K. O. S. V.

Event impact on other components. Maintenance action taken. Quality of work plan documents. Event impact on plant operation. Failure mode. Event impact on system. Plant status at event detection. Event impact on subsystem. Event cause. Event type.

All these attributes are coded so that complete information can be printed out in a single line. A compact list of events can then be listed over a requested time period for systems and component types of interest. Other basic information includes work, room and equipment numbers, locations, allowed outage times, parts used. PM route numbers and signatures of the recorders. Human errors are recorded in a different format and categorization. The computer checks the logical consistency of the recordings, especially between attributes A. B, C and H. In spite of this and double-checking by the Dedicated Data Inspector it is not possible to maintain event coding completely consistent and flawless. Additional verifications are needed whenever the PSA parameters are updated. It seems impossible to completely automate the data collection and analysis directly from the coded event data. Varying interpretations, assumptions and simplifications in PSA models always seem to require some additional manual work to adjust component parameters.

A summary of some interesting events at Loviisa unit 1 is given in Table 1. Quarterly numbers of repair works carried out by different maintenance branches are presented in Fig. 1. The role of different failure detection methods is illustrated by Fig. 2. Most of the failures are detected immediately via control room instrumentation. However, periodic testing is a dominating means of detection for standby safety systems. Table 1 Event summary for Loviisa unit 1

3. MAINTENANCE MANAGEMENT

SUMMARIES

1990 1991 1992 1993

The event data base facilitates many useful summaries for special studies and for maintenance planning and management. These include coded listings and summaries over a specified time, e.g. Events by process location; Events by equipment number; Equipment replacements; Plant availability and trips; Events causing loss of production; Systems causing loss of production; Listing of all components with exceptional event rate or unavailability, separately for

Event Twbine trips Reactor trips Hot shutdowns Cold shutdowns Repair work.orders Safety system works - Repairs - Prevent. mainten.

209

3 1 0 3 1 0 0 1 2 0 2 0 1 1 0 0 3739 3695 3859 3765 235 129

148 91

142 123

196 174

LOVI ISA 1,

REPAIR WORK TRENDS

MECHANICAL - - - - - INSTRUMENTATION ELECTRICAL

1110

I

1110

I

,

,

,

"

,

I

,

140

....~

I

\

~

120

~

100

I

,,

~

!5

" "

' '

I

'-

,

"

,

'. '

110

!

60 40 20 0

~

<;

... . ~

~

,!,.

~

~

!.. ... ~

':'

..

'",!,.

~ ~

i

i

i

i

~

~

~

~

.. ... ..

~

i

i

!.. .. .. ~

Ol ,!,.

i

~ ~

BEGINNING QUARTER

Fig. 1. Quarterly repair works

NUMBER OF TRUE FAILURES BY METHOD OF DETECTION 100.0 90 . 0

!:l

80 . 0

~

70.0

~

60 . 0

-:: ~ ~

50 . 0

~

.0.0

!5

~'l:

30.0 20.0 10.0 0 .0

BEGINNING QUARTER

Fig. 2 Quarterly numbers of failures detected by different means

4. RELIABll..ITY PARAMETER

time. Aging and deterioration of components lead to increased failure rates while improved preventive maintenance, redesign or replacement tend to reduce failure rates. A statistical test for a trend has been implemented in REPA. It can be used if the component (or a group of identical components) experienced at least four failures during the recorded history. The test is based OIl the sum of the first j times to failure vs. the sum of k last times to failure (Vaurio, 1983). This trend test can

ESTIMATION A special developed uncertainty A.n, A., f, Cl,

data analysis code REPA has been to estimate the mean values and distributions for the failure parameters r etc. (Eq. 1).

Trend Test. The first task in data analysis is to verify whether these parameters are constant in

210

observations is selected for use in the PSA.

be used for a single component or for a group of nominally identical components.

Experience so far has indicated both increasing and decreasing trends in failure rates, even among similar components. Even nominally identical components can have rather individual and exceptional failure rates. One should review the historical data carefully before pooling components together. Pooling and averaging can obstruct useful infocmation and lead to wrong conclusions.

Constant Failure Rate. If no significant trend is detected, a group of nominally identical components is taken as a representative sample of the factory population. The failure rates of this population vary, and the distribution of this variation can be estimated by matching moments with the observed numbers of failures . Gamma distribution has been found to be a suitable choice. This empirical distribution is then used in the Bayes equation, multiplying the Poisson likelihood of each component (Vaurio, 1987). This yields the posterior distribution for each component individually. The mean values are used in the base case risk assessment while the whole posterior distributions are used in uncertainty calculations. A similar method applies to the probability per demand parameters a and r with beta prior distributions and Binomial likelihood functions (JanIdllil and Vaurio, 1987). The method has several advantages and minimizes the need for generic data or subjectivity in determining a prior distribution.

Unavailabilities.The mean values and the variances of component unavailabilities can be readily obtained from Eq. I using the error propagation formula, when the corresponding quantities for a, A., 'tA etc. are known. (T is normally known accurately). The mean values are used in the base case PSA calculations. The variances are useful for uncertainty-importance studies (Andsten and Vaurio, 1992). Both mean values and variances are used to define distributions foe uncertainty studies. Assuming a beta distribution guarantees that the values are properly between 0 and 1, although a gamma distribution is otherwise justified if the uncertainty of A dominates in Eq. 1.

Repair Times. As indicated by Eq. 1, the mean . repair times (and the distributions) can be as important as the corresponding failure rates or probabilities. A simple way is to estimate the repair rates (in verses of the repair times) using the same method for the observed repair times as described above foe the times to failure .

Common Cause Failures. Since common cause failures (CCF) are rare events, data from any single plant is not sufficient to estimate CCF parameters. Data from many plants can be used to determine empirical prior distributions for the CCF rates of various systems using the moment matching method described above for component failure rate estimation. Plant-specific experience can then be combined to obtain proper posterior distributions. This approach has been used in the PSA for Loviisa plant to obtain system-level unavailabilities due to common cause failures (Jlinldilii and Vaurio, 1993). CCF's occur at any time, which means that the unavailability impact of CCF" s for standby safety systems depends beavily on test intervals, testing schemes and other rules applied in case a failure is detected in a test (Vaurio, 1993).

Non-constant Failure Rates . If a statistical test indicates that a failure rate is not constant. three different trend models are fitted to the data using the maximum likelihood method. These are a abt"t, a fractional learning Weibull-model A(t) model A(t) = at! (where i is the number of failures before time t) and a continuous fractional learning model A(t) = aI(1+bt). Analytical expressions have been obtained for the maximum likelihood estimates of the parameters (a, b, c) of these models (JanldiUi and Vaurio, 1989), and these are taken as modes of their distributions. The variances and covariances are obtained from the Fisher infocmation matrices for the models. Finally, the modes and the variances of the current values of the failure rates A.(fe) [fe = the end of the observation period, "today"] are then calculated for each model using the estimated parameters (modes and covariances). The mode and the variance fully define a gamma distribution foe A.(fe), and yield the mean value foe base case PSA calculations.

Generic Data. In spite of extensive plant-specific data collection it is necessary in a PSA to rely on generic world-wide data sources for certain rare events. These include rare initiating events such as large leakages in vessels, primary circuit piping or steam generatocs. The empirical Bayes method described in Section 4 is useful again for obtaining proper estimates and distributions for the frequencies of such events.

The final choice between the three trend models is made based on the least squares criteria: the model that fits the data best over the whole period of

Human error probabilities is another area where relevant hard data is scarce. One usually has to rely on generic or subjective data. Experiments with

=

211

6. REFERENCES

operators responding to transients on a plant simulator can be used to identify nominal crew responses. Such experiments are rarely extensive enough to obtain actual error probabilities, for a specific plant

Andsten, RS., and J.K. Vaurio (1992). Sensitivity, Uncertainty and Importance Analysis of a Risk Assessment. Nucl. Technology, 98, 160-170. Jankala. K.E., and J.K.Vaurio (1987). Empirical Bayes Data Analysis for Plant Specific Safety Assessment. Proc. lnt. Con/. PSA'87, Vol. I,pp. 281-286, Verlag TOY Rheinland. Koln. Jankala, K.E., and J.K. Vaurio (1989). Component Aging and Reliability Trends in Loviisa Nuclear Power Plant Proc. Int Mtg. Probability, Reliability and Safety Assessment (PSA'89), Pittsburgh, Pennsylvania, April 2-7. American Nuclear Society. JankIDa. K.E., and J.K.Vaurio (1993). Residual Common Cause Failure Analysis in a Probabilistic Safety Assessment Proc. Int. Mtg. PSA'93, Oearwater Beach, Florida, January 2629, American Nuclear Society. Vaurio, J.K. (1983). Learning Curve Estimation Techniques for Nuclear Industry. Proc. Int Conf. Numerical. Methods in Nuclear Engineering, Montreal, Canada, September 6-9, Canadian Nuclear Society. Vaurio, J.K. (1987). On Analytic Empirical Bayes Estimation of Failure Rates. Risk Analysis 7, 329-338. Vaurio. J.K. (1993). The Effects of Testing Arrangements on the Unavailability of Standby Systems. Proc. Int. Mtg. PSA'93, Clearwater Beach, Florida, January 26-29. American Nuclear Society.

5. CONCLUSION A computerized failure and maintenance event data base has been established at Loviisa power plant This data base can . support maintenance management as well as risk assessment (PSA) studies. Data is collected routinely as part of the normal work order routine. One Dedicated Data Inspector is assigned to verify entries, promote consistency and enter the most judgmental information. It is a long term effort to reach consistency in the interpretation and coding of event attributes. Parameters for PSA applications are estimated periodically by off-line computer codes. Careful review and reassessment is necessary whenever data for PSA is updated. Complete automation of this process is unlikely because of subtle differences in the assignments of data and the events defmed in the logic system models of a PSA. Data evaluations so far have indicated some unexpected (or at least not generally recognized) features: (1) most of hardware failures in standby safety systems occur during standby periods (with failure rate A), not at the times of the test operations, and (2) nominally identical components can have widely varying failure rates and trends.

212