Floating redundancy in digital systems

Floating redundancy in digital systems

Mtcroelectron Rehab, Vol 23, No 3, pp 519-529, 1983 0026-2714/83 $3 00+ 00 Pergamon Pros Ltd Printed m Great Britain FLOATING REDUNDANCY P.G. IN ...

501KB Sizes 3 Downloads 114 Views

Mtcroelectron Rehab, Vol 23, No 3, pp 519-529, 1983

0026-2714/83 $3 00+ 00 Pergamon Pros Ltd

Printed m Great Britain

FLOATING

REDUNDANCY P.G.

IN

ANDERSON

Department

DIGITAL

and Z,G.

of E l e c t r i c a l

UNIVERSITY

SYSTEMS

VRAN~..SIC Engineering,

OF T O R O N T O ,

Toronto,

Ontario,

CANADA (Received

for pubhcation

10th J a n u a r y

1983)

ABSTRACT The traditional styles of redundancy such as triple modular redundancy (TMR) use exact functional duplicates to prowde increased reliability [NE63] This need not be the case, a system may be designed using floating redundancy Floating redundancy improves rehabdlty by using a floating spare that may perform as several module types The adjective "floating" is used to describe this abdity to function as two or more types This paper outhnes some of the results of a study of floating redundancy_

1. INTRODUCTION Fault-tolerance m digital systems is often attained by using extra copies of the hardware, either m voting or sparing configurations Two commonly used techniques are duplex [TO79] and triple modular redundancy (TMR) [NE63] In the duplex technique, the hardware is duphcated and the results produced by each copy are compared to detect errors When a failure xs detected, the dlagnosUc software is used to determine which copy is at fault, and the type of failure (transient or permanent) that occured Action appropriate to the type and location of the failure is then taken The TMR method employs one more copy of the original hardware and a voter to determine the correct result The diagnostic software, which must be carefully written to avoid the consequences of even a small percentage of non-covered faults tAR72], Is thus not needed, but the hardware cost is Increased In systems where a large number of hardware units is required the cost of increased reliability achieved with duplex and TMR techniques is too high to allow practical application of these methods A more modest approach is to use a sparing technique [CA71], where a standby spare is provided for each type of the hardware functional module required When a functional module fails, a spare of the same type is activated While lower in cost, this approach is still hkely to be expensive if the system consists of many diverse types of modules, but only a few (perhaps only one) copies of each type are needed This paper describes a type of hardware redundancy where exact copies are not (necessarily) used It is a sparing technique, but the spare, called a floating spare, ~s able to substitute for more than one type of hardware module F~gure 1 shows an example of the use of floating redundancy in a six module system. There are three module types, A, B and F In a fault-free system all the A and B modules are working or "actwe". The F module is a floating spare. It is able to replace either an A or a B type module, and can replace any of each type m the system. This is denoted in the figure by the dashed lines The solid lines denote the physical connecUons

M R 23/3~n 519

520

P G ANDERSON and Z G VRANESIC

/

/ //

\

//

/ f'- -- "~\,\x / / / /

A2

AI

Fig.

F

\

\ lr

-- "--x

\ \\

B2

BI

\

B3

I. F l o a t i n g spare in a slx m o d u l e system.

2 IMPLEMENTATION AND CONSEQUENCES OF IMPLEMENTATION Some system organizations lend themselves to the use of a floating spare more convemently than others In thin section, the attributes of such a system wdl be described, as well as the placement and use of the floating spare m the system When the floating spare substitutes for a umt that ~s not strictly combinational, t e that has memory, then the spare wdl often reqmre mltmhzatlon before ,t can take on the duties of the failed umt Clearly, any data used to initialize the spare should be error-free for the spare to continue successfully Checkpoint data used m rollback to recover from errors caused by transient faults ts one can&date for thin mJtiahzahon, as it was procured when no error had been detected In additmn, the rollback procedure is not unlike the mltmhzanon process for the spare, and often the latter will be a subset of the former Systems with rollback may more easdy be adapted to floating redundancy

A

i F

i-I B

Fig.

2. Modules

in a plpellne w l t h floatlng redundancy.

The method of getting the data to and from the spare can take many forms Much will depend on the structure of the system, the fault-tolerance requirements of the system and the resulting choices that have been made for error detection and correcUon Some reasonable methods are -change in routing by the user or operating system software, -hardware mapp,ng (similar to virtual addressing), and -dedicated switches

Redundancy system

It JS possible and advnsable for the floating spare to be able to replace more than one unnt at a t~me in some configurations, perhaps in I~mlted groupings F~gure 2 shows a system consxstmg of two unJts in a pipeline, plus a floating spare (F) If F can substntute for enther or both umts, a bad hnk between A and B could be surwved The IognstJcs of functnon switching, and the extra hardware required for doing so and the hmLt of the spare's rehabdnty can work against applying th~s techmque beyond a few simultaneous substitutions While degraded operation may not be desirable m comparison to operation m the error-free state, it has definite advantages over system shutdown upon single fault in some s~tuat~ons Remote fault dragnoses could be done before a repair crew is sent It may be possible to run a subset of tasks m the degraded machine, pruning the non-essential ones and modifying others to perform less than usual Controlled shutdown and controlled swLtchover to alternatives can be done for computer system routines and for external processes controlled by the system Each of these possibdJtJes could depend on continued computing at some speed, even Jf a great deal slower than in the original state Operation with a slower or faster spare requires cooperatton from the rest of the system A system that uses results as they are generated by the umt (say using flags or interrupts) wdl most easdy adapt to a changed response time A system that is made with umts working m step wall not adapt as s~mply to timing changes brought about by floating redundancy The following considerations and characteristics should also be considered in a system using floating redundancy -use of software for dmgnosls and reconfigurat~on, -manual switching of the spare (a single board replacement could be of advantage in remote areas and airborne systems), -the number of spares, -the difficulty in producing a floating spare (in some situations Jt could be trivial, m others overly complex), -implementing only a port,on of the functions by the spare, and -~mplementatlon of extra functxons by the spare The last ~s of particular interest, since the spare is often in a posinon (both m terms of capabdlty and physncal placement) to do fault dmgnos~s and testing of the error-handling curcuntry Mlcroprogram control is proposed as the means of having the spare mum]c the combmatLonal aspects of dnffermg types of modules The flex]bd~ty of th~s method allows varying functnons to be done, including the "extra functions" of the floating spare Further, a single m~croprogrammed controller would suffice for all functions, prowdmg a maximal overlapping of hardware Of course, thns approach requires that any degradation m processing speed be allowable The rehabdity of the controller may also have to be high, although it ~s the switching mechamsm, not the spare ~tself, that must be very rehable This will be shown m the next section Two disadvantages of the approach should be noted a) We are hkely to pay a speed penalty for some functions b) There wall be a cost penalty for the hardware and firmware development of the spare, since some of this w~ll d~ffer from that of the module types the spare replaces

3. RELIABILITY C O N S I D E R A T I O N S OF FLOATING R E D U N D A N C Y

In thJs section floating redundancy will be compared to some other methods of hardware redundancy In order to do so, some assumptions must be made to bring the methods to common ground The reader should note that these assumptions wdl not exactly reflect the nature of most systems, however they will allow the study of certain phenomena without the complexity of the more speczfic models The assumptions are -voters and swntchers can be considered not to fad, -failures are independent, -the failure rate ns Ponsson (this results from the number of independent sources of fadure [CO54]), -the probabdJty of a second fadure occurring during the recovery process from a first fadure can be assumed to be zero, and -faults are permanent, stuck-at faults (unless otherwnse indicated) The effects of some of these assumptions are outhned m [LO76] In Table 1 floating redundancy Is compared to the snmplex configuratton and to five other arrangements that offer some rehabdlty advantages over the simplex configuration m certain situations The five other arrangements are duplex [TO79], TMR [NE63], hybrid [SI72], self-checking only [CA77] and quadding [TR62] Each of these has different characteristics than the others, and has different areas of usefulness hmlted by these characteristics Even ngnormg problems m the voter/switch, each techmque for lmprowng relmbdity has a certain maximum number of errors that can always be correctly handled These are mdncated in the first five rows of the table Note that rollback is assumed successful in countering any transient errors in all schemes w~th error detection, and that the second faults (both transient and permanent) have been assumed to be particularly malevolent In the TMR case, for example, the second permanent error has been assumed to have occurred in the same triple of modules as the first error If not, recovery LS pos-

521

522

P G ANDERSON and Z G VRANESIC

Faults Handled

Table 1. Comparison of Aspects of Several Forms of Redundancy SimDuplex TMR Hybrid SelfQuadplex with -TMR Checking ding diagnosis core

Initial transients corrected 1st permanent -detected -corrected Transients corrected after 1st permanent 2rid permanent -detected -corrected Transients corrected thereafter Some local massive failure protection Cost in modules

Floating

no

yes

yes

yes

yes

yes

yes

no no

yes yes

yes yes

yes yes

yes no

no yes

yes yes

no

no

yes

yes

no

no

yes

no no

no no

yes no

yes yes

no no

no no

yes no

no

no

no

yes

no

no

no

no

yes

yes

yes

no

no

yes

N

2N

3N

N(S+3)

N'

4N

(1)N+I (2)N' + 1 (3)2N+1

no

no

yes

no

Diagnostic software no yes no no no required Stop to yes no no no yes repair Notes ' denotes self-checking modules (20-35% larger [CA77][RES0]) N is the number of modules m the Simplex system S is the number of spares for each module (1) Detection by external examiner (2) Detection using self-checking modules (3) Detection using pairs of modules

slble This may not be the case w~th floating redundancy, as the floating spare may not be capable of substituting simultaneously for two failed modules In this case the rehabdlty of floating redundancy will be less than TMR, as will be shown The next row indicates whether the scheme tolerates some massive fadures such as a burnt out, multiple-gate integrated circuit After th~s is shown the cost of the scheme in modules (which can also mean added costs for power supphes, coolmg, backplane slots, etc ), whether dmgnosttc software must be used to ~mplement the scheme and whether repair and replacement necessarily means halting the processing For the floating redundancy technique, three methods are proposed for error detection (This is not done by the floating spare ) These methods are employing an external examiner, using self-checking modules or using twinned modules As can be seen from the table, floating redundancy exhibits roughly the same behaviour as the duplex and TMR techniques We will now consider more closely the rehabdity of these three methods Two styles of floating redundancy to be examined in the remainder of this section have the structure shown in Figure 3 The structure consists of N matched pairs of module types plus a floating spare Results from each pa~r are compared and the floating spare substituted for the faulty module in case of failure Figure 3 does not show physical connections, as these do not matter to the analyses The difference between the two styles of floating redundancy lies in the allowed subsUtutlons in case A, only a single subst~tutmn is allowed The floating spare, once m use, cannot be used m a second place The entire system fails if a failure occurs in any second module Up to N subsUtutlons are allowed m case B, the opposite extreme The single spare is able to substitute in more than one place by Ume mulUplexmg the required actions Multiple failures can be tolerated, provided no failure occurs m the spare plus any second module, or m any two modules of a single type These two cases are the bounds of floating redundancy with respect to the number of subsatutlons The system ~s assumed to have faded when protection against transient faults Is lost

An alternative, using self-checking modules, was also studied m fAN81] Since fewer modules are needed for the same error coverage, rehabdity increases Unfortunately, off-the-shelf self-checking modules are not generally available at present These results will not be presented here

R e d u n d a n c y system

Mn

M2

MI

523

F

Mn

M2

MI

Fzg. 3. Structure of case A and case B floating redundancy.

The equations that are used for the graphs to follow are presented m appendix A, Unless otherwise specified, the reliablhty of all modules is considered equal Except where noted, the graphs are of Rsys (the system reliability) versus time in terms of the MTTF (mean time to failure) of a single module The first three graphs are done for three module types, a s~tuatlon selected as a practical circumstance for floating redundancy Figure 4 shows the rehabihty against time of the methods for a short period of time when compared with the MTTF of a single module Despite this, the time represented is still sigmficant, about 15 days for an LSI-11 [IN77] As can be seen, the floating methods compare well with the others, especially when the s~mplex method is considered, the only curve w~th non-zero slope at T = 0 TMR and Case B are close in this area, since they have equal first and second derivatives (independent of the number of units) at T = 0

1.000 0.999

DUPLEX

~'~\\ \\ \

0.998

0.997

',\

0.996

\ \

0.995

\

SIMPLEX

\ \

B

\

0.994 \

0.993

TM

DU PL EX--"~ \ WITH \ TRANSIENTS \

0.992

CASE A

\

0.00

10.00 TIME (NORMALIZED)

20.00 X

30.00

0.991

LO.O0

10-3

Fig. 4. Reliability curves illustrating floating redundancy.

w

524

P G ANDERSON and Z G VRANESIC

There are two major reasons that TMR and floating redundancy, requiring more modules in the assumed configuration than duplex, are contemplated for use at all m hght of the sohd lines in figure 4 1 ) A duplex pair '~lth one faded unit will not survwe a transmnt fadure 2) Dmgnosttc routines with very high fault coverage must be produced If transmnts are assumed to be nine Umes as frequent as permanent errors [MA79] and to cause system fmlure when they occur m a sole surviving module of a pair, the duplex rehablhty falls to the dashed curve shown m the diagram The added rehablht3, with transmnt protection is clearly shown here The cases of floating redundancy being examined Ln this secUon, and the TMR method, are able to allow continued operation by adding a guessing strategy to the basra methods This ts known as TMRSimplex when applied to the TMR method [BO71] Adding this strategy causes a similar shift m floatmg redundancy rellabdLty as in TMR tAN81]

1.000

0.999 \\

\ \\

',\, \\ \\\, ,\\

\

x \

\

\

0.996

\ \

\

>F-

\

\ TMF, \

\

'

\ \

.<

\

\ TMR k\X,w,~

\

\

CASE A \ \

1

0.993

\ \

\\

\CASE

\¢'

\ \

0.992

\

\

20.00 X

t-r

\

\ \

I0.00

£13

'\DUPLEX

0.994 ~j

\

TIME (NORMALIZED)

0.995

\

\ \

0.00

0.997

,CASE B

x

\ \\ \ \ \ \

SIMPLEX

0.998

DUPLEX

\

30.00

'

\ \

0.991

40.00

10-3

Fig. 5. Effect of 98% coverage.

In Figure 5 one can see how less than perfect fault coverage (dotted hnes) lowers the ongmal curves (sohd hnes), as has been found for other methods tAR72] Even at the modest rehab]hty levels shown, there are mgmficant changes m the curves Incomplete coverage draws the hnes together, and methods with a h~gher module count, such as TMR, are more senmtive at the initial stages than even the case A method For high reliab~hty requirements, where full coverage is not attained, floating redundancy should be considered Since the causes of less than 100% coverage can vary m mgnlficance from method to method and ~mplementatlon to ~mplementaUon, Figure 5 must be interpreted w~th care The complexity of voting, sw,tchmg and fault ~solatlon may d~ffer, thus the coverage also can change The relmb.hty of a system using floating redundancy changes with the rel,abdity (and complexity) of the floating spare This was found to be proportionately more slgmficant with case B than with case A, due to the former's greater reliance on the spare for higher system rehabdlty levels The bounds of rellabdlty m this respect can be found using the reliabthty equations for case A and case B (see appendix A) These bounds indmate the best and the worst that can be expected as the spare m changed When the rehabdity of the spare is essentmlly zero, the system rellabihty for both cases m the square of the simplex rehabihty When the spare's reliability becomes high, the curve for case A approaches

R e d u n d a n c y system

525

Rsys = p2(I + 2 N Q )

(R is the rehablhty of a single module, P equals R to the Nth power where N is the number of single modules, Q equals (1-R)/R) Since only one substitution is possible, system rehablhty is still very much limited by the rehabdlty of the basic modules Case B approaches Rs,~ = p2(I + [1 + 2Q] N = [(2-R)R]

1)

N

which as the equation for duplex rehabdlty

The perfect spare acts like a perfect diagnostic program

In any system, if one adds a module type, the reliability goes down, due to the possibihty of failure of the additions For methods other than floating redundancy, the proportmnal change m reliability (at a given hme) is constant for each addition, tf all the modules have the same rehabdity

DUPLEX CASE B .99 TMR

.98

uJ _J < u1 o .J >p-

CASE A

.J

hi

SIMPLEX .92 .91

10 NUMBER

20

OF U N I T S

Fig. 6. Effect of the number of unlts.

Due to inherent asymmetry, this Js not true of floating redundancy Figure 6 shows how the rehahditles of several methodologies drop as module types are added On this log scale, the non-floating methods have constant slopes, whereas the floating methods do not However, the curves for floating redundancy approach constant slope as the number of units becomes large For case A, the proportion of Rsys for N + 1 module types to Rsys for N module types is R2'V+2(1 + 2 ( N + I ) ( 1 - - R ) ) / R2~V(1 + 2 N ( 1 - R ) ) which approaches R squared for large N. R squared is the probability that the additional pair of modules will be working This limit means, in essence, that with a large number of module types, the spare wdl almost always be in use after a small time, because the likelihood of some failure Is high Thus each additional type must work by itself - the spare can make no more subshtuhons

526

P G ANDERSONand Z G VRANESIC

For case B the proport,on ts

R2'~+2(I+R[(I+2(I-R)/R)'¢+I-1]) / R2~"(I+R[(I+2(1-R)/R)N--1])) which approaches (2-R)R for large N, the same as the duplex proportion Thus the addition acts hke a duplex pair at this point, since the rest of the system Is not likely to function ff the spare has faded, and, if the spare Is perfect, the system behaves like a duplex system The proportion is larger than m case A because the spare is able to make multtple substitutions, unlike in case A Also, because N is always an exponent m this equation, the approach of the proportion to the value independent of N, as N becomes large, is faster than m the other method The extreme other end of the range is one module type For N = I , the numbers of modules and the rehabdttles of TMR, case A and case B are equivalent (Note that floating redundancy is not strictly defined for N = 1 )

4. THE USES OF FLOATING REDUNDANCY We consider three example structures where floatmg redundancy can be apphed Each structure has several types of modules within, the basic requirement for floating redundancy, but the designs are quite d,fferent The first structure we wall consider is that of Fapor, Fapor ts an acronym for Fault-Adaptive ProcessOR This processor was developed as an example of floating redundancy, [AN81] Fapor is composed of two processors, A1 and A2, operating m parallel, A comparator, C m figure 7, is used at every wrtte operatLon Rollback Is used for transtent failure correcttons detected by the matching operation The information for the rollback is held In registers X, F In the figure is the floating spare Should a permanent failure occur m any one of the aforementioned units, the spare can be used instead Switches shown allow the information from the spare to be transmitted as needed, dependmg on whether it is replacing a processor, the comparator or the rollback registers There is some degradation of processing speed when the spare is used [ANS1]

A1

A2

,.

J

L

(

KEY -

-

- -

DATA

BUS

SWITCH

IFI Fig.

7. F a p o r - s t r u c t u r e .

Redundancy system

To ~mprove performance, system architects are proposing systems w~th special-purpose modules for specific functions Integrated circmts and hardware modules are being contemplated and developed for such functions as searching [FO80], scheduling [DES0], arithmetic [PAS0] and hashing [LA78] Already common are back-end umts for calculating FFTs [CH79], and for array processing [FL79] Whenever two or more types of these modules are in a system, a floating spare could be added to provide fault-tolerance Two structures with a dtstrlbuted CPU have been introduced recently One is the data-flow structure [RU77], the other the Fermtor structure [LO80] In both these designs there are several types of processing elements Multiple copies of each type operate in parallel, performing their specific operatlon, then passing the control to another processor to perform the next operation The systems are mult~processors, each process having a separate control chain, so many elements are usually active at once

Such architectures are candidates for floating redundancy The floating spare is merely another type of processing element and thus the lnterconnectlon scheme need not be changed The structure of the Fermtor elements, m particular, is standardized Thts same basic structure, including m~croprogram control, would provtde a convement, compatible base on which to budd the floating spare

5. CONCLUDING REMARKS

Certain systems will be more amenable to floating redundancy Since implementation will reqmre communications between the spare and the modules it replaces, systems wtth centrahzed data routes, such as a bus, will allow the method to be more easily used Similarly, systems that are not spread over large geographical areas wdl allow the use of higher speed links to the spare, and thus probably less speed degradation when the spare is m use Like any method that improves hardware rehabdlty, careful study of the advantages and disadvantages of the method in the particular system is required before any choice is made Floating redundancy offers improved rehablhty at a lower module count than some other methods This technique is likely to be particularly attractive in the multlprocessor systems, where s~milarly structured mlcroprogram-controlled m~croprocessor modules are used to perform differing functions Such systems are h~ghly conducive to implementation with VLSI technology

BIBLIOGRAPHY

[AN81] Anderson,P G , "Floating Redundancy", PhD Thesis, Department of Electrical Engmeermg, University of Toronto, 1981 [AR72] T F Arnold, "The Concept of Coverage and Its Effect on the Rellabthty Model of a Repairable System", Fault-Tolerant Computing Symposium, Digest of Papers, 1972 [BO71] W.G Bourlclus, W_C Carter, D C.Jessep, P R Schneider, A B Wadia, "Reliabdlty Modeling for Fault-Tolerant Computers", Fault-Tolerant Computing Symposium, Digest of Papers, 1971 [CA77] W C Carter, G R Putzolu, A B Wadla, W.G BoUnClUS, D C Jessep, E P Hsleh, C J Tan, "Cost Effectiveness of Self-Checking Computer Design", Fault-Tolerant Computing Symposmm, Digest of Papers, 1977 [CA71] W C Carter,WG Bourlc~us,D C.Jessep,J P Roth,P.R Schnelder,A B Wadia, "A Theory of Design of Fault-Tolerant Computers Using Standby Sparing", Fault-Tolerant Computing Symposium, Digest of Papers, 1971 [CH79] P Chow, "A Hardware Implementation of the Prime Factor Fourier Transform", MASc Thesis, Department of Electrical Engineering, Universgy of Toronto, 1979 [CO54] D R Cox, W L Smith, "On the Superposltlon of Renewal Processes", Blometnka, Vol 41, 1954. IDES0] P J Denning, "Working Sets Past and Present", IEEE Transactions on Software Englneenng, Vol SE-6, No 1, January, 1980 [FL79] Floating Point Systems, Inc, Form 7345, "The Array Processor Brochure", P O.Box 23489, Portland, OR, 97223 [FO80] M J.Foster, H,T Kung, "Design of Special-Purpose VLSI Ch~ps - Example and Opinions', Computer, Vol 13, No.l, Jan, 1980_ [IN77] A D Ingle, D P Siewlorek, "Rehabllity Models for Mulaprocessor Systems with and without Periodic Maintenance", Fault-Tolerant Computing Symposmm, Digest of Papers, 1977

527

528

P G ANDERSON and Z G VRANESIC

[LA78] C L kam, "A Proposal for Effioent Fde Addressmg Techmques", PhD Thesis, Department of Electrical Engineering, UmversJty of Toronto, 1978 [LO76] J Losq, "A Htghly Efficient Redundancy Scheme Self-Purging Redundancy", 1EEE Transactions on Computers, Vol C-25, No 6, June, 1976 [LO80] W Loucks, "Fermtor A Flexible Extendible Range Multlprocessor", PhD Thests, Department of Electrtcal Engineering, Umverstty of Toronto, 1980 [MA79] Y K Malaiya, S Y S Su, "A Survey of Methods for lntermLttent Fault Analysis", AFIPS Conference Proceedings, Vol 48, 1979 [NE63] J von Neuman, "Probabahsttc Logics and the Synthesis of Rehable Organisms from Unrehable Components", from A H Taub (ed), Collected Works, Vol 5, Macmillan Pubhshers, New York, 1963, as used m W C Carter, "Fault-Tolerant Computing An Introductmn and a V~ewpomt", IEEE Transacttons on Computers, Vol C-22, No 3, March, 1973 [PA80] J Palmer, R Nave, C Wymore, R Koehler, C McMmn, "Making Mainframe Mathematics Accessible to Microcomputers", Electronics, Vol 53, No 11, May 8, 1980 IRES0] D A Rennels, "Distributed Fault-Tolerant Computer Systems", Computer, Vol 13, No 3, March, 1980 [RU77] J Runbaugh, "A Data Flow Multlprocessor", IEEE Transactmns on Computers, Vol C-26, No 2, February, 1977 [SI72] D P SJewtorek, E J McCluskey, "An Iteratlve Cell Switch Design for Hybrid Redundancy", Fault-Tolerant Computing Symposium, D~gest of Papers, 1972 [TO79] W N Toy, "Fault-Tolerant ESS Processors", Fault-Tolerant Computing Symposmm, Digest of Papers, 1979 [TR62] l G Tryon, "Quadded Logic", from R H Wdcox, W H Mann (ed), Redundancy Techmques for Computing Systems, Spartan Press, New York, 1962

Appendix A - Equations Used in Section 4

The following short forms are used T - T~me (all modules unfaded at T = 0 ) MTTF1 - Mean ttme to fadure of module R~(T) or Rl - Probability of module 1 being unfaded at T Rs - Probability of spare being unfaded at T We also assume R1 = exp(-T/MTTF1) For case A, the probabdlty that the system survLves Is the s u m of the probabdmes that all active umts survive, and that the spare survives with exactly one failure in the other modules

N

2

N[ N

Rsv, = H R, + R , * ~ / I - I R j 2 " 2 " R , * ( 1

- R,)

]l

1 ,q = P:'(1 + 2*Rs*~Q,)

t=l

N where P is defined as I I R , and Q,= (1 -

R,)/R,

When all the Rl are the same, say R, then this equation becomes R~y, = P2(1 +

2NR~Q)

For case B, the system rehabdlty is the probabdlty of all achve units not failing, plus the spare not fadmg and at least one failure, with no more than one unit failing per pair Thus

Redundancy system

529

R~v~ = I'IR, 2 + R~ " ~ I - I R , z ° I i R k * ( l - R , )

t=l

[

= p2(1 +

F j~G

k~B

R~]~I]Q~) F k~B

In this equatton P and Qk are as previously defined and F ts the set of combinations of fadures described above For each of these combinations of failures there are a set of unfaded (good) units that form a set G and a set of faded (bad) units that form a set B Indices j and k run through the e l e ments of sets G and B respectively When all the R~ are equal f failures can be distributed so there is at most one failure per pa~r m C ( N f ) ways (C is the combmaUonal operator.) Each fadure can be in either element of the pair result we can wnte this as

As a

N

Rxys = p2(1 + Rs " ~ C(N,f)*2f'Q f) f=l

= P2(1 + Rs*([1 + 2Q] N -

1))

The basic equations used for duplex and TMR were

Duplex R (2--R)

TMR

R2(3-2R)

DUPLEX WITH TRANSIENTS m e t h o d the following was used

For the curve showing the effects of transients on the duplex

T

Rsys =

R2+

9 R2 _

=3.

f ~t (1-e 2~ / t'e-9(T t)dl

2 R9

3-

The factor of 9 comes from [MA79] NON-PERFECT COVERAGE

Here c Is the coverage

The equations were altered to

Stmplex Rsys = RN-IRk CaseA Rsy~ = R2~V-2Rk2[1+2cR~((N- 1) Q + Qk] N

CaseB R~ys = R2n-2Rk2[I + R~ ~ ( C ( N - l,f)2fQ/cY(l + 2cQk) + 2cQk] f=] Duplex Rsy~ = R [R +2c(1 - R ) ] jr-1 °Rk[Rk+2c(1 --Rk)] TMR

, R~ys = R2[R+3c(1-R)] N-z °Rk2[Rk+3C(1--Rk)]