Fault-tolerant design of local ESS processors

Fault-tolerant design of local ESS processors

190 \~ or]d A hs! rac Is oil M icroclecl ronic~, and 14ella bilii~ The third approach employs the concept of celhilar automata in failure detection ...

119KB Sizes 6 Downloads 100 Views

190

\~ or]d A hs! rac Is oil M icroclecl ronic~, and 14ella bilii~

The third approach employs the concept of celhilar automata in failure detection and switching flmctions. Finally, the fourth design approach pro\tale,, an uutomatic shift from 2 otlt of 3 lo 2 0 t l l of 2. \~
Pluribus--an operational fault-tolerant multiprocessor. DAVID KA'fSUKI, ERII" S. ELSAM, WIllIAM F. M,~N,',,. ERIC S. ROBERTS, JOIIN G. ROBI'-:SON. I'. SIANI.E; SK(IWRONSKI and ERt(" W. WoI.l. Pr,,c IEEI'. 66, {10t 1146 (October I978). The uuthors describe the Pluribus nmliiprocessor system, outline severa] lechnictucr, used to achie,,e fault-tolerance, describe their lield experience Io date. and mention some potenliul applications. The Pluribtls syblcnl places the major responsibility for recovel \ from failures on the software. Failing hardware modules are removed from the system, spare modules are substituted where available, and appropriate initialization is performed. In applications where the goal is maxintum availability rather thun totally fault-free operation, this approach represents a considerable savings in complexity and cost over Iradilional implementations. The software-based reliability approach has bccn extended to provide error-handling and rccnver~ mccllanisms for the system software structures us well. A nulnber of Pluribus systems have been built and ale currcnth in operation. Experience with these systems has gi\en us confidence in their perfornlancc and nlainlamabilil 3, and leads us to suggest other applications thai inighi bcnulh from this appl-oacll.

F T M P - - a highly reliable fault-tolerant multiprocessor for aircraft. AI.BERI L. HOPKINS, Jr., T. BASU, SMrIH, Ill and JAYNARA;'Att H. [.ALA. Pro~. IEEE 66, (101 1221 (October 197g). F T M P is a digital computer architccl.urc \ditch has evolved over a ten-year period in connection with several life-critical aerospace applications. Most recenlly il ha.', been proposed as a fault-tolerant central computer for civil transport aircraft applications. A working emulation has been operating for some time, and the fir>l cngineering prototype is scheduled to bc completed in late 1970. F T M P is designed to have a failure raic dl.ic It) random causes of the order of 10 i. faihlres pcr hirer, on ten-hour flights where no airborne llmintenuncc is a~ailable. The preferred maintenance interval is of the order of hundreds of flight hours, and the probability that nmintenance will be required earlier thun the preferred inlcrxal is desired to be at most u few percent. The design is based on independent processor-cache memory modules and c o m m o n lllelllOfy modules which communicate via redundant serial bnses. All information processing and transmission is conducted in triplicate so that local \oters in each module can c Modules can be retired u n d e r reassigned ill any conliguration. Reconfiguration is curried Otll routmely from second to second to search for latent faults in the \oting and reconfiguration elements. Job assignments arc all made on a floating basis, so that any processor triad ix eligible to exectite any job step. The core sofl\~are in the f i l M l ' will handle all fault detection, diagno~,is, and l'CCovcly ill such a way that upplicalions progranls do nol need to bc involved Failure-rate models and numerical restihs are described for both permanent and intermittent fauhs. A dispatch probability model is also presented Fxpericnce with an experimental emulation is described.

SIFT: design and analysis of a fault-toleraut computer tot aircraft control. JOIIN H. ~ ' I N S l F ; . l l S l IF l.~\Ml'ln¢ I. I',l k Gl)ll)l~l!P, tk ~'[11 ION W. GRtFN, KARl N [.1',[I ! I) M. ~'1111 IXR-,~MI[II. R¢.)111141 [, NllOSt.\l-. alld (.'It,\RIt{N it, \\rI'INSI()tK. IJt't~t. IItH? 66, (]()l 124~i (Oct,>bci !')7Ni.

SIFT ISoftware hnplementcd I.atllt 1 olcrancc)i> an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replicaiioll of tasks a m o n g processing units. The n];lin processing tUlltS ttrt_' offthe-shelf minicoraptlters. ~; ith slandard micl.computer,, ser\ing as the h/terfacc to the I () s\>lCln I a u l t I,a)hiiinrl is achie',cd b\ usin 7 a spc~.ially ,Ic>igned icdtll~diint htl,, s~+stenl to inler,,:onnccl the proces>ing units. EiToi delmctlon and anulysis and ~,y',tcm reconligUlUlion are performed b7 softv, ure. lierati'.¢ tasks arc redundantly executed, and the resLIlis of each iteration arc \oleo llpon bctiwe being nscd. Thus. any single failure in a prc)cessillg mill nr bus olin be Iolerated ~sith lrip]ication of tasks, und silbseqtlenl fuihn-es can bc tolerated aftcr rcconfiguration. Independent exectltion b} sepalal.e processors II]Calls thai the f,rnccssor~ need onl 3 he to{~sch synchroniicd, und a nu~ct lau[ttolerant ~,vnclnoniialion int-'thod ix de>clibcd fl,c SIFI sofl\~.are is highl~ '-;tri.lcturcd alld i!i fofnlall\ spc:cilicd rising the SR l-developed SPE( 'IA 1. ]allguagc. The col i-eclncss of SIET is to bc proved usmg a hiennch~ ,,i J})rlna] 111odcls. '\ Mal'ko~ model ix used t',oth to Linal\,/c thc reliabilit_\ of tilt_' "QslCln and {t> s c l \ c tlS lht' J\)i!lla{ requirement for the SIF'T design..,\xioms are given t,, characterile the high-level behavior of the systcm, n-ore which a correctness slatenlent ha~, been pro\ed. ~n engineering lesl ~ersion of SII-:'T ix currenlt 3 being buill

Fault-tolerant design of local ESS processors. 'A. N. -I¢!~ Prec. IEEE 66, II0t 1126 IOctobcr It~78). The stored program control of Bell System Flectronic Switching S~slems (ESS~ has been under development since 1953. During this period, the No. I ESS, the No. 2 ESS. and the Nt~. 3 ISS ha~c been dc~eloped and used extensively by Bell System operating companies to provide commercial lclephone ser\i~:e. These s'~stcms serxe a[] lypcs of telephnnc oltices: f h e largc-capaci O No. 1 ['SS serves metropolitan offices, the medium-capacity No. 2 ESS ~as designed for suburban offices, and the No. 3 ESS can bc found in many small rural oflices. Fhe fault tolerant design of IrSS processors pzovides the same highly dependable telephone service established by the prexious electro-lnechanical systems. Pertinent processor architecture features used to uchicve f S S reliability objectives are discussed \ detailed discussion of the maintcncincc design nf lht_, I,\ l)l()CCSXt)l- i,, also inchidcd.

A case stud)' of C. romp, Cm*, and ( . vmp: part i - Experiences with fault tolerance in multiproce~or systems. DANIEI, P. SIIWIOREK, VITIAI KINL HFNRY ]~ASIIBURN, SIEPHI/N M('('ONNt!I. und MIf'HAtl. FSA(I. Dt'nt. IEIzE 66,

I lOI 1178 (October 1078). Three multiprocessor systems designed, implemented, and currently operational at ('arnegie Mellon University are compared and central, ted. The design goals and architectures are summarized with a special focus on reliability features. Experiences gamed in design and operation ure discussed Finally. reliability data. ~ith a focus on trunsienl faihlrcs, measured fl'om each system are presenled and discussed

4. M I C R O E L E C T R O N I C S Integrated display components. Review of the international status and trends. W. }tEIDtIORN. .Vachrichtentechnik Eh'ktt'otlik 28, (9) 35(~ ll97SI. {In (JCflnanL ()v~ing to the

GENERAL

growing technical possibilities of microelectronic circuits a continuous increase of the information capacity of display c o m p o n e n l s ix reqnired. B) s y s t e m a l i : h l g the c o m p o n e n t