Application of Redundant Processing to Space Shuttle

Application of Redundant Processing to Space Shuttle

Copyright © IFAC Control Science a nd Tec hnology (8th Triennial W orld Congress) Ky oto . Japan . 1981 APPLICATION OF REDUNDANT PROCESSING TO SPACE ...

1MB Sizes 1 Downloads 61 Views

Copyright © IFAC Control Science a nd Tec hnology (8th Triennial W orld Congress) Ky oto . Japan . 1981

APPLICATION OF REDUNDANT PROCESSING TO SPACE SHUTTLE J. T. Caulfield International Business Machines, Owego, NY

The Space Shuttle is a highly automated vehicle, in which the computers are essential to fllght safety. The very high reliability required is achieved by a four-fold redundant configuration at the ~evel of.a complete computer, with redundancy management of the computers external sensor, and lnterfaclng equipment performed by a combination of hardware and software' techniques. The paper describes the techniques used, and results achieved. Abstr~ct.

Keywords. Redun~anc~ manage~ent; hig~ reliab~lity; spacecraft data processing; flight control; guidance and navlgatlon; votlng technlques; slngle point failure elimination . The Space Shuttle has been designed as a general purpose, reusable space transportation system for the '80s and '90s. In addition to a very large payload carry capability, it will be called upon to serve the unique requirements of a wide range of low earth orbit missions, and will be reconfigured from mission-to-mission with a very short turnaround time. It is an advanced, sophisticated vehicle that has created unique demands on its data processing system, which had to be created in a cost effective manner out of early 1970's technology. The computers in Space Shuttle are pervasive, controlling all flight and mission functions, and hence, as a total data processing system, had to have unprecedented reliability. The role of the data processing in Space Shuttle includes the normal navigation and guidance function, flight control including a digital auto pilot and stabilization -- a true fly-by-wire system -- operation of four CRT display systems and keyboards for crew interface, system and payload monitoring, and management of sensor redundancy, in addition to management of the data processing system redundancy . Space Shuttle operates in three different flight regimes; ascent, on orbit, and reentry; and hence, must operate with inputs from several sensors and separate actuators for each of these flight regimes. During ascent, inertial inputs are used to control the main propulsion engines, on orbit both inertial and stellar and electromagnetic sensor inputs are used with reaction jet control, and on reentry, the vehicle is flown as an unpowered aircraft with aerodynamic control surfaces and a full range of aircraft instrumentation. The basic reliability requirement on every subsystem in Space Shuttle is Fail Operational / Fail Safe, meaning that after the first failure of any given unit, the system will still be fully operational, and it will be safe after a second failure. Because of the flightcritical nature of the data processing system,

246 1

this implies that it is Fail Operational / Fail Operational, able to continue its full operation after two like failures have occurred. There can be no single point failure modes, or any conditions that might be perceived as failures, resulting in the rejection of a good unit from redundant set operation. Flight control dynamics are very critical in both the ascent phase and the reentry into earth's atmosphere, and require smoothly continuous control with no transients in the control systems, thus requiring that all redundant units be powered on and instantly available during these critical mission phases. A derivative requirement is that there can be no skew in the time coherence of the input data used by each computer, and not greater than a millisecond of time skew in the output. Redundancy technique must tolerate transient transmission errors. IBM and others have implemented various redundancy schemes in the past, such as, quad redundant components, triple modular redundancy, as well as dual systems. Each of these techniques were chosen to meet the specific requirements of the system in question; none met the unique requirements of Shuttle. Withstanding two like failures requires, as a minimum, a quadruple redundant system . In addition, a fifth computer, on-board as a payload management computer , is presently being used during critical mission phases as an independently programmed backup, to guard against generic software problems in the prime system that could conceivably affect the entire redundant set. Because the outputs of the several computers are typically multiple digital words, the voting process becomes quite complicated and the voters themsleves could be quite complex. By using the computers themselves as the voters upon each other, based on passing the desired data for comparison between themselves, the additional logic for redundancy management can be made quite si mple, as shall be described

246 2

J. T . Ca ul f iel d

while another computer had not yet read the new data. With synchronization, however, all of the computers enter a sync routine and wait for one another prior to commanding input data that will be used in the flight-critical computations. No attempt is made to synchronize hardware; synchronization is accomplished through software, utilizing a coded set of three discrete outputs from each computer transmitted to all of the others. As each computer enters its sync routine, it looks for The basic computational requirement is for a the presence of these discretes from the others Central Processor Unit (CPU) of 400,000 operaand waits only a precise amount of time for tions per second on a 32 bit data flow, with 106,000 32 bit words of non-destructive memory, them to arrive. Once they are all in the synchronization routine, they then all start and a micro-coded architecture capable of together at the same point. If any computer implementing a full instruction set of some 154 fixed point and floating point instructions. fails to sync within the prescribed time, it is eliminated from the set. 24 output channels are required to interface to the da ta buses, implemented by a sepa ra te In addition to having identical inputs from Input/Output Processor (lOP), cyclically time several sensors, each GPC must have the same shared among the 24 channels. The lOP has a protocol to derive the mean value . This prodirect memory access to the CPU's main memory, tocol is entirely contained in software and by a cycle stealing technique that minimizes includes a reasonableness check on each value CPU interference. The data buses and terminals to eliminate any totally erroneous input. are st~ndardized throughout the .Avionics, Hith two sensors, an average is ta ken; wi th operatlng at a rate of one 28 blt word each three, the middle value is used; and with 33 microseconds, with a one megahertz clock four, the average of the two middle values. cycle. Input error conditions, when sensed, are transmitted to other GPC's over the interA block diagram of the system configuration computer data buses, to insure that all is shown in Fig. 1. Each General Purpose Comprocess the data in the same way. puter (GPC), consisting of a CPU and lOP, interfaces with the same 23 shared data buses All critical outputs are voted at the end in five groups, used for intercomputer comeffector. Normal operation has the effector munications mass memories, displays, payload receiving four inputs and providing one outoperations, launch functions, and the flightput, although the voting effector will procritical sensors and controls. In addition, vide proper outputs with only two inputs. each computer has a dedicated output for Thus, the two fault tolerant requirement is flight instrumentation. With this configuramet on the output side of the computer comtion, any GPC can be used as the backup system, plex, as each computer in the redundant set any GPC can command any data bus, any GPC can has provided one of the inputs to the effector control any display or format, and any GPC can over one of its command channels. be assigned as a prime computer to control memory overlays from the mass memories, or can The voting techniques provide a straightdirectly load the memory of any other computer. forward means of detecting computer faults or Assignment or reconfiguration is made by the of transmission errors, and effectively mask fl ight crew by keyboard entry. In practice, the system on an instantaneous basis. Ideneach computer controls a pre-specified subset tifying and removing the failed computer from of the buses. the set is, therefore, not time critical, and it is acceptable to wait each cycle until all Fault-tolerant operation is achieved by the output commands are computed and transmitted techniques shown in Fig. 2. For the flightbefore comparing results. critical input channels, each computer controls one of four buses and listens on the other Identification of a failed computer and its three. Only the controlling computer on any removal from the redundant set is accomplished given bus transmits commands to any given by three categories of fault detection sensor, but each of the four receives the inmechanisms. The first category is by internal put data simultaneously. Since the computers BITE in each of the computers. This category are synchronized, each computer requests data includes parity checking, voltage monitors, from its sensor simultaneous with the others, internal timers, etc . Because of the inability and the three sets of data from the three to exercise I/O at will, this provides the sensors are time coherent, with no special lowest coverage of the three. The outputs of processing required to match three data sets the internal BITE cause the affected computer in time. to fail to sync. The redundancy management scheme requires that A "fail-to-sync" is the most common and most the several computers are precisely syneffective of the three categories. Fail-tochronized with each other. If this were not sync occurs either because of a BITE error or done, one computer may have been preceding to any erroneous program branches or abnormal read new input data prior to the start of computational activity. There are approxioperations requiring the use of that data, mately 280 sync points per second on the later. Redundancy management is thus accomplished by a combination of hardware and software and allows techniques to be employed that can tolerate one-time transient failures either in the processors or on the interfacing buses. Redundancy at the high level, to be described, significantly reduces the number of cross-strapping nodes at which single-point failures could occur.

Redund ant Proc essin g

average, and a computer is eliminated from the redundant set the first time that it fails to sync. The third category is a compare test in which each computer compares the results of its critical computations with each of the others. In this test, all critical outputs are summed together, and the sum check word is transmitted over the intercomputer communication data buses. Each computer controls one of these buses and listens on the other four, and compares its results with each of the others. This test provides the most total coverage of the three categories; and since it is entirely under software control, it may be tailored to be tolerant of transient error conditions occurring from data transmission. In Shuttle, two noncompares in a row are are required to eliminate a computer from the redundant set. A decision on this test typically requires at least 80 milliseconds for decision, based on two noncompares with a 40 millisecond cycle time. If a computer disagrees with the sum word from any other computer, it sets a discrete, under software control, to that computer which says, in affect, "I disagree with you". No attempt is made at this point to identify which of the two is incorrect. However, if a computer receives two or more discretes from other members of the set, then hardware logic within the computer receiving the failed vote discretes signals that it has failed, and depending on the setting of hardware control latches, may reset its input/output to inhibit further transmi ss i on of any outputs. The hardware in each computer to indicate that it has failed is indpendent of the hardware and software that makes the decision on whether or not that computer agrees with the others. The hardware logic used to indicate a failure is shown in Fig. 3, shown for computer No. 1. The upper portion of this figure shows the disagreement outputs of computer No. 1 transmitted to the other four computers and to the crew panel. The lower half of this figure shows the discrete receivers, receiving disagreement votes from the other computers as well as its internally detected faults which are used to manipulate the watchdog timer. If computer No. 1 has received failure votes from more than one other computer, it sets its voter fail latch, inhibits transmission of failure indications to the other computers, and provides its own computer fail signal to the crew panel. While outputs of this logic may be used to inhibit further I/O by this computer, it is typically left for crew action to remove that computer from the redundant set. Software redundancy management tasks, in addition to synchronization and the performance of the compare tests, include intercomputer data transfers, such as, status information of all subsystems. This data is used for system reconfiguration as necessary, accomplishment of display assignments, and initialization. Input error conditions must

2463

be transmitted and used to insure that all computers do the same thing with input data, even though they may not all have seen the same error conditions. Software must also control bus management, bypassing specific units which may have failed or bypassing a specific failed bus (which may involve several separate sensors, such as IMU and star tracker, etc.). A little thought will show that the theoretical coverage for identification of the first and second failure of a four computer redundant set is, essentially, unity. If a computer cannot set itself failed, then the others will set it failed anyway. The crew, through their matrix control panel, has an immediate indication of each computer's opinion of each of the others. Experience has also confirmed this degree of coverage. There have been over 5,000 hours of formal multi-string verification testing in the Shuttle Avionics Integration Laboratory (SAIL), in addition to many hours of vehicle checkout, crew training simulations, etc . The formal testing has included nominal conditions, failure conditions, and stress testing . Although there are no recorded instances of failure to detect a failed computer, careful attention in design and analysis and hardware testing has been required to insure that single-point failure modes, or any condition that could lead to propagation of a failure indication from one computer to another, have been eliminated. Some of these failure modes have been detected by software analysis or verification and corrected in software, but some have required hardware testing and subsequent changes in hardware. Of particular concern, have been transients created on the synchronization discretes or intercomputer dat~ buses when one computer is intentionally or lnadvertently powered off . Depp.nding upon the timing of such transients, the exposure would exist that other computers could perceive these signals differently, with the result that a healthy computer could be voted out of the set. Fortunately, all known ex posures were identified early enough for corrective action to be taken before the first fl i ght. From experience with a number of very high reliability programs requiring redundancy, one can conclude that the optimum approach is governed by the specific requirements of the application. In a high performance system, if transient interruption of processing cannot be tolerated, the redundancy must be of a dynamic voting type . Although more failures can be tolerated if the processor is partitioned into smaller modules, the complexity, and hence the failure rate, of the voters rises rapidly . In addition, the crossstrapping required at a lower partitioning level would cause a major increase in the single point failure exposure. The Space Shuttle approach -- redundancy at the GPC level -- though tolerant of only two failures, has been implemented with minimum additional

2464

J. T. Caulfield

hardware, has minimized the problem areas of single point failure exposure, and provided a system where the software can be effectively utilized to manage the redundancy of the sensor and data bus as well as the data processing system. Tolerance to transient conditions on inputs can be programmed in at the discretion of the system designers. With five computers on board, and the ability to tolerate two failures, but requiring a negligibly small probability of the loss of three, the need to achieve good inherent reliability is not eliminated, since the rate of separate failures is increased by a factor of five over that of a single processor. In actual fact, the mean-time-between-failure of each GPC of over 4,000 hours has been achieved to date, and is continuing to rise. The effectiveness of the redundancy techniques herein described were put to the acid test of an actual flight failure once during the earlier Approach and Landing Test phase of the Space Shuttle program. During one of the five flights, a computer experienced a transient failure at the time of separation of the Shuttle from the parent aircraft. No transient condition appeared in the control of the vehicle. The only awareness that the pilots had of the malfunction was the indication on their cockpit display. Acknowledgement. The work of Mr. Robert E. Poupard, of IBM, in the conceptual design of the Shuttle data processing system, and valuable counsel in the preparation of this paper, is gratefully acknowledged.

SYSTEM CONFIGURATION

l

[

r

D

GPe 1

CPU 1 lOP 1

I

~ L".m.",..

,

Discrete onputs .I'Id outputs amonV lOPs, control pi"el,_.nd ~IS memo,i .. Control

GPe 2

GPC 3

IJ,

CPU 2 IOP2

CPU 3

lOP 3

~

151

Ma", memory 121

Launch fUl'lctlon 12) ded.Citlitd~,

,

GPC)

lOP 4

GPC5

,J, ~

CPU 5

lOP 5

0...,...-..

-

~J

~

t--

hr--

Payload OPlrallOn 12)

Flight ,rYt"trume"t 15 ; 1

IJ, ~

CPU 4

f---

tu:::::

Ullpl.y ,vUem 141

GPC 4

P'"tlls

>8 ' ·MHz ...... 1

h

' dat.

b"wo 123 'hired . 5d.d'~I.d

FIoghI '(:"llcal $8"'0' and control 18)

~

'-GNC sensoFt ~,n.ngln.ont.rf.c.

Aa.murt,e.actu" •• , Th,un · .... cto' control .ClualO" Promary flight displays M'SSlon .... ent conlro"'"

M.,tertlme N.\II/~IIIIO" Aids

Mm memory units

~

~ ]

18

CRT d,.pl.~

f

.1

,"tI,f.ca ""Md M,nlpulator upj,nk

I

Sol.d

.~,,, b.m'~

I

Ground umbilical,

Ground .uppon equlpmen,

o Buses distributed by function for different levels of redundancy o Intercomputer. Display. and Flight Critical are the critical buses

4815-6A

Fig. 1 - System Configuration

Redundant Processing

2465

REDUNDANT SYSTEM OPERATION INPUTS·

OUTPUTS·

Dummy Transaction When Only Three Sensors Exist C -

Command Channel for Computer 1

l - listen Channel for Computer 1 X -

No Transaction for Computer 1

• Inputs and outputs mayor may not occur over the same physical channels . Functional separation shown for clarity only.

o Task synchronization of computers is required 4815·7A

Fig. 2 - Redundant System Operation

HARD LOGIC FOR IDENTIFYING FAILED COMPUTER

F,.lu •• VOtes From Oth" Compul'~

L-...-GPC SI.'V'S--.J ' - - - Felled GPC - - - ' L - - C'IW PI"" ~

r Comp

No . 3 -

------,

' Dose.etl L RIC. •.,., .......I~

Comput er· F,,1 S,g".llo C._PI""

I

I

~ ND

. - - - --, I

~ .nd T"nsm'"lnJ

L ___ _ _ _ ___ __ ______ _

Mlst.r , PowerQ" lOP Fill Resat ,

T.~n.m'ulon

0' Hllt · D''':'.'1

lOP Reset

~ ~ -II "hlb'IDllc"I" I I ~

T.,mtn.t,on Cont,ol LogIc

Shown for Computer 1. o Independent of the hardware / software that failed and caused a noncompare 4815 · 11A

Fig. 3 - Hard Logic for Identifying Failed Computer

2466

J. T. Cau lfie ld

Discussion to Paper CS 7 . 2 P. Kant (Netherlands): What is the reason for the asymmetric solar panel lay - out? T. Suzuki (Japan): The other side of the pitch axis is used for the radiation cooling of VTIR and MSR, which have IR detectors. Discussion to Paper CS 7.4 H. Wedde (Federal Republic of Germany) Who controls the intercommunication between the c o mputers? Is it conceptually possible that more than two computers are thrown out (because of "false " behaviour) as a result of a very special intercommunication process? J . T . Caulf ield (USA): Each computer controls communication on one of the five intercommunication buses , with all others receiv ing (listening to) the transmitted message. Thus each computer can transmit any detected error or fault condition to each of the

others. It is conceptionally possible , b ut of extremely remote (negligible) possibility, since it requires failure of a specific component occuring during transition and in time coincidence with two other computers ' processing. It is not conceptionally poss ible to prove that there are no other poss ible cond i tions , although extensive analysis has not uncovered any that have not already been corrected , by design change. J.W . Hursh (USA): During what pe r centage of cycle time is an initialization fault poss i ble? J . T.CRulfield (VSP) : Approximately 1 i n 65. After the launch delay, it was repeated in SAIL on the 74th attempt. After the prob l em was understood , it was easy to test for t h e condition, which could only occur upon the first initialization after system turn o n. The problem possibility will be eliminated in the next software release.