Redundancy techniques for use in an air traffic control computer

Redundancy techniques for use in an air traffic control computer

Per,.zamonPress 1~04. \ol. 3, pp. 175-192. Microelectronics and Reliability Printed in Great Britain REDUNDANC~ TECHNIQUES FOR USE IN AN AIR T T ...

1MB Sizes 0 Downloads 68 Views

Per,.zamonPress 1~04. \ol. 3, pp. 175-192.

Microelectronics and Reliability

Printed in Great Britain

REDUNDANC~ TECHNIQUES FOR USE IN AN AIR T

T

,

T

7"

TRAFFIC CONTROL COMPUTER

D. J. CREASEY

Electronic and Ek'ctrical Engineering Department, Univcrsit.v of Birmin,.,,ham

Abstract--Perfect reliability has not been obtained using high quality components and rugged circuit design. In data handling systems, such as a computer used for air traffic control, component failures which cause unscheduled maintenance delays cannot be tolerated. In such systems maintenance is shown to be an integral part of reliability. Redundancy is one method which can be used to improve the system reliability to an acceptable level. Present techniques use circuit redundancy and information redundancy: these techniques and their application in an air trat'Ec control computer are discussed.

1. RELLa..BILITY, A STATISTICAL PARAMETER high individual reliability, and to measure this I x MaYY applications electronic equipment is reliability with any high degree of confidence, requires that large numbers of test samples be required to operate for long periods of time without system failure. One such application is that of used. (2~ Because of this factor, precise component reliabilities are best obtained from field experience, a computer for use in air traffic control, where failure may cause loss of human life. Such an and life testing of components over a large period equipment is called an on-line system. T h e reli- of time by the component manufacturer, or other large organization. It will be assumed that comability of an on-line system differs from other applications mainly in respect of the long period ponent reliability data is available. T h e problem of time over which the equipment is expected to is then to apply this data to design equipment, to predict the equipment reliability, and should the function. While it is impossible to predict the exact life of an on-line system, it is possible to reliability be insufficient, to develop techniques to obtain a measure of the equipment's reliability. improve the reliability to an acceptable level. This paper will assume that normal design Reliability is defined as "the probability of pertechniques will not produce equipment which is forming without failure a specified function under given conditions for a specific period of time"IlL sufficiently reliable ~15), and will review techniques Because reliability is a probability, it must be currently being studied which may be used to appreciated that the problem of achieving reli- improve the equipment reliability. Tt:ese techability is a statistical problem. As such the niques use simple statistical models: a description reliability engineer can use statistical methods to of these models is given in most standard texts(aL predict the reliability of equipment, provided the While the paper deals with methods of improving reliability of each individual part under similar the reliability of an on-line system, the same environmental conditions is already known. methods are often applicable to other types of Present-day components generally have a very equipment. 175

176

D. 5. C R E A S E Y 2." C I R C L q T

REDUNDANCY

R e d u n d a n c y is one particular m e t h o d of obtain> ing reliable circuits usin~ less reliab[e components. As wiil be see:t later this statement must be quailfled because r e d u n d a n t circuits may reduce the overall reliability beiow the levei of individual c o m p o n e n t reliability ~*~. T h e b i n o m i a l distrib u t i o n TM can be useful in predicting the reliability of a r e d u n d a n t system. Let p be the probability that an individual diode will operate satisfactorily, then (l--p) = q is the probability of faiiure. Suppose two diodes are used in either a series or parallel connect{on and that failure of one diode does not affect the reliability of the r e m a i n i n g diode. T h e binomial expansion of (p+q)"- gives individual terms as foiloxvs:

It may be that s r, in which case r e d u n d a n c y would ii~crease the circui~ reliability if the two diodes were placed {n series instead of in parallel. Using this type of analysis as a building block m a n y schemes may be devised. For example, suppose four diodes are used to replace a single diode as shown in the H a m m o c k network of Fig. l(a). ( ' s i n g the multinomia[ expansion of

°i

7/°l

p2 = the probability that both diodes operate satisfactorily"

b-"q

L---"q

2pq = the probability that one diode fails while the other continues operating

L

q" = the probabil{ty that both diodes fail. As a simplification suppose tile probability o( either diode failing short circuit is [mpossibte, and an open circuit failure is the only failure mode possible. Let the diodes be conncctcd in parallel, then the probability of the parallel connection functioning correctly is P = p 2 - ' 2 p ( 1 - - p ) . T h e term 2p(1--p) is included since an open circuit failure of one diode does not cause failure of the pair. Inspection will show that in all cases P>~p(O<~p<~l), and increased reliability" has resulted, except at p = 0 and p = I. Should the assumption that short circuit failure is impossible be false, the following analvsis can be made. Let r = the probability of open circuit failure, and s = the probability of short circuit failure. Inspection of the terms in the expansion of (p+r+s)" gives the probability of total circuit failure O,, = re+2s--s ''. For the original comp o n e n t 01 = r+s. Let ~---i= B be the improve-

i

i b-< Fie;. I.

(p +-r+s) ~, and inspecting each individual term, it is found that the probability" of total circuit failure

is C~ = s-(2--s-)~-r-(4--4r.r-), .... ', " : ". and the improvem e n t factor B

=

s-r s2(2--s2)+re(4--4r-)-r 2) s--r

-- 2s"--_4r" 100 \Vhen s = r -= 0-01, B = -7% and w h e n s = r = .3

0.001,

B--

t000 ~ . Should

r ~ s then B ~ ,

I

1

but if r > s B ~ 4-r" T h i s particular configuration m e n t factor. T h e reliability is improved if B > 1 which requires r>s. T h e reliability is degraded if B < I , i.e. if s>r. T h e reliability is unaltered if B = 1 or i f s = r.

is said to be open circuit favourable. However, if the j u n c t i o n s between each pair of diodes are j o i n e d as in Fig. l(b), the i m p r o v e m e n t factor is

REDUNDANCY TECHNIQUES B=

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

r--s

r2(2_/2) + s " - ( 4 - - 4s --s ~) r--s

- - 2r"-_4s'-

if r and s both small.

This configuration of diodes is short circuit favourable. The designer should choose between the configurat{ons in Figs. l(a) and l(b) depending which of r or s is greater. The levels at which redundancy can be applied may be broken down as foIlows: (1) (2) (3) (4)

component level (diodes, resistors, etc.) logical element level (gates, ampl{fiers, etc.) sub-system level (register, adder, etc.) system level (complete computer).

If the redundant circuit configuration is such that the overall reliability is greater than the reliability of the redundant parts, it can be shown that the greatest benefit can be obtained if redundancy is applied at the lowest possible level. For example, consider a shift register composed of n identical stages, each with a failure probability q. Suppose that each individual stage is duplicated in such a manner that a failed stage will be automatically switched from service, and the standby stage switched into service to maintain the register in a working condition. In this case the probability of total register failure is 1--(l--q") n. If the complete register had been duplicated the resulting failure probability would be [I --( 1--q)n] "~. When q is small the approximate reduction in probability of failure when duplicating at the next lower level is 1--(1--nq'2+(0)q 4) 1 [1-(1-,,q+(0)q2)]

' -

additional switches at the individual stages. A further point is that the switching arrangement has been considered perfect, which is impossible. In an on-line system, such as a computer used for air traffic control, the computer must work continuously, and faults must not degrade the system performance. As such the switching of faulty elements must be fully automatic, and must occur immediately a fault occurs, otherwise data may be lost. Redundancy may be described either as active or passive. Active redundancy requires change-over switches to switch out faulty units and switch in replacement units. Passive redundancy uses reserve units which are connected in such a way that failure of a single unit (or sometimes a number of units) does not affect the system operation. At first sight passive redundancy would seem to present a better solution to the on-line system problem, because there is no time lag to correct the fault. 3. P A S S I V E C I R C U I T R E D U N D A N C Y

Three schemes will be investigated which introduce passive redundancy at the logic element level. These three schemes are duplication (5~, quadded logic, (6) and majority decision logictSL Essentially duplication consists of using two logic elements in parallel, each of which is fed into an OR gate before passing the information on to the next pair of logic elements [see Fig. 2(a)]. A refinement is to duplicate the OR gates as in Fig. 2(b). Unfortunately only errors which result

(o}

n

In the example of the shift register if q = 0.01 and n = 5, the probability of failure with each individual stage duplicated is 5 7. I0 -a, whereas the probability of failure when the complete register is duplicated is 25 x 10 -4. Generally there is little difference in cost when applying the same amount of redundancy at different levels. For example, in the shift register example already considered, there were exactly the same number of logical elements (stages) in each system. The additional cost in the lower level of redundancy would be the provision of

177

Logic • lernen~s

Or go?es

Logic elements

(b)

Fro. 2.

D. J. CREASE5."

178

in zero output from the oR gate can be corrected downstream from the fault by good signals from the duplicate of the faulty unit. Secondly, if a fault occurs in the oa gate automatic correction may not result. If the logic unit appears in quadruplicate l~i (see Fig. 3), and interconnection between rows of logic units is made in a special manner then automatic recovery is brought about from all single errors. Automatic recovery is also possible from multiple faults, provided that the faults are not close to each other in the logical structure. As with duplication, this quadded logic corrects the fault downstream .of the unit which caused it, using good signals from the neighbours of the faulty unit. Correction may not be at the next logic stage, indeed the follov,ing stage may propagate the fault over two rows of logic units before recovery is made. Fig. 3 illustrates how a single error is corrected in a typical quadded logic chain.

large. Figure 4(a) illustrates the us,: of triplicated majority' decision logic. The votir:a circuit itsct( will o( course be subject :c~ faih:re so the votin~ circuit may be triplicated as illustrated in Fig. 4(by. A warning should be ~i,.-en that the configuration of Fig. 4(by does not correct nil sin~,le errors in ~, gh'en logic level. Should a voting circuit fail ira one row of one level, and a ionic unit fail in a different row in the next ievel, then an error can be propagated. This is illustrated in Fig. 4(c). Prediction of reliabiiitv in this case requires that the complete triplicated Iogic system be :malvsed an
(by

(a) "1

:Ta.ve,2~e~ ,n 3" 12~e )

I

'

,2

l

T T

//i ,/

E .

/

) i I

!

!

i

recovery in this level.

k*

t

T

:

I

I

Fro. 3. F I G . 4.

Majority decision logic is similar to the previous schemes in that logic units are arranged in redundant rows. T h e output of each logic unit or of a row of logic units, is fed into a voting circuit along with the outputs of the other logic units in the same level. Should a fault occur in a logic unit, the fault is automatically corrected by the voting circuit taking a decision as to the correct answer. T h e decision is taken on a majority" vote of the input conditions of the voting circuit. In order that a correct decision can be made, the number of rows must be odd. Normally three rows are used, the cost of five or more rows being very

An optimum case will obviously result if the voting circuits are regarded as perfect. In this case the reliability of the redundant block of Fig. 4(a) can be predicted by expanding (p +q)a. Since two logic blocks must fail to cause an error the reliability is pa--3p2q= 3pe--2p a. For triplicated majority decision logic to improve the reliability', the reliability of the individual units must be greater than 0"5. Much current work in reliability is based on the exponential failure law. Ir.s.'a> This is a special case of the Poisson distribution which is itself an

REDUNDANCY

TECHNIQUES

USE IN A I R T R A F F I C

approximation to the binomial distribution (3~, and it assumes a constant failure rate (". T h e exponential failure law states that the probability of failure is given by q = 1--e - r e where r = failure rate. Alternatively this may be written in the form = p r o b a b i l i t y of zero failures in time t. When components are connected in such a way that failure of a single component causes a system failure, the components are functionally in series, and the product rule can be applied(VL T h u s the system reliability is given by P =PtP".

• • • Pn =

e-(',+~:+ ....

)e

where pj is the reliability of the /'th component and U is the failure rate of the ]'th component. In general the reliability of any unit can be written as p = e - r t where r = r t T'r o . - r -' . . . rn = the sum of the failure rates of the elements within the unit. T h e fact that reliability decreases with time has a marked effect on the use of redundancy. Taking, for example, the expressions derived for the majority decision logic using perfect voting circuits, then substituting p = e -re the system reliability becomes P = 3 e - O - r t - - 2 e - 3 f t . This function is plotted in Fig. 5 against r t . T h e slope of the function P with respect to r t is 6e-O-rt--6e a r t . W i t h r t very small this is approximately 6 r t and the system reliability is given by P "~ 1--6rt.

~

I i,',

i I i

] i

. .

1 . . .

. .

: !

~

,

,

,

i

i

; . .

. . . . . . .

__

1

i

0

~9 Haz0rd

• Time

;" 3 r~

F~o. 5. As t increases the redundant system becomes less reliable, and eventually becomes less reliable than the non-redundant system. This is illustrated in Fig. 5 where the function p = e - r t is drawn for comparison with P = 3e 3 r e - - 2 e -3,'t.

CONTROL

CO3[PUTER

179

T h e reliability of an on-tine system would have to be very high. Unless super-components can be found in which the failure rate is very small so that during the equipment life (20 years maybe) the product r t is always minutely small, then redundancy bv itself will not provide the necessary reliability. It would seem that some other criterion must be used to judge the usefulness of an on-line system. If a failure occurs in a redundant system, the level of redundancy associated with the failure is lost. Furthermore, since the failure rate, r e - r t , is a maximum at t = o most failures can be expected immediately the equipment is installed. If the 1

mean time between failures, - ,

is I00 hr the

r

reliability of non-redundant equipment is only 0"368 after 100 hr has elapsed. This assumes that no maintenance action is performed; in some applications, such as submarine repeaters and satellite electronics, maintenance is virtually impossible, but in computer applications maintenance must be considered an integral part of reliability. It will be shown later that the addition of maintenance action can improve the reliability in the above example from 0-368 to 0"990. Furthermore, without maintenance the reliability will continue to fall whereas with, maintenance 0-990 is a m i n i m u m value. 4. AVAILABILITY It has been shown that redundancy by itself cannot give the reliability required for an on-line system. This is because the reliability of the individual components decreases with time according to the exponential failure law. It should be possible to allow maintenance action to be taken so that failed parts can be renewed, and for the system to continue functioning after the maintenance has been completed. It will be assumed that the Poisson distribution can be applied to the repair of the system as well as to the system failure (n) . T h e problems of maintainability and availability have been investigated by Calabro (v), but he places a time restraint on the maintenance action. This restraint is not required in an on-line system, since a fault must be cleared irrespective of any time limit. Assuming a non-redundant system the following analysis can be made:

180

D. J. C R E A S E Y

the probability- density function for the mean time to failure dq d d t -- dt ( t - - e - r t ) = re-rt" Let w be the mean repair rate, then the Poisson

(wt/

distribution gives

~

e -~*'t = the probability" of

performing f maintenance actions in time t. If a repair is being carried out, the probability that further failure will occur before the repair is completed is the integral of the probability density function for the mean time to failure, weighted by the probability that the repair has not been completed, (~m t

i.e. j" re-rt.e-we.dt O r =

- -

approaches infinity, and this value is in fact the up time ratio. T h e resuhs obtained in the analysis are sutScient to show how maintenance action can become an integral part of the reliability model. For example, suppose that an equipment has a mean time to faiIurc of 100 hr, aad a mean repair time of 1 hr, then the m i n i m u m availability is 0.990. Without mainteaance action the re[iabilitv would be only 0.368 after 100 hr service. T h e inclusion of redundancy should enable maintenance action to be performed while the equipmeat continues running. Although the mean time to repair may bc increased, the system availability should not bc reduced. An analysis can be made of redundant repairable systems using equations already derived. Consider the triplicated majorit.v decision logic system previously discussed. T h e reliability of this svstem was given by 3p"--2p a. Replacing p by the availability of oneunit system, the availability of the repairable redundant system becomes:

(l_e-(r+wlt).

r _L- Zt2

(

St.'

r

; r

sc

or

r+w

r

4~_

(1--e-(~+w) 9, _ r _

r

r-cz v

__ e_(r+wlt.

This formula is that derived by Barlow and Hunter (m for a one-unit repairable system, but they used the more general method of considering the system as a stationary Markov process. Because maintenance is performed, the system reliability can be called the system availability to distinguish it from the system reliability when no maintenance is called for. Availability cail now be defined as the probability that at a given instant in time the system is working, given that maintenance action can be performed in accordance with prescribed procedures. As ( r + w ) t increases, the availability approaches

(rawlt'~ 2

\(r

T h e system reliability is then 1-- ~

C

t

e

-'- __

- ( r : !vlt) a

f zca @ 3zc"-r+6r"-u'e -(rL'':lt

(r+scp ~

--3r'-'(,zc--r)e elr+'"~t--2rae air-:why This approaches z"a + 3zver

(r'-sc):: as t becomes large. As in the example considered previously, let the mean time to failure be 100 hr, but let the repair time be increased to 2 hr because redundancy has made detection of an error more d i ~ c u l t . U n d e r these conditions the system avail1 '06 ability" becomes ~ ; or 0-9989.

1

r I _

1 T

Y

"7-

mean time to failure mean time to failure + m e a n time to repair"

_

Z C~

This ratio has been defined as the t'e TI.xIE RATIO~7). T h e m i n i m u m value of availability occurs when t

A distinction can be made between the up thne ratio, and the reliability of a system. T h e fraction of time during an interval 0 to T, that the system is working is given by the mean value of the availability function during the time interval. For a one-unit system this becomes

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

T

l l" w

r e-Cr-w~tdt

T I r ~-~,

r +~'

O g~'

= --

r

,__ -

-

r+w , T(r@zc)a

(l__e-Cr+,vlT)

This is the up time ratio of the system at time T. It might be said that the availability represents the theoretical reliability calculated from predetermined values of r and w. The up time ratio represents the empirical reliability calculated from results of the failure and repair of systems which obey the availability function formula. Thus the error in the observed reliability is the difference between the up time ratio and the availability, i.e.

r ( m (r@w) T(r-+w)

e -~r+w'r T(r--w)

} e-{r+'t"T "

As T approaches infinity the transient conditions associated with both functions disappear and the

181

digit is the parity check digit. An error in any single digit will cause the parity check to fail indicating that an error is present, although the position of the faulty digit is unknown. This particular code will not detect multiple errors. If blocks of information are to be transmitted an extension of this simple type of detecting code will enable a single error to be detected and corrected. This is illustrated in Table 1, where an

Table 1 0 1 0 1 0 1 1 Information block

E v e n parity. check on rows

1 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 0 1 1

g¢.'

steady state condition ~ )

is equal in both cases.

This is the limiting value of the system availability as previously shown. The up time ratio for redundant repairable systems can be similarly calculated. 5. CODE REDUNDANCY If applied correctly circuit redundancy can increase equipment reliability. A further method of increasing the equipment reliability is that of providing redundant information, or coding. The provision of redundant information can be done in such a way that errors can be detected and even corrected. The solution of the reliability problem by coding stems from the original work of Hamming cv'~ and Shannon ~13~. Suppose that a simple binary number has to be transmitted, extra digits may be added so that an error in one or more transmitted digits can be detected and corrected. The simplest codes are those which only provide for single error detection. In binary notation the decimal number 9 becomes 1001, a parity or check digit may be added to the binary number so that the number of l's in the complete code will be either odd or even. Thus the decimal number 9 can be coded 10011 when an odd parity check is used, or 10010 when an even parity check is used. In each case the fifth

Even parity check on columns.

even parity check is used. A single error in any digit will cause the parity check to fail in one row and one column, thus not only will the presence of an error be known but the position of the faulty digit will be indicated, and its state can be changed. Multiple errors inside the information block can be detected and corrected provided there is no more than a single error in any one row or one column. T h e presence of a detected error in a row only, can indicate that it is the parity digit which is at fault. This type of coding has been used successfully with magnetic tape storage units. Redundant coding can be advantageous over redundant circuitry because the amount of redundancy can be less. T h e figure of merit very often used in redundant coding is R =

n -m

where n -- total number of digits in the redundant code ; and m = number of information digits. This figure of merit should not be used to compare the relative merits of redundant codes and redundant circuits because redundant coding requires additional circuitry to encode and decode. The encoding and decoding circuits must be made secure by circuit means.

182

D. J. C R E A S E Y

Simple geometric models

Table l(a)

Let each digit in a three digit code represent one direction in the three dimensional system o[" coordinates as shown in Fig. 6(a). Consider, for example, the decimal n u m b e r five, which is represented in a b i n a ~ three digit code as 101. 10l means a zero c o m p o n e n t in the v direction and

Z

J~

......

l.o . . . . . . . .

X

(a)

011

~LI

ioo (b)

Fro. 6. unit c o m p o n e n t s in the x and z directions. So it is clear that the n u m b e r five corresponds to the coordinates of point " a " in Fig. 6(a), which can be Iabelled 101. A vector drawn between the origin and ~ wilt be called the code vector 101. Similarly all eight points c o r r e s p o n d i n g to the eight n u m b e r s of the three d.igit binary code can be found, and. vectors drawn from the origin to those points ca~ be labelled in the same m a n n e r . As is seen in Fig. 6(b) these eight points form a unit cube, with vertices representing a three digit b i n a r y code. Suppose only two digits are used as information digits, and the third is a r e d u n d a n t digit, which can be used as an even parity check digit. T h e resulting code is t a b u l a t e d in T a b l e l(a).

Dec{ma[ number

1st digit

2nd digit

Parity digit

0

0

0

0

I

0

t

1

2 3

t !.

0 t

1 0

An error in a single digit can be detected but not corrected. For example, should 100 be read, an error will be indicated because there is no code vector II)0 in T a b l e l(a). T h e error could be ill any of the three digits and could resuh from an error in a single digit of the code vectors 000, 110, or 101. "['he sanle conclusions cart be drawn from an examination of Fig. 6(t)). Each vertex of ttle unit cube is connected with three neighbouring vertices bv unit distances along the cube sides. T h e difference between the vertices and each of these thrcc neighbouring yetrices is [n one digit only. T h e code vector ca~l be transferred to each of thc three n c i g h b o u r i n g vertices by" an error in one digit with equal probability. A n error in two digits will transfer the same code vector to three more vertices which are at a distance two units along the code sides. Finally the same code vector will be transferred to the only remaining vertex, which is at a distance of three units along the cube sides by an error in three digits. T h e vertices representing the code may be selected in such a way that the m i n i m u m distance between these vertices is two units alo:xg the cube sides. These vertices may be chosen as 000, 101, 011 rind l l 0 [see Fig. 6(b)]. Each one of the four r e m a i n i n g n o n - c o d e d vertices (100, 010, 00i, l t l ) is separated by unit distance from three of the code vertices. So an error in oue digit results in transfering a code vector to the vertex which is not part of the code, and this wilt be the indication of error. Of course as each u n coded vertex is separated by unit distance from three coded vertices, it is impossible to decide which of the three digits is in error. T o detect a single error in a code it can be seen that the distance between the code vertices should be at least two units. If the error has to be corrected the code vertices must be :it least three units apart. For example, let 000 and 111 be chosen as code

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O E I P U T E R

vertices, these vertices are separated by three units along the sides of the cube. Suppose that one d i ~ t in 000 is in error. T h e code vector 000 wiil be transformed on to one of three non-coded vertices. These three vertices are all at a distance one unit from 000, whereas they are two units distance from 111. In order that the code vector 111 should be transformed on to the same vertex would require an error in two digits. T h e probability of error in one out of three digits is 3paq, and the probability of error in two out of three digits is 3pq2, where q = probability" of error in a single uncoded digit and p = 1--q. Normally p >>q, so if a received vector is only one digit in error from 000 it can be assumed that the correct code vector is 000. This code is equivalent to the majority decision logic discussed previously. T h e encoding of this triple redundant code would be very simple. A method of decoding the triple redundant code, other than using a majority decision gates will be considered later. For codes with n-dimensional vectors, an n-dimensional geometric medel may be used. T h e m i n i m u m distance between code vectors in a given code, the H a m m i n g distance, is the m i n i m u m n u m b e r of digits in one code vector which must be altered to obtain any other code vector in the

Table 2 Hamming distance

Code use

1

Uniqueness.

2

Single error detection.

3

(i) single error correction (ii) double error detection

4

(i) single error correction plus double error detection (ii) triple error detection.

5

(i) double error correction (ii) single error correction plus triple error detection (iii) quadruple error detection.

d

d-1 = 2E-D where E is the number of errors to be corrected, and D the number of additional errors to be detected.

183

same code. T h u s if v 1 and v a are code vectors, and v~ e v 2 signifies addition to modulo two term by term, a 1 will only occur in % ~ v 2 where respective digits in % and % differ. T h e weight of the code vectors % and % is equal to the n u m b e r of l's in v t @v,_,. T h e H a m m i n g distance or code weight determines how the code can be used. Table 2 illustrates this :

Sbzgle error correcting codes Let a code contain vectors with n digits, m of which are information digits. T h e remaining h digits are the check or parity digits, and allow k parity checks to be applied to the received code vector. These parity checks are applied not only to the check digits but to a combination of check and information digits, in such a way that when all the checks have been made, all n digits in the vector have been included. Normally even parity checks are made, i.e. the addition by modulo two over the selected digits should equal 0 when there is no error, and 1 when there is a single error. By choosing the checks in a specific order, the observed sequence of k parity checks give a binary number, the check number, equal to the position of any single error in the original code vector. This will be illustrated later by examples. A zero value of check n u m b e r indicates that no error is present, thus the k digits in the check n u m b e r must describe n + l different things. The n u m b e r of separate items which can be represented by k digits is 2k: thus 2~>~ n + l . Since n = m + k this may be written 2n 2 m ~< - n+l" There exists a n u m b e r of codes which give a m i n i m u m n for a certain m: these codes are known as m i n i m u m redundancy codes since the redunn dancy - is a m i n i m u m for a given m. T h e values 7n of n, m, k and R for m i n i m u m redundancv codes are listed in Table 3. It would seem generally that an increase in n and m brings a reduction in R. However this leads to an increase in fault probability because these codes will only correct a single error. As n increases

184

D. J. CREASES"

Table 3 n

m

k

R

1 2 3 4 5 6 7 8

2 3 3 3 4 4 4 4

3 25 2 1-75 1.S 1.67 1.57 1.50

13

9

4

1.44

14

10

4

1.4

3 5 6 7 9 10 11 12

etc.

If q is small, to a first order approximation, the m i n i m u m reliability of the encoding and decoding e q u i p m e n t s must be greater than ( 1 - m q ) m p'h U n f o r t u n a t e l y as n increases the encoding and decoding e q u i p m e n t s become complex and the condition for u becomes increasingly more d i ~ c u k to satisfy. T h u s in general the size of the code is l{mited bv the re!iabilitv of the encoding and decoding e q u i p m e n t , and if the probability of error in a single digit is small, the probability of error in the encoding and decoding e q t d p m e n t s must be less than the error probabiiity of the n o n - r e d u n d a n t code.

Example of decoding procedure the probability of two. digits being in error also increases. Suppose the probability of a digit being in error is q, where as before p + q = 1. T h e reliability of a single error correcting code containing n digits in each vector is

pn+npn-lq :

1

n(n--~ 1) q~-+(0)q 2.

I f q is small the probability of faiktre is approxi-

mately

T h e triple r e d u n d a n t code will be used as an example. Suppose the digits in the code vector are at, a e and aa, where a t and a., are the parity digits and a a the information digit. T h e even parity checks a t Oa a and ae-3a a will be used. T h u s when a a = 1, a ~ = a . , = l and similarly w h e n a a = 0 a 1 = a., = 0. T h e r e will be a total of eight possible received c o m b i n a t i o n s of the three digits which have to be decoded. These c o m b i n a t i o n s together with the check processes are shown in "Fable 4.

n ( n - - 1) q2

Table 4

2 which increases with n. T h e encoding and decoding e q u i p m e n t s m u s t be reliable so that the overall system reliability is improved. Because these e q u i p m e n t s and the code are functionaUy in series the product rule applies/rl, and the overall reliability becomes

.(p-+,,p-%) = ,,(~p~-,-(~- 1)p,,) where u is the reliability of the encoding and decoding equipments. Since the digits in the n o n r e d u n d a n t code are also functionally in series, an increase in overall system reliability will result if the condition

,,(,,p~-~- (,,- l b ' 0 > p"~ holds. T h u s the reliability of the encodiag and decoding e q u i p m e n t s m u s t satisfy

u>~ or

u/>

pm (np n - ~ - ( n - - 1)p'O 1

1 +mq--(O)q

at a2 aa

Binary check numb,er

a.., @ aa

at @ aa

000 100 0[0 001

0 0 1

0 I 0

00 ol 10

0 1 2

i

1

II

3

l l l 011 101

0 0

0 1

00 01

0 1

l 1

0 I

10 11

2 3

1 1 0

Digit in error

T h e binary check n u m b e r is the binary n u m b e r resulting from the parity checks a.,5"a:,, a~ e a a. If it is assumed that only a single erroneous digit can occur, the check n u m b e r represents the position of the digit in error.

The construction of a single error correcting code Because the check n u m b e r should give the positioa of any error in the code vector, any

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

position when represented by its binary n u m b e r which has a 1 in its j t h position from the right will cause t h e j t h parity check to fail. T h u s Table 5 can be constructed to give the positions over which the k parity checks are to be made. For example, in binary notation the decimal numbers 2, 3, 6, 7, I0, 11, 14, etc., all have a 1 in the second position from the right. These are the positions over which

185

Writing these resulting digits from right to left as they occur gives the check n u m b e r 011, which indicates that the third digit is incorrect, its state can be changed from 1 to 0 and the parity digits neglected giving the decoded binary number 011.

Table 6. The completely constructed Hamming (6, 3) code

Table 5 Check

Positions over which check is made

1

1

3

2 3

2 4

3 6 7 5 6 7

5

7

9 11 13 15 17 . . . I0 II 14 15 18. .. . 12 13 14 15 20 . . . etc.

the second parity check is made. T h e position of the parity digits can be chosen so that they are independent of each other. I n the encoding process if position 1 is made a parity digit, this digit will only occur in the first parity check. Similarly positions 2, 4, 8, 32, etc., can be used as independent parity digits. Using the m i n i m u m redundancy code n = 6, m = 3 the decimal n u m b e r 3 can be encoded ata.,Oa411 where at, a 2 and a 4 are the parity digits which have to be determined. From "Fable 5 the first parity check is made over positions 1, 3, 5, 7 . . . giving al~0~l

=0

i . e . a 1 = 1.

T h e second parity check over positions 2, 3 and 6 gives a.,@0@l = 0 i.e.a., = 1. Similarly the third parity check over positions 4. 5 and 6 gives a4elO1

=0

at

a~

0

0

1

0

2 3 4 5 6 7

1 1 1 1 0 0

al~aa~gas= 1 ~ 1 ~ 1 = 1 a.,~aa@a6= 1 ~ 1 ~ 1 = 1 aa@as@a 6=0@1@1 = 0 .

a3

a4

a5

as

0

0

1

0

0

0

0

1

0

0 1 1 0 1 0

0 0 1 1 1 I

1

1 0 0 1 1 0

1 1 0 0 1 1

0 1 0 1 0 1

3Iathematical representation of the encoding and decoding process If a code vector v is transmitted, and a single error occurs the received vector differs from v in one digit. Suppose e is a vector which has all zero digits except in the position where v is in error. T h e received vector is then v ~ e . A convenient method of representing the parity checks is to establish a matrix [H] in such a way that v[H] r =-- 0 where [H] r is the transpose of [H]. T h u s if the received vector is v e e the parity check vector, or syndrome, becomes [ v ~ e ] [ H ] T = v [ H l T e e [ H ] r = e[H] T. I n the H a m m i n g (6, 3) code the parity check matrix is as follows:

i.e. aa = 0.

T h u s the encoded digital form of the decimal n u m b e r 3 is 110011. Suppose that in transmission the third digit is changed from 0 to 1. Without coding this would change the decimal n u m b e r 3 to 7. However by use of coding, the parity checks will not all equal 0.

c

Decima[ value

[H] =

[ 0011 ] 1 0

1 1

0 0

0 1

a 6 × 3 matrix where the columns represent the bina W numbers 1 to 6. It can be seen that if v represents the code vectors in Table 6, then v[H] T--~ O. Suppose that the vector representing 3 is in error in the third digit, i.e. v @ e = (110011) @(001000) = 111011, then the syndrome becomes

186

[11t011]

[!0,11

=[001000]

1

1

(3

0

0

1 0

1

i

D. J. CREASES' :See5

0

0 1

1 1 0 o

1

0

1

()

0

I

[b

l]

0 1

2, )

¢

"ti

= 011 = the binary check number. Thus the third digit is in error. Using this notation the expression v[H] T represents the encoding process, and [ v ~ e ] [ H ] T or e[H] T represents the decoding process.

'

~5~C6

~

C4

6

C5

D Jgs

Er,coding a n d decoding circuitry

Each different type of code requires a different type of circuitry to encode and decode the binary vector. A system, suitable for use with a core store, will be described which will encode and decode the Hamming (6, 3) single error correcting code. This system uses binary stages, together with A.','D and OR gates, plus buffer stages where required. (1) Encoding. The problem here is that given the digits an, a 5 and a G to find the parity digits a I, a2 and a 4. Since the check equations are addition to modulo two, these equations may be written

Reset

FIc. 7. Imgiccircuit for encoding Hamming (6, 3) code.

~jajv ~,ai -

°i

aa@a s = a 1 aa@a ~ = a., as@at, = a 4.

Thus if a a and a 5 are fed in sequence into a binary stage the resulting output will be the binary sum aa@a 5 or a t. Similarly for the other parity equations. Fig. 8 represents the logical diagram of the encoding process. If the digits aa, a 5 and a s arrive simultaneously in time, then either the digits may be delayed with respect to each other, or the binary stages may be replaced by an exclusive--or gate (Fig. 8). Using the excluse-or gating technique, care must be taken that the two inputs are synchronous otherwise a false output can result. (2) Decoding. The problem when decoding may be summarized as follows: (a) establish whether an error is present by means of the parity check equations;

FIG. 8. Exclusive or gate.

(b) if an error is present, examine the check number to determine which digit is in error ; (c) if an information digit is in error change its state; (d) it may be necessary to indicate that a fault is present ; (e) release the corrected decoded vector. An error may be detected by feeding in sequence the digits, over which the parity check is being made, into a binary" stage. The output of the binary stage will indicate an error by giving a final output of 1, Comparison of the binary stage outputs in a number of AND gates will determine if the error is in an information digit. The outputs of the .aND gates may be used to change the state

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

of the erroneous information digit before the information vector is released. The scheme just described is illustrated in Fig. 9. 6. ACTIVE CIRCUIT REDUNDANCY It has been explained before that active circuit redundancy requires change over switches to

Reset

187

m identical stores, all working independently, with n - - m redundant stores provided. Suppose that there exists a switching arrangement where a store failure wiU be detected by the checking apparatus and a redundant store can replace the faulty store. The checking apparatus may take many forms, a possible form being to compare the

Ciock ouise

~elease Corrected d~gtt

"o

rr

Fzc. 9. Logic circuit for decoding Hamming (6, 3) code. switch out faulty units, and switch in replacement units. Generally this will require time, and in certain applications the time lag may cause failure. However, if speed can be sacrificed, active circuit redundancy may be used to great advantage. Active circuit redundancy has the advantage that the system may be made self-checking, and if shared redundant units are used the redundancy ratio n/m becomes smaller, m is the minimum units in the equipment, and n - - m is the number of redundant or standby units provided. Consider, for example, a computer which has

information written into the store with the original information. Some form of non-destructive read out would have to be used c141. If the original data and the information in the store did not agree, it would indicate a failure in the store. A fault indicator could be activated so that maintenance action could correct the fault. Coding might also be used to check the information in the final read process in case a failure occurs after the information was written into the store. T h e assumption that the reliability, p, is the same for units in use and those on standby is a

188

D.J.

CREASEY

worst case condition, because those on standbv could in fact have higher reliabilities. However, (f the reliability changes when the unit is put into service, analvsis is made difficult. The overall reliability of the redundant svstem will be the sum of the first ( n - - m ) terms in the binomial expansion of (p+(1--p))'~, i.e.

of failure of the switching elements is v, then the overall redundant system reliability is

When

....

7.')

7Z is small, the overall reliability is approximately j=m

1- -

[n this form the advantage gained by using active redundancy is not apparent. The overaU reliability may be expressed in a power series anp n - - a n - i pt~ -I +a,,_ap,~-',.+ . . . + a , n p m

where the coefficients are functions of (~.). p may now be written (1 - - q ) , and the terms (1 --q)'~ 4 can be expanded in a binomial expansion. If this is done the results shown in "Fable 7 are obtained.

n

1 - - nq + ( O ) q

n(n--1)

i~

., q" -"-(0)qa

n--1

1--

n--2

1

n--3

1--

n--j

1-- ( nl)qi+t÷(O)qj+t

n(n-- 1)(n--2) 1.2.3 qa-r-(0)qa n ( n - - I )(n--2)(n--3)

1.2.3.4

q~+ (O)qt

i+

If q is small, then the general form of the reliability of an active shared redundancy system can be written

p(n, m) ~_ 1--

n

I-- m - - I

' q( m+l

,,i-- v

n 1') q( "+~+'< m-- ,

The analysis so far has neglected the reliability of the switching elements. Suppose the probability

~>p,n = ( l - - q ) ' "

must hold. This may be written in the alternative fOrFIl

(")

Reliability of redundant system

"

m+.~-~-v?.j

If the failure of a single unit in the non-redundant system causes failure, the units are effectively in series, and in order that redundancy should increase the system reliability, the condition

~"q- m--I q(-,u. : , 0 ~

Table 7

m

.. m--1.I ql !("

2(7) qJ(--1)J "t.

j-=l

It q is small, the probability of failure ot the switching elements must be less than the probability of failure of the non-redundant system, otherwise redundancy will degrade the system reliability. This is the same as saying that if the probability of failure in the units to which redundancy has been applied is reduced to zero, then the probability of failure in an 5" additional equipment which the redundancy technique requires should be less than the probability of failure of the non-redundant system. This last statement would apply equally well to any form of redundancy, active circuit redundancy and code redundancy. The analysis of the active redundant system has assumed the reliability of the component parts to be independent of time. As was explained earlier components normally obey the exponential failure law, and so the concept of availability was introduced which required maintenance action. The equation for system availability was 1-- z----~d r ~--' (1--e (,-+w)t).

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

The probability of failure, assuming maintenance action, is r

r '--,v' ( I - - e

-(r+g')t)

and substituting this value into the equation for active shared redundancy the availability becomes

A(n,m,t) = 1--

(~

) \r-l-zc r[__( "

1-e-(r+w,e

/|

This has a minimum value as t approaches co

n ( r ~ cn+l-ml. A(n,m,'m)=l-- (m_l)'\~,w !

189

This ratio again consists of steady state and transient conditions. The steady state conditions for availability and up time ratio are equal but the transient conditions are unequal. At any time the difference between the observed up time ratio and the theoretical availability is equal to the difference between the two transient state conditions. If the transients are finished the availability equals the up time ratio. \Vhen repair is not taken into account and the reliability of a non-redundant system follows a negative exponential law, the reliability approaches zero as the time approaches infinity. The availability of a non-redundant repairable system still follows a negative exponential curve, but the minimum availability approaches mean time between failures mean time between failures ,'-- mean repair time

7. A P P L I C A T I O N OF R E D U N D A N C Y

TECHNIQUES TO AN AIR TRAFFIC CONTROL COMPUTER Cluley arJ has calculated that the expected failure rate in a large digital system is too great for application in an air traffic control computer. It is necessary therefore to develop redundancy techniques which enable system reliability to be increased to an acceptable level. In applications such as satellite electronics, repair of electronic equipment is impossible, but for computer and data handling systems, repair of electronic equipment must be considered an integral part of reliability. The equipment design must be such that the time to detect and repair a fault is a minimum. In addition redundant equipment should enable a failure to be repaired without system downtime. This must surely be the aim of an on-line system. When repair and preventative maintenance are taken into account, the equipment reliability can be regarded as the probability that at any instant of time the system is working. This is defined as the system availability. If failure and repair of a system conform to the negative exponential failure and repair laws, the resulting availability equation contains steady state and transient conditions. The up time ratio of a system can be defined as working time working time .-{- repair time"

as the time approaches infinity. T h u s in a nonredundant repairable system the availability can be maximized by making the mean time between failures as large as possible and the mean repair time as small as possible. In a redundant repairable system care must be taken that the redundancy does not make system failures difficult to detect and repair, as a large increase in the mean repair time can, in extreme cases, degrade the availability. Circuit redundancy can be applied at various levels but generally speaking the reliability improvement is greatest when redundancy' is applied at the lowest level. If component redundancy is used a failed component cannot be detected and replaced until the complete redundant group has failed. This will require down time while the failure is repaired. This disadvantage, which is common to all passive circuit redundancy', may be overcome if checking facilities are provided. For example, consider a triplicated majority decision logic system: a comparator may be used to check that the three redundant levels are all in the same state. Failure of one level will be detected, and repair can be made without system down time. The probability of detecting a system error is equal to the reliability of the comparator. The system availability is affected by a failed comparator as it is most probable that down time will be required for repair in the eventuality of the redundant group failing. Alternatively if pro-

190

D.J.

CREASEY

grammed down time can be permitted, say during 1 hr each day, power supplies may be removed from sections of one redundant level. If a svstem check then fails, it will indicate a fault in one of the remaining levels which can then be repaired. Active circuit redundancy on the other hand requires a built-in fault detector, so that a failed unit can be switched from service. The same fault detector may be used as an alarm so that repairs can be made before further faults degrade the system. The use of coding also provides a built-in fault detector. However, unless redundant circuitry is used, an eventual catastrophic failure will result which the coding will be unable to correct and the system will require down time for repairs to be made. It would appear therefore that only active circuit redundancy and checked passive circuit redundancy when used by themselves will provide sufficient freedom from sudden catastrophic failures, resulting in unprogrammed down time. There is no reason why all types of redundancy should not be combined to increase the system reliability to an acceptable level. For example, a system could consist of switched or active redundant subsystems: the sub-systems could be made from passive redundant logic units, which could themselves be built from redundant components. This type of building block, where redundant levels consist of further redundant levels, is the MooreShannon TM hypothesis that reliable circuits can be made from arbitrarily poor components. Active redundancy would be very useful if the redundant units could be shared so that although the level of redundancy is reduced the reliability can be maintained at an acceptable level. Unfortunately, although the reliability could be made as large as required, the complexity of the switching system would increase. The more switching functions the switching system has to perform, the less reliable it would become. Although the reliability of the redundant system might approach unity, the reliability of the redundant system plus switching array may become less than the reliability of the non-redundant system. Thus the level of redundancy which can be applied to active redundancy is limited by the reliability of the switching array which is used. In applications, such as information storage, the reliability of the received information could be

made arbitrarily large bv the use of redundant information. Here again something has to be lost for something gained. If the reliability of the encoding and decoding equipments is less than the reliability of the non-redundant information, the use of this type of redundancy will degrade the system performance. So it might be said. that the error correcting and detecting capabilities of a code, although quite feasible mathematically, arc limited in practice bv the reliability of the encoding and decoding equipments. Assuming that reliable switching arrays, encoding and decoding equipments can be built, a block schematic of a reliable storage system is shown ia Fig. 10. This type of arrangement couid be applied to the storage problems in an air traffic control computer. It is important that once data has been accepted by the air traffic control computer it should not be lost. Failure in a channel can result in loss of

Encoding equipment -g pI

iltCt -r3)

j.... n

,i bI

It 2 ~

Store5 y

t~

Oecod n g ~ { ~ a' ]

equif~ments

m : Numberof non-redundon~rebutsandDUtDUt5 n : Totalnumberof stores provtded, n > m n-m -=Numberof sharedscandbyunits a Compor'ent level redundanc7 b Logic level passlVechecked redundancy c Redundan!mformahonor codlr~ 2 ~.chveswitched redundar,cy Ftc. 10. Block schematic of system to provide reliable stores.

REDUNDANCY TECHNIQUES

USE IN A I R T R A F F I C C O N T R O L C O M P U T E R

information, however coding does enable automatic recover" to be made. T h e codes described earlier in the text were of the Hamming type, other codes of course exist/TM but many of them are extremely difficult to encode and decode. For example, certain of the Bose Chaudhuri ~16J codes require a separate computer to decode them. There is little advantage in coding if the processes involve complex equipment or a large number of electronic components, because the overall reliability may well be reduced. In general the choice of code is limited by the reliability of the encoding and decoding equipment, and it has been shown in the text that the Hamming single error correcting codes can be encoded and decoded using simple binary stages and gating techniques. Coding has the disadvantage that the channel capacity has to be increased to facilitate the parity digits required for the checking procedures. However, reliability improvement using redundancy techniques in any other form will require extra equipment capacity, and coding can be made very efficient from the point of view of extra equipment required. The reliability of a code depends upon the error correcting capabilities of the code and the code len~h. For example, the reliability of a single error correcting code using seven digits is greater than the reliability of a single error correcting code using eleven digits. One very simple method of improving the code reliability is to divide the code up into groups and to apply a complete parity check to each group. In this way a single error correcting code could be applied which could correct a number of errors equal to the number of groups, provided not more than a single error occurred in any group. A group need not consist of digits which are in sequence in the code, but could consist of digits picked at equal intervals. In this way it would be possible to detect and correct bursts of errors, the length of the burst being equal to the number of groups. CONCLUSIONS The failure rate of electronic components is such that the mean time to failure of a data processing system for use in air traffic control applications is too small. Because of this, maintenance action must be regarded as an integral part of system reliability. In order that unprogrammed down time should not cause loss of human life, redun-

191

dant circuitry must be provided so that the system can continue to function correctly during the period that a fault exists. Redundancy must be applied so that failed components or units can be detected, and the complete redundant group must never be allowed to fail unless further redundant levels are present to maintain continuous system availability. Loss of data due to a failed storage unit may be avoided if an error detecting and correcting code is used. Coding may be regarded as a redundancy technique because redundant information is required. Active circuit redundancy, passive checked circuit redundancy and coding all require additional electronic equipment. This additional equipment must be reliable so that the overall system reliability is not degraded. For example, the length and error correcting capabilities of any particular code are limited by the reliability of the encoding and decoding equipments. Similarly the amount of active redundancy which can be applied is limited by the reliability of the switching array used to switch the failed units from service and the redundant units into service. Although theoretically maximum benefit results when component redundancy is used, still greater benefit results by combining redundancy techniques at all levels, and in all forms. The basic principles of applying redundancy should be to provide a simple basic system built from a few simple units which can be designed to operate within agreed limits and under specified environmental conditions, so that the mean time between system failures is maximized. Redundancy techniques should then be applied, but the system must be kept reasonably simple and easy to maintain, so that the mean time to repair a fault is minimized. The system availability will then be a maximum. REFERENCES

1. Report by Advisory Group on Reliabili~" of Electronic Equipment, Reliability of 3/Iilitary Electronic Equipment, Office of the Assistant Secretary. of Defence (Research and Engineering), Washington (1957). 2. R. BREWlm, Survey of Life Test Evidence of Semiconductor Devices in Commercial Production, Semiconductor Reliability, Engineering Publishers (1961). 3. D. A. BELL, Statistical :~Iethods in Electrical Engineering, Chapman and Hall, London (1953).

192

I). J. C R E A S E Y

4. E. F. MOORE and C. E. SHA:-.NO.X',Rdiable Circuits usin~ less Reliable Rdays, J. Frankiin Inst. 262 (!.956).

5. J. C. CLucav, A Comparisvn of Duplicate and Triplicate Redundancy ,~ckemes for B[nar'; Lo'~ica[ ~Vetworks, Univ. of Birmingham, Electrical Engineering .Department, M e m o r a n d u m No. 119 (1962). 6. J. G. TRvox, Quadded Logic, Redundancy Techniques for Computing .S'3'ste,ls, Spartan Books,

Washington (1962).

10. S. J. EINHOR.X:, Reliability Prediction for Repairable R e d u n d a n t S!.stcms, Proc. I.EE.E., p. 312 IFeb. 1963). ! 1. P,. E. BARLOW and L. C. t-[l YTER, ,~ylt'ania Techno[o~,ist 13 (1960). 12. R.. \V. H.',M.'ql.XG, Bell Syst. teck. j . 29. 14-7 (1950). 13. C. E. SH.',_\'XOX, Bell £:vsr tech. J. 27, 379-423 and 623-656 (1948).

7. S. R. CALABaO, Reliability Principles and Practices, .'McGraw Hill, New York (1962).

14. C. J. QUARTEP,LY, ,S'quare-Loop Ferrite Circldtrs. ' Iliffe Books, L o n d o n (1962).

8. [-{. \V. PRICU, .l[ean Life of Parallel £Yectronic Components, [:'.vponentia[ Distribution Case. Redundancy Techniques or Computin%, .S'vstems, Spartan [looks, Washington (1962).

15. J. C. CLULE'~', Log~' Level Redundancy as a 3leans , f [mprovin¢ Digital Computer Reliability, Universiv,

9. v. J..~,[C3,[tLLAN ap,d P. Cox, J. Brit. blstn. Radio Engrs. 22, (1961).

16. \V. \V. PE'rEI~SO.X-, Error Correetinq~ Codes, M . I . T . Press and John \\'iley, New York (1961).

of Birmingham 7~,[emorandun~ No. I00. (1962)