Architectural factors influencing the reliability of fault-tolerant VLSI arrays

Architectural factors influencing the reliability of fault-tolerant VLSI arrays

Microelectron. Reliab., VoL 31, No. 5, pp. 963-968, 1991. Printed in Great Britain. 0026--2714/9153.00+.00 © 1991 PergamonPressplc ARCHITECTURAL FA...

335KB Sizes 0 Downloads 103 Views

Microelectron. Reliab., VoL 31, No. 5, pp. 963-968, 1991. Printed in Great Britain.

0026--2714/9153.00+.00 © 1991 PergamonPressplc

ARCHITECTURAL

FACTORS INFLUENCING THE

RELIABILITY OF FAULT-TOLERANT VLSI ARRAYS*

ANDREA BOBBIO

Istituto Elettrotecnico Nazionale Galileo Ferraris Strada delle Cacce 91, 10135 Torino, Italy

(Received for publication

20 July 1990)

Abstract Current trend in VLSI technology is toward highly modular structures in which identical Processing Elements (PE) are connected in a regular lattice. Enhancement of the operational reliabilityof these VLSI devices ii obtained by means of fault-tolerance.Faulttolerance is achieved incorporatingspare PE's into the array and designing a flexibleinterconnection network which is able to support the reconfigurationof the array in the presence of a fault. This paper discusseshow architecturalrelatedfactorsinfluencethe configuration and the technology of the device,and how these factorscan be accounted for in a predictive reliabilitymodel.

Introduction

1

M o d e r n VLSI technology favors t h e use of a n u m b e r of identical cells interconnected in regular m a n n e r to yield short c o m m u n i c a t i o n paths. Arrays of Processing Elements (PE) configured onto a bidimensional lattice [1,2,3] are an example of this trend. As t h e complexity ( n u m b e r of P E ' s integrated into a single VLSI) increases, in order to g u a r a n t e e a correct operation of the device, fault-tolerant techniques m u s t be incorporated into t h e device at t h e array design level. A c o m m o n approach to achieve fault-tolerance is to a d d spare P E ' s a n d a flexible interconnection s t r u c t u r e into t h e array. U p o n occurrence of an operational fault, the array must undergo a reconfiguration process consisting in replacing the faulty P E with an inactive fault-free spare PE, and restructuring the interconnection network in order to establish logical connection patterns equivalent to those present before the reconfiguration. The presence of spare units can also be used to improve the production yield (percentage of good chips out of a wafer). W h e n the manufacturing process is terminated, the chips with defective parts are statically restructured by interconnecting only good PE's. Really, m a n y fault-tolerant schemes I4,5,6,7,8]have been primarily thought with the aim of improving the yield rather than the operational reliability. Both issues of yield and performance/reliability have been addressed in [9,10,11,12]. Several competing factors must be considered in a reconfigurable fault-tolerantVLSI array

[12]: • Architectural location of the spare PE's; • Structure of the interconnection wiring; • Coverage and latency of the reconfiguration mechanism; • Increase in the chip area due to the added hardware. The influence of these factors on the reliabilityof the device is examined in Section 2, with particular emphasis on the interrelationships between architectural issues and technological issues. In Section 3, some considerations are developed about the possibilityof accounting for these factors into a comprehensive quantitative reliabilityprediction model.

"Work supported by the ItalianNational Research Council C N R under the Project "Materialsand Devices for Solid State Electronics",Grant 89.01857.61. 963

964

A. BOBBIO

2

Architecture-related factors

In this Section we'examine the principal architecture-related factors that have a direct impact on the reliability of a fault tolerant VLSI processor array. W e assume that the VLSI has the lattice structure represented in Figure I, where boxes indicate the PE's and the small circles the programmable switches. The interconnections between PE's and switches are not explicitely indicated in the figure, since different interconnection configurations could be realized on the same lattice structure [1,2]. 2.1

Location

of spare PE's

Given a N × M rectangular array the most widely diffused proposal is to add extra rows or extra colunms (or both) of spare PE's. The number of added elements affects the ability of the device to tolerate operational faults but has a negative impact on the total area of the device, the configuration of the interconnection wiring, and consequently the complexity of the reconfiguration algorithm. Thus, even if higher levels of redundancy could be conceived, practical schemes discussed in the literature refer to the incorporation of a single spare row (or column), a single row and a single column; and fi.nal]ytwo rows (in the upmost and downmost position) and two columns (in the rightmost and leftmost postion). A different scheme for incorporating fault-tolerance into a regular array has been proposed by Singh [5],and is obtained by locating spares at interstitialsites in the lattice (Figure 2). The redundancy level of the structure of Figure 2 can be increased by increasing the number of interstitial sites used to locate spares or the number of spares in each interstitialsite. Further discussion on interstitialredundancy has been reported in [7].

0

0

0

0

0

0

0

oDoQo 0

0

O

O

0

0

oDo O

0

0

O

0

0

0

0

0

O

0

oDoDoDoDo O

0

0

O

0

O

O

oDoDoDoDo 0

0

0

0

O

0

O

oDoDoDoDo 0

0

0

0

O

0

O

Figure I - Lattice structure of a programmable VLSI processor array

Figure 2 - Interstitialredundancy array (with 25 % spare PE',) [5]

Reliability of fault-tolerant VLSI arrays

2.2

Reconflguration

strategy

Given that an operational fault is detected, the reconfiguration strategy should be able to replace the faulty PE with an inactive fault free spare PE. An ideal strategy is the one which is capable of solving any reconfisuration problem when spare PE's are still available: i.e. if ,Y fault-free spares are present, an ideal strategy should guarantee the device survival up to the occurrence of S operational faults. However, several factors discourage the implementation of ideal strategies in VLSI arrays: • The cost of an ideal strategy (and therefore the latency time of a fault) increases geometrically with the number of elements in the lattice; • Since it is not know a priori which links will be set by the recon/igttration algorithms, all the possible interconnecting networks must be realized on the chip. This imposes severe penalties on the design, since the commtmication is very costly in area, and power; • The final reconfigured connection can have paths connecting two logically adjacent PE's of several lenght units (where a lenght unit is the distance between physically adjacent PE's in the original lattice). This imposes capacitance (and therefore time delay) penalties. The above mentioned points have motivated the search for simpler, faster even if suboptimal reconfiguration algorithms. In [6] single-track swithches are used so that a faulty processor can be replaced only by the spare located either on the same row or on the same column. The efficiency of the reconfiguration algorithm based on this scheme has been improved in [8]. Lombardi et al. [12] propose an algorithm which allows piece wise vertical/orizontal reconfiguration paths through the array. Interstitial redundancy [5] facilitates the reconfiguration algorithm and the locality principle, by abandoning the search of ideality (two faulty PE's inside the same cluster of Figure 2 cannot be reconfigured in spite of the possibly high number of available spare PE's in other interstitial sites). 2.3

Area considerations

The incorporation of spares increases the area of the chip and thus increases the probability of having defects in the chip. Thus, even if the level of redundancy increases the capability of the reconfiguration algorithm to support defects or operational faults in the original array, an optimal trade-off can be envisioned between the advantages of increasing the number of spares and the shortcomings related to the increase in the chip area [4,9,11]. 2.4

Interconnectlon

considerations

Higher degrees of fault-tolerance or more complete reconfiguration algorithms require more flexible and complex interconnection structures between the PE's of the array. Communication is the fundamental limitation in VLSI performance [2]. Communication is expensive in chip area, is expensive in delay and is expensive in the power consumed and dissipated on the chip. All these consideration are in favour of fault-tolerant architectures which limit the penalties caused by complex wiring structures.

3

R e l i a b i l i t y m o d e l s for fault-tolerant V L S I

The traditional approach to the predictive reliability of electronic components is based on the use of formulas that provide a value for a constant (time independent) failure rate as a function of the technological characteristics and of the application environment of the device. The MIL tIDBK 217-E [131 is the most representative collection of this class of models. The consideration developed in the previous section are intended to show that in the case of fault-tolerant VLSI the interrelationships between architecture and technology cannot he neglected in the analysis of the device reliability. A Markov approach for the reliability analysis of fault-tolerant VLSI processor arrays has been developed in [9,11]. This approach leads to a very large state space whose analytical solution becomes a difficult task. In this paper, we propose a more compact formulation of the device reliability model that takes origin from similar studies in the area of fault-tolerant architectures in multichip systems [14]. These models are based on the observation that the occurrence of faults and the recovery process evolve at very different time scales so that the two phenomena can be analysed separately and then combined together into a unified comprehensive framework [15]. In particular, even if the reconfiguration process is analysed by means of a very complex and detailed model,

965

966

A. BoBmo

its effect on the overall system reliability can be captured in a su~ciently accurate way by a single time-independent parameter called the coverage. The coverag e of the reconfiguration algorithm is formally defined as the probability that the device is capable to reconfigure to a functional state given that a fault has occurred.

c = Pr { System recovers successfully ] A fault has occurred }

(1)

For a survey of the coverage models proposed in the literature see [16]. The coverage probability c in (1) depends on the ability of the system to detect the faulty PE, to the possibility of replacing the faulty processor with a spare PE and to the latency of the recovery process.

Fault petection The fault detection probability measures the ability of the system to correctly detect and isolate the fault, so that the reconfiguration procedure can be properly initiated. The value of the fault detection probability is difficult to estimate from theoretical arguments, since is related to the efficiency of the testing procedure and to the type of fault (transient or permanent). In order to recover from transient faults a fixed number of retry operation is usually attempted [6]. To assign correct figures to the detection probability some degree of actual (or simulated) experimentation is needed. Indeed, many models reported in the literature assumes that the detection and isolation of the fault is perfect [11,6].

Fault Surt~,al The survival probability [12] is defined as the probability that a feasible solution to the reconfiguration problem exists. If the reconfiguration algorithm is ideal (section 2.2) the survival probability is equal to one as long as the number of faulty PE's is less or equal to the number of spares S and then drops to 0. In the presence of a suboptimal algorithm the survival probability tends to decrease as the number of faults equals the number of spares. Computing the survival probability in actual cases requires the definition of the spare geometry and location (since the survival probability depends not only on the number of faults but on their position in the array) and of the algorithm to replace faulty units. Solution to the survival probability problem can be obtained by means of combinatorial/geometric arguments (connectivity of a graph), or by simulation. Examples, for different spare geometries and reconfiguration algorithms, are given in [10,6,12].

Fault Latency The latency of the reconfigurntion process is the time elapsed from the occurrence of the fault to the successful (if any) completion of the reconfiguration process. The latency is primarily related to the complexity of the reconflguration algorithm and whether the algorithm is selfdriven or host-driven. The presence of a latency time during the reconflguration can be modeled by means of a state whose exit time distribution equals the latency. During the sojourn in the latent state the device operation is frozen and three main factors can have an impact on the performance/reliability: 1. The device suspends any activity and ignores stimuli coming from the outside world. This causes .the loss of tasks arriving at the device during the latency period [17]. 2. If a second failure occurs during the latency time (a near coincident fault), a fatal failure occurs [16]; usually, only a single fault can be handled at the same time. 3. In real-time application there can be a deadline to the recovery process; if the latency time extends beyond the deadline the reconfiguration is unsuccessful and a fatal failure occurs. If the latency time is exponentially distributed with parametr 6 and the deadline for the recovery process id d, the probability that the latency time is less than the deadline is given by: cd = Prob{ Latency < d } = 1 - e -$d 3.1

The reliability model

The reliability prediction model assumes the form of a Markov chain. The operational states are characterized by two indices (n, i) where n, (n < N × M) is the number of working PE's functionally connected in the lattice and i (i <_ S) the number of still available fault-free spares. Each PE fails at a constant failure rate A, where A is evaluated from the technology and the complexity of the PE using conventional methods (viz. MIL HDBK ~17-E) or experimental data. PE's are assumed identical and operational failures are assumed statistically independent. With the above assumptions, an operational failure in state (n,i) occurs at a rate n A. If

Reliability o f fault-tolerant VLSI arrays

.................... !

yR

967

-: !

.

Figure 3 - The building-block of the Markov model and the FHM.

the reconfigurstion process is assumed to be perfect (coverage probability c = I), the device behaves as s (S out of N x M ) system (S+ I failures are necessary to determine the exhaustion of the redundancy). In the actual cases, the reconflguration cannot be assumed as perfect. The transition from the operational state with i spares to the operational state with (i - I) spares is represented by means of a block (the Fault Handling Model - F H M ) that models the detection, recovery and reconfigurstion process. The exit from the F H M toward the state with (i - 1) spares represents successful recovery and occurs with probability c(nj), while the exit toward the fatal failure state represents the uncovered percentage of faults, and occurs with probability (1 - c(,~j)).In general, the coverage c(nj) depends on the state (n, i) in which failure occurred. A detailed representation of the F H M that explicitly accounts for the detection,survival and latency is reported in Figure 3. The F M H block can be further specified following, for instance, [18]. Due to the time scale separation the sub-msrkov chain inside the F H M can be analysed in isolation. The calculated exiting probabilities from the F H M are then inserted in the higher level Markov model from which sn approximated estimation of the time behaviour of the device is evaluated. Results showing the influence of the coverage on the optimal level of redundancy have been reported in [18].

4

Conclusion

The paper has presented s compact approach to model the reliabilityof a fanlt-tolerant VLSI processor arrays by combining technological factors with architectural factors. The model is obtained by capturing the influence of the reconfiguration process into u time-independent coverage probability. The coverage can be evaluated separately from the knowledge of the device characteristics. The behaviour of the device in time is modeled by a Markov chain, from which many performance reliabilitymeasures can be quantitatively estimated [171.

References [1] L. Snyder. Introduction to the configurable, highly parallel computer. Computer, 47-56, January 1982. [2] C. L. Seitz. Concurrent VLSI architectures. I E E E Transactionson Computers, C-33:12471265, 1984. [3] S. Yalamanchili and J.K. Aggarwal. Reconfiguration strategies for parallel architectures. Computer, 44-51, December 1985. [4] T.E. Mangir and A. Avizienis. Fault-tolerant design for VLSI: effect of interconnect reqttirements on yield improvement of VLSI designs. IEEE 2¥ansactions on Computers, C-31:609-815, 1982. [5] A.D. Singh. Interstitialredundancy: an area efficientfault-tolerance scheme for large ares VLSI processor arrays. I E E E Transactions on Computers, 37:1398-1410, 1988. [5] S.Y. Kung, S.N. Jean, and C.W. Chang. Fault-tolernt array processors using single-track switches. I E E E Transactions on Computers, C-38:501-514, 1989. [7] S. Latifi and A. EI-Amawi. Nonplanar VLSI arrays with high fault-tolerance capabilities. I E E E Transactions on Reliability, 38:51-57, 1989.

968

A. BOBBIO

[8] V.P. Roychowdhurv, J. Bruck, and T. Kailath. EfBeient algorithms for reconfi~mration in VLSI/WSI arrays. "IEEE Transactions on Computers, 39:480-489, 1990. [9] I. Koren and M.A. Breuer. On area and yield considerations for fault-tolerant VLSI processor arrays. I E E E Transactions on Computers, C-33:21-27, 1984. [I0] M.G. Sam.i and R. Stefanelli. Reconfigurable architectures for VLSI processing arrays. Proceedings of the IEEE, 74:712-722, 1986. [11] I. Koren and D.K. Pradhan. Modeling the effect of redundancy on yield and performance of VLSI systems. I E E E Transactions on Computers, C-36:344-355, 1987. [12] F. Lombard.i, M.G. Sami, and R. StefaneUi. Reconflguration of VLSI axrays by covering. I E E E Transactions on Computer-Aided Design , 8:952-965, 1989. [13] MIL H D B K 217-E. Reliabilityprediction of electronic equipment. Technical Report, U S A Department of Defence, 1986. [14] A. Bobbio. Dependability analysis of fault-tolerantsystems: a literaturesurvey. Microprocessin 9 and Microprogramming, 1990. [15] A. Bobbio and K.S. Trivedi. An aggregation technique for the transient analysis of stiff Markov chains. IEEE Transactions on Computers, C-35:803-814, 1986. [16] J. Bechta Dugan and K. S. Trivedi. Coverage modeling for dependability analysis of fault-tolerant systems. IEEE Transactions on Computers, 38:775-787, 1989. [17] K. Trivedi, A.S. Sathaye, O.C. Ibe, and R.C. Howe. Should I add a processor ? In Proceedings 23-rd Annual Hawaii International Conference on System Sciences HICSS23, pages 214-221, 1990. [18] A. Bobbio. The effect of an imperfect coverage on the optimum degree of redundancy of a degradable multiprocessor system. In Proceedings RELIABILITY'87, Paper 5B/3, Birmingham, 1987.