Error propagation analysis using FPGA-based SEU-fault injection

Error propagation analysis using FPGA-based SEU-fault injection

Available online at www.sciencedirect.com Microelectronics Reliability 48 (2008) 319–328 www.elsevier.com/locate/microrel Error propagation analysis...

383KB Sizes 3 Downloads 141 Views

Available online at www.sciencedirect.com

Microelectronics Reliability 48 (2008) 319–328 www.elsevier.com/locate/microrel

Error propagation analysis using FPGA-based SEU-fault injection Alireza Ejlali *, Seyed Ghassem Miremadi Department of Computer Engineering, Sharif University of Technology, Azadi Avenue, Tehran, Iran Received 26 October 2006; received in revised form 1 April 2007 Available online 5 June 2007

Abstract Error propagation analysis is one of the main objectives of fault injection experiments. This analysis helps designers to detect design mistakes and to provide effective mechanisms for fault tolerant systems. However, error propagation analysis requires that the chosen fault injection technique provides a high degree of observability (i.e., the ability to observe the internal values and events of a circuit after a fault is injected). Simulation-based fault injection provides a high observability adequate for error propagation analysis. However, the performance of the simulation-based technique is inadequate to handle today’s hardware complexity. As an alternative, FPGA-based fault injection can be used to accelerate the fault injection experiments, but the communication time needed for observing the circuit behavior from outside of the FPGA imposes severe limitations on the observability. In this paper, an observation technique for FPGA-based fault injection is proposed which significantly reduces the communication time as compared with previous scan-based observation techniques. Furthermore, this paper describes a SEU-fault injection technique based on a chain of parallel registers which reduces the time needed for injecting SEU faults as compared to the previous scan-based fault-injection techniques. As a case study, a 32-bit pipelined processor has been used in the fault injection experiments. The experimental results show that when a high degree of observability is required (e.g., error propagation analysis), the proposed fault injection technique is over 1166 times faster than simulation-based fault injection, whereas the traditional scan-based technique can achieve only a speedup of about 2–3 – which means that the proposed technique is about 500 times faster than the traditional scan-based technique. Such results are supported by theoretical performance analysis. This speed increase has been achieved without excessive increase in FPGA resource overhead, for example, the FPGA overhead of the proposed technique is only 2  3% higher than that of the traditional scan-based technique.  2007 Elsevier Ltd. All rights reserved.

1. Introduction Trends in CMOS technology, applications, and operating conditions are resulting in circuits with higher susceptibility to transient faults. Unfortunately, deep sub-micron (DSM), system on a chip (SoC), and low power design techniques aggravate the reliability problem [9,20,22] so that single event up-sets (SEU faults) are becoming a major source of concern and it becomes increasingly important to analyze the potential consequences of SEUs on the applications [6,9,11,17,18,20]. Fault injection is an effective method to study the consequences of SEUs [3,5,17,24]

*

Corresponding author. E-mail addresses: [email protected] (A. Ejlali), [email protected] (S.G. Miremadi). 0026-2714/$ - see front matter  2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.microrel.2007.04.003

and to analyze how the errors caused by SEUs propagate in a system [8,15]. In practice, designers are interested in tools for performing early SEU-fault injection campaigns when a prototype of the designed circuit is not yet available, and thus when radiation testing cannot be exploited [5,20,23]. This is because early fault-injection provides the ability to detect design mistakes in the design phase and before the system implementation, thus it reduces the cost for correcting such mistakes. Simulation-based fault injection allows such an early evaluation of the system when only a high-level and non-synthesizable model of the system is available [9,17,20,21]. Moreover, simulation-based fault injection techniques provide control over where and when a fault is injected (controllability), and the ability to monitor internal events of a circuit (observability) [10,16]. However, the main drawback to simulation-based fault injection is that it is time-consuming [5,7,10].

320

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

One way to provide good controllability and observability as well as high speed in the SEU-fault injection experiments is to use FPGA-based fault injection [3,5,7,24]. The techniques that use FPGAs for SEU-fault injection can be divided into two main categories: • Instrumentation-based techniques [5,7]: In these techniques a hardware circuit (e.g., scan chain) is added to the original circuit which is used as a fault-injection instrumentation and both the circuits (i.e., the original circuit as well as the fault injection instrumentation) are synthesized into an FPGA. Throughout the fault injection experiments, the FPGA configuration is fixed and no reconfiguration occurs during the experiments. • Reconfiguration-based techniques [3,24]: These techniques do not use an extra hardware circuit as a fault injection instrumentation. Rather, a reconfiguration of the FPGA is required whenever a fault has to be injected to the circuit. The most notable advantages of instrumentation-based techniques as compared to reconfiguration-based techniques are as follows: • Instrumentation-based techniques can be implemented with any commercial off-the-shelf (COTS) FPGA. This is because these techniques do not depend on any special feature of any specific FPGA (features like the capability of runtime reconfiguration [3], the capability of partial reconfiguration [24,25], capture and readback [24,25], etc.). • As discussed in [5], in instrumentation-based techniques, there is no time-overhead for reconfiguration which means that a higher fault injection speed is achievable. However, the main drawback of instrumentation-based techniques as compared to reconfiguration-based techniques is that in instrumentation-based techniques there is a hardware overhead caused by the insertion of a faultinjection instrumentation. However, reconfiguration-based techniques do not have such an overhead. Although FPGA-based fault injection techniques can be used for high speed SEU-fault injection, their speed decreases as the required observability increases (i.e., there is a trade-off between observability and fault-injection speed). This is because, higher observability requires higher communication with the FPGA and this increases the fault injection time. High observability is required for error propagation analysis [1,15] which is one of the main objectives of fault injection experiments (See Section 2). To analyze how errors propagate in a circuit, register (or signal) values of each faulty circuit model should be compared with the values of the fault free circuit. Traditionally, this is done by transferring the register (or signal) values of each faulty and fault free circuit form the FPGA to a host computer, so that the host computer can compare the values [5]. However, this technique will be very time-consum-

ing when high observability is required. Another technique is to save the values into a memory internal to the FPGA, so that they can be read and compared at the end of the fault injection experiment [19]. However, as we will argue later (Section 4), this technique puts practical limitations on the amount of activity that can be analyzed. The main contribution of this paper is to present a builtin logic internal to the FPGA which provides the ability to monitor the circuit behavior. Therefore, it reduces the communication time needed for observing the circuit behavior from outside of the FPGA. This paper also describes a SEU-fault injection technique based on a chain of parallel registers which reduces the time needed for injecting faults as compared to traditional scan-based techniques. A typical fault-injection environment has been developed which is used to demonstrate the features of the proposed technique. As a case study, a pipelined RISC processor has been used in the experiments. Experimental results and theoretical performance studies show that the proposed technique is significantly faster than traditional scan-based and simulation-based fault-injection techniques. The rest of the paper is organized as follows. The proposed observation technique is presented in Section 2. Section 3 describes the SEU injection technique based on a chain of registers. In Section 4, experimental results are presented to compare the proposed technique with the traditional scan-based technique. Section 5 provides a theoretical analysis of the performance of the proposed and traditional scan-based techniques. Finally, Section 6 concludes the paper and points out future works. 2. Observation logic The ability to monitor the internal values of a circuit after the injection of a fault (observability) is one of the important objectives when fault injection is used for error propagation analysis [1,8,15]. The objective is to compare the internal values of each faulty model with the internal values of the fault-free model and extract the error manifestation data (locations and latencies). An error manifests on a flip-flop (signal) if its value in the faulty model differs from its value in the fault-free model. Error manifestation latency of a flip-flop (signal) is defined as the time from the injection of a fault to the manifestation of an error on the flip-flop (signal). Note that when fault injection experiments are used to evaluate a fault tolerance mechanism, we may not require a high level of observability. In this case, we may only need to check the target system at the end of each experiment to see if the fault tolerance (or fault detection) mechanism is successful to tolerate (or detect) the injected faults or not. However, when fault injection experiments are used to analyze error propagation, we require a high degree of observability [8,15]. In error propagation analysis, we want to know in which registers (or signals) and at what times (or clock cycles) the errors manifest [8,15]. To do this so many observation points must be monitored in every clock cycle to check whether

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

an error is manifested on the observation point or not. Such an analysis helps designers to design effective fault tolerance mechanisms [2]. One possible technique for providing observability in FPGA-based fault injection is to use scan-chain-based methods [5]. In this technique, scan-chain can be used to read the internal register values of the target system. However, there is a trade off between speed and observability in this technique. For example, if one wants to observe the internal register values at each clock cycle, this requires shifting out all the internal values at every clock cycle which is apparently very time-consuming. Another technique is to store the internal values in the internal memory blocks of the FPGA, so that they can be read and compared at the end of the fault injection experiment [19]. However this technique puts practical limitations on the amount of hardware and software activities that can be emulated (see Section 4). Duplication technique is one of the observation techniques which have been used for observing the fault effects in processors. In this technique the target chip (faulty chip) outputs are compared pin-by-pin and cycle-by-cycle with a gold unit (fault-free unit) in order to know whether the injected faults have produced errors in the target chip or not [4]. Since in FPGA-based fault injection the internal nodes of the circuits are accessible, the duplication technique can be used even for comparing the internal node values in faulty and fault free circuits (see Fig. 1). This technique does not have the limitations of previous observation techniques (i.e., using scan-chain and internal memory blocks); however it has a high degree of hardware overhead (about 100%). Since faulty and fault free circuits are identical the combinational parts of them are identical too. Also, as shown in Fig. 1, the faults are injected into FFs and not into the combinational part of the target system. This is because it is assumed that bit-flip fault model, which is applicable to FFs has been used in the fault injection experiments. (It has been shown that bit-flip faults closely match SEU effects [5,21].) The observation technique which is proposed in this paper (see Fig. 2) is to use a shared combination circuit for both the faulty and fault free circuits. In this technique, when the multiplexer unit selects the input labeled 0,

Fig. 1. The technique of observation based on duplication.

321

Fig. 2. Proposed observation technique.

the circuit operates in the fault free mode for one clock cycle and the FFs labeled ‘fault free’ are loaded with new data, then the multiplexer selects the input labeled 1 and the circuit operates in faulty mode for one clock cycle and the FFs labeled ‘faulty’ are loaded with new data. At the end of this stage the values of faulty and fault free FFs are compared to each other. This process is repeatedly executed until the end of the fault injection experiment. Fig. 3 shows in more detail how the proposed technique works. As shown in Fig. 3a, the clock signal CLK1 has been applied to the fault free FFs and the clock signal CLK2 has been applied to the faulty FFs. It can be seen from the waveform shown in Fig. 3b that the MUX_Select signal alternatively selects the fault free and faulty FFs. The positive edges (it is assumed that FFs are sensitive to the positive edge of the clock signals) of CLK1 occur only when MUX_Select = 0 and the positive edges of CLK2 occur only when MUX_Select = 1. Figs. 3c and d show how the circuit operates when MUX_Select = 0, and MUX_Select = 1, respectively. In these two figures, the solid lines represent those parts of the circuit which are operational and the dotted lines represent those parts of the circuit that are disabled and idle. When MUX_Select = 0, as shown in Fig. 3c, the multiplexer selects the fault free FFs; hence, the outputs of the fault free FFs are connected to the inputs of the ‘Combinational Circuit’ and the outputs of the faulty FFs are not connected to the inputs of the ‘Combinational Circuit’. Also, whenever MUX_Select = 0, as shown in Fig. 3b, only the clock signal CLK1 which is applied to the fault free FFs is triggered and the clock signal CLK2 is not triggered; hence, when MUX_Select = 0, only the fault free FFs are updated and the faulty FFs remain unchanged. In short, when MUX_Select = 0, the circuit is in fault free operation mode, and in this case the circuit is identical to the fault free circuit unit shown in Fig. 1. (Note that the solid lines in Fig. 3c form a circuit which is identical to the fault free circuit unit shown in Fig. 1.) Similarly, when MUX_Select = 1, as shown in Fig. 3d, the multiplexer selects the faulty FFs; hence, the outputs of the faulty FFs are connected to the inputs of the ‘Combinational Circuit’ and the outputs of the fault free FFs are not connected to the inputs of the ‘Combinational Circuit’. Also, whenever MUX_Select = 1, as shown in Fig. 3b, only

322

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

a

Inputs

Outputs

b

Combinational Circuit MUX 2/1 MUXS elect

1

0

FaultF ree FFs CLK1

<

Faulty MUXS elect

FFs <

CLK2 CLK1

Fault Injection

CLK2

c

d Inputs

Outputs

Inputs

Outputs

Combinational Circuit

MUXS elect=0

Combinational Circuit

FaultF ree FFs <

MUXS elect=1

FaultF ree FFs

CLK1

Faulty

Faulty FFs <

CLK1

<

FFs <

CLK2

CLK2

Fault Injection

Fault Injection

Fig. 3. Operation of the proposed observation technique: (a) the proposed observation architecture, (b) waveform of the MUX_Select and clock signals, (c) the operation of the proposed architecture when MUX_Select = 0 and (d) the operation of the proposed architecture when MUX_Select = 1.

the clock signal CLK2 which is applied to the faulty FFs is triggered and the clock signal CLK1 is not triggered; hence, when MUX_Select = 1, only the faulty FFs are updated and the fault free FFs remain unchanged. In short, when MUX_Select = 1, the circuit is in faulty operation mode, and in this case the circuit is identical to the faulty circuit unit shown in Fig. 1. (Note that the solid lines in Fig. 3d form a circuit which is identical to the faulty circuit unit shown in Fig. 1.) Fig. 4 shows the compare logic in more detail. In this figure, the boxes labeled ‘Equivalence check’ are combinational circuits. Each ‘equivalence check’ circuit compares a register of fault free circuit with its counterpart in faulty circuit and produces a 1-bit result which is ‘1’ if the two registers have different values and is ‘0’ if they are equal. The results of the comparison of all the registers are stored in the register labeled ‘‘Comparison Register’’. If at least one of the bits in ‘Comparison Register’ is ‘1’, the output of the OR gate interrupts the host computer and the host computer reads the ‘Comparison Register’ in response. The hardware overhead of the proposed technique (see Section 4) is much less than the 100% hardware overhead of the duplication technique and yet the observability pro-

FFs of thet arget system

Comparison Register

.. .

.. .

.. .

.. .

Equivalence check

.. .

Equivalence check

REG A

Faulty

REG A

Fault Free

REG B

Faulty

REG B

Fault Free

.. .

.. .

.. . Fig. 4. Comparison logic.

vided by this technique is the same as the observability provided by the duplication technique. However, the emulation time of the proposed technique is two times of the emulation time of the duplication technique. This is because in the proposed technique, the system switches between faulty and fault-free modes alternately. Although the proposed technique increases the emulation time, it per-

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

forms most of the observation process internal to the FPGA and therefore reduces the communication time (needed for observability purpose). Totally it significantly reduces the time needed for fault injection experiments, since the fault injection time is dominated by the communication time (see Sections 4 and 5). 3. SEU-fault injection The fault injection logic which is used for injecting faults into the target system is a combinational logic which provides the ability to configure the flip-flops of the target system into a chain of parallel registers. The fault model which is used in this technique is the single/multiple bit-flip in the circuit storage elements, since it closely matches SEU effects [5,21]. Fig. 5 shows a schematic diagram of this fault injection technique.

Fig. 5. Schematic block diagram of the proposed fault injection technique.

323

As shown in Fig. 5, the flip-flops of the system are grouped together to form k-bit registers. When the multiplexer units select the inputs labeled 1, the circuit performs its normal task. But when the multiplexer units select the inputs labeled 0, the FFs of the target system form a chain of k-bit registers. Like most of the other SEU-fault injection techniques (e.g., [5,26,27]), fault injection is made by reading the contents of the chain, inverting the bits stated in the user defined fault list and writing back the faulty data into the chain. Since flip-flops are grouped into k-bit registers, this technique is about k times faster than traditional scan-based techniques which use a chain of flip-flops. 4. Experimental results The environment, which is used for the experiments, consists of two parts: a PCI-based PLDA board [13] and a Pentium IV system (3.2 GHz, RAM = 2GB, OS = Windows XP). The PLDA board is connected to the computer via a PCI expansion slot. An FPGA chip, FLEX 10K200SFC484-1 [12,13], is mounted on the board. This FPGA can be configured through the PCI bus. Also, after the configuration, the FPGA can communicate with the computer through the PCI bus. In the experiments, a 32-bit pipelined RISC processor was synthesized to the FPGA and the faults were injected into the RISC processor. The architecture of the RISC processor is depicted in Fig. 6. Note that in these experiments the fault injection was only used to study the error behavior of the pipelined RISC processor (i.e., error propagation analysis). It was not used to evaluate any fault-tolerance or error detection mechanisms. Three different sets of experiments were done: (1) Simulation-based fault injection: The simulationbased fault-injection experiments were performed

Fig. 6. Architecture of the pipelined RISC processor.

324

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

Table 1 Errors manifestation latencies (measured by the number of clock cycles) Workload

Observation point

Min

Max

Mean

Median

Std

# of manifested errors

Total # of manifested errors

Selection-sort

PC1 ALU output REG

0 0

378 409

124.48 141.01

144 197

112.11 136.97

532 330

627

Matrix multiplication

PC2 ALU output REG

0 0

288 153

129.50 89.36

155 97

103.17 42.93

486 369

589

Insertion into linked list

PC3 ALU output REG

0 0

103 62

37.45 19.05

52 36

28.84 18.46

582 290

655

using a ModelSim simulator (Version SE PLUS 6.0) [14]. In the simulation experiments variable manipulation technique [15] was used for injecting bit-flip (SEU) faults. (2) FPGA-based fault injection using traditional scanbased technique: In these experiments, a scan-chain hardware like that proposed in [5] was used for injecting faults into the internal flip-flops of the processor. The same scan-chain was used for observing the internal flip-flops of the system at each clock cycle. (3) FPGA-based fault injection using the proposed technique: In these experiments, the proposed observation technique was used for observing the internal flip-flops of the processor and the proposed chainof-registers technique was used for injecting faults into the internal flip-flops. In order to do the 2nd and 3rd sets of experiments, a program was developed which can automatically insert observation and fault-injection logic into the target system code. Three different workloads were executed to analyze error propagation in the RISC processor: 1 A selection sort algorithm, which sorts a 100-element array of integers. 2 A matrix multiplication algorithm, which multiplies 5 · 5 matrices with integer elements. 3 A linked-list processing algorithm, which inserts 200 nodes into a sorted linked list (initially, the linked list is empty). All of these workloads take about 10000 clock cycles to complete. The reason for using these three workloads is that their operations are together intended to represent the various types of operations in real application programs. Also, most of the fault-injection literatures (such as [15]) have used similar workloads. The fault list was randomly generated, so that each flip-flip has the equal probability to become faulty and also each flip-flop may become faulty at each clock cycle by a constant probability. For all the three workloads and the three sets of experiments the same fault list was used, therefore the error manifestation latencies obtained from the simulation-based experiments (1st set of experiments) did not differ from the experimental results of the FPGAbased experiments (2nd and 3rd sets of experiments).

In each fault injection campaign, 1000 SEU faults were injected to the registers of the RISC processor. Table 1 shows the error manifestation latencies for two observation points i.e., PC1 and ALU output registers. It should be noted that in the experiments, 25 registers were observed, but for the sake of simplicity, the error manifestation information of only two observation points (i.e., PC1 and ALU output register) have been shown in Table 1 and the other observation points have not been shown in this table. Table 2 shows the amount of the FPGA resources, which was used in the experiments. It should be noted that if one wants to use the storage technique (proposed in [19]), one will have to store 861 bits per clock cycle (because the RISC processor has 861 DFFs in its registers). Considering that each of the workloads takes 10000 clock cycles to complete, 8610000 bits may be stored in each experiment, but this amount of required memory is much higher than the maximum available memory (40960 bits) in Altera FLEX 10 K FPGAs [12]. (The FPGA which is used in the experiments.) This example shows why the storage technique (proposed in [19]) puts practical limitations on the amount of activity that can be analyzed. However, the proposed technique does not have such a limitation and it provides full observability while it has only an overhead of 18.57% (Table 2). Table 3 shows the execution time of the fault-injection campaigns as well as speedup figures. As shown in Table 3, the proposed technique is about 500 times faster than the traditional scan-based technique. [5] reports that the speedup of the traditional technique over the simulation-based technique varies between 14186 and 45859, but Table 3 reports that this speedup is very low and it is about 2. This is because, in the experiments reported in [5] the internal states of the target system were observed

Table 2 Available and consumed FPGA resources Number of LCs Total available resources in the FPGA Pipelined RISC processor Pipelined RISC processor as well as the proposed fault injection and observation logic Pipelined RISC processor as well as the traditional scan-chain

9984 5837 6921 (Overhead = 18.57%)

6756 (Overhead = 15.74%)

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

325

Table 3 Speedup of FPGA-based techniques a

b

Workload

Simulation time [sec]

Emulation time proposed technique [s]

Emulation time traditional technique [s]

S Prop

Selection sort Matrix multiplication Linked list processing

106132 143188

102.21 149.76

59376.3 59392.1

1038.37 956.11

1.79 2.41

580.09 396.73

126709

108.58

59361.0

1166.96

2.13

547.87

a b c

Sim

S Trad

Sim

S Prop

Trad

c

Speedup of the proposed technique over the simulation-based technique. Speedup of the traditional technique over the simulation-based technique. Speedup of the proposed technique over the traditional technique.

only at the end of each fault-injection experiment, but in this paper the states of the target system are observed at each clock cycle (since this paper considers error propagation analysis, a high degree of observability is required). Therefore, for the applications which do not require high observability the traditional technique can be effectively used. However, for the applications which require high observability the traditional technique suffers from a significant communication bottleneck. Table 3 shows that the emulation time of the proposed technique is workload dependent, while the emulation time of the traditional technique is not workload dependent. This is because in the proposed technique the FPGA transfers observation data only when an error is manifested in the observation points and as stated in [15] error behavior is workload dependent. Therefore in the proposed technique communication time is workload dependent. However, in the traditional technique a fixed amount of data is transferred at each clock cycle so that communication time is not workload dependent. 5. Performance analysis This section provides a theoretical performance analysis of the proposed fault-injection and observation technique as well as the traditional scan-based fault injection technique. It should be noted that in this analysis, it is assumed that high observability is required and therefore after the fault injection, the internal values of the target system have to be observed at each clock cycle. Suppose using the proposed technique, a fault-injection campaign (i.e., a series of fault injection experiments) is executed. The time required for executing the fault-injection campaign is T New

FPGA Campaign

where Nf NC TCLK nFF

 ln m lw m  FF CR T C þ NC  a TC ¼ N f 2N C T CLK þ 2 K K ð1Þ

the number of fault-injection experiments during a fault injection campaign the number of clock cycles required to complete a workload the period of the clock signal applied to the target system the number of flip-flops in the target system

K TC

a wCR

the bit width of the interface between the FPGA and the host computer (32 for PCI bus) the communication time for transferring one K bit word between the host computer and the FPGAbus the probability that an error occurs (manifests) in the observation points (registers) the bit width of ‘Comparison Register’.

Following is a description of each term in Eq. (1): Emulation time: 2NCTCLK represents the emulation time of the system shown in Fig. 2. It should be noted that the emulation time for the original target system is NCTCLK and the emulation time of the system shown in Fig. 2 increases by a factor of 2 since it emulates both faulty and fault free behavior of the target system. Fault injection time: 2dnKFF eT C represents the time required for reading the contents of the chain, inverting some of the bits, and writing back the faulty data into the chain. Note that flip-flops of the target system are configured to a chain-of-registers structure. Also, it is assumed that the communication protocol is half-duplex (two shift operations of whole chain are required). Observation time: N C  adwKCR eT C represents the time required for reading the observation data. At each clock cycle, ‘Comparison Register’ will be read if an error manifests in any of the observation points. The time required for executing a fault-injection campaign based on the traditional scan-based technique (like that proposed in [5]) is:   NC nFF T C þ nFF T C T Traditional FPGA Campaign ¼ ðN f þ 1Þ  N C T CLK þ 2 ð2Þ

Following is a description of each term in Eq. (2): Emulation time: NCTCLK represents the emulation time of the target system. The factor of 2 is not needed, since at each fault injection experiment only one circuit (faulty or fault-free) is emulated. Fault injection time: nFFTC represents the time required for writing fault injection data into the chain (see [5]). Observation time: N2C nFF T C represents the average time required for reading the contents of the scan chain at each clock cycle (after the a fault is injected). It is assumed

326

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

that the communication protocol is half-duplex, so that fault injection and observation cannot be performed simultaneously. The factor of 1/2 is required because fault observation is performed only after that a fault is injected to the system. A fault may be injected at any clock cycle by constant probability so the factor of 1/2 represents the average time. In order to assess the theoretical analysis, provided in this section (Eqs. (1) and (2)), we compare the emulation times estimated by the analytical equations with the emulation times obtained from the experiments. As mentioned in Section 4, the RISC processor, used in the experiments, has nFF = 861 FFs, each workload takes NC = 10000 clock cycles to complete, and in each fault injection campaign, Nf = 1000 SEU faults were injected to the registers of the RISC processor. The FPGA clock period was TCLK = 0.33 · 107 s, the PCI communication transaction time was TC = 15 · 106 s, the bit-width of the registers in the chain of registers was K = 32, and the number of the registers that were observed in the experiments was wCR = 25. In order to evaluate Eq. (1), we also need to know the value of a parameter, i.e., the probability that an error manifests in the observation points. The a parameter can be easily obtained from the experiments by counting the number of the interrupts which the circuit shown in Fig. 4 makes, since each interrupt means that an error is manifested in one of the observation points. The values of the a parameter in the experiments were: a(Selection sort) = 0.63, a(Matrix multiplication) = 0.79, a(Linked list processing) = 0.67. Based on these values and Eqs. (1) and (2), the emulation time of the proposed technique and traditional technique can be estimated as shown in Table 4. It can be easily seen that the estimated values match the experimental results (see Table 3). Since the a parameter changes as the workload changes, here we analyze the impact of the variations of the a parameter on the speedup of the proposed technique over the traditional scan-based technique. The speedup of the proposed technique over the traditional scan-based technique can be obtained from Eqs. (1) and (2) as T Traditional FPGA Campaign T New FPGA Campaign   ðN f þ 1Þ  N C T CLK þ N2C nFF T C þ nFF T C    ¼  N f 2N C T CLK þ 2 nKFF T C þ N C  a wKCR T C



ð3Þ

Assuming that all the parameters except for the a parameter are constant (and equal to what we had in our experiTable 4 Estimated emulation times Workload

Emulation time proposed technique [s]

Emulation time traditional technique [s]

Selection sort Matrix multiplication Linked list processing

95.97 119.97 101.97

64652.83 64652.83 64652.83

Fig. 7. Impact of a-parameter (probability of error manifestation) on the speedup of the proposed technique over the traditional scan-based technique.

ments), Fig. 7 shows how the S parameter (the speedup of the proposed technique over the traditional scan-based technique, defined in Eq. (3)) varies as the a parameter changes. As shown in this figure, as the a parameter increases the speedup of the proposed technique over the traditional scan-based technique decreases, however the speed of the proposed technique is still much faster than the traditional scan-based technique even when the a parameter takes its maximum value =1 where the speedup is almost 427. 6. Concluding remarks and future works This paper presents (i) a novel observation technique for FPGA-based fault injection which can be effectively used for analyzing how the errors, caused by SEUs, propagate in a circuit (error propagation analysis), (ii) a fault injection logic based on a chain of parallel registers, (iii) a case study where faults are injected into a 32-bit pipelined RISC processor, and (iv) a theoretical performance analysis which justifies the experimental results and helps to analyze the speedup of the proposed technique over the traditional one. Both the experimental and theoretical analysis show that for the applications requiring a high degree of observability, the speedup provided by the traditional scan-based fault injection significantly reduces. However, the proposed observation technique can be effectively used for these applications. In fact, the experimental results show that the proposed technique is about 500 times faster than the traditional scan-based technique. Although FPGA-based fault injection can be used for high speed SEU-fault injection, it has some important lim-

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

itations. One of these limitations is that FPGA-based fault injection can only be used for synthesizable models and it cannot be used for non-synthesizable models. Some attempts have been made to extend the application of FPGA-based fault injection to non-synthesizable models [7,8]. In [7], it has been discussed how FPGAs can be used to accelerate the fault injection into switch-level models. (Note that switch-level models, e.g., Verilog switch-level models, are not synthesizable into FPGAs.) In [8], it has been discussed how FPGAs can be used to accelerate the fault injection into Verilog and VHDL codes which are not fully synthesizable and only parts of them are synthesizable. However, these techniques do not cover all the possible non-synthesizable codes. Hence one of the interesting areas for future work could be the use of FPGAs for accelerating fault injection into non-synthesizable model. Another important limitation of FPGA-based fault injection is that there are some IP-cores that although they can be synthesized into FPGAs, their source codes are not available to the user. For these IP-cores, error propagation analysis is impossible, because it does not make sense to analyze how errors propagate through a system whose architecture (structure) is unknown. However, one may still want to perform fault injection experiments for purposes other than error propagation analysis. In this case, one still faces with serious problems such as: (1) The controllability is very low because the information about the structure of the IP-core is not available and one does not have any control on where a fault is going to be injected. (2) It is impossible to put some extra codes within the IPcore description to facilitate the fault injection process. Instrumentation-based techniques require adding the description of the fault injection instrumentation to the original IP-core description. Also reconfigurablebased techniques may require the alteration of the IP-core description [25]. Based on the above, another interesting area for future work could be the use of FPGAs for performing fault injection into IP-cores whose source codes are not available. In the above-mentioned future works, it is intended to develop some new techniques to overcome the limitations of the FPGA-based fault-injection techniques. Another possible area of research is to compare the existing FPGA-based fault-injection techniques to see which technique is more appropriate for a given application. As stated in Section 1, FPGA-based fault-injection techniques can be divided into two main categories: (1) Instrumentationbased techniques and (2) Reconfiguration-based techniques. A detailed comparison between these two techniques (in terms of fault injection speed, hardware overhead, cost, TTM, etc.) requires that one implements both of these techniques and applies the both techniques on the same target system. This comparison could be a potential new area of research.

327

FPGA-based fault injection is one of the interesting areas of research and we believe that many works can be carried out in this area. The above-mentioned works are only some examples of possible and interesting candidates for future work. Acknowledgement The authors acknowledge Research Vice-Presidency of Sharif University of Technology for partially funding this work. References [1] Aidemark J, Folkesson P, Karlsson J. Path-based error coverage prediction. Journal of Electronic Testing Theory and Application (JETTA) 2002;16(June):343–9. [2] Ammari A, Hadjiat K, Leveugle R, On combining fault classification and error propagation analysis in RT-Level dependability evaluation. In:10th IEEE international on-line testing symposium (IOLTS) 2004; p. 227–32. [3] Antoni L, Leveugle R, Fehe´r B. Using run-time reconfiguration for fault injection applications. IEEE Transactions on Instrumentation and Measurement 2003;52(5):1468–73. October. [4] Carreira J, Madeira H, Silva J, Xception: Software fault injection and monitoring in processor functional units. In: Conference on dependable computing for critical applications (DCCA-5); 1995. p. 135–49, September. [5] Civera P, Macchiarulo L, Rebaudengo M, Sonza Reorda M, Violante M. Exploiting circuit emulation for fast hardness evaluation. IEEE Trans Nucl Sci 2001;48(Dec):2210–6. [6] Dupont E, Nicolaidis M, Rohr P. Embedded robustness IPs for transient-error-free ICs. IEEE Des Test Comput 2002;19(3):54–68. May–June. [7] Ejlali A, Miremadi SG, FPGA-based fault injection into switch-level models. journal of microprocessors and microsystems, Elsevier Science. April 2004;28(5–6): p. 317–27. [8] Ejlali A, Miremadi SG, Zarandi HR, Asadi G, Sarmadi SB. A hybrid fault injection approach based on simulation and emulation cooperation. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN-2003), San Francisco, USA, June 2003; p. 479–88. [9] Ejlali A, Schmitz MT, Al-Hashimi BM, Miremadi SG, Rosinger P. Combined time and information redundancy for SEU-tolerance in energy-efficient real-time systems. IEEE Trans (VLSI) Syst. April 2006;14(4):p. 323–35. [10] Folkesson P, Svensson S, Karlsson J. A comparison of simulation based and scan chain implemented fault injection. In: Proceedings of the 28th international symposium on fault-tolerant computing. 1998; p. 284–93. [11] Gonzales I, Berrojo L. Supporting fault tolerance in an industrial environment: The AMATISTA approach. In: Proceedings of the IEEE international on-line test workshop. 2001; p. 178–83. [12] . [13] . [14] . [15] Jenn E, Arlat J, Rimen M, Ohlsson J, Karlsson J. Fault injection into VHDL models: The MEFISTO tool. In: Proceedins of the 24th international symposium on fault-tolerant computing. 1994; p. 336– 44. [16] Kim S, Somani AK. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In: Proceedings of the international conference on dependable systems and networks. June 2002; p. 416–25.

328

A. Ejlali, S.G. Miremadi / Microelectronics Reliability 48 (2008) 319–328

[17] Leveugle R, Fault injection in VHDL descriptions and emulation. In: Proceedings. IEEE international symposium on defect and fault tolerance in VLSI systems. Oct. 2000; p. 414–19. [18] Leveugle R, Ammari A. Early SEU fault injection in digital, analog and mixed signal circuits: a global flow Design. In: Proceedings of the design, automation and test in europe conference and exhibition, Febuary 2004, vol. 1, p. 590–95. [19] Lima F, Rezgui S, Carro L, Velazco R, Reis R. On the use of VHDL simulation and emulation to derive error rates. In: Sixth European conference on radiation and its effects on components and systems. September 2001; p. 253–60. [20] Maheshwari A, Burleson W, Tessier R. Trading off transient fault tolerance and power consumption in deep submicron (DSM) VLSI circuits. IEEE Trans (VLSI) Syst 2004;12(3):299–311. [21] Rezgui S, Swift GM, Velazco R, Farmanesh FF. Validation of an SEU simulation technique for a complex processor: PowerPC7400. IEEE Trans Nucl Sci 2002;49(6):3156–62. December. [22] Shanbhag NR. Reliable and efficient system-on-chip design. Computer 2004;37(3):42–50. [23] Violante M. Accurate single-event-transient analysis via zero-delay logic simulation. IEEE Trans Nucl Sci 2003;50(6):2113–8. December.

[24] Aguirre M, Tombs JN, Mun˜oz F, Baena V, Torralba A, Ferna´ndezLeo´n A, Tortosa-Lo´pez F. FT-UNSHADES: A new system for SEU Injection, analysis and diagnostics over post synthesis netlists, NASA Military and Aerospace Programmable Logic Devices (MAPLD 2005), Washington DC (USA), 2005. [25] Tombs J, Aguirre Echano´ve MA, Mun˜oz F, Baena V, Torralba A, Ferna´ndez-Leo´n A, Tortosa F. The implementation of a FPGA hardware debugger system with minimal system overhead. In: Proceedings of the international conference on field programmable logic and application (FPL), Lecture Notes in Computer Science (LNCS). 2004. p. 1062–66. [26] Nguyen HT, Yagil Y, Seifert N, Reitsma M. Chip-level soft error estimation method. IEEE Trans. Device and Materials Reliability 2005;5(3):365–81. [27] Civera P, Macchiarulo L, Rebaudengo M, Sonza Reorda M, Violante A, Exploiting FPGA-based techniques for fault injection campaigns on VLSI circuits. In: IEEE international symposium on defect and fault tolerance in VLSI systems, (DFT’01). 2001; p. 250–58.