Microprocessors and Microsystems 34 (2010) 49–61
Contents lists available at ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
Reliable data path design of VLIW processor cores with comprehensive error-coverage assessment Yung-Yuan Chen a,*, Kuen-Long Leu a,b a b
Department of Computer Science and Information Engineering, Chung-Hua University, Hsin-Chu, Taiwan Department of Electrical Engineering National Central University, Chung-Li, Taiwan
a r t i c l e
i n f o
Article history: Available online 22 November 2009 Keywords: Concurrent error-detection Error-recovery Error-coverage analysis Fault-tolerant processor core Fault injection
a b s t r a c t In this paper, an effective fault-tolerant framework offering very high error coverage with zero detection latency is proposed to protect the data paths of VLIW processor cores. The feature of zero detection latency is essential to real-time error-recovery. The proposed framework provides the error-handling schemes of varying hardware complexity, performance and error coverage to be selected. A case study with an experimental VLIW architecture implemented in VHDL was used to demonstrate the impacts of our technique on hardware overhead and performance degradation. The fault injection experiments were performed to characterize the effects of fault-occurring frequency as well as workload variations on the error coverage, and the permanent faults on the length of time spent for error-recovery. The results observed from the experiments show that our approach can well protect the VLIW data paths even in a very severe fault scenario. As a result, the proposed fault-tolerant VLIW core is quite suitable for the highly dependable embedded applications. Ó 2009 Elsevier B.V. All rights reserved.
1. Introduction As processor chips become more and more complicated, and contain a large number of transistors, the processors have a limited operational reliability due to the increased likelihood of faults or radiation-induced soft errors especially when the chip fabrication enters the very deep submicron technology [1–3]. Also indicated specifically in [4], it is expected that the bit error rate in a processor will be about 10 times higher than in a memory chip due to the higher complexity of the processor. And a processor may encounter a bit flip once every 10 h. Thus, it is essential to employ the faulttolerant techniques in the design of high-performance superscalar or VLIW processors to guarantee a high operational reliability in critical applications. Recently, the reliability issue in high-end processors is getting more and more attention [4–19]. For example, the Intel Itanium processor provides fault-tolerant features [10], such as enhanced machine check abort (MCA) architecture with extensive error-correcting code (ECC), to maximize system reliability and availability. Lately, ARM announced a fault-tolerant processor named Cortex-R4F [17] to meet the stringent error-free safety standards and high performance requirements of automotive applications.
* Corresponding author. E-mail address:
[email protected] (Y.-Y. Chen). 0141-9331/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2009.11.004
As pointed out in [2,20], the soft error rate in combinational logic will increase more quickly than the soft error rate in storage elements. Consequently, the logic circuits will rapidly become the dominant source of soft errors for processors which equip ECC for protecting the memory elements. So, the dependability of logic circuits will soon become a critical issue in the future design. To mitigate the risks of soft errors in logic circuits, in this study we focus our attention on the protection of the VLIW processors’ data paths, particularly in functional modules. We assume that the register file in the data paths is protected by an ECC. With a combination of this reliable data path design with the control flow checking scheme presented in paper [19], we can construct a highly reliable VLIW processor. Two primary parameters concerned for error-detection are the coverage and latency. While an error occurs in the system, two critical issues should be addressed in the aspect of detection latency: 1. The length of error-recovery time increases with an increase in detection latency. 2. How does the detection latency affect the performance of workload execution? In other words, how much of the errorrecovery time is demanded to overcome the transient faults. Performing the error-recovery process will prolong the workload execution, and could lead to the time-constrained failure even though the error has been conquered successfully.
50
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
Therefore, the performance analysis needs to include the assessment of performance degradation resulting from the error-recovery. It is clear that the analysis of latency effect on performance of workload execution for the detection schemes with variable latency will be quite involved, because the variable latency will complicate the analysis of the impact of error-recovery on performance aspect. According to the above point of view, to minimize the effect of error-detection latency on the recovery performance, the zero detection latency is set as our research goal so as to accomplish the real-time error-recovery by simply using the cost-effective instruction-retry method. Meanwhile, the feature of zero detection latency also simplifies the assessment of performance degradation resulting from the error-recovery and latency effect on performance of workload execution. In this work, an effective fault-tolerant framework with zero detection latency is proposed to protect the data paths of VLIW processor cores. The framework presented here is comprehensive in that it consists of the methodologies of error-detection and error-recovery. To verify our fault-robust approach, we implemented the proposed approach in an experimental 32-bit VLIW core, and exploited an efficient fault-tolerant verification platform constructed by our team to assess the core’s error coverage in various fault environments. By varying the fault attributes, our simulation-based injection tool can generate diverse fault environments (or called fault scenarios), which can be used to effectively validate the capability and the robustness of a fault-tolerant system under various fault scenarios. The assessment of the core’s error coverage is used to validate the feasibility of our faultrobust approach. The rest of this paper is organized as follows. Next section introduces the background of this research, which includes the survey of related work and the motivation of this work. Section 3 proposes a fault-tolerant approach concentrating on the dependable data path design of VLIW processor cores. An experimental 32-bit VLIW core based on the proposed fault-tolerant approach was implemented in VHDL and the measurements of hardware overhead and performance degradation caused by our fault-robust scheme are presented in Section 4. In Section 5, a thorough error-coverage analysis is conducted to validate our scheme. The conclusions appear in Section 6.
2. Background 2.1. Related work The previous researches in reliable microprocessor design are mainly based on the concept of time redundancy approach [4–8, 11–16] that uses the instruction replication and recomputation to detect the errors by comparing the results of regular and duplicate instructions. The instruction replication, recomputation schedule and result comparison of regular and duplicate instructions can be accomplished either in software level source code compilation phase to generate redundant code for fault detection [5,6,13,14] or in hardware level [4,7,11,12,16]. The choice of software-based or hardware-based methodologies is the trade-off between hardware complexities, memory space overhead, and performance degradation. The works described in [13,14] adopted the software techniques for detecting the errors in superscalar and VLIW processors respectively. The compiler-based software redundancy schemes have the advantage of no hardware modifications required, but the performance degradation and code growth increase significantly as pointed out in [4,7]. The hardware redundancy approach requires extra hardware and architectural modification to manage the instruction
replication, recomputation and comparison to detect the errors with the benefits of lower performance degradation and no extra code space compared to the software redundancy approach. The regular instruction and its recomputation can be executed in the same functional unit at different time slot using the recomputation with shifted operands technique to rerun the instructions [21] or in the different functional units concurrently or at different time slot. Previously, the papers [5,6,14] addressed the issue of fault-tolerant design in VLIW processors, and employed the software redundancy approach to detect the hardware faults. One representative software methodology proposed in [14] exploited the compiler-based redundant code approach to achieve the detection of hardware faults that occur in the data paths. In [14], the adopted fault model is basically the transient or permanent functional faults which cause at most a single module failure. Because the comparison technique is used to detect the faults, the actual set of covered faults includes the multiple module failures. However, the fault model described in [14] does not take the phenomenon of multiple faults and common-mode failures into account. So, the approach presented in [14] claims 100% fault coverage. Practically, the phenomenon of common-mode failures, which could lead to the unsafe failure, cannot be ignored in the development phase as addressed in [22,23]. The software redundancy approach presented in [14] has the advantages of providing the protection of register file and requiring no hardware modification so that it is not necessary to customize the processor core. But it leads to 50–100% performance degradation and 100–200% code growth, according to the results reported in [14]. Moreover, the code growth will increase the space requirement of program memory and the power consumption of program execution. As indicated in [24], the power consumption due to memory traffic occupies a considerable portion of the system’s power consumption. As a result, the software methodology using redundant codes to detect the faults will significantly increase the power cost of the system. In addition, this software redundancy scheme increases the compiler complexity as well. Finally, the work in [14] did not propose the error-recovery process and lacked a comprehensive error-coverage evaluation under different fault scenarios. 2.2. Motivation According to the above statements, the motivation of this work resorts to the hardware redundancy approach to achieve the following benefits: no code modification/growth; reduction of the performance degradation and power consumption. However, the hardware redundancy approach needs to pay the cost of the hardware modification and overhead. The emphases of this study are described below: Adopt a more complete and realistic fault model, which includes multiple faults and the common-mode failures. Propose a complete error-handling framework, which consists of the error-detection and error-recovery processes. Perform a comprehensive error-coverage evaluation under various fault scenarios, which contains the error-detection coverage, error-recovery coverage and the probability of unsafe failures resulting from the common-mode failures.
3. Fault-tolerant data path design The data paths normally comprise register file and functional modules. A VLIW processor core may possess several different classes of functional modules in the data paths, such as integer ALU
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
and load/store units. One or more identical modules are provided for a specific functional class. In the following, we first describe the fault model adopted in this study, and then present the methodologies employed in our scheme to detect and recover errors occurring in the data paths. Following that, we use a case study to demonstrate our fault-tolerant approach. Finally, the design metric competition in terms of hardware overhead, performance degradation and fault tolerance capability is discussed. 3.1. Fault model Three types of faults described below are considered in the error-handling processes: 1. Correlated transient faults [25,26] (e.g., a burst of electromagnetic radiation) which could cause multiple module failures. 2. Single or multiple independent faults which could cause single or multiple module failures. 3. Near-coincident faults [27,28] – a second fault occurring while recovery from a prior fault is being attempted is called nearcoincident fault, which could worsen the fault scenario and reduce the recovery probability; this kind of faults could interfere with the recovery process in highly reliable systems. The faults considered above can be transient or permanent. It is evident that the adopted fault model in this study is more rigorous, complete and realistic compared to the single-fault assumption commonly applied before. Since the potential faults in our fault model could cause the multiple module failures, the phenomenon of common-mode failures [22,23] must be taken into account in the error-coverage analysis. Normally, the common-mode failures will result in the unsafe failures, and therefore, the assessment of this failure probability is imperative to validate the system dependability to verify whether the system satisfies the safety requirements or not for the safety–critical applications. Last but not least, the occurrence of near-coincident faults may cause the catastrophic failure and diminish the reliability in highly dependable systems if the error-handling processes do not take the near-coincident faults into account. Therefore, the interference of near-coincident faults needs to be considered in the recovery process in order to mitigate its effect on the recovery probability. This implies that due to the more rigorous fault model and severe fault situations considered, it requires developing a more powerful fault-tolerant scheme to raise the system reliability to a sound level for safety–critical applications. 3.2. Concurrent error-detection and real-time error-recovery We note that the length of error-recovery time mainly depends on the error-detection latency. Hence, the adopted error-detection scheme has a significant impact on the efficiency of the errorrecovery. To fulfill the requirements of zero detection latency and real-time error-recovery, it demands that the execution results of each instruction must be examined immediately and if errors are found, the erroneous instructions are retried at once to overcome the errors. So, the error-detection problem can be formalized as how to verify the execution results promptly for each instruction, i.e. how to achieve zero error-detection latency. 3.2.1. Concurrent error-detection Concurrent error-detection (CED) Scheme: We develop a simple CED scheme, which combines the duplication with comparison, henceforth referred to as comparison, and majority voting methodologies to achieve zero error-detection latency. To deal with the correlated transient faults, which may cause the multiple module failures, the triple modular redundancy (TMR) scheme is
51
enhanced to embed the ability to detect the multiple module failures. Exploiting TMR as a detection scheme has a benefit to avoid activating the procedure of error-recovery while only one faulty unit is detected. In contrast to TMR, comparison scheme needs to spend time for error-recovery. Consider the following situation: if a permanent fault resides in one unit, the system utilizing the comparison method needs to perform the error-recovery process to overcome the errors when the faulty unit is used and the permanent fault is activated to produce the output errors. That will significantly degrade the performance. Instead, the TMR can tolerate one faulty unit, and therefore, no error-recovery is required. Hence, using TMR can lower the performance degradation caused by the error-recovery. However, TMR demands more resources to carry out the error-detection compared to the comparison method. According to the above discussion, if resources allow, TMR is the first choice for the instruction-checking method. Note that the treatment of permanent faults and the effect of permanent faults on recovery efficiency will be explored further in Section 5.3.3. The following notations are developed first: x: number of different classes of functional modules in the data paths, x P 1; n_m(y): number of identical modules in the yth class, n_m(y) > 1 and 1 6 y 6 x. n_m(y) is also the maximum number of instructions that can be executed concurrently in the modules of the yth class; n_s(y): number of spare modules added to the yth class, n_s(y) P 0; n_i(y): number of instructions in an execution packet for the yth class, 0 6 n_i(y) 6 n_m(y); ni: number of instructions in an execution packet, where the P notation ni is equal to xy¼1 n iðyÞ. An execution packet is defined as the instructions in the same packet that can be executed in parallel. There are n_m(y) + n_s(y) modules for the yth class, where 1 6 y 6 x. As we know, if n_i(y) 2 > n_m(y) + n_s(y) then it is clear that the processor core would not have the enough resources to verify the instructions of the yth class in an execution packet concurrently based on the comparison methodology. Under the circumstances, the current execution packet needs to be partitioned into several packets that will be executed sequentially. Given an execution packet, there are three cases to consider for each n_i(y), 1 6 y 6 x: Case 1: n_i(y) 2 = n_m(y) + n_s(y). In this case, each instruction in the yth class can be verified by the comparison scheme simultaneously. Case 2: n_i(y) 2 < n_m(y) + n_s(y). We can divide the instructions in the yth class into two groups: G(1) and G(2). There are m1 instructions and m2 instructions in G(1) and G(2) respectively, where m1 + m2 = n_i(y), m1, m2 P 0. Instructions in G(1) and G(2) can be examined by the TMR scheme and comparison scheme respectively. The following equations and criterion are used to decide m1 and m2. The equations are m1 3 + m2 2 6 n_m(y) + n_s(y); m1 + m2 = n_i(y); m1, m2 P 0. There may have several solutions derived from the equations. Because TMR can tolerate one faulty module compared to the comparison, the criterion employed is to choose a solution, which has the maximal value of m1 among the feasible solutions. Case 3: n_i(y) 2 > n_m(y) + n_s(y). Due to limited resources, n_i(y) instructions cannot be all verified at the same cycle.
52
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
Therefore, we need to partition n_i(y) instructions into two or three sequential packets such that the instructions in each packet can be examined concurrently. Consequently, one or two extra cycles are required to guarantee that each instruction can be verified while it is executed. The method of instruction partitioning is described next.
3.2.1.1. Instruction partitioning method. if ((n_m(y) = ‘odd’) and (n_s(y) = 0) and (n_i(y) = n_m(y))) //situation for three packet’s partition// then {n_i(y) instructions need to be partitioned into three sequential packets, and therefore, require two extra cycles to j k finish its execution. We first distribute n 3iðyÞ instructions to j k each packet; next, if ðn iðyÞ n 3iðyÞ 3Þ ¼ 1, then the remaining
instruction is assigned to the first packet; if j k n iðyÞ 3Þ ¼ 2, then the remaining two instructions 3
ðn iðyÞ
are evenly distributed to the first and second packets.} else {two packet’s partition: if (n_i(y) = ‘even’) then we distribj k j k ute n 2iðyÞ instructions to each packet; else n 2iðyÞ þ 1 and n 2iðyÞ instructions to the first and second packets, respectively.} Case 3 example: given n_m(y) = 5 with no spares added, and n_i(y) = 5, the five instructions for the yth class will be partitioned into three packets, where the first, second, and third packets contain two, two, and one instructions respectively; if one spare is added, then three and two instructions are held in the first and second packets respectively. Case 3 implies that the performance of program execution will be degraded. The degree of performance degradation depends on the occurring frequency of the Case 3 during the program execution. The compromise between hardware overhead and performance degradation can be accomplished by choosing a proper number of spare modules added to the yth class. In general, the performance degradation in our dependable VLIW processor stems mainly from two sources: first is the extra cycles demanded for detecting the faults/errors; second is the time for error-recovery in order to overcome the errors in the system. The error-recovery scheme is presented next. 3.2.2. Real-time error-recovery Error-recovery scheme: Since each instruction is executed and verified at the same time, the instruction retry can be exploited to overcome the errors in an efficient manner. When control unit of data paths receives the abnormal signals from the detection circuits, the procedure of error-recovery will be activated immediately to recover the erroneous instructions. The following notations are used to explain the proposed error-recovery scheme: my(i): The ith module in the yth class, where 1 6 i 6 n_m(y) + n_s(y); TMR(my(i), my(j), my(k)): TMR using my(i), my(j), my(k), where i – j – k. In the following, the term of TMR(my(i), my(j), my(k)) is abbreviated to TMR_y(i, j, k); r_no: number of retries permitted for an erroneous instruction, where r_no > 0. To simplify the recovery scheme, the erroneous instructions in the same class are recovered one by one with the TMR scheme if the number of modules in this class is greater than or equal to three; otherwise, the comparison scheme is used. The erroneous instructions in the different classes can be recovered in parallel. We allow performing r_no retries for an errone-
ous instruction to conquer the errors before declaring the system as fail-safe. Since TMR scheme represented as TMR_y(i, j, k) is employed for the instruction retry, an issue arises as how to determine the (i, j, k) for each retry during the recovery of an erroneous instruction. As n mðyÞ þ n sðyÞ combinations of (i, j, k). we know, there are 3 n mðyÞ þ n sðyÞ combinations Let S_TMR be a set that contains 3 of TMR_y(i, j, k). Hence, S_TMR can be represented as {TMR_ y(1, 2, 3), . . ., TMR_y(1, 2, n_m(y) + n_s(y)), . . ., TMR_y(1, n_m(y) + n_s(y) 1, n_m(y) + n_s(y)), TMR_y(2, 3, 4), . . ., TMR_y(2, n_m(y) + n_s(y) 1, n_m(y) + n_s(y)), . . ., TMR_y(n_m(y) + n_s(y) 2, n_m(y) + n_s(y) 1, n_m(y) + n_s(y))}, where n_m(y) + n_s(y) P 3. It is clear that selecting for example the TMR_y(1, 2, 3) constantly for each retry during the recovery of an erroneous instruction is the simplest approach, which has the advantage of simple implementation but the recovery capability is confined to TMR_y(1, 2, 3). It implies that this simple retry scheme is less effective because it lacks the flexibility of forming various TMR_y(i, j, k) to recover the erroneous instructions. In contrast to that, selecting elements one by one based on the element’s sequence in S_TMR for the instruction retries is a complex approach. Such an approach suffers from the high implementation cost, but on the other hand it enjoys a very high recovery probability and can tolerate n_m(y) + n_s(y) 2 faulty modules in the yth class n mðyÞ þ n sðyÞ 1 . So, it is apparent that if we choose r_no P 2 the (i, j, k) selecting strategy for instruction retry influences the implementation complexity of control unit and the number of faulty modules to be tolerated. A sound selecting strategy for (i, j, k) is presented next. Selecting strategy: On the basis of the above discussion, a set named SS_TMR, a subset of S_TMR, is created to guide the instruction-retry process. SS_TMR is given below: SS_TMR = {TMR_y(i, i + 1, i + 2), where 1 6 i 6 n_m(y) + n_s(y) 2}. Example 1. Given n_m(y) = 4, n_s(y) = 1 and r_no = 6, then SS_TMR = {TMR_y(1, 2, 3), TMR_y(2, 3, 4), TMR_y(3, 4, 5)}. Therefore, the sequence of six times of retry for each erroneous instruction will be {TMR_y(1, 2, 3), TMR_y(2, 3, 4), TMR_y(3, 4, 5), TMR_y(1, 2, 3), TMR_y(2, 3, 4), TMR_y(3, 4, 5)}. As seen from SS_TMR, the proposed retry method possesses a high regularity in its selecting strategy. So, it is easy to implement the SS_TMR strategy compared to the S_TMR strategy. The following example is used to discuss the error-recovery ability between S_TMR and SS_TMR selecting strategies. Example 2. Given n_m(y) = 4 and n_s(y) = 1. As stated before, n_m(y) + n_s(y) 2 = 3 faulty modules can be completely tolerated for S_TMR strategy, if we choose r_no P 6. The sequence of five times of retry for each erroneous instruction based on S_TMR strategy is S(r_no = 5) = {TMR_y(1, 2, 3), TMR_y(1, 2, 4), TMR_y(1, 2, 5), TMR_y(1, 3, 4), TMR_y(1, 3, 5)}. The faulty pattern (fmy(i), fmy(j), fmy(k)) is used to represent the ith, jth and kth modules in the yth class being faulty. For r_no = 5, we can verify that the faulty pattern (fmy(1), fmy(2), fmy(3)) will cause the recovery failure according to the set of S(r_no = 5) if the faulty pattern remains during the recovery process. Therefore, choosing r_no to be five cannot completely tolerate three faulty modules. The set of S(r_no = 6) is formed by adding the element of TMR_y(1, 4, 5) to the set of S(r_no = 5). It is evident that ten faulty patterns (fmy(i), fmy(j), fmy(k)) for three faulty modules can be completely tolerated by the recovery process based on the set of S(r_no = 6). For SS_TMR strategy and assuming three out of five modules being faulty, then the occurrence of one of the following
53
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
3.3. Reliable data path design: a case study
I1
ALU_1
Error Signal
In the following illustration, for simplicity of demonstration, we assume only one class of functional module, namely ALU, in the data paths. In this case study, the original VLIW core contains three ALUs (n_m(1) = 3), and therefore, three ALU instructions can be issued at most per cycle. A spare ALU (n_s(1) = 1) is added to prevent the severe performance degradation as explained below. From CED scheme described in Section 3.2, we note that if no spare is added then the execution packets with n_i(1) = 2 or 3 will fall into Case 3. Consequently, the performance may be degraded significantly. Hence, the cost of a spare is paid to lower the performance degradation. Clearly, adding three spares in order to completely eliminate the performance degradation may not be a feasible choice. Given n_m(1) = 3 and n_s(1) = 1, according to CED scheme, n_i(1) = 1 falls into Case 2. The (m1, m2) can be (1, 0) or (0, 1). (1, 0) is then selected as the final solution. So, if an execution packet contains only one ALU instruction then it will be verified by TMR scheme. For n_i(1) = 2, it is Case 1. Each instruction will be verified by comparison scheme. For n_i(1) = 3, it is Case 3. The three concurrent ALU instructions require scheduling to two sequential execution packets where one packet contains two instructions and the other holds the remaining one; consequently, one extra ALU cycle is required to complete the execution of three concurrent ALU instructions for error-detection need. Fig. 1 illustrates this case study. CED process: The notation CPR_ALU(i, j) is used to denote an instruction verified with the comparison scheme using the ith and jth ALUs.
I1
3.4. Design metric competition In this section, we discuss the design metric competition in terms of hardware overhead, performance degradation and fault tolerance capability. Table 1 lists the data of design metrics derived from the CED scheme presented in Section 3.2 for various numbers
Error Signal
After the analyses with some values of n_m(y) and n_s(y), we decide to adopt the SS_TMR selecting strategy based on the following reasons: the first one is the low implementation cost compared to the S_TMR strategy; second, most of the faults are transient type, which may disappear during the recovery process. Based on the second reason, we can infer that both selecting strategies should have the similar fault tolerance capabilities. We can see that the recovery time required for each erroneous instruction is between one and r_no clock cycles where r_no cycles are the maximum allowable recovery time before processor core enters the fail-safe state. In summary, our error-recovery scheme can provide the capability of real-time error-recovery.
while (not end of program) {switch (n_i(1)) {case ‘1’: I1: TMR_ALU(1, 2, 3); if (TMR_MV detects more than one ALU failure) then ‘Error Signal’ is turned on and ‘Error-recovery process’ is activated to recover the failed instruction case ‘2’: the execution packet contains two instructions: I1 and I2 I1: CPR_ALU(1, 2); I2: CPR_ALU(3, 4); if (I1 fails) then ‘Error Signal’ is turned on and ‘Error-recovery process’ is activated to recover I1 if (I2 fails) then ‘Error Signal’ is turned on and ‘Error-recovery process’ is activated to recover I2 case ‘3’: the packet is divided into two packets and executed sequentially}} Error-recovery process: {i 1; While (r_no > 0) {TMR_ALU(i, i + 1, i + 2); if (TMR_ALU succeeds) then the error-recovery succeeds – exit; else {r_no r_no 1; i i + 1; if (i P 3) then i 1;}} recovery failure – processor core enters the fail-safe state}
ALU_1 CPR
ALU_2
ALU_2 TMR_MV
ALU_3
Output
I2
ALU_3 CPR
ALU_4
execution packet
Output
I1
Error Signal
faulty patterns: (fmy(1), fmy(3), fmy(4)), (fmy(2), fmy(3), fmy(4)) and (fmy(2), fmy(3), fmy(5)) will lead to the recovery failure, if the faulty pattern remains during the recovery process. This phenomenon can be easily verified from the retry sequence as shown in Example 1.
I2
Output
I3
packet2 Partition
I3
packet1 I1
I2
ALU_4
Fig. 1. A case study, figures in left/middle/right for one/two/three instructions in a packet, where MV is an abbreviation of majority voter and CPR is comparator.
54
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
Table 1 Data of design metrics for various numbers of identical modules. n_m
n_s
Hardware overhead (%)
Instruction partitioning
CED scheme
3 4 4 4 5 5 5 6 6 6
1 0 1 2 0 1 2 0 1 2
33.3 0 25 50 0 20 40 0 16.7 33.3
3 ? (2, 1); 3 ? (2, 1); 3 ? (2, 1); 4 ? (2, 2); 3 ? (2, 1); 4 ? (2, 2); 4 ? (2, 2); 4 ? (2, 2); 4 ? (2, 2); 5 ? (3, 2);
1: 1: 1: 1: 1: 1: 1: 1: 1: 1:
4 ? (2, 2); 4 ? (2, 2); 4 ? (2, 2); 5 ? (2, 2, 1); 5 ? (3, 2); 5 ? (3, 2); 5 ? (3, 2); 6 ? (3, 3); 5 ? (3, 2); 6 ? (3, 3); 6 ? (3, 3);
of identical modules in a functional class, where n_m and n_s represent the number of modules and spares, respectively. The hardware overhead design metric in Table 1 simply counts the part of functional modules themselves and does not include other circuit portions resulting from the demand of fault tolerance, such as comparators and majority voters. As stated earlier, the performing of error-detection could lead to the performance degradation due to the partition of instructions into several sequential packets. We use the notation, for example, 3 ? (2, 1) to represent the partition of a packet containing three instructions into two sequential packets where one packet contains two instructions and the other holds the remaining one. Similarly, 5 ? (2, 2, 1) will induce two extra cycles. The notations shown in the column of CED scheme represent the methodologies used to detect the instruction execution errors. Actually, the notations can be classified into two types: the first type is, for instance, ‘2: (1: TMR, 1: comparison)’, which represents an execution packet with two instructions where one instruction is verified in TMR technology and the other is examined in comparison scheme; the second type is, for instance, ‘2: TMR’, which represents an execution packet containing two instructions and each instruction is verified in TMR scheme. As can be seen from Table 1, for a specific n_m, our approach offers several design options based on the trade-off among the metrics of hardware overhead, performance degradation and fault tolerance capability. Without loss of generality, we use n_m = 5 as an example to explain the design trade-off. For n_m = 5, there are three design options illustrated in Table 1. For performance-oriented design, according to the data of instruction partitioning shown in Table 1, we could choose the design using n_s = 1, i.e. the design using one spare, two majority voters and three comparators. We note that the design using two spares has the same performance level as one spare design. Clearly, the design with no spare enjoys the lowest hardware overhead but suffers from the highest performance degradation and lowest dependability among three design options. Generally speaking, the designs with more spares have the advantages of lower performance degradation and better dependability but suffer from a higher hardware overhead. We should point out that not every functional class must be protected with fault-tolerant design. We can assess the vulnerability of data paths in a processor core to identify the critical functional classes, which are the candidates to be protected. The decision of which classes to be protected depends on the requirement specification and the trade-off among the design metrics of hardware overhead, performance degradation and fault tolerance capability. 4. Hardware implementation and performance evaluation: an experimental study To validate the proposed approach, an experimental 32-bit fault-tolerant VLIW processor core was developed. Fig. 2 illustrates
TMR; TMR; TMR; TMR; TMR; TMR; TMR; TMR; TMR; TMR;
2: 2: 2: 2: 2: 2: 2: 2: 2: 2:
comparison comparison (1: TMR, 1: comparison) TMR; 3: comparison (1: TMR, 1: comparison) TMR; 3: comparison TMR; 3: (1: TMR, 2: comparison) TMR; 3: comparison TMR; 3: (1: TMR, 2: comparison) TMR; 3: (2: TMR, 1: comparison); 4: comparison
the architectural implementation. The processor core contains two classes of functional modules in the data paths: ALU and load/store unit. In this experimental study, the error-handling schemes presented in Section 3.3 were used to protect the class of ALUs, and for simplicity of demonstration, we did not provide the protection in the class of load/store units. The features of this 32-bit VLIW core are stated as follows: the instruction set is composed of 25 32-bit instructions; each ALU includes a 32 32 multiplier; a register file containing 32 32-bit registers with 12 read and five write ports is shared with modules and designed to have bypass multiplexors that bypass written data to the read ports when a simultaneous read and write to the same entry is commanded; data memory is 1 k 32 bits. The structure consists of five pipeline stages: ‘instruction fetch and instruction dispatch (IF & ID)’, ‘decode and operand fetch from register file (DRF)’, ‘execution (EXE)’, ‘data memory reference (MEM)’ and ‘write back into register file (WB)’ stages. This experimental processor core can issue at most three ALU and three load/ store instructions per cycle. Note that the ‘Error Analysis’ block in execution stage displayed in Fig. 2 was created to facilitate the measurement of the error coverage during the fault injection campaigns. The aim of the ‘Error Analysis’ block will be explained in Section 5.2. In our VLIW processor, each instruction is 32-bit long and a bundle contains six instructions. A 198-bit bundle is shown in Fig. 3, where ‘I1’–‘I6’ are instructions and ‘flag’ is 1-bit space. The compiler can utilize the ‘flag’ field to mix independent and dependent instructions in the same bundle. If ‘flag’ = 1, then the next following instruction can be executed at the same clock cycle; else, the next following instruction will be executed at the subsequent clock cycle. The program memory space can be saved significantly in this long instruction format. The processor fetches one bundle at a time and determines how many execution packets in this bundle. The instructions contained in a packet can be executed in parallel. The functions of major blocks in Fig. 2 are briefly explained as follows: Instruction dispatch unit: The functions of this unit include: (1) based on the flag bits, deciding how many execution packets involved in the current bundle; (2) dispatching the execution packets in sequence; (3) controlling when to update the address of the next sequential bundle. Instruction partition unit: This unit is responsible for partitioning three concurrent ALU instructions into two sequential execution packets where one packet contains two instructions and the other contains the remaining one. Meanwhile, the unit sends an idle signal to ‘Main_Control’ unit to demand a stall of previous stages for one clock cycle such that the extra packet induced
55
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
IF & ID
EXE
DRF
Recovery Idle
Extra-slot Idle
Stage Idle
WB
Safe failure
Main_Control
MEM
Present Inst_count Present Func_I1 ALU_Control
Present Func_I2 Select_s el
Error_signal
Sel
Func_1
Func_2
Func_3
Func_4
Di s_control
Inst_ count op_1(I1)
ALU_1
CP1
op_2(I1)
op_1(I2)
Shared Register File
0
Forwarding
op_2(I2)
Instruction Dispatch
Next address selector
I1_out
1
CP2 Select
op_2(I3) Func_I1
DFF
Dispatch
op_1(I3)
Instruction Memory
ALU_2
ALU_3
Instruction Partition
0
I2_out
Func_I2 ALU_4
Jump address
Next sequent ial
Branch address
Func_I3
0 1 2 3
DFF
TMR_ MV
Result_ (I1)
1
Ne
Result_ (I2)
Ne-det Ne-esc-det
Result_ (I3)
Error Analysis
Ne-rec Ne-nrec-f-s Ne-nrec-f-uns
Data Memory
L/S Unit
L/S Unit
L/S Unit 5x32-bit
5x32-bit
Note: CP (ComParator), MV (Majority Voter ) Fig. 2. Architecture of fault-tolerant VLIW data paths.
I1
flag
I2
flag
I3
flag
I4
flag
I5
flag
I6
flag
Fig. 3. An instruction bundle.
from the partition can be performed correctly. Besides that, the unit also delivers the instruction’s count, i.e. the number of instructions executed in parallel at the present clock cycle, to the ‘ALU_Control’ unit. Here, the instruction’s count could be 0, 1 or 2. The last function for this unit is to hold the instructions that require the instant retry while receiving the recovery signal from the ‘ALU_Control’ unit. ALU_Control unit: The purpose of this control unit is to carry out the control tasks for CED and error-recovery schemes. First of all, the control unit exploits the instruction’s count, i.e. the number of ALU instructions in the current packet, sent from the ‘Instruction Partition’ unit to determine how to dispatch each instruction in the current packet to the corresponding ALUs. This work can be achieved through the proper control of the ‘Dis-
patch’ unit. The ‘ALU_Control’ unit then checks the correctness of the execution results and activates the error-recovery process if any error has been detected; otherwise, passing the correct results to the next stage. Again, the error-recovery process will result in the pipeline stall. A fault-tolerant VLIW processor based on the architecture of Fig. 2 and the features mentioned previously was realized in VHDL at the register-transfer level (RTL). The implementation data by UMC 0.18 lm process are shown in Table 2. The area does not include the instruction memory as well as the ‘Error Analysis’ block. It is worth noting that the overhead of ‘ALU_Control’ unit is only 0.26% compared to the area of the non-fault-tolerant VLIW core. This implies that the control task of our approach is simple and easy to implement. The performance of VLIW core may be affected due to the introduction of reliable data path design to the ‘EXE’ stage. In this illustration, we simply resorted to the synthesis tool to synthesize the fault-tolerant core with the time constraint of the same clock frequency as non-fault-tolerant core.
56
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
Table 2 Comparing our approach with non-fault-tolerant VLIW core.
Non-fault-tolerant VLIW Our approach
Area (lm2)
Overhead
ALU_Control (lm2)
Overhead
Core clock (MHz)
93,19,666 107,08,296
14.9%
24,215
0.26%
128 128
Table 3 The results of performance degradation for several benchmark programs. Program
Degradation (%)
Heapsort FFT Quicksort
0.6 4.4 5.9
Four queens P 2 5i¼1 Ai Bi N!, N = 10
6.8 10.5 17.7
5 5 matrix multiplication IDCT (8 8)
28.4 34.3
Table 3 lists the results of performance degradation caused from the CED demand. The performance degradation is between 0.6% and 34.3% for eight benchmark programs used. Note that the impacts of various input data sequences on the performance degradation for programs ‘Heapsort’ and ‘Quicksort’ are considered, and therefore, the data for those two programs represent the average performance degradation. As mentioned before, the degree of performance degradation is related to the frequency of three ALU instructions executed simultaneously. This implies that the IDCT program has the highest frequency of three ALU instructions executed concurrently among eight benchmark programs as shown in Table 3. 5. Error-coverage assessment In this section, the error-coverage assessment based on the fault injection [29,30] was conducted to validate our scheme. A comprehensive fault tolerance verification platform comprising a simulation-based fault injection tool [31], ModelSim VHDL simulator and data analyzer has been built. It offers the capability to effectively handle the operations of fault injection, simulation and error-coverage analysis. The core of the verification platform is the fault injection tool that can inject the transient and permanent faults into VHDL models of digital systems at chip, RTL and gate levels during the design phase. The tool adopts the built-in commands of VHDL simulators to inject the faults into VHDL simulation models. Injection tool can inject the following classes of faults: ‘0’ and ‘1’ stuck-at faults, ‘Z’: high-impedance and ‘X’: unknown faults. Weibull fault distribution is employed to decide the time instant of fault injection. A new characteristic of our tool is to offer users the statistical analysis of the injected faults. The statistical data for each injection
errors
Ce-det Pf-uns-det
campaign exhibit the degree of fault’s severity, which represents a fault scenario (or called fault environment). The degree of fault’s severity is relative to the probability of i faults (denoted as Pi) occurring concurrently while a fault-tolerant system is simulated in the injection campaign, where i P 1. For example, P1 = 98% and P2 = 2% mean that 98% is one fault and 2% is two faults when the faults occur throughout the injection campaign. Hence, the injection tool can assist us in creating the proper fault environments that can be used to effectively validate the capability and the robustness of a fault-tolerant system under various fault scenarios. As a result, the validation process will be more comprehensive and complete. In a word, the proposed verification platform helps us raise the efficiency and validity of the dependability analysis. 5.1. Fault-tolerant design metrics Since our approach uses the comparison and TMR methods, it is intuitive that the following types of errors will escape being detected or recovered: (1) two functional modules produce the same, erroneous results to comparator; (2) two or three of functional modules produce the identical, erroneous results to TMR voter. Such defects will result in the unsafe failures (or called commonmode failures [23]). Three possible outcomes could happen for each instruction retry in the error-recovery process using TMR scheme. One possibility is that the recovery is successful; another is retry fails and the core enters the fail-safe state; the last possibility is two or three of functional modules produce the identical, erroneous results to TMR voter such that the core encounters the fail-unsafe hazard. Fig. 4 illustrates the error-handling process in our fault-tolerant processor core, where the notations will be explained below. As can be seen from Fig. 4, if errors happen, the system could enter one of the following states: ‘successful recovery’, ‘fail-safe’ and ‘fail-unsafe’ states. The following notations are used in the assessment of fault-tolerant design metrics: Ce-det: error-detection coverage, i.e. probability of errors detected. Ce-rec: error-recovery coverage, i.e. probability of errors recovered given errors detected. Ce: error coverage, i.e. probability of errors detected and recovered. Pf-s: probability of system entering the fail-safe state due to the errors. Pf-uns: probability of system entering the fail-unsafe state due to the errors.
detected
Pt-det-f-uns
Ce-rec
successful recovery
Pt-det-f-s
fail-unsafe
Fig. 4. Error-handling graph of fault-tolerant mechanism.
fail-safe
57
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
Pt-det-f-s: state transition probability from ‘detected’ state to ‘fail-safe’ state. Pt-det-f-uns: state transition probability from ‘detected’ state to ‘fail-unsafe’ state. Pf-uns-det: probability of system entering the fail-unsafe state due to the detection defects stated above. Pf-uns-rec: probability of system entering the fail-unsafe state due to the recovery defects stated above. The design metrics Ce-det, Ce-rec, Ce, Pf-s, and Pf-uns will be exploited to justify the proposed fault-tolerant approach. The error-related parameters Ne, Ne-det, Ne-esc-det, Ne-rec, Ne-nrec-f-s and Ne-nrec-f-uns represent the total number of errors occurred, the number of errors detected, the number of errors escape being detected, the number of errors recovered, the number of errors not recovered and system entering the ‘fail-safe’ state and the number of errors not recovered and system entering the ‘fail-unsafe’ state, respectively. The following expressions are obtained:
Ne-det ; Ne Ne-rec ; Ce-rec ¼ Ne-det
Ce-det ¼
Pf -uns-det ¼
Ne-esc-det ; Ne
Ce ¼ Ce-det Ce-rec;
Ne-nrec-f -s Ne-nrec-f -uns ; Pt-det-f -uns ¼ ; Ne-det Ne-det Pf -uns-rec ¼ Ce-det Pt-det-f -uns
ð1Þ
Pt-det-f -s ¼
Pf -s ¼ Ce-det Pt-det-f -s;
ð2Þ
Pf -uns ¼ Pf -uns-det þ Pf -uns-rec; ð3Þ
Ne ¼ Ne-det þ Ne-esc-det; Ne-det ¼ Ne-rec þ Ne-nrec-f -s þ Ne-nrec-f -uns
ð4Þ
5.2. Hardware-assisted metric evaluation technique The fault injection campaigns were used to derive the values of the error-related parameters described in last section. The design metrics can then be calculated by the expressions (1)–(3). To lower the development complexity of error analysis tools, we developed a hardware-assisted metric evaluation technique to speed up the evaluation of fault-tolerant design metrics. The principal idea of this error analysis technique is based on the addition of a measuring block in hardware form to aid us in collecting the desirable data during the fault simulation to measure the design metrics that we are interested in. A measuring block termed ‘Error Analysis’ as shown in Fig. 2 was generated to count the above error-related parameters in an efficient manner. The counting method implemented in ‘Error Analysis’ block is briefly depicted below. From the preceding discussion, it is apparent that the correct results for each instruction executed in ALUs are necessary for measuring the error-related parameters during the fault simulation. In ‘Error Analysis’ block, the correct results of each ALU instruction are employed to compare with the corresponding outcomes of ALUs so that the errors occurring in the ALU outputs as well as the occurring of common-mode failures can be detected and counted. So, we provide an extra field for each machine instruction in an instruction bundle illustrated in Fig. 3 to store the correct results for ALU instructions. If instructions are not ALU type, the extra fields leave to be unused. In such way, the correct results for each ALU instruction can be offered and importantly the length of an instruction bundle is still fixed in order to simplify the dispatching process of instruction bundle and reduce the implementation complexity of the ‘Instruction Dispatch unit’ as shown in
Fig. 2. It should be noted that the benchmark programs constructed in such method are only for the fault simulation usage. 5.3. Experimental results We have conducted a huge amount of fault injection campaigns to validate the proposed fault-tolerant VLIW scheme as shown in Fig. 2 under various fault environments. Five benchmark programs P5 including N! (N = 10), 5 5 matrix multiplication, 2 i¼1 Ai Bi, heapsort and quicksort were developed and used in the fault injection campaigns to analyze the design metrics mentioned in Section 5.1. We first performed a comprehensive experiment to explore a particular fault-related parameter, namely fault-occurring frequency, to see its impact on the fault-tolerant metrics. By adjusting the fault-occurring frequency, we can create a variety of fault scenarios, which can be used to measure how robust can our fault-tolerant system reach to under the various fault scenarios? In addition to the investigation of the above fault-related parameter, we also examined the effects of workload variations on the error coverage and permanent faults on recovery time. The common rules of fault injection for the experiments are: (1) value of a fault was selected randomly from the s-a-1 and s-a-0; (2) injection targets covered the entire ‘EXE’ stage except the ‘load/ store’ and ‘Error Analysis’ units, as shown in Fig. 2. To inject the faults into the inside of the adders and multipliers, those components were implemented at the gate level. As shown in [7], the fault rate of a component is proportional to its circuit area. Therefore, our fault injection process is based on the circuit area of components considered to be the injection targets to decide the probability of a fault injected into a specific component while a fault occurs. Intuitively, the area ratio of a specific component corresponds to the probability of a fault which could be injected into that particular component. So, the fault distribution for each injection campaign is based on the circuit area of components located in the ‘EXE’ stage. The common data of fault injection parameters are: a = 1 (useful-life), failure rate (k) = 0.001, probability of permanent fault occurrence = 0, fault duration = 5 clock cycles. In addition, the number of retries r_no is set to four. In the following, we discuss the outcomes obtained from the experiments. 5.3.1. Fault-occurring frequency The goal of this experiment is to observe the effect of the faultoccurring frequency on the design metrics depicted in Section 5.1. In this experiment, we copy each of the following benchmark proP grams: ‘ N! (N = 10)’, ‘5 5 matrix multiplication’, ‘2 5i¼1 Ai Bi’, four times and then the twelve programs are combined in random sequence to form a workload, namely Workload 1, for the fault simulation. The length of Workload 1 is equal to 4384 (clocks) 30 (ns/clock). Note that if workload and fault duration are constant, the quantity of faults injected in an injection campaign, i.e. fault-occurring frequency, will influence the degree of fault overlap. For instance,
Table 4 Statistical data for various fault injection campaigns. Pi (%)
P1 P2 P3 P4 P5 P6 P7 P8
No. of faults 100
500
1000
1500
2000
96.77 3.02 0.21
76.88 20.08 2.85 0.15 0.04
55.46 32.69 9.95 1.77 0.129 0.001
37.04 37.47 19.28 5.43 0.72 0.057 0.003
22.19 34.23 28.02 12.44 2.76 0.32 0.037 0.003
58
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
(a)
1 0.995 0.99 0.985 0.98 0.975 0.97 0.965 0.96 0.955
0.025 (b) 0.0225
Ce-det Ce-rec Ce
100
500
1000
1500
2000
Number of faults
0.02 0.0175 0.015 0.0125 0.01 0.0075 0.005 0.0025 0
Pf-s Pf-uns
100
500
1000
1500
2000
Number of faults
Fig. 5. Fault-tolerant metric analysis; (a) coverage and (b) probabilities of fail-safe and fail-unsafe.
while the quantity of faults injected increases, the degree of fault overlap will become more serious. In other words, the various fault-occurring frequencies will lead to the different fault environments. Hence, in order to investigate the effect of the fault-occurring frequency on error coverage, we conducted a number of fault injection campaigns with various numbers of faults injected. The statistical analysis of an injection campaign is able to disclose the fault activity within the simulation. Table 4 lists the statistical data for five injection campaigns, where Pi is the probability of i faults occurring concurrently while the fault-tolerant system is simulated in the injection campaign. Clearly, from Table 4, the larger the number of faults injected, i.e. higher fault-occurring frequency, the worse of fault environment will be due to the increase of the occurring frequency of multiple faults including correlated, mutually independent and near-coincident transient faults. Therefore, the statistical analysis helps designers choose a set of desired fault scenarios to test the ability of fault-tolerant systems. As a result, the proposed fault-tolerant verification platform can furnish more comprehensive and solid error-coverage measurements. Fig. 5 characterizes the effect of fault-occurring frequency on the fault-tolerant design metrics. The experimental results obtained have 95% confidence interval of ±0.138% to ±0.983%. The outcomes shown in Fig. 5 reveal the fault tolerance capability of our scheme in the various fault environments. It is evident that the error coverage decreases with the increase of fault-occurring frequency. Meanwhile, the processor core has a higher chance to enter the fail-safe and fail-unsafe states when the probability of occurrence of multiple faults rises. The safe failure occurs once the error-recovery process cannot overcome the errors due to a serious fault situation. Overall, the results presented in Fig. 5 are quite positive and sound those declare the effectiveness of our fault-tolerant scheme even in a very severe fault environment. Another interesting point is how much of the probability of unsafe failure can be reduced by our approach. To know the amount of reduction, we need to obtain the probability of unsafe failure for non-fault-tolerant VLIW core. Since the fault environment has an effect on the probability of unsafe failure, for simplicity, we select a particular fault environment to show the amount of reduction using our approach. The fault environment selected in this demonstration is the injection campaign of 100 faults injected, as shown in Table 4. The outcomes of the injection campaign with non-faulttolerant core reveal that the probability of unsafe failure for nonfault-tolerant core is 0.4447. As a result, it is easy to see that our approach can significantly reduce the probability of unsafe failure from 0.4447 to 0.0069. Apparently, this huge reduction drastically enhances the safety level of the processor core. 5.3.2. Workload effect on error coverage and recovery performance The goal of this experiment is to characterize the effect of the workload variation on the error coverage and recovery performance. As stated earlier in Section 3.3, one or two ALU instructions per cycle can be performed in the execution stage displayed in Fig. 2. We see that there is a difference in carrying
out the error-detection between one and two instructions executed, i.e. TMR and comparison for one and two instructions executed respectively. Three workloads were created for this experiment to investigate the impact of the variation of one/ two/three ALU instructions’ ratios on the error coverage. Workloads 2 and 3 are ‘heapsort’ and ‘quicksort’ respectively. The construction method of Workload 4 is the same as Workload 1 except that the seven copies for each of the benchmark programs instead of four copies were used to build Workload 4. Table 5 provides the data collected from the execution stage while the workloads were executed in the proposed fault-tolerant VLIW processor. The ratios for each workload shown in Table 5 indicate the probability of one or two ALU instructions performed in the execution stage while an ALU packet is executed. We chose these three workloads in an attempt to fairly exhibit the impact of workloads on fault-tolerant design metrics. To broaden the investigation of this experiment, we generated three different fault environments where the degrees of the severity of faults are from low to high. The concept of fault environment is the same as the last experiment, and therefore, we do not provide the detailed statistical data, as shown in Table 4, for the fault environments used in this experiment. Table 6 illustrates the effect of the workload variation on the error coverage under three various fault environments. Interestingly, the effect of workload variation on error coverage seems to be insignificant according to the results shown in Table 6. This discovery is valuable in that the fault tolerance capability of our system could be workload-invariant. Next, we discuss the impact of workload variation on the recovery performance. Intuitively, TMR scheme can tolerate one faulty module without the error-recovery compared to the comparison mechanism. For instance, assume that ALU_1 and ALU_4 are faulty. TMR_ALU(1, 2, 3) is used for one instruction execution, and therefore, no error-recovery required; however, comparison is employed for two instruction execution, and so, two clock cycles are required to complete the error-recovery. In Table 5, Workload 4 spends the least time for error-recovery among three workloads under the same fault environment because it has the highest ratio of one instruction. Clearly, the time spent for error-recovery is Workload 2 > Workload 3 > Workload 4. 5.3.3. Effect of permanent faults on recovery time The aim of this experiment is to explore the impact of the permanent faults on the length of time spent for the error-recovery process. In fact, our approach can cope with not only the transient Table 5 The workloads’ data collected from the execution stage.
No. of ALU packets with one instruction (ratio) No. of ALU packets with two instructions (ratio) No. of ALU execution packets
Workload 2
Workload 3
Workload 4
1715 (21.5%)
3978 (50.9%)
5767 (80.3%)
6248 (78.5%)
3840 (49.1%)
1417 (19.7%)
7963
7818
7184
59
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61 Table 6 The characterization of the effect of workload variation on design metrics, where ‘max. diff.’ represents maximal difference.
Workload 2 Workload 3 Workload 4 Max. diff.
Ce-det
Ce-rec
Ce
Pf-s
Pf-uns
0.9873–0.9976 0.9850–0.9954 0.9847–0.9945 0.0026–0.0031
0.9868–0.9984 0.9780–0.9991 0.9889–0.9988 0.0109–0.0007
0.9743–0.9960 0.9633–0.9946 0.9739–0.9933 0.011–0.0027
0.0016–0.0130 0.0008–0.0217 0.0012–0.0109 0.0008–0.0108
0.0024–0.0127 0.0045–0.0150 0.0055–0.0152 0.0031–0.0025
faults but also the permanent faults. However, both kinds of faults exhibit quite different influence on the recovery time. As we see, the time spent for error-recovery is insignificant when faults are transient type, whereas the occurrence of permanent faults will considerably increase the time spent for error-recovery due to the increase of the frequency of activating the recovery process. Since we do not provide a method to distinguish between transient faults and permanent faults, our scheme lacks the ability to isolate the failed components. Consequently, the occurrence of permanent faults would increase the recovery time. For this, we should point out that the isolation of the failed components will force the system to operate in a degraded mode, and therefore, the performance will be affected as well. We use Fig. 2 to discuss the performance issue related to the permanent faults. To simplify the discussion, the number of permanent faults is restricted to one. Assume that a permanent fault occurs in ALU_1. If ALU_1 is isolated, then the performance is degraded drastically. This is because the execution packets holding two ALU instructions now also need to be partitioned due to the failed ALU_1. One such packet induces one extra cycle, and so, the extra cycles required depend on how many such packets. In contrast to that, our scheme would not put emphasis on the isolation of the failed components. This decision has the following hardware and performance advantages: Since the permanent faults and transient faults are treated with the same manner, no additional hardware cost is required to tackle the permanent faults. Our scheme would not pay the extra cycles as mentioned in the above isolation method, because the execution packets holding two ALU instructions do not need partitioning. However, our approach must pay the extra cycles to overcome the errors resulting from the permanent fault in ALU_1 while executing two ALU instructions concurrently. It is evident that the permanent fault in ALU_1 is not always triggered to produce the wrong output. Therefore, not every execution packet containing two ALU instructions will activate the recovery process where one cycle will be taken to correct the errors for each recovery. Clearly, the performance degradation of our approach is lower than that of isolation method. In general, we can expect that the more a program has packets containing two ALU instructions, the more the performance of the program will be degraded due to the permanent faults. This expectation is confirmed by the experimental results provided later. Three systems, each installed a particular mechanism to deal with the permanent faults, were then created to investigate the effect of permanent faults on program execution time. The permanent fault-handling mechanisms adopted in this experiment are ‘isolation method’ as described above, ‘our scheme’ and ‘our scheme with modification’ by replacing TMR with comparison for the check of execution packets containing only one ALU instruction. Table 7 presents the performance degradation for Workloads 2– 4 under various schemes due to the occurrence of a permanent fault in ALU_1. Note that the performance degradation shown in the rows of ‘our scheme’ and ‘our scheme with modification’ is caused by the time taken for the error-recovery, whereas the requirement of zero detection latency for each instruction execu-
tion leads to the performance degradation for ‘isolation method’. Each datum in the rows of ‘our scheme’ and ‘our scheme with modification’ was derived from the average of the results of 300 fault injection campaigns where each campaign was injected with a single permanent fault into ALU_1 at different time instant. As stated previously, the performance degradation of ‘isolation method’ is predictable by counting how many execution packets holding two ALU instructions after the occurrence of a permanent fault and the isolation of the failed ALU_1. Therefore, the performance degradation of ‘isolation method’ can be estimated by half of the number of ALU packets with two instructions as shown in Table 5. Table 7 indicates that ‘our scheme’ has the lowest performance degradation among three schemes. We point out that the ratios of one and two instructions of a workload as shown in Table 5 play a key role in determining the degree of its performance degradation owing to the permanent faults. Such instruction ratios imply that the performance degradation of ‘our scheme’ and ‘isolation method’ is Workload 2 > Workload 3 > Workload 4, as confirmed in Table 7. The presented data in Table 7 also justify the use of TMR in ‘our scheme’. It is apparent that ‘our scheme with modification’ must pay more recovery cycles than ‘our scheme’ because the comparison mechanism is now used to check the packets holding one ALU instruction. It is easy to see that the performance degradation of ‘our scheme with modification’ will become the same as that of ‘our scheme’ for the cases of failed ALU_3 or ALU_4 because ‘our scheme with modification’ exploits ALU_1 and ALU_2 to examine the packets containing one ALU instruction. Except the situation addressed above, the data provided in Table 7 show the results of the performance degradation of three systems with a single failed ALU. 5.4. Discussions In general, the effectiveness of fault-tolerant schemes for highperformance microprocessors can be measured by: (1) hardware overhead; (2) performance degradation; (3) program space overhead; (4) design metrics described in Section 5.1; (5) error-detection latency; (6) error-recovery efficiency. Therefore, the design is to find a good trade-off among those six properties. In other words, each method may focus the attention on some design attributes and meanwhile degrade others. Previously, one representative methodology proposed in [14] exploited the compiler-based redundant code approach to achieve the detection of hardware faults that occur in VLIW data paths. As depicted in Section 2, this software redundancy scheme enjoys no hardware modification and overhead, but suffers from high performance degradation, program space overhead as well as power cost. In our work, a feasible Table 7 Effect of a single failed ALU on performance degradation under various schemes.
Our scheme Our scheme with modification Isolation method Fault-free execution time
Workload 2 (clocks)
Workload 3 (clocks)
Workload 4 (clocks)
960 (12.06%) 1924 (24.16%)
723 (9.25%) 2275 (29.1%)
474 (6.6%) 1855 (25.82%)
3124 (39.23%) 7963
1920 (24.56%) 7818
708 (9.86%) 7184
60
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61
hardware redundancy approach was adopted to achieve no code modification/growth, and the reduction of power consumption as well as performance degradation compared to software redundancy scheme [14]. However, our methodology needs to pay the cost of the hardware modification and overhead. Our work is summarized as follows: (1) our approach is more complete in that it offers an effective error-handling process comprising the error-detection and error-recovery methods. In the design of fault-tolerant systems, the adopted error-detection scheme determines the error-detection latency and the length of latency could affect the implementation complexity and time efficiency of the error-recovery process. Our error-detection mechanism enjoys zero detection latency such that we can utilize a very simple instruction-retry method to recover the errors in real-time manner. Other schemes with variable error-detection latency would pay a higher hardware cost to implement the error-recovery process and more execution time to overcome the errors. For example, the error-detection scheme presented in [16] holds 692 cycles of detection latency on average, and 36,183 cycles for the worst case. Such a lengthy latency requires more time for the error-recovery. That will degrade the performance significantly once the errors occur. However, the scheme in [16] provides the detection coverage of the transient faults for the entire pipeline, whereas our scheme only covers the data paths, but can conquer the transient and permanent faults. (2) This work offers the more complete and solid results especially in fault-tolerant design metrics. Based on such results, our method can be thoroughly validated with confidence. Those data are valuable and important for us to justify whether the scheme is feasible or not. Apart from that, we introduce a concept of fault scenario to imitate various degrees of fault’s severity. Therefore, we can verify how robust our scheme can achieve in different fault scenarios. Through this investigation, we can gain an insight into the impact of the fault environment on the robustness of our scheme.
6. Conclusions This paper presents a valuable fault-tolerant framework which focuses mainly on the data paths of VLIW processor cores. Based on a more rigorous and realistic fault model, an approach combined CED scheme with real-time error-recovery scheme is proposed to enhance the dependability of the data paths. This framework is quite useful and complete in that it can give the designers an opportunity to choose an appropriate solution to meet their need in terms of hardware overhead, performance degradation and fault tolerance capability. Several significant contributions of this study are: (1) Integrate the error-detection and error-recovery into VLIW cores with reasonable hardware overhead and performance degradation. It is worth noting that the proposed fault-tolerant framework can achieve zero error-detection latency and real-time error-recovery. (2) Conduct a thorough fault injection campaigns to assess the fault-tolerant design metrics under a variety of fault environments. Importantly, we measure not only the error-detection and error-recovery coverage, but also the fail-safe and fail-unsafe probabilities. Acquiring the fail-unsafe probability is crucial for safety–critical applications to understand how much possibility the system could fail without notice once the errors occur. This measurement assists us in assessing the risk level of the system. Moreover, a variety of fault environments, which represent the different degrees of fault’s severity, were constructed to validate our scheme so as to realize the capability of our scheme in various fault scenarios. So, such experiments can give us more realistic and comprehensive results. The effectiveness of our mechanism even in a very severe fault environment is justified from the experimental results.
Acknowledgement The authors acknowledge the support of the National Science Council, Republic of China, under Contract No. NSC 96-2221-E216-006 and NSC 97-2221-E-216-018.
References [1] C. Constantinescu, Impact of deep submicron technology on dependability of VLSI circuits, in: IEEE Intl. Conf. on Dependable Systems and Networks (DSN’02), 2002, pp. 205–209. [2] P. Shivakumar et al., Modeling the effect of technology trends on the soft error rate of combinational logic, in: DSN’02, 2002, pp. 389–398. [3] T. Karnik, P. Hazucha, J. Patel, Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Dependable Secure Comput. 1 (2) (2004) 128–143. [4] J.B. Nickle, A.K. Somani, REESE: a method of soft error detection in microprocessors, in: DSN’01, 2001, pp. 401–410. [5] D.M. Blough, A. Nicolau, Fault tolerance in super-scalar and vliw processors, in: IEEE Workshop on Fault Tolerant Parallel and Distributed Systems, 1992, pp. 193–200. [6] J.G. Holm, P. Banerjee, Low cost concurrent error detection in a VLIW architecture using replicated instructions, in: International Conference on Parallel Processing, 1992, pp. 192–195. [7] M. Franklin, A study of time redundant fault tolerance techniques for superscalar processors, in: IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems (DFT’95), 1995, pp. 207–215. [8] E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors, in: 29th IEEE FTCS, 1999, pp. 84–91. [9] T.M. Austin, DIVA: a reliable substrate for deep submicron microarchitecture design, in: ACM/IEEE 32nd Annual International Symposium on Microarchitecture, November 1999, pp. 196–207. [10] N. Quach, High availability and reliability in the titanium processor, IEEE Micro. 20 (5) (2000) 61–69. [11] F. Rashid, K.K. Saluja, P.A. Ramanathan, Fault tolerance through re-execution in multiscalar architecture, in: DSN’00, 2000, pp. 482–491. [12] T. Sato, I. Arita, Evaluating low-cost fault-tolerance mechanism for microprocessors on multimedia applications, in: Pacific Rim International Symposium On Dependable Computing, 2001, pp. 225–232. [13] N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors, IEEE Trans. Reliab. 51 (1) (2002) 63–75. [14] C. Bolchini, A software methodology for detecting hardware faults in VLIW data paths, IEEE Trans. Reliab. 52 (4) (2003) 458–468. [15] S. Mitra et al., Robust system design with built-in soft-error resilience, IEEE comput. (2005) 43–52. [16] M.K. Qureshi, O. Mutlu, Y.N. Patt, Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors. in: DSN’05, 2005, pp. 434–443. [17] ARM Launches Fault-Tolerant Processor to Cut Cost of Future Car Development.
(09.10.06). [18] E. Touloupis et al., Study of the effects of seu-induced faults on a pipelineprotected microprocessor, IEEE Trans. Comput. 56 (12) (2007) 1585–1596. [19] H.C. Lai, S.J. Horng, Y.Y. Chen, An online control flow check for VLIW processor, in: 14th Pacific Rim International Symposium on Dependable Computing, 2008, pp. 256–264. [20] F. Irom, F.F. Farmanesh, Frequency dependence of single-event upset in advanced commercial powerpc microprocessors, IEEE Trans. Nucl. Sci. 51 (6) (2004) 3505–3509. [21] J.H. Patel, L.Y. Fung, Concurrent error detection in ALUS by recomputing with shifted operands, IEEE Trans. Comput. C-31 (7) (1982) 589–595. [22] B.W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley, 1989. [23] S. Mitra, N.R. Saxena, E.J. McCluskey, Common-mode failures in redundant VLSI systems: a survey, IEEE Trans. Reliab. 49 (3) (2000) 285–295. [24] W.T. Shiue, C. Chakrabarti, Memory design and exploration for low power, embedded systems, J. VLSI Signal Process. 29 (2001) 167–178. [25] S.W. Kwak, B.K. Kim, Task-scheduling strategies for reliable TMR controllers using task grouping and assignment, IEEE Trans. Reliab. 49 (4) (2000) 355– 362. [26] A. Avizienis, J.-C. Laprie, B. Randell, C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Dependable Secure Comput. 1 (1) (2004). [27] J.B. Dugan, K.S. Trivedi, Coverage modeling for dependability analysis of faulttolerant systems, IEEE Trans. Comput. 38 (6) (1989) 775–787. [28] A.L. White, Transient faults and network reliability, in: IEEE Aerospace Conference, 2004, pp. 78–83. [29] J. Clark, D. Pradhan, Fault injection: a method for validating computer-system dependability, IEEE Comput. 28 (6) (1995) 47–56. [30] M.C. Hsueh, T.K. Tsai, R.K. Iyer, Fault injection techniques and tools, IEEE Comput. 30 (4) (1997) 75–82. [31] E. Jenn, et al., Fault injection into VHDL models: the MEFISTO tool, in: 24th IEEE FTCS, 1994, pp. 66–75.
Y.-Y. Chen, K.-L. Leu / Microprocessors and Microsystems 34 (2010) 49–61 Yung-Yuan Chen received the MS degree in computer science and the PhD degree in electrical and computer engineering from State University of New York at Buffalo in 1987 and 1991, respectively. He is currently a professor with the Department of Computer Science and Information Engineering, Chung-Hua University, Taiwan. His research interests include fault-tolerant computing, VLSI system design, computer architecture, reliable processor and SoC design with FMEA process and dependability assessment and validation.
61
Kuen-Long Leu received the MS degree from the Department of Computer Science and Information Engineering, Chung-Hua University, Taiwan, and currently he is the PhD student with the Department of Electrical Engineering, National Central University, Taiwan. His research interests include fault-tolerant processor and system design and analysis, SoC dependability verification platform development.