Microprocessors and Microsystems 36 (2012) 462–470
Contents lists available at SciVerse ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
Fast online error detection and correction with thread signature calculae Bernhard Fechner University of Augsburg, 86159 Augsburg, Germany
a r t i c l e
i n f o
Article history: Available online 16 May 2012 Keywords: Online error detection and correction Thread Parity RMT Duplex Checksum FPGA
a b s t r a c t To recognize transient control-flow and data faults, caused by Single-Event Upsets (SEUs) in a microprocessor pipeline, several mechanisms to check the execution in the retirement have been proposed and discussed over the years. In this paper, we suggest a compression-based and compression-free checksum-scheme, which is able to recognize transient faults before commitment and preserves binary compatibility. The scheme is applicable for time-redundant (virtual duplex and redundantly multithreaded systems) as well as structural redundant systems. It can localize a fault by partial re-execution within the pipeline. By additionally introducing a modified micro-rollback, single or multiple pipeline stages can be rolled back for a retry. In the best case, a fault can be localized, detected and corrected in four clock cycles within a fine-grained redundantly threaded microprocessor. We validate and analyze the scheme through an FPGA and standard-cell implementation and conclude that it is able to replace the wellknown parity-computation for high-performance designs. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction A prominent example for transient faults is the discovery of high transient fault rates by Intel in their 16-kBit-DRAM family, caused by a-particles of low energy [27], since radioactive water polluted the ceramic packaging. At minimum feature sizes below 90 nm another phenomenon can be observed at sea level [3], only known before from aerospace applications: the increasing probability of neutrons causing transient faults. Neutrons are the most common cause for transient faults in memory elements (SingleEvent Upsets, SEUs) [3,23]. Although recent field studies [34] showed that for DRAMS permanent faults are much more common. For COTS processors, this is not applicable since they do not contain DRAM. The average particle-flow at sea level ranges from 20 to 14,400 neutrons/cm2/h [24]. To determine the exact fault rate, we have to specify the neutron flow with respect to the location and altitude. Ziegler describes in [25], how the particle flow can be estimated. He computes the neutron flow for New York to 5:27522:6043ln E0:5985ðln EÞ2 3 4
1:5 e 0:08915ðln EÞ þ0:003694ðln EÞ per hour per cm2, where E is the particle energy in MeV. Table 1 shows the results. The fault rate in SRAMs increases at large heights from 3 to 10 times, approximately by factor 1.3 per 1000 ft [4]. Shivakumar et al. [22] forecast a growth of transient errors in memory elements factor 105 from 1992 to 2011. For the next decade (>2010) 104 FIT (MTBF of 114.155 years), a fault rate of 105/h is foreseen. Since this trend continues with nanometer designs, it is important to tolerate such faults on hardware level. E-mail address:
[email protected] 0141-9331/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.micpro.2012.05.001
This work is structured as follows: in Section 2, we introduce the main part of this work, the Thread Signature Calculus. A thread or hardware-thread is hereby denoted as dedicated hardware to execute a software thread. The state of a thread is called context. A software thread is an instruction stream. An experimental evaluation concerning the fault coverage is done is Section 3. In Section 4, a comparison with the well-known parity scheme is conducted. Section 5 introduces the micro-rollback for single pipeline stages. Section 6 concludes the paper. 1.1. Basic fault model A fault model is used to specify which faults are assumed to occur within the examined system. It should be an integral part of every work considering an evaluation of fault coverage. Karlsson et al. [20] separate flip-to-0 (ft0) and flip-to-1 (ft1), data-, control-flow and other errors. The transition probabilities P{0 ? 1} and P{1 ? 0} are assumed to be equal. This assumption was confirmed by experimental observations [29]. We assume transient control-flow and data errors through SEUs (ft0, ft1) in the pipeline of a microprocessor (logic and memory). Single-Event Upsets (SEUs) are modeled through inversion of a single bit in memory elements. If x is the width of a pipeline stage i, this can be modeled through
pipe regðiÞ ) pipe regðiÞ mask < x 1 : 0 > : The fault mask mask is applied to each pipeline register according to the assumed fault rate (s. Table 3). We assume that a single fault can occur in a single pipeline register at a time.
463
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470 Table 1 Particle flow and energy.
Table 3 Fault-injection parameters.
E (MeV)
Particle-flow (cm2/h)
Parameter
Value
3 4 5
0.0001894434561 0.00005181768842 0.0000169856564
Simulation type Fault rate Duration of transient faults Distribution of faults Workload
Software-simulation k = 105 per clock cycle Negligible (0.2/0.5 ns)
1.2. Related work Random number generator Simulation time Fault-injection runs Assumptions
A well-known method to check the execution after writeback is the Dynamic Implementation Verification Architecture (DIVA) [7,12]. DIVA consists of a checker, connected to the writebackstage of the processor to be checked. A watchdog timer is initialized with the maximal instruction latency monitoring the retirement. DIVA has two in-order pipelines, checkcomm to check the values to be written and checkcomp to check all computations. As stated in Section 1.1, like DIVA, we support the detection and correction of data and control faults but – unlike DIVA – we are able to detect and correct faults faster and conduct fault-injection experiments to determine the fault coverage. Additionally we can identify single pipeline stages as faulty. Other methods to monitor only the control-flow are based on signature analysis [13,14]. In the TRIP processor [6] a signature monitor [10,11] calculates control-flow signatures to detect discrepancies. Macroinstruction control-flow monitoring [9] separates the entire application program into blocks. The instruction sequence within a block is checked instruction-by-instruction to detect control-flow faults. When using signatured instruction streams (SISs) [11], checksums must be inserted after a branch. A monitor reads and compares the computed checksum with the checksum embedded in the object code. With the probability of a branch occurring every fourth to tenth instruction, application programs will increase in size by 10–25% if equal instruction lengths are assumed. The signatures must be loaded from main memory over the bus, increasing the power consumption and decreasing performance, because the bus is occupied while fetching the signatures. Furthermore, the signatures must be handled on system level not to pollute the caches. SIS was validated by fault-injection [10]. In comparison to the original system the fault coverage increased by 25%. In [17] fingerprinting was introduced, a method to compare the results of the processors within a duplex-system. A fingerprint is being built over the execution through computing a hash value by a linear block code (e.g. CRC). Smolens et al. [17] also propose an integration of fingerprinting within the microarchitecture of a processor but do not discuss implementation issues such as the critical path. Methods to check the execution within the retirement have the disadvantage to recognize faults only after getting effective. In this
Discrete, equally distributed Synthetic, parameter from SPECint2006_basesimulations 48 bit, linear congruency [30] 106 Instructions 10 One fault at a time per component, stochastically independent
case, recovery mechanisms must be implemented to rollback to a sane system state. Such late detection will lead to side-effects as performance degradation and error propagation. We try to minimize the performance degradation by rolling back single pipeline stages, instead of a whole microarchitectural state. Naturally, the propagation of errors will not be of a concern, if a rollback of the concerned state is implemented. Unfortunately, due to its nature, a propagated error cannot be located any more. The proposed signature calculation can detect and locate faults before getting effective (before retirement). With the quick detection of faults, we can localize the fault and are able to rollback a single pipeline stage instead of a whole microarchitectural state. Therefore, only these parts must be rolled back in case of a fault. We introduce the compression-based and compression-free Thread Signature Calculus for redundantly multithreaded systems (RMT) [28]. Both schemes are not bound to RMT. We took this only as an application example. The schemes can also be used within any system supporting time or structural redundancy. We suggest RMT because it enables the combination of different multithreading strategies such as fine-grained and coarse-grained multithreading (FMT, CMT). FMT fetches instructions on a cycleby-cycle basis, CMT changes the context on a latency, e.g. memory accesses after cache misses. The performance degradation though RMT is not part of this work, since it was already discussed deeply in literature and is only one application field of the Thread Signature Calculus. The idea of compression-based checksums was first introduced in [33] and analyzed in [32]. Compression-free checksums were introduced in [31]. In contrary to our previous work, we additionally implement a micro-rollback, conduct a delay, power and area analysis of the proposed schemes with different technologies
Table 2 Opcode probabilities (%) SPECint2006_base-benchmarks. Group
astar
bzip2
gcc
gobmk
h264ref
hmmer
libquantum
mcf
omnetpp
perlbench
sjeng
xalancbmk
Logic Addsub Addshift Sel Br,cc Jmp Bru Ld St Shiftsimple Shift Mul Flags Chk
23.4 22.1 0.8 0.8 14.2 0.7 1.3 25.1 6.7 3 1.7 0 0.1 0.1
14.9 38.8 0.2 0.1 14.8 0.2 0.3 17.3 12.5 0.8 0.1 0 0 0.1
24 17.9 6.5 0.7 12.6 0.3 0.6 25.2 6.9 1.5 1.6 0 1.1 1.1
15.6 31.6 9.3 0.1 15.9 0.3 0.7 8.8 13.9 1 0.3 0.1 0 2.4
14.2 28.6 1.4 0.9 14.5 1.8 1.7 21 11.4 1 0.2 0.1 0.4 2.7
13.3 31.4 0.5 0.4 12.3 1.1 1.6 18.7 13.8 0.7 0.2 0.2 0.3 4.8
22.4 19.9 5.9 0.2 10.7 0.2 0.4 19.7 8.1 0.8 5.8 0 5.7 0.1
14.3 24.4 0.7 0.6 13.5 1.2 2.3 27.8 13.1 1 0.2 0.3 0.5 0.2
23.3 22.2 0.8 0.9 14.1 0.7 1.4 25.1 6.7 2.9 1.7 0 0 0.1
15.4 27.5 1.3 0.7 13.3 1.4 1.8 23.7 10.4 1.4 1.1 0.1 1.1 0.9
17,3 20.9 18.4 0.1 8 0.2 0.3 17.8 10.9 1.3 3.5 0 0 1.2
14.6 28.5 1.5 0.7 13.8 1.3 1.8 23.2 10.7 1.1 0.9 0.1 0.9 1
17.73 26.15 3.94 0.52 13.14 0.78 1.18 21.12 10.43 1.38 1.44 0.08 0.84 1.23
464
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
(FPGA and standard-cells) and compare the proposed Thread Signature Calculus with the well-known parity scheme. 2. Thread signature calculae To recognize faults, we calculate an online signature consisting of the contents of all pipeline registers within a pipelined processor containing control information and data. Digital signatures are mainly used for the secure, message-based communication of remote systems – error detecting codes with transmitter authentication. In this work, the authentication is done via the mapping of a checksum to a thread, forming the signature. The signature must not be cryptographically strong since we do not assume systematic attacks. To secure all pipeline registers by error-correcting (ECCs) or error-detecting codes (EDCs), e.g. parity is not practicable, since the computation of a single parity bit can last longer than the critical path of that stage, especially in high-performance designs with only a few logic levels. In Section 2.3 we will have a closer look on parity computation. EDCs/ECCs are practicable on a coarsegrained level, e.g. caches. Since the entire execution must be secured with marginal performance loss, the computation must be done on micro-architectural level. We regard two different methods: the compression-based and compression-less checksum calculus. 2.1. Compression-based calculus To check the correct execution within a RMT system, checksums must be calculated for each instruction stream (leading and redundant thread). These will be computed and compared on a context change from the redundant to the leading thread. A context switch should be done on every conditional branch or a cache-miss. If we switch the context on a conditional branch, we omit the problems occurring with speculative execution, leading to different checksums. In case of a long lasting latency (e.g. an access to the main memory), the concerned thread will be frozen until the cause for the latency (i.e. data/code has been fetched from/was written to the memory) is eliminated and the context is switched to another thread. This omits complications when writing to main memory or the caches. We will first apply signatures based on cyclic codes (CRCs) [1,2]. If the checksums differ, a fault is signaled. Let v = hvx1, . . . , v0i a binary codeword of width x with the polynoP mial representation m ¼ mðxÞ ¼ xi¼01 mi xi . The message mi ðxÞ; i f0; ; n 1g consists of the pipeline register contents during the execution between two context changes, n being the message length (here the number of pipeline registers). These checksums can be simultaneously computed for multiple instruction streams. Fig. 1 shows the pipeline-parallel computation of Pp1 i checksums with the generator polynomial gðxÞ ¼ i¼0 g i x . As the feedback stage, we propose the last pipeline stage (the retirement). The result of each pipeline stage will be directed to the next stage
and to the checksum logic. It will only be propagated if the concerned thread is active. In Fig. 1 two pipeline structures can be identified: the processor and the checksum pipeline. Since the criteria for a context switch are coded within the instruction streams and both streams are identical in the fault-free case, checksums can propagate similarly. The schemes in Figs. 1–3 do not need RMT. For example, like in [10], we could insert a pre-calculated checksum within the object code after an instruction which is subjected to cause a latency. Then, the checksum is read and compared to the checksum calculated within the pipeline. To support out-oforder execution we consider that the checksum for a single execution unit is calculated per thread by using a commutative operation D. We call this part of the checksum the out-of-order checksum, the part of the checksum calculated for the in-order stages the in-order checksum. 2.2. Construction of and D Since in-order and out-of-order checksums must integrate in arbitrary sequence in the final checksum, the necessary combination-operator must be commutative. By using static issuing, where each issued instruction is assigned statically to an execution unit, we know which instruction stream is associated with which checksum and execution unit. Here, must not be commutative. For dynamic issuing (issued instructions are not fixed to a dedicated execution unit) the results from the out-of-order-execution must be combined with the in-order checksums through a commutative . depends on the outputs from the preceding, the last and the feedback stage (depending on g(x)). Since each bit must influence the output, we have (b 2 {0,1}):
D : f0; 1g f0; 1g f0; 1g ! f0; 1g ð1Þ Dðx1 ; x2 ; x3 Þ ¼ Dðx1 ; x2 ; x3 Þ ¼ b; ð2Þ Dðx1 ; x2 ; x3 Þ ¼ Dðx1 ; x2 ; x3 Þ ¼ Dðx1 ; x2 ; x3 Þ ¼ b; ð3Þ Dðx1 ; x2 ; x3 Þ ¼ Dðx1 ; x2 ; x3 Þ ¼ Dðx1 ; x2 ; x3 Þ ¼ b: Two functions satisfy this requirement:
Dðx1 ; x2 ; x3 Þ ¼ D0 ðx1 ; x2 ; x3 Þ ¼
x1 x 2 : r ¼ 0
x1 x2 x3 : r ¼ 1 x1 x2 : r ¼ 0
and
x1 x2 x3 : r ¼ 1:
(r = 0: no feedback, r = 1: feedback). We derive the combinationoperator and set = D. Fig. 2 shows the out-of-order checksum calculation. As noted before, each thread has processed an equal amount of instructions when reaching a latency, since the criteria for a context switch (latencies like conditional branches, memory accesses) are at equal points in the instruction stream, i.e. the checksum registers have equal contents.
Fig. 1. Compression-based checksum calculation.
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
465
Fig. 2. Checksum-calculation for out-of-order execution (compression-based).
Fig. 3. Signature calculation for SMT.
2.3. Compression-free calculus The feedback-based mechanism of the compression-based checksum calculus in Section 2.1 excludes the early detection and fast recovery of faults since the context-change causing instruction in the redundant thread has to pass all p pipeline stages. Depending on how early context-change criteria can be detected, other instructions of the leading or redundant thread can be in the pipeline. In case of an error, all results which were produced from these instructions must be discarded. To reduce latency and complexity, we see the compression-free calculus of checksums as remedy. The advantage of the compression-free checksum calculus is obvious, resulting in less and much faster hardware and higher fault coverage. Therefore, we do not do a comparison between the two schemes and redirect the interested reader to [32]. Here, the contents of all checksum-registers are considered part of the checksum. Fig. 3 shows the final method (two threads, out-of-order support). For clarity, the pipeline is shown only until the execution stage. The forwarding of checksums and the feedback is deactivated to decouple the pipeline stages, en-
abling the localization of faults. The results of the following experiments apply to the scheme shown in Fig. 3 (TID – thread ID, REG – register), whereby the combination-operator is implied. 2.4. Theoretical comparison Initially, a comparison with parity trees concerning the amount of xor-gates and the maximal depth in gates has to take place. We calculate the costs of a parity tree with arbitrary fan-in and data width N. The number of xor-gates f:N0 ? N on root level (zero) is l m l m lgðNÞ N f ð0Þ ¼ fanin the minimal depth t ¼ lgðfaninÞ . We solve the recurl m l m n f ðn1Þ N 1 rence f ðnÞ ¼ fanin and get f ðnÞ ¼ fanin fanin . Summing over all tree levels, the cost function (number of xorgates within the tree) is:
3 2 t 1 f ð0Þ fanin 1 t1 fanin X 7 6 7 f ðiÞ ¼ 6 7 6 fanin 1 7 6 i¼0 7 6
466
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
With a fan-in greater than two, parity trees cost less than the compression-free scheme. With a fan-in of two the costs are approximately the same. The delay of the parity tree grows logarithmically, only shortened by a wider fan-in, which is restricted to 4–6 by most CMOS-technologies. The delay of the checksum method is one gate per stage. Additionally, we have to take into account that the checking of register contents by parity computation is done in two turns: (1) the calculation when writing and (2) the calculation when reading for comparison. This can lead to a significant performance breakdown for large data paths, since the clock frequency has to be adjusted to the critical path of the tree. Note, that the checking for equality in the checksum scheme can be done through xoring both checksums and a wired-or of all resulting bits (logical ‘‘1’’ signals an error). Wang et al. [21] report that 88% of all faults in pipeline registers have been masked out when injecting transient faults while executing SPECint2000 benchmarks. Unfortunately this result strongly depends on the underlying architecture. But we can conclude that not all bits of a pipeline register have to be secured, thus reducing the area requirements. To determine what bits should be secured, an implementation of the microarchitecture is necessary. Here, we could either work on source-code level, since the implementer will insert timing constraints within the code for parts which have tight timing requirements or the parts to be secured can be determined by test series on logic simulation level, e.g. by critical path analysis.
3. Experimental results In this section, elementary metrics like fault coverage and localization accuracy for multiple variables (in dependence from branch probability and instructions per clock (IPC)) as well as the latency until an error is detected are gained by software simulations. Two identical instruction streams with 106, 32-bit instructions have been generated as input for a pipeline, which can be modeled with an arbitrary number of stages p and an arbitrary control and data path width x. For a worst-case analysis the pipeline will be emptied every time a fault is detected. The data from stage to stage will not be modified, since then the model would have been implementation-depended. Branches (br,cc) and loads (ld) are inserted in the instruction stream with probability P and K. To get values for P and K, benchmarks of the Standard Performance Evaluation Corporation (SPEC) were executed, listed in Table 2, showing the opcode probabilities in percent for SPECint2006_base benchmarks. The results from Table 2 (Ld, Br,cc, arithmetic average, last column) are used as input parameters for the synthetic benchmarks (s. Table 3, workload). The benchmarks were executed with the x86–64-bitsimulator ptlsim [19] until 106 instructions were committed. We could also execute the benchmarks and experiments with a larger number of instructions. But since we have a fixed fault rate, the deviation would be smaller. However, we conducted several (10) fault injection experiments (for each value of P and K) to smoothen the result. Naturally, with the execution of other benchmark types (e.g. the floating-point part of the CPU2006 benchmarks) other values will result. We took this to have some real-world values. To keep P and independent from particular benchmarks, we varied P and K between 1/10 and 1/10010 (10010 instructions between threads). For the architectural configuration, please see Table 4. Conditional branches are denoted by br,cc unconditional branches through bru/ jmp. The context-change causing instruction groups (K and P) are highlighted. The first experiment whose results are shown in Fig. 4 served to visualize the dependencies between fault-coverage and the probability of a context switch. During the simulation, faults
Table 4 Microarchitectural parameters. Parameter
Wert
Reorder buffer Load/store queue Number of pipeline stages Width of fetch, issue, dispatch, commit, writeback Execution units
128 entries 48/32 entries 12 4 Instructions
Execution type Execution time Physical register file L1 instruction-/data cache 64 Byte cache line, 2 cycles L2 instruction-/data cache 64 Byte cache line, 6 cycles Memory access time DTLB and ITLB Workload
4 (2 INT, 1 FP, 1 LOADSTORE) Speculative 106 Commits 128 register 16/32 KB, 4-way, 256 KB, 4-way, 112 cycles 32 entries SPECint2006_base
were injected according to the fault model with a rate of k = 105. Since 10 fault injection experiments were conducted, the fault coverage is shown with mini- and maximum. For further fault-injection parameters, please see Table 3. One could argue that highly optimizing compilers group arithmetic instructions. Additionally, the criteria for a context switch could be a real context switch of the operating system, resulting in hundreds of instructions between threads. This was also a reason for the variation of P and K. The probability K þ P for a context switch was varied between 1/10 and 1/10010 in steps of 1/1000. Additionally, the polynomial trend (second degree) is shown. We see that the fault coverage decreases very slowly in dependence from the probability of a context switch. It decreases since the number of instructions between checkpoints is strongly determined by the probability of a context switch. A fault is not detected if there are two transient faults (transition 0 ? 1 and 1 ? 0) in the same checkpoint interval in the same instruction stream at the same position in the same pipeline register. A consequence is that the scheme can be used for programs which have little probability for a context switch or that the number of context changes can be reduced to gain performance. For superscalar processors with static issuing the fault coverage will be higher if the results from the in-order and out-of-order execution are e.g. concatenated, since a higher IPC will distribute more instructions among the available execution units. Dynamic issuing will have no influence on fault coverage. Out-of-order execution will influence the latency until an error will be detected. To determine the latency, we resolved the issue and commit rates on microarchitectural level (Issue/Commit lIPC), as well as the final commit rate (Commit IPC). The simulations were done with ptlsim [19]. The workload was the SPECint2006_base benchmark suite. Fig. 6 shows the average IPC-values (/) within the commit stages (higher is better). Fig. 5 shows the latency until the detection of a fault (y-axis) for a 5-stage pipeline using coarse grained multithreading with K þ P ¼ 1=5. A fault will be detected in average 5.93 cycles. If pi is the probability for n context switch causing instructions with P ^ þ p ¼ ni¼1 pi ½0; 1 and s is the number of threads, / the IPC, the latency is maximal Pn s clocks. p U i¼1 i
The compression-free scheme is able to map faults onto a pipeline stage. Fig. 7 shows the minimal and maximal accuracy (10 fault injection runs, y-axis) in percent with respect to K þ P (xaxis), varied between 1/10 and 1/10010 in steps of 1/1000. Additionally, the polynomial trend (second degree) is depicted. We see that the accuracy slowly decreases. After 10010 write accesses still 80% of all faults can be mapped to a pipeline stage. For FMT the
467
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
100
Localization accuracy (%)
100
Fault coverage (%)
80
60
40
20
80
60
40
20
Probability of a context switch
0.00010
0.00011
0.00012
0.00014
0.00017
0.00020
0.00025
0.00033
Probability of a context change
Fig. 4. Fault coverage, K þ P (%).
Fig. 7. Localization accuracy (%).
bit, burst failures up to length x can be recognized (x is the width of a pipeline stage).
20
Latency in instructions
0.00050
0.10000
0.00099
0
0.00010
0.00011
0.00012
0.00014
0.00017
0.00020
0.00025
0.00033
0.00050
0.00099
0.10000
0
15
4. Implementation results 4.1. FPGA
10
5
0
Time (nonlinear) Fig. 5. Latency until fault detection.
latency to detect a fault is maximal two clock cycles. The fault can be perfectly localized since and results are compared every second cycle. An even number of faults cannot be detected by the parity computation. Since the checksum scheme is applicable for every
Parity trees and checksum calculus have been synthesized and implemented in VHDL for a Xilinx Virtex-E XCV1000 FPGA [8]. The sources are generated with support of a fan-in of two. For all examined bit widths, the area requirements of the checksum scheme and parity trees differed by exactly 14 LUTs whereas the logic for a single pipeline stage was implemented with normal place and route effort. Fig. 8 shows the critical path in ns for the parity- and the checksum (right, darker) method. The critical path for the checksum scheme is (apart from N = 64) significantly shorter. We did a static analysis for both schemes with the power analysis tool from the manufacturer (XPower). All applied signals toggle at specific frequencies, i.e. the less significant bit with the highest frequency, then half of that frequency was used for the
2.5
2
IPC ( )
1.5
1
0.5
0 Issue µIPC
astar
hmmer
gcc
Commit µIPC 0.95866335 1.38850789 1.4047434 Commit IPC
bzip2
gobmk
libquantum
omnetpp
mcf
h264ref
sjeng
perlbench
xalancbmk
1.22282817 1.7310055 1.85922432 2.08692012 1.96548519 2.41765589 1.22634322 2.12829036 1.78668325 2.14218786 1.19016111 1.21838464 1.9791737 1.81735457 2.26067493 0.9518795 1.71457621 1.48207513 1.72288516 0.88144443
0.952654
0.80459067 0.61856729 0.97773789 1.43619764 1.13505703 1.71120652 0.79776689 1.27409959 0.94368735 1.17300264 0.62184266 0.79929087
Fig. 6. IPC-values (SPECint2006_base).
468
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
next bit, etc. Thus, we simulated a run of all input combinations. Fig. 9 shows the power consumption difference in mW between
checksums and parity trees for increasing frequencies (x-axis, 25, 50, 200, 400 MHz) and bit widths (16, 32, 64). For low frequencies, the difference is marginal. Starting from 50 MHz, a linear increase can be recognized. The checksum calculation consumes significantly less power for greater bit widths than the parity computation.
14
Critical path (ns)
Table 5 Standard-cell synthesis parameters. Synthesis Conversion (vhdl ) vbe) Algebraic minimization (level) Algebraic minimization (optimization)
10
16
32
64
Delay/area balanced (50%)
Implementation Placement Standard-cell library Routing Technology Transistor model
6 8
Initial values, vdd/vss-connectors 3 (0 to max. 3)
128
Ring sxlib 6 metal layers 130 nm CMOS, 6 metal layers BSIM4.4
Bit width Fig. 8. Critical path (ns).
10000 300 32
64
Critical Path (ps)
Difference - Power consumption (mW)
16
200
1000
100
100 8
16
32
64
128
256
512
Parity
1230
1655
2079
2504
2928
3352
3777
Checksum
602
602
602
602
602
602
602
0 25
50
200
400
Frequency (MHz) Fig. 9. Difference power consumption checksum, parity tree (FPGA).
Fig. 11. Delay – parity, checksum (ps).
25
5000000
20
Capacity (pF)
Area
(λ 2)
4000000
3000000
2000000
1000000
15 10 5 0 -5
0 8
16
32
64
128
256
512
Parity
8
16
32
64
128
256
512
0.3
0.6
1.1
2.3
4.9
10.4
23.5
Parity
58750
124250
252500
511000 1030000 2066250 4141000
Checksum
75000
148000
297000
596250 1188000 2379250 4763000
Checksum
0.4
1.1
2
4.7
12.1
9.9
20.9
Difference
16250
23750
44500
85250
Difference
0.1
0.5
0.9
2.4
7.2
-0.5
-2.6
158000
313000
Fig. 10. Area – parity, checksum calculus (k2).
622000
Fig. 12. Capacity – parity, checksum (pF).
469
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
Stage 4
R E G
Stage 3
Rollback
R E G
Stage 2
Rollback
R E G
Stage 1
R E G
Rollback
Rollback
v(x)
Thread 1 CMP
CMP
CMP
CMP
w(x)
Thread 2 Fig. 13. Fine-grained micro-rollback.
4.2. Standard-cells Fig. 10 shows the area (in k2, x-axis, N 2 {8, . . . , 512}) of the parity tree and the checksum calculus for a standard-cell design. For synthesis parameters, please see Table 5. In contrary to the FPGA design the area (difference) differences can be noticed much earlier. Both mechanisms were implemented with 4 metal layers. Fig. 11 shows the delay in ps for both methods. Depicted on the x-axis is the bit width of the data word N 2 {8, . . . , 512}. The difference can be depicted very early. Fig. 12 shows the capacity in pF of the parity and checksum logic for increasing bit widths (x-axis, N 2 {8, . . . , 512}). Since no dynamic analysis of power consumption was possible due to tool restrictions, we use a static analysis. Parity calculation has less capacity for bit widths ranging from 8 to 128 bit. For bit widths over 128 bit the capacity of the checksum calculus is less. We recommend using the checksum calculus for internal control or data path bit widths over 128 bit, due to its beneficial run time behavior and low power consumption.
5. Micro-rollback If the compression-free checksum calculus and FMT are combined to a micro-rollback, a simple and cost-effective method to recover the pipeline state after a fault results. Tamir and Tremblay present in [15] a fast (within a few cycles) rollback-mechanism. The micro-rollback in the Mirror processor [16] is supported by a rollback-register file, an N-word FIFO, delaying write operations on the architectural registers (delayed write). Based on [15], Pflanz [18] developed a micro-rollback for a master-checker system within ALUs and simple microprocessors. The trailer is assumed to be in the perfection core executing the program with one clock latency after the master. In the fault-free case, its state is identical to the state of the master. The master can be rolled back by injecting the trailer state. The checksum scheme can be easily extended to support a micro-rollback for pipelined processors. Fig. 13 shows the micro-rollback for an in-order pipeline. By implementing checksum-registers for each execution unit, the micro-rollback can be easily extended to out-of-order pipelines. We do not assume the redundant thread (thread 2) as fault-free, since this assumption is too rigorous. From the detection to the correction of the fault, we need four clock cycles (fine-grained multithreading, two times the execution of the instruction, comparison/rollback and retry). Depending on the implementation, the time can be further reduced.
A simple method to produce more than two results is to execute the redundant thread again (thread 3) by loading the checksum pipeline contents into the previous pipeline stage and doing a retry. If we have an agreement between the results of threads 2 and 3, thread 1 was faulty and we continue the execution. Else, thread 2 was faulty and the contents of the checksum register of thread 1 will be transferred into the pipeline. After a rollback we have three or four checksums, depending on the number of threads participating in the rollback. If we have no consensus between the threads, we must restart both threads, e.g. retry from a previous checkpoint. A counter can be integrated to detect continuous faults. If such a behavior is detected, we should resort to a fail-safe mode. Remember that our fault model excluded the possibility of such permanent faults and that the detection and the toleration of such faults is not within the scope of this work. To directly load the correct values into the pipeline, the direct-load scheme from Yu [26] can also be used. 6. Conclusion and future work In this work, we presented a novel checksum scheme for pipelined processors (redundantly threaded or structural redundant), called Thread Signature Calculus, supporting an arbitrary number of stages and threads. We introduced two sub-schemes, the compression-based and the compression-free checksum calculus. The performance of both schemes can be adjusted by increasing or decreasing the number of context switches. We did not handle compression-based checksums in detail, since it requires more area and time and is less accurate. Furthermore it is more complex and can discover less faults than the compression-free calculus. Compression-based checksums require more hardware because of the inter-stage feedback wiring. Therefore, we recommend the compression-free calculus. The scheme can be used for FMT- or CMT-systems. For fine-grained execution the scheme has perfect fault coverage, for coarse-grained multithreading the fault coverage decreases polynomial with the probability of a context switch. The scheme was validated through an FPGA and standard-cell implementation. The results show a slight area increase but significantly less power consumption and a shorter critical path in comparison with the traditional parity, enabling the direct integration in the processor pipeline without having to worry about performance issues. The scheme was extended to support a rollback and retry for FMT. In the ideal case, a fault can be detected, localized and corrected within four clock cycles. The schemes can be used within a multiprocessor environment with shared memory applications, either by applying it on processor level or between
470
B. Fechner / Microprocessors and Microsystems 36 (2012) 462–470
duplexed processors, integrating the checksum mechanism in the control and data paths. This would assume that applications are distributed between the processors in such a way that a fixed amount of processors is used and that the execution of instructions is bound to dedicated processors. References [1] S. Lin, D. Costello, Error Control Coding, Prentice-Hall, 1983. [2] W. Peterson, E. Weldon, Error-Correcting Codes, second ed., MIT Press, 1972. [3] E. Normand, Single event upset at ground level, IEEE Transactions on Nuclear Science 43 (6) (1996) 2742–2750. [4] H. Kobayashi et al., Soft errors in SRAM devices induced by high energy neutrons, thermal neutrons and alpha particles, IEDM Technical Digest (2002) 337–340. [6] M. Rimén, J. Ohlson, A study of the error behavior of a 32-bit RISC subjected to simulated transient fault injection, in: Proc. of the Int’l. Test Conference, Baltimore, USA, 1992, pp. 696–704. [7] C. Weaver, T. Austin, A fault tolerant approach to microprocessor design, in: IEEE Intl. Conference on Dependable Systems and, Networks (DSN-2001), 2001. [8] Xilinx, Virtex-E 1.8 V Field Programmable Gate Arrays, 2002.
(checked 30.11.07). [9] M. Namjoo, Techniques for concurrent testing of VLSI processor operation, in: Proc. of the 12th Int’l. Symp. on Fault-Tolerant-Computing, IEEE Computer Society, Santa Monica, CA, 1982, pp. 461–468. [10] T. Sridhar, S.M. Thatte, Concurrent checking of program flow in VLSI processors, in: Digest of the 1982 Int’l. Test Conference, IEEE, 1982, pp. 191–199. [11] J.P. Shen, M.A. Schuette, On-line self-monitoring using signatured instruction streams, in: Proc. of the 13th IEEE Int’l. Test Conference, 1983, pp. 275–282. [12] T.M. Austin, DIVA: a reliable substrate for deep submicron microarchitecture design, in: Proc. of the 32nd Intl. Symp. on Microarchitecture, November 1999. [13] R. Leveugle, T. Michel, G. Saucier, Design of microprocessors with built-in online test, in: Proc. of the 20th IEEE Int’l. Symp. On Fault-Tolerant, Computing, 1990, pp. 450–456. [14] G. Miremadi, J. Ohlson, M. Rimén, J. Karlsson, Use of time and address signatures for control flow checking, in: Proc. 5th IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-5), IEEE Computer Society Press, Urbana-Champaign, IL, USA, September 1995, pp. 201–221. [15] Y. Tamir, M. Tremblay, High-performance fault-tolerant VLSI systems using micro rollback, IEEE Transaction on Computers 39 (4) (1990) 548–554. [16] Y. Tamir, M. Tremblay, M. Liang, T. Lai, The UCLA mirror processor. A building block for self-checking self-repairing computing nodes, in: Proc. of the 21st Int’l. Symp. on Fault-Tolerant Computing (FTCS), 1991, pp. 178–185. [17] J.C. Smolens et al., Fingerprinting: bounding soft-error-detection latency and bandwidth, IEEE Micro 24 (6) (2004) 22–29. [18] M. Pflanz, On-Line Error Detection and Fast Recover Techniques for Dependable Embedded Processors, LNCS 2270, Springer, Berlin, Heidelberg New York, 2002. 3-540-43318-X. [19] M.T. Yourst. PTLsim: a cycle accurate full system x86–64 microarchitectural simulator, in: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’07), San Jose, California, April 2007. [20] J. Karlsson, P. Lidén, Transient fault effects in the MC6809E 8 bit microprocessor: a comparison of results of physical and simulated fault injection experiments, Technical Report No. 96, Dept. of Computer Engineering, Chalmers University of Technology, Göteborg, Sweden, 1990.
[21] N.J. Wang et al., Characterizing the effects of transient faults on a highperformance processor pipeline, in: Proc. of the 2004 International Conference on Dependable Systems and Networks – vol. 00. IEEE Computer Society, pp. 61–70, 2004. doi: http://dx.doi.org/10.1109/DSN.2004.1311877. [22] P. Shivakumar, M. Kistler, W. Keckler, D. Burger, L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic, in: Proc. IEEE Int’l. Conference on Dependable Systems and, Networks (DSN’02), 2002, pp. 389–398. [23] R. Baumann, E. Smith, Neutron-induced boron fission as a major source of soft errors in deep submicron SRAM devices, in: Proc. of the 38th International Reliability Physics Symposium, IEEE Computer Society, 2000. [24] F.L. Kastensmidt, L. Carro, R. Reis, Fault-tolerance techniques for SRAM-based FPGAs, Springer, ISBN 0-387-31068-1, 2006. [25] J. Ziegler, Terrestrial cosmic ray intensities, IBM Journal of Research and Development 42 (1) (1998) 117. (checked 30.11.07). [26] S. Yu, Fault Tolerance in Adaptive Real-Time Computing Systems. Dissertation. UMI No.: AAI3038176, Stanford University, 2002. [27] T.C. May, M.H. Woods, Alpha-particle-induced soft errors in dynamic memories, IEEE Transactions on Electron Devices 26 (1) (1979) 2–9. [28] S.S. Mukherjee, M. Kontz, S.K. Reinhardt, Detailed design and evaluation of redundant multithreading alternatives, in: Proc. of the 29th Annual International Symposium on Computer Architecture (ISCA) 2002, S. pp. 99– 110. [29] P. Lidén, P. Dahlgren, R. Johansson, J. Karlsson, On latching probability of particle induced transients in combinational networks, in: 24th Int. Symp. on Fault Tolerant Computing (FTCS-24), IEEE Computer Society Press, Austin, TX, USA, June 1994, pp. 340–349. [30] D. Knuth, The Art of Computer Programming. Volume 2 (Seminumerical Algorithms). 3. Ausg. Addison-Wesley, ISBN 0-2010-3822-6, 1997. [31] B. Fechner, J. Keller, Compression-free checksum-based fault-detection schemes for pipelined processors, in: Proc. ARCS 2007 Workshop on Reliability and Fault-Tolerance, 2007. [32] B. Fechner, Analysis of checksum-based execution schemes for pipelined processors, in: Proc. 11th IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 2006. [33] B. Fechner, Securing execution in simultaneous multithreaded processors, in: Proc. 5th European Dependable Computing Conference (EDCC-5), 2005. [34] B. Schroeder, E. Pinheiro, W.D. Weber, DRAM errors in the wild: a large-scale field study, in: Proc. 11th Conf. on Measurement and Modeling of Computer Systems, 2009, pp. 193–204.
Bernhard Fechner has an accident insurance education and received a Master and PhD in computer science, both from the University of Hagen. He worked several years as a consultant and was involved in the conception and realization of the first telematic and virtual computer architecture lab for the University of Hagen, Germany. Dr. Fechner is currently assistant professor at the Chair of Systems and Networking (Prof. Dr. Ungerer) at the University of Augsburg, Germany. His research interests include (fault-tolerant) microarchitectures, their realization (FPGA and standard-cell) and novel fault models. He is IEEE and GI member.