Exploring the limitations of dataflow SIHFT techniques in out-of-order superscalar processors

Exploring the limitations of dataflow SIHFT techniques in out-of-order superscalar processors

Microelectronics Reliability 100–101 (2019) 113406 Contents lists available at ScienceDirect Microelectronics Reliability journal homepage: www.else...

957KB Sizes 0 Downloads 15 Views

Microelectronics Reliability 100–101 (2019) 113406

Contents lists available at ScienceDirect

Microelectronics Reliability journal homepage: www.elsevier.com/locate/microrel

Exploring the limitations of dataflow SIHFT techniques in out-of-order superscalar processors

T



D.M. Cardoso , R. Tonetto, M. Brandalero, G. Nazar, A.C. Beck, J.R. Azambuja Instituto de Informática, Universidade Federal do Rio Grande do Sul, Porto Alegre, RS, Brazil

A B S T R A C T

This paper presents an analysis on the efficiency of dataflow SIHFT techniques applied to out-of-order superscalar processors. A set of five SIHFT techniques are applied to a benchmark of applications running in different configurations of the complex BOOM superscalar processor. A fault injection campaign is performed by simulating over 42 million faults affecting all system's flip-flops, and the efficiency of each technique is tested regarding IPC gain, sensitivity reduction, and execution time. Results are presented individually for each micro-architectural structure, showing improvements in sensitivity reduction and IPC gain, but a deterioration in execution time. Conclusions may lead designers to develop less sensitive complex processor architectures.

1. Introduction High-performance multiple-issue dynamically-scheduled processors have become an excellent alternative to meet processing demands for applications with growing complexity, mainly because they provide computational power with acceptable costs while keeping binary compatibility. However, the very Complementary Metal Oxide Semiconductor (CMOS) technology shrinking process that nowadays allows the high integration of transistors to achieve high-performance systems is also responsible for the increase in the vulnerability to soft errors [1]. Even though complex superscalar processors have an inherent high degree of error masking factors (e.g., due to speculative execution) [2], the increase in susceptibility to soft errors is a growing concern for the design of fault tolerance strategies. This challenge has paved the way for alternative solutions, such as mechanisms for error protection at software-level. Software-Implemented Hardware Fault Tolerance (SIHFT) approaches can provide high flexibility, low development time and lowcost solutions for computer-based systems because modifications in the hardware of the processor are not required. In addition, Commercial Off-The-Shelf (COTS) processors, such as the mentioned complex superscalar processors, can be used for safety-critical applications if available protection mechanisms, such as SIHFT techniques, meet the required safety target [3]. This paper evaluates the execution time, Instruction Per Cycle (IPC) and sensitivity reduction when applying SIHFT techniques to applications executing in an OoO processor. To do so, we start by choosing five variations of SIHFT techniques and applying it to three variations of the complex superscalar processor Berkley OoO Machine (BOOM),



including versions with single-, dual-, and quad-issue cores, and analyzing resulting data according to IPC gain and execution time. Then, a fault injection campaign is performed and over 42 million faults are injected by RTL simulation in all flip-flops describing the BOOM, in all versions (unprotected and protected). After, we evaluate the sensitivity of all software versions and compare results to SIHFT techniques applied to simple processors. Finally, we analyze individually each microarchitectural structure of BOOM according to sensitivity reduction and draw conclusions of the limitations of SIHFT techniques applied to complex OoO processors. The novelty of this work lies in evaluating different version of SIHFT strategies by performing RTL fault injection campaigns with high controllability. By doing so, we are able to provide more realistic and comprehensive results than related work. To the best of our knowledge, this is the first work in the literature to inject faults to all individual micro-architectural structures of a real OoO superscalar processor and raise the main limitations of dataflow SIHFT techniques applied to this class of processor. 2. Related work Fault tolerance techniques involve some kind of redundancy, categorized into: (1) hardware, (2) information, and (3) time; or can still be combined to compound a hybrid technique [4]. As we are interested in options for processors already on the market, without hardware modifications, this paper focuses on SIHFT techniques; these are divided into data-flow and control-flow techniques. Due to its flexibility, assessing the reliability of different solutions is a time costly process. Therefore, some works have examined the global effectiveness of the

Corresponding author. E-mail address: [email protected] (D.M. Cardoso).

https://doi.org/10.1016/j.microrel.2019.113406 Received 14 May 2019; Received in revised form 11 June 2019; Accepted 1 July 2019 Available online 23 September 2019 0026-2714/ © 2019 Elsevier Ltd. All rights reserved.

Microelectronics Reliability 100–101 (2019) 113406

D.M. Cardoso, et al.

techniques before its application [5]. Software-implemented strategies to protect the data integrity use the temporal redundancy, i.e., they re-execute redundant operations. The Error Detection by Duplicated Instructions (EDDI) [6] and Variables [7] techniques use this concept to insert redundancy at the instruction level in the binary code. All registers and memory locations are copied to unused resources; likewise, all assembly operations are duplicated, right after the original one, using the registers and memory location replicas. Instructions performed on some register are also executed on their respective copies, saving the result of each operation in a different location. For fault detection, branch instructions are inserted to check the consistency of the logical registers. These instructions are inserted immediately before any instruction that performs some write operation in the data memory, and it is necessary to add a comparison instruction for each register present in the memory access instruction. The efficacy of EDDI was evaluated at low-level for superscalar processors in [8], but the work does not provide insights about the efficacy of the technique for each particular structure of the processor. Our experiments, however, consist in assessing the effectiveness of five different SIHFT techniques by targeting different hardware structures of a complex superscalar pipeline. We characterize the effectiveness of the technique for each particular micro-architectural structure, at a flipflop-/realistic level, highlighting which structures should be hardened by alternative fault tolerance mechanisms.

Table 2 Micro-architectural structures areas (μm2).

To evaluate our approach, we chose the BOOM processor [9], a parameterizable RISC-V core, due to its growing use in academia and its parameterization flexibilities. We performed experiments in three different configurations: single-issue (Single), dual-issue (Dual), and quadissue (Quad) cores, where the last is similar to the ARM Cortex-A15 processor. Table 1 shows the structural configuration for them. The BOOM processor can be split in 12 separate micro-architectural structures, which are: Physical Register File (PRF), Register Renaming Units (Rename), Issue Unit (Issue), Fetch Unit (Fetch), Branch Reorder Buffer (BROB), Backing Predictor (BPD), Load-Store Unit (LSU), Reorder Buffer (ROB), Control/Status Registers (CSR File), instruction Translation-Lookaside Buffer (iTLB), Branch Hardware (Branch HW), and the pipeline Execution stage (EXE), which includes ALUs the bypass network. The Branch hardware contains information necessary for the correct resolution of branches and does not contain only performance enhancing structures, hence faults injected in such structure may lead to failures. In order to obtain each individual micro-architectural structure area, we synthesized the RTL description to NanGate's 15 nm standard cell library [10] and a 2.1 GHz synthesis target. Table 2 shows the area Table 1 BOOM configuration. Quad

Fetch-width PRF registers Fetch buffer entries Instruction Window LSU entries ROB entries

1 100 4 10 4 24

2 110 4 20 16 48

4 128 4 28 32 64

Quad

PRF Rename Issue Fetch BROB BPD LSU ROB CSR File iTLB Branch HW EXE Total area

17,742 7680 3151 1537 778 10,227 4149 5303 2662 1147 4160 26,174 84,710

25,777 16,178 9039 1865 938 20,350 13,807 8319 2660 1147 4242 26,990 131,312

40,023 26,390 17,630 2496 877 41,299 32,192 9495 2913 1150 4326 28,301 207,092

The five chosen dataflow SIHFT techniques are based on EDDI [6] and called VAR and VARM. They transform the assembly code by inserting extra instructions that are able to duplicate dataflow instructions. For that, they first duplicate all used registers into spare ones. Then, they replicate all instructions that operate on replicated data. Finally, they perform consistency checks between the original registers' data and their replicas' data by inserting compare instructions. When a mismatch is detected, the program flow is branched to an error detection subroutine that flags an error and stops program execution. The difference between VAR and VARM is that VARM replicates load instructions with move instructions, instead of load instructions, increasing performance at the cost of less fault detection. The consistency checks are inserted in different cases for VAR, VARM, and their variations: VAR inserts checks before any instruction that reads a register, Load and Store (LS) only inserts checks before memory access instructions, and Load, Store, and Branch (LSB) only inserts checks before memory access and control flow instructions, such as branch, jump, and call instructions. Table 3 shows the transformation of an original unprotected program code to all chosen dataflow SIHFT techniques. As one can see, the LS variation requires the least instructions, followed by LSB and VAR. The only difference between VAR and VARM is the use of “mv r2’, r2”, instead of a “ld r2’, (r4’)”. We used 13 benchmarks in our analysis: CRC32, Dijkstra, k-means, matrix multiplication, 1D median filter, multiply filter, string search, qsort, rsort, Rijndael encryption, SHA, towers of Hanoi and vectorvector add. All SIHFT techniques have been automatically applied to the case-study applications. Table 4 shows the average execution time overhead, for all casestudy applications, for all BOOM configurations. As one can notice, VARM LS yielded the least execution time overhead for all configurations, with overheads of 84.5%, 76.7%, and 42.2% for the single-issue, dual-issue, and quad-issue, respectively. VAR, on the other hand, more than doubled execution times for all BOOM configurations. It is also interesting to notice that, except for VAR, all other variations presented 70% or less overhead for the quad-issue. To analyze the execution time overhead, considering the same frequency for all cores, we measured the mean instruction count and IPC for SIHFT techniques and presented in Fig. 1. VAR has the highest instructions overhead, followed by LSB, and LS, while VAR and VARM variations showed the same overhead, as expected. When considering IPC gain, results differ by BOOM configuration, but VARM LS shows the best IPC gain, which, when combined

3.1. Superscalar BOOM processor

Dual

Dual

3.2. SIHFT techniques

The implementation of our setup considers the BOOM processor (and its configurations), regarding configuration values and area, and the SIHFT techniques, considering execution time and IPC gain. In the following sections both are described in detail.

Single

Single

for each core and each structure. As one can see, the EXE plays a major role for the single-issue configuration, while the PRF, BPD, and LSU surpass it for the quad-issue configuration.

3. Implementation

Structure

Structure

2

Microelectronics Reliability 100–101 (2019) 113406

D.M. Cardoso, et al.

Table 3 Dataflow techniques transformation. Original code

ld r2, (r4)

add r1, r2, r7

beq r1, r3, lab

st r1, (r5)

VAR

VAR LS

VAR LSB

VARM LS

VARM LSB

bne r4, r4’, err ld r2, (r4) ld r2’, (r4’) bne r2, r2’, err bne r7, r7’, err add r1, r2, r7 add r1’, r2’, r7’ bne r1, r1’, err bne r3, r3’, err beq r1, r3, lab bne r1, r1’, err bne r5, r5’, err st r1, (r5)

bne r4, r4’, err ld r2, (r4) ld r2’, (r4’)

bne r4, r4’, err ld r2, (r4) ld r2’, (r4’)

bne r4, r4’, err ld r2, (r4) mv r2’, r2

bne r4, r4’, err ld r2, (r4) mv r2’, r2

add r1, r2, r7 add r1’, r2’, r7’

add r1, r2, r7 add r1’, r2’, r7’ bne r1, r1’, err bne r3, r3’, err beq r1, r3, lab bne r1, r1’, err bne r5, r5’, err st r1, (r5)

add r1, r2, r7 add r1’, r2’, r7’

add r1, r2, r7 add r1’, r2’, r7’ bne r1, r1’, err bne r3, r3’, err beq r1, r3, lab bne r1, r1’, err bne r5, r5’, err st r1, (r5)

beq r1, r3, lab bne r1, r1’, err bne r5, r5’, err st r1, (r5)

Table 4 Execution time overhead (%).

beq r1, r3, lab bne r1, r1’, err bne r5, r5’, err st r1, (r5)

Table 5 Sensitivity for each case-study application.

Structure

Single

Dual

Quad

Unprotected

43.6 μs

24.6 μs

24.1 μs

VAR VAR LS VAR LSB VARM LS VARM LSB

161.1 105.9 121.8 84.5 118.4

158.9 80.3 116.2 76.7 117.9

104.7 66.4 70.2 42.2 58.8

Benchmark

CRC32 Dijkstra Kmeans MatMul Median Multiply Search Qsort Rijndael Rsort SHA Towers vvadd

Unprotected

VAR

Single

Dual

Quad

Single

Dual

Quad

1.09 1.26 0.98 0.98 0.97 1.02 1.31 0.77 1.06 0.93 1.12 1.11 0.98

2.36 1.43 1.15 1.90 0.98 1.17 1.56 1.00 1.12 1.52 1.96 1.70 1.11

1.45 0.96 0.85 1.53 0.56 1.01 1.13 0.74 1.00 1.19 1.63 1.29 0.86

0.69 0.49 0.24 0.51 0.66 0.81 0.42 0.44 0.26 0.38 0.51 0.44 0.37

0.54 0.60 0.23 0.55 0.53 0.43 0.59 0.42 0.18 0.40 0.43 0.66 0.32

0.26 0.41 0.17 0.42 0.56 0.48 0.57 0.53 0.12 0.26 0.38 0.47 0.26

yielding an increase in the Instruction Level Parallelism (ILP) that can be better exploited by bigger cores [11]. 4. Fault injection results To evaluate the robustness of the SIHFT techniques on BOOM, we performed a fault injection campaign in 12 micro-architectural structures, using the platform described in [12]. We injected Single Event Upsets (SEUs) in a random bit from a random register, one per program execution, following the statistical model described in [13], which resulted in over 42 million faults, leading to a confidence level of 99% with an error margin of 1%. We assume instruction and data caches are

Fig. 1. IPC gain and instruction count.

with the least instructions overhead, results in the smallest execution time overhead. IPC gains are explained by the fact that the duplication of instructions is based on inserting extra independent instructions,

Fig. 2. BOOM sensitivity divided by micro-architectural structures. (a) Unprotected original BOOM. (b) VAR SIHFT hardened BOOM. 3

Microelectronics Reliability 100–101 (2019) 113406

D.M. Cardoso, et al.

Fig. 3. Physical Register File (PRF) sensitivity for unprotected, for each individual register, for groups of registers, and VAR protection. (a) 1D Median Filter. (b) CRC 32. (c) Dijkstra. (d) PBM String Search.

campaign. In the following we analyze the aspects of VAR and its variations, VARM, LS, and LSB. 4.1. Analysis on VAR As the SIHFT technique with the most check instructions, VAR is expected to provide the best sensitivity reductions, at the highest costs in execution time overhead. Nonetheless, as we intend to explore the limitations of sensitivity reduction by dataflow SIHFT techniques, VAR is the best option. Fig. 2 depicts the micro-architectural structures' sensitivity for all three BOOM configurations, where Fig. 2a shows data for the unprotected BOOM and Fig. 2b shows data for the VAR SIHFT technique, when applied to the benchmark applications. Both figures draw microarchitectural structure sensitivities multiplied by their fraction of area on the overall processor area. We use this metric to better estimate the impact that a given structure has on the overall core's sensitivity, as bigger/smaller structures will suffer more/fewer upsets. When considering the unprotected processor, Fig. 2a shows that the PRF plays a significant role in overall processor sensitivity, followed by the Rename unit and LSU. Main reasons for that are the percentage of flip-flops from those structures and the chance of them affecting the results (i.e., the occupancy of the structure). In other words, the percentage of live bits in a structure that can affect the application's output is counterbalanced by smaller fractions of area of that structure. As an

Fig. 4. Sensitivity reduction for VAR variations.

protected with Error Correction Code (ECC) logic, and therefore no faults were injected in these structures. Faults were then classified as Silent Data Corruption (SDC), when the memory footprint differed from the golden memory, Hang, when the execution of the application timed out, or Detected, when the error detection mechanism branches to the error detection label. We used the term sensitivity to express the probability that a single bit-flip will cause either SDCs or hangs, so the sensitivity of a given structure is the fraction of non-masked faults obtained at the end of the fault injection

4

Microelectronics Reliability 100–101 (2019) 113406

D.M. Cardoso, et al.

sensitivity reductions 10.4% and 6.8% lower than VAR, while for the quad-issue, VARM LS and VARM LSB showed sensitivity reductions 3.4% and 3.6% lower than VAR, respectively. Such results show that, for complex OoO processors, such as the quad-issue configuration of BOOM, there are SIHFT techniques that could provide sensitivity reduction at fair costs in terms of execution time overhead.

example, the LSU's occupancy is roughly the same for the single- and dual-issue cores, while the fraction of area occupied by the dual-issue's LSU is double the single-issue's one, so that structure tends to be more vulnerable for the dual-issue. However, the quad-issue LSU has smaller area due to its bigger number of entries (many of which are not allocated by the application during execution), while its fraction of area is not much bigger than the dual. It is also interesting to notice how VAR tackles the sensitivity, in Fig. 2b. As one can see, it is able to reduce sensitivity in all structures, except for the Branch HW, which can be explained by the insertion of the several detection branches that increase its occupancy. The PRF presents the best sensitivity reductions, as the chosen dataflow SIHFT techniques are tailored specifically for it. On the other hand, reductions on the PRF cannot achieve 100%, as achieved by related work on simple processors [7]. Also, the LSU, ROB, Branch HW, and EXE structures still present high sensitivity. Table 5 shows the sensitivity for each case-study application on our benchmark, showing a mean reduction of 54%, 69%, and 67% for the single-issue, dual-issue, and quad-issue, respectively. Considering that those values account for the whole structure of the BOOM processor, while VAR targets mainly the PRF, they seem like adequate results. On the other hand, there are at least 36% of the faults still affecting the computation of BOOM. In order to further analyze why VAR cannot achieve sensitivity reduction as related work on simple processors, we selectively applied VAR to four case-study applications from our benchmark, by protecting only a subset of registers. Resulting data can be seen in Fig. 3. It shows, for the 1D Median Filter (a), CRC32 (b), Dijkstra (c), and PBM string search (d), the PRF sensitivity for each protected register and some sets of protected registers. The first point to notice from Fig. 3 is that VAR is unable to reach 0.0 sensitivity. The main reason for that is the fact that some control flow faults may affect the PRF and not be detected by VAR, and faults affecting a store instruction. Examples are a fault affecting the register used by a branch, which then branches incorrectly, and a fault affecting a register being written in the memory. The second point is how the protection of some registers increases the sensitivity of the original unprotected program code. This mainly happens because the complex architecture of BOOM tends to keep as much data as possible close to the execution stage, reading data from the PRF as least as possible. By doing so, the instruction replication and checking performed by VAR may not be able to reach the actual PRF, checking, instead, temporary structures. When this happens, faults cannot be detected, and lead to erroneous results.

5. Conclusions and future work We applied dataflow SIHFT techniques to different configurations of a real complex OoO superscalar processor and performed a fault injection campaign at RTL. Results showed that, even though the extra instructions can be absorbed into the IPC gain and result in acceptable execution times overhead, the sensitivity reduction is less than related work on simple processor architectures. An analysis on selective protection and individual micro-architectural structures showed that SIHFT techniques alone are not able to provide enough sensitivity reduction. Future work will analyze the control-flow and hybrid techniques efficiency. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) Finance Code 001, the Fundação de Amparo à Pesquisa do Estado do RS (FAPERGS) and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). References [1] A. Dixit, A. Wood, The impact of new technology on soft error rates, IRPS, 2011. [2] S.S. Mukherjee, et al., A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor, MICRO-36, 2003, pp. 29–40 Dec. [3] A. Beck, et al., Adaptable Embedded Systems, 1nd edn, Springer, New York, 2012, pp. 211–242. [4] Sartor, et al., Exploiting idle hardware to provide low overhead fault tolerance for VLIW processors, JETC, 13, no. 2 2017 Mar. [5] J.I. González, et al., Sharc: an efficient metric for selective protection of software against soft errors, Microelectron. Reliab. 88 (2018) 93–98. [6] N. Oh, et al., Error detection by duplicated instructions in super-scalar processors, IEEE Trans. Reliab. 51 (1) (2002) 63–75 Mar. [7] J.R. Azambuja, et al., Heta: hybrid error-detection technique using assertions, IEEE Trans. Nucl. Sci. 60 (4) (2013) 2805–2812. Aug. [8] E. Cheng, et al., Tolerating soft errors in processor cores using clear, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 37 (9) (2018) 1839–1852 Sep. [9] C. Celio, et al., The Berkeley out-of-order machine: an industry-competitive synthesizable, parameterized risc-v processor, EECS Department, University of California, Berkeley, Tech. Rep, 2015. [10] Martins et al., “Open Cell Library in 15nm FreePDK Technology,” in ISPD 2015. [11] Sartor et al., “Adaptive ILP Control to Increase Fault Tolerance for VLIW Processors, ” in ASAP, Jul 2016. [12] R. Tonetto, et al., Precise Evaluation of the Fault Sensitivity of Superscalar Processors, DATE (2018). [13] R. Leveugle, et al., Statistical Fault Injection: Quantified Error and Confidence, DATE (2009).

4.2. Analysis on VAR variations The VAR technique should be the best in terms of sensitivity reduction, but its high overheads in execution time make it infeasible for most applications. In order to check how its variations perform in terms of sensitivity reduction, we tested VAR LS, VAR LSB, VARM LS, and VARM LSB. Fig. 4 shows the sensitivity reduction for all VAR variations. As expected, all variations showed worse results than VAR. Still, when considering the single-issue configuration of BOOM, VAR LS and VAR LSB showed 6.6% and 8.4% lower sensitivity reduction, respectively. For the dual-issue version of BOOM, VARM LS and VARM LSB showed

5