Proceedings of the 2nd IFAC Conference on Embedded Systems, Computational and Telematics Control Systems, Proceedings of Intelligence the 2nd IFAC Conference oninEmbedded Proceedings of the 2nd Conference on Systems, Proceedings of Intelligence theMaribor, 2nd IFAC IFAC Conference oninEmbedded Embedded June 22-24, 2015. Slovenia Available online at Systems, www.sciencedirect.com Computational and Telematics Control Computational Intelligence and Telematics in Control Computational Intelligence and Telematics in Control June 22-24, 2015. Maribor, Slovenia June 22-24, 2015. Maribor, Slovenia June 22-24, 2015. Maribor, Slovenia
ScienceDirect
252–257 Software ReliabilityIFAC-PapersOnLine Validation48-10 and(2015) Verification Using Fault Injection Software Reliability Validation and Verification Using Fault Injection Software Validation and Using Techniques on a Fault Tolerant Processor Software Reliability Reliability Validation and Verification Verification Using Fault Fault Injection Injection Techniques on a Fault Tolerant Processor Techniques on a Fault Tolerant Processor Techniques on a Fault Tolerant Processor Gregor Kirbiš, David Selčan, Iztok Kramberger
Gregor Kirbiš, David Selčan, Iztok Kramberger Gregor Gregor Kirbiš, Kirbiš, David David Selčan, Selčan, Iztok Iztok Kramberger Kramberger Faculty of Electrical Engineering and Computer Science, Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia (e-mail:
[email protected]) Faculty of Engineering and Science, Faculty of Electrical Electrical Engineering and Computer Computer Science, University of Maribor, Maribor, Slovenia (e-mail:
[email protected]) University of of Maribor, Maribor, Maribor, Maribor, Slovenia Slovenia (e-mail: (e-mail:
[email protected])
[email protected]) University Abstract: Due to the continuously increasing complexities of embedded electronic systems, there is a Abstract: Due to continuously increasing complexities of electronic there clear need for sophisticated methods of testing and evaluating the performances and reliabilities Abstract: Duemore to the the continuously increasing complexities of embedded embedded electronic systems, systems, there is isofaaa Abstract: Due to the continuously increasing complexities of embedded electronic systems, there is clear need for more sophisticated methods of testing and evaluating the performances and reliabilities the embedded software. One such method is to simulate potentially dangerous events, while monitoring clear need need for for more more sophisticated sophisticated methods methods of of testing testing and and evaluating evaluating the the performances performances and and reliabilities reliabilities of of clear of the embedded software. such method is to simulate potentially dangerous events, monitoring system’s response andOne stability. This paper presents a software simulation tool, whichwhile will be used for the embedded software. One such method is to simulate potentially dangerous events, while monitoring the embedded software. One such method is to simulate potentially dangerous events, while monitoring the system’s stability. This paper presents amission. software simulation which will used for evaluating theresponse missionand critical software of the TRISAT The softwaretool, simulation toolbe simulates, the system’s response and stability. This paper presents software simulation tool, which be used the system’s response and stability. Thisof paper presents aamission. softwareThe simulation tool, which will will be used for for evaluating the mission critical software the TRISAT software simulation tool simulates, on a per-cycle basis, the fault tolerant processor and its peripherals while also simulating the effects of evaluating the mission critical software of the TRISAT mission. The software simulation tool simulates, evaluating the mission critical software of the TRISAT mission. The software simulation tool simulates, on a per-cycle basis, the fault tolerant processor and its peripherals while also simulating the effects of the space environment on the simulated hardware. Using the simulation software, a dependability on per-cycle basis, fault tolerant processor peripherals while simulating the on aaspace per-cycle basis, the the on fault tolerant processor and and its itsUsing peripherals while also alsosoftware, simulating the effects effects of of the environment the simulated the simulation dependability analysis was performed regarding the use ofhardware. Error Correction and Detection Codes on theaa processor data the space environment on the simulated hardware. Using the simulation software, dependability the space environment on the simulated hardware. Using the simulation software, a dependability analysis performed regarding the use of Errorusing Correction and Detection Codes on the processor data memory was as well as the rate of memory scrubbing, two benchmark algorithms: matrix multiplication analysis performed regarding the of Correction and Codes the data analysis was was performed regarding the use use of Error Errorusing Correction and Detection Detection Codes on on the processor processor data memory as well as the rate of memory scrubbing, two benchmark algorithms: matrix multiplication and quick sort. memory as as well well as as the the rate rate of of memory memory scrubbing, scrubbing, using using two two benchmark benchmark algorithms: algorithms: matrix matrix multiplication multiplication memory and quick sort. and quick sort. and quick sort. © 2015, IFAC (International Federation of Automatic Control) Hosting Elsevier and Ltd. correction. All rights reserved. Keywords: simulation, radiation, fault detection, fault tolerance, errorbydetection Keywords: simulation, radiation, fault detection, fault tolerance, error detection and correction. Keywords: simulation, simulation, radiation, radiation, fault fault detection, detection, fault fault tolerance, error error detection detection and and correction. correction. Keywords: tolerance, systems a dedicated fault detection isolation and recovery 1. INTRODUCTION systems aa dedicated detection isolation recovery (FDIR) unit needs tofault be implemented, whichand reports and systems fault detection and recovery 1. INTRODUCTION systems a dedicated dedicated fault detection isolation isolation and recovery 1. (FDIR) unit needs to be implemented, which reports and prevents the system from hanging or crashing when detecting 1. INTRODUCTION INTRODUCTION unit needs to implemented, which reports and (FDIR) unit system needs from to be behanging implemented, which reports and Due to increasing integration levels of electronic (FDIR) prevents or crashing when detecting aprevents critical the system error. In addition, every error mitigating unit the system system from from hanging hanging or or crashing crashing when when detecting detecting Due to increasing integration levels of electronic prevents the components, there arose a further need for improving the Due to increasing integration levels of electronic a critical system error. In addition, every error mitigating unit needs to system be evaluated either byevery software simulationunit or Due to increasing integration levels of electronic critical error. In In addition, addition, error mitigating mitigating components, arose aa systems, further need for improving the aa critical system error. every error unit reliability of there high-density as described by Todd components, there arose further need for improving the needs to be evaluated either by or laboratory trials. In the related worksoftware from M.simulation Karunarante components, there arose a further need for improving the needs to be evaluated either by software simulation or reliability of high-density systems, as described by Todd needs to be evaluated either by software simulation or Austin (2008). The fabrication process technological trends reliability of high-density systems, as described by Todd laboratory trials. In the related work from M. Karunarante (2005) and Uroš Legat (2010) the faults are injected during reliability of high-density systems, as described by Todd laboratory trials. In the related work from M. Karunarante Austin (2008). The fabrication process technological trends laboratory trials. In the related work from M. Karunarante are still in favour of reducing the fabrication sizes, which Austin (2008). The technological trends and Uroš (2010) the faults injected the synthesis of Legat the HDL code or are are injected by during using Austin (2008). Theoffabrication fabrication process process technological trends (2005) and Legat (2010) the faults are injected during are fabrication sizes, (2005) and Uroš Uroš Legat (2010) the or faults are injected during results increased deviation the of transistor parameters. As (2005) are still stillinin inanfavour favour of reducing reducing the fabrication sizes, which which the synthesis of the HDL code are injected by using external processors directly to the hardware. These methods are still in favour of reducing the fabrication sizes, which the synthesis of HDL code or are by using results in deviation of As synthesis of the thedirectly HDL to code or are injected injected by using aresults consequence those devices manufactured this way need in an an increased increased deviation of transistor transistorinparameters. parameters. As the external processors the hardware. These methods require detailed knowledge of the FPGA hardware design, results in an increased deviation of transistor parameters. As external processors directly to the hardware. These methods ato consequence those devices manufactured in this way need external processors directly to the hardware. These methods integrate hardware and software error-mitigating consequence those those devices devices manufactured manufactured in in this this way way need need require knowledge of FPGA hardware which isdetailed not desirable for general software aa consequence require detailed knowledge of the the FPGAsimulators. hardware design, design, to integrate hardware and software error-mitigating detailed knowledge of the FPGA hardware design, techniques in order to improve their reliabilities. This is require to integrate hardware and software error-mitigating which is not desirable for general software simulators. to integrate hardware and software error-mitigating which is is not not desirable desirable for for general general software software simulators. simulators. techniques in order to improve their reliabilities. This is which especially true for mission critical applications, like satellite techniques in to their reliabilities. This techniques true in order order to improve improve their reliabilities. This is is In order to simulate the impact of SEE and to assess the risks especially for mission critical applications, like satellite control systems. especially true for mission critical applications, like satellite In to impact of assess the they present to the the TRISAT we to designed especially true for mission critical applications, like satellite In order order to simulate simulate the impact mission, of SEE SEE and and tohave assess the risks risksa control systems. In order to simulate the impact of SEE and to assess the risks control systems. they present to the TRISAT mission, we have designed program simulator which can mission, help us evaluate the programaa control systems. they present to the TRISAT we have designed present to thewhich TRISAT mission, we havethe designed a This paper presents a software simulation procedure using they program simulator can help us design adequacy ofwhich the application. Aevaluate simple the and program popular program simulator can help us evaluate program This paper presents a software simulation procedure using program simulator which can help us evaluate the program fault injection techniques, which is planned for application This paper presents a software simulation procedure using design adequacy of the application. A simple and method, which reduces the impact of radiation effects on This paper presents a software simulation procedure using design adequacy of the application. A simple and popular popular fault techniques, which is application design adequacy of the the application. A radiation simple and popular during the TRISAT mission, is afor technological fault injection injection techniques, which which is planned planned for application system method, which reduces impact of effects on reliability, is the multiplexed redundant execution, fault injection techniques, which is planned for application method, which the impact radiation on during the mission, which is method,reliability, which reduces reduces the impact of of redundant radiation effects effects on demonstration space mission additional educational system during the TRISAT TRISAT mission,with which is aaa technological technological is the multiplexed execution, described by Pramod Subramanyan (2010). But that method during the TRISAT mission, which is technological system reliability, is the multiplexed redundant execution, demonstration space mission with additional educational system reliability, is the multiplexed redundant execution, benefits. The software simulator will be used to verify the demonstration space space mission mission with with additional additional educational educational is described Pramod Subramanyan (2010). But that method not wellby suited for use on memory-constrained systems, as demonstration described by Pramod Subramanyan (2010). method benefits. software simulator will be used to verify the described by Pramod Subramanyan (2010). But But that that method operation The of the on-board software. The same software can benefits. The software simulator will be used to verify the is not well suited for use on memory-constrained systems, it suffers from higher demands for memory and increased benefits. The software simulator will be used to verify the is not not well well suited suited for for use use on on memory-constrained memory-constrained systems, systems, as as operation of the on-board software. The same is as also be used during the satellite operation for software restating can the it operation of the on-board software. The same software can suffers from higher demands for memory and power consumption. On the other hand, redundancy methods operation of the on-board software. The same software can it suffers suffers from from higher higher demands demands for for memory memory and and increased increased also be used during the satellite operation for restating the it increased same conditions of the satellite. also be used the satellite power consumption. On the other hand, methods resolve those multiple induced errorsredundancy which are very hard also be used during during the satellite operation operation for for restating restating the the can power consumption. On other redundancy methods same conditions of the satellite. power consumption. On the the other hand, hand, redundancy methods same conditions of the satellite. can resolve those multiple induced errors which are very hard to resolve using coding techniques. In many cases several of same conditions of the satellite. can resolve those multiple induced errors which are very canresolve resolveusing thosecoding multiple induced errors which are several very hard hard The radiation environment encountered in space presents a the to techniques. In many cases presented techniques have been combined. Combining of to using coding In several The radiation environment in space presents aa the to resolve resolve using coding techniques. techniques. In many many cases cases several of of unique challenge from a encountered system reliability standpoint.. The radiation environment encountered in space presents presented techniques have been combined. Combining of methods results in a fault tolerant (FT) and reliable system. The radiation environment encountered in spacestandpoint.. presents a the presented techniques have been combined. Combining of unique challenge from a system reliability the presented techniques have been combined. Combining of Several radiation-induced effects need to be considered: unique challenge challenge from from aa system system reliability reliability standpoint.. standpoint.. methods results in a fault tolerant (FT) and reliable system. unique methods results in a fault tolerant (FT) and reliable system. Several radiation-induced effects need to be considered: methods results in a fault tolerant (FT) and reliable system. Single error event effects (SEE), total ionizing doze (TID), Several radiation-induced radiation-induced effects effects need need to to be be considered: considered: Additionally, with increasing system complexity the Several Single error (SEE), total ionizing doze (TID), single upset effects (SEU) and single transient Single event error event event effects (SEE), totalevent ionizing doze (SET). (TID), likelihood Additionally, with increasing system the of inducing software bugs alsocomplexity rises. Software Single error event effects (SEE), total ionizing doze (TID), with system complexity the single event upset (SEU) and single event transient (SET). Additionally, with increasing increasing system complexity the Each of these effects cause different changes within the Additionally, single event upset (SEU) and single event transient (SET). bugs also rises. Software likelihood of inducing software simulation is usually performed in order to mitigate them.. single event upset (SEU) and single event transient (SET). likelihood of of inducing inducing software software bugs also rises. Software Each of these effects cause different changes within the bugs also rises. Software likelihood material, which can have devastating effects on the reliability Each of these effects cause different changes within the simulation is usually performed in order to mitigate them.. critical software for space applications canthem.. also Each of which these can effects cause different changes within the Mission simulation is performed in to material, have devastating effects on the reliability simulation is usually usually performed in order order to mitigate mitigate of the system and therefore the mission’s success. material, which can have devastating effects on the reliability Missionfrom critical software forsimulate space applications canthem.. also benefit testssoftware which also the radiation-induced material, which can have devastating effects on the reliability Mission critical for space applications can of the system and therefore the mission’s success. Mission critical software forsimulate space applications can also also of the system and therefore the mission’s success. benefit from tests which also the radiation-induced effects. One such simulator was described by Actel (2007) of the system and therefore the mission’s success. from tests also simulate the benefit tests which which alsowas simulate the radiation-induced radiation-induced In order to mitigate the previously mentioned effects, system benefit effects. from One Note such simulator described byareActel (2007) Application AC304. Such system testsby usually not effects. One such simulator was described (2007) In order to mitigate the previously mentioned effects, system effects. One such simulator was described byareActel Actel (2007) reliability can be increased by using redundancy. In such In order to mitigate the previously mentioned effects, system Application Note AC304. Such system tests usually not In order to can mitigate the previously mentioned effects,Insystem Application reliability be increased by using redundancy. such Application Note Note AC304. AC304. Such Such system system tests tests are are usually usually not not reliability can be increased by using redundancy. In such reliability can be increased by using redundancy. In such
Copyright © 2015 IFAC 252 2405-8963 © 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright © 2015 IFAC 252Control. Peer review© of International Federation of Automatic Copyright 2015 IFAC 252 Copyright ©under 2015responsibility IFAC 252 10.1016/j.ifacol.2015.08.140
CESCIT 2015 June 22-24, 2015. Maribor, Slovenia
Gregor Kirbiš et al. / IFAC-PapersOnLine 48-10 (2015) 252–257
Figure 2: Memory allocation of a memory row
performed on the hardware due to the high cost of highenergy particles exposure.
In a high-density sub-micron SRAM memory, the memory cell placement is tight; for example, 35nm technology results in cell sizes smaller than 10µm2. We can deduce from this result, that high-energy particles will still cause at least one double error and in the worst case scenario even triple errors are possible. The probability of the data being changed in a way that the EDAC would not detect a double error is extremely low. For this to occur more than eleven bits are required to be inverted simultaneously by a SEE. The PicoSky FT processor is designed so that the EDAC units are transparent to the user and do not cause any additional delays during reading and writing procedures of the memory. In order to ensure a high level of fault tolerance it includes two separate fault detection, isolation and recovery units namely: the supervisor and FDIR. The main difference between these units is that the supervisor can only handle the processor’s traps, while the FDIR unit can also determine the error type, source and the procurement plan.
A radiation test study on the ProASIC3 family of FPGAs has shown that the FPGA core is much more tolerant to SEE than the FlashROM and SRAM memory, detailed test results of TID and SEE were presented by Sana Rezgui (2010). It is because of that we decided to inject faults only within the SRAM data memory. The high-density sub-micron SRAM memory is more vulnerable to SEU. When a SEE accurses it does not only corrupt a single memory cell but also some of the neighbour cells, either in raw or column, these facts being presented by D. G. Mavis (2008). In order to mitigate the effects of the induced errors the memory cells holding the same fragment of data need to be set at least100um apart. Such large spacing between the memory cells is hard to achieve. In the light of this, we propose a real time software testing simulator with which we can simulate the effects of the injected fault within the memory allocations. 2.
DESIGN DESCRIPTION
2.1 EDAC unit A simple and popular method is to use Hamming codes. We decided that the Hamming (7, 4) code with additional parity bit would be more suitable. Since it is simple to implement within an FPGA, it can correct all of the single errors in one byte and detect most of the double errors. An interesting EDAC approach was also presented by T. Aaron Gulliver (1993), where he stated that with this method we can detect any adjacent triple errors and correct double and single errors. This approach may still need to be considered because of its simplicity, depending on the SRAM memory size and speed.
In our previous work we designed a flexible 8/16-bit fault tolerant (FT) FPGA core named PicoSky; a detailed description of which was presented by I. Kramberger (2012). The core includes separate memory allocations for data and code. PicoSky Fault Tolerant Processor Architecture
IDU
BPU
ALU
FDIR
Control/status (FT) registers Debug support
Supervisor/user General Purpose Register files
Code memory interface (EDAC)
FLASH
Data memory interface (EDAC)
SRAM
2.2 Memory scrubbing unit
Memory scrambling
The PicoSky FT processor includes a memory-scrubbing unit within its architecture, certain registers are dedicated to enabling and programing the scrubbing rate. The main purpose of this unit is to correct single errors at designated memory locations. The scrubbing rate depends upon the probability of single error injection because these errors are corrected using EDAC. It also needs to be considered that the memory- scrubbing unit introduces a certain delay for reading, correcting and writing operations. An equation for describing the scrubbing rate was proposed by D. G. Mavis (2008):
Peripherial interface
Figure 1: PicoSky FT core pool
In order to correct single errors and detect double errors (SECDED) and for archiving larger spacing between memory cells we decided to use Hamming (8, 4) coding, which adds a parity bit for each bit of data. In order to achieve greater distances between fragments of data, without the need for using redundant memory cells in the SRAM memory, we have implemented a simple interleaving between neighboring cells. This has resulted inat least four memory cells in a row having to be corrupted before a double error occurs, meaning that in a single row up to four single errors can be successfully corrected.
Tscrub 2
D2
D3
EDAC PARITY 4 bits
D0
D1
D4
D0
D8 D16 D24 D1
D5
D6
D7
DATA / PROGRAM MEMORY 4 bits
D8
EDAC PARITY 4 bits
DATA / PROGRAM MEMORY 4 bits
EDAC PARITY 4 bits
DATA / PROGRAM MEMORY 4 bits
Rbe L 2 Rb ( L P)2
(1)
Whereby Rbe is the desired error rate, Rb is the SEU error/bitday, L is the word size and P is the number of parity bits. The scrubbing time needed is inversely-proportional to the induced error rate.
Memory allocation 32 bit DATA / PROGRAM MEMORY 4 bits
253
EDAC PARITY 4 bits
D9 D10 D11 D12 D13 D14 D15 D16 D17 D18 D19 D20 D21 D22 D23 D24 D25 D26 D27 D28 D29 D30 D31
3.
SIMULATOR’S STRUCTURE
In order to increase the system’s testability, the system tests need to be performed at the early design stages as described
Memory reallocation D9 D17 D25 D2 D10 D18 D26 D3 D11 D19 D27 D4 D12 D20 D28 D5 D13 D21 D29 D6 D14 D22 D30 D7 D15 D23 D31
253
CESCIT 2015 254 June 22-24, 2015. Maribor, Slovenia
Gregor Kirbiš et al. / IFAC-PapersOnLine 48-10 (2015) 252–257
by M. Karunarante (2005). The PicoSky processor was primarily designed and tested using Verilog. At the simulator, this code was translated to C-language code, which was implemented so that it preserved cycle exact cyclical program execution. In order to archive this, each module implemented in Verilog was translated into C-code, with some additional time adjustment’s of the program execution. This enabled us to connect hardware to the simulator via converters like Ethernet to SPI or similar. In addition the C-code implementation is easier and faster to compile that the synthesis. The program under test is then compiled and loaded into the simulator.
mission critical bugs were detected and subsequently solved. In addition, the simulator can be accessed over an Ethernet communication by other subsystems developed by supporting companies of the TRISAT mission. At the final testing procedure the working subsystems are physically connected and the functionality test is performed analogos to the simulation environment. In order to verify the simulators accuracy, the simulation results are compared with the measurements made during this stage. Virtual software User
Induced errors GDB
Simulator
Fault injection generator
PicoSky
System evaluation MTTUF calculation
FDIR unit
User Program under test Matrix multiplication Quick sort
A
Fault tolerant program
Program output
TCP/IP Computer
Physical iinterface
Physical hardware
Physical hardware Physical interface Results
Physical hardware
C Physical hardware
Figure 5: Simulation design procedures: A – whole system simulator, B – System simulation with partial hardware, C- Hardware integration.
Essentially the fault injection unit can insert faults at any memory location whether it be data, programs or registers. Durinf each system’s development process, several design mistakes can remain undetected by the designers. Simulation programs are often used in order to fix the majority of such errors..
4.
SIMULATION PROCEDURE
The simulator consists of two special units; the fault injection unit and the dependability calculation unit, which enable us to systematically change the state of any bit and monitor the systems success rate. In order to simplify the simulator design, the error injection unit is constrained to only inject faults in between the processors cycles.
START
4.1 Fault injection unit
Memory initicialization
Yes
In order to fully test the success rate of the EDAC and to evaluate the systems performance, fault injections are randomly generated over the assigned memory. Random number generation greatly depends on the distribution function. Since the probability that an SEE induced over the assigned memory is more or less uniform, it is intuitive to make use of the uniform distribution functions, on the other hand different distribution functions may also be considered. Depending in the SEU generator initialization, the simulated faults can induce single or double errors. These errors can be generated within the program memory, data memory, registers or communication links. The type of the generated error can vary between: transient, intermittent, or permanent. This method was suggested in Seungjae Han (1995). At a specific count of cycles, a single or double error can be injected at a random location within the assigned memory. In addition the simulator also supports import of the look up table (LUT), which can be used to minimize the simulation runs by only injecting errors at cube-distances; as explained
Fault injection
No Yes
Memory Scrambbling
No Execute program
System evaluation
Restart?
Subsystem simulator
B
Figure 3: Block diagram of the simulator
Scrambbler?
Induced errors GDB
Results
Quick sort
Insert errors?
TCP/IP
EDAC
Matrix multiplication
Copy of generated data
Program output
Virtual software
Supervissor unit
Generated data
Fault tolerant program
Yes
No END
Figure 4: Flow chart of the simulator
The simulation program enabled us to perform subsystem tests at the early stages of the design, at this point many 254
CESCIT 2015 June 22-24, 2015. Maribor, Slovenia
Gregor Kirbiš et al. / IFAC-PapersOnLine 48-10 (2015) 252–257
in Lucas W. B. Lee (2005).Permanent faults can also be inserted at a specific memory location using the LUT.
P
L L 1 N2
SEU generator initicialization
5.
LUT?
No Yes
RESULTS AND DISCUSSION
The system's fault tolerance was evaluated using two benchmark programs: quick sort and matrix multiplication. These and some similar programs are used for this purpose, in a manner as described by Olga Goloubeva (2006) and Mahroo Zandrahimi (2010). They were selected because they use most of the machine instructions of the PicoSky processor. In addition, the code was not optimized to the best possible, so that more machine instructions were present within the code. In the matrix multiplication calculations we used two 8 x 8 matrixes, which resulted in 1kB of SRAM memory cost. The result of the calculation was stored in another 8 x 8 matrix which used an additional 512 B of SRAM memory.
Select memory for fault injection
Random?
(4)
Where P is the probability of double error, L is the length at which a double error occurs, and N is the fault injection rate.
Start
Yes
255
Generate random address
No Insert address
Insert faults
Table 1: Simulation results for matrix multiplication Restart?
Matrix multiplication Stop
Figure 6: Fault injection unit flow chart
4.2 Dependability calculation unit The simulator also enables calculation of the dependability measure, defined as the mean time to unsafe failure (MTTUF). This assessment for system reliability was introduced by Vishwani D. Agrawal (2003), and is calculated as follows:
MTTUF
MTTF 1 (1 Csys ) (1 Csys )
(2)
Where Csys is the system’s steady state fault coverage and λ is the system constant failure rate. The simulator takes a time stamp before and after the program’s execution. For a given execution time and the double error rate, we can calculate the MTTUF with the following equation:
MTTUF
tstart tstop Edouble
0.27 0.64 0.84 0.95
Error rate [E/s] 610.35 1831.05 4272.46 6713.87
MTTUF [ms] 0.00 0.00 0.00 0.00
0.03 0.13 0.32 0.61
610.35 1831.05 4272.46 6713.87
505.13 105.00 31.19 8.88
EDAC with high scrubbing rate 0.00 0.08 1831.05 0.00 0.47 4272.46 0.00 0.91 6713.87
443.60 39.71 3.47
553,878 527,180 509,534
EDAC with middle scrubbing rate 0.90 0.00 0.10 1831.05 0.55 0.00 0.45 4272.46 0.27 0.00 0.73 6713.87
240.01 31.63 9.26
436,150 416,946 386,668
0.77 0.41 0.09
EDAC with low scrubbing rate 0.00 0.23 1831.05 0.00 0.59 4272.46 0.00 0.91 6713.87
73.81 14.21 1.87
Cycle
Successful
Incorrect
Reset
315,163 287,052 270,652 272,059
0.00 0.00 0.00 0.00
0.73 0.36 0.16 0.05
315,709 311,122 295,489 281,348
0.97 0.87 0.68 0.39
EDAC 0.00 0.00 0.00 0.00
739,331 714,848 693,949
0.92 0.53 0.09
(3) The results of the matrix multiplication were normalized depending on the measurement count. In addition we categorized these results into three groups: successful, incorrect, and reset. The MTTUF value was also calculated, from which we could see that the best results were obtained by only using EDAC. These results are not relevant since the memory scrubbing rate was still too low in regard to the error rate, and the scrubbing was performed on 2kBs of memory, as well while the stored data was only used approximately 80% of its size. During this simulation we used three
Where tstart-tstop is the processing time and Edouble is the double error rate. This basic equation can be used for simple designs, where a double error presents a catastrophic failure. In more complex designs the influence of the FDIR unit also needs to be considered. The probability of inserting a double error can also be calculated from the single error data rate and the memory size within which these errors occur:
255
CESCIT 2015 256 June 22-24, 2015. Maribor, Slovenia
Gregor Kirbiš et al. / IFAC-PapersOnLine 48-10 (2015) 252–257
different scrubbing rates: high scrubbing rate meaning that each 150 cycles the scrubbing unit is executed, an intermediate rate at every 200 cycles, and a low rate at every 300 cycles. The PicoSky processor used a 20MHz clock source over all simulation results.
The comparisons between the scrubbing rates are shown in Figure 8. We can see that the different scrubbing rates do not contribute to a better success rate. When we plot the error rate vs. MTTUF, we can more clearly observe the effect of the scrubbing unit. A higher scrubbing rate offsets the curve to higher error rates, which means that this method is more tolerant to errors. The problem that the same results are not obtained from the MTTUF value is that the scrubbing unit spends too much time correcting the memory. Table 2: Simulation results of the quick sort algorithm Quick sort
0.08
Error rate [E/s] 610.35
MTTUF [ms] 0.99
0.24
1831.05
0.00
0.56
0.44
4272.46
0.00
0.38
0.63
6713.87
0.00
Cycle
Successful
Incorrect
Reset
108,531
0.15
0.77
106,992
0.00
0.76
103,667
0.00
111,459
0.00
Figure 7: Matrix multiplication results
EDAC 108,537
1.00
0.00
0.00
610.35
105,191
0.92
0.00
0.08
1831.05
115.71
104,864
0.93
0.00
0.07
4272.46
73.40
103,911
0.88
0.00
0.12
6713.87
38.10
EDAC with high scrubbing rate 276,080
0.96
0.00
0.04
1831.05
331.30
284,763
0.88
0.00
0.12
4272.46
109.16
273,975
0.69
0.00
0.31
6713.87
30.82
EDAC with middle scrubbing rate 206,212
1.00
0.00
0.00
1831.05
230.27
207,277
0.89
0.00
0.11
4272.46
79.95
201,996
0.63
0.00
0.37
6713.87
17.45
EDAC with low scrubbing rate Figure 8: Comparison between scrubbing rates when multiplying matrixes
160,187
0.97
0.00
0.03
1831.05
224.26
161,222
0.90
0.00
0.10
4272.46
75.24
155,901
0.75
0.00
0.25
6713.87
22.83
In Figure 7, we can see the comparinson between the nonprotected data and the data protected by EDAC. This figure shows that the EDAC unit greatly improved the success rates of the performed calculations.
Figure 10: Quick sort results
When comparing the non-corrected data to the ones with EDAC protection while applying the quick sort algorithm, we can see similar improvement as that in matrix multiplication.
Figure 9: Matrix multiplication reliability validation
The scrubbing unit was implemented in the software by using interrupts, this is also the main reason for the ineffectiveness. 256
CESCIT 2015 June 22-24, 2015. Maribor, Slovenia
Gregor Kirbiš et al. / IFAC-PapersOnLine 48-10 (2015) 252–257
As the scrubbing unit is correcting one single error, in the meantime another error is being injected by the simulator. Similar results were also obtained when considering the quick sort algorithm.
7.
257
REFERENCES
A. Benso, P. Prinetto (2003), Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation, Alfredo Benso, Paolo Prineto. D. G. Mavis, P. H. Eaton, M. D. Sibley, R. C. Lacoe, E. J. Smith, and K. A. Avery (2008) Multiple Bit Upsets and Error Mitigation in Ultra-Deep Submicron SRAMS. IEEE Transactions on Nuclear Science. I. Kramberger (2012) Design of Generic Can Node for ESMO Mission. The 4S Symposium. Lucas W. B. Lee and Katarzyana Radecka, Department of Elecrical Engineering and Computer Engineering Concordia University, Montreal Canada, LUT error modelling Based on Implicit Cube-Distance Errors, (2005) IEEE. M. Karunarante, A. Sagahyroon S. Prodhuturi, (2005), RTL Fault Modeling, IEEE Conference Publications. Mahroo Zandrahimi, Hamid R. Zarandi, Alireza Rohani Department of Computer Engineering and Information Technology, Amirkabir University School of Computer Science, Institute for Research in Fundamental Sciences (IPM) (2010). An Analysis of Fault Effect and Propagations in ZPU: the World’s Smallest 32 bit CPU IEEE Conference Publication Olga Goloubeva, Maurizio Rebaudengo, Matteo Sonza Reorda, Massimo Violante (2006), Software Implemented Hardware Fault Tolerance. (pages: 38162). Springer Pramod Subramanyan, Virendra Singh, Kewal K. Saluja, Erik Larsson (2010). Multiplexed Redundant Execution: A Technique for Efficient Fault Tolerance in Chip Multiprocessors, Design, Automation & Test Europe Conference & Exhibition. Sana Rezgui, Senior Principal Engineer and Project Lead for Radiation Effects, (2010) Actel Corparation, RadiationTolerant ProASIC3 FPGAs Radiation Effects. Test Report Sana Rezgui, Senior Principal Engineer and Project Lead for Radiation Effects, (2010) Actel Corparation, RadiationTolerant ProASIC3 FPGAs Single-Event-Latch-Up. Test Report. Todd Austin, Valeria Bertacco, and Scott Mahlke Yo Cao (2008). Reliable Systems on Unreliable Fabrics, Design in the Late- and Post-Silicon Eras. T. Aaron Gulliver, Vijay K. Bhargava, A Systematic (16,8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors, (1993) IEEE Uroš Legat, Anton Biasizzo, Franc Novak (2010),Computer Systems Department, Jozef Stefan Institute, Ljubljana, Slovenia, Automated SEU fault emulation using partial FPGA reconfiguration. (pages: 24-27), 978-1-42446613-9/10/$26.00 ©2010 IEEE Seungjae Han, Kang G. Shin, and Harold A. Rosenberg, Real time computer laboratory, DOCTOR: An IntegrateD SOftware Fault InjeCTiOn EnviRonment for Distributed Real-time Systems, (1995) IEEE. Unknown, Actel Aplication Note AC304 (2007), Simulating SEU Events in EDAC RAM.
Figure 11: Comparison between scrubbing rates when applying quick sort
Comparison between scrubbing rates shown when applying quick sort in Figure 12, did not contribute to a better success rate eather. The scrubbing rate would have much more effect when implemented in the logic because the addition delay would be minimized.
CONCLUSION Figure 12:6. Quick sort reliability validation This paper presented a simulation approach which can be used as a software verification and reliability simulator. The simulation results show that EDAC and the scrubbing unit greatly increase the system’s reliability. It also needs to be noted that the scrubbing rate needs to be adjusted to the error rate and if possible should not greatly reduce the system’s performance. In the future work we will upgrade this simulator to a full system simulator with all the needed sub systems and their software. It should be noted that the simulation times for each run of the simulation were relatively low and so all the results presented here are only approximate.
257