Microelectronics Reliability 47 (2007) 1836–1840 www.elsevier.com/locate/microrel
Configuration errors analysis in SRAM-based FPGAs: Software tool and practical results a,*
V. Maingot a b
, J.B. Ferron a, R. Leveugle a, V. Pouget b, A. Douin
b
TIMA Laboratory, 46 Avenue Fe´lix Viallet, 38031 Grenoble Cedex, France IMS Laboratory, 351 Cours de la Libe´ration, 33402 Talence Cedex, France Received 9 July 2007 Available online 4 September 2007
Abstract The reconfigurability of SRAM-based FPGAs has also some drawbacks, especially when used in systems requiring a high level of safety and/or dependability. Dealing with single-event effects is an important issue in these systems. This paper presents a software tool to analyze a bit-stream and the functional effects of errors in it. Results of analyzes are presented, based on experiments using a laser platform to inject faults in the circuit. 2007 Elsevier Ltd. All rights reserved.
1. Introduction Due to the many advantages of the reconfigurability of SRAM-based FPGAs, their use is increasing even in systems requiring a high level of dependability (safety, availability, security, etc.). The main issue for such systems is their working conditions: they often have to operate under harsh environment, such as ionizing radiations, or they may have to resist to voluntary fault-based attacks, creating similar perturbations by using for example a laser. Single-event effects (SEE) induced by the interaction of particles with integrated circuits are a well-known threat for space systems, which are directly exposed to cosmic rays. With the shrinking of the transistor sizes in modern technologies, systems are also sensitive to atmospheric particles at sea-level. The most probable effect, when we consider SRAM-based FPGA at sealevel, is the singleevent upset (SEU), i.e. a bit-flip in the embedded memories [1]. Faults in the configuration memory of a SRAM-based FPGA directly modify the definition of its function, dangerously impacting its ability to operate properly [2]. These errors usually last until the configuration memory *
Corresponding author. E-mail address:
[email protected] (V. Maingot).
0026-2714/$ - see front matter 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.microrel.2007.07.074
is refreshed. Moreover, detecting and/or correcting these errors induce, in most cases, a high cost, that can still increase if multiple-bit upsets (MBUs) must also be considered. Protecting the system against faults in the configuration memory is an important issue at design time. Several design-level solutions exist to develop fault-tolerant architectures from SRAM-based FPGAs. An example using the triple modular redundancy (TMR) technique is given in [3]. In all cases, the designer has to make a compromise between cost (area, power and performance overheads) and fault-tolerance. At design time, the evaluation of the effects of faults in the device and the choice of the best protection strategy require realistic fault models. To achieve this, it is necessary to use results from actual fault injections on a test device to develop the fault models. The better the models are, the more accurate the evaluation of the dependability will be. An on-going collaborative effort has allowed us to develop hardware and software tools and associated methodologies for performing pulsed laser fault injections in FPGA devices [4]. In this paper, we present the analysis of some results. We first describe our tool soft error functional effect analysis in programmable devices (SEFEAProD), developed to analyze the configuration memory of
V. Maingot et al. / Microelectronics Reliability 47 (2007) 1836–1840
SRAM-based FPGAs and the effects of fault injections, performed by simulations or in-silico experiments. Then, we detail some results obtained during a first injection campaign. 2. Presentation of the analysis tool A software tool for the analysis of configuration errors had been previously developed for the Xilinx Virtex I family with the JBits 2.8 software development kit (SDK) [5]. The JBits SDK is a set of Java classes, provided by Xilinx, that defines an application program interface (API) into Xilinx devices. We developed our own tool based on the JBits 3.0 SDK, for the analysis of the configuration of any device from the Virtex II family and the comparison between erroneous configurations with golden ones. Input files can be a bitstream (.bit) file or a read-back data (.rbd) file, downloaded from the board. A graphical user interface (GUI) has been implemented in order to make the analysis easier, with different views of the configuration memory and of the FPGA architecture. Fig. 1 shows the hierarchy of these views: • Matrix tile view: The configuration memory is presented as a tile array, showing the tiles used by the design and, for each one, the different configuration bits. The criticality of each configuration bit is predicted with respect to its role and to the implemented design. • Matrix frame view: The same information is displayed but the configuration bits are gathered with respect to their configuration frame. • Schematic tile view: It shows the resources actually used in each CLB tile by the implemented application. The interconnections used and the configuration (mode and content) of registers and look up tables (LUTs) are available. Additionally, this part of the developed tool can load several bit-streams at .bit format to detect error configurations due for example to pulsed laser fault injections. The visualization of the implemented architecture after each
Fig. 1. Hierarchical view of the tool.
1837
fault injection allows linking the laser energy and position to architectural effects in the FPGA under test. Finally, our software tool is also able to compare different bit-stream files and to report statistics on the effects of fault injection campaigns [4]. A bitwise comparison and the study of the structure of the bitstream give us the list of faulted bits and their role in the bit-stream. Text reports generated by the tools are then processed with UNIX Shell scripts to achieve some statistical analyzes. Currently, some work is still required to precisely identify the role of some bits whose role is unknown. But the configuration of most of the functional resources can be analyzed. 3. Experimental results We have conducted several laser fault injection campaigns on different test configurations of a Virtex II xc2v1000 FPGA. A campaign of about 50 runs with multiple laser shots have been performed between the configuration and the read-back of the device. The objective was to gather information on the largest amount of error patterns. With these data, we obtain both a statistical analysis and a detailed view of the effects of the faults. One of our first test configuration has been designed to use all available CLBs in the FPGA – all flip-flops are connected and all LUTs implement a logic function – but no BRAM has been used. The results presented in the following few lines are based on this test setting. Table 1 presents the average numbers of faulted configuration bits during this campaign for each type of elements, and their corresponding percentages. The elements are the following: input/output blocks and their interconnections (IOB, IOI), global clock (GCLK), RAM blocks and their interconnections (BRAM, BRAMI), CLB and configuration bits of lateral IOB/IOI – contained in CLB frames – (CLBIO). This table is divided into three parts: the first two lines are the figures considering every faulted bit, while the others are their equivalent when considering bits initially at zero (line 3–4) or one (line 5–6). We notice that most errors occur in CLB and BRAM frames, which are the most important elements in the chip and also because we have not focused our laser shots on I/ O pads. We also notice that the number of faulted zeros and ones are from the same order of magnitude, respectively, 53.09% and 46.91% of the number of faulted bits. This must be correlated with the area exposed to the laser. Table 2 shows the average number of configuration bits per CLB in both the original bitstream and the faulted one. This density allows us to compare the different probabilities to flip a bit, depending on its value: the probability to flip a one is found to be 2.5 times higher than the probability to flip a zero. Since zero is the default value, this means, for example, that a hit should lead with a higher probability to the suppression of an interconnection rather than the creation of a new one. Table 3 presents the repartition of erroneous bits in CLB tiles. These bits are categorized in three groups: intercon-
1838
V. Maingot et al. / Microelectronics Reliability 47 (2007) 1836–1840
Table 1 Average repartition of faulted bits Element type
Total
CLB
CLBIO
GCLK
IOB
IOI
BRAM
BRAMI
Number of ‘0’, ‘1’ faulted Percentage
137.85 100.00
80.95 58.72
0.54 0.39
0 0
0.03 0.02
0.03 0.02
50.41 36.57
5.87 4.26
Number of ‘0’ faulted Percentage
73.18 53.09
17.10 12.40
0.54 0.39
0 0
0.03 0.02
0.03 0.02
50.41 36.57
5.05 3.66
Number of ‘1’ faulted Percentage
64.67 46.91
63.85 46.32
0 0
0 0
0 0
0 0
0 0
0.82 0.59
Table 2 Average number of bits per CLB Category
All bits
Bits at ‘1’
Bits at ‘0’
Golden bitstream Faulted bits Bit-flip probability
1760 9.15 0.52%
212.80 2.37 1.11%
1547.20 6.78 0.44%
Table 3 Average repartition of faulted CLB bits Bit type
Total
Logic
Interco.
Unknown
Number Percentage
80.95 58.72
34.49 25.02
44.15 32.03
2.31 1.68
• No effect: The link is maintained without any modification. • Suppressed: The initial link is suppressed without the creation of any other connection. • Added: The initial link is maintained with the creation of extra connections. • Modified: The initial link is suppressed with the addition of extra connections. For an initial state with no connection, the bit-flip may have no effects (no effect pattern) or may create new connections (created pattern). Table 4 shows that in 94% of the cases, the error patterns concern interconnections defined by two bits. This is coherent with the percentage of such interconnections in the architecture (90.3%). But, we did not manage to flip bits in the 0.2% of interconnections defined by three bits. Of course, the sensitivity to bit-flips highly depends on the number of bits required to determine the connection. Since several 1-bit connections can be configured by the same bit, the number of observed modification patterns is higher than the number of bit-flips in this case. Due to the complexity of the configuration scheme of multiplebit connections, the opposite is observed in this case: the number of bit-flips is higher than the number of modification patterns. Connected wires
Unconnected wires
…
……
…
…
…
nection configuration bits, logic configuration bits (including LUTs and user memory bits) and currently unidentified bits. As expected, the largest contribution comes from bits configuring the interconnections. The bits categorized as unknown are those that cannot be accessed by Jbits and their identification is still ongoing. As previously mentioned, our software gives us the complete list of erroneous bits and their function in the configuration of the FPGA. By analyzing the original bit-stream, we can understand the effects of the laser shot on the implemented architecture. We here focus on some particular examples of error patterns. Most bit-flips in logic configuration modified LUTs and registers. For flip-flops, we identified the location in the bitstream of the user memory configuration bits. The content of each flip-flop in a CLB is configured by one bit. Consequently, a bit-flip is critical on these locations and it will lead to an error (and potentially a failure) at execution time. For LUTs, the truth tables are entirely included in the bit-stream; so an error will lead to a modification of the logic function (as a four-input function), but the actual function may be preserved if all input patterns are not used. Interconnections in the device are configured in a heterogeneous way: they are not all configured by the same number of bits, that varies between one and three. Most of the interconnections are defined by 2 bits (90.3%), while very few use three bits (0.2%). Bit-flips in these configuration bits can lead to different modification patterns, that are separated into two cases. For 1-bit connections, modification patterns are from two types: the creation of a connection or its suppression. For multiple-bit connections,
modification pattern are from six types, depending on the initial state of the interconnection and on the modification. Fig. 2 illustrates these modification patterns. A bit-flip in bits initially connecting two wires leads to four patterns; the initial link can be maintained or not, with or without the addition of extra interconnections:
Modified
Suppressed
Added
: CLB interconnection
No effect
No effect
Created
: CLB wire
Fig. 2. Common interconnection modification patterns (multiple-bit connections).
V. Maingot et al. / Microelectronics Reliability 47 (2007) 1836–1840 Table 4 Repartition of faulted bits in CLB interconnections Configuration bits
One bit
Two bits
Three bits
Number of bit-flips Modification patterns
18 91
2290 1407
0 0
Table 5 illustrates an average classification of the modification patterns for 2-bit connections, over a full campaign. One important point is the possibility in many cases, for interconnections defined by two bits, to maintain a correct connection structure in spite of the errors in the configuration data. For initially unconnected wires, 86.12% of the modifications do not create perturbations. For initially connected wires, the initial wire is maintained in 50% of the cases (column ‘Added’) and the real functional consequences of the added connections on the node depend on the global interconnection resource usage for a given design. For further studies on these figures, we need to present the structure used for configuring an interconnection. For 2-bit interconnections, the link is defined between one resource and several sources. All configuration bits are associated to the resource; each of these bits defines the list of sources reachable if the bit is activated. The resource is consequently connected to the source contained in the intersection between the two activated lists. To illustrate this, Fig. 3 shows an extract of the configuration structure for the OMux9 resource; we have extracted five configuration bits defining, respectively, the following lists: {XQ0, XQ1}, {YQ0, YQ1}, {XQ0}, {XQ1, YQO} and {YQ1}. From an unconnected configuration, creating a connection needs at least two bit-flips. But since the intersection of two lists can be empty, two bit-flips do not always lead to the creation of a connection. Indeed, the ‘Created’ modification pattern needed in average three bit-flips. It also explains why most of initially unconnected wires remain in their state, since the average number of bit-flips per pattern is 1.6, which is below the threshold to have a chance to impact the connection. When considering an already established connection, the number of configuration bits per resource is larger than 2 (9.1 in average). Consequently, the probability to suppress the connection (to flip one of its two bits) is smaller than the probability to add a new connection. This trend is confirmed by the figures shown in Table 5. However, the average number of bit-flips needed is higher in the ‘Added’ case, due to the possibility of multiple creations.
1839
In the ‘Modified’ case, we have to both suppress the connection and create another one. In the optimal case, this needs two bit-flips (one to disconnect and one to re-associate the resource with another source). This modified situation, as a combination of the two previous ones, is less frequent and requires more bit-flips. The case ‘No effect’ happens when flipping a bit corresponding to no source that is present in the union of the lists of the two activated bits. Considering the multiplicity of faults in bits configuring a resource, it is very improbable to maintain the original state of the interconnection. Furthermore, we studied the evolution of the average number of bit-flips per pattern as a function of the number of created connections in the ‘Modified’, ‘Added’, and ‘Created’ situations. Results are shown in Fig. 4. To add one connection, one bit-flip is usually enough in the ‘Added’ situation because there is already a connection in the initial state. On the contrary, the ‘Modified’ situation needs two bit-flips in average (one to destroy the initial link and one to create a new connection). The ‘Created’ situation is similar because in the initial state no configuration bit is activated, so two bit-flips are needed to make one connection. Our experiments have shown that from this situation, adding one bit-flip on a resource tend to create a new connection. When increasing the number of created connections, the number of bit-flips necessary to add a new connection seems to decrease, linked to the increasing probability to find a source in an already activated list. We did not use the common interconnection error patterns introduced in [6] since our goal was to identify possible modification patterns over one and only one wire
B1 B2
B3 B4 B5
XQ0
XQ XQ1 OMUX 9 YQ0
YQ YQ1
Fig. 3. Partial view of the configuration structure of a 2-bit interconnection.
Table 5 Classification of modification patterns in interconnections defined by two bits Initial state
Connected
Unconnected
Effect on connection
Modified
Suppressed
Added
No effect
No effect
Created
Average number of bit-flips Average number of modification patterns Percentage Average number of bit-flips per pattern
16.5 7.1 0.5 2.3
30.7 20.4 1.5 1.5
49 29 2.1 1.7
0 0 0 n/a
1613 1163.1 82.7 1.4
581 187.4 13.3 3.1
V. Maingot et al. / Microelectronics Reliability 47 (2007) 1836–1840 Average number of bitflips per patterns
1840 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Modified Added Created
1
2 3 4 Number of created connections
Fig. 4. Average number of bit-flips in the ‘Modified’, ‘Added’, and ‘Created’ pattern.
resource within a CLB. We were able to classify all possible modification patterns. The next step of our study will be to determine the probability of each modification pattern following a single laser shot and to study the criticality of faults with respect to the implementation of the design. Then, we will be able to use the nomenclature in [6] by studying not only one wire resource but also the situation of the neighbour interconnection resources. We will then be able to make a correspondence between for instance our ‘Added’ situation and the ‘short’ or the ‘bridge’ pattern previously defined. 4. Few lessons from this experiment These analyzes aim at giving detailed information on the effects of laser fault injection on a SRAM-based FPGA. These data can be used to both elaborate a fault model and emphasize the criticality of different parts of the component. At this point of the study, after the analysis of results obtained during campaigns based on multiple laser shots, we are able to classify the induced errors. The development of a precise fault model will require the results of the next campaign, based on single laser shots and their analysis. Error patterns obtained in such conditions will be used in emulation- or simulation-based fault injection campaigns to evaluate the robustness of designs to configuration errors. But another aspect of this study is to reveal the sensitivity of the different elements on the chip. With this cartography of the criticality, a designer can choose which resources inside the chip must be used in priority. From the data presented in this paper, we can draw some recommendations on the use of resources inside the CLB, in particular for LUTs and interconnections. As each error in 4-input LUT configuration will probably generate a failure, complex logic functions may have to be distributed over several LUTs to increase their robustness. Of course, the impact on the global design will have to be evaluated in further work. For interconnections, on which we have focused our current work, the number of needed configuration bits to
create a connection has to be taken in account. 1-bit connections have to be avoided when possible. Moreover, since in most cases the initial link can be maintained when initially connected (56%), the addition of connections to the initial wire resource is the main source of application failure. So the local density of interconnections should be kept low, and used resources have to be distributed over the maximum number of CLBs. 5. Conclusion and perspectives We have briefly presented a software tool developed to analyze the configuration memory of SRAM-based FPGAs from the Virtex II family. The link with the architectural components allows us to study the effects of configuration errors from a designer point of view. Experimental results point out sensitive elements in the FPGA and the higher probability to flip a 1 (i.e. an activated bit) than a 0. Error patterns have also been discussed and new directions towards robust design on SRAM-based FPGAs have been outlined. Future work will first focus on the effect of single laser shots. Modification patterns obtained in this case will provide a good model for simulated or emulated fault injections. This should allow better evaluating the robustness of a design before using any laser facility. Also, the protection techniques outlined here will have to be implemented and evaluated. Acknowledgements This work is partly supported by the French Ministry of Research, through the project ACI-SI VENUS. The authors thank all TIMA and IMS people having contributed to the experiments whose results are analyzed in this paper. References [1] Alderighi M, Candelori A, Casini F, D’Angelo S, Mancini M, Paccagnella A, et al. SEU sensitivity of virtex configuration logic. IEEE T Nucl Sci 2005;52(6):2462–7. [2] Morgan K, Caffrey M, Graham P, Johnson E, Pratt B, Wirthlin M. SEU-induced persistent error propagation in FPGAs. IEEE T Nucl Sci 2005;52(6):2438–45. [3] Kastensmidt FL, Sterpone L, Carro L, Reorda MS. On the optimal design of triple modular redundancy logic for SRAM-based FPGAs. In: Proceedings of design, automation and test in Europe (DATE) 2005, vol. 2; 2005. p. 1290–5. [4] Pouget V et al. Tools and methodology development for pulsed laser fault injection in SRAM-based FPGAs. In: 8th Latin-American test workshop (LATW), March 12–14, 2007. [5] Kinzel Filho C, Lima Kastensmidt F, Carro L. Improving reliability of SRAM-based FPGAs by inserting redundant routing. IEEE T Nucl Sci 2006;53(4):2060–8. [6] Sonza Reorda M, Sterpone L, Violente M. Efficient estimation of SEU effects in SRAM-based FPGAs. In: Proceedings of international online testing symposium (IOLTS); 2005. p. 54–9.