Failure mode taxonomy for assessing the reliability of Field Programmable Gate Array based Instrumentation and Control systems

Annals of Nuclear Energy 108 (2017) 198–228 Contents lists available at ScienceDirect Annals of Nuclear Energy journal homepage: www.elsevier.com/lo...

Download PDF

2MB Sizes 321 Downloads 289 Views

Report

PDF Reader
Full Text

Annals of Nuclear Energy 108 (2017) 198–228

Contents lists available at ScienceDirect

Annals of Nuclear Energy journal homepage: www.elsevier.com/locate/anucene

Failure mode taxonomy for assessing the reliability of Field Programmable Gate Array based Instrumentation and Control systems Phillip McNelles a,b,⇑, Zhao Chang Zeng a, Guna Renganathan a, Marius Chirila a, Lixuan Lu b a b

Canadian Nuclear Safety Commission, 280 Slater Street, Ottawa, Ontario K1P 5S9, Canada University of Ontario Institute of Technology, 2000 Simcoe Street, Oshawa Ontario L1H 7K4, Canada

a r t i c l e

i n f o

Article history: Received 14 January 2017 Received in revised form 20 April 2017 Accepted 22 April 2017

Keywords: FPGA Failure modes Taxonomy Nuclear Power Plant Digital I&C

a b s t r a c t Field Programmable Gate Arrays (FPGAs) are a form of programmable digital hardware configured to perform digital logic functions. This configuration (programming) is performed using Hardware Description Language (HDL), making FPGAs a form of HDL Programmed Device (HPD). In the nuclear field, FPGAs have seen use in upgrades and replacements of obsolete Instrumentation and Control (I&C) systems. This paper expands upon previous work that resulted in extensive FPGA failure mode data, to allow for the application of the OECD-NEA failure modes taxonomy. The OECD-NEA taxonomy presented a method to model digital (software-based) I&C systems, based on the hardware and software failure modes, failure uncovering effects and levels of abstraction, using a Reactor Trip System/Engineering Safety Feature Actuation System (RTS/ESFAS) as an example system. To create the FPGA taxonomy, this paper presents an additional ‘‘sub-component” level of abstraction, to demonstrate the effect of the FPGA failure modes and failure categories on an FPGA-based system. The proposed FPGA taxonomy is based on the FPGA failure modes, failure categories, failure effects and uncovering situations. The FPGA taxonomy is applied to the RTS/ESFAS test system, to demonstrate the effects of the anticipated FPGA failure modes on a digital I&C system, and to provide a modelling example for this proposed taxonomy. Ó 2017 Elsevier Ltd. All rights reserved.

1. Introduction A Field Programmable Gate Array (FPGA) belongs to a group of digital technologies known as Hardware Description Language (HDL) Programmed Devices (HPD). These are large scale integrated circuits that are programmed (configured) by the end user after they are built, in order to perform certain digital logic functions (IAEA, 2016). These logic functions are performed using the FPGA hardware, as there is no software or operating system present on the FPGA chip itself. The blank FPGAs are configured using HDLs, of which VHDL and Verilog are the most popular, and both of those languages possess their own IEEE standards. The HDLs textually describe the architecture of the logic functions and connections that will occur inside the FPGA chip, and then the design is synthesized onto the FPGA chip using software tools, creating the physical routing and logic functions. FPGAs have been the focus of research projects and implementation projects for safety-related and non-safety related Nuclear Power Plant (NPP) systems in several countries across North and ⇑ Corresponding author at: Canadian Nuclear Safety Commission, 280 Slater Street, Ottawa, Ontario K1P 5S9, Canada. E-mail address: [email protected] (P. McNelles). http://dx.doi.org/10.1016/j.anucene.2017.04.033 0306-4549/Ó 2017 Elsevier Ltd. All rights reserved.

South America, Europe and Asia (McNelles and Lu, 2013; Electric Power Research Institute, 2009; EPRI, 2011; Menon and Guerra, 2015). Often, the FPGA-based systems are installed to replace the existing analog or digital systems, which are becoming obsolete. FPGAs possess certain potential advantages over other Instrumentation and Control (I&C) technologies, such as reduced complexity, faster response times, the ability to partition safety and non-safety functions on the FPGA chip, and the inclusion of FPGAs could help meet diversity requirements (IAEA, 2016; Valtion Teknillinen Tutkimuskeskus, 2011). FPGAs are not without drawbacks though, and certain limitations of FPGA-based systems include a lack of experience in the nuclear field, a limited number of platforms/tools, and less access to the internal signals of an FPGA, when compared to a microprocessor (IAEA, 2016; Valtion Teknillinen Tutkimuskeskus, 2011). The effect of FPGAs and other HDL technologies has been listed as one of the seventeen ‘‘important issues” facing digital I&C systems in Nuclear Power Plants (NPPs), according to the IAEA (IAEA, 2015). Nevertheless, FPGA implementations are continuing to take place in the nuclear field, with many examples seen in the technical literature (IAEA, 2016; McNelles and Lu, 2013; Electric Power Research Institute, 2009; EPRI, 2011; Menon and Guerra, 2015; Valtion Teknillinen Tutkimuskeskus, 2011).

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

In order to be able to analyze the reliability of FPGAs and FPGAbased systems, the potential failure modes of FPGAs, including both the hardware and HDL components, must be established. Previously, available technical literature on the topic of FPGAs has been reviewed and Failure Mode and Effects Analysis (FMEA) for FPGA-based systems has been compiled (McNelles et al., 2015), according to the International Electrotechnical Commission (IEC) FMEA standard International Electrotechnical Commission, 2006. This work is expanded on significantly in this paper, to create a taxonomy of FPGA failure modes that can be used in the modelling of FPGA-based systems, based on the taxonomy created for softwarebased systems. The framework used to develop the FPGA taxonomy is based on the OECD-NEA Taxonomy for Digital I&C Systems (OECD-NEA, 2015). It considered a generic, software-based digital I&C system, such as one based on a microprocessor, so FPGAs and other HPDs were not in the scope of that framework. One of the recommendations in the OECD-NEA taxonomy stated that future work should involve ‘‘Complementation of the failure modes taxonomy with issues that were left out of the scope, e.g., control systems, networks, PLD technology (FPGA/ASIC)” OECD-NEA, 2015. This makes the extension of the OECD-NEA taxonomy to include FPGA-based systems, via the FPGA FMEA and resulting FPGA taxonomy a useful and logical endeavor. This taxonomy was further enhanced through the inclusion of high-level fault classifications as well as IEC fault classifications, to provide additional information on the failure modes and the mitigation measures. In order to extend the framework of the OECD-NEA taxonomy to incorporate FPGAs and other HPDs, and therefore creating the FPGA Taxonomy, the authors of this paper propose the creation of the ‘‘Logic Process” block. This block represents the all digital logic hardware and software/HDL code for any form of digital logic device, making it a suitable bridging point for HPDs such as FPGAs and the OECD-NEA taxonomy. This paper is organized as follows: Section 2 discusses the OECD-NEA digital failure mode taxonomy that is used as the basis for this paper, and the extension of that taxonomy for the inclusion of FPGA-based systems, including the creation and implementation of the ‘‘Logic Process”. Section 3 discusses failure mode categorization methodologies as well as the FPGA FMEA. Section 4 presents the FPGA taxonomy, in-line with the generic digital failure modes taxonomy. Section 5 showcases the demonstration of FPGA taxonomy. Conclusions from this research will be drawn in Section 6. A list of all failures used in this paper is provided in Appendix A.

2. Extended taxonomy To fully implement the FPGA taxonomy within the framework of the OECD-NEA taxonomy, it must be extended to incorporate FPGA-based systems. As the OECD-NEA taxonomy did not explicitly consider HPDs such as FPGAs, the OECD-NEA taxonomy must be modified to incorporate FPGA-based systems. To do so, this paper proposes the introduction of a ‘‘logic process” block, which incorporates established implementations of digital hardware and software. Section 2.1 explains the importance and relevance of creating the FPGA taxonomy. Section 2.2 provides a detailed overview of the OECD-NEA failure mode taxonomy. Section 2.3 discusses the extension of the OECD-NEA taxonomy to include FPGAbased systems, through the use of the ‘‘logic process”.

2.1. Importance and relevance of the FPGA taxonomy The importance of constructing a failure modes taxonomy for FPGA-based systems is well-supported in the literature. This is seen in information published from international organizations

199

(IAEA, IEEE and the OECD-NEA), as well as in a survey of the scientific/technical literature. 2.1.1. Information from international organizations According to documents from the IAEA, ‘‘An increased number of FPGA based applications can be expected as nuclear operators and regulators become more familiar with the advantages of the technology” and that ‘‘. . .the technology is expected to be applicable to large scale replacement of I&C systems in modernization projects, as well as providing complete I&C systems (safety and non-safety) in new Nuclear Power Plant designs” (IAEA, 2016). Additionally, although FPGAs have seen increased implementations in NPP I&C functions, those are mainly recent implementations, so information regarding ‘‘lessons learned” and technical standards are not prevalent (IAEA, 2015). With the increased use of FPGA-based I&C systems in nuclear plants, this taxonomy will provide additional technical information for the purpose of hazard analysis. The reason for performing hazard analysis is said to be to ‘‘identify and control conditions that produce or contribute to a hazard” (IEEE Power and Energy Society, 2016). This includes the identification, avoidance, evaluation and resolution of hazards in all phases of the system lifecycle. These hazards are caused by failure modes, which must be identified and evaluated. Therefore, the FPGA taxonomy presented in this paper provides a means of identifying, categorizing and modelling the failure modes for use in hazard analysis, during the design and review of FPGA-based I&C systems. This would provide a basis for the decisions on engineering and safety based on system review criteria (Mossman et al., 2013). Regarding the OECD-NEA taxonomy, it was stated in that document that ‘‘An activity focused on the development of a common taxonomy of failure modes is seen as an important step towards standardised digital Instrumentation and Control (I&C) reliability assessment techniques” (OECD-NEA, 2009) and ‘‘The taxonomy will be the basis of future modelling and quantification efforts” (OECD-NEA, 2015). These statements from the OECD-NEA underscore the importance of having a failure modes taxonomy for the analysis and assessment of digital systems. As stated previously, the OECD-NEA taxonomy considered a software-based system, and stated that the development of an FPGA taxonomy would be a source of future work on this topic. Furthermore, the OECD-NEA taxonomy document laid out certain criteria, which the taxonomy was intended to meet. The only criteria that was designated as ‘‘Not Met”, was entitled ‘‘Should capture defensive measures against fault Propagation (detection, isolation and correction) and other essential design features of digital I&C”, and was again left as a topic of future work (OECD-NEA, 2015). In this FPGA Taxonomy, potential mitigation methods were also included, for the example failure modes/failure categories. These mitigation methods considered both FPGA-specific mitigation methods (Wang et al., 2011; Kretzschmar et al., 2016; Habinc, 2002) as well as mitigation methods for generic I&C systems (Hwang et al., 2010; Salewski and Taylor, 2007). Therefore, this FPGA taxonomy fulfills two important areas of future work, as described by the OECD-NEA taxonomy (OECD-NEA, 2015). 2.1.2. Information from the technical literature and author’s experience It has been seen in the literature that there has been a great deal of work put into the design, verification and validation (V&V) and safety analysis of FPGA-based control systems in general (Brombacher and van Beurden, 1999; Monmasson and Cirstea, 2007), and specifically in the case of the nuclear industry (Lu et al., 2015a,b; Wu et al., 2016; Jung and Roh, 2017). The unique properties of FPGAs present certain challenges during the safety analysis process, which may be different challenges than with

200

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

analog systems or software-based digital systems (IAEA, 2015). Previous work by the authors of this paper considered different reliability analysis methods for an FPGA- based nuclear plant safety system (McNelles, 2016). While that paper used failure mode data from a previous FMEA (McNelles et al., 2015), the failure modes used were just selected from the large list of compiled failure modes, and slotted into the models, based on the author’s expert opinion. However, there is the potential that future analysis/comparisons could be improved upon, if the FPGA failure mode data was properly compiled, categorized and analyzed, such as was done in the OECD-NEA taxonomy. Additionally, this would also allow for the inclusion of mitigation methods in the models. Therefore, due to the reasons described above, the modelling and analysis of that reactor trip logic loop demonstrated to the authors that a complete FPGA taxonomy, similar to that of the OECD-NEA taxonomy would be useful in the hazard analysis of FPGA-based I&C systems. 2.2. OECD-NEA failure modes taxonomy The Committee for the Safety of Nuclear Installations (CSNI), as part of the Organization for Economic Co-operation and Development Nuclear Energy Agency (OECD-NEA), published in 2015 the ‘‘Failure Modes Taxonomy for Reliability Assessment of Digital I&C Systems for PRA” (OECD-NEA, 2015). This report compiled an extensive list of failure modes for digital (software-based) systems, for the purpose of assessing the reliability of digital I&C systems in NPPs. The taxonomy decomposes the digital I&C system into separate ‘‘levels of abstraction”, defines the types of failures and the uncovering situation of those failures, discusses a generic test system, and presents an example of the application of the taxonomy. This report represents the culmination of an international project, of which an interim report was published in 2009 (OECD-NEA, 2009). The main features of the OECD-NEA taxonomy is discussed in this section. The full taxonomy can be found in the literature (OECD-NEA, 2015). 2.2.1. Levels of abstraction The OECD-NEA taxonomy considered five levels of abstraction (location inside the system where the failure(s) occur). From the highest to lowest level, these are: System level, Division level, I&C Unit Level, I&C Module Level and Basic Component (BC) Level. These levels are as defined below: System Level: The complete I&C system. Division Level: The physical separation of the I&C system, where each division is comprised of the I&C units. I&C Unit Level: The elements that execute the specific functions that are necessary for the I&C system to carry out its specified purpose. These units are defined by the general system functions they perform, and consist of I&C modules. Module Level: Hardware and software elements that support the specific tasks needed for the system to function. Examples include I/O cards (hardware) or operating systems (software), and are comprised of the basic components. Basic Component Level: The individual hardware components, such as CPUs, memory, etc., as well as the software used in those hardware components. The failure effects at a lower level of abstraction will become the failure modes at the higher level of abstraction (International Electrotechnical Commission, 2006; Mossman et al., 2013). Fig. 1, shows an example of the relationship between the ‘‘Failure Effects” and ‘‘Failure Modes”, as defined in the OECD-NEA taxonomy. It is seen that a failure in one (or more) component(s) at the ‘‘basic component” level will affect some part of the ‘‘I&C module” level, causing a failure at that level. A failure at the ‘‘basic component” level could propagate up through all levels of abstraction, and cause a failure of the overall system. This is referred to as ‘‘Cascade Failure Propagation” in the OECD-NEA Taxonomy (OECD-NEA, 2015).

Fig. 1. Failure Effect and Failure Mode Relation.

2.2.2. OECD-NEA example system The OECD-NEA Taxonomy provides an example test system of a generic software-based Reactor Trip System (RTS) and Engineering Safety Feature Actuation System (ESFAS). A simplified version of that test system is shown in Fig. 2, with the full example system provided in the literature (OECD-NEA, 2015). As seen in Fig. 2, the overall system consists of four identical, redundant divisions. Each division is then composed of an Acquisition and Processing Unit (APU), Voting Unit (VU), and Priority Unit (PU). These I&C units are themselves comprised of various modules, such as the I/O Board, Mother Board, Communication (Comm) Module, etc. These modules are made up of individual components, including Analog-to-Digital and Digital-to-Analog Converters (ADC/DAC), the microprocessor, and software functions.

Fig. 2. Simplified RTS/ESFAS Test System.

201

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

2.2.3. OECD-NEA example taxonomy construction A brief example of the construction of the OECD-NEA taxonomy for hardware and software failure modes is presented in this section, using one example hardware failure mode, and one example software failure mode for the test system shown in Fig. 2. The complete taxonomy is found in the literature (OECD-NEA, 2015). In this example, Table 1 provides the fault location at the ‘‘Module” level and the corresponding ‘‘I&C Unit” level, for one hardware and one software fault. In turn, Table 2 provides examples of failure modes for each of the OECD-NEA ‘‘Failure Effects”, for the one hardware and one software module introduced in Table 1. Example ‘‘Uncovering Situations” and fault tolerance features are seen in Table 3 for those example faults that could affect the selected modules, based on the information in the OECD-NEA taxonomy (OECD-NEA, 2015). The information provided in Tables 1–3 is then used in Section 2.2.4 to present an example demonstration of the OECD-NEA taxonomy for the hardware and software failure modes.

2.2.4. OECD-NEA example taxonomy demonstration Using the example information discussed in Section 2.2.3, an example of the OECD-NEA Taxonomy can be demonstrated. The OECD-NEA Taxonomy gives a four-step process to compress the hardware failure modes into their functional effects on the system. The steps are as follows: (1) The failure effects are assigned to different failure modes of the RPS test system, based on the failure modes taxonomy at the module level. This allows for the uncovering situations and the functional impacts to be described for the test system. (2) Failure mode categories are defined based on the failure effect(s) and uncovering situation(s) for the different failure modes, at the I&C unit level. Categories for the detection of failures are created based on the information on the location of, and the detection of these faults. (3) The end effect of the fault is described based on fault tolerance coverage, location of the detection, and the functional impact on the I&C unit level. (4) Group all of the basic failure modes for each I&C module that have the same generic attributes, detection method, and end effects. The full explanation, along with the example system is given in the literature (OECD-NEA, 2015). The same process is used in the hardware FMEA and demonstration in the FPGA taxonomy, as discussed in Section 5.2.1. Examples of ‘‘Step 1” and the output of ‘‘Step 4” are seen in Tables 4 and 5, respectively, for the example hardware module (‘‘Digital Output Module”). Those two tables provide the final demonstration for the hardware modules, based on the OECD-NEA taxonomy framework, showcasing the most pertinent information as defined in that document (OECD-NEA, 2015). In the OECD-NEA Taxonomy, separate treatments for the Hardware and Software FMEA/demonstrations were given. In the case of software, the OECD-NEA Taxonomy is based on the FMEA of the following list (OECD-NEA, 2015):

Table 1 Example Classification of I&C Modules. I&C Module Fault Location

Relevant I&C Unit Category

APU Digital Output Module (Hardware) Elementary Function (Software)

Acquisition and Processing Unit (APU) Acquisition and Processing Unit (APU)

Table 2 Examples Failure Modes and Failure Effects at the I&C Module Level. I&C Module Output

I&C Modules with Digital Outputs

N/A

Module Type

Failure Mode

Hardware Example Digital Output Module Hang/Crash (No output) Delayed Output Software Example Elementary Function Hang/Crash (No output) Output Stuck (Current Value)

Failure Effect Fatal Non-Fatal Fatal Non-Fatal

Table 3 Examples of Uncovering Situations at the I&C Module Level. Example Uncovering Situation

Example Fault

Example Fault Tolerance Feature

Triggered by Demand (Hardware) Offline Detection Mechanism (Software)

Failure of the digital output during a demand due to thermal shock Elementary Function software stuck in a frozen state

No detection before triggering Failure is detected through periodic testing

(1) Software CCF for all subsystems (2) CCF for one subsystems (3) Software fault causing a failure at the level of redundant systems (4) Software causing a fault in application functions. While this is sufficient for the levels of abstraction discussed in the OECD-Taxonomy, it would not be applicable to the ‘‘SubComponent” level that has been developed in this paper, as discussed in Section 5.2.2. In the demonstration for the example software module (‘‘Elementary Function”), as seen in Table 6, a failure in an elementary function could affect one sub-system in either the APU or VU (based on the example RTS/ESFAS), or it could affect the whole system. In that table, the entry of ‘‘1” denotes that the failure is a system-wide Common Cause Failure (CCF), and the entry of ‘‘2a” represents a CCF in one sub-system, resulting in the loss of outputs from that sub-system. Finally, the OECD-NEA taxonomy concludes its demonstration with an example fault tree. This fault tree has been recreated in this paper and is used in the FPGA Taxonomy. That fault tree is presented in Fig. 15, and is located in Appendix B. The full taxonomy demonstration for the OECD-NEA taxonomy is found in Ref. (OECD-NEA, 2015). 2.3. Taxonomy integration The OECD-NEA taxonomy was developed with software-based systems in mind, but the same framework can be extended to FPGA-based systems. This section discusses the shortcomings of the OECD-NEA taxonomies as it pertains to FPGA-based systems, and provides a detailed discussion of the proposed ‘‘Logic Process”. 2.3.1. Application to FPGA-based systems Upon inspection of the test system, it is seen that an FPGAbased system and taxonomy would not be significantly different from the software-based system and taxonomy, using the current levels of abstraction. In an actual system, the failure modes at the ‘‘System”, ‘Division” and ‘‘I&C Unit” levels would not differ from a software-based to an FPGA-based system. For example,

202

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Table 4 Example Taxonomy Demonstration for the Hardware Modules (Step 1). Hardware Module

Failure Mode(s)

Failure Effect

Uncovering Situation

Functional Impact (I&C Unit(s))

Digital Output Module

Signal Stuck at Current Value Signal Fails to Opposite State

Non-Fatal Non-Fatal

Online Detection Revealed by Demand (Latent)

Loss of specific module application function Loss of specific module application function

Table 5 Example Taxonomy Demonstration for the Hardware Modules (Steps 2–4). Hardware Module

Compressed Failure Mode(s)

Failure Detection

Failure End Effect (RPS)

Digital Output Module

Loss of function

SelfMonitoring Periodic Test

Specific APU/VU output (based on FTD) Loss of specific APU/VU output

Latent Loss of function

Table 6 Example Taxonomy Demonstration for the Software Modules. Effect

1 Sub-system (1SS) Total System

Software Fault Location EF (APU)

EF (VU)

2a 1

2a 1

for the system level failure modes for an RTS, the system could either have a ”Missed Trip”, ‘‘Spurious Trip”, or a ‘‘Partial/Delayed Trip”, according to the OECD-NEA taxonomy (OECD-NEA, 2015), and these would not change with an FPGA-based system. However, there would be two main changes in an FPGA Taxonomy, at the ‘‘Module” Level and ‘‘Basic Component” Level, respectively. Module Level: At this level, the OECD-NEA Taxonomy considers separate hardware and software failure modes. However, as the FPGA does not run actual software, and does not have an Operating System (OS), failure modes like ‘‘OS Freeze” or ‘‘OS Crash” would not apply. Basic Component Level: In an FPGA-based system, there would be no ‘‘Microprocessor” or accompanying ‘‘Software”, so these two entries would be replaced by ‘‘FPGA” and ‘‘HDL Code”, respectively. However, that requires swapping out parts of the OECD-NEA taxonomy for different systems, as the original taxonomy does not include devices such as FPGAs. This issue could be rectified, by modifying the ‘‘Basic Component” level of the taxonomy, to work with all forms of digital technology. The use of the OECDNEA taxonomy for the creation of an FPGA taxonomy is important, as the OECD-NEA methodology is internationally recognized, and is used by working groups and researchers in this field. Following the protocol/methodology of the OECD-NEA taxonomy allows the FPGA taxonomy to retain its importance to those working groups, and ensures that the quality of the FPGA taxonomy is up to that international standard. It is seen that the OECD-NEA taxonomy does not include FPGA-specific components, which will be discussed further in Section 4.1. 2.3.2. Logic process Modelling FPGA-based systems with the original OECD-NEA taxonomy would not be possible, as FPGAs would not fit into its framework .The method proposed in this paper to extend the OECD-NEA taxonomy to incorporate FPGA-based systems is through the use of the ‘‘Logic Process”. This block would replace all digital logic hardware and software/HDL at the ‘‘Basic Component” level. For example, both ‘‘microprocessor” and ‘‘software”, as well as ‘‘FPGA” and ‘‘HDL” code could be represented by a single ‘‘Logic Process” block, as shown in Fig. 3. Components like the ADC/ DAC, MUX/DEMUX, and any additional components at the ‘‘Basic

Fig. 3. Extended Taxonomy Using ‘‘Logic Process”.

Component” Level identified in the OECD-NEA taxonomy remain unchanged. In Fig. 3, the ‘‘Logic Process” block represents all potential digital hardware technologies (microprocessor, FPGA, etc.), as well as all software/HDL components. This extends the OECD-NEA taxonomy to incorporate FPGA and other forms of control technologies, for all five levels of abstraction. Using this extended taxonomy creates a plug-in for other forms of digital technology, and allows for the modelling of the FPGA failure modes to be performed within the context of the OECD-NEA framework. This plug-in allows for the FPGA failure mode data to be used within the OECD-NEA taxonomy framework, as presented in this paper, or by itself with the failure mode data given in Reference (McNelles et al., 2015). 3. Failure mode categorizations The failure mode information is categorized, in order to provide clear information to system developers, or to facilitate the modelling of the system. Several methods for categorization of faults and failure modes have been seen in the literature, with this section presenting four separate methodologies. Section 3.1 discusses the failure mode categorization presented in the OECD-NEA taxonomy; Section 3.2 presents a fault taxonomy based on elementary fault classes, Section 3.3 describes a categorization specific to FPGA failure modes, Section 3.4 describes the IEC random hardware fault classification, and Section 3.5 discusses the relationship between those methodologies. These additional methodologies were selected due to their prevalence and recognition in the scientific literature, and their acceptance in practical use. 3.1. OECD-NEA categorization This sub-section will briefly outline the aspects of the OECDNEA failure modes taxonomy, relevant to the FPGA Taxonomy constructed in this paper. 3.1.1. Failure effects categories The OECD-NEA Taxonomy considers two overall categories of Failure Effects: ‘‘Fatal” and ‘‘Non-Fatal”. These two categories are each further broken down in two more classifications, to give a total of four categories:

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Fatal: The unit stops functioning completely, and no longer provides an output. - Ordered Fatal: When the failure occurs, the unit outputs are forced into pre-set values. - Haphazard Fatal: When the failure occurs, the unit outputs are not forced into pre-set values, so the unit is in an unpredictable state. Non-Fatal: The unit fails, but still performs computations, passing along incorrect output data. - Plausible Behavior: The incorrect outputs cannot be easily identified, given the current plant condition. Implausible Behavior - The outputs from the unit are obviously incorrect. These ‘‘Failure Effects” are used to assess the possible outcomes of the failures at different levels of abstraction. It is possible that a certain failure mode could have multiple potential failure effects.

3.1.2 Fault uncovering situations There are certain situations where the fault(s) of the digital I&C system will be uncovered. Two specific uncovering cases were considered: (1) Uncovered Without Demand - Failure detected with detection mechanisms - Failure causes a Spurious Action (2) Uncovered due to an Actual Demand - The system failure occurs when the intended action is demanded Detection methods can be broken down into ‘‘Online” and ‘‘Offline” detection. Online detection methods include self-monitoring and external monitoring. Offline detection includes periodic testing during maintenance intervals. Overall, this leads to four possible uncovering situations, seen in Fig. 4 with the full explanations provided in OECD-NEA (2015): The four possible uncovering situations are:

Fig. 4. Fault Uncovering Situations for Digital I&C Systems.

-

Revealed Revealed Revealed Revealed

by by by by

203

Spurious Action Demand (Latent or Triggered) Online Detection Offline Detection

These uncovering situations will be used later in the taxonomy, to determine when certain FPGA-based failures will be revealed. 3.1.3. Overall taxonomy basis Overall, the OECD-NEA Taxonomy considers four main elements: (1) (2) (3) (4)

Fault Location Failure Effect Uncovering Situation End Effect (Maximum and most likely)

When considering the ‘‘End Effect”, it would be identified during a specific analysis. In order to perform that analysis, three additional aspects can be included: (5) Failure Origin (6) Maximum possible end effect (assuming Fault Tolerant Design (FTD) features are not used or do not work) (7) The most likely end effect (assuming FTD features are included and are effective) The aforementioned 7 elements will be applied to the FPGA FMEA data and test system, and are included in the FPGA taxonomy demonstration. 3.2. Elementary fault class categorization The use of ‘‘elementary fault classes” (EFC) involves a high-level classification of the faults that could affect a system at some point in its lifecycle, based on eight topics (Avizienis et al., 2004). In total, those eight elementary fault classes were combined to produce 31 potential fault combinations, of which a simplified version of the overall fault taxonomy is seen in Fig. 7 (Avizienis et al., 2004). In reference (Avizienis et al., 2004), three major, potentially overlapping groupings were discussed: - Development Faults: All fault classes occurring during the development stages - Physical Faults: All hardware fault classes - Interaction Faults: All external faults In Fig. 7, the term ‘‘Mal” stands for ‘‘Malicious”, while ‘‘NonMal” denotes ‘‘Non-Malicious”, as defined in the literature (Avizienis et al., 2004). The last row gives examples of possible examples of faults for the different fault class combinations. For example, ‘‘Software Faults” are a form of ‘‘Development Fault”, while ‘‘Production Defects” would be considered a ‘‘Development Fault” and a ‘‘Physical Fault”. The fault classes seen in the bottom row of Fig. 7 represent the high-level categories that the faults uncovered through the FPGA FMEA would correspond to. Although there are 31 potential fault combinations listed in Ref. (Avizienis et al., 2004), many of those correspond to overlapping examples (i.e. Fig. 7(a) of Ref. Avizienis et al., 2004 denotes that ‘‘Software Flaws” are an example of four fault combinations, with nine total examples being provided). Therefore, in Fig. 7 of this paper, a simplified figure was produced, which covers those nine example categories. All of which, are included in the fault category mapping, discussed in Table 7. These examples are then covered in the hardware and software FMEAs and FPGA taxonomy demonstration, provided in Sections 4 and 5, respectively.

204

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Table 7 FMEA Fault Category Mapping. FMEA Category

Elementary Fault Class Example

All Software/HDL Failures (Except ‘‘Design Security”) Manufacturer Defects Board Level (Design) Sneak Circuits (HW) CCF (HW) Environmental (Environmental Qualification) Environmental (Radiation Induced Hard Errors) (Radiation Induced Soft Errors) Stress/Aging Human Factors (Maintenance Induced)

Software Faults

Human Factors (Security Breach)

Design Security

Production Defects and/or Hardware Errata

Physical Interference (Natural, Hardware (HW)) Physical Interference (Nat., HW., Perm) ((Nat., HW., Trans.) Physical Deterioration Physical Interference (Hardware, Non-Mal) Input Mistakes (Software, Non-Mal) Intrusion Attempts (Hardware, Mal) Virus/Worms (Software, Mal, Int) Logic/Timing Bombs (Software, Mal, Dev)

In terms of the EFC failure effects, five total effects were considered, described under the ‘‘failure domain”. These are (Avizienis et al., 2004): Timing Failure: Incorrect arrival time of information (either early or later) Content Failure: The information being sent/delivered is incorrect Halt Failure: The external state becomes constant, with no perceivable activity. This includes cases where the system produces no output. Erratic Failure: Timing and content error that is not a ‘‘halt failure”, leads to the system producing erratic outputs. When considering the detectability of these failures, there are two basic categories: ‘‘Signaled”, where the failure is detected and a warning signal is sent, and ‘‘Unsignaled”, where there is no detection and/or no warning signal sent. The EFC end effects and uncovering situations are seen in Fig. 5. When discussing the avoidance/mitigation methods discussed in the EFC paper (‘‘Means”), there are four main categories, with several subcategories, as shown in Fig. 6. These are (Avizienis et al., 2004): Fault Tolerance (FT): Methods to avoid system failures in the presence of faults. Fault Prevention (FP): Prevent the occurrence/introduction of faults.

Fig. 6. Elementary Fault Class ‘‘Means” to Avoid/Mitigate Failures.

Fault Forecasting (FF): Estimate the number of faults present, future faults that may occur, and the potential consequences of those faults. Fault Removal (FR): Reduce the number and/or severity of the faults. It should be noted that the information presented in Fig. 5 and Fig. 6 is not an exhaustive list, and represents the information most relevant to the creation of the FPGA taxonomy. The complete listing of information is seen in Ref. (Avizienis et al., 2004). 3.3. FPGA FMEA categorization An extensive FPGA FMEA was undertaken for Phase 1 of this research project (McNelles et al., 2015). A literature review was performed that covered reports from technical organizations, international technical standards, white papers from FPGA manufacturers, and papers published in the scientific literature. This produced an extensive list of failures that could affect the physical FPGA chip and board, as well as the HDL code (logic) McNelles et al., 2015. The FMEA data was categorized based on the stage in the FPGA lifecycle in which the failure could occur and the potential causes of failure. Individual Failure Sets (FS) are then created, based on similar causes, effects on the system, and mitigation methods (McNelles et al., 2015). The original preliminary results from the FPGA FMEA were modified and expanded upon to create a clearer and more informative categorization with additional Failure Sets. The new categorization is shown in Fig. 8 (McNelles et al., 2015). A basic description of the ‘‘Lifecycle” and ‘‘Cause” categories is provided as follows: Design (Fabrication): Failures of the chip itself or the FPGA (HDL) logic that occur during chip fabrication or system design (hardware and software design). - Design Defects: Failures caused by faults in the system hardware and/or the HDL logic. - Manufacturer Defects: Failures due to defects in the physical FPGA chip that occur when the chip was manufactured. Operation: Failures that occur during the operation of the FPGA-based system inside the NPP. - Environmental: Failures that could be included based on the operating environment of the FPGA-based system, such as radiation-induced failures. - Aging/Stress: Failures induced in semiconductor technologies during the ageing process or due to thermal-mechanical stress. - Maintenance Induced/Human Factors: Failures induced by plant personnel during maintenance procedures.

Fig. 5. Elementary Fault Class Failure Domain and Detectability.

The full explanation for each Failure Set can be found in McNelles et al. (2015). The categorization of FPGA failure modes was also used in the creation and demonstration of the FPGA

205

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228 Faults

Development Faults

Development

Internal

Soware

Internal

Natural

Natural

Hardware

Hardware

Hardware

Hardware

Non-Mal

Non-Mal

Mal

Non-Mal

Non-Mal

Non-Mal

Permanent

Permanent

Permanent

Permanent

Permanent

Permanent

Soware Faults

Logic/Timing Bomb

Producon Defects

Physical Deterioraon

Physical Interference

Hardware Errata

Producon Defects

Development and Physical Faults

Physical and Interacon Faults

External

Natural

Human-Made

Interacon Faults

Operaonal

Physical Faults

Human-Made

Hardware

Soware

Non-Mal

Mal

Mal

Non-Mal

Transient

Permanent/ Transient

Permanent/ Transient

Permanent/ Transient

Permanent/ Transient

Physical Interference

Physical Interference

Intrusion Aempts

Virus/ Worm

Input Mistakes

Fig. 7. Elementary Fault Classes (Avizienis et al., 2004).

Fig. 8. FPGA Failure Mode Categories (‘‘Failure Sets”).

Taxonomy. There are seven additional/amended categories added in this paper: Commercial-Off-The-Shelf ‘‘Software” (COTS): Represents dedication of any commercial grade software (HDL code, IP cores) and software tools used in the configuration of the FPGA-based system (Jung et al., 2016). This failure set would also include PreDeveloped Software (PDS). COTS/PDS needs to undergo failure modes analysis (such as FMEA), according to certain standards, including CSA N290.14 (CSA Group, 2015). Maintainability: Attributes included during the ‘‘Design” phase, which will assist with the maintenance of the ‘‘software” (HDL code) during the ‘‘Operation” phase. Attributes that may impede maintainability include the use of vendor-specific IP Cores, code

structure/hierarchy, vendor-specific hard macros, synthesis attributes and constraints, as well as place and route directives (Bobrek and Bouldin, 2010; U.S. NRC, 2010). Environmental Qualification: This failure category was added under the ‘‘Environmental” cause. It consists of failures that should be accounted for during environmental qualification, such as failures caused by high temperatures, humidity, electrical noise, electromagnetic interference (EMI), seismic/vibration, etc. As an example, high temperatures may cause damage to the FPGAbased system at the device, package and board levels. This includes damage to the soldering, die, connectors, substrate and bonds. It should be noted that temperature is an important factor in many aging-related failure modes, however those are considered under

206

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

the ‘‘Stress-Aging” cause in Fig. 8 (IAEA, 2016; Actel, 2003; Benfica et al., 2016). Security-Breach: The original ‘‘Maintenance (Human Factors)” cause was renamed as ‘‘Human Factors”. It was sub-divided into two failure sets: ‘‘Maintenance Induced”, which signifies unintentional failure modes introduced during the maintenance period (Xilinx, 2013; Actel, 2005; Khalaquzzaman, 2010), and ‘‘Security Breach”, which signifies failures that were intentionally introduced (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008, 2010; Hadzˇic´ et al., 1999; Trimberger and Moore, 2014). Hard and Soft Radiation Induced: The ‘‘Radiation-Induced” Failure Set was also divided into two new failure sets. The ‘‘Hard Radiation Induced” corresponds to ‘‘Hard” errors, which are permanent errors that cause damage/destruction to the FPGA chip. The other Failure Set, ‘‘Soft Radiation Induced”, represents ‘‘Soft” errors, which are transient errors, and will not cause permanent FPGA damage (Mutuel, 2016). This is done as the two failure sets will generally have different effects on the FPGA chip, and require different mitigation methods (Mutuel, 2016). Design Security: This Failure Set consider malicious actions directed at introducing logic errors into the HDL code and/or IP cores during the design stage of the FPGA-based system. These malicious functions could lay dormant, waiting to be activated upon some triggering condition. This failure set is different from the ‘‘Security Breach” failure set, as it strictly considers malicious software logic inserted into FPGA chip during the ‘‘Design” stage in the lifecycle, whereas the ‘‘Security Breach” failure set considers hardware and software failure modes, and only considers malicious acts during the ‘‘Operation” part of the lifecycle. Aging Failures (Clock) and (FPGA Chip): The ‘‘Aging Process” failure set was divided into ‘‘Aging Process (Clock)” and ‘‘Aging Process (FPGA Chip)”. This was done to differentiate between aging process failures that affect the clock (e.g. reduce the clock frequency), and aging process failures that damage the internal FPGA circuitry (such as the logic blocks or programmable interconnects). The two new failure sets have some distinct failure effects and mitigation methods, so dividing that failure set allowed for better modelling using this taxonomy.

3.4. IEC classification of random hardware faults Standards published by the IEC include a classification for random hardware faults. The following five categories are used (IEC, 2010; O’Connor et al., 2016): No Effect: - The fault has no effect on the safety function of the component

Safe Detected: - A failure that does not cause a spurious actuation, and is detected by diagnostic functions, leading to the correct reaction of the system, as defined in IEC 61508–2. These fault classifications, as well as the effects of those faults on a logic system such as a Functional Safety-PLC (FS-PLC), or in this case, and FPGA, are visualized in Fig. 9. However, the ‘‘No Effect” fault classification was not considered in this taxonomy, as those faults would not pose a threat to the overall FPGA-based system. It is seen that the IEC fault classification includes two failure effects (‘‘Safe” and ‘‘Dangerous”), and two uncovering situations (‘‘Detected” and ‘‘Undetected”). These aspects will be compared with the OECD-NEA taxonomy definitions in Section 3.5.2. It should be noted that this fault classification applies only to random hardware faults. Although the FPGA HDL code will be synthesized into hardware on the FPGA chip, those HDL code failures would be systematic errors, and as such, the random hardware failure classification does not apply. 3.5. Relationship between fault classifications This section provides a comparison and mappings for the three fault classifications used during this research project (FPGA FMEA failure sets, Elementary Fault Classes, IEC fault classification and OECD-NEA taxonomy categorization), and discusses the relationship between those classifications. 3.5.1. Elementary fault classes and FPGA FMEA The fault taxonomy from Ref. (Avizienis et al., 2004) presents, an abstract, high-level taxonomy that could be applied to a wide variety of systems, while the FPGA FMEA categorization focused on the FPGA-specific faults. There are some similarities though, as both classifications include a ‘‘Design/Development” and ‘‘Operation” lifecycle stages. As the FMEA categorization is a lower-level classification, it can be mapped to the fault taxonomy in Ref. (Avizienis et al., 2004). As seen in Table 7, all of the FPGA FMEA categories can be mapped to the elementary fault classes. All of the software/HDL failures in Fig. 8 fall under the ‘‘Design Defect” ‘‘Cause”, and would be considered as ‘‘Software Faults” in the EFC. ‘‘Manufacturer Defects” would map to ‘‘Production Faults” and ‘‘Hardware Errata”, with the exception being that ‘‘Hardware Errata” only considers human-made causes, while ‘‘Production Faults” also include natural causes. The ‘‘Environmental Failures” map to ‘‘Physical Interference” (specifically ‘‘Natural, Hardware”,

Dangerous Undetected: - Failure that prevents or lowers the probability that the safety function operates when required. These faults do not unintentionally execute the safety function, and do not put the system into a Defined Safe Space (DSS). Dangerous Detected: - Dangerous Failure uncovered via some form of diagnostic function or test procedure. Safe Undetected: - A failure that unintentionally actuates the safety function (spurious actuation), or increases the probability of a spurious actuation.

Fig. 9. IEC Hardware Fault Classification and Logic System Behaviour.

207

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

and the ‘‘Radiation-Induced” failures could be ‘‘Permanent (Perm)” or ‘‘Transient (Trans)”). It should be noted that all of the EFCs from Table 7 originate from Ref. (Avizienis et al., 2004). Similarly, the ‘‘Stress/Aging” category would map to the ‘‘Physical Deterioration” example. The ‘‘Human Factors” category has broken up into the two failure sets, to cover five possible mappings, based on the ‘‘Dimension” (Hardware or Software), ‘‘Objective” (Mal or Non-Mal), and fault grouping (Interaction (Int) or Development (Dev) Avizienis et al., 2004. The ‘‘Maintenance Induced” failure sets would map to ‘‘Physical Interference” (hardware) and ‘‘Input Mistakes” (software), as they are both non-malicious. The ‘‘Security-Induced” failure set would map to ‘‘Intrusion Attempts” for hardware faults, ‘‘Virus/Worms” for Interaction software faults, and ‘‘Logic Bombs” for Development software faults, as all three of those faults are considered malicious. This mapping allows for the low-level FPGA faults to be related to the high-level elementary fault classifications, providing more information on the possible cause(s) of each failure modes. These two categorizations (seen in Fig. 7 and Fig. 8) would be useful in the design/development stage of an FPGA-based system, to identify potential failures for FPGA-based systems, allowing for avoidance or mitigation methods to be utilized. 3.5.2. OECD-NEA categorization and IEC random hardware faults It is seen that the IEC fault classification includes the failure effects and uncovering situations jointly, while the OECD-NEA taxonomy considers these to be separate classifications. However, it is still possible to map these failure classifications/uncovering situations, similar to what was done in Section 3.5.1. Table 8 shows the mapping of the fault categories. In that table, ‘‘Safe” failures (both ‘‘Detected” and ‘‘Undetected” could be considered either ‘‘Implausible ‘‘Non-Fatal” or ‘‘Ordered Fatal”, as these failures would be easily detected and/or set the system to a pre-defined safe state. ‘‘Dangerous” failures would then map to the ‘‘Implausible NonFatal” and ‘‘Haphazard Fatal”, as these failures would be much more difficult to detect and/or set the system into an unpredictable state. The mapping of uncovering situations is shown in Table 9. Here, ‘‘Safe Detected” and ‘‘Dangerous” detected failures would map to both the ‘‘Online Detection” and ‘‘Offline Detection”, as they are detected failures that will not be revealed via demands or spurious actuations. By the definitions given in IEC documentation (IEC, 2012b, 2010), the ‘‘Safe Undetected” failure would map to a ‘‘Spurious Actuation”, and the ‘‘Dangerous Undetected” failure would map to being ‘‘Revealed by Demand (Latent or Triggered)”. When the failure effects and the uncovering situations of the fault, according to the OECD-NEA taxonomy, are known, then the corresponding IEC classification is applied. As an example, a failure classified as an ‘‘Implausible Non-Fatal” failure with a ‘‘Spurious Actuation” Uncovering” situation in the OECD-NEA taxonomy, would be considered as a ‘‘Safe Undetected” failure under the IEC classification. Mapping the IEC fault classifications to the OECDNEA taxonomy classifications also allows the IEC classifications to be applied to the FPGA FMEA failure sets, during the creation of the FPGA Taxonomy. 3.5.3. EFC and OECD-NEA categorization The EFC failure effects (‘‘Failure Domain”) and uncovering situation (‘‘Detectability”) seen in Fig. 5 were also mapped to the Table 8 IEC/OECD-NEA Fault Category Mapping. IEC Fault Category

OECD-NEA Failure Effect

Safe (Detected or Undetected) Dangerous (Detected or Undetected) Dangerous (Detected or Undetected) Safe (Detected or Undetected)

Fatal (Ordered) Fatal (Haphazard) Non-Fatal (Plausible) Non-Fatal (Implausible)

Table 9 IEC/OECD-NEA Uncovering Situation Mapping. IEC Fault Category

OECD-NEA Uncovering Situation

Safe Undetected Failure Dangerous Undetected Failure Detected Failure (Safe or Dangerous) Detected Failure (Safe or Dangerous)

Spurious Actuation Demand (Latent or Triggered) Online Detection Offline Detection

Table 10 EFC/OECD-NEA Fault Category Mapping. EFC Failure Domain

OECD-NEA Failure Effect

Content Failure Timing Failure (Early or Late) Halt Failure Erratic Failure

Non-Fatal (Plausible or Implausible) Non-Fatal (Plausible or Implausible)) Fatal (Ordered or Haphazard) Fatal (Haphazard)

OECD/NEA taxonomy. Here, ‘‘Content Failures” and ‘‘Timing Failures” would both map to ‘‘Non-Fatal” failures, as these failures do not prevent the system from providing outputs. These failures could represent either ‘‘Plausible” or ‘‘Implausible” failures, depending on the extent of the failure. The ‘‘Halt” failure would be a form of ‘‘fatal” failure, as it would prevent the system from outputting any new data (and may prevent any form of output). Depending on if the system will be forced into pre-set values or not, the ‘‘Halt” failure could constitute either an ‘‘Ordered” or ‘‘Haphazard” fatal failure. Lastly, the ‘‘Erratic” failure would map to a ‘‘Haphazard Fatal” failure, due to the erratic nature of the outputs. These mappings are presented in Table 10. Considering the uncovering situations, the mapping is very simple, as seen in Table 11. Any failure uncovered through an ‘‘Online Detection” or ‘‘Offline Detection” method would constitute a ‘‘Signaled” failure, while any failure uncovered by ‘‘Demand” or ‘‘Spurious Actuation” would be an ‘‘Unsignaled” failure. 3.5.4. OECD-NEA categorization and FPGA FMEA The OECD-NEA taxonomy discussed in Section 3.1 provides a framework for the categorization of failure modes based on their end effects, uncovering effects, and level of abstraction in which the failure occurs. However, one potential shortcoming of that taxonomy is that it does not provide any categorization for the cause of those failure modes, as was done in Ref (McNelles et al., 2015). Therefore, the framework from the OECD-NEA taxonomy is applied to the FPGA FMEA results and Failure Sets. The Failure Effects and uncovering situations are defined for the hardware and software (HDL) FPGA failure modes, as well as their potential end effects on the ‘‘I&C Module” and ‘‘System” levels of abstraction. The development of the FPGA taxonomy is laid out in Section 4. 3.5.5. Classification mapping results The mapping of the failure modes, failure effects and uncovering situations from the various fault classifications to the OECDNEA taxonomy allows for a comprehensive plug-in to be developed for the FPGA taxonomy. The inclusion of these additional classifications will provide more guidance on the effects, detection and mitigation of faults which will affect FPGA-based systems. The results of the mappings are used to complete the FPGA Taxonomy and demonstrations, seen in Sections 4 and 5. 4. Development of the FPGA taxonomy The development of the FPGA taxonomy followed the framework laid out in the OECD-NEA taxonomy, for the hardware and software (HDL) components. Section 4.1 introduces the

208

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Table 11 EFC/OECD NEA Uncovering Situation Mapping. EFC Detectability

OECD-NEA Uncovering Situation

Unsignaled Unsignaled Signaled Signaled

Spurious Actuation Demand (Latent or Triggered) Online Detection Offline Detection

‘‘Sub-Component” Level of Abstraction; Section 4.2 discusses the inclusion of mitigation methods; and Sections 4.3 and 4.4 present the ‘‘Sub-Component” Taxonomies for Hardware and HDL Code, respectively. Section 4.5 provides a summary of the information provided in the aforementioned taxonomies.

4.1. Sub-Component level of abstraction A ‘‘Sub-Component” (SC) level of abstraction was created to account for the failure modes of FPGAs, including the Failure Set data from Fig. 8. The FPGA failure data collected during the FMEA affects the FPGA chip/board or the actual HDL code. However, these failures were not included in the OECD-NEA Taxonomy, as it stopped at the ‘‘Basic Component” level, and did not consider failures beyond that level in detail. To allow for the inclusion of the FPGA FMEA data, a ‘‘Sub-Component” (SC) level of abstraction was proposed, in order to account for the effects of the different FPGA and HDL code failures on the example system. The FPGA Taxonomy focuses on the ‘‘Sub-Component” and ‘‘Basic Component” levels, to tie together the FPGA FMEA data and the OECD-NEA Taxonomy example system, by plugging the ‘‘Sub-Component” level into the ‘‘Logic Process”, developed in Section 2.3.2. The ‘‘Sub-Component” level of abstraction is shown in Fig. 10. The ‘‘Sub-Component” Level is a potential way to illustrate the effects of the failure modes of the FPGA system. This will breakdown the ‘‘FPGA” and ‘‘HDL” entries at the ‘‘Basic Component” level, to their most basic hardware and ‘‘software” components, respectively. This allows for the Failure Sets, presented in Fig. 8, to be used to construct the FPGA failure mode taxonomy, based on the OECD-NEA template. To create this ‘‘Sub-Component” level of abstraction, a vertical extension was applied to the ‘‘Basic Component” level, as seen in Fig. 10.The FPGA chip itself is comprised of several smaller (hardware) constituents, and the overall FPGA logic/HDL code (software) is said to be comprised of function blocks, such as mathematical/logic functions, IP cores, etc. As each (configured) FPGA is comprised of the aforementioned hardware and software constituents, it is analogous to the ”System” level being comprised of the constituent divisions at the ‘‘Division

Level”, or a motherboard at the ‘‘Module” level being comprised of the constituent components such as microprocessors, software, A/D converter, D/A converter, etc., at the ‘‘Basic Component” level. Therefore, a vertical extension of the new ‘‘Logic Process” block at the ‘‘Basic Component” level to the new ‘‘Sub-Component” is warranted, as the various hardware and software constituents will comprise the configured FPGA. As seen in Fig. 10the FPGA Chip can be broken down into many categories based on the hardware (blue) and ‘‘software” (red) modules that make up a (configured) FPGA. This will allow for the application of the failure categories, shown in Fig. 8, to the taxonomy, to provide detailed information about the failure modes and failure effects regarding the hardware and software components of the FPGAs and FPGA-based systems. It should be noted that the ‘‘Soft Processor” is a form of ‘‘IP Core”; however it was included here separately, due to the inclusion of the ‘‘Processor” in the OECD-NEA Taxonomy. Additionally, it may seem counterintuitive that the Mux/Demux constituent may appear in both the ‘‘Basic Component” and ‘‘Sub-Component” level. However, this is because the Mux and/or Demux could be its own physical (hardware) component (such as in the case of an analog Mux), or it could be programmed into the FPGA. In the case of a configured FPGA, the Mux/Demux could be affected by software errors (in the case of errors in the HDL code), as well as hardware errors, if failures occur in the logic resources or programmable interconnects used by the Mux/Demux in the configured FPGA. This additional level of abstraction allows for the re-structuring of the failure mode data in this paper, and to provide information on classifications, fault locations and uncovering situations that are given in the taxonomy. It also demonstrates how failures at the sub-component level have the potential to cascade upwards through the levels of abstraction, causing a failure of the overall system. Furthermore, the lifecycle information of the failure modes is included in the new FPGA taxonomy.

4.2. Inclusion of mitigation measures and fault classification mapping The identification and categorization of FPGA failure modes is important, however once these failure modes have been ascertained, it is imperative that defenses against these failures be established. This FPGA failure mode taxonomy provides examples of detection, avoidance and/or mitigation methods for each of the failure modes/failure mode categories resulting from the FMEA (McNelles et al., 2015). Specific mitigation methods are provided for each of the failure sets, based on the research performed into these measures performed during the FPGA FMEA, in order to eliminate the occurrence of each failure mode during the ‘‘Design”

Fig. 10. Relationship Between ‘‘Basic Component”, ‘‘Sub-Component” and ‘‘Failure Categories”

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

stage of the lifecycle (e.g. HDL code errors), or to mitigate the failure mode during the ‘‘Operation” stage, if it cannot be eliminated (e.g. aging process failure). This allows for mitigation methods to be properly incorporated in future design, modelling and analysis of FPGA-based systems, and fulfills an important, additional criteria of future work laid out in the OECD-NEA taxonomy. Additionally, high-level mitigation measures are provided based on information from the EFC from the literature (Avizienis et al., 2004). The ‘‘Fault Prevention” methods primarily apply to failures in the ‘‘Design” stage of the lifecycle (e.g. HDL code failures), as one should prevent those faults from being implemented during the design/construction of the system. Certain failures that occur during the ‘‘Operation” stage, such as aging failures cannot be prevented, as aging effects will inevitably damage any electronic system over time. In that case, ‘‘Fault Removal” (Correction) is applicable, as it would involve the removal of the FPGA chip(s)/ board(s) that have been damaged due to operational conditions (e.g. aging failures, radiation interactions, environmental interactions, etc.). ‘‘Fault Tolerance” involves the methods used in the detection certain failures (such as SEUs), as well as the methods to handle or recover from (mitigate) those failures (such as TMR or error detection codes). It should be noted that ‘‘Fault Forecasting” is applicable to any form of conceivable failure; however it is especially relevant to CCF, as it is assumed that CCF is a latent error that cannot be detected or removed (OECD-NEA, 2015). The specific mitigation methods for each failure set from the FPGA FMEA, as well as he high-level measures from the EFC are included in the Hardware and Software FPGA Taxonomy in Sections 4.3 and 4.4, as well as in the Taxonomy demonstration presented in Section 5. The fault classification mappings discussed in Section 3.5 are also incorporated into the FPGA Taxonomy tables in Sections 4.3 and 4.4. In Tables 13 and 15, the OECD-NEA Categorization and FPGA FMEA Mapping (Section 3.5.4) allows for the OECD-NEA Failure Effect ‘‘Fatal” or ‘‘Non-Fatal” (column labelled ‘‘Failure Effect”) to be related to the FPGA FMEA data (columns labelled ‘‘Failure” and ‘‘Fault Location”). Additionally, the mapping between the EFCs and FPGA FMEA (Section 3.5.1), allows for the ‘‘Failure Set”, ‘‘Cause” and ‘‘Lifecycle” data from the low-level FPGA FMEA to be related to the high-level EFCs as seen in the three rightmost columns of Tables 13 and 15. In terms of Tables 14 and 16, the mapping of the FPGA FMEA failure mode data with the other failure mode classifications of Section 3.5 allows for the failure mode examples in the ‘‘Fault” column to be related to the example ‘‘Uncovering Situations” from the other the OECD-NEA, EFC and Functional Safety (IEC) fault classifications (where applicable). Furthermore, it allows the FPGA FMEA failure mode data (‘‘Fault” Column) and the FPGA FMEA data for the ‘‘Fault Tolerance Feature”, to be related to the ‘‘Fault Tolerance Features” for the OECD-NEA and the EFC fault classifications 4.3. Sub-Component hardware taxonomy The ‘‘Sub-Component” taxonomy for the hardware subcomponents is discussed here. Fig. 11gives a representation of the FPGA Chip and Board, along with the hardware subcomponents, and an example of failures that affect those components. In Fig. 11, the FPGA chip is divided into the three underlying components; FPGA I/O, Configurable Logic Blocks (CLB), and the Programmable Interconnects (PI). The CLB is further subdivided by its sub-components; the Look-Up Tables (LUT), Register/Flip Flops, Mux’s (discussed in the OECD-NEA taxonomy). The effects of failures of the inputs from the clock, the FPGA board (which the FPGA chip itself would reside), as well as inputs into the FPGA board itself, are also considered. All of these components (except for the Mux) were then assigned an example failure

209

Fig. 11. FPGA Chip/Board Hardware Failures.

Fig. 12. Effects of failures of CLBs and Programmable Interconnects.

mode, taken from the FPGA FMEA research. The sub-components and failure data shown in Fig. 11 are reconstructed in Table 13, which also includes information tying the failure modes back to the fault classes shown in Fig. 7, and the failure mode categories in Fig. 8, along with the potential effects of those failures on the ‘‘Basic Component” and ‘‘Sub-Component” levels of abstraction. The definitions for all the acronyms used in Fig. 11 are found in Appendix A. It should be noted that in Fig. 11, the square icons represent the hardware components and sub-components of the FPGA, whereas the diamond shape icons represent examples of failure modes that have the potential to affect the indicated FPGA hardware component or sub-component. Table 13 also considers the elementary fault classes from Ref. (Avizienis et al., 2004), listed in parenthesis below the information taken from Fig. 8. The column ‘‘Failure Set” includes the elementary fault example mapping from Table 7. The ‘‘Cause” column denotes if the failure from the ‘‘Development”, ‘‘Physical” or ‘‘Interaction” groups. Lastly, the ‘‘Lifecycle” column states if the failures is in the ‘‘Development” or ‘‘Operational” portion of the lifecycle. Additionally, the column in Table 13 entitled ‘‘Failure Effect” denotes the ‘‘Fatal” or ‘‘Non-Fatal” failure effects associated with

210

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Table 12 Effects of SEU on Register Storage Values. Intended Numeric Value

Bit Flipped (MSB/LSB)

Base 2

Base 10

10101010 10101010 00101010 00101010

170 170 42 42

Erroneous Numeric Value

MSB LSB MSB LSB

Base 2

Base 10

00101010 10101011 10101010 00101011

42 171 170 43

Table 13 Sub-Component Level Failure Modes and Failure Effects (Hardware). Failure

Fault Location

SC Level Effect

BC Level Effect

Failure Effect

Failure Set

Cause

Lifecycle

TDDB

LUT

Destruction of FPGA LUT

No Output/ Incorrect Output

Operation (Operational)

Programmable Interconnect

No Output/ Incorrect Output

Stress/Aging (Physical)

Operation (Operational)

SEGR

Logic Gate

Destruction of Programmable Interconnects Destruction of FPGA logic gate

Register (Storage Element) Register and/or Logic Gates

Temporary Bit Upset in Memory Element

Incorrect Output

Transient pulse through logic/registers/output

Incorrect Output

NonFatal

Substrate Breakdown (High Temp.) Stuck-Pin

FPGA Chip/ Board

Destruction of FPGA Device, Package or Board

Board/Chip destroyed, no output

Fatal

Environmental (Physical/ Interaction) Environmental (Physical/ Interaction) Environmental (Physical/ Interaction) Environmental (Physical/ Interaction)

Operation (Operational)

SEU

Aging Process (FPGA Chip) (Physical Deterioration) Aging Process (FPGA Chip)(Physical Deterioration) Radiation-Induced Hard Errors(Physical Interference) Radiation-Induced Soft Errors(Physical Interference) Radiation-Induced Soft Errors(Physical Interference) Environmental Qualification(Physical Interference)

Stress/Aging (Physical)

EM

Fatal or NonFatal Fatal or NonFatal Fatal or NonFatal NonFatal

FPGA I/O

FPGA Pin Stuck (‘‘1” or ‘‘0”)

Incorrect Output

NonFatal

Chip and Board (Production Defects/ _Hardware Errata)

HCE

Clock

Delayed Output

ESD

FPGA Chip/ Board

Reduction in Clock Frequency ES damage to the FPGA Chip/Board

Aging Process (Clock) (Physical Deterioration) Maintenance Induced (Physical Interference)

Differential Power Analysis (DPA) Hardware Sneak Circuit Data Retention Failure (DRF) Discrete (Digital) Input Hardware CCF

FPGA Chip/ Board (FPGA Logic)

Secret Cryptographic keys are recovered (unauthorized)

Board/Chip destroyed, no output System security and/or IP is compromised

NonFatal Fatal

NonFatal

Security Breach (Intrusion Attempts)

FPGA Chip/ Board

Spurious or Missed Actuation

Incorrect Output or No Output

Programmable Interconnect

Interconnect Self-Healing

Incorrect Output

Fatal or NonFatal NonFatal

Board I/O

Failure of Input on the FPGA Board

Fatal

FPGA Chip/ Board

Common Cause Failure (Hardware)

No output (resulting from no input) No output

SET

No Output/ Incorrect Output

each example failure, as discussed in Section 3.1.1 and described in more detail in the OECD-NEA Taxonomy (OECD-NEA, 2015). The hardware failures in Table 13 include Hot Carrier Effects (HCE), which will slow down the clock period, and Single Event Upsets (SEU), which will invert a data bit that is stored in a memory element, such as a register. A full list of all acronyms and definitions for the failures can be found in Appendix A. Failures that affect the interconnects, such as Electromigration (EM), could result in either fatal or non-fatal errors, depending on the location where the failure occurred. If the EM failure occurs in an interconnect carrying only part of the data (i.e. it is one of several inputs

Fatal

Operation (Operational) Operation (Operational) Operation (Operational)

Manufacturer Defects (Development/ Physical) Stress/Aging (Physical) Human Factors (Physical/ Interaction) Human Factors (Physical/ Interaction)

Design (Development)

Sneak Circuit(Production Defect)

Design Defect (Physical)

Design (Development)

Bit Error(Physical Deterioration)

Stress/Aging (Physical)

Operation (Operational)

Board Level(Production Defects/Hardware Errata) Common Cause Failure (Production Defects)

Design Defects (Development/ Physical) Design Defects (Development/ Physical)

Design (Development)

Operation (Operational) Operation (Operational) Operation (Operational)

Design (Development)

that will be summed and then output) denoted ‘‘EM 1”, the failure will be non-fatal. If the EM failure occurs in an interconnect that is the only input or output path for the signal, then the failure would be fatal, as seen with ‘‘EM 2”. Similarly, an error in a CLB (such as Time Dependent Dielectric Breakdown (TDDB)) could be non-fatal, if there are many logic blocks performing computations in parallel, as in the case of ‘‘TDDB 1”. However, if it is the only logic block leading to an output, then it could be considered as a fatal error, denoted by ‘‘TDDB 2”. For NonFatal failures in both cases, the effects could be either ‘‘Plausible” or ‘‘Implausible”, depending on the failure location, logic process,

211

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228 Table 14 Uncovering Situation and Mitigation Method Examples for Sub-Component Level (Hardware). Uncovering Situation

Fault

NEA (International Electrotechnical Commission, 2006)

Functional Safety (IEC, 2012b)

EFC (Avizienis et al., 2004)

Online detection mechanisms

Safe Detected Failure

Signaled

Offline detection mechanisms

Dangerous Detected Failure

Signaled

Latent revealed by demand Dangerous Undetected failure

Unsignaled

Triggered by demand

Dangerous Undetected failure

Unsignaled

Spurious actuation

Safe Undetected Failure

Unsignaled

Undetectable

Dangerous Undetected Failure

Unsignaled

Fault Tolerance Feature NEA (International Electrotechnical Commission, 2006)

FPGA FMEA (McNelles EFC (Avizienis et al., et al., 2015) 2004)

Fault Tolerance: Error Detection (Concurrent) Fault Tolerance: Error Handling (Compensation) Fault Removal: Correction Fault Tolerance: Error MTBF for Aging EM (National Aeronautics Periodic Testing Detection (Concurrent) Failures, and Space Administration, Fault Tolerance: Error Periodic Testing 2008; Srinivasan et al., Handling 2008)): Revealed by peri(Compensation) odic testing of the FPGA Fault Removal: Correction SEGR (National Aeronautics Detected during uncovering Sensitivity Testing, Fault Tolerance: Error and Space Administration, incident Protective Circuits, Detection (Concurrent) 2008; Srinivasan et al., Periodic Testing Power De-Rating, Fault Tolerance: Error 2008): Damage to the logic Periodic Testing Handling gates causes incorrect (Compensation) calculations, not detected Fault Removal: until actuation required Correction No detection before Electrostatic Protection Fault Prevention ESD (Benfica et al., 2016; Program Fault Removal: Xilinx, 2013): FPGA fails due triggering Correction to Electro-Static Discharge from another component (due to Maintenance errors) Failure is not detected SEU Mutuel (2016, 2014): Detection (Before Actuation) Triple Modular Fault Tolerance: Error Redundancy (TMR), Memory Upset causes valDetection (Concurrent) Error Detection and ues to read above a setpoint, Fault Tolerance: Error Correction Codes causing a spurious trip Handling (Diagnosis, Failure is detected (EDAC Compensation, (EDAC) Methods) Isolation) Hardware CCF (OECD-NEA, Undetectable Diversity and Defence Fault Prevention Fault Forecasting 2015): Undetectable in Depth, Requirements from technical standards, CCF Analysis HCE (National Aeronautics and Space Administration, 2008; Srinivasan et al., 2008): Revealed by monitoring clock skew

and combination of inputs. These situations are visualized in Fig. 12. In the case of registers (storage elements), there is the vulnerability to SEU. These failure could invert a stored memory value (0 ? 1 or 1 ? 0), and that could affect the output of the system or sub-system. The effect that this inverted bit would have on output of the values would depend on which bit in memory is inverted. For example, for an 8-bit signal, shown in Table 12, if the Most Significant Bit (MSB), typically the leftmost bit is flipped, the difference is much greater than if the Least Significant Bit (LSB) is flipped, typically the rightmost bit. In Table 12, the first example is a binary input of ‘‘10101010”, corresponding to a value of ‘‘170” in decimal notation (base 10). If the MSB is flipped (1 ? 0), then the resulting binary signal is ‘‘00101010”, or ‘‘42” in decimal notation. This is obviously a very large change in value, which could cause a large change in output calculations (although this is likely an ‘‘Implausible” Non-Fatal Failure). If the LSB is flipped (0 ? 1), then the ensuing binary representation is ‘‘10101011”, or ‘‘171” in decimal form. This is a much smaller change (likely a ‘‘Plausible” Non-Fatal failure), and may actually be a negligible difference, depending on the system logic. Due to the large variation in the change in signal values due to SEUs, the corresponding failures could include both ‘‘Revealed by Demand (Latent)” and ‘‘Spurious Actuation”, and the Failure Effects could be either ‘‘Plausible” or ‘‘Implausible” Non-Fatal failures.

Monitoring

Monitor Clock Skew, MTBF for HC, Periodic Testing

Following the example set out in the OECD-NEA taxonomy, Table 14 provides one example for each of the uncovering situations, and an example of one of the corresponding mitigation measures from the fault classifications considered in this paper, for the failure modes seen in Fig. 11 and Table 13. While (hardware) CCF was not explicitly shown in the example figure; however it is included in the corresponding tables (Tables 13 and 14, respectively). Additionally, CCF was stated to be ‘‘undetectable” in the OECD-NEA taxonomy (OECD-NEA, 2015). It should be noted that Table 14 was not intended to be an exhaustive list; it was intended to provide an example of each uncovering situation and potential mitigation method, as was done in the OECD-NEA taxonomy (OECD-NEA, 2015). The information provided in Tables 13 and 14 is used in the demonstration of the FPGA Taxonomy, to create the data tables seen in Section 5.2. In that sub-section, uncovering situations and mitigation methods are provided for an example failure mode from each of the Failure Sets listed in Fig. 8. 4.4. Sub-Component HDL code taxonomy The same principle in Section 4.3 is applied to the HDL Code failure modes (the ‘‘software” component of the FPGA). The example in this case is loosely developed with a reference to the Overtemperature and Overpressure trip parameters in the Westinghouse AP1000 documentation, and is presented in Fig. 13 (AP1000 Design Control Document, 0000). Several software/logic

212

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Fig. 13. FPGA ‘‘Software” Failures (Parameter Trip).

Fig. 14. FPGA ‘‘Software” Failures (State Machine).

errors are seen, including the use of a latch instead of a register in the synchronization chain, mathematical error due to ‘‘Arithmetic Overflow”, and a ‘‘Stuck Output” (output will not update, even with changing inputs) from the soft processor core that calculates the Trip Setpoint (TSP). The Software Sneak circuit reveals that a sneak circuit is created that could bypass certain important functions, resulting in a ‘‘Missed Trip” or ‘‘Spurious Trip”. A second IP core, this time a COTS IP Core (considered to perform some generic signal processing function) may contain failure, or the specifications of that IP core may not be appropriate. Composition Errors (either malicious or non-malicious) could cause the IP cores the IP cores to alter one another, or interfere with each other’s functions. Design Tool Subversion could input code that effects the clock signal (such as if a dynamic clock frequency is used), distorting proper chip timing. Finally, an FPGA virus could cause damage to the FPGA chip (such as a short circuit), resulting in no output being sent. All of these failures could affect the output of the FPGA logic. It should be noted that in Fig. 13, the square icons represent software sub-component function blocks (FPGA

logic), while the diamond shape icons represent examples of failure modes that have the potential to affect the indicated FPGA software sub-component. A smaller example shown in Fig. 14 shows certain potential errors in the FPGA state machine. As can be seen, the state machine could get caught in an Endless Loop (Infinite Loop), as the state S2 is encoded in such a way that a value of ‘‘1” has two possible paths. In this case, the state machine could continuously loop back into itself, causing it to hang. A second fault in the state machine is seen with S4, where the state is unreachable. State Machines often see use in FPGA-based systems, and as such was given consideration in the FPGA taxonomy. The information regarding the software sub-components, failure modes, and failure categories was compiled and displayed in Table 15. As in the case with Table 13, the column entitled ‘‘Failure Effect” in Table 15 denotes the failure effect defined in the OECD-NEA Taxonomy (OECD-NEA, 2015). It should be noted that the failure mode(s) and uncovering situations of individual IP cores are dependent on the functionality of each individual IP Core, as stated in the OECD-NEA taxonomy. In Fig. 13, it was assumed that the filter (such as a lead-lag filter) was implemented (digitally) on the FPGA, using an IP Core. A failure in that core could affect the filtered output, with failures such as ‘‘Stuck Output”, ‘‘No Output”, ‘‘Delayed Output”, etc. The IP Core in this case was used as an example. As there are potentially numerous IP Cores available by different vendors, it is not practical to discuss failure modes for all the individual IP Cores (with the aforementioned exception of the Soft Processor). As in the case of the hardware sub-components, the information regarding uncovering situations and mitigation methods is given in Table 16, with software CCF being ‘‘undetectable”. 4.5. Taxonomy summary The FPGA taxonomy was constructed using the same overall process as used in the OECD-NEA failure modes taxonomy, including the failure effects and uncovering situations as defined in that

213

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228 Table 15 Sub-Component Level Failure Modes and Failure Effects (Software). Failure

Fault Location

SC Level Effect

BC Level Effect

Failure Effect

Failure Set

Cause

Lifecycle

Endless Loop

State Machine

No Output/Stuck Output

Fatal

Unreachable States

State Machine

State Machine caught in an endless loop State(s) cannot be reached as intended

No Output or Incorrect Output

Design Defects (Development) Design Defects (Development)

Design (Development) Design (Development)

COTS HDL Code or Tool Failure Stuck Output

FPGA Logic

Function Dependent HDL Error

No Output or Incorrect Output

COTS(Software Fault)

Design Defects (Development)

Design (Development)

Soft Processor

SP core stops updating

No Output/Stuck Output

Fatal or NonFatal Fatal or NonFatal Fatal

State Machine (Software Fault) State Machine (Software Fault)

Math Error (Arithmetic Overflow) Logic Error (Comparator Error) Fixed Point Resolution Error Software Sneak Circuit

EF (Math)

Arithmetic overflow leads to calculation error Error in comparator leads to logic error

Incorrect (Math)Output

NonFatal

Soft Processor (Software Fault) Logic Errors (HDL) (Software Fault)

Design Defects (Development) Design Defects (Development)

Design (Development) Design (Development)

Incorrect (Logic)Output

NonFatal

Logic Errors (HDL) (Software Fault)

Design Defects (Development)

Design (Development)

Data Type

Low resolution of FXP value

Incorrect setpoint

NonFatal

Design Defects (Development)

Design (Development)

FPGA Chip/Board

Spurious or Missed Actuation

Incorrect Output or No Output

Design Defect (Development)

Design (Development)

Latch

Registers

Unknown/Random Output

Design Tool Subversion

FPGA Logic/Timing (FPGA Synthesis Tool) FPGA Logic/ Circuitry

Unintended asynchronous signals Unauthorized HDL code synthesized

Fatal or NonFatal NonFatal Fatal or NonFatal Fatal or NonFatal Fatal or NonFatal Fatal or NonFatal Fatal

Input and Data Type(Software Fault) Sneak Circuit (Production Defect) Clock/Timing (Software Fault) Design Security (Logic/Timing Bomb) Human Factors (Virus/Worm)

Design Defects (Development) Design Defects (Development)

Design (Development) Design (Development)

Security Breach (Interaction)

Operation (Operational)

Human Factors (Input Mistakes)

MaintenanceInduced (Interaction) Design Defects (Development)

Operation (Operational)

Design Defects (Development)

Design (Development)

FPGA Virus

EF (Logic)

Internal signal conflict in FPGA

Unauthorized access, incorrect outputs or device damage Device damage/destruction

Incorrect outputs or device damage

IP Core

Updated IP Cores alter or interfere with each other Function Dependent

FPGA Chip/Board

CCF due to Software

No Output

Composition Problem

IP Core

IP Core Failure

Software CCF

Function Dependent

document (OECD-NEA, 2015), with a brief explanation of that information provided in Section 3.1. In this paper, the FPGA failure mode data was obtained through the performance of a detailed FMEA, for which the preliminary results were published in Ref. (McNelles et al., 2015), and then expanded on in Section 3.3. The FPGA FMEA categorized the failure modes using three steps; first by the stage of the lifecycle the failure occurs in, secondly by the overall cause of the failure, and lastly the failures were grouped into ‘‘Failure Sets”, based on the common effects and mitigation methods of those failures. A similar approach is seen in the EFCs, which provides data on generic, high-level faults, and provides additional categorization through the use of ‘‘Physical”, ‘‘Development” and ‘‘Interaction” fault classes, as seen in Ref. (Avizienis et al., 2004) and in Section 3.2 of this paper. The EFC also presents its own variant of failure effects (‘‘Failure Domain”), and for uncovering situations (‘‘Detectable”). Finally, the IEC classification for random hardware faults was presented in Section 3.4, which considers the detected/undetected safe or dangerous faults, as described in Ref. (IEC, 2010; O’Connor et al., 2016). The various classifications outlined in Section 3 were then use to construct the hardware and HDL code taxonomies seen in Sections 4.3 and 4.4, respectively. The FMEA tables (Tables 13 and 15) were constructed using the failure mode data collected in the FPGA FMEA, and included categorization data from the FPGA FMEA, the EFC and the failure effects from the OECD-NEA taxonomy. The proceeding tables in the hardware and HDL code

Maintainability (Software Fault) Common Cause Failure(Software Fault)

Design (Development)

taxonomy (Tables 14 and 16) present example failures from the FPGA FMEA, for each of the uncovering situations defined in the OECD-NEA taxonomy. Those tables also incorporate the uncovering situations based considered in the EFC and IEC Functional Safety documentation (where applicable), as well as example fault tolerance features from the FPGA FMEA, EFC, and OECD-NEA taxonomy. The information presented in Sections 4.3 and 4.4 were used in Section 5 to demonstrate the FPGA Taxonomy for example failure modes covering all of the ‘‘Failure Sets” discussed in Section 3.3, using the process outlined in Section 2.2.4.

5. FPGA taxonomy demonstration The OECD-NEA taxonomy was demonstrated using the aforementioned RTS/ESFAS test system. For the FPGA taxonomy demonstration, the information in Section 4 was used to create the FPGA taxonomy, for hardware and ‘‘software”, at the ‘‘Basic Component” and ‘‘Sub-Component” levels of abstraction. Filling out these levels of abstraction links the FMEA and the PRA modelling information given in the example in the OECD-NEA document, to demonstrate how the failures of the configured FPGA and HDL code would affect the overall system. Section 5.1 will cover the ‘‘Basic Component” level, while Section 5.2 presents the demonstration for the ‘‘SubComponent” level hardware and software FMEA. Lastly, a modelling example using fault trees is provided in Section 5.3.

214

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Table 16 Uncovering Situation and Mitigation Method Examples for Sub-Component Level (Software). Uncovering Situation

Fault

NEA (International Electrotechnical Commission, 2006)

EFC (Avizienis et al., 2004)

Online detection mechanisms

Signaled

Offline detectionmechanisms

Unsignaled

Latent revealed by demand

Unsignaled

Triggered by demand

Unsignaled

Spurious actuation

Unsignaled

Undetectable

Unsignaled

Fault Tolerance Feature

Endless Loop (IEC, 2010; 2012a): State Machine Endless Loop caught by WDT. State Machine returned to pre-defined state Unreachable States (IEC, 2010): Unreachable states found and corrected by using State Machine Hazard Analysis FXP Resolution Error (Digital Instrumentation and Controls Working Group, 2013; Taylor, 2012): Low Resolution for the FXP data type leads to an inaccurate TSP, causing the demand to fail when it should actuate Design Tool Subversion (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008): Malicious logic inserted into HDL code during design, triggered by activation conditionFailure is not detected Logic Error (Comparator) (Bobrek and Bouldin, 2010; NRC, 2010): Incorrect comparator logic (i.e. ‘‘>” instead of ‘‘<”) causes a spurious trip. Failure is detected Software CCF (OECD-NEA, 2015): Undetectable

NEA (International Electrotechnical Commission, 2006)

FPGA FMEA (McNelles et al., 2015)

EFC (Avizienis et al., 2004)

Monitoring

WDT and return to predefined state

Periodic Testing

State Machine Hazard Analysis,

Detected during uncovering incident Periodic Testing

Ensure proper calculation/ verification of all FXP values, Input Sanity Check

FT: Error Detection (Concurrent Detection) FT: (Error Handling: Rollback/Rollforward) Fault Prevention Fault Removal: Verification (static and dynamic) Fault Prevention Fault Removal: Verification (static and dynamic)

No detection before triggering

Secure Lifecycle. Trusted Tools and IP Cores

Fault Prevention Fault Removal: Verification (static and dynamic)

Detection (Before Actuation)

HDL Code V&V, International Standards

Fault Prevention Fault Removal: Verification (static and dynamic)

Undetectable

Diversity and Defence in Depth, Requirements from technical standards, CCF Analysis/Specification

Fault Prevention Fault Forecasting

Table 17 Basic Component Level FPGA FMEA for the OECD-NEA AIM. Failure Mode

Failure Mode Detection

Failure Effects on AIM

Comments

CCF would cause all the AIM channels to fail

No

Incorrect AIM Output(All Channels Fail) Incorrect AIM Output

No

No

Incorrect AIM Output

May not be detected (if plausible failure)

No

Yes

No

Yes

No

Yes

No

Yes

No AIM OutputFPGA Reset or Predefined state(Detection) Incorrect AIM OutputFPGA Reset or Predefined state(Detection) Delayed AIM OutputsFPGA Reset (Detection) Missed AIM OutputsFPGA Reset (Detection)

Logic issues may be detected with WDT, resetting the FPGA, or sending it to a predefined state Logic issues may be detected with WDT, resetting the FPGA, or sending it to a predefined state Timing errors cause FPGA to output data too slowly, delaying AIM outputs Timing errors cause FPGA to output data too quickly, AIM outputs are missed

Application HDL

WDT

Hardware/Software CCF

(Undetectable)

(Undetectable)

Hardware/Software Sneak Circuit Random or Incorrect FPGA Outputs FPGA logic stuck internally, no outputs sent(No Output) FPGA stops updating outputs (Stuck Output) Delayed Outputs

No

Timing Error (Fast or Slow)

5.1. Basic Component level demonstration It was determined to start with the ‘‘Basic Component Level” for the FPGA taxonomy, and work from there. At this level, the example failure modes for the components other than the FPGA/microprocessor and HDL/software would again be the same, so they were not considered here. Examples relevant to the FPGA and/or HDL code based on the OECD-NEA examples are shown in Table 17.

Spurious or Missed Output (SCA)

The ‘‘Sub-Component” taxonomy demonstrations, that will be discussed in Section 5.2, will build up to the ‘‘Basic Component” levels, and then up to the total ‘‘System” level, to demonstrate the effect of the FPGA and HDL failures on the overall system. The Analog Input Module (AIM) was used in Table 17, as it was the module used in the ‘‘Basic Component” example in the OECD-NEA Taxonomy. The effects of FPGA failures on the AIM can then be related to the digital RTS/ESFAS.

Table 18 Sub-Component Level FPGA Taxonomy Demonstration (Step 1) – Hardware. Failure Mode Information

Failure Effect(s) and Uncovering Situation(s) Functional Components of Failure Effect Uncovering Safety Fault HW Modules (NEA) (OECD- Situation(NEA) (OECD-NEA, 2015) Classification NEA, 2015) (IEC, 2012b)

Failure Mode (McNelles et al., 2015)

Failure Set (FPGA FMEA) (McNelles et al., 2015)

Elementary Fault Class (Avizienis et al., 2004)

HCE,NBTI

Aging Process (Clock)

Physical Clock Deterioration

Non-Fatal (Plausible or Implausible)

Online Detection

Revealed by Demand(Latent)

Offline Detection

Timing (Late), Signaled Timing (Late), Unsignaled

Delayed Output

Dangerous Detected Undetected (Dangerous or Safe)

Content, Signaled Content, Unsignaled

No Output (Fatal) Incorrect Output (NonFatal)

Aging Process (FPGA Physical Programmable Fatal Chip) Deterioration Interconnect (Haphazard) Non-Fatal (Plausible or Implausible)

SEDB,SEGR

Radiation Induced Hard Errors

Physical Interference (Nat., HW., Perm.)

CLB Logic

Fatal (Haphazard) Non-Fatal (Plausible or Implausible)

Dangerous Offline DetectionRevealed Detected by Demand(Latent) Undetected (Dangerous or Safe)

SEU

Radiation Induced Soft Errors

Physical Interference (Nat., HW., Trans.)

Registers

Non-Fatal (Plausible) Non-Fatal (Implausible) Non-Fatal (Implausible)

Detected Online DetectionSpurious Failure (Safe or Actuation Dangerous) Safe Undetected Failure

Stuck Pin

Chip and Board

Production Defects/ Hardware Errata

FPGA Chip I/O Non-Fatal (Plausible or Implausible)

Discrete (Digital) Input

Board Level

Production Defects/ Hardware Errata

FPGA Board I/O Fatal(Ordered) Online Detection

Safe (Detected or Undetected)

ESD EOS

MaintenanceInduced Physical Interference (HW., NonMal)

FPGA Chip/ Board

Dangerous Undetected

SET

Fatal (Haphazard)

Revealed by Demand(Latent)

Undetected (Dangerous or Safe) Spurious Actuation Safe Undetected Failure

Revealed by Demand (Triggered)

AIM (OECDNEA, 2015)

Delayed AIM Outputs

Mitigation Method(s) FPGA FMEA

EFC (Avizienis et al., 2004)

Monitor Clock Skew (Bobrek and Bouldin, 2010; U.S. NRC, 2010), MTBF for aging failures (JEDEC Solid State Technology Association, 2011; National Aeronautics and Space Administration, 2008) Periodic Testing (McNelles et al., 2015) Incorrect AIM MTBF for Aging Failures (JEDEC Output Solid State Technology Association, 2011; National Aeronautics and Space Administration, 2008), Periodic Testing (McNelles et al., 2015)

FT: Error Detection (Concurrent) FT: Error Handling (Compensation) FR: Correction

FT: Error Detection (Concurrent) FT: Error Handling (Compensation) FR: Correction FT: Error Incorrect AIM Sensitivity Testing (Titus, 2013), Content, No Output Output Protective Circuits (Mutuel, 2016) Detection Signaled (Fatal) Content, (Concurrent) Power De-Rating (Scheik, 2008) Incorrect Unsignaled FT: Error Periodic Testing (McNelles et al., Output (NonHandling 2015) Fatal) (Compensation) FR: Correction Content, Incorrect Incorrect AIM TMR (Wang et al., 2011), FT: Error Signaled Output Output EDAC (Habinc, 2002) Detection (Concurrent) FT: Error Content, Spatial or Temporal Redundancy Handling Unsignaled (Mutuel, 2016, 2014), (Diagnosis, Circuit Freezing (Smith, 2012) Compensation, Isolation) FR: Content, Incorrect Incorrect AIM Detect/control damaged/ (Verification, Signaled Output Output disconnected pins (Bobrek and Correction) Bouldin, 2010, U.S. NRC, 2010) Content, FT: Error Periodic Testing (McNelles et al., Unsignaled Detection 2015) (Concurrent) FT: Error Handling (Compensation) FR: Verification No AIM Standards for fault models, Halt,Signaled No output Output diagnostic coverage and mitigation FT: Error (resulting or Detection (IEC, 2010, 2010a) from no input) Unsignaled (Concurrent) FT: Error Handling (Compensation) No Output No AIM Electrostatic Protection Program FP Halt or Output (Xilinx, 2013; Actel, 2005) FR: (Correction) Erratic, Unsignaled

215

(continued on next page)

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Detected(Safe or Dangerous) Dangerous Undetected

Electromigration, Stress Migration

Revealed by Demand(Latent)

Functional Impact Domain and ‘‘BC” Detectability (Avizienis et al., 2004)

216

Table 18 (continued) Failure Mode Information

Failure Effect(s) and Uncovering Situation(s) Functional Components of Failure Effect Uncovering Safety Fault HW Modules (NEA) (OECD- Situation(NEA) (OECD-NEA, 2015) Classification NEA, 2015) (IEC, 2012b)

Failure Set (FPGA FMEA) (McNelles et al., 2015)

Elementary Fault Class (Avizienis et al., 2004)

DRF

Bit Error

Physical Programmable Non-Fatal Deterioration Interconnect (Plausible or Implausible)

Substrate Breakdown (High Temp.) DPA

Environmental Qualification

Physical Interference (Nat., HW.) Intrusion Attempts (Hardware, Mal)

Security Breach

Hardware Sneak Circuit

Sneak Circuit

Hardware CCF

CCF

Production Defects/ Hardware Errata Production Defects/ Hardware Errata

FPGA Chip/ Board

Fatal (Haphazard)

FPGA Chip/ Board (FPGA Logic)

Non-Fatal (Plausible)

FPGA Chip/ Board

Non-Fatal (Plausible or Implausible)

FPGA Chip/ Board

Fatal (Haphazard)

Revealed by Demand(Latent)

Dangerous Undetected

Revealed by Demand (Triggered) Online Detection

Dangerous Undetected

Functional Impact Domain and ‘‘BC” Detectability (Avizienis et al., 2004) Content, Unsignaled

Halt or Erratic, Unsignaled Safe Detected No Impact, Signaled

Revealed by Dangerous Demand (Latent) Undetected Spurious Actuation Safe Undetected Undetectable Dangerous Undetected

Content, Unsignaled

Halt or Erratic, Unsignaled

Incorrect Output

AIM (OECDNEA, 2015)

Mitigation Method(s) FPGA FMEA

Incorrect AIM Copies of Programming Data Output (Bobrek and Bouldin, 2010; U.S. NRC, 2010), MTBF for Interconnects (Bobrek and Bouldin, 2010; U.S. NRC, 2010), Data for Configuration memory and P/E cycles (Bobrek and Bouldin, 2010; U.S. NRC, 2010) No Output No AIM Environmental Qualification Output Procedures (IAEA, 2016; Korash et al., 1998) No Functional No Functional Secure Lifecycle (Huffmire et al., 2008), Impact Impact Side Channel Attack Counter-mea(System (System sures (Trimberger and Moore, Security Security Compromised) Compromised) 2014; Huffmire et al., 2010; Mulder et al., 2007) Cyber Security Guide (US NRC, 2010) No Output Incorrect AIM General and FPGA-Specific Sneak Output Circuit Analysis (Hahn et al., 1991; Incorrect Remnant, 2009; European Space Output Agency, 1997; ESA, 1997) Board level No AIM Diversity and Defence in Depth Failure Output (McNelles et al., 2015), Requirements from technical standards (IEEE Power and Energy Society, 2016; IEC, 2007, 2012) CCF Analysis (O’Connor and Mosleh, 2016; Kang and Kim, 2012)

EFC (Avizienis et al., 2004)

FF FT: Error Handling (Compensation)

FP FR: (Correction) FP FT: Error Detection (Concurrent)

FP FR: Verification

FP FF

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Failure Mode (McNelles et al., 2015)

Table 19 Sub-Component Level FPGA Taxonomy Demonstration (Steps 2–4) – Hardware. Failure Mode Information

Failure Detection Components of Compressed NEA (International HW Modules Failure Electrotechnical Mode Commission, 2006)(Detection)

Failure Set(FPGA FMEA) (McNelles et al., 2015)

Elementary Fault Class (Avizienis et al., 2004)

Aging Process (Clock)

Physical Clock Deterioration

Failure End Effect Domain and Detectability (Avizienis et al., 2004)

AIM (International Electrotechnical Commission, 2006)

FPGA FMEA RTS/ESFAS (International Electrotechnical Commission, 2006)

Elementary Fault Class (Avizienis et al., 2004)

Detected(Safe or Dangerous) Dangerous Undetected

Timing (Late), Signaled Timing (Late), Unsignaled

Delayed AIM Outputs (Latent Loss of Function/ Loss of Function)

Loss of 1oo4 conditions of specific APU/VU outputs

FT: Error Detection (Concurrent) FT: Error Handling (Compensation) FR: Correction

Dangerous Detected Undetected (Dangerous or Safe)

Content, Signaled Content, Unsignaled

Incorrect AIM Output (Loss of Function)

Monitor Clock Skew (Bobrek and Bouldin, 2010; U.S. NRC, 2010), MTBF for aging failures (JEDEC Solid State Technology Association, 2011; National Aeronautics and Space Administration, 2008), Periodic Testing (McNelles et al., 2015) MTBF for Aging Failures (JEDEC Loss of 1oo4 Solid State Technology Association, conditions of specific APU/VU 2011; National Aeronautics and Space Administration, 2008), outputs Periodic Testing (McNelles et al., 2015)

Monitoring

Online Detection

Latent Loss of Function

Periodic Test

Revealed by Demand(Latent)

Aging Process (FPGA Physical Programmable Latent Loss Chip) Deterioration Interconnects of Function

Periodic Test

Offline Detection Revealed by Demand(Latent)

Radiation Induced Soft Errors

Chip and Board

Board Level

Physical Interference (Nat., HW., Perm.)

CLB Logic

Physical Interference (Nat., HW., Trans.)

Registers

Latent Loss of Function

Periodic Test

Offline Detection Revealed by Demand(Latent)

Loss of Function

Monitoring

Spurious Function

Self-Revealing

Non-Fatal (Plausible) Non-Fatal (Implausible) Non-Fatal (Implausible)

Detected Failure (Safe or Dangerous) Safe Undetected Failure

Non-Fatal (Plausible or Implausible)

Undetected (Dangerous or Safe) Safe Undetected Failure

Safe (Detected or Undetected)

Production Defects/ Hardware Errata

FPGA Chip I/O Latent Loss of Function

Production Defects/ Hardware Errata

FPGA Board I/O Loss of Function

Periodic Test

Online Detection

FPGA Chip/ Board

Revealed by Demand

Fatal (Haphazard) Dangerous Undetected

MaintenanceInduced Physical Interference

Spurious Function

Loss of Function

Periodic Test

Dangerous Detected Undetected (Dangerous or Safe)

Self-Revealing

FT: Error Detection (Concurrent) FT: Error Handling (Compensation) FR: Correction FT: Error Sensitivity Testing (Titus, 2013) Loss of 1oo4 Content, Incorrect AIM Protective Circuits (Mutuel, 2016) Detection conditions of Signaled Output Content, (Concurrent) specific APU/VU Power De-Rating (Scheik, 2008), (Loss of Unsignaled Periodic Testing (McNelles et al., FT: Error outputs Function) Handling 2015) (Compensation) FR: Correction FT: Error 1oo4 conditions TMR (Wang et al., 2011), Content, Incorrect AIM Detection of specific APU/ EDAC (Habinc, 2002) Signaled Output Spatial or Temporal Redundancy (Concurrent) (Loss of Function VU outputs FT: Error according to FTD (Mutuel, 2016, 2014), Or Spurious Content, Handling Circuit Freezing (Smith, 2012) Function) Unsignaled (Diagnosis, Compensation, Isolation) FR: 1oo4 conditions Detect/control damaged/ Content, Incorrect AIM (Verification, of specific APU/ disconnected pins (Bobrek and Signaled Output Correction) Bouldin, 2010; U.S. NRC, 2010) (Loss of Function VU outputs Content, according to FTD Periodic Testing (McNelles et al., FT: Error or Spurious Unsignaled Detection 2015) Function) (Concurrent) FT: Error Handling (Compensation) FR: Verification 1oo4 conditions Standards for fault models, Halt,Signaled No Output/ of specific APU/ diagnostic coverage and mitigation FT: Error Incorrect AIM or Detection (IEC, 2010, 2010a) VU outputs Output Unsignaled (Concurrent) according to FTD (Loss of FT: Error Function) Handling (Compensation) Halt or No AIM Output 1oo4 conditions Electrostatic Protection Program FP Erratic, (Loss of of specific APU/ (Xilinx, 2013; Actel, 2005) FR: (Correction)

217

(continued on next page)

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Functional Safety Fault Classification (IEC, 2012)

Loss of Function

Radiation Induced Hard Errors

Mitigation Method

NEA (International Electrotechnical Commission, 2006)(Uncovering)

218

Table 19 (continued) Failure Mode Information Failure Set(FPGA FMEA) (McNelles et al., 2015)

Bit Error

Elementary Fault Class (Avizienis et al., 2004)

Failure Detection Components of Compressed NEA (International HW Modules Failure Electrotechnical Mode Commission, 2006)(Detection)

(HW., NonMal) Physical Programmable Latent Loss Deterioration Interconnect of Function

Failure End Effect NEA (International Electrotechnical Commission, 2006)(Uncovering)

FPGA FMEA RTS/ESFAS (International Electrotechnical Commission, 2006)

Unsignaled

Function)

Dangerous Undetected

Content, Unsignaled

Incorrect AIM Output

VU outputs according to FTD Loss of 1oo4 conditions of specific APU/VU outputs

Revealed by Demand (Latent)

No AIM Output (Loss of Function)

Environmental Qualification

Physical Interference (Nat., HW.)

FPGA Chip/ Board

Loss of Function

Fatal (Haphazard) Revealed by Demand (Triggered)

Dangerous Undetected

Halt or Erratic, Unsignaled

Security Breach

Intrusion Attempts (Hardware, Mal)

FPGA Chip/ Board (FPGA Logic)

AIM Function is Unaffected

Monitoring

Non-Fatal (Plausible)

Online Detection

Safe Detected Normal AIM Output (System Security Compromised)

Sneak Circuit

Production Defects/ Hardware Errata Production Defects/ Hardware Errata

FPGA Chip/ Board

Latent Loss of Function Spurious Function Loss of Function

Periodic Test

Revealed by Demand (Latent) Spurious Actuation Fatal(Haphazard)

Dangerous Content, Undetected Unsignaled Safe Undetected Undetectable Dangerous Undetected

CCF

FPGA Chip/ Board

Self-Revealing Undetectable

No Output/ Incorrect AIM Output Complete AIM failure

Copies of Programming Data (Bobrek and Bouldin, 2010; U.S. NRC, 2010), MTBF for Interconnects (Bobrek and Bouldin, 2010; U.S. NRC, 2010), Data for Configuration memory and P/E cycles (Bobrek and Bouldin, 2010; U.S. NRC, 2010) 1oo4 conditions Environmental Qualification of specific APU/ Procedures (IAEA, 2016; Korash et al., 1998) VU outputs according to FTD Secure Lifecycle (Huffmire et al., Normal RTS/ ESFAS Operation 2008) (System Security Side Channel Attack Counter-meaCompromised) sures (Trimberger and Moore, 2014; Huffmire et al., 2010; Mulder et al., 2007), Cyber Security Guide (US NRC, 2010) General and FPGA-Specific Sneak Loss of 1oo4 Circuit Analysis (Hahn et al., 1991; conditions of specific APU/VU Remnant, 2009; European Space Agency, 1997; ESA, 1997) outputs Loss of Multiple Diversity and Defence in Depth (McNelles et al., 2015), Division Requirements from technical Functions standards (IEEE Power and Energy Society, 2016; IEC, 2007, 2012) CCF Analysis (O’Connor and Mosleh, 2016; Kang and Kim, 2012)

Elementary Fault Class (Avizienis et al., 2004)

FF FT: Error Handling (Compensation

FP FR: (Correction)

FP FT: Error Detection (Concurrent)

FP FR: Verification

FP FF

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

AIM (International Electrotechnical Commission, 2006)

(Triggered) Periodic Test

Mitigation Method

Domain and Detectability (Avizienis et al., 2004)

Functional Safety Fault Classification (IEC, 2012)

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

5.2. Sub-Component demonstration The FMEAs based on the OECD-NEA Taxonomy for both Hardware and HDL Code Failures are presented in this section. The process for the Hardware FMEA and FPGA Taxonomy demonstration is discussed in Section 5.2.1, with the process for the HDL Code (Software) FMEA discussed in Section 5.2.2. 5.2.1. Hardware FMEA and demonstration The process for the FPGA hardware FMEA and demonstration follows the same process as outlined in the OECD-NEA taxonomy, and was present in Section 2.2.3 of this paper (OECD-NEA, 2015). The example used in this paper considers the ‘‘Module” level up to the end effect on the RTS/ESFAS. However, since the overall system consists of separate divisions that contain basically the same I&C units, the end effect of each I&C is the effect that is generally considered. The same basic process will be applied to the ‘‘SubComponent” level, with the end effects being given for the ‘‘Basic Component” level, and then up to the full system level (RTS/ESFAS) at the end. The (hardware sub-components) process is seen in Tables 18 and 19. In the case of this taxonomy, those end effects are assumed to occur if the mitigation methods are not implemented, fail, or are implemented incorrectly. Table 18 covers Step 1 of the process. Examples of potential failures modes for the different (sub-component) hardware modules are presented, along with the Failure Effect and Uncovering Situations (as defined by the OECD-NEA taxonomy). Finally, the functional impact of the failure modes on the Basic Component level (overall FPGA chip) is listed. Table 19 shows Steps 2–4 of the process, including the ‘‘Compressed Failure Mode” (high level failure mode at the I&C unit level, based on similar uncovering situations and failure effects), examples of failure detection methods, and the potential effect(s) on the AIM. Additionally, the potential top level effects of the low-level failures are seen. The AIM was chosen, as it was used in the ‘‘Basic Component” level exampled in the OECDNEA taxonomy, and was used in Table 17 to create the ‘‘Basic Component” level FPGA FMEA example. This allows Tables 17–19 to be tied together, and for the RTS/ESFAS example system used in the OECD-NEA taxonomy to be applied to the FPGA taxonomy. An example of a Hardware failure is a Single Event Upset (SEU) that occurred in one of the Registers. This would temporarily invert a memory bit, which would lead to either a ‘‘Plausible” or ‘‘Implausible” Non-Fatal failure. The incorrect data would be passed through the chip, causing an ‘‘Incorrect Output” at the ‘‘Basic Component” Level (Table 18). It is possible to detect these types of errors using Error Detection Codes/Error Detection and Correction Codes (EDC/EDAC), so if those methods are included, it may reveal that an SEU has occurred. If these methods fail or were not included, then the error may not be seen until there is a ‘‘Spurious Actuation”, or it is ‘‘Revealed by Demand (Latent)”. This will lead to an ‘‘Incorrect Output” from the AIM (Table 19). Finally, this AIM failure will cause a failure at the ‘‘System” level that is dependent on the exact RTS/ESFAS function where the failure occurred. 5.2.2. HDL code FMEA and demonstration The HDL Code FMEA and Demonstration followed the same process as the Hardware case, outlined in Section 5.2.1. This is reasonable in an FPGA, as any software faults will manifest themselves as hardware logic errors, once the HDL code is synthesized and the FPGA is configured. The results are shown in Tables 20 and 21. Tables 20 and 21 demonstrated the ‘‘Sub-Component” level of abstraction, as it pertains to the RTS/ESFAS test system. This was

219

all based on the structure and template of the OECD-NEA Taxonomy. It was seen that both hardware failures (Tables 18 and 19) and HDL Code failures (Tables 20 and 21) could cause a failure or failures in the AIM, and those AIM failure modes would then have an effect on the total system. The ‘‘Sub-Component” level failures would cause a failure at the ‘‘Basic Component” level (the FPGA or HDL Code), which would in turn cause a failure in the ‘‘Module” Level, then the ‘‘Unit” Level, and finally the ‘‘System Level”. This is the ‘‘Cascading Failure Propagation” discussed in the OECD-NEA Taxonomy. It should be noted that the end effects the AIM failure modes had on the RTS/ESFAS system were taken from the OECDNEA Taxonomy (OECD-NEA, 2015). As in the case of the hardware failure modes, those end effects are assumed to occur if the mitigation methods are not implemented, fail, or are implemented incorrectly. An example of an HDL Code failure is an error in a state machine, causing a ‘‘No Exit” failure. This would disallow the state machine to transition out of that state, and could cause the state machine to become ‘‘stuck”. This could be a fatal error, causing ‘‘No Output” or ‘‘Loss of FPGA Function” at the ‘‘Basic Component” level (Table 20). This error maybe detected by diagnostics measures like Watchdog Timers (WDT), or through offline measures such as Periodic Testing. This would likely cause a ‘‘Loss of Function” in the AIM, which would then cause a failure of the RTS/ESFAS system (Table 21).

5.3. FPGA taxonomy modelling demonstration The new FPGA taxonomy based on the original OECD-NEA taxonomy will now be demonstrated on an FPGA-based test system. The OECD-NEA taxonomy provided some examples of fault trees, one of which was recreated and shown in Fig. 15. It considers a ‘‘Spurious Actuation” of one division (Division ‘‘X”) in the Emergency Feed Water system (EFV), due to a failure in the Voting Unit (VU) in that division. At that level of abstraction, there is little difference between the fault tree for the FPGA and software-based system taxonomies, so the fault tree must be expanded to include the lower levels of abstraction. Of specific interest is the hardware (HW)-based basic event entitled ‘‘HW Module Failure #6”, highlighted in Grey. According to the OECD-NEA taxonomy, one of the potential causes of ‘‘HW Module Failure #6” is a failure in the AIM. That basic event is then expanded on (with a specific focus on the AIM), in subtrees that are shown in Fig. 16 and Fig. 17. Figs. 15–17 are located in Appendix B of this paper. Fig. 16 sets the ‘‘HW Module Failure #6” as the Top Event, and then proceeds down the levels of abstraction, through the AIM (Unit Level), FPGA Chip and HDL Code (Component Level), ending at the individual hardware and software modules (SubComponent Level). This allows for the individual failures for the hardware and software sub-components, as well as their effects on the total system (in this case the Spurious Trip of the EFW), to be modelled. The fault tree shown in Fig. 17 is similar to the fault tree in Fig. 16, except that the basic events are set as the Failure Sets from Fig. 8. This allows for the effects of the different Failure Sets to be modelled, as opposed to the failures of the subcomponents. The mitigation methods for those failure modes are also listed in the fault trees in Fig. 16 and Fig. 17, to show the effect of the failure of mitigation methods, or if the mitigation methods are not employed. This represents an improvement on the fault trees from the previous work from Ref. (McNelles, 2016), as those fault trees/DFM models did not included any mitigation methods (apart from TMR).

Failure Mode Information

220

Table 20 Sub-Component Level FPGA Taxonomy Demonstration (Step 1) – Software. Failure Effect(s) and Uncovering Situation(s)

Functional Impact

Mitigation Method(s)

Failure Set (FPGA FMEA) (McNelles et al., 2015)

Elementary Fault Class (Avizienis et al., 2004)

Components of HW Modules

Failure Effect (NEA) (OECDNEA, 2015)

Uncovering Situation (NEA) (OECD-NEA, 2015)

Domain and Detectability (Avizienis et al., 2004)

‘‘BC”

FPGA FMEA AIM (OECDNEA, 2015)

EFC (Avizienis et al., 2004)

Hang/ Deadlock Signal Delay

Soft Processor

Software Fault

Soft Processor

Fatal (Ordered)

Online Detection

Halt, Signaled

Non-Fatal (Plausible)

Revealed by Demand (Latent)

Timing (Late), Unsignaled

No Output (Loss of FPGA Function) Delayed FPGA Signal Output

Non-Fatal (Plausible)

Revealed by Demand (Latent)

Content, Unsignaled

Incorrect Output

Non-Fatal (Implausible)

Online Detection

Content, Signaled

Incorrect Output

Spurious Actuation

Content, Signaled

Incorrect Output

FP FR: Verification (static and dynamic) FT: Error Detection (Concurrent Detection) FT: (Error Handling: Rollback/ Rollforward)

Fatal (Ordered)

Online Detection

Fatal (Ordered)

Online Detection

No Output or Stuck Output (Fatal, loss of FPGA function)

Fatal (Ordered)

Online Detection

Halt, Signaled Halt, Signaled Halt, Signaled Halt, Unsignaled Content, Unsignaled Content, Signaled Content, Signaled Content, Unsignaled Content, Unsignaled Content, Signaled Content, Unsignaled Content, Unsignaled Timing (Fast or Slow), Unsignaled

No AIM Output Delayed AIM Output Incorrect AIM Output Incorrect AIM Output Incorrect AIM Output No AIM Output Stuck AIM Output

Random/ Unknown Signal

Hang/ Deadlock Endless Loop

State Machine

Software Fault

State Machine

No Exit

Unreachable

Math Errors

Non-Fatal (Plausible)

Logic Errors (HDL)

Software Fault

HDL EF

Logic Errors

Latch

Set-up/Hold Violation

Non-Fatal (Implausible) Non-Fatal (Plausible)

Non-Fatal (Implausible) Non-Fatal (Plausible)

Clock/Timing

Software Fault

Registers

Non-Fatal (Plausible)

Non-Fatal (Plausible)

Revealed by Demand (Latent) Revealed by Demand (Latent) Offline Detection Online Detection Revealed by Demand (Latent) Spurious Actuation Online Detection Revealed by Demand (Latent) Spurious Actuation Revealed by Demand (Latent) Spurious Actuation Revealed by Demand (Latent) Spurious Actuation

Incorrect Output (Non-Fatal)

Incorrect AIM Output

Incorrect Output

Incorrect AIM Output

Incorrect or Delayed Output (Timing Error)

Incorrect AIM Output Delayed AIM Output

FPGA V&V, (IAEA, 2016; IEC, 2012) Software V&V (IEEE Power and Energy Society, 2016; IEC, 2006, 2004)

FP FR: Verification (static and dynamic) FT: Error Detection (Concurrent Detection) FT: (Error Handling: Rollback/ Rollforward) HDL Code V&V (IAEA, 2016; Bobrek FP and Bouldin, 2010; U.S. NRC, 2010) FR: Verification International Standard (IEC, 2012) (static and dynamic)

WDT (IEC, 2010, 2012) State Machine Hazard Analysis (IEC, 2010) Return to pre-defined state (Bobrek and Bouldin, 2010; U.S. NRC, 2010)

Implement Registers (Bobrek and Bouldin, 2010; U.S. NRC, 2010; IEC, 2012) Complete logic statements and sensitivity lists latch) (Bobrek and Bouldin, 2010; U.S. NRC, 2010) Static Timing Analysis (Bobrek and Bouldin, 2010; U.S. NRC, 2010) Timing Simulations (IAEA, 2016; IEC, 2012)

FP FR: Verification (static and dynamic)

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Failure Mode (McNelles et al., 2015)

Table 20 (continued) Failure Mode Information

Failure Effect(s) and Uncovering Situation(s)

Functional Impact

Mitigation Method(s)

Failure Set (FPGA FMEA) (McNelles et al., 2015)

Elementary Fault Class (Avizienis et al., 2004)

Components of HW Modules

Failure Effect (NEA) (OECDNEA, 2015)

Uncovering Situation (NEA) (OECD-NEA, 2015)

Domain and Detectability (Avizienis et al., 2004)

‘‘BC”

FPGA FMEA AIM (OECDNEA, 2015)

FXP Resolution

Input and Data Type

Software Fault

Data Type

Non-Fatal (Plausible) Non-Fatal (Implausible) Non-Fatal (Plausible) Non-Fatal (Implausible) Fatal or NonFatal (Function Dependent)

Online Detection

Incorrect Output

Incorrect AIM Output

Ensure proper calculation/ verification of all FXP values (Taylor, 2012) Input Sanity Check (Bobrek and Bouldin, 2010; U.S. NRC, 2010)

FP FR: Verification (static and dynamic)

Revealed by Demand (Latent) Function Dependent

Content, Signaled Content, Unsignaled Content, Signaled Content, Unsignaled Function Dependent

Function Dependent

Function Dependent

Incorrect AIM Output o AIM Output Function Dependent

FPGA V&V, (IAEA, 2016; IEC, 2012) Software V&V (IEEE Power and Energy Society, 2016; IEC, 2006) Obsolescence Management73 Secure Lifecycle (Huffmire et al., 2008), Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008) FPGA Dedication (Jung et al., 2016), Software Dedication (Preckshot and Scott, 1996; Electric Power Research Institute, 2013) Secure Architecture and comms (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008 Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008 Secure Lifecycle (Huffmire et al., 2008), Bit stream Protection Secure (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008), Architecture and comms (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008), Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008) General and FPGA-Specific Sneak Circuit Analysis (Hahn et al., 1991; Remnant, 2009; European Space Agency, 1997; ESA, 1997) Diversity and Defence in Depth (McNelles et al., 2015), Requirements from technical standards (IEEE Power and Energy Society, 2016, IEC, 2007, 2012a)CCF Analysis/specification (O’Connor and Mosleh, 2016; Kang and Kim, 2012; Rainer, 2007)

FP FR: Verification (static and dynamic) FP FR: Verification (static and dynamic)

Input Overflow

Revealed by Demand (Latent) Online Detection

IP Core Failure Maintainability

Software Fault

IP Core

Design Tool Subversion

Design Security

Logic/Timing Bombs (SW., Mal., Dev.)

FPGA Logic/ Timing (FPGA Synthesis Tool)

Fatal or NonFatal

Triggered by Demand

Erratic, Unsignaled

Unauthorized access, incorrect outputs or device damage

COTS HDL Code Failure

COTS

Software Fault

FPGA Logic

Fatal or NonFatal (Function Dependent)

Function Dependent

Function Dependent

Function Dependent

Composition Problem

Maintenance Induced

Input Mistakes (SW., NonMal.)

IP Core

Fatal or NonFatal (Function Dependent)

Function Dependent

Function Dependent

Function Dependent

Function Dependent

FPGA Virus

Security Breach

Virus/Worms (SW., Mal., Int.,)

FPGA Logic/ Circuitry

Fatal (Haphazard)

Triggered by Demand

Erratic, Unsignaled

No Output (Chip Damaged/Destroyed)

No AIM Output

Software Sneak Circuit

Sneak Circuit

Software Fault

FPGA Chip Logic Non-Fatal (Plausible or Implausible)

Content, Unsignaled

Incorrect Output

Incorrect Aim Output

Software CCF

CCF

Software Fault

FPGA Logic

Revealed by Demand (Latent) Spurious Actuation Undetectable

Halt or Erratic, Unsignaled

Board level Failure

No AIM Output

Fatal (Haphazard)

EFC (Avizienis et al., 2004)

FP FR: Verification (static and dynamic) FP FR: Verification (static and dynamic)

FP

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Failure Mode (McNelles et al., 2015)

FP FR: Verification (static and dynamic FP FF

221

Failure Mode Information

222

Table 21 Sub-Component Level FPGA Taxonomy Demonstration (Steps 2–4) – Software. Failure Detection

Failure End Effect

Mitigation Method

Elementary Fault Class (Avizienis et al., 2004)

Components of HW Modules

Compressed Failure Mode

NEA (International Electrotechnical Commission, 2006) (Detection)

NEA (International Electrotechnical Commission, 2006) (Uncovering)

Domain and Detectability (Avizienis et al., 2004)

AIM (International Electrotechnical Commission, 2006)

RTS/ESFAS (International Electrotechnical Commission, 2006)

FPGA FMEA

Elementary Fault Class (Avizienis et al., 2004)

Soft Processor

Software Fault

Soft Processor

Loss of Function Latent Loss of Function

Monitoring

Online Detection

Halt, Signaled

Delayed Output or No Output (Latent loss of Function/Loss of Function)

Loss of 1oo4 conditions of specific APU/VU outputs

FPGA V&V (IAEA, 2016; IEC, 2012) Software V&V (IEEE Power and Energy Society, 2016; IEC, 2006, 2004)

Halt, Signaled Content, Unsignaled Content, Signaled

No Output or Incorrect Output (Latent loss of Function /Loss of Function)

Loss of 1oo4 conditions of specific APU/VU outputs

WDT (IEC, 2010a, 2012) State Machine Hazard Analysis (IEC, 2010) Return to pre-defined state (Bobrek and Bouldin, 2010; U.S. NRC, 2010) Return to pre-defined state (Bobrek and Bouldin, 2010; U.S. NRC, 2010)

HDL Code V&V (IAEA, 2016; Bobrek and Bouldin, 2010; U.S. NRC, 2010) International Standard (IEC, 2012) Implement Registers (Bobrek Delayed or Missed Loss of 1oo4 and Bouldin, 2010; U.S. NRC, conditions of Output 2010; IEC, 2012) specific APU/VU (Timing Error) Complete logic statements and outputs sensitivity lists latch) (Bobrek and Bouldin, 2010; U.S. NRC, 2010) Static Timing Analysis (Bobrek and Bouldin, 2010; U.S. NRC, 2010) Timing Simulations (IAEA, 2016; IEC, 2012) Ensure proper calculation/ Incorrect Output Loss of 1oo4 verification of all FXP values (Loss of Function) conditions of (Taylor, 2012) specific APU/VU Input Sanity Check (Bobrek and outputs Bouldin, 2010; U.S. NRC, 2010) FPGA V&V, (IAEA, 2016; IEC, Function 1oo4 conditions of 2012) Dependent specific APU/VU outputs according to Software V&V (IEEE Power and Energy Society, 2016; IEC, 2006) FTD Obsolescence Management (O’Connor et al., 2016)

FP FR: Verification (static and dynamic) FT: Error Detection (Concurrent Detection) FT: (Error Handling: Rollback/ Rollforward) FP FR: Verification (static and dynamic) FT: Error Detection (Concurrent Detection) FT: (Error Handling: Rollback/ Rollforward FP FR: Verification (static and dynamic

State Machine

Logic Errors (HDL)

Clock/Timing

Input and Data Type

Software Fault

Software Fault

Software Fault

Software Fault

Maintainability Software Fault

State Machine

HDL EFs

Registers

Data Type

IP Core

Periodic Test Revealed by Demand (Latent)

Timing, Unsignaled Content, Unsignaled Content, Unsignaled

Spurious Function

Self-Revealing

Spurious Actuation

Loss of Function Latent Loss of Function

Monitoring

Online Detection

Periodic Test

Revealed by Demand (Latent) Offline Detection

Latent Loss of Function Spurious Function

Periodic Test

Online Detection

Self-Revealing

Revealed by Demand (Latent)

Content, Signaled Content, Unsignaled

Periodic Test Self-Revealing

Revealed by Demand (Latent) Spurious Actuation

Timing (Fast or Slow), Unsignaled

Monitoring

Online Detection

Periodic Test

Revealed by Demand (Latent)

Content, Signaled Content, Unsignaled

Function Dependent

Function Dependent

Latent Loss of Function Spurious Function

Loss of Function Latent Loss of Function Function Dependent

Function Dependent

Incorrect Output Loss of 1oo4 (Loss of Function) conditions of specific APU/VU outputs

FP FR: Verification (static and dynamic

FP FR: Verification (static and dynamic FP FR: Verification (static and dynamic)

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Failure Set (FPGA FMEA) (McNelles et al., 2015)

Table 21 (continued) Failure Mode Information

Failure Detection

Failure End Effect

Mitigation Method FPGA FMEA

Compressed Failure Mode

NEA (International Electrotechnical Commission, 2006) (Detection)

NEA (International Electrotechnical Commission, 2006) (Uncovering)

Domain and Detectability (Avizienis et al., 2004)

AIM (International Electrotechnical Commission, 2006)

RTS/ESFAS (International Electrotechnical Commission, 2006)

Design Security Software Fault

FPGA Logic/ Timing (FPGA Synthesis Tool)

Latent Loss of Function Loss of Function

Undetectable

Triggered by Demand

Erratic, Unsignaled

Incorrect Output or No Output (Once triggered)

COTS

Software Fault

FPGA Logic

Function Dependent

Function Dependent

Function Dependent

Function Dependent

Function Dependent

Maintenance Induced

Software Fault

IP Core

Function Dependent

Function Dependent

Function Dependent

Function Dependent

Function Dependent

Security Breach Software Fault

FPGA Logic/ Circuitry

Loss of Function Latent Loss of Function

Undetectable

Triggered by Demand

Erratic, Unsignaled

Damage/ Destruction of AIM

Sneak Circuit

Software Latent Loss Sneak Circuit of Function Spurious Function Software CCF Latent Loss of Function

Periodic Test

Revealed by Demand (Latent) Spurious Actuation

Content, Unsignaled

Incorrect Output

Halt or Erratic, Unsignaled

Board level Failure

Complete AIM failure

Secure Lifecycle (Huffmire et al., 2008), Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008) FPGA Dedication (Jung et al., 1oo4 conditions of 2016), specific APU/VU outputs according to Software Dedication (Preckshot and Scott, 1996; Electric Power FTD Research Institute, 2013) 1oo4 conditions of Secure Architecture and comms specific APU/VU (Valtion Teknillinen outputs according to Tutkimuskeskus, 2011; Huffmire et al., 2008) FTD Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008) and IP Cores Secure Lifecycle (Huffmire et al., Loss of 1oo4 2008), conditions of Bit stream Protection specific APU/VU Secure(Valtion Teknillinen outputs Tutkimuskeskus, 2011; Huffmire et al., 2008), Architecture and comms (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008), Trusted Tools and IP Cores (Valtion Teknillinen Tutkimuskeskus, 2011; Huffmire et al., 2008) General and FPGA-Specific Sneak Loss of 1oo4 Circuit Analysis (Hahn et al., conditions of 1991;, Remnant, 2009; European specific APU/VU Space Agency; 1997; ESA, 1997) outputs Loss of Multiple Diversity and Defence in Depth Division Functions (McNelles et al., 2015), Requirements from technical standards (IEEE Power and Energy Society, 2016; IEC, 2007, 2012) CCF Analysis/specification (O’Connor and Mosleh, 2016; Kang and Kim, 2012; Rainer, 2007)

CCF

Elementary Fault Class (Avizienis et al., 2004)

Software Fault

Software Fault

Self-Revealing Undetectable

Loss of 1oo4 conditions of specific APU/VU outputs

Elementary Fault Class (Avizienis et al., 2004) FP FR: Verification (static and dynamic

FP FR: Verification (static and dynamic) FP FR: Verification (static and dynamic)

FP

FP FR: Verification (static and dynamic) FP FF

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Components of HW Modules

Failure Set (FPGA FMEA) (McNelles et al., 2015)

223

224

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

The fault trees shown in Figs. 15–17 could be simplistic for an actual RTS/ESFAS. However, they serve the purpose for showing how the FPGA taxonomy down to the sub-component level will affect the overall system. Furthermore, the FPGA Taxonomy is not limited being used only with fault trees, and could be applied to both static and dynamic reliability assessment methodologies.

6. Conclusion In order to ensure that FPGA-based systems would be reliable and functionally safe, an extensive FMEA was performed in the preliminary work by compiling and categorizing FPGA failure modes. The FMEA results are expanded on in this paper. The OECD-NEA Failure Modes Taxonomy was applied to the FPGA FMEA data, to create an FPGA Taxonomy, based on future work recommendations from that OECD-NEA document to fill an important area of need. To showcase the differences between FPGA-based systems and software-based systems, an additional level of abstraction, coined the ‘‘Sub-Component” level was proposed. This level includes the individual failures for the FPGA chip and HDL code, which potentially cause failures in the basic components of a digital I&C system. The procedure from the OECD-NEA Taxonomy was applied to create the taxonomy of the FPGA sub-component failures and uncovering effects, which were then built up through all levels of abstraction to demonstrate the effects of sub-components failures at the FPGA system level. This FPGA taxonomy discussed several methods that were used to categorize/model failure modes for digital systems. The FPGA taxonomy is intended for use in the design and analysis of FPGAbased I&C systems for NPPs. It would serve as a benchmark for evaluating the failure mode analysis/hazard analysis of FPGAbased systems. This new taxonomy provides information regarding the stage in the lifecycle the failure occurs, if the failure effects are fatal or non-fatal, the uncovering/detection methods of the failure, as well as how failures at the FPGA/HDL level will affect the overall I&C system. This allows for the identification of the failure modes to be avoided in the ‘‘Design/Development” stage, and for residual failure modes to be mitigated in the ‘‘Operation” stage of the system lifecycle. The use of the FPGA taxonomy in the hazard analysis of FPGA-based systems presents a basis for the safety and engineering decisions during the design and review of those systems. Future work on this project would include the application of the FPGA Taxonomy to specific FPGA test systems, to provide qualitative and quantitative analyses. Additionally, comparisons between the analyses from different reliability methodologies using the FPGA Taxonomy could be performed.

Appendix A. Failure Mode Definitions (McNelles et al., 2015).

Failure

Definition

AO (Arithmetic Overflow)

Loss of data caused by (data) overflow after a mathematical operation Damage to the FPGA chip or board due to discharge of static electricity Damage to the FPGA chip or board due to inadequate electrical protection Thermal-mechanical stress resulting in the destruction of interconnects Fixed Point Data Type (Errors can be caused due to round-off or insufficient resolution) Timing/clock cycle failure due to change in threshold voltage. Oscillations between ‘‘0” and ‘‘1”, cause ‘‘Unknown” value Timing/clock cycle failure due to change in threshold voltage

ESD (Electrostatic Discharge) EOS (Electrical Overstress) EM (Electromigration)

FXP (Fixed Point)

HCE (Hot Carrier Effects) Meta (Metastability) NBTI (Negative Bias Temperature Instability) P/E (Program/Erase)

TDDB (Time Dependent Dielectric Breakdown) SED (Single Event Disturb) SEDB (Single Event Dielectric Breakdown) SEGR (Single Event Gate Rupture) SEU (Single Event Upset) SHE (Single Hard Error) SM (Stress Migration)

TC (Thermal Cycling)

Acknowledgement The funding for this research program was provided by the Canadian Nuclear Safety Commission (CNSC). Additional funding via a PhD scholarship was provided by the Canadian Nuclear Society (CNS).

Appendix B. Fault Trees

P/E cycle refers to the reprogramming of an FPGA. This is not a failure mode by itself, but Flash-based FPGAs can withstand a limited number of P/E cycles. Destruction of logic gates and look-up tables due to dielectric breakdown Temporary information corruption in a memory element Rupture of the dielectric material in gates Rupture of the dielectric material in gates Temporary information corruption in a memory element Permanent state change in memory element Thermal-mechanical stress resulting in the destruction of interconnects Destruction of logic gates and look-up tables due temperature cycling

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Fig. 15. OECD-NEA Taxonomy Fault Tree for a spurious division-X ‘‘EFW-OFF” Event.

225

226

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

Fig. 16. Fault Tree For ‘‘HW Module #6” (Sub-Component Level).

Fig. 17. Fault Tree For ‘‘HW Module #6” (Sub-Component Level) Using Failure Categories.

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

References Actel, Reliability Considerations for Automotive FPGAs, San Jose, USA, 2003. Actel. Electro-static discharge. California: Mountain View; 2005. AP1000 Design Control Document (Revision 15), Chapter 7: Instrumentation and Controls, Westinghouse, http://pbadupws.nrc.gov/docs/ML1117/ML11171A500. html. Avizienis, A., Laprie, J.C., Randell, B., Landwehr, C., 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Trans. Dependable Secure Comput. 1 (1), 11–33. http://dx.doi.org/10.1109/TDSC.2004.2. Benfica, Juliano et al., 2016. Analysis of SRAM-Based FPGA SEU sensitivity to combined EMI and TID-imprinted effects. IEEE Trans. Nucl. Sci. 63, 1294–1300. http://dx.doi.org/10.1109/TNS.2016.2523458. Bobrek, M., & Bouldin, D., Review Guidelines for FPGAs in NPP Safety Systems, Oak Ridge, Tennessee, 2010. ORNL. Brombacher, A.C., van Beurden, I.W.R.J., 1999. RIFIT: analyzing hardware and software in safeguarding system. Reliab. Eng. Syst. Saf. 66, 149–156. http://dx. doi.org/10.1016/S0951-8320(99)00032-0. CSA Group, CSA N290.15, Qualification of digital hardware and software for use in instrumentation and control applications for nuclear power plants, Toronto, Canada, 2015. Digital Instrumentation and Controls Working Group (DICWG), DICWG-05, ‘‘Common Position on the Treatment of Hardware Description Language (HDL) Programmed Devices for use in Nuclear Safety Systems”, Issy-lesMoulineaux, France, 2013. Electric Power Research Institute (EPRI), TR-1019181, Guidelines on the Use of Field Programmable Gate Arrays (FPGAs) in Nuclear Power Plant I&C System. Palo Alto, California, 2009. Electric Power Research Institute (EPRI), TR-1025243, Plant Engineering: Guideline for the Acceptance of Commercial Grade Design and Analysis Computer Programs Used in Nuclear Safety-Related Applications, Palo Alto, California, 2013. EPRI, TR-1022983, Recommended Approaches and Design Criteria for Application of Field Programmable Gate Arrays in Nuclear Power Plant Instrumentation and Control Systems, Palo Alto, California, 2011. ESA, ECSS-Q-40-04A Part 2, Sneak Analysis – Part 2: Clue list, Noordwijk, The Netherlands, 1997. European Space Agency (ESA), ECSS-Q-40-04A Part 1, Sneak Analysis- Part 1: Method and Procedure, Noordwijk, The Netherlands, 1997. Habinc, Sandt, Lessons Learned from FPGA Developments, Gaisler Research, Gloteborg, Sweden, 2002 Hadzˇic´, Ilija; Udani, Sanjay; and Smith, Jonathan M., ‘‘FPGA Viruses” (1999). Technical Reports (CIS). Paper 94. http://repository.upenn.edu/cis_reports/94 Hahn, Heidi A., Blackman, Harold S., Gertman, David I., 1991. Applying sneak circuit analysis to the identification of human errors of commission. Reliab. Eng. Syst. Saf. 33, 289–300. http://dx.doi.org/10.1016/0951-8320(91)90065-F. Huffmire, Ted, Brotherton, Brett, Sherwood, Timothy, Kastner, Ryan, Levin, Timothy, Hguyen, Thuy D., Irvine, Cynthia, 2008. Managing security in FPGA-based embedded systems. IEEE Des. Test Comput. 25, 590–598. http://dx.doi.org/ 10.1109/MDT.2008.166. Huffmire, Ted, Irvine, Cynthia, Nguyen, Thuy D., Levin, Timothy, Kastner, Ryan, Sherwood, Timothy, 2010. Handbook of FPGA Design Security. Springer, New York, USA. Hwang, Inseok, Kim, Sungwan, Kim, Youdan, Seah, Chze Eng, 2010. A survey of fault detection, isolation and reconfiguration methods. IEEE Trans. Control Syst. Technol. 18, 636–653. http://dx.doi.org/10.1109/TCST.2009.2026285. IAEA, Technical Challenges in the Application and Licensing of Digital Instrumentation and Control Systems in Nuclear Power Plants, Vienna, Austria, 2015 IAEA, Application of Field Programmable Gate Arrays in Instrumentation and Control Systems of Nuclear Power Plants, Vienna, Austria, 2016 IEC, 62138, Nuclear power plants - Instrumentation and control systems important to safety - Software aspects for computer-based systems performing category B and C functions, Geneva, 2004. IEC, 60880, Nuclear power plants - Instrumentation and control systems important to safety - Software aspects for computer-based systems performing category A functions, Geneva, 2006. IEC, 62340, Nuclear Power Plants-Instrumentation and control systems important to safety-Requirements for coping with common cause failure (CCF), Geneva, 2007. IEC, 61508-2, Functional Safety of electrical/electronic/programmable electronic safety related systems- Part 2: Requirements for electrical/electronic/programmable electronic safety related systems, Geneva, 2010. IEC, 61508-7, Functional Safety of electrical/electronic/programmable electronic safety related systems- Part 7: Overview of techniques and measures, Geneva, 2010. IEC, 61508-7, Functional Safety of electrical/electronic/programmable electronic safety related systems-Part 4: Definitions and Abbreviations, Geneva, 2010. IEC, 62566, Development of HDL-programmed integrated circuits for systems performing category A functions, Geneva, 2012. IEC, 61131-6, Programmable Controllers – Part 6: Functional Safety, Geneva, 2012. IEEE Power and Energy Society, IEEE Std 7.4.3.2, IEEE Standard Criteria for Programmable Digital Devices in Safety Systems of Nuclear Power Generating Stations, New York, New York, 2016, IEEE. International Electrotechnical Commission (IEC), 60812, Analysis techniques for system reliability – Procedures for failure mode and effects analysis (FMEA), Geneva, 2006.

227

JEDEC Solid State Technology Association, JEP-122G, Failure Mechanisms and Models for Semiconductor Devices, Arlington, Virginia, 2011. Jung, Kil-Young, Roh, Myung-Sub, 2017. A Study for an Appropriate Risk Management of New Technology Deployment in Nuclear Power Plants. Ann. Nucl. Energy, 157–164. http://dx.doi.org/10.1016/j.anucene.2016.08.013. Jung, Sejin, Eui-Sub, Kim, Yoo, Junbeom, Kim, Jang-Yeol, Choi, Jong Gyun, 2016. An evaluation and acceptance of COTS software for FPGA-based controllers in NPPs. Ann. Nucl. Energy 94, 338–349. http://dx.doi.org/10.1016/j. anucene.2016.03.026. Kang, Hyun Gook, Kim, Hee Eun, 2012. Unavailability and spurious operation probability of k-out-of-n reactor protection systems in consideration of CCF. Ann. Nucl. Energy 49, 102–108. http://dx.doi.org/10.1016/j. anucene.2012.06.012. Khalaquzzaman, M. et al., 2010. A model for estimation of reactor spurious shutdown rate considering maintenance human errors in reactor protection system of nuclear power plant. Nucl. Eng. Des. 240, 2693–2971. http://dx.doi. org/10.1016/j.nucengdes.2010.05.031. Korash, K., Hassan, M., Tanaka, T. J., and Wood, R.T., NUREG/CR-6479, Technical Basis for Environmental Qualification of Microprocessor-Based Safety-Related Equipment in Nuclear Power Plants, Washington, DC, 1998. Kretzschmar, U., Gomez-Cornejo, J., Astarioa, A., Bidarte, U., Del Ser, J., 2016. Synchronization of faulty processors in coarse-grained TMR protected partially reconfigurable FPGA designs. Reliab. Eng. Syst. Saf. 191, 1–9. http://dx.doi.org/ 10.1016/j.ress.2015.12.018. Lu, Jun-Jen, Hsu, Teng-Chieh, Chou, Hwai-Pwu, 2015a. System assessment of an FPGA-Based RPS for ABWR nuclear power plant. Prog. Nucl. Energy 85, 44–55. http://dx.doi.org/10.1016/j.pnucene.2015.05.010. Lu, Jun-Jen, Huang, Hsuan-Han, Chou, Hwai-Pwu, 2015b. Evaluation of an FPGAbased fuzzy logic control of feed-water ABWR under automatic power regulating. Prog. Nucl. Energy 79, 22–31. http://dx.doi.org/10.1016/j. pnucene.2014.10.010. McNelles, P. et al., 2016. A comparison of fault trees and the dynamic flowgraph methodology for the analysis of FPGA-based safety systems part 1: Reactor trip logic loop reliability analysis. Reliab. Eng. Syst. Saf. http://dx.doi.org/10.1016/j. ress.2016.04.014. McNelles, P., Lu, L., A review of the current state of fpga systems in nuclear instrumentation and control. In: Proceedings of the 21st International Conference on Nuclear Engineering, Chengdu, July 30–August 2, 2013. ICONE. DOI: 10.1115/ICONE21-16819. McNelles, P., Zeng, Z.C., Renganathan, G., 2015, Modelling of field programmable gate array based nuclear power plant safety systems part i: failure mode and effects analysis”. In: Proceedings of the 7th International Conference on Modelling and Simulation in Nuclear Science and Engineering, Ottawa, Canada, 2015. Menon, Catherine, Guerra, Sofia, 2015. Field Programmable Gate Arrays in Safety Related Instrumentation and Control Applications. Energiforsk, Sweden. Monmasson, Eric, Cirstea, Marcian N., 2007. FPGA design methodology for industrial control systems – A review. IEEE Trans. Industr. Electron. 54, 1824–1842. http:// dx.doi.org/10.1109/TIE.2007.898281. Mossman, Tim, IDSRS Chapter 7, Appendix A: I&C Perspectives on Hazard Analysis (HA), Washington, DC, 2013, U.S. NRC. Mulder, E.De., Ors, S.B., Preneel, B., Verbauwhede, I., 2007. Differential power and electromagnetic attacks on an FPGA implementation of elliptic curve cryptosystems. Comput. Electr. Eng. 33, 367–382. http://dx.doi.org/10.1016/ j.compeleceng.2007.05.009. Mutuel, L.H., Appreciating the effectiveness of single event effect mitigation techniques. In: Proceedings of the 33rd Digital Avionics Systems Conference, Colorado Springs, 2014, SB1-11, IEEE, http://dx.doi.org/10.1109/DASC.2014. 6979481. Mutuel, L.H., 2016. Single Event Effect Mitigation Techniques. Federal Aviation Administration (FAA), New Jersey, USA. National Aeronautics and Space Administration (NASA), JPL 08-5, Microelectronics Reliability: Physics-of-Failure Based Modeling and Lifetime Evaluation. Pasadena, California, 2008. O’Connor, Andrew, Mosleh, Ali, 2016. A general cause based methodology for analysis of common cause and dependant failures in system risk and reliability assessments. Reliab. Eng. Syst. Saf. 145, 341–350. http://dx.doi.org/10.1016/j. ress.2015.06.007. O’Connor, Matthew, Geddes, Bruce, Kelley, Sean, 2016. Guidance and methodologies for managing digital instrumentation and control obsolescence. J. Nucl. Eng. Radiat. Sci. http://dx.doi.org/10.1115/1.4034287. OECD-NEA, Recommendations on Assessing Digital System Reliability in Probabilistic Risk Assessment of Nuclear Power Plants, France, 2009. OECD-NEA, Failure Modes Taxonomy for Reliability Assessment of Digital I&C Systems for PRA, France, 2015 Preckshot, G. G, and Scott, J. A., NUREG/CR-6241, A Proposed Acceptance Process for Commercial Off-The-Shelf (COTS) Software in Reactor Applications, USNRC, Washington, DC, 1996. Rainer, Faller, 2007. Specification of a Software Common Cause Analysis Method‘‘, SAFECOMP 2007. Springer-Verlag, Berlin, pp. 162–171. Remnant, M. The Application Of Sneak Analysis to Safety Critical FPGA, University of York, York, U.K., 2009. E-DOCS-#4638251 Salewski, Falk and Taylor, Adam, ‘‘Fault Handling in FPGAs and Microcontrollers in Safety-Critical Embedded Applications: A Comparative Survey”. In: 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools

228

P. McNelles et al. / Annals of Nuclear Energy 108 (2017) 198–228

(DSD 2007), Lubeck, Germany, August 29-31, 2007, pp. 124-131. DOI: 10.1109/ DSD.2007.4341459 Scheik, Leif, 2008. Testing Guidelines for Single Event Gate Rupture (SEGR) of Power MOSFETs. NASA, Pasadena, California. Smith, Farouk, 2012. Single event upset mitigation by means of a sequential circuit state freeze. Microelectron. Reliab. 52, 1233–1240. http://dx.doi.org/10.1016/j. microrel.2011.11.019. Srinivasan, S. et al., 2008. Toward increasing FPGA lifetime. IEEE Trans. Dependable Secure Comput. 5, 115–127. http://dx.doi.org/10.1109/ TDSC.2007.70235. Taylor, A. ‘‘The Basics of FPGA Mathematics.” Xcell Journal (2012, Third Quarter): pp. 44-49. Titus, Jeffrey L., 2013. An Updated Perspective of Single Event Gate Rupture and Single Event Burnout in Power MOSFETs, ‘‘IEEE Transactions on Nuclear Science”, 60, pp.1912-1928. DOI: 10.1109/TNS.2013.2252194 Trimberger, Stephen M., Moore, Jason J., 2014. FPGA security: motivations, features and applications. Proc. IEEE 102, 1248–1265. http://dx.doi.org/10.1109/ JPROC.2014.2331672. U.S. NRC, NUREG-7006, Review Guidelines for Field Programmable Gate Arrays in Nuclear Power Plant Safety Systems, Washington D.C., 2010.

US NRC, Regulatory Guide 5.71, Cyber Security Programs for Nuclear Facilities, Washington, DC, 2010. Valtion Teknillinen Tutkimuskeskus (VTT), 2011. The current state of FPGA technology in the nuclear domain. Vuorimiehentie, Finland. Wang, Xin., Holber, Keith E., Clark, Lawrence T., 2011. Single event upset mitigation techniques for FPGAs utilized in nuclear power plant digital instrumentation and control. Nucl. Eng. Des. 341, 3317–3324. http://dx.doi.org/10.1016/j. nucengdes.2011.06.033. Wu, Yichan et al., 2016. Development, verification and validation of an FPGA-based core heat removal protection system for a PWR. Nucl. Eng. Des. 301, 311–319. http://dx.doi.org/10.1016/j.nucengdes.2016.03.018. Xilinx, WP433, ‘‘Understanding and Mitigating System-Level ESD and EOS Events in Xilinx 7 Series Device”, San Jose, California, 2013.

Further reading Kastensmidt, F., Rech, P., 2015. FPGAs and Parallel Architectures for Aerospace Applications: Soft Errors and Fault-Tolerant Design. Springer, Cham, Switzerland, p. 129.

Failure mode taxonomy for assessing the reliability of Field Programmable Gate Array based Instrumentation and Control systems

Failure mode taxonomy for assessing the reliability of Field Programmable Gate Array based Instrumentation and Control systems

Recommend Documents