Microelectronics Reliability 44 (2004) 1017–1028 www.elsevier.com/locate/microrel
FPGA-based Monte Carlo simulation for fault tree analysis Alireza Ejlali *, Seyed Ghassem Miremadi Department of Computer Engineering, Sharif University of Technology, P.O. Box 11365–9517 Azadi Ave., Tehran, Iran Received 23 December 2003; received in revised form 28 January 2004 Available online 21 April 2004
Abstract The reliability analysis of critical systems is often performed using fault-tree analysis. Fault trees are analyzed using analytic approaches or Monte Carlo simulation. The usage of the analytic approaches is limited in few models and certain kinds of distributions. In contrast to the analytic approaches, Monte Carlo simulation can be broadly used. However, Monte Carlo simulation is time-consuming because of the intensive computations. This is because an extremely large number of simulated samples may be needed to estimate the reliability parameters at a high level of confidence. In this paper, a tree model, called Time-to-Failure tree, has been presented, which can be used to accelerate the Monte Carlo simulation of fault trees. The time-to-failure tree of a system shows the relationship between the time to failure of the system and the times to failures of its components. Static and dynamic fault trees can be easily transformed into time-to-failure trees. Each time-to-failure tree can be implemented as a pipelined digital circuit, which can be synthesized to a field programmable gate array (FPGA). In this way, Monte Carlo simulation can be significantly accelerated. The performance analysis of the method shows that the speed-up grows with the size of the fault trees. Experimental results for some benchmark fault trees show that this method can be about 471 times faster than softwarebased Monte Carlo simulation. 2004 Elsevier Ltd. All rights reserved.
1. Introduction Fault trees [1–3] provide a compact, graphical intuitive method to analyze system reliability. They have been found to be the most popular choice in terms of building an analytical model of a system and can be easily specified and understood by humans [4]. Fault trees are divided into two categories: static and dynamic. Static fault trees are based on the use of Boolean AND, OR and M-out-of-N gates to represent
Abbreviations: FTA, fault-tree analysis; MCS, Monte Carlo simulation; TTF, time-to-failure; FDEP, functional dependency gate; PAND, priority AND gate; SEQ, sequence enforcing gate; BDD, binary decision diagram; FPGA, field programmable gate arrays * Corresponding author. E-mail addresses:
[email protected],
[email protected]. edu (A. Ejlali),
[email protected] (S. Ghassem Miremadi).
how component failures lead to system failures [5,6]. Dynamic fault trees [7] add a special set of dynamic gates (such as FDEP, PAND, SEQ, Cold Spare, Hot Spare, Warm Spare) to the three static gates (AND, OR, and M-out-of-N gates) to model sequential dependency [5]. The complex redundancy-management techniques typically used in fault-tolerant systems (e.g. prioritized use of spares) cannot be captured in static fault trees. However, they can be effectively captured in dynamic fault trees [5]. Static fault trees can commonly be solved in two ways: (1) converting it to an equivalent BDD [4,5,8,9], (2) using cut sets [1,10]. The cut set approaches are generally inferior to the newer BDD based approaches [11]. Dynamic fault trees can be solved by conversion to equivalent Markov model [5,6,12]. Another approach, called modular approach, divides a fault tree to independent sub-trees and provides a combination of BDD solution for static sub-trees and Markov model solution for dynamic sub-trees [4,5,13].
0026-2714/$ - see front matter 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.microrel.2004.01.016
1018
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
MCS presents a viable alternative to the analytic approaches (BDD and Markov model) in several interesting situations. The analytic approaches, as compared with the MCS approach, are usually fast and computationally cheap. However, their usage is limited in few models and certain kinds of distributions. Furthermore, they are not suitable when the parameters are correlated [5,14]. Unlike the analytic approaches, MCS can be broadly used, but the MCS approach is limited by the intensive computation [14,15]. Because of the rare events (such as failures and errors), an extremely large number of simulated samples may be needed in order to obtain estimates at a high level of confidence. Therefore MCS is time-consuming. One method to reduce the number of runs in MCS is a probabilistic modeling technique called importance sampling [15,16]. Although this technique has long been used [16], its use has been limited. Importance sampling requires the user to specify certain biasing values. In the general application of importance sampling, the best assignment of these biasing values is difficult to derive, and the success of the technique depends critically on the correct choice. This problem has been an important factor in limiting the use of importance sampling [15]. Another way to accelerate MCS is to use FPGAs. An FPGA is a re-configurable chip whose hardware can be configured by the user in the field. Nowadays, FPGAs are increasingly used for accelerating time-consuming computations [35]. There are some works on the use of FPGAs for accelerating MCS to solve some certain problems such as: MCS of electron dynamics in semiconductors [17], MCS of statistical physics models [18], and MCS for calculating the energy of dipolar systems [19]. This paper presents the TTF tree [20], a tree model for reliability modeling, in order to accelerate the MCS of fault trees using FPGAs. The TTF-tree model is a method for representing the mathematical relation between a system TTF and the TTFs of its components. In other words, a TTF tree receives the TTFs of the components and computes the TTF of the whole system. TTF trees can be implemented as pipelined digital circuits, which can be synthesized to FPGA chips. Therefore, MCS can be significantly accelerated. Also, importance sampling can be used to reduce the number of inputs, which are applied to a TTF tree, and therefore results in a higher performance. Both dynamic and static fault trees can be easily converted into TTF trees using a simple replacement operation. In this operation, each fault-tree gate is replaced with its corresponding TTFtree unit. This paper extends the work in [20] by adding the following investigations: • The TTF-tree model and its components are studied in more detail.
• The corresponding TTF-tree units for the dynamic fault-tree gates, which were not discussed in [20], are presented. • Some of the important points about the hardware implementation of TTF trees are discussed (e.g., number representation). • An analytical performance estimation of the method is presented. • The TTF-tree model is experimentally evaluated using the benchmark fault trees presented in [21]. Section 2 describes how TTF trees can model nonfault-tolerant systems. Section 3 introduces the TTF-tree units to model fault-tolerant systems. Section 4 describes how coverage factor can be incorporated into TTF-tree model. Section 5 describes some important issues about the hardware implementation of TTF trees. Section 6 presents an analytical performance estimation of the presented method. The experimental evaluation of the TTF-tree model using the benchmark fault trees presented in [21] is discussed in Section 7. Finally, Section 8 discusses the work. 1.1. Assumptions 1. The components of the system cannot be repaired/ maintained. 2. The fault trees are coherent (i.e. they do not have NOT logic). (1) Safety systems and protection systems can experience two phases of operation: standby phase and demand phase. The standby phase can last for a long time, during which the safety system is periodically tested and maintained. Once a demand occurs, the safety system must operate successfully for the length of demand. The system is non-maintainable during demand phase (i.e. the active components cannot be repaired/maintained during demand) [22]. The standby phase requires an unavailability (availability) analysis, while the demand phase requires an unreliability (reliability) analysis. In this paper only unreliability (reliability) analysis of the system during demand phase is considered. (2) Fault trees are commonly coherent (i.e. they do not have NOT logic). This is because a system would be quite unusual (or poorly designed) if replacing a failed component by a functioning component causes the whole system to fail [23].
2. TTF trees for non-fault-tolerant systems This section presents the building units of TTF trees, which correspond to the static fault-tree gates and
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
describes how static fault trees can be converted to the corresponding TTF trees. Static fault trees (composed of AND, OR, and M-out-of-N gates) are a popular modeling choice for reliability analysis of non-fault-tolerant systems. The fault-tree model for a series system is simply an OR gate. In a series system, each element of the system is required to operate correctly for the system to operate correctly. Therefore, the TTF of a series system is equal to the minimum of the components’ TTFs. For this reason, the TTF-tree model for a series system is a MIN unit. Fig. 1 shows how the MIN unit in the TTF-tree model corresponds to the OR gate in the fault-tree model. In this figure T 1 and T 2 are the TTFs of the components and TS is the TTF of the whole system. Also, X 1 and X 2 are the Boolean fault indicators of the components and XS is the Boolean fault indicator of the system. The fault-tree model for a parallel system is simply an AND gate. In a parallel system, only one of several elements must be operational for the system to perform its function correctly. Therefore, the TTF of a parallel system is equal to the maximum of the components’ TTFs. For this reason, the TTF-tree model for a parallel system is a MAX unit. Fig. 2 shows how the MAX unit in the TTF-tree model corresponds to the AND gate in the fault-tree model. The fault-tree model for an M-out-of-N:F system is simply an M-out-of-N gate. An M-out-of-N:F system is failed if and only if at least M of its N components are failed. Therefore, if T1 , T2 ; . . . ; TN is the list of the TTFs
TS XS
TS
XS T2 X2
MIN T1
X1 X2
X1 T1
T2
Fig. 1. The MIN unit computes the TTF of a series system.
TS
XS
2/3
X1 X2 X3 (a)
1019
2/3
T1
T2 T3 (b)
Fig. 3. (a) 2-out-of-3 gate (b) 2-out-of-3 unit which computes the TTF of a 2-out-of-3:F system.
of the components which is sorted in the increasing order, then the TTF of the whole system is TM . In this way, an M-out-of-N unit (shown in Fig. 3), the TTF-tree model for an M-out-of-N:F system, computes the TTF of the whole system. It should be noted that an M-outof-N unit needs not to sort the input TTFs, it only needs to find the Mth smallest TTF. Static fault trees can be easily converted to TTF trees by replacing AND, OR, and M-out-of-N gates with MAX, MIN, and M-out-of-N units respectively. It is obvious that the MAX, MIN and M-out-of-N units can be easily implemented as digital circuits using digital magnitude comparators. Therefore, any TTF tree, which is constructed with these units, is a digital circuit, which can be synthesized to an FPGA.
3. TTF trees for fault-tolerant systems Dynamic fault trees are used for reliability analysis of fault-tolerant systems. They are composed of the dynamic gates as well as the static gates. There are a variety of dynamic gates, which are presented in the literatures. Not all dynamic FTA methods and tools use the same set of dynamic gates. [24] presents a formal definition for a set of six dynamic gates (i.e. FDEP, PAND, SEQ, Cold Spare, Hot Spare and Warm Spare). In this section the same set of dynamic gates is used and the corresponding TTF-tree units for these six dynamic gates are presented. 3.1. SEQ gate
TS XS
TS
XS T2 X2
MAX T1
X1 X2
X1 T1
T2
Fig. 2. The MAX unit computes the TTF of a parallel system.
The SEQ gate is one of the dynamic gates, which is used to force the input events to occur in a specific (leftto-right) order. That is, an input event to a SEQ gate is not enabled until after all of the inputs to its left have already occurred [21]. Fig. 4 shows how an ADD unit in the TTF-tree model corresponds to the SEQ gate in the fault-tree model. As shown in this figure, the event X 1 occurs (component 1 fails) at the time T 1. The occurrence of
1020
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028 TS XS
XS
SEQ
X2
TS
T2 ADD T1 X1 X2
X1 T1
T2
Fig. 4. The ADD unit corresponds to the SEQ gate.
this event causes the event X 2 to be enabled, which occurs at the time T 2 later. Therefore, Ts ¼ T 1 þ T 2. It is evident that an ADD unit can be easily implemented as a digital circuit using digital adders. 3.2. FDEP gate The FDEP gate is another dynamic gate, which allows modeling of the cases where the occurrence of some event (call it trigger event) causes other dependent components to become inaccessible or unusable [6]. The FDEP gate operates by labeling the dependent event as failed when the trigger event occurs [7]. Therefore, the TTF of the dependent component is equal to the minimum of the time left before the occurrence of the trigger event and the time left before the occurrence of the dependent event. Consequently, in such cases the MIN unit can be used to develop the TTF-tree model of the system. As mentioned previously, the MIN unit can be easily implemented as a digital circuit using a digital magnitude comparator. Fig. 5 shows that the MIN unit in the TTF-tree model corresponds to the FDEP gate in the fault-tree model. In this figure, TA and TB are the TTFs of components A and B respectively and TS is the TTF of component B when it is functionally dependent on component A. 3.3. Cold spare gate The cold spare gate [7] is another dynamic gate, which is used to model cold spares. The leftmost input to the cold spare gate is the primary (active) component.
The other inputs represent cold spare units, which are switched into active operation as needed. The spare inputs can also be shared with other cold spare gates. When a primary component shares a spare unit with other primary components, it will be replaced with the spare unit only if it fails before the other primary components. Consider a primary component A shares a cold spare with other primary components Bi , i ¼ 0; 1; 2; . . . ; n. Let TA , TBi and TSPARE be the TTFs of component A, components Bi and the cold spare respectively. If component A fails before all components Bi (i.e. TA < TB1 ; TA < TB2 ; . . . ; TA < TBn ), then the cold spare will replace component A and therefore component A will be operational for TA þ TSPARE . Otherwise, the cold spare will replace some component Bj (where TA > TBj ) and therefore will be no longer available to component A, so component A will be operational for TA . The TTF-tree unit, which allows modeling of such cases, is the Selector unit shown in Fig. 6. The Selector unit is defined by Eq. (1). TS ¼
TA1 TA2
if TC < TB1 ; TC < TB2 ; . . . ; TC < TBn Otherwise
The cascade of a Selector unit and an ADD unit corresponds to the cold spare gate in the fault-tree model. As an example, consider dual-redundant processors A1 and A2 and a cold spare which can replace either upon failure. A dynamic fault-tree model for this processors system is shown in Fig. 7 [25]. Fig. 8 shows how the combination of Selector and ADD units can be used to develop the TTF-tree model of the system shown in Fig. 7. The fault tree shown in Fig. 7 is converted to the TTF tree shown in Fig. 8 by replacing cold spare gates with the cascade of ADD and Selector units. The cold spare gates, which have more than two inputs, cannot be replaced with Selector and ADD units directly. They can be replaced with the cascade of twoinput cold spare gates and then each two-input cold spare gate can be replaced with Selector and ADD units. As an example, Fig. 9 shows how a fault tree with threeinput cold spare gates can be converted to a corresponding TTF tree.
TS TS FDEP
A
B
MIN
TA
TB
Fig. 5. The MIN unit corresponds to the FDEP gate.
ð1Þ
TC
TB1
Selector
TBn TA1
TA2
Fig. 6. Selector unit.
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
1021
3.4. Warm and hot spare gates
Cold Spare
The warm and hot spare gates [21] are other dynamic gates, which are used to model warm and hot spares. Unlike cold spares, warm and hot spares may fail even when they are dormant. The failure rate of a warm spare component changes when it is switched into active use. However, the failure rate of a hot spare, whether dormant or active, is constant. The combination of Selector and ADD units can also be used for modeling warm and hot spare configurations. Consider a primary component A shares a warm spare with other primary components Bi , i ¼ 0; 1; 2; . . . ; n. Let TA , TBi be the TTFs of component A and components Bi respectively. Also, suppose TDORMANT is a random number, which is generated using the TTF distribution of a dormant spare and TACTIVE is a random number, which is generated using the TTF distribution of an active spare. If component A fails before all components Bi (i.e. TA < TB1 ; TA < TB2 ; . . . ; TA < TBn ) and before failing of the dormant spare (i.e. TA < TDORMANT ), then the warm spare will replace component A and therefore component A will be operational for TA þ TACTIVE . Otherwise, the spare component will not be available to component A, so component A will be operational for TA . As an example, consider a dual-redundant processors A1 and A2 and a warm spare which can replace either upon failure. The fault-tree model and the TTF-tree model of this system are shown in Figs. 10 and 11 respectively. In a similar manner, Selector and ADD units can be used for modeling hot spares. In this case, TDORMANT and TACTIVE are generated without changing the failure rate.
Cold Spare
A1
A2
Spare
Fig. 7. Fault-tree model for a dual-redundant processors system with a shared cold spare.
Tsystem MAX
Selector
Selector
ADD
ADD
TA1
TA2
TSpare
Fig. 8. The TTF-tree model of the processors system shown in Fig. 7.
3.5. PAND gate The Selector unit can be implemented as a digital circuit using a digital magnitude comparator.
PAND is another dynamic gate, which is used to detect certain sequences of events [6]. A PAND gate is a Tsystem MAX
Selector
Cold Spare
Cold Spare
Cold Spare
Selector
ADD
ADD
Cold Spare
TD Cold Spare
Cold Spare
A
B
D
Selector
Selector
ADD A
B
C
D
ADD
C TA1
TB
Fig. 9. Conversion of a fault tree with three-input cold spare gates to the corresponding TTF tree.
TC
1022
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
Warm Spare
As shown in this figure the PAND gate fires if X 1 occurs before X 2 (i.e. T 1 < T 2). Otherwise, the PAND gate never fires. The MAX-or-Infinite unit in the TTFtree model is defined by Eq. (2). T2 if T1 < T2 TS ¼ ð2Þ 1 otherwise
Warm Spare
The infinity produced by a MAX-or-Infinite unit is processed by other TTF-tree units according to the following rules: A1
A2
Spare
• • • •
Fig. 10. Fault-tree model for a dual-redundant processors system with a shared warm spare.
The MAX-or-Infinite unit can be implemented as a digital circuit using digital magnitude comparators, if the number system, which is used for hardware implementation, is capable of representing infinity. This issue will be treated in Section 5 when the number system, which is used for hardware implementation of TTF trees, is discussed.
Tsystem MAX
Selector
MAXðx; 1Þ ¼ MAXð1; xÞ ¼ 1 for all x MINðx; 1Þ ¼ MINð1; xÞ ¼ x for all x x þ 1 ¼ 1 þ x ¼ 1 for all x 1 > x for all x 6¼ 1
Selector
4. TTF trees and coverage factor ADD
ADD
TA1
TA2
TActive
Including the concept of coverage in the system level model is critical to an accurate reliability assessment [26]. Fault coverage is the conditional probability C that the system recovers from a fault given that a fault has occurred. There are some approaches for incorporating coverage modeling into FTA [8,27,28]. Random selector unit, shown in Fig. 13, is one of the TTF-tree units, which is used for modeling the coverage factor. The output of the random-selector unit, shown in Fig. 13, is equal to Ti with the probability of Pi , where,
TDormant
Fig. 11. The TTF-tree model of the processors system shown in Fig. 10.
two-input gate, which fires if and only if the input events occur in a specific order (left to right). Fig. 12 shows how a MAX-or-Infinite unit in the TTF-tree model corresponds to the PAND gate in the fault-tree model. In this figure T 1, T 2 and Ts are the TTFs of component 1, component 2 and the whole system respectively. Also, X 1, X 2 and Xs are the Boolean fault indicators of component 1, component 2 and the whole system respectively.
n X
XS
TS is infinite
T2
T2 MAX-or-Infinite
X2 T1
X1
TS
XS
X2 X1 X2
ð3Þ
For example, Fig. 14 shows how a cold spare with the coverage factor C ¼ 0:7 can be modeled by a randomselector unit. In this figure TP is the TTF of the primary unit and TSPARE is the TTF of the spare. Here TS (i.e. the
TS XS
Pi ¼ 1
i¼1
T1 X1 T1
T2
Fig. 12. The MAX-or-Infinite unit corresponds to the PAND gate.
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028 Ts
p1
p2
...
pn
T1
T2
...
Tn
Fig. 13. Random selector unit.
TS
1023
entity max_unit is generic(w : natural); port( T1,T2 : in std_logic_vector(w-1 downto 0); Ts : out std_logic_vector(w-1 downto 0)); end max_unit; architecture synthesizable of max_unit is begin process(T1,T2) begin if T1>T2 then Ts <= T1; else Ts <= T2; end if; end process; end synthesizable;
Fig. 15. The synthesizable VHDL description of MAX unit.
0.3
0.7 ADD
TP
TSPARE
Fig. 14. The TTF tree which models a cold spare system with coverage factor ¼ 0.7.
TTF of the whole system) is equal to TP with the probability of 0.3 and is equal to TP þ TSPARE with the probability of 0.7. A random-selector unit can be easily implemented as a digital circuit using a multiplexer whose select lines are determined by random values.
5. Translation program and hardware implementation of TTF trees A translation program has been developed which can be used to automatically transform a fault tree into the hardware implementation of the corresponding TTF tree within an FPGA, so that it does not require extra time and extra effort from the user to transform a fault tree into a TTF-tree. For this program, a library of VHDL [30] codes has been developed which contains a synthesizable VHDL description for each TTF-tree unit (i.e. MAX, MIN, ADD, Selector, MAX-or-Infinite and random-selector). As an example, Fig. 15 shows the synthesizable VHDL description of MAX unit. The library also contains synthesizable VHDL descriptions for pseudo random number generators [32–34]. This is because the generation of random numbers is inherent to any MCS. The input values to the hardware implementation of a TTF tree are generated using these pseudo random number generators.
The translation program takes the textual description of a fault tree as input and then replaces each fault-tree gate with the instantiations of appropriate components from the library. Also the translation program replaces the primary inputs of the fault tree with pseudo random number generators to generate the input values. The resulting VHDL code is a synthesizable code, which can be synthesized to an FPGA using common synthesis tools. Number representation is an important issue in arithmetic digital circuits. In order to attain a better utilization of FPGA resources a fixed-point representation of TTF values has been used rather than a floatingpoint representation. To avoid the limited range of fixed-point representation, the widths of numbers can increase through a digital circuit as needed, so that an overflow never occurs. Fortunately, TTF trees do not involve operations such as multiplication, which can increase the widths of numbers greatly. The result obtained from the addition of two fixed-point numbers of n bits each can be up to n þ 1 bits long. Also, MAX, MIN, Selector and Random Selector units do not increase the widths of their operands. Therefore, TTF trees involve the operations, which do not increase the widths of numbers greatly. As mentioned in Section 3, the output of a MAX-orInfinite unit may be infinite. Therefore, the number representation, which is used for the hardware implementation of TTF trees, should be capable of representing infinity. To do this, infinity has been assigned the all-1s representation, so that an n-bit code can represent 2n 1 finite fixed-point numbers as well as 1. Since MCS needs to repeat the same task many times with different samples, a pipeline [29] implementation has been used. For this purpose, the translation program puts pipelining registers between the instantiations of the components. As an example, Fig. 16 shows the hardware implementation of an example TTF tree. The boxes
1024
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
TTF-tree units) and TFPGA Setup the setup time needed for starting the FPGA-based MCS (such as the synthesistime of the hardware description codes), then the FPGA-based simulation time is
TS R(16)
TFPGA ¼ TFPGA
3/4
R(16)
R(16)
R(16)
R(17)
S¼ R(16)
R(16)
R(16)
R(16)
T2
T3
labeled ‘‘R’’ in the figure are pipelining registers. The width of each register is enclosed in parentheses. As shown in the figure, the widths of numbers (registers) can change as needed, so that an overflow never occurs. The clock period of a pipeline is determined by the maximum propagation delay of the longest pipeline stage. Unless the stage delays are balanced, one big and slow stage can slow down the whole pipeline. In such cases the longest pipeline stage (such as the 3-out-of-4 unit in Fig. 16) can be broken up into smaller pipeline stages.
6. Performance estimation The performance estimation of the FPGA-based MCS is as follows. Let N be the number of the simulation iterations, NU the number of TTF-tree units of the TTF tree under evaluation, TCPU the average CPU time which is consumed for the evaluation of a single TTFtree unit and TComputer Setup the setup time needed for starting the computer-based simulation (such as the compile-time of the simulation codes), then the computer-based simulation time is Setup
þ N NU TCPU
TComputer TComputer Setup þ N NU TCPU ¼ TFPGA TFPGA Setup þ N TP
S ¼ NU
T4
Fig. 16. Pipeline implementation of an example TTF tree.
TComputer ¼ TComputer
ð5Þ
ð6Þ
Since TComputer Setup and TFPGA are independent of N, as N ! 1 (MCS needs extremely large number of simulated sample to obtain estimates at a high level of confidence), the speed-up approaches
MAX
T1
þ N TP
This is because regardless of the number of the units of the TTF tree under evaluation, a pipeline generates an output for each clock period [29]. The speed-up of the FPGA-based MCS over computer-based MCS is obtained from Eqs. (4) and (5) as
ADD R(16)
Setup
ð4Þ
This is because all the units of the TTF tree should be evaluated during each simulation iteration. Let TP be the clock period of the pipeline implementation of the TTF tree under evaluation (i.e. the signal propagation time through the hardware-implemented
TCPU TP
ð7Þ
It can be seen from Eq. (7) that the speed-up is directly proportional to NU (i.e. the number of the units of the TTF tree under evaluation). Therefore the speed-up grows with the size of the TTF tree (i.e. the size of the corresponding fault tree).
7. Experiments In order to evaluate the TTF-tree method, a set of experiments was performed using the benchmarks provided in [21]. The experiments were performed using a PCI-based FPGA board (called CPCI10K-PROD board [31]). An FPGA chip, FLEX 10K200SFC484-1 [31], is mounted on the board. This FPGA can be configured through the PCI bus. Also, after the configuration, the FPGA can communicate with the host computer through the PCI bus. Using the translation program (see Section 5), a Synthesis tool (Leonardo Version: 2001_1a.32), and the FPGA program tool (MAXPLUS II Version: 10) each benchmark fault tree was transformed into the hardware implementation of the corresponding TTF tree on the FPGA within a few minutes and then it was analyzed by MCS. The translation program, the synthesis tool and the FPGA program tool were run on a Pentium IV system (2.4 GHz, RAM ¼ 256 MB, OS ¼ Windows XP). In the experiments, the clock rate, which was applied to the random number generators, was 33 MHz and the pipeline clock rate, which was applied to the pipelined TTF trees, was 33/4 ¼ 8.28 MHz. Each random number generator generated random numbers for four random variables (i.e. the inputs of the TTF tree), because the
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
1025
Table 1 Available and consumed FPGA resources (FLEX 10k200SFC484-1) Total available LCs in the FPGA Benchmark B1 (8 of 10 redundant system) Benchmark B1 (8+2 Cold spares) Benchmark B1 (8+2 Warm spares) Benchmark B2 Benchmark B3 Benchmark B4
NRNG
NTTF-tree
N
%
– 1448 1448 1448 3978 607 3712
– 1160 4263 4759 3042 445 2499
9984 2608 5711 6207 6620 1012 5811
100 26.12 57.20 62.17 66.31 10.14 58.20
NRNG : total number of logic cells which were used for random number generators. NTTF-tree : total number of logic cells which were used for TTF trees. N : total number of logic cells (N ¼ NRNG þ NTTF-tree ).
random variables with the same distribution can share a random number generator and the only limitation is the speed of the random number generator. Table 1 shows the amount of the FPGA resources, which was used in the experiments. During the MCS of each TTF tree, the host computer received the outputs of the TTF tree (as the TTFs of the whole system) from the FPGA board and stored them. From these stored values, the unreliability parameters of benchmarks were estimated, as follows. Suppose N output values (i.e. TTF values of the system) are obtained from N simulation iterations, and NF ðtÞ is the number of the output values which are less than t (i.e. the system fails before time t). Then the unreliability of the system at time t is estimated by U ðtÞ ¼
NF ðtÞ N
ð8Þ
which is simply the probability that the system fails before time t. Table 2 shows the experimental results, which were obtained from the Monte Carlo analysis of TTF trees and compares these results with what have been reported in [21]. In order to evaluate the speed-up of the presented method, a C simulation program was developed for the MCS of each benchmark and was run on a PC (Pentium IV, 2.4 GHz, RAM ¼ 256 MB, OS ¼ Windows XP). Table 3 shows the speed-ups. The speed-up values, which are reported in Table 3, are obtained from the following equation:
MCS (such as the time needed for translating faulttrees to VHDL codes and synthesis-time of the VHDL codes). As shown in Table 3, the speed-up increases as the number of MCS iterations increases from 108 to 1010 . This was analytically estimated in Section 6 that as the number of MCS iterations approaches 1, the speed-up increases and approaches to a limiting value. Also, Table 3 shows that the speed-up depends on the size of the TTF tree. This dependency was estimated analytically in Section 6.
8. Discussion This paper introduces TTF tree as a model for reliability modeling. This model is a method for representing the mathematical relation between a system TTF and its components’ TTFs. The advantages of this model are:
ð9Þ
• Both dynamic and static fault trees can be easily converted into TTF trees using a simple replacement operation. In fact, a translation program can perform this task automatically without requiring much of the user’s time. • Each TTF tree can be interpreted as a digital circuit, which receives the TTFs of the components and computes the TTF of the whole system. Therefore, it can be synthesized into an FPGA chip in order to accelerate MCS. • Since MCS needs to repeat the same task many times with different samples, a pipeline implementation of TTF trees can be used which results in great speedups.
where PC Setup Time is the setup time needed for starting the PC-based MCS (such as the compile-time of the simulation codes) and FPGA Setup Time is the setup time needed for starting the FPGA-based
The analytical performance estimation of the TTFtree model shows that the speed-up of the method grows with the size of the TTF tree. Experimental results for some benchmark fault trees show the capability of the
TComputer TFPGA ðPC Setup TimeÞ þ ðPC based Simulation TimeÞ ¼ ðFPGA Setup TimeÞ þ ðFPGA based Simulation TimeÞ
S¼
1026
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
Table 2 Experimental results Benchmark and TTF distribution of basic components
t (years)
N
B1 (8 of 10 redundant system) Exponential, k ¼ 1 FPMH
1
106 108 1010 106 108 1010 106 108 1010
5
10
B1 (8 of 10 redundant system) Weibull, 1=a ¼ 2 FPMH, b ¼ 1:1
1
5
10
B1 (8+2 Cold Spare) Exponential, k ¼ 1 FPMH
1
5
10
B1 (8+2 Warm Spare) Exponential, kACTIVE ¼ 1 FPMH, kDORMANT ¼ 0:1 FPMH
1
5
10
NF ðtÞ
Estimated UðtÞ from MCS
UðtÞ reported in [21]
82 7399 774954 7741 768326 76154725 44553 4590249 457416757
8.2 · 105 7.339 · 105 7.74954 · 105 7.741 · 103 7.68326 · 103 7.6154725 · 103 4.4553 · 102 4.590249 · 102 4.57416757 · 102
7.62 · 105
106 108 1010 106 108 1010 106 108 1010
184 18233 1755081 24169 2505321 249408257 149499 14517723 1486178261
1.84 · 104 1.8233 · 104 1.755081 · 104 2.4169 · 102 2.505321 · 102 2.49408257 · 102 1.49499 · 101 1.4517723 · 101 1.486178261 · 101
1.78 · 104
106 108 1010 106 108 1010 106 108 1010
42 3926 389258 892 94746 9215246 3733 357554 36750120
4.2 · 105 3.926 · 105 3.89258 · 105 8.92 · 104 9.4746 · 104 9.215246 · 104 3.733 · 103 3.57554 · 103 3.6750120 · 103
3.82 · 105
106 108 1010 106 108 1010 106 108 1010
52 5534 576754 5908 563255 56704086 36239 3496452 357451564
5.2 · 105 5.534 · 105 5.76754 · 105 5.908 · 103 5.63255 · 103 5.6704086 · 103 3.6239 · 102 3.496452 · 102 3.57451564 · 102
5.66 · 105
7.54 · 103
4.53 · 102
2.47 · 102
1.47 · 101
9.33 · 104
3.63 · 103
5.73 · 103
3.54 · 102
B2 Exponential, kMASTER ¼ 0:421 FPMH, kSLAVE ¼ 0:3837 FPMH
12
106 108 1010
16805 1673592 166608758
1.6805 · 102 1.673592 · 102 1.66608758 · 102
1.65 · 102
B3 Exponential, k (MIB) ¼ 1 FPMH, k (Active & Spare Feeder) ¼ 1.5 FPMH, Coverage Factor(Feeder) ¼ 0.99
2.5
106 108 1010
15696 1577606 158563753
1.5696 · 102 1.577606 · 102 1.58563753 · 102
1.6 · 102
B4 Exponential, kACTIVE ¼ 1 FPMH, kSPARE ¼ 1 FPMH
1
106 108 1010
161 15167 1559558
1.61 · 104 1.5167 · 104 1.559558 · 104
1.54E · 104
t: the time at which unreliability is estimated. N : number of MCS iterations. NF ðtÞ: number of the TTFs which are less than t. Hrs: hours. FPMH: failures per million hours.
method to accelerate MCS. This method makes it feasible to run MCS with extremely large number of sim-
ulated samples in order to obtain estimates at a high level of confidence.
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
1027
Table 3 Speed-ups Benchmark
N
Computer setup time (min)
Benchmark B1 (8 of 10 redundant system)
108 1010
1.3 1.3
Benchmark B1 (8+2 Cold spares)
108 1010
Benchmark B1 (8+2 Warm spares)
Computerbased simulation time (min)
FPGA setup time (min)
FPGA-based simulation time (min)
Speed-up
33.5 3218.2
2.1 2.1
0.21 20.2
15.06 144.372
1.9 1.9
109.3 10499.4
4.3 4.3
0.21 20.1
24.66 430.38
108 1010
1.9 1.9
121.2 11647.3
4.5 4.5
0.21 20.2
26.14 471.63
Benchmark B2
108 1010
2.0 2.0
91.6 8801.8
4.9 4.9
0.21 20.4
18.32 347.98
Benchmark B3
108 1010
0.3 0.3
13.5 1300.1
0.9 0.9
0.22 20.6
12.32 63.04
Benchmark B4
108 1010
1.9 1.9
43.9 4212.6
4.4 4.4
0.22 21.1
9.91 165.27
N : number of MCS iterations. Computer setup time: the setup time needed for starting the Computer-based MCS (such as the compile-time of the simulation codes). FPGA Setup time: the setup time needed for starting the FPGA-based MCS (such as the time needed for translating fault-trees to VHDL codes and synthesis-time of the VHDL codes).
References [1] Henley EJ, Kumamoto H. Probabilistic Risk Assessment. IEEE Press; 1992. [2] Vesely WE, Goldberg FF, Roberts NH, Haasl DF. Fault tree handbook. United States Nuclear Regulatory Commission, January 1981. [3] Lee WS, Grosh DL, Tillman FA, Lie CH. Fault tree analysis methods and applications: a review. IEEE Trans Reliab 1985;R-34:194–302. [4] Gulati R, Dugan JB. A modular approach for analyzing static and dynamic fault trees. In: Proc Ann Reliability & Maintainability Symp, Philadelphia, Pennsylvania, USA, January 1997. p. 57–63. [5] Dugan JB, Sullivan KJ, Coppit D. Developing a low-cost high-quality software tool for dynamic fault-tree analysis. IEEE Trans Reliab 2000;49(March):49–59. [6] Anand A, Somani AK. Hierarchical analysis of fault trees with dependencies, using decomposition. In: Proc Ann Reliability & Maintainability Symp, Anaheim, California, USA, January 1998. p. 69–75. [7] Dugan JB, Bavuso S, Boyd M. Dynamic fault tree models for fault tolerant computer systems. IEEE Trans Reliab 1992;41(September):363–77. [8] Dugan JB, Doyle SA. Incorporating imperfect coverage into a BDD solution of a combinational model. Euro J Automat 1996;30(8):1073–86. [9] Sinnamon R, Andrews JD. Fault tree analysis and binary decision diagrams. In: Proc Ann Reliability & Maintainability Symp, Las Vegas, Nevada USA, January 1996. p. 215–222. [10] Kohda T, Inoue K. Probability evaluation of systemfailure occurrence based on minimal cut-sets. In: Proc Ann Reliability & Maintainability Symp, Seattle, Washington, USA, January 2002. p. 190–194.
[11] Dugan JB, Sullivan KJ, Coppit D. Developing a highquality software tool for fault tree analysis. In: Proc 10th Int’l Symp on Software Reliability Engineering. Boca Raton, Florida, USA, November 1999. p. 222– 231. [12] Dugan JB, Bavuso S, Boyd M. Fault trees and Markov models for reliability analysis of fault tolerant systems. Reliab Eng Syst Saf 1993;39:291–307. [13] Dutuit Y, Rauzy A. A linear-time algorithm to find modules in fault trees. IEEE Trans Reliab 1996;45(September):422–5. [14] Yin L, Smith MAJ, Trivedi KS. Uncertainty analysis in reliability modeling. In: Proc Ann Reliability & Maintainability Symp, Philadelphia, Pennsylvania, USA, January 2001. p. 229–234. [15] Bavuso J. Aerospace applications of Weibull and Monte Carlo simulation with importance sampling. In: Proc Ann Reliability & Maintainability Symp, Philadelphia, Pennsylvania, USA, January 1997. p. 208– 210. [16] Goyal A, Shahabuddin P, Heidelberger P, Nicola VF, Glynn PW. A unified framework for simulating Markovian models of highly dependable systems. IEEE Trans Comput 1992;41(January):36–51. [17] Negoi A, Zimmermann J. Monte Carlo hardware simulator for electron dynamics in semiconductors. In: Proc Intl Semiconductor Conf, Sinaia, Romania, October 1996. p. 557–60. [18] Cowen CP, Monaghan S. A reconfigurable Monte-Carlo clustering processor (MCCP). In: Proc IEEE Workshop on FPGA for Custom Computing Machines. Napa Valley, California USA, April 1994. p. 59–65. [19] Danese G, De Lotto L, Leporati F, Spelgatti A. FPGA based coprocessor to calculate the energy of dipolar system. In: Proc 10th Euromicro Workshop on Parallel,
1028
[20]
[21]
[22]
[23]
[24]
[25]
[26]
A. Ejlali, S. Ghassem Miremadi / Microelectronics Reliability 44 (2004) 1017–1028
Distributed and Network-based Processing, Canary Islands, Spain, January 2002. p. 227–34. Ejlali A, Miremadi SG. Time-to-failure tree. In: Proc Ann Reliability & Maintainability Symp, Tampa, Florida USA, January 2003. p. 148–52. Zhu H, Zhou S, Dugan JB, Sullivan KJ. A benchmark for quantitative fault tree reliability analysis. In: Proc Ann Reliability & Maintainability Symp, Philadelphia, Pennsylvania, USA, January 2001. p. 86–93. Meshkat L, Dugan JB, Andrews JD. Dependability analysis of systems with on-demand and active failure modes, using dynamic fault trees. IEEE Trans Reliab 2002;51(June):240–51. Barlow RE, Proschan F. Mathematical Theory of Reliability. Society for Industrial & Applied Mathematics; 1996. Coppit D, Sullivan KJ, Dugan JB. Formal semantics of models for computational engineering: a case study on dynamic fault trees. In: Proc 11th Intl Symp on Software Reliability Engineering, San Jose, California USA, October 2000. p. 270–82. Dugan JB, Venkataraman B, Gulati R. DIFTree: A software package for the analysis of dynamic fault tree models. In: Proc Ann Reliability & Maintainability Symp, Philadelphia, Pennsylvania, USA, January 1997. p. 64–70. Dugan JB, Trivedi KS. Coverage modeling for dependability analysis of fault tolerant systems. IEEE Trans Comput 1989;38(6):775–87.
[27] Doyle SA, Dugan JB. Dependability assessment using binary decision diagrams. In: Proc IEEE Int’l Symp FaultTolerant Computing, vol. FTCS-25, Pasadena, California, USA, June 1995. [28] Doyle SA, Dugan JB. Fault trees and imperfect coverage: a combinational approach. In: Proc Ann Reliability & Maintainability Symp., Atlanta, Georgia USA, January 1993. p. 214–9. [29] Parhami B. Computer Arithmetic––Algorithms and Hardware Designs. Oxford University Press; 2000. [30] IEEE Std 1076-1993: IEEE Standard VHDL Language Reference Manual. [31] PLD Applications Corp. CPCI10K-PROD B User’s. [Online] Available: http://www.compumodules.com/pdf/ plda-pdf/cpciprod10k.pdf. [32] Chu P, Jones R. Design techniques of FPGA based random number generator. Presented at Military and Aerospace Applications of Programmable Devices and Technologies Conf. [Online]. Available: http://academic. csuohio.edu:8080/chup/chu_web_stuff/research_stuff/ Random.PDF, 1999. [33] Bardell PH, McAnney WH, Savir J. Build-in Test for VLSI: Pseudo-random Techniques. John Wiley and Sons; 1987. [34] Banks J, Carson JS, Nelson BL, Nicol DM. Discrete-Event System Simulation. Prentice-Hall; 2000. [35] Brown S, Rose J. FPGA and CPLD architectures: a tutorial. IEEE Design Test Comput 1996;13(2):42–57.