Accepted Manuscript
Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures Sourav Sinha , Neeraj Kumar Goyal , Rajib Mall PII: DOI: Reference:
S1383-7621(18)30230-3 https://doi.org/10.1016/j.sysarc.2018.10.007 SYSARC 1537
To appear in:
Journal of Systems Architecture
Received date: Revised date: Accepted date:
3 June 2018 15 October 2018 22 October 2018
Please cite this article as: Sourav Sinha , Neeraj Kumar Goyal , Rajib Mall , Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures, Journal of Systems Architecture (2018), doi: https://doi.org/10.1016/j.sysarc.2018.10.007
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures Sourav Sinhaa,, Neeraj Kumar Goyala and Rajib Mallb a
Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, India b Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur, India
CR IP T
Abstract Interactions among software and hardware components play an important role in successful operation of a system. Researchers have identified two types of interaction failures: software failure influenced by hardware breakdown (hardware-driven software failure) and hardware failure influenced by software malfunction (software-driven hardware failure). The existing research in this domain either has not considered the entire spectrum of interaction failures or limited their work to mere failure analysis rather than reliability/ availability modeling. In this paper, we are proposing a unified model to predict the
AN US
worst case achievable reliability/ availability of hardware-software combined system at early design phases. The proposed model identifies system functions from the requirements specification document. Then, these functions are mapped to corresponding conceptual design components. Subsequently, the functional design is simulated for sets of input data that have been randomly generated for different operation modes (failure/ working) of the components. We also simulate system state transition due to the
M
component operation modes. Finally, reliability and availability is predicted from simulation results. In this context, we address four important aspects: i) proposing a conceptual design based early reliability/
ED
availability prediction model, ii) apart from individual hardware-software component failure, the proposed model addresses different interaction failures such as, hardware-driven software and softwaredriven hardware failures, iii) implementing the proposed model through a case study, and iv) validating
PT
the model by comparing the obtained reliability/ availability value using the proposed approach with an established method.
CE
Keywords Reliability/ availability prediction, hardware-software interactions, failure analysis, functional
AC
failure, embedded system 1. Introduction
A sharp increase in the use of software-intensive systems has been noticed in recent times. Even a wide range of safety-critical hardware devices that perform a multitude of activities are often controlled by software[1, 2]. For example, in the aircraft industry, a significant increase in the use of combined hardware-software systems can be noticed. Tumer and Smidts [3] have reported that total percentage of
Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur-721302, India. Tel.: +91 3222269868, E-mail address:
[email protected] (S. Sinha).
1
ACCEPTED MANUSCRIPT
functional requirements operated by software has increased, from 8% for the F-4 US aircraft in the 1960s, to 80% for the F-22 US aircraft in 2000. The integration of hardware-software makes reliability evaluation more complicated. Because, we must consider hardware-software interactions failure, apart from their independent failure. Iyer and Velardi [4] at Stanford University experimentally proved that degradation of hardware component due to fatigue, temperature, electrical stress, design susceptibilities or configuration change might impact software operation. To corroborate their findings, they demonstrated
CR IP T
that nearly 35 per cent of the software errors on an MVS/SP operating system is hardware-related [4]. It implies that fault in a hardware component may cause malfunction of the corresponding software component. On the other hand, bugs in the software may also lead a hardware device/ peripheral to failure. For example, in March 2015, F-35 joint strike fighter aircraft failure that proclaimed software glitch caused aircraft to detect targets incorrectly. Propagation of fault from hardware to software or vice
AN US
versa leads to the failure of the combined HW-SW system. Therefore, a pragmatic way to predict HWSW combined system reliability/ availability is not to ignore interactions among components. A few research results have been reported in system reliability/ availability prediction that has considered interactions failure among HW-SW component [5]. Some researchers have proposed reliability/ availability models taking into account hardware-driven software interaction failures [5-8]. They mostly
M
relied on Markovian approach for reliability/ availability modeling. Another group of researchers have considered hardware-driven software and software-driven hardware interaction failures together for the
ED
failure/ reliability analysis [3, 9-13]. Out of these, Huang et al. [13] presented a quantitative reliability analysis for hardware part only, considering the usage profile of embedded software in SPICE simulation environment. They did not give a consolidated model for combined hardware-software system reliability.
PT
The other models of this group have modeled failure analysis using Functional Failure Identification and Propagation (FFIP) framework or its extension [3, 9-12, 14, 15]. As the name suggests, FFIP analyzes a
CE
system based on its functional failures [9, 10, 16, 17]. However, they also failed to give a quantitative reliability analysis for the combined hardware-software system. We are following this line of work to some extent by employing functional failure analysis for early prediction of system reliability. Unlike
AC
FFIP framework, our work is not limited to mere failure analysis rather extended to the quantitative reliability prediction. The applicability of the proposed model is limited to the classical embedded system. Such systems comprise a microcontroller with embedded software controller and input/ output peripheral devices that are connected to the microcontroller. A unified model that considers hardware-driven software and software-driven hardware interaction failures apart from individual component failures is lacking at present. In this paper, we are proposing a consolidated model that predicts lower bound of reliability/ availability considering functional failures for 2
ACCEPTED MANUSCRIPT
a given system design. If design alternatives are available, this model can evaluate the minimum achievable reliability/ availability for each alternatives. Based on the evaluation results, the system design with comparatively higher reliability can be chosen. At the early design stages, component level technical details of the system are usually unavailable, but system functions can be identified from the requirements specification document. In this research, we map the system functions to abstract components at the conceptual design stage. For example, consider that the functional requirement of a system is
CR IP T
“temperature measurement” and this function should have 99.9999% reliability. As per the proposed model, the function “measure temperature” to be mapped to a generic “temperature sensor” with a maximum on demand failure probability of 0.0001 complying with Safety Integrity Level 4 (SIL 4) [18]. Once the process of identifying the generic hardware components is complete, standard reliability handbooks with generic failure data sources can be referred to identify the failure modes of each
AN US
component with associated probability of failure. All modes of the components including failure modes and normal working mode are termed as component operation modes in this paper. Similarly, embedded software controller operation mode can be identified from the System Requirement Specification document. Based on the operation modes of the generic hardware components and the software controller, we configure the system that defines the inter-component dependency and the flow of the function execution. Finally, we simulate random operation scenarios (input data variations) to test the system
M
functionalities and predict the reliability/ availability based on simulation result. In this case, the simulation-based approach is helpful as the operational/ non-operational states of the system are quite
ED
large. Modeling such a huge state space using a Markov chain based approach is a tedious job. In this paper, the reliability of the system is obtained as ratio of total number of time system responses within the operational limit (correct range/ value) to the total number of simulation iterations. Availability of the
PT
system is predicted from the same ratio with consideration of recovery from failed states.
CE
The proposed reliability/ availability prediction model also incorporates two types of interaction failures apart from input variation and individual hardware/ software component failure. These interaction failures are: a) hardware-driven software failure, and b) software-driven hardware failure. For better
AC
understanding of the proposed model, we have demonstrated its application to an aircraft fuel control system as a case study in Section 3. Evaluation of transient reliability and steady-state availability are also explained in the case study. The transient reliability/ steady state availability results obtained for the case study are compared with established approaches in Section 4. The rest of this paper is organized as follows: Section 2 presents a review of the existing work. Section 3 proposes a functional failure based model for early reliability prediction. Section 4 presents a validation of the proposed model. Finally, Section 5 summarises the important contributions of this research work. 3
ACCEPTED MANUSCRIPT
2. Background of the Research The existing reliability/ availability/ failure analysis models based on HW-SW interactions failure can be divided into two groups: 1) models considering only hardware-driven software interaction failures [5-8, 19], and 2) models considering both hardware-driven software and software-driven hardware interaction
CR IP T
failures [3, 9-12]. Reliability prediction approaches based on hardware-driven software interaction failure, used Markov or derivatives of Markov and other stochastic process like Stochastic Petri Nets [5-7]. For example, Teng et al. [5] used Markov chain for system reliability modeling. They considered that a system fails due to hardware, software or hardware-software interaction failures. They assumed a Weibull distribution for hardware failures, NHPP for software failures and a Markov chain model for HW-SW interaction failures.
AN US
They mentioned that deterioration of hardware manifests into system failure if it remains undetected or gets ignored. Finally, they evaluated system reliability as the product of independent hardware reliability, independent software reliability and hardware-software interactions reliability. Kanoun and Ortalo-Borrel [7] used off-springs of Generalized Stochastic Petri Net (GSPN) and Markov chains for dependability modeling of a distributed combined hardware-software system. They considered
M
that master software installed in the main controller system, interacts with the slave software counterparts installed in the distributed peers. In this distributed structure, Kanoun and Ortalo-Borrel [7] considered
ED
possibility of two types of interactions: a) interactions with the internal components of the system and b) interactions with the external components/ systems. They modeled dependability of the software, hardware and their interactions using Petri Net. The Petri Nets produce a reachability graph that is
PT
identical to the Continuous Time Markov Chain (CTMC). However, they could not deduce any concrete
CE
conclusion due to state-space explosion. Another distinct reliability/ availability prediction approach was presented by Sumita and Masuda [8]. They formulated system failure as a multivariate stochastic process using matrix Laguerre transform and
AC
semi-Markov process. They assumed that hardware failures are independent of the software. They modeled hardware failures as an alternating renewal process. They considered exponentially distributed uptime and any general distribution for downtime. They illustrated, if hardware related failures lead to software failure, two different situations may arise. First, completion of a software repair without any interruption from hardware failures may occur. Second, hardware failure may interrupt software repair operation. They modeled both the situations using stochastic processes and computed system reliability
4
ACCEPTED MANUSCRIPT
using matrix Laguerre transform. Finally, they demonstrated the efficacy of their proposed methodology using a numerical example. Costes et al. [6] proposed reliability/ availability approach of a repairable system using Markov statebased modeling. They assumed that hardware failure follows a Poisson distribution with known failure rate whereas software failure rate is constant. They derived the software failure behavior model from the
CR IP T
previous literature [20]. This model considered the residual errors of the software are unknown and the debugging process is imperfect. At first, they studied the impact of hardware and software failures on a non-redundant computer system. Then, they applied the learning of the non-redundant system to the redundant system. After that, they compared the obtained availability of the redundant and non-redundant systems. Finally, they concluded that redundancy of the hardware/ software parts increased the
AN US
availability of the system.
Roy et al. [19] proposed a reliability framework for Phasor Measurement Units (PMU) as an extension of the work presented by Teng et al. [5]. At first, they modeled hardware components failure through Weibull distribution, software components failures using NHPP model and hardware-software interaction failures using Markov chain. Then, they predicted the system reliability as the product of independent hardware reliability, independent software reliability and hardware-software interaction reliability. To
M
validate their model, they used Monte Carlo simulation (MCS). During validation process, they generated failure data of each component using MCS. Then, they identified the failure distribution of each
ED
component. Subsequently, they estimated the system reliability assuming that all components are in series. Finally, they validated the model by comparing the predicted reliability with the estimated
PT
reliability.
Another group of researchers considered hardware-driven software and software-driven hardware
CE
interaction failures together, for the failure/ reliability analysis of the system. They used Functional Failure Identification and Propagation (FFIP) framework or its extension for reliability/ failure modeling. Jensen et al. [10] were probably the first to introduce FFIP framework for combined HW-SW system
AC
failure analysis. They developed the functional layout of the system following the FFIP framework. Then, they analysed the material/ information flow along the flow-paths of the functional layout, to identify critical nodes. Subsequently, they used reasoning based approach to monitor the flow level at the critical nodes to restrict the fault propagation in the system. The FFIP combines system modeling and behavioral simulation approaches for the failure analysis at the early design phase of system development. Later, FFIP was adopted by others for qualitative reliability/ failure analysis of the safety-critical systems [3, 11, 21].
5
ACCEPTED MANUSCRIPT
Tumer and Smidts [3] adopted FFIP framework for high-level system modeling and failure analysis. They used five steps approach: a) modeling the functional layout and system configuration using abstract components, b) ascertaining failure states of each component based on specified input and output flow information, c) identifying system components which can act as checkpoint to sense failures of the previous node (component), d) apprehending the abnormal behavior of the predecessor node and identify the mechanisms to stop propagation of failure, and e) evaluating different scenarios against the
CR IP T
predecided set of rules for the alternate flow-path. This model used checkpoint to monitor the fault propagation. If a fault is manifested, the model alters the flow path to stop further propagation. They also demonstrated extension of FFIP to the software domain where they used UML based system modeling approach. However, the central idea is as same as the traditional FFIP.
Sierla et al. [11] modified FFIP framework for the failure analysis of the system that controls concurrent
AN US
processes. They demonstrated concurrent processes integration for the software controller in the background of failure analysis. Moreover, they also considered flows across the boundaries of mechatronic domains that are hardly covered by the contemporary FFIP based models. They used SysML to implement the functional model and configure the system behavior using configuration flow graph (CFG). The component behavioral models define the formulation of output values from input values,
M
using statechart diagram. Then the flows of material, energy, and signal across domain boundaries were analysed using FFIP based simulation approach. Finally, the graphical output of the Simulink/ Stateflow
ED
used to analyse the abnormal flow levels in the FFIP path. Papakonstantinou et al. [21] identified some drawbacks of the FFIP based models. They argued that if we
PT
use different functional model (alternative system design) of the same system, it will provide different outcomes. Moreover, the framework could not suggest the best alternative. To overcome such drawbacks, they proposed a model considering alternate flow paths for mitigating failure propagation using the
CE
Simulink/ Stateflow environment. They analysed the risk of failure propagation using modified FFIP approach. Their approach was illustrated using the example of a boiling water nuclear reactor. However,
AC
the way their approach integrates concurrent safety processes to mitigate risk is not clear. Papakonstantinou et al. [15] proposed another approach where they put an effort to improve their previous work [21]. Their extended approach considered Hierarchical Functional Fault Detection and Identification (HFFDI) framework that combines machine learning techniques and the traditional FFIP for failure analysis. The machine learning techniques used for fault detection and identification (FDI) from historical data whereas, FFIP was used for functional decomposition of the system. They implemented HFFDI framework to a complex nuclear power plant system as a case study. Then, they compared failure analysis 6
ACCEPTED MANUSCRIPT
results of HFFDI with FDI based approach, for the same case study. Finally, they concluded that HFFDI gave an edge over its counterpart. The results revealed that, in two fault scenarios HFFDI could isolate one fault with 79% accuracy and both faults with 13% accuracy. In three fault scenarios, HFFDI could isolate single faults with 69% accuracy, two faults with 22% accuracy and all three faults with 1% accuracy.
CR IP T
Mutha et al. [12] claimed that traditional FFIP framework which is efficient in detecting electromechanical fault, hardly detects faults in cross-domain functionalities. To overcome such problems, they proposed Integrated System Failure Analysis (ISFA) approach that identifies and analyzes faults of cross-domain functionalities. As a part of ISFA they introduced a new simulation mechanism, named as Failure Propagation and Simulation Approach (FPSA). The FPSA works on the principles of the FFIP. They implemented ISFA technique to a holdup tank as a case study. They demonstrated two
AN US
instances of commonly occurring faults that cause system failure. Based on the result of a case study, they presented the efficiency of the ISFA approach to analyze faults in a combined hardware-software system. Later, Diao et al. [14] also used Integrated System Failure Analysis (ISFA) framework [12] for the combined study of hardware-software faults. As a novel feature, they added online monitoring (OLM)
M
system to ISFA. The OLM isolated the potential faults in the critical components. They implemented their proposed methodology for a nuclear hybrid energy system as a case study. In this regard, they configured the system model using a conceptual design of the components. Then, they analyzed the fault
ED
propagation using OLM. Subsequently, they evaluated the effectiveness of the fault detection and diagnosis techniques of their proposed model using functional simulation. Based on the simulation results
PT
they proposed an optimization plan of the OLM system. Finally, the correctness of their methodology was verified through some experiments on a hardware-in-the-loop system. However, it can be observed that the above quantitative reliability/ availability/ dependability prediction models only consider hardware-
CE
driven software interaction failures [5-8, 19]. They do not consider the impact of software-driven hardware interaction failures. On the other hand, models pertaining the failure analysis considered both,
AC
hardware-driven software and software-driven hardware interaction failure [3, 10-12, 14, 15, 21]. Unfortunately, the reported failure analysis models are largely limited to qualitative reliability analysis and do not provide any quantitative reliability evaluation. Therefore, a scarcity of literature is noticed in the area of quantitative reliability/ availability prediction considering the entire spectrum of interaction failures. 3. Proposed Combined Hardware-software System Reliability/ Availability Prediction Model
7
ACCEPTED MANUSCRIPT
We propose a simulation-based reliability/ availability prediction model for combined hardware-software system. The applicability of the proposed model is limited to the classical embedded systems with specific functionalities and real-time computing constraints. The software that controls the functionalities of the system remains embedded in a microcontroller. All the peripheral input/ output devices are kept connected to the microcontroller for serving the communication with the external entities. We have attempted to predict the worst case achievable reliability/ availability of a system for a given conceptual
CR IP T
design. As mentioned earlier, the proposed model can identify the most reliable system design among available alternatives. However, the predicted reliability/ availability value may not accurately match with post-development estimated reliability/ availability values obtained through system testing. This is because, the proposed model does not consider actual system components while predicting the reliability/ availability. It maps system functions to the functionally equivalent generic components to configure the
AN US
system. Finally, we simulate different operational scenarios (input datasets) to test the system functions and predict reliability/ availability based on simulation result. The proposed model can significantly reduce the production cost of the safety-critical system. During the post-development testing phase, if the system does not meet the reliability requirement due to inefficient design, it incurs a huge cost for rework. Even reliability improvement methods at the end of the development cycle force to take ad-hoc solutions due to budgetary or time constraints. Such approaches may compromise product quality, reliability and
M
safety. Therefore, it is important to identify most reliable system design at the beginning of the high-level
3.1. Assumptions
ED
development processes.
1) The System Requirement and Specification (SyRS) document is available at the early design phase of
PT
system development
2) Operation modes of a hardware component are divided into three categories: Normal: Component performs its designated work adequately
b)
Degraded/ partial failure: At this mode component works in a limited manner. It is also
CE
a)
AC
two types: a) Additive polarity: The partial failure enhance the signal strength/
c)
performance of the component, b) Subtractive polarity: The partial failure diminish the signal strength /performance of the component Complete failure: The overall functionality of the component gets disrupted. Such failures can be permanent or transient.
3) Maximum allowable failure probability and failure modes of each hardware component are known 4) Based on system functionality, software controller operational modes are broadly divided into two categories: 8
ACCEPTED MANUSCRIPT
a.
Working: The response of the software controller matches the system requirement specification document for a given set of input. It is two types: i) Normal working: Software performs its designated work adequately without any hardware component failure, and ii) Fault-tolerant: Software performs its designated work adequately even when some hardware component has failed.
b.
Failure: The response of the software controller does not match with the system
CR IP T
requirement specification document for a given set of input. 5) Hardware-software interaction has two operation modes: a.
Normal: The successful exchange of data/ signal/ material among hardware-software components
b.
Failure: The unsuccessful exchange of data/ signal among hardware-software
AN US
components. Such failures are two types: i) Hardware-driven software interaction failure: The transient hardware failure (like indeterminate memory and delay) leads this type of failure, and ii) Software-driven hardware interaction failure: At some operation condition software response may be undefined. Such situation software operation become uncontrollable, and this may lead to hardware breakdown.
M
3.2. Proposed Model
We briefly explain each step of the proposed reliability/ availability prediction model for the combined
ED
hardware-software system. Subsequently, we apply the proposed model for a case study. Then, we validate the reliability/ availability results obtained using the proposed model, with some already
PT
established approach. The steps of the proposed model are the following:
CE
3.2.1. Functional Requirement Identification At first, we need to identify all the functions that the system should perform to fulfil the user requirements. For this purpose, we refer to the System Requirement Specification (SyRS) document.
AC
Based on the information available at SyRS document during the early design stage, a list of n system functions is prepared. Any function in this list is referred as
where
.
3.2.2. Configuration Model The functions identified in the above step are mapped into conceptual design components. Each function ( ) gets replaced by an abstract component
. where, 9
ACCEPTED MANUSCRIPT
The mapping process follows same logic as used in the FFIP framework for representing the configuration model (D. C. Jensen et al., 2008; D. Jensen et al., 2009; Mutha et al., 2013; Sierla et al., 2012; Tumer & Smidts, 2011). During configuration modeling, we do not look into the technical details of the components; only consider the components as generic type. Out of these n components, one is microcontroller where software controller S is embedded. Other components are peripheral devices
CR IP T
connected to microcontroller for supplying input data/ signals or producing outputs. 3.2.3. Component Behavior Model Each component
has m number of operation modes (normal/ failure). Component behavior changes
with its operation mode. A past experience of similar project or standard reliability handbooks with generic failure data sources is referred to identify the failure modes of each component and their failure
AN US
probabilities. As we are interested to evaluate the worst case achievable reliability/ availability, we consider maximum possible failure probability (sum of probability of occurrence of the failure modes) for each component as recorded in the data source. A mapping function
maps component and ∑
.
with a probability of occurrence
M
where
to operation mode
→
ED
We use digital simulation platform where random sampling is performed to select operation mode of each component. Let’s consider, at any instant operation mode
of the component
occurs then response
as
PT
of the component is accordingly governed. For any input x, the operational state component . We categorize the operational state
of a component into following three sets:
): In this state component response
do not deviate from specified range/
CE
a) Normal (
is denoted
value
of the system requirement specification (SyRS) document. It can be expressed as
AC
the following:
b) Partial Failure/ Degraded (
): Due to noise, moderate degradation of the materials, reverse
polarity, oscillation, and etc. component response deviate from the specified value/ range of the SyRS. This is considered as partial failure of the component. The partial failure can be additive polarity when the measured signal strength at state
is higher than the specified
.
Another type of partial failure can be subtractive polarity when measured signal strength at state is lower than the specified
. It is expressed as the following: 10
ACCEPTED MANUSCRIPT
Or
): If the component do not respond at all it is called failure state. It is two types –
c) Failure (
permanent and transient. Due to complete breakdown of component, degradation of the materials,
(
CR IP T
intermittent, open circuit, short circuit, and etc. component stop responding permanently ). Sometimes due to some transient failures like indeterminate bit value, delay in signal,
improper synchronization, etc. component may stop responding temporarily ( cases component failure can be expressed as the following:
AN US
3.2.4. Software Controller Behavior Model
). In both the
The system state is determined by the current operational state of the software controller. The operational states are broadly classified as: working ( normal working (
) and fault tolerant (
) and failure (
). The working state is further two types:
). We noticed three types of state transition of a software
controller, if repair is not considered. These are: normal to fault tolerant ( ) and fault tolerant to failure (
). These state transitions occur due to one or more
M
(
), normal to failure
components failure or interaction failure. Impact of one or more component failure and interactions
ED
failure on state transition of the software controller are described as Aggregate Component Behavior Model and Interaction Behavior Model, respectively. Software controller state transitions due to the
AC
CE
Behavior Model.
PT
combined impact of the components and interaction failures are represented in Software Operation
11
ACCEPTED MANUSCRIPT
Aggregate component behavior model As mentioned above, failure of a specific sets of components may trigger the state transition of the software controller. We refer the SyRS document to identify the sets of components. Such sets of components can be an element of any of the following two distinct supersets: a) An element (set of components that failed) of first superset triggers a state transition of the
CR IP T
software controller from the normal state to the fault-tolerant state. As an illustration, consider b is a set of k components. If all k components of the set b fail (
), the system
reaches a tolerant state. Now, consider b itself is an element of the superset
. The superset
represents a collection all sets of components that trigger state transition from normal to
where
⋂
is represented as the following:
AN US
the fault tolerant state. So, any element b of the superset
for any input ,
, and
b) An element (set of components that failed) of the other superset triggers a state transition of the software controller from normal/ fault tolerant state to complete failure state. For illustration, consider that d is a set of
components. If all
components fail (
) system
reaches to complete failure state. Now, consider d itself is an element of another superset
represents collection all sets of components that trigger state transition from
M
The superset
normal/ fault tolerant state to failure state. So, any element d of the superset
ED
as follows:
⋂
for any input ,
is represented
, and
PT
where
.
CE
Interaction Behavior Model
The interaction failure is divided into two major groups: software-driven hardware failures and hardwaredriven software failures. Sometimes the software controller fails to restrict the system within its
AC
operational limits due to exceptional input conditions, out of range inputs, logical errors for degraded inputs, etc. Due to such exceptional conditions, the control signal generated by the software controller causes malfunctioning of the associated peripheral components. This is considered as software-driven hardware interaction failure. We refer SyRS to identify potential exceptional input signal invoked by the set of degraded components (
). Consider e is a set of r components. If all r
components of set e generates exceptional input condition to the software controller due to degraded mode (
), system fails. Now, consider e itself is an element of superset
12
. The
ACCEPTED MANUSCRIPT
superset
represents collection all sets of components that generates exceptional input condition to
the software controller. So, any element e of the superset
where
⋂
for any input ,
is represented as the following:
, and
The transient behavior of the components like indeterminate bit value, delay in signal, improper ) also lead to software failure. It is considered as hardware-
CR IP T
synchronization, etc. (
driven software interaction failure. Consider g is a set of s components. If all s components of set g undergo transient failure ( . The superset
transient failure. Then any element
where
⋂
represents collection all sets of components that undergo
of the superset
is represented as the following:
for any input ,
, and
AN US
element of the superset
), software controller fail. Now, consider g itself is an
Software Operation Behavior
The operational states of a software controller are of three types: normal (
), fault-tolerant (
), and
failure ( ). Due to the combine impact of the component failures and interactions failures, the following
M
state transition occurs.
a) The state transition from normal working to fault tolerant due to components failure (
ED
represented as the following:
) is
→
b) The state transition from normal working to failure state due to different components failure/
PT
interaction failure (
) is represented as following:
→
CE
c) The state transition from fault tolerant to failure state due to different component failure/ interaction ) is represented as following: →
AC
failure (
3.2.5. System Behavior Simulation At the beginning of the simulation process, we assume that the system is in normal working state. Therefore, we start the simulation process setting the operation mode of each component as normal. During the simulation process, we randomize the occurrence of operation modes for each component. We know that at each operation mode the component behavior follows a distinct trend. Therefore, during 13
ACCEPTED MANUSCRIPT
simulation process the components generate random signals as input to the software controller. We simulate the software controller for a large set of random input signals. Finally, the response of the software controller is recorded at each simulation iteration and reliability/ availability is predicted on the basis of the simulation results.
CR IP T
3.2.6. Reliability/ Availability Prediction The basis of reliability/ availability prediction in the proposed model is functional failure. The SyRS document specifies the desired range of system response. If the system response exceeds the specified range for any input dataset, it is considered as functional failure. During each simulation iteration, random sampling is performed to select operation mode of each component. The random sampling of the operation modes creates input data variations to the software controller. Then, we feed these random input
AN US
data to the software controller for execution and observe the response. For some input dataset the system state transition may not occur, whereas for others transitions may occur. Three distinct state transitions are observed: normal to failure failure
, normal to fault-tolerant
. We consider the estimated number of times transitions
or fault-tolerant to ,
and
occur are u,
v, and w, respectively. The total number of simulation iteration (L) is known as we run the simulation
M
process in the digital platform as per our requirement. Therefore, the unreliability of the system ( ̅ estimated as the ratio of total number of times (
)
) system response undergo functional failure to the
ED
total number of simulation iterations ( ). The unreliability of the system is expresses as ̅ So, the reliability of the system is expressed as
.
.
PT
If we consider instant repair of the failure states, this model can be used to estimate the availability of the system. In such a case, we assume that the system state transition from failure to fault tolerant ( and fault tolerant to normal ( ,
, and
CE
transition
)
) are instantaneous. We consider the estimated number of times
occurs are ́ , ́ , and ́ , respectively. So, the total number of times system
response undergoes functional failure is estimated as ( ́
́ ). If the total number of simulation iterations
AC
( ́ ) is known, then the availability of the system is expressed as
́
́ ́
.
Each step of our presented model is demonstrated in the subsequent parts of this section with a case study of the fault-tolerant aircraft fuel control system. This system has been taken from MathWorks website [22] with some admissible changes in the system as per our need. 3.3. Case Study
14
ACCEPTED MANUSCRIPT
An aircraft fuel control system comprises an engine, actuator, fuel rate controller (FRC), and four sensors. These sensors are throttle sensor, engine fan speed sensor, exhaust gas oxygen (EGO) sensor, and manifold absolute pressure (MAP) sensor. Functional details of each component given in the subsequent parts. We assume that the microcontroller and three out of four sensors must give reading in an acceptable range to operate the system. Therefore, failure of the FRC system can be defined as functional failure of at least two sensors or failure of microcontroller. We must know the failure modes and failure
CR IP T
probabilities of the hardware components to evaluate the reliability/ availability of the FRS system. The System Requirement Specification (SyRS) document of the FRC also needs to be available before we start the evaluation process. However, proposed model is applied to a portion of the whole FRC system for the reliability/ availability evaluation. This portion is consists of the fuel rate controller (FRC) and four sensors providing input to the FRC. Altogether these five components are referred as Fuel Rate
AN US
Control System (FRCS) in the rest of the paper. The following parts of this section represent the reliability/ availability modelling of the FRCS. We have used MATLAB Simulink/ Stateflow
AC
CE
PT
ED
M
environment for this case study.
15
ACCEPTED MANUSCRIPT
3.3.1 Functional Requirement Identification We have listed the functional requirement of the aircraft fuel control system in the first column of Table 2. In this regard, we have considered the information available at the MathWorks website as SyRS document of the system [22]. The aircraft fuel control system requires four functional inputs. These are exhaust gas oxygen (EGO), manifold pressure (MAP), throttle angle (open/ close), and engine fan speed.
CR IP T
Among these four, throttle and engine speed are forward signal. The others, EGO and MAP are the feedback signals. Throttle signal gives the information about the opening/ closing of the throttle valve. Based on the throttle valve angle system estimates required amount of air flow to the engine. The fan speed signal gives information about the rotational speed of the turbine. These two forward signals are fed to the fuel rate controller (FRC). The feedback signal EGO gives the information about the amount of oxygen present at engine. The other feedback signal MAP gives the information about the air density at
AN US
engine. These two feedback signals feed engine oxygen content and air density back to FRC, respectively. The FRC, based on the input signals, determines the required fuel outflow rate for combustion and maintains required proportion of oil and air at the engine. It also maintains internal temperature of the system for smooth combustion operation. The uninterrupted fuel supply to the combustor ensures turbine
M
rotation for energy generation.
Table 2. System Functions and Functionally Equivalent Components Functionally Equivalent Component ( ) : Throttle :Engine Fan : Exhaust Gas Oxygen (EGO) : Manifold Pressure (MAP) : Throttle Sensor : Speed Sensor : EGO Sensor : MAP Sensor : Fuel Rate Controller : Engine
ED
Functions ( )
PT
: Throttle Angle Open/ Close : Run Engine Fan : Exhaust Gas Oxygen Supply
CE
: Air Pressure Measure : Sense Throttle Signal
AC
: Sense Fan Speed : Sense EGO Signal : Sense MAP Signal : Control Fuel Outflow Rate : Run Engine
Fig. 6. Configuration model of fault-tolerant aircraft fuel control system
3.3.2. Configuration Model We map the functional requirements of the aircraft fuel control system to a set of functionally equivalent abstract components as listed in second column of Table 2. The system configuration model is created 16
ACCEPTED MANUSCRIPT
using these components (refer Fig. 6). We use flow taxonomy, which is similar to functional failure identification and propagation (FFIP) framework to mark the signal/ data flow [23, 24]. 3.3.3. Component Behavioral Model Operational modes of the components are grouped into three categories: a) normal, b) degraded/ partial
CR IP T
failure, and c) complete failure. Some components, like EGO, may have an initial warm-up state. During this state, the component undergoes a preparatory phase as the feedback signal takes time to reach the operational range. From literature [25], we have identified degraded (partial failure) modes of EGO sensors. These degraded modes cause deviation of the response signal with additive/ subtractive polarity. As identified in the literature [25] these degraded modes are: a) incorrect signal/ calibration error, b) error in the transmission line, c) error computation device, and d) improper response to the recipient [25].
AN US
Again, complete failure modes of EGO sensor that cause loss of signal are: a) loss of signal from sensor, b) loss of signal from the transmission line, c) short circuit, and d) open circuit [25]. In Table 3 we have listed above-mentioned failure modes of EGO sensor. The probability of occurrence each failure modes is also given in Table 3, as identified in the literature [26]. Here we consider, on-demand maximum one failure in ten thousand observation, as we assume system requirement specification instructed to use
M
sensor components that quality Safety integrity level (SIL) 4. As represented in Table 3, during the complete failure of the sensor, the signal strength becomes zero except from the short circuit case, in which it rises abruptly to high values. On the other hand, during partial failure, the signal strength
ED
erroneously deviates from the actual value. This error follows the standard normal distribution N(µ=0, σ). We assume standard deviation (σ) of the sensor signal due to incorrect signal, error in transmission line,
PT
error in computation device, and improper response to recipient as 0.3%, 0.4%, 0.5%, and 0.6% of the actual signals, respectively. All operational modes of the exhaust gas oxygen (EGO) are shown in Fig. 7.
CE
Similarly, for other sensors and microcontroller component behaviour analysis is performed. Table 4 represents actual input signal to each sensor and their acceptance range. Throttle angle and fan speed are the forward input signals whereas EGO and MAP are the feedback signals. For normal working
AC
of the system input signals should be the within acceptance range given in the Table 4. Based on these input data we simulate the system in the next step of the model.
17
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Fig.7: EGO sensor component behavior model
18
ACCEPTED MANUSCRIPT
Table 3. Failure modes, probability of failure modes, expected component behavior at each failure mode of an EGO sensor
0.0001
Loss of signal from transmission line
Failure Modes (j)
Probabilit y (pj) of occurring failure mode j
Partial Failure EGO sensor response at failure mode j Where standard error Additive polarity Subtractive polarity
Incorrect signal
Ci(Normal)+N(0,0. 003×Ci(Normal))
Ci(Normal)N(0,0.003×Ci(Normal))
No_signal
Error in transmission line
Ci(Normal)+N(0,0. 004×Ci(Normal))
Ci(Normal)N(0,0.004×Ci(Normal))
Error in computation device
Ci(Normal)+N(0,0. 005×Ci(Normal))
Ci(Normal)N(0,0.005×Ci(Normal))
Ci(Normal)+N(0,0. 006×Ci(Normal))
Ci(Normal)N(0,0.006×Ci(Normal))
Short circuit
0.0002
∞
Open circuit
0.00012
No_signal
0.00058
Improper response to recipient
CR IP T
Loss of signal from sensor
Complete Failure Probability EGO sensor (pj) of response at occurring failure mode j failure mode j No_signal
AN US
Failure Modes (j)
Fan Speed (FS) (in rad)
EGO (in volt)
MAP (in bar)
Microcontroller
Initial Input Signals
20
300
N/A (feedback signal)
N/A (feedback signal)
N/A
Acceptance Range
3< TA< 90
50
EGO<1.2
0.05
[On, Off]
ED
Throttle Angle (TA) (in degree)
M
Table 4. Initial input signals and acceptance range of the components
PT
3.3.4. Software Controller Behavioral Model The software controller behavior model is presented using three concurrent system states aggregate components failure behavior, interactions failure behavior, and software operation behavior. These three
AC
CE
states are inter-dependent. In the following, we explain it with example of the FRCS.
Fig. 8. Aggregate components behavior of software controller 19
AC
CE
PT
ED
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
Fig. 9. Interaction behavior model of the software controller The aggregate components operational behavior explains impact of single or composite component failure on the state transition of the software controller. Based on the system specification at the
20
ACCEPTED MANUSCRIPT
MathWorks website we have identified two different sets of component(s). The failure of one set of components leads state transition of the software controller from normal state to fault-tolerant state. For example, each sensor (throttle/ speed/ EGO/ MAP) of the FRCS belong to this set. The failure of the other set of components leads state transition of the software controller from normal/ fault tolerant to complete failure. The microcontroller and combination of any two or more sensors belong to this set. We have modeled aggregate components operational behavior in the Fig. 8, using Simulink/ Stateflow
CR IP T
environment. In this figure, four operational states are identified: all components are working (all_working), fault-tolerant state due to single sensor failure (single_sensors_fail), system failure due to more than one sensor failure (multi_sensor_failure), and system failure due to microcontroller failure (chip_fail). If all sensors and microcontroller are working, the fueling system keeps supplying fuel to the engine as normally. At any point, if the microcontroller is working and one-out-of-four sensors fails, then
AN US
system transit to fault-tolerant state. At this stage, system still supply fuel to the engine, but outflow rate may not remain steady. However, if there is multiple sensors and/ or microcontroller failure, it disrupts the fuel supply and engine may stall. At this point software controller state transit from normal/ faulttolerant to complete failure. The suspension of fuel supply continues until the system is restored to its operational condition. Fig. 8 shows the aggregate components operational behavior of the software
M
controller.
The modeling of interactions behavior for the FRCS is represented in Fig. 9. To demonstrate software-
ED
driven hardware interaction failure, we have defined some critical states. If the system reaches such states due to exceptional input conditions, it damages the associated hardware component. Based on the specification of the MathWorks website [22], we have listed such exceptional input conditions in Table 5.
PT
On the other hand, we have demonstrated two types of hardware-driven software interaction failures: memory inaccessibility and delay [27]. We have modeled memory inaccessibility using a Matlab/
CE
Simulink function (memory_op()) that investigates failure of any memory operation of the software controller. If the function returns true, it leads the corresponding operation running at the microprocessor to failure. Delay in operation occurs due to the low clock frequency of the microprocessor. If the function
AC
(delay_op()) comes true, it leads corresponding operation running at the microprocessor to failure due to operational delay. Table 5. Fuel rate controller (FRC) failures dependent on composite signals strength Sl/No. 1 2 3 4 5
Conditions Throttle <3 degree & Speed <50 Throttle <3 degree & Speed >628 rpm Throttle <3 degree & EGO> 1.2 Throttle <3 degree & & MAP< 0.05 volt Throttle <3 degree & MAP > 0.95bar
21
Sl/No. 10 11 12 13 14
Conditions Throttle >90 degree & MAP > 0.95 bar Speed <50 rpm & EGO > 1.2 volt Speed <50 rpm & MAP < 0.05 bar Speed <50 rpm & MAP > 0.95 bar Speed >628 rpm & EGO > 1.2 volt
ACCEPTED MANUSCRIPT
Throttle >90 degree & Speed < 50 rpm Throttle >90 degree & Speed >628 rpm Throttle <3 degree & EGO> 1.2 Throttle >90 degree & MAP < 0.05 bar
15 16 17 18
Speed >628 rpm & MAP < 0.05 bar Speed >628 rpm & MAP > 0.95 bar EGO> 1.2 volt & MAP < 0.05 bar EGO >1.2 volt & MAP > 0.95 bar
AN US
CR IP T
6 7 8 9
Fig. 10. Software operation behavior of the software controller
M
The software operational behavior model (shown in Fig. 10) represents the state transitions of the software controller due to component(s) or interaction failures. The software controller of the FRCS
ED
broadly has two operation modes: fuel running (working) and disabled (failure). If all system components are working properly, fuel flows at a constant rate. This is considered as the fuel running mode (working). This running mode can be analyzed further into two types: low emission mode (normal working) and rich
PT
mode (fault tolerant) based on the proportion of the air and oil in the fuel. The low emission mode consists of initial warm-up mode and normal operational mode. At the beginning of the operation, oxygen
CE
level and air pressure may not be optimal at the combustor. During the warm-up mode, software controller tries to bring the oxygen level and air pressure to an optimal condition based on the feedback signals. Once the software controller mounts the desired operating condition after the warm-up mode, the
AC
system starts its normal operation. During the warm-up and normal operation modes, the fuel outflow rate remains comparatively low. However, if one out the four sensors fails, equilibrium will be disturbed and the controller increases fuel outflow rate to bring back normalcy. This situation is denoted as rich emission mode. On the other hand, fueling mode turns to disable mode, if more than one sensor or microcontroller fails independently or due to interaction failure. 3.3.5. System Behavior Simulation
22
ACCEPTED MANUSCRIPT
At the beginning of the simulation process we assume that FRCS is in normal working state. Therefore, we start the simulation process by setting operation mode of each component as normal. The normal operating range of the components are given in Table 4. During simulation process, random sampling is performed to select operation mode of each component. We use Matlab/ Simulink platform for this purpose. As mentioned in the Component behavioral model, at each operation mode the component behavior follows a distinct trend. For example, the EGO sensor response during failure modes is given in
CR IP T
Table 3 and the normal mode (EGO feedback signal range) is given in Table 4. During simulation process, component generates random signals based on their operation mode. These random input signals are fed to the software controller for execution. Finally, the response of the software controller is recorded at each simulation iteration, and reliability/ availability is predicted on the basis of the simulation results.
AN US
3.3.6. Reliability/ Steady State Availability Prediction of the System
We start the simulation process assuming the FRCS is in the normal working state. During each iteration a random input data set is generated. The randomization of the components’ operation modes creates input data variations to the software controller. Then, we feed these random input data to the software controller for execution and record the response. For some input dataset state transition of the software
M
controller may not occur whereas for others transition may occur. Three distinct state transitions are observed: normal to failure failure
, normal to fault-tolerant
or fault-tolerant to
. As mentioned in the section Software Controller Behavioral Model, observing
ED
output of the FRCS at each simulation iteration we estimate the number of times transitions occur. For brief explanation, if transition
or
,
, and
occurs then fuel outflow rate of the FRCS will
PT
drop to zero whereas for other case the outflow rate will comply with the desire value (>0) as mentioned in the MathWorks website. We record the input data sets and corresponding response of the system using
AC
CE
Matlab/ Simulink platform.
Fig. 12. Availability graph for the FRCS
Fig. 11. Reliability graph of the FRCS
23
ACCEPTED MANUSCRIPT
The total number of simulation iteration (L) is known as we run the simulation process in the digital platform. Say, the estimated number of times transitions
,
, and
occurred in the Matlab/
Simulink platform are u, v, and w, respectively. Therefore, the unreliability ̅ estimated as ration of total number of times (
of the FRCS is
) fuel outflow rate does not comply with operational
limit to the total number of simulation iteration ( ). So, the reliability of the system is estimated . Initially we set a simulation time 500 millisecond (iterations L=503) and record
CR IP T
as
the FRCS fuel outflow rate. Then we increase the simulation time until get a stable (
) ration. We
observed reliability of the system tends to zero when time tends to infinity. The transient reliability graph of the FRCS is given in Fig. 11. If we consider FRCS as a non-repairable system then transient reliability ∑
of the system can be defined as
where
is the failure rate of the th component. In this
AN US
proposed model, we have assumed that the failure rate of each component as constant, so system reliability should follow exponential distribution. It is clear from Fig. 11 that the transient reliability curve obtained using the proposed method, decreases exponentially with increase in simulation iterations. If we increas the simulation iterations till the steady state, the plot (Fig. 11) will saturated at zero and no further change will be noticed. This fact also can be explained from the above reliability expression. In this
M
expression, if time t tends to infinity then the transient reliability can be expressed as
.
Table 6. Steady-state Availability of the Fuel Rate Control System (FRCS)
Failure 5
System Failures due to Interaction
ED
System Failures due to Component
Steady-state Availability
Failure 6
0.99983
PT
To develop availability model we have considered instant repair of failure states of the system. In this regard, we estimate the total number of times ( ́
́ ) fuel outflow rate exceed operational limit and the
CE
total number of simulation iterations ( ́ ). So, the availability of the system is estimated as ́
AC
Initially, we set simulation time as 500 millisecond and record the value of we increase the simulation time till
́
́
ration reaches stability. At the point
́
́
́ ́
́
́ ́
.
ratio. Then step by step get stabilized it is called
steady-state availability of the system. We have noticed at 1000 millisecond (100001 iterations) fuel rate control system (FRCS) reaches steady-state availability. The availability graph of the FRCS is given in Fig. 12. Table 6 represents system failure due to components failure, system failure due to interactions failure, and the steady-state availability of the FRCS. 4. Validation of the Proposed Reliability/ Availability Prediction Models 24
ACCEPTED MANUSCRIPT
We validate the proposed methodology by comparing the reliability/ availability values obtained using the proposed approach for the above case study, with those obtained using an already established approach like Petri Nets model. At first, we use Generalized Stochastic Petri Nets (GSPNs) to model the above fuel rate control system (FRCS). Then, we evaluate the transient reliability and steady-state availability of the FRCS using GSPN. Finally, compare the obtained reliability/ availability values using GSPN model with
CR IP T
the results that we obtained using our proposed model. The GSPN marking produces reachability graph that is equivalent to Continuous Time Markov Chain (CTMC). Therefore, the state transition rate of the reachability graph is constant. The reachability graph does not have the vanishing states; only tangible states are considered in constituting the CTMC. Fig. 13 shows a GSPN of the FRCS with one place (All_working_P1) enabled containing four tokens. At this point, transactions t-failure-T1, s-failure-T2, e-failure-T3, m-failure-T4 or u-failure-T9 may occur due to
AN US
throttle sensor (t), speed sensor (s), EGO sensor (e), MAP sensor (m) or microcontroller failure (u), respectively. Transactions t-failure-T1, s-failure-T2, e-failure-T3, m-failure-T4 or u-failure-T9 may lead the system to t-failure_P2, s-failure_P3, e-failure_P4, m-failure_P5, and u-failure_P12 places, respectively. From the places t-failure_P2, s-failure_P3, e-failure_P4, m-failure_P5, and u-failure_P12 it may again return to All_working_P1 place through the transactions t-repair-T5, s-repair-T6, e-repair-T7,
M
m-repair-T8, and u-repair-T10, respectively.
On the other hand, place t-failure_P2 may fire transitions s-failure-T19, e-failure-T20, m-failure-T21, u-
ED
failure-T11 to reach the places t&s-failure_P7, t&e-failure_P8, t&m-failure_P8, t&u-failure_P13, respectively. Again, from places t&s-failure_P6, t&e-failure_P7, t&m-failure_P8, t&u-failure_P13
PT
system may return to the place t-failure_P2 through the transitions s-repair-T31, e-repair-T33, m-repairT35, and u-repair-T12, respectively.
CE
The place s-failure_P3 may fire transitions t-failure-T22, e-failure-T23, m-failure-T24, u-failure-T14 to reach the places t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, t&u-failure_P14, respectively. Again, from the place t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, and t&u-failure_P14 it may
AC
return to place s-failure_P3 through the transitions t-repair-T32, e-repair-T38, m-repair-T39, and urepair-T14, respectively. The place e-failure_P4 may fire the transitions t-failure-T25, s-failure-T26, m-failure-T27, u-failure-T15 to reach the places t&e-failure_P7, s&e-failure_P9, e&m-failure_P11, t&u-failure_P15, respectively. Again, from the place t&e-failure_P7, s&e-failure_P9, e&m-failure_P11, t&u-failure_P15 places it may return to e-failure_P4 through the transitions t-repair-T34, s-repair-T37, m-repair-T41, and u-repair-T16.
25
ACCEPTED MANUSCRIPT
The place m-failure_P5 may fire the transitions t-failure-T28, s-failure-T29, e-failure-T30, u-failure-T17 to reach the places t&m-failure_P7, s&m-failure_P9, e&m-failure_P11, m&u-failure_P17, respectively. Again, from the place t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, and t&u-failure_P14 it may return to e-failure_P4 through the transitions t-failure-T36, s-failure-T40, e-failure-T42 and u-
PT
ED
M
AN US
CR IP T
repair_T18.
CE
Fig. 13. Petri Nets model of the Fuel Rate Controller System (FRCS) The GSPN produces reachability graph (Fig. 14) that is equivalent to CTMC. No vanishing states are observed in the reachability graph. Only 16 tangible states constitute the CTMC. In this model
AC
infinitesimal generator is denoted as to
. If there is no arc from
steady-state vector as
to
[ then
], where . In this model
is the transaction rate from states ∑
. Therefore, we can expresses it as:
26
and we denote the
ACCEPTED MANUSCRIPT
∑ To calculate the transient probability of each state, we define
. So, we
have 16 first order linear differential equations where failure probability of the throttle sensor
probability
, MAP sensor
, microcontroller failure (
are known. The equations are as following:
1 2 3 4 5 P1 t P2 t P3 t P4 t P5 t P12 2 3 4 5 P2 t 1 P1 t + P6 (t)+ P7 (t) P8 (t)+ P13 (t)
1 3 4 5 P3 t 2 P1 t + P6 (t)+ P9 (t) P10 (t)+ P14 (t) 1 2 4 5 P4 t 3 P1 t + P7 (t)+ P9 (t) P11 (t)+ P15 (t)
1 2 3 5 P5 t 4 P1 t + P8 (t)+ P10 (t) P11 (t)+ P16 (t) d P6 t 2 P6 (t ) 2 P2 t 1P3 t dt d P7 t 2 P7 (t )3 P2 t 1P4 t dt d f P8 t 2 P8 (t ) 4 P2 t 1 P5 t dt d P9 t 2 P9 (t ) 2 P4 t 3 P3 t dt d P10 t 2 P10 (t ) 2 P5 t 4 P3 t dt d P11 t 2 P11 (t ) 3 P5 t 4 P4 t dt d P12 t P12 (t ) 5 P1 t dt d P13 t P13 (t ) 5 P2 t dt d P14 t P14 (t ) 5 P3 t dt d P15 t P15 (t ) 5 P4 t dt d P16 t P16 (t ) 5 P5 t dt
AC
CE
PT
ED
M
d P1 t dt d P2 t dt d P3 t dt d P4 t dt d P5 t dt
, and their repair
CR IP T
, EGO sensor
AN US
sensor
27
, speed
ACCEPTED MANUSCRIPT
The transient probabilities we get from the above differential equations can be expressed as the following: 16
P 1, i 1
i
PT [ D I] 0,
where P is state probability vector, D is transition probability matrix and I is identity matrix.
1 1 2 3 4 5 0 1 1 3 0 0
2
3
0 4 5 0 1 1 2 0
0 0 4 5 0 1 1 2 0
0 0
0 0 0 0
0 0 0 0
0 0 0
0 0 0
0 0 0
0 0
0
0 0 0 0 0 0
4
0 0
5
0 0 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
CR IP T
3 4 5
0 0 0 0 0 0 0 2 3 4 0 0 0 0 1 0 0 3 4 0 0 0 1 0 2 0 4 3 5 0 0 1 0 2 3 0 1 2 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
AN US
1 1 2 Q
Table 7. Number of failures and failure probability of the components Engine Fan Speed Sensor
EGO Sensor
MAP Sensor
Microcontroller
6
3
2
2
0.000059
0.000029
0.000019
0.000019
M
Throttle Sensor 4
Failure Probability
0.000039
AC
CE
PT
ED
Number of Failures out of 100001 iterations
28
ACCEPTED MANUSCRIPT
Table 8: Steady-state availability Steady-state probability
CR IP T
States P1
0.98719
P2
0.00064
P3
0.00081
P4
0.00065
AN US
P5
Steady-state availability
0.01068 0.99999
M
Fig. 14. Reachability graph of the Petri Nets model
ED
To estimate the failure probability of the components we use the same data that was used during the system simulation. The first row of Table 7 represents the total number of failures of throttle sensor, speed sensor, EGO sensor, MAP sensor, and microcontroller during entire 100001 simulation iterations.
PT
We estimate the failure probability of each component by the ratio of total number of failures to the total number of simulation iterations. The estimated failure probability of the components is represented in the
CE
second row of Table 7.
In the reachability graph, one set of states ( mA ) represents the working condition of the system and the
AC
other set ( m ' A ) represents the failure states. If we consider FRCS as a non-repairable system the transient reliability of the system can be defined as
∑
where
, then
is the firing rate of
transition whose firing causes the target networks to leave the reliable states. In this reliability expression if t tends to infinity
. The transient reliability plot of the FRCS is given in the Fig. 15.
29
CR IP T
ACCEPTED MANUSCRIPT
Fig. 16. Graphical representation of the steady-
Fig. 15. Graphical representation of the steadystate reliability using CTMC model
AN US
state availability using CTMC model
If we consider FRCS as a repairable system then the transient availability is
∑
and
is the coefficient. The repair probability of throttle sensor, speed sensor, EGO sensor, MAP sensor, and microcontroller is assumed as same
. Then, solving the above first order linear differential
M
equations, we get steady-state probabilities of the working states ( mA ) as given in the Table 8. The sum of steady-state probabilities of the working states gives steady-state availability of the system
ED
. Fig. 16 represent the steady-state availability of the system with respect to time. Finally, we observe the obtained steady-state availability of the FRCS using CTMC model (0. 99999) which is quite
PT
similar to the corresponding value (0.99983) achieved using proposed simulation-based method. The CTMC model giving slightly high availability as there is no scope considering interactions failure. It only
CE
considers system failure due to components failure. 5. Conclusion
AC
At the early stage of the system development, more than one design alternatives may be available. It is difficult to determine and select the most reliable system design among the available alternatives. Moreover, at the initial stages of system development, the actual system components may not be available. So, it is desirable to perform reliability/ availability analysis on the system design. We have proposed a model for a combined hardware-software system that can be used to predict the worst case system reliability/ availability based on the conceptual design. A novelty of our work is the quantitative reliability/ availability analysis of a combined HW-SW system considering hardware-driven software and software-driven hardware failures, at the early design stages is the novelty of our work. The proposed 30
ACCEPTED MANUSCRIPT
model converts the system functions to conceptual level abstract components. Technical composition of such components are unknown, but the functionalities are defined. We simulate system behavior based on functional logic for a set of input data variations. Functional failure/ success of the system for different input set gives the reliability/ availability of the system. Unlike the existing functional failure based models, the proposed model predicts the reliability/ availability of the system rather than confined to performing only risk analysis. At the same time, we have considered the entire spectrum of interaction
CR IP T
failures that may arise among the hardware-software components, apart from individual component failures. To demonstrate the applicability of the proposed model, we have predicted the reliability/ availability of an aircraft fuel rate controller as a case study. Further, we validate the proposed model using the same example. Finally, we can conclude that the proposed simulation-based model avoids the inconvenience of handling huge state space like Markovian early reliability/ availability models. It also
AN US
avoids qualitative analysis of huge execution paths like functional failure identification and propagation (FFIP) based models. Acknowledgements
This work is carried out at the Subir Chowdhury School of Quality and Reliability, Indian Institute of
M
Technology Kharagpur, India. We thank all the faculty members, research scholars, and staff of the school for their co-operation and support. We gratefully acknowledge the Ministry of Human Resource
AC
CE
PT
ED
Development (MHRD), Government of India, for funding this research.
31
ACCEPTED MANUSCRIPT
References
[7]
[8] [9]
[10]
[11]
[12] [13]
[14]
AC
[15]
CR IP T
[6]
AN US
[5]
M
[4]
ED
[3]
PT
[2]
A. Syed, D. G. Pérez, and G. Fohler, "Job-shifting: An algorithm for online admission of nonpreemptive aperiodic tasks in safety critical systems," Journal of Systems Architecture, vol. 85, pp. 14-27, 2018. Q. Zhao, Z. Gu, M. Yao, and H. Zeng, "HLC-PCP: A resource synchronization protocol for certifiable mixed criticality scheduling," Journal of Systems Architecture, vol. 66, pp. 84-99, 2016. I. Tumer and C. Smidts, "Integrated design-stage failure analysis of software-driven hardware systems," IEEE Transactions on Computers, vol. 60, pp. 1072-1084, 2011. R. K. Iyer and P. Velardi, "Hardware-related software errors: measurement and analysis," IEEE Transactions on Software Engineering, pp. 223-231, 1985. X. Teng, H. Pham, and D. R. Jeske, "Reliability modeling of hardware and software interactions, and its applications," IEEE Transactions on Reliability, vol. 55, pp. 571-577, 2006. A. Costes, C. Landrault, and J.-C. Laprie, "Reliability and availability models for maintained systems featuring hardware failures and design faults," IEEE Transactions on Computers, vol. 100, pp. 548-560, 1978. K. Kanoun and M. Ortalo-Borrel, "Fault-tolerant system dependability-explicit modeling of hardware and software component-interactions," IEEE Transactions on reliability, vol. 49, pp. 363-376, 2000. U. Sumita and Y. Masuda, "Analysis of software availability/reliability under the influence of hardware failures," IEEE Transactions on Software Engineering, pp. 32-41, 1986. D. Jensen, I. Y. Tumer, and T. Kurtoglu, "Flow State Logic (FSL) for analysis of failure propagation in early design," in ASME 2009 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2009, pp. 1033-1043. D. C. Jensen, I. Y. Tumer, and T. Kurtoglu, "Modeling the Propagation of Failures in Software Driven Hardware Systems to Enable Risk-Informed Design," in ASME 2008 International Mechanical Engineering Congress and Exposition, 2008, pp. 283-293. S. Sierla, I. Tumer, N. Papakonstantinou, K. Koskinen, and D. Jensen, "Early integration of safety to the mechatronic system design process by the functional failure identification and propagation framework," Mechatronics, vol. 22, pp. 137-151, 2012. C. Mutha, D. Jensen, I. Tumer, and C. Smidts, "An integrated multidomain functional failure and propagation analysis approach for safe system design," AI EDAM, vol. 27, pp. 317-347, 2013. B. Huang, M. Rodriguez, M. Li, J. B. Bernstein, and C. S. Smidts, "Hardware error likelihood induced by the operation of software," IEEE Transactions on Reliability, vol. 60, pp. 622-639, 2011. X. Diao, Y. Zhao, M. Pietrykowski, Z. Wang, S. Bragg-Sitton, and C. Smidts, "Fault Propagation and Effects Analysis for Designing an Online Monitoring System for the Secondary Loop of the Nuclear Power Plant Portion of a Hybrid Energy System," Nuclear Technology, pp. 1-18, 2018. N. Papakonstantinou, S. Proper, B. O’Halloran, and I. Y. Tumer, "A Plant-Wide and FunctionSpecific Hierarchical Functional Fault Detection and Identification (HFFDI) System for Multiple Fault Scenarios on Complex Systems," in ASME 2015 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2015, pp. V01BT02A039-V01BT02A039. T. Kurtoglu and I. Y. Tumer, "A graph-based fault identification and propagation framework for functional design of complex systems," Journal of Mechanical Design, vol. 130, p. 051401, 2008. T. Kurtoglu and I. Y. Tumer, "A risk-informed decision making methodology for evaluating failure impact of early system designs," in ASME 2008 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2008, pp. 457-467.
CE
[1]
[16] [17]
32
ACCEPTED MANUSCRIPT
[20]
[21]
[22] [23]
[24] [25] [26]
AC
CE
PT
ED
M
[27]
CR IP T
[19]
A. M. Dowell III, "Layer of protection analysis for determining safety integrity level," Isa Transactions, vol. 37, pp. 155-165, 1998. D. S. Roy, C. Murthy, and D. K. Mohanta, "Reliability analysis of phasor measurement unit incorporating hardware and software interaction failures," IET Generation, Transmission & Distribution, vol. 9, pp. 164-171, 2015. A. K. Trivedi and M. L. Shooman, A Markov model for the evaluation of computer software performance: Polytechnic Institute of New York, Department of Electrical Engineering and Electrophysics, 1974. N. Papakonstantinou, S. Sierla, I. Y. Tumer, and D. C. Jensen, "Using fault propagation analyses for early elimination of unreliable design alternatives of complex cyber-physical systems," in ASME 2012 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2012, pp. 1183-1191. MathWorks. (2011). MATLAB/ Simulink Examples. Available: http://in.mathworks.com/examples J. Hirtz, R. B. Stone, D. A. McAdams, S. Szykman, and K. L. Wood, "A functional basis for engineering design: reconciling and evolving previous efforts," Research in engineering Design, vol. 13, pp. 65-82, 2002. R. B. Stone and K. L. Wood, "Development of a functional basis for design," Journal of Mechanical design, vol. 122, pp. 359-370, 2000. NSWCD, "Hand ook of reliability prediction procedures for mechanical equipment," Naval Surface Warfare Carderock Division, West Bethesda, Maryland 20817-57002011. P. S. Borgovini R, Rossi M, "Failure Mode, Effects and Criticality Analysis (FMECA)," Reliability Analysis Center, Rome Laboratory1993. D. Gil, J. Gracia, J. C. Baraza, and P. J. Gil, "Impact of faults in combinational logic of commercial microcontrollers," in European Dependable Computing Conference, 2005, pp. 379-390.
AN US
[18]
33
ACCEPTED MANUSCRIPT
Sourav Sinha is currently pursuing Ph. D. from Indian Institute of Technology (IIT) Kharagpur. He has received B.Tech in Computer Science & Engineering and MS in Industrial and Systems Engineering. His areas of research are software reliability, system reliability and dependability analysis. He also has work experience of more than five years as Software Programmer in ERP implementation project at IIT Kharagpur. Neeraj Kumar Goyal is currently an associate professor in the Reliability Engineering Centre, Indian Institute of Technology (IIT), Kharagpur, India. He received his PhD degree from IIT Kharagpur in reliability engineering in 2006. His areas of research and teaching are network reliability, software reliability, electronic system reliability, reliability testing, probabilistic risk/safety assessment, and reliability design. He has completed various research and consultancy projects for various organizations, e.g. DRDO, NPCIL, Vodafone, and ECIL. He has contributed
CR IP T
several research papers to international journals and conference proceedings. Rajib Mall is a Professor of the Department of Computer Science and Engineering at the Indian Institute of Technology, Kharagpur, West Bengal, India. He has received his Bachelor's, Master's and Ph.D. degrees in Computer Science, all from Indian Institute of Science, Bangalore. Prof. Mall has presented numerous lectures, conference presentations, and workshops on Software Engineering, Real Time System, and Wireless Sensor Network. He has published more than 200 journal paper, conference paper, and book chapter. He has active research interests in Software Engineering, Real Time System, Wireless Sensor Network, Web Engineering, Web Sizing, Cost Estimation, and on Web Quality and
AN US
Productivity.
CE
PT
Dr. Neeraj Kumar Goyal
ED
M
Sourav Sinha
AC
Prof. Rajib Mall
34