Early prediction of reliability and availability of combined hardware-software systems based on functional failures

Early prediction of reliability and availability of combined hardware-software systems based on functional failures

Accepted Manuscript Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures Sourav Sinha ...

2MB Sizes 0 Downloads 51 Views

Accepted Manuscript

Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures Sourav Sinha , Neeraj Kumar Goyal , Rajib Mall PII: DOI: Reference:

S1383-7621(18)30230-3 https://doi.org/10.1016/j.sysarc.2018.10.007 SYSARC 1537

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

3 June 2018 15 October 2018 22 October 2018

Please cite this article as: Sourav Sinha , Neeraj Kumar Goyal , Rajib Mall , Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures, Journal of Systems Architecture (2018), doi: https://doi.org/10.1016/j.sysarc.2018.10.007

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Early Prediction of Reliability and Availability of Combined Hardware-Software Systems based on Functional Failures Sourav Sinhaa,, Neeraj Kumar Goyala and Rajib Mallb a

Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, India b Department of Computer Science & Engineering, Indian Institute of Technology Kharagpur, India

CR IP T

Abstract Interactions among software and hardware components play an important role in successful operation of a system. Researchers have identified two types of interaction failures: software failure influenced by hardware breakdown (hardware-driven software failure) and hardware failure influenced by software malfunction (software-driven hardware failure). The existing research in this domain either has not considered the entire spectrum of interaction failures or limited their work to mere failure analysis rather than reliability/ availability modeling. In this paper, we are proposing a unified model to predict the

AN US

worst case achievable reliability/ availability of hardware-software combined system at early design phases. The proposed model identifies system functions from the requirements specification document. Then, these functions are mapped to corresponding conceptual design components. Subsequently, the functional design is simulated for sets of input data that have been randomly generated for different operation modes (failure/ working) of the components. We also simulate system state transition due to the

M

component operation modes. Finally, reliability and availability is predicted from simulation results. In this context, we address four important aspects: i) proposing a conceptual design based early reliability/

ED

availability prediction model, ii) apart from individual hardware-software component failure, the proposed model addresses different interaction failures such as, hardware-driven software and softwaredriven hardware failures, iii) implementing the proposed model through a case study, and iv) validating

PT

the model by comparing the obtained reliability/ availability value using the proposed approach with an established method.

CE

Keywords Reliability/ availability prediction, hardware-software interactions, failure analysis, functional

AC

failure, embedded system 1. Introduction

A sharp increase in the use of software-intensive systems has been noticed in recent times. Even a wide range of safety-critical hardware devices that perform a multitude of activities are often controlled by software[1, 2]. For example, in the aircraft industry, a significant increase in the use of combined hardware-software systems can be noticed. Tumer and Smidts [3] have reported that total percentage of 

Subir Chowdhury School of Quality and Reliability, Indian Institute of Technology Kharagpur, Kharagpur-721302, India. Tel.: +91 3222269868, E-mail address: [email protected] (S. Sinha).

1

ACCEPTED MANUSCRIPT

functional requirements operated by software has increased, from 8% for the F-4 US aircraft in the 1960s, to 80% for the F-22 US aircraft in 2000. The integration of hardware-software makes reliability evaluation more complicated. Because, we must consider hardware-software interactions failure, apart from their independent failure. Iyer and Velardi [4] at Stanford University experimentally proved that degradation of hardware component due to fatigue, temperature, electrical stress, design susceptibilities or configuration change might impact software operation. To corroborate their findings, they demonstrated

CR IP T

that nearly 35 per cent of the software errors on an MVS/SP operating system is hardware-related [4]. It implies that fault in a hardware component may cause malfunction of the corresponding software component. On the other hand, bugs in the software may also lead a hardware device/ peripheral to failure. For example, in March 2015, F-35 joint strike fighter aircraft failure that proclaimed software glitch caused aircraft to detect targets incorrectly. Propagation of fault from hardware to software or vice

AN US

versa leads to the failure of the combined HW-SW system. Therefore, a pragmatic way to predict HWSW combined system reliability/ availability is not to ignore interactions among components. A few research results have been reported in system reliability/ availability prediction that has considered interactions failure among HW-SW component [5]. Some researchers have proposed reliability/ availability models taking into account hardware-driven software interaction failures [5-8]. They mostly

M

relied on Markovian approach for reliability/ availability modeling. Another group of researchers have considered hardware-driven software and software-driven hardware interaction failures together for the

ED

failure/ reliability analysis [3, 9-13]. Out of these, Huang et al. [13] presented a quantitative reliability analysis for hardware part only, considering the usage profile of embedded software in SPICE simulation environment. They did not give a consolidated model for combined hardware-software system reliability.

PT

The other models of this group have modeled failure analysis using Functional Failure Identification and Propagation (FFIP) framework or its extension [3, 9-12, 14, 15]. As the name suggests, FFIP analyzes a

CE

system based on its functional failures [9, 10, 16, 17]. However, they also failed to give a quantitative reliability analysis for the combined hardware-software system. We are following this line of work to some extent by employing functional failure analysis for early prediction of system reliability. Unlike

AC

FFIP framework, our work is not limited to mere failure analysis rather extended to the quantitative reliability prediction. The applicability of the proposed model is limited to the classical embedded system. Such systems comprise a microcontroller with embedded software controller and input/ output peripheral devices that are connected to the microcontroller. A unified model that considers hardware-driven software and software-driven hardware interaction failures apart from individual component failures is lacking at present. In this paper, we are proposing a consolidated model that predicts lower bound of reliability/ availability considering functional failures for 2

ACCEPTED MANUSCRIPT

a given system design. If design alternatives are available, this model can evaluate the minimum achievable reliability/ availability for each alternatives. Based on the evaluation results, the system design with comparatively higher reliability can be chosen. At the early design stages, component level technical details of the system are usually unavailable, but system functions can be identified from the requirements specification document. In this research, we map the system functions to abstract components at the conceptual design stage. For example, consider that the functional requirement of a system is

CR IP T

“temperature measurement” and this function should have 99.9999% reliability. As per the proposed model, the function “measure temperature” to be mapped to a generic “temperature sensor” with a maximum on demand failure probability of 0.0001 complying with Safety Integrity Level 4 (SIL 4) [18]. Once the process of identifying the generic hardware components is complete, standard reliability handbooks with generic failure data sources can be referred to identify the failure modes of each

AN US

component with associated probability of failure. All modes of the components including failure modes and normal working mode are termed as component operation modes in this paper. Similarly, embedded software controller operation mode can be identified from the System Requirement Specification document. Based on the operation modes of the generic hardware components and the software controller, we configure the system that defines the inter-component dependency and the flow of the function execution. Finally, we simulate random operation scenarios (input data variations) to test the system

M

functionalities and predict the reliability/ availability based on simulation result. In this case, the simulation-based approach is helpful as the operational/ non-operational states of the system are quite

ED

large. Modeling such a huge state space using a Markov chain based approach is a tedious job. In this paper, the reliability of the system is obtained as ratio of total number of time system responses within the operational limit (correct range/ value) to the total number of simulation iterations. Availability of the

PT

system is predicted from the same ratio with consideration of recovery from failed states.

CE

The proposed reliability/ availability prediction model also incorporates two types of interaction failures apart from input variation and individual hardware/ software component failure. These interaction failures are: a) hardware-driven software failure, and b) software-driven hardware failure. For better

AC

understanding of the proposed model, we have demonstrated its application to an aircraft fuel control system as a case study in Section 3. Evaluation of transient reliability and steady-state availability are also explained in the case study. The transient reliability/ steady state availability results obtained for the case study are compared with established approaches in Section 4. The rest of this paper is organized as follows: Section 2 presents a review of the existing work. Section 3 proposes a functional failure based model for early reliability prediction. Section 4 presents a validation of the proposed model. Finally, Section 5 summarises the important contributions of this research work. 3

ACCEPTED MANUSCRIPT

2. Background of the Research The existing reliability/ availability/ failure analysis models based on HW-SW interactions failure can be divided into two groups: 1) models considering only hardware-driven software interaction failures [5-8, 19], and 2) models considering both hardware-driven software and software-driven hardware interaction

CR IP T

failures [3, 9-12]. Reliability prediction approaches based on hardware-driven software interaction failure, used Markov or derivatives of Markov and other stochastic process like Stochastic Petri Nets [5-7]. For example, Teng et al. [5] used Markov chain for system reliability modeling. They considered that a system fails due to hardware, software or hardware-software interaction failures. They assumed a Weibull distribution for hardware failures, NHPP for software failures and a Markov chain model for HW-SW interaction failures.

AN US

They mentioned that deterioration of hardware manifests into system failure if it remains undetected or gets ignored. Finally, they evaluated system reliability as the product of independent hardware reliability, independent software reliability and hardware-software interactions reliability. Kanoun and Ortalo-Borrel [7] used off-springs of Generalized Stochastic Petri Net (GSPN) and Markov chains for dependability modeling of a distributed combined hardware-software system. They considered

M

that master software installed in the main controller system, interacts with the slave software counterparts installed in the distributed peers. In this distributed structure, Kanoun and Ortalo-Borrel [7] considered

ED

possibility of two types of interactions: a) interactions with the internal components of the system and b) interactions with the external components/ systems. They modeled dependability of the software, hardware and their interactions using Petri Net. The Petri Nets produce a reachability graph that is

PT

identical to the Continuous Time Markov Chain (CTMC). However, they could not deduce any concrete

CE

conclusion due to state-space explosion. Another distinct reliability/ availability prediction approach was presented by Sumita and Masuda [8]. They formulated system failure as a multivariate stochastic process using matrix Laguerre transform and

AC

semi-Markov process. They assumed that hardware failures are independent of the software. They modeled hardware failures as an alternating renewal process. They considered exponentially distributed uptime and any general distribution for downtime. They illustrated, if hardware related failures lead to software failure, two different situations may arise. First, completion of a software repair without any interruption from hardware failures may occur. Second, hardware failure may interrupt software repair operation. They modeled both the situations using stochastic processes and computed system reliability

4

ACCEPTED MANUSCRIPT

using matrix Laguerre transform. Finally, they demonstrated the efficacy of their proposed methodology using a numerical example. Costes et al. [6] proposed reliability/ availability approach of a repairable system using Markov statebased modeling. They assumed that hardware failure follows a Poisson distribution with known failure rate whereas software failure rate is constant. They derived the software failure behavior model from the

CR IP T

previous literature [20]. This model considered the residual errors of the software are unknown and the debugging process is imperfect. At first, they studied the impact of hardware and software failures on a non-redundant computer system. Then, they applied the learning of the non-redundant system to the redundant system. After that, they compared the obtained availability of the redundant and non-redundant systems. Finally, they concluded that redundancy of the hardware/ software parts increased the

AN US

availability of the system.

Roy et al. [19] proposed a reliability framework for Phasor Measurement Units (PMU) as an extension of the work presented by Teng et al. [5]. At first, they modeled hardware components failure through Weibull distribution, software components failures using NHPP model and hardware-software interaction failures using Markov chain. Then, they predicted the system reliability as the product of independent hardware reliability, independent software reliability and hardware-software interaction reliability. To

M

validate their model, they used Monte Carlo simulation (MCS). During validation process, they generated failure data of each component using MCS. Then, they identified the failure distribution of each

ED

component. Subsequently, they estimated the system reliability assuming that all components are in series. Finally, they validated the model by comparing the predicted reliability with the estimated

PT

reliability.

Another group of researchers considered hardware-driven software and software-driven hardware

CE

interaction failures together, for the failure/ reliability analysis of the system. They used Functional Failure Identification and Propagation (FFIP) framework or its extension for reliability/ failure modeling. Jensen et al. [10] were probably the first to introduce FFIP framework for combined HW-SW system

AC

failure analysis. They developed the functional layout of the system following the FFIP framework. Then, they analysed the material/ information flow along the flow-paths of the functional layout, to identify critical nodes. Subsequently, they used reasoning based approach to monitor the flow level at the critical nodes to restrict the fault propagation in the system. The FFIP combines system modeling and behavioral simulation approaches for the failure analysis at the early design phase of system development. Later, FFIP was adopted by others for qualitative reliability/ failure analysis of the safety-critical systems [3, 11, 21].

5

ACCEPTED MANUSCRIPT

Tumer and Smidts [3] adopted FFIP framework for high-level system modeling and failure analysis. They used five steps approach: a) modeling the functional layout and system configuration using abstract components, b) ascertaining failure states of each component based on specified input and output flow information, c) identifying system components which can act as checkpoint to sense failures of the previous node (component), d) apprehending the abnormal behavior of the predecessor node and identify the mechanisms to stop propagation of failure, and e) evaluating different scenarios against the

CR IP T

predecided set of rules for the alternate flow-path. This model used checkpoint to monitor the fault propagation. If a fault is manifested, the model alters the flow path to stop further propagation. They also demonstrated extension of FFIP to the software domain where they used UML based system modeling approach. However, the central idea is as same as the traditional FFIP.

Sierla et al. [11] modified FFIP framework for the failure analysis of the system that controls concurrent

AN US

processes. They demonstrated concurrent processes integration for the software controller in the background of failure analysis. Moreover, they also considered flows across the boundaries of mechatronic domains that are hardly covered by the contemporary FFIP based models. They used SysML to implement the functional model and configure the system behavior using configuration flow graph (CFG). The component behavioral models define the formulation of output values from input values,

M

using statechart diagram. Then the flows of material, energy, and signal across domain boundaries were analysed using FFIP based simulation approach. Finally, the graphical output of the Simulink/ Stateflow

ED

used to analyse the abnormal flow levels in the FFIP path. Papakonstantinou et al. [21] identified some drawbacks of the FFIP based models. They argued that if we

PT

use different functional model (alternative system design) of the same system, it will provide different outcomes. Moreover, the framework could not suggest the best alternative. To overcome such drawbacks, they proposed a model considering alternate flow paths for mitigating failure propagation using the

CE

Simulink/ Stateflow environment. They analysed the risk of failure propagation using modified FFIP approach. Their approach was illustrated using the example of a boiling water nuclear reactor. However,

AC

the way their approach integrates concurrent safety processes to mitigate risk is not clear. Papakonstantinou et al. [15] proposed another approach where they put an effort to improve their previous work [21]. Their extended approach considered Hierarchical Functional Fault Detection and Identification (HFFDI) framework that combines machine learning techniques and the traditional FFIP for failure analysis. The machine learning techniques used for fault detection and identification (FDI) from historical data whereas, FFIP was used for functional decomposition of the system. They implemented HFFDI framework to a complex nuclear power plant system as a case study. Then, they compared failure analysis 6

ACCEPTED MANUSCRIPT

results of HFFDI with FDI based approach, for the same case study. Finally, they concluded that HFFDI gave an edge over its counterpart. The results revealed that, in two fault scenarios HFFDI could isolate one fault with 79% accuracy and both faults with 13% accuracy. In three fault scenarios, HFFDI could isolate single faults with 69% accuracy, two faults with 22% accuracy and all three faults with 1% accuracy.

CR IP T

Mutha et al. [12] claimed that traditional FFIP framework which is efficient in detecting electromechanical fault, hardly detects faults in cross-domain functionalities. To overcome such problems, they proposed Integrated System Failure Analysis (ISFA) approach that identifies and analyzes faults of cross-domain functionalities. As a part of ISFA they introduced a new simulation mechanism, named as Failure Propagation and Simulation Approach (FPSA). The FPSA works on the principles of the FFIP. They implemented ISFA technique to a holdup tank as a case study. They demonstrated two

AN US

instances of commonly occurring faults that cause system failure. Based on the result of a case study, they presented the efficiency of the ISFA approach to analyze faults in a combined hardware-software system. Later, Diao et al. [14] also used Integrated System Failure Analysis (ISFA) framework [12] for the combined study of hardware-software faults. As a novel feature, they added online monitoring (OLM)

M

system to ISFA. The OLM isolated the potential faults in the critical components. They implemented their proposed methodology for a nuclear hybrid energy system as a case study. In this regard, they configured the system model using a conceptual design of the components. Then, they analyzed the fault

ED

propagation using OLM. Subsequently, they evaluated the effectiveness of the fault detection and diagnosis techniques of their proposed model using functional simulation. Based on the simulation results

PT

they proposed an optimization plan of the OLM system. Finally, the correctness of their methodology was verified through some experiments on a hardware-in-the-loop system. However, it can be observed that the above quantitative reliability/ availability/ dependability prediction models only consider hardware-

CE

driven software interaction failures [5-8, 19]. They do not consider the impact of software-driven hardware interaction failures. On the other hand, models pertaining the failure analysis considered both,

AC

hardware-driven software and software-driven hardware interaction failure [3, 10-12, 14, 15, 21]. Unfortunately, the reported failure analysis models are largely limited to qualitative reliability analysis and do not provide any quantitative reliability evaluation. Therefore, a scarcity of literature is noticed in the area of quantitative reliability/ availability prediction considering the entire spectrum of interaction failures. 3. Proposed Combined Hardware-software System Reliability/ Availability Prediction Model

7

ACCEPTED MANUSCRIPT

We propose a simulation-based reliability/ availability prediction model for combined hardware-software system. The applicability of the proposed model is limited to the classical embedded systems with specific functionalities and real-time computing constraints. The software that controls the functionalities of the system remains embedded in a microcontroller. All the peripheral input/ output devices are kept connected to the microcontroller for serving the communication with the external entities. We have attempted to predict the worst case achievable reliability/ availability of a system for a given conceptual

CR IP T

design. As mentioned earlier, the proposed model can identify the most reliable system design among available alternatives. However, the predicted reliability/ availability value may not accurately match with post-development estimated reliability/ availability values obtained through system testing. This is because, the proposed model does not consider actual system components while predicting the reliability/ availability. It maps system functions to the functionally equivalent generic components to configure the

AN US

system. Finally, we simulate different operational scenarios (input datasets) to test the system functions and predict reliability/ availability based on simulation result. The proposed model can significantly reduce the production cost of the safety-critical system. During the post-development testing phase, if the system does not meet the reliability requirement due to inefficient design, it incurs a huge cost for rework. Even reliability improvement methods at the end of the development cycle force to take ad-hoc solutions due to budgetary or time constraints. Such approaches may compromise product quality, reliability and

M

safety. Therefore, it is important to identify most reliable system design at the beginning of the high-level

3.1. Assumptions

ED

development processes.

1) The System Requirement and Specification (SyRS) document is available at the early design phase of

PT

system development

2) Operation modes of a hardware component are divided into three categories: Normal: Component performs its designated work adequately

b)

Degraded/ partial failure: At this mode component works in a limited manner. It is also

CE

a)

AC

two types: a) Additive polarity: The partial failure enhance the signal strength/

c)

performance of the component, b) Subtractive polarity: The partial failure diminish the signal strength /performance of the component Complete failure: The overall functionality of the component gets disrupted. Such failures can be permanent or transient.

3) Maximum allowable failure probability and failure modes of each hardware component are known 4) Based on system functionality, software controller operational modes are broadly divided into two categories: 8

ACCEPTED MANUSCRIPT

a.

Working: The response of the software controller matches the system requirement specification document for a given set of input. It is two types: i) Normal working: Software performs its designated work adequately without any hardware component failure, and ii) Fault-tolerant: Software performs its designated work adequately even when some hardware component has failed.

b.

Failure: The response of the software controller does not match with the system

CR IP T

requirement specification document for a given set of input. 5) Hardware-software interaction has two operation modes: a.

Normal: The successful exchange of data/ signal/ material among hardware-software components

b.

Failure: The unsuccessful exchange of data/ signal among hardware-software

AN US

components. Such failures are two types: i) Hardware-driven software interaction failure: The transient hardware failure (like indeterminate memory and delay) leads this type of failure, and ii) Software-driven hardware interaction failure: At some operation condition software response may be undefined. Such situation software operation become uncontrollable, and this may lead to hardware breakdown.

M

3.2. Proposed Model

We briefly explain each step of the proposed reliability/ availability prediction model for the combined

ED

hardware-software system. Subsequently, we apply the proposed model for a case study. Then, we validate the reliability/ availability results obtained using the proposed model, with some already

PT

established approach. The steps of the proposed model are the following:

CE

3.2.1. Functional Requirement Identification At first, we need to identify all the functions that the system should perform to fulfil the user requirements. For this purpose, we refer to the System Requirement Specification (SyRS) document.

AC

Based on the information available at SyRS document during the early design stage, a list of n system functions is prepared. Any function in this list is referred as

where

.

3.2.2. Configuration Model The functions identified in the above step are mapped into conceptual design components. Each function ( ) gets replaced by an abstract component

. where, 9

ACCEPTED MANUSCRIPT

The mapping process follows same logic as used in the FFIP framework for representing the configuration model (D. C. Jensen et al., 2008; D. Jensen et al., 2009; Mutha et al., 2013; Sierla et al., 2012; Tumer & Smidts, 2011). During configuration modeling, we do not look into the technical details of the components; only consider the components as generic type. Out of these n components, one is microcontroller where software controller S is embedded. Other components are peripheral devices

CR IP T

connected to microcontroller for supplying input data/ signals or producing outputs. 3.2.3. Component Behavior Model Each component

has m number of operation modes (normal/ failure). Component behavior changes

with its operation mode. A past experience of similar project or standard reliability handbooks with generic failure data sources is referred to identify the failure modes of each component and their failure

AN US

probabilities. As we are interested to evaluate the worst case achievable reliability/ availability, we consider maximum possible failure probability (sum of probability of occurrence of the failure modes) for each component as recorded in the data source. A mapping function

maps component and ∑

.

with a probability of occurrence

M

where

to operation mode



ED

We use digital simulation platform where random sampling is performed to select operation mode of each component. Let’s consider, at any instant operation mode

of the component

occurs then response

as

PT

of the component is accordingly governed. For any input x, the operational state component . We categorize the operational state

of a component into following three sets:

): In this state component response

do not deviate from specified range/

CE

a) Normal (

is denoted

value

of the system requirement specification (SyRS) document. It can be expressed as

AC

the following:

b) Partial Failure/ Degraded (

): Due to noise, moderate degradation of the materials, reverse

polarity, oscillation, and etc. component response deviate from the specified value/ range of the SyRS. This is considered as partial failure of the component. The partial failure can be additive polarity when the measured signal strength at state

is higher than the specified

.

Another type of partial failure can be subtractive polarity when measured signal strength at state is lower than the specified

. It is expressed as the following: 10

ACCEPTED MANUSCRIPT

Or

): If the component do not respond at all it is called failure state. It is two types –

c) Failure (

permanent and transient. Due to complete breakdown of component, degradation of the materials,

(

CR IP T

intermittent, open circuit, short circuit, and etc. component stop responding permanently ). Sometimes due to some transient failures like indeterminate bit value, delay in signal,

improper synchronization, etc. component may stop responding temporarily ( cases component failure can be expressed as the following:

AN US

3.2.4. Software Controller Behavior Model

). In both the

The system state is determined by the current operational state of the software controller. The operational states are broadly classified as: working ( normal working (

) and fault tolerant (

) and failure (

). The working state is further two types:

). We noticed three types of state transition of a software

controller, if repair is not considered. These are: normal to fault tolerant ( ) and fault tolerant to failure (

). These state transitions occur due to one or more

M

(

), normal to failure

components failure or interaction failure. Impact of one or more component failure and interactions

ED

failure on state transition of the software controller are described as Aggregate Component Behavior Model and Interaction Behavior Model, respectively. Software controller state transitions due to the

AC

CE

Behavior Model.

PT

combined impact of the components and interaction failures are represented in Software Operation

11

ACCEPTED MANUSCRIPT

Aggregate component behavior model As mentioned above, failure of a specific sets of components may trigger the state transition of the software controller. We refer the SyRS document to identify the sets of components. Such sets of components can be an element of any of the following two distinct supersets: a) An element (set of components that failed) of first superset triggers a state transition of the

CR IP T

software controller from the normal state to the fault-tolerant state. As an illustration, consider b is a set of k components. If all k components of the set b fail (

), the system

reaches a tolerant state. Now, consider b itself is an element of the superset

. The superset

represents a collection all sets of components that trigger state transition from normal to

where



is represented as the following:

AN US

the fault tolerant state. So, any element b of the superset

for any input ,

, and

b) An element (set of components that failed) of the other superset triggers a state transition of the software controller from normal/ fault tolerant state to complete failure state. For illustration, consider that d is a set of

components. If all

components fail (

) system

reaches to complete failure state. Now, consider d itself is an element of another superset

represents collection all sets of components that trigger state transition from

M

The superset

normal/ fault tolerant state to failure state. So, any element d of the superset

ED

as follows:



for any input ,

is represented

, and

PT

where

.

CE

Interaction Behavior Model

The interaction failure is divided into two major groups: software-driven hardware failures and hardwaredriven software failures. Sometimes the software controller fails to restrict the system within its

AC

operational limits due to exceptional input conditions, out of range inputs, logical errors for degraded inputs, etc. Due to such exceptional conditions, the control signal generated by the software controller causes malfunctioning of the associated peripheral components. This is considered as software-driven hardware interaction failure. We refer SyRS to identify potential exceptional input signal invoked by the set of degraded components (

). Consider e is a set of r components. If all r

components of set e generates exceptional input condition to the software controller due to degraded mode (

), system fails. Now, consider e itself is an element of superset

12

. The

ACCEPTED MANUSCRIPT

superset

represents collection all sets of components that generates exceptional input condition to

the software controller. So, any element e of the superset

where



for any input ,

is represented as the following:

, and

The transient behavior of the components like indeterminate bit value, delay in signal, improper ) also lead to software failure. It is considered as hardware-

CR IP T

synchronization, etc. (

driven software interaction failure. Consider g is a set of s components. If all s components of set g undergo transient failure ( . The superset

transient failure. Then any element

where



represents collection all sets of components that undergo

of the superset

is represented as the following:

for any input ,

, and

AN US

element of the superset

), software controller fail. Now, consider g itself is an

Software Operation Behavior

The operational states of a software controller are of three types: normal (

), fault-tolerant (

), and

failure ( ). Due to the combine impact of the component failures and interactions failures, the following

M

state transition occurs.

a) The state transition from normal working to fault tolerant due to components failure (

ED

represented as the following:

) is



b) The state transition from normal working to failure state due to different components failure/

PT

interaction failure (

) is represented as following:



CE

c) The state transition from fault tolerant to failure state due to different component failure/ interaction ) is represented as following: →

AC

failure (

3.2.5. System Behavior Simulation At the beginning of the simulation process, we assume that the system is in normal working state. Therefore, we start the simulation process setting the operation mode of each component as normal. During the simulation process, we randomize the occurrence of operation modes for each component. We know that at each operation mode the component behavior follows a distinct trend. Therefore, during 13

ACCEPTED MANUSCRIPT

simulation process the components generate random signals as input to the software controller. We simulate the software controller for a large set of random input signals. Finally, the response of the software controller is recorded at each simulation iteration and reliability/ availability is predicted on the basis of the simulation results.

CR IP T

3.2.6. Reliability/ Availability Prediction The basis of reliability/ availability prediction in the proposed model is functional failure. The SyRS document specifies the desired range of system response. If the system response exceeds the specified range for any input dataset, it is considered as functional failure. During each simulation iteration, random sampling is performed to select operation mode of each component. The random sampling of the operation modes creates input data variations to the software controller. Then, we feed these random input

AN US

data to the software controller for execution and observe the response. For some input dataset the system state transition may not occur, whereas for others transitions may occur. Three distinct state transitions are observed: normal to failure failure

, normal to fault-tolerant

. We consider the estimated number of times transitions

or fault-tolerant to ,

and

occur are u,

v, and w, respectively. The total number of simulation iteration (L) is known as we run the simulation

M

process in the digital platform as per our requirement. Therefore, the unreliability of the system ( ̅ estimated as the ratio of total number of times (

)

) system response undergo functional failure to the

ED

total number of simulation iterations ( ). The unreliability of the system is expresses as ̅ So, the reliability of the system is expressed as

.

.

PT

If we consider instant repair of the failure states, this model can be used to estimate the availability of the system. In such a case, we assume that the system state transition from failure to fault tolerant ( and fault tolerant to normal ( ,

, and

CE

transition

)

) are instantaneous. We consider the estimated number of times

occurs are ́ , ́ , and ́ , respectively. So, the total number of times system

response undergoes functional failure is estimated as ( ́

́ ). If the total number of simulation iterations

AC

( ́ ) is known, then the availability of the system is expressed as

́

́ ́

.

Each step of our presented model is demonstrated in the subsequent parts of this section with a case study of the fault-tolerant aircraft fuel control system. This system has been taken from MathWorks website [22] with some admissible changes in the system as per our need. 3.3. Case Study

14

ACCEPTED MANUSCRIPT

An aircraft fuel control system comprises an engine, actuator, fuel rate controller (FRC), and four sensors. These sensors are throttle sensor, engine fan speed sensor, exhaust gas oxygen (EGO) sensor, and manifold absolute pressure (MAP) sensor. Functional details of each component given in the subsequent parts. We assume that the microcontroller and three out of four sensors must give reading in an acceptable range to operate the system. Therefore, failure of the FRC system can be defined as functional failure of at least two sensors or failure of microcontroller. We must know the failure modes and failure

CR IP T

probabilities of the hardware components to evaluate the reliability/ availability of the FRS system. The System Requirement Specification (SyRS) document of the FRC also needs to be available before we start the evaluation process. However, proposed model is applied to a portion of the whole FRC system for the reliability/ availability evaluation. This portion is consists of the fuel rate controller (FRC) and four sensors providing input to the FRC. Altogether these five components are referred as Fuel Rate

AN US

Control System (FRCS) in the rest of the paper. The following parts of this section represent the reliability/ availability modelling of the FRCS. We have used MATLAB Simulink/ Stateflow

AC

CE

PT

ED

M

environment for this case study.

15

ACCEPTED MANUSCRIPT

3.3.1 Functional Requirement Identification We have listed the functional requirement of the aircraft fuel control system in the first column of Table 2. In this regard, we have considered the information available at the MathWorks website as SyRS document of the system [22]. The aircraft fuel control system requires four functional inputs. These are exhaust gas oxygen (EGO), manifold pressure (MAP), throttle angle (open/ close), and engine fan speed.

CR IP T

Among these four, throttle and engine speed are forward signal. The others, EGO and MAP are the feedback signals. Throttle signal gives the information about the opening/ closing of the throttle valve. Based on the throttle valve angle system estimates required amount of air flow to the engine. The fan speed signal gives information about the rotational speed of the turbine. These two forward signals are fed to the fuel rate controller (FRC). The feedback signal EGO gives the information about the amount of oxygen present at engine. The other feedback signal MAP gives the information about the air density at

AN US

engine. These two feedback signals feed engine oxygen content and air density back to FRC, respectively. The FRC, based on the input signals, determines the required fuel outflow rate for combustion and maintains required proportion of oil and air at the engine. It also maintains internal temperature of the system for smooth combustion operation. The uninterrupted fuel supply to the combustor ensures turbine

M

rotation for energy generation.

Table 2. System Functions and Functionally Equivalent Components Functionally Equivalent Component ( ) : Throttle :Engine Fan : Exhaust Gas Oxygen (EGO) : Manifold Pressure (MAP) : Throttle Sensor : Speed Sensor : EGO Sensor : MAP Sensor : Fuel Rate Controller : Engine

ED

Functions ( )

PT

: Throttle Angle Open/ Close : Run Engine Fan : Exhaust Gas Oxygen Supply

CE

: Air Pressure Measure : Sense Throttle Signal

AC

: Sense Fan Speed : Sense EGO Signal : Sense MAP Signal : Control Fuel Outflow Rate : Run Engine

Fig. 6. Configuration model of fault-tolerant aircraft fuel control system

3.3.2. Configuration Model We map the functional requirements of the aircraft fuel control system to a set of functionally equivalent abstract components as listed in second column of Table 2. The system configuration model is created 16

ACCEPTED MANUSCRIPT

using these components (refer Fig. 6). We use flow taxonomy, which is similar to functional failure identification and propagation (FFIP) framework to mark the signal/ data flow [23, 24]. 3.3.3. Component Behavioral Model Operational modes of the components are grouped into three categories: a) normal, b) degraded/ partial

CR IP T

failure, and c) complete failure. Some components, like EGO, may have an initial warm-up state. During this state, the component undergoes a preparatory phase as the feedback signal takes time to reach the operational range. From literature [25], we have identified degraded (partial failure) modes of EGO sensors. These degraded modes cause deviation of the response signal with additive/ subtractive polarity. As identified in the literature [25] these degraded modes are: a) incorrect signal/ calibration error, b) error in the transmission line, c) error computation device, and d) improper response to the recipient [25].

AN US

Again, complete failure modes of EGO sensor that cause loss of signal are: a) loss of signal from sensor, b) loss of signal from the transmission line, c) short circuit, and d) open circuit [25]. In Table 3 we have listed above-mentioned failure modes of EGO sensor. The probability of occurrence each failure modes is also given in Table 3, as identified in the literature [26]. Here we consider, on-demand maximum one failure in ten thousand observation, as we assume system requirement specification instructed to use

M

sensor components that quality Safety integrity level (SIL) 4. As represented in Table 3, during the complete failure of the sensor, the signal strength becomes zero except from the short circuit case, in which it rises abruptly to high values. On the other hand, during partial failure, the signal strength

ED

erroneously deviates from the actual value. This error follows the standard normal distribution N(µ=0, σ). We assume standard deviation (σ) of the sensor signal due to incorrect signal, error in transmission line,

PT

error in computation device, and improper response to recipient as 0.3%, 0.4%, 0.5%, and 0.6% of the actual signals, respectively. All operational modes of the exhaust gas oxygen (EGO) are shown in Fig. 7.

CE

Similarly, for other sensors and microcontroller component behaviour analysis is performed. Table 4 represents actual input signal to each sensor and their acceptance range. Throttle angle and fan speed are the forward input signals whereas EGO and MAP are the feedback signals. For normal working

AC

of the system input signals should be the within acceptance range given in the Table 4. Based on these input data we simulate the system in the next step of the model.

17

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig.7: EGO sensor component behavior model

18

ACCEPTED MANUSCRIPT

Table 3. Failure modes, probability of failure modes, expected component behavior at each failure mode of an EGO sensor

0.0001

Loss of signal from transmission line

Failure Modes (j)

Probabilit y (pj) of occurring failure mode j

Partial Failure EGO sensor response at failure mode j Where standard error Additive polarity Subtractive polarity

Incorrect signal

Ci(Normal)+N(0,0. 003×Ci(Normal))

Ci(Normal)N(0,0.003×Ci(Normal))

No_signal

Error in transmission line

Ci(Normal)+N(0,0. 004×Ci(Normal))

Ci(Normal)N(0,0.004×Ci(Normal))

Error in computation device

Ci(Normal)+N(0,0. 005×Ci(Normal))

Ci(Normal)N(0,0.005×Ci(Normal))

Ci(Normal)+N(0,0. 006×Ci(Normal))

Ci(Normal)N(0,0.006×Ci(Normal))

Short circuit

0.0002



Open circuit

0.00012

No_signal

0.00058

Improper response to recipient

CR IP T

Loss of signal from sensor

Complete Failure Probability EGO sensor (pj) of response at occurring failure mode j failure mode j No_signal

AN US

Failure Modes (j)

Fan Speed (FS) (in rad)

EGO (in volt)

MAP (in bar)

Microcontroller

Initial Input Signals

20

300

N/A (feedback signal)

N/A (feedback signal)

N/A

Acceptance Range

3< TA< 90

50
EGO<1.2

0.05
[On, Off]

ED

Throttle Angle (TA) (in degree)

M

Table 4. Initial input signals and acceptance range of the components

PT

3.3.4. Software Controller Behavioral Model The software controller behavior model is presented using three concurrent system states aggregate components failure behavior, interactions failure behavior, and software operation behavior. These three

AC

CE

states are inter-dependent. In the following, we explain it with example of the FRCS.

Fig. 8. Aggregate components behavior of software controller 19

AC

CE

PT

ED

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

Fig. 9. Interaction behavior model of the software controller The aggregate components operational behavior explains impact of single or composite component failure on the state transition of the software controller. Based on the system specification at the

20

ACCEPTED MANUSCRIPT

MathWorks website we have identified two different sets of component(s). The failure of one set of components leads state transition of the software controller from normal state to fault-tolerant state. For example, each sensor (throttle/ speed/ EGO/ MAP) of the FRCS belong to this set. The failure of the other set of components leads state transition of the software controller from normal/ fault tolerant to complete failure. The microcontroller and combination of any two or more sensors belong to this set. We have modeled aggregate components operational behavior in the Fig. 8, using Simulink/ Stateflow

CR IP T

environment. In this figure, four operational states are identified: all components are working (all_working), fault-tolerant state due to single sensor failure (single_sensors_fail), system failure due to more than one sensor failure (multi_sensor_failure), and system failure due to microcontroller failure (chip_fail). If all sensors and microcontroller are working, the fueling system keeps supplying fuel to the engine as normally. At any point, if the microcontroller is working and one-out-of-four sensors fails, then

AN US

system transit to fault-tolerant state. At this stage, system still supply fuel to the engine, but outflow rate may not remain steady. However, if there is multiple sensors and/ or microcontroller failure, it disrupts the fuel supply and engine may stall. At this point software controller state transit from normal/ faulttolerant to complete failure. The suspension of fuel supply continues until the system is restored to its operational condition. Fig. 8 shows the aggregate components operational behavior of the software

M

controller.

The modeling of interactions behavior for the FRCS is represented in Fig. 9. To demonstrate software-

ED

driven hardware interaction failure, we have defined some critical states. If the system reaches such states due to exceptional input conditions, it damages the associated hardware component. Based on the specification of the MathWorks website [22], we have listed such exceptional input conditions in Table 5.

PT

On the other hand, we have demonstrated two types of hardware-driven software interaction failures: memory inaccessibility and delay [27]. We have modeled memory inaccessibility using a Matlab/

CE

Simulink function (memory_op()) that investigates failure of any memory operation of the software controller. If the function returns true, it leads the corresponding operation running at the microprocessor to failure. Delay in operation occurs due to the low clock frequency of the microprocessor. If the function

AC

(delay_op()) comes true, it leads corresponding operation running at the microprocessor to failure due to operational delay. Table 5. Fuel rate controller (FRC) failures dependent on composite signals strength Sl/No. 1 2 3 4 5

Conditions Throttle <3 degree & Speed <50 Throttle <3 degree & Speed >628 rpm Throttle <3 degree & EGO> 1.2 Throttle <3 degree & & MAP< 0.05 volt Throttle <3 degree & MAP > 0.95bar

21

Sl/No. 10 11 12 13 14

Conditions Throttle >90 degree & MAP > 0.95 bar Speed <50 rpm & EGO > 1.2 volt Speed <50 rpm & MAP < 0.05 bar Speed <50 rpm & MAP > 0.95 bar Speed >628 rpm & EGO > 1.2 volt

ACCEPTED MANUSCRIPT

Throttle >90 degree & Speed < 50 rpm Throttle >90 degree & Speed >628 rpm Throttle <3 degree & EGO> 1.2 Throttle >90 degree & MAP < 0.05 bar

15 16 17 18

Speed >628 rpm & MAP < 0.05 bar Speed >628 rpm & MAP > 0.95 bar EGO> 1.2 volt & MAP < 0.05 bar EGO >1.2 volt & MAP > 0.95 bar

AN US

CR IP T

6 7 8 9

Fig. 10. Software operation behavior of the software controller

M

The software operational behavior model (shown in Fig. 10) represents the state transitions of the software controller due to component(s) or interaction failures. The software controller of the FRCS

ED

broadly has two operation modes: fuel running (working) and disabled (failure). If all system components are working properly, fuel flows at a constant rate. This is considered as the fuel running mode (working). This running mode can be analyzed further into two types: low emission mode (normal working) and rich

PT

mode (fault tolerant) based on the proportion of the air and oil in the fuel. The low emission mode consists of initial warm-up mode and normal operational mode. At the beginning of the operation, oxygen

CE

level and air pressure may not be optimal at the combustor. During the warm-up mode, software controller tries to bring the oxygen level and air pressure to an optimal condition based on the feedback signals. Once the software controller mounts the desired operating condition after the warm-up mode, the

AC

system starts its normal operation. During the warm-up and normal operation modes, the fuel outflow rate remains comparatively low. However, if one out the four sensors fails, equilibrium will be disturbed and the controller increases fuel outflow rate to bring back normalcy. This situation is denoted as rich emission mode. On the other hand, fueling mode turns to disable mode, if more than one sensor or microcontroller fails independently or due to interaction failure. 3.3.5. System Behavior Simulation

22

ACCEPTED MANUSCRIPT

At the beginning of the simulation process we assume that FRCS is in normal working state. Therefore, we start the simulation process by setting operation mode of each component as normal. The normal operating range of the components are given in Table 4. During simulation process, random sampling is performed to select operation mode of each component. We use Matlab/ Simulink platform for this purpose. As mentioned in the Component behavioral model, at each operation mode the component behavior follows a distinct trend. For example, the EGO sensor response during failure modes is given in

CR IP T

Table 3 and the normal mode (EGO feedback signal range) is given in Table 4. During simulation process, component generates random signals based on their operation mode. These random input signals are fed to the software controller for execution. Finally, the response of the software controller is recorded at each simulation iteration, and reliability/ availability is predicted on the basis of the simulation results.

AN US

3.3.6. Reliability/ Steady State Availability Prediction of the System

We start the simulation process assuming the FRCS is in the normal working state. During each iteration a random input data set is generated. The randomization of the components’ operation modes creates input data variations to the software controller. Then, we feed these random input data to the software controller for execution and record the response. For some input dataset state transition of the software

M

controller may not occur whereas for others transition may occur. Three distinct state transitions are observed: normal to failure failure

, normal to fault-tolerant

or fault-tolerant to

. As mentioned in the section Software Controller Behavioral Model, observing

ED

output of the FRCS at each simulation iteration we estimate the number of times transitions occur. For brief explanation, if transition

or

,

, and

occurs then fuel outflow rate of the FRCS will

PT

drop to zero whereas for other case the outflow rate will comply with the desire value (>0) as mentioned in the MathWorks website. We record the input data sets and corresponding response of the system using

AC

CE

Matlab/ Simulink platform.

Fig. 12. Availability graph for the FRCS

Fig. 11. Reliability graph of the FRCS

23

ACCEPTED MANUSCRIPT

The total number of simulation iteration (L) is known as we run the simulation process in the digital platform. Say, the estimated number of times transitions

,

, and

occurred in the Matlab/

Simulink platform are u, v, and w, respectively. Therefore, the unreliability ̅ estimated as ration of total number of times (

of the FRCS is

) fuel outflow rate does not comply with operational

limit to the total number of simulation iteration ( ). So, the reliability of the system is estimated . Initially we set a simulation time 500 millisecond (iterations L=503) and record

CR IP T

as

the FRCS fuel outflow rate. Then we increase the simulation time until get a stable (

) ration. We

observed reliability of the system tends to zero when time tends to infinity. The transient reliability graph of the FRCS is given in Fig. 11. If we consider FRCS as a non-repairable system then transient reliability ∑

of the system can be defined as

where

is the failure rate of the th component. In this

AN US

proposed model, we have assumed that the failure rate of each component as constant, so system reliability should follow exponential distribution. It is clear from Fig. 11 that the transient reliability curve obtained using the proposed method, decreases exponentially with increase in simulation iterations. If we increas the simulation iterations till the steady state, the plot (Fig. 11) will saturated at zero and no further change will be noticed. This fact also can be explained from the above reliability expression. In this

M

expression, if time t tends to infinity then the transient reliability can be expressed as

.

Table 6. Steady-state Availability of the Fuel Rate Control System (FRCS)

Failure 5

System Failures due to Interaction

ED

System Failures due to Component

Steady-state Availability

Failure 6

0.99983

PT

To develop availability model we have considered instant repair of failure states of the system. In this regard, we estimate the total number of times ( ́

́ ) fuel outflow rate exceed operational limit and the

CE

total number of simulation iterations ( ́ ). So, the availability of the system is estimated as ́

AC

Initially, we set simulation time as 500 millisecond and record the value of we increase the simulation time till

́

́

ration reaches stability. At the point

́

́

́ ́

́

́ ́

.

ratio. Then step by step get stabilized it is called

steady-state availability of the system. We have noticed at 1000 millisecond (100001 iterations) fuel rate control system (FRCS) reaches steady-state availability. The availability graph of the FRCS is given in Fig. 12. Table 6 represents system failure due to components failure, system failure due to interactions failure, and the steady-state availability of the FRCS. 4. Validation of the Proposed Reliability/ Availability Prediction Models 24

ACCEPTED MANUSCRIPT

We validate the proposed methodology by comparing the reliability/ availability values obtained using the proposed approach for the above case study, with those obtained using an already established approach like Petri Nets model. At first, we use Generalized Stochastic Petri Nets (GSPNs) to model the above fuel rate control system (FRCS). Then, we evaluate the transient reliability and steady-state availability of the FRCS using GSPN. Finally, compare the obtained reliability/ availability values using GSPN model with

CR IP T

the results that we obtained using our proposed model. The GSPN marking produces reachability graph that is equivalent to Continuous Time Markov Chain (CTMC). Therefore, the state transition rate of the reachability graph is constant. The reachability graph does not have the vanishing states; only tangible states are considered in constituting the CTMC. Fig. 13 shows a GSPN of the FRCS with one place (All_working_P1) enabled containing four tokens. At this point, transactions t-failure-T1, s-failure-T2, e-failure-T3, m-failure-T4 or u-failure-T9 may occur due to

AN US

throttle sensor (t), speed sensor (s), EGO sensor (e), MAP sensor (m) or microcontroller failure (u), respectively. Transactions t-failure-T1, s-failure-T2, e-failure-T3, m-failure-T4 or u-failure-T9 may lead the system to t-failure_P2, s-failure_P3, e-failure_P4, m-failure_P5, and u-failure_P12 places, respectively. From the places t-failure_P2, s-failure_P3, e-failure_P4, m-failure_P5, and u-failure_P12 it may again return to All_working_P1 place through the transactions t-repair-T5, s-repair-T6, e-repair-T7,

M

m-repair-T8, and u-repair-T10, respectively.

On the other hand, place t-failure_P2 may fire transitions s-failure-T19, e-failure-T20, m-failure-T21, u-

ED

failure-T11 to reach the places t&s-failure_P7, t&e-failure_P8, t&m-failure_P8, t&u-failure_P13, respectively. Again, from places t&s-failure_P6, t&e-failure_P7, t&m-failure_P8, t&u-failure_P13

PT

system may return to the place t-failure_P2 through the transitions s-repair-T31, e-repair-T33, m-repairT35, and u-repair-T12, respectively.

CE

The place s-failure_P3 may fire transitions t-failure-T22, e-failure-T23, m-failure-T24, u-failure-T14 to reach the places t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, t&u-failure_P14, respectively. Again, from the place t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, and t&u-failure_P14 it may

AC

return to place s-failure_P3 through the transitions t-repair-T32, e-repair-T38, m-repair-T39, and urepair-T14, respectively. The place e-failure_P4 may fire the transitions t-failure-T25, s-failure-T26, m-failure-T27, u-failure-T15 to reach the places t&e-failure_P7, s&e-failure_P9, e&m-failure_P11, t&u-failure_P15, respectively. Again, from the place t&e-failure_P7, s&e-failure_P9, e&m-failure_P11, t&u-failure_P15 places it may return to e-failure_P4 through the transitions t-repair-T34, s-repair-T37, m-repair-T41, and u-repair-T16.

25

ACCEPTED MANUSCRIPT

The place m-failure_P5 may fire the transitions t-failure-T28, s-failure-T29, e-failure-T30, u-failure-T17 to reach the places t&m-failure_P7, s&m-failure_P9, e&m-failure_P11, m&u-failure_P17, respectively. Again, from the place t&s-failure_P6, s&e-failure_P9, s&m-failure_P10, and t&u-failure_P14 it may return to e-failure_P4 through the transitions t-failure-T36, s-failure-T40, e-failure-T42 and u-

PT

ED

M

AN US

CR IP T

repair_T18.

CE

Fig. 13. Petri Nets model of the Fuel Rate Controller System (FRCS) The GSPN produces reachability graph (Fig. 14) that is equivalent to CTMC. No vanishing states are observed in the reachability graph. Only 16 tangible states constitute the CTMC. In this model

AC

infinitesimal generator is denoted as to

. If there is no arc from

steady-state vector as

to

[ then

], where . In this model

is the transaction rate from states ∑

. Therefore, we can expresses it as:

26

and we denote the

ACCEPTED MANUSCRIPT

∑ To calculate the transient probability of each state, we define

. So, we

have 16 first order linear differential equations where failure probability of the throttle sensor

probability

, MAP sensor

, microcontroller failure (

are known. The equations are as following:

  1  2  3  4  5  P1  t    P2  t    P3  t    P4  t    P5  t    P12     2  3  4  5  P2  t   1 P1  t  +  P6 (t)+  P7 (t)   P8 (t)+  P13 (t)

    1  3  4  5  P3  t   2 P1  t  + P6 (t)+  P9 (t)   P10 (t)+  P14 (t)     1  2  4  5  P4  t   3 P1  t  +  P7 (t)+  P9 (t)   P11 (t)+  P15 (t)

    1  2  3  5  P5  t   4 P1  t  + P8 (t)+  P10 (t)   P11 (t)+  P16 (t) d  P6 t   2 P6 (t )  2 P2 t   1P3 t  dt d  P7 t   2 P7 (t )3  P2 t   1P4 t  dt d f  P8  t   2 P8 (t )  4 P2  t   1 P5  t  dt d  P9 t   2 P9 (t )  2 P4 t   3 P3 t  dt d  P10 t  2 P10 (t )  2 P5 t   4 P3 t  dt d  P11 t   2 P11 (t )  3 P5 t   4 P4 t  dt d  P12 t     P12 (t )  5 P1 t  dt d  P13 t    P13 (t )  5 P2 t  dt d  P14 t   P14 (t )  5 P3 t  dt d  P15 t   P15 (t )  5 P4 t  dt d  P16 t   P16 (t )  5 P5 t  dt

AC

CE

PT

ED

M

d  P1 t   dt d  P2 t   dt d  P3 t   dt d  P4 t   dt d  P5 t   dt

, and their repair

CR IP T

, EGO sensor

AN US

sensor

27

, speed

ACCEPTED MANUSCRIPT

The transient probabilities we get from the above differential equations can be expressed as the following: 16

 P  1, i 1

i

PT [ D  I]  0,

where P is state probability vector, D is transition probability matrix and I is identity matrix.

   

1 1  2  3  4  5   0 1  1  3 0 0

2

3

0  4  5   0 1  1  2 0

0 0  4  5   0 1  1  2 0

  

 0 0



 





0 0 0 0

0 0 0 0



0 0 0

0 0 0



0 0 0

0 0

0

0 0 0 0 0 0

4

0 0





5

0 0 0 0   0 5 0 0 0  0 0 5 0 0   0 0 0 5 0  0 0 0 0 5  0 0 0 0 0   0 0 0 0 0  0 0 0 0 0   0 0 0 0 0   0 0 0 0 0  0 0 0 0 0   1  0 0 0 0   0 1  0 0 0  0 0 1  0 0   0 0 0 1  0  0 0 0 0 1   

CR IP T

 3  4  5

0 0 0 0 0 0 0 2 3 4 0 0 0 0 1 0 0 3 4 0 0 0 1 0 2 0 4  3  5   0 0 1 0 2 3 0 1  2 0 0 0 0 0 0 0 1  2 0 0 0 0  0 0 1  2 0 0 0 0 0 0 0 1  2 0 0  0 0 0 0 1  2 0  0 0 0 0 0 1  2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0  0 0 0 0 0 0

AN US

1  1  2            Q             

Table 7. Number of failures and failure probability of the components Engine Fan Speed Sensor

EGO Sensor

MAP Sensor

Microcontroller

6

3

2

2

0.000059

0.000029

0.000019

0.000019

M

Throttle Sensor 4

Failure Probability

0.000039

AC

CE

PT

ED

Number of Failures out of 100001 iterations

28

ACCEPTED MANUSCRIPT

Table 8: Steady-state availability Steady-state probability

CR IP T

States P1

0.98719

P2

0.00064

P3

0.00081

P4

0.00065

AN US

P5

Steady-state availability

0.01068 0.99999

M

Fig. 14. Reachability graph of the Petri Nets model

ED

To estimate the failure probability of the components we use the same data that was used during the system simulation. The first row of Table 7 represents the total number of failures of throttle sensor, speed sensor, EGO sensor, MAP sensor, and microcontroller during entire 100001 simulation iterations.

PT

We estimate the failure probability of each component by the ratio of total number of failures to the total number of simulation iterations. The estimated failure probability of the components is represented in the

CE

second row of Table 7.

In the reachability graph, one set of states ( mA ) represents the working condition of the system and the

AC

other set ( m ' A ) represents the failure states. If we consider FRCS as a non-repairable system the transient reliability of the system can be defined as



where

, then

is the firing rate of

transition whose firing causes the target networks to leave the reliable states. In this reliability expression if t tends to infinity

. The transient reliability plot of the FRCS is given in the Fig. 15.

29

CR IP T

ACCEPTED MANUSCRIPT

Fig. 16. Graphical representation of the steady-

Fig. 15. Graphical representation of the steadystate reliability using CTMC model

AN US

state availability using CTMC model

If we consider FRCS as a repairable system then the transient availability is



and

is the coefficient. The repair probability of throttle sensor, speed sensor, EGO sensor, MAP sensor, and microcontroller is assumed as same

. Then, solving the above first order linear differential

M

equations, we get steady-state probabilities of the working states ( mA ) as given in the Table 8. The sum of steady-state probabilities of the working states gives steady-state availability of the system

ED

. Fig. 16 represent the steady-state availability of the system with respect to time. Finally, we observe the obtained steady-state availability of the FRCS using CTMC model (0. 99999) which is quite

PT

similar to the corresponding value (0.99983) achieved using proposed simulation-based method. The CTMC model giving slightly high availability as there is no scope considering interactions failure. It only

CE

considers system failure due to components failure. 5. Conclusion

AC

At the early stage of the system development, more than one design alternatives may be available. It is difficult to determine and select the most reliable system design among the available alternatives. Moreover, at the initial stages of system development, the actual system components may not be available. So, it is desirable to perform reliability/ availability analysis on the system design. We have proposed a model for a combined hardware-software system that can be used to predict the worst case system reliability/ availability based on the conceptual design. A novelty of our work is the quantitative reliability/ availability analysis of a combined HW-SW system considering hardware-driven software and software-driven hardware failures, at the early design stages is the novelty of our work. The proposed 30

ACCEPTED MANUSCRIPT

model converts the system functions to conceptual level abstract components. Technical composition of such components are unknown, but the functionalities are defined. We simulate system behavior based on functional logic for a set of input data variations. Functional failure/ success of the system for different input set gives the reliability/ availability of the system. Unlike the existing functional failure based models, the proposed model predicts the reliability/ availability of the system rather than confined to performing only risk analysis. At the same time, we have considered the entire spectrum of interaction

CR IP T

failures that may arise among the hardware-software components, apart from individual component failures. To demonstrate the applicability of the proposed model, we have predicted the reliability/ availability of an aircraft fuel rate controller as a case study. Further, we validate the proposed model using the same example. Finally, we can conclude that the proposed simulation-based model avoids the inconvenience of handling huge state space like Markovian early reliability/ availability models. It also

AN US

avoids qualitative analysis of huge execution paths like functional failure identification and propagation (FFIP) based models. Acknowledgements

This work is carried out at the Subir Chowdhury School of Quality and Reliability, Indian Institute of

M

Technology Kharagpur, India. We thank all the faculty members, research scholars, and staff of the school for their co-operation and support. We gratefully acknowledge the Ministry of Human Resource

AC

CE

PT

ED

Development (MHRD), Government of India, for funding this research.

31

ACCEPTED MANUSCRIPT

References

[7]

[8] [9]

[10]

[11]

[12] [13]

[14]

AC

[15]

CR IP T

[6]

AN US

[5]

M

[4]

ED

[3]

PT

[2]

A. Syed, D. G. Pérez, and G. Fohler, "Job-shifting: An algorithm for online admission of nonpreemptive aperiodic tasks in safety critical systems," Journal of Systems Architecture, vol. 85, pp. 14-27, 2018. Q. Zhao, Z. Gu, M. Yao, and H. Zeng, "HLC-PCP: A resource synchronization protocol for certifiable mixed criticality scheduling," Journal of Systems Architecture, vol. 66, pp. 84-99, 2016. I. Tumer and C. Smidts, "Integrated design-stage failure analysis of software-driven hardware systems," IEEE Transactions on Computers, vol. 60, pp. 1072-1084, 2011. R. K. Iyer and P. Velardi, "Hardware-related software errors: measurement and analysis," IEEE Transactions on Software Engineering, pp. 223-231, 1985. X. Teng, H. Pham, and D. R. Jeske, "Reliability modeling of hardware and software interactions, and its applications," IEEE Transactions on Reliability, vol. 55, pp. 571-577, 2006. A. Costes, C. Landrault, and J.-C. Laprie, "Reliability and availability models for maintained systems featuring hardware failures and design faults," IEEE Transactions on Computers, vol. 100, pp. 548-560, 1978. K. Kanoun and M. Ortalo-Borrel, "Fault-tolerant system dependability-explicit modeling of hardware and software component-interactions," IEEE Transactions on reliability, vol. 49, pp. 363-376, 2000. U. Sumita and Y. Masuda, "Analysis of software availability/reliability under the influence of hardware failures," IEEE Transactions on Software Engineering, pp. 32-41, 1986. D. Jensen, I. Y. Tumer, and T. Kurtoglu, "Flow State Logic (FSL) for analysis of failure propagation in early design," in ASME 2009 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2009, pp. 1033-1043. D. C. Jensen, I. Y. Tumer, and T. Kurtoglu, "Modeling the Propagation of Failures in Software Driven Hardware Systems to Enable Risk-Informed Design," in ASME 2008 International Mechanical Engineering Congress and Exposition, 2008, pp. 283-293. S. Sierla, I. Tumer, N. Papakonstantinou, K. Koskinen, and D. Jensen, "Early integration of safety to the mechatronic system design process by the functional failure identification and propagation framework," Mechatronics, vol. 22, pp. 137-151, 2012. C. Mutha, D. Jensen, I. Tumer, and C. Smidts, "An integrated multidomain functional failure and propagation analysis approach for safe system design," AI EDAM, vol. 27, pp. 317-347, 2013. B. Huang, M. Rodriguez, M. Li, J. B. Bernstein, and C. S. Smidts, "Hardware error likelihood induced by the operation of software," IEEE Transactions on Reliability, vol. 60, pp. 622-639, 2011. X. Diao, Y. Zhao, M. Pietrykowski, Z. Wang, S. Bragg-Sitton, and C. Smidts, "Fault Propagation and Effects Analysis for Designing an Online Monitoring System for the Secondary Loop of the Nuclear Power Plant Portion of a Hybrid Energy System," Nuclear Technology, pp. 1-18, 2018. N. Papakonstantinou, S. Proper, B. O’Halloran, and I. Y. Tumer, "A Plant-Wide and FunctionSpecific Hierarchical Functional Fault Detection and Identification (HFFDI) System for Multiple Fault Scenarios on Complex Systems," in ASME 2015 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2015, pp. V01BT02A039-V01BT02A039. T. Kurtoglu and I. Y. Tumer, "A graph-based fault identification and propagation framework for functional design of complex systems," Journal of Mechanical Design, vol. 130, p. 051401, 2008. T. Kurtoglu and I. Y. Tumer, "A risk-informed decision making methodology for evaluating failure impact of early system designs," in ASME 2008 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2008, pp. 457-467.

CE

[1]

[16] [17]

32

ACCEPTED MANUSCRIPT

[20]

[21]

[22] [23]

[24] [25] [26]

AC

CE

PT

ED

M

[27]

CR IP T

[19]

A. M. Dowell III, "Layer of protection analysis for determining safety integrity level," Isa Transactions, vol. 37, pp. 155-165, 1998. D. S. Roy, C. Murthy, and D. K. Mohanta, "Reliability analysis of phasor measurement unit incorporating hardware and software interaction failures," IET Generation, Transmission & Distribution, vol. 9, pp. 164-171, 2015. A. K. Trivedi and M. L. Shooman, A Markov model for the evaluation of computer software performance: Polytechnic Institute of New York, Department of Electrical Engineering and Electrophysics, 1974. N. Papakonstantinou, S. Sierla, I. Y. Tumer, and D. C. Jensen, "Using fault propagation analyses for early elimination of unreliable design alternatives of complex cyber-physical systems," in ASME 2012 International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, 2012, pp. 1183-1191. MathWorks. (2011). MATLAB/ Simulink Examples. Available: http://in.mathworks.com/examples J. Hirtz, R. B. Stone, D. A. McAdams, S. Szykman, and K. L. Wood, "A functional basis for engineering design: reconciling and evolving previous efforts," Research in engineering Design, vol. 13, pp. 65-82, 2002. R. B. Stone and K. L. Wood, "Development of a functional basis for design," Journal of Mechanical design, vol. 122, pp. 359-370, 2000. NSWCD, "Hand ook of reliability prediction procedures for mechanical equipment," Naval Surface Warfare Carderock Division, West Bethesda, Maryland 20817-57002011. P. S. Borgovini R, Rossi M, "Failure Mode, Effects and Criticality Analysis (FMECA)," Reliability Analysis Center, Rome Laboratory1993. D. Gil, J. Gracia, J. C. Baraza, and P. J. Gil, "Impact of faults in combinational logic of commercial microcontrollers," in European Dependable Computing Conference, 2005, pp. 379-390.

AN US

[18]

33

ACCEPTED MANUSCRIPT

Sourav Sinha is currently pursuing Ph. D. from Indian Institute of Technology (IIT) Kharagpur. He has received B.Tech in Computer Science & Engineering and MS in Industrial and Systems Engineering. His areas of research are software reliability, system reliability and dependability analysis. He also has work experience of more than five years as Software Programmer in ERP implementation project at IIT Kharagpur. Neeraj Kumar Goyal is currently an associate professor in the Reliability Engineering Centre, Indian Institute of Technology (IIT), Kharagpur, India. He received his PhD degree from IIT Kharagpur in reliability engineering in 2006. His areas of research and teaching are network reliability, software reliability, electronic system reliability, reliability testing, probabilistic risk/safety assessment, and reliability design. He has completed various research and consultancy projects for various organizations, e.g. DRDO, NPCIL, Vodafone, and ECIL. He has contributed

CR IP T

several research papers to international journals and conference proceedings. Rajib Mall is a Professor of the Department of Computer Science and Engineering at the Indian Institute of Technology, Kharagpur, West Bengal, India. He has received his Bachelor's, Master's and Ph.D. degrees in Computer Science, all from Indian Institute of Science, Bangalore. Prof. Mall has presented numerous lectures, conference presentations, and workshops on Software Engineering, Real Time System, and Wireless Sensor Network. He has published more than 200 journal paper, conference paper, and book chapter. He has active research interests in Software Engineering, Real Time System, Wireless Sensor Network, Web Engineering, Web Sizing, Cost Estimation, and on Web Quality and

AN US

Productivity.

CE

PT

Dr. Neeraj Kumar Goyal

ED

M

Sourav Sinha

AC

Prof. Rajib Mall

34