20th IFAC Symposium on Automatic Control in Aerospace 20th IFAC Symposium on Automatic Control in Aerospace August 21-25, 2016. Sherbrooke, Quebec, Canada 20th IFAC Symposium on Control in 20th IFAC Symposium on Automatic Automatic Control in Aerospace Aerospace August 21-25, 2016. Sherbrooke, Quebec, Canada Available online at www.sciencedirect.com August 21-25, 2016. Sherbrooke, Quebec, Canada August 21-25, 2016. Sherbrooke, Quebec, Canada
ScienceDirect IFAC-PapersOnLine 49-17 (2016) 248–253
Flight Control Software Failure Mitigation: Flight Control Software Failure Mitigation: Flight Control Software Failure Mitigation: Flight Control Software Failurefor Mitigation: Design Optimization Design Optimization for Design Optimization for Design Optimization for Software-implemented Fault Detectors Software-implemented Fault Detectors Software-implemented Fault Software-implemented Fault Detectors Detectors
System System System System boundary boundary boundary boundary
Andrey Morozov ∗∗ Klaus Janschek ∗∗ Andrey Morozov ∗∗ Klaus Janschek ∗∗ Andrey Andrey Morozov Morozov Klaus Klaus Janschek Janschek ∗ at Dresden, Institute of Automation, ∗ Technische Universit¨ ∗ Technische Universit¨ a t Dresden, Institute of Automation, ∗ Technische Universit¨ a Institute of 01062 Dresden, Germany (e-mail: {andrey.morozov, Technische Universit¨ att Dresden, Dresden, Institute of Automation, Automation, 01062 Dresden, Germany (e-mail: {andrey.morozov, 01062 Germany (e-mail: {andrey.morozov, klaus.janschek}@tu-dresden.de) 01062 Dresden, Dresden, Germany (e-mail: {andrey.morozov, klaus.janschek}@tu-dresden.de) klaus.janschek}@tu-dresden.de) klaus.janschek}@tu-dresden.de) Abstract: Failures of avionic and aerospace control hardware, caused by negative environmental Abstract: of avionic and aerospace control hardware, by negative environmental Abstract: Failures of and aerospace control hardware, caused by environmental impacts likeFailures increasing heat or cosmic radiation, can lead to caused silent data corruption and undeAbstract: Failures of avionic avionic and aerospace controlcan hardware, caused by negative negative environmental impacts like increasing heat or cosmic radiation, lead to silent data corruption and impacts like increasing increasing heat or or cosmic cosmic radiation, can lead leadand to specifically silent data data protected corruptionhardware and undeundetected incorrect system outputs. Traditionally, redundant is impacts like heat radiation, can to silent corruption and undetected incorrect system outputs. Traditionally, and specifically protected hardware is tected incorrect system outputs. Traditionally, redundant and specifically protected hardware is used, which is expensive and available only on redundant restricted markets. The application of softwaretected incorrect system outputs. Traditionally, redundant and specifically protected hardware is used, which is expensive and available only on restricted markets. The application of softwareused, which is expensive and available only on restricted markets. The application of softwareimplemented fault detectors like SWIFT, SWIFT ECF, or Software Encoded Processing is a used, which is expensive and available only on restricted markets. The application of softwareimplemented fault detectors like SWIFT, SWIFT ECF, or Software Encoded Processing is aa implemented fault detectors like SWIFT, SWIFT ECF, or Software Encoded Processing is promising alternative solution that offers the opportunity to use cost effective, but less reliable implemented fault detectors like SWIFT, SWIFT ECF, or Encoded but Processing is a promising alternative solution that offers the the opportunity to Software use cost effective, less reliable reliable promising solution that offers opportunity to cost effective, less hardware. alternative However, this entails generation of extra source code, resulting inbut a considerable promising alternative solution that offers the of opportunity to use use cost effective,in but less reliable hardware. However, this entails generation extra source code, resulting a considerable hardware. However, this entails of extra source code, in computational overhead and, as ageneration consequence, performance degradations. This article hardware. However, thisand, entails generation of leads extra to source code, resulting resulting in aa considerable considerable computational overhead as a consequence, leads to performance degradations. This article computational overhead and, as a consequence, leads to performance degradations. This article introduces an approach that aims minimizing the negative performance impact while maintaincomputational overheadthat and,aims as a minimizing consequence, leads to performance degradations. This article introduces an approach the negative performance impact while maintainintroduces an approach that aims minimizing the negative performance impact while maintaining the required system reliability level. It is shown that selective and balanced application introduces an approach that aims minimizing the negative performance impact while maintaining thesoftware-implemented required system system reliability level. It It issolely showntothat that selective and balanced balanced application ing the required level. shown and application of the fault detectors the selective most critical parts of the control ing thesoftware-implemented required system reliability reliability level. It is issolely showntothat selective and balanced application of the fault detectors the most critical parts of the control of the software-implemented fault detectors solely to the most critical parts of the control software is an efficient system design solution. The presented approach uses a combination of of the software-implemented fault detectors solely to the most critical parts of the control software is an efficient system design solution. The presented approach uses aa the combination of software is an efficient system design solution. The presented approach uses combination of two methods for reliability and performance analysis. Both methods are used for quantitative software is anforefficient system design solution. The presented approach uses a the combination of two methods reliability and performance analysis. Both methods used for quantitative two methods reliability and performance analysis. Both methods are used for the quantitative exploration offor different strategies of selective protection and alloware finding a balance between two methods for reliability and performance analysis. Both methods are used for the quantitative exploration of different different strategies ofThe selective protection and the allow finding aa ofbalance balance between exploration of selective and allow finding between system performance andstrategies reliability.of articleprotection demonstrates application the introduced exploration of different strategies ofThe selective protection and the allow finding a ofbalance between system performance and reliability. article demonstrates application the introduced system performance and reliability. The article demonstrates the application of the introduced approach using embedded flight control software of an UAV. system performance and reliability. The article demonstrates the application of the introduced approach using embedded flight control software of an UAV. approach using embedded flight software of approach using embeddedFederation flight control control softwareControl) of an an UAV. UAV. © 2016, IFAC (International of Automatic Hosting by Elsevier Ltd. All rights reserved. Keywords: Flight control software, UAV, error propagation, reliability, performance, Keywords: Flight control software, UAV, error propagation, Keywords: Flight Flight control software, UAV, error error propagation, reliability, reliability, performance, performance, optimization, Markov models, model-based design. Keywords: control software, UAV, propagation, reliability, performance, optimization, Markov models, model-based design. optimization, Markov models, model-based design. optimization, Markov models, model-based design. 1. INTRODUCTION 1. INTRODUCTION INTRODUCTION 1. 1. INTRODUCTION SW SW Input Output 1.1 Motivation SW f1 SW f2 function funciton SW SW Input Output 1.1 Motivation SW f1 SW f2 Input Output function funciton Erroneous 1.1 Motivation Input Output function f1 funciton f2 1.1 Motivation function f1 funciton f2 Erroneous Increasing critical output Erroneous Erroneous Usually, software in safety-critical systems, assumes a fault HW heat Increasing critical Increasing critical output output Usually, software in safety-critical systems, assumes a fault HW heat memory Increasing critical output Usually, software in systems, assumes aa fault HW free execution through the executing hardware. However, heat Usually, software in safety-critical safety-critical systems, assumes fault HW heat memory Lowering free execution through the executing hardware. However, memory free execution through However, decreasing feature sizesthe of executing integratedhardware. circuits (e.g. CPU memory voltage Lowering free execution through the hardware. However, Lowering decreasing feature sizes of executing integrated circuits (e.g. (e.g. CPU voltage Lowering decreasing feature of integrated circuits CPU and memory) and sizes increasing system complexity lead to voltage HW failure decreasing feature sizes of integrated circuits (e.g. CPU voltage and memory) and increasing increasing system complexity lead to to Cosmic HW and memory) and lead less reliable hardware (Borkar system (2005)).complexity Hardware failures e.g. afailure bit-flip HW radiation and memory) and increasing system complexity lead to Cosmic HWafailure failure less reliable hardware (Borkar (2005)). Hardware failures Cosmic e.g. bit-flip radiation less reliable hardware (Borkar (2005)). Hardware failures Cosmic can cause hidden data corruption and may result in undee.g. a radiation less reliable hardware (Borkar (2005)). Hardware failures e.g. a bit-flip bit-flip radiation can cause hidden data corruption and may result in undecan cause hidden data and result in tected incorrect system outputs (silent data corruption). can cause hiddensystem data corruption corruption and may may result in undeundetected incorrect outputs (silent data corruption). 1. An example of a hardware failure: A negative tected incorrect system (silent corruption). In avionic and aerospace applications, andata undetected erro- Fig. tected incorrect system outputs outputs (silentan data corruption). Fig. 1. An example of a hardware A negative In avionic andcan aerospace applications, undetected erroFig. 1. of failure: A negative environmental corrupts afailure: part of In avionic and aerospace applications, an undetected erroneous output cause hazardous issues. A failure of the Fig. 1. An An example exampleimpact of aa hardware hardware A system’s negative In avionic andcan aerospace applications, an undetected erroenvironmental impact corrupts aafailure: part of system’s neous output cause hazardous issues. A failure of the environmental impact corrupts part of system’s memory, this changes the value of a stored variable neous output can cause hazardous issues. A failure of the Russian space mission ”Phobos Grunt” in February 2012 environmental impact corrupts a part of system’s neous output can cause”Phobos hazardous issues.inAFebruary failure of2012 the memory, this changes the value of aa stored variable Russian space mission Grunt” memory, this changes the value of stored variable and cause a data error that propagates to a critical Russian space ”Phobos in 2012 is an example, see Oberg (2012).Grunt” According to the official memory, this changes the value of a stored variable Russian space mission mission ”Phobos Grunt” in February February 2012 and cause a data error that propagates to a critical is an example, see Oberg (2012). According to the official and cause a data error that propagates to a system output. is an example, see Oberg (2012). According to the official report (Roskosmos (2012)), this failure has happened beand cause a data error that propagates to a critical critical is an example, see Oberg (2012). According to the official system output. report (Roskosmos (2012)), this failure has happened besystem output. report this failure cause of(Roskosmos an SRAM (2012)), fault, caused by ”a has localhappened influencebeof system output. report (Roskosmos (2012)), this failure has happened because of an SRAM SRAM fault,(cosmic caused radiation). by ”a ”a local influence influence of cause an fault, caused by heavy of charged particles” cause of an SRAM fault,(cosmic caused radiation). by ”a local local influence of of heavy charged particles” heavy charged particles” (cosmic consequences. Similar hardware failures leading to data heavy charged in particles” (cosmic radiation). radiation). The example Fig. 1 illustrates a hardware failure in consequences. leading data consequences. Similar hardware failures leading to data corruption can Similar happen hardware not only infailures memory, but in to a CPU, The example in Fig. 1 illustrates a hardware failure in consequences. Similar hardware failures leading to data The example in Fig. 1 illustrates a hardware failure in focus. A negative environmental impact corrupts a part of corruption can happen not only in memory, but in a CPU, The example in Fig. 1 illustrates a hardware failure in corruption can happen not only in memory, but in a CPU, a bus or other computing hardware parts. focus. A negative environmental impact corrupts a part of corruption can happen not only in memory, but in a CPU, focus. A negative environmental impact corrupts a part of memory of computing hardware.impact This results in aapart single or other computing parts. focus. A negative environmental corrupts of aa bus bus or computing hardware hardware parts. memory ofbit computing hardware. This results instate, single bus article or other other hardware memory computing results in aaa single is computing organized as follows.parts. The remaining part or severalof flips thathardware. change theThis application e.g. aThis memory of computing hardware. This results in single This article is organized as follows. The remaining or several bit flips that change the application state, e.g. This article is organized as follows. The part or several bit flips that change the application state, e.g. of this section presents the relevant state of the art part and the value of a critical variable as it shown in Fig. 1. This article is presents organizedthe as relevant follows. state The remaining remaining part or several bit flips that change the application state, e.g. of this section of the art and2 the value of a critical variable as it shown in Fig. 1. of this section presents the relevant state of the art the value of a critical variable as it shown in Fig. 1. the basic concept of the introduced method. Section Later, during an execution of the software function f2, this section presents theintroduced relevant state of theSection art and and the value of aancritical variable as software it shownfunction in Fig. f2, 1. of the basic concept of the method. Later, during execution of the the basic concept of the introduced method. Section 2 Later, during an execution of the software function f2, demonstrates a case-study UAV platform. Section 3 intro-2 this erroneous value is readofand propagated further f2, to the basic concept of the introduced method. Section 2 Later, during an execution thepropagated software function demonstrates a case-study UAV platform. Section 3 introthis erroneous value is read and further to demonstrates a case-study UAV platform. Section 3 introthis erroneous value is read and propagated further to duces the design optimization method itself. Concluding, the system output. An error in the output is considered demonstrates a case-study UAV platform. Section 3 introthis erroneous value is read and propagated further to the design optimization itself. the system An the considered duces the design method itself. Concluding, the output. An error error in thetooutput output is unintended considered duces Section 4 describes the methodmethod evaluation andConcluding, results. as asystem systemoutput. failure that may in lead variousis duces the design optimization optimization method itself. Concluding, the system output. An error in the output is considered Section 4 describes the method evaluation and results. as a system failure that may lead to various unintended Section 4 describes the method evaluation and as a system failure that may lead to various unintended results. as a system failure that may lead to various unintended Section 4 describes the method evaluation and results. Copyright © 2016 IFAC 248 2405-8963 ©©2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Copyright 2016 IFAC 248 Copyright 2016 IFAC 248 Copyright ©under 2016responsibility IFAC 248Control. Peer review © of International Federation of Automatic 10.1016/j.ifacol.2016.09.043
IFAC ACA 2016 August 21-25, 2016. Quebec, Canada
Andrey Morozov et al. / IFAC-PapersOnLine 49-17 (2016) 248–253
Message "Error detected"
Input Input
Undetected error in output
Original software
Protection superstructure SW
Error-prone HW Original system
e1
e2
e3
e4
Decomposition
SW Software protection method
Selective application of SFDs
Error-prone HW Protected system
Fast but unreliable
249
SWIFT (e1)
e2
e3
AN(e4)
AN(e1) ...
Reliable but slow
Fig. 2. A general principle of software-based system protection using a sofware-implemented fault detector. A data error caused by a hardware fault is detected on a software level. This results in an error-message instead of an undetected error. 1.2 Traditional Hardware-implemented Solutions Traditionally, specific custom hardware protects systems from the situations like the one that is shown in Fig. 1. For example: heat and radiation protected hardware (hardened chips, bipolar integrated circuits, magnetoresistive RAM, shielding etc.), or hardware redundancy (dual or triple module redundancy). However, these solutions have several serious disadvantages. First of all, such custom systems are expensive and their markets are restricted. Second, custom hardware is usually an order of magnitude slower than up-to-date commodity hardware (Barnaby (2005)). Third, custom hardware solutions age over time. This results in appearance of new unexpected errors in critical components (Borkar (2005)). Fourth, custom hardware leads to critical dependencies on a single supplier, which can be unacceptable for long-running systems. 1.3 Software-implemented Hardware Fault Detectors An alternative solution is the application of softwareimplemented hardware fault detectors (SFDs) like SWIFT (Reis et al. (2005)), SWIFT ECF (Reis et al. (2007)), or Software Encoded Processing (AN-, ANB-, ANBD-codes) (Schiffel et al. (2010)). These techniques can not prevent hardware faults but they can detect them early enough and prevent data errors in system. The principle of an SFD application is shown in Fig. 2. The SFDs offer the opportunity for using cost effective but less reliable hardware, while maintaining the required level of system reliability. Nowadays, independent R&D projects based on cost effective platforms like CubeSats become more and more popular in the aerospace domain. SFDs do not require any special hardware. They are cheaper, more flexible, and have sufficiently high error detection rates (Schiffel et al. (2010)) in comparison with traditional hardware solutions. Moreover, SFDs can be applied automatically. This results in shorter development cycles and the minimization of programmer’s errors. 1.4 Introduced Design Optimization Method However, existing software-implemented solutions also have a strong drawback that limits their utilization - a considerable computational overhead. Application of SFDs 249
Reliability:
Low
SFD type:
None
Perfromance:
High
SWIFT (e3)
ANB (e2)
...
ANBD (e4)
ANBD (e1)
ANBD (e2)
ANBD (e3)
ANBD (e4)
High SWIFT
AN
ANB
ANBD Low
Fig. 3. System decomposition and selective application of different combinations of SFD result in different variants of selectively protected software with different performance and reliability characteristics. entails generation of extra source code (”protection superstructure” in Fig. 2). This considerably increases the system execution time and leads to higher memory consumption and as a consequence leads to performance degradations, which are critical for control algorithms. This article is focused on balancing the performance degradation of SFDs versus increase of system reliability. Particularly, it is aimed to answer the following question: How to minimize the negative performance impact of software-implemented hardware fault detectors while maintaining the required system reliability level? We claim that it makes sense to apply SFDs selectively, only to the most critical parts of the system. This will maximize the achieved error-detection rate, while minimizing the performance overhead. In Nakka et al. (2007), the authors also address reliability and performance optimization. As opposed to SFDs’ application, this article describes reliability improvement using processor-level selective replication. However, the key idea is also to replicate only critical parts of the code. This research has shown the following quantitative results: ”With about 59% less overhead than full duplication, selective replication detects 97% of the data errors and 87% of the instruction errors that were covered by full duplication”. In this article, we introduce a method for identification of a suitable places for SFD application through a probabilistic quantitative analysis of the original system. Fig. 3 and Fig.4 demonstrate the general idea of the proposed method. First, we decompose the system into separate elements in order to apply SFDs selectively (see Fig. 3). In general, depending on software size, complexity, design and programming paradigms it can be done on different abstract levels: components, functions, basic blocks of code etc. In our case study, software functions play the role of the elements. This is reasoned by our software design approach with a UML Activity Diagram: Each action block models a functions of the control software, see Fig. 5 c).
Sj
"P
ar
of
eto
op
-fr
tim
on
al
tie
r"
Si
solutio
ns
Mean number of failures (Unreliability)
S0
Andrey Morozov et al. / IFAC-PapersOnLine 49-17 (2016) 248–253
Slow
Reliability requirements
Not optimal solutions
Mean execution time (Performance)
Sn
t Fron
Mean execution time (Performance)
IFAC ACA 2016 250 August 21-25, 2016. Quebec, Canada
Sj Suitable solutions
Performance requirements Unreliable
Si
Mean number of failures (Unreliability)
Fig. 4. The central idea of the reliability and performance optimization method (an abstract example). The dots in the charts represent different protections strategies. The ”optimal” strategies are forming the green pareto-frontier. Second, we use methods for reliability and performance analysis of different variants of selectively protected software. The capability of analyzing selectively protected software allows us to find a balance between performance and reliability by solving a multi-objective optimization problem. The two abstract charts, shown in Fig. 4, demonstrate how to find an appropriate combination of SFDs and places of their application. Quantitative system properties like the mean number of system failures (a reliability metric) and the mean execution time (a performance metric) can be evaluated for all variants of the selectively protected software. Each possible variant of the software is represented by a dot in the charts. A variant is judged as ”optimal” if there are no other variants with the lower mean execution time and the lower mean number of system failures at the same time. These ”optimal” variants are forming the pareto-frontier, highlighted in green in Fig. 4. The variants above this frontier can be excluded from the consideration. The two boundaries in the right chart in Fig. 4 represent reliability and performance requirements of a system. All software variants with the mean execution time higher than the performance boundary are considered to be too slow. All software variants with the mean number of system failures higher than the reliability boundary are considered to be unreliable. If there exist several sufficient variants, we can choose between them (see the right chart in Fig. 4). 2. CASE STUDY: UNMANNED AERIAL VEHICLE A part of embedded flight control software of an octocopter flight platform (unmanned aerial vehicle - UAV) has been analyzed. The UAV (see Fig. 5 a) was developed during ”S3 ARV: Small Safe & Space Autonomous Robot Vehicles” project that has been carried out by Institute of Automation of Technische Universit¨ at Dresden and Institute of Flight Mechanics and Control of Universit¨ at Stuttgart in 2012 - 2014. The flight vehicle contains a number of onboard computers (see Fig. 5 b) with embedded guidance, navigation, and control software. A part of the control software, responsible for low-level flight control, was selected as a case study. It was already developed and suitable for the application of the dicsussed method. Attitude and rate control are the main functions of this software. This makes it also one of the most critical part. An UML activity diagram of the main loop of the selected flight control software is shown in Fig. 5 c). The software is written in C and contains approximately 800 lines of 250
Fig. 5. a) The UAV during one of test flights. b) Onboard electronics. An embedded computer with the low-level flight control software is highlighted with the blue frame. c) An UML activity diagram of the main loop of the software. The black lines represent control flow between six functions. The colored lines represent data flow between the functions and variables. code. It is decomposed into six functions. They are represented by UML activities (rounded rectangles). The black arrows in the diagram represent control flow transitions. The control flow contains forks and joins. The functions ”read input”, ”rate control”, and ”ecg” are invoked in each iteration of the main loop, the functions ”err quat” and ”attd ctrl” are executed in each second iteration, and the function ”eul to quat” in each forth iteration. The colored rectangles represent nine internal and one output variables. The colored, thin arrows show the data flow between the functions and the variables. The software reads sensor data and external inputs (”read input”), processes them (”eul to quat” and ”err quat”), performs attitude and rate control (”attd ctrl” and ”rate ctrl”), and generates engine commands (”ecg”). The output variable ”mtr cmd” is a critical system output. An erroneous value of this variable is a violation of reliability requirements. The original on-board software was instrumented with fault injection and error detection mechanisms. In our experiments the instrumented copy was run on a ground PC using a predefined sequence of input variables. Each statistical experiment contains 60 iterations of the main loop, 100000 statistical experiments has been conducted. The fault injection mechanisms emulate CPU faults and produce incorrect outputs of the functions with given probabilities using a ”random” operator. We assume that the probability of fault activation is proportional to the execution time of the function. The error detection mechanisms compare computed values of ”mtr cmd” with stored correct values. Additionally, the software was extended with time measurement blocks that allow the measurement of the mean execution time of each function and the mean execution time of one iteration of the main loop. The mean execution time of one iteration of the main loop is considered to be our performance metric.
IFAC ACA 2016 August 21-25, 2016. Quebec, Canada
Base-line system models (HW+SW)
UML/SysML
Andrey Morozov et al. / IFAC-PapersOnLine 49-17 (2016) 248–253
Model transformation, Statistical experiments
Abstract behavioural model Data ßow graph
Markov-based system analysis
System-level analysis
Control ßow graph
Reliability analysis
Element-level analysis
Protection strategy
Reliability properties Performance properties
Design decision
251
a) Control ßow graph
b) Data ßow graph
Initial control ßow node
Performance analysis
read_input o1
euler_ref
Control ßow transition probabilities
Performance metric: Mean execution time Reliability metric: Mean numberof errors
Fig. 6. Model-based design flow for identification of suitable protection strategies.
quat_ref
az
i1 o1
error_quat
1.0 err_quat
System elements
0.5
t = 0.115139
i1 attd_ctrl o1
1.0
The goal of the proposed method is the identification of suitable combinations of SFDs (protection strategies). A top-level structure of the method is presented in Fig. 6.
i2
err_quat
0.25
t = 0.275766
3. METHOD DESCRIPTION
First, we perform system-level and element-level analysis of an available base-line system model and generate an abstract mathematical representation that describes behavioral aspects of the system from reliability and performance points of view. After that, two discrete time Markov chain (DTMC) models are used for numerical evaluation of the reliability and performance metrics. These evaluations take also into account a selected protection strategy. Based on the estimated reliability and performance we modify the current protection strategy and perform another evaluation until a suitable combination of SFDs is found.
imu_data
o1
Data ßow arcs
eul_to_quat
o3
eul_to_quat
t = 0.030060
0.25
o2
quat_navsol
i1
Memory slots
read_input
o4
1.0 s_omegaref
Data ßow inputs
attd_ctrl
t = 0.331084
1.0
i3
i2
i1
rate_ctrl o1
o2
s_qref
rate_ctrl
s_zref
t = 0.326524 i1
1.0 Control ßow arcs
ecg
Data ßow outputs
t = 0.1021
i2
ecg o1
mtr_cmd
Fig. 7. An extended dual-graph error propagation model generated from the UML activity diagram shown in Fig. 5 c). of an SFD. The probabilities of error propagation have been defined with element-level statistical experiments.
3.1 Abstract Behavioral Model An abstract behavioral model is the core part of the method. This model is based on an extended dual-graph error propagation model (DEPM) that was introduced in our previous publications. An early concept of the DEPM has been presented in Morozov and Janschek (2011), a completed version complemented with a mechatronic case study in Morozov and Janschek (2013, 2014). Fig. 7 demonstrates a DEPM that has been automatically generated for the case study software from the UML activity diagram shown in Fig. 5 c). The DEPM, used in this article, consists of four components: Control flow graph (CFG): CFG nodes represent executable system elements (software functions). Arcs represent control flow between the elements. The arcs are weighted with transitions probabilities that have been obtained from the nominal operational profile. The CFG of the case study software is shown in Fig. 7 a). Data flow graph (DFG): A DFG contains two types of nodes that (i) represent the elements like the CFG nodes and (ii) a set of memory slots. Arcs of a DFG connect inputs and outputs of elements with memory slots. The DFG of the case study software is shown in Fig. 7 b). Reliability properties: Probabilities of fault activation during elements execution and probabilities of error propagation through the elements (from inputs to outputs). We assume that the probability of fault activation is proportional to the execution time of the function. The properties are extended with coefficients that show how the probability of fault activation decreases after application 251
Performance properties: (i) A mean execution time of each element (shown in the CFG in Fig. 7 a)) and (ii) coefficients that show how this mean execution time increases after application of an SFD. The memory slots in the DFG, the performance properties, and the performance and reliability coefficients are a new features in comparison of the DEPM, presented in our previous publications. 3.2 Markov-based Numerical Evaluation Reliability analysis: A DTMC model is used for reliability analysis. The model describes system dynamics as a stochastic process in terms of occurrence and propagation of data errors. The DTMC model is automatically generated from the abstract behavioral model and is applied for computation of the mean number of errors in critical system outputs during a given time period. Fig. 8 sketches a state graph of this DTMC model. Each state of the DTMC model corresponds to a moment in time between executions of two elements and is defined with two parameters. The parameter enext shows which element will be executed next, according to the control flow. An expression enext = ek means that the execution of some element has already been completed, and the system is about to run the element ek . The second parameter is a binary vector v. The length of v equals to the number of memory slots in the system. The elements of this vector show presence or absence of errors in the memory slots at the particular moment of time.
IFAC ACA 2016 August 21-25, 2016. Quebec, Canada 252
...
... = = (0,0,0)
... Transition probability
... Transition probability
...
fro
...
ie
nt
...
= = (1,0,1)
Pareto
= = (1,0,0)
...
Andrey Morozov et al. / IFAC-PapersOnLine 49-17 (2016) 248–253
r
of
Fig. 8. A general structure of the DTMC model for reliability analysis: enext defines which element will be executed next, v shows which data contain errors.
able
so
lut
ion
s
Fig. 9. Model-based numerical results, obtained by the application of the presented method.
of
r
ie
252
nt
Results: The obtained reliability and performance estimations help to identify a set of suitable protection strategies, which form the pareto-frontier in the performancereliability chart in Fig. 9.
fro
Performance analysis: The method for performance analysis is also based on a DTMC model. However, this DTMC model is much simpler, it is a modification of a standard DTMC-based method for analysis of the performance in terms of an execution time. A similar idea is shown in Happe (2005). The performance DTMC model is generated from a CFG by removing the control flow transition that closes up the main loop: The control flow transition from ”ecg” to ”read input” in Fig. 7 a). After the removal, we compute the mean number of executions nei of each element ei using underlying mathematical methods of Markov chains. A product nei tei of the computed mean number and the given mean execution time of the correspondent element is an estimation of the time that was spent on the execution of an element ei during one iteration of the main loop. In case an element is protected with an SFD, we apply a corresponded performance overhead coefficient to tei . The sum of such products for all elements of the system is the mean execution time of one iteration of the main loop. This value is considered to be our performance metric. The discussed Markov models have been generated automatically for all possible protection strategies. PRISM software tool (Kwiatkowska et al. (2011)) for Markovian analysis has been used in the presented study in addition to our in-house software toolset ErrorPro (Morozov et al. (2015)).
to
re
Pa
The DTMC generation method is based on an iterative algorithm. First, an initial node of the DTMC state graph is created. This node is considered to be the current node in the first iteration. With each succeeding iteration, the algorithm identifies all possible states where the system can move after the execution of enext , and computes corresponding transition probabilities using the probabilities of control flow branches, faults activation, and errors propagation. Using the discussed DTMC model we are able to identify the states of interest that represent moments when the system ”writes” erroneous values to critical outputs. Using underlying mathematical methods of Markov chains we compute the mean number of passages through the states of interest. The computed value corresponds to the mean number of erroneous outputs during the given time interval.
suit
suita
ble s
olu
tio
ns
Fig. 10. The results of the statistical experiments. 4. EVALUATION OF THE METHOD AND RESULTS In order to evaluate correctness of the introduced method, we compared the numerical results, obtained by the application of the presented method (see Fig. 9), with the results of the statistical experiments (see Fig. 10). Each dot on the charts represent a single protection strategy. In order to demonstrate feasibility of the proposed method, we used a trivial SFD - triple redundancy of software functions. Basically, during the statistical experiments, protected elements were executed three times and the results have been post-processed by a majority-voting function. In the model level, it was assumed that the application of the triple redundancy increases the mean execution time of an element by three times and decreases the probability of fault activation towards zero.
IFAC ACA 2016 August 21-25, 2016. Quebec, Canada
Andrey Morozov et al. / IFAC-PapersOnLine 49-17 (2016) 248–253
The element ”read input” emulates inputs from real sensors and is not supposed to be protected. Different combinations of SFD application to the rest five elements result in 25 = 32 different protection strategies. Binary representations of the numbers of the protections strategies show which elements are protected. For instance, the protection strategy 31 (111112 ) represents the situation when all five elements protected, and the protection strategy 28 (111002 ) represents the system with protected elements ”eul to quat”, ”err quat”, and ”attd ctrl”. The comparison of the model-based and the experimental results is following: Relative difference for performance estimations: 4.5598%, Relative difference for reliability estimations: 2.4850%. 5. CONCLUSION A design optimization method for minimizing the negative performance impact of software-implemented hardware fault detectors while maintaining the required system reliability level has been introduced. The method combines an abstract behavioral system model and two DTMC-based methods for probabilistic reliability and performance analysis. The feasibility of the method has been demonstrated using the low-level flight control software of an experimental unmanned aerial vehicle. The introduced method shall be improved in several important aspects. The method has to be validated using more sophisticated fault injectors that emulate bit-flips caused by single event upsets. Moreover, only the fault activation during execution of system elements (emulation of CPU faults) has been taken into account, however, the fault could be also activated in memory. In the presented case study, only one trivial type of SFDs has been used triple redundancy of software functions. In the follow-up activities, we will extend our method for other SFDs. ACKNOWLEDGEMENTS The authors would like to thank Prof. Christof Fetzer from the Faculty of Computer Science of Technische Universit¨at Dresden for ideas helping to formulate the key concept of this study, Andr´e Schmitt from SIListra Systems for his expertise in software-implemented fault tolerance, and two of our students Jin Li and Regina Took for their help in development of analysis software. This work was partially funded by the German Aerospace Center (DLR Space Administration), Contract No. 50RA1208 (S3ARV - Small & Safe Space Autonomous Robotic Vehicles). REFERENCES Barnaby, H.J. (2005). Will radiation-hardening-by-design (RHBD) work? Plasma Sciences. Borkar, S. (2005). Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25, 10–16. doi: 10.1109/MM.2005.110. Happe, J. (2005). Predicting Mean Service Execution Times of Software Components Based on Markov Models. In R. Reussner, J. Mayer, J. Stafford, S. Overhage, S. Becker, and P. Schroeder (eds.), Quality of Software 253
253
Architectures and Software Quality, volume 3712 of Lecture Notes in Computer Science, 53–70. Springer Berlin Heidelberg. doi:10.1007/11558569 6. Kwiatkowska, M., Norman, G., and Parker, D. (2011). PRISM 4.0: Verification of probabilistic real-time systems. In G. Gopalakrishnan and S. Qadeer (eds.), Proc. 23rd International Conference on Computer Aided Verification (CAV’11), volume 6806 of LNCS, 585–591. Springer. Morozov, A. and Janschek, K. (2013). Case study results for probabilistic error propagation analysis of a mechatronic system. In: Tagungsband Fachtagung Mechatronik 2013, Aachen, 06.03.-08.03.2013, 229–234. Morozov, A. and Janschek, K. (2011). Dual graph error propagation model for mechatronic system analysis. In 18th IFAC World Congress. Milano, Italy. Morozov, A. and Janschek, K. (2014). Probabilistic error propagation model for mechatronic systems. Mechatronics, 24(8), 1189 – 1202. doi: http://dx.doi.org/10.1016/j.mechatronics.2014.09.005. Morozov, A., Tuk, R., and Janschek, K. (2015). ErrorPro: Software tool for stochastic error propagation analysis. In 1st International Workshop on Resiliency in Embedded Electronic Systems, Amsterdam, The Netherlands, 59–60. Nakka, N., Pattabiraman, K., and Iyer, R. (2007). Processor-level selective replication. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN ’07, 544– 553. IEEE Computer Society, Washington, DC, USA. doi:http://dx.doi.org/10.1109/DSN.2007.75. Oberg, J. (2012). Did bad memory chips down Russias Mars probe? URL http://spectrum.ieee.org/ aerospace/space-flight/did-bad-memory-chipsdown-russias- mars-probe. Reis, G.A., Chang, J., and August, D.I. (2007). Automatic instruction-level software-only recovery. IEEE Micro, 27, 36–47. doi:10.1109/MM.2007.4. Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., and August, D.I. (2005). SWIFT: Software Implemented Fault Tolerance. In Proceedings of the international symposium on Code generation and optimization, CGO ’05, 243–254. IEEE Computer Society, Washington, DC, USA. doi:http://dx.doi.org/10.1109/CGO.2005.34. Roskosmos (2012). The general conclusions of the interdepartmental commission for analysis of the causes of abnormal situations during the flight testing of the spacecraft ”phobos-grunt” (rus). URL http://www.roscosmos.ru/main.php? id=2&nid=18647. Schiffel, U., Schmitt, A., S¨ ußkraut, M., and Fetzer, C. (2010). ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software. In E. Schoitsch (ed.), Computer Safety, Reliability, and Security, volume 6351 of Lecture Notes in Computer Science, 169– 182. Springer Berlin / Heidelberg. doi:10.1007/978-3642-15651-9 13.