Reliability Engineering and System Safety 142 (2015) 456–462
Contents lists available at ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Fault tolerant design of a field data modular readout architecture for railway applications Ada Fort a, Marco Mugnaini a,n, Valerio Vignoli a, Vittorio Gaggii b, Moreno Pieralli b a b
Department of Information Engineering and Mathematics, University of Siena, Siena, Italy Comesa SrL, Prato, Italy
art ic l e i nf o
a b s t r a c t
Article history: Received 27 January 2015 Received in revised form 21 May 2015 Accepted 11 June 2015 Available online 20 June 2015
Modern data acquisition systems used to collect sensor signals are usually designed taking into consideration performance and operating parameters which are mainly related to sensitivity, selectivity, resolution and stability over time. In addition to such important features, field application systems should also respond to other constraints like reliability and availability and additionally, depending on the specific application, to some peculiar requirements in terms of safety. The present paper is addressed to supply an overview of the implications, during a sensor input/output hardware module design, of such parameters as the safety integrity level. The discussion involves the overall system design once integrated with availability considerations. In this manuscript, considerations concerning the on board software implementation are omitted without loss in generality. The study has been developed taking into account solutions suitable for railway applications like signaling or crossing detection systems. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Field Sensors SIL Safety design System availability
1. Introduction The design of a sensor acquisition system able to operate in harsh environments and responding to severe constraints in terms not only of data acquisition performance parameters, but also in terms of robustness to common use, is still an open research field. This is due to the fact that, even if compact field distributed acquisition systems are used in a wide variety of industrial contexts ranging from telecommunications up to oil field monitoring ones, safety requirements result to be, most of times, application dependent. In particular in railway signaling systems, for example, the constraints according to some standards [1,2] can represent an actual obstacle for designers both in terms of software and hardware because the safety functions are considered to operate continuously and not only in low demand mode [1]. The introduction of new technologies and new hardware solutions to cover safety functions, even if from one side is an added resource for designers, on the other hand constitutes another constraint. The reason is that usually well proven architectures are a must to comply with the most used standards, avoiding additional testing costs. Therefore additional design parameters as system reliability, availability maintainability and safety (RAMS) should be
n
Corresponding author. E-mail addresses:
[email protected] (M. Mugnaini),
[email protected] (M. Pieralli). http://dx.doi.org/10.1016/j.ress.2015.06.008 0951-8320/& 2015 Elsevier Ltd. All rights reserved.
considered during the main design phases to ensure that the newly designed system could withstand a wide variety of application conditions. Some researchers have tried to discuss the variation of the system availability and reliability over time for some specific uses related to railways or telecommunications [3–7] or to oil and gas systems [8,9] with only limited safety considerations. These works can in general describe how RAM parameters can change dynamically providing exploitable reconfigurable models. Other authors as in [9–11] tried to describe suitable roadmaps for the definition of degradation models of the systems, trying to keep the safety requirements constrained within upper and lower boundaries. Nevertheless, in these cases authors generally provide complex approaches which may result difficult to be followed by designers in practical approaches where also maintenance considerations should be taken into account. The introduction of advanced techniques as the Hidden Markov Modeling approaches allowed for the embedding of the service shop activities to retrofit the a priori assumed failure and repair rates. In this way, the model is bended to the actual system life under effective use. Such effort in principle is useful to allow system parameter retrofit. Other authors try, as in [12] to show how variations in the risk reduction factor (RRFs) may affect the design options, and try to introduce a rough cost model to discriminate between the options on cost basis. In [13] authors provide an effective approach toward the functional safety assessment of pre-crash systems for reciprocal hazards in the automobile application field building suitable simulation models. Finally in [14,15] authors try to discuss general
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
approach strategies to address safety problems exploiting traditional methods. Nevertheless, none of the previously mentioned papers address the issue of designing a flexible structure in terms of hardware structure or architecture for sensors readout in order to be able to cope with different safety requirements (linked to different safety functions) with a single modular solution. Some other authors [16] have discussed the case of safe instrumented systems for low demand mode analyzing the impact of different testing strategies exploited in mechanical and industrial plants. The authors of [17] propose an interesting and simplified method for safety integrity level evaluation based on reliability block diagram degradation approach. Such approach simplifies the formulas presented in [1,2] supplying the designers with a useful tool to be exploited during system design. Nevertheless, this kind of researches, even if introducing alternative approaches to formulas provided into the mostly used standards, has been applied in cases managed with long period of testing proof intervals only and cannot be exploited for continuous operation mode cases. In general case studies [18,19] the designers and researchers focusses most of times in efficiency management and measurement performance skipping most of the considerations on safety constraints introduced by the specific application requirements. Some authors tried to address the problem of analyzing the behavior of low demand mode versus high demand mode systems in terms of both testing proof interval changes and configuration management [20,21]. In particular while [20] focused on the effectiveness of testing on a general instrumented system on the basis of the demand mode classification, [21] tried to optimize the testing proof interval according to a specific selected architecture. Nevertheless neither of these latter [20,21] addressed specifically problems related to railways context which are very peculiar and strictly application dependent. The authors of this manuscript present an analysis performed within the boundaries of some relevant international safety standards [1,2] to suggest one solution suitable for exploitation whenever an a priori safety design requirement is established. In details, in this paper the authors tried to start from a basic architecture composed by the sensing element, a logical unit devoted to data manipulation and management, and a final actuating/output element to develop a modular solution able to cope with a commonly required safety integrity level (SIL according to [2] standard), established in particular for railway applications. In these latter cases in particular safety requirements are generally selected according to [1,2] with higher rank (SIL 3 or 4) than in other fields. The proposed modular solutions have been then evaluated in terms of availability parameters, and the best configuration in terms of such parameters has been proposed. The purpose and the novelty of this manuscript resides in the possibility to provide a useful guidance for designers who have to deal with continuous monitoring systems starting from very simple architectures up to complex structures exploitable in particular for railway applications, where standard configurations can be exploited and further enhanced for specific signaling interfacing systems. The paper is arranged in six sections. In the Section 1 a general introduction to the problem is supplied with an indication of the state of the art of the hardware safety design approaches in different application fields. In Section 2 a general system description and overview is commented. In Section 3 a selected architecture is proposed and different basic configurations are compared in terms of RAMS characteristics, excluding maintenance policies dissertation and assuming only corrective actions. In Section 4 the simulation results are shown and discussed while in Section 5 a possible modular hardware solution able to cover different safety function requirements is proposed. Results in particular highlight that complex system of course may present
457
lower reliability/availability data while proving at the same time a satisfactory protection degree and reduced residual risk. In Section 6 the conclusions are presented.
2. System description The proposed basic architecture is a fault tolerant smart front end system for safety-critical applications in industrial processes or railway area, supporting severe requirements of configurations and response time. The system works with centralized and distributed configurations, with a modular redundant (MR) architecture to eliminate single points of failure and to ensure the required system availability. The system can operate correctly with the presence of a major component fault and tolerates multiple, non-concurrent faults if properly arranged in a XooY configuration, where X is the minimum required number of signals to be received from sensors (inputs), and Y is the total amount of available ones. In redundant configurations it identifies and compensates faulty elements and allows for repair activities while continuing an assigned task without process interruption. The system with MR architecture operates as a single set of hardware and software (even if the software section is not discussed in this manuscript and recalled only for diagnosis purposes). The general system architecture can be the one depicted in Fig. 1. Regarding the system Scan Time, a specific SET-PLC system should be always be present for allowing the selection of the optimal strategy used for elaborating the I/O data. The following three sample different strategies can be applied:
Polling driven by Main Processor Module Spontaneous dispatch at time out Data change dispatch The cycle time between input state change and output state change, e.g. for 1 km communication channel length and 32 I/O module, can be evaluated from 6 ms to 70 ms as typical values depending the strategy used for collecting and elaborating the data.
3. Architectures modeling Once the basic structure is set, the problem is to define the redundancy of the three main sections of Fig. 1 in order to meet the requirements of multiple safety functions usually present in such systems. The analysis of suitable configurations is developed to meet the requirements of the IEC61508 [2] safety standard (safety integrity level SIL) in designing local or distributed systems for data collection from filed sensors and subsequent manipulation for control purposes. The design will take into consideration the implementation of at least a SIL 3 fault tolerant structures considering the base chain of every safety function as per Fig. 1 due to the fact that this specific application, even if representing a general purpose one, can be specifically exploited for railway signaling monitoring. The possibility to exploit a general structure
Fig. 1. Basic system description. Acquisition and change state times are assumed as typical mean values for railways cases.
458
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
for readout purposes enable the designers to focus on architecture details starting from a recognized and exploitable configuration. In particular in Fig. 2 there is the generic representation of the PLC architecture where the three main safety components covering a safety loop are identified. The S block represents the analog signal coming from the starting/sensor element (IN Module of Fig. 1), the logic solver LS is usually identified with the intelligent unit like an ASIC, a microprocessor or a microcontroller (Main Processor Module in Fig. 1), and finally the output element acting on the output side identified by the letters FE (final element or OUT Module of Fig. 1) which can be used to send the data straight to an actuator or to another processing unit for further elaboration according to the specific needs. In the present document several architectures have been proposed and evaluated with the aim to compare their performance levels both in terms of safety integrity level (SIL) with respect to some pre-established requirements and in terms of availability parameters, once the basic reliability assumptions for both failure and repair rates, have been set. For this reason in Table 1, the values of the failure and repair rates for the specific elements as shown in Fig. 3 have been reported. Table 1 expresses only a data subset obtained at 25 1C and in a ground benign (GB according to MIL-HDBK 217 definitions) environment supposing a distributed derating operability of 30%. Additionally the probability of failure on demand (PFD) will be evaluated considering the continuous operating condition (this applies generally to railway signaling systems). To evaluate the PFD of the overall structure each individual PFD for all the involved sections will be independently evaluated and jointly considered over different architectures to verify whether the obtained results satisfy the standard requirements or not, according at least to SIL 3 definitions. In addition, due to the peculiar implementation of the specific architecture, as required by [1,2], the hardware components will be considered belonging to class A devices, i.e. to well proven technologies and the following assumptions can be made:
Test Proof Interval ¼500 ms, well below the system mean time to repair (MTTR) As a consequence of possible further undetermined operating conditions, in order to be conservative the following additional hypothesis can be assumed:
λ/2 ¼ λD (dangerous failure rate fraction) λDD ¼ λ/2(DC) (detectable dangerous failure rate fraction) λDU ¼ λ/2(1 DC) (undetectable dangerous failure rate fraction) λSD ¼ λ/2(DC) (detectable safe failure rate fraction) As a final consideration particular importance should be addressed to the input section because, for safety applications, information coming from sensors is the most critical part in terms of data corruption and noise contribution. Therefore decisions taken on wrong data may affect the overall system behavior even if it correctly designed. Different possible architectures can be considered to meet the proposed requirements according to the continuous operating mode. Hereafter five possible implementations have been suggested considering that not all of them may meet the single fault tolerance requirement, as shown in Table 2, even if they can at the same time satisfy the desired SIL threshold. In general the sub-structure of the smart front end electronics which represents the system input is analogous to the one of Fig. 2 as per Fig. 4 but embedding the front electronics components only. Each configuration has been evaluated in terms of the mean availability parameter, the number of expected failures over time and the safety figures in order to get a global overview of the tradeoff among such parameters. In particular the system availability and failures representation have been reported in Section 4 for the first configuration only, while summarized results have been reported in Section 5 for all the proposed redundant structures. Obviously all the other functional parameters as accuracy, uncertainty, response time etc… are equivalent for all the proposed structures.
DC¼0.99% (diagnostic coverage)
β ¼2% (common cause of failures) βD ¼1% (common cause of dangerous failures)
4. Simulation results 4.1. Single line 1oo1 The architecture shown in Fig. 5 is the simplest in terms of system configuration blocks that can be implemented to cover a
Fig. 2. Basic safety chain according to IEC61508.
Table 1 Reliability data. Safety loop section
Subsystem
Failure rate [f/106 h]
Repair rate [r/h]
S
I/O Front end FRAM Microcontroller Power supply Communication management I/O Front end Power management CPU I/O Front end
λ¼ 0.2399 λ¼ 0.1328 λ¼ 0.8693 λ¼ 0.1781 λ¼ 0.6655
0.33 0.33 0.33 0.33 0.33
λ¼ 0.2399 λ¼ 0.3437 λ¼ 0.8993 λ¼ 0.2399
0.33 0.33 0.33 0.33
LS FE
Data have been calculated considering an aggregation of components for each subsystem with 30% derating usage conditions in ground benign conditions (GB) and at 25 1C.
Fig. 3. Safety loop description identifying the main system components.
Table 2 Proposed and evaluated architectures. Architecture (configuration)
Fault tolerant
1oo1 (simplex)
No
Mixed solutions
A B C D
No No No Yes
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
459
Fig. 4. System configuration of the basic safety loop. Fig. 7. Representation of the A configuration.
Fig. 5. Representation of the 1oo1 configuration.
Fig. 8. Representation of the B configuration.
Fig. 9. Representation of the C configuration.
System PFD¼ 5.6676E-06; SIL 1 Such structure is therefore not able to satisfy the SIL 3 requirement. In Fig. 6 the behavior of the simplex structure in terms of mean availability and the expected number of failures over time has been simulated and represented through the BlockSim software. 4.2. Configuration A
Fig. 6. Availability and failure behavoiur of the 1oo1 configuration over time.
safety function. This architecture is discussed as an example due to the fact that this specific arrangement does not allow for meeting the fault tolerant requirements. It will be therefore not considered in the foregoing analysis. Exploiting the data of Table 1 it is possible to define the system PFD as PFDS ¼4.0038E-06; PFDLS ¼1.304E-06; PFDFE ¼3.5985E-07;
The architecture shown in Fig. 7 consists of a 1oo2D selection for the input frontend, a 1oo2D configuration for the logic solver and a 1oo2D configuration for the output (final element). This architecture is discussed due to the fact it is in all sections it presents functional redundancies which enhance the possibility to cover the safety requirements through system diagnostics. Exploiting the data of Table 1 it is possible to define the system PFD as PFDS ¼ 4.0439E-08; PFDLS ¼1.31699E-08; PFDFE ¼ 3.63449E-09; System PFD¼ 5.72434E-08; SIL 3 This architecture in principle is able to meet the SIL 3 requirements and the comparison of its availability over time with respect to all the other configurations will be discussed in Section 5.
460
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
Fig. 10. Representation of the D configuration.
Fig. 12. Failure rate comparison of the configurations from 1oo1 to D.
Table 3 Proposed and evaluated architectures.
Fig. 11. Availability comparison of the configurations from 1oo1 to D.
Architecture
Mean availability @87600 h
System failure rate [f/Yr]
Possible SIL matching
1oo1 (simplex) Configuration A Configuration B Configuration C Configuration D
0.999989
0.0330
1
0.999977
0.0011
3
0.999967
0.0034
4
0.999975
0.0014
3
0.999966
0.0034
4
4.3. Configuration B The architecture shown in Fig. 8 consists of a 2oo3 selection for the input frontend, a 2oo3 configuration for the logic solver and a 1oo2D configuration for the output (final element). Exploiting the data of Table 1 it is possible to define the system PFD as -09
PFDS ¼ 1.29544E ; PFDLS ¼4.01195E-10; PFDFE ¼3.63449E-09; System PFD ¼5.33112E-09; SIL4
4.4. Configuration C The architecture shown in Fig. 9 represents the case of a 1oo2D selection for the input frontend, a 2oo3 configuration for the logic solver and a 1oo2D configuration for the output (final element). Exploiting the data of Table 1 it is possible to define the system PFD as PFDS ¼ 4.0439E-08; PFDLS ¼4.01195E-10; PFDFE ¼3.63449E-09; System PFD ¼4.44747E-08; SIL 3
4.5. Configuration D The architecture shown in Fig. 10 is the one consisting of a 2oo3 selection for the input frontend, a 2oo3 configuration for the logic
solver and a 2oo3 configuration for the FE. Such configuration is normally considered to be fully 2oo3. Exploiting the data of Table 1 it is possible to define the system PFD as PFDS ¼1.29544E-09; PFDLS ¼4.01195E-10; PFDFE ¼ 1.0872E-10; System PFD ¼ 1.80535E-09; SIL4 The system configurations, as per Fig. 11 and failure over time (Fig. 12), have been compared in terms of their mean availability. The results showed that the sensor stage heavily affects the system availability being characterized by the highest failure rates contributions. Such arrangements penalize the configurations where multiple sensor inputs are requested. On the other side, from a safety standpoint, redundant configurations allow to respect the modularity constraints and the SIL 3 target.
5. Sensor interface proposed architecture In Table 3 a summary of the expected number of failures and availability figures of the proposed configurations is reported. On the basis of the former considerations and out of the tradeoff among system availability, safety degree and system configurability a possible solution has been identified, at least for the input stage, which resulted in the most critical section. The Triple Modular Redundant (TMR) architecture ensures fault tolerance and provides error-free, uninterrupted control in the
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
Fig. 13. Analog input stage (sensor) of single channel.
Fig. 14. Elaboration section of single channel configuration.
461
The possibility to exploit a so called vital power supply strategy allows for increasing the diagnostic coverage of these systems beyond the traditional 90% that can be found in commercial products. This concept relies on the control of the enabling or disabling capabilities of the overall elaboration system through the power supply logic connections, minimizing the interaction between software and hardware for anomalies detection. In Figs. 13–15 one of the channels of the TMR (three identical physically independent channels) block diagram is represented once divided into three main sections. In particular in Fig. 13 the analog input section is represented where it is possible to notice the self-test architecture enabling to test, according to the specific demand, the input correct signals driving the input to a well defined value (logic 0 or 1) and reading back the imposed analog voltage. In Fig. 14 the microcomputer simplified block diagram is represented with its main functional components. In particular to take into account for system safety the following section should always be considered in an analysis. In particular the interrupt controller, the clock, the DMA, the CRC, the system timer and the DSP (if present) and memories should be modeled considering the possible hazardous failure modes out of a specific failure mode analysis. Finally in Fig. 15 the ending section which has been proposed in this representation as a hardware intrinsic robust solution is described. In particular, in such picture the interaction of two of the channel present in the TMR structure (even if the same principle applies to channels A–B-C) is represented and it should be pointed out that the fail safe power supply section of microcomputer A is the managed by a signal coming from two separate sections (channels A and B as a sample). Therefore any mismatch between the two codes referring to the analog inputs and coming from the CPUs (A and B) will lead to an interruption through the OR logic port of the fundamental power supply for CPU A generating a waterfall effect on the power supply of CPU B (2oo2 structure for CPU). The same principle is applied, as an example in this representation, for the communication section but other solutions can be implemented as well resulting in an effective barrier against unsafe conditions.
6. Conclusions
Fig. 15. Hardware logic configuration of single channel mixing signals from channel A and B for fail safe power supply and communication management.
presence of either hard failures of components or transient faults from internal or external sources. Each I/O module houses the circuitry for three independent channels. Each channel in the input modules reads the process data and passes that information to its respective Main Processor. Each channel can be independently driven by the corresponding microcontroller in order to force (test) the input to a preestablished and well known analog or digital value. In this way possible hidden faults (stuck signals of active sensors) can be monitored and identified improving the self-diagnostic capabilities of the design. Any disagreement among the microcontroller elaborations produces one, or more than one, microcontroller going to the inactive safe state. In this state, the microcontroller that detected the disagreement stops to generate the power signal down for the others, ultimately forcing the module to complete shutdown reaching, therefore, a safe condition.
In this paper the authors described a possible solution for sensor based smart frontend architectures to be used in railway applications which should match an established SIL level. The authors analyzed several possible configurations in terms of system availability and safety requirements ending with a proposal of a solution representing the tradeoff among the reference parameters considered. A triple redundant module for the sensing input interface represents a possible implementation strategy. Such solution allows for keeping the desired flexibility degree and a reasonable availability level for the overall system. Moreover the suggested architecture helps designers in assessing a basic hardware solution which is able to cope with high severity requirements in terms of failure tolerance and probability of failure on demand in continuous conditions. Nevertheless, the overall safety structure can be completed and therefore the safety degree extended well beyond this level by including the software safety requirement and implementation. References [1] CENELEC 50129 and CENELEC 50126. [2] IEC61508-1-6. Functional safety of electrical/electronic/programmable electronic safety related systems.
462
A. Fort et al. / Reliability Engineering and System Safety 142 (2015) 456–462
[3] Babczyński T, Magott J. Dependability and safety analysis of ERTMS level 3 using analytic estimation safety and reliability: methodology and applications. In: Proceedings of the European safety and reliability conference, ESREL; 2014. p. 293–8. [4] Fort A et al.. Availability modeling of a safe communication system for rolling stock applications. In: Proceedings of the IEEE instrumentation and measurement technology conference, 2013, p. 427–30. [5] Addabbo T, Cordovani O, Fort A, Mugnaini M,Vignoli V. Exhaust thermoelements redundant strategy to improve temperature reading reliability and serviceability SMARTGREENS. In: Proceedings of the 3rd international conference on smart grids and green IT systems; 2014. p. 96–100. [6] Catelani M, Ciani L, Mugnaini M, Scarano V, Singuaroli R. Definition of safety levels and performances of safety: applications for an electronic equipment used on rolling stock. In: Proceedings of the IEEE instrumentation and measurement technology conference; 2007. [7] Catelani M, Mugnaini M, Masi A, Ceschini G, Nocentini F. Pseudotime-variant parameters in centrifugal compressor availability studies by means of Markov models. Microelectron Reliab 2002;42(9–11):1373–6. [8] Ceschini G, Mugnaini M, Masi A. A reliability study for a submarine compression application. Microelectron Reliab 2002;42(9–11):1377–80. [9] Addabbo T et al.. HMM used for component parameters apportionment. In: Proceedings of SSD14, Casteldefells, 2014. [10] Fort A, Mugnaini M, Vignoli V. Hidden Markov models approach used for life parameters estimations. Reliab Eng Syst Saf 2015;136:85–91. [11] Guo Haitao, Yang Xianhui. Automatic creation of Markov models for reliability assessment of safety instrumented systems. Reliab Eng Syst Saf 2008;93 (6):829–37. [12] Alijagic E, Dang VN. Achieving safe designs through evaluation of options at the conceptual level of safety system design 2015 safety and reliability: methodology and applications. In: Proceedings of the European safety and reliability conference, ESREL; 2014. p. 1733–40.
[13] Takeichi M, Suyama K, Sato Y. Functional safety assessment of pre-crash systems for reciprocal hazards. Nihon Kikai Gakkai Ronbunshu, C Hen/Trans Jpn Soc Mech Eng C 2013;79(806):3839–53. [14] Alijagic E, Dang, VN. Safety review process – a methodology for qualitative verification of safety measures 2014 safety, reliability and risk analysis: beyond the horizon. In: Proceedings of the European safety and reliability conference, ESREL; 2013. p. 77–84. [15] Duran L, Krywonos M. A practical view of risk reduction management of hardware and software for safety critical applications. In: Proceedings of the 28th center for chemical process safety international conference 2013, CCPS – topical conference at the 2013 AIChE Spring meeting and 9th global congress on process safety; 2013. [16] Jin Hui, Rausand Marvin. Reliability of safety-instrumented systems subject to partial testing and common-cause failures. Reliab Eng Syst Saf Vol 2014;121:146–51. [17] Ding Long, Wang Hong, Kang Kai, et al. A novel method for SIL verification based on system degradation using reliability block diagram. Reliab Eng Syst Saf Vol 2014;132:36–45. [18] Sunder R, Kolbasseff A, Kieninger A, Rohm A, & Walter J. Operational experiences with onboard diagnosis system for high speed trains. In: Proceedings of the world congress on rail research; November 2001. [19] Elia M, Diana G, Bocciolone M, Bruni S, Cheli F, Collina A, & Resta F. Condition monitoring of the railway line and overhead equipment through onboard train measurement-an Italian experience. In: Proceedings of the institution of engineering and technology international conference on railway condition monitoring; November 2006. p. 102–7. [20] Liu Yiliu, Rausand Marvin. Reliability assessment of safety instrumented systems subject to different demand modes. J Loss Prev Process Ind 2011;24 (1):49–56. [21] Torres-Echeverría AC, Martorell S, Thompson HA. Modelling and optimization of proof testing policies for safety instrumented systems. Reliab Eng Syst Saf 2009;94(4):838–54.