Fault-tolerant design of local controller for the poloidal field converter control system on ITER

Fault-tolerant design of local controller for the poloidal field converter control system on ITER

G Model ARTICLE IN PRESS FUSION-8916; No. of Pages 8 Fusion Engineering and Design xxx (2016) xxx–xxx Contents lists available at ScienceDirect F...

3MB Sizes 0 Downloads 82 Views

G Model

ARTICLE IN PRESS

FUSION-8916; No. of Pages 8

Fusion Engineering and Design xxx (2016) xxx–xxx

Contents lists available at ScienceDirect

Fusion Engineering and Design journal homepage: www.elsevier.com/locate/fusengdes

Fault-tolerant design of local controller for the poloidal field converter control system on ITER Jun Shen, Peng Fu, Ge Gao, Shiying He, Liansheng Huang ∗ , Lili Zhu, Xiaojiao Chen Institute of Plasma Physics, Chinese Academy of Science (ASIPP), P.O. Box 1126, Hefei, Anhui Province 230031, China

h i g h l i g h t s • • • •

The requirements on the Local Control Cubicles (LCC) for ITER Poloidal Field Converter are analyzed. Decoupled service-based software architecture is proposed to make control loops on LCC running at varying cycle-time. Fault detection and recovery methods for the LCC are developed to enhance the system. The performance of the LCC with or without fault-tolerant feature is tested and compared.

a r t i c l e

i n f o

Article history: Received 28 May 2016 Received in revised form 20 August 2016 Accepted 13 September 2016 Available online xxx Keywords: ITER poloidal field Fault tolerance Service-based program Control system

a b s t r a c t The control system for the Poloidal Field (PF) on ITER is a synchronously networked control system, which has several kinds of computational controllers. The Local Control Cubicles (LCC) play a critical role in the networked control system for they are the interface to all input and output signals. Thus, some additional work must be done to guarantee the LCCs proper operation under influence of faults. This paper mainly analyzes the system demands of the LCCs and faults which have been encountered recently. In order to handle these faults, decoupled service-based software architecture has been proposed. Based on this architecture, fault detection and system recovery methods, such as redundancy and rejuvenation, have been incorporated to achieve a fault-tolerant private network with the aid of QNX operating system. Unlike the conventional method, this method requires no additional hardware and can be achieved relatively easily. To demonstrate effectiveness the LCCs have been successfully tested during the recent PF Converter Unit performance tests for ITER. © 2016 Elsevier B.V. All rights reserved.

1. Introduction The control system of the Poloidal Field Converter System (PFCS) on ITER is a networked control system which is comprised of several kinds of subsystems, including the Master Controller (MRC), the Circuit Controller (CCR) and the Local Control Cubicles (LCC). The LCCs are the only controllers which are located near the controlled power supply equipment and are thus exposed to a harsh operating environment, thus requiring very high tolerance to disturbances. Moreover, the LCCs are a mission-critical system and any malfunction in them may cause damage to the device, so faulttolerant design plays a critical role. Even if the program for the LCCs has been developed by using a test and verify method, the system still runs the risk of failure due to transient faults [1]. In the early

∗ Corresponding author. E-mail address: [email protected] (L. Huang).

stages of the experiment, there were many cases of program abort, invalid memory access and network package loss, which would cause interruptions in the experiment or even damage to the equipment. So, the LCCs needed to be able to detect fault conditions and resume operation from an erroneous state. However, in designing a fault-tolerant system the challenges of system recovery without interrupting ordinary control loop were especially difficult. The physical and functional architecture of ITER are quite unique and the size of the control system is estimated in [2]. Fault-tolerant control from a holistic view is dealt with in [3] and covers the entire process of design. The effectiveness of the service-oriented architectural approach to fault-tolerant systems is evaluated in [4], which also illustrates a prototype implementation for service management using an experimental remote handling control system. This paper provides an approach of service-based program architecture and a method to reinforce the dependability of fault tolerance by manipulating critical services and communications. The original idea of service-based architecture was inspired by the

http://dx.doi.org/10.1016/j.fusengdes.2016.09.008 0920-3796/© 2016 Elsevier B.V. All rights reserved.

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS

2

J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

Fig. 1. The basic architecture and components of PFCS. Note that each LCC has one or three sets of PF Control Unit. [1].

real-time service oriented architecture (RTSOA) that has been introduced and evaluated in [5]. The paper is organized as follows: in the next section basic functions and system requirement of the LCCs are described. Then in Section 3, definitions of fault-tolerant and related issues are expounded, and then methods to identify faults in conventional way and consequential way are provided. Section 3 also discusses fault-tolerant methods in dealing with typical LCCs faults and elaborates a redundancy design in a private network. In Section 4, run time assigned to each service of LCC has been tested and the results of recent PF Converter Unit performance tests are presented to demonstrate the effectiveness of the proposed methods. Finally, Section 5 summarizes and concludes the paper.

2. The LCCs requirement In order to meet the requirements of ITER plasma confinement, the LCCs need to control the output current from ITER PF converter which can range arbitrarily between −55 and +55 kA. The PF converter module consists of four six-pulse bridges which are supplied by two converter transformers. The main purpose of LCCs is to make two bridges coordinate while in different operating modes, e.g. circulating current mode, single bridge mode and parallel mode [6]. The LCCs are dedicated, industrial controllers which are implemented using the PCI eXtensions for Instrumentation (PXI) family, running Real-Time Operating System (RTOS). Periodically, the LCCs receive parameters from the CCR through the private network and follow the instructions of Plant System Operating State (PSOS) condition, then sends data and system status to the CCR by adopting self-defined communication protocol which are used as sources of the Process Variable (PV) for Channel Access (CA) of Experimental Physics and Industrial Control System (EPICS) [7]. In accordance with the instruction of PSOS, the LCCs implement local control policies in real-time to provide the overall control system with a timeliness response, especially for controlling the firing angles of the Converter Unit (CU). The whole control-cycle time from Plasma

Control System (PCS) to LCCs is less than 1 ms, so the time for each CCR to aggregate data from LCCs must be under 0.5 ms. The basic architecture of PF converter control system is shown in Fig. 1. The MRC processes control functions, handles alarm events, retrieves error logs, and is to be managed as the interface to the CODAC (Control, Data Access and Communication). The MRC communicates with CCR to dispatch control parameters and receive real-time plant data from the LCC. To satisfy the requirements of the LCCs, the PXI solution was chosen for its advantages as a Commercial Off-The-Shelf (COTS) component which is widely available on the market. The QNX Neutrino RTOS has been implemented because it is qualified for various safety and security standards in highly critical products that also need to have a high fault tolerance. However there were still some fault occurrences in the early stages of the experiment, e.g. abnormal program operation, invalid resource access and communication data incompleteness. Some of the problems are recurring and can be easily observed, while others are transient and difficult to be detected. It is necessary to provide a fault-tolerant method within the software package for the LCC, not only on the software architecture but also on the communication medium.

3. Fault-tolerant design The basic concept of “fault tolerance” was established in the First International Symposium on Fault-Tolerant Computing in 1971 [8]. Derived from the definition proposed by Avizienis et al. [9] for fault, error and failure. (1) Failure, is an event, occurs whenever the behavior of system is deviated from the prescribed system specification. (2) Error is an erroneous state of system and (3) Fault is hypothesized and activated causes of an error. IEEE Std. 1044–2009 [10] clarified the relationship between fault and failure by “A failure may be caused by (and thus indicate the presence of) a fault.”. Therefore, the main purpose of a fault-tolerant design is aimed to avoid failure and prevent any fault from leading to a failure [3]. Several approaches involved in fault tolerance are developed, such as system rejuvenation, N-version system and block recovery. N-version

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

3

The Comm part is responsible for communicate with the CCR via the private network and applies channel substitute strategy when a fault occurs; The Field-Bus acquires all slow signals from field bus module while the RT D/A Signal part acquires the fast digital and analog data by means of the PCI bus; The Fault-Handling continually checks the health of the system, detects faults, classifies faults, and reports faults to the Supervisory Service. The SUB part detaches the RT Alpha-Control and Control Policies process from the original MAIN part, speeds up the frequency of operation in order to obtain more efficient control of the phase shift. 3.2. Fault detection Fig. 2. The Basic Service-based Architecture.

system uses two or more diverse version of hardware or software running simultaneously in order to retain system in correct state, but overhead of N-version system is relatively high [3,11]. Our work focus on build LCC on a service-based architecture and make the system fault tolerance by rejuvenate services or communication messages partially. 3.1. Service-based software architecture In order to solve the problems encountered in the early experiments, it was decided to build a new robust software package with fault-tolerant capabilities. The new software package aims to (1) reform the system by making use of the microkernel architecture of QNX; (2) enhance the vulnerable or critical parts of the system, dual process if necessary; (3) guarantee the reset of an abnormality does not affect the rest of the normal operation; (4) ensure that any communication medium has cyclic redundancy checks and duplicates data in case of data error. To achieve these goals, implementation of the LCCs is divided into a set of periodic processes and threads, called services. The services are decoupled without interdependencies on each other and the threads are created by the services to maintain high performance on multi-core hardware. Each service can be easily configured, started, stopped and most importantly reset to avoid failure by the Supervisory Service. The Supervisory Service not only manages local services but also monitors the Supervisory Service of other LCCs which are located within the private network. Inter-process communication methods are provided to guarantee that the services can be distributed on a single node or remotely over the private network. The basic architecture of the LCCs has been shown in Fig. 2. At the user layer, the program is divided into two parts, MAIN and SUB. The MAIN part is a general multi-process and multi-thread program that allows each process to handle different functions. Note that the decoupled architecture makes services in the MAIN part exchange data by means of a distributed data bus. The distributed data bus has two duplicate data copies with cyclic codes [12] to make provision for data error condition. The scheduling service is handled by the Supervisory Service and all of the other processes are created and manipulated by the Supervisory Service and work as normal services. The Scheduling service manipulates other services and schedules all resources to them, such as memory, time slot, interface and so on. Moreover, it also monitors the Supervisory Service of other LCCs which are located within the private network by means of QNX, which is discussed in section 3.4; The Timer process synchronizes local time with global time on ITER Time Communication Network (TCN) which is based on IEEE 1588 Precise Time Protocol (PTP) [13];

The incident of fault can be classified by the definition of transient and persistent [3]. Transient faults are related with aging, such as memory leaks, bit flips, resource faults, etc. Generally, activation of transient faults may depend on particular timing and runtime environment which makes them difficult to trace. Therefore, the best way to handle transient fault is to compare data to a duplicate copy or to recreate a new process instead of using current one. After refreshing data or process, transient faults are no longer present. In comparison, persistent faults still remain even after refreshing data or process. This is because persistent faults are caused by defection in algorithms or the control flow. In addition to conventional classification, the LCCs dynamically classify faults by their possible consequences into minor, major and crucial faults. Minor faults can be detected and recovered in a process within one control period and do not affect other processes. If a fault cannot be recovered or even cannot be detected in a normal process, the supervisory process will detect the fault and recover the normal process and upgrade the fault to being major. Usually, a major fault cannot be recovered within one control period and thus causes limited degradation of the system. Finally, if the supervisory process fails to recover the normal process from a fault state or detects its malfunction, the fault will be updated to being crucial. In the case of a crucial fault, the system may temporarily terminate the faulted process(es) in order to protect the equipment. A detailed flow chart is illustrated in Fig. 3. 3.3. System recovery This section describes system recovery methods which are implemented in the improved LCCs system. To be able to estimate the capabilities of the service and costs of recovery, the present services must first be categorized as hardware related or algorithm related. Note that SUB, Timer, Comm, Field-Bus and RT D/A Signal services are independent services that have separate hardware accesses without overlapping with others, but system recovery with relevant hardware reset may interfere with the normal control loop. Other services do not rely on hardware, but need a faster response speed (Fault-Handling and Scheduling) and enhanced data recovery capabilities (Control Policies). 3.4. Program abort error Program abort is caused by major fault which cannot be recovered by the process itself, but rather with the help of the Supervisory Service. During the initializing phase of a program, the normal services register with Scheduling process using their Process Identification (PID) which is derived from RTOS. Any unexpected abortion of a process will cause PID destruction or incompletion. Scheduling process checks the PID of every process to detect unexpected crashes. When crash happens, Scheduling process (1)

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS

4

J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

Fig. 3. Flow chart of normal and supervisory process when a fault occurred.

estimates fault reach and recovery time; (2) records fault as transient and reports it to upper class via the distributed data bus; (3) restarts the process and brings it back to normal operation. 3.5. Invalid memory access error The identification of invalid memory access is comparatively dynamic, and so great effort is put into keeping the level of error as low as possible. According to experience and tests on the EAST project, it is unlikely to use up all CPU resources. This allows unused CPU processing power to execute redundant shadow threads. Therefore, a shadow thread in the Fault-Handling service has been implemented to periodically check the integrity of the core when it is unused. When faults are detected by periodic self-checking, the Fault-Handling service will prevent the program from running on the faulted core. 3.6. Package loss error The package loss here mainly refers to the control parameters distributed from the CCR and system status reported by the LCCs through the private synchronized network. For maximum realtime performance of the private synchronized network, the User Datagram Protocol (UDP) is employed to realize the transmission of network data. Since the transmission of the UDP is inherently connectionless, fault-tolerant designs should be taken into consideration. Based on the current PFCS experiments for ITER, the CCR sends reference current and voltage to the LCCs cyclically, so package loss may cause discontinuities in the control signal and

thus affect the control performance accordingly. In order to prevent package loss, a hand-shaking frame content on the base of real-time performance was introduced.

3.7. Network redundancy The application was implemented on the QNX Neutrino RTOS which is based on micro-kernel architecture and provides the foundation for a unique networking technology. Unlike the conventional operating systems, the micro-kernel operating system only allows the most essential OS primitives to run in the kernel and keeps all of the other services, including drivers, file systems and protocol stacks, outside of the kernel as individual processes. These services do not have to be called by the kernel, because they are not running in the kernel. With the foundation of message passing, a private QNX distributed processing control system was built for the sake of fault-tolerant design. With the aid of Qnet [14], any node in the private QNX based network can easily access other node’s protocol stacks in Portable Operating System Interface (POSIX). With Global Name Service (GNS) [14] enabled, local services are fully transparent to any other nodes in the private network. This transparent processing is used to form a fault-tolerant network, without additions to current hardware, which is all based on software methods. The architecture of the redundancy network topology is shown in Fig. 4. By means of the GNS, each node on the private network shares with all the other nodes its protocol stacks in case of a network fault. If one LCC fails or becomes blocked with too many frames or requests from the CCR, the other LCCs can take over that LCC’s duties

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

5

Fig. 4. The topology of Network Redundancy. Fig. 5. QNet-based data link switching time measurement.

until it returns to normal operation. Fig. 4 shows the flow chart of the LCCs process. While cyclically receiving command frames from the CCR, one LCC can detect faults in communication. Then that LCC will communicate with other LCCs and check the interval time of an echo message to decide which link would be the fastest “substitute”. Then that LCC creates a new connection with the “substitute” and transmits data, then waits for echo of the CCR which indicates a complete message transmission. After that the LCC releases the connection and waits for another cyclical data request. 4. Test and result The performance test is completed on the current PFCS to evaluate the results of fault-tolerant design. The cycle period (Tcycle ) is the time in that each service is performed correctly at least once. The run time of a service (Trun ) consists of several periods of time, (1) time to execute prescribed functions (Tloop ) which equals to the run time of system without fault-tolerant feature; (2) time to detect faults (Tdetect ); (3) time to recover from error or to restart a service to avoid failure (Trecovery ). The aim to design a fault-tolerant system is to satisfy the equation: Trun = Tloop + Tdetect + Trecovery < Tcycle

(1)

Tloop of system without fault-tolerant feature is measured to determine the minimum time of Trun . With knowingly injecting fault, the newly-added and crucial parts, Tdetect and Trecovery , is also measured. For example, (1) to simulate occurrence of package loss error, the first command package on the LCC side is intentionally omitted to trigger retransmitting package from the CCR; (2) to observe the service’s response time to invalid memory access error, the memory value in program is randomly changed [15]; (3) to evaluate the impact of program abort error to the system, each service is manually restarted to measure the absolute restart time. The test platform is implemented in a PXI-based chassis which is equipped with Advantech® MIC-3395 controller running on QNX 6.5 SP1, NI PXI-6255 DAQ module for acquiring analog signal, ADLINK PCI-7432 module for acquiring digital signal and WAGO 750–881 ETHERNET Fieldbus Controller for acquiring equipment status. Each period of time is measured by making use of spare output pin on PXI-6255 module. The service toggles the output pin to indicate the start or the stop time and three output pins are allocated to each service to represent three phases of run time respectively. What calls for special attention is that the overhead of manipulation of output pins is negligible for that it has no impact to the normal operation. The results of test are shown in Table 1.

It is could be noted that Tloop equals to Trun of system without fault-tolerant feature. According to the category of services which is discussed in section 3.3, the results show hardware related service usually takes more recovery time upon detecting faults, while algorithm related service resets comparatively quicker. It is could be noted that Comm service and Field-bus service set acknowledge (ACK) frame timeout to avoid package loss. In order to shorten recovery time of Comm service, Qnet-based network redundancy technique is implemented in the system. The Qnet data link is inactive in normal operation and quickly takes over data link when UDP data link is broken or on package loss. There are four LCCs in the private network, which means one has three potential backup connections. The time to establish a new connection and to transmit data may influence the effectiveness of this fault-tolerant method. Fig. 5 shows a whole process of new connection taking over a fault connection which is measured by oscilloscope. By injecting a fault (the first falling edge), it is easy to observe and measure the time of the takeover process (the second falling edge). Instead, the main purpose of Field-Bus service is to acquire slow signals, so neither the cycle period nor the recovery time of this service is as stringent as those of other services. Although the scheme spends much time on detecting and recovering, the CPU occupation time is not as much practically. For example, Comm service does not occupy CPU in waiting for the ACK message to avoid package loss. It is very important to measure the actual system load with or without fault-tolerant feature to evaluate the overhead of reform. Therefore, Kernel Event Trace [16] in QNX Momentics IDE is used to take a “snapshot” of all processes and threads in the QNX operating system for a period of time. Fig. 6 shows the overall CPU summary time of four cores for around 10 ms. The comparative result shows there is approximately 13.5% overhead for fault-tolerant feature. In order to prove the effectiveness of schemes, they are implemented in PF Converter Unit test and the output current of PF1IDC, PF1IDCCU4, PF1VDC and PF1VDCCU4 is measured to evaluate the results. In Fig. 7(a), the shot number 76519 shows system running under normal working conditions and the shot number 76509 shows the output current fails to follow the input signal under the influence of faults. Fig. 7(b) shows that good performances can be achieved for the current output by fault-tolerant implementation even with fault disturbance.

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS

6

J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

Table 1 The results of test. Service

T loop (␮s)

T detect (␮s)

T recovery (␮s)

T run (␮s)

T cycle (␮s)

Timer Comm

13 102

2 100

862 9 46 21(PF1, 6) 61(PF2–5) 18

1000 5 100 20(PF1, 6) 60(PF2–5) 10

114 324 425 18342824 25.09 208.1 73 153 56

100 1000

Field-Bus Fault-Handling RT D/A Signal Scheduling

101 122 (Qnet) 223 (UDP) 962 9.09 62.1 32

SUB part

28

4000 1000 500 100 100

Fig. 6. The summary of the tracelog. (a) without fault-tolerant feature. (b) with fault-tolerant feature.

5. Conclusion This paper has introduced the architecture and complexity of PFCS, together with the requirement of the LCC. The paper describes an approach to manage the challenges of complexity by making use of service-based software. Based on the conventional method, this paper proposed an identification method which depends on the consequences of the faults. The LCCs system deals with faults

according to its identification and limits influence of the fault to a minimal range. Some redundancy strategies have been included that enhance the system’s capability of fault tolerance, including online fault detection and network redundancy. The redundancy method also can be achieved on present PFCS network architecture and running platform. The effectiveness of the system and faulttolerant design has been proven through recent PF Converter Unit performance tests for ITER.

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

7

Fig. 7. Tests’ Results of PF Converter Unit by EASTScope. (a) system without fault-tolerant feature; (b) fault-tolerant system with fault disturbance.

Disclaimer The view and opinion expressed herein does not necessarily reflect those of the ITER organization.

Acknowledgement The authors would like to express gratitude to Ministry of Science and Technology of China for the foundation and staff of Institute of Plasma Physics of China Academy of Sciences for helpful discussions and suggestions.

References [1] J. Shen, P. Fu, G. Gao, L. Huang, S. He, A timed-token based network for the ITER poloidal field converter control system, J. Fusion Energy 33 (December (6)) (2014) 726–730. [2] W. Treutterer, D. Humphreys, G. Raupp, E. Schuster, J. Snipes, G. De Tommasi, M. Walker, A. Winter, Architectural concept for the ITER plasma control system, Fusion Eng. Des. 89 (5) (2014) 512–517. [3] T. Anderson, J.C. Knight, A framework for software fault tolerance in real-time systems, IEEE Trans. Software Eng. SE-9 (3) (1983) 355–364. [4] P. Alho, J. Mattila, Real-time service-oriented architectures: a data-centric implementation for distributed, in embedded systems: design, in: Analysis and Verification: 4th IFIP TC 10 International Embedded Systems Symposium, IESS 2013, Paderborn, Germany, June 17–19, 2013. Proceedings, 2013, pp. 262–271.

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008

G Model FUSION-8916; No. of Pages 8

ARTICLE IN PRESS

8

J. Shen et al. / Fusion Engineering and Design xxx (2016) xxx–xxx

[5] W.T. Tsai, Y. Lee, Z. Cao, Y. Chen, B. Xiao, RTSOA: real-time service-oriented architecture, 2006 Second IEEE International Symposium on Service-Oriented System Engineering (SOSE’06) (2006) 49–56. [6] H. Yuan, P. Fu, G. Gao, L. Huang, Z. Song, L. Dong, M. Wang, T. Fang, On the current sharing control of ITER poloidal field converter, J. Fusion Energy 33 (June (3)) (2014) 294–298. [7] M. Ruiz, J. Vega, R. Castro, D. Sanz, J.M. Lopez, G. De Arcas, E. Barrera, J. Nieto, B. Goncalves, J. Sousa, B. Carvalho, N. Utzel, P. Makijarvi, ITER fast plant system controller prototype based on PXIe platform, Fusion Eng. Des. 87 (12) (2012) 2030–2035. [8] A. Avizienis, Fault-tolerant systems, IEEE Trans. Comput. C–25 (12) (1976) 1304–1312. [9] A. Aviˇzienis, J.C. Laprie, B. Randell, C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Dependable Secure Comput. 1 (1) (2004) 11–33. [10] IEEE Standard Classification for Software Anomalies, 1–23 (2010). [11] M. Correia, J. Sousa, A.P. Rodrigues, A.J.N. Batista, Á. Combo, B.B. Carvalho, B. Santos, P.F. Carvalho, B. Gonc¸alves, C.M.B.A. Correia, C.A.F. Varandas, N + 1

[12] [13]

[14] [15] [16]

redundancy on ATCA instrumentation for nuclear fusion, Fusion Eng. Des. 88 (6–8) (2013) 1418–1422. W. Peterson, D. Brown, Cyclic codes for error detection, Proc. IRE 49 (January (1)) (1961) 228–235. C. Kutschera, A. Gröblinger, R. Höller, C. Gemeiner, N. Kerö, G.R. Cadek, IEEE 1588 clock synchronization over IEEE 802.3/10 GBit ethernet, ISPCS 2010—2010 International IEEE Symposium on Precision Clock Synchronization for Measurement, Control and Communication, Proceedings (2010) 71–76. An Introduction to QNX Transparent Distributed Processing http://qnx. symmetry.com.au/resources/whitepapers/qnx distributed processing.pdf. M.-C. Hsueh, T.K. Tsai, R.K. Iyer, Fault injection techniques and tools, Comp.: Long. Beach. Calif. 30 (April) (1997) 75–82. B. Veldhuijzen, Redesign of the CSP Execution Engine, University of Twente, 2009.

Please cite this article in press as: J. Shen, et al., Fault-tolerant design of local controller for the poloidal field converter control system on ITER, Fusion Eng. Des. (2016), http://dx.doi.org/10.1016/j.fusengdes.2016.09.008