Reliability Engineering and System Safety 113 (2013) 76–93
Contents lists available at SciVerse ScienceDirect
Reliability Engineering and System Safety journal homepage: www.elsevier.com/locate/ress
Analyzing vulnerabilities between SCADA system and SUC due to interdependencies a ¨ Cen Nan a,n, Irene Eusgeld b, Wolfgang Kroger a b
ETH Z¨ urich, Z¨ urich, Switzerland University of Duisburg-Essen, Essen, Germany
a r t i c l e i n f o
a b s t r a c t
Article history: Received 10 August 2012 Received in revised form 11 December 2012 Accepted 27 December 2012 Available online 9 January 2013
Interdependencies within and among Critical Infrastructures (CIs), e.g., between Industrial Control Systems (ICSs), in particular Supervisory Control and Data Acquisition (SCADA) system, and the underlying System Under Control (SUC), have dramatically increased the overall complexity of related systems, causing the emergence of unpredictable behaviors and making them more vulnerable to cascading failures. It is vital to get a clear understanding of these often hidden interdependency issues and tackle them with advanced modeling and simulation techniques. In this paper, vulnerabilities due to interdependencies between these two exemplary systems (SCADA and SUC) are investigated and analyzed comprehensively using a modified five-step methodical framework. Furthermore, suggestions for system performance improvements based on the investigation and analysis results, which could be useful to minimize the negative effects and improve their coping capacities, are also presented in this paper. & 2013 Elsevier Ltd. All rights reserved.
Keywords: SCADA Critical infrastructure protection (CIP) Interdependency study Simulation and modeling High level architecture (HLA)
1. Introduction 1.1. CI interdependency The welfare and security of each nation rely on a continuous flow of essential goods (e.g., energy, food, water, data, etc.) and services (e.g., banking, health care, public administration, etc.). CIs are those infrastructures so vital to any country that their incapacity or destruction would have a debilitating impact on
Abbreviations: ABM, Agent based modeling; ASAI, Average service availability index; ASSAI, Average substation service availability index; CCF, Common cause failure; CI, Critical infrastructure; CN, Complex network; COCOM, Cognitive control model; CPC, Common performance condition; CREAM, Cognitive reliability error analysis method; CU, Communication unit; DCST, Dynamic control system theory; DI, Degree of impact; EPSS, Electricity power supply system; FC, Failure to close; FCD, Field level control device; FID, Field level instrumentation device; FIS, Fuzzy inference system; FO, Failure to open; FRC, Failure to run due to communication error; FRF, Failure to run with field device; FRH, Failure to run (too high); FRL, Failure to run (too low); FRW, Failure to run due to hardware failure; HEP, Human error probability; HLA, High level architecture; HRA, Human reliability analysis; ICS, Industrial control system; ICT, Information and communication technology; IIM, Input–output inoperability modeling; LAN, Local area network; MMI, Man-made machine interface; MTU, Master terminal unit; PLC, Programmable logic controller; PSF, Performance shaping factor; RTU, Remote terminal unit; RTI, Run time infrastructure; SCADA, Supervisory control and data acquisition; SO, Spurious operation; SOE, Sequence of events; SUC, System under control; THREP, The technique for human error rate prediction; TSO, Transmission system operator; UPS, Uninterrupted power supply n Corresponding author. Tel.: þ41 44 632 6334. E-mail address:
[email protected] (C. Nan). 0951-8320/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ress.2012.12.014
health, safety, security, economics and social well being [1]. CIs have been continuously exposed to multiple threats and hazards such as natural hazards, technical or human failures, and socio-political hazards. A failure caused by these hazards within any CI or loss of its continuous service may be damaging enough to our society and economy while cascading failures crossing subsystems (within a single CI) and/or even CI-boundaries have the potential for multiinfrastructural collapse and unprecedented consequences [2]. Negative cascading impacts due to these interlinks have started to challenge our society to study and cope with the recently recognized weakness of CIs, i.e., vulnerability caused by their interdependencies. From a technical perspective, the term dependency depicts a linkage between two systems (CIs) through which the state of one system influences the state of the other, whereas interdependency is a bidirectional relationship through which the state of each system is correlated to the state of the other [3]. It should be noted that two systems, as mentioned above, can correspond to one CI (internal interdependency) or more CIs (external interdependency). For example, the interdependency between a SCADA system and its associated SUC can be referred to as an example of internal interdependency due to the fact that two systems are included within one system, e.g., an Electricity Power Supply System (EPSS), while the interdependency between an EPSS and its interlinked telecommunication system can be referred to as an example of external interdependency. Interdependencies1 can
1 This paper uses the general term interdependency to describe the interactions within and among CIs including both interdependency and dependency.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
77
also be of different types: physical, cyber, geographic, and logic [3]. The importance of preventing or at least minimizing negative impacts of cascading failures caused by these interdependencies has been recognized and accepted by both governments and the public as a topic of CI Protection (CIP). The purpose of the protection is not just to identify the cause of failures and prevent them but also to halt ongoing cascading or escalating events before they affect other CIs. 1.2. SCADA system Modern CIs, e.g., power supply, telecommunication, and rail transport systems, are all large-scale, complex, highly integrated and particularly interconnected. The operators of these systems must continuously monitor and control them to ensure their proper operation [4]. These industrial monitor and control functions are generally implemented using a SCADA2 system. Its fundamental purpose is to allow a user (operator) to collect data from one or more remote facilities and send control instructions back to those facilities. For instance, voltage, frequency and phase angle are all important parameters in a power supply system which need to be continuously monitored for maintaining a normal working environment. In general, a SCADA system has two main functions: (1) monitor function: transmit data from sensors, transducers, and other devices installed at remote facilities to a centrally located control facility3 , and (2) control function: adjust process parameters by changing the states of field devices, e.g., opening or closing valves. CIs have been benefiting from using SCADA systems [5–7]; they can be regarded as backbones of these CIs [8]. In general, there are four levels in a standard SCADA system hierarchy mainly based on the functionalities of devices, shown in Fig. 1. Level 1, the lowest level in the standard hierarchy, includes Field Level Instrumentation and Control Devices (FIDs and FCDs), e.g., sensors and actuators. Remote Terminal Unit (RTU), level 2 in the standard hierarchy, is a rugged industrial common system providing intelligence in the field. It is a standard stand-alone data acquisition and control unit with the capabilities of acquiring data from monitored processes, transferring data back to the control center, and controlling locally installed equipments. Communication Unit (CU), level 3 in the standard hierarchy, provides a pathway for communications between a control center and RTUs. Master Terminal Unit (MTU) can be regarded as a ‘‘host computer’’ issuing commands, collecting data, storing information, and interacting with SCADA operator4 who can communicate with substation level components. Compared to the RTU, the MTU is a ‘‘master machine’’, which is able to initiate the communication either automatically by its installed programs or manually by an operator. It should be noted that most devices in the scope of the first three levels of the standard SCADA system hierarchy (level 1–3) are normally installed (hardwired) at substations. 1.3. Motivations and objects Originally, a SCADA system was designed as a point-to-point system connecting a monitoring or command device to remotely 2 In this paper, SCADA will be referred as a system if it is individually introduced and discussed. Nevertheless, if the discussions are related to interdependency study within one CI, SCADA will be referred as a subsystem. 3 As part of a SCADA system this facility is always referred as a control center. 4 Operator is a personnel who has access to MTU and makes decisions according to the monitored field information.
Fig. 1. Standard SCADA system hierarchy.
located sensors or actuators. By now, it has evolved into a complex network that supports communication between a central control unit and multiple remote units using advanced Information and Communication Technologies (ICTs) [9]. Having said this, extensive uses of ICTs introduce new types of security threats to the SCADA systems [10,11]. The increased connectivity of a SCADA system has a potential to expose its monitored/ controlled safety-critical CI (SUC) to a wide range of security issues and severe threats, e.g., unauthorized accesses, malicious intrusions, etc., causing cascading failures and incidents. For example, Stuxnet, a self-replicating computer worm, has recently been hailed for its capability of challenging the securities of CIs through SCADA systems by modifying the control logic of control systems [12,13]. Recent surveys show that a number of attacks against SCADA systems have been reported over the years, e.g., the prominent Maroochy Shire accident in Australia (2000), the Florida power outage in USA (2008), etc. [14,15]. There also are numerous unreported incidents by asset owners and operators related to the security issues in SCADA systems [16]. All of these incidents and lessons learned from the past have motivated us to identify and study vulnerabilities of SCADA systems including interdependency-related vulnerabilities between SCADA system and SUC, which have been addressed by researchers during last decades [13,16]. Most publications are related to the investigation regarding pervasive uses of ICT. For example, the SCADA communication protocol Modbus has become one of the most widely discussed topics for securing the data transmission between the control center and remote substation facilities. However, the importance of components installed at substations should not be ignored due to the fact that a minor operation disruption of these devices could possibly lead to a significant service loss and even unavailability of both systems (SCADA and SUC). The objective of this paper is to identify and analyze both obvious and hidden vulnerabilities5 due to the (internal) interdependencies between SCADA system and its interconnected SUC. It should be noted that the reliability analysis of two systems is the main focus of this paper. Security analysis, which is related to the topic such as malware attack, is also very important, but it is not the subject of this paper. Devices and equipments installed at the substation level of the SCADA system are systematically analyzed due to the fact that they could have impacts on the
5 See Section 2 for more information regarding obvious and hidden vulnerabilities.
78
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
vulnerability of the whole system. The EPSS is used as an example of CI in this paper. The assessment uses a modified methodical framework, which was originally developed by Eusgeld and ¨ Kroger for analyzing vulnerabilities due to CI interdependencies [17]. In general, this framework can be divided into five steps, i.e., preparatory phase, screening analysis, in-depth analysis, results assessment, and potential technical improvement. The focus of this paper is the presentation of a novel hybrid modeling/ simulation approach and in-depth analytical experiments, which have been developed especially for the most challenging step of this framework, the step of in-depth analysis. The core of this approach is the idea to divide overall simulation tool into different modules first, and combine them into a distributed simulation platform. The rest of this paper is organized mainly based on five working steps of the framework: In Section 2, the objective of the task, as well as of the CIs and/or subsystems within a CI to be analyzed, is framed in detail. The approaches that have been applied to model/simulate CI interdependencies are also briefly introduced. The purpose of Section 3 is to develop adequate system understanding according to the previously framed task. Furthermore, identified obvious vulnerabilities are presented in this section. Section 4 presents a novel hybrid modeling/simulation approach for in-depth analysis of CI interdependencies including the implementation of this approach, i.e., modeling the SCADA system (both technical and non-technical components) and the development of an experimental simulation platform. Section 5 focuses on the introduction of three sets of indepth experiments conducted using the developed simulation platform and discussion of experiment results. In Sections 6 and 7, results assessment and potential identification are presented. Section 8 presents our conclusion.
2. Step 1: Preparatory phase The main purpose of this step is to reach a clear understanding regarding the objective of the task and the systems under study. The understanding of these systems will facilitate distinguishing between obvious and hidden vulnerabilities [18]. Obvious vulnerabilities can be recognized by the screening analysis, while hidden vulnerabilities need the in-depth analysis using more advanced techniques such as the system modeling and simulation. The boundaries of the studied systems need to be determined as well, which could range from the delimitation of an elementary model, focusing only fundamental components, to the delimitation of a whole infrastructure system, composed by subsystems. Furthermore, it is necessary to identify interdependency characteristics, e.g., types, coupling degrees, etc. After framing the task, the knowledge base should then be checked with respect to available methods/approaches suitable for the framed task. 2.1. Framing the task According to [1], the interdependencies can be described by six dimensions, e.g., type of failure, type of interdependencies, coupling/response behavior, etc. It is essential to decide which CIs or subsystems should be analyzed. Then, the general term ‘‘to find system(s) vulnerabilities due to interdependencies’’ can be stated precisely. In this paper, two subsystems, SCADA and SUC, within the EPSS are selected. SCADA system in this case can be regarded as a general SCADA system including four levels in the standard system hierarchy. SUC within the power supply system could be a distribution system, a generation system, or a transmission system. In this paper, the 220 kV/380 kV Swiss electricity transmission
network is selected as an exemplary SUC. It is assumed that this transmission network is a stand-alone system and the energy exchange with neighboring countries is regarded as independent positive or negative power injections at the respective boundary substations. 2.2. General understanding of studied interdependencies In general, interdependencies between SCADA system and its associated SUC can be summarized as follows:
Physical interdependency: the physical interdependency exists
since a SCADA system requires electric power supply by the SUC, and some substation level devices of a SCADA system, e.g., RTUs, FCDs, etc., have control (manipulation) over its connected SUC. Cyber dependency: the cyber dependency exists since a SCADA system monitors and controls components of a SUC via various communication medias. Geographic interdependency: the geographic interdependency exists since some of their components need to be installed at same places, e.g., substations. Logic interdependency: the logic interdependency between SCADA system and SUC does not exist in this case.
Based on the definition of the degree of coupling, introduced in Ref. [1], the SCADA system and the SUC are tightly coupled. After framing the task and gaining general understanding of interdependencies under study, methods and approaches available for performing the task need to be checked. 2.3. Available methods/approaches The challenges regarding understanding, characterizing, and investigating CI interdependencies are immense and research in this area is still at an early stage [19,20]. In recent years a great deal of effort has been devoted by researchers and two main directions have been followed, i.e., knowledge-based and modelbased approaches. Knowledge-based approaches, e.g., empirical investigations and brainstorming, intend to use data collected by interviewing experts and/or analyzing past events to acquire information and improve the understanding of dimensions and types of interdependencies [21,22]. An example of this approach is a policy brief from the International Risk Governance Council (IRGC), which introduces an assessment of dependencies among CIs based on brainstorming sessions among experts around the world and then categorizes how dependent each CI is on the others [23]. This type of approach is straightforward and easy to understand. It is capable of providing a qualitative assessment of the severity of CI interdependencies and can be considered as an efficient screening method. However, it is a purely data-driven approach, meaning that the accuracy of results depends on the quality and the interpretation of the collected information. Modelbased approaches aim to analyze interdependent CIs comprehensively using advanced modeling/simulation techniques and are capable of providing both quantitative and qualitative assessment. Currently, a variety of model-based approaches have been applied, e.g., Input–output Inoperability Modeling (IIM), Complex Network (CN) Theory, Agent-Based Modelling (ABM), etc. The IIM approach is an example of capturing CI interdependencies via the development of mathematical models. Originally this approach was a framework for studying the equilibrium behavior of an economy by describing the degree of interconnectedness among various economic sectors [24]. It assumes that each CI can be modeled as an atomic entity whose level of operability depends on other CIs and propagation between CIs can be
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
described mathematically based on the basic Leontief high order mathematical model [25]. The IIM approach is capable of analyzing cascading failures and providing a mechanism for dependency measurement of CIs. Haimes et al. [26,27] applied this approach to study impacts of high-altitude electromagnetic pulse on electric power infrastructure. The great advantage of this type of mathematical model is its preciseness. However, deriving an appropriate mathematic representation of multiple infrastructure systems is not easy due to their inherent complexities. To overcome this difficulty, the task of analyzing behaviors of interdependent CIs as a whole can be turned into the analysis of the aggregate behaviors of many smaller interacting entities, e.g., nodes (CN theory) and agents (ABM). Fundamental elements of the CN theory approach are originally formed by the graph theory [28]. A graph G (V, E) is composed by a set of nodes (vertices) V and the set of connections E between them. Each node (or vertex) represents an element of the system, while a link (or edge) represents the relation between corresponding elements. A graph can then be drawn by plotting nodes as points and edges as lines between them. In general, a graph can be analyzed by well-developed parameters, e.g., the order/size of a graph, the weight/strength of a link, the degree/degree distribution/betweenness of nodes, etc. A complex network can be regarded as a graph with non-trivial topological features that do not occur in simple networks such as lattices or random graphs but often occur in real graphs. The CN theory is an approach capturing the CI coupling phenomenon as a set of nodes connected by a set of links and by this to characterize their topology. A number of modeling efforts have been made to adopt this approach for the development of CI models and interdependencyrelated assessments, demonstrating its capability of representing relationships established through connections among CI components [29,30]. Apostolakis and Lemon have developed a screening methodology investigating vulnerabilities in the MIT campus by modeling its infrastructures as interconnected graphs [31]. Ouyang et al. proposed a topology-driven approach using complex networks to comprehensively assess the vulnerability among interdependent infrastructures [32]. The analysis of the topological properties of the network representing given CIs is able to reveal useful information about the structural properties, topological vulnerability, and the level of functionality demanded for its components. However, this approach lacks the ability to capture uncertain and dynamic characteristics of CIs and system properties when dynamical processes, acting on the network, occur. This can be improved by replacing the nodes by agents via the ABM approach. Using the ABM approach, each agent is capable of modifying its own internal data, its behaviors, its environments and even its adaptation to environmental changes. Similar to the CN theory, this approach also describes a whole system by its individual parts, assuming that interdependency behaviors will emerge through interactions among all the agents. An agent can be used to model both a technical component (e.g., a transmission line), and a non-technical component (e.g., a human operator), while different agents interact with each other directly or indirectly. This approach is able to provide an integrated environment where a more comprehensive analysis of dynamic system behaviors can be performed by ‘‘looking-into’’ the component level of studied system(s) [33]. In [34], Panzieri et al. have developed a multiagent simulator, Critical Infrastructure Simulation by Interdependency Agents (CISIA), to analyze fault propagation crossing heterogeneous infrastructures which is able to use rich sources of micro-level data to develop interdependency forecasts. Overall, the ABM approach achieves a closer representation of system behaviors by integrating the spectrum of different phenomena that may occur, e.g., generating a multitude of representative
79
Table 1 Definition of functional failure modes. Component
Failure mode
FCD
FO (failure to open) FC (failure to close) SO (spurious operation)
FID
FRH (failure to run (too high)) FRL (failure to run (too low))
RTU
FRW (failure to run due to hardware failure ) FRF (failure to run with field device ) FRC (failure to run due to communication error)
stochastic, time-dependent event chains. However, this approach demands a large number of parameters defined for each agent, requiring thorough knowledge of the studied system(s). It should be noted that other approaches, which have also been applied by researchers but will not be discussed in this paper, include System Dynamic (SD) [35], PetriNet (PN)-based [36], Bayesian Network (BN) [37,38], and the Dynamic Control System Theory (DCST) [39–41]. Compared to the knowledge-based approaches, the modelbased approaches promise to gain a deeper understanding of CI behaviors as well as their interdependencies. The level of this socalled deeper understanding by each model-based approach also varies. Some approaches are only capable of analyzing studied system(s) at the structure/topology level, e.g., the CN theory approach, while some approaches are capable of capturing and analyzing dynamic behaviors of the systems at the functional level, e.g., the ABM approach.
3. Step 2: Screening analysis The purpose of this step is to reach a further understanding of the previously framed task by acquiring sufficient information/ knowledge of main functionalities, interfaces, and components of each studied system, as well as interdependencies among previously determined systems, in order to decide which to evaluate in more detail. Components and interfaces of each system need to be analyzed systematically, especially for components essential for the normal functionalities of the system. In this step, obvious vulnerabilities should be identified using the methods such as empirical investigations and topological analysis. Indicators of the obvious vulnerabilities could be reliability bottlenecks, errors in operation, emergency procedures, etc. [17]. 3.1. Development of adequate system understanding Although the studied system(s) has(have) been described and its(their) boundaries have been defined at the stage of the ‘‘framing the task’’ (Step 1), it is still important to further develop an adequate system understanding, which aims to improve accuracy of results obtained from the screening analysis and collect more detailed information for the in-depth analysis. To achieve these goals, components of the systems need to be analyzed at first and then corresponding failure modes for each system component need to be defined. In this section, adequate understanding of the SCADA system within the electric power supply system is developed including components installed at level 1 and 2 (substation level) of standard SCADA system hierarchy, i.e., FID, FCD and RTU, due to their importance for the interdependency-related vulnerability analysis. More information regarding adequate understanding of the SUC and SCADA system
80
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
Fig. 2. Overview of the SCADA system for 220 kV/380 kV Swiss power transmission network (red nodes represent key substaions). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
can be found in [42,43]. In total, eight (functional) failure modes are defined for these substation level components, summarized in Table 1. 3.2. Identification of obvious vulnerabilities 3.2.1. Empirical investigation One of the established methods to identify obvious vulnerabilities is to look into statistics. Repository of Industrial Security Incidents (RISI) is a database including a number of technical incidents in which process control, industrial automation or SCADA systems were affected. The purpose of this database is to collect, investigate, analyze, and share important industrial security incidents among a number of companies for the purpose of exchanging experiences. These incidents include not just public known incidents, but also incidents from private reports. According to a technical report [44], which is based on 141 records collected from the RISI database, the most serious vulnerability, exposed by these incidents, is the lack of overall security awareness. The second serious vulnerability is the inadequately examined and maintained system administration mechanisms and software. Inadequately designed ICS networks and insufficient security for remote accesses are also the causes of many incidents. 3.2.2. Topological analysis Another method to identify obvious vulnerabilities is to apply topological analysis using the approach of CN theory. For example, the SCADA system for the 220 kV/380 kV Swiss electric power transmission network consists of 149 substations, connecting 219 transmission lines in total. Some substations connect only one transmission line and some about 11 transmission lines, shown in Fig. 2. Each node represents a substation and each link represents a transmission line. There are 149 nodes and 219 links in this graph, considering the SCADA system as an undirected and unweighted graph. Generally, the failures of substations connecting more transmission lines could have more negative effects on the reliability of the whole system, compared to substations connecting less transmission lines. Therefore, it is important to identify these substations. As shown in Fig. 3, the degree distribution6 of the SCADA system peaks at k¼2 and also has large values when k¼ 1 and 3, meaning that most substations connect less than three transmission lines. It should be noted that substations with k¼ 1 are boundary substations. The number of substations with kZ 6 is very small. The graph can be regarded as a scale-free network, as defined in [28,45] . The characteristic of such type of networks is that most nodes have small degrees but there is a finite possibility 6 The degree distribution P(k) represents the probability that a generic node in the network is connected to k other nodes.
Fig. 3. Degree distribution of the SCADA system for 220 kV/380 kV Swiss electric power transmission network.
of identifying nodes with intermediate and large degrees. The nodes with a higher value of degree play a specific role in the structure of the network [28]. It has been demonstrated in [46] that the removal of these nodes usually causes quite a rapid destruction of the structure of the network. Due to the importance of these nodes (substations), it is assumed that the substations with kZ6 are considered as key substations.
3.3. Identification of obvious vulnerabilities
Top 3 vulnerabilities of the SCADA system, based on the
analysis of the RISI database, are (1) the lack of overall security awareness, (2) inadequate examined and maintained system administration mechanisms/software, and (3) inadequately designed ICS networks and insufficient security for remote accesses. All these vulnerabilities are related to the inadequacy of the system design, maintenance, and procedure which could be handled by defining more comprehensive company policies or improving the security of the system. However, none of these vulnerabilities are caused due to CI interdependencies. Therefore, it is still necessary to identify interdependencyrelated vulnerabilities. According to the degree distribution obtained by analyzing the SCADA system, nodes (substations) with intermediate and large degrees exist, which are also referred to as key substations. Based on the CN theory, the removal of these nodes causes quite a rapid destruction of the structure of the network meaning that the failures of these key substations could have negative effects on the reliability of the system significantly. Therefore, these key substations require more attention. Some SCADA components such as the FID and the FCD are installed at the location where interlinked systems (in this case, SUC and SCADA) overlap. Therefore, these components can also be regarded as interface components. Due to their specific installation location these interface components are more likely to be affected by the interdependencies and require more attention.
4. Step 3: In-depth analysis After preparatory phase and screening analysis, a more sophisticated analysis has to be performed, calling for advanced modeling techniques to represent CI interdependencies. One of the main goals of this step is to create a novel approach for the identification and assessment of hidden vulnerabilities, which requires its capabilities of such an approach to represent complexities of these interdependencies. In general, the development of this approach faces two major methodical challenges.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
81
4.1. Challenges The first challenge is to model a single CI due to its inherent characteristics such as dynamic/nonlinear behavior, and intricate rules of interaction including with the environment due to its openness and high degree of interconnectedness. These characteristics make the modeling and simulation of such a system highly challenging and call for methods capable of representing it, often multiple layered, as a whole and not as a sum of single parts. Therefore, classical approaches and methods based on decoupling and decomposition such as fault and event trees reach the limit of their capacity [47,48]. Several approaches have been introduced and discussed in Section 2. Among these approaches, the CN theory is one of most frequently used techniques for topological analysis, while the ABM can be combined with other techniques such as the Monte Carlo simulation and offers the possibilities to include physical laws into the simulation and emulate behaviors of the infrastructure as it emerges from the behavior of the individual agents and their interactions. The second challenge appears when more than one CI or subsystem within one CI must be considered and interdependencies among them need to be tackled. It has proven necessary to integrate different types of modeling approaches into one simulation tool in order to fully utilize benefits/advantages of each approach and optimize the efficiency of the overall simulation. One of the key challenges for developing such type of simulation tool is the required ability to create multiple-domain models, e.g., discrete and continuous time models, time-based and frequencybased models, etc., and effectively exchange data among them [49]. In practice, there is still no ‘‘silver bullet’’ approach. To find a more promising solution for solving these challenges and handling these technical difficulties, a hybrid modeling/simulation approach is proposed and discussed in [50,51], which combines various simulation/modeling techniques by adopting the technology of distributed simulation and the concept of modular design for the purposes of exploring and assessing CI vulnerabilities due to interdependencies qualitatively and quantitatively. While several simulation standards do exist for supporting the implementation of distributed simulation approach, the most widely implemented and applicable one is the High Level Architecture (HLA) simulation standard [52]. HLA is a general purpose highlevel simulation framework to facilitate the interoperability of multiple-types models and simulations [20]. More details about the HLA standard can be found in [51]. While HLA is an architecture, a simulation standard, but not a software, Run Time Infrastructure (RTI) is a software. It is the core element of the HLA standard providing common services to all participating federates. A real-time HLA-compliant simulation platform is created, which has been used to assess interdependency-related vulnerabilities between SUC and its SCADA system. The platform consists of four major simulation components: SUC model, SCADA model, RTI server, and simulation monitor system, shown in Fig. 4. All these components are connected over a Local Area Network (LAN). The SUC model is a continuous-time and agent-based model, while the SCADA model is a discreteevent and agent-based model. The RTI server acts as the center of the simulation platform, which is responsible for simulation synchronization and communication routing among all components through a local RTI interface of each model. The simulation monitor system is a real-time tool, through which the simulation of two models can be observed (see Refs. [50,51] for more details). 4.2. Modeling SCADA Modeling a SCADA system is not just a challenge from a theoretical point of view but of great practical importance [9].
Fig. 4. Architecture of the experimental simulation platform.
Nowadays, there are still only few well-developed SCADA models available [53]. Siemens PTI has developed a high performance TM network modeling package, PSS SINCALs, for the planning of electricity, gas, water, and district heating networks. It is a commercial product that integrates all the models into one compact system. However, this tool can hardly be used for research purposes due to its commercial background. A computer-based simulation of a SCADA system is realized and configured by the Italian National Agency for New Technology (ENEA) through a set of computer machines connected to a LAN [9]. Each machine is developed to simulate a specific functionality of a specific SCADA system. This functionally distributed network approach provides a solution for the interfacing issue related to connections between different system models. A simulation environment is created by Nai Fovino et al., where identified attacks can be simulated for the purpose of assessing the cyber security of a power plant [54]. In this simulation environment, a group of devices are used to simulate a SCADA system. Similar to the previously discussed ENEA’s simulation system, a number of computers, servers, and switches are included to simulate different system functions. Although it is an applicable approach to investigate vulnerabilities and weaknesses of the overall system, both simulation environments mentioned above require a number of recourses to keep system running. Their intrinsic complexity could significantly limit the maintainability and problem diagnosing ability in future development. The modification of such SCADA models could become a very challenging task. It is important to have a model that can integrate all components of a SCADA system into one platform. In addition, SCADA is an event-driven and service-oriented system. The model representing such type of system is normally coupled with models representing time-based systems such as an EPSS, thus increasing the overall modeling difficulty since it is hard to include both types of models in one platform. These technical difficulties have motivated us to model the SCADA system using the ABM approach. Firstly, it is very difficult to model this type of system (event-driven) using graph-based modeling methods such as the CN theory. Secondly, this approach can improve the flexibility of modeling efforts. System behaviors can then be modified easily by tuning parameters of corresponding agents. Furthermore, the agent-based SCADA model is capable to simulate such a distributed control network by running multiple agents simultaneously without increasing the overall complexity of the model. However, developers need to have a deeper understanding of system behaviors and be able to determine all the inputs and the outputs of the system. Thorough knowledge of an object-oriented programming language, such as Java and Cþþ, is also a ‘‘must-have’’ for developing each agent of the SCADA model. Validating agent-based models is also a challenging task due to the
82
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
Fig. 5. Overview of structure of the overall SCADA model.
fact that the complexity of the overall model could arise if a great number of agents, their states, and the number of possible interactions are required [55,56]. The SCADA model is developed by integrating agents, objects, database into one platform and implemented using a simulation software tool: Anylogic 6.4. Fig. 5 shows the model structure: technical components, i.e., FCD, FID, RTU, and MTU, as well as non-technical component: human operator are all modeled as agents. The SCADA model also includes a number of objects7, represented by blue circles, i.e., command, alarm, and monitor. The main purpose of developing these objects is to transmit information/data between agents. The DB_SCADA is a Microsoft Access based database, which is linked to the SCADA model to record traced sequential events during the simulation via its realtime Sequence of Events (SOE) table. The components at the substation level of the SCADA system are modeled using a failure-oriented modeling approach (Fig. 6). In this approach, the ‘‘agent state’’ is defined as a location of control with a particular set of reactions to conditions and/or events of its related agent. For example, open and close are two states defined for a FCD agent. ‘‘Device mode’’ including both operational mode and failure mode is defined as the hardware status of corresponding simulated hardware devices. For example, failure-to-open and failure-to-close are two device modes defined for a field control device simulated by an FCD agent. The transition of various device modes can affect corresponding agent states. It should be noted that all corresponding (functional) failure modes of a SCADA system have been defined during the development of adequate system understanding in Step 2 and shown in Table 1. With the help of this modeling approach,
technical failures of simulated devices of a SCADA system (e.g., FID, FCD and RTU) can be easily determined and corresponding failure propagations can be visualized/studied. The core of the device mode model is given by the state diagrams illustrated in Fig. 7, which reflects a continuous-time, discrete-state Markov model describing failure behaviors of a studied device with one operation mode (left) and two failure modes (right) (see Ref. [50] for more details). It should be noted that Fig. 7 is general and only for illustrative purposes. The number of failure modes is not just limited to two. To apply the failure-oriented modeling approach to simulate the SCADA system of the 220 kV/380 kV Swiss electric power transmission, 588 agents are created to model corresponding technical components, i.e., FCDs, FIDs, RTUs, and a MTU.8 According to [57], the major purpose of the model validation is to verify whether or not the model is the accurate representation of the real-world systems by comparing experimental results to
7 The difference between an agent and an object is that an agent can be regarded as a decision-making entity and an object is more or less a data structure consisting of data fields and methods. An agent can also be called an intelligent object.
8 It should be noted that in total 587 agents have also been created to model technical components (e.g., transmission lines, generators, etc.) of SUC of the 220 kV/380 kV Swiss electric power transmission.
Fig. 6. Failure-oriented modeling approach [50].
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
Fig. 7. State diagram of the device mode model (l: constant failure rate and m: represents repair rate) [50].
real-world data. However, this type of traditional verification methods is not always applicable to agent-based models. Therefore, the validation should mainly focus on whether or not the model is useful and convincing in its explanation of how a system possibly operates or potential states of the system [57]. Several methods to validate agent-based models have been recommended by Nikolic et al. [57], i.e., history replay, expert consultation, literature validation, and model replication. The validation of the developed SCADA model is conducted by examining and testing functionalities of various types of single agents, e.g., FCD agent, FID agent, etc. For example, the functionalities of a RTU agent include receiving data from agents of FCD and FID, sending data to MTU agent, forwarding generated alarms to MTU agent, and interpreting commands sent by the MTU agent. To verify these functionalities, a set of experiments has been developed and results have been recorded using the DB_SCADA database. Based on these results, parameters of corresponding agents can then be tuned/adjusted in order to improve the overall accuracy of the SCADA model. It should be noted that the experts who have been working on various SCADA systems were also invited to provide suggestions and advices during the development of the SCADA model. See [58] for more information about the development and validation of the SCADA model. The failure-oriented modeling approach can only be applied to model technical components of the SCADA system; the nontechnical component, i.e., human operator, needs to be modeled using a different approach. 4.3. Modeling human operator During the last decades, the human operator of infrastructure systems has become an essential element for not just maintaining daily operation, but also the security and quality of the system. For example, in a power supply system, a Transmission System Operator (TSO) is the personnel who is responsible for ensuring the safety and efficiency of transmitting electrical power from generation plants to regional or local electricity distribution operators. Generally, the responsibilities of a TSO include monitoring/processing generated alarms, switching off components located at remote substations, sending commands to remote substations, etc. Although the operator’s responsibilities are mainly related to system functionalities of monitoring and remote control, examining the reliability of the human operator remains crucial. As a part of MTU agent, the human operator can be modeled using the Human Reliability Analysis (HRA) approach. Human error is defined as ‘‘any member of a set of human actions or activities that exceeds some limit of acceptability, i.e. an out of tolerance action (or failure to act) where the limits of
83
performance are defined by the system’’ by Swain [59]. The human error has become a cause of great concern to the reliability of interactive technical systems, since most of these systems depend on the interaction with operators in order to maintain appropriate function. Therefore, research work related to the HRA is important for safety engineers to evaluate human error possibilities and uncertainties of the data concerning human factors [60]. Over the years, many HRA methods have been developed to assess human performance especially human errors. Qualitative methods focus on the identification of events or errors, while quantitative methods focus on translating identified events/errors into Human Error Probability (HEP) [61]. The Technique for Human Error Rate Prediction (THERP), the best known first generation HRA method, is probably the most widely used technique to date [60]. It is basically a hybrid approach as it models human errors using both dependence models and Performance Shaping Factors (PSFs). Appropriate HEPs from a list of around 100 factors are selected for a nominal assessment [62]. The use of the THERP causes limitations during human performance analysis since this method intends to characterize each operator action with a binary path (success or failure) and is highly judgmental based on assessor’s experiences. Additionally, the representation of PSFs influence on human performance is quite poor [60,61]. Cognitive Reliability Error Analysis Method (CREAM) is one of the best known second generation HRA methods, which offers a practical approach to both performance analysis and error prediction [63]. This method presents a consistent error classification system integrating all individual, technological and organizational factors, which can be used both as a stand-alone method for accidental analysis and as part of larger design methods for interactive systems. In this method, human error is not considered to be stochastic, but shaped by different factors such as the context of the task, physical/psychological situation of the human operator, time of day, etc. One of the main features of this method is the integration of a useful cognitive model and framework that can be used in both retrospective and prospective analysis [64]. CREAM is capable of providing the final estimated HEP that can be used as part of overall system analysis. As part of the SCADA model, the CREAM method is selected to model human operator for several reasons. Firstly, it represents a second generation HRA method with improved applicability and accuracy compared to most of the first generation methods. It is able to extend the traditional description of error modes beyond the binary categorization of success–failure and accounts explicitly for how the (performance) conditions affect the performance. Secondly, it is originally developed from the Cognitive Control Model (COCOM)9 and also uses it to organize some of categories describing possible causes and effects on human action. Last but not least, CREAM can be used for performance prediction since quantified results can be provided as the final outcome. This capability especially makes the integration of the CREAM-based non-technical component model with other technical component models possible, which is a critical requirement for the development of the SCADA model. Applying the CREAM approach for the purpose of developing human operator model as part of the MTU agent of the SCADA model can be divided into five steps10 : constructing event sequence, determining COCOM functions, identifying most likely cognitive function failures,
9 COCOM models human performance as a set of control modes: strategic, tactical, opportunistic and scrambled and proposes a model of how transitions between these control modes occur. See [63] for more information. 10 It should be noted that the development of these steps are based on the working steps suggested in [61,63].
84
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
assessing Common Performance Conditions (CPCs), and determining final failure probability. The CREAM uses CPCs to determine sets of error modes and probable error causes instead of PSFs. Total nine CPCs, e.g., working conditions, available time, time of day, number of simultaneous goals, etc., proposed by Hollnagel [63], are adopted in this model development. One of most challenging step is Step 4, assessing CPCs. The purpose of this step is to examine and assess the CPCs under which the analyzed task is performed. In order to simplify this assessment, it is necessary to make following assumptions:
working conditions (in control center) are compatible, adequacy of organization is efficient, availability of procedures/plans is acceptable, adequacy of training and preparation is adequate with high experience, and crew collaboration quality is efficient.
Based on above assumptions, five of the nine CPCs are assigned with a fixed CPC level, while the level of each remaining CPC (adequacy of Man-made Machine Interface (MMI) and operational support, time of day, number of simultaneous goals, and available time) is updatable depending on actual performance conditions. For example, the level of ‘‘time of day’’ can be ‘‘day time’’ or ‘‘night time’’, depending on the time when the analyzed task is performed. Compared to other four CPCs, the CPC ‘‘available time’’ is more difficult to be assessed quantitatively due to following reasons. First, it is difficult to set a numerical threshold, by which the corresponding level can be decided. Second, the assessment depends on the knowledge and experiences related to the specific task. Furthermore, many other issues could also have direct effects on the assessment of ‘‘available time’’. For example, both the number of current simultaneous tasks and time left for operators to handle one task could have significant influences. In order to assess this CPC, a knowledge-based approach using the fuzzy logic theory is proposed and developed as part of the human operator model. As one of advantages of Fuzzy Logic, it is capable of accommodating the ambiguities of real-world human language and logic with its inference techniques. Fuzzy inference systems (FISs), which are developed based on the fuzzy logic theory, have been successfully applied in different research and industrial fields such as automatic control, data classification, expert system, and decision analysis [65]. Unlike other regular mathematical systems, the FIS is related to the classes with unsharp boundaries where the output is only a matter of degrees. It is primarily about linguistic vagueness through its ability to allow an element to be a partial member of set, so that its membership value can lie between 0 and 1 [66]. One of advantages of integrating the FIS approach into HRA is its capability of providing a fundamentally simple way to handle complex problems without making itself exceedingly complex. It is straightforward, flexible, and easy to develop. However, this approach is a data-driven approach, meaning that the accuracy of the output is dependent on the quality of expert knowledge and experiences. Therefore, the membership functions, as well as developed rules, need to be carefully calibrated. The validation of the human operator model is conducted by performing test runs with specific assumptions, e.g., the number of simultaneous tasks, time left for operator to handle an alarm, etc., and examining the rationality of results (see [58] for more information). This is the first effort to implement a human operator performance model that assesses the CPCs dynamically using the ABM approach. During the simulation, if there is a request for the operator to handle an alarm, the CPCs will be assessed according to current simulation environment, e.g., time of day, simultaneous
goals, etc., and corresponding HEP11 will then be calculated as an input to the MTU agent. However, only four CPCs are assessed and five CPCs are assumed to be fixed without further assessment due to limited data sources, which will affect the output accuracy of this model. 4.4. Validating the hybrid modeling/simulation approach To demonstrate the capabilities of the hybrid modeling/simulation approach, as well as the simulation platform for representing interdependencies within and among CIs, several experiments have been designed and developed including feasibility and failure propagation experiments. 4.4.1. Feasibility experiment The purpose of this experiment is to study the feasibility of the HLA-compliant distributed simulation environment as an approach to simulate interdependencies. Both SCADA and SUC models are used in this experiment. In order to visualize the interdependency phenomena between SCADA and SUC, the scenarios that will trigger power line overload alarm are generated manually during the simulation. Generally, the maximum load each power transmission line can carry has been previously determined by its vendor and is called overload threshold. If the real power flowing through a transmission line exceeds its overload threshold, this line is considered to have become overloaded. An accidentally overloaded transmission line could cause a system collapse (partial or even complete blackouts). Therefore, suitable corrective actions should be taken in order to alleviate the overloaded transmission lines. Normally, whenever a monitored transmission line is overloaded, an alarm will be generated and sent to the operator in the control center (MTU) by the RTU of the SCADA system. If, after a certain period, the operator fails to react to the overload alarm, then the protection devices such as disconnectors (example of FCD) will automatically disconnect the overloaded transmission line to minimize the negative consequence of the problem. It should be noted that the procedure for handling a power line overload alarm is complicated and that other factors should also be considered. In order to simplify this problem, it is assumed that the overload alarm failed to be handled correctly only if the operator fails to react to the alarm in time and the protection device fails to trigger. Three case study scenarios are developed by modifying parameters of corresponding agents in order to observe three different outcomes after the occurrence of the transmission line overload: (1) neither operator nor protection device react the alarm, (2) operator reacts alarm, and (3) protection device is triggered after operator fails to react. The observed simulation results from three case studies, presented in [51], show that the propagation of cascading failures between infrastructure systems due to interdependencies can be simulated and visualized with the help of the experimental platform. Although the models are distributed, overall simulation performance is not affected and interconnections between models can still be efficiently handled (see [51] for more information). 4.4.2. Failure propagation experiment Failures occurring in subsystem(s) of one CI could propagate into other subsystem(s) within one CI or even other CIs due to the existence of interdependencies. To investigate this phenomenon and related issues, an experiment mainly focusing on the investigation into the consequences of failure propagation between two systems under study, has been developed and conducted. In this experiment, a number of tests are conducted by triggering 11
The calculated HEP value in this case is between 0.0014 and 0.672.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
single technical failure or even multiple technical failures in order to observe and study sequent events due to failure propagation. For example, in a single technical failure test, which is mainly related to the investigation of the physical interdependency, the FID agent represents a power flow transducer (PTi) measuring power flow (in unit of MW) transmitted in a selected transmission line that is represented by the SUC model. It is assumed that the PTi is calibrated incorrectly due to the aging of the PTi. A list of sequential events after the incorrect modification of the PTi’s calibration value is recorded by the SOE table of the DB_SCADA database during the simulation. As learned by studying this table, at certain time, the PTi’s calibration value is modified incorrectly. As a consequence, the output of the PTi is more than its measured variable value should be. According to this wrong value, the RTU generates a wrong overloading alarm and sends it to the MTU causing the operator in the control room to make a wrong decision, i.e., to redistribute the power flow of a transmission line. As a result, the amount of power transmitted in this line decreases, although it should not. The measured variable from PTi, as part of the SUC, acts as physical input into the SCADA system. This relationship can be considered as the physical interdependency, which causes the failure of PTi to propagate from the SUC to the SCADA system and go back to the SUC (see [48] for more information). According to investigation results, analyzed based on both feasibility and failure propagation experiments, it can be concluded that three types of interdependencies can be simulated using the current experimental simulation platform: physical, cyber, and geographical interdependency.12
5. In-depth experiments Three in-depth experiments are developed for the identification and assessment of hidden vulnerabilities due to interdependencies between the SCADA system and the SUC, which are all performed on the HLA-compliant simulation platform: 1. substation level single failure model experiment, 2. small network level single failure mode experiment, and 3. whole network worse case failure mode experiment. It should be noted that Experiment 1 has been introduced and the results have been discussed in detail in [50]. Therefore, this experiment is only briefly introduced in this section.
85
as failure rate and repair rate defined for each failure mode, are adapted from [67]. The transitions between different device modes will have influences on corresponding agent states resulting in the change of behaviors of the SCADA system and SUC. If there is a request for the human operator in the MTU to make any decision, the model of the human operator will be activated and the corresponding HEP will be calculated according to current situations. All the events occurred during each test are recorded in the SOE table of the DB_SCADA database. It should be noted that only one failure mode is assumed during each test. The simulation period of each test is assumed to be 3 days, based on several trial tests conducted before starting the experiment.13 According to the conclusion of this experiment, among all the simulated SCADA-related devices, negative effects caused by failures of the RTU device seem more significant on its interconnected SUC (see [50] for more information). 5.2. Experiment 2: Small network level single failure mode experiment This experiment extends the scope of the first experiment to a small network including more components of the SCADA system and the SUC (40 substations and 50 transmission lines). The aim of this experiment is to identify the failure modes that can cause more negative effects due to interdependencies between two studied systems. It should be noted that this experiment is based on the assumption that all substations are homogenous meaning that the structure and devices included in each substation are the same. In this experiment, one key substation from the SUC model is selected for triggering the failure modes of substation level components during the simulation. For each single failure mode, two types of tests are implemented: normal and worse-case. The modeling scenarios are summarized below. 5.2.1. Normal test The modeling scenarios of this test are similar to that of the tests in Experiment 1. However, compared to the first experiment, the transition from the operation mode to the respective failure mode at the beginning of each test is triggered manually instead of within given time based on the failure rate.14 The purpose of this adjustment is to ensure that the transition time from the operation mode to each failure mode is the same. It should be noted that the transition from each failure mode back to the operation mode still depends on the repair rate. The simulation period of the normal test is assumed to be 5 days.15
5.1. Brief introduction of experiment 1 In the first experiment, different failure modes of each substation level component are evaluated by performing a number of tests related to each failure mode. One substation from the reference SCADA system including two transmission lines is randomly selected. During each test, the scenarios that will trigger power line overload alarm are loaded at the beginning of the simulation. Each test starts in the operation mode (a device mode) and one of the agent states. Within a given time period, the device mode of a respective component will go to one failure mode. The transition time from the operation mode to this failure mode is assumed to be exponentially distributed with constant failure rates l. After a given time period, the device mode will go back to operation mode. The transition time from this failure mode to operation mode is assumed to be exponentially distributed with repair rate m. It should be noted that all reliability data used in this experiment and other two experiments, such 12
Logic interdependency is not considered during these experiments.
5.2.2. Worse-case test This test represents the worse-case situation when the operator is unable to handle any alarm received by the control center due to natural or technical failures (hazards), e.g., the failure of the control panel, flooding/fire in the control center, etc. The purpose of performing experimental tests under this situation is to observe corresponding consequences if the SCADA system fails to monitor and control the SUC through the MTU. It is assumed that this worse-case situation will go back to normal after 1 day and the simulation period of this test is assumed to be 1 day. The following parameters are proposed to analyze the results. 13 Results from these trial tests show that after 3 days, both SCADA system and SUC become stable and no abnormal events are observed. 14 This experiment is considered as a semi-quantitative experiment due to the adjustment. 15 The reason to set the simulation time as 5 days is based on several trial tests before starting the experiment. Results obtained from these tests show that after 5 days (simulation time), both SCADA system and SUC become stable and no abnormal events are observed.
86
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
5.2.3. Average substation service availability index (ASSAI) This parameter represents the ratio of the total number of hours that service is provided by all available substations during a given time period to the total hours demanded (Eq. (1)). The parameter is adapted from an IEEE parameter called Average Service Availability Index (ASAI), which represents the ratio of the total number of customer hours that service is available during a given time period to the total customer hours demanded. ASSAI ¼
P N S ðnumber of hoursÞ N i ¼ 1 Ri NS ðnumber of hoursÞ
ð1Þ
where Rt is the restoration time for ith substation (if service interruption exists), and Ns is the total number of substation.
5.2.4. Degree of Impact (DI) The purpose of developing this parameter is to qualify negative effects caused by each failure mode using three other parameters as indicators obtained during each test: ASSAI, the number of SCADA components (interdependency failures), and the number of affected SUC components (dependent failures). All of these three indicators receive values between 1 and 5 according to their real value, shown in Table 2. In addition, a weighting factor (Wi) is defined for each indicator showing its importance for calculating the degree of impact. As shown in Table 3, the weighting factor for the indicator ASSAI is higher than for other indicators since this parameter plays a more important role for the quantification of negative effects than others. The degree of impact caused by each failure mode can be obtained according to Eq. (2). The DI triggered by different technical failures can be categorized by five levels, shown in Table 4. It should be noted that all the values set up for indicators and weighting factors are Table 2 Parameters as three indicators. Indicator 1 (ASSAI)
Indicator 2 (no. of affected SCADA components )
Indicator 3 (no. of affected SUC components )
I1
Real value
I2
Real value
I3
Real value
1 2 3 4 5
¼1 (0.999, 1) (0.99, 0.999) (0.94, 0.99) o0.94
1 2 3 4 5
0 (0, 1) (1, 2) (2, 3) 43
1 2 3 4 5
0 (0, 4) (4, 8) (8, 12) 412
Table 3 Weighting factor. W i (Weighting factor)
Real value
W 1 (for indicator ASSAI) W 2 (for indicator interdependent failures) W 3 (for indicator dependent failures)
4 4 2
Table 4 Categories of DI. Level of DI
Scope of DI
Very weak Weak Middle Strong Very strong
¼10 (10, 20) (20,30) (30, 40) Z40
based on author’s knowledge and experiences. DI ¼
N X
W i li
ð2Þ
i¼1
where N is the number of indicators. 5.3. Experiment results The analyzed test results from this experiment are summarized in Fig. 8. During normal tests, the consequences caused by the FCD FC mode and SO mode are similar, although the causes of these two failure modes are different. The results from FID FRL tests are close to the results from FID FRH tests according to DI value and ASSAI, although the FRL mode extends the period of appearence of first overload alarm and FRH shortens this period. Fig. 9 shows how components of the SCADA system and SUC are affected due to their interdependencies based on the results from one of FID FRH normal tests. In this test, the calibration value of the FID device (RTU #33) of the studied transmission line (line #127) was modified to a higher number. Therefore, the line became overloaded, although it should not be. As a consequence, two transmission lines, controlled by same RTU as line 127, also became overloaded. In this case, the technical failure (FRH failure mode) was triggered at a component (a FID device of a RTU) of the SCADA system. Then this failure propagated to the SUC affecting its three components (transmission lines 127, 66, and 194). In this test, only dependency related failure propagation was observed. During RTU FRF and FRW tests, the RTU device loses its connection to field level devices and then becomes blind, which is the cause of further negative events, i.e., loss of alarms and extended period of the line disconnection. The RTU FRC mode is triggered when communication issues appear between the MTU and the RTU. As a result, the related RTU device has difficulties interpreting any command sent by the MTU. The consequence caused by this failure mode is mainly the short disconnection of the corresponding transmission line. In this case, the service availability of the system is not affected significantly, according to the value of ASSAI. During worse-case tests, the cause of continuous transmission line disconnections is the combination of the lack of responses from operators in the control center and the failure of the hardware located in the substation meaning that the hardware failure is not sufficient enough to affect the system service availability significantly. Compared to the results from FCD FC tests, FCD SO tests have more negative effects according to the average number of alarms and average ASSAI since the FCD FC mode is not capable of triggering the overload of the studied transmission line. However, the FCD SO failure mode is capable of triggering the spurious overload alarm although it should not. Results collected from the FID FRH tests are similar to results of the FCD SO tests. During these two tests, the threshold of overload alarms is affected (modified). This is the reason why the ASSAIs calculated in both tests are very close (0.9357 from FCD SO tests and 0.9358 from FID FRH tests). Fig. 10 shows how components of SCADA system and SUC are affected due to their interdependencies based on the results from one of FID FRH worse-case tests. Compared to results from the FID FRH normal test (Fig. 9), about eight transmission lines were affected. It should be noted that the substations controlling these affected lines are closely located, which is the reason why the triggered technical failure easily propagate from one to another. During this test, not only SUC components were affected, several SCADA components were also affected (power loss of several RTU devices). Therefore, interdependency related failure propagation is observed. Results from all RTU tests indicate that ASSAIs from these tests are smaller compared to tests of other the two devices
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
87
Fig. 8. Summary of the small network level experiment (x-axis denotes different failure modes (defined in Table 1), left y-axis denotes degree of impact, and right y-axis denotes ASSAI).
Fig. 9. Affected components due to dependency between SCADA and SUC according to results from the FID FRH normal test.
(FID and FCD). However, this does not mean that negative effects caused by failures of the RTU device on the system service availability become much less significant. Due to the disconnection between the RTU and its field level devices, several overload alarms are lost and, therefore, failures triggered by sudden disconnection of the transmission line are unable to propagate. As shown in Fig. 8, in average, negative effects due to interdependencies are aggravated during worse-case tests indicating that very strong DIs are observed. The average DI from FID FRH tests is 44 meaning its impact is very strong. The DIs from all RTU failure mode tests are between middle and weak, which seems not as serious as for the FID and FCD. As the failure of this device means interruption of service provided by the SCADA substation level components, it is still worthy to develop further tests on this device, as well as the FID device, in the next experiment.
5.4. Experiment 3: Whole network worse-case failure modes experiment This experiment extends the scope of the previous experiment to the whole network including all simulated components of the SCADA system and the SUC, by which negative consequences caused by interdependencies can be observed and analyzed. In this experiment, instead of just considering single failures, double failures occurring simultaneously at different substations are also included. The same modeling scenarios defined in the worse-case tests of the previous experiment are applied, but in addition, two key substations and non-key substations are selected as exemplary substations. The experiment mainly focuses on two failure modes, i.e., FID FRH and RTU FRW, based on the results from last experiment. Two parameters, i.e., DI and ASSAI, developed in the
88
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
Fig. 10. Affected components due to dependency between SCADA and SUC according to results from the FID FRH worse-case test.
Double failure tests: in these tests, two independent simulta-
Table 5 List of eight types of tests in whole network experiment. Test no.
Failure mode
Failure type
Substation type
1 2 3 4 5 6 7 8
FID FRH FID FRH RTU FRW RTU FRW FID FRH FID FRH RTU FRW RTU FRW
Single Single Single Single Double Double Double Double
Key substation Non-key substation Key substation Non-key substation Key substations Non-key substations Key substations Non-key substations
previous experiment, are also used to analyze tests results of this experiment. In total, eight different types of tests (single failure and double failure tests) are developed, listed in Table 5 and summarized below:
Single failure tests: in each of the single FID failure mode test, the studied transmission line (in key-substation and non-key substation respectively) becomes overloaded at first. It is assumed that the FID device for this line is in malfunction meaning that the measured value is higher than it is supposed to be (recall the definition of FID FRH mode), triggering the spurious overload alarm to be sent to the control center. However, no redistribution command is sent to the RTU devices of the corresponding substations. After a certain time, the studied transmission line is disconnected by its own FCD for safety reasons. In each of the single RTU failure mode test, it is assumed that the RTU device is in (hardware) malfunction and becomes blind to its field level devices, i.e., FID(s) and FCD(s). The affected line then remains disconnected since no update could be sent by the RTU device. All of these tests are conducted about 10 times and the simulation time for each test is 1 day.
neous failures of the same type of devices are considered. Instead of one transmission line, two transmission lines from two substations (two key substations or two non-key substations) are included in these tests.
5.5. Experiment results The test results of this experiment are summarized in Fig. 11. In the first four tests (single failure tests), the average DI from FID single failure tests (key substation) shows the highest value, as both SUC and SCADA components are affected by this technical failure. As shown in Fig. 12, at time 10 h, one transmission line (a component of SUC) was disconnected due to the wrong overload alarm caused by the technical failure of its FID and the absence of operator action. At time 13.2 h, the number of failed SUC components (disconnected lines) reached to the maximum value. Then this number started to drop and only one line was disconnected. Observed at time 16.5 h, SCADA components (RTUs) were also affected and the number of affected components (interrupted RTU devices) was one. The maximum number of affected SCADA components was two at time 16.8 h. After that, the numbers of both affected SCADA and SUC components started to drop and returned to zero at time 18 h. In this test, the number of affected SUC components is 18, meaning that 18 transmission lines become overloaded. The number of affected SUC components is three. As observed from this test, failures of SUC components seem not to affect its interconnected SCADA system instantly. It took about 6 h before failures started to propagate from one system to another, which can be considered as the delay of dependency failures. Results from the RTU single failure tests (both key substation and non-key substation) show that the failure of RTU device does not cause significant cascading effects in the SUC and is not able to propagate back into the SCADA system affecting more components.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
89
Fig. 11. Summary of the whole network experiment (x-axis denotes different failure modes (defined in Table 1), left y-axis denotes degree of impact, and right y-axis denotes ASSAI).
Fig. 12. Affected SUC and SCADA components in one of FID single failure tests (key substation).
The results from double failure tests are similar to the results from previous single failure tests. The DI from the test of double FID failures (in key substations) is very strong, highest one compared to same parameter from other three tests. Fig. 13 illustrates results collected from one of FID double failure (key substation) tests. Compared to the results from FID single failure tests (Fig. 12), more SUC and SCADA components are affected since FID technical failures are triggered in two key substations.
The maximum number of simultaneously affected SUC components (transmission lines) is eight, one more than the maximum number of total affected SUC components of the single FID technical failure test, although the total number of affected SUC components increases significantly. The delay of dependency failures in this case is about 5 h, indicating that it took less time before failures start to propagate from one system to another. Furthermore, both SUC and SCADA system became less resilient
90
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
Fig. 13. Affected SUC and SCADA components in one of FID double failure tests (key substations).
indicated by the increase of ‘‘back to normal time’’ (8 h in single failure tests and 12 h in double failure tests). More SUC and SCADA components are affected if FID technical failures are triggered in two key substations. Both double FID and RTU failure tests on the non-key substations show much less negative consequence. The DIs of these two tests are weak and none of SUC components or SCADA components are affected.
6. Step 4: Results assessments After analyzing simulation results from three in-depth experiments, hidden vulnerabilities caused by interdependencies between the SCADA system and the SUC are summarized as follows. 6.1. Importance of field level devices should not be underestimated According to [68], the field level device such as FID and FCD can be regarded as interface device connecting a SCADA system to its controlled/monitored physical processes (SUC). In general, most past research works, especially modeling efforts related to SCADA systems, focus on RTU devices, which belong to the substation level in a standard SCADA system, and underestimate the existence of field level devices [9,69,70]. However, as shown by worse-case tests of in-depth experiments, negative consequences caused by failures of field level devices can also be significant. As observed in the small network level single failure mode experiment, results from normal case tests show that negative consequences caused by failures of the RTU device seem significant (highest degree of impacts and lowest ASSAI). However, if assuming the operator is unable to handle any alarm (worse-case scenarios), consequences caused by failures of field level devices become worse compared to failures of RTU devices. The simulation results from three worse-case single failure mode tests related to field level devices (FCD FC, FCD SO, and FID FRH) in this experiment demonstrate very strong degree of impact and smaller ASSAI. This phenomenon is also observed in the whole network worse-case failure modes experiment. In this experiment, the ASSAI obtained from the key substation single FID failure test is 0.991, while it is 0.9996 in the key substation single
RTU failure test. In the same experiment, the degree of impact caused by single FID failure is strong, while it is middle in the RTU failure test. Furthermore, the propagation of failures between the SCADA system and the SUC is also observed during FID worsecase tests, but not during the RTU worse-case test.16 6.2. A predictable delay of dependency failures is important As observed in FID single key substation failure tests and FID double key substation failure tests, the propagation of failures crossing interlinked systems needs certain time, and does not start instantly (delay of dependency failures). For example, this delay is about 6 h in FID single key substation tests and about 5 h in FID double key substation tests. Based on these two tests, it seems that the delay of dependency failures is proportional to the degree of impact and inversely proportional to ASSAI meaning worsen consequences shorter delay period. This period is very important for minimizing negative effects caused by interdependencies. If failures were able to stop cascading within this period, it is possible to avoid propagation of failures into another system. 6.3. Negative consequences caused by failures of devices in key substations are significant The whole network worse-case experiment also demonstrates the importance of key substations of the SCADA system since increasing the number of failed key substations and non-key substations show very different results. In this experiment, negative consequences caused by failures of devices become more significant if the number of failed key substations increases. For example, the ASSAI value calculated in FID single key substation failure tests is 0.991, while the value drops to 0.9776 when failures of two key substations are triggered (degree of impact 16 One explanation for this phenomenon is that the RTU device loses its connection to its field devices during tests of FRF and FRW. As a result, the RTU device is unable to handle any alarm sent by its field level devices. This is also the reason why the results observed from normal and worse-case RTU tests are similar. Therefore, although results from worse-case tests show that the negative consequences caused by field devices are more significant than by RTU devices, RTU devices are as important as field level devices.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
increases after increasing the number of failures of key substations in this case). This phenomenon is observed during RTU failure tests as well. However, increasing the number of failures of non-key substations seems to cause no significant negative effects. The degree of impact remains the same after triggering failures of two non-key substations during both FID and RTU nonkey substation tests. Therefore, the reliability of key substations of the SCADA system is important to whole system. 6.4. The role of the human operator in the control center is important The human operator in the control center plays a very important role in the SCADA system. Although the lack of responses from human operators might not be the cause of failures of substation level devices, negative consequences caused by the failures of these devices could worsen significantly. As demonstrated in all the experiments, if the human operator is able to respond to overload alarms adequately and send commands to corresponding RTUs for further corrective functions, failures could not propagate and negative consequences could be significantly minimized. In the second experiment, for instance, the ASSAI value calculated in FID FC normal case tests is 0.9996, while this value drops to 0.9604 in the same failure mode worsecase test. The degree of impact also changes from weak to very strong. Maintaining normal functionalities of the human operator in the control center, as well as required hardware of the control center, can minimize negative consequences caused by failures of substation level devices and even stop propagation of those failures. Therefore, the absence of the human operator has to be strictly avoided.
7. Step 5: Potential technical improvement Although the propagation of cascading failures due to interdependencies cannot be completely prevented, following improvement are suggested, which could be useful to minimize the negative effects caused by this type of failures and to increase the coping capacity of the SCADA system and the SUC. 7.1. Increasing the reliability of field level devices Several measures can be recommended. (1) Increasing redundancy. As mentioned in last step, it is very important to maintain normal functionality of field level devices such as instrumentation devices and control devices due to their specific installation locations (where interlinked systems overlap). Installing more redundant devices could be one option to reduce the possibility of device failures. (2) Implementation of diversity. Redundant devices could also fail simultaneously due to common causes such as human errors, lack of maintenance, design inadequacy, etc. Diversity can be used for the protection against these so called Common Cause Failures (CCFs). For instance, field level devices from different vendors could be used as redundant devices. (3) Implementation of self-diagnosis. For the field level devices installed at key substations, it is worthy to implement some more sophisticated and advanced techniques in order to reduce probability of CCFs, e.g., self-diagnosis techniques. For example, a real-time monitoring system can be installed for diagnosing current operation status of instrumentation devices and is responsible to monitor outputs of redundant instrumentation devices in real-time. If these outputs vary
91
significantly, then at least one of devices must be in malfunction, then an alarm can be sent informing maintenance personnel.
7.2. Prevention of failure propagation In order to minimize the negative effects caused by the propagation of cascading failures, a real-time prediction system can be implemented for analyzing most recent information (monitored variables) from all its substations. This system should be developed to be able to identify early symptoms of failures, which will trigger cascading failures and eventually propagate from one system to another. The correct identification must be conducted during the period of delay of dependency failures and further actions can be performed in order to successfully stop the propagation. 7.3. Increasing the capacity of batteries for RTUs of the SCADA system One of the major causes for service interruptions of RTUs is the loss of power supply due to full consumption of their batteries in case of preferred power loss (caused by interdependencies). This type of interruptions could be minimized by increasing the battery capacity in the case when power supply from another source is temporarily unavailable. 7.4. Setting up a remote emergency center Worse-case tests have demonstrated the importance of maintaining the normal functionalities of the control center including human operator actions. In the case when natural disasters occur, e.g., earthquake, flooding, etc., not just operators will not be able to perform any safety actions, devices installed in the control center, e.g., control panels, working stations, monitors, etc., will also be likely to fail to function. Setting up an emergency center that is located a certain distance away from the current control center is necessary and its importance should not be ignored. During the normal situation, this remote emergency center receives/updates/backups current field information directly from the control center. In an emergency situation, the role of the current control center can be transferred to the remote emergency center where operators should be able to continuously monitor/control the system (SUC) and restore the system data according to previously stored/backup information. Although setting up a remote emergency center will require significant financial supports and the possibility of using this center is relatively low, the whole society will certainly benefit from this work in the future.
8. Conclusion Our society needs to face the fact that interdependencies within and among CIs are more complicated than imagined and research works related to this topic will possibly not become easier in the future. Sometimes assessing interactions among subsystems within one infrastructure system already seems very challenging, e.g., SCADA system and SUC. Each approach developed or adapted for this topic has its own advantages and disadvantages. In practice, there is still no ‘‘silver bullet’’ approach. Combining different approaches into one simulation tool by adopting the technique of distributed simulation using appropriate simulation standards, referred as a hybrid modeling and simulation approach and presented in this paper as part of in-depth analysis already proves its feasibility and applicability and will hopefully be accepted
92
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
by researchers in the field of CI interdependency study, or even in the field of reliability study. With the help of this approach, classic approaches such as the FTA can also be integrated with advanced modeling approaches such as the ABM approach and used for more advanced and comprehensive system reliability analysis. In this paper, the interdependency-related vulnerabilities between the SCADA system and its associated SUC are analyzed by following a five-step methodical framework. As the part of the core step of this framework, the step of in-depth analysis, a hybrid modeling and simulation approach is presented, which is used to identify and assess hidden vulnerabilities. Three in-depth experiments are then designed and performed with the help of this approach. These experiments show the importance of mapping complex physical systems from the real world to the simulation world and then project data from simulation world back into real world. Furthermore, the simulation results from these experiments demonstrate the capability of the methodical framework as well as the hybrid approach to analyze CIs in a complex way, which allows us to identify and assess both obvious and hidden vulnerabilities. For instance, several hidden vulnerabilities due to interdependencies between the SCADA system and the SUC have been identified and presented in this paper, such as the importance of field level devices that has been underestimated by most past research works, the discovery of the delay of dependency failures, etc. The hybrid modeling/simulation approach clears the technical difficulties for future research efforts that are required to handle complexities of CIs by allowing the integration of different types of modeling/simulation methods into one simulation platform. More implementations of this approach to investigate and identify interdependency-related vulnerabilities among various CIs are currently under development.
Acknowledgments The authors thank the Swiss Federal Office of Civil Protection for providing financial support for this research, which is a part of a Project on Vulnerabilities Of Critical Infrastructures. References ¨ [1] Kroger W, Zio E. Vulnerable systems. London: Springer; 2011. ¨ [2] Buzna L, Peters K, Ammoser H, Kuhnert C, Helbing D. Efficient response to cascading disaster spreading. Physical Review E 2007;75:056107. [3] Rinaldi SM, Peerenboom JP, Kelly TK. Identifying, understanding, and analyzing critical infrastructure interdependencies. IEEE Control Systems Magazine 2001;21:11–25. [4] Igure VM, Laughter SA, Williams RD. Security issues in SCADA networks. Journal of Computers and Security 2006;25:498–506. [5] Richardson BT, Chavez L. National SCADA test bed consequence modeling tool. Sandia National Laboratories; 2008. p. 23. [6] Pfander JP, Baumann R, Amitirigala R. New SCADA/EMS concept of the Swiss Federal Railways. In: Proceedings of the 4th international conference on power system control and management; 1996. p. 231–9. [7] Kaneda K, Tamura S, Fujiyama N, Arata Y, Ito H. IEC61850 based substation automation system. In: Proceedings of the joint international conference on power system technology; 2008. p. 1–8. [8] Giani A, Karsai G, Roosta T, Shah A, Sinopoli B, Wiley J. A testbed for secure and robust SCADA systems. SIGBED Review 2008;5:1–4. [9] Balducelli C, Bologna S, Lavalle L, Vicoli G. Safeguarding information intensive critical infrastructures against novel types of emerging failures. Reliability Engineering and System Safety 2007;92:1218–29. [10] Nai Fovino I, Carcano A, Masera M, Trombetta A. An experimental investigation of malware attacks on SCADA systems. International Journal of Critical Infrastructure Protection 2009;2:139–45. [11] SWISSGRID: Die Nationale Netzgesellschaft; 2007. [12] Stuxnet: rumours increase, infections spread. Network Security; 2010. p. 1–2. [13] Johnson RE. Survey of SCADA security challenges and potential attack vectors. Internet Technology and Secured Transactions (ICITST); 2010. p. 5. [14] Slay J, Miller M. Lessons learned from the Maroochy water breach. IFIP international federation for information processing; 2008. vol. 253; p. 73–82. [15] FPL FPaLC. FPL announces preliminary findings of outage investigation; 2008.
[16] Christansson H, Luiijf E. Creating a European SCADA security testbed. In: Proceedings of the IFIP international federation for information processing. Boston: Springer; 2007. p. 237–47. ¨ [17] Eusgeld I, Kroger W. Towards a framework for vulnerability analysis of interconnected infrastructures. In: Proceedings of the 9th international probabilistic safety assessment & management conference (PSAM 09). Hong Kong; 2008. ¨ ¨ [18] Eusgeld I, Kroger W, Sansavini G, Schlapfer M, Zio E. The role of network theory and object-oriented modeling within a framework for the vulnerability analysis of critical infrastructures. Reliability Engineering and System Safety 2009;94:954–63. [19] Griot C. Modelling and simulation for critical infrastructure interdependency assessment: a meta-review for model characterisation. International Journal of Critical Infrastructure 2010;6:363–79. [20] Pederson P, Dudenhoeffer D, Hartly S, Permann M. Critical infrastructure interdependency modeling: a survey of U.S and international research. Idaho National Laboratory; 2006. [21] Zimmerman R. Decision-making and the vulnerability of interdependent critical infrastructure. In: Proceedings of the IEEE international conference on systems, man and cybernetics; 2004. p. 4059–63. [22] Rahman HA, Beznosov K, Marti JR. Identification of sources of failures and their propagation in critical infrastructures from 12 years of public failure reports. International Journal of Critical Infrastructures 2009;5:220–44. [23] IRGC. Policy brief: managing and reducing social vulnerabilities from coupled critical infrastructures. Geneva, Switzerland: IRGC; 2007. [24] Leontief WW. Input–output economics. 2nd ed New York: Oxford University Press; 1986. [25] Setola R, De Porcellinis S, Sforna M. Critical infrastructure dependency assessment using the input–output inoperability model. International Journal of Critical Infrastructure Protection 2009;2:170–8. [26] Haimes YY, Horowitz BM, Lambert JH, Santos JR, Lian C, Crowther KG. Inoperability input–output model for interdependent infrastructure sectors I: theory and methodology. Journal of Infrastructure Systems 2005;11:67–79. [27] Haimes YY, Horowitz BM, Lambert JH, Santos J, Crowther K, Lian C. Inoperability input–output model for interdependent infrastructure sectors II: case studies. Journal of Infrastructure Systems 2005;11:80–92. [28] Mv. Steen. Graph theory and complex networks: an introduction. 1st edMaarten van Steen; 2010. [29] Buldyrev SV, Parshani R, Paul G, Stanley HE, Havlin S. Catastrophic cascade of failures in interdependent networks. Nature 2010;464:1025–8. [30] Johansson J, Hassel H. An approach for modelling interdependent infrastructures in the context of vulnerability analysis. Reliability Engineering and System Safety 2010;95:1335–44. [31] Apostolakis GE, Lemon DMA. Screening methodology for the identification and ranking of infrastructure vulnerabilities due to terrorism. Risk Analysis 2005;25:361–76. [32] Ouyang M, Hong L, Mao Z-J, Yu M-H, Qi F. A methodological approach to analyze vulnerability of interdependent infrastructures. Simulation Modelling Practice and Theory 2009;17:817–28. [33] Eusgeld I, Nan C. Creating a simulation environment for critical infrastructure interdependencies study. In: Proceedings of the IEEE international conference on industrial engineering and engineering management. Hong Kong; 2009. p. 2104–8. [34] De Porcellinis S, Setola R, Panzieri S, Ulivi G. Simulation of heterogeneous and interdependent critical infrastructures. International Journal of Critical Infrastructures 2008;4:110–28. [35] Min H-SJ, Beyeler W, Brown T, Son YJ, Jones AT. Toward modeling and simulation of critical national infrastructure interdependencies. IIE Transactions 2007;39:57–71. [36] Sultana S, Chen Z. Modeling infrastructure interdependency among floodplain infrastructures with extended Petri-Net. In: Proceedings of the 16th IASTED international conference on applied simulation and modelling. Palma de Mallorca, Spain: ACTA Press; 2007. p. 104–9. [37] Di Giorgio A, Liberati F. Interdependency modeling and analysis of critical infrastructures based on Dynamic Bayesian Networks. In: Proceedings of the 19th mediterranean conference on control and automation (MED); 2011. p. 791–7. [38] HadjSaid N, Tranchita C, Rozel B, Viziteu M, Caire R. Modeling cyber and physical interdependencies—application in ICT and power grids. In: 2009 Power Systems Conference and Exposition; 2009. p. 1–6. [39] D’Agostino G, Bologna S, Fioriti V, Casalicchio E, Brasca L, Ciapessoni E., et al. Methodologies for inter-dependency assessment. In: Proceedings of the 5th international conference on critical infrastructure (CRIS); 2010. p. 1–7. [40] Casalicchio E, Bologna S, Brasca L, Buschi S, Ciapessoni E, D’Agostino G, et al. Inter-dependency assessment in the ICT-PS network: the MIA project results. In: Xenakis C, Wolthusen S, editors. Critical information infrastructures security. Berlin Heidelberg: Springer; 2011. p. 1–12. [41] Fioriti V, D’Agostino G, Bologna S.. On modeling and measuring interdependencies among critical infrastructures. In: Proceedings of the 2010 complexity in engineering: IEEE computer society; 2010. p. 85–7. ¨ ¨ [42] Schlapfer M, Kessler T, Kroger W. Reliability analysis of electric power systems using an object-oriented hybrid modeling approach. In: Proceedings of the 16th power systems computation conference. Glasgow; 2008. ¨ [43] Nan C, Kroger W, Eusgeld I. Focal report: study of common cause failures of SCADA system at substation level. BABS: ETH Zurich; 2011.
C. Nan et al. / Reliability Engineering and System Safety 113 (2013) 76–93
[44] Zhou L. Forcal report: vulnerability analysis of industrial control systems—Part B: statistics and analysis of industrial security incidents, challenges of ICS security research. BABS. Zurich, Switzerland: ETH Zurich; 2011. [45] Caretta Cartozo C. Complex networks: from biological applications to exact theoretical solutions. EPFL; 2009. [46] Gallos LK, Cohen R, Argyrakis P, Bunde A, Havlin S. Stability and topology of scale-free networks under attack and defense strategies. Physical Review Letters 2005;94:188701. ¨ [47] Kroger W. Critical infrastructure at risk: a need for a new conceptual approach and extended analytical tools. Reliability Engineering and System Safety 2008;93:1781–7. [48] Eusgeld I, Nan C, Dietz S. System-of-systems approach for interdependent critical infrastructures. Reliability Engineering and System Safety 2011;96: 679–86. [49] Bloomfield R, Chozos N, Nobles P. Infrastructure interdependency analysis: introductory research review; 2009. ¨ [50] Nan C, Kroger W, Probst P. Exploring critical infrastructure interdependnecy by hybrid simulation approach. ESREL 2011. France: Troyes; 2011 2483–91. [51] Nan C, Eusgeld I, Adopting HLA. Standard for interdependency study. Reliability Engineering and System Safety 2010;96:149–59. [52] Gorbil G, Gelenbe E. Design of a mobile agent-based adaptive communication middleware for federations of critical infrastructure simulations. In: Proceedings of the CRITIS; 2009. [53] Bobbio A, Bonanni G, Ciancamerla E, Clemente R, Iacomini A, Minichino M, et al. Unavailability of critical SCADA communication links interconnecting a power grid and a Telco network. Reliability Engineering and System Safety 2010;95:1345–57. [54] Nai Fovino I, Guidi L, Masera M, Stefanini A. Cyber security assessment of a power plant. Electric Power Systems Research 2011;81:518–26. [55] Louie MA, Carley KM. Balancing the criticisms: validating multi-agent models of social systems. Simulation Modelling Practice and Theory 2008;16: 242–56. [56] Tolk A, Uhrmacher AM. Agents: agenthood, agent architectures, and agent taxonomies.Agent-directed simulation and systems engineering. KGaA: Wiley–VCH Verlag GmbH and Co.; 2010 73–109. [57] Nikolic I, Dam KHV, Kasmire J. Agent-Based modelling of socio-technical systems. Netherlands: Springer; 2012 267.
93
[58] Nan C. Hybrid modeling/simulation approach for identification of hidden vulnerabilities due to interdependencies within and among critical infrastructures. ETH; 2012. [59] Swain AD. Comparative evaluation of methods for human reliability analysis. Institute for Reactor Safety; 1989. [60] Konstandinidou M, Nivolianitou Z, Kiranoudis C, Markatos N. A fuzzy modeling application of CREAM methodology for human reliability analysis. Reliability Engineering and System Safety 2006;91:706–16. [61] Kyriakidis M. Focal report: a study regarding human reliability within power system control rooms. BABS report, Zurich: lab for safety analysis, ETH Zurich; 2009. p. 29. [62] Ingenieure V.D. Methods for quantitative assessment of human reliability; 2003. [63] Hollnagel E. Cognitive reliability and error analysis method CREAM. UK: Elsevier; 1998. [64] He X, Wang Y, Shen Z, Huang X. A simplified CREAM prospective quantification process and its application. Reliability Engineering and System Safety 2008;93:298–306. [65] Marcellus RL. Evaluation of a nonstationary policy for statistical process control. In: Proceedings of the 6th annual industrial engineering research conference; 1997. p. 89–94. [66] Harris CJ, Hong X, Gan Q. Adaptive modeling estimation and fusion from data. New York: Springer; 2002. [67] IEEE. IEEE recommended practice for the design of reliable industrial and commercial power systems. IEEE Std 493-2007 (Revision of IEEE Std 493-1997); 2007. p. 1–383. [68] Nan C, Eusgeld I. Exploring impacts of single failure propagation between SCADA and SUC. In: Proceedings of the IEEE international conference on industrial engineering and engineering management (IEEM); 2011. p. 1564–8. [69] Nai Fovino I, Masera M, Guidi L, Carpi G. An experimental platform for assessing SCADA vulnerabilities and countermeasures in power plants. In: Proceedings of the 3rd conference on human system interactions (HSI); 2010. p. 679–86. [70] Queiroz C, Mahmood A, Jiankun H, Tari Z, Xinghuo Y. Building a SCADA security testbed. In: Proceedings of the 3rd international conference on network and system security; 2009. p. 357–64.