Towards end-to-end network resilience

Towards end-to-end network resilience

international journal of critical infrastructure protection 6 (2013) 159–178 Available online at www.sciencedirect.com www.elsevier.com/locate/ijcip...

2MB Sizes 1 Downloads 54 Views

international journal of critical infrastructure protection 6 (2013) 159–178

Available online at www.sciencedirect.com

www.elsevier.com/locate/ijcip

Towards end-to-end network resilience Panagiotis Vlacheasa,n, Vera Stavroulakia, Panagiotis Demestichasa, Scott Cadzowb, Demosthenes Ikonomouc, Slawomir Gorniakc a

University of Piraeus, 80 Karaoli and Dimitriou Street, 18534 Piraeus, Greece Cadzow Communications Consulting, 10 Yewlands, Sawbridgeworth, Hertfordshire CM21 9NP, United Kingdom c European Network and Information Security Agency (ENISA), 1 Vasilissis Sofias, 15124 Marousi, Attica, Greece b

art i cle i nfo

ab st rac t

Article history:

Telecommunications networks have evolved towards a unified, service-oriented, operator-

Received 10 November 2011

governed and autonomic managed infrastructure. Unification ensures interoperability and

Received in revised form

federation among different domains, technologies, architectures, while allowing the joint

16 August 2013

consideration of network and service aspects towards a “network as a service” view.

Accepted 16 August 2013

Autonomicity reduces operational expenditures and governance guarantees operator

Available online 22 August 2013

control over the entire network. In this new environment, the meaning of network

Keywords:

resilience must be revised in an end-to-end manner. This paper focuses on network

Future networks

resilience, identifies the principal network resilience concepts and proposes an ontology,

Network resilience

which describes the content and the interactions between the resilience concepts.

Threats

& 2013 Elsevier B.V. All rights reserved.

Cognitive framework Ontology Profiles Policies

1.

Introduction

The notion of resilience was introduced in the early decades of the twentieth century in a variety of scientific domains, such as physics, psychology and psychiatry, ecology, business, industrial safety and telecommunications [21]. However, network resilience has recently gained much attention as the property of a network to sustain its normal operations and desired performance when facing a number of predicable or unpredictable situations such as threats and changes. There have been considerable efforts in the literature (see, e.g., [21]) to distinguish resilience from terms such as faulttolerance, dependability, robustness, and to determine the boundaries and relationships between resilience and other terms such as stability and diversity. Often, resilience as a more global concept seems to encompass other terms n

Corresponding author. Tel.: þ30 2104142749. E-mail address: [email protected] (P. Vlacheas).

1874-5482/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ijcip.2013.08.004

(e.g., dependability) and resilience has even been built into their initial definitions in an incremental approach. After a careful investigation of resilience, it is safe to conclude that resilience evolves in parallel with network development and operating requirements and challenges. For example, faulttolerance has been known to exhibit some robustness with respect to fault and error handling; and dependability has been used in ubiquitous systems to describe the ability to deliver services that can be trusted in the face of continuous changes [1]. Thus, fault-tolerance and dependability may be considered to be resilience properties. Trends and challenges regarding future networks impose new requirements when considering resilience. In order to avoid starting from scratch and recognizing the fact that there are already some initiatives that address aspects of future networks, this effort has capitalized on existing

160

international journal of critical infrastructure protection 6 (2013) 159 –178

research results. For example, the UniverSelf Project [36] provides a set of top-level requirements and design goals that have to be covered in order to address the general vision and research directions with regard to future networks, service-oriented computing and networking, and the future Internet. These requirements include governance, unification, service orientation, autonomicity, orchestration and coordination and intelligence embodiment. Governance allows a network operator to have control even over an autonomic network by setting goals and requests, enforcing them on the corresponding network elements and receiving notifications when a situation requires prompt operator (re)action. Unification implies interoperability and federation among diverse management systems, which may involve different network segments (radio access, core and service) and may be implemented on different autonomic architectures. Service orientation denotes a joint service and network management in the sense that everything in the network may be considered to be a managed service. Autonomicity allows network entities to be managed and operate with “self” properties such as self-configuration, self-monitoring, self-optimization, self-healing, self-diagnosis, self-protection, self-awareness and self-testing, and, thus, may be considered to be a synonym of self-management. Orchestration and coordination guarantee that simultaneously operating and even conflicting autonomic entities will not cause any instabilities or incompatibilities. Finally, intelligence embodiment represents the progressive introduction of autonomic features in the management chain, and especially in network and service domains, in a distributed manner. These requirements can guide the redefinition of resilience for future networks. A well-established way to do this is through the design and use of an ontology [37]. An ontology is the term used to refer to the specification of a shared conceptualization within a domain of interest. It involves the definition of domain concepts (e.g., objects, attributes and processes) and their properties and relationships. Therefore, it can be used to model and/or describe the domain, reason

about the included entities, and create a unifying framework to solve problems in the domain. Usually, this specification is formal and standardized (explicit ontology), but it may also involve subjective usage (implicit ontology). Ontologies are useful tools for representing knowledge in many scientific domains, such as communications, system and software engineering, enterprise modeling, information architecture, and in general, wherever there is a strong need for a shared vocabulary and interoperability. Conventional networks comprise several management domains in which resilience is of utmost importance. The management processes and systems in these domains are heterogeneous and typically adhere to specific standards. For the sake of argument, a resilience ontology may be at least applicable to one domain, allowing the definition of the semantic aspects of the domain and its information. In order to address the requirements of future networks, an end-toend resilience ontology, which integrates information that currently belongs to each domain, is required. Such a “metaontology” would allow network managers to elaborate and reason about resilience aspects with an abstract and interoperable view. Moreover, the ability of an ontology to model behavior, in terms of rules and constraints, enables managers to compensate for problematic situations in an autonomic as well as domain-independent manner. Ontology-based network management [23] has being applied to a variety of network management and security scenarios. A typical example is autonomic management on behalf of an operator of a service deployment constructed on top of a heterogeneous network infrastructure. The proposed end-to-end network resilience ontology will help to hide the heterogeneity of the underlying infrastructure and to formulate an appropriate information flow, while deploying the new services with concrete attributes such as quality of service (QoS) and quality of experience (QoE). Another example is network security and policy management, which are

Fig. 1 – Ontology design methodology.

international journal of critical infrastructure protection 6 (2013) 159 –178

tightly related to self-diagnosis. The end-to-end network resilience ontology could be used by an inference engine to dynamically correlate attacks identified by alerts from heterogeneous sources with security policies and enforce the appropriate context-aware policy to eliminate the threat. Some of these scenarios are analyzed later in this paper. Fig. 1 summarizes the methodology that is used to derive the ontology. First, a thorough investigation of existing resilience ontologies is conducted. As stated above, these ontologies may focus on well-isolated management domains. Then, based on the requirements of future networks, these ontologies are analyzed. The purpose of the analysis is neither to critique the ontologies nor to classify them, but instead to identify their re-usable features and the potential gaps and areas that need to be improved in order to obtain a more integrated view. The results of the analysis can be exploited to define classes (concepts) and properties (relationships) in the ontology. Next, the required enablers (i.e., instruments and mechanisms that help achieve resilience) are incorporated in the ontology. Finally, the ontology is validated through a proof-of-concept description, which also demonstrates the utility of the ontology. Although the ontology aims at providing a consolidated view, it is by no means exhaustive. However, the high-level, abstract definition allows it to be characterized as extendable, in the sense that, if new ontologies are derived, the same procedure can be followed to update the ontology without disrupting the current status. The next section, Section 2, discusses related work in the field. Section 3 provides a description of the proposed ontological model for end-to-end resilience. Section 4 presents a detailed description of the identified resilience domain in terms of interrelationships between the defined concepts. Section 5 describes the approach used to enable resilience with cognitive aspects, which, in turn, enables the solution of the resilience problem. Section 6 demonstrates the use of the ontology in a use case scenario. Finally, Section 7 summarizes the conclusions and discusses avenues for future research.

2.

Related work

The previous section highlighted the need for an ontology to describe end-to-end resilience and address current and future network requirements. Limited work has been done in this area. Much of the research focuses on separate domains and management silos, and, therefore, different considerations of resilience. Avižienis and colleagues [1] focus on dependability and security. They define dependability in terms of reliability, availability, safety, integrity and maintainability; and define security with respect to confidentiality, integrity and availability. After providing these basic definitions, they specify the threats to dependability and security (faults, errors and failures) and the means for resolving them (fault prevention, fault tolerance, fault removal and fault forecasting). Their model enables a unified presentation of the threats and expresses specific details using various fault classes. Means are described in a more orthogonal manner than conventional classifications according to attributes, which tend to

161

conflict with each other. Faults are examined based on their incentive (malicious or non-malicious), enabling the consideration of human involvement. Service failures are distinguished from dependability failures and dependability is related to the development process and trust. The major part of the analysis is provided in the form of ontologies. Since dependability and security are prerequisites for resilience and their attributes represent resilience properties, the defined ontologies are also concerned with resilience. Laprie [21] investigates how to move from dependability to resilience by considering faults as well as change tolerance and adaptation. In his view, the adjective “resilience” has been used for decades in dependable computing systems essentially as a synonym for “fault-tolerance,” thereby ignoring the unexpected aspects of phenomena that systems may have to face. He argues that a totally different consideration is needed when future, large, networked, evolving systems, including ubiquitous systems, containing complex information infrastructures, are addressed. In such environments, emphasis should be placed on how to maintain dependability, i.e., the ability to deliver a service that can be justifiably trusted, in spite of continuous changes. Therefore, the definition of resilience builds on the definition of dependability with the addition of change handling and fault and error confrontation. Thus, resilience is viewed as the “persistence of dependability when facing changes.” Laprie also provides a classification of changes. He emphasizes that, in the context of dependability, changes may also apply to the threats faced by a system. Moreover, the changes themselves can turn into threats, such as when changes induce incompatibilities and instabilities to a system. He also discusses the relationships between the required properties for resilience (evolvability, assessability, usability and diversity) and the means for dependability (fault prevention, fault tolerance, fault removal and fault forecasting). Laprie's research was performed under the European Network of Excellence ReSIST (Resilience for Survivability in Information Society Technologies) Project [28]. This project addressed the strategic objective of a “global dependability and security framework” of the Sixth Framework Work Programme of the European Commission. In particular, it responded to the “need for resilience, self-healing, dynamic content and volatile environments.” The project involved experts in the areas of dependability, security and human factors, and attempted to ensure that future ubiquitous computing systems, which are needed to support ambient intelligence, have the necessary resilience and survivability properties, despite any residual development and physical faults, interaction errors, and malicious attacks and disruptions. The ReSIST Project [28,29] has attempted to create a structured representation of resilience concepts, in the form of a thesaurus and an ontology, which could be used with natural language processing tools to perform computer-aided identification and classification of existing documents concerned with resilience and to classify new documents as they are generated. The approach first creates a hierarchical thesaurus based on the terminology that is extracted. The thesaurus is used to automatic index documents, after which a number of clusters are identified using automatic clustering analysis. The resulting thesaurusbased ontology is compared with the ontology specified by

162

international journal of critical infrastructure protection 6 (2013) 159 –178

Avižienis and colleagues [1]. It is important to note that the ReSIST Project uses the same definition of resilience as Laprie [21], namely the “persistence of dependability in the presence of changes.” The ReSIST Project [29] also attempted to find correspondence with and enhance the ontology provided by Avižienis and colleagues [1]. It noted that ontology lacked multiple inheritance, specifically, that the sub-fault relationship spans a tree rather than a graph. This can result in inappropriate and possibly misleading categorizations of faults. Furthermore, with regard to the categorization of faults, eight basic viewpoints (whose possible combinations lead to 31 fault classes) exist, which leads to a number overlapping groupings. A more detailed investigation of the distribution of potential faults shows that the ontology implicitly encodes several subsets in terms of sub-fault relationships. The ReSIST Project addressed these issues by proposing a new fault hierarchy that explicitly considers some of the missing sub-fault relationships. In [20], Laprie presents the relationships existing between changes, resilience scalability properties and resilience scaling technologies (i.e., technologies for resilience) using an ontology diagram. The underlying idea is as follows. Dependability and security call for scalability in order to accommodate functional, environmental and technological changes. Scalability, in turn, requires that resilience policies, algorithms and mechanisms be extensible, composable, adaptive and consistent with the assumed threats. The evolvability, assessibility, usability and diversity of resilience policies, algorithms and mechanisms are prerequisites for satisfying the scalability properties. ENISA [5] is a European organization whose mission is to achieve high levels of network and information security within the European Union. ENISA has produced a research report [7] that identifies gaps in standardization related to network resilience, defines resilience in the context of standardization, lists the major activities undertaken in the standards developing organizations in the area of security and resilient architectures, and identifies future work on resilience that may be undertaken by standards developing organizations. The ENISA research introduces isolated ontologies for resilience. First, a generic system security and attack model is specified, which expresses a system as an aggregation of assets that are attacked either in isolation or in combination to defeat the system objectives. Then, the ITU-T ontological model of cyber security stressing resilience is presented. This model explicitly considers aspects that impact resilience as a composition of typical security objectives such as confidentiality, integrity, accountability and availability. Next, an illustration of a threat hierarchy is presented as an alternative model for considering both ontologies and taxonomies. In addition, an ontology of cyber security, identified by ITU-T and ETSI in TR 187 010, is used to demonstrate that a “resilient infrastructure” can be obtained by enabling protection and including countermeasures that assure availability and integrity. A model that illustrates the metrics is used to measure or express system resilience. Finally, a taxonomy of hazards is provided based on attacks on network resilience and the areas of interest to standards developing organizations.

Another ENISA research report [6] presents an overview of three key technologies, namely Internet Protocol version 6 (IPv6), Domain Name System (DNS) Security (DNSSEC), and Multi-Protocol Label Switching (MPLS). The three technologies are specifically evaluated in terms of their effectiveness at improving network resilience. When studying how resilience is enabled by DNSSEC, an ontology model is used to express the main threats against DNS. The DNS threats are separated into two main categories, one containing addressable threats and the other threats that cannot be handled by DNSSEC. Stavroulaki et al. [31] present an open, scalable platform for the integration and management of cognitive systems. This platform introduces cognitive and self-management features on the network and the user device sides. It enables efficient end-to-end operation, abstracts the complexity of the underlying infrastructure of network operators and offers ubiquitous, personalized services with enhanced user experiences. Cognitive management enables systems to determine and configure their operations in a reactive and as well as proactive manner to predict and prevent critical issues, and to achieve network resilience. Stavroulaki et al. formulate the information flow as an ontology, facilitating the interoperability of different, proprietary platforms and testbeds, abstraction of complex infrastructures, and extensibility by introducing various software and hardware components. The INTERSECTION FP7 Project [17] focuses on the design and implementation of a security framework with mechanisms for achieving intrusion detection and tolerance. The project has created an ontology of vulnerabilities in heterogeneous network infrastructures. This ontology is used by an offline decision support system [4], which provides information to network operators about vulnerabilities and allows them to conduct monitoring and diagnosis, and implement countermeasures. Vulnerabilities are defined as “assetrelated network weaknesses,” which are exploited by threats and attacks. The goal of the INSPIRE FP7 Project [16] is to design and implement techniques and components that enhance the reliability of supervisory control and data acquisition (SCADA) systems used in critical infrastructure assets and to protect information flow against cyber attacks. A security ontology is used to describe knowledge about SCADA components, complex symptoms with regard to faults and attacks, and appropriate countermeasures. This knowledge is then exploited by a decision aid tool [3] that helps systems administrators diagnose and reason about security problems and select the appropriate countermeasures. The REMPLANET FP7 Project [27] focuses on the development of methods, guidelines and tools for implementing resilient multi-plant networks in non-hierarchical and distributed (with respect to decision making) manufacturing networks. The environment requires effective operation and management of heterogeneous, dynamic and geographically distributed partners and bypasses the traditional concept of static supply chains. An ontological approach encompassing REMPLANET concepts and relationships is provided to address information exchange interoperability and distributed coordination [14]. The ontology is divided into three simple views to facilitate understanding: resilient organization, information and

international journal of critical infrastructure protection 6 (2013) 159 –178

communications technologies (ICT) platform and members. A resilient organization has “the capacity to respond rapidly to unforeseen change, even chaotic disruption.” Therefore, a resilient organization can effectively update its strategy, operations and management, governance structure and decision-support capabilities based on continually changing risks and successfully face the corresponding threats. The ICT platform facilitates the integration and the connection of REMPLANET members, communications between the members and non-centralized decision making. Lopez de Vergara et al. [23] summarize research efforts related to ontology-based network management, including studies involving autonomic systems, network security management, network monitoring, and the lessons learned about the use of ontologies and rule languages in these areas. They identify the need for generic ontology models to be used in autonomic systems. One of the studies [26] proposes an autonomic provisioning model for the dynamic configuration and self-diagnosis of end user devices employed for digital home services. The model contains a knowledge layer implemented using an ontology and policy-based reasoning rules to enable entities to understand services, relate services to devices and their capabilities, select the best device configuration for a service and the user preferences, as well as to provide information exchange support between entities. An identical approach is also followed in [25] for home area networks. Another research effort [34] proposes the transition from taxonomies to ontologies for use in intrusion detection systems to exploit the relationships between collected data and to enable advanced attack recognition, reporting and correlation capabilities. The authors note that very limited, if any, published research exists for such ontologies. They proceed to present an intrusion detection ontology and demonstrate its effectiveness at detecting instances of denial-of-service and buffer overflow attacks. A related effort [24], with the goal of improving the overall resilience of IP networks to attacks, employs an ontology and inference rules to map alerts to attack contexts, which are used to identify the security policies that should be enforced in a network to confront the threats. Basile and colleagues [2] employ a policy-based network management approach for

163

the security configuration of network devices. They use an ontology to represent domain knowledge and to reason about appropriate configurations. An autonomic translation of high-level goals to low-level device configurations is performed, reducing human effort and errors due to manual configuration, while allowing the human intervention (e.g., when information is missing from the ontology). The analysis of the research literature reveals a critical issue. Although several attempts have been made to define resilience and even to design ontologies, the attempts focus on highly isolated domains. There is a clear need to address end-to end resilience using an ontology that can break down the different silos that come into play in resilience considerations. The resilience ontology should cover issues at different implementation levels (hardware, middleware and software), network layers (network and application layers) on the network and user sides, and different domains (network, service, business and human factors). Moreover, the resilience ontology should be able to cover future network requirements, namely governance, unification, service orientation, autonomicity, orchestration and coordination, and intelligence embodiment as much as possible. In this way, the resilience ontology would be applicable to a variety of areas of interest [8], such as cloud computing, real-time detection and diagnosis systems, sensor networks, future wireless networks and integrity of supply chain, in an integrated manner. Finally, the resilience ontology should be explicit in the sense that it bridges the gap in standardization and provide guidelines for stakeholders (e.g., operators and policy-makers).

3. End-to-end resilience: concepts and definitions This section describes the basic concepts and definitions underlying end-to-end resilience along with the proposed ontological model for end-to-end resilience.

3.1.

Resilience

As identified in the previous section, there is a need to address and define the vital issue of end-to-end resilience

Fig. 2 – Resilience concepts and relationships.

164

international journal of critical infrastructure protection 6 (2013) 159 –178

as imposed by future network requirements using an ontology. The end-to-end property implies an integrated resilience ontology that encompasses information coming from several heterogeneous management domains, so that a manager can elaborate and reason about resilience aspects in an abstract and interoperable manner. This section identifies the relative concepts of such an ontology. Fig. 2 shows the higher-level classes of the ontology. Resilience is the main class located at the center of the ontology and may be considered to be the root. It has direct relationships with the classes ThreatAgent, Domain, Properties, Threats and Means. Note that all these relationships are defined based on the root, i.e., network resilience. Threats are what Resilience confronts, namely potential attacks against network assets. However, the term “threats” is more general than the term “attacks” as used in the security literature to describe malicious external faults. In resilience, the consideration of threats exceeds the classical view of potentiality (i.e., threats are not yet active) and includes realization. The corresponding relationship is confronts, while the inverse relationship is isConfrontedBy. A ThreatAgent denotes an entity, physical or not, that threatens Resilience, namely the agent that manifests the Threats and attacks on the system objectives. The corresponding relationship is isThreatenedBy and the inverse relationship is threatens. The Domain is where Resilience is required. Actually, it is the concept that has a strong impact on the direction of endto-end resilience. There is a clear need for the reconsideration of resilience based on the application domain in order to cover the orientation and the requirements of future networks. A resilience description per domain is presented in Section 4; it highlights the importance of the concept. The relationship is isRequiredBy and the inverse relationship is requires. Properties are the attributes by which Resilience is expressed. Properties play an important role in resilience [1]. The reason is that, as long as resilience evolves, the greater is the need to find appropriate ways to measure it. Moreover, resilience is a general concept, encompassing concepts such as dependability, security, even trust, and integrating the corresponding attributes. These attributes may be conflicting, revealing the need for tradeoffs to find the ideal balance of the countermeasures against Threats. Based on Properties, more specialized and technical Metrics are derived, allowing one to verify if a property is guaranteed. Furthermore, Metrics are the “key performance indicators” of resilience, permitting the evaluation of a system in terms of resilience and the triggering of proactive and reactive actions to sustain the desired operations. The monitoring and evaluation processes enable assurance, which is a key management process as specified by an enhanced telecom operations map (eTOM) [33]. The relationship is isExpressedBy and the inverse relationship is express. The Means class represents the means that have been developed to attain the various resilience properties and are intended to either eliminate threats or fix vulnerabilities. This concept is of interest to resilience stakeholders, e.g., industry, telecom operators and service providers. The focus is on the mapping between Threats and Means per Domain (Section 4) to create a common and shared understanding on how the

various threats could be faced. Although many Means have been developed over the past 50 years to confront Threats and attain Properties, the ongoing appearance of new threats in combination with the conflicting nature of some properties make it an urgent objective to keep the resilience “outfit” complete and updated. The corresponding relationship is isEnabledBy and the inverse relationship is enables.

3.2.

Resilience Threats

Fig. 3 presents the Threats to Resilience in terms of subclasses. The relationship and inverse relationship are subsumes (a class subsumes a subclass) and isSubsumedBy, respectively. Security Threats include Interception, Manipulation, Repudiation and Denial of Service. These threats are subclasses of the Security Threats subclass, forming a hierarchy in the ontology, called Taxonomy. However, for reasons of simplicity, secondlevel subclasses are not shown. This threat hierarchy is presented in [7] and is strongly related to cyber security. A threat may lead to an unwanted incident that impacts certain pre-defined security objectives; and an unwanted incident is an incident that involves the loss of confidentiality, integrity and/ or availability [12]. Thus, the presented Security Threats correspond to the non-satisfaction of traditional cyber security properties, namely Confidentiality, Integrity, Availability. Interception stands for the illegal access and retrieval of private information and content of communications by a malicious agent. Manipulation is the modification of application settings and/or information by a malicious agent; it also implies access to the information. Manipulation subsumes the subclasses UnauthorizedAccess, Masquerade, Forgery, InformationCorruption and InformationLoss. Repudiation occurs when a malicious agent overcomes the controls that track and log user actions and/or changes in the authoring information of user actions, thus allowing interception and manipulation. Finally, Denial of Service is the result of a malicious attempt that renders a computer resource unavailable to its intended users. Dependability Threats are based on the common Fault, Error, Failure model, which also defines the associated subclasses. This model is described in detail in [1]. A failure mainly relates to a service, in the sense that a delivered service deviates from the correct service. Considering a service as a sequence of system external states, Error is defined as the

Fig. 3 – Resilience Threats.

international journal of critical infrastructure protection 6 (2013) 159 –178

deviation of an external system state from the correct state; such a deviation can lead to a Failure. The verified or hypothetical cause of an error is called a Fault and can be internal or external to the system. Disasters are physical threats caused by natural events or natural hazards. Natural hazards such as floods, volcanic eruptions, earthquakes and landslides are analyzed in [7]. Changes represent the threats due to dynamic conditions that prevail in a system (e.g., system evolution) or in its environment (surrounding context). Interaction Conflicts are implicit threats that arise when various network assets interact through specific mechanisms to achieve common or even competing goals. This interaction is a fundamental prerequisite to address the future network requirements of unification and orchestration/coordination, and also applies to the business domain. The threats may be inter-domain or intra-domain. Inter-domain conflicts occur when different network segments (radio access, core, service), administrative domains, even businesses interact. In contrast, intra-domain conflicts are confined to a given domain. Interaction Conflicts may be the result of correct operation or a failure of one component that has some service relation with another component belonging to the same or different system. Typical examples of the first case are conflicts in selforganizing networks [18]. Two types of conflicts may be defined in a self-organizing network and may lead to system instability or even a crash. The first type is referred to as parameter value conflict, i.e., two or more mechanisms attempt to enforce different values for the same network parameter. The second type of conflict, goal conflict, occurs when a metric optimized by one mechanism is negatively affected by one or more mechanisms aimed at optimizing other metrics. In the second case, Interaction Conflicts among systems are identical to Emerging Disservices [13], which represent disservice caused in a system component by a failure in another system component that has a service relation with the first system but that is located in another independent system. Supply Chain Attacks are threats against “integrity across the supply chain,” which seek to enable trust and confidence

165

during the purchase, installation and utilization of products and technologies [8]. Supply chain attacks involve equipment (hardware and software) supplied by vendors, development of network services by third parties, insertion of improper functionality into products, creation of counterfeit products, questionable suppliers and purchasing practices, disposal of discarded products and intermixing equipment of different quality.

3.3.

Resilience means

In this study, the resilience means are kept at a high level and represent management functions. This is because the target audience of resilience stakeholders are mainly from industry (e.g., operators and service providers). Of course, the management functions involve instruments and tools, which are described later in this subsection. Fig. 4 illustrates the concept of Resilience Means. Fault Management is the means to confront Dependability Threats and consists of four subclasses that correspond to different management functions. Fault Prevention seeks to prevent the occurrence of threats. Fault Tolerance aims at maintaining correct operation and avoiding a system failure in the presence of threats. Fault Removal involves corrective actions that reduce the impact of threats and even eliminate them. Fault Forecasting estimates the current situation regarding threats and predicts the future existence and consequences of threats in order to trigger the appropriate inhibitory actions. Fault prevention and fault forecasting act proactively, while fault tolerance and fault removal act reactively. Cooperation is mainly related to the existence and confrontation of Interaction Conflicts. Actually, it addresses the requirement of orchestration/coordination in future networks. Future network environments will have simultaneously operating and even conflicting autonomic entities, which should have the ability to be set up in a “plug-andplay” manner. In order to cope with incompatibilities and conflicts, as well as avoid network instabilities and jeopardize network operations, the cooperation of diverse managing and managed entities is a prerequisite. In a unified network

Fig. 4 – Resilience Means.

166

international journal of critical infrastructure protection 6 (2013) 159 –178

environment, Cooperation is responsible for achieving federation and bridging different segments, such as wireless access, backhaul networks, core networks, for example by translating data from one network management system (NMS) to another. Also, it enables the cooperation of future networks with legacy systems. Finally, Cooperation applies to collaborating businesses and helps achieve common plans and objectives in an easier and more effective manner. Governance, which addresses future network requirements, allows an operator or service provider to have absolute control over a network, even when the network acts in an autonomic manner. Thus, governance should first introduce an enhanced human-network interface for the insertion of business goals and requests. Then, the corresponding objectives should be translated to lower-level policies and commands that are understood by the managed entities. Next, governance should provide the means to enforce the commands on the corresponding entities. Finally, it should participate in the overall system evaluation and allow the receipt of notifications when a situation requires prompt operator (re)action. Clearly, Governance affects other resilience means such as Cooperation by providing clear human control/ directives regarding functionality. Another resilience enabler is Cognitive and Self-Management. In order for all resilience means functionality to be adequate, the corresponding means to monitor, detect/predict, resolve and manage external/internal disturbances/dynamics in networks and services are required. Cognitive and Self-Management corresponds to the well-known Monitor, Analyze, Plan, Execute (MAPE) cycle [15,19] enabled with learning (knowledge) capabilities. Fig. 5 presents Cognitive and Self-Management in terms of its subclasses and properties (relationships). Monitor monitors the status and capabilities of the managing and managed entities, as well as the goals and high-level objectives set through Governance. It also feeds the information to Analyze and Knowledge. Note that status, capabilities and goals stand for context, profile and policies, respectively, which are the instruments and tools that are used by Cognitive and Self-Management (described in detail in Section 5). Analyze analyzes, diagnoses and reasons about the current situation, derives the specific configuration policies and triggers the appropriate re-optimization and corrective actions when degradation exists and, thus, it feeds Plan. Plan is the decision-making function, which uses input from Knowledge. Plan also feeds Knowledge with the decision in order to produce

Fig. 5 – Cognitive and Self-Management subclasses and properties.

knowledge about the solution to a problem for future use. Then, Plan orders the enforcement of the decision to Execute. Part of the execution is the transfer of commands to specific devices and their translation into device-specific languages. Finally, Execute feeds Knowledge about the enforcement result. As stated above, the entire control loop feeds or isFedBy Knowledge about previous decisions and configurations, involved entities and their locations, device capabilities, policies, etc. Therefore, Knowledge builds knowledge through learning capabilities and enables other functions to acquire this knowledge and make more sophisticated decisions or take more immediate actions based on past experience. Cognitive and SelfManagement has self-diagnosis and self-healing capabilities. Its operation and the corresponding information exchange are described in Section 5. Trust Management is a more horizontal management process. In a sense, it encompasses the rest of the resilience means, because their enforcement actually leads to trust creation. Trust management comprises all the mechanisms needed to decide the extent to which autonomic processes and systems can be trusted and can incrementally build trust. Supply Chain Integrity Management seeks to understand and confront Supply Chain Attacks. Interested readers are referred to [8] for a discussion about managing such attacks. The Definition of Product and Service Requirements should cover the entire supply chain lifecycle (i.e., design, production, delivery, purchase, installation and maintenance). Evaluation Frameworks should be considered in order to check the consistency of products and services with the defined requirements and the desired resilience level. Identification of Origin and Authenticity is required throughout the product lifecycle. Protection and Maintenance of System Integrity are necessary prerequisites for Supply Chain Integrity Management. Finally, Trust Models are needed to enable end-to-end verifiable trustworthiness of systems. Needless to say, a shared management framework ensuring resilience at all levels of the supply chain and encompassing all the features discussed above must be created. Risk Management in the context of resilience is split into several subclasses, each representing a separate phase. The subclasses are: Risk Identification, Risk Evaluation, Risk Prioritization, Risk Monitoring, Risk Treatment, Risk Management Improvement and Risk Management Embedding (within other management functions) [9]. Risk Evaluation is further divided to Incident Likelihood Assessment and Impact Analysis. These concepts are fairly straightforward, so the focus here is mainly on the meaning of Risk Treatment. Risk Treatment is represented by the “4T” model: Terminate, Transfer, Treat and Tolerate. Terminate/Avoid is the annihilation of the activity that creates risk and may be the worst case, when no other action is effective. Transfer is the passing of risk to another party. Treat/Mitigate is the reduction or elimination of the negative effect or probability of a risk by means of new solutions or investment to increase the level of resilience. Tolerate/Accept is the acceptance of risk consequences, given that they are kept at an acceptable level and their status is monitored periodically. Fig. 6 shows the properties of the classes Threats and Risks as well as other concepts. Threats induce Risks, which lead to Incidents if they are realized. Risks may have various Severity

international journal of critical infrastructure protection 6 (2013) 159 –178

Fig. 7 – Resilience Domain.

Fig. 6 – Threats and Risks.

Levels (e.g., high, medium and low) in terms of negative effects and probabilities. Moreover, they affect a Domain and areConfrontedBy Risk Management. Security Management involves the use of tools, processes and standards to manage security and is described in detail in [22]. Security Tools correspond to computer hardware/software and other devices with security functions. Security Standards describe mechanisms and protocols related to security. Security Business Processes are business functions, including their configurations. As mentioned above, each of the resilience Means is supported by specific Instruments, which render it effective. The relationship between Means and Instruments is use and its inverse is areUsedBy. Therefore, Trust Management uses mainly Certification. Cognitive and Self-Management uses Policies, Context and Profiles (discussed in Section 5). Governance uses Policies. Finally, Security uses Security Services (e.g., authentication, authorization, delegation and auditing), Security Mechanisms (e.g., cryptography and access control), Security Policies (e.g., high level goals and rules), and Security Technology (e.g., DNSSEC and IPsec).

3.4.

Resilience Domains

Fig. 7 presents the concepts underlying Resilience Domains. The Network is the domain that is the focus of “attention” of the resilience stakeholders. A network is typically a collection of interconnected assets (e.g., user devices, links and nodes) that enable communications, resource sharing and information exchange. However, in order to address future network requirements, such as unification and service orientation, the resilience concept should be extended to cover more domains with an end-to-end direction. Resilience needs to surpass the traditional hardware view and assume a service-oriented view. Therefore, the Service domain represents everything related to services and applications, and even the need to shift and converge towards “everything as a managed service,” which also includes “network as a service.” However, the network as a service concept will only be realized when the network and service aspects are fully integrated. Business domain aspects should be included in resilience in order to have a strong impact on industry stakeholders,

167

Fig. 8 – Resilience ThreatAgents.

service providers and operators. This consideration is highlighted in eTOM [33], which describes business processes in the telecommunications industry. The description covers aspects and activities ranging from customer-facing to supplier-facing and support, as well as the interactions between these processes. As stated above, assurance is a principal eTOM process that can be tightly linked to resilience. Customer may or may not be considered to be a part of the Business domain. Although the Customer domain is explicitly presented in Fig. 7, it is treated as a member of Business in this study. The resilience domains will be described in more detail in Section 4.

3.5.

Resilience ThreatAgents

Resilience ThreatAgents is shown in Fig. 8. Human is the physical person who enacts a Threat. Machine can be divided to subclasses according to its mode of operation and control such as Autonomous (i.e., acts by itself), Controlled (i.e., controlled continually by (property: isControlledBy) a Human), and Scripted (i.e., programmed by (property: isProgrammedBy) a Human but then acting in an automatic way). Autonomous Machine may itself be a threat agent because it acts according to its own intention, while Human is the actual threat agent behind Controlled and Scripted Machines. All these types of threat agents may be malicious (when there is an incentive to induce improper operation) or fault prone (when there is no incentive, but a tendency for improper operation). Finally, a significant threat agent is Nature as a result of various natural

168

international journal of critical infrastructure protection 6 (2013) 159 –178

presented in an ontology using the Web Ontology Language (OWL) [23]. Fig. 10 summarizes the main concepts in the ontological model. For reasons of simplicity, only the first level of subclasses is presented. Next, we define properties (relationships) and inverse properties for several resilience classes (concepts) without the intervention of the root class Resilience. Therefore:

Fig. 9 – Resilience Properties.

hazards and Disasters. Indeed, Nature may be the most unpredictable and dangerous threat agent.

3.6.

Resilience Properties

Resilience Properties is shown in Fig. 9. Properties are very useful for monitoring, measuring, evaluating and taking corrective actions to achieve resilience. Since the effectiveness of a resilient network is mainly assessed through its correct operation, the properties are strongly related to the Service domain. Avižienis and colleagues [1] provide a good analysis of resilience properties, which they refer to as dependability and security attributes. Availability is the continuous provision of services and reliability is the permanence of correct services. Safety is the condition of being protected from serious consequences caused by Threats to the user and environment. Integrity corresponds to system consistency and the absence of improper system modifications. Maintainability is the ability of a system to cope with modifications and remaining in or being restored to normal operation. Confidentiality ensures that information is disclosed only to authorized parties. Performance refers to the quality of the offered service, capturing aspects of QoS as well as QoE. Section 2 described how these attributes, which we call resilience properties in this paper, are related to dependability and security. Avižienis and colleagues [1] specify some secondary properties such as robustness, accountability, authenticity and nonrepudiability, which they use to refine or specialize other properties according to the types of information that are available. We note that their properties are easily expressed in terms of availability, integrity and maintainability.

3.7.

Summary

This section has described the main resilience concepts, namely Threats, Means and Domains, ThreatAgents, Properties along with their subclasses. Some of the subclasses have their own subclasses and, thus, a hierarchy may be created in the form of a taxonomy. Needless to say, all these concepts represent a thesaurus related to resilience that may be

▪ Domain isAffectedBy (property) Threats – Threats affect (inverse property) Domain. ▪ Domain isAttackedBy (property) ThreatAgent – ThreatAgent attack (inverse property) Domain. ▪ Means confront (property) Threats – Threats areConfrontedBy (inverse property) Means.

But also: ▪ Means areAttackedBy (property) Threats – Threats attack (inverse property) Means. ▪ ThreatAgent performs (property) Threats – Threats arePerformedBy (inverse property) ThreatAgent. ▪ Domain isProtectedby (property) Means – Means protect (inverse property) Domain.

Fig. 11 presents the classes and properties specified above. Using the classes and properties, an attempt will now be made to combine them and see how they interact in the concept of end-to-end resilience. The interactions can be built as in the following examples: ▪ Cognitive and Self-Management confronts Changes (e.g., context changes, traffic changes, PKI degradation) which arePerformedBy Human (e.g., mobility, profile change) and which (Changes) affect Network. ▪ Risk Management confronts Risks which areInducedBy Changes (e.g., outstanding business plans) which arePerformedBy Human (e.g., competition) and which (Changes) affect Business. ▪ Security Technology (e.g., protocols, DNSSEC, IPsec) isUsedBy Security which confronts Security Threats, which arePerformedBy Human (e.g., malicious user) and which (Security Threats) affect Service (e.g., DNS application), but Security Technology (e.g., protocols, DNSSEC, IPsec) areAttackedBy other Security Threats.

In order to build all these interactions, a description per domain is needed; this is provided in Section 4. The end-toend notion will, however, become more obvious after the determination of interrelationships between all the concepts and the demonstration of the use of the ontology in a use case scenario. By building concepts and relationships that are valid in the entire taxonomy of the Domain and that encompass a variety of resilience aspects as denoted in the other main classes of Resilience (ThreatAgent, Properties, Threats and

international journal of critical infrastructure protection 6 (2013) 159 –178

169

Fig. 10 – An extract of the Resilience ontological model.

Fig. 11 – Resilience classes and properties (without the intervention of the class Resilience).

Means), the ontology will achieve the goal of the “end-to-end property,” as well as break down the silos in conventional views of resilience, where a resilience ontology is usually applicable to only one or a limited number of domains. In this sense, the scope of the ontology may be considered as the definition of ThreatAgent, Properties, Threats and Means in the taxonomy of Domain. The main limitations in the scope

of the ontology comprise the intentional focus on aspects related to future network requirements as stated in this paper and the high-level representation of some concepts for extensibility reasons. The investigation and integration of many existing efforts as well as the feedback provided by validation could guarantee the adequacy of the ontology with respect to its scope. An autonomic system using the ontology, such as an inference engine, should be able to issue a notification to the system administrator if a deficiency is found (e.g., a Threat not included in the ontology is detected) as well as allow the system administrator to update the ontology and enforce a corrective action via a Policy. This is consistent with OWL's open world assumption [30], namely that the absence of a particular statement means that the statement has not yet been made explicitly regardless of whether or not it is true. Note also that the ontology is better characterized as an integrated ontology rather than an exhaustive one. An important property when designing an ontology is the existence of disjoint classes, namely the classification of an item in one class excludes all other classes because classes do not overlap. Class disjointness, when it exists, facilitates consistency checking and the automatic evaluation of individual components with respect to the ontology. However, as discussed in [38], disjointness decisions are far from trivial for nonexpert as well as expert users. The main reasons are different semantic contexts, existence of abstract individuals, modeling of different roles of the same individual, different levels of abstraction in the ontology and possible non-disjointness in the

170

international journal of critical infrastructure protection 6 (2013) 159 –178

extensions of disjoint classes. The majority of the classes in the proposed ontology are disjoint. However, some of the classes may be not be disjoint. Specific examples are provided in Section 6. The reasons are the high-level representation of concepts to enable extensibility and the existence of multiple features that characterize some objects in different semantic contexts [38]. In this effort, context information (e.g., the Domain) is used to define specific rules using the Semantic Web Rule Language (SWRL) [23], which provides more expressive relations and more powerful deductive reasoning capabilities than OWL, including inference about the categorization of an individual. In the “human-readable” syntax of SWRL, a rule has the form “antecedent-consequent,” implying that if the antecedent holds, then the consequent must also hold. Consider an example of a router produced by an ICT manufacturing company that is deployed in the core network of an operator. Using this syntax, a set of rules asserting that a Router x is classified in ICT Products in the Business domain or in the Network Elements in the Network domain is written as: Core(?net)4Router(?x)4consistsOf(?net,?x)-NetworkElements(?x) ICTmanufacturers(?ltd)4Router(?x)4produce(?ltd,?x)-ICTproducts(?x). The iterative approach used in ontology design (i.e., specify an ontology, consider new requirements, use feedback received from validation use cases, re-specify the ontology) can compensate in the medium term for ambiguities in the definitions of classes. The definition of properties (relationships) between classes (concepts) helps express relational events among the classes. For example, as mentioned above, Cognitive and Self-Management confronts Changes. SWRL can be used to construct more complicated relations. Consider the example where Cognitive and SelfManagement is the appropriate Means to Confront Changes in the Network, while Cooperation is the appropriate Means to Confront Changes in the Business. These relations can be expressed using the following SWRL rules: Network(?net)4Changes(?y)4isAffectedBy(?net,?y)4CognitiveAndSelfManagement(?mape-k)-confront(?mape-k,?y) Business(?ltd)4Changes(?y)4isAffectedBy(?ltd,?y)4Cooperation(?all)-confront(?all,?y).

Other examples of relationships in terms of OWL properties and SWRL rules are provided in Sections 4 and 6, respectively. This paper attempts to provide sufficient class descriptions in order to enable other ontologies to map their concepts to concepts in this ontology. The obstacles introduced by terminology in the integration effort can be overcome if adequate descriptions are available to delimit the concepts and the respective classes.

4.

Ontology description per domain

Section 3 focused mainly on the definition of resilience concepts and classes. This section focuses on the next step by specifying the properties involving the classes, and especially between Threats and Means per Domain. This specification is important because it can be used as a reference and as a shared vocabulary among resilience stakeholders on how to confront the various threats.

4.1.

Network domain

Fig. 12 shows the classical view of the Network domain with network segments. Thus, Network is divided into Access and Core, and Access is further split into Wireless and Fixed. Wireless consistsOf Access Points and User Devices that a Customer uses. Core consistsOf Routers, which are divided into Edge Routers and Core Routers. Fixed consistsOf ADSL Gateways and User Devices. However, a different view of the Network domain with respect to operator organization levels (Fig. 13) is more convenient when addressing management issues. At the top of the hierarchy are the Operational Support System (OSS) and the Business Support System (BSS). Network Management System (NMS) follows, which controls one or more management domains. Each management domain has an Element Management System (EMS) that is responsible for some Network Elements (NEs). However, an NMS may directly control an NE with an incorporated EMS (i.e., an NE with embedding capabilities). An NE may be an Access Point, a User Device or a Router. The view presented in Fig. 12 is a simplification according to current network trends. A more complete view would

Fig. 12 – Network domain (classical view).

international journal of critical infrastructure protection 6 (2013) 159 –178

171

Fig. 13 – Network domain (operator organization level view).

Fig. 14 – Heterogeneous Networks. consider the existence of Heterogeneous Networks. Fig. 14 presents an example of a Heterogeneous Network and the role of Cooperation as a means to federate and negotiate among the different networks. A Heterogeneous Network consistsOf several Networks. Wireline and Backhaul are subclasses of Fixed. Moreover, a specific instantiation exists under each Network. Fig. 15 describes Threats and Means, along with interconnections in the Network domain. The main Threats to Network are Changes (e.g., traffic changes, mobility pattern changes, PKI degradation and user profile changes) and Interaction Conflicts among network assets. Network trends are to have different network segments and even different networks (heterogeneous networks). Moreover, some management functions, located at the same or different network assets, may conflict. This case is enforced when autonomic functions are spread over the network layout (i.e., self-organizing networks) as well as over all the Operator organization levels. Governance, Cooperation and Cognitive and Self-Management are the Means to confront both Interaction Conflicts and Changes. Governance provides the rules and policies to address network instabilities caused by Changes or Interaction Conflicts. Cooperation resolves Interaction Conflicts, but also helps in addressing Changes in a more effective way. Cognitive and Self-Management monitors the network, identifies instabilities and triggers the appropriate actions. Furthermore, the learning capability

enables the acquisition of knowledge, so that the next time similar events are confronted more immediately. Finally, Trust Management is required to build trust in the autonomous systems.

4.2.

Service domain

Fig. 16 presents Threats and Means in the Service domain. Service mainly has three kinds of Threats: Security Threats, Dependability Threats and Changes. Security Threats areConfrontedBy Security, using the corresponding Instruments. Dependability Threats areConfrontedBy Fault Management. Finally, Changes areConfrontedBy Cognitive and Self-Management, which may also be achieved in an autonomous manner by Fault Management.

4.3.

Business and Customer domains

At first glance, the Business domain may not seem to be so relevant to Resilience. This is the reason why the definition of Business with respect to network resilience is first investigated and presented in Fig. 17. Business may be divided into the following subclasses with regard to resilience: ICT Manufacturers, Software Vendors, Service Providers, System Integrators and Telecom Operators. This set also comprises the target audience

172

international journal of critical infrastructure protection 6 (2013) 159 –178

Fig. 15 – Threats and Means in the Network domain.

Fig. 16 – Threats and Means in the Service domain.

Fig. 17 – Business and Customer domains.

international journal of critical infrastructure protection 6 (2013) 159 –178

173

Fig. 18 – Business and Scope.

of the derived resilience ontology because they are the main stakeholders affected by resilience. Furthermore, the Customer domain may be a member of the Business domain and User (i.e., end-user) is an instantiation of Customer. The properties of these businesses with ICT Products and the Network domain are also presented. ICT Manufacturers, Software Vendors, Service Providers produce and use ICT Products, while System Integrators integrate and use them. Telecom Operators use them and manage/coordinate the Network, to which the ICT Products belong. A User uses ICT Products. ICT Products are divided to Hardware (e.g., Network Elements), Middleware (e.g., Platforms) and Software (e.g., Applications and Algorithms). A Business may have a large or a small-medium Scope. According to this criterion, it may be classified as Industry and Small-Medium Enterprise (Fig. 18). A Small-Medium Enterprise isAffectedBy Threats that induce more Risks to the Small-Medium Enterprise than to Industry. Industry can absorb the Risks more easily due to its large Scope. The Business domain mainly has three kinds of Threats: Supply Chain Attacks, Changes in the business environment and Interaction Conflicts with other businesses (Fig. 19). Supply Chain Integrity Management confronts Supply Chain Attacks, while Cooperation between businesses confronts all these Threats. As we have already seen, Cooperation protects the Business and Network domains because it confronts Threats corresponding to conflicting businesses and network assets, respectively.

5. Enabling network resilience via cognitive aspects Section 3 highlighted the importance of Profiles, Context and Policies, which represent the Instruments of Cognitive and

Self-Management. The first two concepts allow for perception and awareness of the status of users, devices, infrastructure elements and networks, while the third facilitates governance of the network according to certain requirements and goals. This section further elaborates on these concepts by extending the ontology and providing subclasses and properties of these concepts. Profiles, Context and Policies actually represent the way to enforce the Cognitive and Self-Management – a subclass of Resilience Means in Figs. 10 and 15 – in the Network domain via appropriate information exchange among the network assets. Based on this perspective, Profiles, Context and Policies can enable network resilience because they guarantee the Resilience Properties as shown in Fig. 10. Earlier research [31] has ventured in the direction of endto-end resilience. The “end-to-end property” is also valid here because the derived ontology of Profiles, Context and Policies is independent of the network domain and applicable to any kind of network shown in Fig. 12. A discussion of how Profiles, Context and Policies enable end-to-end resilience is provided in Section 6, which also provides an instantiation of the proposed ontology in a specific deployment scenario.

5.1.

Profiles

Profiles comprise User Profile, User Device Profile and Element Profile. User Profile contains information about the user subscription, in terms of identifiers, user class, services to which the use has subscribed, including the corresponding QoS levels per service and the user preferences regarding service provisioning. These user preferences are expressed via the utility value per service and QoS level. This information is obtained and updated dynamically following a process similar to that described in [32].

174

international journal of critical infrastructure protection 6 (2013) 159 –178

Fig. 19 – Threats and Means in the Business domain.

User Device Profile includes information about the capabilities of a device, such as the services that can be supported by the device and the various network interfaces of the device. It is also associated with the specific user profile of the user who uses the device. The network interface information consists of the corresponding radio access technologies (RAT) and the frequencies per RAT. Element Profile is a superclass of Access Point Profile in [31]. This extension is required in order to capture end-to-end resilience; it enables the consideration of multiple network segments in a unified manner. Thus, instead of having only a number of reconfigurable transceivers, with the classical wireless view of RATs and frequencies, Element Profile contains information about the supported network interfaces in a NE, covering the case of, for example, of a router and its capabilities in terms of capacity, energy consumption, etc.

5.2.

Context

Context comprises User Device Status, Element Context and Element Configuration. User Device Status is required in combination with policies in order to decide about the optimal device configuration. Also, it can be used to inform the network side about a new configuration. It includes the current configuration of the device, running services, active user profile and the most recently acquired user preferences. Element Context contains information about the current load of each active network interface. It may also include information about the energy consumption for green purposes. When necessary, for example to access NEs, the load should be given per user class, service and QoS level. Element Context is used for decision making about the optimal configuration of a specific NE or a certain service area of the network. Element Configuration determines the current configuration of a NE, including information about the active network interfaces, resources allocated to them and, if required (e.g.,

for NEs in access networks), the services and QoS levels that can be supported. The resource allocation may be done per resource basis, namely subcarrier, path, etc.

5.3.

Policies

Policies consist of Business Level Entries, High-Level Policies (Associations) and Configuration Policies, corresponding to different operation, administration and maintenance (OA&M) levels. Business Level Entries are provided by the operator and/or service provider at the business level and comprise information related to the number of users anticipated for an application, user class, location and time zone. High-Level Policies (Associations) specify rules about the relationships of applications with user classes and quality levels, relationships of an application with other applications and relationships between user classes. Each Association comprises information about a set of applications. Each application may be associated with one or more user classes. Each user class may be associated with one or more QoS levels. Each QoS level is associated with one or more QoS parameters. Configuration Policies specify rules or constraints that should be taken into account when selecting the optimal configuration of a service area, NE or user device. In this sense, Configuration Policies refine the information comprised in Profiles and Context. Configuration Policies should be derived from Business Level Entries and High-Level Policies (Associations). A configuration policy consists of a compound policy condition and a set of policy configurations. A compound policy condition comprises a logical expression (e.g., AND, OR, XOR) and one or more compound policy conditions or policy conditions. A policy condition incorporates a policy expression (e.g. “equals,” “greater than” and “greater than or equal”) and a policy argument. A policy argument may include a user class, location and time zone information, and basically indicates the devices that are affected by the specific policy.

international journal of critical infrastructure protection 6 (2013) 159 –178

A policy configuration indicates the network interfaces that can be operated by network elements, as well as configuration parameters in a certain service area. It may also specify the services and corresponding QoS levels that can be provided over certain network interfaces.

6.

Case study

This section describes the use of the end-to-end resilience ontology in a case study. The case study highlights the added value of the ontology and the notion of the “end-to-end” property. Since the ontology design is based on future network requirements as specified in [36], a corresponding case study is used, namely “operator-governed, end-to-end, autonomic, joint network and service management” [35].

6.1.

Storyline

A telecom operator wishes to deploy new services and/or accommodate new traffic on top of his multi-vendor and multi-technology network, which incorporates wireless access (OFDM-based) and core (MPLS-based) segments. The scenario involves rock concert attendees who launch a realtime video application in order to share the event with their friends and family in various locations. The application requires an end-to-end connection, starting from the smart phones of the video stream emitter and ending at the smart phones of the receivers or mobile TVs. This situation implies that the operator either introduces a new service or application in the network or, in the case of an already-deployed service, introduces a new requested load in a specific geographical region and for a specific time period with concrete QoS and QoE attributes. Current solutions comprise network planning and OA&M processes. On one hand, network planning handles all the potentially changing situations (time-variant traffic demand, occurrence of faults, mobility and radio conditions) considering the worst-case (most demanding) scenario, which leads to unnecessary over-provisioning of resources (e.g., network elements and bandwidth). On the other hand, traditional OA&M processes have to deal with system heterogeneity, with respect to the technology and the vendor of the technology, and in general, they fail to achieve end-to-end optimization in the network due to loose integration or no integration. Furthermore, they rely on processes that are not fully automated, which negatively impacts the time required for (re)configuring the infrastructure and necessitates the human intervention for the cross-technology configurations, leading to errors and increased operational expenditures and delays. Finally, both the heterogeneity and the lack of full automation negatively affect the cost of service management and customer relations management. The use case reports on the above problems and calls for solutions that could provide a unified (call for unification), goal-based (call for governance), autonomic (call for autonomicity) management system for the service deployment and/or new traffic accommodation on top of heterogeneous networks encompassing wireless access and core segments (call for coordination).

6.2.

175

Proof of concept description

This use case addresses heterogeneity at a variety of levels, targeting an end-to-end network topology that considers access and core segments, as well as application domains (servers), covering equipment – from the operator console down to the end-user device – that may come from different vendors, along with diverse management tools and systems. This use case is ideal for demonstrating the proposed end-toend resilience ontology. However, since this use case represents only one (but nevertheless representative) application scenario of the ontology, a set of the ontology concepts and properties are present. Domain is the concept that makes an impact in the direction of the end-to-end resilience. Careful examination of the storyline reveals several domain concepts. In the Business domain, the main actor in the use case is the Telecom Operator that wishes to describe its business goals and manage/coordinate its Network. The Telecom Operator is also the Service Provider of the real-time video application. The attendees and their friends/family correspond to the Customer, who represents the User that uses ICT products such as smart phones. The different vendors and management tools and systems are modeled as ICT Manufacturers and Software Vendors that produce ICT Products. The Network that the Telecom Operator wishes to govern comprises both Access (specifically Wireless) and Core. The Network belongs to Heterogeneous Networks. Wireless Access consists of User Devices that the Customer uses and Access Points. As stated in Section 3.7, an individual may simultaneously have different roles and features but in a well-defined context. For instance, in many markets (the specific market is the context in this example), the Telecom Operator and Service Provider are not considered to be disjoint classes because the operator of the Network is actually the provider of the Service. We recommend the use of SWRL rules to reason about the categorization of an individual in a class when complicated relations exist. These rules use information from the context, which also identifies differences in the use of classes. Core consists of Routers. User Devices, Access Points and Routers are ICT Products while Access Points and Routers are Elements of the Network; this statement is an example of classes in the ontology that are not disjoint. Section 3.7 presented a pair of SWRL rules to help decide if a router is to be placed in ICT Products or Network Elements based on context information from the Domain. Network Elements belong to EMS and an NMS is responsible for managing one or more EMSs. The Network of the Telecom Operator is affected by two kinds of Threats in the use case. Interaction Conflicts are due to possible incompatibilities between the QoS offered by the Wireless Access and Core Network. Changes comprise potential changing situations, namely time-variant traffic demand, occurrence of Faults, mobility and radio conditions. This is another example of nondisjoint classes because the presence of Faults may result in Changes. The SWRL rule may be written as: Faults(?x) 4 Service(?serv) 4 Network(?net) 4 isAffectedBy (?serv,?x) 4 isDeployedIn(?serv,?net)-Changes(?x) 4 isAffectedBy(?net,?x). Two ThreatAgents may be identified. The Human attacks the Network if the Telecom Operator has inconsistencies in business

176

international journal of critical infrastructure protection 6 (2013) 159 –178

goals or manual configurations. In addition, User Devices, Access Points and Routers may produce Faults during their operation. The inference engine calls for Cooperation to confront Interaction Conflicts and for Cognitive and Self-Management to confront Changes. Faults are confronted by Fault Management, but as shown in the ontology, Fault Management is achieved by Cognitive and Self-Management (classes that are also not disjoint). Thus, Cognitive and Self-Management is used to confront both Changes and Faults. The SWRL rule with respect to Faults may be written as: Faults(?x) 4 FaultManagement(?fm) 4 CognitiveAndSelfManagement(?mape-k)4isAchievedBy(?fm,?mape-k)-areConfrontedBy(?x,?mape-k). Cooperation relies on mechanisms to fine-tune the offers from the Wireless Access and Core segments in order to achieve coherence. Cognitive and Self-Management uses Policies, Context and Profiles. Fig. 20 presents the use case workflow. The workflow is summarized as follows. The Telecom Operator through Governance describes and enforces its business goals. Governance uses Policies, more specifically, Business Level Entries, which include the load demanded in the specific geographical region during the time period. In order to derive technology (network) specific requirements (e.g., required bit rate), Governance also uses High Level Policies (Associations) and then issues commands through Configuration Policies. When an Element (e.g., Access Point or Router) identifies a problem situation (e.g., Changes), it sends an Element Context and Element Configuration to the NMS. The NMS via Cognitive and Self-Management takes into account the Element Profile and decides on the optimal network configuration to handle the Threat. As a result, the NMS sends a new Element Configuration to the Element and the corresponding Configuration Policies to the User Device. The User Device via Cognitive and Self-Management receives the Configuration Policies, takes into account the User Device Profile, User Profile and User Device Status and then decides on the optimal device configuration. Finally, a User Device Status is sent to the Element and NMS.

As discussed above, Cognitive and Self-Management is also responsible for decisions (Analyze and Plan), configurations (Execute) and assurance (Monitor). The Monitor function is responsible for guaranteeing the desired QoS and QoE. These attributes are associated with Performance, one of the Resilience Properties. Achievable throughput is a typical Performance Metric for Wireless Access. Bandwidth, resource utilization and energy consumption are typical Performance Metrics for Core. Maintainability is also a relative Property. Maintainability Metrics may include the time needed to resolve incompatibilities between segments and time to reach a mutuallyaccepted (re)configuration decision. Finally, the deviation from the requested QoS and QoE belongs to Reliability Metrics. In summary, this section demonstrates how the ontology deals with Threat Agents, Properties, Threats and Means in the taxonomy of the Domain. The end-to-end property is justified by the fact that the ontology integrates resilience concepts in a domain-agnostic manner while addressing problems arising from heterogeneous, non-integrated domains. As stated above, the implementation scheme employs a set of SWRL rules in conjunction with classes and properties written in OWL. The ontology provides all the necessary constructs in terms of classes and properties to enable it to be (re)usable and extendable to any number of use case scenarios and contexts by a variety of stakeholders, who may add classes, properties and rules without disrupting the ontology.

7.

Conclusions

The network resilience concepts presented in this paper take into account current gaps in the understanding of network resilience as well as future network requirements. An overview of the current situation revealed the need to break down the different silos regarding the notion of resilience and adopt an end-to-end resilience concept. Future network requirements, in particular, governance, unification, service orientation, autonomicity, orchestration and coordination, and intelligence

Fig. 20 – Use case workflow.

international journal of critical infrastructure protection 6 (2013) 159 –178

embodiment, should be taken into account when redefining the notion of resilience. The ontology proposed in this paper is intended to provide a shared conceptualization of resilience. It captures five aspects that are the lynchpins of resilience: Threats, Means, Domains, ThreatAgents and Properties. These concepts were analyzed in detail in terms of their subclasses and the subclasses were further decomposed into other subclasses, forming a hierarchy and, eventually, a taxonomy. The classes and subclasses collectively represent a thesaurus about resilience. Another key aspect is the investigation of interrelationships existing between the various concepts (classes). These interrelationships, which are specified as OWL properties in the ontology, were derived and elaborated on a resilience Domain basis. A description was provided for each Domain, to ensure domain specificity and facilitate the identification of the connections among the defined concepts, especially among Threats and Means. This enables the construction of a common framework for resilience, which also provides a shared view regarding the confrontation of a Threat. This is crucial in an environment that requires interoperability, collaboration and aggregation of Means among different stakeholders to effectively handle resilience issues. Resilience is enabled by cognitive aspects. After defining the related concepts and interactions, the paper presented a common structure of the information exchanged among the various components of a network, even in a business network, in order to provide resilience. Although the components may remain implementation-specific, the information exchange among them can be formally defined or standardized. This was accomplished by categorizing the information exchanged into Profiles, Context and Policies. Previous work [31] was leveraged to develop a unified and end-to-end view of resilience, which encompasses different network segments (e.g., wireless access and core) and different architectures (e.g., autonomic and non-autonomic). Finally, the end-to-end resilience ontology was validated through a use case scenario. This use case highlights the added value provided by the ontology as well as the “end-toend” property. The classes and properties of the ontology that are related to the use case were described in detail. Furthermore, the high-level logical flows of a service deployment and a problematic situation were presented. The resilience ontology is designed to be applicable in several areas of interest and perhaps even influence resilience stakeholders, primarily from industry. In order to be able to reach a wider audience, it is essential to ensure standardization of the resilience ontology and the methods used to derive it. The ETSI TISPAN Group [11] will work on this effort with regard to the core network aspects and the ETSI MTS Group [10] will attempt to formalize the methods used to develop taxonomies and ontologies. Current standards definitions do not consider interactions at the level of complexity that can be examined using ontologies and taxonomies. This is somewhat surprising because standards – like ontologies – seek to communicate knowledge to audiences in a comprehensive and nonambiguous manner. Our future research will focus on standardization, specifically the elaboration of resilience concepts

177

and interactions in order to identify missing classes and properties that would allow resilience to be modeled for all network variants. Also, we intend to examine how each domain influences the other domains, thus clarifying the relationships existing between domains instead of merely investigating resilience within each domain. This will constitute an important step towards unification, a preeminent requirement of future networks.

Acknowledgment The research was supported by ENISA Project P/30/10/TCD “Ontology and Taxonomies for Resilience.” The authors express their appreciation to the members of the ETSI TISPAN Project Group and the ENISA CIIP/Resilience Group for their contributions during the course of this research.

r e f e r e nc e s

[1] A. Avižienis, J. Laprie, B. Randell, C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions of Dependable and Secure Computing 1 (1) (2004) 11–33. [2] C. Basile, A. Lioy, S. Scozzi, M. Vallini, Ontology-based security policy translation, in: Proceedings of the Second International Workshop on Computational Intelligence in Security for Information Systems, 2009, pp. 117–126. [3] M. Choraś, R. Kozik, A. Flizikowski, W. Hołubowicz, Ontology applied in decision support system for critical infrastructure protection, in: Proceedings of the Twenty-Third International Conference on Industrial and Other Applications of Applied Intelligent Systems, 2010, pp. 671–680. [4] M. Choraś, R. Kozik, A. Flizikowski, R. Renk, W. Hołubowicz, Ontology-based decision support for security management in heterogeneous networks, in: Proceedings of the Fifth International Conference on Intelligent Computing, 2009, pp. 920–927. [5] European Network and Information Security Agency, ENISA, Heraklion, Greece 〈www.enisa.europa.eu〉. [6] European Network and Information Security Agency, Resilience Features of IPv6, DNSSEC and MPLS and Deployment Scenarios, Heraklion, Greece, 2008 〈www.enisa. europa.eu/act/it/library/deliverables/res-feat〉. [7] European Network and Information Security Agency, Gaps in Standardization Related to Resilience of Communication Networks, Heraklion, Greece, 2009 〈www.enisa.europa.eu/ act/it/library/deliverables/gapsstd〉. [8] European Network and Information Security Agency, Priorities for Research on Current and Emerging Network Technologies, Heraklion, Greece, 2010 〈www.enisa.europa. eu/act/it/library/deliverables/procent〉. [9] European Network and Information Security Agency, Enabling and Managing End-to-End Resilience, Heraklion, Greece, 2011 〈www.enisa.europa.eu/act/it/library/ deliverables/e2eres〉. [10] European Telecommunications Standards Institute, Methods for Testing and Specification, Sophia-Antipolis, France 〈www.etsi.org/mts〉. [11] European Telecommunications Standards Institute, Telecommunications and Internet Converged Services and Protocols for Advanced Networking (TISPAN〉, SophiaAntipolis, France 〈www.etsi.org/tispan〉. [12] European Telecommunications Standards Institute, Telecommunications and Internet Converged Services and

178

[13]

[14]

[15] [16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

[24]

[25]

international journal of critical infrastructure protection 6 (2013) 159 –178

Protocols for Advanced Networking (TISPAN), NGN Security, Report on Issues Related to Security in Identity Management and Their Resolution in the NGN, Technical Report TR 187 010, Sophia-Antipolis, France, 2008. I. Fovino, M. Masera, Emergent disservices in interdependent systems and system-of-systems, in: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2006, pp. 590–595. R. Franco, G. Prats, R. de Juan-Marín, An ontology proposal for resilient multi-plant networks, in: K. Popplewell, J. Harding, R. Poler, R. Chalmeta (Eds.), Enterprise Interoperability IV, Springer, London, United Kingdom, 2010, pp. 169–178. A. Ganek, T. Corbi, The dawning of the autonomic computing era, IBM System Journal 42 (1) (2003) 5–18. INSPIRE Project, Increasing Security and Protection Through Infrastructure Resilience, Consorzio Interuniversitario Nazionale per l'Informatica (CINI), University of Roma La Sapienza, Rome, Italy 〈www.inspire-strep.eu〉. INTERSECTION Project, Consorzio Interuniversitario Nazionale per l'Informatica (CINI), University of Roma La Sapienza, Rome, Italy 〈www.intersection-project.eu〉. T. Jansen, M. Amirijoo, U. Türke, L. Jorguseski, K. Zetterberg, R. Nascimento, L. Schmelz, J. Turk, I. Balan, Embedding multiple self-organization functionalities in future radio access networks, in: Proceedings of the Sixty-Ninth IEEE Vehicular Technology Conference, 2009. J. Kephart, D. Chess, The vision of autonomic computing, IEEE Computer 36 (1) (2003) 41–50. J. Laprie, Resilience for the scalability of dependability, in: Proceedings of the Fourth IEEE International Symposium on Network Computing and Applications, 2005, pp. 5–6. J. Laprie, From dependability to resilience, in: Proceedings of the Thirty-Eighth IEEE/IFIP International Conference on Dependable Systems and Networks, 2008. X. Li, C. Chandra, J. Shiau, Developing a taxonomy and model for security-centric supply chain management, International Journal of Manufacturing Technology and Management 17 (1/2) (2009) 184–212. J. López de Vergara, A. Guerrero, V. Villagrá, J. Berrocal, Ontology-based network management: study cases and lessons learned, Journal of Network and Systems Management 17 (3) (2009) 234–254. J. López de Vergara, E. Vázquez, A. Martin, S. Dubus, M. Lepareux, Use of ontologies for the definition of alerts and policies in a network security platform, Journal of Networks 4 (8) (2009) 720–733. J. Lopez de Vergara, V. Villagra, C. Fadon, J. Gonzalez, J. Lozano, M. Alvarez-Campana, An autonomic approach to

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36] [37]

[38]

offer services in OSGi-based home gateways, Computer Communications 31 (13) (2008) 3049–3058. J. Lozano, A. Castro, J. Gonzalez, J. Lopez de Vergara, V. Villagra, V. Olmedo, Autonomic provisioning model for digital home services, in: Proceedings of the Third IEEE International Workshop on Modeling Autonomic Communications Environments, 2008, pp. 114–119. REMPLANET Project, Resilient Multi-Plant Networks, Polytechnic University of Valencia, Valencia, Spain 〈www. remplanet.eu/web〉. ReSIST Project, Resilience for Survivability in IST, Laboratory for Analysis and Architecture of Systems, Toulouse, France 〈www.resist-noe.org/overview/summary.html〉. ReSIST Project, Deliverable D34: Resilience Ontology: Final, Laboratory for Analysis and Architecture of Systems, Toulouse, France, 2008 〈www.resist-noe.org/Publications/ Deliverables/D34-Resilience_Ontology_Final.pdf〉. S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, Upper Saddle River, New Jersey, 2009. V. Stavroulaki, N. Koutsouris, K. Tsagkaris, P. Demestichas, A platform for the integration and management of cognitive systems in future networks, in: Proceedings of the IEEE International Workshop on Management of Emerging Networks and Services – IEEE GLOBECOM Workshop on Management of Emerging Networks and Services, 2010, pp. 492–497. V. Stavroulaki, Y. Kritikou, E. Darra, Acquiring and learning user information in the context of cognitive device management, in: Proceedings of the IEEE International Conference on Communications Workshops, 2009. TM Forum, Business Process Framework (eTOM), Morristown, New Jersey 〈www.tmforum.org/ BusinessProcessFramework/1647/home.html〉. J. Undercoffer, A. Joshi, J. Pinkston, Modeling computer attacks: An ontology for intrusion detection, in: Proceedings of the Sixth International Symposium on Recent Advances in Intrusion Detection, 2003, pp. 113–135. UniverSelf Project, Case Study on Operator-Governed, End-to-End, Autonomic, Joint Network and Service, Alcatel-Lucent Bell Labs France, Villarceaux, France, 2011. UniverSelf Project, UniverSelf, Alcatel-Lucent Bell Labs France, Villarceaux, France 〈www.univerself-project.eu〉. M. Uschold, M. Gruninger, Ontologies: principles, methods and applications, The Knowledge Engineering Review 11 (2) (1996) 93–136. J. Völker, D. Vrandečić, Y. Sure, A. Hotho, Learning disjointness, in: Proceedings of the Fourth European Semantic Web Conference, 2007, pp. 175–189.