Accepted Manuscript Cooperative Security in Distributed Networks Oscar Garcia-Morchon, Dmitriy Kuptsov, Andrei Gurtov, Klaus Wehrle PII: DOI: Reference:
S0140-3664(13)00107-2 http://dx.doi.org/10.1016/j.comcom.2013.04.007 COMCOM 4808
To appear in:
Computer Communications
Please cite this article as: O. Garcia-Morchon, D. Kuptsov, A. Gurtov, K. Wehrle, Cooperative Security in Distributed Networks, Computer Communications (2013), doi: http://dx.doi.org/10.1016/j.comcom.2013.04.007
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Cooperative Security in Distributed Networks Oscar Garcia-Morchona , Dmitriy Kuptsovb , Andrei Gurtovb,d , Klaus Wehrlec a Distributed
Sensor Systems, Philips Research Europe, Eindhoven, The Netherlands b Aalto University, HIIT, Finland c Distributed Systems Group, RWTH Aachen University, Germany d Center for Wireless Communication, University of Oulu, Finland
Abstract We consider a distributed network in which faulty nodes can pose serious threats as they can subvert the correct operation of basic functionalities, such as, routing or data aggregation. As a setoff to such nodes, we suggest that trust management between nodes is an essential part of a distributed system. In particular, benign nodes shall communicate with trusted nodes only and misbehaving nodes must be rapidly removed from the system. This paper formalizes the concept and properties of cooperative security – a protocol which allows implementing trust management by means of two voting procedures. During the first voting – admission procedure – each node gains trust by distributing revocation information to its neighbors. These neighbors form the node’s trusted entourage. If the node cooperates and discloses enough information, it is admitted and can communicate with the rest of the network; otherwise it is rejected. If the admitted node tries to endanger the network the second revocation voting procedure takes place. In this case, if the node’s entourage agrees upon act of misbehavior they revoke the node network-wide using previously disclosed revocation information. Keywords: Distributed networks, cooperative security, Byzantine failures, revocation and trust management, voting and consensus. 1. Introduction Nowadays distributed systems play a key role in many areas by performing many different types of tasks without a central entity. An example is a distributed smart environment comprising smart objects communicating over wireless links. Each node in a distributed system has to behave in a fair way, i.e., as expected, to ensure the correct operation of the distributed system since faulty nodes can disrupt basic functionality such as routing or time synchronization protocols. However, we cannot prevent nodes from becoming faulty due to either node compromise by an attacker or simply due to buggy software, battery depletion, or hardware failures. Our goal in this paper is to deal with these faulty nodes by means of a distributed protocol such that (i) a node only communicates with other nodes that are trustworthy and (ii) those nodes that are not behaving in the correct way are rapidly identified and isolated. In this context, we describe the Efficient Cooperative Security (ECoSec) † , a distributed and adaptive protocol that allows a network to control the admission and revocation of nodes. The protocol is built around the concept of cooperative security [14] in which the correct network operation is ensured by enforcing node cooperation and mutual monitoring. Thereby, if a node does not disclose its own revocation information, then a subset of node’s neighbors does not authorize it to join the network. Thus, the malicious node cannot cause any damage. If the node † A very brief description of the protocol and its properties was peer reviewed and published in [19]
Preprint submitted to Elsevier
is cooperative but tries to attack the system, the protocol ensures that the node’s entourage has received enough information to rapidly and reliably isolate the attacker in the whole network. ECoSec and the underlying concept of cooperative security are useful in a distributed system due to two main reasons. First, network nodes can set up a trusted cluster (for which we give a precise definition in Section 4) in a simple way. This feature guarantees that each node only communicates with other honest nodes ensuring the correct network operation. Second, the protocol allows for a distributed (and unattended) revocation so that if a node misbehaves it is removed and any other node can verify this decision. This work includes the following contributions: • The work described in this paper provides a formal definition of the cooperative security concept and its desired properties. • We present our Efficient Cooperative Security Protocol. Our comprehensive analysis reveals several improvements over previous works. In particular, the presented protocol can endure a higher fraction of compromised nodes in the network when compared with its predecessor cooperative security protocol in [14]. Besides, the underlying keying material structure of the protocol allows reusing information during the two voting procedures, and thus, we can reduce the network bandwidth and CPU overhead involved in such agreement procedures when compared with related approaches. May 15, 2013
The rest of this paper is organized as follows. We begin this paper with a careful comparison of our work with previous studies. This overview is highlighted in Section 2. Then in Section 3 we formalize our problem and define the assumptions under which we solve the problem. While in Section 4 we describe high-level protocol operation and its key components, in Section 5 we describe the protocol operation in detail. In Section 6, we analyze the correctness of the protocol and provide discussions on its configuration. In this section we also discuss the utility of our solution and give an example of its usage in real-world applications. Section 7 concludes the paper.
In the protocol discussed in this paper, a joining node is responsible for the dissemination of its own revocation information which is further verified during admission voting. These steps allow solving the scalability limitations of the protocol in [6]: a node only has to store its own keying material requiring O(1) memory and not the keying material to revoke any node in the network1 . It is worth mentioning that if, for instance, a publickey cryptography (such as in [34]) is employed to sign keying material, the network running an instance of the protocol presented in this paper can scale almost infinitely. Some issues of the protocol detailed in [6] were first addressed in [14]. Thus, in this work the authors introduce the concept of Cooperative Security Protocol (CSP) which uses two voting procedures to mitigate the problem of high memory requirements. However, the biggest issue with the protocol presented in [14] is that it is not optimal in terms of number of colluding attackers the system can sustain due to the type of keying material used in that protocol. In our work we achieve this optimality. Distributed node revocation protocols in MANETs. There is a considerable body of work in the area of node revocation in mobile ad-hoc networks (MANET). We outline the two most notable papers: The first work we mention is a suicide node revocation scheme [9, 26]. The idea is simple: Whenever a node finds some other node being faulty, it issues a revocation message for both the faulty node and itself. This signed revocation message is then broadcast network wide for the revocation to take effect. The shortcoming of the scheme is false revocation decisions, which can undoubtedly lead to fast network depletion. The protocol described in our work does not have such limitation. The other notable work is related to threshold based public key cryptography (PKC). Thus, Luo et al. [24] suggest that nodes join a network by requesting a (cooperatively generated) certificate from some subset of its neighbors. If the certificate is granted, the node can start its normal operation. However, there are several shortcomings of such approach. First, the protocol can sustain lower number of faulty nodes in the system, compared with the protocol discussed in this paper: It is mentioned by the authors that any node can collect partial signatures from k nodes in the entire network, e.g., by moving from one location to another. In practice, k would be a small number in comparison to the total number of nodes in the network. As we will see, in our protocol we require that as many as c neighbors of a particular node can be faulty (note, it is likely that in practice k and c would be of the same order of magnitude). This means that total number of faulty nodes in the entire network in our protocol will be larger than c and k respectively. Second, in protocols, similar to [24], if the private keys are compromised all nodes need to be rekeyed to ensure that the faulty nodes will not admit colluding attacker into the network in the future by collectively constructing a valid certificate. Network-wide rekeying can be hard or even infeasible in many settings. In
2. Related Work Prior revocation protocols had several limitations with respect to scalability (in terms of memory storage and communication complexity) and erroneous admission and revocation decisions. The cooperative security protocol [14] proposed an approach to deal with some of these issues – in particular, the memory storage – but it is limited to rather low ratios of compromised devices the system can endure. These problems are addressed by ECoSec protocol. Overall, ECoSec is a cooperative security protocol that combines and extends existing results on Byzantine Agreement and group membership protocols to create dynamic groups of nodes that are responsible for some node in the network. Each such group handles the trust relationships with a given node in the network (which initially creates this group) by reaching an agreement on its admission and revocation. In this way, the revocation keying material allows creating a network-wide revocation message verifiable beyond the group scope. With this at hand, in the following paragraphs we describe the related work in the areas of distributed node revocation, group membership protocols, and protocols related to Byzantine agreement. Distributed node revocation protocols in sensor networks. The concept of cooperative security emerged first in the area of node revocation in wireless sensor networks [7, 6, 14], although its applicability is broader as described in this paper. And despite that centralized approaches [11, 23, 12] allow to tackle the problem, they appear to be ill suited for deployments lacking central trusted authority. As such the cooperative security, introduced in [14], aims to provide a fully distributed way for secure, fast and reliable node isolation and revocation. The first work on distributed node revocation which is close to ours can be found in [6]. Overall, in the paper the authors suggest configuring nodes with the revocation information against the rest of the devices in the network before deployment. After the deployment, this information is used to revoke misbehaving nodes. Preloading the revocation information during deployment inevitably leads to a need for rekeying of all nodes in the network whenever a new node is added. In other words, the scheme is more suitable for static networks. Besides, the design presented in [6] requires O(n) memory (where n is number of nodes in the network) because each node has to store revocation votes against each other node in the network. This can be unacceptably large, imposing serious limitations on system scalability.
1 Note that here we refer to the storage needs of partial revocation votes. Both schemes have memory needs O(log(n)) for the storage of the verification paths in the Merkle tree.
2
contrast, in the protocol discussed in this paper, due to the assumption that a joining node distributes its own keying material in order to join the network, no complex rekeying mechanisms are required: only compromised nodes need to be rekeyed while non-compromised nodes remain unaffected. Group membership protocols. A group membership protocol and its variants (for sampling see [29, 30, 10, 13, 27]) is a family of distributed protocols in which the processes belonging to some group of an arbitrary size can under the presence of faults agree on which processes should further represent the group. In other words, the goal of the protocol is to ensure that only non-faulty processes will belong to the group. In general, these protocols can be classified according to several assumptions made about the system model. Thus, some protocols rely on the existing cryptographic primitives [29, 10, 13, 27], other introduce novel schemes [30]. These protocols can be further classified according to the network type, which can be fully [29] or partially [10] connected or communication channel assumed, which can be synchronous [10, 13] or asynchronous [29, 27]. The operation of the protocols varies depending on the assumptions made. Despite the similarities, our work addresses a slightly different problem. First, on the scale of neighbors of a particular node attempting to join the network, we do not solve the group membership problem as such: it is the joining node in our protocol who decides on the set of neighbors which will represent and monitor it. And unlike group membership protocols, in our protocol the task of these neighbors is not to agree on a common list of non-faulty neighbors of the joining node, but rather first to agree whether to accept or not the joining node; and later to agree whether to isolate it. Second, we note that, although, the problem we are addressing can be potentially solved with the existing group membership protocols (such as those found in [29, 30, 10, 13, 27]), e.g., in the case that all nodes participate in the admission of any node in the network. Obviously, the approach will become impractical when the size of the network is large. The protocol we discuss in this paper does not have such a scalability problem: the entire network is partitioned into arbitrary many small-sized groups, allowing these groups to decide (on behalf of all other nodes in the network) whether to trust or not a particular joining node. Accordingly, our work is different in several ways from group membership protocols. First, we address and tackle the scalability issue. Second, ECoSec involves two different voting procedures. The methods presented in this paper incorporate the ability to correlate the information disclosed in these procedures so that we optimize the overall communication complexity. To our best knowledge the bounds we obtain were not previously known. Third, we introduce a comprehensive analysis of the protocol operation when the intrusion detection system is biased and show that the threshold for the number of nodes needed for admission is a function of nodes needed during revocation. Byzantine agreement and related aspects. State-machine replication is an approach used to implement fault-tolerant systems by replicating resources and coordinating requests in a distributed way. Cooperative security is close to the notion of
Byzantine state-machine replication in which a set of processors acts in unison masking Byzantine faults. For instance, this is similar to the behavior of nodes in cooperative security when monitoring nodes ask each other whether a joining node distributed enough revocation information. These ideas appear in the literature starting with Lamport’s paper [20] followed by the contribution of Scheider on Fail-stop processors [31]. A comprehensive overview of these concepts is provided in [32]. To our best knowledge, these concepts miss the formation of processor groups that can monitor and revoke some other processor whenever last start to misbehave. Furthermore, cooperative security is related to the work on fault-containment in the context of self-stabilizing algorithms. Here, a group of processors attempts to contain the effects of faults by handling these effects locally so that other processors outside of the group are not affected. The difference to statemachine replication lies in the type of fault model considered (transient faults in contrast to Byzantine faults). The first reference we are aware of is Ghosh, Gupta and Pemmaraju [15]. A more general paper is authored by Herman et al. [18]. Lower-bounds are related to existing results on Byzantine agreement and its crypto-variants. Lamport, Shostak, and Pease deserve credit for their term Byzantine faults [28] and their 3t + 1 lower bound proof. There is as well a large body of work that suggests several variants to the original (recursive) algorithm, for instance see Cachin et al. in [5]. Finally, cooperative security protocols also show some links to failure detectors. A failure detector aims at isolating the timing assumptions required to solve agreement instead of directly dealing with them within the agreement algorithm. The intruder detection scheme used in our protocol or [14] might be extended to follow the modular approach of failure detectors, first introduced by Chandra and Toueg [8]. As a short summary of the differences, we would like to indicate here that ECoSec uses Merkle trees to authenticate secret shares as in [7], however, the structure is adapted to authenticate them during the admission phase (such a phase does not exist in [6] or [7]) and verify the revocation vote created during the revocation phase. The usage of both admission and revocation phases were first introduced in [14] to reduce memory storage needs of, e.g., [6], but ECoSec further improves [14] with an advanced protocol that allows enduring a higher number of compromised devices. Group membership protocols also existed, but ECoSec differentiates from them in a number of aspects such as the usage of correlated revocation information that reduces overhead in consecutive operational phases (admission/revocation) leading to new bounds. 3. Problem Statement and Assumptions Consider a distributed network comprising n nodes, such as, e.g., a wireless sensor network, in which some nodes might be faulty due to node capture (and modification) in the sense of Byzantine failures. The goal is to remove those faulty nodes in an efficient way. In this setting, this paper deals with those faulty nodes by designing a cooperative security protocol, 3
• Nodes can establish pairwise secure links, e.g., using approaches described in [2, 22] (or by using any other suitable cryptographic protocol; for instance, we do not limit ourselves to usage of the symmetric cryptography only and admit that public-key cryptography [34] can be utilized without any changes to protocol operation if the system can tolerate resource intensive computations), ensuring mutual authentication and allowing for integrity and confidentiality of unicast messages.
formally defined as follows:
Definition 1 A cooperative security protocol is a distributed protocol that exhibits the following three design principles. Admission: any node ζ can only communicate with the remaining n − 1 nodes in the network, if and only if, a set of qj nodes agrees on its admission where qj ≤ n; Isolation: if a node is found to be an intruder by its neighbors, the node is revoked network-wide; Fault tolerance: the admission and revocation decisions are guaranteed to sustain a collusion of up to c faulty nodes.
• We assume that the links, over which the messages are exchanged, are reliable in the sense that packet losses are recovered with retransmission and error correction mechanisms. We further assume that a message traverses a link in a bounded period of time. This time includes the transmission time, the reception time, and the bounded message processing time. Furthermore, we denote ΔTtransmit as a maximum time for message exchange between pair of neighbors of arbitrary node and ΔTnetwork for messages propagating time through the entire network.
To guarantee the proper operation, any cooperative security protocol, compliant to Definition 1, is required to fulfill the following properties (similar properties were defined in [6]): • Correctness: If a node ζ is admitted or revoked from the network, then qj nodes have agreed on its admission or qr nodes on revocation as long as the number of faulty nodes c does not exceed a given threshold.
• Each joining node discloses the revocation keying material to its qj neighbors. We assume that all neighbors are in direct communication range of the joining node.
• Completeness: If a joining node ζ is admitted by qj nodes, node ζ can communicate with the whole network. If node ζ is detected to be an attacker by qr other nodes, the node is permanently removed from the network.
• We say that a node is honest if it follows the expected protocol operation. Otherwise the node is considered to be faulty due to node compromise. Compromised nodes can collude to subvert the protocol operation in the sense of Byzantine faults.
• Bounded Admission and Revocation Times: A node ζ attempting to join the network and following the protocol will succeed before ΔTjoining . A node identified as attacker will be revoked from the whole network in a bounded period of time ΔTrevocation . Note although we specify the time bounds explicitly, our protocol is able to operate in asynchronous setting since the voting schemes we discuss in this paper (except for keying material disclosure during node revocation when intruder detection system (IDS) is perfect in which situation this property not required) are designed to tolerate node asynchrony.
• Each node running a protocol instance is able to detect other nodes’ failures, e.g., by using IDS [16]. IDS is an element, independent from the ECoSec protocol, that outputs a decision every ΔTIDS time interval on the behavior of another node ζ. In this paper, we consider two hypothetical IDS variants: perfect and biased. An IDS is perfect if it detects the intruders unerringly with neither false positives nor false negatives. An IDS is biased if it also can produce false positive decisions with some non-negative probability pe .
Thus, in this work we view the problem of constructing an instance of cooperative security protocol that satisfies Definition 1 and fulfills all its properties. In addition, the resulting protocol should not exhibit the limitations of the distributed node revocation protocols described in Section 2 and be highly scalable.
3.2. Background In this subsection we summarize the primitives that we use in the ECoSec design. We begin with a data structure which can be used to efficiently authenticate the origin of the messages. The structure we consider is a Merkle tree † . Thus, a Merkle tree is a binary tree in which each of n leaves, Li , is calculated as the hash of a value ai , i.e., Li = H(ai ) and each internal node mij is calculated as the hash of the concatenation of its two sibling nodes mij = H(mi ||mj ). Trivially, a tree comprising n leaves has depth log(n). But most importantly, given a root of such
3.1. Assumptions In this section we outline the assumptions under which we design ECoSec protocol. While these assumptions are listed below, the notation we use throughout the paper is collected in Table 1. • We adopt the threat model described in [6] and here we merely outline its key aspects. First, an adversary can perform chosen node compromise. Second, compromised nodes can cooperate. Third, an adversary can block or delay communications passing through it, but not jam the entire network.
† We rely on Merkle trees instead of asymmetric cryptography since they perform better in some settings [17]. Note that in other settings, in which nodes are more powerful, public-key cryptography might be applied for information verification or to setup secure communication links between nodes without any changes to the protocol
4
Table 1: Notation Symbol DTSD IDS RV PRV Verification tree |S| |V | pe n c qj qr τj , τr P RVζkx fζk (0) ΔTadmission ΔTrevocation ΔTtransmit ΔTnetwork ΔIDS
Meaning Dynamic trusted security domain Intruder detection scheme Revocation vote Partial revocation vote Merkle tree for verifying node IDs, communication sessions, PRVs and RVs Signature size Vote size Probability of (biased) IDS false positive decision Network size Threshold for number of faulty nodes Number of nodes required for admission Minimum number of nodes required for revocation Admission and revocation thresholds PRVs associated with node ζ during communication session k, x ∈ {1, . . . , qj } RV associated to node ζ in communication session k Maximum time for executing node admission algorithm Maximum time for executing node revocation algorithm Message delivery time within a DTSD Message delivery time in the whole network Maximum time needed for IDS to detect misbehavior of a node
tree along with log(n) elements (on the path from some leaf Li up to the root) one can efficiently verify whether a message ai is authentic or not [25]. In our study we also use the notion of secret sharing which refers to a mechanism that allows distributing a secret among several parties. More specifically, this mechanism does not allow using a single share to reconstruct the secret. But if the number of shares exceeds some threshold, the secret can be collectively recovered. In our work we use the widely accepted Shamir scheme [33]. It is based on a Lagrange polynomial interpolation and allows recovering a secret if at least t + 1 shares are known. Finally, in this work we extensively use several voting procedures. Briefly, such mechanisms allow a group of nodes to reach an agreement on a certain action. The actions we consider are whether to admit or revoke a node. Overall, described algorithms resemble to some extent the Byzantine Generals Problem [21] in which the generals can communicate with each other only via a messenger. After exchanging the messages, they have to decide upon a common action plan.
nodes (i.e., physically places the nodes in a desired location). After deployment, however, TTP takes a passive role, while deployed nodes take an active role in network monitoring. Thus, after the deployment each node has to distribute part of the keying material to its qj neighbors to join the network. If the node does not disclose this information, it cannot join, and thus, it cannot endanger the network because it is not allowed to participate in the normal network operation. If a node follows the protocol then the set of qj nodes that received the revocation information is responsible for monitoring its operation and responding to a node’s misbehavior. In addition, there is a guarantee that these qj nodes have enough information to remove the node if a wrong behavior is detected. Finally, a node, if otherwise was not revoked from the entire network, can rejoin by disclosing a new set of keying material for another communication session. Note that the protocol described in this work is able to deal with any type of faults because it is designed for adversarial and malicious faults. However, if one is only interested in non-adversarial faults such as fail-stop or node crash, the protocol conditions can be relaxed. The description of these alternative designs are out-of-the scope of this paper.
4. EcoSec Protocol: Preliminaries While we give a more detailed description of EcoSec protocol (an instance of cooperative security protocol) operation in Section 5, here we overview its design principles and main building blocks. Namely, in this section we first give a precise definition of EcoSec protocol. We then discuss in detail the keying material structure and voting mechanisms which, as we shall see in Section 5, make EcoSec cooperative security protocol compliant.
Definition 2 A mandatory condition for a node to join the network during a communication session is the distribution of a set of qj partial revocation votes (PRVs), verifiable secret tokens, to qj neighbors (i.e., each receiving a single PRV from the joining node).
Definition 3 The set of qj neighbors that received the PRVs from a joining node and agreed on its admission during a given communication session forms the node’s Dynamic Trusted Security Domain (DTSD). A neighbor, which receives a PRV, is denoted as DTSD member while a node joining the network by distributing PRVs is denoted as owner of the DTSD.
4.1. Protocol Definition Formally we define the ECoSec protocol as follows. Before deployment, a network is governed by a trusted third party (TTP) – an entity that is responsible for managing the identities and supplying the nodes with a keying material. In practice that would mean a network operator that programs and deploys the 5
Public root
H(G n)
Definition 4 A communication session is defined as a period of time during which a node remains trusted by its DTSD. During this period of time, a node communicates with the network through its DTSD members – only DTSD members that admitted the node into the network can forward the packets from the corresponding owner; other non-DTSD members (which may be direct neighbors) will ignore such messages as they will lack trust to this node.
Gn
Global verification tree
1
n
n-1
H(G ) G
Before a deployed node can start communicating with other nodes during a given communication session, it has to gain trust during first admission voting. If the joining node is successful, it becomes trusted and can start normal communication. If the DTSD detects the node to be malicious during a particular communication session, the second voting round starts in which DTSD members need to agree whether the node should be revoked or not. If a positive decision is found, nodes exchange the PRVs to reconstruct the final revocation vote.
1
2
Communication sessions verification tree
s
k
k k H(H(G ) || H(RV ))
k G
PRV and RV verification tree 1
q -1 j
q
j
H(H(PRV kq ))
Admission voting
k H(PRV 1 )
k H(PRV q ) j
Revocation voting (biased IDS)
k PRV 1
k PRV q j
Revocation voting (perfect IDS)
H(H(PRV k1))
Definition 5 A revocation vote (RV) is a verifiable piece of information that can be reconstructed from a subset of t + 1 < qj PRVs belonging to a given revocation session of a particular DTSD owner. RV allows for verifiable node revocation in the entire network proving that (i) the node joined the network by disclosing its PRVs, (ii) τj nodes in its DTSD admitted it in the network, and (iii) the τr nodes agreed on its removal due to misbehavior.
2
j
Figure 1: Keying material structure and its usage
node ζ. Thus, to authenticate RV or PRV values, corresponding paths in Gn , Gζ , Gkζ trees and H(RV) value are required. For example, following Figure 1, the verification path in Gkζ for the first PRV value would comprise the right sibling element at each level from leaf to the root elements of this tree. The paths in Gζ and Gn are defined similarly. While we use Merkle trees for authentication, we represent PRVs and RV of a joining node ζ during communication session k as the points on a polynomial fζk (x) of a degree t. It is important that PRVs are computed as fζk (x) such that x ∈ {1, . . . , qj }, while the corresponding secret RV is represented by fζk (0). Such assumption allows reconstructing RV from any t + 1 PRVs using approach introduced in Section 3.2. Observe, that each leaf element in tree Gζ stores the hash of a root element of the corresponding third tree Gkζ concatenated with the hash of the RV , i.e., H(H(Gkζ )H(RV )). This construction has several useful properties. First, it allows any node in the network, provided with an RV and corresponding paths in Gζ and Gn , to efficiently verify the authenticity of received RV. Second, it allows nodes to verify PRV without prior knowledge of RV value, which is important during admission voting when RV is not known. Finally, we mention that each of the qj leaf elements of a tree Gkζ stores a double hash of a PRV, i.e., H(H(PRV)), which represents the anchor of a hash chain comprising three elements: H(H(PRV)), H(PRV), and PRV. Though, the need for such hash chain is explained later, here we merely point out that it allows nodes to authenticate the votes seen in different voting procedures using single hash function evaluation. Second, and most importantly, it makes possible to correlate the revocation votes
4.2. Keying Material The ECoSec protocol relies on cryptographic keying material. We keep in mind the following properties when defining its structure. First, we require the nodes to efficiently authenticate identities, disclosed PRVs and reconstructed RVs. Second, the structure of the keying material should also provide the means to securely link votes disclosed in admission voting with votes disclosed in consequent revocation voting procedures, in this way, the verification process performed during the admission voting can be easily reused later. As we will see later in Section 6 such property allows achieving certain lower bounds for communication complexity during revocation voting. Overall, we consider that the keying material comprises several combined Merkle trees (which we denote as verification tree) and polynomials. This data structure is presented in Figure 1 and explained in the following paragraphs. We begin with the description of the verification tree which comprises three hierarchically connected Merkle trees. The global verification tree, Gn (also the top-level tree), is used to verify the identities of n nodes in the network. Second, n subtrees, denoted as communication session verification trees, Gζ , are associated with and unique to each node in the network. These subtrees are bounded to corresponding n leaves of Gn and are used to verify the keying material corresponding to s communication sessions of a particular node in the network. Finally, the third tree, Gkζ , is used to authenticate both PRVs and RV belonging to a particular communication session k of a 6
with previously seen votes during admission voting. In other words, given the PRV or H(PRV) values, nodes can easily verify whether they correspond to previously seen H(H(PRV)) value. Keying material generation. In some settings (especially when network nodes are limited on storage) it can be beneficial to allow the nodes to reconstruct the polynomials and hence corresponding PRVs and RV, initially generated by TTP, on demand. To achieve this we can represent coefficient i of polynomial fζk (x) as H(Aζ ki), where Aζ is a master seed selected by TTP and i ∈ {0, . . . , t} is the index of a polynomial coefficient. Clearly, if TTP provides a node with a master seed, polynomial degree t and corresponding path in Gn , the node can reconstruct its entire keying material after deployment. Usage of public-key cryptography. The keying material described above relies on symmetric cryptography only. While this approach can be favorable in networks involving nodes with limited computational capabilities, public key cryptography can be employed whenever scalability is desired. For instance, instead of having Gn and Gζ , TTP can directly sign the H(H(Gkζ )H(RV )) with its private key. In this work we omit the discussion of such design possibility, but rather note that such change will not affect the protocol operation.
voting scheme must be used in both admission and revocation procedures. Definition 6 Yes/no agreement is a Binary Byzantine Agreement that allows a DTSD to reach consensus on the admission or revocation of a node under the collusion of faulty nodes.
In this strategy, during the admission voting, the DTSD decides whether the joining node should be admitted or not. To this end, each DTSD member that has received a PRV from the joining node, firstly verifies that the PRV belongs to the joining node. Then, each DTSD member reliably broadcasts [4] its vote – yes or no – regarding the correct reception of the PRV. Based on the exchanged votes, the DTSD decides whether the DTSD owner disclosed enough PRVs valid for the particular communication session. In a similar way, during the revocation procedure, the DTSD decides whether the DTSD owner is to be revoked. A positive decision is taken if enough DTSD members claim that the DTSD owner behaves in a malicious manner. In this case, they disclose the PRVs collected from the node such that to reconstruct RV and revoke the node. This voting strategy requires the existence of pairwise secure channels between the DTSD members for two reasons. First, the DTSD owner can securely transmit its PRVs to each DTSD member such that each member only learns one PRV. Second, each DTSD member should sign the messages exchanged during the voting procedures by means of keyed message authentication codes (MAC).
4.3. Voting Strategies The core of our distributed protocol is the ability to find consensus regarding the joining (trust acquisition) or revocation (trust loss) of a node under the presence of corrupted or faulty nodes. To this end, we consider three different voting strategies, namely: yes/no agreement, H(H(PRV))/H(PRV) agreement and simple PRV disclosure. From now on, we will use the term agreement to indicate a voting protocol that allows reaching consensus on decision among the nodes that do not have any knowledge regarding which nodes are faulty and which are not. On the other hand, we will use term disclosure to indicate a voting protocol in which nodes have knowledge regarding which nodes are compromised and which ones are not and therefore the consensus can be considered as a vote broadcast by majority. We therefore say that a DTSD, when reaching consensus, can use agreement schemes both for admission and revocation, while disclosure mechanisms should be only used during revocation. We briefly summarize this in Table 2. Although we provide a more detailed discussion of trade-offs in Section 6.5, here we merely mention that the protocol (depending on the requirements) can be configured in the following ways:
4.3.2. H(H(PRV)) agreement In this section we describe voting scheme which can be used only during admission voting in configurations involving both perfect and biased IDS. Definition 7 H(H(PRV)) agreement allows a DTSD reaching consensus on the admission of a node under the collusion of faulty nodes; the agreement uses H(H(PRV)) values as votes and reliable broadcast (secured with MAC) for exchanging these votes between participants.
H(H(PRV)) agreement requires a reliable broadcast with totality property which guarantees that no faulty node can cheat and disclose H(H(PRV)) vote to all, but colluding attackers only. In other words, reliable broadcast is a communication primitive that guarantees that if some node broadcasts H(H(PRV)) value then all honest nodes will receive it (or reliably deliver it) [4]. Such property is achievable, for example, with Bracha’s reliable broadcast [3] and can be used both in yes/no and H(H(PRV)) agreement. This protocol satisfies the basic properties of reliable broadcast primitives [4], including totality property, which is crucial in the context of this work.
• Admission: Yes/no agreement; Revocation: Yes/no agreement following PRV disclosure (any IDS) • Admission: H(H(PRV)) agreement; Revocation: H(PRV) voting following PRV disclosure (biased IDS) • Admission: H(H(PRV)) agreement; Revocation: PRV disclosure (only if perfect IDS is used) 4.3.1. Yes/no agreement In this section we define the yes/no voting scheme. Note, if the protocol is configured with the first configuration, this 7
Table 2: Relationships between voting schemes
Yes/no
H(H(PRV))
Admission Admission voting Agreement between qj nodes regarding node admission using binary Byzantine agreement Agreement between qj nodes regarding node admission with H(H(PRV)) values exchanged using Bracha’s reliable broadcast
Revocation Revocation voting Agreement between qr nodes regarding node revocation using binary Byzantine agreement Agreement between qr nodes regarding node revocation based on flooding of H(PRV) values inside DTSD
Definition 8 Totality property: If a node broadcasts a message and some honest node receives it, then all honest nodes will eventually receive the same message [4].
Actual revocation
Disclosure of PRV to compute the RV
5.1. System Setup and Node Initialization Initially, the network’s trusted third party generates the keying material for all the nodes and communication sessions in the network. To this end, the TTP accomplishes the following steps: (i) for each node ζ in the network it randomly chooses a seed Aζ from large enough GF (p); (ii) using the seed it constructs a polynomial fζk (x) for each communication session k and node ζ according to the description given in Section 4.2; (iii) the TTP calculates the PRVs and RV for a given node (and all its communication sessions) as the shares of the polynomial and uses a double hash values of these PRVs, i.e., H(H(PRV)) where H is a hash function, to construct the tree Gkζ ; (iv) the constructed trees Gkζ and corresponding RV fζk (0) for each node, and each session, are used to generate other parts of Merkle tree Gζ for each node, and from them the global Merkle tree Gn that completes the verification tree. Prior to node deployment, the TTP supplies each node ζ with a unique path in Gn and the seed Aζ . This allows node ζ to reconstruct the revocation polynomials for all its communication sessions during normal operation.
4.3.3. H(PRV) and PRV disclosure The last voting scheme we cover in this paper is H(PRV) and PRV disclosure. The voting scheme can be used only together with the H(H(PRV)) voting for admission. Note that H(PRV) voting is preferable when IDS is biased, otherwise it is more effective to use more simplistic PRV disclosure algorithm during the revocation. Definition 9 H(PRV) and PRV disclosure is a simple voting mechanism for revocation based on (a DTSD limited) broadcast flooding with H(PRV) values used as votes and verified with the corresponding H(H(PRV)) votes observed in admission phase. If the voting is successful, the PRVs are disclosed. This approach relies on the fact that the admission and revocation procedures involve the same nodes and related cryptographic material (observe H(H(PRV)) values can be used to verify corresponding H(PRV) and PRV values). These two aspects enable to link the verified information of the first voting procedure (H(H(PRV)) agreement-based admission) with the information disclosed in the second voting (node revocation) and allows reducing the communication complexity considerably (see Section 6). Note, that an even simpler method is the direct disclosure of PRVs for the revocation voting which can be employed if the IDS is perfect.
5.2. Static System Operation We refer to static system operation as the operation within a single communication session with static nodes having a fixed set of neighbors. For instance, let us assume that a network comprising n static nodes is deployed. Then, the static system operation refers to the process during which (i) each node ζ shares its PRVs with qj neighbors and later (ii) node ζ can be removed if wrong behavior is detected. Note that this situation does not allow node ζ to move to another location. Next, we explain in detail the most important steps within a communication session.
5. ECoSec: Protocol Operation
5.2.1. Node admission procedure (boot-strapping process) A newly deployed ζ is required to reveal fresh authentic PRVs of communication session k to join the network and gain trust according to Algorithm 1. To this end, joining node ζ firstly establishes a secure channel (e.g., by using a key agreement approach as described in [2], [22]) based on authenticated encryption with each DTSD member, i.e., each message is encrypted and a message authentication code is attached. Keying agreement also allows verifying that the distributed ECoSec keying material belongs to the authenticated node. Node ζ distributes qj PRVs corresponding to the kth communication session to the qj members of its potential DTSD in a confidential way. After
This section specifies in detail the operation of the protocol. We first analyze how the network and the nodes are initialized by TTP. Then, we explain the operation protocol within a single communication session pointing out the differences between the voting configurations in Section 6.5. Finally, we consider other aspects that need to be taken into account when configuring the system. In particular we discuss static and dynamic settings of the system. We differentiate between static and dynamic networks, because in the static setting the DTSDs remain constant, while in a dynamic one nodes can move and the membership of a DTSD changes over time. 8
5.2.2. Node revocation procedure After node joining, the cooperative network monitoring phase starts. In this phase, the nodes in the network observe the correct operation of each other. In the design of ECoSec, we consider the following elements: (i) an intruder detection scheme (IDS) to monitor suspicious nodes [16]; (ii) distributed agreement schemes interfacing with the IDS; (iii) time bounds required in the algorithms to ensure the correct operation; and (iv) cryptographic primitives to construct the revocation vote against the malicious node in case of a positive revocation decision. First, each network node z – a member of a particular DTSD – runs a local instance of an IDS algorithm such as [16] to find out suspicious nodes. In Algorithm 3, each node’s DTSD triggers a revocation decision every ΔTIDS against node ζ based on the information collected by the IDS during the last ΔTIDS seconds. Note that if the IDS is perfect, then configuration 3 can be used (with direct PRV disclosure as discussed in Section 4.3). Otherwise, we need first agreement, and then the PRV disclosure. Below we outline the algorithms for PRV disclosure (Algorithm 4) and H(PRV) voting (Algorithm 5). The goal of these algorithms is to allow the DTSD members to agree or disagree on the revocation of the suspicious nodes. Algorithm 4 is based on PRV disclosure, and thus, it is suited for perfect IDSs. Algorithm 5 involves additional H(PRV) disclosure prior to PRV disclosure (thus avoiding the risk of false PRV disclosure) between the DTSD members to decide whether a node is to be removed or not. This increases the system robustness in case of biased IDSs. If the disclosure is positive, the DTSD proceeds with the actual node revocation.
the initial distribution of PRVs, the joining node broadcasts verification paths for the disclosed PRVs comprising the paths of the Gn , Gζ and Gkζ trees as well as H(RV). This information allows each DTSD member to verify that the disclosed PRVs actually belong to ζ. Algorithm 1 Admission algorithm run by joining node ζ in communication session k 1: 2: 3: 4: 5: 6: 7:
while DTSD size less than qj do Add new member z to the DTSD Setup secure channel with z Securely transmit unique PRV to z end while Broadcast Verification Paths of Gn , Gζ , Gk ζ to the DTSD. Securely broadcast the DTSD members’ IDs to all DTSD members.
ζ’s DTSD only accepts the joining node, if (i) the node has not been revoked before and (ii) the node has disclosed enough revocation information. This last point is verified in two steps. First, each DTSD member locally verifies received PRV from ζ. To this end, it produces a hash chain P RV, H(P RV ), H(H(P RV )) from the PRV and traverses the paths of the trees Gkζ , Gζ and Gn . Each member can verify the authenticity of the leaf because the root of Gn tree is public (i.e., common to all the nodes in the network). Second, after the PRV has been locally verified by each DTSD member, all DTSD members cooperate to ensure that enough PRVs have been distributed by joining node ζ. To this end, the nodes vote according to one of the voting strategies, discussed in Section 4.3. We briefly outline the admission algorithm for H(H(PRV)) voting. Algorithm 2 Admission algorithm run by each DTSD member z of joining node ζ in communication session k
Algorithm 3 Periodic IDS revocation decision taken by node z
1: 2: 3: 4: 5: 6: 7: 8:
Local verification [run in parallel] if (A PRV received from ζ authentic) AND (Node ζ not revoked before) then k reliably broadcast(Admission Votes H(H(P RVζ,z )) Increase by 1 the DT SDsize end if Admission loop [run in parallel] while DT SDsize < τj do if ΔTadmission exceeded then DTSD member z DENIES access to node ζ 9: end if k 10: if reliably receive(H(H(P RVζ,z )) and vote is authentic and not replayed then 11: Increase by 1 the DT SDsize 12: end if 13: end whileDTSD member z GRANTS access to node ζ
1: while 1 do 2: Wait ΔTIDS 3: for Each DTSD Owner ζ of which Node z is a DTSD member do 4: if Wrong behavior of ζ then 5: % For perfect IDSs implement this option 6: Trigger Algorithm 4 7: % For biased IDSs implement this option 8: Trigger Algorithm 5 9: end if 10: end for 11: end while
Next, if the IDS of node z detects the wrong behavior of node ζ, node z starts a revocation procedure against ζ by following one of the three voting strategies described in Section 4.3.
Algorithm 2 is an H(H(PRV)) agreement protocol in which the DTSD has to agree on the admission of the joining node as discussed in Section 4.3. If a node accepts the joining node, it reliably broadcasts its H(H(PRV)). As we discussed reliable broadcast should be secured (have authentication information attached) using the keys shared between the DTSD members. Whenever DTSD members receive τj votes they can conclude that agreement has been reached.
• If the IDS is perfect, PRV disclosure is preferred and Algorithm 3 triggers Algorithm 4 directly. In this case, the nodes vote by directly disclosing the PRVs against the target node. The revocation vote can be calculated as soon as t + 1 PRVs have been disclosed. • If the IDS is biased, H(PRV) disclosure shall be chosen. The nodes first agree on the node revocation by broadcasting H(PRV). They disclose the PRVs only if the DTSD agrees on revocation of ζ. This happens if at least τr H(PRV)s were received. Note that in this case, it is enough
Lemma 1 If at least τj nodes reliably broadcast their H(H(PRV)) votes then by totality property all honest nodes will receive at least τj H(H(PRV)). 9
to broadcast the PRV hashes to authenticate the messages because the double hashes were already verified during the admission procedure. This is the key difference from yes/no voting where such possibility is not available. Finally, the PRVs are disclosed (or not disclosed) after agreement (disagreement).
Algorithm 4 Revocation algorithm run by DTSD member z against DTSD owner ζ in communication session k for perfect IDSs k 1: Broadcast P RVζ,z k 2: while Number of disclosed P RVζ,z < τr do 3: if ΔTrevocation exceeded then
Revocation failed
4: end if k 5: if Next P RVζ,z received, is authentic and not replayed then 6: increase number of received votes by 1 7: end if 8: end while k 9: Reconstruct and authenticate revocation vote RVζ,z 10: Broadcast revocation message Revocation succeeded
• Yes/no voting might be applied to any IDS. In this case, the DTSD first agrees whether the node is to be revoked or not. To this end, the DTSD members exchange their yes/no votes. If at least τr (reader is referred to Table 3 for particular values) of DTSD members voted against the node, as a second step, the DTSD members disclose the PRVs that they have received to revoke the node. This leads to the revocation of ζ.
Algorithm 5 Revocation algorithm run by DTSD member z against DTSD owner ζ in communication session k for biased IDSs
Once the revocation starts and considering a maximum message delivery time ΔTT ransmit , each DTDS member hears the votes of all the DTSD members at most after ΔTrevocation = ΔTIDS + ΔTT ransmit . Each DTSD member, which can hear the voting procedure, initiates a local counter after (i) receiving the first vote for revocation, or (ii) disclosing the vote itself. On the other hand, the DTSD owner also starts the counter as it hears the first vote against it. As a result, when ΔTrevocation has elapsed, the DTSD owner either (i) dismisses its DTSD and needs to reinitialize it, or (ii) loses ability to communicate with the entire network if it was revoked. As for DTSD members, which drop the association with the corresponding DTSD owner after ΔTrevocation has elapsed, nodes either (i) wait for node re-association or (ii) revoke the node from the network, correspondingly. If the consensus on revocation is positive, the collaborating nodes will collect t + 1 different PRVs to revoke the target. A node receiving the required amount of PRVs can verify their validity by checking whether they belong to the set of elements stored in the previously disclosed Merkle tree Gkζ . After successful verification, any node can reconstruct the secret fζk (0), i.e., the RV, for the corresponding revocation session. The nodes in the DTSD construct the revocation message against the node by concatenating the reconstructed secret RV with the public tree path that was disclosed at the joining time. Observe that the verification of the reconstructed secret is possible because the previously verified tree path includes the hash of fζk (0). Hence, any node in the network only needs to verify whether the hash of fζk (0) yields the value contained in the tree path.
k 1: Broadcast vote H(P RVζ,z ) k 2: while Number of disclosed H(P RVζ,z ) < τr do 3: if ΔTrevocation exceeded then
Revocation failed
4: end if k 5: if Next H(P RVζ,z ) received, is authentic and not replayed then 6: increase number of received votes by 1 7: end if 8: end whileTrigger Algorithm 4
or broadcast these messages independently of type of voting scheme (if a network is small cluster of nodes that exhibits full connectivity). Alternatively, by relaying the messages through intermediary non-faulty nodes, e.g., in a DTSD that does not exhibit full connectivity, where nodes inside the same DTSD can send unicast messages (again using reliable transmission protocol) either directly or via several hops such that each intermediary node is a member of the same DTSD. Although the first case is trivial, the second situation is challenging since in some topologies there might not exist a path without a faulty node, making impossible message delivery. In Section 6, we show conditions for such paths to exist in certain topologies. As for RV passing, whenever the final revocation vote is constructed, it can be broadcast to all neighbors. Such message will reach nodes in the network as long as the network is connected. Such protocol operation is adequate because (i) revocation messages are self-certifying – they can be authenticated by any node in the network and not just by DTSD members, and (ii) revocation messages are always guaranteed to be fresh – no node can be revoked twice. 5.4. Dynamic System Operation
5.3. Message passing
In Section 5.2 we analyzed the system operation within a single communication session. We refer to dynamic system operation as the operation of a node ζ running ECoSec between communication sessions k and k + 1, each associated to a different set of PRVs and RV. The concept of communication session is designed to address two main operational issues. In both cases, node ζ follows the basic steps presented in Algorithm 6. Dynamic system operation firstly addresses the topic of false revocations, first introduced in [6] as revocation sessions. In this situation, if the IDSs of a node’s DTSD takes wrong de-
Correct protocol operation is not possible if the delivery of PRVs and RV is not guaranteed during the respective voting procedures. To ensure this, ECoSec relies on PRV and RV gossip, i.e., nodes in the network cooperate in forwarding the PRVs and RVs to the other nodes in the network, and no faulty node (or a collusion of faulty nodes) can form a network cut inside the DTSD. PRV gossip allows PRV to be conveyed from one member to the rest of the nodes in the same DTSD. One way is to unicast 10
Lemma 2 ECoSec relies on a secret key sharing scheme based on polynomials of degree t and each DTSD member receives a PRV. Thus, the DTSD is secure under the collusion of up to c = t attackers because the hidden revocation vote can only be recovered by combining at least t + 1 PRVs.
Algorithm 6 Dynamic system operation for node ζ in communication session k 1: Broadcast DROP ζ’s DTSD FOR COMMUNICATION SESSION k 2: Trigger Algorithm 1 for communication session k + 1
cisions, the affected node ζ might be removed after a certain time. To avoid this, node ζ can disclose new keying material to make sure that it is not removed by error. Note that this use case is motivated by the usage of a revocation voting procedure for perfect IDS (as Algorithm 4) in which PRVs are directly disclosed. If agreement is used, as in Algorithm 5, the effect of IDS imperfections is minimized. In this situation, Algorithm 6 is triggered when the revocation procedure against node ζ is negative after ΔTrevocation elapsed. In this case, node ζ can monitor the voting procedures against it and if it observes many voting procedures, it might deduce that some of the nodes might not be operating in the correct way. In this case, it rejoins the network by redistributing the fresh keying material for session k + 1. Note that the newly joining node can choose the members of its DTSD and remove those that, e.g., might be making wrong revocation decisions. The second motivation for dynamic operation is node mobility: when a node moves, it can build a new DTSD at a different location. In the case of a mobile node, Algorithm 6 allows node ζ to send a drop communication message to its old DTSD for revocation session k such that each DTSD member removes it from the list of trusted devices and deletes ζ’s keying material associated with communication session k. When the node joins a new cluster of nodes in a different location, it discloses keying material of communication session k + 1.
Given this maximum threshold for each DTSD, c attackers might still try subvert ECoSec operation during the voting procedures, namely node admission and revocation (Definition 1). To prove the correct system operation during node revocation, we have to show that c faulty nodes cannot (i) hinder the network from removing an intruder or (ii) remove an honest node. On the other hand, a collusion of up to c attackers (iii) cannot help another faulty node to join the network without disclosing enough revocation information or (iv) prevent an honest node from joining the network. The question to answer is: what is the minimum number of nodes that should participate in the admission voting such that a coalition of c faulty nodes cannot undertake those actions? We note that this problem relates to the Byzantine agreement problem.
Theorem 1 A collusion of c = t faulty nodes cannot subvert protocol operation if during admission qj = 3t + 1 nodes are present and the underlying IDS operates faultlessly. Proof: To prove this it is sufficient to show that conditions (i), (ii), (iii), and (iv) above hold. Note that there are up to c faulty nodes (Lemma 2) and at least qj − c ≥ qr honest nodes in the DTSD. Each node’s IDS is perfect, and thus, the decisions of a node are faultless. (i) If the IDS is perfect ECoSec uses Algorithm 4 for voting. In this case, c attackers will try to remove the honest node by disclosing their t PRVs. However, as the IDS of the honest nodes does not trigger any alarm (they are perfect), then, the last and needed PRV will not be disclosed by any honest node. And eventually, the attackers will fail.
6. Analysis This section analyzes ECoSec operation and parameters for the following configuration: H(H(PRV)) agreement for the admission voting, and PRV disclosure for the revocation when perfect IDS in place or H(PRV) disclosure for the revocation when biased IDS is applied. For perfect IDS configuration we show the proof for three properties introduced in Section 3: system correctness, completeness, and bounded execution times. Section 6.4 discusses the differences in configurations when biased IDS is in place. Finally, we demonstrate how different protocol configurations affect the system performance in Section 6.5. Specifically, we show how H(H(PRV)) agreement reduces the overall communication overhead.
(ii) If honest nodes in a DTSD find a node to be an attacker, the IDS of all honest nodes will trigger an alarm. Honest nodes will disclose at least t + 1 distinct PRVs allowing for the reconstruction of the RV. As a result this leads to a DTSD-wide revocation. Then, qr ≥ c + t + 1 = 2t + 1 (iii) The joining attacker has to disclose enough information, however, it can collude with up to c attackers. If c attackers within the DTSD vote positively, i.e., reliably broadcast their H(H(PRV)), according to Algorithm 2, then at least another t + 1 honest nodes must do it as well to make sure that the network has enough revocation information to reconstruct RV and revoke the node in the future. It is obvious that the reliable broadcast of 2t + 1 H(H(PRVs)) is mandatory: If during the voting procedure at least 2t + 1 votes were not received, then according to totality property (Definition 8) it can only mean that the joining attacker tries to fool DTSD members by
6.1. Correctness Property for Perfect IDSs We prove the protocol correctness by showing that it fulfills the design principles of cooperative security protocol (see Definition 1) under the collusion of c compromised nodes in each DTSD. We first show the maximum number of compromised nodes that can be endured within a DTSD. Then, we elaborate on the minimum DTSD size and the total number of PRVs that ensure the correct system operation. 11
disclosing less PRVs to the honest nodes than required.
inside the same DTSD. In other words a member of a DTSD can reliably send a message to any other member of the same DTSD via a path that is either direct (1-hop) or multi-hop, and comprises non-faulty members of the same DTSD only. Availability of such communication paths is needed to ensure the correct execution of the admission and revocation voting procedures. Otherwise, not a single protocol can work since intentional attackers can drop all communications. In what follows, we review how these conditions are satisfied for two particular topologies: fully connected star topology and random graph. Other deployments are out of the scope in this work. 1-hop clustered topology: Occurs when n nodes are clustered into disconnected groups of qj +1 members whose membership is unknown a priory. We believe that such deployment is also practical. For instance, it can represent a large network divided into a number of smaller isolated security domains such as isolated personal area networks or clusters of processes. Clearly, in such case the network connectivity is guaranteed to be qj , i.e., the number of disjoint paths between any pair of nodes is always qj . Moreover, because the network is fully connected every DTSD member is at most 1-hop away. Hence, the cluster topology unconditionally exhibits the completeness properties which we described earlier. Random topology: Is a deployment in which the topology is formed by randomly scattering n wireless nodes (each having communication radius R) over an area Adeployment . Obviously, the connectivity of such network largely depends on the parameters n, Adeployment , and R. And according to [1] these parameter can be selected in a way such that the resulting network will be properly connected.
(iv) If the compromised nodes within the potential DTSD try to prevent the node from joining, c nodes will not reliably broadcast their H(H(PRVs)) stating that they have not received them. We know from (ii) that the DTSD must receive at least 2t + 1 H(H(PRV)). Hence, the potential DTSD must initially comprise at least qj ≥ c + (2t + 1) = 3t + 1 nodes. In this way, (at least) 2t + 1 honest nodes will send their H(H(PRV)) values, and thus, the honest node will be allowed to join the network.
Corollary 1 From points iii and iv in Theorem 1, the threshold in Algorithm 2 is equal to τj = 2t + 1 PRVs. Based on points i and ii in Theorem 1, the assumption on perfect IDSs, and Lemma 2 the threshold in Algorithm 4 is equal to τr = t + 1 PRVs. Corollary 2 From Lemma 2, ECoSec can endure up to c = t faulty nodes within a DTSD. From Theorem 1, the system operates correctly if qj ≥ 3t + 1. The ratio between faulty nodes and number of DTSD members is maximized when c is maximum and qj is minimum. Thus, ECoSec can endure up to one third and the optimal DTSD size is equal to 3t + 1 nodes. A final remark refers to the fact Theorem 1 and Corollary 2 for ECoSec fit existing results for Byzantine Generals Problem [21]. This is a reasonable result because ECoSec relies on two voting procedures whose outcome can be modified by the faulty nodes. The main interesting point here that qj is a function of qr . This result also shows the improvement of ECoSec protocol over the original Cooperative Security Protocol [14] that is only able to endure up to 17% of faulty nodes in the best case.
Lemma 3 In a random topology there exists with high probability at least one non-compromised path between any pair of honest members of the same DTSD. Sketch of the proof: Assume that each DTSD comprises at least qr = 2t + 1 nodes which are uniformly distributed within the area ADT SD = πR2 with a DTSD owner in the center. We then prove the above lemma by contradiction. Consider there always exist a benign DTSD member that is completely isolated from other benign DTSD members. This essentially means that the probability of a subgraph comprising qr − c benign DTSD members being connected equals to 0. In other words, using the definition of this probability as in [1], we let P r(path exists) (1−ec−qr )qr −c = 0 However, choosing a practical value for t = 4 and assuming that c = t we find that this probability is approximately 0.97. This contradicts with our previous statement.
6.2. Completeness Property In this section we show that completeness property is satisfied. Essentially, the problem reduces to the following questions: Can nodes participating in admission and revocation of node ζ communicate with each other? And if node ζ is detected to be an attacker, can the final revocation vote reach the entire network? To answer the first question we show that noncompromised nodes inside the same DTSD can reliably deliver messages to each other either directly, or via a path that comprises non-compromised members of the same DTSD. And second, we show that between any pair of nodes in the network there exists a communication path. But before we dive into the analysis we provide several useful definitions. In the context of this work, network connectivity refers to the number of paths that exist between any pair of nodes in the whole network. If each node has at least one communication path to any other node, then the network is called connected. On the other hand, we define DTSD connectivity as existence of non-compromised path between any pair of nodes
6.3. Bounded Execution and Stabilization Time This section analyzes the bounded execution time of ECoSec operation during node admission and revocation and network stabilization time. 12
Lemma 4 A joining node ζ is admitted by a DTSD during ΔTadmission = ΔTAlgorithm 1 + ΔTLocalV erif ication + qj3 ΔTT ransmit if (i) node ζ discloses at least 3t + 1 PRVs to at least qj = 3t + 1 nodes according to Algorithm 1 and (ii) there are at most c = t faulty nodes among its qj DTSD members.
fails due to timeout event, is known and that the maximum duration, ΔTadmission , of an admission process is defined as in Lemma 4. Then, the expected time needed for a successful DTSD creation be defined as follows: E[X] = can ∞ 1 = ΔTadmission (1−p . AsΔTadmission (1 − pt ) i=0 ipi−1 t t) suming that pt is identical for all DTSDs, e.g., the probability of timeout does not depend on the number of nodes in the network currently constructing their DTSDs, the upper bound for network stabilization time can be characterized as 1 ). For example, when pt = 1 · 10−5 O(ΔTadmission (1−p t) the stabilization time is approximately ΔTadmission , and when pt = 0.9 this time equals to 10ΔTadmission , that is on average each node will attempt 10 times before successfully constructing its DTSD. In practice, however, pt would be well below 1.
Proof: An honest joining node ζ discloses its PRVs which will be received by each of its qj neighbors in ΔTT ransmit due to our assumed transmission model and topology. Thus, ζ executes Algorithm 1 in finite time ΔTAlgorithm 1 linearly dependent with the number of DTSD members qj . Each of them can validate the correctness of the disclosed PRV by means of the verification tree path in ΔTLocalV erif ication according to Algorithm 2. Next, the DTSD verification carried out by securely broadcasting (using reliable broadcast) the votes between the DTSD members is successfully performed in up to qj3 ΔTT ransmit after each of the DTSD members disclosed its vote regarding the admission of the joining node, this vote has been verified by each of the DTSD members, and at least qj − c positive votes were collected requiring ΔTGlobalV erif ication . Thus, the joining node will join in ΔTadmission = ΔTAlgorithm 1 + ΔTLocalV erif ication + qj3 ΔTT ransmit .
6.4. Perfect vs. Biased IDSs As discussed earlier, ECoSec relies on an intruder detection system (IDS) to make the node revocation decisions. This allows us to differentiate between the actual ECoSec operation and the mechanism used to trigger the alarms. Note that this is similar to some extent to the operation of failure detectors [8]. To this end we analyzed the protocol operation for perfect IDS only. However, perfect IDS operation is difficult to ensure in a real system. And in the following paragraphs we consider the effect of biased IDSs on the protocol operation. We now consider the worst case when the DTSD after admission voting contains c = t colluding attackers. We let x be the number of honest nodes in the DTSD. During the revocation, attackers will always tend to vote positively to remove a good node from the network. This makes a single vote, erroneously disclosed by any one of the honest nodes, to be sufficient to revoke a node. The probability of such wrong revocation due to biased IDS is equal to p = P r(at least one honest node votes) = 1 − (1 − pe )x . To deal with this issue one can require a higher number of nodes agreeing on the removal of the node. In other words, instead of requiring just τr = t + 1 positive votes, we can set the revocation threshold as function of qr , i.e, τr = qr − t = t + 1 + y , where qr = 2t + 1 + y . In this case the number of honest nodes will be at least x = x + y where x is as above. Then a node will be erroneously removed if at least y + 1 honest nodes out of x make a mistake. This probability follows a binomial distribution and is calculated as p = P (at least y + x 1 honest nodes vote) = i=y +1 xi pie (1 − pe )x −i . Note that by increasing the number of nodes required for the revocation, the number of nodes participating in admission procedure should be increased as well. And therefore we have qj = qr + t and τj = qr Consequently, the usage of biased IDSs affects the correctness property because faulty nodes can exploit the wrong decisions of honest nodes due to IDS errors. The usage of Algorithm 5 with a higher revocation threshold - instead of the simple disclosure of PRVs (Algorithm 4) - allows ensuring correct system operation with a relatively high probability. The analysis for completeness remains valid because that property is only related to the protocol operation and does not rely on
Lemma 5 A node ζ is revoked by its DTSD in ΔTrevocation from the network if there are at most c = t faulty nodes among its qr DTSD members and the IDS is biased. ΔTrevocation is at most ΔTIDS + ΔTAlgorithm 5 + (t + 1)ΔTT ransmit + ΔTnetwork Proof: Since the biased IDS does not produce false negatives, each of the qr − c honest DTSD members will start a vote against the node in up to ΔTIDS after the node having misbehaved (Algorithm 5). Each DTSD member will receive and verify each others vote in up to ΔTT ransmit . Thus, Algorithm 4 is triggered in up to ΔTAlgorithm 5 = qr ΔTT ransmit . Algorithm 4 involves then the disclosure of only t + 1 PRVs in (t + 1)ΔTT ransit allowing for the reconstruction of the revocation vote against the misbehaving node ζ that will reach the whole network in ΔTnetwork . Thus, ζ is removed from the network in up ΔTrevocation = ΔTIDS + ΔTAlgorithm 5 + (t + 1)ΔTT ransmit + ΔTnetwork . The above lemma presents the result for bounded revocation time when IDS is biased. However, if IDS is perfect we can omit the running time of Algorithm 5, and directly derive the result for perfect IDS configuration ΔTrevocation = ΔTIDS + (t + 1)ΔTT ransit + ΔTnetwork . Finally, we estimate the time needed for the network to stabilize after the deployment. In the context of this work the term stabilization is referred to system state when all nonfaulty nodes in the network constructed their corresponding DTSDs and, thus, successfully joined the network. We consider that the probability, pt , with which a joining operation 13
Table 3: Voting configurations
Revocation
qj
qr
Perfect IDS Biased IDS
PRV disclosure Yes/no agreement
≥ 3t + 1 ≥ 4t + 1
≥ ≥
Perfect IDS Biased IDS
PRV disclosure H(PRV) disclosure
≥ 3t + 1 ≥ qr + t
≥ ≥
τj
τr
Message complexity Admission Revocation Yes/no agreement during admission 2t + 1 2t + 1 t+1 O(qj3 ) O(qr ) 3t + 1 3t + 1 2t + 1 O(qj3 ) O(qr3 ) H(H(PRV)) agreement during admission 2t + 1 2t + 1 t+1 O(qj3 ) O(qr ) 2t + 1 qr qr − t O(qj3 ) O(qr )
the IDS operation. However, the execution time increases due to the usage of the Algorithm 5.
Communication complexity Admission Revocation O(|S|qj3 ) O(|S|qj3 )
O(|V |qr ) O(|S|qr3 )
O((|S| + |V |)qj3 ) O((|S| + |V |)qj3 )
O(|V |qr ) O(|V |qr )
lower becomes the probability of false revocation, but the larger gets qj . Node readmission and duration of a communication session: Faulty nodes, by intentionally disclosing their PRVs after node’s admission, can try to subvert normal system operation. Although the voting procedures against an honest node will fail (if c < t + 1), this still represents a threat since the node has to rejoin the network after each failed revocation procedure. Eventually, this can make a node to become unavailable for some time. There are several ways to combat such misbehavior. First, such nodes should be revoked from the system by their representative DTSDs. Second, the protocol can use instead of a PRV hash chain with three elements (PRV, H(PRV), and H(H(PRV))), a longer hash chain (P RV, . . . , H h (P RV )). This will allow the DTSD members to use the first hash element for the admission voting (as before), and the next h − 2 hashes in revocation voting procedures without need for rekeying, give enough time for the neighbors to detect and isolate the intruder.
6.5. Protocol configuration This section briefly discusses some practical considerations related to the protocol operation. Choice of the voting strategy: The choice of the voting strategy largely depends on the type of IDS used. In Table 3 we present a summary for different configurations and in the following paragraphs elaborate on the differences between these configurations. Having a good IDS allows applying the PRV disclosure voting scheme as it minimizes the overhead. The usage of a biased IDS requires reaching consensus on node revocation before disclosing the PRVs, otherwise the corrupted nodes within a DTSD will exploit the situation to force the revocation of an honest node. In this setting yes/no voting offers similar degree of resiliency as the H(H(HPRV)) disclosure. Moreover, both voting procedures require reaching consensus between at least 2t + 1 nodes, and therefore, decrease considerably the probability of false revocation. However, the key difference here is that the yes/no voting strategy, whenever used for both admission and revocation, requires qj ≥ 4t + 1. Intuitively, the result follows from the following facts: First, for a joining attacker, if it distributes t + 1 votes to honest nodes, t votes to colluding attackers, and conceals t other votes in secret (given that admission threshold is 2t+1), will be successfully admitted as both t attackers and t+1 honest nodes will vote positively; second, for the yes/no voting to work during the revocation the synergy of at least 3t + 1 nodes is needed. Obviously, the only way for this configuration to work is to set the admission threshold to 3t + 1 and number of nodes participating in admission voting to at least 4t + 1. Second, yes/no voting spawns O(qr3 ) messages during admission and revocation, but H(PRV) and PRV require significantly less message exchanges. Also, neither H(PRV) nor PRV requires authentication during revocation voting, since the votes are self-certifying after disclosing the first hash element in authenticated admission voting. This reduces the communication complexity even more. Finally, we note that in the configuration that involves H(H(PRV)) agreement for admission and H(PRV) agreement for revocation the revocation threshold (τr ) and number of nodes required for admission (qj ) depend on the number of nodes required for revocation (qr ). By varying qr one can control the probability of erroneous revocation: the higher qr is the
6.6. Using ECoSec in a real-world application After the formal description of the ECoSec protocol, this section describes in a simple way how ECoSec can be used in a real-world application. For our real-world application we consider a large wireless sensor network (WSN) deployed in a smart city. Such a network can be used, e.g., for monitoring environmental parameters or controlling smart city devices such as outdoor luminaries. We consider that this network is occasionally managed by a base station, the TTP, that connects the wireless sensor network with a backend server. The backend server collects information from the network or controls the smart city devices. In this setting the TTP is in charge of generating the ECoSec keying material and distributing it to the nodes before deployment (discussed in Section 5.1). Just after deployment, the nodes in the network start communicating with each other to form the network. It is at this stage when, e.g., the routing tables are created. In this early phase, each node forms its DTSD by following the admission voting procedure of the ECoSec protocol. For a real-world application, one might even consider that the nodes at this early phase are non-faulty devices. Under this assumption, that is valid in real deployments, the devices would just distribute its ECoSec keying material to its neighbors, but the neighbors (the DTSD) would just skip the verification phase since the devices are non-faulty and they will follow the ECoSec protocol. If this cannot be assumed, then the nodes will follow the normal verification process explained in Section 5.2.1. 14
During operation the WSN nodes perform their tasks, e.g., monitoring the smart city or device control. Furthermore, the WSN nodes exchange messages with the backend server through the base station related to the sensed values, energy consumption values, or control commands. For this, the WSN nodes have to route the messages through the network. At this time, it might happen that a node is not working in a reliable way due to hardware failures. This can cause serious problems in routing protocols. It is even possible that a device has been captured by an attacker. In any case, the neighbors can detect this by using an intruder detection algorithm. If the neighbors observe and agree on a wrong behavior, they can use ECoSec to isolate the faulty device as described in Section 5.2.2. Most importantly, the revocation message will also reach the base station that will then be able to take the corresponding actions. Such actions can be, e.g., the replacement of the device if it is broken, or its revocation from the network if it has been compromised. In case of node capture, the base station might also update keys (e.g., if any network-wide keys are used). 7. Conclusions This paper formalizes the concept of cooperative security by presenting its accurate definition and required properties. As a proof-of-concept we present the ECoSec that allows a network to decide on the trustworthiness and revocation of each node in a cooperative and fully distributed manner. Starting from an undefined level of trust after deployment, we show how a node can gain trust during the admission voting by distributing its own revocation information among its neighbors. The node can lose trust during a revocation voting if its neighbors decide that it is misbehaving. The fact that a node carries its own revocation information – feature first introduced in [14] reduces memory storage needs when compared with other distributed revocation protocols such as [6]. Our scrutinized analysis allows us to deduce the exact thresholds for the number of endured faulty devices during both voting procedures. In particular, for the setting with perfect IDS we show that these thresholds are optimal when exactly 3t + 1 nodes (including up to t compromised nodes) take part in admission voting. Thus, ECoSec improves the number of endured faulty devices of its predecessor cooperative protocol in [14]. Otherwise, if IDS has some non-zero false positive decision probability the optimal values behave as the function of number of nodes participating in node revocation procedure. Furthermore, while the message complexity for the admission voting is O(qj3 ), our neatly designed keying material allows reducing the complexity of the revocation voting considerably by using information verified during the first voting to achieve a much more efficient revocation agreement procedure. Similar results are not achievable when voting procedures are implemented in an isolated manner, e.g., with binary Byzantine agreement or with an existing protocol for group membership management. Acknowledgments This work was supported in part by Academy of Finland project SEMOHealth. 15
References [1] C. Bettstetter. On the minimum node degree and connectivity of a wireless multihop network. In Proceedings of the 3rd ACM international symposium on Mobile ad hoc networking & computing, MobiHoc ’02, pages 80–91, New York, NY, USA, 2002. ACM. [2] C. Blundo, A. De Santis, A. Herzberg, S. Kutten, U. Vaccaro, and M. Yung. Perfectly-Secure Key Distribution for Dynamic Conferences. Lecture Notes in Computer Science, 740:471–486, 1993. [3] G. Bracha. An asynchronous [(n - 1)/3]-resilient consensus protocol. In Proceedings of the third annual ACM symposium on Principles of distributed computing, PODC ’84, pages 154–162, New York, NY, USA, 1984. ACM. [4] C. Cachin, K. Kursawe, F. Petzold, and V. Shoup. Secure and efficient asynchronous broadcast protocols (extended abstract). In Advances in Cryptology: CRYPTO 2001, pages 524–541. Springer, 2001. [5] C. Cachin, K. Kursawe, and V. Shoup. Random oracles in constantinople: Practical asynchronous byzantine agreement using cryptography. In in Proc. 19th ACM Symposium on Principles of Distributed Computing (PODC, pages 123–132, 2000. [6] H. Chan, V. D. Gligor, A. Perrig, and G. Muralidharan. On the distribution and revocation of cryptographic keys in sensor networks. IEEE Transactions on Dependable and Secure Computing, 2:233–247, 2005. [7] H. Chan, A. Perrig, and D. Song. Random key predistribution schemes for sensor networks. In Proceedings of the 2003 IEEE Symposium on Security and Privacy, SP ’03, pages 197–, Washington, DC, USA, 2003. IEEE Computer Society. [8] T. D. Chandra and S. Toueg. Unreliable failure detectors for asynchronous systems. pages 325–340, New York, NY, USA, 1991. ACM. [9] J. Clulow and T. Moore. Suicide for the common good: a new strategy for credential revocation in self-organizing systems. SIGOPS Oper. Syst. Rev., 40(3):18–21, July 2006. [10] F. Cristian. Reaching agreement on processor group membership in synchronous distributed systems. Distributed Computing, 4:175–187, 1991. [11] G. Dini and I. M. Savino. An efficient key revocation protocol for wireless sensor networks. In Proceedings of the 2006 International Symposium on on World of Wireless, Mobile and Multimedia Networks, WOWMOM ’06, pages 450–452, Washington, DC, USA, 2006. IEEE Computer Society. [12] L. Eschenauer and V. D. Gligor. A key-management scheme for distributed sensor networks. In Proceedings of the 9th ACM conference on Computer and communications security, CCS ’02, pages 41–47, New York, NY, USA, 2002. ACM. [13] P. D. Ezhilchelvan and R. de Lemos. Reaching agreement on processor group membership in synchronous distributed systems. pages 173–179, 1990. [14] O. Garcia-Morchon, H. Baldus, T. Heer, and K. Wehrle. Cooperative security in distributed sensor networks. In CollaborateCom, pages 96– 105. IEEE, 2007. [15] S. Ghosh, T. Herman, and S. V. Pemmaraju. A fault-containing selfstabilizing algorithm for spanning trees. In Proceedings of Journal of Computing and Information, pages 322–338, 1996. [16] S. Gupta, R. Zheng, and A. M. K. Cheng. ANDES: an Anomaly Detection System for Wireless Sensor Networks. In MASS’07, pages 1–9, 2007. [17] P. Hamalainen, M. Kuorilehto, T. Alho, M. Hannikainen, and T. D. Hamalainen. Security in wireless sensor networks: Considerations and experiments. In Proceedings of 6th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation, pages 167–177, 2006. [18] T. Herman and S. Pemmaraju. Error-detecting codes and fault-containing self-stabilization. Inf. Process. Lett., 73(1-2):41–46, Jan. 2000. [19] D. Kuptsov, A. Gurtov, O. Garcia-Morchon, and K. Wehrle. Brief announcement: distributed trust management and revocation. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing, PODC ’10, pages 233–234, New York, NY, USA, 2010. ACM. [20] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978.
[28] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. J. ACM, 27(2):228–234, Apr. 1980. [29] M. K. Reiter. A secure group membership protocol. IEEE Trans. Softw. Eng., 22(1):31–42, Jan. 1996. [30] N. Saxena, G. Tsudik, and J. H. Yi. Identity-based access control for ad hoc groups. In Proceedings of the 7th international conference on Information Security and Cryptology, ICISC’04, pages 362–379, Berlin, Heidelberg, 2005. Springer-Verlag. [31] R. D. Schlichting and F. B. Schneider. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1:222–238, 1983. [32] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv., 22(4):299–319, Dec. 1990. [33] A. Shamir. How to share a secret. Commun. ACM, 22(11):612–613, Nov. 1979. [34] H. Wang, B. Sheng, and Q. Li. Elliptic curve cryptography-based access control in sensor networks. Int. J. Secur. Netw., 1(3/4):127–137, Dec. 2006.
[21] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401, July 1982. [22] D. Liu, P. Ning, and R. Li. Establishing pairwise keys in distributed sensor networks. ACM Trans. Inf. Syst. Secur., 8(1):41–77, Feb. 2005. [23] D. Liu, P. Ning, and K. Sun. Efficient self-healing group key distribution with revocation capability. In Proceedings of the 10th ACM Conference on Computer and Communications Security, CCS ’03, pages 231–240, New York, NY, USA, 2003. ACM. [24] H. Luo, J. Kong, P. Zerfos, S. Lu, and L. Zhang. URSA: Ubiquitous and robust access control for mobile ad hoc networks. IEEE/ACM Trans. Netw., 12(6):1049–1063, Dec. 2004. [25] R. C. Merkle. Secrecy, authentication, and public key systems. PhD thesis, Stanford, CA, USA, 1979. AAI8001972. [26] T. Moore, J. Clulow, S. Nagaraja, and R. Anderson. New strategies for revocation in ad-hoc networks. In Proceedings of the 4th European conference on Security and privacy in ad-hoc and sensor networks, ESAS’07, pages 232–246, Berlin, Heidelberg, 2007. Springer-Verlag. [27] L. E. Moser, P. M. Melliar-Smith, and V. Agrawala. Membership algorithms for asynchronous distributed systems. In ICDCS’91, pages 480– 488, 1991.
16