computers & security 58 (2016) 106–124
Available online at www.sciencedirect.com
ScienceDirect j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / c o s e
Proactively applied encryption in multipath networks James Obert a,*, Inna Pivkina b, Hong Huang b, Huiping Cao b a b
Cyber R&D Solutions, Sandia National Labs, Albuquerque, NM, USA Computer Science & Electrical Engineering Departments, New Mexico State University, Las Cruces, NM, USA
A R T I C L E
I N F O
A B S T R A C T
Article history:
In providing data privacy on multipath networks, it is important to conserve bandwidth by
Received 30 January 2015
ensuring that only the necessary level of encryption is applied to each path. This is achieved
Received in revised form 12
by dispersing data along multiple secure paths in such a way that the highest encryption
November 2015
level is applied to those paths where threats are most likely to be present. Conversely, for
Accepted 24 December 2015
those paths where the likelihood of attack is least, the encryption levels should be com-
Available online 6 January 2016
mensurately lower. In order to maintain data privacy, path encryption level adjustments should be proactive. In so doing, the multipath network should have the ability to calcu-
Keywords:
late the probability of an attack and proactively adjust the encryption strength long before
Multipath security
the final steps of an attack sequence occur. The unique methods described in this re-
Information assurance
search, are able to sense when an attack sequence is initiated on a path. This is achieved
Anomaly detection
by calculating the probability of the presence of specific attack sequence signatures along
Intrusion detection
each network path using statistical learning techniques, and by deriving path information
Data encryption
assurance levels using these probabilities. As an attack sequence progresses, the likelihood of the presence of specific attacks grows until a threshold level is met and an encryption adjustment for a path is warranted. © 2015 Elsevier Ltd. All rights reserved.
1.
Introduction
Using multiple paths impedes an adversary’s ability to focus the attack on a single path. A secure multipath network requires a sender to transmit data over multiple paths with a level of security enabled along each path commensurate with the relative threats and be proactively adjusted according to the foreseen attack environment. Even though the importance of proactively adjusting security measures is crucial, existing multipath routing protocols such as Multipath TCP lack the ability to sense threats or adjust the level of security along a path (Wei-wei and Hai-feng, 2010).
* Corresponding author. Tel.: +0015056045519. E-mail address:
[email protected] (J. Obert). http://dx.doi.org/10.1016/j.cose.2015.12.003 0167-4048/© 2015 Elsevier Ltd. All rights reserved.
In this paper, we present a novel approach that utilizes machine learning techniques to determine the current and future information assurance levels of network paths in multipath networks. Through the use of mutual information theory, entropy analysis and Bayesian inference (Gelman, 2004), predicted future information assurance levels are calculated. With the ability to predict future attacks, a multipath network is able to proactively enforce appropriate security measures on the most vulnerable paths. Compared to other types of multipath network security methods, which are based on heuristically derived trust models, the proposed approach is based on actual cloud infrastructure and service provider directed attack sequences.
computers & security 58 (2016) 106–124
Additionally, our approach finds the likelihood of the presence of cloud directed attack patterns in data event windows and proactively assigns information assurance levels and applies appropriate encryption strength to paths. Although, the exact security measures applied to vulnerable paths depends on the type of attack sequence signature identified, in this paper, we only address attacks on data privacy where an increase in path encryption strength will preserve privacy. In addition, the field of intrusion detection is quite expansive, and includes complex topics such as polymorphic and zero-day attacks, which we will not attempt to address in this paper directly. We refer the reader to Bilge and Dumitras (2012) and Polychronakis et al. (2009) for an extensive review of these topics. Although we recognize that polymorphic and zeroday attacks pose a real threat to multipath networks, we restricted our research to slow attacks where known static attack signature sequences are transmitted over multipath networks within an extended time period (see Section 3.1). Specifically, we explored scenarios where an attacker attempts to stagger the transmission of known attack signature elements over a targeted network path using a slow attack strategy in order to evade detection. Our methods calculate the probability of an attack occurring prior to the transmission of final attack signature elements. Once the probability of a slow attack on a path is calculated, the encryption level on the targeted path is proactively adjusted. The remainder of this paper is organized as follows. We provide in Section 2 a review of related work. Section 3 presents derived network path information assurance level and proactive encryption methods. In Section 4, evaluation results and conclusion are presented.
2.
Background
Early forms of multipath routing entailed using no encryption and data were split among different routes in order to minimize the effects of malicious nodes. In the case of fixed bandwidth networks, the approach in Lee et al. (2005) uses existing multiple paths such that an intruder needs to spread resources across several paths to seriously degrade data confidentiality. Later approaches presented in Monnet and Mokdad (2013) and Younis et al. (2009) statically apply fixed encryption strengths to data on each path according to heuristically predetermined data sensitiveness on a path. The approach in Younis et al. (2009) randomly alternates the encryption strengths along paths as a means to confuse an adversary. The primary limitation of Younis et al. (2009) is accurately forecasting the data sensitivity of data transmitted along a path. None of the approaches suggests an adequate or explicit means for combining dispersive data security methods with intelligent, dynamically differentiating path data security measures. Additionally, little progress has been made in providing resilience to protocol attacks, slow attacks or evasive attacks in multipath networks. The differentiating approach proposed in this paper is to proactively sense initiated attack sequences present along each network path and correspondingly increase the encryption strength on more vulnerable paths while decreasing it on the less vulnerable ones. In order to manage throughput loss, the
107
transmission rates on more vulnerable paths will drop, while it will increase on the less vulnerable ones.
3.
Network path security determination
Given a network, let I be the information assurance factor, C be the path cost factor (i.e., Open Shortest Path First Cost (Li and Kwok, 2005)), and E be the encryption scaling factor. The information assurance factor I is a measure of how secure a network path is. If I is determined to be low on a network path P, the probability of a network threat being present on that path is high and that path is considered to be vulnerable. The encryption scaling factor E is a factor representing how much encryption is to be applied to a network path in order that the path can be effectively protected from an attack. The path cost factor C represents the impedance presented by a network path. When C is high, network traffic is less likely to travel down a specific path. For distinct paths in a multipath network, the values of these factors are different. Ii, Ei and Ci are the information assurance, encryption scaling, and cost factors, respectively, for a path Pi. Given a message L transmitted through a multipath network from source node vs to a destination node ve, the data of L are divided among the network paths using multipath routing. If the network attack threat levels are sensed on each path, then the information assurance factor Ii for each path can be determined based on these threats. Data security is enforced by decreasing Ci and Ei on less vulnerable paths while increasing these factors on more vulnerable ones. For example, assume that the network routing algorithm decides to use two paths Pi = path (v1, v6, v3, v4, v2) and Pj = path (v1, v6, v5, v7, v2) to send a message L from v1 to v2 as shown in Fig. 1. Then, the loss of throughput through the multipath network is lowered by increasing or decreasing C and E on each path according to the value of I derived for each path. The values of C and E are varied inversely to the path value of I over n paths. The data are transmitted to destination vertex v2 and protected by dynamically adjusting encryption E scaling factors according to the values of the information assurance factor I over each path.
Fig. 1 – Multipath graph.
108
computers & security 58 (2016) 106–124
Let us look at the following scenario where a message L with a length of 100 Mbits is to be transmitted across a multipath network. Assume the ranges for the information assurance, encryption and cost factors are I ∈ [0.02, 1], E ∈ [0.25, 1], and C ∈ [2.5, 10], respectively. At time 1 the values for I, E and C on paths Pi and Pj are as follows:
Time 1:
Pi values include Ii = 0.8, Ei = 0.25 and Ci = 2.5.
Pj values include I j = 0.8, E j = 0.25 and C j = 2.5. At time 1 paths Pi and Pj both have equal information assurance factors: Ii = 0.8 and Ij = 0.8. The values of Ii and Ij are on the higher range of the scale for I and thus indicate that there is a low probability that serious threats to either path currently exist. Because the threat levels on both paths are low and the information assurance levels are high, the encryption scaling factors Ei and Ej and the cost factors Ci and Cj for each path are set low. When packets from message L are transmitted through paths Pi and Pj, routing is based upon the cost factors Ci and Cj of each path. A higher cost on a path equates to less packets being sent down that path. Because the cost factors are equal on each path, the packets of L are randomly, but equally distributed over each path from the source node v1 to destination node v2. At time 2, traffic data are sampled and new information assurance factors are derived for each path.
Time 2:
Pi values include Ii = 0.8.
Pj values include I j = 0.1. At time 2, the information assurance factor for path Pi stays the same at Ii = 0.8, but decreases for path Pj.This indicates that there is now an increased probability of attacks being present in path Pj. In response to this lowered information assurance level on Pj, the encryption scaling factor is increased to Ej = 1.0 and Cj = 10 on this path (see Section 3.7 for details on how individual E and C path values are derived). This has the effect of decreasing the number of L message packets per second transmitted over path Pj, while increasing the number of packets per second transmitted over path Pi. In this instance L/Cj (100 Mbits/10 = 10 Mbits) of data are transmitted through path Pj, while L/Ci (100 Mbits/2.5 = 40 Mbits) of data are transmitted through path Pi. If the transmission rate for the both paths Pi and Pj is 100 Mbits/s, 20 Mbits will transfer through path Pi while 80 Mbits will transfer through path Pj in one second. Summarizing, in time 2, the encryption level was increased on path Pj due to the fact that attacks were detected on that path, and to ensure confidentiality of the data was protected. Additionally, the cost factor was increased for Pj in order to direct less traffic through this path where there is a greater threat to data security. It will be shown that the information assurance factors I along a path can be derived by finding the likelihood of the presence of attack signature patterns within a defined event window of network traffic (Section 3.2). We will show how experimentally derived E factors are used to determine path C factors in Section 3.7 such that a secure quality of service (SQoS) is maintained in multipath networks.
Our approach determines the security levels of network paths by examining the traffic data with different temporal partitions. In particular, the network traffic is partitioned into event windows where each window collects data over 30 minute sampling periods. Within each path Pi of a multipath network, a separate event window of data is collected by sensors. TCP/IP packets are sampled on each network path with each packet count type corresponding to a feature sampled every 180 ms. Within one event window approximately 833 samples (vectors), are stored in raw data matrix Φ over a period of 30 minutes. Each vector component (f1 − f12) is termed as a vector feature value and there are approximately 10,000 vector feature values in one event window. Each of the vector components in a sample represents the number of packets of a specific network protocol type passing by a network path sensor within one sampling interval. As it will be shown in Sections 3.4, 3.5, and 4.1, the anomaly detection methods described in this section are able to chain multiple event windows together during analysis. The periodicity of the network attacks analyzed are not limited to 30 minutes, and the choice of the 30 minute event window was only driven by data storage and processing conveniences. For each 30-minute event window, we collect N samples from the network traffic for a single path. For each event window, our approach performs traffic sampling, anomaly detection, and path security determination. The path security determination process is discussed in Section 3.2. In the path security determination process, cluster analysis on the traffic samples is performed and significant clusters are inspected for the presence of active attack signature features, and the likelihood of a respective cluster containing attack signatures is calculated (Section 3.2). Given the likelihood of specific attack features (associated with a signature Sr), being present on a path Pi, the cyber threat level Wr and information assurance Ii, are determined using Equation 8 and Equation 9, which are discussed in Section 3.5. For each event window, our approach performs traffic sampling, anomaly detection, calculates current and future path information assurance levels. We will show in Section 3.6 that a predicted value Iproj can be found that enables a multipath network to proactively adjust path encryption strengths. Figure 2 below summarizes the proactive encryption adjustment process. In steps 1–3 (described in Sections 3.2, 3.3, and 3.4) traffic is sampled and the data are clustered (refer to Section 3.1 for the list of data features.) Using entropy analysis anomalous clusters are identified. Within the anomalous clusters, samples that are statistically significant and have a high probability of matching known attack sequence signatures are identified. In step 4 (described in Section 3.5), the information assurance levels I is determined. Finally, in step 5 (described in Sections 3.6 and 3.7 ) IProj is derived and path encryption strengths are proactively applied.
3.1.
Traffic sampling and attack signatures
Network packet header, network time, port, protocol, and flags are collected at each router interface. An event window corresponds to a set of Transmission Control Protocol/Internet
109
computers & security 58 (2016) 106–124
Table 1 – Network attack suites. Suite signature (Sr) 1
2
3
Fig. 2 – Summary of proactive encryption methods.
Protocol (TCP/IP) packet records for a path transiting a set of subnets or virtual LANs (VLAN) contained within an autonomous system. For each event window, we first define several notations used in the process of data sampling and the later discussions. • f: one feature that is abstracted from the network packet records. When there are multiple features, we also use fi to denote the i-th feature. • N(f): the total number of features that are of our interest. In this research, a total of 12 (N(f) = 12) features were extracted from the captured network packets. These features correspond to a specific subset of TCP, Internet Control Message Protocol (ICMP), Hypertext Transfer Protocol (HTTP), OSPF protocol states, which are most often associated with router and host attacks. • N: the number of samples that we take for one event window. • Φ = (x1, x2, … ,xN): a N × N(f) matrix with the N samples that are taken for an event window. Here, xp is represented as an N(f)-dimensional row vector. • Φi: over each path Pi, an event window of raw network data is collected every 30 minutes. For a network of 50 paths, n = 50 and i = 1, … , n. As it will be shown in Sections 3.4, 3.5 and 3.6, the anomaly detection methods described in this chapter are able to chain multiple event windows together during analysis. The periodicity of the network attacks analyzed are not limited to 30 minutes, and the choice of the 30 minute event window was only driven by data storage and processing conveniences. Three separate network based suites, namely reconnaissance, vulnerability scanning, and exploitation, were used in emulating real-world host and network conditions. Each attack suite possesses a unique signature (Sr) consisting of a set of features listed under column 3 of Table 1. Signatures Sr were determined by observing the three classes of cloud provider
Description
Cloud guest reconnaissance, vulnerabilities & exploitation Cloud infrastructure reconnaissance, vulnerabilities & exploitation Cloud services reconnaissance, vulnerabilities & exploitation
Active features 6 {f1, f2, f3, f4, f9, f11} 6 {f1, f5, f6, f7, f8, f9} 5 {f1, f5, f8, f9, f10}
Threat level (Wr) 3
5
4
attacks listed under column 2 of Table 1. A threat level (Wr) is assigned to each attack suite type, and has a value that ranges from 1 for least severe to 5 for most severe and r = 1, … ,3. As shown in Table 1 column 4, threat levels assigned are for each attack class and are as follows: 3 for cloud guest attacks (S1), 5 for cloud infrastructure attacks (S2), and 4 for cloud service attacks (S3). Threat levels were determined for each of the attack suites in Table 1 by analyzing the consequences of each attack class in terms of the potential loss of data confidentiality, integrity and availability for cloud provider guest virtual machines, network infrastructure and services. A particular emphasis was placed on ensuring that network infrastructure data availability and confidentiality were not impacted. The information assurance level on a path is inversely proportional to the threat level. For example, when the threat level of an attack class is set at the highest level of 5, it has the effect of lowering the information assurance factor the greatest amount on a path where that attack signature is found. Thus by assigning the cloud infrastructure attack class (S1) a value of 5, we ensure that large encryption and cost factors are assigned to paths where cloud infrastructure attacks are found. A large encryption and cost factor being assigned to a network path causes less traffic to pass through that path and those packets that do transit through the path are encrypted with a greater strength than those paths that do not have the same level of cost and encryption factor values. Table 2 describes the individual features composing attack signatures in Table 1. These features consist of protocols that operate on TCP/IP networks. The cloud service provider enterprise consists of client guest virtual machines, network and switching devices and enterprise services. These cloud provider enterprise components utilize the protocols in Table 2 to communicate with both nodes within and outside of the cloud networks. When the signature patterns in Table 1 are observed in event window data, it indicates that a possible attempt to exploit cloud enterprise services using one or more of the protocols in Table 2 has been attempted. The definitions for the acronyms within the brackets in Table 2 are: reset (RST), synchronize (SYN) and acknowledge (ACK). The details of these protocols are beyond the scope of this research and will not be discussed in this paper. Reset, synchronize, acknowledge, redirect, and read are protocol activities (Bonaventure, 2011).
110
computers & security 58 (2016) 106–124
Table 2 – Feature definitions. Feature f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12
Indicator
Description
ICMP Redirect TCP http [RST] TCP http [SYN, ACK] TCP ftp [SYN, ACK] UDP tftp [Read, ACK] TCP bgp [RST, ACK] TCP ospf-lite [RST, ACK] TCP mp-bgp [RST, ACK] TCP https [RST, ACK] TCP smb [RST, ACK] TCP ssh [RST, ACK] TCP telnet [RST, ACK]
Internet control protocol redirect Hypertext transfer protocol reset Hypertext transfer protocol SYN File transfer protocol SYN Trivial file transfer protocol read Border gateway protocol reset Open shortest path first lite protocol reset Multiprotocol border gateway protocol reset Hypertext transfer protocol secure reset Server message block protocol reset Secure shell tunneling protocol reset Telnet application protocol reset
We formed the feature attack signatures in column 3 of Table 1 by observing variations in network protocol packet counts over each event window period when a known attack type was injected into network traffic. Those features that showed high variations were flagged as indicators of a specific attack class being present in traffic. The observed attack indicator features were also confirmed through literature discussing the anatomy of each attack class (see Archer, 2013). Each attack signature suite (Sr) consists of patterns of observed TCP/IP packet headers. As an example, let us analyze the remote exploitation attack graph shown in Fig. 3 below. In the Identify and Recon stages of an attack in a multipath network, an attacker transmits TCP/IP packets along the paths of a network in order to identify open services, known vulnerabilities and exploits contained within routers and hosts. Indicators that an attacker is performing identify and recon activities along a path Pi include significant changes in distribution entropy changes of network traffic features {f1 − f12}. Signature attack suites Sr consist of sequenced combinations of observed traffic
feature distribution entropy changes over time. In the case of attack signature suite S1 feature, distribution entropy changes are observed for the set of features {f1, f2, f3, f4, f9, f11}. It should be noted that depending on the particular target (cloud infrastructure, guest or services), a specific set of features are indicators. The order that feature entropy changes occur in observed path traffic over time is important. For example, in the remote execution exploit attack shown in Fig. 3, for a cloud infrastructure attack, we expect to observe feature distribution entropy changes for the full set of features {f1, f5, f6, f7, f8, f9} during the identify and recon stages of the attack. During the identify and recon stages, the attacker is interested in identifying all open services, vulnerabilities in those services and how they can be exploited. Thus, any protocols associated with those open services will be exercised during the identify and recon stages. If an attacker identifies vulnerabilities and attack vectors, in order to launch a remote exploit attack, he will progress to the crafting, delivery and trigger stages. In the case of the cloud
Fig. 3 – Remote execution attack graph.
computers & security 58 (2016) 106–124
Fig. 4 – Hierarchical clustering of data event windows.
infrastructure attack, we would expect to see a reduction of entropy in feature distributions where service vulnerabilities were not identified by the attacker. However, for those features related to the vulnerabilities and exploits found, we expect to see continued distribution entropy changes and patterned traffic. As an example, if an attacker was only able to identify an OSPF-lite remote execution exploit service vulnerability in a router during the identify and recon stages, we would expect to see patterned f7 (OSPF-lite) and f5 (tftp) traffic changes during the delivery, trigger, exploit and execute stages of the attack.
3.2.
Determination of path security
Samples consist of row vectors xp that belong to a collected event window Φ. As a first step in identifying anomalous samples, we first need to define associations between the samples by way of data clustering. The two most commonly used methods for performing clustering are partitioning and agglomerative clustering. In the partitioning approach, a set number of cluster centroids is hypothesized prior to clustering the samples. The distance of all samples from cluster centers is calculated, and then each sample is assigned to a cluster based on its distance from each predetermined centroid. In agglomerative clustering, there are no initial predetermined centroids; conversely, each sample is initially treated as a cluster. A similarity matrix containing pair-wise distances between samples is created. An iterative process that finds the two most similar clusters and recalculates the similarity matrix after each iteration ensues. The agglomerative process is usually terminated when a specific average threshold distance between cluster centroids or distance between points within clusters is sensed. The results of the clustering process are represented by a dendrogram such as that shown in Fig. 4. We experimented with the partitioning method K-means, and an agglomerative hierarchical method. Because K-means does not work well with overlapping clusters, is sensitive to outlier noise, and there is the need to accurately predict the number of cluster centroids prior to performing clustering, we chose agglomerative hierarchical clustering. Generally, the
111
results of hierarchical clustering of sizable data sets are a large number of clusters, many of which contain a small fraction of the samples that are considered minor clusters. We performed a straightforward approach in prioritizing clusters and in eliminating the minor clusters by cutting the hierarchical dendrogram lower tiers. Once clusters are identified in an event window (Algorithm 1), determination of which clusters contain attack signature features of high magnitude is conducted (Algorithm 1). The path information assurance factor is calculated using Equations 8 and 9. A signature consisting of a distinct collection of significant features is associated with each attack suite; thus, the nature of significant features contained within the traffic data of an event window is captured in a hierarchical clustering. Algorithm 1 implements a modified hierarchical agglomerative clustering algorithm that merges clusters until a minimum distance threshold between clusters is reached or all the clusters are merged to only one. When a minimum distance threshold is used, the algorithm ensures maximum partitioning of data into feature-rich clusters and increases the probability that the top-tier clusters contain a full attack signature feature set.
112
computers & security 58 (2016) 106–124
Algorithm 1 takes as input (1) SSigattack, the set of attack signatures, each of which consists of several features, (2) Φ with the N samples, where each sample is a row vector in Φ, and (3) δ, the distance threshold to stop cluster merging (4) APbuf, significant features that have a high probability of matching signatures SSigattack from earlier analysis periods. Derivation of parameter APbuf will be discussed in section 3.4. This algorithm groups the samples in one event window to c clusters (lines 3–8). Then, for each cluster, it finds the attack signature that has the highest probability to match the cluster’s features (lines 10–21). The signatures and matching probabilities for all the clusters are put into SSig and PProb, respectively. This algorithm outputs a triple (Clusts, SSig, PProb) where Clusts[i] is the i-th derived cluster, SSig[i] contains the attack signature with the highest probability (in PProb[i]) to match Clusts[i]‘s features. In this algorithm, c represents the total number of clusters found so far, and is initialized to 0 (line 2). Di is a cluster that is being processed. Initially, Di is initialized to contain the i-th sample in one event window. Clusts is the set of clusters finally derived, and is initialized as an empty set. Lines 3–8 merge samples to c significant clusters. The group sum of squares (GSS) measure is a measure of how dispersed data samples are from one another in terms of distance. GSS estimates the variance or spread of the samples about their mean distance value. In the context of data clustering, the within group sum of squares (WGSS) is thus a measure of the amount of dispersion of samples within a cluster, while the between groups sum of squares (BGSS) measures the dispersion of samples in one cluster as compared to other cluster samples. A successful result of agglomerative clustering is a set of clusters with samples tightly grouped. In addition, the clusters should be well separated in distance from one another. A high value for the ratio of BGSS/WGSS indicates that the agglomerative clustering process resulted in a set of good candidate clusters and the proper distance metric was selected for computing the similarity matrix. In Algorithm 1, this cluster merging process stops when the minimum distance between two nearest clusters exceeds the distance threshold δ. In finding the nearest clusters from all the existing ones, both Ward and complete linkage methods can be utilized. Past research (Steinbach et al., 2000) showed that the complete linkage method Equation 1 yields the best ratio the WGSS and the BGSS; thus, indicating tighter grouping between inter-cluster members and optimal cluster-to-cluster spacing. We used the complete linkage distance method for calculating the similarity matrix, and in measuring the distance between clusters. Using the complete linkage method, all samples in cluster are compared with all samples of another cluster. The distance threshold δ was heuristically selected by clustering baseline event window data observing which values of δ, yielded the best BGSS/WGSS ratio.
dist ( Di, D j ) = max {d ( xi, x j ): xi ∈ Di, x j ∈ D j }
(1)
For each cluster Di out of the c clusters in Clusts, lines 10– 21, calculate the probability that its features match every attack signature. Because data transmission rates are often extremely high, detecting individual network path attacks is quite challenging. Typical network intrusion detection systems are vulnerable to denial of service attacks, evasive attacks, and when too many packets are delivered, the systems will time out, drop packets and consequently often miss attacks altogether (Salour and Su, 2007). The novelty and utility of our devised Algorithm 1 lies in its ability to efficiently locate anomalies and classify clusters containing attack signatures in multipath network traffic data. The average successful anomaly detection and classification rate has been observed to be greater than 90% (see Section 4.1 for a detailed review of Algorithm 1’s performance characteristics).
3.3.
Matching features to signatures
Identifying feature matches is performed by measuring the entropy for each feature within an event window. Baseline event window data containing no attacks and test event windows where attacks were inspected. It was observed that specific features were indicators of attacks as indicated in Table 1, column 3. These attack indicator features showed changes in entropy. The relative entropy F of an individual attack signature feature fj changes when attack suite traffic is injected into the network. Measuring entropy levels of feature random variable data cannot distinguish between differing distributions with the same amount of uncertainty (see Kullback and Leibler, 1951), and simply measuring the entropy levels of these features between time periods does not guarantee that anomalies will be detected. For this reason, it is important to compare the relative entropies (Equation 6) of attack signature features within specific timeframes. Identifying feature matches is performed by finding the KL divergence designated by Equation 6. KL divergence is a non-symmetric measure of the difference between two probability distributions and is often used a relative entropy measure between distributions. We use KL divergence to measure the relative entropy changes between baseline that we know does not contain anomalies and test data that may contain anomalies. Respective baseline and anomalous feature distributions F and Fi′ are represented by Equations 2 and 3, and the respective relative entropies by Equations 4, 5 and 6. Equation 6 is used in order to determine feature relative entropies in Algorithms 1 and 2.
113
computers & security 58 (2016) 106–124
F = [Prob (f1 ) , Prob (f2 ) , … , Prob (f J )]
T
(2)
F′ = [Prob (f ′ 1 ) , Prob (f ′ 2 ) , … , Prob (f ′ K )]
T
cophenetic correlation coefficient CorFac is represented in Equation 7 below
(3)
CorFac = DKL (Fi Fi ′ ) = ∑ f , f ′ Prob ( fi ) ln i
i
Prob ( fi ) Prob ( f ′ i )
DKL (Fi ′ Fi ) = ∑ f , f ′ Prob ( f ′ i ) ln i
i
KLD = DKL (Fi Fi ′ ) + DKL (Fi ′ Fi )
Prob ( f ′ i ) Prob ( fi )
(4)
(5)
(6)
The complete form in Equation 6 by Gibbs inequality is always > = 0 and a formal proof of this is provided in (Kullback and Leibler, 1951). In order to reduce the search space, clusters with the largest cophenetic distances and highest inconsistency factor are selected for feature entropy comparisons. The cophenetic distance between two clusters is represented in a dendrogram such as Fig. 4 by the height of the link at which two clusters are first joined. A correlation factor for a dendrogram is defined as the linear correlation coefficient between the cophenetic distances obtained from the tree, and the original distances (or dissimilarities) used to construct the tree. As described by Equation 1, the complete distance measure method is used when generating the dendrogram. The correlation factor is used as a measure of how well the tree represents the dissimilarities among clusters. A value close to 1 is considered a highquality, well grouped clustering tree (Steinbach et al., 2000). The
∑ (Odist i< j
∑ (Odist i< j
ij
− odist ) (Cofdist ij − cofdist )
− odist )
2
ij
∑ (Cofdist i< j
− cofdist )
2
ij
,
(7)
where Odist ij is the Euclidean distance between clusters Di and Dj. Cofdist ij is the cophenetic distance between clusters Di and Dj, and cofdist and odist are the averages of Cofdisti,j and Odisti j, respectively. As natural fluctuations in feature distributions in baseline data do occur, it is necessary to measure the baseline KLDblo using Equation 6, and o = 1, … ,12 corresponding to the features f1 through f12. KLDblo is averaged over a set number of event windows of non-anomalous data and represented as KLDblavgo. KLDblavgo is calculated for each feature and used as a threshold when determining if a feature is anomalous in test event window data. Test event windows are event windows where anomalies may be present as opposed to baseline event windows where anomalies are not present. KLDtesto is derived using test event window data. A feature f’k appearing in a test event window is significant if KLDtesto > KLDblavgo for that feature. These significant features are subsequently used in determining the probability that a cluster is associated with a specific attack suite. Let us look at an example for signature matching probabilities. Assume that we are given a hypothetical attack signature that consists of four features, f1, f2, f3, f9, i.e., Sr ={f1, f2, f3, f9}. Algorithm 1 finds clusters and extracts the features that exist in Sr. As shown in Fig. 5, for candidate clusters D1,
114
computers & security 58 (2016) 106–124
f1 f2 f3 f9
D1 x x x
D2 x x x
D5 X X X
D9 x x x x
Fig. 5 – Attack signature matching probabilities.
D2, D5 and D9, the algorithm found matching significant features H1 = {f1, f2, f9}, H2 = {f1, f2, f3}, H5 = {f1, f2, f3}, and H9 = {f1, f2, f3, f9}. Significant feature matches for each cluster is computed using Hu/(# features Sr), with u = 1, 2, 5, 9 and r = 1, 2, 3, 9 and Hu represents the cardinality of H. The signature match probability for H1 = 0.75, H2 = 0.75%, H5 = 0.75% and H9 = 1. They all have high ≥ 75% feature matches to Sr. Among these four clusters, D9 has the highest feature match probability (100%), since all four features in the attack signature Sr exist in the cluster. After calculating the signature matching probability (lines 13–16), the attack signature with the highest featurematching probability is recorded by maxProbSigi (line 18) and the matching probability is recorded by maxProbi (line 17). When the highest matching probability maxProbi and the matching attack signature maxProbSigi for each cluster are found, they are put into sets PProb and SSig, respectively (lines 22–23). In a new example, Fig. 6(a) below shows clustering of samples prior to classification, and Fig. 6(b) shows those clusters which were identified to match most closely attack suite
Clusters of interest prior to classification. {D5, D6, D8}
(a) Unclassified top 10 clusters
signatures. Figure 6(a) shows the top 10 candidate clusters including the annotated clusters D 5 , D 6 , and D 8 prior to classification. The results of the classification process completed by Algorithm 1 are shown in Fig. 6(b). Of the top 10 candidate clusters, clusters D5, D6, and D8 matched attack suite signatures SSig1, SSig2, and SSig3, respectively, with signature match probabilities of 0.93, 0.947, and 0.9. Out of all candidate clusters, Algorithm 1 determined these three candidate clusters to have the highest match probabilities for the attack suite signatures SSigattack. The clusters having the highest probability of containing the attack suite signatures SSigattack is calculated as follows: 1. The KLD was calculated between baseline feature distributions containing no anomalies. An averaged KLDblavgo is derived for use in step 2. 2. Those test dataset features with KLDtesto > KLDblavgo are classified as anomalous features as illustrated in Fig. 6 with an associated probability of matching an attack signature. If it is found that more than one cluster matches a specific attack signature, the cluster with the highest probability of a specific signature match with the highest number of similar samples is selected to represent the probability of a specific attack signature being present on a path. More explicitly, if two or more clusters have an equal signature match probability, the cluster with the largest number of samples is chosen to represent the probability of a specific attack being present on a path.
Clusters with samples containing significant features. Clusters {D5, D6, D8} now classified as matching attack signatures SSig1, SSig2, and SSig3 with
associated probabilities.
(b) Clusters matching SSigattack
Fig. 6 – Cluster classification.
computers & security 58 (2016) 106–124
Fig. 7 – Slow cloud attacks.
A formal discussion of this methodology applied to attacks occurring over multiple event windows is discussed in Section 3.4. As we will see in Sections 4.1 and 4.2, Algorithm 1 affords a high degree of anomaly detection and classification accuracy of network path attacks.
3.4.
Stream look-forward probability of attack (SLFPA)
Due to data storage limitations, one of the largest challenges in a Signature-based Intrusion Detection system (SID) is missing essential attack sequence elements. Adversaries desiring to evade a SID’s signature matching capabilities will often purposely slow down the transmission of attack elements with the expectation that the signature matching buffer length is shorter than the attack data stream. Algorithm 1 is used to identify significant features (KLDtesto > KLDblavgo) over a set number of event windows which we refer to as an analysis match period (TSmatch). The analysis match period is a unit of measure consisting of a set number of event windows over which span we
115
inspect the contents of the event windows for the presence of attack signatures. If an attack sequence occurs over multiple analysis match periods, the stream look-forward probability of attack (SLFPA) must be calculated for statistically significant features. SLFPA is designed to work in concert with Algorithm 1 in order to detect a slow attack. As an example let us analyze scenarios using signatures SSig2 and SSig3 to illustrate SLFPA. Figure 7 below shows two slow cloud attack sequences with the cloud infrastructure attack (SSig2) consisting of 6 elements represented by features f1, f5, f6, f7, f8, and f9 and the cloud services attack (SSig3) consisting of 5 elements represented by features f1, f5, f8, f9, and f10. If the system signature matching buffer can only store 2 event windows (EW) for a signature match analysis period TSmatch, (match = 1, … ,3), then the system will only at best be able to obtain a match with a probability of 0.33 for SSig2 and 0.40 for SSig3 at TS1 after observing features f1 and f5. In the next two EWs (EWs 3–4) features f6, f7, f8 and f9 are observed and without the knowledge gathered in TS1, in TS2 a match probability of 0.33 for SSig2 and 0.40 for SSig3 is again obtained. An obvious solution to this problem is to increase the signature match analysis period to the known longest attack signature length. This approach unfortunately will require Algorithm 1 to process all patterns within that much longer time period, requiring considerably more computing resources and data storage capacity. As illustrated in Fig. 8, a more efficient solution is to calculate the probability that the significant features present in each analysis matching buffer (APbuf) are part of a known attack sequence signature (SSGattack), and to forward only the highest probability pattern knowledge into the next analysis timeframe. We are in effect keeping track of patterns seen in previous event windows only if those patterns have a high probability of belonging to known attack signature patterns. Given the data stream in Fig. 8, significant features f1 and f5 as members of the SSg2 and SSg3 item sets are placed in the
Fig. 8 – High probability pattern forwarding.
116
computers & security 58 (2016) 106–124
analysis match buffer during TS1 while significant features f2 and f4 are ignored. Likewise, in TS2, because significant features f6 and f7 are members of the SSg2 item set, these features are placed in the analysis match buffer while significant features f3 and f4 are ignored. Finally in TS3, significant features f8, f9 and f10 are placed in the analysis match buffer to complete the matching process. Summarizing, after 3 match analysis periods (TS1 − TS3), the match analysis buffer contains significant features in attack signatures SSg 2 and SSg 3 and the probability of these attacks being present on a path can be calculated. SLFPA is implemented in the pseudo code of algorithm 2. Algorithm 2 takes as input (1) SSigattack, the set of attack signatures, each of which consists of a subset of features f1 − f12, (2) Φ, which is N observed raw data samples (3) δ, the distance threshold to stop cluster merging (4) Sfactor, a factor used to set the number of TSmatch time periods that span a match analysis period. Lines 1–3 initialize variables and call Algorithm 1 in order to obtain the high probability cluster sets Clusts, matching signatures SSig and the corresponding probabilities of match PProb. In lines 4–10, for each significant cluster, Di samples are evaluated using Equation 6 and when a cluster contains a statistically significant number of features where KLDtesto > KLDblavgo, those features are concatenated with corresponding attack sequence signature elements in the match buffer APbuf. In lines 11–14, the accumulating match buffer (resulting from actions in lines 4–10) is processed successively by Algorithm 1 until all event windows within the analysis match period have been processed.
Algorithm 3 takes as input (1) Clusts, the set of threat classified clusters, each of which consists of several features, (2) SSig the corresponding attack signatures, (3) PProb, the probability of the presence of specific attack signatures, and (4) Wr, the threat levels with r = 1, 2, 3. In lines 3–6, the probability for each classified cluster in the set Clusts is multiplied by the corresponding threat level. The sum of the respective cluster Wr * (PProb) products are accumulated in the variable O. The path information assurance factor Ii is calculated in Line 9.
3.5.
Calculate assurance level for a path
Once the set of clusters Clusts are derived and probabilities of the presence of specific signatures in those clusters (PProb and SSig) are calculated, the path information assurance factor Ii for network path Pi is calculated using Equation 8 and Equation 9 The total number of paths is equal to n and i = 1, 2, … ,n. Each attack threat level Wa is derived using a domain specific threat analysis process as described in Section 3.1 and reference (Archer, 2013), and a = 1, 2, … , c with c equaling the number of clusters associated with attack signatures on a path Pi. For each cyber threat level Wa, and a corresponding traffic threat signature Sa present in an event window, the likelihood of cyber threat signatures being present is high if both Wa and Prob(Sa) are high. For a path Pi, which consists of c clusters of samples in an event window (discovered in Algorithm 1), we can sum up the threat for each cluster (Equation 8). Then the information assurance factor Ii for Pi is derived using Equation 9:
O f = ∑ a =1 Wa Prob (Sa ) c
Ii =
1 Of
(8)
(9)
In practice, using the threat model we have specified in Table 1, the ranges of Wa ∈ [1, 5], Of ∈ [1, 50], and Ii ∈ [0.02, 1], apply for each network path used. Algorithm 3 calculates the information assurance factor I for a path by utilizing Equations 8 and 9.
3.6.
Trend derived information assurance levels
In the previous section, the probable association of statistically significant features within adjacent event windows was determined. These associative probabilities were found in order to facilitate efficient matching of observed data to signatures. As shown in Fig. 8, the probability of a specific signature match was determined using the significant feature information for each successive signature match analysis period TSmatch. The probability of a match with SSig3 increases through TS1 − TS3,
computers & security 58 (2016) 106–124
and it can be surmised that at each match analysis period the stream look-forward probability of attack (SLFPA) for SSig3 is known. The SLFPA is a temporal probability of attack within a data stream, and is limited to predicting an attack on a path only within a specified set of analysis periods {TS1,…TSg}. In order to predict the long term attack environment on a path, a future information assurance factor IProji must be found over a larger set {TS1,…TSG}, where G > >g and µ is the prior distribution mean of a selected set of event window data contained within the set {TS1,…TSG}. The mathematical techniques we selected to find IProji is called Bayesian inference and Bayesian variable selection (BVS). For those not familiar with these mathematical methods, these methods assume that data feature distributions fi − f12 are normally distributed and when given prior feature distributions, that future or posterior data distributions can be derived (see O’Hara and Sillanpaa (2009) for detailed coverage of BVS and Bayesian inference). When using BVS, we are able to find the probabilities that specific features determine the outcome of a future information assurance factor IProji. BVS helps us reduce the number of features that we need to include in our analysis. Only those features with probabilities above a designated threshold are included in IProji regression equations. The exact usage of BVS and Bayesian inference in predicting IProji is discussed in the paragraphs that follow. The posterior distribution Prob(σ |I) is found using the following Bayesian probability relation with the assumed posterior mean σ and I the observed information assurance factor.
Prob (σ I ) =
Prob (I σ ) Prob (σ ) Prob (I )
(10)
Prob(I) is an unconditional normalizing constant that ensures the posterior integrates to one. Prob(I) is a multidimensional integral shown in Equation 10 over all of the features fj, and is approximated using Markov chain Monte Carlo (MCMC) algorithms (Meyer and Wilkinson, 1998).
Prob (I ) = ∫ Prob (I σ ) Prob (σ ) d (σ )
(11)
Performing the multidimensional integral in Equation 11 is computationally intensive, and therefore, it is necessary to use the MCMC algorithms to find Prob(I). MCMC algorithms are used iteratively to generate a series of values, chains, for all of the parameters specified in the models below where Equation 12 is the generalized model and Equation 13 is the model for the feature set { f1′ … , f12 } . The variable i represents the path number for a specific network path Pi and n is the total number of paths in the network, i = 1, 2, … , n, while j is the feature number per path, (j = 1, 2, … , N(f)). As explained later, Equations 10 and 11 are used to calculate the posterior probabilities for βi,j and αi. The generalized model is as follows: N( f )
IProji = ψα i + ψ ∑ j =1 βi, j fi, j + ei Where βi,j: The corresponding regression coefficients. αi: Regression constant.
(12)
117
fi, j : Features also referred to predictor variables in the regression model. ei: The regression error. IProji: The response variable indicating the predicted information assurance factor. ψ : A regression scaling factor which is derived from inspection of a predetermined number of prior event window Ii values as compared to the multiple regression values for these prior event windows. The generalized model of Equation 12 using the feature set {f1, f2…, f12} takes the form of Equation 13. IProji = ψα i + ei + ψ ( βi,1 fi,1 + βi,2 fi,2 + βi,3 f3 + βi,4 fi,4 … + βi,12 fi,12 ) (13) In the Bayesian variable selection (BVS) approach, we assign a spike and slab prior probability distribution (discussed below) to βi,j and αi and formulate a regression expression using the full set of features fi,j in the form of Equation 13. Then posterior probabilities for βi,j and αi are determined using Equations 10 and 11 with the assistance of MCMC algorithms. Submodels using Equation 13 with each βi,j assigned a slab and spike distribution prior distribution each with unique means and variance are evaluated. We inspect the posterior probabilities of the different submodels with particular interest in those submodels where the individual probability values of Prob(βi,j) are below a specified activation probability threshold. Activation thresholds are domain specific heuristically derived values that are used to determine if a feature fi,j should be either dropped or maintained as a predictor variable in Equation 13. As an example, setting an activation probability threshold of 0.4 will result in all features with Prob(βi,j) < 0.4 to be considered inconsequential when considering the outcome of the target variable IProji. As long as the MCMC iterations run long enough, the process will converge where the values of IProji will appear as if they come from the joint posterior distribution of the model parameters. After convergence, the calculated posterior distributions for βi,j can be further characterized by generating a large number of simulated values using the derived posterior distributions and compiling summary statistics such as the posterior mean, standard deviation and percentiles (Montgomery, 2012; O’Hara and Sillanpaa, 2009). Such simulations are of great value when assessing the relationship or covariance between βi,j terms and ultimately in selection of the most suitable submodel. Equation 12 assumes a normal linear model (NLM) and is analyzed using an analysis of covariance (ANCOVA) (Gelfand and Smith, 1990) model consisting of a grouping of at least one covariant. In the Bayesian NLM model of Equation 12, the relationship between the response variable IProji and covariate features fi,j can be linear or curvilinear (polynomial) providing the mean of IProji can be expressed as a linear function of the unknown parameters βi,j. A common problem with multiple linear regression models like that of Equation 12 is over fitting. One primary reason over fitting can occur is when variables from the feature set {f1…, f12} are indiscriminately included. There are several methods that are used to ensure only those features that are statistically significant are included (O’Hara and Sillanpaa, 2009). One effective method used for this purpose is Bayesian variable selection (BVS). The particular BVS method illustrated here
118
computers & security 58 (2016) 106–124
assumes a spike and slab prior distribution. With the spike and slab prior distribution, it is assumed that the regression coefficients are mutually independent with a two-point mixture distribution made up of a uniform flat slab distribution and a degenerate spike distribution at zero (Lempers, 1971). The spike and slab distribution has the following properties: Inda ∼ Bern (π ) : A Bernoulli distribution with an indicator parameter = 1 if active, and 0 otherwise and π is the prior probability that a predictor variable is active. ω a* ∼ N (0, γ 2τ 2 ) : Is the slab part of the prior distribution where τ 2 is the residual variance, γ 2 determines the prior’s precision. A detailed explanation of how the values of γ and π are determined are beyond the scope of this paper ( π is 0.25 and [0.5–3.0] for γ ). For those interested, please refer to O’Hara and Sillanpaa (2009). Simply stated, the values of π and γ are based on the assumption that in most cases, the majority of predictor variables are inactive. It is assumed that for the slab ω j (Inda = 1) equals βa* , and for the spike ω j (Inda = 0) equals 0. Finding the value of IProji for a path allows us to predict how we should protect a path in the future based on what threats have been observed in the past. Using the Bayesian method described, it is important to select the proper distribution to base your model in order to allow the MCMC method to quickly and accurately resolve terms and converge on values for the regressors. Summarizing, the trend-derived method for proactively determining IProj is completed using four steps: 1. Select a representative number of past event windows where attacks were encountered. Ideally these event windows should directly precede the timeframe for which IProj will be determined. The larger the number of event windows sampled, the greater the likelihood of calculating an accurate IProj. 2. Form a multiple regression model using the general form Equation 12 and adapting Equation 12 to a form appropriate for the number of features. Equation 13 contains 12 features and was derived from Equation 12.
3. Reduce dimensionality of the regression model by using the BVS method. 4. Solve the reduced regression model and verify that the IProj value predicted is sound. Verification methods include determining Coefficient of Multiple Determination (ƿ2), conducting Overall F Test, and conducting Residual Analysis (Montgomery, 2012).
3.7.
Path encryption levels adjusted
The path encryption level protocol described in this section utilizes Algorithms 1–3 to ensure a secure quality of service (SQoS) where data in a multipath network are transferred in a predictable manner with the level of security commensurate with the threats encountered on each path. Figure 9 illustrates the path security determination and encryption adjustment protocol in a two path network. Each path possesses a monitor capable of determining path security and necessary encryption levels. The protocol uses public and private key encryption. Public key encryption algorithms utilize mathematical techniques such as integer factorization, discrete logarithm, and elliptic curve relationships. Those wishing to use public key encryption will generate a private and public key that are mathematically related. The private key must be securely stored and guarded by its owner. When data are encrypted using public key encryption methods, it is nearly computationally infeasible for a malicious party to construct a private key that allows the encrypted data to be deciphered. What makes public key encryption techniques useful is the fact that the holder of the private key is able to publicly distribute the public key for others to use. Those wishing to send private data to the holder of the private key simply encrypt data with the public key and only the owner of the private key is able to decrypt the data. In public key encryption, an initial secure exchange is not required since the public key is made public to those who want to send secure message data to the private key owner. Private key encryption on the other hand requires that an initial secure exchange of private keys between communicating parties be performed. Public keys are used for two purposes: to encrypt data or to verify
Fig. 9 – Adjustment of path C and E according to I.
119
computers & security 58 (2016) 106–124
a digital signature. In the case of verifying a digital signature, the owner of the public–private key pair is the only party able to encrypt or sign data with the private key. The owner of the private key signs the data using the private key and sends that data to a party with whom he would like to securely communicate. The recipient of the signed data then uses the publically available public key to decrypt the data. The recipient’s successful decryption action verifies the signature and thus the identity of the private key owner. Public key encryption is termed as asymmetric encryption since two keys are used when data are encrypted and decrypted. Private or secret key encryption is termed symmetric encryption as the same private key is used to encrypt and decrypt the data. Although asymmetric encryption has the advantage of not requiring a secure key exchange between communicating parties, it is not as computationally efficient as symmetric key encryption. For this reason, when possible, public key encryption is used to securely exchange a secret key between parties. Once the secret key is exchanged, the parties will then use symmetric encryption to encrypt and decrypt data sent between them. For these encryption/decryption efficiency reasons, the path encryption level protocol utilizes asymmetric encryption to initially exchange private key or a session key between path monitors. Once the session key is held by each party, messages sent between path monitors is symmetrically encrypted and decrypted using the common session key. As shown in Fig. 9, private and public key pairs (pki, ski) and (pkjj, skj) are first generated by monitors for paths Pi and Pj. The public keys of each path monitor is made publically available to all other monitors in the multipath network. The secret or private key skj of the monitor of path Pj is used to sign a session key ssk and public keys pkj and pki belonging to path monitors Pi and Pj, respectively, and sends that encrypted message to Pi. The monitor of path Pi decrypts the signed message from Pj using pkj and verifies that the message came from the owner of the private key of skj. The session key ssk is now in the possession of both path monitors and is subsequently used for passing information assurance factors Ii and Ij as they are updated every after the collection of each 30 minute event window. Paths Pi and Pj, which are capable of determining Ii and Ij and adjusting Ei and Ej, first establish trusted communications via public key encryption. Monitoring occurs for a specified number of event windows where Ii and Ij are determined using Algorithm 1, and Ei and Ej are adjusted as specified by Algorithm 4.
Table 3 – Information assurance and encryption. I (1 ≥ I > 0.8) (0.8 ≥ I > 0.7) (0.7 ≥ I ≥ 0.02)
E
AES key length (bits)
Avg. packet delay (sec)
0.25 0.53 1
128 192 256
0.06 0.08 0.15
skj[(pkj, pki,ssk)]: secret key skj is used to sign public keys pkj, pki and session key ssk. pki{skj[(pkj, pki,ssk)]}: public key of pki is used to encrypt signed content from Pj. Algorithm 4 takes as input (1) I the information assurance factor vector of all paths, (2) P the matrix of all paths and includes columns C and E, the cost and encryption factors, respectively. This algorithm locates the encryption factor for each path Pi in experimentally derived Table 3 and then sorts the rows of P in descending order (lines 3–4). A constant Cf = 100 is a factor of Ei and establishes a Ci range of Ci ∈ [2.5, 10]. The constant Cf = 100 was randomly chosen in order to establish an upper path cost bound for Ci of 100/10. The cost Ci of each row in P is calculated, and the paths with higher encryption factors are increased in proportion to the path’s respective encryption factor Ei (lines 5–8). This adjustment has the effect of directing less traffic through paths with higher threat levels, and at the same time provides greater encryption and data protection to these more vulnerable paths. Table 3 was derived by injecting cloud guest, infrastructure and service suite attacks with signatures specified in Table 3 into a 50 path 1 Gbit/s multipath network. The attacks were injected in a Poisson fashion, which resulted in feature distributions showing significant entropy changes. The information assurance factors using Algorithm 1 were found for each path Ii over the span of 40, 30 minute event windows. The goal of the experiment was to identify Advanced Encryption Standard (AES) key lengths that provided additional data confidentiality for the most vulnerable paths, without adversely impacting the overall network data throughput. With the range of Ii ∈ [0.02, 1], we chose heuristically based cutoffs for AES key lengths and encryption scaling factors E based on threats defined in Archer (2013). These derived values for I, E and AES key lengths are shown in the first 3 columns of Table 3. Greater encryption strengths (longer key lengths and greater encryption scaling factors) impose overhead on network paths where such security is imposed. This greater overhead has the effect of delaying packet transfer rates through such paths. For this reason, packet delays for network data encrypted with AES 128, 192 and 256 bit encryption were measured and recorded in column 4 of Table 3. In deriving values for column 1 of Table 3, we chose values of I that did not adversely impact the overall network data throughput. We accomplished this by ensuring that a proportionally small fraction of the paths would utilize 192 and 256 bit AES key encryption. We tested our heuristically derived ranges for I. It was observed that of the fifty paths, on average 38 paths had 1 ≥ I > 0.8, eight of the paths fell in the range 0.8 ≥ I > 0.7 and 4 in the range of 0.7 ≥ I ≥ 0.02. With on average only 24% of the paths using AES 192 and 256 bit encryption,
120
computers & security 58 (2016) 106–124
the overall throughput loss was deemed acceptable. Section 4.2 discusses actual throughputs achieved using the derived path encryption level protocol.
3.8.
Proactive adjustment of path security
As described in Sections 3.4 and 3.6, it is possible to predict future information assurance factors on each path Pi, given past values of Ii. Algorithm 5 summarizes the proactive path encryption process using the trend derived methods described in Sections 3.4 and 3.6.
Fig. 10 – Prob(SSig2), TS1.
Algorithm 5 takes as input (1) I, which is the vector of information assurance factors for a set number of previous event windows (2) Φn, which are the event windows for n paths (3) ʑ, which is the feature activation probability threshold (4) n, which is the number of network paths and (5) ɖ, which is the number of previous event windows from which the regression scaling factor is derived. In lines 1–4, the path information assurance level scaling factor is found for each path in the network. If we let ψ = 1 in Equation 12, we can find multiple linear regression values Ii for a previous number of ɖ event windows. A linear scale of values for ψ is created and is correlated with the observed information assurance factor Ii values (see Section 4.2 for details on the value range of ψ ). In line 5 for each event window of a path, the multiple linear regression for ɖ event windows is analyzed using a least-squares method using the slab and spike prior as explained in Section 3.6. Posterior distributions for the elements of β are derived using MCMC and activation probabilities for features f1 − f12 are found. The submodel where the number of features have been reduced and where each of the reduced features have activation probabilities are greater than ʑ is returned in the form of (β, Fe, α i , ei). In line 6, predicted information assurance values of IProji for each path are found using Equation 12. In line 8, cost factors C and encryption scaling factors E are proactively set for all network paths using Algorithm 4.
4.
Discussion, results and conclusion
4.1.
Performance and accuracy of SLFPA
To verify the functionality of the Algorithm 1 – SLFPA combination as described in Sections 3.2 and 3.4, 6, 30 minute event windows with the presence of cloud attack SSig2 were analyzed. It should be mentioned that our method has no dependency on the signature pattern, and SSig2 was chosen randomly in order to present the utility of our method. The match analysis period TSmatch was set at 2 event windows as this analysis match period is quite manageable in terms of data storage. As shown in Figs 10–12, the expected cumulative probability of slow attack SSig2 was clearly observed in the respective data clusters 4 and 7 through the analysis match periods TS1, TS2 and TS3. This cumulative probability in the clusters verified the SLFPA algorithm was retaining high probability features between analysis match periods. An experiment was then conducted using 20, 30 minute event windows of captured cloud traffic. In addition to the cloud
Fig. 11 – Prob(SSig2), TS2.
121
computers & security 58 (2016) 106–124
Fig. 12 – Prob(SSig2), TS3.
dataset, the standard KDD Cup 99 data set (Stolfo, 1999) was tested to verify our method’s accuracy. As the longest attack sequence was not expected to exceed 8 event windows, the match analysis period was set at 8 event windows. As shown in Fig. 13 below, the anomaly detection and classification rates for Algorithm 1 with SLFPA Algorithm 2 averaged > 91% accuracy. This verified that the detection accuracy of Algorithms 1 and 2 held for both the standard KDD and the cloud datasets and serves as a firm basis for identifying network attacks where an attack signature features span multiple event windows. As expected, the use of the SLFPA method produced considerable savings over a configuration utilizing one large analysis buffer containing the full set of 20, 30 minute event windows. We first tested Algorithm 1 without SLFPA using 1 large match analysis buffer containing the full 20 by 30 minute event window dataset. We then tested Algorithm 1 with SLFPA and inspected the resulting reduced analysis match buffer size. The error observed between SLFPA and non-SLFPA attack classification was only 2% mean squared error (MSE). The data buffer
Fig. 13 – ROC curves for cloud and KDD data Algorithm 1.
Fig. 14 – BVS feature activation probabilities for attack with signature SSig3.
storage savings observed when using SLFPA was 99.8% versus non-SLFPA.
4.2.
Trend derived information assurance levels analyzed
An experiment was conducted using 20, 30 minute event windows of cloud data that was gathered from a 50 path, 1 Gbit/s multipath network. We wanted to identify how well we could predict future values of the information assurance factor IProji, where i is the network path number with i = 1, … , 50. The ranges of IProj and ψ are IProj ∈ [0.02, 1] and ψ ∈[0.8772, 1.724,] . The exact value of ψ is linearly related to (f) the regression terms ∑ Nj =1 βi, j fi, j + α i [ 0.0116, 1.14 ] which are taken from Equation 12. The values from ψ are taken from its range from largest to smallest. For example if the sum value of the regression terms is 0.0116, then the value of ψ is 1.724. If the sum value of the regression terms is 1.14, then the value of ψ is 0.8772. When the sum value of the regression terms is somewhere in between 0.0116 and 1.14, the value of ψ is found linearly within the range [0.8772, 1.724] largest to smallest. The goal of the experiment was to predict IProj1 (the path 1 information assurance factor) over 20 event windows. Equation 13 was used as our model for IProji, a slab and spike prior distribution for I1, and MCMC analysis in order to converge on posterior distributions for the regressors of our model. Using the strategy of BVS, simulations were run in order to characterize the derived posterior probability distributions for our model regressors. With the regressors as factors of our indicator variables f 1 ,…f 12 , we derived indicator activation probabilities as shown in Fig. 14 for path 1 (P1). As Fig. 14 indicates, the activation probability for features f1, f2, f3, f6, f7, f9, f11 and f12 have activation probabilities greater than 0.5 while f4, f5, f8 and f10 are far less probable and have less influence on the dependent IProj1. Figure 14 shows us which features have the greatest effect on the dependent variable IProj1. The complexity of Equation 13 is reduced by only evaluating the indicator variables with highest activation probabilities.
(
)
122
computers & security 58 (2016) 106–124
If we choose an activation threshold of >0.60, for example, for P1, our model reduces to Equation 14. Features are subscripted in the form fi,j with i representing the path number and j the feature number, j = 1, … , 12.
IProj1 = ψα 1 + e1 + ψ (β1,1 f1,1 + β1,2 f1,2 + β1,3 f1,3 + β1,6 f1,6 + β1,7 f1,7 + β1,12 f1,12 )
(14)
BVS analysis involved the use of the Bayesian regression Equation 14, which included conditional expectation of IProj1 given all indicator variables f1,1…f1,12. Because the prior distribution Prob(µ) was known, a parametric method was used when finding values for α 1 and regressors β1,1, β1,3, β1,6, β1,7 and β1,12 (O’Hara and Sillanpaa, 2009). Our multiple linear regression model of Equation 14 was then verified using the 20 event windows of sampled data, and the following statistical measurements were analyzed (see Montgomery(2012) for a detailed coverage of the methods used to confirm our model’s accuracy). • Coefficient of multiple determination (ƿ2): Measure of variation of dependent variable to independent. ƿ2 = 89.7%, which showed a high correlation between the variation of the predictor variables f1,1, f1,3, f1,6, f1,7 and f1,12 and the value of IProj1 . • Overall F test: Verifies a linear relationship exists between at least one predictor variable and the target variable IProj1 . The overall F = 0.00128 fell within the range of F < 0.05. This indicated that using the baseline data sample, a linear relationship does exist between the predictor variables f1,1, f1,3, f1,6, f1,7, f1,12 and IProj1 . • Residual analysis: Verified that there was little or no pattern in the relationship between the residuals and the predicted values of IProj1 . This confirmed that there was no curvilinear effect in the predictor variables; thus, no violation of equal variance. Summarizing, the above results indicate that because the coefficient of multiple determination equals a high percentage value, we know the BVS method identified the correct feature subset. When we setup the multiple regression of Equation 14, we assume that a linear or curvilinear relationship exists between the dependent variables (the features) and the dependent variable ( IProj1 ). Because the overall F test yielded a value less than 0.05, it confirmed that our assumption of linearity was correct and the values predicted for IProj1 are accurate. Finally, when conducting residual analysis of the regression Equation 14, we graphed the residuals and verified that the residual data points were indeed randomly distributed. Observing that the residuals were randomly distributed confirmed that the expected random error experienced in our predictions of IProj1 is completely random and not deterministic. If the residuals were not randomly distributed, we would need to further refine Equation 14 and evaluate if feature distributions were normal. Those features found not to be normal would need to be removed. Predicted information assurance factor IProj1, ew and observed information assurance factor I1,ew were determined over path 1 for 20 successive event windows. The values of I1,ew were calculated using Algorithm 1 without SLFPA and the use of Equations 12 and 13. Prior distributions for each IProj1,ew data
Fig. 15 – Predicted IProj versus observed I.
point in Fig. 15 were based on the 2 previously observed event window I1,ew-1 and I1,ew-2 distributions. The MSE between IProj1,ew and I1,ew was found to be 4.3% and is explained by the regression error e1,ew, which was found to be highest when abrupt changes in I1,ew values were observed. The Bayesian prediction method is highly dependent on the choice of prior distributions, and in this case, we used the prior distributions observed in the preceding 2 event windows as our priors. Our derived regression relation tended to predict average posteriors based on the priors; thus, when there were abrupt changes in observed values of I1,ew, Equation 14 averaged those changes. Subsequently, the values of IProj1, ew tracked slightly behind any abrupt changes in I1,ew in those instances. By convention, an MSE < 5% is considered as an acceptable margin of error, and at 4.3% MSE the expected deviation in predicted encryption strengths versus determined encryption strengths was observed to be small. Using the I1,ew and IProj1,ew values shown in Fig. 15, the individual encryption strengths for each event window were determined using Table 3. Table 4 below lists the encryption strengths for I1,ew and IProj1,ew while Fig. 16 below illustrates the effects on throughput. As shown in Table 4, one encryption level error was encountered when predicting the encryption strength for event window 17. In only one event window (EW 17) was the encryption level incorrectly predicted (a 5% error). This proactive encryption error is a result of the earlier observed 4.3% MSE between I1,ew and IProj1,ew and is acceptably small when considering the error is averaged over multiple paths in the network. To ensure the observed error rate in predicted encryption strengths would not seriously affect throughput, the throughput performance of proactively applied encryption using IProji,ew was tested on a 1 GBit/s 50 path network (i = 1, … , 50) over a span of 40 event windows (ew = 1, … , 40). Cloud attack suites 1–3 were transmitted over the 50 paths and the results are illustrated in Fig. 16. As Fig. 16 illustrates the MSE in throughput between IProji,ew predicted and Ii,ew determined encryption strengths was relatively small at 3.3%. This is an acceptably low error rate, and an error rate for which system tolerances can be safely established.
computers & security 58 (2016) 106–124
Table 4 – Required AES encryption strengths (path #1). Event window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
4.3.
I required AES key length (bits)
IProj required AES key length (bits)
128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 192 128 128 128
128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128
Practical considerations
While our method is very effective at identifying anomalies in multipath networks, in order to ensure accuracy of the methods some important practices must be followed: 1. It is always important to understand the types of threats that need to be detected and countered. The slower the anticipated attack types, the larger the number of event windows that must be chained together using SLFPA. Algorithm 1 and SLFPA are effective at retaining only high probability signature elements in the matching buffer; thus, the cumulative matching buffer storage size is very small. Nevertheless, if a large number of attack signatures are tracked by the system, it is important to carefully designate the proper subset of suspected slow attacks in order
to optimize match buffer utilization and Algorithm 1 computation overhead. 2. The relative entropy methods discussed in Section 3.3 require the collection of baseline network samples. It is good practice to collect a number of baseline samples greater than or equal to the length of the longest anticipated slow attack. Additionally, because baseline non-anomalous network data transmission characteristics shift over time, it is essential that a sliding window baseline data collection strategy be employed. Not only should the sliding baseline sampling window be at least the size of the slowest anticipated attack length, but also contain typical cyclic traffic patterns experienced by the network during normal usage. 3. As stated in the discussion of Algorithms 1, 3 and 4, if two or more attacks are found on any one path, the path encryption level is set according to the cumulative threat level on a path. In environments where the cumulative threats develop at substantially different rates on various paths, it is generally advised to set the encryption adjustment interval to twice the frequency of the shortest anticipated high threat attack scenario. In so doing, the network path encryption levels will be appropriately adjusted to provide greater data security in networks where different paths encounter attacks at different rates. It should be noted that setting the encryption adjustment interval to a shorter time interval will have the effect of reducing data throughput due to the overhead associated with the encryption adjustment protocol (Algorithm 4). 4. The selection of the BVS activation threshold (ʑ in Algorithm 5) is another important consideration. If a low activation threshold is selected, non-essential features may be included when calculating Iproj. If non-essential features are included, there is a greater probability that data over fitting will occur. Over fitting will result in a higher number of false predictions which will ultimately cause paths to have an incorrectly assigned encryption strength. On those paths where a higher encryption level than necessary is erroneously employed, the overall throughput for the multipath network will be unnecessarily reduced. For those paths where the encryption level is set too low due to a mis-prediction of Iproj, an attacker has a higher probability of being able to exploit that path’s vulnerabilities. For this reason, it is very important to validate the prediction model by ensuring the assumption of linearity is correct as discussed in Section 4.1 and reference (Montgomery, 2012).
4.4.
Fig. 16 – Throughput comparison IProj versus observed I.
123
Conclusion
In conclusion, previous multipath security schemes illustrated in Lee et al. (2005) Monnet and Mokdad (2013), and Younis et al. (2009) utilized predetermined path encryption levels and random shuffling of traffic routes in order to increase multipath data privacy. Although these former methods are an improvement over applying a uniform, strong encryption level on all paths, they fail to directly respond dynamically to emerging path threats and instead rely on predetermined data sensitivity heuristics. In addition, none of these methods is able to intelligently sense or predict what future threats might occur on a network path. Conversely, we have shown that our methods maintain a predictable secure quality of service by
124
computers & security 58 (2016) 106–124
actively sensing the threats on fixed bandwidth network paths and proactively adjusting encryption levels with a high level of accuracy when attacks are directed against individual paths.
4.5.
Future research
We restricted our research to detecting network attacks launched by an attacker over a specific set of network paths, and our methods only involved analysis of TCP/IP header data in order to detect malicious behaviors. Future research will include extending our methods to detect polymorphic or zeroday attacks via TCP/IP packet payload inspection. In addition, our future planned research will include methods that employ active countermeasures in response to sensed attacks to include such actions as traffic shaping, profiling and re-routing of malicious traffic. REFERENCES
Archer J. Top threats to cloud computing V1.4. In: Cloud Security Alliance Summit. 2013. Bilge L, Dumitras T. Before we knew it – an empirical study of zero-day attacks in real world. In: CCS. 2012. Bonaventure O. Computer networks principles protocols and practice. The Saylor Foundation; 2011. Gelfand A, Smith A. Sampling based approaches to calculating marginal densities. J Am Stat Assoc 1990;85:398–409. Gelman AE. Bayesian data analysis. 2nd ed. CRC Press/Chapman & Hall; 2004. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat 1951;22(1):79–86. Lee PC, Misra V, Rubenstein D. Distributed algorithms for secure multipath routing. In: INFOCOM. 2005. Lempers F. Posterior probabilities of alternative linear models. Rotterdam University Press; 1971. Li Z, Kwok Y. A new multipath routing approach to enhancing TCP security in ad hoc wireless networks. In: ICCP 2005 workshop, 2005. Meyer R, Wilkinson R. Bayesian variable assessment. Commun Stat-Theory Methods 1998;27(11):2675–705.
Monnet Q, Mokdad L. Data protection in multipath WSNs. In: IEEE Symposium on Computer and Communications (ISCC). 2013. Montgomery D. Design and analysis of experiments (8th edition). John Wiley & Sons; 2012. Obert J, Pivkina I, Huang H, Cao H. Dynamically differentiated multipath security in fixed bandwidth networks. In: MILCOM. 2014. O’Hara RB, Sillanpaa MJ. A review of Bayesian variable selection methods: what, how and which. Bayes Anal 2009;4(1):85–118. Polychronakis M, Anagnostakis KG, Markatos EP. Real-world detection of polymorphic attacks. In: Fourth International Workshop on Digital Forensics & Incident Analysis. 2009. Salour M, Su X. Dynamic two-layer signature-based IDS with unequal databases. In: The International Conference on Information Technology (ITNG ’07). 2007. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In: KDD text mining workshop. 2000. Stolfo S. KDD cup 1999 data. In: The Fifth International Conference on Knowledge Discovery and Data Mining. 1999. Wei-wei X, Hai-feng W. Prediction model of network security situation based on regression analysis. In: IEEE International Conference on Wireless Communications, Networking and Information Security. 2010. Younis M, Krajewski N, Farrag O. Adaptive security provision for increased energy efficiency in WSNs. In: IEEE 34th Conference on Local Computer Networks. 2009. James Obert is a computer scientist at Sandia National Labs and is actively involved in dynamic network defense and trusted computing research. Inna Pivkina is an associate professor in Computer Science at New Mexico State University (NMSU), and her research is in the areas of knowledge representation and data mining. Hong Huang is an associate professor in Electrical and Computer Engineering at NMSU, and his research interests include wireless sensor networks and network security. Huiping Cao, is an assistant professor in Computer Science at NMSU, and her research interests include data mining and databases. This paper expands upon the author’s previously presented conference paper at MILCOM 2014 (Obert et al., 2014).