Author’s Accepted Manuscript Modification of Supervised OPF-based Intrusion Detection Systems Using Unsupervised Learning and Social Network Concept Hamid Bostani, Mansour Sheikhan www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(16)30239-4 http://dx.doi.org/10.1016/j.patcog.2016.08.027 PR5856
To appear in: Pattern Recognition Received date: 23 June 2015 Revised date: 11 July 2016 Accepted date: 22 August 2016 Cite this article as: Hamid Bostani and Mansour Sheikhan, Modification of Supervised OPF-based Intrusion Detection Systems Using Unsupervised Learning and Social Network Concept, Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.08.027 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Modification of Supervised OPF-based Intrusion Detection Systems Using Unsupervised Learning and Social Network Concept Hamid Bostania, Mansour Sheikhanb1* a
Department of Computer Engineering, Islamic Azad University, South Tehran Branch, Tehran, Iran () Department of Communication Engineering, Islamic Azad University, South Tehran Branch, Tehran, Iran
[email protected] [email protected] * Corresponding Author, Tel: +98 21 88833104, Fax: +98 21 88830012. b
Abstract Optimum-path forest (OPF) is a graph-based machine learning method that can overcome some limitations of the traditional machine learning algorithms that have been used in intrusion detection systems. This paper presents a novel approach for intrusion detection using a modified OPF (MOPF) algorithm for improving the performance of traditional OPF in terms of detection rate (DR), false alarm rate (FAR), and time of execution. To address the problem of scalability in large datasets and also for achieving high attack recognition rates, the proposed framework employs the k-means clustering algorithm, as a partitioning module, for generating different homogeneous training subsets from original heterogeneous training samples. In the proposed MOPF algorithm, the distance between unlabeled samples and the root (prototype) of every sample in OPF is also considered in classifying unlabeled samples with the aim of improving the accuracy rate of traditional OPF algorithm. Moreover, the centrality and the prestige concepts in the social network analysis are employed in a pruning module for determining the most informative samples in training subsets to speed up the traditional OPF algorithm. The experimental results on NSL-KDD dataset show that the proposed method performs better than traditional OPF in terms of accuracy rate, DR, FAR, and cost per example (CPE) evaluation metrics. Keywords: Optimum-path forest, classification, clustering, pruning, centrality, prestige, social network analysis
1. Introduction Along with the rapid growth of technology in computer networks, especially the Internet, security has become a critical challenge for community of security researchers. The networks have become vulnerable to security threats; therefore, detecting and preventing anomalous traffic have become important issues in the security of computer networks. Intrusion detection systems (IDSs) are effective tools for achieving high level of security. IDSs gather network traffic as input data. Then, they analyze network traffic with the aim of detecting attacks or malicious behaviors. According to different data sources, IDSs take either 1
Postal Address: Mahallati Highway, Dehhaghi Blvd., Faculty of Engineering, Islamic Azad University-South Tehran Branch,
network- or host-based approach to recognize attacks [1]. Various activities of hosts, such as audit records of operating system and system logs, provide data for the host-based IDSs [2]. However, data collection is performed using packets, such as Internet packets, in the network-based IDSs [2]. According to different analysis methods, IDSs are classified into three main categories [3]: a) misuse detection; b) anomaly detection; and c) specification-based systems. In the misuse detection systems, predefined attack patterns are stored in a signature database [2]. These attack patterns are employed for matching with system behavior or network traffic with the aim of detecting intrusions. In the misuse detection methods, all of the known attacks can be detected with low false alarm rate (FAR); however, unknown abnormalities are not detectable. In an anomaly detection system, a statistical- or machine learning-based model is built for describing the normal behavior [4]. Then, an anomaly can be detected in observed data by noticing deviations from this model. The anomaly detection algorithms are useful for detecting new intrusions; however, they are not as effective as misuse detection models in detecting known attacks. In addition, they suffer from high false positive rates [5]. The specification-based systems are similar to the anomaly detection systems; however, user guidance is required for developing a model of normal behavior in these systems. Notably, machine learning techniques are usually used in the specification-based IDSs [3]. With the development of computer networks, intrusions are also becoming more sophisticated. Various techniques of machine learning have been used in IDSs with the aim of achieving a high detection rate (DR), low FAR, and low computation and communication costs [6, 7]. Traditionally, the machine learning methods are classified into three categories: a) supervised; b) unsupervised; and c) semisupervised. The supervised learning methods employ labeled data in the training phase with the aim of correct mapping of data samples to the corresponding class labels in the test phase [7]. On the other hand, the unsupervised learning methods do not use class-labeled data in the training phase. They employ the similarities of data points for data clustering. The supervised and unsupervised learning models are usually employed in the misuse-based and anomaly-based IDSs, respectively. The semi-supervised learning is another type of machine learning that employs both unlabeled and labeled data for training. Several studies have been reported based on employing mentioned machine learning methods in the intrusion detection field. The following studies are some examples of supervised and unsupervised learning models for intrusion detection: Li et al. [8] proposed a supervised network intrusion detection method based on transductive confidence machines for k-nearest neighbors (TCM-KNN) learning algorithm. They introduced an active learning method for selecting fewer good training samples with the aim of reducing the scale of training dataset, and consequently the computation cost. They evaluated the performance of proposed method using KDD Cup 99 dataset [9]. Yi et al. [10] introduced an incremental support vector machine (SVM) based on reserved set, called RS-ISVM, for network intrusion detection. They proposed a modified kernel
function (called U-RBF that was based on the Gaussian kernel function) for reducing the noise which was generated because of feature differences. Moreover, they developed a reserved set strategy to overcome the oscillation phenomenon which often occurs in the learning phase of a simple incremental SVM. This strategy retains non-support vectors that are likely to be support vectors. Constant FAR outlier detection, using a supervised method based on normalized residual values, was proposed by Ru et al. [11]. Rajasegarar et al. [12] proposed distributed anomaly detection architecture for modeling the data at each sensor in the network. This architecture used multiple hyperellipsoidal clusters and detected global and local anomalies. Semi-supervised learning was employed in the following sample intrusion detection models: Xue et al. [13] focused on outlier detection by combining semi-supervised learning, fuzzy set theory, and rough set theory for detecting outliers. They entitled this hybrid approach as fuzzy rough semi-supervised outlier detection (FRSSOD). Daneshpazhouh and Sami [14] proposed an entropy-based outlier detection model. Their proposed method consisted of two phases: a) extracting negative examples from positive and unlabeled data and b) detecting top N outliers using the entropy-based algorithm. Some researchers employed hybrid learning or architecture approaches for intrusion detection. The following studies are some examples of these systems: Yeung and Ding [15] detected intrusions using program/user profiles that were modeled by two types of behavioral models (i.e., dynamic and static models) for data mining. The dynamic model was based on hidden Markov model and the maximum likelihood principle. The static model was based on event occurrence frequency distributions and minimum cross entropy principle. Xiang et al. [16] combined the supervised tree classifiers and unsupervised Bayesian clustering to detect intrusions. Wang et al. [17] proposed an approach based on artificial neural networks (ANNs) and fuzzy clustering, called FC-ANN, for achieving high DR and low false positive rate. The fuzzy clustering technique was used for generating different training subsets. These subsets were employed for training the ANNs. The result of simulation experiments on the KDD Cup 99 dataset showed superior performance as compared to the back-propagation neural network (BPNN) and other well-known methods such as decision tree and naïve Bayes. Kim et al. [5] proposed a hybrid intrusion detection method that integrated hierarchically a misuse detection model based on C4.5 decision tree algorithm and an anomaly detection model based on multiple one-class SVMs. The experimental results showed improvement of the IDS performance in terms of unknown-attacks detection and speed of detection. Aburomman and Reaz [18] proposed an ensemble construction method that employed particle swarm optimization (PSO)-generated weights to create ensemble of classifiers for intrusion detection. Papa et al. [19] introduced a novel supervised machine learning algorithm based on graph theory, called optimum-path forest (OPF), which reduced a classification problem into the problem of partitioning the
vertices of a graph [20]. The OPF is a simple and fast classifier which is parameter-independent and originally supports multi-class problems [20]. The OPF does not make any assumption about the shape of classes; so, partial overlapping among the classes can be handled by the OPF [20]. The complete graph in an OPF, which is generated based on samples and their distances in the dataset, is partitioned to one or more optimum-path trees rooted at prototypes (i.e., key samples) for representing each class in the classification problem. Each tree of the OPF includes vertices that are strongly connected to their roots. The OPF is still a new classifier in the intrusion detection field and to the best knowledge of the authors; the OPF has not been used widely in the IDSs. In this study, a novel IDS is proposed that is based on a modified OPF model. The proposed model consists of three main modules: a) partitioning; b) pruning; and c) detecting. One of the main challenges in the OPF is the size of training set. In the first module, k-means clustering algorithm was used as an unsupervised learning method for addressing the problem of scalability in large datasets. Moreover, this partitioning can enhance the detection accuracy of low-frequent attacks, such as remote to local (R2L) and user to root (U2R) attacks, in the NSL-KDD dataset [21]. The k-means clustering algorithm partitions the original training set into k clusters which will be used as the training and evaluation sets of the OPFs in the detecting module. With the aim of speeding up the OPF, the centrality and the prestige concepts in social network analysis were used in the second module for pruning the training set of OPF by identifying and selecting the most informative samples. In our work, the proposed pruning module can identify informative samples, so the size of training and evaluation sets can be reduced for speeding up the OPF. Then, OPFs were trained in the detecting module by using different training and evaluation sets which were projected in the first module. To fine-tune the classification algorithm in OPF, the following subjects are important: a) considering all arcs that connect an unlabeled sample to the training set samples (for finding the optimum path from prototypes to the unlabeled sample in the classification phase) and b) the distance between the unlabeled sample and the root (prototype) of each sample in the training set (considered in this study). The simulation results on NSL-KDD dataset showed that the proposed method performs better than the traditional OPF. The rest of this paper is organized as follows: the related work is presented in Section 2. The foundations of following subjects are introduced briefly in Section 3: a) supervised OPF; b) k-means clustering algorithm; and c) the centrality and the prestige concepts in social network analysis. The proposed IDS, which is based on MOPF model, is introduced in Section 4. The performance of the proposed method is compared with traditional OPF and well-known classifiers in Section 5 by reporting the experimental results on NSL-KDD dataset. The paper is concluded in Section 6.
2. Related Work
As mentioned earlier, the proposed IDS is based on a modified OPF model. This model employs k-means clustering algorithm for providing better performance. In addition, the centrality- and the prestige-based measures are adopted from social network analysis to speed up the proposed model. Based on these concepts, the related work is presented in this section: Jiang et al. [22] considered an outlier factor of clusters for measuring the deviation degree of a cluster. In this way, the cluster radius threshold was computed and attack classification was performed by an improved nearest neighbor method. Their proposed clustering method was employed for unsupervised intrusion detection. Tsai and Lin [23] proposed a hybrid learning model based on the triangle area-based nearest neighbor (TANN) for detecting attacks. In TANN, cluster centers (corresponding to the attack classes) were firstly obtained by using k-means clustering algorithm. Then, the triangle area was determined using two cluster centers and data from the dataset. Finally, a new feature was formed and knearest neighbors (k-NN) classifier was used for classifying similar attacks based on the new feature. Lin et al. [24] proposed a feature representation approach, called cluster center and nearest neighbor (CANN), for intrusion detection. In CANN, two distances were measured and summed: a) distance between each data sample and its cluster center and b) distance between the data and its nearest neighbor in the same cluster. Then, this distance was employed to represent each data sample for intrusion detection by the kNN classifier. Pereira et al. [25] proposed an intrusion detection model based on the OPF classifier. They used a pruning algorithm, which selected compact samples in the training set to speed up the classification phase, and also a variety of evolutionary-based feature selection methods for improving the accuracy of OPF. They compared the performance of OPF with some competitive methods such as SVM with radial basis kernel mapping function (called SVM-RBF), Bayesian classifier, and self-organizing map (SOM) neural network. The performance of the proposed model was similar to competitive methods; however, it was faster than those methods. As related studies on OPF model, the following researches can be reported: Papa et al. [26] showed that OPF can perform successfully with different graph topologies and learning techniques. In this way, they focused on a supervised learning approach and proposed an algorithm for fast classification using a reduced-size training dataset. Iwashita et al. [27] proposed an architecture-independent optimization method for big data classification by the OPF. This method used a theoretical formulation that relates the minimum spanning tree (MST) with minimum spanning forest (MSF) generated by the OPF over the training dataset. Souza et al. [28] introduced the k-OPF supervised classifier. The k-OPF has many similarities with the k-NN classifier. These classifiers are equivalent when all training samples are used as prototypes. Saito et al. [29] proposed a data pre-organization technique for balancing the selection of samples from all classes and uncertain samples for training. In OPF-based clustering, the effectiveness of clustering is dependent on the probability density function (pdf) estimation. So, Costa et al. [30] proposed
a nature-inspired method for pdf estimation in data clustering based on OPF. Finally, Osaku et al. [31] classified remote sensing images using a contextual approach that was based on OPF and Markov random fields. The centrality and the prestige measures are used for identification of key members in social networks [32]. As related studies on these measures, the following researches have been reported in recent years: Yan et al. [33] introduced a node centrality measure and its derivative indices to measure the collaboration competence of a node in a social network. The proposed indices utilized the following information to measure the centrality of a node: a) the number of nodes’ neighbors; b) link strengths; and c) centrality information of neighbor nodes. Qi et al. [34] introduced spanning tree centrality (STC) score for centrality measurement in the weighted networks. They calculated the STC score based on Kirchhoff polynomial. Kundu and Pal [35] proposed a community detection algorithm in which fuzzy-rough communities were identified. Qi et al. [36] introduced an algorithm for community selection. Their approach was based on drops of densities between each pair of parent and child nodes.
3. Preliminaries In this section, the foundations of the supervised OPF, the k-means clustering algorithm, and the centrality and the prestige concepts in social network analysis are reviewed briefly. 3.1. Supervised OPF In 2009, Papa et al. [19] introduced a supervised machine learning algorithm based on the graph theory, called optimum-path forest. They reduced a pattern recognition problem into an optimal graph partitioning in a given feature space. In the OPF, each sample is represented by a feature vector and is shown as a node in a complete weighted graph. The weighted arcs, which are defined by adjacency relations between samples, link all pairs of nodes in this graph. The arcs are undirected and the distance between two feature vectors is considered as the weight of arc in the complete graph. Each path connects a pair of nodes. A path is composed of a sequence of distinct samples and a connectivity function, such as the maximum arc-weight along the path, which assigns a cost to the path [19]. Generally, the OPF-based classification is composed of two distinct phases: a) training and b) classification. In the training phase, some key samples from the training set, called prototypes, should be identified for each class in the classification problem. Then, the complete graph will be partitioned into optimum-path trees (OPTs) by a competitive process between prototypes (as the roots of the OPTs) which introduces optimum paths to the remaining nodes of the graph [25]. The OPF will be composed by a union of OPTs. The nodes of OPT will be strongly connected to their prototypes as compared to other prototypes in the OPF. Hence, each sample that belongs to an OPT has the same class as its prototypes. In the classification phase, all of the
arcs that connect an unlabeled sample to all samples in the OPF are considered to classify that unlabeled sample. By evaluating the paths from the nodes of OPF to an unlabeled sample, a node can be found which offers an optimum path to that unlabeled sample. Hence, the class of an unlabeled sample will be same as the root’s class of the best node. In addition to the training and the classification phases, another phase (called learning) is usually used for improving the accuracy of OPF. The learning phase is performed by using the classification errors on the evaluation set. Suppose dataset in which noted that
,
, and
are the training, the evaluation, and the test datasets, respectively. It is
is used for improving the accuracy of learning model on
by teaching the classifier using
randomly selected samples of the same class from non-prototype samples in the misclassified samples of
as a labeled
and replacing them with
[19]. The details of OPF procedures are described in the following
subsections. A. Training Phase Identification of the optimum set of prototypes, called training phase. Suppose
, is one of the main processes in the
as a complete weighted graph with the specifications mentioned
above. The samples in the training dataset are represented by the nodes of G, and each pair of samples is defined by its arc as
. To find the prototypes from
The closest nodes of MST which have different labels in set of prototypes algorithm in
, an MST from G should be computed.
are the prototypes of OPF [19]. The optimum
is a set of prototypes that minimizes the classification error of the OPF training
[19].
An OPT should be found in the training phase for each
that rooted a special prototype
,
where S is the set of prototypes. It means that the OPF classifier assigns one optimum-path from S to each sample
[26]. Notably, a sequence of distinct samples such as
is a path, and a path with one sample like assigns a path cost to each path 〈 〉
. The
〈 〉 is called trivial [19]. A path-value function is determined by Eq. (1) [19]:
〉
where s is a training sample (i.e., (
(1)
)) and S represents the prototype set. Moreover, the distance
between samples s and t is denoted by 〈
〉 with terminus t
{ 〈
along
〈
〈
〉 is computed by
. The maximum distance between two adjacent samples 〈
〉 in which
〈
〉 denotes the concatenation of
and
〉 (as a path and an arc, respectively). Partitioning the OPF for computing OPTs is performed by
minimization of sample
which assigns an optimum path
whose minimum cost
from the set of prototypes (
) to every
is calculated as follows [19]: (2)
The minimization of
is performed in the training phase of OPF. To see more details about the OPF
training algorithm, refer to [19]. B. Classification and Learning Phases To classify each unlabeled sample such as t, we assume t as a part of the training set. Let all of the nodes and their arcs connect corresponding samples in the training set to t. The purpose of the classification phase is to find an optimum path
from
(as optimum path prototypes) to t, and then labeling t
with the class of its root which is shown by
, in which
is a function that assigns the correct label to
denotes the root of t and
[19]. The optimum path can be found
incrementally by evaluating the optimum cost as follows [19]: (3)
where
is the minimum cost of s. Suppose
that
, so the classifier assigns
is the best node that satisfies Eq. (3). It is noted to t as the class of t and an error is occurred when
[19]. One of the main challenges in building an efficient classifier is the quality of training samples in a classification problem. Hence, it would be interesting to select the most informative samples for can improve the accuracy of classifier on
. Papa et al. [19] presented a general learning algorithm for
building an efficient classifier that used an evaluation set training set on
which
without changing its size. They used
. Then, they replaced misclassified samples of
to enhance the composition of samples in the
for evaluating the classifier which was projected with the samples of the same class from non-
prototype samples in
which were randomly selected. This process was repeated for T iterations by
using the new sets of
and
. More details about the OPF classification and learning algorithms can be
found in [25, 26].
3.2. k-means Clustering Algorithm Clustering is based on an unsupervised learning algorithm which can cluster data samples into disjoint partitions based on their similarities. The k-means is a popular and crisp clustering technique (with nonoverlapping partitions) that is a numeric, non-deterministic, and iterative method [37, 38]. Suppose is a set of n samples. The k-means algorithm partitions these samples into k
clusters
such that the following conditions be satisfied: a) ; and c) ⋃
; b)
[21]. The main goal of k-means clustering
algorithm is to minimize the sum of dissimilarity of all samples in a cluster from the centroid. In other words, if a criterion is defined as follows [39]: ∑
∑
(4)
then we want to partition n samples into k clusters such that the above criterion within each cluster becomes minimum. In Eq. (4),
is the centroid of all samples which belong to cluster
is the k-means objective function wherein ∑
∑
. Notably,
(as an error function) shows the
quality of the clustering. The steps of k-means clustering algorithm are as follows [38]: Step 1: Generate k random centroids from the dataset. Step 2: For each sample in the dataset, calculate its Euclidean distance from the centroids. Step 3: Assign each sample to a cluster that is the nearest to centroid of cluster based on the calculated Euclidean distance (computed in Step 2). Step 4: Recalculate the new cluster centroids by computing the mean of attribute values of the samples in each cluster. Step 5: Repeat Steps 2 to 4 until the stopping criterion is met. Cluster validity assessment (CVA) is a process for evaluating the results of clustering algorithms such as k-means [40]. With CVA in mind, several validity indices have been introduced for measuring the veracity of a clustering result such as Dunn index [41], Dunn-like indices (which use different definitions for inter-cluster distance and cluster diameter as compared to Dunn index) [42, 43], Silhouette index [44], and Davies-Bouldin index [45]. The main disadvantages of Dunn and Dunn-like indices are as follows: a) high computational load and b) sensitivity to noise [40]. One of the key parameters in k-means clustering algorithm is the value of k which has a great effect in the performance of algorithm. The DB index is an appropriate validity index for evaluating the intra-cluster similarity and inter-cluster differences [46]. This attribute results in a positive impact on the partitioning module for grouping the homogeneous samples from original heterogeneous samples in the same groups. In addition, the computational load of this index is less than Silhouette index [47] and Dunn index [40]; therefore, the DB index is used as the evaluation metric of k-means clustering algorithm in this study [45]: ∑
(
)
(5)
where
and
are the average distances of all elements in the ith and the jth clusters from the centroid of
each cluster, respectively.
denotes the distance between centroids of the ith and the jth clusters. As
seen in Eq. (5), the DB index is a function of the ratio of the sum of within-cluster scatter to betweencluster separation. The within-cluster scatter should be minimized and the between-cluster separation should be maximized [48].
in Eq. (5) shows the similarity measure of clusters. Moreover,
on the dispersion measure of a cluster ( ) and the cluster dissimilarity measure (
is based
) [45]. The minimum
value of DB index shows better quality of the clusters, and consequently better performance of k-means clustering algorithm [39]. To see more details about this metric, refer to [45]. In the proposed model, k-means clustering algorithm was used for partitioning the original training set to k clusters that were used as the training sets of k OPFs. This approach was used for addressing the scalability problem. 3.3. Centrality and Prestige Concepts in Social Network Analysis The social network is a social structure made up of a set of nodes as the social actors who are connected by one or more specific types of interdependency. The social actors and their interactions or relationships can be represented by a graph. Hence, the graph theory is an essential approach which is widely used in the social network analysis. Identification of influential actors is one of the primary applications of the graph theory in the social network analysis for finding the most important ones. Centrality and prestige [49] are two fundamental measures that are widely used for this purpose. By using the centrality measure in social network analysis, some actors who are extensively involved in relationships with other actors can be found [50]. A variety of metrics such as the degree centrality, the closeness centrality, and the betweenness centrality are used for measuring the centrality [51]. The prestige is a property which is derived from the patterns of social ties of a particular social network [52]. In the social network analysis, the prestige measure focuses on in-links. In other words, the opinion of other actors (which are expressed by their arcs) determines the importance of an actor (instead of the opinion of a special actor) [53]. Different measures such as the node degree-based prestige, the proximity prestige, the rank prestige, and the node position are used for measuring the prestige [50]. The prestige measure differentiates in-links and out-links, so it is a more refined measure of prominence of an actor than the centrality in the social network analysis [53]. In this study, we have solely focused on the betweenness centrality (BC) and the proximity prestige (PR). The BC measure indicates the importance of an actor in terms of connecting other actors [51]. In other words, BC is based on an actor who makes to tie two other actors in the social network, provided that this connection path is the shortest path between those two actors. For each pair of actors, this metric takes
into account that how many times a special actor can interrupt the shortest distance between two actors. In a directed graph, the BC is computed as follows [50]: ∑
(6)
where M is the node set, contains node
, and
is the number of the shortest paths between node is the number of the shortest paths between node
and node
that
and node
The PR measure shows how much all other actors within a social network are closed to a special actor [50]. This metric takes into account the influence domain of an actor, which is a set of actors who are directly (or indirectly) connected to it [54]. In a directed graph, PR is computed as follows [50]: ∑
(7)
∑
where n is the number of nodes (as actors in the directed graph) and who are in the influence domain of actor The distance between two nodes
and
. Moreover, is shown by
specifies the number of all actors
is an actor in the influence domain of actor
.
.
As mentioned earlier, the OPF algorithm works on a complete weighted graph which is generated based on the samples and their distances in the dataset. It means that each sample (from the classification problem) is shown as a node in the complete weighted graph. The weighted arcs, which are defined by adjacency relations between samples, link all pairs of nodes in this graph. It seems that the OPF and the social networks arise from different contexts; however, both of them are based on the graph theory. In other words, a social network is also based on the graph theory wherein a user is shown as a node and the relation between each pair of users links the corresponding nodes in the graph. Hence, certain analysis methods in the social networks, which are extracted from the graph theory, can be used in the OPF. In this study, we were inspired by the application of centrality and prestige as two different measures which are used to quantify graph theoretical ideas about an individual actor’s prominence within social networks [49]. Notably, the centrality and the prestige draw on two prominence classes [49]; thus, they were selected for being used in the proposed method. The centrality represents the volume of activity wherein an actor with high centrality is an actor who has high involvement in many relations regardless of send/receive direction [49]. Moreover, the prestige represents the popularity in which an actor with high prestige is an actor who receives many directed ties, but initiates few relations [49]. It is noted that the BC and the PR (as two well-known measures of centrality and prestige, respectively) are used for selecting the most informative samples and pruning the training set of OPF classifier to speed up and improve the
performance of the OPF. The details of using the concept of social network in the OPF-based classification are discussed in Section 4.
4. Proposed Model In this section, the proposed IDS model is introduced. This model is based on a modified OPF. As mentioned earlier, there are three main modules in the proposed model: a) partitioning; b) pruning; and c) detecting. In the partitioning module, the k-means clustering algorithm is used as an unsupervised learning algorithm for grouping all samples from original training and evaluation sets into smaller groups that will be used in the detecting module as the training sets of OPF classifiers. Before using the generated training and evaluation sets in the OPFs, the most informative samples are selected in the pruning module. In the pruning module, the BC and PR metrics, which were introduced in the social network analysis, are used for identifying the influential samples and selecting the most informative ones from the training and evaluation sets. Notably, for each generated training or evaluation set, the pruning module is employed as a preprocessing stage for the OPF classifiers. The detecting module is used for anomaly detection and classification of attacks. Different OPFs are projected in this module based on each of the pruned training and evaluation sets. In the classification phase of OPF, all of the arcs that link a new sample to the training samples are considered. In this way, the distance between the new sample and the root (prototype) of each training sample in the OPTs becomes important. The proposed IDS framework is depicted in Fig. 1. According to Fig. 1, these steps are followed in the proposed IDS framework: Step 1 (Preprocessing) The features in the NSL-KDD dataset have three different types: a) continuous; b) discrete; and c) symbolic. These features have significantly different ranges and various resolutions; therefore, most of the classifiers are not able to process data in this format [55]. Hence, one of the main challenges of the datasets used in machine learning is data imbalance. Therefore, it is essential to normalize the value of each feature to avoid data imbalance. In this study, all the features of the dataset were normalized before the partitioning phase. The details of preprocessing will be discussed in section 5.1. Step 2 (Partitioning) In this step, the training set is portioned into k clusters by using k-means clustering algorithm for overload reduction of the intrusion detection engine and also for addressing the problem of scalability in a large dataset. As seen in Fig. 1 (Partitioning stage – Part 2), the engine should determine that which preference group of the training samples is associated with a new sample (when a new sample is classified). Then, the corresponding OPF will be selected for the classification process. As given in Eq. (8), the algorithm calculates the Euclidean distance between the new sample and the centroid of each cluster for finding the
best cluster for a new sample. It is noted that the best cluster is selected based on the minimum Euclidean distance between its centroid and the new sample (as compared to the other centroids): √
where
(8)
and
are the new sample and the centroid of ith cluster,
respectively. Step 3 (Pruning): As discussed earlier, each training set, such as TRi,, should be pruned in this step based on the BC or the PR metrics (that were originally employed in the social network analysis) for finding the most informative samples. Before using one of these two metrics as a suitable metric for pruning the training set, the performance was compared based on the proposed pruning approach. According to the mentioned evaluation, one of the BC or the PR metrics was used as a suitable metric in the proposed pruning phase. The details of this comparison are discussed in section 5.2. Unlike the centrality measures (e.g., BC) which are used originally for undirected graphs, the prestige measure is used for directed graphs. Therefore, a directed graph should be extracted from the complete weighted graph to employ the PR in the proposed model. On the other hand, the original complete weighted graph (which is used in the OPF) can be used for computing the BC; however, the order of computation time of the BC in weighted graph is
[56] wherein n is the number of nodes and m is the number of edges. In this
study, the training set consisted of 4,000 samples; so, the complete weighted graph (which was used in the OPF algorithm) was a graph with 4,000 nodes and 7,998,000 edges. It is obvious that the computation of the BC (or even the PR) is a time-consuming process. In order to deal with this problem, we were inspired by the proposal of Kang et al. [57] for large-scale networks. This proposal was grounded on a new graph called directed line graph. To reduce the large number of edges, the extracted directed graph with reasonable number of edges, which was used for computing the PR, was also employed for computing the BC. So, a directed graph was used in the proposed model for computing BC and PR metrics. As mentioned earlier, we should create a directed graph from the complete graph that was originally used in the OPF for employing these metrics in the proposed model. In the directed graph, the directed relations between each pair of nodes are based on the m best neighbors. To do it, as shown in Fig. 2, for each node complete undirected graph, an ordered list of m best neighbor nodes (as based on the weight of the arcs (i.e., the most similarity to
in the
neighbors) will be selected
or minimum distance to
). By using these
ordered lists, the direction of relations between the training samples can be identified for constructing the directed graph. When the directed graph was constructed, BC and PR were computed for each node in the directed graph (using Eqs. 6 and 7, respectively). For example, according to the generated directed graph in Fig. 2, BC and PR values of node
are calculated as follows:
∑
(
)
⇒
∑
(
) (9)
∑
(10)
Our simulation results showed that in the OPF (unlike social networks), some samples that have the worst BC or PR are selected as the informative samples. When the BCs (or PRs) of all samples in the training and evaluation subsets are computed, then
percent of samples will be selected as the most informative
samples according to the worst BCs (or PRs). In this study, the genetic algorithm (GA) is employed for finding an appropriate value for . However, Newton-like methods can also be used for this purpose (e.g., the conjugate gradient (CG) and the quasi-Newton (QN) methods). However, the results obtained by the GA in previous researches show that error (fitness) average and standard deviation are about an order of magnitude smaller than those obtained by the CG and the QN (such as one reported in [58]). It is worthy to note that the computational cost of the GA is considerably higher than CG and QN methods. However, GA was employed with the aim of achieving better accuracy rates. The GA, which is based on the evolutionary ideas of genetics and natural selection, is an adaptive heuristic search algorithm for solving optimization problems. Any population in the proposed GA represents a possible solution of appropriate . In GA, a fitness function is used for evaluating population. The fitness function, which is used in this study, is based on a proposed version of OPF (called advanced OPF (AOPF)) that will be explained in Step 4 of the proposed framework. The following pseudo-code describes the proposed algorithm for obtaining the fitness value (Algorithm 1): Algorithm 1. Proposed algorithm for obtaining fitness value in finding optimum value of Input: A binary string with size of N as a chromosome (corresponding to ). TR and TS are the training set and test set, respectively. Output: Fitness value R. Steps: (1) Set as the count number of “1” values in the input binary string; (2) Set ; (3) Set as the betweenness centrality set (or the proximity prestige set) (4) Create a directed graph (DG) from the training samples; (5) for each node (corresponds to ) (6) Compute BC (or PR) of based on DG, and add the result to S;
(7) end (8) Set to percent of the worst values in S; (9) Set pruned TR (PTR) by the training samples correspond to ; (10) Create an OPF model from PTR by the OPF training algorithm; (11) Perform AOPF classification algorithm for classifying new samples in TS by using the OPF model; (12) Compute DR and FAR based on the classified samples; (13) Set R to the Euclidean distance between the points with coordinates of (FAR, DR) and (0%, 100%); (14) return R;
In Algorithm 1, the TR and TS are two new subsets of NSL-KDD dataset that are used as training and test sets in the current simulation. More details about these sets and the GA parameter settings will be discussed in section 5.2. The GA fitness function is based on minimizing the Euclidean distance between the classification point such as i with coordinates of (FARi, DRi) and the perfect classification point with coordinates of (0%, 100%) in a receiver operating characteristic (ROC) curve. The best point is a point that its Euclidean distance to the perfect point is the least as compared to other points. The Euclidean distance between the perfect point in the ROC curve and the current point with the coordinate of (FAR, DR) is computed in Step 13 of Algorithm 1. It is clear that appropriate value of and 100 (
should be between 0
). Therefore, each population in GA, which represents a possible solution, is coded
with a binary string vector as a chromosome with a size of 100. Step 4 (Detecting) In this step, an OPF model will be projected for each pruned training (or evaluation) set. As seen in Fig. 1, the learning algorithm which works based on the classification error of the OPF model on the evaluation set is used in a few iterations for improving the composition of samples in the training (or evaluation) set. To evaluate the performance of OPF models on the evaluation sets, the following accuracy metric is calculated in each iteration t (as mentioned in [19]): ∑
(11)
where c is the number of the target classes and
is the partial sum error of class i that is defined as
follows [19]: (12)
where
is the size of
and
is the number of samples in
that belong to class i.
are the false positive and false negative rates of class i on
and
, respectively. After projecting
efficient OPF classifiers, the new samples in the test set can be classified. To classify a new unlabeled sample (in the learning and the classification phases of the OPF); all of the arcs that connect the unlabeled sample to the samples in training set are considered (with the aim of
finding an optimum path from prototypes to the unlabeled sample). However, in the proposed new version of OPF, called AOPF, the distance between the unlabeled sample and the root (prototype) of each sample in the training set is also considered as an important factor for improving the performance of a traditional OPF (as shown in Fig. 3). We assume that the proposed approach may select more appropriate training samples as compared to the traditional approach. Hence, it can lead to find the best optimum path from the prototypes to the unlabeled sample. As seen in Fig. 3, the cosine distance (CD) was empirically chosen instead of the Euclidean distance (ED) for computing the weight of arcs in the complete weighted graph. The computation time of ED is less than that of CD for high-dimensional datasets (e.g., NSL-KDD dataset); however, our experimental results showed that when we used the CD (as a distance measure for computing the weight of arcs), the performance of OPF was improved. Feature selection methods can be used for reducing the dimension and consequently the computation time reduction of CD in highdimensional datasets. Moreover, each feature in the dataset may not be relevant to the classification task (e.g., due to the noise); therefore, choosing relevant features (as a proper subset of all features) can improve the performance of classifiers such as OPF. The CD is calculated using Eq. (13). As seen in Eq. (13), the cosine similarity is subtracted from 1. Unlike the ED, the cosine similarity measures similarity instead of dissimilarity. The cosine similarity is originally used for finding the similarity of two feature vectors such as
and
: ∑ ‖
‖‖
‖
√∑
(13)
√∑
In the proposed classification phase of AOPF algorithm, the ED is used for measuring the distance between a new unlabeled sample and the root of samples in the OPF. As mentioned earlier, an optimum path
from
to t (as an unlabeled sample) should be found for classifying t, where t is a part of the
OPF. Therefore, the optimum path can be found incrementally by evaluating the optimum cost using Eq. (3). In the proposed AOPF, the ED between the root of each sample in the training set and the unlabeled sample (i.e.,
) is also considered in the cost function (Eq. 14): (14)
where
is used as a weight which shows the impact of
. By using Eq. (14), the classification
algorithm of the OPF is changed as given in the following pseudo-code (Algorithm 2): Algorithm 2. AOPF classification algorithm
Input: Classifier where , , and are the optimum-path forest, cost map, label map, and ordered list, respectively, the test set (or the evaluation set ), and the pair of for feature vector and distance computations Output: Label and predecessor maps defined for . Auxiliary: Cost variables tmp and mincost. Steps: (1) for each (2) Set and ; (3) Set and ; (4) while and (5) Compute ; (6) if (7) Set ; (8) Set and ; (9) end (10) Set ; (11) end (12) end (13) return
In Algorithm 2, and
consists of all training nodes in a non-decreasing order of the optimum path cost.
are the optimum path cost and the class label of the corresponding sample in
Moreover,
and
, respectively.
represent the parent (predecessor) and the label of each classified sample in test set,
respectively. Algorithm 2 is akin to the original classification algorithm [25]; however, Steps 2 and 5 are altered. As seen in Step 5 of Algorithm 2, a training sample (e.g., from the unlabeled sample t (based on represents low similarity between t and
) whose root (e.g.,
) is distant
), will be ignored by Step 6, because the long distance ’s OPT (wherein the
It means that the
OPT’s root) may not be an appropriate OPT for t. In other words, the class of
is the
may not be appropriate
for t. As mentioned earlier, this approach can select more appropriate training samples as compared to the traditional approach in finding optimum path from the prototypes to the unlabeled sample. It is noted that | |
each node until an optimum path
〈
is tried through Steps 4 to 11 (as the inner loop of Algorithm 2) 〉 is found [25]. Notably,
,
,
, and
are the elements of the
classifier model which were projected in the training phase of OPF. For more details about these elements, refer to [25]. In this study, the GA is used for finding an optimum value for . The following pseudo-code describes the proposed algorithm for obtaining the fitness value (Algorithm 3): Algorithm 3. Proposed algorithm for obtaining fitness value in finding optimum value of Input: A binary string with size of N as a chromosome (corresponding to ). TR and TS are the training set and test set, respectively. Output: Fitness value R. Steps: (1) Set cnt as the count number of “1” values in the input binary string;
(2) (3) (4) (5) (6) (7)
Set ; Create an OPF model from the training set by the OPF training algorithm; Perform AOPF classification algorithm for classifying new samples in the test set; Compute DR and FAR based on the classified samples; Set R to the Euclidean distance between the points with coordinates of (FAR, DR) and (0%, 100%); return R;
In Algorithm 3, the TR and TS are two new subsets of NSL-KDD dataset that are used as training and test sets in the current simulation. It is noteworthy that the values of average ED and average CD between the training samples and the test samples were 5.87 and 1.09, respectively. So, if in Eq. (14) becomes ineffective as compared to the
(e.g.,
. Therefore, to retain the effects
of both CD and ED in Eq. (14), we have experimentally found that the appropriate value of between 0 and 1 ( appropriate
), then the
should be
). Therefore, each population in GA which represents a possible solution of
is coded with a binary string vector as a chromosome with a size of 1000. So, the value of
each bit is 0.001 and it means that the threshold value will be reported up to three decimal points.
5. Simulation Results In this section, the simulation results on the NSL-KDD dataset are presented to show the effectiveness of the proposed method in intrusion detection field. Pereira et al. [25] compared the performance of OPF classifier in the IDSs with some well-know classifiers such as SVM, SOM neural network, and Bayesian classifier. So, we have mainly focused on comparisons between the proposed modified OPF and the traditional OPF models in this study. As mentioned in Section 4, the proposed method was composed of three modules (functional stages). These modules (i.e., partitioning, pruning, and detecting) were employed for improving the performance of a traditional OPF in IDSs. The partitioning module was used for addressing the scalability of large datasets, detecting low-frequent attacks, and speeding up the OPF. The pruning module was used for pruning the training and evaluation sets by identifying the most informative samples. Finally, the anomaly detection and the attack classification were performed by the proposed AOPF in the detecting module. Therefore, at first the AOPF is compared with the traditional OPF in the experiments of this study. Then, to evaluate the performance of partitioning and pruning modules in the proposed model, the results of applying AOPF+P (partitioning), AOPF+Pr (pruning), and AOPF+P+Pr (called MOPF) are compared with the traditional OPF. To evaluate the performance of the proposed algorithm, OPF, AOPF, pruning, and partitioning modules were implemented in MATLAB R2014a on a PC with an Intel(R) Core (TM) i5-4460, CPU 3.20 GHz, and 8 GB RAM. Notably, the kmeans clustering algorithm (which was already implemented in MATLAB) was used in the partitioning module.
5.1. Intrusion Dataset In this study, the NSL-KDD dataset [21] was used for evaluating the proposed IDS. This dataset is a new version of KDD 99 that consists of selected records of the complete original KDD dataset. The original KDD dataset has some problems, such as having redundant and duplicated records [59], which will cause negative effects on the evaluation results (when it is used as an evaluation dataset). The NSL-KDD consists of 41 continuous, discrete and symbolic features in addition to one target class (similar to KDD). The NSL-KDD features can be classified into three main categories: a) basic features; b) content features; and c) traffic features (including time-based traffic and host-based traffic features) [59]. The basic features are the features that encapsulate all the attributes extracted from the packet headers of TCP/IP connection without considering the payload [59, 60]. The low-frequent attacks do not have frequent sequential intrusion patterns and are embedded in the portions of a packet. Hence, the content features which use the domain knowledge to assess the payload of the original TCP packets are employed for identifying these attacks [60, 61]. On the other hand, time-based traffic features are designed to capture properties that mature over a two-second temporal window, and the host-based traffic features use a historical window estimated over the number of connections instead of time. Thus, they are designed to assess attacks that span in the intervals longer than 2 seconds [61]. The general description of NSL-KDD dataset is given in Table 1. The NSL-KDD dataset contains normal and attack data. The attack data is categorized into four types of attacks: a) denial of service (DoS); b) probe; c) remote to local (R2L); and d) user to root (U2R). The DoS attacks make a service too busy to deny legitimate users from using the service [60]. The probe attacks are the efforts to collect the information about a target host or a network of computers, which can lead to vulnerability from security threats by circumventing its security controls [60, 61]. The R2L represents the attackers who try to gain access to the victim machine, because they do not have any account on the victim machine; and on the contrary, U2R represents the attacks that have local access to the victim machine and try to gain super user privileges [60]. As mentioned earlier, preprocessing is required to normalize the data imbalance. In the preprocessing step, data normalization is performed to scale the value of each continuous attribute into a suitable range [55]. As shown in Fig. 1, all of the features in NSL-KDD dataset were normalized. Normalization, which is also called feature scaling, can solve the issue of data imbalance to eliminate the effect of scale difference [62-64]. In other words, with the aim of avoiding features with large numerical values from dominating other features [65, 66], normalization attempts to weight all features equally that can retain the effect of all features in the classification algorithms. Generally, the supervised OPF is a graph-based classifier that uses the Euclidean distance (or cosine distance) between each pair of nodes (as the weight of arcs) in most of its operations (e.g., partitioning the complete weighted graph into OPTs in training
phase). However, in the calculation of the Euclidean distance, the distance will be governed by some features which have a broad range of values. Therefore, normalization was performed by scaling numeric features with respect to their mean and standard deviation in this study. For example, dst_bytes feature which ranges from 0 to 1,309,937,401 may overwhelm same_srv_rate feature which ranges from 0 to 1 [55]. As seen in Table 1, there are three categorical features in the NSL-KDD dataset: a) protocol_type (with 3 different symbols); b) service (with 70 different symbols); and c) flag (with 11 different symbols). In the preprocessing step and before data normalization, these symbolic-valued features should be mapped to numeric values or vectors. Two approaches have been used in previous studies to convert categorical variables in KDDcup99 and NSL-KDD datasets: 1- Dummy coding: Replacing a categorical value by a value in RM using a function e that maps the jth value of the feature to the jth component of an M-dimensional vector [67]: 0,…,1,…0];
(15)
in which 1 is placed at position j. For example, protocol_type feature is replaced by [1,0,0] for TCP, [0,1,0] for UDP, and [0,0,1] for ICMP in this approach [68, 69]. 2- Label coding: Replacing categorical values by numeric (integer) values ranging from 1 to M, where M is the number of symbols. For example, protocol_type feature is replaced by 1 for TCP, 2 for UDP, and 3 for ICMP in this approach [70, 71]. It is noted that the symbolic features were even removed from NSL-KDD dataset in some studies and only the numeric features were employed [55, 72]. The value of numeric features was scaled by the statistical normalization as given below [55, 67]: (16)
where N is the number of instances in the dataset and the given feature.
and
is the value of the ith instance in the dataset for
are the mean and the standard deviation, respectively.
The total number of instances in NSL-KDD dataset is 148,517 based on the full NSL-KDD training set (with 125,973 instances) and test set (with 22,544 instances) [21] with 21 predicted labels which show the difficulty level of instances in dataset [59]. Notably, the difficulty level of instances shows the degree of difficulty of the correctly predicted label of instances, when using different well-known classifiers (e.g., J48 decision tree and SVM) for the classification of unlabeled instances. Higher values of difficulty level show the lower degree of difficulty in correct prediction and vice versa. For more details, refer to [59]. In this study, 7,000 and 3,000 instances were randomly selected from the NSL-KDD as the training and test
datasets, respectively. Table 2 lists the number of training and test instances. It is worthy to note that in all experiments, 4,000 initial training samples were used as the training set and the remaining 3,000 samples were used as the evaluation set. Notably, the distribution of data types in both training and test sets was almost same as the NSL-KDD dataset. It means that the distribution of each data type (i.e., Normal, DoS, Probe, U2R, and R2L) and even the difficulty level of instances were considered in their random selection. The details of the comparison between the NSL-KDD and our sub-datasets on NSL-KDD (training and test sets) for different ranges of difficulty level are given in Table 3. 5.2. Description and Parameters Setting As discussed earlier, the proposed MOPF model was composed of several stages. The parameters setting in different stages are reported in Table 4. As seen in Table 4, the maximum iteration number of the learning algorithm in all versions of the OPF was set to 5 (
. The reason is based on the experiments of Papa et al. [19] about adequacy of
repeating learning procedure for a few iterations. The number of clusters in the k-means clustering algorithm was set to 4. To select the best value of k in the k-means clustering algorithm, this algorithm was iterated for different values of k and after each execution; the DB index was computed for each k. As shown in Fig. 4, the value of DB index was minimized when
; therefore, the best value of k was
assumed as 4. As reported in Table 4, the optimum values of
and
were 65% and 0.696, respectively. These settings
were used as thresholds in the pruning and detecting modules, respectively. It is noted that these parameters were obtained by the GA; and Algorithms 1 and 3 were run for obtaining the fitness values. As mentioned earlier, two new subsets of NSL-KDD dataset were used as the training set and the test set in Algorithms 1 and 3. For this purpose, 1,500 and 750 instances were randomly selected from the NSLKDD dataset as the new training set (TR) and test set (TS), respectively. These sets were used in Algorithms 1 and 3 alike. The number of TR and TS instances is given in Table 5. The GA, which was already implemented in MATLAB, was used for finding the optimum values of and . The population size and the number of generations of both implementations of GA (which use Algorithms 1 and 3 as their fitness functions) were set to 5 and 10 in our simulations, respectively. However, simple nonlinear classic optimization algorithms can also be used for obtaining the optimum values of
and
more rapidly (as compared to GA and as an offline computation in the proposed MOPF
model). For example, we employed fminbnd function, which was already implemented in MATLAB, for obtaining these values and also evaluating the reliability of the GA results. The fminbnd function, which combines the golden-section search and the parabolic interpolation, implements a constrained optimization algorithm that finds the minimum of single-variable function (as a one-dimensional
problem) on a fixed interval. The obtained optimum values of
and , while using GA and fminbnd
function, are reported in Table 6. In addition, the execution time of these optimization algorithms is given in the same table. As shown in Table 6, the optimizations by GA are more time-consuming than the fminbnd function; however, the achieved results by the fminbnd function show the consistency of the optimization results with those obtained by GA. So, similar performance could be obtained if this function is employed instead of GA. For computing the BC or the PR in the pruning stage, a directed graph should be constructed from the input samples. In constructing the directed graph, the m best neighbor nodes (as
neighbors) were
selected based on the weight of the arcs in the complete weighted graph. For selecting the best value of m, we were inspired by the neighborhood selection problem in the recommendation systems and k-NN algorithm (which is used for the classification and regression). Generally, two approaches have been used in the recommendation systems for selecting the best neighbors of an active user: a) selecting a fixed number of the best neighbors and b) selecting the best neighbors based on a threshold on the similarity weights. Herlocker et al. [73] demonstrated that the first approach which uses a fixed number of neighbors (i.e., 20 up to 60) performs better than the second approach. On the other hand, the appropriate value for m in k-NN algorithm is achieved by calculating square root of the number of instances in the training set (as an empirical rule-of-thumb which was popularized by Duda et al. [74]). In the proposed method, the partitioning module ramifies the initial heterogeneous training samples into four different homogeneous training subsets. The number of samples in each cluster is exclusively different. So, different values of m should be considered for each cluster. According to the mentioned approaches, the best value of m is assumed as ⌊√ ⌋ in the proposed method, where n is the number of training samples. According to the size of training samples, if the value of training samples), we will select
or
is not between 20 and 60 (based on the size of
where ⌊√ ⌋
or ⌊√ ⌋
, respectively. To
test the validity of this assumption, a sample experiment was performed on AOPF+Pr based on three-fold cross-validation. The experimental results of this simulation are given in Table 7. As mentioned earlier, the TR set (which was used in GA and fminbnd function) was also used in this experiment. Three-fold cross-validation was applied for computing the DR and FAR of AOPF+Pr method when it used an m value for constructing the directed graph. The above statement means that 1,500 samples of TR set were divided into three equal subsets (with 500 samples). In each validation, two subsets (with 1,000 samples) were used for training the OPF and another subset was used for testing the generated OPF. Finally, the average performance of these validations was reported in terms of DR and FAR. As given in Table 7, the performance of AOPF+Pr is acceptable in terms of DR and FAR when ⌊√
⌋
(as the proposed number of neighbors). As seen in this table, when
(as the
best number of neighbors in terms of achieved DR), the DR is 95.74% which shows 0.21 percent
improvement as compared to the DR when m equals 31. However, as given in Table 7, the FAR when m equals 31 was improved by 0.45 percent as compared to the FAR when m equals 20. To evaluate the effectiveness of BC and PR measures in the pruning stage, they were compared in terms of Euclidean distance between the classification point with coordinates of (FAR, DR) and the perfect classification point in the ROC, when a lower distance shows better classification performance. The experimental results of these performance evaluations are shown in Figs. 5 and 6. Figure 5 shows the performance comparison of four classifiers (based on the AOPF+Pr model) when using different pruning strategies in terms of Euclidean distance at 10 runs. It is noted that different datasets consisted of 1,000 samples were used in each execution as the training set and 500 samples as the test set (which were randomly selected from the original NSL-KDD dataset). At first, each pruning module computed the BC and the PR of each training sample. Secondly, 50% of the samples that had the worst BC values were selected as the training samples in the first simulated AOPF shown in Fig. 5. Then, 50% of the samples that had the best BC values were selected as training samples in the second simulated AOPF shown in Fig. 5. Similarly, this scenario was repeated for the samples with the worst and the best values of PR for the third and fourth simulated AOPFs shown in Fig. 5, respectively. As seen in Fig. 5, according to both BC and PR metrics, selecting some samples which had the worst BC (or PR) values led to better performance as compared to samples which had the best BC (or PR) values. For a more accurate comparison, the average Euclidean distance over 10 executions (for each type of pruning depicted in Fig. 5) is shown in Fig. 6. As seen in Fig. 6, AOPF+Pr results in the best classification performance when the worst samples were used based on their BC value. Therefore, it is concluded that in the OPF machine learning, unlike social network analysis, some samples which have the worst BC (or PR) are the best informative samples for our classification problem. Furthermore, as seen in Fig. 5, using the BC measure for identifying the most informative samples is better than using the PR measure. Hence, the BC measure was used as a suitable metric for finding the most informative samples in the pruning module of the proposed model. Notably, these experiments were conducted for performing comparison between the BC and the PR metrics (by employing 50% of samples that had the worst (or the best) BC (or PR) as the training samples of OPF). If more than 50% of samples had been selected in these experiments, then the classification results could have been improved. However, the result of comparisons may not be changed even by increasing the percentage of selected samples. To investigate the effect of different coding methods for categorical features on the performance of the proposed model, another experiment was conducted in this study. In this experiment, the performance of AOPF+Pr model was evaluated when dummy or label coding method was used for transformation of categorical features. It is noted that in dummy coding, each connection record of the NSL-KDD dataset was represented by 122 input coordinates instead of 41 input coordinates in label coding. These 122
coordinates consist of the following values: a) 32 continuous values; b) 6 discrete values (0 or 1); c) 3 different values for protocol_type feature; d) 70 different values for service feature; and e) 11 different values for flag feature. Three-fold cross-validation was used for determining DR and FAR metrics in this experiment. For this purpose, 1,500 new samples were selected from the NSL-KDD dataset. This set was divided into three equal subsets. In each validation, two subsets were used for training the model and another subset was used for testing the generated model. The average performance of these validations is reported in terms of DR and FAR in Table 8. As seen in Table 8, the performance is approximately similar when using dummy and label coding methods for categorical features. So, the experiments in this study were performed by using label coding for categorical features which resulted in size reduction of input feature vector (i.e., 41 coordinates instead of 122).
5.3. Evaluation Metrics As mentioned earlier, some standard metrics, such as DR and FAR, were used for evaluating the performance of proposed IDS. It is noted that DR is the ratio between the number of detected attacks that are classified correctly and the total number of attacks, while FAR is the ratio between the number of normal connections that are classified as attacks and the total number of normal connections [61]. In addition to the mentioned metrics, the accuracy metric (based on Eq. (11) that was introduced by Papa et al. [19]) and the cost per example (CPE) [75] were also used for evaluating the performance of proposed algorithm. CPE is defined as follows: ∑
∑
(17)
where T and m denote the total number of samples in the test set and the number of classes in the classification problem, respectively. CM is a square matrix, called confusion matrix, in which rows represent the actual classes, while the columns correspond to the predicted classes. Each cell, such as in the confusion matrix, represents the number of samples that originally belong to class i, but they are classified in class j. The cost penalties for this misclassification are represented by a cost matrix that is presented in Table 9 for the NSL-KDD dataset [75]. 5.4. Simulation Results and Discussion In this section, the performance of proposed method is compared to the traditional OPF in terms of evaluation metrics introduced in section 5.3. As mentioned earlier, the proposed MOPF was composed of independent modules such as partitioning, pruning, and detecting. To evaluate the performance of proposed model, four variants of modifications (i.e., AOPF, AOPF+P, AOPF+Pr, MOPF (AOPF+P+Pr))
are compared with the traditional OPF. The details of simulation results, when using the mentioned investigated models, are reported in Tables 10, 11, and 12 for evaluating the IDS classification performance. The accuracy rate of the investigated proposed models and training and execution times are reported in Table 10. It is noted that the computation time of BC measure for each training sample is timeconsuming, so for improving the speed of MOPF and AOPF+Pr (which use the value of each sample’s BC in their pruning module), the BC calculation was performed offline. Moreover, the k-means clustering, which was run in the partitioning modules of MOPF and AOPF+P, was also performed offline. The GA was run offline and its execution time is not included in Table 10, as well. As seen in Table 10, the accuracy rate is improved considerably when using the MOPF algorithm as compared to the traditional OPF. However, the accuracy improvement is achieved by using each of proposed modules in the classification problem (such as AOPF+P or AOPF+Pr). The modification of optimum-path cost function in the classification phase of AOPF has led to high computational overload and consequently, high testing time. On the other hand, partitioning and pruning modules reduce the training time significantly. This significant improvement is due to the computation time of MST. As mentioned earlier, one of the key operations in the training phase of the OPF algorithm is the identification of the optimum set of prototypes. This set is obtained using the closest nodes of MST which have different labels. Therefore, an MST should be computed based on the complete weighted graph (which was generated by using the training samples) for finding the prototypes from the training set. In this study, the prime algorithm (which was already implemented in MATLAB) was used to compute the MST in the training phase. The time complexity of this algorithms is
, where n and m are the number of nodes and edges,
respectively. As noted earlier, the initial training set consisted of 7,000 samples in this study, where 3,000 samples were used as the evaluation set and remaining 4,000 samples were used for training the OPF. Hence, the complete weighted graph, which was generated from these training samples, has 4,000 nodes and 7,998,000 edges. Indubitably, the computational time of MST from this graph will be high. By using the proposed partitioning module, the original training set was partitioned into four clusters. Subsequently, an OPF was generated for each cluster by means of a lighter graph (in terms of the number of nodes and edges) based on the cluster’s training subset. This strategy reduces the computational time of MST and consequently reduces the training time of the OPF algorithm. Notably, this scenario is repeated for the pruning module. As shown in Table 10, the traditional OPF performs as the fastest model in terms of testing time; however, if the whole execution time (i.e., training and testing times) is considered, the MOPF will be faster than the traditional OPF (about 35% reduction in the execution time). This shows superior
computational efficiency of the proposed MOPF for the time of classifier training and testing (i.e., detection). As seen in Table 11, the DR is improved when using AOPF and AOPF+Pr models as compared to other models. However, FAR of the mentioned models is high (and not acceptable) as compared to AOPF+P and MOPF models. The MOPF model consists of AOPF, partitioning, and pruning modules; therefore, we can conclude that the proposed MOPF can achieve the best performance among the investigated methods when considering both DR and FAR. The confusion matrix and CPE of five mentioned investigated models are reported in Table 12. Notably, the proposed MOPF model offers the best performance in terms of CPE as compared to the traditional OPF and other three investigated models. One of the main advantages of the proposed MOPF method is its superior performance against lowfrequent attacks such as U2R and R2L. In this way, the DR of different attack types, when using the proposed MOPF or the traditional OPF, is shown in Fig. 7. As seen in Fig. 7, the proposed MOPF model offers better DR than the traditional OPF for all attack types. In addition, the FAR of different attack types, when using the proposed MOPF or the traditional OPF, is shown in Fig. 8. As seen in Fig. 8, the proposed MOPF model offers better FAR for DoS, probe, and U2R attacks than the traditional OPF. As seen in Figs. 7 and 8, the FAR of R2L attack when using the traditional OPF is slightly better than the proposed MOPF model. However, the DR of U2R and R2L attacks and also the FAR of U2R attack are considerably improved when using the proposed MOPF model. Here, in addition to the comparisons between the proposed MOPF and the traditional OPF, the performance of the proposed MOPF was compared with the following classifiers: a) SVM; b) naïve Bayes (NB); and c) classification and regression tree (CART). These classifiers have been already implemented in MATLAB. Notably, DR and FAR of mentioned classifiers were obtained by training and testing them using the dataset that was used in the proposed MOPF and was detailed in Table 2. The results of this comparison are reported in Table 13. As seen in Table 13, the proposed MOPF model offers better DR and FAR compared to SVM and NB. On the other hand, the DR of CART is slightly better than MOPF (tantamount to 0.95 percent). However, the FAR of proposed MOPF was improved by 1.31 percent as compared to that of CART.
6. Conclusion and Future Work In this study, a new intrusion detection approach, called MOPF, was introduced. The proposed model consisted of three main modules: a) partitioning; b) pruning; and c) detecting. For addressing the scalability of large datasets and reducing the complexity of the original training set, a partitioning module was used as the first stage of the proposed model to improve the performance of IDS, especially in
detecting low-frequent attacks. This module used k-means clustering algorithm for grouping the training set into k clusters as the training subsets. To determine k in the k-means clustering algorithm, the DB index was used as an apt validity measure for evaluating the intra-cluster similarity and inter-cluster differences because of its low computational load as compared to Silhouette and Dunn indices. Each subset was used as a training set for the detecting module with the aim of projecting an OPF model. Moreover, the pruning module, which was based on the centrality and the prestige measures in the social network analysis, was used for pruning the training set by identifying the most informative samples. It is worthy to note that the OPF and the social network are based on the graph theory. So, certain analysis methods in the social networks that are extracted from the graph theory can be used in the OPF. The speed of training phase and the accuracy of the traditional OPF were improved considerably by employing the proposed pruning stage. To determine the percentage of selected samples as the most informative ones, GA was used. Detecting anomalous traffic and classifying detected intrusions were performed in the detecting module that was based on an advanced OPF model. In a traditional OPF, all arcs that connected an unlabeled sample to the samples in the training set were considered for classifying an unlabeled sample and finding an optimum path from prototypes to the unlabeled sample. However, the distance between the unlabeled sample and the root (prototype) of each sample in the training set (i.e., each node in the OPF model) was also considered as an important factor in the proposed advanced OPF model with the aim of fine-tuning the classification algorithm. Accuracy rate (defined in Eq. 11), DR, FAR, and CPE were used for evaluating the performance of the proposed model. For this purpose, four variants of modifications were applied to the traditional OPF as follows: a) AOPF (detailed in Step 4 of Section 4); b) AOPF+Partitioning module (AOPF+P); c) AOPF+Pruning module (AOPF+Pr); and d) MOPF (AOPF+P+Pr). Different experiments on the NSL-KDD dataset showed the effectiveness of each mentioned proposed modules in terms of evaluation metrics and training time as compared to the traditional OPF. For example, the accuracy rate of the proposed MOPF was improved by 14.86 percent as compared to the traditional OPF. The training time of MOPF was about 6.9 times less than the OPF. However, the total time of training and testing was about 1.5 times lower than the traditional OPF. The DR and FAR of the proposed MOPF model were improved by 5.9% and 2.75% as compared to the traditional OPF, respectively. Notably, the DR of low-frequent attacks, such as U2R and R2L, was improved considerably by using the proposed MOPF model. In addition, the CPE of the proposed MOPF model was about 2.6 times less than the traditional OPF. Intrusion detection in dynamic environments (e.g., wireless sensor networks and cloud computing platforms) requires an adaptive real-time system whose parameters are changed with the condition of
environment. For example, some IDSs that use machine learning approaches for identifying the malicious behaviors should be able to take on new information about anomalies for dealing with new intrusions (such as HTTP-based or XML-based attacks [76]). These attacks are different from DoS attacks considered in NSL-KDD dataset (e.g., smurf, processtable, and mailbomb [21]). To sum, the machine learning requires to be retrained over the life of the IDS [77, 78]. Incremental learning is an outperforming technique which has been introduced to meet the retraining requirements. On the other hand, the speed of machine learning is a vital issue in retraining. Since the proposed MOPF has high speed in training as compared to the traditional OPF, so it can be considered as a suitable classifier for dealing with dynamic environments. Modern IDSs should handle big data and rapid changes in data. As future work, the promising approaches to these issues are as follows: a) ensemble-based data mining; b) distributed implementations; c) collaborative intrusion detection; and d) designing real-time intrusion response systems [79, 80].
7. References [1] V. Das, V. Pathak, S. Sharma, R. Sreevathsan, M. Srikanth, G. Kumar, Network intrusion detection system based on machine learning algorithms, International Journal of Computer Science & Information Technology 2 (2010) 138-151. DOI: 10.1016/j.asoc.2009.06.019 [2] S. Wu, E. Yen, Data mining-based intrusion detectors, Expert Systems with Applications 36 (2009) 5605-5612. DOI: 10.1016/j.eswa.2008.06.138 [3] N. Stakhanova, S. Basu, J. Wong, On the symbiosis of specification-based and anomaly-based detection, Computers & Security 29 (2010) 253-268. DOI: 10.1016/j.cose.2009.08.007 [4] V. Golmah, An efficient hybrid intrusion detection system based on C5.0 and SVM, International Journal of Database Theory and Application 7 (2014) 59-70. DOI: 10.14257/ijdta.2014.7.2.06 [5] G. Kim, S. Lee, S. Kim, A novel hybrid intrusion detection method integrating anomaly detection with
misuse
detection,
Expert
Systems
with
Applications
41
(2014)
1690-1700.
DOI:
10.1016/j.eswa.2013.08.066 [6] M. Sheikhan, Artificial neural network models for intrusion detection, In: Encyclopedia of Information Assurance, Taylor & Francis, New York (2014) 1-12. DOI: 10.1081/E-EIA-120051983 [7] S.X. Wu, W. Banzhaf, The use of computational intelligence in intrusion detection systems: A review, Applied Soft Computing 10 (2010) 1-35. DOI: 10.1016/j.asoc.2009.06.019 [8] Y. Li, L. Guo, An active learning based TCM-KNN algorithm for supervised network intrusion detection, Computers & Security 26 (2007) 459-467. DIO: 10.1016/j.cose.2007.10.002 [9] KDD cup 99 Intrusion detection data set, (Available on http://kdd.ics.uci.edu/databases/kddcup99), [Accessed on 20 Feb. 2015]
[10] Y. Yi, J. Wu, W. Xu, Incremental SVM based on reserved set for network intrusion detection, Expert Systems with Applications 38 (2011) 7698-7707. DOI: 10.1016/j.eswa.2010.12.141 [11] X. Ru, Z. Liu, Z. Huang, W. Jiang, Normalized residual-based constant false-alarm rate outlier detection, Pattern Recognition Letters 69 (2016) 1-7. DOI: 10.1016/j.patrec.2015.10.002 [12] S. Rajasegarar, A. Gluhak, M.A. Imran, M. Nati, M. Moshtaghi, C. Leckie, M. Palaniswami, Ellipsoidal neighbourhood outlier factor for distributed anomaly detection in resource constrained networks, Pattern Recognition 47 (2014) 2867-2879. DOI: 10.1016/j.patcog.2014.04.006 [13] Z. Xue, Y. Shang, A. Feng, Semi-supervised outlier detection based on fuzzy rough C-means clustering,
Mathematics
and
Computers
in
Simulation
80
(2010)
1911-1921.
DOI:
10.1016/j.matcom.2010.02.007 [14] A. Daneshpazhouh, A. Sami, Entropy-based outlier detection using semi-supervised approach with few positive examples, Pattern Recognition Letters 49 (2014) 77-84. DOI: 10.1016/j.patrec.2014.06.012 [15] D.Y. Yeung, Y. Ding, Host-based intrusion detection using dynamic and static behavioral models, Pattern Recognition 36 (2003) 229-243. DOI: 10.1016/S0031-3203(02)00026-2 [16] C. Xiang, P.C. Yong, L.S. Meng, Design of multiple-level hybrid classifier for intrusion detection system using Bayesian clustering and decision trees, Pattern Recognition Letters 29 (2008) 918-924. DOI: 10.1016/j.patrec.2008.01.008 [17] G. Wang, J. Hao, J. Ma, L. Huang, A new approach to intrusion detection using artificial neural networks and fuzzy clustering, Expert Systems with Applications 37 (2010) 6225-6232. DOI: 10.1016/j.eswa.2010.02.102 [18] A.A. Aburomman, M.B.I. Reaz, A novel SVM-kNN-PSO ensemble method for intrusion detection system, Applied Soft Computing 38 (2016) 360-372. DOI: 10.1016/j.asoc.2015.10.011 [19] J.P. Papa, A.X. Falcão, C.T.N. Suzuki, Supervised pattern classification based on optimum-path forest, International Journal of Imaging Systems and Technology 19 (2009): 120-131. [20] W.P. Amorim, M.H. Carvalho, Supervised learning using local analysis in an optimal-path forest, In: Proceedings of the 25th Conference on Graphics, Patterns and Images, Ouro Preto, Brazil (2012) 330335. DOI: 10.1109/SIBGRAPI.2012.53 [21] M. Tavallaee, E. Bagheri, L. Wei, A. Ghorbani, NSL-KDD Data Set (Available on http://nsl.cs.unb.ca/NSL-KDD), [Accessed on 20 Feb. 2015] [22] S.Y. Jiang, X. Song, H. Wang, J.J. Han, Q.H. Li, A clustering-based method for unsupervised intrusion detections, Pattern Recognition Letters 27 (2006) 802-810. DOI: 10.1016/j.patrec.2005.11.007 [23] C.F. Tsai, C.Y. Lin, A triangle area based nearest neighbors approach to intrusion detection, Pattern Recognition 43 (2010) 222-229. DOI: 10.1016/j.patcog.2009.05.017
[24] W.C. Lin, S.W. Ke, C.F. Tsai, CANN: An intrusion detection system based on combining cluster centers
and
nearest
neighbors,
Knowledge-Based
Systems
78
(2015)
13-21.
DOI:
10.1016/j.knosys.2015.01.009 [25] C.R. Pereira, R.Y.M. Nakamura, K.A.P. Costa, J.P. Papa, An optimum-path forest framework for intrusion detection in computer networks, Engineering Applications of Artificial Intelligence 25 (2012) 1226-1234. DOI: 10.1016/j.engappai.2012.03.008 [26] J.P. Papa, A.X. Falcão, V.H.C. Albuquerque, J.M.R.S. Tavares, Efficient supervised optimum-path forest
classification
for
large
datasets,
Pattern
Recognition
45
(2012)
512-520.
DOI:
10.1016/j.patcog.2011.07.013 [27] A.S. Iwashita, J.P. Papa, A.N. Souza, A.X. Falcão, R.A. Lotufo, V.M. Oliveira, V.H.C. de Albuquerque, J.M.R.S. Tavare, A path- and label-cost propagation approach to speed up the training of the optimum-path forest classifier, Pattern Recognition Letters 40 (2014) 121-127. DOI: 10.1016/j.patrec.2013.12.018 [28] R. Souza, L. Rittner, R. Lotufo, A comparison between k-optimum path forest and k-nearest neighbors
supervised
classifiers,
Pattern
Recognition
Letters
39
(2014)
2-10.
DOI:
10.1016/j.patrec.2013.08.030 [29] P.T.M. Saito, C.T.N. Suzuki, J.F. Gomes, P.J. de Rezende, A.X. Falcão, Robust active learning for the diagnosis of parasites, Pattern Recognition 48 (2015) 3572-3583. DOI: 10.1016/j.patcog.2015.05.020 [30] K.A.P. Costa, L.A.M. Pereira, R.Y.M. Nakamura, C.R. Pereira, J.P. Papa, A.X. Falcão, A natureinspired approach to speed up optimum-path forest clustering and its application to intrusion detection in computer networks, Information Sciences 294 (2015) 95-108. DOI: 10.1016/j.ins.2014.09.025 [31] D. Osaku, R.Y.M. Nakamura, L.A.M. Pereira, R.J. Pisani, A.L.M. Levada, F.A.M. Cappabianco, A.X. Falcão, J.P. Papa, Improving land cover classification through contextual-based optimum-path forest, Information Sciences 324 (2015) 60-87. DOI: 10.1016/j.ins.2015.06.020 [32] A. Klein, H. Ahlf, V. Sharma, Social activity and structural centrality in online social networks, Telematics and Informatics 32 (2015) 321-332. DOI: 10.1016/j.tele.2014.09.008 [33] X. Yan, L. Zhai, W. Fan, C-index: A weighted network node centrality measure for collaboration competence, Journal of Informetrics 7 (2013) 223-239. DOI: 10.1016/j.joi.2012.11.004 [34] X. Qi, E. Fuller, R. Luo, C. Zhang, A novel centrality method for weighted networks based on the Kirchhoff polynomial, Pattern Recognition Letters 58 (2015) 51-60. DOI: 10.1016/j.patrec.2015.02.007 [35] S. Kundu, S.K. Pal, Fuzzy-rough community in social networks, Pattern Recognition Letters 67 (2015) 145-152. DOI: 10.1016/j.patrec.2015.02.005
[36] X. Qi, W. Tang, Y. Wu, G. Guo, E. Fuller, C.Q. Zhang, Optimal local community detection in social networks based on density drop of subgraphs, Pattern Recognition Letters 36 (2014) 46-53. DOI: 10.1016/j.patrec.2013.09.008 [37] J. MacQueen, Some methods for classification and analysis of multivariate observations, In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Oakland, USA (1967). [38] O. Sangita, J. Dhanamma, An improved k-means clustering approach for teaching evaluation, In: Advances in Computing, Communication and Control, Springer (2011) 108-115. DOI: 10.1007/978-3642-18440-6_13 [39] S. Bakshi, A.K. Jagadev, S. Dehuri, G. Wang, Enhancing scalability and accuracy of recommendation systems using unsupervised learning and particle swarm optimization, Applied Soft Computing 15 (2014) 21-29. DOI: 10.1016/j.asoc.2013.10.018 [40] C. Legány, S. Juhász, A. Babos, Cluster validity measurement techniques, In: Proceedings of 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Madrid, Spain (2006) 388-393. [41] J.C. Dunn, Well separated clusters and optimal fuzzy partitions, Journal of Cybernetica 4 (1974) 95104. DOI: 10.1080/01969727408546059 [42] S. Theodoridis, K. Koutroumbas, Pattern Recognition, 4th Edition, Elsevier, 2008. [43] N.R. Pal, J. Biswas, Cluster validation using graph theoretic concepts, Pattern Recognition 30 (1997) 847-857. DOI: 10.1016/S0031-3203(96)00127-6 [44] P.J. Rousseeuw, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987) 53-65. DOI: 10.1016/03770427(87)90125-7 [45] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1979) 224-227. DOI: 10.1109/TPAMI.1979.4766909 [46] J.H. Yeh, F.J. Joung, J.C. Lin, CDV index: A validity index for better clustering quality measurement,
Journal
of
Computer
and
Communications
2
(2014)
163-171.
DOI: 10.4236/jcc.2014.24022 [47] S. Petrović, A comparison between the Silhouette index and the Davies-Bouldin index in labeling IDS clusters, In: Proceedings of the 11th Nordic Workshop on Secure IT-systems, Linkoping, Sweden (2006) 53-64. [48] S. Ray, R.H. Turi, Determination of number of clusters in k-means clustering and application in colour image segmentation, In: Proceedings of the 4th International Conference on Advances in Pattern Recognition and Digital Techniques, Calcutta, India (1999) 137-143.
[49] S. Wasserman, K. Faust, Social network analysis: Methods and applications, New York, Cambridge University Press, 1994. [50] K. Musiał, P. Kazienko, P. Bródka, User position measures in social networks, In: Proceedings of the 3rd Workshop on Social Network Mining and Analysis, New York, USA (2009). DOI: 10.1145/1731011.1731017 [51] A. Rusinowska, R. Berghammer, H.D. Swart, M. Grabisch, Social networks: Prestige, centrality, and influence, In: Relational and Algebraic Methods in Computer Science, Springer (2011) 22-39. DOI: 10.1007/978-3-642-21070-9_2 [52] Y.M. Kim, A preliminary social network analysis of MPACT, In: Annual Meeting Proceedings Information
Science
with
Impact:
Research
in
and
for
the
Community
(2007)
1-10
(https://www.asis.org/Conferences/AM07/proceedings/posters/79/79_poster.html.) [53] I. Varlamis, M. Eirinaki, M. Louta, Application of social network metrics to a trust-aware collaborative model for generating personalized user recommendations, In: The Influence of Technology on Social Network Analysis and Mining, Springer (2013) 49-74. DOI: 10.1007/978-3-7091-1346-2_3 [54] L.G. Pérez, F. Chiclana, S. Ahmadi, A social network representation for collaborative filtering recommender systems, In: Proceedings of 11th International Conference on Intelligent Systems Design and Applications, Córdoba, Spain (2011) 438-443. DOI: 10.1109/ISDA.2011.6121695 [55] W. Wang, X. Zhang, S. Gombault, S.J. Knapskog, Attribute normalization in network intrusion detection, In: Proceeding of 10th International Symposium on Pervasive Systems, Algorithms, and Networks, Kaohsiung, Taiwan (2009) 448-453. DOI: 10.1109/I-SPAN.2009.49 [56] D.A. Bader, S. Kintali, K. Madduri, M. Mihail, Approximating betweenness centrality, In: Proceedings of 5th International Conference on Algorithms and Models for the Web-Graph, San Diego, CA, USA (2007) 124-137. DOI: 10.1007/978-3-540-77004-6_10 [57] U. Kang, S. Papadimitriou, J. Sun, H. Tong, Centralities in large networks: Algorithms and observations, In: Proceedings of the SIAM International Conference on Data Mining, Mesa, Arizona, USA (2011). DOI: 10.1137/1.9781611972818.11 [58] E. Portes dos Santos, C.R. Xavier, P. Goldfeld, F. Dickstein, R. Weber dos Santos, Comparing genetic algorithms and Newton-like methods for the solution of the history matching problem, Lecture Notes in Computer Science 5544 (2009) 377-386. [59] M. Tavallaee, E. Bagheri, L. Wei, A. Ghorbani, Detailed analysis of the KDD CUP 99 data set, IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, Canada (2009) 1-6. DOI: 10.1109/CISDA.2009.5356528
[60] K. Lahre, T.D. Diwan, S.K. Kashyap, P. Agrawal, Analyze different approaches for IDS using KDD 99 data set, International Journal on Recent and Innovation Trends in Computing and Communication 1 (2013) 645-651. [61] M. Sheikhan, Z. Jadidi, A. Farrokhi, Intrusion detection using reduced-size RNN based on feature grouping, Neural Computing and Applications 21 (2012) 1185-1190. DOI: 10.1007/s00521-010-0487-0 [62] P.A. Raj Kumar, S. Selvakumar, Distributed denial of service attack detection using an ensemble of neural classifier, Computer Communications 34 (2011) 1328-1341. DOI: 10.1016/j.comcom.2011.01.012 [63] P.A. Raj Kumar, S. Selvakumar, Detection of distributed denial of service attacks using an ensemble of adaptive and hybrid neuro-fuzzy systems, Computer Communications 36 (2013) 303-319. DOI: 10.1016/j.comcom.2012.09.010 [64] D. Ippoliti, X. Zhou, A-GHSOM: An adaptive growing hierarchical self organizing map for network anomaly detection, Journal of Parallel and Distributed Computing 72 (2012) 1576-1590. DOI: 10.1016/j.jpdc.2012.09.004 [65] J.J. Davis, A.J. Clark, Data preprocessing for anomaly based network intrusion detection: A review, Computers & Security 30 (2011) 353-375. DOI: 10.1016/j.cose.2011.05.008 [66] S.T. Ikram, A.K. Cherukuri, Intrusion detection model using fusion of chi-square feature selection and multi class SVM, Journal of King Saud University – Computer and Information Sciences (2016) {Article in Press}. DOI: 10.1016/j.jksuci.2015.12.004 [67] P. Laskov, P. Dussel, C. Schafer, K. Rieck, Learning intrusion detection: Supervised or unsupervised? Image Analysis and Processing (2005) pp. 50-57. [68] E. De la Hoz, E. De La Hoz, A. Ortiz, J. Ortega, B. Prieto, PCA filtering and probabilistic SOM for network intrusion detection, Neurocomputing 164 (2015) 71-81. DOI: 10.1016/j.neucom.2014.09.083 [69] Y. Bouzida, F. Cuppens, N. Cuppens-Boulahia, S. Gombault, Efficient intrusion detection using principal component analysis. In: Proceedings of the 3ème Conférence sur la Sécurite´ et Architectures Réseaux (SAR), Orlando, FL, USA, 2004.
[70] M.H. Bhuyan, D.K. Bhattacharyya, J.K. Kalita, A multi-step outlier-based anomaly detection approach
to
network-wide
traffic,
Information
Sciences
348
(2016)
243-271.
DOI:
10.1016/j.ins.2016.02.023 [71] F. Amiri, M.M. Rezaei Yousefi, C. Lucas, A. Shakery, N. Yazdani, Mutual information-based feature selection for intrusion detection systems, Journal of Network and Computer Applications 34 (2011) 1184-1199. DOI:10.1016/j.jnca.2011.01.002 [72] S. Rastegari, P. Hingston, C.-P. Lam, Evolving statistical rulesets for network intrusion detection, Applied Soft Computing 33 (2015) 348-359. DOI: 10.1016/j.asoc.2015.04.041
[73] J. Herlocker, J.A. Konstan, J. Riedl, An empirical analysis of design choices in neighborhood-based collaborative
filtering
algorithms,
Information
Retrieval
5
(2002)
287-310.
DOI:
10.1023/A:1020443909834 [74] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edition, New York: John Wiley & Sons, 2000. [75] R. Agrawal, M.V. Joshi, PNrule: A new framework for learning classifier models in data mining (a case-study in network intrusion detection), IBM Research Division, Report No. RC-21719 (2000). [76] V. Vidhya, A review of DoS attacks in cloud computing, IOSR Journal of Computer Engineering 16 (2014) 32-35. [77] C.N. Seon, H. Lee, H. Kim, J. Seo, Improving domain action classification in goal-oriented dialogues using a mutual retraining method, Pattern Recognition Letters 45 (2014) 154-160. DOI: 10.1016/j.patrec.2014.03.021 [78] E.A. Gerlein, M. McGinnity, A. Belatreche, S. Coleman, Evaluating machine learning classification for financial trading: An empirical approach, Expert Systems with Applications 54 (2016) 193-207. DOI: 10.1016/j.eswa.2016.01.018 [79] G. Folino, P. Sabatino, Ensemble based collaborative and distributed intrusion detection systems: A survey, Journal of Network and Computer Applications 66 (2016) 1-16. DOI: 10.1016/j.jnca.2016.03.011 [80] Z. Inayat, A. Gani, N.B. Anuar, M. Khurram Khan, S. Anwa, Intrusion response systems: Foundations, design, and challenges, Journal of Network and Computer Applications 62 (2016) 53-74. DOI: 10.1016/j.jnca.2015.12.006 Biography
Mansour Sheikhan was born in Tehran, Iran, in 1966. He received the B.S. degree in electronic engineering from Ferdowsi University, Meshed, Iran, in 1988 and M.S. and Ph.D. degrees in communication engineering from Islamic Azad University (IAU), Tehran, Iran, in 1991 and 1997, respectively. He is currently an Associate Professor in Telecommunication Engineering Department of IAU-South Tehran Branch. Dr. Sheikhan has published more than 80 journal papers and about 60 conference papers. He has published three books in Farsi, and four book chapters for IET and Taylor & Francis. He has been the reviewer of international journals such as Information Sciences, IET Communications, IEEE Trans. on Smart Grid, IEEE Access, Circuits, Systems & Signal Processing, Multimedia Systems, Neural Computing and Applications, Soft Computing, Applied Mathematics and Computation, Biomedical Signal Processing and Control, Machine Learning and Cybernetics, and Computers in Biology and Medicine in recent years. He selected as the outstanding researcher of IAU in 2003, 2008, and 2010-2012. His research
interests include intelligent systems, speech signal processing, security in communication networks, and neural networks.
Figure Captions Fig. 1. The framework of proposed IDS Fig. 2. A simple example of generating a directed graph from a complete undirected graph Fig. 3. Classification of a new unlabeled instance by using the projected OPF based on its distance from the OPT roots Fig. 4. The DB index value versus the number of clusters Fig. 5. The comparison of AOPF performance when using different pruning strategies in intrusion detection application Fig. 6. The average Euclidean distance of AOPF when using different pruning strategies Fig. 7. Performance comparison of the MOPF and the traditional OPF in terms of DR Fig. 8. Performance comparison of the MOPF and the traditional OPF in terms of FAR
Category Basic Content Time-based traffic Host-based traffic
Table 1. General description of features in NSL-KDD dataset Number of features Type (Features ID) Continuous (IDs: {1, 5, 6, 8, 9}) 9 Symbolic (IDs: {2, 3, 4}) Discrete (ID: {7}) Continuous (IDs: {10, 11, 13, 16, 17, 18, 19, 20}) 13 Discrete (IDs: {12, 14, 15, 21, 22}) 9 Continuous (IDs: {23, 24, 25, 26, 27, 28, 29, 30, 31}) 10 Continuous (IDs: {32, 33, 34, 35, 36, 37, 38, 39, 40, 41})
Table 2. Size of the training and test sets Number of instances in the Number of instances in the training set* test set 3,490 1,526 2,489 1,028 684 284 228 109 109 53 7,000 3,000
Data type Normal DoS Probe R2L U2R Total
* 4,000 samples of the total training samples were used as the training set and the remaining 3,000 samples were used as the evaluation set. Table 3. Comparison of instances specification in NSL-KDD and employed sub-dataset of NSL-KDD for different ranges of difficulty level NSL-KDD dataset Employed sub-dataset in this study Difficulty Number of Number of level Distribution (%) Distribution (%) instances instances 0-5 992 0.67 135 1.35 6-10 1,605 1.08 151 1.51 11-15 9,863 6.64 743 7.43 16-20 62,806 42.29 4,101 41.01 21 73,251 49.32 4,870 48.70
Table 4. Parameters setting of the proposed method in simulations Parameter
T
Value 5 4 65(%) 0.696
Phase/Stage Learning phase of the OPF Partitioning stage Pruning stage Detecting stage
Table 5. Size of TR and TS used in Algorithms 1 and 3 Data type Number of TR instances Number of TS instances Normal 738 382 DoS 542 246 Probe 144 79 R2L 49 28 U2R 27 15 Total 1,500 750
Table 6. Obtained values of Parameter
and
when using GA and fminbnd function for optimization Value 65(%) 62(%)
Optimization time (min) 307 132
0.696 0.618
143 50
Table 7. Performance of AOPF+Pr model for different number of neighbors (m) in terms of DR and FAR m (number of neighbors) DR (%) FAR (%) 20 95.74 8.95 30 94.86 8.15 31 95.53 8.50 40 94.48 8.50 50 94.50 9.28 60 95.14 10.37
Table 8. Performance of AOPF+Pr model when using different coding methods for categorical features Coding method for DR (%) FAR (%) categorical features Dummy 93.91 8.19 Label 93.95 6.12
Table 9. Cost matrix for NSL-KDD dataset Actual Normal DoS Probe R2L U2R
Predicted Normal 0 2 1 3 4
DoS 2 0 2 2 2
Probe 1 1 0 2 2
R2L 2 2 2 0 2
U2R 2 2 2 2 0
Table 10. Accuracy rate and execution time of investigated OPF-based models Accuracy Training Testing Training+Testing Model rate (%) time (s) time (s) time (s) OPF 76.88 136.94 4.05 140.99 AOPF 90.53 139.94 511.89 651.83 AOPF+P 91.09 49.43 147.64 197.06 AOPF+Pr 90.27 57.60 269.81 327.41 MOPF 91.74 19.93 72.35 92.28
Table 11. Performance comparison of investigated models in terms of DR and FAR Model DR (%) FAR (%) OPF 90.30 4.19 AOPF 96.13 10.55 AOPF+P 93.83 1.96
AOPF+Pr MOPF
96.27 96.20
9.23 1.44
Model
OPF
AOPF
AOPF+P
AOPF+Pr
MOPF
Table 12. Confusion matrix and CPE of investigated models tested on NSL-KDD dataset Predicted Actual Normal DoS Probe R2L U2R Normal 23 11 26 4 1,462 DoS 29 4 1 0 994 Probe 33 23 2 1 225 R2L 69 4 1 2 33 U2R 12 19 3 3 15 Normal 47 54 45 15 1,365 DoS 10 8 0 0 1,010 Probe 20 7 0 4 257 R2L 22 0 0 4 83 U2R 5 0 1 4 43 Normal 14 10 3 3 1,496 DoS 53 8 3 3 961 Probe 21 3 1 1 258 R2L 10 0 4 2 93 U2R 7 3 2 4 37 Normal 46 54 38 15 1,373 DoS 6 8 0 0 1,014 Probe 20 7 0 0 257 R2L 32 0 0 3 74 U2R 5 0 1 4 43 Normal 5 7 5 5 1,504 DoS 20 7 4 1 996 Probe 27 7 2 4 244 R2L 7 4 0 13 85 U2R 2 1 2 5 43
CPE
0.1950
0.1473
0.0983
0.1460
0.0753
Table 13. Performance comparison of proposed MOPF with SVM, NB, and CART classifiers in terms of DR and FAR Model DR (%) FAR (%) SVM 95.05 2.10 NB 81.00 9.37 CART 97.15 2.75 MOPF 96.20 1.44
Highlights
We proposed a modified version of optimum-path forest (MOPF) for intrusion detection.
Social network analysis is used for pruning the training set to speed up the OPF.
A partitioning module is used to improve the detection rate of low-frequent attacks.
The classification phase of traditional OPF is modified for improving the accuracy.
Our method improved detection/false alarm rate and execution time of traditional OPF.