A feature reduced intrusion detection system using ANN classifier

A feature reduced intrusion detection system using ANN classifier

Expert Systems With Applications 88 (2017) 249–257 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

602KB Sizes 10 Downloads 238 Views

Expert Systems With Applications 88 (2017) 249–257

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

A feature reduced intrusion detection system using ANN classifier Akashdeep∗, Ishfaq Manzoor, Neeraj Kumar UIET, Panjab University, Chandigarh, India

a r t i c l e

i n f o

Article history: Received 2 August 2016 Revised 28 June 2017 Accepted 6 July 2017 Available online 8 July 2017 Keywords: Intrusion Detection System (IDS) Feature Ranking Feature Reduction ANN

a b s t r a c t Rapid increase in internet and network technologies has led to considerable increase in number of attacks and intrusions. Detection and prevention of these attacks has become an important part of security. Intrusion detection system is one of the important ways to achieve high security in computer networks and used to thwart different attacks. Intrusion detection systems have curse of dimensionality which tends to increase time complexity and decrease resource utilization. As a result, it is desirable that important features of data must be analyzed by intrusion detection system to reduce dimensionality. This work proposes an intelligent system which first performs feature ranking on the basis of information gain and correlation. Feature reduction is then done by combining ranks obtained from both information gain and correlation using a novel approach to identify useful and useless features. These reduced features are then fed to a feed forward neural network for training and testing on KDD99 dataset. Pre-processing of KDD-99 dataset has been done to normalize number of instances of each class before training. The system then behaves intelligently to classify test data into attack and non-attack classes. The aim of the feature reduced system is to achieve same degree of performance as a normal system. The system is tested on five different test datasets and both individual and average results of all datasets are reported. Comparison of proposed method with and without feature reduction is done in terms of various performance metrics. Comparisons with recent and relevant approaches are also tabled. Results obtained for proposed method are really encouraging. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Traditionally, intrusion was detected by authentication, encryption and decryption techniques and firewalls etc. These are known as first line of defense in computer security. This enables evaluation of computer programs installed on the host to detect known vulnerabilities. After evaluation, the vulnerable computer program is patched with the latest patch code (Amiri, Yousefi, Lucas, Shakery, & Yazdani, 2011). However, attacker can bypass them easily and first line defense mechanism is not flexible and powerful enough to thwart different kinds of attacks or intrusions. Antivirus softwares that act as the second line of defense has a limitation that it can only detect attacks whose signatures are present in the database. They are limited in their ability to cope with attacks that may result in a few hours until their next updates. An opposed but of strong nature is Intrusion Detection System (IDS) as it gathers information related to activities that violate security policies. IDS gather information from a network system and analyze it in order to determine elements which violate security policies of computer ∗

Corresponding author. E-mail addresses: [email protected] (Akashdeep), [email protected] (I. Manzoor), [email protected] (N. Kumar). http://dx.doi.org/10.1016/j.eswa.2017.07.005 0957-4174/© 2017 Elsevier Ltd. All rights reserved.

and networks. Accuracy, extensibility and adaptability are three important characteristics of intrusion detection system. IDS must achieve good accuracy and adaptability to counter attacks from intruders. IDS distinguish between legitimate and illegitimate users and must be used with the first line of defense to thwart intrusions and aberrations from inside as well as outside attackers. IDS is an important asset to computer security because attacker tries to conceal his identity and launch attacks through intermediate hosts widely known as stepping stones intrusion. Secondly, changing nature of technology and technique makes it more difficult to detect attacks. IDS can therefore make use of learning techniques to detect unknown future attacks. Intrusion Detection System (IDS) can be classified as Misuse detection system and Anomaly detection system. These two methods may be combined to form a hybrid detection system. In Misuse Detection system, signatures of already known attack patterns are stored in the database and matched with network data; if the match is positive it is declared as an attack (Wu & Huang, 2010). Example:- successive failed login within a minute or two may be a possibility of password guessing attack. Experts from security field encode rules which are attained from real intrusions. Misuse detection fails or gives less effective results in case signatures are not known or attack varies from the actual signature

250

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257

pattern (Lin, Ving, Lee, & Lee, 2012). Also, this type of detection system has the same problem as antivirus software which needs periodic updation to detect a new type of attacks. Anomaly Detection system creates a normal profile by analyzing and observing the normal behavior of network system known as normal or baseline profile. It then checks any deviation from the baseline profile (Bhuyan, Bhattacharya, & Kalita, 2014) and patterns which deviate from normal profile are called outliers, anomalies, aberrations. A significant deviation from the normal profile is considered an attack. In anomaly detection system there is no need of prior knowledge of signatures. Anomaly detection can be divided into static and dynamic. The static detector works on a principle that only a fixed part of system which does not change, is monitored e.g. operating system software. Dynamic anomaly detector addresses network traffic data or audit records. Dynamic detector sets a threshold to separate normal consumption from anomalous consumption of resources. This method can detect an seen as well as a new attack which is an advantage over misuse detection systems but may lead to high rate of false alarm. Another drawback is, if an attacker knows that he or she is being profiled, they can slowly change the profile to train the anomaly detection system of intruder’s malicious behavior as normal. False positives and false negatives errors also lead to inevitable costs (Joo, Hong, & Han, 2003). Such systems can be further categorized into Network Intrusion Detection System (NIDS) and Host Intrusion Detection System (HIDS). Network-based intrusion detection monitors and analyses network traffic to differentiate and detect normal usage patterns from attack patterns. If a malicious pattern is detected, it is said to be an intrusion. Host-based intrusion detection analyzes log files for attack signatures. HIDS analyze host based audit sources such as audit trails, system logs and application logs to detect attacks. Hybrid Detection System ensembles both misuse and anomaly based detection systems. Hybrid methods thus eradicate known attacks using signature based mechanism by identifying a match and then anomaly detection checks if there is any deviation from normal baseline profile to increase detection rate and decrease false alarm rate. Feature Ranking and Reduction:- Feature ranking and selection is important perspective in intrusion detection systems for getting better performance. Methods of feature ranking and feature selection are useful to answer the question of feature importance present in a dataset and categorize them into features with high or less significance. These features help to classify data traffic in networks to normal or abnormal (attack) classes. However, features which either marginally or does not contribute in detecting different kinds of attacks should be removed to get better accuracy and speed in intrusion detection systems. Removal of these features will make IDS performance better in terms of computation, dimension reduction and time complexity (Sangkatsance, Watlanapongsakorn, & CharnsriPinyo, 2011). Predicting the importance of such features is a complex task due to lack of proper mathematical methods. Empirical methods may be used to determine the importance of these features and mostly more than one method is used to gauge their importance. Feature reduction develops an understanding of features, leads to reduction in data, improves performance and can also aid use of simple models for classification. A complete analysis of various advantages of feature reduction was given by Barmejo, Ossa, Gamez, and Puerta (2012). The literature is rich in studies on feature selection or reduction and inspired by these studies, we introduce development of a feature reduction based intrusion detection system. This study is driven by the desire that in order to make IDS effective, it is not computationally advantageous to work on all features collectively for differentiation of attack and non-attack cases. Therefore, the proposed system first reduces the number of features and then performs classification using supervised learning. The highlight of

the proposed method is that it ranks features according to information gain and correlation at first step. The important features are then combined using a novel mechanism so that only useful ones are included and useless are discarded. A neural network using back-propagation learning is then trained with KDD 99 training dataset. Pre-processing of KDD dataset has been done to prevent over-traing and over-fitting of data. The developed IDS is then tested using testing dataset and performance is evaluated with help of various statistical methods. The study put to competition with other contemporary technique in literature and results obtained are promising. The manuscript is arranged as follows: A brief outline of related studies is covered in Section 2; Section 3 introduces the proposed method; Illustrations and results are given in Section 4. Conclusions and future directions are provided in Section 5. 2. Literature survey Intrusion detection system has dimensionality curse i.e. large datasets which simulate real network data increases time complexity of training and testing in IDS. Large data also leads to consumption of resources and may result in less detection stability. It is pragmatic that data not contributing to detection must be eliminated before processing. This leads to the development of effective feature extraction and reduction policy that can not only help to reduce training time but shall provide higher accuracy and can also safeguard against unknown attacks. Feature selection reduces computational complexity, information redundancy, increase the accuracy of learning algorithm, facilitate data understanding and improve generalization. Feature selection and ranking methods are divided into two types as Wrapper and Filter methods (Barmejo et al., 2012). Filter methods use some predefined criteria in order to select features from the data set eliminating irrelevant features. Wrapper methods, on the other hand, are based on training data to evaluate feature. Amiri et al. (2011) proposed some feature selection algorithms and developed IDS using support vector machine. They carried feature investigation by two feature selection methods: linear correlation coefficient and forward feature selection algorithm (FFSA) and proposed modified- mutual information feature selection (MMIFS). They also compared the results obtained by three feature selection and analyzed its effects by feature selection. The data set used was KDD cup 99 (KDD Dataset, 1999). Experiments were performed on windows platform and result showed that modified mutual information feature selection is more effective in detecting probe and R2L attack classes while FFSA performance was good in detecting U2R, DoS and normal profile attack. (Li et al., 2012). They performed preprocessing by k-means clustering to get a compact dataset, then selected a small training data set with help of ant colony optimization (ACO). They performed feature reduction to reduce features from 41 to 19 in KDD data set. Sangkatsance et al. (2011), proposed real-time intrusion detection system (RT-IDS) and extracted 12 essential features from network packet header in the first step. In second step information gain (IG) was used to analyze their importance in detecting various types of attacks. RT-IDS used in detecting attack types achieved detection rate of 98% in denial of service and probing attack classes. Liu, Sui, and Xiao (2014) proposed clustered mutual information hybrid method for feature reduction. Clustering of features was done in unsupervised stage based on similarity. Supervised learning was used to select representative features to increase similarity with response features representing class labels. Xiao, Liu, and Xiao (2009) proposed two step feature algorithm for IDS in which redundant features were eliminated by mutual information method and experiments were carried on KDD cup 99 dataset. Improvement in processing speed and better accuracy was achieved by the proposed method.

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257

Bolon-Canedo, Sanchnez-Marano, and Alonso- Betanzos (2011) obtained reduced set of features with the help of feature reduction algorithms like correlation and interact (based on symmetrical uncertainty). An ensemble method was applied to get better accuracy in detection rate. This method was combination of discretizes, filters and classifier to achieve better performance results and in most cases feature reduction was more than 80%. Al-Jarrah et al. (2014) proposed Random Forest-Forward Selection Ranking (RFFSR) and Random Forest-Backward Elimination Ranking (RF-BER) feature selection methods. In RF-FSR, two subsets were formed with first having three highest weighted features and rest all features were placed in other set. Features were then added to first set one by one to check detection rate. RF-BEM feature having lowest weight was eliminated to check effect on detection rate. Feature set was compared with well-known feature sets and the proposed method resulted in increase of detection rate and a decrease in false alarm rate to 0.01%. Karimi, Mansour, and Harounabadi (2013) merged two different feature sets, first one created by applying information gain and second by applying symmetrical uncertainty. The combined features were weighted and ranked to get the most important features. Statistically experimental results showed that detection rate was better compared to other feature selection algorithms. Mukherjee, and Sharma (2012) analyzed the performance of feature selection by correlation, information gain and gain ratio using WEKA 3.6 tool. They have also proposed feature vitality based reduction method to identify important features and used Naive Bayes classifier to classify different types of attacks in dataset. Results in detection rate were good and their method achieved better accuracy in U2R class attack. Sung and Mukammala (2003) proposed feature ranking on the basis of support vector decision function. Support vector machine and neural network were used for the classification process. Detection accuracy achieved was good in all the attack classes. Barmejo et al. (2012) proposed method which deals with subset selection in datasets with very large number of attributes. Their goal was to maintain good performance with reduced number of wrapper evaluations. The algorithm switches between filter ranking and wrapper feature subset selection to achieve better performance. Also, the method was tested on 11 high dimension data sets using different classifiers. Uguz (2011) used feature selection algorithms in text categorization by using a two-stage feature extraction and selection algorithm. In first stage, information gain was used and in second stage, principal component analysis (PCA) and Genetic algorithm were used. k- nearest neighbor and C4.5 decision algorithm were used for classification. Dataset used was reuters-21,578 and classic3, high categorization effectiveness was achieved by proposed method. A conditional mutual information based method for feature selection was proposed by Fleuret (2004). He compared proposed algorithm with other feature selection algorithms and showed that conditional mutual information method along with Naive Bayes classifier has better performance than methods like support vector machine. Chebrolu, Abraham, and Thomas (2005) used Markova blanket model and decision tree analysis in feature selection to identify importance of different features. Bayesian network and regression trees were used for classification. Mukkamela and Sung (2006) deleted each feature one by one to see a change in the detection rate. They extracted 19 significant features from 41 features and results showed that performance was statically unimportant. Table 1 summarizes important points of various studies available in literature. In spite of availability of large number of studies, major research gaps which may lay foundation of current study are summarized here. The cited literature indicates that predicting an optimal number of features to increase the accuracy of intrusion detection system and reduce training time complexity is still an open issue. Studies are available but these have been targeted

251

Fig. 1. Diagrammatic representation of proposed method.

to improve performance of one or more attack classes. A generic model that fits well for attack and non-attack classes is still desired. It has also been evident that less detection stability or true positive rate has been observed in case of less frequent attacks like U2R and R2L classes. The existing approaches also suffer from higher false alarm rate due to higher false positives for frequent occurring attacks. Redundant and irrelevant data also tends to increase overall complexity of the intrusion detection system. The reason for such variability is that training data for some classes is abundant where as for other classes is very scarce. Based on literature gaps, an intelligent system is proposed in this study which performs pre-processing of data as an initial step to remove redundant data from the dataset. This step has helped to overcome higher false alarm rates and classes having less number of tuples were also normalized. The method then performs dimensionality reduction to decrease time complexity and increase resource utilization. The system employs a unique mechanism to combine information gain with correlation based features to find features with higher utility values. A classifier based on artificial neural network (ANN) has been implemented for training and testing of system. ANN was used to increase the effectiveness of classification process. The system has been tested on KDD 99 dataset and results are encouraging. The next section provides details of the proposed method. 3. Proposed method The proposed methodology is shown in Fig. 1.The first step involves selection of dataset KDD-99 which is a benchmark dataset in intrusion detection system. KDD is actually raw TCP/IP dump

252

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257 Table 1 Summarization of various Feature Selection and Reduction approaches. Authors

Remarks

Advantages

Limitations

Amiri et al. (2011)

They compared results by three feature selection methods and analyzed their effects.

Method shows less detections in DoS and R2L attacks.

Li et.al. (2012)

Used feature reduction to reduce 41 features to 19. Extracted 12 essential features from network packet. Redundant features were removed by mutual information. Used correlation and symmetrical uncertainty to reduce features. Two feature sets were formed and lowest weight features were removed. Merged two feature sets obtained by applying information gain and symmetrical uncertainty. Proposed feature vitality based reduction method to identify important features. Used SVM and neural network for classification of features.

MMIFS was effective in detecting probe and R2L attacks while FFSA was good in detecting U2R, DoS and normal profile attacks. Achieved higher overall accuracy.

Sangkatsance et al. (2011) Xiao et al. (2009) Bolon-Canedo et al. (2011) Al-Jarrah et al. (2014) Karimi et al. (2013)

Mukherjee and Sharma (2012)

Sung and Mukammala (2003)

Uguz (2011) Fleuret (2004) Chebrolu et al. (2005) Mukkamela and Sung. (2006)

Horng et al. (2010)

Used a two-stage feature extraction method Proposed a mutual information based method for feature selection. Classified features according to their importance Removed features one by one to check change in detection rate.

Hierarchical clustering algorithm provided the SVM with fewer, abstracted, and higher-qualified training instances

data which was acquired by MIT Lincoln lab by simulating US Air Force LAN. It was operated like real network and different intentional or pseudo information was attacked on it. KDD features fall into four categories which are both qualitative and qualitative in nature. KDD-99 dataset consists of five classes out of which one is normal class and other four are attack classes known as DoS, U2R, R2L and Probe having redundancy and imbalance. Denial of service (DOS) is the type of attack in which legitimate users are denied or kept waiting for the resources because attackers make resources too busy that legitimate users are not able to use the resources or their request for resources are denied. Example: Smurf, Neptune, teardrop, back etc. In Probing, attackers gather all information regarding computer network and look for weak points to launch the attack. Port scanning is one of the major attacks of this category others are ip-sweep, saint and nmap etc. In, Remote to local (R2L) type of attack, attackers exploit computer systems for vulnerability to gain access as a local user. The attacker tries to have an account on victim machine by guessing password or spy attacking. Guess password, multi-hop, phf, spy, Warezclient etc. are examples of R2L attacks. User to root (U2R) attackers having local access to system exploits system for weak points to get root privileges of a system. Example: buffer overflow, root-kit, land module, Perl etc. 10% of KDD-99 consists of a total of 494,020 instances out of which 97,277 are normal instances; 3,91,458 are denial of service instances; 4107 are probe instances; only 52 instances belong to user to root and 1126 are root to local instances. Data preprocessing has been done manually by removing duplicate instances from KDD-99 dataset and separating instances into different classes. The method starts by removal of some redundant instances in high fre-

Achieved 98% detection rate in DoS and probe classes. Processing speed and accuracy was improved. In most of the cases, feature reduction was more than 80%. Significant increase in detection rate and decrease in false alarm by 0.01%. Detection rate was improved.

Achieved good detection rate and better accuracy in U2R attacks. Compared SVM with neural networks and found that SVM has more scalability i.e., SVM can be used on large datasets. High categorization effectiveness was achieved. Their method along with naïve bayes has better performance than SVM Identified all type of attacks with 12 features only. Comparative study of SVMs, MARs and LGPs is done with elimination of features one by one. Better performance in detection of DoS and Probe attacks.

System was able to recognize only 71% normal attacks. Results of R2L and U2R attacks were not presented. Experiments showed good results in DoS and Probing attacks only. Detection rate in Normal, DoS and U2R attacks are low. Results were shown in the form of accuracy only. Detection accuracy in U2R and R2L needs to be improved further. Detection rate of U2R attacks with complexity and overheads needs to be improved. SVM only makes binary classifications and neural networks took more training time than SVM. Results were not appreciable for all classes. Focused more on processing time. U2R attacks have less detection rate. Most important features in normal, DoS and U2R classes overlap with each other. Accuracy for Probe and DoS class is low. Very less detection accuracy for R2L and U2R classes.

quent classes. The result of data preprocessing step gives a compact dataset with removal of redundancy and imbalance. In next step, feature ranking is performed by two algorithms namely, information gain and correlation. Information gain calculates the entropy of each feature. Higher the entropy, more information content it has. It will determine which attribute in a given set of feature vectors is useful for learning purpose. These features will be used by classification algorithm to distinguish unknown instances into different attack classes. The second method used to rank features is correlation. Lower the correlation of an attribute in feature vector, more is its power to distinguish between types of attacks in multiclass problem. Ranked features from previous steps are then divided into subsets in next step. Information gain features are divided into three subsets named as IG-1, IG-2 and IG-3 and correlation features are divided into three subsets named as CR-1, CR2 and CR-3. IG-1 and CR-1 subsets are built such that both contain first10 features which are in the range of 1 to 10 as result of ranking by corresponding algorithms of information gain and correlation respectively; IG-2 and CR-2 consist of features that were ranked in the range of 11 to 30 and IG-3 and CR-3 contains rest of features. We call them as strongly useful, useful and useless feature subsets respectively. Strongly useful feature have high ranking so high ability or information to differentiate instances into different classes. Useful features contain higher information or ability to differentiate instances into different classes than useless features. IG-1 and CR-1 subset features have a higher ranking than features in IG-2 and CR-2 and similarly IG-2 and CR-2 subset features have higher ranking than features in IG-3 and CR-3. In next step, union of IG-1 feature subset and CR-1 feature subset has been performed

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257 Table 2 Sample distributions of instances for five classes in training dataset.

253

Table 3 Sample distributions of instances for five classes in a single test dataset.

Category of class

Number of instances

Percentage of class occurrence

Category of class

Number of instances

Percentage of class occurrence

Normal U2R DoS R2L Probe Total

25,0 0 0 35(3 times) 30,020 751 2738 58,416

42.65% 0.17% 51.22% 1.28% 4.67% 100%

Normal U2R DoS R2L Probe Total

14,0 0 0 35 16,0 0 0 751 2738 33,524

41.76% 0.10% 47.72% 2.24% 8.17% 100%

and intersection of IG-2 feature subset with CR-2 feature subset is calculated and new feature set is generated. Rest of features that were present in IG-3 and CR-3 were removed because their presence will have negligible difference in intrusion detection. Union is implemented to make sure that all important features qualify for the next level of detection and classification process. A reduced data set consisting of 25 features is obtained by performing union operation to results obtained in the previous step. The next step involves training of classification network which has been implemented as feed forward neural network. A feed forward neural network model is constructed in which every neuron of a layer is connected to next layer neurons. In our case, we have three layer feed forward neural network. Input layer has 25 neurons which are equal to the number of input features. Output layer has 5 neurons which are equal to five output classes i.e. Normal, DoS, U2R, R2L and Probe. The middle layer has 10 neurons; middle layer neurons are determined the empiri√ cal formula I + O+α (α = 1–10) where I is input layer neurons, O is output layer neurons and α is a random number in the range 1 to 10. Training of feed forward is done by Levenberg-Marquardt training method in which weights and biases are changed to train neural network. Levenberg–Marquardt method is fastest back propagation method which may need more memory. Activation function of feed forward neural network is implemented by calculating weights of each connection between neurons and biases that each neural network layer has, these weights keeping on changing to perform training of neural network. Advantages of using an Artificial neural network are knowledge discovery about dependencies without prior knowledge, robustness to inaccuracies and high efficiency. The network is trained using training dataset created by extracting tuples from KDD dataset. A suitable ratio of number of samples has been maintained among various classes. MATLAB version 2013 was used to perform feature ranking by information gain and correlation and measure classification performance for these features. Training dataset contains a total of 58,614 instances. Out of which 30,020 are DoS instances, 25,0 0 0 instances are normal, 2738 instances are probe, 751 instances are root to local and 35 instances are user to root. Re-sampling of user to root instances is done to increase their count. After training, testing of datasets is performed on five different test datasets. Each test dataset consists of 33,524 instances which consist of a mixture of known and unknown instances in equal proportion. Out of 33,524 instances, 16,0 0 0 are DoS, 14,0 0 0 are normal, 751 are R2L, 35 are U2R, and 2738 are Probe instances. We had performed testing on five different datasets all containing different types of instances. Tables 2 and 3 show distribution of instances into different classes for training and testing purpose. Results of each dataset were taken and average of all was tabulated by calculating statistical measures like True positive rate, False positive rate, Precision and Recall. The next section details about results and analysis of proposed method. 4. Illustrations and results Since it is not possible to carry out experiments with whole KDD-99 dataset because of dimensionality, existence of redun-

dancy and class imbalance, therefore, a compact dataset was formed manually in which all 41 features are present. In second step, compact dataset was imported and feature ranking and selection algorithms were applied. Information gain algorithm was applied by calculating entropy of each feature to know extent of information present in different features of dataset. Correlation returns matrix containing pair-wise linear correlation coefficient between each pair of column. Then, correlation coefficient of every feature is calculated by taking mean of every column. Feature ranking from both methods was prepared and corresponding ranking results are shown in Table 4 in which features like Dst-host-count, Dst-host-srv-count, Dst-host-srv-diff-host- rate, flag and protocol type has higher ranking. These features are represented by feature number in table for reasons of brevity. Readers may refer to Appendix A for resolving numbers of different features to names. Many of these features frequently change values in corresponding feature column while feature like Num-root, Num-file creation, Num-shell and is-host-login has lowest ranking as these feature contain values which remain constant throughout. After performing ranking of features, three feature subsets were formed as strongly useful, useful and useless features. Strongly useful features can’t be removed from dataset as these may lead to decrease in accuracy of proposed method. Useful features being important than useless feature can’t be eliminated as these features also helps in detection of different attacks like DoS attack. However, any useless feature which does not have a significant contribution in differentiating different types of attacks or to distinguish between normal and abnormal data can be removed. One research problem that we address here is to calculate the optimal number of features in feature ranking and feature selection to get higher accuracy in classification methods for intrusion detection. Information gain ranked features are divided into three subsets as IG-1, IG-2 and IG-3 on the basis of ranking attained by them in information and correlation algorithm. IG1 subset consists of first 10 features that were ranked between 1 to 10, IG-2 subset contains total 20 features that were ranked between 11 to 30, IG-3 subset contains 11 features that were ranked 31–41. IG-1 subset consists of feature as < 4,37,41,22,32,34,40,39,31,14>, IG-2 subset consists of features as < 33,29,36,30,28,35,15,20,38,9,1,8,13,11,6,19, 12,26,27,10>, and IG-3 subset consists of features as < 17,18,2,3,23,5,25,7,24,16,21>. Correlations ranked features are also ranked into three subsets as CR-1, CR-2 and CR-3. CR-1 subset consists of first 10 ranked features by correlation algorithm which are < 33,2, 41, 27,22,14,37,38,12,39>, CR-2 consists of 20 features ranked in range of 11 to 30 as < 4,16,8,13,5,6,7,3,19,20, 17,10,1,24,9,11,23,15,21,18 > and CR-3 subset consists of features ranked between 31-41 as < 25,26,29,35,28,30, 36, 31, 40, 34,32 >. In next step Union of IG-1 feature subset and CR-1 feature subset is performed. Union operation is used here because it will include all strongly useful features present in IG-1 and CR-1. Result of this operation has 15 features as < 2,4,12,14, 22,27,31,32,33,34,37,38,39,40,41> since five features are common in both feature subsets IG-1 and CR-1. Intersection of IG-2 and CR-2 is performed and result of two subsets consists total of 10 features

254

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257 Table 4 Ranking of features by information gain and correlation method. Method

# Features

Ranking

Information gain Correlation

41 41

4,37,41,22,32,34,40,39,31,14,33,29,36,30,28,35,15,20,38,9,1,8,13,11,6,19,12,26,27,10,17,18,2,3,23,5,25,7,24,16,21 33,2,41,27,22,14,37,38,12,39,4168,13,5,6,7,3,19,20,17,10,124,9,11,23,15,21,18,25,26,29,35,28,30,36,31,40,34,32.

Table 5 Feature reduced dataset with 25 selected features. Dataset

# Features

Selected features

Feature reduced dataset

25

1,2,4,6,8,9,10,11,12,13,14,15,19, 20,22,27,31,32,3334, 37, 38, 39, 40, 41.

as < 1,6,8,9,10,11,13,15,19,20>. Subsets IG-3 and CR-3 are eliminated because they don’t have much influence on the accuracy of classification method instead they increase training time of dataset. In next step feature reduced dataset which is a combination of features obtained by union operation of IG-1 and CR-1 and the intersection of IR-2 and CR-2 is formed. Table 5 shows feature reduced dataset which contain total 25 features used in feature reduced training and testing data set to carry experiments. An artificial neural network based classifier as defined in previous section was then setup for testing effectiveness of proposed method. The classifier was trained with training dataset and various experiments were performed. The performance of the system has been observed using number of statistical measures like true positive rate or sensitivity, false positive rate or specificity, precision and recall. Confusion matrix also known as error matrix is generated and utilized for the measuring performance of classifier. In confusion matrix, columns correspond to actual classes and rows correspond to predicted class. Values of True Positive (TP), True Negative (TN), False Positive(FP) and False negatives(FN) can be easily obtained. True Positive (TP) is a correct prediction of classifier as attack when actual test instances are attack. True Negative (TN) is a correct prediction of classifier as normal when actual test instances are normal. False Positive (FP) is incorrect classifier prediction as attack but actual instances are normal. False Negative (FN) is an incorrect classifier prediction as normal but actual test instances are attack instances.

T P R or Sensit ivit y = T P/(T P + F N )

(1)

F P R or (1 − Speci f icity ) = F P /(F P + T N )

(2)

P recision = T P/(T P + F P )

(3)

Accuracy = (T P + T N )/(T P + T N + F P + F N )

(4)

Sensitivity also known as true positive rate or recall measures proportion of actual positive cases that are correctly identified as such. Thus sensitivity quantifies avoiding of false negatives, as specificity does for false positives. For any test, there is usually a trade-off between measures. Eqs. (1)-(4) have been used to evaluate our performance parameters. We here present sample values obtained for our test datasets. Table 6 shows the results for test dataset-1 in the form of confusion matrix. Statistical parameter true positive is shown across the main diagonal of confusion matrix while others are calculated using formulae’s mentioned above. Columns in confusion matrix correspond to actual classes while rows correspond to predicted classes that are predictions of classification algorithm. The main diagonal of confusion matrix corresponds to True positives samples which are 13,833 for normal class, 17 for U2R and so on. False negatives (FN) of a particular class can be calculated from corresponding row without including true positives of that class for example:

Table 6 Confusion matrix for test dataset-1. Output class

Target class Normal

U2R

DOS

L2R

Probe

Normal U2R DOS L2R Probe

13,833 4 23 57 83

9 17 5 4 0

14 0 15,982 1 3

92 1 3 654 1

25 0 2 0 2711

Table 7 Statistical parameters for Test Dataset-1. S. No

Parameters

Normal

U2R

DOS

R2L

Probe

1 2 3 4 5 6 7 8 9

True positive False negative False positive True negative TPR(%) FPR Precision (%) Recall (%) Accuracy (%)

13,833 140 167 19,384 99.0 0.0086 98.6 99.0 99.08

17 5 18 33,484 77.3 .0 0 05 48.6 77.3 99.93

15,982 33 18 17,491 99.8 0.0010 99.9 99.8 99.85

654 62 97 32,711 91.3 0.0030 87.1 91.3 99.53

2711 87 27 30,699 96.9 0.0 0 09 99.0 96.9 99.66

Normal class false negatives (FN) is equal to 140 which is the addition of 9, 14, 92 and 25 excluding True positives. False positives of a class can be calculated from respective columns without including True positives of that class, for example: to calculate False negatives of Normal class addition of 4, 23, 57 and 83 can be done, which is 167. To calculate True negatives of Normal class, eliminate column one and row one in confusion matrix and add rest of submatrix which also include true positives of other classes excluding that class for which TNs are calculated. Table 7 summarizes values for various statistical parameters undertaken for test dataset-1. It can be seen from Table 7 that values of False Positives and False Negatives are small which is actually good for the system because False Negative will compromise the security of system by allowing malicious data to enter in network system but falsely predicted by intrusion detection system as normal. False positive results increase in overheads, may cost time and resources of systems. However, it is not possible to eliminate all False positive and false negatives in the systems because there is a tradeoff between these parameters. True Positives and True Negative must be on high side so that the system accuracy in detecting different attack increases which is considerably good for our system. The above mentioned process was repeated for all test datasets. The values obtained for test datasets 2, 3, 4 and 5 are presented in Table 8–15 respectively. Tables 7, 9, 11, 13 and 15 shows that FPs and FNs are small which is actually good for the system because false negative can compromise the security of system by allowing malicious data to enter in network system. The value of statistical parameters precision and recall does not depend on size of dataset and is also ap-

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257 Table 14 Confusion matrix for Test Dataset 5.

Table 8 Confusion matrix for Test Dataset- 2. Output class

Target class Normal

Normal U2R DOS L2R Probe

10,255 0 3688 29 28

U2R 10 12 4 7 2

255

DOS

L2R

9 0 15,991 0 0

88 0 1 662 0

Output Class

Target Class Normal

U2R

DOS

L2R

Probe

Normal U2R DOS L2R Probe

12,403 2 1509 48 38

9 17 5 4 0

9 0 15,991 0 0

92 1 3 654 1

25 0 2 0 2711

Probe 66 0 0 0 2672

Table 9 Statistical parameters for Test Dataset-2.

Table 15 Statistical parameters for Test Dataset-5.

S No

Parameters

Normal

U2R

DOS

R2L

Probe

S. No

Parameters

Normal

U2R

DOS

R2L

Probe

1 2 3 4 5 6 7 8 9

True positive False negative False positive True negative TPR(%) FPR Precision (%) Recall (%) Accuracy (%)

10,255 173 3745 19,351 98.3 0.162 73.7 98.3 88.31

12 0 23 33,489 100 0.0 0 07 34.3 100 99.93

15,991 3693 9 13,831 81.2 0.0 0 06 99.9 81.2 88.95

662 36 89 32,737 94.8 0.0027 88.1 94.8 99.62

2672 30 66 27,988 98.9 0.0023 97.6 98.9 99.68

1 2 3 4 5 6 7 8 9

True positive False negative False positive True negative TPR(%) FPR Precision (%) Recall (%) Accuracy (%)

12,403 135 1597 19,372 98.9% 0.0761 88.6 98.9 94.83

17 3 18 33,486 85% 0.0 0 05 92.6 98.6 99.93

15,991 1519 9 15,988 91.3% 0.0 0 06 99.95 91.35 95.43

654 52 97 32,704 92.6% 0.0029 87.1 92.6 99.55

2711 39 27 30,747 98.6% 0.0 0 08 99 99 99.80

Table 10 Confusion matrix of Test Dataset- 3. Output Class

Normal U2R DOS L2R Probe

Table 16 Average statistical parameters of proposed method with feature reduction.

Target Class Normal

U2R

DOS

L2R

Probe

12,391 2 403 66 1138

9 17 5 4 0

0 0 16,0 0 0 0 0

92 1 3 654 1

25 0 2 0 2711

S. No

Parameters

Normal

U2R

DOS

R2L

Probe

1 2 3 4 5 6 7 8 9

True positive False negative False positive True negative TPR(%) FPR Precision (%) Recall (%) Accuracy (%)

12,391 126 1609 19,398 99.0% 0.0765 88.5 99 94.82

17 3 18 33,489 85% 0.0 0 05 48.6 85 99.93

16,0 0 0 413 0 17,111 97.5 0.0 0 0 0 100 97.5 98.76

54 70 97 32,703 90.3 0.0029 87.1 90.3 99.49

2711 1139 27 29,647 70.4 0.0 0 09 99.0 70.4 96.52

Table 12 Confusion matrix for Test Dataset −4.

Normal U2R DOS L2R Probe

Parameters

Normal

U2R

DOS

R2L

Probe

1 2 3 4

TPR or Sensitivity FPR Precision Recall

98.8 0.06558 88.9 98.8

86.6 0.0 0 05 42.88 86.6

93.8 0.0 0 04 99.9 93.8

91.9 0.0028 87.5 91.9

89.8 0.0014 98.4 89.8

Table 17 Comparison of average statistical parameter of two methods.

Table 11 Statistical parameters for Test Dataset-3.

Output Class

S.no

Target Class Normal

U2R

DOS

L2R

Probe

13,299 2 135 59 505

10 12 4 7 2

0 0 16,0 0 0 0 0

88 0 1 662 0

66 0 0 0 2672

Table 13 Statistical parameters for Test Dataset-4. S. No

Parameters

Normal

U2R

DOS

R2L

Probe

1 2 3 4 5 6 7 8 9

True positive False negative False positive True negative TPR(%) FPR Precision (%) Recall (%) Accuracy (%)

13,299 164 701 14,164 98.8% 0.0471 95 98.8 96.94

12 2 23 33,470 85.7% 0.0 0 06 34.35 85.7 99.92

16,0 0 0 140 0 17,367 99.1% 0.0 0 0 0 100 99.1 99.58

662 66 89 32,707 90.9% 0.0027 88.1 90.9 99.53

2672 507 66 30,279 84.1% 0.0022 97.6 84.1 98.29

Proposed method

Class

TPR

FPR

Precision

Recall

With Feature Reduction

Normal U2R DoS R2L Probe Normal U2R DoS R2L Probe

98.8 86.6 93.8 91.9 89.8 99.3 81.4 90.4 91.6 97.5

0.0655 0.0 0 05 0.0 0 04 0.0028 0.0014 0.0835 0.0 0 05 0.0 0 04 0.0048 0.0010

88.9 42.9 99.9 87.5 98.4 85.9 49.7 99.9 94.7 98.8

98.8 86.6 93.8 91.9 89.8 99.3 81.4 90.4 91.6 97.5

Without Feature Reduction

preciable across various datasets. Table 16 shows empirical results obtained by averaging values across all datasets. The table shows that proposed method achieve better detection rate in all classes. Proposed method was able to achieve 86.6% detection rate in U2R and 91.9% detection rate in R2L class which is really appreciable. Also, false alarm rate achieved in U2R, DoS, R2L and Probe classes was very less which is a good indication that number of false detections were also very less. The values of precision and recall are also appreciable. In order to justify the significance of proposed method, an experiment was performed in which feature reduced system was compared without feature reduced system. Proposed method with reduced number of features (25) was compared with all 41 features present in KDD-99 and statistical parameters like TPR, Precision, FPR and Recall are calculated. Training and testing instances for both methods were same as used in the previous experiment. Table 17 shows a comparison of proposed method with and without feature reduction by taking the average of all statistical parameters. Results of two methods show that true positive rate or sensitivity i.e., the proportion of positives which are correctly identified has increased when feature ranking is done. It can be seen

256

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257

Table 18 Comparison of detection accuracy for various algorithms. Method/Study Name

Normal

DoS

U2R

R2L

Probe

Li et al. (2012) Badran and Rockett (2012) Amiri et al. (2011) Mukkamela and Sung (2006) Horng et al. (2010) Chebrolu et al. (2005) Proposed Method

71.51 99.5 99.80 99.55 99.3 98.78 94.80

98.61 97.00 99.00 99.25 99.5 98.95 99.93

99.69 11.40 93.16 99.87 19.7 48.00 96.51

77.86 5.60 99.91 99.78 28.8 98.93 99.54

99.66 78.00 99.83 99.70 97.5 99.57 98.79

from table that proposed method has a gradual increase in the true positive rate and has less false positive rate in less frequent attacks like U2R and L2R. TPR has also increased in DoS class while it has decreased slightly in Probe class but it can be considered negligible considering the savings in training time. FPR in case of Normal and R2L achieved by features ranking is less which is good indication while in case of U2R and DoS class, it has remained the same. Precision decreases in less frequent attacks like U2R and R2L in case of feature ranking method. This is because less percentage of class occurrences are available for these attacks for training. This is acceptable for large systems as such attacks do not occur often. Overall, it indicates that feature ranking plays an important role in increasing detection rate of different class’s especially in less frequent attack class like U2R and R2L. Performance of the system has been retained and results are encouraging. The performance of the proposed system was further compared with recent and relevant approaches found in literature. Such studies were harvested considering their wider dissemination, publishing agency, citations and in which class wise results were available on KDD dataset. Table 18 presents comparative results of detection accuracy for various studies. It can be seen from table that detection accuracy of proposed method in DoS attacks is higher than almost all methods. Also, 98.79% of probe attacks are detected by the proposed method which is better than study of Badran and Rockett (2012) and Horng et al. (2010) while for other studies results are comparable. The proposed method is better than methods of Li, Xia, Zhang, Yan, Chuan, and Dai (2012), Badran and Rockett (2012), Horng et al. (2010) and Chebrolu et al. (2005) in detecting DoS, U2R, R2L and Probe attacks. The only drawback of the proosed method is its performance for Normal class where its values are somewhat on lower side as compared to other methods. There is very little to choose from results of proposed method and studies of Amiri et al. (2011) and Mukkamela and Sung (2006) since the obtained values are in stiff competition to each other. The proposed study scores over these studies in majority of cases. Overall, it can be concluded that the proposed method is better considering its performance levels across both attack and non-attack classes and advantages incurred by reducing the number of features. 5. Conclusion The study proposed a new intelligent intrusion detection system that works on reduced number of features. The system extracts features using concepts of information gain and correlation. Features are first raked using information gain and correlation and combined thereafter using a suitably designed mechanism. The method uses pre-processing to eliminate redundant and irrelevant data from the dataset in order to improve resource utilization and reduce time complexity. A classification system was designed using ANN which was trained on compact dataset and tested on five different subsets of KDD99 dataset. It can be seen from results that the method outperforms other methods for attack and non-attack classes. Overall, the method has reported an increased detection

rate and decreased false alarm rate. The system was put to test against contemporary techniques and results were found to be encouraging. The implication of the proposed system is demonstration of the fact that feature reduction can be an important phenomenon to reduce dimensionality and training time of the system. The performance of the feature reduced system is actually better than system without feature reduction thereby influencing design of systems with far less time complexities. The proposed method of Intrusion detection system can be used to provide security in network, organizational and social areas where security is prime importance. The study can also inspire researchers from field of data science, big data to utilize their work to propose more challenging solutions for the current research problem. Although the present work seems convincing but there has been some shortcomings like pre-processing work has been done manually, number of features in reduced feature set can be made optimal. The present work can be extended to improve these shortcomings in number of ways like finding the optimal number of features to further increase the accuracy of intrusion detection system. This can be accomplished by use of population based optimization algorithms like genetic algorithms, big bang big crunch optimization etc. Fuzzy systems can also be used in preprocessing work as they have a stable history of working in imprecise domains. The performance of system can be further improved by the use of fast converging learning algorithms to check for speedy and accurate detection rate. Increase in the amount of data requires more and more powerful networks and studies based on deep networks like convolution neural networks can be one of the hottest candidates in this direction. Appendix A. List of various features in KDD

Feature name 1

duration

2 3

protocol_type service

4

src_bytes

5

dst_bytes

6 7

flag land

8 9 10 11 12

wrong_fragment urgent hot num_failed_logins logged_in

13 num_compromised 14 root_shell 15 su_attempted 16 17 18 19 20

num_root num_file_creations num_shells num_access_files num_outbound_cmds

21 is_hot_login 22 is_guest_login 23 count

Description Length (number of seconds) of the connection Type of the protocol, e.g. tcp, udp, etc. Network service on the destination, e.g., http, telnet, etc. Number of data bytes from source to destination Number of data bytes from destination to source Normal or error status of the connection 1 if connection is from/to the same host/port; 0 otherwise Number of “wrong” fragments Number of urgent packets Number of “hot” indicators Number of failed login attempts 1 if successfully logged in; 0 otherwise Number of “compromised” conditions 1 if root shell is obtained; 0 otherwise 1 if “su root” command attempted; 0 otherwise Number of “root” accesses Number of file creation operations Number of shell prompts Number of operations on access control files Number of outbound commands in an ftp session 1 if the login belongs to the “hot” list; 0 otherwise 1 if the login is a “guest”login; 0 otherwise Number of connections to the same host as the current connection in the past two seconds (continued on next page)

Akashdeep et al. / Expert Systems With Applications 88 (2017) 249–257

Feature name 24 25 26 27 28

serror_rate rerror_rate same_srv_rate diff_srv_rate srv_count

29 30 31 32 33 34 35 36 37 38 39 40 41

srv_serror_rate srv_rerror_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate

Description % of connections that have “SYN” errors % of connections that have “REJ” errors % of connections to the same service % of connections to different services Number of connections to the same service as the current connection in the past two seconds % of connections that have “SYN” errors % of connections that have “REJ” errors % of connections to different hosts Destination host count Service count for destination host Same service count for destination host Different service count for destination host Same source port rate for destination host Different host rate for destination host Serror rate for destination host Srv-serror for destination host R-error rate for destination host Srv-rerror for destinaon host

Source: KDD Cup 1999 Dataset

References Al-Jarrah, O. Y., Siddiqui, A., Elsalamouny, M., Yoo, P. D., Muhaidat, S., & Kim, K. (2014). Machine learning based feature selection techniques for large scale intrusion detection. In Distributed computing systems workshop 2014.IEEE 34th International conference on (pp. 177–181). doi:10.1109/ICDCSW.2014.14. Amiri, F., Yousefi, M. M. R., Lucas, C., Shakery, A., & Yazdani, N. (2011). Mutual information based feature selection for intrusion detection. Network and Computer Application, 34, 1184–1199. Badran, Khaled, & Rockett, Peter (2012). Multi-class pattern classification using single, multi-dimensional feature-space feature extraction evolved by multi-objective genetic programming and its application to network intrusion detection. Genetic Programming and Evolvable Machines, 13(1), 33–63. Barmejo, P., Ossa, L., Gamez, J. A., & Puerta, J. M. (2012). Fastwrapper feature subset selection in high dimensional datasets by means of filter re ranking. Journal of Knowledge Based Systems, 25, 35–44. Bhuyan, M. H., Bhattacharya, D. K., & Kalita, J. K. (2014). Network anomaly detection: Methods, systems and tools. IEEE Communication Surveys and Tutorials, 16, 303–336. Bolon-Canedo, V., Sanchnez-Marano, N., & Alonso- Betanzos, A. (2011). Feature selection and classification in multiple class dataset: An application to KDD cup 99 dataset. Expert System with Applications, 38, 5947–5957.

257

Chebrolu, S., Abraham, A. &, & Thomas, P. (2005). Feature deduction and ensemble design of intrusion detection systems. Computer and Security., 24(4), 295–307. Fleuret, F. (2004). Fast binary feature selection with conditionalmutual information. Journal of Machine Learning Research, 5, 1531–1555. Horng, S. J., Su, M.-Y., Chen, Y. H., kao, T. K., Chen, R. J., & Lai, J. L. (2010). A novel intrusion detection system based on hierarchical clustering and support vector machines. Expert Systems with Applications (2010). doi:10.1016/j.eswa.2010. 06.066. Joo, D., Hong, T., & Han, I. (2003). Neural network model for IDS based on asymmetrical costs of false negative errors and false positives errors. Expert Systems with Applications, 25, 69–75. Karimi, Z., Mansour, M., & Harounabadi, A. (2013). Feature ranking in intrusion detection dataset using combination of filter methods. International Journal of Computer Application, 78, 21–27. KDD dataset. [Online accessed in 2 August 2016] http://kdd.ics.uci.edu/databases/ kddcup99 . Li, Y., Xia, J., Zhang, S., Yan, J., Chuan, X., & Dai, K. (2012). An efficient intrusion detection system based on support vector machine and gradually features removal method. Expert System with Applications, 39, 424–430. Lin, S., Ving, K., Lee, C., & Lee, Z. (2012). An intelligent algorithm with feature selection and decision rules applied to anomaly intrusion detection. Journal of Soft Computing, 12, 3285–3290. Liu, Q., Sui, S., & Xiao, J. (2014). A mutual information based hybrid feature selection method using feature clustering. In IEEE 38th annual international conference on computers, software and application 2014 (pp. 27–32). doi:10.1109/compsac.2014. 99. Mukherjee, S., & Sharma, N. (2012). Intrusion detection using Naïve Bayes classifier with feature reduction. Procedia Technology, 4, 119–128. doi:10.1016/j.protcy. 2012.05.017. Mukkamela, S., & Sung, A. H. (2006). Significant feature selection using computational intelligent techniques for intrusion detection. Advanced Information and Knowledge Processing, 24, 285–306. doi:10.1007/1- 84628- 284- 5_11. Sangkatsance, P., Watlanapongsakorn, N., & CharnsriPinyo, C. (2011). Practical real time intrusion detection using machine learning approach. Journal of Computer Communication, 34, 2227–2235. Sung, A. H., & Mukammala, S. (2003). Feature selection for intrusion detection using neural network and support vector machine. Transportation Research Record: Journal of the Transportation Research Board, 1822, 1–11. http://dx.doi.org/10. 3141/1822-05. Uguz, H. (2011). Two stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Journal of Knowledge Based Systems, 24, 1024–1032. Wu, H., & Haung, S. S. (2010). Neural network based detection on stepping stone intrusion. Expert System with Applications, 37, 1431–1437. Xiao, L., Liu, Y., & Xiao, L. (2009). A two step feature selection algorithm adapting to intrusion detection. In Convergence and hybrid information technology,2009, international joint conference on (pp. 618–622).