A latent class modeling approach to detect network intrusion

Computer Communications 30 (2006) 93–100 www.elsevier.com/locate/comcom A latent class modeling approach to detect network intrusion Yun Wang a,c,* ...

Download PDF

214KB Sizes 1 Downloads 91 Views

Report

PDF Reader
Full Text

Computer Communications 30 (2006) 93–100 www.elsevier.com/locate/comcom

A latent class modeling approach to detect network intrusion Yun Wang

a,c,* ,

Inyoung Kim b, Gaston Mbateng c, Shih-Yieh Ho

c

a

b

Center for Outcomes Research and Evaluation, Yale University and Yale-New Haven Health, CORE, 300 George Street, Suite 505, New Haven, CT 06511, USA Section of Biostatistics, School of Public Health, Yale University, 300 George Street, Suite 501, New Haven, CT 06511, USA c Qualidigm, 100 Roscommon Drive, Middletown, CT 06457, USA Received 27 April 2006; received in revised form 27 July 2006; accepted 28 July 2006 Available online 30 August 2006

Abstract This study presents a latent class modeling approach to examine network traﬃc data when labeled abnormal events are absent in training data, or such events are insuﬃcient to ﬁt a conventional regression model. Using six anomaly-associated risk factors identiﬁed from previous studies, the latent class model based on an unlabeled sample yielded acceptable classiﬁcation results compared with a logistic regression model based on a labeled sample (correctly classiﬁed: 0.95 vs. 0.98, sensitivity: 0.99 vs. 0.99, and speciﬁcity: 0.77 vs. 0.97). The study demonstrates a great potency for using the latent class modeling technique to analyze network traﬃc data. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Intrusion detection; Machine learning; Classiﬁcation; Latent class model; Computer security

1. Introduction An essential challenge in modern network security is to detect anomaly network traﬃc when known (labeled) abnormal events are absent in training data, or such events are insuﬃcient to train the system. Over time, network trafﬁc data have been primarily analyzed in a framework of modeling for associations. This approach aims to estimate a relationship between the outcome and its predictors (risk factors) and requires both normal (legitimate) and abnormal events in a training dataset. Previous studies, such as statistically based regression [1–4] and Artiﬁcal Intelligence-based neural networking approaches [5–7], are some essential examples from this framerwork. In many real world scenarios, available network traﬃc data either may not include any anomalous events or such events are inadequate to ﬁt a conventional regression model for establishing an association between the *

Corresponding author. Tel.: +1 203 737 5415; fax: +1 203 737 5412. E-mail address: [email protected] (Y. Wang).

0140-3664/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2006.07.018

outcome and its predictors. For example, initial audit data are more likely to have all normal events in newly constructed network systems and, therefore, a regression-based intrusion detection model cannot be activated until a certain percent of abnormal events is acquired. While high-speed mobile wireless and ad hoc network systems have became popular, the importance and need to develop new methods that allow the modeling of network traﬃc data to use attack-free training data have signiﬁcantly increased. This study proposes a Bayesian latent class (LC) modeling approach to address such a need and describes the emergent challenges. This approach could be considered as an unsupervised learning procedure that does not require a labeled or observed outcome for building a model, as, it treats the unknown outcome as a latent variable and uses directly measured variables to infer the outcome. The study had two goals: (i) examine the feasibility of using the LC modeling technique in analyzing network traﬃc data and (ii) evaluate the predictability and reliability of using the LC modeling approach for intrusion detection.

94

Y. Wang et al. / Computer Communications 30 (2006) 93–100

2. Methods 2.1. Study design The study was conducted in three sequential steps: preparing data, ﬁtting the LC model, and evaluating results. The ﬁrst step created derivation and validation samples and categorized all network connections in the derivation sample into diﬀerent patterns (clusters) based on the combination of each connection’s features (e.g., types of protocol, services, login status, etc.), which are represented by a set of binary variables called risk factors. The second step developed a LC model with two classes (normal or anomaly) and ﬁtted the model with the derivation sample. The model estimated parameters for each risk factor and calculated conditional probabilities of belonging to normal and anomaly latent classes for each pattern in the derivation sample. The pattern-speciﬁc probabilities were further used to classify network traﬃc to either a normal or anomalous group. The third step evaluated the above classiﬁcation results by comparing the LC model-based classiﬁcation results with the conventional logistic regression (LR) model-based results. The evaluation was conducted through three approaches. First, we assessed the stability of the patterns identiﬁed from the derivation sample through the validation sample. Then, we assigned the pattern-speciﬁc probabilities estimated from the derivation sample to the patterns in the validation sample, where if a pattern of a network connection in the validation sample is the same as the pattern in the derivation sample, this connection will have the same probability of belonging to a corresponding latent class as it has in the derivation sample. The actual labeled outcome (normal or anomaly) included in the derivation and validation samples was used as a ‘‘gold-standard’’ to compare LC model-based classiﬁcation results. Finally, we ﬁtted data with a LR model and compared the performance of classiﬁcation results between the LC and LR models.

Fig. 1. Structure of a latent class model with i manifest binary variables, 2i patterns and two classes.

manifest variables, M1, M2, . . . , Mi, can be explained by j < i latent categorical variables, with C1, C2, . . . , Cj categories [8]. A LC model also assumes latent conditional independence, which is conditional on the level of the latent variables and that the manifest variables are independent. Assume that there are three manifest variables, X1, X2, X3, which have j diﬀerent patterns (combinations) of their respondent values (e.g., 000, 010, . . . , 111). Let Z denote a latent variable with two latent classes, normal: c = 1 and anomaly: c = 2. Let pc represent the proportion of connections in the population in class c, where p1 + p2 = 1. Each network connection is a member of one of these two classes as well as a member of j patterns, which we do not know which class it belongs to, but we know which pattern it belongs to. The standard latent class model assumes that X1, X2, X3 are conditionally independent given Z, which can be written as follows: X X jZ X jZ X jZ pZz px11jz px22jz px33jz ; ð1Þ P ½X 1 ¼ x1 ; X 2 ¼ x2 ; X 3 ¼ x3 ¼ z X jZ

X jZ

where px11jz ¼ P ½X 1 ¼ x1 j Z ¼ z; px22jz ¼ P ½X 2 ¼ x2 j Z ¼ z; X jZ

2.2. Latent class model A LC model is a parametric model that uses observed data to estimate parameter values for the model. It ﬁnds subtypes of related cases (latent classes) from multivariate categorical data. The ‘‘latent class’’ indicates that the categorical member of a given data record (e.g., a network connection) is not directly observed, rather its probability of being assigned a given category is assumed to depend on which latent class the record belongs to. A LC model includes two types of variables: manifest variable(s) and latent variable(s). A variable that can be measured directly is called a manifest variable and a variable that cannot be observed or measured directly is called a latent variable (Fig. 1). The assumption of a LC model is that the interaction between a set of

px33jz ¼ P ½X 3 ¼ x3 j Z ¼ z. The joint distribution of X1, X2, X3 and Z under the LC model has the following log-linear representation: logðP ½X 1 ¼ x1 ; X 2 ¼ x2 ; X 3 ¼ x3 ; Z ¼ zÞ ZX 2 ZX 3 1 ¼ a þ aZz þ aXx11 þ aXx22 þ aXx33 þ aZX zx1 þ azx2 þ azx3 ;

ð2Þ

where a 0 s are model parameters to be estimated from the data using the Expectation Maximization (EM) algorithm [9]. The log-linear represent of the latent class model reﬂects aXx11xX2 2 ¼ aXx11xX3 3 ¼ aXx22xX3 3 ¼ aXx11xX2 x23X 3 ¼ aXx11xX2 x23Xy3 Y ¼ 0. For EM algorithm, the latent variable is treated as missing data. To handle the missing data, we calculated the probability of assigning a network connection to class c using Bayes rule, and the positive and negative predictive values can correspondingly be obtained [10]

Y. Wang et al. / Computer Communications 30 (2006) 93–100

p1

3 Q

Prðy ij j c ¼ 1Þ

i¼1

Prðc ¼ 1 j y j Þ ¼

P

pc

c

3 Q

ð3Þ Prðy ij j cÞ

i¼1

and p2 Prðc ¼ 2 j y j Þ ¼

3 Q

Prðy ij j c ¼ 2Þ

i¼1

P c

pc

3 Q

:

ð4Þ

Prðy ij j cÞ

i¼1

2.3. Data source The derivation and validation samples were drawn from the Third International Knowledge Discovery and Data Mining Tools Competition (KDD-cup) 1999 data [11], which were based on the Defense Advanced Research Projects Agency (DARPA) Intrusion Detection Evaluation oﬀline database developed by the Lincoln Laboratory at Massachusetts Institute of Technology (MIT) [12,13]. The full dataset, which included seven weeks of transmission control protocol (TCP) dump network traﬃc, as training data that were processed into about ﬁve million connection records, two weeks of testing data, and 32 diﬀerent attack types, was generated on a network that simulated 1000 Unix hosts and 100 users [5]. The test data do not have the same probability distribution as the training data, and they include additional speciﬁc attack types that were not in the training data. Each record in the dataset represents a sequence of TCP packets starting and ending at a ﬁxed time window, of which each packet consists of about 100 bytes of information between which ﬂow to and from a source IP address to a destination IP address under pre-deﬁned protocols. Each record is labeled as either normal or a speciﬁc attack type. Since the full 7-week initial training dataset requires numerous computing resources to ﬁt the latent model, 10% of the data were randomly selected to construct the ﬁnal derivation sample, and the validation sample was constructed based on the 2-week initial testing data.

95

ous studies [14,15], which include three ‘‘types of the service or protocol’’ dummy variables: TCP, HTTP (hypertext transfer protocol) and UDP (user datagram protocol), two login status indication variables: login-successfully and guest-login, and one ‘‘normal or error status of the connection’’ variable—RST. All these variables have been demonstrated [15] that they have at least a probability of 0.85 to be statistically signiﬁcantly associated with an abnormal network connection. 2.5. Statistical analyses Bivariate and descriptive analyses were conducted to compare the frequency of characteristics for each risk factor between the derivation and validation samples as well as the prevalence of diﬀerent risk factors’ combination patterns. A pattern with less than 10 connections was excluded to ensure a LC model to be ﬁtted with stable patterns. Eqs. (1)–(4) were calculated with a STATA program called Generalized Linear Latent and Mixed Models developed by Skrondal and Rabe-Hesketh [10]. A conventional LR model linking the labeled outcome (anomaly: yes/no) to the same risk factors was developed to evaluate the classiﬁcation results drawn from the LC model. Let yi, which takes the value ‘‘1’’ for anomaly with probability pi and ‘‘0’’ for normal with probability 1 pi be the outcome variable for individual connection, i, i = 1, 2, . . . , n; let X1, X2, X3 be three risk factors (independent variables) identiﬁed from a sample, the LR model [16] can be represented as log½pi =ð1 pi Þ ¼ b0 þ b1 xi1 þ b2 xi2 þ b3 xi3 ;

ð5Þ

where pi = Pr(Yi = 1). The probability of being abnormal was calculated for each connection in both derivation and validation samples and a probability of 0.5 was used as a threshold to classify connections into normal or abnormal category for both LC and LR models. Sensitivity, speciﬁcity, the area under the receiver operating characteristic (ROC) curve, and the correctly classiﬁed rate were calculated for evaluating the classiﬁcation results yielded by the LC model and by the LR model. All of the statistical analyses were conducted using SAS version 8.02 (SAS Institute Inc. Cary, NC) and STATA version 8.0 (STATA Corporation, College station, TX).

2.4. Outcome and risk factors 3. Results The KDD-cup and DARPA-MIT data include an outcome variable that labeled a connection being anomaly (yes/no), which could be any one of the included 38 diﬀerent attack types, 24 in the derivation sample and an additional 14 new types in the validation sample. As described in the study design section, this outcome variable was not used to develop the LC model but was used as a ‘‘gold-standard’’ to evaluate the classiﬁcation results from the LC model. The risk factors that served as manifest variables in the LC model are dichotomous variables with a value of ‘‘1’’ for present and a value of ‘‘0’’ for absent of characteristic. These variables were selected based on previ-

3.1. Data characteristics The initial KDD-cup/DARPA-MIT training data, including a total of 494,021 TCP/IP connections, were collected during the 7-week period from June 1, 1998 to July 19, 1998. The ﬁnal 10% derivation sample included 49,399 TCP/IP connections with nine patterns, of which each pattern has at least 10 connections. The initial validation sample has 311,029 connections of which 243 connections were excluded due to have diﬀerent patterns of risk factors’ combinations with the derivation sample. The ﬁnal

96

Y. Wang et al. / Computer Communications 30 (2006) 93–100

validation sample included 310,786 (99.9%) connections. The number of connections labeled as normal was similar between the derivation and validation samples (19.7% vs. 19.5%). The abnormal connections came from four sources: (i) Probe—surveillance and other probing, (ii) DoS— denial of service, (iii) U2R—unauthorized access to local super user (root) privileges, and (iv) R2L—unauthorized access from a remote machine of connections. Approximately 79% of attacks in the derivation sample were DoS and the remaining 1.1% were split among Probe, R2L and U2R, in which there was a remarkable diﬀerence in the distribution of the attacks between the derivation and validation samples, because new attack types were included in the validation data. For example, R2L increased from 0.2% to 5.2% and U2L increased from 0.01% to 0.07% for the derivation and validation samples, respectively. Table 1 illustrates the frequency and rate of the risk factors based on the derivation and validation samples. Overall, there was no diﬀerence between the initial derivation data and the ﬁnal 10% sample and no remarkable diﬀerence between the ﬁnal derivation and validation samples. 3.2. Classiﬁcation Table 2 shows the parameters estimated from the LC model. The posterior probability was 0.84 with a standard derivation (SD) of 0.37 for classifying a connection to anomaly, while there were 79.9% of observed anomalous connections in the derivation sample. Although for a set of J dichotomous variables, there are 2J diﬀerent patterns of combinations that can be potentially observed, the combinations of the six selected risk factors, arranging in TCP (yes/no), UDP (yes/no), HTTP (yes/no), RST (yes/no), login-successfully (yes/no) and guest-login (yes/no), only yielded nine patterns. Probabilities of being a normal and being an anomalous connection were estimated by the LC model for each connection included in the validation sample. Because each connection

represents a 2-s window, most records in both the derivation and the validation samples had exactly the same values of the risk factors longitudinally, which results in an unbalanced distribution of the connections’ prevalence across the nine patterns. More than 50% of the connections have a value of ‘‘0’’, presenting as ‘‘000000’’ for all the risk factors in both the derivation and validation samples. The second highest volume pattern is ‘‘100000’’ that represents a combination of TCP being ‘‘yes’’ and all others being ‘‘no,’’ and it included 22.3% and 20.5% connections in the derivation and validation samples, respectively. The lowest volume pattern was diﬀerent between the two samples, from ‘‘101110’’ (0.03%) in the derivation to ‘‘101000’’ (0.13%) in the validation samples. Overall, the top three high volume patterns included approximate 91.5% of connections for the derivation sample and 86.5% of connections for the validation sample (Table 3). The pattern level mean probability of being anomalous based on the ‘‘gold-standard’’ outcome ranged from 0.0323 (‘‘101000’’) to 0.9950 (‘‘000000’’) and 0.0262 (‘‘101010’’) to 0.9977 (‘‘000000’’) for the derivation and validation samples, correspondingly. Using a probability of 0.5 as a classifying threshold, six patterns in the derivation and four from the validation samples were correctly matched with the ‘‘gold-standard’’ results. All the top three high volume patterns were correctly matched. Thus, although the matched rates at the pattern level do not seem high, they were remarkably high at the connection level, in which approximately 92.9% and 86.5% connections in derivation and validation samples could be correctly matched. Based on the estimated parameters, probabilities of being abnormal and normal at the connection level were calculated for both derivation and validation samples. The probability of being anomalous ranged from 0.00 to 1.00 with a mean value of 0.9517 (SD = 0.2067); the 25th to 75th percentiles were 0.9869 and 1.0000, respectively for all abnormal connections, and the probability of being normal ranged from 0.00 to 1.00 with a mean of 0.7041

Table 1 Descriptive analysis on data characteristics Factors

Derivation

Validation (n = 310,786)

Initial data (n = 494,021)

Final 10% sample (n = 49,399)

#

%

#

%

#

%

190,065 20,354 64,293

38.47 4.12 13.01

19,007 2035 6,430

38.47 4.12 13.02

119,111 26,703 41,237

38.33 8.59 13.26

1493

0.30

149

0.30

2627

0.73

Content features within a connection suggested by domain knowledge Login-successfully 73,237 Guest-login 685

14.82 0.14

7324 68

14.83 0.14

53,645 754

17.25 0.24

Outcome Normal connection

19.69

9725

19.69

60,593

19.50

Type of protocol Transmission control protocol (TCP) User datagram protocol (UDP) Hyper text transfer protocol (HTTP) Network service on the destination RSTO or RSTOS0 or RSTR (RST)

97,278

Y. Wang et al. / Computer Communications 30 (2006) 93–100

97

Table 2 Estimate based on the derivation sample Risk factor

Class c = 1 (normal)

Transmission control protocol (TCP) User datagram protocol (UDP) Hyper text transfer protocol (HTTP) RSTO or RSTOS0 or RSTR (RST) Login-successfully Guest-login Intercept

Class c = 2 (anomaly)

Estimate

SE

Estimate

SE

708.02 2.58 280.69 8.04 191.42 7.80 0.67

0.4809 0.0224 0.1949 0.3194 0.2249 0.2834 0.0095

6055.00 405.9300 0.4714 14858.0000 817.7900 120.0200 –

2.2556 0.2694 0.0149 58.6519 0.0365 –

SE denotes standard error.

Table 3 Probabilities of being normal and anomaly by patterns Patterna

Derivation (n = 49,399)

000000 010000 100000 100010 100011 100100 101000 101010 101110 a b

Validation (n = 310,786) Probability to beb

Prevalence Total (#)

Rate (%)

Normal

Anomaly

28,360 2,035 10,993 1380 68 133 557 5,860 13

57.41 4.12 22.25 2.79 0.14 0.27 1.13 11.86 0.03

0.0000 0.0000 0.0130 1.0000 1.0000 0.0070 1.0000 1.0000 1.0000

0.9999 1.0000 0.9869 0.0000 0.0000 0.9928 0.0000 0.0000 0.0000

Observed outcome (anomaly, %)

Prevalence

True outcome (anomaly, %)

Total (#)

Rate (%)

0.9950 0.0531 0.9867 0.0652 0.5147 0.9774 0.0323 0.0381 0.6923

164,969 26,703 63,649 12,017 653 1559 402 40,268 566

53.08 8.59 20.48 3.87 0.21 0.50 0.13 12.96 0.18

0.9977 0.3972 0.9781 0.7268 0.8224 0.9513 0.9179 0.0262 1.0000

Order of risk factors: TCP (yes/no), UDP (yes/no), HTTP (yes/no), RST (yes/no), login-successfully (yes/no) and guest-login (yes/no). The validation sample has the same probabilities of being normal and anomaly as the derivation sample.

(SD = 0.4561), 25th and 75th percentiles were 0.00 to 1.00 for all normal connections in the derivation sample. The validation sample had the same patterns of probability distribution; the means were 0.9869 (SD = 0.0966) and 0.7720 (SD = 0.4194), and the 25th to 75th were 0.9869 to 1.0000 and 0.9999 to 1.0000, for abnormal and normal connections, correspondingly. 3.3. Evaluations Table 4 shows the parameters estimated by the LR model that modeled the log-odds of the ‘‘gold-standard’’ outcome as a function of risk factors. While all factors

associated with the outcome statistically signiﬁcant, RST and guest-login predict anomalous connections, and the remaining factors predict normal connections. Overall, there was no remarkable diﬀerence in estimated parameters between the two samples except the guest-login factor in which its eﬀect size was reduced approximately 53.8%. Using the same threshold (0.5) to classify connections, Table 5 compares the classiﬁcation results obtained from LC and LR models for the two samples. The LR model achieved higher performance in all six measurements than the LC model but the diﬀerences between these two models were not remarkable. The correctly classiﬁed rates were 94.8% vs. 98.4% in the derivation sample and 90.6% vs.

Table 4 Parameters estimated by logistic regression model Parametera

Transmission control protocol (TCP) User datagram protocol (UDP) Hyper text transfer protocol (HTTP) RSTO or RSTOS0 or RSTR (RST) Login-successfully Guest-login Intercept a b

Anomaly (yes/no) as outcome. SE denotes standard error.

Derivation (n = 49,399)

Validation (n = 310,786)

Estimate

b

SE

Estimate

SEb

1.8598 8.1664 3.8846 4.1306 4.7720 1.4059 5.2848

0.0991 0.1296 0.0904 0.6193 0.0790 0.2508 0.0838

2.1782 6.4935 4.2664 5.2287 3.0045 0.6499 6.0763

0.0583 0.0530 0.0326 0.1254 0.0330 0.1040 0.0515

98

Y. Wang et al. / Computer Communications 30 (2006) 93–100

Table 5 Evaluation of classiﬁcation results Measure of classiﬁcation

Latent class model Derivation

Validation

Derivation

Validation

ROC Sensitivity Speciﬁcity Positive predictive value Negative predictive value Correctly classiﬁed

0.8812 0.9905 0.7718 0.9466 0.9524 0.9475

0.8294 0.9550 0.7038 0.9301 0.7911 0.9060

0.9831 0.9887 0.9666 0.9918 0.9545 0.9844

0.9779 0.9148 0.9676 0.9915 0.7332 0.925

Fig. 2. Performance diﬀerences between the latent class and logistic regression models.

92.5% in the validation sample, and the largest diﬀerence was the speciﬁcity that measures the probability that a connection be classiﬁed as normal for a true normal connection (Fig. 2). Theoretically, the LR model is expected to have better performance because its classiﬁcations are derived from a clear relationship of observed outcome-predictors, while the LC model’s classiﬁcations are based on the predictors only. 4. Discussion Over decades various statistical modeling approaches have been developed and proposed for intrusion detection, and many of them ﬁtted within the scope of modeling for associations between the outcome and its predictors. Such approaches require including labeled abnormal events in a training dataset which may not be satisﬁed in some situations, particularly in many dynamic, short-term, and ad hoc wireless network access environments. Consequently, being able to train intrusion detection systems with attack-free data have become an essential challenge to modern network security and intrusion detection. Our study presents a latent modeling approach to address this challenge. Overall, the proposed model illustrates good performance in ROC area, sensitivity, positive and predictive values as well as the classiﬁcation agreement rate, and it provides a diﬀerent angle to analyze network traﬃc data. The current study is the ﬁrst, to our knowledge, to apply a Bayesian-based LC modeling approach for intrusion detection. It brings two contributions to the network secu-

Logistic regression model

rity area. First, it provides a novel, practical coherence, and statistically robust technique to analyze network traﬃc audit data when a labeled or known anomaly outcome is not available. Traditionally, such data are primary suitable for a rule-based algorithm that requires developing a large set of rules founded on system policies and historical experiences. Our study represents an alternative method to the rule-based algorithm where the development is much less labor and time-intensive than the rule-based system. Second, although an outcome of a network connection is hard to measure directly by a single observable variable during a session connection, hypothetically, it can be inferred by a series of observed variables; hence the underlying outcome of a connection is inferable. This is an essential idea of ﬁtting a LC model for intrusion detection. Our study demonstrates that this idea is practically achievable in a relatively simple and statistically testable way. Although some conventional multivariate modeling methods, such as factor analysis and cluster analysis that do not require including an observed outcome variable in a training dataset, have the potency to be employed for intrusion detection [17]. These methods that mainly focus on data reduction or subject grouping may have more restrictions in assumptions [18] (e.g., factor analysis requires all variables to be continuous), and do not provide classiﬁcation results at the probability level, which could be more suitable for intrusion detection. The LC model employed in this study provides a probability-based classiﬁcation method that has much richer information than some conventional methods [19] and allows other detection component(s) or network administrator(s) to have robust data to make better decisions. Our study has limitations that warrant discussion. First, the usual latent model assumes latent conditional independence: conditional on the level of the latent variables and the manifest variables are independent [20]. Such a ‘‘conditional independence’’ or ‘‘local independence’’ assumption can be unrealistic. For example, a stream of network traﬃc with many positive risk factors many have a stronger or more salient abnormal level; such cases are more likely to elicit a positive classiﬁcation by all manifest variables, which could violate the conditional independence assumption. Uebersax [21] provided a good summary of previous studies and a practical guide to deal with conditional independence, and Zhang [22] addressed this problem in the

Y. Wang et al. / Computer Communications 30 (2006) 93–100

framework of hierarchical latent class models. Second, the six risk factors used in the study may not fully represent the attributes of some typical network traﬃc; regardless, such factors that were selected from the DARPA-MIT oﬀ-line intrusion detection evaluation data are the most widely used public benchmark for testing intrusion detection systems and the most comprehensive data available today. In practice, the process of selecting a robust risk factor should take into account the characteristics of the typical network system and attributes of the audit log data that a LC model will analyze, and take into account recommendations yielded by previous studies for selecting risk factors [23–26]. Finally, as shown in Table 5, the speciﬁcity of LC model-based classiﬁcation seems low, which could lead to high false positive rate. For some high volume network systems 20% of false positive rates can yield approximately 1500 alarms per second without excluding repeated connections. Validating every such case for review is not practical. Thus, a combination of the proposed approach together with existing solutions is expected. For example, a hierarchical and hybrid intrusion detection system can be constructed with using the LC and LR models together. In conclusion, while the basic concept of using Bayesian approach for intrusion detection has been introduced [27–31], this study further illustrates the potential advantages of using the LC modeling approach to analyze network traﬃc data. The latent model can be estimated with easily programmed codes or available software (e.g., SAS, WinBUGS or STATA), and one can use the results to construct Bayesian decision rules for the optimal detection or classiﬁcation of cases with other detection algorithms and procedures. Because of these advantages, the presented LC modeling approach could be useful in securing highspeed mobile wireless and ad hoc network systems where only limited information and known abnormal events are available for analysis. Acknowledgements The authors thank anonymous reviewers for helpful suggestions and comments, Keith W. Hebert, System Administrator at Qualidigm, for his valuable comments, and Joy Dorin, Manager of Marking and Communications Services at Qualidigm, for her editorial commentary. The content of this publication does not necessarily reﬂect the views or policies of Yale University, Yale New Haven Health, or Qualidigm; nor does mention of trade names, commercial products, or organizations imply endorsement by Yale University, Yale New Haven Health, or Qualidigm. The authors assume full responsibility for the accuracy and completeness of the ideas represented. References [1] P. Helman, G. Liepins, Statistical foundations of audit trail analysis for the detection of computer misuse, IEEE Transactions on Software Engineering 19 (9) (1993) 886–901.

99

[2] S. Masum, E.M. Ye, Q. Chen, K. Noh, Chi-square statistical proﬁling for anomaly detection, in: Proceedings of the 2000 IEEE Workshop on Information Assurance and Security, 2000, pp. 182–188. [3] S. Mukkamala, A. Sung, Intrusion detection systems using adaptive regression splines, in: Proceedings of the 6th International Conference on Enterprise Information Systems, 2004, pp. 26–33. [4] Y. Wang, A multinomial logistic regression modeling approach for anomaly intrusion detection, Computer and Security 24 (8) (2005) 662–674. [5] R. Lippmann, S. Cunningham, Improving intrusion detection performance using keyword selection and neural networks, in: Proceedings of the Second International Workshop on Recent Advances in Intrusion Detection (RAID99), West Lafayette, Indiana, 1999. [6] Z. Zhang, J. Li, C.N. Manikopoulos, J. Jorgenson, J. Ucles, HIDE: A hierarchical network intrusion detection system using statistical preprocessing and neural network classiﬁcation, in: Proceedings of the 2001 IEEE Workshop Information Assurance and Security, 2001, pp. 85–90. [7] S. Mukkamala, G. Janoski, A. Sung, Intrusion detection using neural network and support vector machines, in: Proceedings of the IEEE 2002 International Joint Conference on Neural Networks, 2002, pp. 1702–1707. [8] L. Goodman, Exploratory latent structure analysis using both identiﬁable and unidentiﬁable models, Biometrika 61 (1974) 215–231. [9] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of Royal Statistical Society Series B 39 (1977) 1–38. [10] A. Skrondal, S. Rabe-Hesketh, Generalized latent variable modeling: multilevel, longitudinal and structural equation models, Chapman & Hall/CRC Press, Boca Raton, FL, 2004. [11] KDD Cup, Data available on the web, Retrieved February 19, 2006, Available from: . [12] R.K. Cunningham, R.P. Lippmann, D.J. Fried, S.L. Garﬁnkle, I. Graf, K.R. Kendall, et al., Evaluating intrusion detection systems without attacking your friends: the 1998 DARPA intrusion detection evaluation, SANS 1999. [13] R. Lippmann, W.J. Haines, D.J. Fried, J. Korba, K. Das, The 1999 DARPA oﬀ-line intrusion detection evaluation, Computer Networks 34 (4) (2000) 579–595. [14] S. Mukkamala, G.R. Tadiparthi, N. Tummala, G. Janoski, Audit data reduction for intrusion detection, in: Proceedings of the IEEE 2003 International Joint Conference on Neural Networks, 2003, pp. 456–460. [15] Y. Wang, L. Seidman, Risk factors to retrieve anomaly intrusion information and proﬁle user behavior, International Journal of Business Data Communications and Networking 2 (1) (2006) 41–57. [16] D.W. Hosmer, S. Lemeshow, Applied Logistic Regression, second ed., Wiley, New York, 2000. [17] M. Shyu, S. Chen, K. Sarinnapakorn, L. Chang, A novel anomaly detection scheme based on principal component classiﬁer, in: Proceedings of the IEEE Foundations and New Directions of Data Mining Workshop, in conjunction with the 3rd IEEE International Conference on Data Mining (ICDM), 2003. [18] H. Bozdogan, Statistical Data Mining and Knowledge Discovery, Taylor & Francis/CRC Press, Coca Raton, FL, 2003, pp. 373–383. [19] J.A. Harenaars, A.L. McCutcheon, Applied Latent Class Analysis, Cambridge University Press, Cambridge, MA, 2002. [20] P. Congdon, Bayesian Statistical Modeling, Wiley, New York, 2002. [21] J. Uebersax, A practical guide to local dependence in latent class models, Available from: . [22] N. Zhang, Hierarchical latent class models for cluster analysis, Journal of Machine Learning Research 5 (2004) 697–723. [23] W. Lee, S.J. Stolfo, A framework for constructing features and models for intrusion detection systems, ACM Transactions on Information and System Security (TISSEC) 3 (4) (2003) 227–261.

100

Y. Wang et al. / Computer Communications 30 (2006) 93–100

[24] S. Chebrolu, A. Abraham, J.P. Thomas, Feature deduction and ensemble design of intrusion detection systems, Computer and Security 24 (4) (2005) 295–307. [25] G.H. Kayacik, A.N. Zincir-Heywood, Selecting features for intrusion detection: a feature relevance analysis on KDD 99 benchmark, in: Proceedings of the Third Annual Conference on Privacy, Security and Trust, 2005. [26] Y. Wang, J. Cannady, Develop a composite risk score to detect anomaly intrusion, in: Proceedings of the IEEE SoutheastCon 2005, 2005, pp. 445–449. [27] W. DuMouchel, Computer intrusion detection based on Bayes factors for comparing command transition probabilities, Technical Report 91, National Institute of Statistical Sciences, 1999, Available from: . [28] M. Schonlau, W. DuMouchel, W.H. Ju, A.F. Karr, M. Theus, Y. Vardi, Computer intrusion: Detecting masquerades, Statistical Science 16 (2001) 58. [29] W.H. Ju, Y. Vardi, A hybrid high-order Markov chain model for computer intrusion detection, Journal of Computer Graphical Statistics 10 (2001) 277–295. [30] D. Barbard, N. Wu, S. Jajodia, Detecting novel network intrusions using Bayes estimators, in: Proceedings of the 1st SIAM International Conference on Data Mining, 2001, pp. 24– 29. [31] S.L. Scott, A Bayesian paradigm for designing intrusion detection systems, Computational Statistics and Data Analysis 45 (2004) 69– 83.

Yun Wang, PhD, is a senior biostatistician and information specialist at the Center for Outcomes Research and Evaluation, Yale University and Yale-New Haven Health System, and Qualidigm. He has degrees in Mathematics, Computer science, information system, and criminal law with concentration in criminal statistics. His research interests include developing large complex information systems and applying statistical modeling techniques for information analyses, information security, and patient private protection. Inyoung Kim, PhD, is a postdoctoral associate in the Department of Epidemiology and Public Health, Yale University. She has degrees in Mathematics, Biostatistics, and Statistics, and has research interests in developing and applying statistical modeling techniques for data security, biometrics and bioinformatics. Gaston Mbateng, PhD, was a senior information analyst at Qualidigm. He earned a PhD from the University of Illinois at Chicago. Before working in the industries where he specializes in statistical data modeling, he taught at Bradley University and later at Roosevelt University. He is currently employed at Discover Financial Services. Shih-Yieh Ho, PhD, MPH, is a head of the Analysis Division and a senior health information analyst at Qualidigm. She earned degrees in quantitative and healthcare research ﬁelds, and has progressive experiences in performing quantitative and qualitative analysis for healthcare related information and Health Insurance Portability and Accountability Act security compliance.

A latent class modeling approach to detect network intrusion

A latent class modeling approach to detect network intrusion

Recommend Documents