Applied Mathematics and Computation 205 (2008) 735–750
Contents lists available at ScienceDirect
Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc
Exception rules in association rule mining David Taniar a,*, Wenny Rahayu b, Vincent Lee a, Olena Daly a a b
Clayton School of Information Technology, Monash University, Australia Department of Computer Science and Computer Engineering, La Trobe University, Australia
a r t i c l e
i n f o
Keywords: Data mining Association rules Exception rules Negative association rules Association rule mining Support Confidence Exceptionality Knowledge discovery Fuzzy association rules
a b s t r a c t Previously, exception rules have been defined as association rules with low support and high confidence. Exception rules are important in data mining, as they form rules that can be categorized as an exception. This is the opposite of general association rules in data mining, which focus on high support and high confidence. In this paper, a new approach to mining exception rules is proposed and evaluated. A relationship between exception and positive/negative association rules is considered, whereby the candidate exception rules are generated based on knowledge of the positive and negative association rules in the database. As a result, the exception rules exist in the form of negative, as well as positive, association. A novel exceptionality measure is proposed to evaluate the candidate exception rules. The candidate exceptions with high exceptionality form the final set of exception rules. Algorithms for mining exception rules are developed and evaluated using an exceptionality measurement, the desired performance of which has been proven. Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction Exception rule mining has attracted a lot of research interest [1–12]. Exception rules have been defined as rules with low support and high confidence [4]. A traditional example of exception rules is the rule Champagne ) Caviar. The rule may not have a high support, but it has high confidence. The items are expensive so they are not frequent in the database, but they are always bought together so the rule has high confidence. Exception rules provide valuable knowledge about database patterns. This paper presents exception rules mining based on association rules in databases. Exception rules describe unusual, contradictory knowledge in the database. An interconnection between exception rules and association rules will be explored. Based on the knowledge about association rules in the database, the exception rules will be generated. In this paper, we consider that association rules may exist in the form of positive association, as well as negative association [13–15]. Since the exception rules are the opposite of association rules, the exception rules exist in the form of negative, as well as positive, association. A novel exceptionality measure will be proposed to evaluate the reliable exception rules. The exceptions with high exceptionality are the reliable exception rules. The significance of exception rules has been highlighted in a number of research works [4,7–10,16]. Something that contradicts a user’s common belief is bound to be interesting. Hussain et al. [4] state that ‘‘exceptions can take an important role in making critical decisions”. Most researchers focus on association rules that represent common phenomena that occur with
* Corresponding author. E-mail addresses:
[email protected] (D. Taniar),
[email protected] (W. Rahayu),
[email protected] (V. Lee). 0096-3003/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2008.05.020
736
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
high support and confidence. Exceptions, despite their important role in decision making, are still foreign to many users. Exceptions are no doubt highly valuable. Liu et al. [16] maintain that ‘‘reliable exceptions are unknown, unexpected, or contradictory to what the user believes. Hence, they are novel and potentially more interesting than strong patterns to the user”. For example, the rule ‘jobless applicants are granted credit’ will be more novel than the rule ‘jobless applicants are not granted credit’. They stress that ‘‘an exception rule is often beneficial since it differs from a common sense rule, which is often a basis for people’s daily activity”. The above-mentioned examples demonstrate the importance of exception rules in data mining. Association rules and exception rules discover different kinds of rules. Association rules present commonsense knowledge, whereas exception rules represent surprising and unusual facts in the data. The rest of this paper is organized as follows: Section 2 summarizes existing work in exception rules. Section 3 describes exception rules in detail. Section 4 presents our proposed exceptionally measure. Section 5 describes our proposed algorithm and explains the proposed methods with a detailed example. Section 6 presents our experimental results, and Section 7 gives the conclusions.
2. Existing work Existing work on the discovery of exception rules can be classified as either (i) directed or (ii) undirected. A directed search obtains a set of exception rules each of which contradicts a user-specified belief [5,17,18]. An undirected search obtains a set of pairs of an exception rule and a general rule [4,8–10,16]. In a directed search of exception rules, user-specified beliefs are obtained first. Each of the discovered exception rules contradicts the user-supplied beliefs. A post-analysis of the discovered database patterns is performed to identify the most interesting patterns [5]. The technique is characterized by asking the user to specify a set of patterns according to his/her previous knowledge or intuitive feelings. This specified set of patterns is then used by a fuzzy matching algorithm to match and rank the discovered patterns. The assumption is that some amount of domain knowledge and the user’s interests are implicitly embedded in his/her specified patterns. In general, the discovered patterns are ranked according to their level of conformity to the user’s knowledge or their unexpectedness, or their actionabilities. In terms of unexpectedness, patterns are interesting if they are unexpected or previously unknown to users. In terms of actionability, patterns are interesting if users can use them to their advantage. With such rankings, a user can simply check the few patterns on the top of the list to confirm his/her intuitions (or previous knowledge), or to find those patterns that are contrary to his/her expectations, or to discover those patterns that are actionable. Another method of discovering unexpected patterns is proposed that takes into consideration the prior background knowledge of decision makers [17]. This prior knowledge constitutes a set of expectations or beliefs about the problem domain. The proposed method of discovering unexpected patterns uses these beliefs to initiate the search for patterns in data that contradict the beliefs. The rule A ) B is unexpected with respect to the belief X ) Y on the dataset D if B and Y logically contradict each other, the antecedents of the belief and the rule hold for the same statistically large subset of D, and the rule A, X ) B holds. Actionable and unexpected interestingness measures have also been studied, and the relationship between them has been examined [18]. The unexpected measure of interestingness is defined in terms of the user’s belief system. The interestingness of a pattern is expressed in terms of how it affects the belief system. The paper also discusses how this unexpected measure of interestingness can be used in the discovery process. The beliefs are classified into hard and soft beliefs. Hard beliefs cannot be changed with new evidence. The user can change the soft beliefs with new evidence. The degrees of soft beliefs can be assigned employing Bayesian, or Dempster–Shafer, or frequency, or Cyc’s, or a statistical approach. For the undirected method of searching exception rules, the exception rules are obtained based on general rules or commonsense rules. Patterns are categorized according to whether they are strong, weak, and random [16]. Strong patterns hold for numerous objects and are usually consistent with the expectations of experts. Weak patterns are the reliable exceptions representing a relatively small number of objects. Random patterns represent random and unreliable exceptions. Deviation analysis is used to identify interesting exceptions and explore reliable ones. The approach consists of four phases: rule induction and focusing; contingency table and deviations; positive, negative and outstanding deviations; and reliable exceptions. In the rule induction and focusing phase, the strong patterns are obtained and relevant attributes are defined. In the contingency table and deviations phase, the deviations are calculated for all combinations of selected attributes. In the positive, negative and outstanding deviations phase, a threshold is determined to distinguish deviations caused by chance and interesting deviations are obtained. In the reliable exceptions phase, the discovered weak patterns are verified as having low support and high confidence. Another method for mining exception rules is studied based on a novel measure which estimates interestingness relative to its corresponding commonsense rule and reference rule [4]. Commonsense rules are rules with high support and high confidence. Reference rules are rules with low support and low confidence. Exception rules are defined as rules with low support and high confidence. The formula for the relative interestingness measure is derived based on information theory and statistics. The measure has two components: interestingness based on the rule’s support, and interestingness based on the rule’s confidence.
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
737
Table 1 Summary of exception rule mining Authors
Disadvantages
Possible improvement
Directed search
Liu et al. [5] Padmanabhan and Tuzhilin [6] Silberschatz and Tuzhilin [18]
User’s beliefs tend to change with time. Different users have different beliefs and expectations of the system. Different domains require different attributes and modified data mining systems
An automated data mining system is required that discovers exceptions independently of users and domains (undirected search)
Undirected search
Liu et al. [16]
In the rule induction and focusing phase, a user is still involved to specify relevant attributes The scheduling policies are based on multiple balanced searchtrees that may cause huge memory consumption in large DB; one of the policies still requires updating by the user Based on commonsense rules; may be incomplete
To search for relevant attributes from the strong patterns To use alternative data structures; do not involve user in the data mining process
Suzuki [7]
Suzuki and Zytkow [10] Hussain et al. [4]
New researchers may discover new directions in the exception rules mining and novel improved interest measures
In Suzuki [7], an algorithm is presented for discovering pairs of an exception rule and a common sense rule under a prespecified schedule. Two scheduling policies are proposed to update the thresholds used in the mining algorithm [19,20]. The scheduling policies are based on a novel data structure. The data structure consists of multiple balanced search-trees. One of the policies represents a full specification of updating by the user, and the other iteratively improves a threshold value by deleting the worst pattern with respect to its corresponding index. Suzuki and Zytkow [10] present an algorithm that seeks every possible exception rule which contradicts a common sense rule and satisfies several assumptions of simplicity. Exception rules are categorized into eleven categories, and a unified algorithm is proposed for discovering all of them. Table 1 summarizes previous work in exception rules mining. It highlights the disadvantages, if any, and possible future solutions. Although there has been a number of research works done in exception rules mining, in the directed search the discovered rules can be biased by the expectations of a particular user. The discovered rules may not be interesting for the rest of the user. Besides, the rules are domain-dependent and insufficient domain knowledge may cause improbable results. In undirected search, a user may still be involved in parts of the discovery process, which would be preferable to replace with a fully automated system. Hussain et al. [4], and Suzuki and Zytkow [10] presented excellent research projects in exception rules mining. The search for exception rules is based on commonsense rules, which may be incomplete as opposed to a search based on both positive and negative rules. Our proposed approach belongs to the undirected search. Therefore, we avoid the above limitations of a directed search. In this paper, we describe a new approach to mining exception rules. Our approach is based on a directed search. An interconnection between strong association rules and exceptions rules will be established. A novel exceptionality measure will be proposed to evaluate the candidate exception rules. The candidate exceptions with high exceptionality will form the final set of exception rules. 3. Exception rules 3.1. Motivating examples The general definition of an exception is ‘something unusual, something that does not conform to the rule, or something deviating from the norm’. The key terms in this general definition of an exception are the words rule and norm. Therefore, to discover an exception in the given environment, we need to know what the common rules or the norm are in the given environment. Before we start searching for exceptions in a database, we have to discover the strong rules in the database, whether they be in the form of positive or negative rules. These strong rules define the norm or the standard for the given database. Our search for the exception rules will be based on these strong association rules. For instance, we have a strong positive association rule about two people, Person1 and Person2. Suppose that they always come to work together. The fact is that they are always seen arriving at the workplace together, let us say, every day for a year. The rule that Person1 is always seen with Person2 {Person1 ) Person2} is a strong positive association rule. The rule {Person1 ) Person2} has a high support because we have a whole year of observations, and the rule has a high confidence because both appear together on a daily basis. The rule {Person1 ) Person2} forms a norm, or a standard, for the given environment. One day, after conforming to the strong positive rule {Person1 ) Person2} one of the people Person1 or Person2 comes to work on his/her own. This shows that the usual routine that everyone is accustomed to has been broken. The fact is that the strong positive rule {Person1 ) Person2} has been violated and an exception has occurred {Person1 ) Person2}. The exception rule {Person1 ) Person2} states that the presence of the person Person1 implies the absence of the Person2. Note that the exception rule {Person1 ) Person2} exists in the form of a negative association whereby Person2 indicates the absence of this party. This illustrates how a ‘negative’ exception rule is derived from a strong ‘positive’ association rule.
738
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
Let us consider the opposite scenario. We have a strong negative association rule that two people Person1 and Person2 never come to work together on any day for the past year. The rule that Person1 is never seen with Person2 {Person1 ) Person2} is a strong negative association rule with a high support and a high confidence. One day, after a year of conforming to the strong negative rule {Person1 ) Person2}, the people Person1 and Person2 come to work together. This is an unusual occurrence. The fact is that the strong negative rule {Person1 ) Person2} has been violated and an exception has occurred {Person1 ) Person2}. This illustrates how a ‘positive’ exception rule is derived from a strong ‘negative’ association rule. The aforementioned examples about matchmaking in real life (the people Person1 and Person2) could be easily applied to other application domains. If we have a strong positive rule that item A exists, and item B is most likely to exist, then it is a strong positive association rule of {A ) B}, and when this does not take place, it is an exception {A ) B}. As another example, we have a strong negative rule that item A and item B never exist simultaneously {A ) B}. When this does not take place, and both items A and B exist together, then it is an exception {A ) B}. We propose a classification of the exception rules according to the type of strong association rule. When the strong association rule is the positive association rule of a type {A ) B}, we derive an exception rule of a type {A ) B}. We call an exception rule of a type {A ) B} a negative exception. When the strong association rule is the negative association rule of a type {A ) B}, we derive an exception rule of a type {A ) B}. We call an exception rule of a type {A ) B} a positive exception. The details of the proposed classification of exception rules, namely negative and positive exception rules will be detailed in the following sections. 3.2. Exception rules in a form of negative association In this section, we consider the strong positive association rules and attempt to derive the exception rules that contradict the strong positive association rules. Let us consider the strong positive rule {A ) B}. This rule has two items: A and B. A rule that contradicts the strong positive rule {A ) B} is a negative rule {A ) B}, where B indicates a negative item. The strong positive rule {A ) B} implies that the presence of item A implies the presence of item B. The contradicting negative rule {A ) B} indicates that the presence of item A implies the absence of item B. The contradicting rule that may become an exception rule is rule {A ) B}. The antecedent of the strong positive rule is the same, but the consequent changes its value to the opposite value. The premises of the strong positive rule {A ) B} and the possible exception rule {A ) B} are the same. When the same premise results in contradicting consequents, this fact is abnormal. To ensure that the possible exception rule {A ) B} contradicting the strong positive rule {A ) B} is a reliable exception rule, we verify the exceptionality measure of the exception rule {A ) B}. We also verify that the exception rule {A ) B} is infrequent. Exception rules describe the surprising and unusual trends in the database. Exception rules may be supported by only a small fraction of the database. The support of an exception rule has to be less than the minimum support value minsup. However, the rules that have support that is significantly less than the minimum support value minsup so their support value is close to zero, may be the noise. The noise represents random and unreliable database trends. Fig. 1 shows an illustrative scale of support in relation to the rules. To distinguish the noise from the exceptions, we set an interval of support that describes the exception rules. We call this interval of support range an exceptional interval. The exceptional interval includes two support values, which are the lower bound lowerBound and the upper bound upperBound. The exceptional interval is presented as [lowerBound, upperBound]. The lower bound of the exceptional interval lowerBound is greater than 0% and greater than the noiseValue. The value noiseValue separates the noise in the database from the exception rules. The upper bound of the exceptional interval upperBound is less than the minimum support value minsup. The restrictions imposed on the lower and upper bounds of the exceptional interval are described below. Definition 1. The exceptional interval is defined by [lowerBound, upperBound], where lowerBound > noiseValue > 0, and upperBound < minsup.
Database
support ≥ minsup
Positive and negative association rules
support < minsup, support > noiseValue
Exception rules
support < noiseValue
Noise Fig. 1. Rules in databases.
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
739
Definition 2. Given a strong positive association rule {A ) B}, we call the contradicting rule {A ) B} an exception in the negative form if the support of the rule contradicting rule support(A, B) belongs to the exceptional interval [lowerBound, upperBound]. During the data mining process, we obtain the strong positive association rules of a type {A ) B} and exception rules of a type {A ) B} at the same time. To discover the strong positive association rules of a form of {A ) B}, we have to verify the support and confidence value of the rules. To distinguish the exception rules of a form of {A ) B}, we have to verify whether the support of the rule belongs to the exceptional interval. Definition 3. The conditions below are defined for mining exception rules in a negative form: Condition 1: support(A, B) P minsup, and confidence(A ) B) P minconf Condition 2: support(A, B) 2 [lowerBound, upperBound] Condition 1 verifies whether the rule {A ) B} is the strong positive association rule. The support of the rule {A ) B} has to be equal to, or greater than, the minimum support value minsup. The confidence of the rule {A ) B} has to be equal to, or greater than, the minimum confidence value minconf. Condition 2 verifies whether the contradicting rule {A ) B} is an exception rule. The support of the rule {A ) B} has to belong to the exceptional interval [lowerBound, upperBound]. In association rules mining, the strong association rules have high support and high confidence. In exception rules mining, the support of the exception rules has to belong to the exceptional interval. But the confidence of a negative exception rule is low by definition. Let us say the minimum confidence minconf equals 60%. Then, the confidence of the strong association rule {A ) B} is at least 60%. This means that at least 60% of transactions that contain item A also contain item B. Then, only the remaining 40% or less of the transactions that contain item A do not contain item B. Therefore, the confidence of the exception rule {A ) B} is 40% or less. We conclude that if the minimum confidence equals minconf%, the confidence of an exception rule is (100 minconf)%. If the minimum confidence minconf is less than, or equal to, 50%, then both the strong association rule and the exception rule may have a high confidence. However, we cannot rely on such an assumption. We conclude that the confidence measure is inappropriate for measuring the characteristics of the proposed negative exceptions. A special measure has to be developed to distinguish the reliable exceptions from all other exceptions. The reliable exceptions in exception rules mining is an analogue of strong association rules in association rules mining. We named the special measure that distinguishes the reliable exceptions from all other exceptions the exceptionality measure. The reliable negative exception has to have a high exceptionality. Definition 4. A reliable exception rule {A ) B} has a high exceptionality and follows this condition: if exceptionality (A ) B) P minexcep, rule {A ) B} is a reliable exception rule in a form of a negative association rule. Now, we would like to generalize the definitions of negative exception rules with n items. Definition 5. Given a set of items I in database D, the rule {A1 . . . Ak ) B1 . . . Bm} is a strong positive association rule, where Ai 2 I; i ¼ 1; k; Bi 2 I; i ¼ 1; m; k þ m ¼ n; k P 1; m P 1. We call the contradicting rule {A1 . . . Ak ) C1 . . . Cm}, where Ai 2 I; i ¼ 1; k; C i ¼ Bi or C i ¼ Bi ; i ¼ 1; m; NumberOf ð Bi Þ P 1; k þ m ¼ n; k P 1; m P 1 a negative exception if the support of the contradicting rule support(A1 . . . Ak ) C1 . . . Cm) belongs to the exceptional interval [lowerBound, upperBound]. The antecedent and consequent of the strong positive association rule {A1 . . . Ak ) B1 . . . Bm} are subsets of the set of items I in a database D. The antecedent of the negative exception {A1 . . . Ak ) C1 . . . Cm} is the same as the antecedent of the strong positive association rule {A1 . . . Ak ) B1 . . . Bm}. The consequent of the strong positive association rule {A1 . . . Ak ) B1 . . . Bm} is formed by the set {B1 . . . Bm}. The consequent of the negative exception {A1 . . . Ak ) C1 . . . Cm} is formed by the set {C1 . . . Cm}. Each item Ci, where i ¼ 1; m, of the set {C1 . . . Cm} is either the item Bi or the negation of the item Bi, where i ¼ 1; m, of the set {B1 . . . Bm}. The number of the antecedent items in both the strong positive association rule {A1 . . . Ak ) B1 . . . Bm} and the negative exception {A1 . . . Ak ) C1 . . . Cm} is the same and equals k. The number of the consequent items in both the strong positive association rule {A1 . . . Ak ) B1 . . . Bm} and the negative exception {A1 . . . Ak ) C1 . . . Cm} is same and equals m. The total number of items in both the strong association rule and the exception rule is the same and equals the sum of the antecedent and consequent numbers k + m. The conditions k P 1; m P 1 require at least one item in both the antecedent and consequent of the strong positive association rule {A1 . . . Ak ) B1 . . . Bm} and the negative exception {A1 . . . Ak ) C1 . . . Cm}. The condition NumberOf(Bi) P 1 for the negative exception {A1 . . . Ak ) C1 . . . Cm} requires at least one negative item in the consequent of the negative exception rule {A1 . . . Ak ) C1 . . . Cm}. Given a strong positive association rule {A1 . . . Ak ) B1 . . . Bm}, the support of the rule {A1 . . . Ak ) C1 . . . Cm} must belong to the exceptional interval [lowerBound, upperBound], and hence this verifies that the rule {A1 . . . Ak ) C1 . . . Cm} is a negative exception rule. Apart from this exceptional interval condition, the exceptionality of the negative exception rule {A1 . . . Ak ) C1 . . . Cm} must be equal to, or greater than, the minimum exceptionality value minexcep. If the exceptionality of the negative exception rule {A1 . . . Ak ) C1 . . . Cm} is high, the rule {A1 . . . Ak ) C1 . . . Cm} is a reliable negative exception rule.
740
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
3.3. Exception rules in a form of positive association As described earlier, a positive exception rule is derived from a strong negative association rule. In this section, we derive exception rules that contradict strong negative association rules. This exception rule then exists in the form of positive association. For example, based on a strong negative rule {A ) B}, we would derive a contradictory positive rule {A ) B}. The former implies that the presence of item A signals the absence of item B, whereas the latter implies that the presence of item A suggests the presence of item B. Note that the antecedent of both rules (e.g. positive association rule and negative exception rule) is the same (e.g. item A), but the consequent is contradictory. In the above example, item B in the negative association rule becomes item B in the positive exception rule. Like the negative exception rules explained in the previous section, to make sure the possible exception rule {A ) B} contradicting the strong negative rule {A ) B} is the reliable exception rule, we verify the exceptionality measure of the exception rule {A ) B}. We also need to verify that the support of the possible exception rule {A ) B} belongs to the exceptional interval [lowerBound, upperBound]. Definition 6. Given a strong negative association rule {A ) B}, we call the contradicting rule {A ) B} a positive exception rule if the support of the contradicting rule {A ) B} belongs to the exceptional interval [lowerBound, upperBound]. The first stage of the process is to obtain the strong negative association rules of a type {A ) B} and exception rules of a type {A ) B} at the same time. Like the traditional association rule mining, here we use a support and confidence measure to assess the rules. Additionally, for the exception rule {A ) B}, we have to verify whether the support belongs to the exceptional interval. Definition 7. The conditions below are defined for mining positive exception rules: Condition 1: support(A, B) P minsup, and confidence(A ) B) P minconf Condition 2: support(A, B) 2 [lowerBound, upperBound] Condition 1 uses minsup and minconf to check the validity of the negative association rule {A ) B}, whereas condition 2 verifies whether the contradicting rule {A ) B} is a positive exception rule by checking whether the support of the itemsets AB belong to the exceptional interval [lowerBound, upperBound]. To distinguish reliable positive exceptions from all other positive exceptions, we have to verify the exceptionality of the positive exceptions. Definition 8. A reliable exception rule {A ) B} has a high exceptionality and follows this condition: if exceptionality (A ) B) P minexcep, rule {A ) B} is a reliable exception rule in the form of a positive association rule. The above definitions describe the positive exception rules of the two items A and B. The generalized version of positive exception rules with n items is similar to that of the generalized negative exception rules as described in Definition 5. Given a strong negative association rule {A1 . . . Ak ) B1 . . . Bm}, the antecedent of the positive exception rule is the same as the strong negative association rule in a form of {A1 . . . Ak ) C1 . . . Cm}, but with a different consequent, that is a set of {C1 . . . Cm}. Each item Ci, where i ¼ 1; m, of the set {C1 . . . Cm} is either the item Bi or the negation of the item Bi, where i ¼ 1; m, of the set {B1 . . . Bm}. Note that the number of antecedent items in both rules is equal (e.g. k in the above example), and the number of consequents of both is also the same (e.g. m in the above example). The rule {A1 . . . Ak ) C1 . . . Cm} is a positive exception rule, if the support belongs to the exceptional interval [lowerBound, upperBound]. In addition to this, the exceptionality value of this positive exception rule has to be at least the same as the minimum exceptionality measure minexcep to make it a reliable positive exception. 4. Proposed exceptionality measure We propose a novel measure to distinguish the reliable exception rules from all other positive and negative exception rules. We name the novel measure the exceptionality measure. The minimum exceptionality minexcep is specified by a user along with the minimum support value minsup and minimum confidence value minconf. Exceptionality of an exception rule ExcRule given the corresponding association rule AssocRule is defined by the formula below:
ExceptionalityðExcRule=AssocRuleÞ ¼ FuzzySupðExcRuleÞ þ FuzzyFractionðExcRule=AssocRuleÞ þ NeglectðExcRule=AssocRuleÞ:
ð1Þ
The exceptionality measure of an exception rule ExcRule, given the strong association rule AssocRule Exceptionality(ExcRule/ AssocRule), consists of the three components, which we name the fuzzy support of an exception rule FuzzySup(ExcRule), the fuzzy fraction of an exception rule ExcRule given the strong association rule AssocRule FuzzyFraction(ExcRule/AssocRule) and the neglect measure of an exception rule ExcRule given the strong association rule Neglect(ExcRule/AssocRule).
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
741
1. The first component of the exceptionality measure is the fuzzy support of an exception rule FuzzySup(ExcRule), where
FuzzySupðExcRuleÞ ¼ FuzzySupðsupportðExcRuleÞÞ:
ð2Þ
The support of an exception rule ExcRule has to belong to the exceptional interval [lowerBound, upperBound]. The lower and upper bounds define the acceptable support values of an exception rule. The acceptable support range [lowerBound, upperBound] is divided into regions. Each region of the exceptional interval is characterized by a corresponding fuzzy support value, which is specified by a domain expert. The function FuzzySupport returns a different value for each region of the exceptional interval. An example of the exceptional interval and corresponding fuzzy support values is shown as follows, where the exceptional interval is [1%, 5%]. The lower bound of the exceptional interval lowerBound equals 1%. The upper bound of the exceptional interval upperBound equals 5%:
FuzzySupð0—0:99%Þ ¼ 0; FuzzySupð1—1:99%Þ ¼ 0:1; FuzzySupð2—2:99%Þ ¼ 0:5; FuzzySupð3—3:99%Þ ¼ 0:2; FuzzySupð4—5%Þ ¼ 0:5; FuzzySupð5:01—100%Þ ¼ 0: This example shows that the exception rules with support in the range [2–3%] have a better chance of gaining a high exceptionality value. 2. The second component of the exceptionality measure is the fuzzy fraction of an exception rule ExcRule given the strong association rule AssocRule FuzzyFraction(ExcRule/AssocRule):
FuzzyFractionðExcRule=AssocRuleÞ ¼ FuzzyFractionðsupportðExcRuleÞ=supportðAssocRuleÞ 100%Þ:
ð3Þ
The ratio support(ExcRule)/support(AssocRule) has to be relatively low. The exception rule may be only a small fraction of the corresponding association rule. Similar to the FuzzySup(ExcRule) above, the acceptable support of FuzzyFraction(ExcRule/ AssocRule) is divided into fuzzy regions. An example of FuzzyFraction regions is as follows:
FuzzyFractionð0—9:99%Þ ¼ 0:6; FuzzyFractionð10—19:99%Þ ¼ 0:5; FuzzyFractionð20—3:99%Þ ¼ 0:4; FuzzyFractionð40—5:99%Þ ¼ 0:2; FuzzyFractionð60—100%Þ ¼ 0: In the above example, it is clear that an exception rule that is equal to, or greater than, 10% and less than 20% of the corresponding association rule, has a better chance of gaining a high exceptionality of the corresponding association rule. 3. The third component of the exceptionality measure is the neglect measure of an exception rule ExcRule given the strong association rule Neglect(ExcRule/AssocRule):
NeglectðExcRule=AssocRuleÞ ¼ ðsupportðonlyExcRuleÞ=supportðExcRuleÞÞ þ ðsupportðonlyAssocRuleÞ=supportðAssocRuleÞÞ:
ð4Þ
Support of the only exception rule is the fraction of database transactions that contain only items of candidate exception and no other items. Support of the only association rule is defined similarly. The neglect measure defines the fraction of exception rules and corresponding association rules that occur in the database transaction on their own, while the rest of the database items are absent from the transaction. The measure describes the bond between the elements of the exception rule/association rule when no other database items are present. The higher the neglect measure, the stronger is the bond between the items. When the neglect measure returns a high value, no other items can influence the occurrence of the exception rule items.
5. Algorithm and examples In this section, we present the algorithm for mining reliable exception rules. The reliable exception rules generated by the algorithm are the exception rules with high exceptionality. The algorithm is then followed by a walk-through example. 5.1. Algorithm for mining exception rules The strong association rules are derived from frequent itemsets with high confidence. The mining process is based on frequent itemsets, and the output of the algorithm is reliable exceptional itemsets. Exceptional itemsets will become exception rules after the confidence of the corresponding association rules has been verified.
742
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
Fig. 2. Reliable Exceptional Itemsets Generation Algorithm.
The algorithm for mining exception rules is presented in Fig. 2. The frequent itemsets will be generated at each step k (k is the length of itemset). The input to the algorithm is database D, the minimum support value minsup, and the minimum exceptionality value minexcep. In the first step of the algorithm (k = 1), the one-frequent itemsets are generated similarly to the Apriori algorithm [21–23] or any other frequent itemsets generation algorithm. The consequent steps of the algorithm input of the one-frequent itemsets guarantee that the condition for mining positive and negative exception rules will be satisfied, where all exceptional itemsets contradict the frequent itemsets. Because each item of a frequent itemset is frequent by definition, it automatically satisfies both the positive and negative exception rules. The second step of the algorithm (k = 2) inputs the set of one-frequent itemsets generated in the first step of the algorithm (k = 1). The candidate two-itemsets are generated from the one-frequent itemsets according to the rule of candidate positive itemsets generation, which is commonly based on a database join operation [24,25]. The main loop of the algorithm has two sections: I and II. Section I generates the reliable negative exceptional itemsets. Section II generates the reliable positive exceptional itemsets. In section I, we verify support of each candidate two-itemset c in candidate two-itemsets. If the support of the candidate two-itemset c is equal to, or greater than, the minimum support value minsup, the itemset c is added to the frequent twoitemsets. The high support of the itemset c is the preliminary condition that an exception rule may be derived from the itemset c. We verify whether the contradicting itemset negative_set is a negative exceptional itemset. If the support of the contradicting itemset negative_set belongs to the exceptional interval ExceptionalInterval, then the procedure generate_k_Excep_Itemsets_negative generates the negative exceptional itemset negExcepItemset based on the itemset negative_set and the corresponding frequent itemset c. Now we check the condition for reliable negative exception rules. If the exceptionality measure of the exceptional itemset negExcepItemset Exceptionality(negExcepItemset) has a value equal to, or greater than, the minimum exceptionality value minexcep, then the itemset negExcepItemset is a reliable negative exceptional itemset. In Section II of the main loop of the algorithm, we are searching for the reliable positive exceptional itemset. We verify whether the support of the candidate two-itemset c belongs to the exceptional interval ExceptionalInterval. If the support of the candidate two-itemset c belongs to the exceptional interval ExceptionalInterval, then the condition for positive exception rules is satisfied. Now we verify the condition for positive exception rules. The itemset negative_set has to be frequent. Then the itemset c may become an exception based on the frequent negative itemset negative_set. If the support of the negative itemset negative_set support(negative_set) is equal to, or greater than, the minimum support value minsup, the condition for positive exception rules is satisfied. The procedure generate_k_Excep_Itemsets_positive generates the exceptional positive itemset based on the itemset c and the corresponding frequent itemset negative_set. The generated exceptional positive itemset is assigned to the variable posExcepItemset. Now we check the condition for reliable positive exception rules. If the exceptionality measure of the exceptional itemset posExcepItemset Exceptionality(posExcepItemset) has a value equal to, or greater than, the minimum exceptionality value minexcep, then the itemset posExcepItemset is a reliable exceptional positive itemset.
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
743
Sections I and II are repeated for each candidate two-itemset c in the candidate two-itemsets. The subsequent steps of the algorithm (k > 2) are similar to the second step of the algorithm (k = 2). Step k of the algorithm inputs the set of (k 1)-frequent itemsets generated in the previous step of the algorithm (k 1). If the set of (k 1)-frequent itemsets generated in the previous step of the algorithm (k 1) is empty, then the algorithm terminates. The outcome of the algorithm are the reliable exceptional itemsets of lengths 2, 3, . . ., (k 1). If the set of (k 1)-frequent itemsets generated in the previous step of the algorithm (k 1) is not empty, then the candidate k-itemsets are generated from the (k 1)-frequent itemsets according to the rule of candidate positive itemsets generation. The procedure k_candidate_itemsets generates the candidate k-itemsets. 5.2. An illustrative example This section presents a walk-through example to illustrate the work of the Reliable Exceptional Itemsets Generation Algorithm. A sample database with five records is presented in Table 2. A, B, C are the database items. The input of the Reliable Exceptional Itemsets Generation Algorithm is the database in Table 2. The values of the minimum support minsup, minimum exceptionality minexcep and the exceptional interval ExceptionalInterval are: minsup = 40% (2 records) – minimum support minexcep = 1.5 – minimum exceptionality Exceptional interval is [5%, 39.99%] – exceptional interval The exceptionality measure consists of the addition of the fuzzy support value FuzzySup, fuzzy fraction value FuzzyFraction and neglect value Neglect. To evaluate the exceptionality measure, the regions of the fuzzy support and fuzzy fraction and corresponding fuzzy values are required. The regions and values for fuzzy support FuzzySup for this example are as follows:
FuzzySupð0—4:99%Þ ¼ 0; FuzzySupð5—19:99%Þ ¼ 0:3; FuzzySupð20—39:99%Þ ¼ 0:2: The regions and values for fuzzy fraction FuzzyFraction for this example is as follows:
FuzzyFractionð0—9:99%Þ ¼ 0:6; FuzzyFractionð10—19:99%Þ ¼ 0:5; FuzzyFractionð20—39:99%Þ ¼ 0:4; FuzzyFractionð40—59:99%Þ ¼ 0:2; FuzzyFractionð60—100%Þ ¼ 0: The walk-through example is composed of two main phases: formation of exceptional itemsets, and generation of reliable exceptional itemsets. These two phases are detailed in the following sections. 5.2.1. Exceptional itemsets Firstly, frequent itemsets need to be generated. When going through each step in the frequent itemsets, exceptional itemsets are determined. The steps to generate frequent itemsets are commonly used in the traditional association rule mining: (a) Step 1 (k = 1). In the first step of the algorithm (k = 1), the frequent one-itemsets are generated. We verify the items support from the database in Table 2. If the support of an item is equal to, or greater than, the minimum support value minsup, the item is a frequent item and will be added to the set of frequent one-itemsets. Step 1 of the algorithm is shown in Fig. 3, where the support of the database items is verified and the set of frequent one-itemsets is formed {A, B, C}.
Table 2 A sample database TID
Items
1 2 3 4 5
A, A, A, A, C
B B, C B C
744
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
Fig. 3. Frequent one-itemsets.
Fig. 4. Frequent two-itemsets.
(b) Step 2 (k = 2). Step 2 of the algorithm is presented in Fig. 4. The ‘YES’ and ‘NO’ indicate whether or not the condition has been satisfied.The two-candidate itemsets will be generated from the one-frequent itemsets obtained in the previous step of the algorithm (k = 1). The two-candidate itemsets are the set {AB, AC, BC}. For each of the two-candidate itemsets, we verify their support. The support of the first two-candidate itemset AB equals three (records). The minimum support value minsup equals two (records). Therefore, the itemset AB is frequent. The itemset AB will be added to the frequent two-itemsets set. The itemsets that contradict the itemset AB will be verified as being negative exceptional itemsets. The contradicting itemsets are the itemsets A B and B A. One of the items in the frequent itemset AB is negated, which forms a contradicting itemset. The support of the contradicting itemset A B equals 1 (record), which belongs to the exceptional interval ExceptionalInterval. Therefore, the itemset A B is an exceptional negative itemset. The support of the contradicting itemset B A equals 0 (record), which does not belong to the exceptional interval ExceptionalInterval 0 can never belong to the ExceptionalInterval. If we encounter the support value 0, we do not have to check whether it is the exceptional interval. We highlight this fact with the notation ZERO. Therefore, the itemset B A is not an exceptional negative itemset. The support of the second two-candidate itemset AC equals 2 (records), which is frequent and will be added to the frequent two-itemsets set. The contradicting itemsets of AC are the itemsets A C and C A. The support of the contradicting itemset A C equals 2 (record), which does not belong to the exceptional interval ExceptionalInterval. Therefore, the itemset A C is not an exceptional negative itemset. The support of the contradicting itemset C A equals 1 (record), and belongs to the exceptional interval ExceptionalInterval. Therefore, the itemset C A is an exceptional negative itemset. The support of the third two-candidate itemset BC equals 1 (record), and is infrequent (below minsup), but the support of the itemset BC belongs to the exceptional interval ExceptionalInterval. The itemset BC will become an exceptional positive itemsets if the itemsets that contradict the itemset BC are frequent. The contradicting itemsets are the itemsets B C and C B. The support of the contradicting itemset B C equals 2 (records). Therefore, the negative itemset B C is frequent. The itemset BC is an exceptional positive itemset based on the frequent negative itemset B C. The support of the contradicting itemset C B equals 2 (records), and is frequent. The itemset BC is an exceptional positive itemset based on the frequent negative itemset C B. At the end of step 2 of the algorithm, the frequent two-itemset has been formed {AB, AC}. We also discovered four exceptional itemsets {A B/AB, C A/AC, BC/B C, BC/C B}. (c) Step 3 (k = 3). Based on the frequent two-itemsets, we generate the candidate three-itemsets set, which has only oneitemset ABC (see Fig. 5). The support of the itemset ABC equals 1 (record) (refer to the database in Table 2). The minimum support minsup equals 2 (records). The itemset ABC is infrequent. However, the support of the itemset ABC belongs to the exceptional
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
745
Fig. 5. Frequent three-itemsets.
interval ExceptionalInterval. The positive itemset ABC will become an exceptional positive itemset if the itemsets contradicting the itemset ABC are frequent. The itemsets contradicting the itemset ABC are the itemsets {A BC, AB C, A B C, ABC, A BC, AB C}. At least one of the items of the positive itemset ABC has been negated to form the contradicting negative itemsets. The support of the contradicting negative itemset A BC equals 1 (record), but because the minimum support minsup equals 2 (records), then the contradicting negative itemset A BC is infrequent. The itemset ABC is not an exceptional itemset. The support of the contradicting negative itemset AB C equals 2 (record), the contradicting negative itemset AB C is frequent. The itemset ABC is not an exceptional positive itemset based on the frequent negative itemset AB C. The support of the contradicting negative itemset A B C equals 0 (record). The itemset ABC cannot be an exceptional itemset based on the itemset A B C. The support of the contradicting negative itemset ABC equals 1 (record). The contradicting negative itemset ABC is infrequent. The itemset ABC cannot be an exceptional itemset based on the itemset ABC. The support of the contradicting negative itemset A BC equals 1 (record). The contradicting negative itemset A BC is then infrequent. The itemset ABC cannot be an exceptional itemset based on the itemset A BC. Finally, the support of the contradicting negative itemset AB C equals 0 (record). The itemset ABC cannot be an exceptional itemset based on the itemset AB C. At the end of step 3 of the algorithm, the frequent three-itemsets set { } has been formed. The frequent three-itemsets set is empty. We cannot form any candidate four-itemsets. Step 3 is the last step of the algorithm. We obtained one exceptional three-itemset, which is an exceptional positive itemset based on the frequent negative itemset {ABC/AB C}. In all steps of the algorithm, we obtained five exceptional itemsets {A B/AB, C A/AC, BC/B C, BC/C B, ABC/AB C}. Now we have to verify their exceptionality measure to distinguish the reliable exceptional itemsets, which will be described next. The minimum exceptionality minexcep equals 1.5. 5.2.2. Reliable exceptional itemsets Following the five exceptional itemsets {A B/AB, C A/AC, BC/B C, BC/C B, ABC/AB C} that have been generated in the previous section, we need to go through each of them and determine whether or not they are reliable. In this case, we need to apply an exceptionality measure to each of them: (a) Exceptionality A B/AB: Fig. 6 calculates the exceptionality of the negative exceptional itemset A B based on the frequent positive itemset AB. The support of A B is one record, which is 20% in the database with five records in Table 2. The fuzzy support of 20% equals 0.2 (FuzzySup(20–39.99%) = 0.2), hence the first component of the formula equals 0.2. For the fuzzy fraction, since support(A B) = 1 and support(AB) = 3 (records), therefore support(A B)/support(AB) = 1/ 3. Fuzzy fraction of (1/3 100%) is equal to 0.4 (where FuzzyFraction(20–39.99%) = 0.4), and hence the second component of the formula equals 0.4.
Fig. 6. Exceptionality A B/AB.
746
(b)
(c)
(d)
(e)
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
Support of only A B is a fraction of the records that contain A and no other items. Therefore, support(only A B) = 0, which means that the third component of the formula equals 0. Support of only AB is a fraction of the records that contain A and B and no other items. Because support(only AB) = 2 and support(AB) = 3, the fourth component of the formula equals 2/3. The overall formula equals 0.2 + 0.4 + 0.7 = 1.3. The minimum exceptionality minexc is 1.5. The exceptional itemset A B/AB is not a reliable exceptional itemset, as 1.3 < 1.5. Exceptionality C A/AC: Fig. 7 calculates the exceptionality of the exceptional negative itemset C A based on the frequent positive itemset AC. The support of C A is one record (20%), and the fuzzy support of 20% equals 0.2 (where FuzzySup(20–39.99%) = 0.2). Hence, the first component of the formula equals 0.2. Because support(C A) = 1 and support(AC) = 2 (records), therefore support(A B)/support(AB) = 1/2. Fuzzy fraction of (1/2 100%) which is 50% equals 0.2 (where FuzzyFraction(40–59.99%) = 0.2). The second component of the formula equals 0.2.Support of only C A is a fraction of records that contain C and no other items, where support(only C A) = 1 and support(C A) = 1. The third component of the formula equals 1. Support of only AC is a fraction of records that contain A and C and no other items, whereby support(only AC) = 1 and support(AC) = 2. The fourth component of the formula equals 1/2. The total is 0.2 + 0.2 + 1 + 1/2 = 1.9. The minimum exceptionality minexc is 1.5, so a reliable exceptional itemset has been obtained, as 1.9 > 1.5. Exceptionality BC/B C: Fig. 8 presents the exceptionality of the exceptional positive itemset BC based on the frequent negative itemset B C.The support of BC is one record (20%), and the fuzzy support of 20% equals 0.2 (where FuzzySup(20– 39.99%) = 0.2). Hence, the first component of the formula equals 0.2. As support(BC) = 1 and support(B C) = 2 (records), therefore support(BC)/support(B C) = 1/2. Fuzzy fraction of (1/2 100%) is 0.2 (where FuzzyFraction(40–59.99%) = 0.2). Support of only BC where it contains B and C and no other items is support(only BC) = 0. And support of only B C that contains B and no other items is support(only B C) = 0. The sum of these four values is 0.2 + 0.2 + 0 + 0 = 0.4. Because the minimum exceptionality minexc is 1.5, no exceptional itemset has been obtained, as 0.4 < 1.5. Exceptionality BC/C B: Fig. 9 presented the exceptionality of the exceptional positive itemset BC based on the frequent negative itemset C B. The support of BC is 1 record (20%), the fuzzy support of 20% equals 0.2 (where FuzzySup(20–39.99%) = 0.2). The second component involves support(BC) = 1, support(C B) = 2 (records), and support(BC)/support(B C) = 1/2. Fuzzy fraction of (1/2 100%) is 0.2. Support of only BC is nil (support(only BC) = 0), and support(only C B) = 1 divided by support(C B) = 2, which gives 0.5. Overall is 0.2 + 0.2 + 0 + 0.5 = 0.9. The minimum exceptionality minexc is 1.5, so no exceptional itemset has been obtained, as 0.9 < 1.5. Exceptionality ABC/AB C: Finally, Fig. 10 shows the calculation for determining the exceptionality of the exceptional positive itemset ABC based on the frequent negative itemset AB C.
Fig. 7. Exceptionality C A/AC.
Fig. 8. Exceptionality BC/B C.
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
747
Fig. 9. Exceptionality BC/C B.
Fig. 10. Exceptionality ABC/AB C.
The support of ABC is one record (20%) and the fuzzy support of 20% equals 0.2 (FuzzySup(20–39.99%) = 0.2). For the fuzzy fraction, support(ABC) = 1 and support(AB C) = 2 (records), and therefore support(ABC)/support(AB C) = 1/2. Fuzzy fraction of (1/2 100%) is 0.2 (FuzzyFraction(40–59.99%) = 0.2). For the third component in the equation in Fig. 10 (support only ABC and support ABC) is equal to 1, as both support(only ABC) = 1 and support(ABC) = 1. For the last component in the equation dealing with support of only AB C and support AB C is also 1, as both support values are 1. Hence, the overall equation gives 0.2 + 0.2 + 1 + 1 = 2.4. The minimum exceptionality minexc is 1.5, so the reliable exceptional itemset has been obtained, as 2.4 > 1.5. Based on the sample data presented in Table 2, following the three steps of the algorithm, and the five reliable checking steps as described above, the final output is two reliable exceptional itemsets, which are: C A (based on AC) ABC (based on AB C).
6. Performance evaluation The test database was downloaded from the UCI Repository of machine learning databases [26]. The test database is the Intrusion Detection database, which is former KDD Cup 1999 data to distinguish the attacks on the network from other database records. The database represents the parameters of a network over a period of time. The original database included 40 parameters and a vast number of records. Most of the parameters are continuous. In our experimentation, the simplified model of 10 parameters and 10 thousand records was employed [27]. The chosen parameters are listed in Table 3. The parameters are either continuous or discrete. The database sample has been chosen randomly from the downloaded database. The continuous parameters’ values have been divided into ranges according to the minimum/maximum values. Each continuous value has been divided into 10 ranges. The downloaded data set is then transformed [28,29] into dataset 10thousandTxt10attr.txt (Fig. 11). Each of the 10 chosen attributes of the downloaded data set is encoded by two digits. The first digit is 2[0, 9] and represents the attribute name. 0 is used for the attribute 10. The second digit is 2[0, 9] and represents the attribute value. The attribute domain is divided into 10 regions and the attribute value belongs to one of the regions. The modified data set 10thousandTxt10attr.txt is presented in Fig. 12. The first number in the record corresponds to the attribute 1, where the first digit is the attribute’s name and the second digit is the attribute’s value. The second number in the record corresponds to the attribute 2, where the first digit is the attribute’s name and the second digit is the attribute’s value. The downloaded data set has been transformed and the Reliable Exceptional Itemsets Generation Algorithm can now be executed.
748
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
Table 3 Parameters Parameter’s name
Parameter’s description
Parameter’s type
1. duration 2. protocol type 3. flag 4. src_bytes 5. dst_bytes 6. urgent 7. hot 8. logged_in 9. Num_compromised 10. Num_file_creations
Length (number of seconds) of the connection Type of the protocol, e.g. tcp, udp, etc. Normal or error status of the connection Number of data bytes from source to destination Number of data bytes from destination to source Number of urgent packets Number of ‘‘hot” indicators 1 if successfully logged in; 0 otherwise Number of ‘‘compromised” conditions Number of file creation operations
Continuous Discrete Discrete Continuous Continuous Continuous Continuous Discrete Continuous Continuous
Fig. 11. A fragment of the KDDCup1999 dataset.
Exception rules mining starts with one-frequent itemsets mining. The two-candidates are then derived from one-frequent itemsets and the minsup of two-candidate itemsets is evaluated. The negative subsets are generated from a two-candidate itemset. If minimum support of the two-candidate itemset is greater than, or equal to minsup, and support of one of the negative subsets is less than minsup, the pair is the candidate exception (and vice versa). In the Reliable Exceptional Itemsets Generation Algorithm, only positive candidate itemsets are stored in the candidates’ hash trees. Therefore, the candidates’ hash trees of the Reliable Exceptional Itemsets Generation Algorithm store a significantly fewer number of itemsets. The candidates’ hash trees of the Reliable Exceptional Itemsets Generation Algorithm are much more efficient in terms of memory consumption. Fig. 13 presents a graph showing the dependency between minimum support value and number of generated exceptional itemsets. The positive exceptions mean exceptional itemsets in the positive form; negative exceptions mean exceptional itemsets in the negative form. Reliable exceptional itemsets shown in the graph are the exceptional itemsets with the highest exceptionality measure among all exceptional itemsets. The number of reliable exceptional itemsets is the average value of all minimum support minsup values. Fig. 14 presents minimum support value/number of generated exception rules dependency. The graph changes direction with the falling/rising of different minsup values. In Table 4, there are a few samples of the generated exceptional itemsets featuring high exceptionality value. Our algorithm generates reliable exceptional itemsets that become exception rules after computationally simple confidence value
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
749
Number of relaible
exceptional itemsets
Fig. 12. A fragment of the modified file 10thousandTxt10attr.txt.
Itemset Size/Rules Number 200 Positive Exceptions
150 100
Negative Exceptions
50 0 1 2
3
4
5
6 7
8 9
Itemset Size Fig. 13. Itemset size/rules number dependency.
Execution Time (sec)
Minsup/Execution Tim e 400 350 300 250 200 150 100 50 0 1
2
3
5
7 10 12 15 20 22 25 27 30 33 35 40
Minsup(%) Fig. 14. Minsup/execution time dependency.
Table 4 Samples of generated exceptions Reliable exceptional itemsets in a form of positive association
Frequent Itemset: protocol_type = tcp flag = SF urgent = 0 dst_bytes C[0;500 000b] logged_in = 1 Exceptional Itemset: protocol type = tcp flag = Not SF urgent = 0 dst_bytes C[0;500 000b] logged_in = 0 Frequent Itemset: urgent = 0 dst_bytes C[0;500 000b] hot C [0, 9] logged_in = 1 num_compromised = 0 num_file_creations = 0 Exceptional Itemset: urgent = 0 dst_bytes C[0;500 000b] hot C [0, 9] logged_in = 1 num_compromised > 0 num_file_creations > 0
Reliable exceptional itemsets in a form of negative association
Frequent Itemset: flag = SF source_bytes C [0;10000b] num_file_creations = 0 Exceptional Itemset: flag = SF source_bytes C [0;10000b] num_file_creations > 0 Frequent Itemset: flag = REJ logged_in = 0 Exceptional Itemset: flag = REJ logged_in = 1
750
D. Taniar et al. / Applied Mathematics and Computation 205 (2008) 735–750
verification of the frequent itemsets. Frequent itemsets represent strong rules in the database. When an exception based on strong rule has been generated, it indicates something unusual such as detection of invasion of the network. 7. Conclusion and future work In this paper, a novel method has been developed for mining exception rules. The interconnection between strong positive and negative association rules and exception rules is explored, where an exception rule is formed if it contradicts the strong rule, and also if it satisfies some exceptional measure. The proposed exceptionality measure is used to evaluate the candidate exception rule whose desired performance has been proven. In the future, we are going to consider temporal exceptions, which are the temporal patterns in the database related to negative association rules and which change over time. Additional measures will also be considered to identify the temporal exceptions in the database. From the implementation perspective, we will make use of the indexing techniques that we have previously developed for high performance databases [30–33] in order to gain more performance benefits. References [1] H. Déjean, Learning rules and their exceptions, Journal of Machine Learning Research 2 (2002) 669–693. [2] B. Grosof, T. Poon, SweetDeal: representing agent contracts with exceptions using XML rules, ontologies, and process descriptions, in: Proceedings of the World Wide Web Conference, 2003, pp. 340–349. [3] J. Hellerstein, S. Ma, C. Perng, Discovering actionable patterns in event data, IBM Systems Journal 41 (3) (2002) 475–493. [4] F. Hussain, H. Liu, E. Suzuki, H. Lu, Exception rule mining with a relative interestingness measure, in: Proceedings of the Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000, pp. 86–97. [5] B. Liu, W. Hsu, L. Mun, H. Lee, Finding interesting patterns using user expectations, IEEE Transactions on Knowledge and Data Engineering 11 (6) (1999) 817–832. [6] B. Padmanabhan, A. Tuzhilin, Small is beautiful: discovering the minimal set of unexpected patterns, in: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, 2000, pp. 54–63. [7] E. Suzuki, Scheduled discovery of exception rules, Discovery Science, Lecture Notes in Artificial Intelligence 1721 (1999) 184–195. [8] E. Suzuki, In pursuit of interesting patterns with undirected discovery of exception rules, Progress in Discovery Science, Lecture Notes in Computer Science 2281 (2002) 504–517. [9] E. Suzuki, Undirected discovery of interesting exception rules, International Journal of Pattern Recognition and Artificial Intelligence 16 (8) (2002) 1065–1086. [10] E. Suzuki, J. Zytkow, Unified algorithm for undirected discovery of exception rules, Principles of Data Mining and Knowledge Discovery, Lecture Notes in Artificial Intelligence 1910 (2000) 169–180. [11] Y. Yamada, E. Suzuki, Toward knowledge-driven spiral discovery of exception rules, in: Proceedings of the IEEE International Conference on Fuzzy Systems, vol. 2, 2002, pp. 872–877. [12] S. Zhang, C. Zhang, X. Yan, Z. Qin, Identifying exceptional patterns in multi-databases, in: Proceedings of the First International Conference on Fuzzy Systems and Knowledge Discovery, 2002, pp. 146–150. [13] J. Goh, D. Taniar, Mining frequency pattern from mobile users, Lecture Notes in Artificial Intelligence 3215 (2004) 795–801. [14] J. Goh, D. Taniar, Mobile data mining by location dependencies, Lecture Notes in Computer Science 3177 (2004) 225–231. [15] D. Taniar, J. Goh, On mining movement pattern from mobile users, International Journal of Distributed Sensor Networks 3 (1) (2007) 69–86. [16] H. Liu, H. Lu, L. Feng, F. Hussain, Efficient search of reliable exceptions, in: Proceedings of the Third Pacific-Asia Conference on Knowledge Discovery and Data Mining, 1999, pp. 194–203. [17] B. Padmanabhan, A. Tuzhilin, A belief-driven method for discovering unexpected patterns, in: Proceedings of the Fourth International Conference on Knowledge Discovery in Databases, 1998, pp. 94–100. [18] A. Silberschatz, A. Tuzhilin, What makes patterns interesting in knowledge discovery systems, IEEE Transactions on Knowledge and Data Engineering 8 (6) (1996) 970–974. [19] D. Taniar, C.H.C. Leung, The impact of load balancing to object-oriented query execution scheduling in parallel machine environment, Information Sciences 157 (2003) 33–71. [20] D. Taniar, C.H.C. Leung, Query execution scheduling in parallel object-oriented databases, Information and Software Technology 41 (3) (1999) 163– 178. [21] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in large databases, in: Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499. [22] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the International Conference Management of Data, 1993, pp. 207–216. [23] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the ACM-SIGMOD International Conference on Management of Data, 1993, pp. 207–216. [24] D. Taniar, J.W. Rahayu, Parallel sort-merge object-oriented collection join algorithms, Computer Systems Science and Engineering 17 (3) (2002) 145– 158. [25] D. Taniar, R.B.-N. Tan, C.H.C. Leung, K.H. Liu, Performance analysis of groupby-after-join query processing in parallel database systems, Information Sciences 168 (1–4) (2004) 5–50. [26] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, CA, 1998.
. [27] D. Taniar, J.W. Rahayu, Performance analysis of parallelization models for path expression queries, Information Sciences 117 (1–2) (1999) 107–142. [28] J.W. Rahayu, E. Chang, T.S. Dillon, D. Taniar, Performance evaluation of the object-relational transformation methodology, Data and Knowledge Engineering 38 (3) (2001) 265–300. [29] J.W. Rahayu, E. Chang, T.S. Dillon, D. Taniar, A methodology for transforming inheritance relationships in an object-oriented conceptual model to relational tables, Information and Software Technology 42 (8) (2000) 571–592. [30] D. Taniar, J.W. Rahayu, A taxonomy of indexing schemes for parallel database systems, Distributed and Parallel Databases: An International Journal 12 (1) (2002) 73–106. [31] D. Taniar, J.W. Rahayu, Parallel database sorting, Information Sciences 146 (1–4) (2002) 171–219. [32] D. Taniar, J.W. Rahayu, Parallel group-by query processing in a cluster architecture, Computer Systems Science and Engineering 17 (1) (2002) 23–39. [33] D. Taniar, J.W. Rahayu, Global parallel indexing for multi-processors database systems, Information Sciences 165 (1–2) (2004) 103–127.