Expert Systems with Applications 40 (2013) 6531–6537
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Mining high coherent association rules with consideration of support measure q Chun-Hao Chen a, Guo-Cheng Lan b, Tzung-Pei Hong c,d,⇑, Yui-Kai Lin d a
Department of Computer Science and Information Engineering, Tamkang University, Taipei 251, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan c Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan d Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan b
a r t i c l e
i n f o
Keywords: Data mining Association rules Propositional logic Coherent rules Highly coherent rules
a b s t r a c t Data mining has been studied for a long time. Its goal is to help market managers find relationships among items from large databases and thus increase sales volume. Association-rule mining is one of the well known and commonly used techniques for this purpose. The Apriori algorithm is an important method for such a task. Based on the Apriori algorithm, lots of mining approaches have been proposed for diverse applications. Many of these data mining approaches focus on positive association rules such as ‘‘if milk is bought, then cookies are bought’’. Such rules may, however, be misleading since there may be customers that buy milk and not buy cookies. This paper thus takes the properties of propositional logic into consideration and proposes an algorithm for mining highly coherent rules. The derived association rules are expected to be more meanful and reliable for business. Experiments on two datasets are also made to show the performance of the proposed approach. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction Data mining is often used to discover interesting information and relationships between items from large datasets or databases (Agrawal & Srikant, 1994; Agrawal, Imielinksi, & Swami 1993; Agrawal, Imielinksi, & Swami 1993). One of the applications of data mining is to increase sales volume. Association rules in data mining can be expressed as X ? Y, where X and Y are a set of items and X \ Y = U. The meaning of the expression is that if all specific items exist in X in a transaction, then some specific items will also exist in the same transaction with high probability. For example, if customers purchase bread and milk together with high possibility when shopping in a market, the rule ‘‘Bread ? Milk’’ will be mined out as it is of interest for marketing managers. Agarwal et al. proposed the Apriori approach for associationrule mining. Based on the Apriori approach, many algorithms have been proposed, with some focusing on positive association rules (Agarwal, Aggarwal, & Prasad, 2001; Baralis, Cerquitelli, & Chiusano, 2009; Bie, 2011; Brin, Motwani, & Silverstein, 1997; Cai, Tung, Zhang, & Hao, 2011; Cheung & Fu, 2004; Cule & Goethals, 2010; Do, Laurenty, & Termier, 2010; Han, Pei, Yin, & Mao, 2004; Plantevit, q This is a modified and expanded version of the paper ‘‘A high coherent association rule mining algorithm,’’ presented at International Conference on Artificial Intelligence and Applications, 2012, Taiwan. ⇑ Corresponding author at: Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan. E-mail addresses:
[email protected] (C.-H. Chen),
[email protected] (T.-P. Hong),
[email protected] (Y.-K. Lin).
0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.06.002
Laurent, Laurent, Teisseire, & Choong, 2010; Ruggieri, 2010; Teng et al., 2002; Wang, He, & Han, 2003; Webb, 2010). These approaches use different types of association-rule mining algorithm, such as weighted association-rule mining (Bie, 2011; Wang et al., 2003) and fuzzy association-rule mining (Cai et al., 2011), to achieve a specific goal. Some of them focus on negative association rules, which are useful for analyzing transactions of real-world applications (Antonie & Zaïane, 2004; Segond & Borgelt, 2011; Wu, Zhang, & Zhang, 2004; Zhong, Li, & Wu, 2012). Brin et al. described the importance of negative association rules (Brin et al., 1997). Unlike positive rules, negative association rules provide more thoughtful and symmetric property in rule detection and analysis (Wu et al., 2004; Zhao et al., 2009). In general, the number of derived negative rules is larger than that of positive rules for a given transaction database. Wu et al. thus designed an algorithm to enhance the performance of the mining process through an interestingness function and a pruning strategy (Wu et al., 2004). A common problem of association-rule mining approaches is that lots of the derived rules are common sense and thus cannot be easily used, especially for business applications, and some may be misleading. For example, the rule ‘‘if milk is bought, then bread is bought’’ may be misleading because customers may buy milk and not buy bread. To solve this problem, Sim et al. proposed an algorithm for mining coherent rules based on the properties of propositional logic without a minimum support threshold. In thir approach, if a rule satisfies the logic equivalence, then it is a coherent rule. Logic equivalence means that if the rule X ? Y exists, then the rule X ? Y also exists with certain criteria.
6532
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537
Based on logical equivalence, this paper proposes an Apriorilike approach, namely the Highly Coherent Association-Rule Mining (HCARM) algorithm, for mining rules from transaction databases. Since generating candidate coherent itemsets is timeconsuming, the proposed approach first calculates the corresponding lower and upper bounds of subitemsets in the consequent part of an itemset. These intervals are then used for removing itemsets that cannot become highly coherent itemsets to speed up the mining process. Then, the contingency tables of qualified itemsets are calculated and used to check whether they satisfy the conditions of logical equivalence. If yes, then the itemsets are used to generate highly coherent association rules. Experiments on the Foodmart dataset and a simulated dataset are made to show the performance of the proposed approach. The rest of this paper is organized as follows. Related works are described in Section 2. The derivation of the lower and upper bounds of itemsets is described in Section 3. The proposed mining algorithm and an example that illustrates its use are presented in Sections 4 and 5, respectively. Experimental results are shown in Section 6. The conclusions and suggestions for future work are given in Section 7.
2. Review of related mining approaches This section reviews related works, including association-rule mining approaches and the concept of coherent rules.
2.1. Association-rule mining approaches Many data mining methods have been proposed for deriving association rules from transaction databases. Data mining aims to extract useful knowledge and patterns from existing data to solve a specific problem. Association-rule mining has been applied in fields such as marketing and crime prevention and prediction. An association rule is represented as A ? B, where A and B are sets of products. The rule expresses that if product A is purchased, then product B will be purchased together in high probability (Agarwal et al., 2001). Since transaction databases usually consist of millions of records, rule mining is time-consuming. In order to improve the performance of the mining process in various domains, many methods have been proposed, including those based on the FP-tree structure and the mining of both positive and negative association rules. Many existing mining approaches focus on mining positive rules. Cai et al. designed an incremental Apriori-like algorithm for selecting interesting rules with two interestingness measures, called Max-Subrule-Conf and Min-Subrule-Conf. Do et al. (2010) proposed the PGLCM algorithm for mining closed frequent gradual patterns. The execution time of their approach increases linearly with the number of closed frequent gradual itemsets. To derive the buying cycle of items and their intervals between those items, Chiang et al. combined data mining and statistics and proposed Cyclic Model Analysis (CMA) (Chiang, Wang, Chen, & Chen, 2009). Negative rules are also important in the transaction data. Teng et al. proposed a method for mining substitution rules which means that customers can replace some items by others (Webb, 2010). The SRM algorithm was designed to enhance performance. It consists of two phases. All frequent positive and negative itemsets are generated and then the substitution rules are derived. Antonie et al. devised an algorithm for generating both positive and negative association rules with a sliding correlation coefficient threshold (Antonie & Zaïane, 2004). The negative mining concept has also been extended for mining negative sequential patterns. Zheng et al. proposed a genetic algorithm (GA)-based algorithm
to find negative sequential patterns (Zheng, Zhao, Zuo, & Cao, 2010). Since data mining is a time-consuming task, many algorithms have been proposed to improve the efficiency of the mining process. Han et al. proposed the TFP algorithm for mining top-k frequent closed patterns (Marinica & Guillet, 2010). Then, they proposed a two-step FP-growth algorithm for mining frequent itemsets, with the database scanned only twice (Han, Wang, Lu, & Tzvetkov, 2002). In the first step, the size of the database is reduced and the FP-tree is constructed using this reduced database. In the second step, the FP-growth approach is used to derive all frequent itemsets. Baralis et al. proposed the Itemset-Tree and the Item-Btree for creating a compact and lossless representation of item relations (Baralis et al., 2009). Marinica et al. proposed an interactive post-processing approach to reduce the number of discovered rules (Plantevit et al., 2010). The approach first integrates user domain knowledge in association rules according to ontologies and rule schemas. Then, it guides the user to prune and filter rules. The ARIPSO framework is then utilized to select interesting rules. Segond et al. proposed a mining algorithm based on the Eclat algorithm (Sim, Indrawan, Zutshi, & Srinivasan, 2010), which takes the covers of associated items into consideration in the mining task. Methods that use multiple levels and thresholds have also been proposed (Brin et al., 1997; Ruggieri, 2010). 2.2. Concept of coherent rules One of the main issues of association-rule mining approaches is how to define appropriate minimum support and minimum confidence. It has been reported that although the appropriate minimum support may exist, it can be difficult to find (Wu et al., 2004). Many studies have proposed methods for finding an appropriate minimum support (Han et al., 2004; Plantevit et al., 2010; Ruggieri, 2010; Webb, 2010; Zhong et al., 2012). The coherent rule mining algorithm proposed by Sim et al. is based based on the properties of propositional logic (Sá, Soares, Jorge, Azevedo, & Costa, 2011). In their approach, by using the properties of propositional logic, the relationships between items can be derived directly without knowing the appropriate value of the minimum support (Sá et al., 2011). The approach maps the association rules to equivalences. Each mapping from an association rule to an equivalence should satisfy the conditions shown in Table 1. In Table 1, X and Y are two itemsets. An association rule X ? Y is mapped to p q, if and only if (1) X ? Y is true; (2) :X ? Y is false; (3) X ? Y is false; and (4) :X ? :Y is true. When used in multiple transactions, association rules are mapped to implications as follows: X ? Y is mapped to an implication p ? q, if and only if (1) Sup(X, Y) > Sup(X, Y); (2) Sup(X, Y) > Sup(:X, Y); (3)Sup(X, Y) > Sup(X, :Y); and (4) Sup(X, Y) > Sup(:X, :Y). In the same way, other association rules are mapped to implications based on comparisons between supports, creating pseudoimplications. Sim et al. developed coherent rules for the pseudoimplications. The following four conditions must be satisfied for a coherent rule: (1) Sup(X, Y) > Sup(:X, Y); (2) Sup(X, Y) > Sup(X, :Y); (3) Sup(:X, :Y) > Sup(:X, Y); and (4) Sup(:X,
Table 1 Four conditions for mapping rules to equivalences. Equivalences Association rules
pq X?Y
:p :q :X ? :Y
True or False on association rules
Required conditions
T F F T
X?Y X ? :Y :X ? Y :X ? :Y
:X ? :Y :X ? Y X ? :Y X?Y
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537 Table 2 Contingency table of a rule. Frequency of co-occurrences
Antecedent X
Consequence Y
X :X
Y
:Y
Q1 = Sup(X,Y) Q3 = Sup(:X,Y)
Q2 = Sup(X, :Y) Q4 = Sup(:X, :Y)
:Y) > Sup(X, :Y). These four conditions can also be represented as a contingency table, as shown in Table 2. The concept of a coherent rule is that if a rule X ? Y exists, then the rule :X ? :Y should also exist. Thus, according to the coherent properties, the coherent mining algorithm for processing a certain consequence itemset Y is stated as follows [28]. Firstly, the algorithm finds out the set of items from the given transaction database, namely I. Then, it maps the power sets of A (=I Y) to the indices of a binary system. At last, the frequencies of each element in the set A and itemset Y are calculated for deriving coherent rules with exploiting the anti-monotone property on the condition S(X, Y) > S(:X, Y). By utilizing the coherent rule concept, the present study proposes an association-rule mining algorithm that uses these four conditions in the mining process for deriving highly coherent association rules from transactions. 3. Derivation of lower and upper bounds of itemsets In this section, based to the proprieties of coherent rules defined in the previous section, for a coherent rule X ? Y, let the supports of X and Y be P1 and P2, and the number of transactions be T. In the following, assume that the derived coherent rule X ? Y has the highest confidence if P1 > P2. The contingency table is shown in Table 3. From Table 3, according to the four criteria, if the rule X ? Y is a coherent rule, the following two formulas must be satisfied:
P2 > ðP1 P2 Þ
ð1Þ
ðT P1 Þ > ðP1 P2Þ
ð2Þ
The contingency table for P2 > P1 is shown in Table 4. From Table 4, the following two formulas must be satisfied:
P1 > P2 P1
ð3Þ
ðT P2 Þ > ðP2 P1 Þ
ð4Þ
Using formulas (1)–(4), the following theorem is obtained: Theorem 1. Let X ? Y be a coherent rule, where X and Y are itemsets. Assume that the support of X is P1, the support of Y is P2, and the total number of transactions is T; then, the support of Y must be in the range Max[(2P1 T), 0.5P1] < P2 < Min[(P1 + T)/2, 2P1].
Frequency of co-occurrences
Consequence Y
X :X
Y
:Y
P2 0
P1 P2 T P1
Table 4 Contingency table when P2 > P1. Frequency of co-occurrences
Antecedent X
Proof. For P1 > P2, the following two formulas are derived: (1) P2 > (P1 P2) and (2) (T P1) > (P1 P2). They can be transformed into (10 ) P2 > 0.5P1 and (20 ) P2 > (2P1 T), respectively. For P1 < P2, another two formulas are derived: (3) P1 > P2 P1 and (4) (T P2) > (P2 P1). They can also be transformed into (30 ) 2P1 > P2 and (40 ) P2 < (P1 + T)/2, respectively. From (10 ) and (30 ), it can be shown that (5) 0.5P1 < P2 < 2P1. From (20 ) and (40 ), it can be shown that (6) (2P1 T) < P2 < (P1 + T)/2. Using formulas (5) and (6), the lower and upper bounds of Y can be derived as Max[(2P1 T), 0.5P1] and Min[(P1 + T)/2, 2P1], respectively. h Given an itemset S, its candidate coherent itemsets {X, Y} can be formed, where X and Y are subsets of S, and (X \ Y) = £. Since calculating the contingency table of a candidate coherent itemset is timeconsuming, by using Theorem 1, if the support of Y is not in the interval, then the itemset cannot generate coherent rules and is removed directly without having its contingency table calculated. 4. Proposed mining algorithm This section describes the proposed mining algorithm, HCARM, for deriving highly coherent association rules from databases. Proposed HCARM algorithm: INPUT: A body of n transactions, a predefined minimum support threshold a, a predefined minimum confidence threshold w, and a coherence strength parameter k. OUTPUT: A set of highly coherent association rules (HCAR). STEP 1: Scan the database and calculate the support of each item Ij. STEP 2: Compare the support value of each item Ij, Sup(Ij), to the predefined minimum support a. If the support value of item Ij is larger than or equal to the minimum support value, then put Ij into the large 1-itemset as follows:
L1 ¼ fIi jSupðIj Þ P a; 1 6 j 6 kg STEP 3: If L1 is not null, let L1 be a highly coherent 1-itemset, and set r = 2. STEP 4: Generate candidate coherent r-itemsets from highly coherent (r 1)-itemsets. STEP 5: For each highly coherent r-itemset S, form all its candidate coherent itemsets {X, Y}, where X and Y are subsets of S, and (X \ Y) = £. STEP 6: For each itemset X, calculate its corresponding lower bound and upper bound of the support of itemset Y, Sup(Y), of the candidate coherent itemset (X, Y) using:
LBY < SupðYÞ < UBY
Table 3 Contingency table when P1 > P2.
Antecedent X
6533
Consequence Y
X :X
Y
:Y
P1 P2 P1
0 T P2
where LBY and UBY are calculated using LBY = Max [(2 ⁄ Sup(X) 1), k ⁄ Sup(X)] and UBY = Min [(Sup(X) + 1)/2, 2 ⁄ Sup(X)], respectively, Sup(X) and Sup(Y) are support values of itemsets X and Y, respectively, and k is a parameter for representing the coherence strength of a coherent itemset. Note that its range is between [0.5, 1]. The larger the value of k is, the stronger of the coherence of the itemset is. When the support of X is already known in advance, if the support of Y is not in the interval [LBY, UBY], then the itemset {X, Y} will not be a coherent itemset. STEP 7: If Sup(Y) of a candidate coherent itemset S is in the (continued on next page)
6534
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537
interval [LBY, UBY], then keep it in the candidate coherent ritemsets. Otherwise, remove the itemset S from the candidate r-itemsets, which also means that the candidate coherent itemset (X, Y) cannot generate qualified coherent association rules. STEP 8: For each candidate coherent itemset (X, Y), calculate the contingency table for antecedent X and consequent Y. Here, four support values are calculated, namely Q1: Sup(XY), Q2: Sup(XY), Q3: Sup(XY), and Q4: Sup(XY). Note that Q2, Q3, and Q4 can be calculated using (Sup(X) Q1), (Sup(Y) Q1), and (T Q1 Q2 Q3), respectively. STEP 9: Check whether all candidate coherent itemsets (X, Y) of itemset S meet the five conditions Q1 > Q2, Q1 > Q3, Q4 > Q2, Q4 > Q3, and Q1 P a. If yes, put candidate coherent r-itemset S into highly coherent r-itemset Lr. STEP10: If Lr is not null, set r = r + 1, Lall = Lall U Lr, and go to STEP 4. Otherwise, go to the next step. STEP11: Generate highly coherent association rules using the following substeps: SUBSTEP 11.1: For each itemset S in Lall, generate all its candidate coherent association rules (X ? Y), where X and Y are subsets of S, and (X \ Y) = £. SUBSTEP 11.2: Calculate the confidence value of each candidate coherent association rule (X ? Y) using conf(X ? Y) = Sup(XY)/Sup(X). SUBSTEP 11.3: Check the confidence value of each candidate coherent association rule (X ? Y). If its value is larger than or equal to the minimum confidence value w, then put it into the highly coherent association rule set (HUCA) as follows:
HUCA ¼ fRulei : ðX ! YÞjconfðRulei Þ P wg . STEP 12: Output the highly coherent association rule set HCAR.
5. An example
Step 1: The database is scanned for calculating the support of each item. Take the item I1 in the transaction as an example. The support of item I1 is 0.28 (=7/25). The results of all items are shown in Table 6. Step 2: The support value of each item Ij is compared to the predefined minimum support a. Take I1 as an example. Since the support value of I1 is 0.28, which equals the minimum support, I1 is put into the highly coherent 1-itemset. The results are shown in Table 7. Step 3: Since L1 is not null, r = 2 is set and the next step is followed. Step 4: The candidate coherent 2-itemsets are generated from highly coherent 1-itemsets. In this example, a total of 15 candidate coherent 2-itemsets are generated. Step 5: For each candidate coherent 2-itemset S, all candidate coherent itemsets (X, Y) of the itemset S are formed, where X and Y are subsets of S, and (X \ Y) = £. Take the itemset S = (I4, I5) as example. Two candidate coherent itemsets are generated, namely {X = I4, Y = I5} and {X = I5, Y = I4}. Thus, in the following step, itemsets I4 and I5 will be used to calculate its corresponding lower bound and upper bound of the support of itemset Y. Step 6: For the itemset {X = I4}, its corresponding lower bound and upper bound of the support of itemset Y, Sup(Y), are calculated. Take the itemset {X = I4} as an example. Since the support value of {X = I4} is 0.6, the lower bound of Sup(Y) is calculated as 0.3 (=Max(2 ⁄ 0.6 1, 0.5 ⁄ 0.6)) and the upper bound of Sup(Y) is calculated as 0.8 (=Min((0.6 + 1)/2, 2 ⁄ 0.6). Thus, the boundaries of Sup(Y) are: 0.3 < Sup(Y) < 0.8, which means that if the support value of a consequent itemset Y is not in the interval, then itemset {X, Y} will not be a coherent itemset. The final results of all itemsets are shown in Table 8. Step 7: Since the interval of Sup(Y) is [0.3, 0.8] for itemset {X = I4}, the support values of itemsets I2, I3, I5, and I7 are in the interval, and thus these itemsets are possible coherent 2-itemsets with item I4. The support of item I1 is 0.28, which is not in the range. Therefore, itemset {I1,
Table 6 Supports of all items.
In this section, an example is given to illustrate the proposed mining algorithm. This is a simple example to show how the proposed algorithm can be used to mine highly coherent association rules from transaction data. Assume that there are seven items in a transaction database. The dataset includes the 25 transactions shown in Table 5. In this example, the minimum support is set at 0.28 and the minimum confidence is set at 0.5. For the transaction data in Table 5, the proposed mining algorithm proceeds as follows.
Table 5 Transaction database used for example. ID
Items
ID
Items
ID
Items
1 2 3 4 5 6 7 8 9
I2 I6 I2, I4 I3 I3 I3 I4 I3, I4, I5, I7 I1, I3, I4, I5, I7
10 11 12 13 14 15 16 17 18
I1, I3, I1, I2, I3, I1, I2, I3, I1,
19 20 21 22 23 24 25
I3, I2, I2, I1, I2, I1, I2
I4, I4, I3, I4, I4, I4, I7 I4, I2,
I6 I5, I4, I5, I5, I5,
I7 I5, I7 I7 I6, I7 I6, I7
I5, I7 I4, I6
I4 I3, I3, I2, I5 I2,
I5 I5, I6 I4, I5 I4, I5, I7
TID
Count of item
TID
Count of item
I1 I2 I3 I4
0.28 (=7/25) 0.44 0.48 0.6
I5 I6 I7
0.52 0.2 0.4
Table 7 Highly coherent 1-itemset. TID
Sup(Ii)
TID
Sup(Ii)
I1 I2 I3
0.28 (=7/25) 0.44 0.48
I4 I5 I7
0.6 0.52 0.4
Table 8 Intervals of all itemsets. X
Sup(X)
[LBY, UBY]
X
Sup(X)
[LBY, UBY]
I1 I2 I3
7 11 12
[0.14, 0.56] [0.22, 0.72] [0.24, 0.8]
I4 I5 I7
15 13 10
[0.3, 0.8] [0.26, 0.76] [0.2, 0.7]
6535
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537
I4) cannot be a coherent itemset, and it is pruned. Thus, the candidate coherent 2-itemsets with respect to item I4 are {{I4, I2}, {I4, I3}, {I4, I5}, and {I4, I7}}. Step 8: The contingency table of each candidate coherent itemset (X, Y) derived in the previous step is then calculated. Take candidate coherent 2-itemset {I4, I5} as an example. Four support values are calculated, namely Q1: Sup(I4I5) =0.4 (=10/25), Q2: Sup(I4I5) = 0.2 (=(15–10)/ 25), Q3: Sup(I4I5) = 0.12 (=(13–10)/25), and Q4: Sup(I4I5) = 0.28 (=(25–10–5 3)/25). Step 9: It is then checked whether itemset (I4, I5) satisfies the five conditions. Since (I4, I5) satisfies the limitations Q1 > Q2, Q1 > Q3, Q4 > Q2, and Q4 > Q3 and Sup(I4I5) P 0.28, it is put into the highly coherent 2-itemset. In total, four itemsets are added to L2: {I3, I5}, {I4, I5}, {I4, I7}, and {I5, I7}. Step 10: Since L2 is not null, r is set at 3. Let Lall = Lall U L2. STEP 4 is then repeated. In this example, itemset {I4, I5, I7} is generated and put into the highly coherent 3-itemset L3. Step 11: Each highly coherent itemset in Lall is then used to generate highly coherent association rules using the following substeps: SUBSTEP 11.1: Each highly coherent itemset is used to form candidate coherent association rules. Take the itemset {I3, I5} as an example. The following two candidate coherent association rules are formed: (1) I3 ? I5 and (2) I5 ? I3. The results of all candidate coherent association rules are shown in Table 9. SUBSTEP 11.2: The confidence value of each candidate coherent association rule is then calculated. Take the first rule I3 ? I5 as an example. Its confidence value is 0.66 (=8/12). The results of all rules are shown in Table 10. Step 12: If the minimum confidence is set at 0.7, then seven rules are output as highy coherent association rules (HCAR). They are (1) I5 ? I4; (2) I7 ? I4; (3) I7 ? I5; (4) (I4, I5) ? I7; (5) (I4, I7) ? I5; (6) (I5, I7) ? I4; and (7) I7 ? (I4, I5).
6. Experimental results In this section, experimental results of the proposed approach are presented. The programs were implemented in Java on a personal computer with an Intel Core2 2.2-GHz CPU and 2 GB of RAM running Windows 7 Professional. The algorithm used two datasets: one real dataset, namely the Foodmart dataset from Microsoft SQL Server, and the other a simulated dataset, generated by the IBM Generator. The parameters of the simulation data are shown in Table 11. The details of the two datasets are shown in Table 12. In the experiments, the minimum support thresholds were varied. The Foodmart dataset was first used to evaluate the efficiency and performance of the proposed approach. Various minimum Table 11 Parameters of the simulation data. Parameter
Definition
Value
D T N L I
Number of transactions in 000s Average items per transaction Number of different items in 000s Number of patterns (possible rules) Average length of maximal pattern
300 8 0.5 500 7
Table 12 Details of databases. Database
# of transactions
# of items
Foodmart Simulated data
21,557 247,488
1559 489
Table 13 Comparison results between HCARM and Apriori for Foodmart dataset. Approach
Minimum support (%)
Total number of rules
Average conf.
Execution time (s)
HCARM
0.0001 0.00015 0.0001 0.00015
2840 136 18,880 2520
0.918 0.906 0.325 0.211
2739 2481 3087 2889
Apriori
Table 9 Candidate coherent association rules. Rule
RID
Rule
1 2 3 4 5 6 7
I3 ? I5 I5 ? I3 I4 ? I5 I5 ? I4 I4 ? I7 I7 ? I4 I5 ? I7
8 9 10 11 12 13 14
I7 ? I5 (I4, I5) ? I7 (I4, I7) ? I5 (I5, I7) ? I4 I7 ? (I4, I5) I5 ? (I4, I7) I4 ? (I5, I7)
Table 10 Confidence value of each rule. RID
Rule
Confidence
RID
Rule
Confidence
1 2 3 4 5 6 7
I3 ? I5 I5 ? I3 I4 ? I5 I5 ? I4 I4 ? I7 I7 ? I4 I5 ? I7
0.66 0.61 0.67 0.77 0.6 0.9 0.69
8 9 10 11 12 13 14
I7 ? I5 (I4, I5) ? I7 (I4, I7) ? I5 (I5, I7) ? I4 I7 ? (I4, I5) I5 ? (I4, I7) I4 ? (I5, I7)
0.9 0.9 1 1 0.9 0.69 0.6
HCARM
0.6
Apriori
0.4 0.2 0
1
1.1
1.2
1.3
1.4
1.5
Support (%) Fig. 1. Comparison results of HCARM and Apriori for simulated dataset.
Execution Time(sec.)
RID
Confidence
1 0.8
2000 HCARM 1500
Apriori
1000 500 0 1
1.1
1.2
1.3
1.4
1.5
Support (%) Fig. 2. Execution times of HCARM and Apriori for simulated dataset.
6536
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537
Table 14 Numbers of rules obtained with various minimum supports. Number of rules
HCARM Avg. sup. Avg. conf. Apriori Avg. sup. Avg. conf.
Minimum support (%) 1
1.1
1.2
1.3
1.4
1.5
31622 0.011 0.843 45334 0.011 0.764
10428 0.012 0.843 16812 0.0121 0.739
5144 0.0125 0.845 9052 0.0127 0.7245
552 0.0133 0.818 2094 0.0138 0.603
10 0.0153 0.695 338 0.016 0.383
4 0.017 0.795 158 0.018 0.322
supports were used to compare the rules derived using the proposed approach (HCARM) to those derived using the original Apriori algorithm. The results are shown in Table 13. The minimum supports were set at 0.0001% and 0.00015%. Table 13 shows that the numbers of rules derived by HCARM are 2840 and 136 for supports of 0.0001% and 0.00015%, respectively. The numbers of derived rules are lower than those obtained by the Apriori approach; these results are reasonable because the constrains of HCARM are stricter than those of Apriori. In addition, the average confidences of the rules obtained using HCARM are 0.918 and 0.906 for the two supports, respectively. The average confidences of the rules obtained using the Apriori approach are 0.325 and 0.211, respectively. The rules derived using the proposed algorithm are thus more interesting for managers due to their higher confidences. The execution time of the proposed approach is faster than that of Apriori. Experiments were then made on the simulated dataset. Various minimum supports were set to compare the performance of the proposed algorithm with that of the Apriori algorithm. The results are shown in Fig. 1. The figure shows that the confidences of the rules derived using the proposed approach are larger than those of rules derived using Apriori, especially when the support value was set at 1.5%. The execution times of HCARM and Apriori for the simulated dataset are shown in Fig. 2. The figure shows that the execution time of HCARM is lower than that of the Apriori algorithm, especially when the minimum support was set to less than 1.3%. The numbers of rules obtained using HCARM and Apriori with various minimum supports are shown in Table 14. Table 14 shows that HCARM derived fewer rules than did Apriori. However, the rules derived by HCARM might be more interesting than those derived by the Apriori algorithm in terms of average confidence. In addition, the rules derived by HCARM have the logical equivalence property, which means that if a rule ‘‘X ? Y’’ was derived, then the rule ‘‘X ? Y’’ also existed. The experimental results indicate that the proposed approach provides fewer but more useful rules (in terms of confidence) than those obtained using the Apriori algorithm. The derived rules have the logical equivalence property, which may be interesting from a business point of view. 7. Conclusion and future work Association-rule mining produces lots of rules that are common sense and thus difficult to apply for business applications. The rules may also be misleading. This paper proposed an Apriori-like approach (HCARM) for mining interesting rules from transaction databases. Experiments were conducted on the Foodmart dataset and a simulated dataset to show the performance of the proposed approach. The two major contributions of this paper are: (1) Highly coherent association rules were developed to help decision-makers make marketing strategies.
(2) To reduce the time required to generate candidate coherent itemsets, the corresponding lower and upper bounds of supports of subitemsets in the consequent part of an itemset are defined for removing itemsets that cannot become highly coherent itemsets. The proposed approach can be further enhanced. For example, high efficiency mining algorithm, various transaction dataset, multi-level rule mining, etc. In the future, we will extend the proposed approach to more complex problems. References Agarwal, R., Aggarwal, C., & Prasad, V. V. V. (2001). A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 61(3), 350–371. Agrawal, R., Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In The 20th international conference on very large data bases (pp. 487– 499). Agrawal, R., Imielinksi, T., Swami, A. (1993). Mining association rules between sets of items in large databases. In The ACM SIGMOD conference, Washington DC, USA. Agrawal, R., Imielinksi, T., & Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925. Antonie, M. L., Zaïane, O. R. (2004). Mining positive and negative association rules: An approach for confined rules. In PKDD ‘04 Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (pp. 27–38). Baralis, E., Cerquitelli, T., & Chiusano, S. (2009). IMine: Index support for item set mining. IEEE Transactions on Knowledge and Data Engineering, 21(4), 493–506. Bie, T. D. (2011). An information theoretic framework for data mining. In Proc. of the 17th ACM SIGKDD conference on knowledge discovery and data mining (pp. 564– 572). Brin, S., Motwani, R., Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 265–276). Cai, R., Tung, A. K. H., Zhang, Z., & Hao, Z. (2011). What is unequal among the equals? Ranking equivalent rules from gene expression data. IEEE Transactions on Knowledge and Data Engineering, 23(11), 1735–1747. Cheung, Y. L., & Fu, A. W. C. (2004). Mining frequent itemsets without support threshold: With and without item constraints. Proceedings of IEEE Transactions on Knowledge and Data Engineering, 16(9), 1052–1069. Chiang, D. A., Wang, C. T., Chen, S. P., & Chen, C. C. (2009). The cyclic model analysis on sequential patterns. IEEE Transactions on Knowledge and Data Engineering, 21(11), 1617–1628. Cule, B., Goethals, B. (2010). Mining association rules in long sequences. In Proceedings of the 14th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 300–309). Do, T. D. T., Laurenty, A., Termier, A. (2010). PGLCM: Efficient parallel mining of closed frequent gradual itemsets. In IEEE international conference on data mining (pp. 138–147). Han, J., Wang, J., Lu, Y., Tzvetkov, P. (2002). Mining top-K frequent closed patterns without minimum support. In Proceedings of the IEEE international conference on data mining (p. 211). Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation a frequent-pattern tree approach. Journal of Data Mining and Knowledge Discovery, 8, 53–87. Marinica, C., & Guillet, F. (2010). Knowledge-based interactive postmining of association rules using ontologies. IEEE Transactions on Knowledge and Data Engineering, 22(6), 784–797. Plantevit, M., Laurent, A., Laurent, D., Teisseire, M., & Choong, Y. W. (2010). Mining multidimensional and multilevel sequential patterns. ACM Transactions on Knowledge Discovery from Data, 4(1). Article 4. Ruggieri, S. (2010). Frequent regular itemset mining. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 263–272).
C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537 Sá, C. R. D., Soares, C., Jorge, A. M., Azevedo, P., Costa, J. (2011). Mining association rules for label ranking. In Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining (pp. 432–443). Segond, M., Borgelt, C. (2011). Item set mining based on cover similarity. In Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining (pp. 493–505). Sim, A. T. H., Indrawan, M., Zutshi, S., & Srinivasan, B. (2010). Logic-based pattern discovery. IEEE Transactions on Knowledge and Data Engineering, 22(6), 798–811. Teng, W. G., Hsieh, M. J., Chen, M. S. (2002). On the mining of substitution rules for statistically dependent items. In Proceedings of the IEEE international conference on data mining (pp. 442–449). Wang, K., He, Y., & Han, J. (2003). Pushing support constraints into association rules mining. Proceedings of IEEE Transactions on Knowledge and Data Engineering, 15, 642–658.
6537
Webb, G. I. (2010). Self-sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4, 1–20. Wu, X., Zhang, C., & Zhang, S. (2004). Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3), 381–405. Zhao, Y., Zhang, H., Wu, S., Pei, J., Cao, L., Zhang, C., Bohlscheid, H. (2009). Debt detection in social security by sequence classification using both positive and negative patterns. In Proc. ECML-PKDD, Vol. 5782 (pp.648–663). Zheng, Z., Zhao, Y., Zuo, Z., Cao, L. (2010). An efficient ga-based algorithm for mining negative sequential patterns. In Proceedings of the 14th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 262–273). Zhong, N., Li, Y., & Wu, S. T. (2012). Effective pattern discovery for text mining. IEEE Transactions on Knowledge and Data Engineering, 24(1), 30–44.