Mining high coherent association rules with consideration of support measure

Expert Systems with Applications 40 (2013) 6531–6537 Contents lists available at SciVerse ScienceDirect Expert Systems with Applications journal hom...

Download PDF

487KB Sizes 0 Downloads 97 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 40 (2013) 6531–6537

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Mining high coherent association rules with consideration of support measure q Chun-Hao Chen a, Guo-Cheng Lan b, Tzung-Pei Hong c,d,⇑, Yui-Kai Lin d a

Department of Computer Science and Information Engineering, Tamkang University, Taipei 251, Taiwan Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan 701, Taiwan c Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan d Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 804, Taiwan b

a r t i c l e

i n f o

Keywords: Data mining Association rules Propositional logic Coherent rules Highly coherent rules

a b s t r a c t Data mining has been studied for a long time. Its goal is to help market managers ﬁnd relationships among items from large databases and thus increase sales volume. Association-rule mining is one of the well known and commonly used techniques for this purpose. The Apriori algorithm is an important method for such a task. Based on the Apriori algorithm, lots of mining approaches have been proposed for diverse applications. Many of these data mining approaches focus on positive association rules such as ‘‘if milk is bought, then cookies are bought’’. Such rules may, however, be misleading since there may be customers that buy milk and not buy cookies. This paper thus takes the properties of propositional logic into consideration and proposes an algorithm for mining highly coherent rules. The derived association rules are expected to be more meanful and reliable for business. Experiments on two datasets are also made to show the performance of the proposed approach. Ó 2013 Elsevier Ltd. All rights reserved.

1. Introduction Data mining is often used to discover interesting information and relationships between items from large datasets or databases (Agrawal & Srikant, 1994; Agrawal, Imielinksi, & Swami 1993; Agrawal, Imielinksi, & Swami 1993). One of the applications of data mining is to increase sales volume. Association rules in data mining can be expressed as X ? Y, where X and Y are a set of items and X \ Y = U. The meaning of the expression is that if all speciﬁc items exist in X in a transaction, then some speciﬁc items will also exist in the same transaction with high probability. For example, if customers purchase bread and milk together with high possibility when shopping in a market, the rule ‘‘Bread ? Milk’’ will be mined out as it is of interest for marketing managers. Agarwal et al. proposed the Apriori approach for associationrule mining. Based on the Apriori approach, many algorithms have been proposed, with some focusing on positive association rules (Agarwal, Aggarwal, & Prasad, 2001; Baralis, Cerquitelli, & Chiusano, 2009; Bie, 2011; Brin, Motwani, & Silverstein, 1997; Cai, Tung, Zhang, & Hao, 2011; Cheung & Fu, 2004; Cule & Goethals, 2010; Do, Laurenty, & Termier, 2010; Han, Pei, Yin, & Mao, 2004; Plantevit, q This is a modiﬁed and expanded version of the paper ‘‘A high coherent association rule mining algorithm,’’ presented at International Conference on Artiﬁcial Intelligence and Applications, 2012, Taiwan. ⇑ Corresponding author at: Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 811, Taiwan. E-mail addresses: [email protected] (C.-H. Chen), [email protected] (T.-P. Hong), [email protected] (Y.-K. Lin).

0957-4174/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2013.06.002

Laurent, Laurent, Teisseire, & Choong, 2010; Ruggieri, 2010; Teng et al., 2002; Wang, He, & Han, 2003; Webb, 2010). These approaches use different types of association-rule mining algorithm, such as weighted association-rule mining (Bie, 2011; Wang et al., 2003) and fuzzy association-rule mining (Cai et al., 2011), to achieve a speciﬁc goal. Some of them focus on negative association rules, which are useful for analyzing transactions of real-world applications (Antonie & Zaïane, 2004; Segond & Borgelt, 2011; Wu, Zhang, & Zhang, 2004; Zhong, Li, & Wu, 2012). Brin et al. described the importance of negative association rules (Brin et al., 1997). Unlike positive rules, negative association rules provide more thoughtful and symmetric property in rule detection and analysis (Wu et al., 2004; Zhao et al., 2009). In general, the number of derived negative rules is larger than that of positive rules for a given transaction database. Wu et al. thus designed an algorithm to enhance the performance of the mining process through an interestingness function and a pruning strategy (Wu et al., 2004). A common problem of association-rule mining approaches is that lots of the derived rules are common sense and thus cannot be easily used, especially for business applications, and some may be misleading. For example, the rule ‘‘if milk is bought, then bread is bought’’ may be misleading because customers may buy milk and not buy bread. To solve this problem, Sim et al. proposed an algorithm for mining coherent rules based on the properties of propositional logic without a minimum support threshold. In thir approach, if a rule satisﬁes the logic equivalence, then it is a coherent rule. Logic equivalence means that if the rule X ? Y exists, then the rule X ? Y also exists with certain criteria.

6532

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537

Based on logical equivalence, this paper proposes an Apriorilike approach, namely the Highly Coherent Association-Rule Mining (HCARM) algorithm, for mining rules from transaction databases. Since generating candidate coherent itemsets is timeconsuming, the proposed approach ﬁrst calculates the corresponding lower and upper bounds of subitemsets in the consequent part of an itemset. These intervals are then used for removing itemsets that cannot become highly coherent itemsets to speed up the mining process. Then, the contingency tables of qualiﬁed itemsets are calculated and used to check whether they satisfy the conditions of logical equivalence. If yes, then the itemsets are used to generate highly coherent association rules. Experiments on the Foodmart dataset and a simulated dataset are made to show the performance of the proposed approach. The rest of this paper is organized as follows. Related works are described in Section 2. The derivation of the lower and upper bounds of itemsets is described in Section 3. The proposed mining algorithm and an example that illustrates its use are presented in Sections 4 and 5, respectively. Experimental results are shown in Section 6. The conclusions and suggestions for future work are given in Section 7.

2. Review of related mining approaches This section reviews related works, including association-rule mining approaches and the concept of coherent rules.

2.1. Association-rule mining approaches Many data mining methods have been proposed for deriving association rules from transaction databases. Data mining aims to extract useful knowledge and patterns from existing data to solve a speciﬁc problem. Association-rule mining has been applied in ﬁelds such as marketing and crime prevention and prediction. An association rule is represented as A ? B, where A and B are sets of products. The rule expresses that if product A is purchased, then product B will be purchased together in high probability (Agarwal et al., 2001). Since transaction databases usually consist of millions of records, rule mining is time-consuming. In order to improve the performance of the mining process in various domains, many methods have been proposed, including those based on the FP-tree structure and the mining of both positive and negative association rules. Many existing mining approaches focus on mining positive rules. Cai et al. designed an incremental Apriori-like algorithm for selecting interesting rules with two interestingness measures, called Max-Subrule-Conf and Min-Subrule-Conf. Do et al. (2010) proposed the PGLCM algorithm for mining closed frequent gradual patterns. The execution time of their approach increases linearly with the number of closed frequent gradual itemsets. To derive the buying cycle of items and their intervals between those items, Chiang et al. combined data mining and statistics and proposed Cyclic Model Analysis (CMA) (Chiang, Wang, Chen, & Chen, 2009). Negative rules are also important in the transaction data. Teng et al. proposed a method for mining substitution rules which means that customers can replace some items by others (Webb, 2010). The SRM algorithm was designed to enhance performance. It consists of two phases. All frequent positive and negative itemsets are generated and then the substitution rules are derived. Antonie et al. devised an algorithm for generating both positive and negative association rules with a sliding correlation coefﬁcient threshold (Antonie & Zaïane, 2004). The negative mining concept has also been extended for mining negative sequential patterns. Zheng et al. proposed a genetic algorithm (GA)-based algorithm

to ﬁnd negative sequential patterns (Zheng, Zhao, Zuo, & Cao, 2010). Since data mining is a time-consuming task, many algorithms have been proposed to improve the efﬁciency of the mining process. Han et al. proposed the TFP algorithm for mining top-k frequent closed patterns (Marinica & Guillet, 2010). Then, they proposed a two-step FP-growth algorithm for mining frequent itemsets, with the database scanned only twice (Han, Wang, Lu, & Tzvetkov, 2002). In the ﬁrst step, the size of the database is reduced and the FP-tree is constructed using this reduced database. In the second step, the FP-growth approach is used to derive all frequent itemsets. Baralis et al. proposed the Itemset-Tree and the Item-Btree for creating a compact and lossless representation of item relations (Baralis et al., 2009). Marinica et al. proposed an interactive post-processing approach to reduce the number of discovered rules (Plantevit et al., 2010). The approach ﬁrst integrates user domain knowledge in association rules according to ontologies and rule schemas. Then, it guides the user to prune and ﬁlter rules. The ARIPSO framework is then utilized to select interesting rules. Segond et al. proposed a mining algorithm based on the Eclat algorithm (Sim, Indrawan, Zutshi, & Srinivasan, 2010), which takes the covers of associated items into consideration in the mining task. Methods that use multiple levels and thresholds have also been proposed (Brin et al., 1997; Ruggieri, 2010). 2.2. Concept of coherent rules One of the main issues of association-rule mining approaches is how to deﬁne appropriate minimum support and minimum conﬁdence. It has been reported that although the appropriate minimum support may exist, it can be difﬁcult to ﬁnd (Wu et al., 2004). Many studies have proposed methods for ﬁnding an appropriate minimum support (Han et al., 2004; Plantevit et al., 2010; Ruggieri, 2010; Webb, 2010; Zhong et al., 2012). The coherent rule mining algorithm proposed by Sim et al. is based based on the properties of propositional logic (Sá, Soares, Jorge, Azevedo, & Costa, 2011). In their approach, by using the properties of propositional logic, the relationships between items can be derived directly without knowing the appropriate value of the minimum support (Sá et al., 2011). The approach maps the association rules to equivalences. Each mapping from an association rule to an equivalence should satisfy the conditions shown in Table 1. In Table 1, X and Y are two itemsets. An association rule X ? Y is mapped to p q, if and only if (1) X ? Y is true; (2) :X ? Y is false; (3) X ? Y is false; and (4) :X ? :Y is true. When used in multiple transactions, association rules are mapped to implications as follows: X ? Y is mapped to an implication p ? q, if and only if (1) Sup(X, Y) > Sup(X, Y); (2) Sup(X, Y) > Sup(:X, Y); (3)Sup(X, Y) > Sup(X, :Y); and (4) Sup(X, Y) > Sup(:X, :Y). In the same way, other association rules are mapped to implications based on comparisons between supports, creating pseudoimplications. Sim et al. developed coherent rules for the pseudoimplications. The following four conditions must be satisﬁed for a coherent rule: (1) Sup(X, Y) > Sup(:X, Y); (2) Sup(X, Y) > Sup(X, :Y); (3) Sup(:X, :Y) > Sup(:X, Y); and (4) Sup(:X,

Table 1 Four conditions for mapping rules to equivalences. Equivalences Association rules

pq X?Y

:p :q :X ? :Y

True or False on association rules

Required conditions

T F F T

X?Y X ? :Y :X ? Y :X ? :Y

:X ? :Y :X ? Y X ? :Y X?Y

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537 Table 2 Contingency table of a rule. Frequency of co-occurrences

Antecedent X

Consequence Y

X :X

Y

:Y

Q1 = Sup(X,Y) Q3 = Sup(:X,Y)

Q2 = Sup(X, :Y) Q4 = Sup(:X, :Y)

:Y) > Sup(X, :Y). These four conditions can also be represented as a contingency table, as shown in Table 2. The concept of a coherent rule is that if a rule X ? Y exists, then the rule :X ? :Y should also exist. Thus, according to the coherent properties, the coherent mining algorithm for processing a certain consequence itemset Y is stated as follows [28]. Firstly, the algorithm ﬁnds out the set of items from the given transaction database, namely I. Then, it maps the power sets of A (=I Y) to the indices of a binary system. At last, the frequencies of each element in the set A and itemset Y are calculated for deriving coherent rules with exploiting the anti-monotone property on the condition S(X, Y) > S(:X, Y). By utilizing the coherent rule concept, the present study proposes an association-rule mining algorithm that uses these four conditions in the mining process for deriving highly coherent association rules from transactions. 3. Derivation of lower and upper bounds of itemsets In this section, based to the proprieties of coherent rules deﬁned in the previous section, for a coherent rule X ? Y, let the supports of X and Y be P1 and P2, and the number of transactions be T. In the following, assume that the derived coherent rule X ? Y has the highest conﬁdence if P1 > P2. The contingency table is shown in Table 3. From Table 3, according to the four criteria, if the rule X ? Y is a coherent rule, the following two formulas must be satisﬁed:

P2 > ðP1 P2 Þ

ð1Þ

ðT P1 Þ > ðP1 P2Þ

ð2Þ

The contingency table for P2 > P1 is shown in Table 4. From Table 4, the following two formulas must be satisﬁed:

P1 > P2 P1

ð3Þ

ðT P2 Þ > ðP2 P1 Þ

ð4Þ

Using formulas (1)–(4), the following theorem is obtained: Theorem 1. Let X ? Y be a coherent rule, where X and Y are itemsets. Assume that the support of X is P1, the support of Y is P2, and the total number of transactions is T; then, the support of Y must be in the range Max[(2P1 T), 0.5P1] < P2 < Min[(P1 + T)/2, 2P1].

Frequency of co-occurrences

Consequence Y

X :X

Y

:Y

P2 0

P1 P2 T P1

Table 4 Contingency table when P2 > P1. Frequency of co-occurrences

Antecedent X

Proof. For P1 > P2, the following two formulas are derived: (1) P2 > (P1 P2) and (2) (T P1) > (P1 P2). They can be transformed into (10 ) P2 > 0.5P1 and (20 ) P2 > (2P1 T), respectively. For P1 < P2, another two formulas are derived: (3) P1 > P2 P1 and (4) (T P2) > (P2 P1). They can also be transformed into (30 ) 2P1 > P2 and (40 ) P2 < (P1 + T)/2, respectively. From (10 ) and (30 ), it can be shown that (5) 0.5P1 < P2 < 2P1. From (20 ) and (40 ), it can be shown that (6) (2P1 T) < P2 < (P1 + T)/2. Using formulas (5) and (6), the lower and upper bounds of Y can be derived as Max[(2P1 T), 0.5P1] and Min[(P1 + T)/2, 2P1], respectively. h Given an itemset S, its candidate coherent itemsets {X, Y} can be formed, where X and Y are subsets of S, and (X \ Y) = £. Since calculating the contingency table of a candidate coherent itemset is timeconsuming, by using Theorem 1, if the support of Y is not in the interval, then the itemset cannot generate coherent rules and is removed directly without having its contingency table calculated. 4. Proposed mining algorithm This section describes the proposed mining algorithm, HCARM, for deriving highly coherent association rules from databases. Proposed HCARM algorithm: INPUT: A body of n transactions, a predeﬁned minimum support threshold a, a predeﬁned minimum conﬁdence threshold w, and a coherence strength parameter k. OUTPUT: A set of highly coherent association rules (HCAR). STEP 1: Scan the database and calculate the support of each item Ij. STEP 2: Compare the support value of each item Ij, Sup(Ij), to the predeﬁned minimum support a. If the support value of item Ij is larger than or equal to the minimum support value, then put Ij into the large 1-itemset as follows:

L1 ¼ fIi jSupðIj Þ P a; 1 6 j 6 kg STEP 3: If L1 is not null, let L1 be a highly coherent 1-itemset, and set r = 2. STEP 4: Generate candidate coherent r-itemsets from highly coherent (r 1)-itemsets. STEP 5: For each highly coherent r-itemset S, form all its candidate coherent itemsets {X, Y}, where X and Y are subsets of S, and (X \ Y) = £. STEP 6: For each itemset X, calculate its corresponding lower bound and upper bound of the support of itemset Y, Sup(Y), of the candidate coherent itemset (X, Y) using:

LBY < SupðYÞ < UBY

Table 3 Contingency table when P1 > P2.

Antecedent X

6533

Consequence Y

X :X

Y

:Y

P1 P2 P1

0 T P2

where LBY and UBY are calculated using LBY = Max [(2 ⁄ Sup(X) 1), k ⁄ Sup(X)] and UBY = Min [(Sup(X) + 1)/2, 2 ⁄ Sup(X)], respectively, Sup(X) and Sup(Y) are support values of itemsets X and Y, respectively, and k is a parameter for representing the coherence strength of a coherent itemset. Note that its range is between [0.5, 1]. The larger the value of k is, the stronger of the coherence of the itemset is. When the support of X is already known in advance, if the support of Y is not in the interval [LBY, UBY], then the itemset {X, Y} will not be a coherent itemset. STEP 7: If Sup(Y) of a candidate coherent itemset S is in the (continued on next page)

6534

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537

interval [LBY, UBY], then keep it in the candidate coherent ritemsets. Otherwise, remove the itemset S from the candidate r-itemsets, which also means that the candidate coherent itemset (X, Y) cannot generate qualiﬁed coherent association rules. STEP 8: For each candidate coherent itemset (X, Y), calculate the contingency table for antecedent X and consequent Y. Here, four support values are calculated, namely Q1: Sup(XY), Q2: Sup(XY), Q3: Sup(XY), and Q4: Sup(XY). Note that Q2, Q3, and Q4 can be calculated using (Sup(X) Q1), (Sup(Y) Q1), and (T Q1 Q2 Q3), respectively. STEP 9: Check whether all candidate coherent itemsets (X, Y) of itemset S meet the ﬁve conditions Q1 > Q2, Q1 > Q3, Q4 > Q2, Q4 > Q3, and Q1 P a. If yes, put candidate coherent r-itemset S into highly coherent r-itemset Lr. STEP10: If Lr is not null, set r = r + 1, Lall = Lall U Lr, and go to STEP 4. Otherwise, go to the next step. STEP11: Generate highly coherent association rules using the following substeps: SUBSTEP 11.1: For each itemset S in Lall, generate all its candidate coherent association rules (X ? Y), where X and Y are subsets of S, and (X \ Y) = £. SUBSTEP 11.2: Calculate the conﬁdence value of each candidate coherent association rule (X ? Y) using conf(X ? Y) = Sup(XY)/Sup(X). SUBSTEP 11.3: Check the conﬁdence value of each candidate coherent association rule (X ? Y). If its value is larger than or equal to the minimum conﬁdence value w, then put it into the highly coherent association rule set (HUCA) as follows:

HUCA ¼ fRulei : ðX ! YÞjconfðRulei Þ P wg . STEP 12: Output the highly coherent association rule set HCAR.

5. An example

Step 1: The database is scanned for calculating the support of each item. Take the item I1 in the transaction as an example. The support of item I1 is 0.28 (=7/25). The results of all items are shown in Table 6. Step 2: The support value of each item Ij is compared to the predeﬁned minimum support a. Take I1 as an example. Since the support value of I1 is 0.28, which equals the minimum support, I1 is put into the highly coherent 1-itemset. The results are shown in Table 7. Step 3: Since L1 is not null, r = 2 is set and the next step is followed. Step 4: The candidate coherent 2-itemsets are generated from highly coherent 1-itemsets. In this example, a total of 15 candidate coherent 2-itemsets are generated. Step 5: For each candidate coherent 2-itemset S, all candidate coherent itemsets (X, Y) of the itemset S are formed, where X and Y are subsets of S, and (X \ Y) = £. Take the itemset S = (I4, I5) as example. Two candidate coherent itemsets are generated, namely {X = I4, Y = I5} and {X = I5, Y = I4}. Thus, in the following step, itemsets I4 and I5 will be used to calculate its corresponding lower bound and upper bound of the support of itemset Y. Step 6: For the itemset {X = I4}, its corresponding lower bound and upper bound of the support of itemset Y, Sup(Y), are calculated. Take the itemset {X = I4} as an example. Since the support value of {X = I4} is 0.6, the lower bound of Sup(Y) is calculated as 0.3 (=Max(2 ⁄ 0.6 1, 0.5 ⁄ 0.6)) and the upper bound of Sup(Y) is calculated as 0.8 (=Min((0.6 + 1)/2, 2 ⁄ 0.6). Thus, the boundaries of Sup(Y) are: 0.3 < Sup(Y) < 0.8, which means that if the support value of a consequent itemset Y is not in the interval, then itemset {X, Y} will not be a coherent itemset. The ﬁnal results of all itemsets are shown in Table 8. Step 7: Since the interval of Sup(Y) is [0.3, 0.8] for itemset {X = I4}, the support values of itemsets I2, I3, I5, and I7 are in the interval, and thus these itemsets are possible coherent 2-itemsets with item I4. The support of item I1 is 0.28, which is not in the range. Therefore, itemset {I1,

Table 6 Supports of all items.

In this section, an example is given to illustrate the proposed mining algorithm. This is a simple example to show how the proposed algorithm can be used to mine highly coherent association rules from transaction data. Assume that there are seven items in a transaction database. The dataset includes the 25 transactions shown in Table 5. In this example, the minimum support is set at 0.28 and the minimum conﬁdence is set at 0.5. For the transaction data in Table 5, the proposed mining algorithm proceeds as follows.

Table 5 Transaction database used for example. ID

Items

ID

Items

ID

Items

1 2 3 4 5 6 7 8 9

I2 I6 I2, I4 I3 I3 I3 I4 I3, I4, I5, I7 I1, I3, I4, I5, I7

10 11 12 13 14 15 16 17 18

I1, I3, I1, I2, I3, I1, I2, I3, I1,

19 20 21 22 23 24 25

I3, I2, I2, I1, I2, I1, I2

I4, I4, I3, I4, I4, I4, I7 I4, I2,

I6 I5, I4, I5, I5, I5,

I7 I5, I7 I7 I6, I7 I6, I7

I5, I7 I4, I6

I4 I3, I3, I2, I5 I2,

I5 I5, I6 I4, I5 I4, I5, I7

TID

Count of item

TID

Count of item

I1 I2 I3 I4

0.28 (=7/25) 0.44 0.48 0.6

I5 I6 I7

0.52 0.2 0.4

Table 7 Highly coherent 1-itemset. TID

Sup(Ii)

TID

Sup(Ii)

I1 I2 I3

0.28 (=7/25) 0.44 0.48

I4 I5 I7

0.6 0.52 0.4

Table 8 Intervals of all itemsets. X

Sup(X)

[LBY, UBY]

X

Sup(X)

[LBY, UBY]

I1 I2 I3

7 11 12

[0.14, 0.56] [0.22, 0.72] [0.24, 0.8]

I4 I5 I7

15 13 10

[0.3, 0.8] [0.26, 0.76] [0.2, 0.7]

6535

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537

I4) cannot be a coherent itemset, and it is pruned. Thus, the candidate coherent 2-itemsets with respect to item I4 are {{I4, I2}, {I4, I3}, {I4, I5}, and {I4, I7}}. Step 8: The contingency table of each candidate coherent itemset (X, Y) derived in the previous step is then calculated. Take candidate coherent 2-itemset {I4, I5} as an example. Four support values are calculated, namely Q1: Sup(I4I5) =0.4 (=10/25), Q2: Sup(I4I5) = 0.2 (=(15–10)/ 25), Q3: Sup(I4I5) = 0.12 (=(13–10)/25), and Q4: Sup(I4I5) = 0.28 (=(25–10–5 3)/25). Step 9: It is then checked whether itemset (I4, I5) satisﬁes the ﬁve conditions. Since (I4, I5) satisﬁes the limitations Q1 > Q2, Q1 > Q3, Q4 > Q2, and Q4 > Q3 and Sup(I4I5) P 0.28, it is put into the highly coherent 2-itemset. In total, four itemsets are added to L2: {I3, I5}, {I4, I5}, {I4, I7}, and {I5, I7}. Step 10: Since L2 is not null, r is set at 3. Let Lall = Lall U L2. STEP 4 is then repeated. In this example, itemset {I4, I5, I7} is generated and put into the highly coherent 3-itemset L3. Step 11: Each highly coherent itemset in Lall is then used to generate highly coherent association rules using the following substeps: SUBSTEP 11.1: Each highly coherent itemset is used to form candidate coherent association rules. Take the itemset {I3, I5} as an example. The following two candidate coherent association rules are formed: (1) I3 ? I5 and (2) I5 ? I3. The results of all candidate coherent association rules are shown in Table 9. SUBSTEP 11.2: The conﬁdence value of each candidate coherent association rule is then calculated. Take the ﬁrst rule I3 ? I5 as an example. Its conﬁdence value is 0.66 (=8/12). The results of all rules are shown in Table 10. Step 12: If the minimum conﬁdence is set at 0.7, then seven rules are output as highy coherent association rules (HCAR). They are (1) I5 ? I4; (2) I7 ? I4; (3) I7 ? I5; (4) (I4, I5) ? I7; (5) (I4, I7) ? I5; (6) (I5, I7) ? I4; and (7) I7 ? (I4, I5).

6. Experimental results In this section, experimental results of the proposed approach are presented. The programs were implemented in Java on a personal computer with an Intel Core2 2.2-GHz CPU and 2 GB of RAM running Windows 7 Professional. The algorithm used two datasets: one real dataset, namely the Foodmart dataset from Microsoft SQL Server, and the other a simulated dataset, generated by the IBM Generator. The parameters of the simulation data are shown in Table 11. The details of the two datasets are shown in Table 12. In the experiments, the minimum support thresholds were varied. The Foodmart dataset was ﬁrst used to evaluate the efﬁciency and performance of the proposed approach. Various minimum Table 11 Parameters of the simulation data. Parameter

Deﬁnition

Value

D T N L I

Number of transactions in 000s Average items per transaction Number of different items in 000s Number of patterns (possible rules) Average length of maximal pattern

300 8 0.5 500 7

Table 12 Details of databases. Database

# of transactions

# of items

Foodmart Simulated data

21,557 247,488

1559 489

Table 13 Comparison results between HCARM and Apriori for Foodmart dataset. Approach

Minimum support (%)

Total number of rules

Average conf.

Execution time (s)

HCARM

0.0001 0.00015 0.0001 0.00015

2840 136 18,880 2520

0.918 0.906 0.325 0.211

2739 2481 3087 2889

Apriori

Table 9 Candidate coherent association rules. Rule

RID

Rule

1 2 3 4 5 6 7

I3 ? I5 I5 ? I3 I4 ? I5 I5 ? I4 I4 ? I7 I7 ? I4 I5 ? I7

8 9 10 11 12 13 14

I7 ? I5 (I4, I5) ? I7 (I4, I7) ? I5 (I5, I7) ? I4 I7 ? (I4, I5) I5 ? (I4, I7) I4 ? (I5, I7)

Table 10 Conﬁdence value of each rule. RID

Rule

Conﬁdence

RID

Rule

Conﬁdence

1 2 3 4 5 6 7

I3 ? I5 I5 ? I3 I4 ? I5 I5 ? I4 I4 ? I7 I7 ? I4 I5 ? I7

0.66 0.61 0.67 0.77 0.6 0.9 0.69

8 9 10 11 12 13 14

I7 ? I5 (I4, I5) ? I7 (I4, I7) ? I5 (I5, I7) ? I4 I7 ? (I4, I5) I5 ? (I4, I7) I4 ? (I5, I7)

0.9 0.9 1 1 0.9 0.69 0.6

HCARM

0.6

Apriori

0.4 0.2 0

1

1.1

1.2

1.3

1.4

1.5

Support (%) Fig. 1. Comparison results of HCARM and Apriori for simulated dataset.

Execution Time(sec.)

RID

Confidence

1 0.8

2000 HCARM 1500

Apriori

1000 500 0 1

1.1

1.2

1.3

1.4

1.5

Support (%) Fig. 2. Execution times of HCARM and Apriori for simulated dataset.

6536

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537

Table 14 Numbers of rules obtained with various minimum supports. Number of rules

HCARM Avg. sup. Avg. conf. Apriori Avg. sup. Avg. conf.

Minimum support (%) 1

1.1

1.2

1.3

1.4

1.5

31622 0.011 0.843 45334 0.011 0.764

10428 0.012 0.843 16812 0.0121 0.739

5144 0.0125 0.845 9052 0.0127 0.7245

552 0.0133 0.818 2094 0.0138 0.603

10 0.0153 0.695 338 0.016 0.383

4 0.017 0.795 158 0.018 0.322

supports were used to compare the rules derived using the proposed approach (HCARM) to those derived using the original Apriori algorithm. The results are shown in Table 13. The minimum supports were set at 0.0001% and 0.00015%. Table 13 shows that the numbers of rules derived by HCARM are 2840 and 136 for supports of 0.0001% and 0.00015%, respectively. The numbers of derived rules are lower than those obtained by the Apriori approach; these results are reasonable because the constrains of HCARM are stricter than those of Apriori. In addition, the average conﬁdences of the rules obtained using HCARM are 0.918 and 0.906 for the two supports, respectively. The average conﬁdences of the rules obtained using the Apriori approach are 0.325 and 0.211, respectively. The rules derived using the proposed algorithm are thus more interesting for managers due to their higher conﬁdences. The execution time of the proposed approach is faster than that of Apriori. Experiments were then made on the simulated dataset. Various minimum supports were set to compare the performance of the proposed algorithm with that of the Apriori algorithm. The results are shown in Fig. 1. The ﬁgure shows that the conﬁdences of the rules derived using the proposed approach are larger than those of rules derived using Apriori, especially when the support value was set at 1.5%. The execution times of HCARM and Apriori for the simulated dataset are shown in Fig. 2. The ﬁgure shows that the execution time of HCARM is lower than that of the Apriori algorithm, especially when the minimum support was set to less than 1.3%. The numbers of rules obtained using HCARM and Apriori with various minimum supports are shown in Table 14. Table 14 shows that HCARM derived fewer rules than did Apriori. However, the rules derived by HCARM might be more interesting than those derived by the Apriori algorithm in terms of average conﬁdence. In addition, the rules derived by HCARM have the logical equivalence property, which means that if a rule ‘‘X ? Y’’ was derived, then the rule ‘‘X ? Y’’ also existed. The experimental results indicate that the proposed approach provides fewer but more useful rules (in terms of conﬁdence) than those obtained using the Apriori algorithm. The derived rules have the logical equivalence property, which may be interesting from a business point of view. 7. Conclusion and future work Association-rule mining produces lots of rules that are common sense and thus difﬁcult to apply for business applications. The rules may also be misleading. This paper proposed an Apriori-like approach (HCARM) for mining interesting rules from transaction databases. Experiments were conducted on the Foodmart dataset and a simulated dataset to show the performance of the proposed approach. The two major contributions of this paper are: (1) Highly coherent association rules were developed to help decision-makers make marketing strategies.

(2) To reduce the time required to generate candidate coherent itemsets, the corresponding lower and upper bounds of supports of subitemsets in the consequent part of an itemset are deﬁned for removing itemsets that cannot become highly coherent itemsets. The proposed approach can be further enhanced. For example, high efﬁciency mining algorithm, various transaction dataset, multi-level rule mining, etc. In the future, we will extend the proposed approach to more complex problems. References Agarwal, R., Aggarwal, C., & Prasad, V. V. V. (2001). A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 61(3), 350–371. Agrawal, R., Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In The 20th international conference on very large data bases (pp. 487– 499). Agrawal, R., Imielinksi, T., Swami, A. (1993). Mining association rules between sets of items in large databases. In The ACM SIGMOD conference, Washington DC, USA. Agrawal, R., Imielinksi, T., & Swami, A. (1993). Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6), 914–925. Antonie, M. L., Zaïane, O. R. (2004). Mining positive and negative association rules: An approach for conﬁned rules. In PKDD ‘04 Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (pp. 27–38). Baralis, E., Cerquitelli, T., & Chiusano, S. (2009). IMine: Index support for item set mining. IEEE Transactions on Knowledge and Data Engineering, 21(4), 493–506. Bie, T. D. (2011). An information theoretic framework for data mining. In Proc. of the 17th ACM SIGKDD conference on knowledge discovery and data mining (pp. 564– 572). Brin, S., Motwani, R., Silverstein, C. (1997). Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 265–276). Cai, R., Tung, A. K. H., Zhang, Z., & Hao, Z. (2011). What is unequal among the equals? Ranking equivalent rules from gene expression data. IEEE Transactions on Knowledge and Data Engineering, 23(11), 1735–1747. Cheung, Y. L., & Fu, A. W. C. (2004). Mining frequent itemsets without support threshold: With and without item constraints. Proceedings of IEEE Transactions on Knowledge and Data Engineering, 16(9), 1052–1069. Chiang, D. A., Wang, C. T., Chen, S. P., & Chen, C. C. (2009). The cyclic model analysis on sequential patterns. IEEE Transactions on Knowledge and Data Engineering, 21(11), 1617–1628. Cule, B., Goethals, B. (2010). Mining association rules in long sequences. In Proceedings of the 14th Paciﬁc-Asia conference on advances in knowledge discovery and data mining (pp. 300–309). Do, T. D. T., Laurenty, A., Termier, A. (2010). PGLCM: Efﬁcient parallel mining of closed frequent gradual itemsets. In IEEE international conference on data mining (pp. 138–147). Han, J., Wang, J., Lu, Y., Tzvetkov, P. (2002). Mining top-K frequent closed patterns without minimum support. In Proceedings of the IEEE international conference on data mining (p. 211). Han, J., Pei, J., Yin, Y., & Mao, R. (2004). Mining frequent patterns without candidate generation a frequent-pattern tree approach. Journal of Data Mining and Knowledge Discovery, 8, 53–87. Marinica, C., & Guillet, F. (2010). Knowledge-based interactive postmining of association rules using ontologies. IEEE Transactions on Knowledge and Data Engineering, 22(6), 784–797. Plantevit, M., Laurent, A., Laurent, D., Teisseire, M., & Choong, Y. W. (2010). Mining multidimensional and multilevel sequential patterns. ACM Transactions on Knowledge Discovery from Data, 4(1). Article 4. Ruggieri, S. (2010). Frequent regular itemset mining. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 263–272).

C.-H. Chen et al. / Expert Systems with Applications 40 (2013) 6531–6537 Sá, C. R. D., Soares, C., Jorge, A. M., Azevedo, P., Costa, J. (2011). Mining association rules for label ranking. In Proceedings of the 15th Paciﬁc-Asia conference on Advances in knowledge discovery and data mining (pp. 432–443). Segond, M., Borgelt, C. (2011). Item set mining based on cover similarity. In Proceedings of the 15th Paciﬁc-Asia conference on Advances in knowledge discovery and data mining (pp. 493–505). Sim, A. T. H., Indrawan, M., Zutshi, S., & Srinivasan, B. (2010). Logic-based pattern discovery. IEEE Transactions on Knowledge and Data Engineering, 22(6), 798–811. Teng, W. G., Hsieh, M. J., Chen, M. S. (2002). On the mining of substitution rules for statistically dependent items. In Proceedings of the IEEE international conference on data mining (pp. 442–449). Wang, K., He, Y., & Han, J. (2003). Pushing support constraints into association rules mining. Proceedings of IEEE Transactions on Knowledge and Data Engineering, 15, 642–658.

6537

Webb, G. I. (2010). Self-sufﬁcient itemsets: An approach to screening potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4, 1–20. Wu, X., Zhang, C., & Zhang, S. (2004). Efﬁcient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3), 381–405. Zhao, Y., Zhang, H., Wu, S., Pei, J., Cao, L., Zhang, C., Bohlscheid, H. (2009). Debt detection in social security by sequence classiﬁcation using both positive and negative patterns. In Proc. ECML-PKDD, Vol. 5782 (pp.648–663). Zheng, Z., Zhao, Y., Zuo, Z., Cao, L. (2010). An efﬁcient ga-based algorithm for mining negative sequential patterns. In Proceedings of the 14th Paciﬁc-Asia conference on advances in knowledge discovery and data mining (pp. 262–273). Zhong, N., Li, Y., & Wu, S. T. (2012). Effective pattern discovery for text mining. IEEE Transactions on Knowledge and Data Engineering, 24(1), 30–44.

Mining high coherent association rules with consideration of support measure

Mining high coherent association rules with consideration of support measure

Recommend Documents