Available online at www.sciencedirect.com Available online at www.sciencedirect.com
Procedia Engineering
Procedia Engineering 00 (2011) 000–000 Procedia Engineering 15 (2011) 1678 – 1683 www.elsevier.com/locate/procedia
Advanced in Control Engineering and Information Science
Association Rules Mining with Multiple Constraints Li Guang-yuana,ba*, Cao Dan-yanga, Guo Jian-weia a b
School of Computer and Communication Engineering, University of Science&Technology Beijing, Beijing, China Shool of Computer and Information Engineering, Guangxi Teachers Education University, Nanning, China
Abstract Association rules mining(ARM) is an important task in the field of data mining, mining frequent itemsets is a key step of many algorithms for ARM. In a very large dataset, rules generated may be very large, but some of them are useless to the users, to improve the effectiveness and efficiency of mining tasks, constraint-based mining enables users to concentrate on mining their interested association rules instead of the complete set of association rules. Most of previously proposed methods are mainly deal with a single constraint. In this paper, we present an algorithm for mining association rules with multiple constraints, the proposed algorithm simultaneously copes with two different kinds of constraints, it consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases. Third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FP-growth algorithm.
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011] Keywords: Data mining; Association rules mining; Constraints-based mining.
1. Introduction Association rules mining is an important task in the field of data mining, and frequent itemset mining is a key step of many algorithms for association rules mining. There had been lots of work done for mining of association rules. When the dataset are large, the rules generated may be very large, but some of them are not interesting to the users, so, it is common to set some parameters to reduced the numbers of rules generated, support and confidence are two common parameters, but using only the support and confidence has some drawbacks[1]: first, it is lack of user exploration and control, second, it is lack of focus, third, it is a rigid notion of relationship. To improve the effectiveness and efficiency of mining * Corresponding author. Tel: (0)13977177085 E-mail address:
[email protected].
1877-7058 © 2011 Published by Elsevier Ltd. doi:10.1016/j.proeng.2011.08.313
Li Guang-yuan et al. Procedia Engineering 1500 (2011) 1678 – 1683 Li Guang-yuan et /al/ Procedia Engineering (2011) 000–000
2
tasks, constraint-based mining enables users to concentrate on mining their interested association rules instead of the complete set of association rules. According to the properties of the constraints, there are four kinds of constraints, which are monotonic constraint, anti-monotone constraint, succinct constraint, and convertible constraint. The problem of discovering all frequent itemsets that satisfy constraints is a difficult one, the difficulty stems from the fact that[2], first, testing for minimum support and maximum support can not be done simultaneously, since when valid, one is always true for subsets while the other is always true for supersets. Second, despite their selective power, some constraints cannot be checked to filter candidate itemsets until a very late stage of the mining process depending upon the type of constraint and the search space traversal strategy used. However, there are some efficient algorithms proposed to deal with this problem[2-7], but most of these algorithms only cope with one constraint, in this paper, we present an algorithm to mine association rules with multiple constraints, it copes with two different kinds of constraints simultaneously. The proposed method consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases. Third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FPgrowth algorithm. 2. Problem Definition Let Items={x1,x2,…,xn}be a set of distinct items. An itemset X is a non-empty of Items. If X has k items, then X is call a k-itemset. A transaction is a couple< IDt , X > where IDt is the transaction identifier and X is the content of the transaction. A transaction database DB is a set of transactions. An itemset X is contained in a transaction < IDt , Y > if X ⊆ Y , the support of an itemset X , written as Support ( X ) is the numbers of X that contain in DB. Given a user-defined minimum support δ , an itemset X is called frequent in DB if Support (X ) ≥ δ . A constraint C is a predicate on the powerset of the set of items I , i.e., C : 2 I → {true, false} . An itemset X satisfies a constraint C if and only if C ( X ) is true. The complete set of itemsets satisfying a constraint C is SATC ( I ) = {X X ⊆ I ∧ C ( X ) = true}
Definition 1 Given an itemset X , a constraint C is anti-monotone if ∀Y ⊆ X : C ( X ) = true ⇒ C (Y ) = true Definition 2 Given an itemset X, a constraint C is monotone if ∀Y ⊆ X : C (Y ) = true ⇒ C ( X ) = true Definition 3 A constraint C is convertible anti-monotone provided there is an order on items such that whenever an itemset X satisfies C , so does any prefix of X . A constraint Φ is convertible monotone provided there is an order on items such that whenever an itemset X violates Φ so does any prefix of X. In this paper, the proposed algorithm deal with two constraints, the two constraints are anti-monotone and monotone. We use the FP-Growth algorithm as the basic approach to mine frequent itemsets since it is more efficient compare with many other algorithms such as Apriori-like algorithm. Given a DB as well as two constraints C1 , C2 , C1 is a anti-monotone constraint, and C2 is a monotone constraint, our goal is to generate all the itemset X which satisfy C1 ( X ) ∧ C2 ( X ) such that
1679
1680
Guang-yuan al. /Jian-wei/ ProcediaProcedia Engineering 15 (2011)001678 – 1683 Li Guang-yuanLiCao Dan-yangetGuo Engineering (2011) 000–000
SATC1 ∧ C 2 ( X ) = {X X ⊆ I ∧ C1 ( X ) ∧ C2 ( X ) = true}.
3. The Proposed Method In our algorithm, we use an example date set shown in table 1 below to illustrate how the proposed algorithm works, two constraints are such as(max(S.cost)≤min(S.price)) and (total(S.price)≥100), it is obviously that the former constraint is anti-monotone, and the latter is monotone, where S is an itemset and each item in S contains two attributes cost and price, max(S.cost) denotes the maximum cost of all items in S and min(S.price) denotes the minimum price of all items in S, total(S.price) denotes the total of price of all items in S. Before we give the description of the algorithm, first, we present some definitions and lemmas as follows: Definition 4 [4]. Given a database T and a project condition pc . 1. pc( s1 , s2 ) = true if the relationship between itemsets s1 and s2 is correct. For example, let s1 = ab , s2 = abcd and pc = prefix relationship, pc( s1 , s2 ) = true because s1 is the prefix of s2 2. Itemset b is called the max-a projection of a transaction < tid , I t > , w.r.t pc , if and only if
a ⊆ I t and b ⊆ I t ; pc(a, b) = true ; there exists no proper superset c of b such that c ⊆ I t and pc(a, b) = true . 3. The a-conditional database is the collection of max-a projections of transactions containing a w.r.t. pc . Definition 5[4]. Let α be a frequent itemset and λ be the set of frequent items in α ’s conditional database. α U λ forms the potential largest frequent itemset in the α ’s conditional database. Lemma 1[4]. Let β be the set of frequent items in T a and γ be a sub-set of the set of frequent items in T a If we have confirmed Φ (a U β ) = true in T
a
where Φ is an anti-monotone constraint, we do not
need to check Φ in T {a}∪ a for each a ∈ β because ({a} U a U γ ) ⊆ (α U β ) and thus Φ ({a} U a U γ ) is certainly true. Lemma 2[4]. Let γ be a sub-set of the set of frequent items in T {a}∪ a , if we have confirmed Φ (a U β ) = false , where Φ is an anti-monotone constraint and a is an individual frequent item in T a ,
we do not need to generate T {a}∪ a because {a} U a U γ contains {a} U a and thus Φ({a} U a U γ ) is certainly false. Lemma 3. Let β be the set of frequent items in T a and γ be a sub-set of the set of frequent items in T a If we have confirmed Φ (a U β ) = flase in T
a
where Φ is a monotone constraint, we do not need to check
Φ in T {a}∪ a for each a ∈ β because ({a} U a U γ ) ⊆ (α U β ) and thus Φ ({a} U a U γ ) is certainly false.
Lemma 4. Let γ be a sub-set of the set of frequent items in T {a}∪ a , If we have confirmed Φ (a U β ) = true , where Φ is a monotone constraint and a is an individual frequent item in T a , we do
not need to generate T {a}∪ a because {a} U a U γ contains {a} U a and thus Φ({a} U a U γ ) is certainly true.
3
1681
Li Guang-yuan et al. Procedia Engineering 1500 (2011) 1678 – 1683 Li Guang-yuan et /al/ Procedia Engineering (2011) 000–000
4
Table 1. An example database T TID
Items
1 2 3 4 5
A, B, E B,C A,D,E A,B,C,E B,D
Table 2. Items in database T Item A B C D E
Now, we give a brief description of the algorithm below. MCAL (α , T
α
Cost 30 35 50 25 20
price 45 40 60 40 45
)
Input: DB ,anti-monotone constraint C1 ,monotone constraint C2 , minimum support threshold δ Output: All frequent itemsets X satisfying C1 ( X ) ∧ C2 ( X ) . 1. Collect the set of frequent items L and their supports from the FP-tree header table of T
α
2. β = L ;if (C2 ( β U α ) = false ,then exit, there are no frequent itemsets X that satisfy C1 ( X ) ∧ C2 ( X ) 3. if C2 ( β U α ) = true Apply lemma3 and lemma4 to calculate the number N of item that satisfy C2 4. for each a ∈ β 5. MCAL(α , T α ) 6. if ( L ) < N 7. continue 8. else 9. for each χ ∈ L 10. if C1 ( χ U a) = true 11. Generate Tχ U a ,if ( L ) > N 12. Output L 13. endfor 14. endfor 4. Experimental Results In order to evaluate the performance of the proposed algorithm, we compare it with the FP-growth+[5], all the experiments were performed on a Pentium IV3.2GHz personal computer with 2MB main memory, running Windows XP. The program is written in C++ and compiled with Microsoft Visual C++6.0. The data set is generated with a similar way as in[8], the data set is denotes as V25F20T50I1L100, which V25 denotes that the average size of the transactions is 25, F25 denotes that the average size of the maximal potentially frequent itemsets is 20, T50 denotes that the number of transactions is 50K, I1 denotes that the
1682
Guang-yuan al. /Jian-wei/ ProcediaProcedia Engineering 15 (2011)001678 – 1683 Li Guang-yuanLiCao Dan-yangetGuo Engineering (2011) 000–000
5
number of items is 1K, L100 denotes that the number of maximal potentially frequent itemsets is 100. The experimental results are shown in figure.1.and figure.2. as follows: V25F20T50I1L100
Execute time(sec)
800
FP-grow+
600
Proposed algotithm
400 200 0 20K
40K
60K
80K
100K
Number of Transactions
Fig.1. Scalability with number of transactions
Execute time(sec)
V25F20T50I1L100 1000
600 500 FP-grow+
400 300
Proposed algotithm
200 100 0 0.1%
0.2%
0.4%
0.8%
1%
Support Threshold
Fig.2. Scalability with minimum support threshold
Conclusions In this paper, we present an efficient algorithm for mining association rules with multiply constraints, The proposed method consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases, third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FP-growth algorithm such as FP-growth+. In the future, we plan to investigate the multiple constraints based in uncertain data mining.
Li Guang-yuan et al. Procedia Engineering 1500 (2011) 1678 – 1683 Li Guang-yuan et /al/ Procedia Engineering (2011) 000–000
6
Acknowledgements This work was supported by the Science and Technology Project of Beijing, Beijing, China. References [1] Raymond T. Ng, Laks V.S.Lakshmannan, Jiawei Han. Exploratory mining and pruning optimizations of constrained associations rules. Proceedings ACM SIGMOD International Conference on Management of Data, June ,1998, Seattle, Washington, USA. [2] Mohammad EI-Hajj, Osmar R.Zaiane, Paul Nalos. Bifold constraint-based mining by simultaneous monotone and antimonotone checking. Proceedings of the 5th IEEE International Conference on Data Mining , November 2005, Houston, Texas, USA [3] L. Lakshmanan, R. Ng, J. Han, A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In ACM SIGMOD Conference on Management of Data, p. 157–168, 1999. [4] Anthony J.T.Lee, Wan-chuen Lin, Chun-sheng Wang. Mining association rules with multi-dimensional constraints. The Journal of Systems and Software 2006, (79), p. 79-92. [5] J. Pie, J. Han, and L. Lakshmanan. Mining frequent itemsets with convertible constraints. In IEEE ICDE Conference, p.433– 442, 2001. [6] C. Bucila, J. Gehrke, D. Kifer, W. White. Dualminer: Adual-pruning algorithm for itemsets with constraints. In EightACM SIGKDD Internationa Conf. on Knowledge Discovery and Data Mining, p. 42–51, Edmonton, Alberta, August 2002. [7] R. M.Ting, J. Bailey, K. Ramamohanarao. Paradualminer: An efficient parallel implementation of the dualminer algorithm. In Eight Pacific-Asia Conference, PAKDD 2004, p. 96–105, Sydney, Australia, May 2004. [8] Agrawal,R., Srikant,R. Fast algorithms for mining association rules. Proceedings of International Conference on Very Large Data Bases,p.487-499.
1683